# Project Title: Classification Problem Analysis


## Table of Contents
1. [Introduction](#introduction)
2. [Importing Libraries and Dataset](#importing-libraries-and-dataset)
    - 2.1 [Import Necessary Libraries](#import-necessary-libraries)
    - 2.2 [Load the Dataset](#load-the-dataset)
3. [Data Exploration](#data-exploration)
    - 3.1 [Understanding the Dataset Structure](#understanding-the-dataset-structure)
    - 3.2 [Descriptive Statistics](#descriptive-statistics)
    - 3.3 [Checking for Incoherencies](#checking-for-incoherencies)
    - 3.4 [Univariate Analysis](#univariate-analysis)
    - 3.5 [Multivariate Analysis](#multivariate-analysis)
4. [Data Cleaning and Pre-processing](#data-cleaning-and-pre-processing)
    - 4.1 [Handling Missing Values](#handling-missing-values)
    - 4.2 [Outlier Detection and Treatment](#outlier-detection-and-treatment)
    - 4.3 [Dealing with Categorical Variables](#dealing-with-categorical-variables)
    - 4.4 [Feature Engineering](#feature-engineering)
        - 4.4.1 [Feature Creation](#feature-creation)
        - 4.4.2 [Feature Transformation](#feature-transformation)
    - 4.5 [Data Scaling and Normalization](#data-scaling-and-normalization)
5. [Feature Selection](#feature-selection)
    - 5.1 [Feature Importance Analysis](#feature-importance-analysis)
    - 5.2 [Correlation Matrix](#correlation-matrix)
    - 5.3 [Dimensionality Reduction Techniques](#dimensionality-reduction-techniques)
    - 5.4 [Final Feature Selection](#final-feature-selection)
6. [Model Building](#model-building)
    - 6.1 [Problem Type Identification](#problem-type-identification)
    - 6.2 [Algorithm Selection](#algorithm-selection)
    - 6.3 [Model Training](#model-training)
    - 6.4 [Cross-Validation](#cross-validation)
    - 6.5 [Performance Metrics](#performance-metrics)
    - 6.6 [Model Evaluation](#model-evaluation)
7. [Prediction on Test Set](#prediction-on-test-set)
    - 7.1 [Generating Predictions](#generating-predictions)
    - 7.2 [Result Interpretation](#result-interpretation)
8. [Conclusion](#conclusion)
9. [References](#references)
10. [Appendices](#appendices)


<a id='introduction'></a>
## Introduction

- Brief overview of the classification problem.
- Objectives and goals of the analysis.
- Summary of the dataset provided by the client.


<a id='importing-libraries-and-dataset'></a>
## Importing Libraries and Dataset


<a id='import-necessary-libraries'></a>
### 2.1 Import Necessary Libraries

- List all Python libraries required for the analysis.
- Explain the purpose of each library.


In [35]:
# Import necessary libraries
# Example:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# etc.

<a id='load-the-dataset'></a>
### 2.2 Load the Dataset

- Instructions on how to import the dataset into the notebook.
- Description of the dataset's structure (e.g., number of rows and columns).


In [36]:
# Load the dataset
# Example:
df = pd.read_csv('../project_data/train_data.csv')
# Display basic information about the dataset



  df = pd.read_csv('../project_data/train_data.csv')


In [37]:
df.head().T

Unnamed: 0,0,1,2,3,4
Accident Date,2019-12-30,2019-08-30,2019-12-06,,2019-12-30
Age at Injury,31.0,46.0,40.0,,61.0
Alternative Dispute Resolution,N,N,N,,N
Assembly Date,2020-01-01,2020-01-01,2020-01-01,2020-01-01,2020-01-01
Attorney/Representative,N,Y,N,,N
Average Weekly Wage,0.0,1745.93,1434.8,,
Birth Year,1988.0,1973.0,1979.0,,1958.0
C-2 Date,2019-12-31,2020-01-01,2020-01-01,,2019-12-31
C-3 Date,,2020-01-14,,,
Carrier Name,NEW HAMPSHIRE INSURANCE CO,ZURICH AMERICAN INSURANCE CO,INDEMNITY INSURANCE CO OF,,STATE INSURANCE FUND


In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 593471 entries, 0 to 593470
Data columns (total 33 columns):
 #   Column                              Non-Null Count   Dtype  
---  ------                              --------------   -----  
 0   Accident Date                       570337 non-null  object 
 1   Age at Injury                       574026 non-null  float64
 2   Alternative Dispute Resolution      574026 non-null  object 
 3   Assembly Date                       593471 non-null  object 
 4   Attorney/Representative             574026 non-null  object 
 5   Average Weekly Wage                 545375 non-null  float64
 6   Birth Year                          544948 non-null  float64
 7   C-2 Date                            559466 non-null  object 
 8   C-3 Date                            187245 non-null  object 
 9   Carrier Name                        574026 non-null  object 
 10  Carrier Type                        574026 non-null  object 
 11  Claim Identifier          

In [39]:
df.columns.values

array(['Accident Date', 'Age at Injury', 'Alternative Dispute Resolution',
       'Assembly Date', 'Attorney/Representative', 'Average Weekly Wage',
       'Birth Year', 'C-2 Date', 'C-3 Date', 'Carrier Name',
       'Carrier Type', 'Claim Identifier', 'Claim Injury Type',
       'County of Injury', 'COVID-19 Indicator', 'District Name',
       'First Hearing Date', 'Gender', 'IME-4 Count', 'Industry Code',
       'Industry Code Description', 'Medical Fee Region',
       'OIICS Nature of Injury Description', 'WCIO Cause of Injury Code',
       'WCIO Cause of Injury Description', 'WCIO Nature of Injury Code',
       'WCIO Nature of Injury Description', 'WCIO Part Of Body Code',
       'WCIO Part Of Body Description', 'Zip Code', 'Agreement Reached',
       'WCB Decision', 'Number of Dependents'], dtype=object)

In [40]:


# create a dictionary with WCIO Cause of Injury Code and Description. WCIO Nature of Injury Code and Description, WCIO Part of Body Code and Description, Industry Code and Description
C_o_In_indicators = df[['WCIO Cause of Injury Code', 'WCIO Cause of Injury Description',]].drop_duplicates().set_index('WCIO Cause of Injury Code')
N_o_In_indicators = df[['WCIO Nature of Injury Code', 'WCIO Nature of Injury Description',]].drop_duplicates().set_index('WCIO Nature of Injury Code')
P_o_B_indicators = df[['WCIO Part Of Body Code', 'WCIO Part Of Body Description']].drop_duplicates().set_index('WCIO Part Of Body Code')
Ind_indicators = df[['Industry Code', 'Industry Code Description',]].drop_duplicates().set_index('Industry Code')

# print each df 
C_o_In_indicators
N_o_In_indicators
P_o_B_indicators
Ind_indicators

# check whether there are codes two times and print it with the len of the df
print(len(C_o_In_indicators[C_o_In_indicators.index.duplicated()]))
print(len(N_o_In_indicators[N_o_In_indicators.index.duplicated()]))
print(len(P_o_B_indicators[P_o_B_indicators.index.duplicated()]))
print(len(Ind_indicators[Ind_indicators.index.duplicated()]))
print("there is no code two times")

0
0
0
0
there is no code two times


In [41]:

desc_columns = [column for column in df.columns if 'Description' in column]

df[desc_columns]

Unnamed: 0,Industry Code Description,OIICS Nature of Injury Description,WCIO Cause of Injury Description,WCIO Nature of Injury Description,WCIO Part Of Body Description
0,RETAIL TRADE,,FROM LIQUID OR GREASE SPILLS,CONTUSION,BUTTOCKS
1,CONSTRUCTION,,REPETITIVE MOTION,SPRAIN OR TEAR,SHOULDER(S)
2,ADMINISTRATIVE AND SUPPORT AND WASTE MANAGEMEN...,,OBJECT BEING LIFTED OR HANDLED,CONCUSSION,MULTIPLE HEAD INJURY
3,,,,,
4,HEALTH CARE AND SOCIAL ASSISTANCE,,"HAND TOOL, UTENSIL; NOT POWERED",PUNCTURE,FINGER(S)
...,...,...,...,...,...
593466,,,,,
593467,TRANSPORTATION AND WAREHOUSING,,FROM DIFFERENT LEVEL (ELEVATION),MULTIPLE PHYSICAL INJURIES ONLY,MULTIPLE
593468,,,,,
593469,,,,,


In [42]:
# print out how many missing values there are for each column in percentage 
df.isnull().mean()*100

Accident Date                           3.898084
Age at Injury                           3.276487
Alternative Dispute Resolution          3.276487
Assembly Date                           0.000000
Attorney/Representative                 3.276487
Average Weekly Wage                     8.104187
Birth Year                              8.176137
C-2 Date                                5.729850
C-3 Date                               68.449174
Carrier Name                            3.276487
Carrier Type                            3.276487
Claim Identifier                        0.000000
Claim Injury Type                       3.276487
County of Injury                        3.276487
COVID-19 Indicator                      3.276487
District Name                           3.276487
First Hearing Date                     74.590502
Gender                                  3.276487
IME-4 Count                            77.622664
Industry Code                           4.954412
Industry Code Descri

In [43]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age at Injury,574026.0,42.11427,14.25643,0.0,31.0,42.0,54.0,117.0
Average Weekly Wage,545375.0,491.0883,6092.918,0.0,0.0,0.0,841.0,2828079.0
Birth Year,544948.0,1886.768,414.6444,0.0,1965.0,1977.0,1989.0,2018.0
Claim Identifier,593471.0,23667600.0,107927100.0,5393066.0,5593414.5,5791212.0,5991000.5,999891667.0
IME-4 Count,132803.0,3.207337,2.832303,1.0,1.0,2.0,4.0,73.0
Industry Code,564068.0,58.64531,19.64417,11.0,45.0,61.0,71.0,92.0
OIICS Nature of Injury Description,0.0,,,,,,,
WCIO Cause of Injury Code,558386.0,54.38114,25.87428,1.0,31.0,56.0,75.0,99.0
WCIO Nature of Injury Code,558369.0,41.01384,22.20752,1.0,16.0,49.0,52.0,91.0
WCIO Part Of Body Code,556944.0,39.73815,22.36594,-9.0,33.0,38.0,53.0,99.0


In [44]:
df.describe(include='object').T

Unnamed: 0,count,unique,top,freq
Accident Date,570337,5539,2020-03-01,1245
Alternative Dispute Resolution,574026,3,N,571412
Assembly Date,593471,1096,2020-03-06,1422
Attorney/Representative,574026,2,N,392291
C-2 Date,559466,2475,2021-05-11,1847
C-3 Date,187245,1648,2021-04-21,350
Carrier Name,574026,2046,STATE INSURANCE FUND,111144
Carrier Type,574026,8,1A. PRIVATE,285368
Claim Injury Type,574026,8,2. NON-COMP,291078
County of Injury,574026,63,SUFFOLK,60430


<a id='data-exploration'></a>
## Data Exploration


<a id='understanding-the-dataset-structure'></a>
### 3.1 Understanding the Dataset Structure

- Display the first few rows of the dataset.
- Discuss the data types of each column.
- Identify the target variable and features.


In [45]:
# Display the first few rows
# df.head()

# Display data types
# df.dtypes

# Identify target variable and features
# target = 'target_column_name'
# features = df.columns.drop(target)

<a id='descriptive-statistics'></a>
### 3.2 Descriptive Statistics

- Calculate summary statistics (mean, median, mode, etc.).
- Analyze the distribution of numerical features.


In [46]:
# Summary statistics
# df.describe()

# Analyze distributions
# for col in numerical_columns:
#     print(f'Distribution of {col}')
#     plt.hist(df[col])
#     plt.show()

<a id='checking-for-incoherencies'></a>
### 3.3 Checking for Incoherencies

- Identify any inconsistencies or anomalies in the data.
- Discuss potential data entry errors or irregularities.


In [47]:
# Check for inconsistencies
# Example: df['column_name'].unique()

In [48]:
# create a dictionary with WCIO Cause of Injury Code and Description. WCIO Nature of Injury Code and Description, WCIO Part of Body Code and Description, Industry Code and Description
C_o_In_indicators = df[['WCIO Cause of Injury Code', 'WCIO Cause of Injury Description',]].drop_duplicates().set_index('WCIO Cause of Injury Code')
N_o_In_indicators = df[['WCIO Nature of Injury Code', 'WCIO Nature of Injury Description',]].drop_duplicates().set_index('WCIO Nature of Injury Code')
P_o_B_indicators = df[['WCIO Part Of Body Code', 'WCIO Part Of Body Description']].drop_duplicates().set_index('WCIO Part Of Body Code')
Ind_indicators = df[['Industry Code', 'Industry Code Description',]].drop_duplicates().set_index('Industry Code')

# print each df 
C_o_In_indicators
N_o_In_indicators
P_o_B_indicators
Ind_indicators

# check whether there are codes two times and print it with the len of the df
print(len(C_o_In_indicators[C_o_In_indicators.index.duplicated()]))
print(len(N_o_In_indicators[N_o_In_indicators.index.duplicated()]))
print(len(P_o_B_indicators[P_o_B_indicators.index.duplicated()]))
print(len(Ind_indicators[Ind_indicators.index.duplicated()]))
print("there is no code two times")

0
0
0
0
there is no code two times


* Now we can drop the columns with the descriptions and keep only the codes

* OIICS Nature of Injury Description does has zero values so we drop it too --> obtain information from website:

In [49]:
desc_columns

['Industry Code Description',
 'OIICS Nature of Injury Description',
 'WCIO Cause of Injury Description',
 'WCIO Nature of Injury Description',
 'WCIO Part Of Body Description']

In [54]:
# check if we already dropped the columns
if [col for col in df.columns if 'Description' in col]:
    descr_col = [col for col in df.columns if 'Description' in col]


print("columns with description that we are going to drop: \n", descr_col)

# drop the columns with description | if statement to check if the columns are in the df or if we already dropped them
if any(col in df.columns for col in descr_col):
    df.drop(descr_col, axis=1, inplace = True)
    print("columns dropped successfully")
else:
    print("columns are already dropped")

columns with description that we are going to drop: 
 ['Industry Code Description', 'OIICS Nature of Injury Description', 'WCIO Cause of Injury Description', 'WCIO Nature of Injury Description', 'WCIO Part Of Body Description']
columns are already dropped


In [55]:
df.columns

Index(['Accident Date', 'Age at Injury', 'Alternative Dispute Resolution',
       'Assembly Date', 'Attorney/Representative', 'Average Weekly Wage',
       'Birth Year', 'C-2 Date', 'C-3 Date', 'Carrier Name', 'Carrier Type',
       'Claim Identifier', 'Claim Injury Type', 'County of Injury',
       'COVID-19 Indicator', 'District Name', 'First Hearing Date', 'Gender',
       'IME-4 Count', 'Industry Code', 'Medical Fee Region',
       'WCIO Cause of Injury Code', 'WCIO Nature of Injury Code',
       'WCIO Part Of Body Code', 'Zip Code', 'Agreement Reached',
       'WCB Decision', 'Number of Dependents'],
      dtype='object')

<a id='univariate-analysis'></a>
### 3.4 Univariate Analysis

- Visualize the distribution of individual features using histograms, box plots, etc.
- Interpret the visualizations and note any observations.


In [None]:
# Histograms and box plots for numerical features
# for col in numerical_columns:
#     sns.histplot(df[col])
#     plt.show()
#     sns.boxplot(x=df[col])
#     plt.show()

<a id='multivariate-analysis'></a>
### 3.5 Multivariate Analysis

- Analyze relationships between pairs of variables using scatter plots, heatmaps, etc.
- Investigate correlations and potential interactions between features.


In [None]:
# Correlation matrix
# corr_matrix = df.corr()
# sns.heatmap(corr_matrix, annot=True)
# plt.show()

# Pair plots
# sns.pairplot(df)
# plt.show()

<a id='data-cleaning-and-pre-processing'></a>
## Data Cleaning and Pre-processing


<a id='handling-missing-values'></a>
### 4.1 Handling Missing Values

- Identify features with missing values.
- Discuss different strategies to handle missing data (deletion, imputation, etc.).
- Justify the chosen method and apply it.


In [None]:
# Identify missing values
# df.isnull().sum()

# Handle missing values
# Example: df.dropna() or df.fillna(method='ffill')

<a id='outlier-detection-and-treatment'></a>
### 4.2 Outlier Detection and Treatment

- Use statistical methods to detect outliers.
- Visualize outliers using box plots or scatter plots.
- Decide on an approach to handle outliers (removal, transformation, etc.).
- Justify your decisions.


In [None]:
# Detect outliers
# for col in numerical_columns:
#     sns.boxplot(x=df[col])
#     plt.show()

# Handle outliers
# Example: Remove or cap outliers

<a id='dealing-with-categorical-variables'></a>
### 4.3 Dealing with Categorical Variables

- Identify categorical features in the dataset.
- Discuss encoding techniques (one-hot encoding, label encoding, etc.).
- Apply the chosen encoding method to transform categorical variables.


In [None]:
# Identify categorical variables
# categorical_columns = df.select_dtypes(include=['object']).columns

# Encode categorical variables
# df = pd.get_dummies(df, columns=categorical_columns)

<a id='feature-engineering'></a>
### 4.4 Feature Engineering


<a id='feature-creation'></a>
#### 4.4.1 Feature Creation

- Explore the possibility of creating new features from existing ones.
- Describe the rationale behind new feature creation.
- Implement the new features.


In [None]:
# Create new features

# Convert date columns to datetime for calculations
df['Accident Date'] = pd.to_datetime(df['Accident Date'], errors='coerce')
df['Assembly Date'] = pd.to_datetime(df['Assembly Date'], errors='coerce')
df['C-2 Date'] = pd.to_datetime(df['C-2 Date'], errors='coerce')
df['C-3 Date'] = pd.to_datetime(df['C-3 Date'], errors='coerce')

# Feature 1: Claim Duration (in days) - Time between 'Accident Date' and 'Assembly Date'
df['Claim Duration'] = (df['Assembly Date'] - df['Accident Date']).dt.days

# Feature 2: Age at Accident Year
df['Accident Year'] = df['Accident Date'].dt.year
df['Age at Accident Year'] = df['Accident Year'] - df['Birth Year']

# Feature 3: Attorney Involvement (Binary transformation of 'Attorney/Representative')
df['Attorney Involved'] = df['Attorney/Representative'].apply(lambda x: 1 if x == 'Yes' else 0)

# Feature 4: Wage Categorization (Low, Medium, High)
df['Wage Category'] = pd.cut(df['Average Weekly Wage'], bins=[0, 500, 1000, float('inf')], labels=['Low', 'Medium', 'High'])

# Feature 5: Claim Filing Delay (in days) - Time between 'C-2 Date' and 'C-3 Date'
df['Claim Filing Delay'] = (df['C-3 Date'] - df['C-2 Date']).dt.days

# Display the new features
engineered_features = df[['Claim Duration', 'Age at Accident Year', 'Attorney Involved', 'Wage Category', 'Claim Filing Delay']].head()

import ace_tools as tools; tools.display_dataframe_to_user(name="Engineered Features", dataframe=engineered_features)

engineered_features


<a id='feature-transformation'></a>
#### 4.4.2 Feature Transformation

- Apply transformations to features if necessary (e.g., log transformation).
- Explain the reasoning for transforming features.


In [None]:
# Transform features
# Example: df['transformed_feature'] = np.log(df['feature'])

<a id='data-scaling-and-normalization'></a>
### 4.5 Data Scaling and Normalization

- Discuss the importance of feature scaling.
- Choose appropriate scaling methods (StandardScaler, MinMaxScaler, etc.).
- Apply scaling to the dataset.
- Explain how scaling affects the model performance.


In [None]:
# Scale features
# from sklearn.preprocessing import StandardScaler
# scaler = StandardScaler()
# df_scaled = scaler.fit_transform(df)

<a id='feature-selection'></a>
## Feature Selection


<a id='feature-importance-analysis'></a>
### 5.1 Feature Importance Analysis

- Use methods like feature importance from tree-based models to assess feature relevance.
- Present the results and interpret them.


In [None]:
# Feature importance using Random Forest
# from sklearn.ensemble import RandomForestClassifier
# model = RandomForestClassifier()
# model.fit(X_train, y_train)
# feature_importances = model.feature_importances_

<a id='correlation-matrix'></a>
### 5.2 Correlation Matrix

- Create a correlation matrix to identify highly correlated features.
- Discuss potential issues with multicollinearity.


In [None]:
# Correlation matrix
# corr_matrix = pd.DataFrame(df_scaled).corr()
# sns.heatmap(corr_matrix, annot=True)
# plt.show()

<a id='dimensionality-reduction-techniques'></a>
### 5.3 Dimensionality Reduction Techniques

- Consider techniques like PCA if applicable.
- Explain the benefits and drawbacks of dimensionality reduction.


In [None]:
# Apply PCA
# from sklearn.decomposition import PCA
# pca = PCA(n_components=number_of_components)
# df_pca = pca.fit_transform(df_scaled)

<a id='final-feature-selection'></a>
### 5.4 Final Feature Selection

- Define a clear strategy for selecting the final set of features.
- Justify the selection based on the analysis.
- Prepare the dataset with the selected features for modeling.


In [None]:
# Select final features
# selected_features = ['feature1', 'feature2', 'feature3']
# X = df[selected_features]
# y = df['target']

<a id='model-building'></a>
## Model Building


<a id='problem-type-identification'></a>
### 6.1 Problem Type Identification

- Confirm that the task is a classification problem.
- Discuss the nature of the target variable (binary, multiclass).


<a id='algorithm-selection'></a>
### 6.2 Algorithm Selection

- List potential algorithms suitable for the problem (e.g., Logistic Regression, Decision Trees).
- Justify the choice of algorithms to be used.


In [None]:
# List of algorithms
# algorithms = ['Logistic Regression', 'Decision Tree', 'Random Forest', 'SVM']

<a id='model-training'></a>
### 6.3 Model Training

- Split the dataset into training and validation sets.
- Train the selected models using the training data.
- Provide details on model parameters and settings.


In [None]:
# Split the dataset
# from sklearn.model_selection import train_test_split
# X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Train models
# Example with Logistic Regression:
# from sklearn.linear_model import LogisticRegression
# model = LogisticRegression()
# model.fit(X_train, y_train)

<a id='cross-validation'></a>
### 6.4 Cross-Validation

- Explain the purpose of cross-validation.
- Implement cross-validation techniques (e.g., k-fold CV).
- Record cross-validation results.


In [None]:
# Cross-validation
# from sklearn.model_selection import cross_val_score
# cv_scores = cross_val_score(model, X_train, y_train, cv=5)
# print(cv_scores)

<a id='performance-metrics'></a>
### 6.5 Performance Metrics

- Choose appropriate evaluation metrics (accuracy, precision, recall, F1-score, ROC-AUC).
- Justify the choice of metrics based on the problem context.


In [None]:
# Evaluate model
# from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
# y_pred = model.predict(X_val)
# print('Accuracy:', accuracy_score(y_val, y_pred))
# print('Precision:', precision_score(y_val, y_pred))
# print('Recall:', recall_score(y_val, y_pred))
# print('F1 Score:', f1_score(y_val, y_pred))

<a id='model-evaluation'></a>
### 6.6 Model Evaluation

- Evaluate the model performance using the selected metrics.
- Compare results between different models.
- Discuss any observations or insights.


<a id='prediction-on-test-set'></a>
## Prediction on Test Set


<a id='generating-predictions'></a>
### 7.1 Generating Predictions

- Use the trained model to make predictions on the test dataset.
- Explain any pre-processing steps needed for the test data.


In [None]:
# Predict on test set
# test_data = pd.read_csv('test_dataset.csv')
# Apply same pre-processing steps to test_data
# predictions = model.predict(test_data)

<a id='result-interpretation'></a>
### 7.2 Result Interpretation

- Interpret the prediction results.
- Discuss potential implications or actions based on the predictions.


<a id='conclusion'></a>
## Conclusion

- Summarize the key findings from the analysis.
- Reflect on the model's performance and potential improvements.
- Suggest next steps or recommendations.


<a id='references'></a>
## References

- List any resources, articles, or papers referenced during the analysis.


<a id='appendices'></a>
## Appendices

- Include any additional tables, figures, or code snippets that support the analysis.
