# <center>Machine Learning Project</center>

** **
## <center>*02 - Feature Selection*</center>

** **

The members of the `team` are:
- Ana Farinha - 20211514
- Francisco Capontes - 20211692
- Sofia Gomes - 20240848
- Rui Lourenço - 2021639

## <span style="color:salmon"> Table of Contents </span>

<a class="anchor" id="top"></a>


1. [Filter Methods](#1-filter-methods)<br>  
    1.1 [Univariate Variables](#11-univariate-variables)<br>  
    1.2 [Correlation Indices](#12-correlation-indices)<br>    
    1.3 [Chi-Squared](#13-chi-squared)<br><br>     
2. [Wrapper Methods](#2-wrapper-methods)<br>    
    2.1 [Logistic Regression](#21-logistic-regression)<br>    
    2.2 [Support Vector Machine](#22-support-vector-machine)<br><br>      
3. [Embedded Methods](#3-embedded-methods)<br>     
    3.1 [LassoCV](#31-lassocv)<br>  



In [18]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

# Sklearn packages
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.impute import KNNImputer
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, f1_score

# Models
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
import lightgbm as lgb

# embedded methods
from sklearn.linear_model import LassoCV
import scipy.stats as stats
from scipy.stats import chi2_contingency
from sklearn.feature_selection import RFE

import warnings
warnings.filterwarnings('ignore')

from utils import *
from utils_feature_selection import *
from utils_dicts import numerical_features, categorical_features

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# No train temos de input:
- Average Weekly Wage
- Age at Injury

- Based on the new Age at Injury calcular o Birth Year

*Input `Birth Year`*

In [8]:
# Import dataset
train_df = pd.read_csv('preprocessed_data/train_data.csv', index_col="Claim Identifier")
test_df =  pd.read_csv('preprocessed_data/test_data.csv', index_col="Claim Identifier")

In [9]:
missing_percentage = train_df.isna().sum() / len(train_df) * 100
for col, percent in missing_percentage.items():
    if not percent == 0:
        print(f"{col}: {percent:.2f}% missing values")

Age at Injury: 0.40% missing values
Average Weekly Wage: 63.43% missing values
Birth Year: 0.40% missing values
Industry Code: 1.73% missing values
WCIO Cause of Injury Code: 2.72% missing values
WCIO Nature of Injury Code: 2.73% missing values
WCIO Part Of Body Code: 2.98% missing values
Zip Code: 4.99% missing values


In [10]:
for col in train_df.columns:
    if not (col in numerical_features or col in categorical_features):
        print(f"'{col}',")

'Accident Date',
'County of Injury',
'District Name',
'Industry Code',
'WCIO Cause of Injury Code',
'WCIO Nature of Injury Code',
'WCIO Part Of Body Code',
'Zip Code',
'Claim Injury Type Encoded',


In [19]:
for col in numerical_features+categorical_features:
    if col not in train_df.columns:
        print(f"'{col}',")

'Enc County of Injury',
'Enc District Name',
'Enc Industry Code',
'Enc WCIO Cause of Injury Code',
'Enc WCIO Nature of Injury Code',
'Enc WCIO Part Of Body Code',
'Enc Zip Code',


In [None]:
X = train_df.drop(["Claim Injury Type Encoded"], axis = 1)
y = train_df["Claim Injury Type Encoded"]

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X,y, test_size = 0.25, stratify = y, shuffle = True)

In [None]:
apply_frequency_encoding(X_train,X_val)

In [None]:
for col in train_df.columns:
    if not (col in numerical_features or col in categorical_features):
        print(f"'{col}',")

In [None]:
for col in numerical_features+categorical_features:
    if col not in train_df.columns:
        print(f"'{col}',")

# Input Age at Injury, Birth Year and Average Weekly Wage

In [None]:
to_impute = ["Age at Injury","Average Weekly Wage"]
imputation_value  = X_train[to_impute].median()
for col in to_impute:
        X_train[col].fillna(imputation_value[col], inplace=True)
        X_val[col].fillna(imputation_value[col], inplace=True)

In [None]:
# Ensure 'Accident Date' is in datetime format
X_train['Accident Date'] = pd.to_datetime(X_train['Accident Date'], errors='coerce')
X_val['Accident Date'] = pd.to_datetime(X_val['Accident Date'], errors='coerce')

# Now apply your logic
condition = X_train['Birth Year'].isna() & X_train['Age at Injury'].notna() & X_train['Accident Date'].notna()
X_train.loc[condition, 'Birth Year'] = X_train.loc[condition, 'Accident Date'].dt.year - X_train.loc[condition, 'Age at Injury']

# Filter the rows where 'Birth Year' is NaN, but 'Age at Injury' and 'Accident Date' are not NaN
condition = X_val['Birth Year'].isna() & X_val['Age at Injury'].notna() & X_val['Accident Date'].notna()
# Replace missing 'Birth Year' with the difference between 'Accident Date' year and 'Age at Injury'
X_val.loc[condition, 'Birth Year'] = X_val.loc[condition, 'Accident Date'].dt.year - X_val.loc[condition, 'Age at Injury']

In [None]:
X_train.drop('Accident Date',axis=1,inplace=True)
X_val.drop('Accident Date',axis=1,inplace=True)

# Creating New Features

*Average Weekly Wage*

Relative Wage Compared to Median Wage:<br>
Calculate whether the injured worker’s wage is above or below the median wage for the dataset, it's potentially reflecting job type or socioeconomic factors.

In [None]:
median_wage = X_train['Average Weekly Wage'].median()
X_train['Relative_Wage'] = np.where(X_train['Average Weekly Wage'] > median_wage, 1,0) #('Above Median', 'Below Median')
X_val['Relative_Wage'] = np.where(X_val['Average Weekly Wage'] > median_wage, 1,0) #('Above Median', 'Below Median')

*Financial Impact*

In [None]:
financial_impact(X_train)
financial_impact(X_val)

__Binning:__ Group ages into categories like "young" or "senior" if such categories might capture different risk profiles.<br>

In [None]:
age_bins = [0, 25, 40, 55, 70, 100]
age_labels = [0,1,2,3,4] #['Young', 'Mid-Age', 'Experienced', 'Senior', 'Elderly']
X_train['Age_Group'] = pd.cut(X_train['Age at Injury'], bins=age_bins, labels=age_labels)
X_val['Age_Group'] = pd.cut(X_val['Age at Injury'], bins=age_bins, labels=age_labels)

# Scaling

In [None]:
st = StandardScaler()
X_train[numerical_features] = st.fit_transform(X_train[numerical_features])
X_val[numerical_features] = st.transform(X_val[numerical_features])

# Feature Selection

In [None]:
n_features = len(X_train.columns)

*Univariate variables*

In [None]:
X_train[numerical_features].var().sort_values()

*Corr Matrix*

In [None]:
# initial correlation matrix with the respective values
corr_matrix = X_train[numerical_features].corr()

mask = np.tri(*corr_matrix.shape, k=0, dtype=bool)
# Keeps values where mask is True
corr_matrix = corr_matrix.where(mask)

# defines the figure size
fig, ax = plt.subplots(figsize=(20, 20))
# heatmap of the initial correlation matrix
l = sns.heatmap(corr_matrix, square=True, annot=True, fmt=".2f", vmax=1, vmin=-1, cmap='RdBu', ax=ax)
plt.title('Correlation Between Variables', size=14)
plt.show()

*XGBoosted RFE*

In [None]:
XGB = XGBClassifier(XGBClassifier(max_depth=5, learning_rate= 0.2, n_estimators= 200))

In [None]:
best_XGB = feature_selection_RFE(X_train,y_train,n_features,model=XGB)

In [None]:
best_XGB

*Gradient Boosted RFE*

In [None]:
LGB = lgb.LGBMClassifier(verbose=-1)

In [None]:
best_LGB =feature_selection_RFE(X_train,y_train,n_features,model=LGB)

In [None]:
best_LGB

*Decision Tree feature importance*

In [None]:
dt = DecisionTreeClassifier(max_depth=5, min_samples_leaf=10, class_weight='balanced').fit(X_train,y_train)

In [None]:
feature_importances = dt.feature_importances_
feature_importance_df = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': feature_importances
})

feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

In [None]:
feature_importance_df

*Lasso*

In [None]:
feature_selection_Lasso(X_train,y_train)

*Chi-squared test*

In [None]:
for col in categorical_features:
    TestIndependence(X_train[col],y_train,col,alpha=0.05)

<hr>

### Numerical Data

| Predictor | Spearman | RFE XGB| RFE LGB | Lasso | Feature Importance DT | What to do? (One possible way to \"solve\") |
| --- | --- | --- | --- |--- |--- |---|
| Age at Injury | Keep | Keep | Keep | Discard | Discard | Try with and without |
| IME-4 Count | Keep | Keep | Keep | Include | Keep | Include in the model |
| Days_to_First_Hearing | Keep | Keep | Keep | Include | Keep | Include in the model |
| Days_to_C2 | Keep | Discard | Discard | Discard | Discard | Discard |
| Days_to_C3 | Keep | Discard | Discard | Discard | Discard | Discard |
| Average Weekly Wage | Keep | Keep | Keep | Include | Keep | Include in the model |
| Birth Year | Keep | Keep | Keep | Discard | Discard | Try with and without |
| Number of Dependents | Keep | Discard | Keep | Discard | Discard | Discard |
| C-2 Date_Year | Keep | Keep | Keep | Include | Keep | Include in the model |
| C-2 Date_Month | Keep | Discard | Discard | Discard | Discard | Discard |
| C-2 Date_Day | Keep | Discard | Discard | Discard | Discard | Discard |
| C-2 Date_DayOfWeek | Keep | Discard | Discard | Discard | Discard | Discard |
| C-3 Date_Year | Keep | Keep | Keep | Include | Keep | Include in the model |
| C-3 Date_Month | Keep | Discard | Discard | Discard | Discard | Discard |
| C-3 Date_Day | Keep | Keep | Discard | Include | Discard | Try with and without |
| First Hearing Date_Year | Keep | Keep | Keep | Include | Keep | Include in the model |
| First Hearing Date_Month | Keep | Keep | Keep | Discard | Discard | Try with and without |
| First Hearing Date_Day | Keep | Discard | Discard | Discard | Discard | Discard |
| First Hearing Date_DayOfWeek | Keep | Discard | Discard | Discard | Discard | Discard |

<hr>

### Categorical Data

| Predictor | Spearman | Chi-Square |
| --- | --- | --- |
| County of Injury | Keep | Keep |
| District Name | Keep | Keep |
| Industry Code | Keep | Keep |
| Medical Fee Region | Keep | Keep |
| Attorney/Representative | Keep | Keep |
| COVID-19 Indicator | Keep | Keep |
| Known C-2 Date | Keep | Keep |
| Known C-3 Date | Keep | Keep |
| Known First Hearing Date | Keep | Keep |
| Accident Date_Year | Keep | Keep |
| Accident Date_Month | Keep | Keep |
| Accident Date_Day | Keep | Keep |
| Gender_F | Keep | Keep |
| Gender_M | Keep | Keep |
| Weekend_Accident | Keep | Keep |

<hr>

In [None]:
#test_encoder = LabelEncoder()
#test_encoder.classes_ = target_decoder()