# Data Analytics II - Final Project Part I

## Methodology:

- Feature scaling

- Feature importance using Gini Index

- Deletion of unnecessary features

- Correlation between features

- Visualization of some of the best features

For each classifier:

    - SFS analysis using standard hyperparameters
    
    - Gridsearch using the selected features

    - Analysis of classification metrics

- Fitting of the best model with all the training data

- Predictions to the test dataset

In [2]:
pip install plotly

Collecting plotly
  Downloading plotly-5.8.2-py2.py3-none-any.whl (15.2 MB)
Collecting tenacity>=6.2.0
  Downloading tenacity-8.0.1-py3-none-any.whl (24 kB)
Installing collected packages: tenacity, plotly
Successfully installed plotly-5.8.2 tenacity-8.0.1
Note: you may need to restart the kernel to use updated packages.


## Imports

In [3]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import seaborn as sb
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

## Training dataset preprocessing

In [4]:
df = pd.read_csv('train_set.csv')
y  = df['y']
X  = pd.DataFrame(
    data    = StandardScaler().fit_transform(df.drop('y', axis=1)),
    columns = df.columns.drop('y')
)

FileNotFoundError: [Errno 2] No such file or directory: 'train_set.csv'

In [None]:
df_label_counts = pd.DataFrame(
    data    = np.transpose(np.unique(y, return_counts=True)),
    columns = ['label', 'count'] 
).sort_values(
    by        = 'count',
    ascending = False
).reset_index(
    drop = True
)

df_label_counts['pct'] = 100*df_label_counts['count']/y.size

df_label_counts

## Feature Importance

In [None]:
rf = RandomForestClassifier()
rf.fit(X, y)

In [None]:
max_features = int(df.columns.size/3)

df_feature_importances = pd.DataFrame(
    data    = zip(df.columns.drop('y'), rf.feature_importances_),
    columns = ['feature', 'importance'] 
).sort_values(
    by        = 'importance',
    ascending = False,
).reset_index(
    drop = True
)[:max_features]

fig = px.line(
    data_frame = df_feature_importances,
    x          = 'feature',
    y          = 'importance'
)
fig.show()

## Deletion of unnecessary features

In [None]:
X = df[df_feature_importances['feature']]
X.shape

## Correlation between features

In [None]:
fig, ax = plt.subplots(figsize=(18,14))
sb.heatmap(X.corr(), cmap="Blues", annot=True, linewidths=0.1)

Features are not correlated: no need to remove correlated features to prevent overfitting!

## SFS and Gridsearch to each classifier

OBS: no need for SFS when using Random Forests. No need for hyperparameter tunning for Discriminant Analysis.

In [None]:
n_jobs               = -1
cv                   = 5
verbose              = 3
n_features_to_select = int(df.columns.size/10)
scoring              = 'accuracy'

### Random Forest

No need for SFS when using trees!

In [None]:
gs_rf = GridSearchCV(
    estimator  = RandomForestClassifier(),
    n_jobs     = n_jobs,
    cv         = cv,
    verbose    = verbose,
    scoring    = scoring,
    param_grid = dict(
        n_estimators      = [100],
        criterion         = ['gini'],
        max_depth         = [18], #[18, 20, 22],
        min_samples_split = [0.0001], #[0.0001, 0.0005, 0.0010],
        class_weight      = [None],
        max_samples       = [0.66]
    )
)
gs_rf.fit(X, y)
gs = gs_rf
print('Estimator:', gs.estimator.__class__.__name__)
print('Best result: {:.3f} +- {:.3f}'.format(
    gs.cv_results_['mean_test_score'][gs.best_index_],
    gs.cv_results_['std_test_score'][gs.best_index_],
))
print('Best hyperparameters:', gs.best_params_)

### LDA

No need for gridsearch when using discriminant analysis!

In [None]:
sfs_lda = SequentialFeatureSelector(
    estimator            = LinearDiscriminantAnalysis(),
    direction            = 'forward',
    n_features_to_select = n_features_to_select,
    n_jobs               = n_jobs,
    cv                   = cv,
    scoring              = scoring,
)
sfs_lda.fit(X, y)

In [None]:
kf       = KFold(n_splits=cv, shuffle=True)
features = []
accs     = []
stds     = []

for current_feature in sfs_lda.get_feature_names_out():
    features.append(current_feature)
    inner_accs = []
    
    for train_index, test_index in kf.split(X):
        X_train, X_test = X.loc[train_index][features], X.loc[test_index][features]
        y_train, y_test = y.loc[train_index], y.loc[test_index]
        clf = sfs_lda.estimator
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        inner_accs.append(accuracy_score(y_test, y_pred))
        
    accs.append(np.mean(inner_accs))
    stds.append(np.std(inner_accs))

In [None]:
fig = px.line(
    x       = sfs_lda.get_feature_names_out(),
    y       = accs,
    error_y = stds,
    title   = 'Sequential Feature Selector: ' + sfs_lda.estimator.__class__.__name__,
    labels  = dict(
        x = 'Added feature', 
        y = 'Accuracy'
    )
)
fig.show()

In [None]:
features_lda = sfs_lda.get_feature_names_out()[:8]
gs_lda = GridSearchCV(
    estimator  = LinearDiscriminantAnalysis(),
    n_jobs     = n_jobs,
    cv         = cv,
    verbose    = verbose,
    scoring    = scoring,
    param_grid = dict()
)
gs_lda.fit(X[features_lda], y)
gs = gs_lda
print('Estimator:', gs.estimator.__class__.__name__)
print('Best result: {:.3f} +- {:.3f}'.format(
    gs.cv_results_['mean_test_score'][gs.best_index_],
    gs.cv_results_['std_test_score'][gs.best_index_],
))
print('Best features:', features_lda)
print('Best hyperparameters:', gs.best_params_)

### Suppor Vector Machine

    SFS with standard parameters and then gridsearch with selected features
    
### K Nearest Neighbors

    SFS with standard parameters and then gridsearch with selected features
    
### So on...

In the end, test XGBoost -> this is probably the one that will work the best, but its not from sklearn. So we can do this later.