___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" alt="CLRSWY"></p>

___

# WELCOME!

In this project, you must apply EDA processes for the development of predictive models. Handling outliers, domain knowledge and feature engineering will be challenges.

Also, this project aims to improve your ability to implement algorithms for Multi-Class Classification. Thus, you will have the opportunity to implement many algorithms commonly used for Multi-Class Classification problems.

Before diving into the project, please take a look at the determines and tasks.

# Determines

The 2012 US Army Anthropometric Survey (ANSUR II) was executed by the Natick Soldier Research, Development and Engineering Center (NSRDEC) from October 2010 to April 2012 and is comprised of personnel representing the total US Army force to include the US Army Active Duty, Reserves, and National Guard. In addition to the anthropometric and demographic data described below, the ANSUR II database also consists of 3D whole body, foot, and head scans of Soldier participants. These 3D data are not publicly available out of respect for the privacy of ANSUR II participants. The data from this survey are used for a wide range of equipment design, sizing, and tariffing applications within the military and has many potential commercial, industrial, and academic applications.

The ANSUR II working databases contain 93 anthropometric measurements which were directly measured, and 15 demographic/administrative variables explained below. The ANSUR II Male working database contains a total sample of 4,082 subjects. The ANSUR II Female working database contains a total sample of 1,986 subjects.


DATA DICT:
https://data.world/datamil/ansur-ii-data-dictionary/workspace/file?filename=ANSUR+II+Databases+Overview.pdf

---

To achieve high prediction success, you must understand the data well and develop different approaches that can affect the dependent variable.

Firstly, try to understand the dataset column by column using pandas module. Do research within the scope of domain (body scales, and race characteristics) knowledge on the internet to get to know the data set in the fastest way. 

You will implement ***Logistic Regression, Support Vector Machine, XGBoost, Random Forest*** algorithms. Also, evaluate the success of your models with appropriate performance metrics.

At the end of the project, choose the most successful model and try to enhance the scores with ***SMOTE*** make it ready to deploy. Furthermore, use ***SHAP*** to explain how the best model you choose works.

# Tasks

#### 1. Exploratory Data Analysis (EDA)
- Import Libraries, Load Dataset, Exploring Data

    *i. Import Libraries*
    
    *ii. Ingest Data *
    
    *iii. Explore Data*
    
    *iv. Outlier Detection*
    
    *v.  Drop unnecessary features*

#### 2. Data Preprocessing
- Scale (if needed)
- Separete the data frame for evaluation purposes

#### 3. Multi-class Classification
- Import libraries
- Implement SVM Classifer
- Implement Decision Tree Classifier
- Implement Random Forest Classifer
- Implement XGBoost Classifer
- Compare The Models



# EDA
- Drop unnecessary colums
- Drop DODRace class if value count below 500 (we assume that our data model can't learn if it is below 500)

## Import Libraries
Besides Numpy and Pandas, you need to import the necessary modules for data visualization, data preprocessing, Model building and tuning.

*Note: Check out the course materials.*

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import interact

from sklearn.model_selection import cross_validate, GridSearchCV
from sklearn.metrics import (
    accuracy_score,
    f1_score,
    recall_score,
    precision_score,
    make_scorer,
    classification_report,
    confusion_matrix,
    ConfusionMatrixDisplay,
    average_precision_score,
    roc_curve,
    auc,
)

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

import warnings

warnings.filterwarnings("ignore")

pd.set_option("display.max_rows", 1000)
pd.set_option("display.max_columns", 1000)
pd.set_option("display.width", 1000)

## Ingest Data from links below and make a dataframe
- Soldiers Male : https://query.data.world/s/h3pbhckz5ck4rc7qmt2wlknlnn7esr
- Soldiers Female : https://query.data.world/s/sq27zz4hawg32yfxksqwijxmpwmynq

In [None]:
# Ingesting the data using 2 variables then concating them in a single variable
df_male = pd.read_csv('https://query.data.world/s/h3pbhckz5ck4rc7qmt2wlknlnn7esr',encoding='latin-1')
df_female= pd.read_csv('https://query.data.world/s/sq27zz4hawg32yfxksqwijxmpwmynq')
merged_df = pd.concat([df_female,df_male])
merged_df

## Explore Data

In [None]:
# We can see here that we have 108 features (Target excluded)
merged_df.shape

In [None]:
merged_df.info()

In [None]:
# As we can see the mean and std are really close to each other which indicates having outliers
merged_df.describe()

In [None]:
# We notice that we have null values
merged_df.isnull().sum().any()

In [None]:
# No duplicates (I mean, obviously? (fool me once, shame on you; fool me twice, shame on me, fool me 108 times??))
merged_df.duplicated().sum()

In [None]:
merged_df
# Dropped because it isn't related to race:
#-------------------------------------------
# Age (was taken during collecting data (limited to the range of accepting soldiers and isn't reasonable to use))

# Date (shows the date the data was collected so it has no influence on the race in any way)

# SubjectID (We don't need to know the individuals specifically) 

# WritingPreference (Could vary based on multiple things aside the race)

# Weightlbs (Weight in Kgs is preffered since they are the units used in the Metric System and it was self reported so they 
# aren't 100% reliable)

# Heightin ( Since it is self reported so it isn't reliable and we can use the stature instead as it represents the 
# natural height of a person measuring the bones that contribute to a person's height)

# Branch, Component, Installation, PrimaryMOS (The placement of the soldiers does not vary based on race) 

# SubjectNumericRace (It has all the races a single observation has so it isn't helpful if we are trying to know the 

# differences in each race and since it also includes the DODRace it will cause data leakage)

# If these reasons aren't convincing, let's call them my assumptions :)

In [None]:
# Now let's take a look on the features with null values 
# We can see that both SubjectId and subjectid are going to be dropped anyway, but for the sake of explanation, the reason we 
# see these null values is because they are treated as different columns from each df (female and male)
# So now we are interested in the ethnicity since it has 4647 null values and a reminder the total number of observations is 6068
# Since the number of null values in the Ethnicity is large (More than 70%) we are saying goodbye to Ethnicity
drop_list = []
for col in merged_df:
    if merged_df[col].isnull().sum() > 0:
        print(f"{col} = {merged_df[col].isnull().sum()}")
        drop_list.append(col)


drop_list

In [None]:
# Now dropping all the features we found unrelated and with so much null values:  
print(f"The shape of DataFrame BEFORE dropping null/unnecessary features: rows are {merged_df.shape[0]} and columns are {merged_df.shape[1]}")
merged_df.drop(columns=["Age", "Date", "subjectid", "SubjectId" ,"WritingPreference", "Weightlbs", "Branch","Component","Ethnicity","Installation", "SubjectNumericRace", "PrimaryMOS", "Heightin"
], inplace=True)
print(f"The shape of DataFrame AFTER dropping null/unnecessary features : rows are {merged_df.shape[0]} and columns are {merged_df.shape[1]}")



In [None]:
# We are asked to remove DODRace classes with less than 500 observations assuming the model can't learn from them
# Let's take a look on those classes first:

"""""
1 = White 
2 = Black 
3 = Hispanic 
4 = Asian 
5 = Native American 
6 = Pacific Islander
8 = Other
"""""
merged_df.DODRace.value_counts()


In [None]:
# Cool, let's see clearly?
print(merged_df["DODRace"].value_counts())
merged_df["DODRace"].value_counts().plot(kind="bar"
                                         , figsize=(10, 10))
plt.xlabel("Races");
plt.ylabel("Number of Soldiers");

In [None]:
"""""
Even MORE clearly !

1 = White 
2 = Black 
3 = Hispanic 
4 = Asian
5 = Native American
6 = Pacific Islander 
8 = Other
"""""
print(merged_df["DODRace"].value_counts())
merged_df["DODRace"].value_counts().plot(kind="pie" , autopct="%1.1f%%", figsize=(5, 5))
plt.ylabel("");



In [None]:
# Since we have classes (4, 5, 6, and 8) having observations less than 500 then they must be dropped

drop_DODRace = merged_df.DODRace.value_counts()[merged_df.DODRace.value_counts() <= 500].index
drop_DODRace

In [None]:
print(f"The shape of DataFrame BEFORE dropping classes with less than 500 observations : rows are {merged_df.shape[0]} and columns are {merged_df.shape[1]}")
for i in drop_DODRace:
    drop_index = merged_df[merged_df['DODRace'] == i].index
    merged_df.drop(index = drop_index, inplace=True)

merged_df.reset_index(drop=True, inplace=True)
print(f"The shape of DataFrame AFTER dropping classes with less than 500 observations : rows are {merged_df.shape[0]} and columns are {merged_df.shape[1]}")



In [None]:
merged_df.DODRace.value_counts()

In [None]:
"""""
1 = White 
2 = Black 
3 = Hispanic

"""""
print(merged_df["DODRace"].value_counts())
merged_df["DODRace"].value_counts().plot(kind="bar"
                                         , figsize=(5, 5))
plt.xlabel("Races");
plt.ylabel("Number of Soldiers");

In [None]:
print(merged_df["DODRace"].value_counts())
merged_df["DODRace"].value_counts().plot(kind="pie" , autopct="%1.1f%%", figsize=(5, 5))
plt.ylabel("");



In [None]:
# Now let's do some mapping! 

# As you may have noticed we have the Gender as [Male, Female], but we like numbers! 
merged_df["Gender"] =merged_df["Gender"].map({"Female":0,"Male":1})
merged_df["Gender"]

In [None]:
# Now for the SubjectsBirthLocation, let's see what do we have as values
# We notice that we have the states mentioned separately compared to countries, which doesn't make since
# because they all represent one country
#------
# To solve this issue we can go mutliple ways:
# either divide the states based on their part of the US (West, East, North, and South)
# or divide the countries and states based on continents (Asia, Europe..etc)
# Fnailly my chosen way is to divide based on whether a soldier is from the US (Whichever state) or from a foreign country

locations =  merged_df['SubjectsBirthLocation'].unique()
result = ", ".join(locations)
print(result)

In [None]:
# The reason I'm going with the last solution is that most of the observations are from the US 
# (obvious right? since it's the US soldiers data) 
merged_df['SubjectsBirthLocation'].value_counts()

In [None]:
states = [
    'Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida',
    'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine',
    'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska',
    'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio',
    'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas',
    'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming', 'American Samoa',
    'District of Columbia', 'Guam', 'Northern Mariana Islands', 'Puerto Rico', 'US Virgin Islands'
]



In [None]:
merged_df['SubjectsBirthLocation'] = merged_df['SubjectsBirthLocation'].apply(lambda x: 1 if any(state.lower() in x.lower() for state in states) else 0)


In [None]:
merged_df['SubjectsBirthLocation'].value_counts().unique

# DATA Preprocessing
- In this step we divide our data to X(Features) and y(Target) then ,
- To train and evaluation purposes we create train and test sets,
- Lastly, scale our data if features not in same scale. Why?

In [None]:
X = merged_df.drop(columns=["DODRace"])
y = merged_df.DODRace

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=101, stratify=y
)

In [None]:
print("Train independent variables shape : ", X_train.shape)
print("Train dependent variable shape   : ", y_train.shape)
print("Test independent variables shape  : ", X_test.shape)
print("Test dependent variable shape    : ", y_test.shape)

# Modelling
- Fit the model with train dataset
- Get predict from vanilla model on both train and test sets to examine if there is over/underfitting   
- Apply GridseachCV for both hyperparemeter tuning and sanity test of our model.
- Use hyperparameters that you find from gridsearch and make final prediction and evaluate the result according to chosen metric.

## 1. Logistic model

### Vanilla Logistic Model

In [None]:
from sklearn.preprocessing import StandardScaler
Logistic_Model = LogisticRegression()
Standard_Scaler = StandardScaler()

In [None]:
from sklearn.pipeline import Pipeline
Logistick_pip = Pipeline([('scaler',StandardScaler()),('Logistic_Model',LogisticRegression())])
Logistick_pip.fit(X_train,y_train)

In [None]:
y_pred =Logistick_pip.predict(X_test)
y_train_pred = Logistick_pip.predict(X_train)

In [None]:
print(f'''
 -----------------    Train Results      -----------------
{classification_report(y_train, y_train_pred)}
                        
-----------------    Test Results      -----------------
{classification_report(y_test, y_pred)}''')

In [None]:
from sklearn.metrics import make_scorer
from sklearn.metrics import precision_score, recall_score,  f1_score


f1_S = make_scorer(f1_score, average = "weighted")
precision_S = make_scorer(precision_score, average = "weighted")
recall_S = make_scorer(recall_score, average = "weighted")


scoring = {"f1r":f1_S,
           "precision":precision_S,
           "recall":recall_S} 

In [None]:
operations = [("scaler", StandardScaler()), ("logistic", LogisticRegression())]
model = Pipeline(steps=operations)

scores = cross_validate(model, X_train, y_train, scoring = scoring, cv = 10, return_train_score=True)
df_scores = pd.DataFrame(scores, index = range(1,11))
df_scores.mean()

In [None]:
df_scores

### Logistic Model GridsearchCV

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
operations = [("scaler", StandardScaler()), ("logistic_model", LogisticRegression(max_iter=5000))]
GridModel = Pipeline(steps=operations)

In [None]:
param_grid = { "logistic_model__class_weight" : ["balanced"],               
               'logistic_model__solver' : ['saga','lbfgs','liblinear'],
              'logistic_model__penalty': ["l1","l2"],
               'logistic_model__C' :[0.01, 0.1, 0.5 ,1]
             }
f1_Hispanic =  make_scorer(f1_score, average=None, labels=[3])# Class 3 represent the Hispanic which is the worst scoring for our model and we need to foucs on it 
# Average can be (weighted, macro) if wanna foucse on one of these and so on 

In [None]:
grid_Logistick_pipe = GridSearchCV(GridModel, param_grid = param_grid,scoring=f1_Hispanic, cv=5, return_train_score=True,n_jobs=1)


In [None]:
grid_Logistick_pipe.fit(X_train,y_train)

In [None]:
# The best hyperparameters: 
grid_Logistick_pipe.best_params_

In [None]:
y_pred_grid =grid_Logistick_pipe.predict(X_test)
y_train_pred_grid = grid_Logistick_pipe.predict(X_train)

In [None]:
print(f'''
----------------------------- Train Results -----------------------------\n
{classification_report(y_train, y_train_pred_grid)}
                        
----------------------------- Test Results -----------------------------\n
{classification_report(y_test, y_pred_grid)}''')

In [None]:
from sklearn.metrics import roc_curve, auc
def plot_multiclass_roc(clf, X_test, y_test, n_classes, figsize=(5,5)):
    y_score = clf.decision_function(X_test)

    # structures
    fpr = dict()
    tpr = dict()
    roc_auc = dict()

    # calculate dummies once
    y_test_dummies = pd.get_dummies(y_test, drop_first=False).values
    for i in range(n_classes):
        fpr[i], tpr[i], _ = roc_curve(y_test_dummies[:, i], y_score[:, i])
        roc_auc[i] = auc(fpr[i], tpr[i])

    # roc for each class
    fig, ax = plt.subplots(figsize=figsize)
    ax.plot([0, 1], [0, 1], 'k--')
    ax.set_xlim([0.0, 1.0])
    ax.set_ylim([0.0, 1.05])
    ax.set_xlabel('False Positive Rate')
    ax.set_ylabel('True Positive Rate')
    ax.set_title('Receiver operating characteristic example')
    for i in range(n_classes):
        ax.plot(fpr[i], tpr[i], label='ROC curve (area = %0.2f) for label %i' % (roc_auc[i], i))
    ax.legend(loc="best")
    ax.grid(alpha=.4)
    sns.despine()
    plt.show()
plot_multiclass_roc(grid_Logistick_pipe, X_test, y_test, n_classes=3, figsize=(16, 10));

In [None]:
from scikitplot.metrics import plot_roc, plot_precision_recall

y_pred_proba = grid_Logistick_pipe.predict_proba(X_test)

plot_precision_recall(y_test, y_pred_proba)
plt.show();

## 2. SVC

### Vanilla SVC model 

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import PrecisionRecallDisplay, average_precision_score

In [None]:
operations_SVM = [("scaler", StandardScaler()), ("SVC", SVC(class_weight="balanced",probability=True))]
SVC_pipeline = Pipeline(steps=operations_SVM)

In [None]:
SVM_pipe.fit(X_train,y_train)

In [None]:
y_pred_svm = SVM_pipe.predict(X_test)
y_train_pred_SVM = SVM_pipe.predict(X_train)
print(f'''
 -----------------    Train Results      -----------------\n
{classification_report(y_train, y_train_pred_SVM)}
                        
-----------------    Test Results      -----------------\n
{classification_report(y_test, y_pred_svm)}''')

In [None]:
def plot_dictionary_bar(dictionary):
 keys = dictionary.keys()
 values = dictionary.values()

 plt.bar(keys, values)
 plt.xlabel("Keys")
 plt.ylabel("Values")
 plt.title("Dictionary Bar Chart")

 plt.show()

plot_dictionary_bar(classification_report(y_test, y_pred_svm,output_dict=True))

In [None]:
scores = cross_validate(SVM_pipe,
                        X_train,
                        y_train,
                        scoring=scoring,
                        cv = 5,
                        return_train_score=True)
df_scores = pd.DataFrame(scores, index = range(1, 6))
df_scores.mean()[2:]

In [None]:
df_scores

###  SVC Model GridsearchCV

In [None]:
SVM_pipe_grid = Pipeline(steps=operations_SVM)

In [None]:
param_grid = {'SVC__C': [0.01 ,0.1, 0.5,1],
              'SVC__gamma': ["scale", "auto", 0.01],
              'SVC__kernel': ['rbf', 'linear','poly'],             
              'SVC__class_weight': ["balanced"]}

In [None]:
svm_model_grid = GridSearchCV(SVM_pipe_grid, param_grid, verbose=3, scoring=f1_Hispanic, refit=True,n_jobs=1)

In [None]:
svm_model_grid.fit(X_train, y_train)

In [None]:
svm_model_grid.best_params_

In [None]:
y_pred_svm_grid = svm_model_grid.predict(X_test)
y_train_pred_SVM_grid = svm_model_grid.predict(X_train)
print(f'''
 -----------------    Train Results      -----------------\n
{classification_report(y_train, y_train_pred_SVM_grid)}
                        
-----------------    Test Results      -----------------\n
{classification_report(y_test, y_pred_svm_grid)}''')

In [None]:
y_pred_proba = svm_model_grid.predict_proba(X_test)

plot_precision_recall(y_test, y_pred_proba)
plt.show();

## 3. RF

### Vanilla RF Model

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
RF_model = RandomForestClassifier() 

In [None]:
RF_model.fit(X_train,y_train)

In [None]:
# We notice the model is overfitting

y_pred_RF = RF_model.predict(X_test)
y_train_pred_RF = RF_model.predict(X_train)
print(f'''
 -----------------    Train Results      -----------------
{classification_report(y_train, y_train_pred_RF)}
                        
-----------------    Test Results      -----------------
{classification_report(y_test, y_pred_RF)}''')

### RF Model GridsearchCV

In [None]:
param_grid = {'n_estimators':[400,500],
             'criterion': ["gini","entropy"],
             'max_depth':[2,3,10,14],
             'min_samples_split':[18,20,22],
             'class_weight': ['balanced']}

In [None]:
rf_grid_model = GridSearchCV(RF_model, param_grid, verbose=3, scoring=f1_Hispanic, refit=True,n_jobs=1)

In [None]:
rf_grid_model.fit(X_train,y_train)

In [None]:
rf_grid_model.best_params_

In [None]:
y_pred_RF_grid = rf_grid_model.predict(X_test)
y_train_pred_RF_grid = rf_grid_model.predict(X_train)

In [None]:
print(f'''
 -----------------    Train Results      -----------------
{classification_report(y_train, y_train_pred_RF_grid)}
                        
-----------------    Test Results      -----------------
{classification_report(y_test, y_pred_RF_grid)}''')

In [None]:
y_pred_proba = rf_grid_model.predict_proba(X_test)

plot_precision_recall(y_test, y_pred_proba)
plt.show();

## 4. XGBoost

### Vanilla XGBoost Model

In [None]:
from xgboost import XGBClassifier

In [None]:
xgboost_pipe =Pipeline([('Scaler',StandardScaler()),('XGB',XGBClassifier())]) 

In [None]:
xgboost_pipe

In [None]:
y_train.unique()

In [None]:
y_train_xgb = y_train.map({1: 0, 2:1,3:2})
y_test_xgb = y_test.map({1: 0, 2:1,3:2})

In [None]:
xgboost_pipe.fit(X_train,y_train_xgb)

In [None]:
y_pred_XGB= xgboost_pipe.predict(X_test)
y_train_pred_XGB = xgboost_pipe.predict(X_train)

print(f'''
 -----------------    Train Results      -----------------
{classification_report(y_train_xgb, y_train_pred_XGB)}
                        
-----------------    Test Results      -----------------
{classification_report(y_test_xgb, y_pred_XGB)}''')

In [None]:
from sklearn.utils import class_weight

classes_weights = class_weight.compute_sample_weight(
    class_weight="balanced", y=y_train_xgb
)
classes_weights

In [None]:
my_dict = {"weights": classes_weights, "label": y_train_xgb}

comp = pd.DataFrame(my_dict)

comp.sample(10)

In [None]:
comp.groupby("label").value_counts()

In [None]:
scoring

In [None]:
scores = cross_validate(
    xgboost_pipe,
    X_train,
    y_train_xgb,
    scoring=scoring,
    cv=5,
    n_jobs=1, 
    return_train_score=True,
    fit_params={"XGB__sample_weight": classes_weights},
)
df_scores = pd.DataFrame(scores, index=range(1, 6))
df_scores.mean()[2:]

In [None]:
df_scores

### XGBoost Model GridsearchCV

In [None]:
param_grid = {
    "XGB__n_estimators": [100, 300],
    "XGB__max_depth": [1,2,5],
    "XGB__learning_rate": [0.03, 0.05],
    "XGB__subsample": [0.3, 0.8,1],
    "XGB__colsample_bytree": [0.5,0.8, 1],
}

In [None]:
xgboost_pipe_grid =Pipeline([('Scaler',StandardScaler()),('XGB',XGBClassifier())]) 
xGBoost_Grid = GridSearchCV(
    xgboost_pipe_grid,
    param_grid,
    scoring=make_scorer(recall_score, average=None, labels=[2]),
    cv=5,
    n_jobs=1,
    return_train_score=True,
)

In [None]:
xGBoost_Grid.fit(X_train, y_train_xgb, XGB__sample_weight=classes_weights)

In [None]:
xGBoost_Grid.best_params_

In [None]:
y_pred_XGB= xGBoost_Grid.predict(X_test)
y_train_pred_XGB = xGBoost_Grid.predict(X_train)

In [None]:
print(f'''
 -----------------    Train Results      -----------------
{classification_report(y_train_xgb, y_train_pred_XGB)}
                        
-----------------    Test Results      -----------------
{classification_report(y_test_xgb, y_pred_XGB)}''')

In [None]:
# The tuned Logistic regression model has the best results 

---
---

---
---

# SMOTE
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

##  Smote implement

In [None]:
# !pip install imblearn

In [None]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import  Pipeline as Imb_pip

In [None]:
over = SMOTE()
X_train_over, y_train_over = over.fit_resample(X_train, y_train)

In [None]:
print('''
Shape of X_train is  :      {}
Shape of X_train over is  : {}
Shape of y_train is :       {}
Shape of y_train over is :  {}
--------------each class------------------
{}
'''.format(X_train.shape,X_train_over.shape,y_train.shape,y_train_over.shape,y_train_over.value_counts()))

In [None]:
under = RandomUnderSampler()
X_train_under, y_train_under = under.fit_resample(X_train, y_train)

In [None]:
print('''
Shape of X_train is  :       {}
Shape of X_train under is  : {}
Shape of y_train is :        {}
Shape of y_train under is :  {}
--------------each class------------------
{}

'''.format(X_train.shape,X_train_under.shape,y_train.shape,y_train_under.shape, y_train_under.value_counts()))

In [None]:
y_train.value_counts()

In [None]:
over = SMOTE(sampling_strategy={3: 1000})
under = RandomUnderSampler(sampling_strategy={1: 2000})

In [None]:
X_resampled_over, y_resampled_over = over.fit_resample(X_train, y_train)

In [None]:
steps = [("o", over), ("u", under)]


pipeline = Imb_pip(steps=steps)

X_resampled, y_resampled = pipeline.fit_resample(X_train, y_train)

In [None]:
print('''

--------Y before-------- \n
{}
\n\n--------Smote smapling--------\n
{}
'''.format(y_train.value_counts(),y_resampled.value_counts()))


## Logistic Regression Over/ Under Sampling

In [None]:
operations = [
    ("o", over),
    ("u", under),
    ("scaler", StandardScaler()),
    ("logistic_reg", LogisticRegression(max_iter=10000)),
] 

In [None]:
smote_pipeline = Imb_pip(steps=operations)
smote_pipeline

In [None]:
smote_pipeline.fit(X_train, y_train)

In [None]:
y_pred_smote= smote_pipeline.predict(X_test)
y_train_pred_smote = smote_pipeline.predict(X_train)

print(f'''
 -----------------    Train Results      -----------------
{classification_report(y_train, y_train_pred_smote)}
                        
-----------------    Test Results      -----------------
{classification_report(y_test, y_pred_smote)}''')

In [None]:
model = Imb_pip(steps=operations)

scores = cross_validate(
    model, X_train, y_train, scoring=scoring, cv=10, n_jobs=1, return_train_score=True
)
df_scores = pd.DataFrame(scores, index=range(1, 11))
df_scores.mean()[2:]

In [None]:
df_scores

In [None]:
param_grid = { "logistic_reg__class_weight" : ["balanced", None],
               'logistic_reg__penalty': ["l1","l2"],
               'logistic_reg__solver' : ['saga','lbfgs','liblinear'],
               'logistic_reg__C' :[0.001,0.01, 0.1, 1, 5, 10, 15, 20, 25]
             }
f1_Hispanic =  make_scorer(f1_score, average=None, labels=[3])# Class 3 represent the Hispanic which is the worst scoring for our model and we need to foucs on it 
# Average can be (weighted, macro) if wanna foucse on one of these and so on 


In [None]:
grid_Logistick_smote_pipe = GridSearchCV(smote_pipeline, param_grid = param_grid,scoring=f1_Hispanic, cv=5, return_train_score=True,n_jobs=-1)


In [None]:
grid_Logistick_smote_pipe.fit(X_train,y_train)

In [None]:
grid_Logistick_smote_pipe.best_params_

In [None]:
y_pred_smote= grid_Logistick_smote_pipe.predict(X_test)
y_train_pred_smote = grid_Logistick_smote_pipe.predict(X_train)
print(f'''
 -----------------    Train Results      -----------------
{classification_report(y_train, y_train_pred_smote)}
                        
-----------------    Test Results      -----------------
{classification_report(y_test, y_pred_smote)}''')

## Other Evaluation Metrics for Multiclass Classification

- Evaluation metrics 
https://towardsdatascience.com/comprehensive-guide-on-multiclass-classification-metrics-af94cfb83fbd

In [None]:
from sklearn.metrics import matthews_corrcoef
matthews_corrcoef?
matthews_corrcoef(y_test, y_pred)

In [None]:
from sklearn.metrics import cohen_kappa_score
cohen_kappa_score?
cohen_kappa_score(y_test, y_pred)

# Before the Deployment 
- Choose the model that works best based on your chosen metric
- For final step, fit the best model with whole dataset to get better performance.
- And your model ready to deploy, dump your model and scaler.

___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" alt="CLRSWY"></p>

___