___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" alt="CLRSWY"></p>

___

# WELCOME!

In this project, you must apply EDA processes for the development of predictive models. Handling outliers, domain knowledge and feature engineering will be challenges.

Also, this project aims to improve your ability to implement algorithms for Multi-Class Classification. Thus, you will have the opportunity to implement many algorithms commonly used for Multi-Class Classification problems.

Before diving into the project, please take a look at the determines and tasks.

# Determines

The 2012 US Army Anthropometric Survey (ANSUR II) was executed by the Natick Soldier Research, Development and Engineering Center (NSRDEC) from October 2010 to April 2012 and is comprised of personnel representing the total US Army force to include the US Army Active Duty, Reserves, and National Guard. In addition to the anthropometric and demographic data described below, the ANSUR II database also consists of 3D whole body, foot, and head scans of Soldier participants. These 3D data are not publicly available out of respect for the privacy of ANSUR II participants. The data from this survey are used for a wide range of equipment design, sizing, and tariffing applications within the military and has many potential commercial, industrial, and academic applications.

The ANSUR II working databases contain 93 anthropometric measurements which were directly measured, and 15 demographic/administrative variables explained below. The ANSUR II Male working database contains a total sample of 4,082 subjects. The ANSUR II Female working database contains a total sample of 1,986 subjects.


DATA DICT:
https://data.world/datamil/ansur-ii-data-dictionary/workspace/file?filename=ANSUR+II+Databases+Overview.pdf

---

To achieve high prediction success, you must understand the data well and develop different approaches that can affect the dependent variable.

Firstly, try to understand the dataset column by column using pandas module. Do research within the scope of domain (body scales, and race characteristics) knowledge on the internet to get to know the data set in the fastest way. 

You will implement ***Logistic Regression, Support Vector Machine, XGBoost, Random Forest*** algorithms. Also, evaluate the success of your models with appropriate performance metrics.

At the end of the project, choose the most successful model and try to enhance the scores with ***SMOTE*** make it ready to deploy. Furthermore, use ***SHAP*** to explain how the best model you choose works.

# Tasks

#### 1. Exploratory Data Analysis (EDA)
- Import Libraries, Load Dataset, Exploring Data

    *i. Import Libraries*
    
    *ii. Ingest Data *
    
    *iii. Explore Data*
    
    *iv. Outlier Detection*
    
    *v.  Drop unnecessary features*

#### 2. Data Preprocessing
- Scale (if needed)
- Separete the data frame for evaluation purposes

#### 3. Multi-class Classification
- Import libraries
- Implement SVM Classifer
- Implement Decision Tree Classifier
- Implement Random Forest Classifer
- Implement XGBoost Classifer
- Compare The Models



# EDA
- Drop unnecessary colums
- Drop DODRace class if value count below 500 (we assume that our data model can't learn if it is below 500)

## Import Libraries
Besides Numpy and Pandas, you need to import the necessary modules for data visualization, data preprocessing, Model building and tuning.

*Note: Check out the course materials.*

In [None]:
pip install sqlalchemy

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.decomposition import PCA

from sqlalchemy import create_engine
import warnings
from IPython.core.pylabtools import figsize
from scipy.stats import zscore
from scipy import stats
from numpy import percentile
font_title = {'family': 'times new roman', 'color': 'darkred', 
              'weight': 'bold', 'size': 14}

warnings.filterwarnings('ignore')
sns.set_style("whitegrid")

plt.rcParams['figure.dpi'] = 100

## Ingest Data from links below and make a dataframe
- Soldiers Male : https://query.data.world/s/h3pbhckz5ck4rc7qmt2wlknlnn7esr
- Soldiers Female : https://query.data.world/s/sq27zz4hawg32yfxksqwijxmpwmynq

In [None]:
male = pd.read_csv("ANSUR II MALE Public.csv",encoding="ISO-8859-1")
female = pd.read_csv("ANSUR II FEMALE Public.csv",encoding="ISO-8859-1")

In [None]:
male.head()

In [None]:
female.head()

In [None]:
male.shape

In [None]:
female.shape

- burda bizim columlar onemli concat icin. Columm isimlerin ayni olmasi ve ayni miktarda colum olmali aksi durumda cancate durumunda colum sayisi artacak verimizde null degerler artacak.
- yukarda gordugunuz gibi male ile female arasindaki subject id yazim farkliligi var ikisi ayni veri olmasina ragmen yazilimsal farkliliktan dolayi concat ettigimizde verimiz de yeni bir colum olusturacak bu durmu istemedigimizden dolayi female deki subject id yi ismini duzelticezzzz

In [None]:
female.rename(columns = {'SubjectId':'subjectid'}, inplace = True)
female

In [None]:
data = pd.concat([male,female])
df= data.copy()

## Explore Data

In [None]:
df.head()

In [None]:
df.shape

In [None]:
df.info(verbose=True)
# verbose butun columlari tek seferde bilgilerini getiriyoreee

In [None]:
df.isna().sum().any() # bir tane de olsa null var ama neredeeeee

In [None]:
df.isnull().sum()

In [None]:
df.isna().sum().sort_values(ascending=False)

In [None]:
df.duplicated().sum()

- simdide  datamizin icindeki object verilerin uniquelerine bakicaz.

In [None]:
categoric = df.select_dtypes(include='object')
for col in categoric.columns:
    print(col)
    print(df[col].nunique())
    print("-------------")

- simdbi birde target columimiza bakalim

In [None]:
df["DODRace"].value_counts() 
# target ccolumumuz 7 farkli sinif var 
#  bu projhemiz demekki multi clasification 

In [None]:
df.describe(include=int).T

- Biz asagidaki iki datanin ilk ciktilara gore ayni oldugunu dusunmustuk ama degil fakrli duruyor

In [None]:
df[["DODRace","SubjectNumericRace"]]

In [None]:
df["DODRace"].value_counts(dropna=False)

In [None]:
df["SubjectNumericRace"].sample(10)

In [None]:
df["SubjectNumericRace"].value_counts(dropna=False)

- biz godukki bu ii collum benzer sonuclar vermekte hem malinayi yormamamk hem veri sizintiisni onlemek icin subject li olani dusurmeye karar verdim canim kendimmmmm

In [None]:
df["SubjectNumericRace"] = df["SubjectNumericRace"].apply(lambda x: 4 if x == 5 or x == 6 or x >= 8  else x)
df["SubjectNumericRace"].value_counts()

- simdi domain arastimasi sonucunda bazi columlari drop etmeyece karar verdik.

In [None]:
df["DODRace"] = df["DODRace"].apply(lambda x: 4 if x == 5 or x == 6 or x == 8 else x)
df["DODRace"].value_counts()
# burda lamdanin icinde x 4 ve sonrasini 4  e
# esitledik sebebi veri olarak zten azlardi

- Makale incelendihginde diyorki veri libre verilmis ama Kg olmasi isteniyor
- sonra boy inch verilmis mm isteniyor makalede

In [None]:
df.drop("SubjectNumericRace", axis = 1, inplace = True)

In [None]:
df["DODRace"].value_counts().plot(kind="pie", autopct='%1.1f%%',figsize=(10,10));


In [None]:
df.head(1).T

In [None]:
df.drop("subjectid", axis = 1, inplace = True)

In [None]:
df.drop(["WritingPreference","Ethnicity","PrimaryMOS","Date","Installation"], axis = 1, inplace = True)

In [None]:
df.Branch.value_counts()

In [None]:
df.groupby(["Branch"])["DODRace"].value_counts()

In [None]:
df.Component.value_counts()

In [None]:
df.groupby(["Component"])["DODRace"].value_counts()

In [None]:
df.groupby(["Component","Branch"])["DODRace"].value_counts(),

- hocam ikna oldu cunku irkciligin en fazla oldugu kisim kategoriklestirme diye dusunduk ondan dolayi branchi tutalim companentti drop edelim dedik
- cunku eger bir irk ayrimi yapilacak ise componentta degil branch de yapilacaktir
- yani deniz kara hava herkse alinacak ama uzmanlik da irkcilik devreye girebilirmissss...

In [None]:
df.drop("Component",axis=1,inplace=True)

- datamizde demislerdiki lb kg diye iki colum var bizden istenen kg digerini drop babyyy



In [None]:
df[['Weightlbs','weightkg']]

In [None]:
plt.figure(figsize=(20,20))
sns.heatmap(df.corr(), cmap ="viridis")

In [None]:
df.head(1).T

In [None]:
numeric =df.select_dtypes(include='int64')

In [None]:
def detect_outliers(df, col_name,tukey=1.5):
    ''' 
    this function detects outliers based on 1.5 time IQR and
    returns the number of lower and uper limit and number of outliers respectively
    '''
    first_quartile = np.percentile(np.array(df[col_name].tolist()), 25)
    third_quartile = np.percentile(np.array(df[col_name].tolist()), 75)
    IQR = third_quartile - first_quartile
                      
    upper_limit = third_quartile+(tukey*IQR)
    lower_limit = first_quartile-(tukey*IQR)
    outlier_count = 0
                      
    for value in df[col_name].tolist():
        if (value < lower_limit) | (value > upper_limit):
            outlier_count +=1
    return lower_limit, upper_limit, outlier_count

In [None]:
out_cols = []
for col in numeric:
    print(
        f"{col}\nlower:{detect_outliers(df, col,3)[0]} \nupper:{detect_outliers(df, col,3)[1]}\
        \noutlier:{detect_outliers(df, col,3)[2]}\n*-*-*-*-*-*-*"
    )
    if detect_outliers(df, col,3)[2] > 0 :
        out_cols.append(col)
print(out_cols)    

In [None]:
df= pd.get_dummies(data=df,drop_first=True)

In [None]:
df.DODRace.value_counts()

In [None]:
numeric= df[['abdominalextensiondepthsitting', 'chestdepth', 'hipbreadth', 'hipbreadthsitting', 'lowerthighcircumference', 'thighclearance', 'waistdepth', 'Heightin', 'Weightlbs']]

In [None]:
class_tree = df.groupby('DODRace').size()
class_label = pd.DataFrame(class_tree,columns = ['Size'])
plt.figure(figsize = (8,6))
sns.barplot(x = class_label.index, y = 'Size',data = class_label);

- bu data alanennn imbalance napcazzzzz ?


In [None]:
# outlierss ile ilgili buraya kod yaz
# box plot yap
# guzel bir heatmap uap

index = 0
plt.figure(figsize=(20,20))
for feature in df_cont.columns :
    if feature != 'target' :
        index += 1
        plt.subplot(3,2,index)
        sns.boxplot(x = 'target', y = feature, data = df_cont)

# DATA Preprocessing
- In this step we divide our data to X(Features) and y(Target) then ,
- To train and evaluation purposes we create train and test sets,
- Lastly, scale our data if features not in same scale. Why?

In [None]:
 pip install --upgrade pip

In [None]:
pip install yellowbrick

In [None]:
pip install xgboost

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sqlalchemy import create_engine
import warnings
from IPython.core.pylabtools import figsize
from scipy.stats import zscore
from scipy import stats
from numpy import percentile
from sklearn.metrics import accuracy_score, classification_report,confusion_matrix,plot_confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
# from statsmodels.formula.api import ols
from scipy.stats import zscore
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier , GradientBoostingClassifier
import seaborn as sns
from xgboost import XGBClassifier
from sklearn.model_selection import TimeSeriesSplit
from yellowbrick.classifier import ClassificationReport
from yellowbrick.datasets import load_occupancy
from sklearn.metrics import f1_score

In [None]:
X = df.drop("DODRace", axis = 1)

y = df["DODRace"]

In [None]:
y.astype(str)

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
y = pd.DataFrame(y)
y.reset_index(drop= True, inplace= True)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,stratify=y, random_state=42)

In [None]:
scaler = MinMaxScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Modelling
- Fit the model with train dataset
- Get predict from vanilla model on both train and test sets to examine if there is over/underfitting   
- Apply GridseachCV for both hyperparemeter tuning and sanity test of our model.
- Use hyperparameters that you find from gridsearch and make final prediction and evaluate the result according to chosen metric.

## 1. Logistic model

### Vanilla Logistic Model

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, plot_confusion_matrix

In [None]:
log_model = LogisticRegression()

In [None]:
log_model.fit(X_train_scaled,y_train)
y_pred = log_model.predict(X_test_scaled)
y_pred

In [None]:
plot_confusion_matrix(log_model, X_test_scaled, y_test)

In [None]:
def eval_metric(model, X_train, y_train, X_test, y_test):
    y_train_pred = model.predict(X_train)
    y_pred = model.predict(X_test)
    
    print("Test_Set")
    print(confusion_matrix(y_test, y_pred))
    print(classification_report(y_test, y_pred))
    print()
    print("Train_Set")
    print(confusion_matrix(y_train, y_train_pred))
    print(classification_report(y_train, y_train_pred))

In [None]:
eval_metric(log_model, X_train_scaled, y_train, X_test_scaled, y_test)

### Cross Validate

In [None]:
from sklearn.model_selection import cross_validate

model = LogisticRegression()

scores = cross_validate(model, X_train_scaled, y_train, scoring = ['accuracy', 'precision_weighted','recall_weighted',
                                                                   'f1_weighted'], cv = 10)
df_scores = pd.DataFrame(scores, index = range(1, 11))
df_scores.mean()[2:]

In [None]:
eval_metric(log_model, X_train_scaled, y_train, X_test_scaled, y_test)

### Logistic Model GridsearchCV

In [None]:
from sklearn.metrics import make_scorer
from sklearn.metrics import precision_score, recall_score, accuracy_score, f1_score

f1_versicolor = make_scorer(f1_score, average = None, labels =["versicolor"])
precision_versicolor = make_scorer(precision_score, average = None, labels =["versicolor"])
recall_versicolor = make_scorer(recall_score, average = None, labels =["versicolor"])

In [None]:
model = LogisticRegression()

scores = cross_validate(model, X_train_scaled, y_train, scoring = {"f1_versicolor":f1_versicolor, 
                                                                   "precision_versicolor":precision_versicolor,
                                                                   "recall_versicolor":recall_versicolor}, cv = 10)
df_scores = pd.DataFrame(scores, index = range(1, 11))
df_scores.mean()[2:]

In [None]:
y_pred=log_model.predict(X_test_scaled)
y_pred_proba = log_model.predict_proba(X_test_scaled)

test_data = pd.concat([X_test, y_test], axis=1)
test_data["pred"] = y_pred
test_data["pred_proba_0"] = y_pred_proba[:,0]
test_data["pred_proba_1"] = y_pred_proba[:,1]
test_data["pred_proba_2"] = y_pred_proba[:,2]
test_data["pred_proba_3"] = y_pred_proba[:,3]
test_data.sample(10)

## 2. SVC

### Vanilla SVC model 

In [None]:
scaler = MinMaxScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
from sklearn.ensemble import BaggingClassifier
modelSVM = BaggingClassifier(SVC()) 

In [None]:
modelSVM.fit(X_train_scaled , y_train)

In [None]:
y_pred_test = modelSVM.predict(X_test_scaled)
y_pred_train = modelSVM.predict(X_train_scaled)

In [None]:
print("TEST REPORT")
print(classification_report(y_test, y_pred_test))
plot_confusion_matrix(modelSVM, X_test_scaled, y_test);

print("\n"*3, "-*"*30)

print("TRAIN REPORT")
print(classification_report(y_train, y_pred_train))
plot_confusion_matrix(modelSVM, X_train_scaled, y_train);

###  SVC Model GridsearchCV

In [None]:
param_grid = {
    'base_estimator__C': np.linspace(0.01,1, 3),
}

In [None]:
from sklearn.model_selection import GridSearchCV
model = BaggingClassifier(SVC())
svm_model_grid = GridSearchCV(model,param_grid,)

svm_model_grid.fit(X_train_scaled, y_train)

In [None]:
svm_model_grid.best_params_

In [None]:
svm_model_grid.best_estimator_

In [None]:
y_pred_test = svm_model_grid.predict(X_test_scaled)
y_pred_train = svm_model_grid.predict(X_train_scaled)

In [None]:
print("TEST REPORT")
print(classification_report(y_test, y_pred_test))
plot_confusion_matrix(svm_model_grid, X_test_scaled, y_test);

print("\n"*3, "-*"*30)
print("TRAIN REPORT")
print(classification_report(y_train, y_pred_train))
plot_confusion_matrix(svm_model_grid, X_train_scaled, y_train);

In [None]:
from yellowbrick.classifier import ClassPredictionError


visualizer = ClassPredictionError(modelSVM)

# Fit the training data to the visualizer
visualizer.fit(X_train_scaled, y_train)

# Evaluate the model on the test data
visualizer.score(X_test_scaled, y_test)

# Draw visualization
visualizer.show()

In [None]:
visualizer = ClassPredictionError(svm_model_grid)

# Fit the training data to the visualizer
visualizer.fit(X_train_scaled, y_train)

# Evaluate the model on the test data
visualizer.score(X_test_scaled, y_test)

# Draw visualization
visualizer.show()

## 3. RF

### Vanilla RF Model

In [None]:
rfc = RandomForestClassifier()

rfc.fit(X_train, y_train)

In [None]:
y_pred_test = rfc.predict(X_test)
y_pred_train = rfc.predict(X_train)

In [None]:
print("TEST REPORT")
print(classification_report(y_test, y_pred_test))
plot_confusion_matrix(rfc, X_test, y_test);

print("\n"*3, "-*"*30)
print("TRAIN REPORT")
print(classification_report(y_train, y_pred_train))
plot_confusion_matrix(rfc, X_train, y_train);

In [None]:
visualizer = ClassPredictionError(rfc)

# Fit the training data to the visualizer
visualizer.fit(X_train, y_train)

# Evaluate the model on the test data
visualizer.score(X_test, y_test)

# Draw visualization
visualizer.show()

### RF Model GridsearchCV

In [None]:
param_grid={}

In [None]:
rf_model = RandomForestClassifier()
rf_grid_model = GridSearchCV(rf_model,
                             param_grid)

rf_grid_model.fit(X_train,y_train)

In [None]:
rf_grid_model.best_params_

In [None]:
y_pred_test = rf_grid_model.predict(X_test)
y_pred_train = rf_grid_model.predict(X_train)

In [None]:
rfc_accuracy_test = accuracy_score(y_test, y_pred_test)
rfc_accuracy_train = accuracy_score(y_train, y_pred_train)

rfc_f1_test = f1_score(y_test, y_pred_test, average='macro')
rfc_f1_train = f1_score(y_train, y_pred_train, average='macro')

rfc_accuracy_test, rfc_accuracy_train, rfc_f1_test, rfc_f1_train

In [None]:
print("TEST REPORT")
print(classification_report(y_test, y_pred_test))
plot_confusion_matrix(rf_grid_model, X_test, y_test);

print("\n"*3, "-*"*30)
print("TRAIN REPORT")
print(classification_report(y_train, y_pred_train))
plot_confusion_matrix(rf_grid_model, X_train, y_train);

In [None]:
visualizer = ClassPredictionError(rf_grid_model)

# Fit the training data to the visualizer
visualizer.fit(X_train, y_train)

# Evaluate the model on the test data
visualizer.score(X_test, y_test)

# Draw visualization
visualizer.show()

## 4. XGBoost

### Vanilla XGBoost Model

In [None]:
xgb_classifier = XGBClassifier()
xgb_classifier.fit(X_train , y_train)

In [None]:
y_pred_test = xgb_classifier.predict(X_test)
y_pred_train = xgb_classifier.predict(X_train)

In [None]:
print("TEST REPORT")
print(classification_report(y_test, y_pred_test))
plot_confusion_matrix(xgb_classifier, X_test, y_test);

print("\n"*3, "-*"*30)
print("TRAIN REPORT")
print(classification_report(y_train, y_pred_train))
plot_confusion_matrix(xgb_classifier, X_train, y_train);

In [None]:
visualizer = ClassPredictionError(xgb_classifier)

# Fit the training data to the visualizer
visualizer.fit(X_train, y_train)

# Evaluate the model on the test data
visualizer.score(X_test, y_test)

# Draw visualization
visualizer.show()

### XGBoost Model GridsearchCV

---
---

---
---

# SMOTE
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/

##  Smote implement

In [None]:
!pip install imblearn

In [None]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline

## SVC Over/Under Sampling

## Xgboost Over/ Under Sampling

- Evaluation metrics 
https://towardsdatascience.com/comprehensive-guide-on-multiclass-classification-metrics-af94cfb83fbd

In [None]:
from sklearn.metrics import matthews_corrcoef

matthews_corrcoef(y_test, y_pred)

In [None]:
from sklearn.metrics import cohen_kappa_score

cohen_kappa_score(y_test, y_pred)

#  SHAP

https://towardsdatascience.com/shap-explain-any-machine-learning-model-in-python-24207127cad7

In [None]:
# !pip install shap

In [None]:
import shap
explainer = shap.Explainer(log_model,X_train)
start_index = 203
end_index = 204
shap_values = explainer.shap_values(X_test[start_index:end_index])

In [None]:
shap_values

In [None]:
print(shap_values[0].shape)

In [None]:
# %% >> Visualize local predictions
shap.initjs()
# Force plot
prediction = log_model.predict(X_test[start_index:end_index])[0]
print(f"The log_model predicted: {prediction}")
shap.force_plot(explainer.expected_value[1],
                shap_values[1],
                X_test[start_index:end_index], # for values
                feature_names= X.columns,) 

In [None]:
shap.summary_plot(shap_values, X_train,max_display=300,feature_names = X.columns)

# Before the Deployment 
- Choose the model that works best based on your chosen metric
- For final step, fit the best model with whole dataset to get better performance.
- And your model ready to deploy, dump your model and scaler.

___

<p style="text-align: center;"><img src="https://docs.google.com/uc?id=1lY0Uj5R04yMY3-ZppPWxqCr5pvBLYPnV" class="img-fluid" alt="CLRSWY"></p>

___