<img src="https://cellstrat2.s3.amazonaws.com/PlatformAssets/bluewhitelogo.svg" alt="drawing" width="200"/>

# ML Tuesdays - Session 2
## Machine Learning Track
### Diabetes Classification Exercise (Solution)

### Guidelines
1. The notebook has been split into multiple steps with fine-grained instructions for each step. Use the instructions for each code cell to complete the code.
2. You can refer the Logistic Regression Module in the Machine Learning Pack from CellStrat Hub.
3. Make use of the docstrings of the functions and classes using the `shift+tab` shortcut key.
4. Refer the internet for the explanation of any algorithm.

## About the Dataset
The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. Missing values are believed to be encoded with zero values. The variable names are as follows:

- Number of times pregnant.
- Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
- Diastolic blood pressure (mm Hg).
- Triceps skinfold thickness (mm).
- 2-Hour serum insulin (mu U/ml).
- Body mass index (weight in kg/(height in m)^2).
- Diabetes pedigree function.
- Age (years).
- Class variable (0 or 1).

In [5]:
import pandas as pd

In [7]:
dataset = pd.read_csv('pima-indians-diabetes.csv')

In [8]:
dataset

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


## Data Preprocessing

1. Split to X and y data
2. Perform Train Test Split
3. Feature Scaling (Use Standard or Normalization)

In [11]:
X_data = dataset.iloc[:, :-1]
y_data = dataset.iloc[:, -1]

In [12]:
from sklearn.model_selection import train_test_split

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.2, random_state=0)

In [15]:
from sklearn.preprocessing import StandardScaler

In [16]:
scaler = StandardScaler()

In [17]:
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Training

You need to train 4 different models i.e.,
1. LogisticRegression
2. K Nearest Neighbours (KNN)
3. Decision Tree
4. Random Forest

Make optimal use of the scikit-learn documentation and google to understand each algorithm and apply it.

In [35]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [58]:
logistic_model = LogisticRegression()
knn = KNeighborsClassifier()
tree_model = DecisionTreeClassifier()
forest_model = RandomForestClassifier()

In [59]:
logistic_model.fit(X_train, y_train)

LogisticRegression()

In [60]:
knn.fit(X_train, y_train)

KNeighborsClassifier()

In [61]:
tree_model.fit(X_train, y_train)

DecisionTreeClassifier()

In [62]:
forest_model.fit(X_train, y_train)

RandomForestClassifier()

## Evaluation

1. Evaluate the results of each model using the `classification_report` function in `sklearn.metrics`.
2. Check which model has the best results on the train and test set.
3. Have some models overfitted?

In [63]:
from sklearn.metrics import classification_report

In [64]:
def evaluate(model, X, y):
    return classification_report(y, model.predict(X))

In [65]:
print('Train Results with Logistic Regression')
print(evaluate(logistic_model, X_train, y_train))

print('\nTest Results with Logistic Regression')
print(evaluate(logistic_model, X_test, y_test))

Train Results with Logistic Regression
              precision    recall  f1-score   support

           0       0.78      0.87      0.82       393
           1       0.71      0.57      0.64       221

    accuracy                           0.76       614
   macro avg       0.75      0.72      0.73       614
weighted avg       0.76      0.76      0.76       614


Test Results with Logistic Regression
              precision    recall  f1-score   support

           0       0.84      0.92      0.88       107
           1       0.76      0.62      0.68        47

    accuracy                           0.82       154
   macro avg       0.80      0.77      0.78       154
weighted avg       0.82      0.82      0.82       154



In [66]:
print('Train Results with KNN')
print(evaluate(knn, X_train, y_train))

print('\nTest Results with KNN')
print(evaluate(knn, X_test, y_test))

Train Results with KNN
              precision    recall  f1-score   support

           0       0.83      0.89      0.86       393
           1       0.77      0.67      0.72       221

    accuracy                           0.81       614
   macro avg       0.80      0.78      0.79       614
weighted avg       0.81      0.81      0.81       614


Test Results with KNN
              precision    recall  f1-score   support

           0       0.85      0.87      0.86       107
           1       0.68      0.64      0.66        47

    accuracy                           0.80       154
   macro avg       0.76      0.75      0.76       154
weighted avg       0.80      0.80      0.80       154



In [67]:
print('Train Results with Decision Tree Classifier')
print(evaluate(tree_model, X_train, y_train))

print('\nTest Results with Decision Tree Classifier')
print(evaluate(tree_model, X_test, y_test))

Train Results with Decision Tree Classifier
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       393
           1       1.00      1.00      1.00       221

    accuracy                           1.00       614
   macro avg       1.00      1.00      1.00       614
weighted avg       1.00      1.00      1.00       614


Test Results with Decision Tree Classifier
              precision    recall  f1-score   support

           0       0.83      0.80      0.82       107
           1       0.59      0.64      0.61        47

    accuracy                           0.75       154
   macro avg       0.71      0.72      0.72       154
weighted avg       0.76      0.75      0.76       154



In [68]:
print('Train Results with Random Forest Classification')
print(evaluate(forest_model, X_train, y_train))

print('\nTest Results with Random Forest Classification')
print(evaluate(forest_model, X_test, y_test))

Train Results with Random Forest Classification
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       393
           1       1.00      1.00      1.00       221

    accuracy                           1.00       614
   macro avg       1.00      1.00      1.00       614
weighted avg       1.00      1.00      1.00       614


Test Results with Random Forest Classification
              precision    recall  f1-score   support

           0       0.85      0.88      0.87       107
           1       0.70      0.66      0.68        47

    accuracy                           0.81       154
   macro avg       0.78      0.77      0.77       154
weighted avg       0.81      0.81      0.81       154



Tree based algorithms seem to have overfitted. Logistic Regression has the optimal performance among these 4.