<a href="https://colab.research.google.com/github/Fathimath-Rifna-VK/fmml2021/blob/main/Module_3_FMML_CLF1_Lab4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **FOUNDATIONS OF MODERN MACHINE LEARNING, IIIT Hyderabad**
### MODULE: CLASSIFICATION-1
### LAB-4 : Appreciating Feature Scaling and Standardization
#### Teaching Assistant: Sahil Bhatt

NOTE: YOU ONLY NEED TO MAKE CHANGES/WRITE CODE IN CELLS THAT SPECIFICALLY MENTION TASK-1, TASK-2, etc.

WRITE ANY OBSERVATION(S), IF REQUIRED BY THE TASK, IN A SEPARATE CELL AT THE BOTTOM OF THE NOTEBOOK.  

---

## Binary Classification Task: Diabetes Dataset

We’ll be using ML techniques learnt uptil now to predict whether a Pima Indian Woman has diabetes or not, based on information about the patient such as blood pressure, body mass index (BMI), age, etc. 



# Introduction

Scientists carried out a study to investigate the significance of health-related predictors of diabetes in **Pima Indian Women**. The study population was the females 21 years and above of Pima Indian heritage patients of diabetes and digestive and kidney diseases. 

The research question was: what are the health predictions that associated with the presence of diabetes in Pima Indians? 

The study aimed at testing the significance of health-related predictors of diabetes in Pima Indians women.

To find the reason behind this, we have to find whether there is a relationship between the numbers of times a women was pregnant and the BMIs of Pima Indian Women older than 21 years old, or whether the women have diabetes and their diabetes pedigree function (a function that represents how likely they are to get the disease by extrapolating from their ancestor’s history).

## Exploratory Data Analysis (EDA) and Statistical Analysis

## Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style('whitegrid')

import warnings
warnings.filterwarnings('ignore')

## Importing the dataset

In [None]:
!gdown --id 1WVKhG73-JFvXdlerXQcgqPqsXO1_PeGl

In [None]:
diabetes_data = pd.read_csv('preprocessed_diabetes_data.csv')

In [None]:
# View top 10 rows of the Diabetes dataset
diabetes_data.head(10)

## Identification of variables and data types

In [None]:
diabetes_data.shape

Dataset comprises of 768 observations and 9 fields.

The following features have been provided to help us predict whether a person is diabetic or not:

* **Pregnancies:** Number of times pregnant
* **Glucose:** Plasma glucose concentration over 2 hours in an oral glucose tolerance test. Less than 140 mg/dL is considered normal level of glucose.
* **BloodPressure:** Diastolic blood pressure (mm Hg). 120/80 is normal BP level for female above 18 yr old.
* **SkinThickness:** Triceps skin fold thickness (mm)
* **Insulin:** 2-Hour serum insulin (mu U/ml). 16-166 mIU/L is considered the normal level of insulin.
* **BMI:** Body mass index (weight in kg/(height in m)2)
* **DiabetesPedigreeFunction:** Diabetes pedigree function (a function which scores likelihood of diabetes based on family history)
* **Age:** Age (years)
* **Outcome:** Class variable (0 if non-diabetic, 1 if diabetic)


Let’s also make sure that our data is clean (has no null values, etc).

In [None]:
# Get the details of each column
diabetes_data.describe().T

Let us see distribution and also boxplot for outliers of feature "Pregnancies".

In [None]:
#@title
fig,axes = plt.subplots(nrows=1,ncols=2,figsize = (8,6))

plot00=sns.distplot(diabetes_data['Pregnancies'],ax=axes[0],color='b')
axes[0].set_title('Distribution of Pregnancy',fontdict={'fontsize':8})
axes[0].set_xlabel('No of Pregnancies')
axes[0].set_ylabel('Frequency')
plt.tight_layout()


plot01=sns.boxplot('Pregnancies',data=diabetes_data,ax=axes[1],orient = 'v', color='r')
plt.tight_layout()

<p style="font-weight: bold;color:#FF4500">Highlights</p>
* It seems that Insulin is highly correlated with Glucose (about 0.58), BMI (about 0.23) and Age (about 0.22). It means that as the values of glucose, BMI and Age increase, the insuline is also increasing. It seems logical also that fat and aged people might have high level of insuline in their bodies.
* In the same way SkinThickness is highly correlated with BMI (about 0.65).

## Checking  balance of data

We can produce a seaborn count plot to see how the output is dominated by one of the classes or not.

In [None]:
plt.figure(figsize=(12,6))
sns.countplot(x='Outcome',data=diabetes_data, palette='bright')
plt.title("Emergency call category")

print(diabetes_data['Outcome'].value_counts())

<p style="font-weight: bold;color:#FF4500">Highlights</p>
A total of 768 women were registered in the database. 268 womens about 35% were having diabetes, while 500 women about 65% were not. 

The above graph shows that the dataset is biased towards non-diabetic patient. The number of non-diabetics is almost twice the number of diabetic patients.

## Scatter matrix of data

In [None]:
sns.pairplot(diabetes_data,hue='Outcome')

<p style="font-weight: bold;color:#FF4500">Highlights</p>
The pairs plot builds on two basic figures, the histogram and the scatter plot. The histogram on the diagonal allows us to see the distribution of a single variable while the scatter plots on the upper and lower triangles show the relationship (or lack thereof) between two variables.

In [None]:
#@title
plt.figure(figsize=(12,8))
sns.boxplot(x='Outcome', y='BMI',data=diabetes_data, hue='Outcome')

<p style="font-weight: bold;color:#FF4500">Highlights</p>

It is surprising that the median BMI does not immensely change as the number of pregnancies increases. I expected there to be a strong positive relationship between the number of pregnancies and the BMI. Those who tested positive for diabetes had higher BMIs than those who does not; yet, not a larger difference between the medians.

BMI will generally be higher for women who have had more numbers of pregnancy as well as for those who test positive for diabetes and that the relationship between the pedigree function and the test results will show that those who had a higher pedigree function tested positive and those who had a lower pedigree function tested negative.

## Pedigree function vs Diabetes 

<p style="font-weight: bold;color:#FF4500">Highlights</p>
This graph more clearly shows the relationship between the pedigree function and the test results that the women got for diabetes. Since those who tested positive have a higher median and more high outliers, it is clear that the pedigree function does in fact, accurately help estimate the test results for diabetes. It shows that diabetes does follow genetics so those whose ancestors suffered from it have a higher risk of getting the disease themselves as well. Both test results show many outliers yet the outliers for those who tested negative seem to be lower pedigree functions than those who tested positive. This concluded that the genetic component is likely to contribute more to the emergence of diabetes in the Pima Indians and their offspring.

## Pregnancy vs Diabetes

<p style="font-weight: bold;color:#FF4500">Highlights</p>
The average number of pregnancies is higher (4.9) in diabetic in comparing to (3.3) in non-diabetic women with a significant difference between them.

## Diabetic in Normal BMI

Let try to find out how is the probabiliy of having diabetic in a women having normal BMI. Please note that the range of normal BMI is 18.5 to 25.

In [None]:
#@title
normalBMIData = diabetes_data[(diabetes_data['BMI'] >= 18.5) & (diabetes_data['BMI'] <= 25)]
normalBMIData['Outcome'].value_counts()

In [None]:
#@title
notNormalBMIData = diabetes_data[(diabetes_data['BMI'] < 18.5) | (diabetes_data['BMI'] > 25)]
notNormalBMIData['Outcome'].value_counts()

In [None]:
#@title
plt.figure(figsize=(12,8))
sns.boxplot(x='Outcome', y='BMI',data=notNormalBMIData)

<p style="font-weight: bold;color:#FF4500">Highlights</p>

The Body Mass Index (BMI) showed a significant association with the occurrence of diabetes and that even the normal weighted women were at almost 9 times risk of being diabetic in comparison to the overweight.

In addition, the interquartile range for the women who tested positive reaches a higher BMI than the IQR for those who tested negative. Therefore, women could have higher BMIs and not be outliers if they tested positive as opposed to negative, showing that more women who tested positive did, in fact, have higher BMIs than those who tested negative. 


## Age vs Diabetes

<p style="font-weight: bold;color:#FF4500">Highlights</p>
Significant relation can be seen between the age distribution and diabetic occurrence. Women at age group > 31 years were at higher risk to contract diabetes in comparison to the younger age group.

# The Importance of Standardizing Data

In [None]:
unchanged_data = diabetes_data.drop('Outcome',axis=1)

In [None]:
unchanged_data

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.neighbors import KNeighborsClassifier

# Choosing a K Value
Let's go ahead and use the elbow method to pick a good K Value!

*Create a for loop that trains various KNN models with different k values, then keep track of the error_rate for each of these models with a list.*

In [None]:
def plot_KNN_error_rate(xdata,ydata):
  error_rate = []
  test_scores = []
  train_scores = []

  X_train, X_test, y_train, y_test = train_test_split(xdata,ydata,test_size=0.30,random_state=101)
  
  for i in range(1,40):
      knn = KNeighborsClassifier(n_neighbors=i)
      knn.fit(X_train,y_train)
      pred_i = knn.predict(X_test)
      
      error_rate.append(np.mean(pred_i != y_test))
      train_scores.append(knn.score(X_train,y_train))
      test_scores.append(knn.score(X_test,y_test))

  plt.figure(figsize=(12,8))
  plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
          markerfacecolor='red', markersize=10)
  plt.title('Error Rate vs. K Value')
  plt.xlabel('K')
  plt.ylabel('Error Rate')
  print()
  ## score that comes from testing on the same datapoints that were used for training
  max_train_score = max(train_scores)
  train_scores_ind = [i for i, v in enumerate(train_scores) if v == max_train_score]
  print('Max train score {} % and k = {}'.format(max_train_score*100,list(map(lambda x: x+1, train_scores_ind))))
  print()
  ## score that comes from testing on the datapoints that were split in the beginning to be used for testing solely
  max_test_score = max(test_scores)
  test_scores_ind = [i for i, v in enumerate(test_scores) if v == max_test_score]
  print('Max test score {} % and k = {}'.format(max_test_score*100,list(map(lambda x: x+1, test_scores_ind))))

  return test_scores

In [None]:
unchanged_test_scores = plot_KNN_error_rate(unchanged_data,diabetes_data['Outcome'])

## Standardize the Variables
Standardization (also called z-score normalization) is the process of putting different variables on the same scale. Standardization transforms your data such that the resulting distribution has a mean of 0 and a standard deviation of 1. 

$$ Z = {X - \mu \over \sigma}$$ 


In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [None]:
scaler.fit(diabetes_data.drop('Outcome',axis=1))

In [None]:
scaled_data = scaler.transform(diabetes_data.drop('Outcome',axis=1))

In [None]:
df_feat = pd.DataFrame(scaled_data,columns=diabetes_data.columns[:-1])
df_feat.head()

In [None]:
scaled_test_scores = plot_KNN_error_rate(scaled_data,diabetes_data['Outcome'])

In [None]:
from sklearn.preprocessing import Normalizer
l2_normalizer = Normalizer()

In [None]:
l2_normalizer.fit(diabetes_data.drop('Outcome',axis=1))
l2_normalized_data = l2_normalizer.transform(diabetes_data.drop('Outcome',axis=1))

In [None]:
df_feat = pd.DataFrame(l2_normalized_data,columns=diabetes_data.columns[:-1])
df_feat.head()

In [None]:
l2_normalized_test_scores = plot_KNN_error_rate(l2_normalized_data,diabetes_data['Outcome'])

In [None]:
l1_normalizer = Normalizer(norm='l1')
l1_normalizer.fit(diabetes_data.drop('Outcome',axis=1))
l1_normalized_data = l1_normalizer.transform(diabetes_data.drop('Outcome',axis=1))
l1_normalized_test_scores = plot_KNN_error_rate(l1_normalized_data,diabetes_data['Outcome'])

## Comparing Accuracy before and after Standardization

In [None]:
plt.figure(figsize=(20,8))
plt.title('Accuracy vs. K Value')
sns.lineplot(range(1,40),unchanged_test_scores,marker='o',label='Unscaled data test score')
sns.lineplot(range(1,40),scaled_test_scores,marker='o',label='Scaled data test Score')
sns.lineplot(range(1,40),l2_normalized_test_scores,marker='o',label='L2 normalized data test Score')
sns.lineplot(range(1,40),l1_normalized_test_scores,marker='o',label='L1 normalized data test Score')

# Conclusion

Overall, it seems that there is some form of an association between BMI, number of pregnancies, pedigree function, and the test results for diabetes. It is surprising that the median BMI does not immensely change as the number of pregnancies increases. I expected there to be a strong positive relationship between the number of pregnancies and the BMI. Those who tested positive for diabetes had higher BMIs than those who did not; yet, I predicted a larger difference between the medians.

To find the relationship between the pedigree function and the test results, it would be interesting to also have males and those under 21 as well as 21 in the sample. That way, possible confounding variables such as a hormone that only females have that may cause diabetes, can be eliminated.

# References

https://www.kaggle.com/dktalaicha/diabetes-prediction-by-knn