<center><img src="https://github.com/insaid2018/Term-1/blob/master/Images/INSAID_Full%20Logo.png?raw=true" width="240" height="100" /></center>

# <center><b>KNN Assignment (Problem)<b></center>

---
# **Table of Contents**
---

**1.** [**Introduction**](#Section1)<br>
**2.** [**Problem Statement**](#Section2)<br>
**3.** [**Installing & Importing Libraries**](#Section3)<br>
  - **3.1** [**Installing Libraries**](#Section31)
  - **3.2** [**Upgrading Libraries**](#Section32)
  - **3.3** [**Importing Libraries**](#Section33)

**4.** [**Data Acquisition & Description**](#Section4)<br>
  - **4.1** [**Data Description**](#Section41)
  - **4.2** [**Data Information**](#Section42)

**5.** [**Data Pre-processing**](#Section5)<br>
  - **5.1** [**Pre-Profiling Report**](#Section51)<br>

**6.** [**Exploratory Data Analysis**](#Section6)<br>
**7.** [**Data Post-Processing**](#Section7)<br>
**8.** [**Model Development & Evaluation**](#Section8)<br>
**9.** [**Conclusion**](#Section9)<br>

---
<a name = Section1></a>
# **1. Introduction**
---

- The K-nearest neighbors (KNN) algorithm is a simple, easy-to-implement **supervised machine learning** algorithm.

- It can be used to solve both **classification** and **regression** problems.

- The KNN algorithm assumes that **similar things** exist in **close proximity**. In other words, similar things are near to each other.

#<center> ***Birds of a feather flock together***</center>

- KNN captures the idea of **similarity** (distance, proximity, or closeness) with **calculating** the **distance** between points on a graph.<br><br>

<center><img src="https://www.newtechdojo.com/wp-content/uploads/2020/06/KNN-1.gif"></center><br>

- The straight-line distance (also called the **Euclidean distance**) is a **popular** and **familiar** choice for calculating the distance.






---
<a name = Section2></a>
# **2. Problem Statement**
---

- **Diabetes** is one of the **common** issues in today's world and **detecting** it **early** could be **benefecial** for many potential patients. 

- Pre-diabetic stages can be detected by various factors such as:
  - **glucose levels**, 
  - **insulin levels**, 
  - **skin thickness**, 
  - **Body-Mass Index ratio**, and 
  - **blood pressure**.

<center><img src="https://image.freepik.com/free-vector/scientists-with-folder-clipboard-working-with-huge-dna-test-tube-genetic-testing-dna-testing-genetic-diagnosis-concept-white-background-pinkish-coral-bluevector-isolated-illustration_335657-1521.jpg" width=50%></center>

- Let's say **XYZ Diagnostics** have a good record of **detecting diabetes** in the early stages.

- They want to **automate** the process of **detecting diabetes** based on various factors of a patient.

- They have hired a data scientist for this task. Let's say it is you.

- You have been provided with a **dataset** that contains history of **patients**.

- Your task is to design a **classifier** that can classify if a **person** is **diabetic or not**.

---
<a name = Section3></a>
# **3. Installing & Importing Libraries**
---

<a name = Section31></a>
### **3.1 Installing Libraries**

In [None]:
# !pip install -q datascience                                         # Package that is required by pandas profiling
# !pip install -q pandas-profiling                                    # Library to generate basic statistics about data
# !pip install -q yellowbrick                                         # Toolbox for Measuring Machine Performance

<a name = Section32></a>
### **3.2 Upgrading Libraries**

- **After upgrading** the libraries, you need to **restart the runtime** to make the libraries in sync. 

- Make sure not to execute the cell above (3.1) and below (3.2) again after restarting the runtime.

In [None]:
# !pip install -q --upgrade pandas-profiling
# !pip install -q --upgrade yellowbrick

<a name = Section33></a>
### **3.3 Importing Libraries**

In [None]:
#-------------------------------------------------------------------------------------------------------------------------------
import pandas as pd                                                 # Importing for panel data analysis
from pandas_profiling import ProfileReport                          # Import Pandas Profiling (To generate Univariate Analysis) 
pd.set_option('display.max_columns', None)                          # Unfolding hidden features if the cardinality is high      
pd.set_option('display.max_colwidth', None)                         # Unfolding the max feature width for better clearity      
pd.set_option('display.max_rows', None)                             # Unfolding hidden data points if the cardinality is high
pd.set_option('mode.chained_assignment', None)                      # Removing restriction over chained assignments operations
pd.set_option('display.float_format', lambda x: '%.2f' % x)         # To suppress scientific notation over exponential values
#-------------------------------------------------------------------------------------------------------------------------------
import numpy as np                                                  # Importing package numpys (For Numerical Python)
#-------------------------------------------------------------------------------------------------------------------------------
import matplotlib.pyplot as plt                                     # Importing pyplot interface using matplotlib
import seaborn as sns                                               # Importin seaborm library for interactive visualization
%matplotlib inline
#-------------------------------------------------------------------------------------------------------------------------------
from sklearn.preprocessing import StandardScaler                    # To import a standard scaler for scaling the features
from sklearn.model_selection import train_test_split                # To split the data into train and test datasets
from sklearn.neighbors import KNeighborsClassifier                  # To instantiate a KNN Classifier
from sklearn.metrics import accuracy_score                          # To calculate the accuracy of a classifier
#-------------------------------------------------------------------------------------------------------------------------------
import warnings                                                     # Importing warning to disable runtime warnings
warnings.filterwarnings("ignore")                                   # Warnings will appear only once

---
<a name = Section4></a>
# **4. Data Acquisition & Description**
---

- The datasets consists of several **medical predictor variables** and one target variable.

- Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

</br>

| Records | Features | Dataset Size |
| :-- | :-- | :-- |
| 768 | 9 | 23.0 KB| 

</br>

| Id | Features | Description |
| :-- | :--| :--| 
| 01 | **Pregnancies** | Number of times pregnant. |
| 02 | **Glucose** | Plasma glucose concentration after 2 hours in an oral glucose tolerance test. |
| 03 | **BloodPressure** | Diastolic blood pressure ($mm Hg$). |
| 04 | **SkinThickness** | Triceps skin fold thickness ($mm$). |
| 05 | **Insulin** | 2-Hour serum insulin ($mu U/ml$). |
| 06 | **BMI** | Body mass index (weight in $kg/(height in m)^2$) |
| 07 | **DiabetesPedigreeFunction** | Diabetes pedigree function. |
| 08 | **Age** | Age (years). |
| 09 | **Outcome** | Class variable (**0:** No diabetes or **1:** diabetes).|

In [None]:
data = pd.read_csv(filepath_or_buffer='https://raw.githubusercontent.com/insaid2018/Term-3/master/Data/Assignment/diabetes.csv')
print('Data Shape:', data.shape)
data.head()

Data Shape: (768, 9)


Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.63,50,1
1,1,85,66,29,0,26.6,0.35,31,0
2,8,183,64,0,0,23.3,0.67,32,1
3,1,89,66,23,94,28.1,0.17,21,0
4,0,137,40,35,168,43.1,2.29,33,1


<a name = Section41></a>
### **4.1 Data Description**

- In this section we will get **information about the data** and see some observations.

In [None]:
data.describe()

<a name = Section42></a>
### **4.2 Data Information**

- In this section we will see the **information about the types of features**.

In [None]:
data.info()

<a name = Section5></a>

---
# **5. Data Pre-Processing**
---

<a name = Section51></a>
### **5.1 Pre Profiling Report**

- For **quick analysis** pandas profiling is very handy.

- Generates profile reports from a pandas DataFrame.

- For each column **statistics** are presented in an interactive HTML report.

In [None]:
# profile = ProfileReport(df=data)
# profile.to_file(output_file='Pre Profiling Report.html')
# print('Accomplished!')

**Performing Operations**


---
**<h4>Question 1**. Create a function that replaces the zero values of the following columns with the median values - **BMI**, **Glucose**, **SkinThickness**, and **BloodPressure**.</h4>

---

<details>

**<summary>Hint:</summary>**

- You can use the `.replace()` method to replace the zeros with the median value.

- You cna use (data[column] == 0).sum() to check how many zeros are present

</details>

In [None]:
def fill_zeros(data=None, columns=None):
  # Put your code here...

In [None]:
data = fill_zeros(data=data, columns=['BMI', 'Glucose', 'SkinThickness', 'BloodPressure'])

In [None]:
data.describe()

<a name = Section52></a>
### **5.2 Post Profiling Report**

- Post Profiling report to check the statistical description of data after data cleaning operations.

In [None]:
# profile = ProfileReport(df=data)
# profile.to_file(output_file='Post Profiling Report.html')
# print('Accomplished!')

<a name = Section6></a>

---
# **6. Exploratory Data Analysis**
---


---
**<h4>Question 2:** Create a function that compares age with bloodpressure along with the diabetic condition of the patient.</h4>

---

<details>

**<summary>Hint:</summary>**

- Create a 15x7 inches figure.

- Plot a scatterplot using the `sns.scatterplot` method between 'Age' and 'BloodPressure' features.

- Add additional cosmetics like `grid` and `title`.

- Set `fontsize` for labels as 14, for ticks as 12, and title as 16.

- Use `plt.show()` to properly display the plot.

</details>


In [None]:
def age_bp(data=None, column1=None, column2=None, column3=None):
  # Put your code here...

In [None]:
age_bp(data=data, column1='Age', column2='BloodPressure', column3='Outcome')

---
**<h4>Question 3:** Create a function that compares BMI with Bloodpressure of the patient.</h4>

---

<details>

**<summary>Hint:</summary>**

- Create a 15x7 inches figure.

- Plot a regression plot using `sns.regplot()` on BMI and BloodPressure.

- Add additional cosmetics like `grid` and `title`.

- Set `fontsize` for labels as 14, for ticks as 12, and title as 16.

- Use `plt.show()` to properly display the plot.

</details>


In [None]:
def bmi_bp(data=None, column1=None, column2=None):
  # Put your code here...

In [None]:
bmi_bp(data=data, column1='BMI', column2='BloodPressure')

---
**<h4>Question 4:** Create a function that checks the effect of BMI on diabetic state of the patient.</h4>

---

<details>

**<summary>Hint:</summary>**

- Create a figure of 20x7 inches with 2 subplots (`nrows=1, ncols=2`).

- Plot histograms on BMI for Outcome = 1 (using ax[1].hist) and Outcome = 0 (using ax[0].hist)

- Add additional cosmetics like `grid` and `title` using the axes.

- Set `fontsize` for labels as 14 and title as 16.

- Use `plt.show()` to properly display the plot.

</details>

In [None]:
def bmi(data=None, column1=None):
  # Put your code here...

In [None]:
bmi(data=data, column1='BMI')

---
**<h4>Question 5:** Create a function that checks the glucose levels and it's effect on diabetic state of the patient.</h4>

---

<details>

**<summary>Hint:</summary>**

- Create a figure of 20x7 inches with 2 subplots (`nrows=1, ncols=2`).

- Use `sns.kdeplot()` on Glucose for Outcome = 1 and Outcome = 0.

- Keep fill parameter as True for the kdeplot.

- Assign axis to each of the plots.

- Add additional cosmetics like `grid` and `title` using the axes.

- Set `fontsize` for labels as 14 and title as 16.

- Use `plt.show()` to properly display the plot.

</details>


In [None]:
def glucose(data=None, column1=None):
  # Put your code here...

In [None]:
glucose(data=data, column1='Glucose')

---
**<h4>Question 6:** Create a function that plots a hexbin plot comparing glucose levels across all the ages.</h4>

---

<details>

**<summary>Hint:</summary>**

- You can use `sns.jointplot()` with `kind='hex'` on glucose and age variables.

- Keep the size paramter of the jointplot as 7.

- Add additional cosmetics like `grid` and `suptitle`.

- Set `fontsize` for ticks as 12, labels as 14 and supertitle as 16.

- Use `plt.show()` to properly display the plot.

</details>

In [None]:
def age_glucose(data=data, column1=None, column2=None):
  # Put your code here...

In [None]:
age_glucose(data=data, column1='Age', column2='Glucose')

<a name = Section7></a>

---
# **7. Data Post-Processing**
---

<a name = Section71></a>
### **7.1 Feature Scaling**

- In this section, we will perform standard scaling over the features.

---
**<h4>Question 7:** Create a function that creates two dataframes for dependent and independent features.</h4>

---

<details>

**<summary>Hint:</summary>**

- Create input dataframe X by dropping only "Outcome" feature from axis 1.

- Create target series by using "Outcome" as value.

</details>


In [None]:
def seperate_Xy(data=None):
  # Put your code here...

In [None]:
X, y = seperate_Xy(data=data)


---
**<h4>Question 8:** Create a function that scales the dataframe of independednt features using a standard scaler.</h4>

---

<details>

**<summary>Hint:</summary>**

- Instantiate a scaler using StandardScaler()

- Fit and transform **X** using `.fit_transform(X)` method of the **StandardScaler**.

- Put the values of scaled X (in numpy array format) into a new dataframe with column names of X.

</details>

In [None]:
def scale_data(X=None):
  # Put your code here...

In [None]:
scaled_X = scale_data(X=X)
scaled_X.head()

<a name = Section72></a>
### **7.2 Data Preparation**

- Now we will **split** our **data** in **training** and **testing** part for further development.

---
**<h4>Question 9:** Create a function that splits the data into train and test datasets while keeping random state as 42.</h4>

---

<details>

**<summary>Hint:</summary>**

- Use `train_test_split()` to split the dataset.

- Use `test_size` of **0.20**

- Use `random_state` equal to **42**.

- **Stratify** the target variable.

</details>

In [None]:
def Xy_splitter(X=None, y=None):
  # Put your code here...

In [None]:
X_train, X_test, y_train, y_test = Xy_splitter(X=X, y=y)

<a name = Section8></a>

---
# **8. Model Development & Evaluation**
---

- In this section we will develop a KNN Classifier, check it's performance for different values of K and select the best model.

<a name = Section81></a>
### **8.1 Baseline Model Development & Evaluation**

---
**<h4>Question 10:** Create a function that instantiates a baseline version of KNN Classifier.</h4>

---

<details>

**<summary>Hint:</summary>**

- Instantiate a KNN model using KNeighborsClassifier().

- Use `n_neighbors` = 5 as default value.

</details>

In [None]:
def model_classifers_generate(k=5):
  # Put your code here...

In [None]:
clf = model_classifers_generate()

---
**<h4>Question 11:** Create a function that fits the model on train set and evaluates it on test set using the accuracy score.</h4>

---

<details>

**<summary>Hint:</summary>**

- `Fit` the model on training set.
- `Predict` the values on the train set and the test set.
- Evaluate them using the `accuracy_score` on the train set and the test set.

</details>

In [None]:
def train_n_eval(clf=None):
  # Put your code here...

In [None]:
train_n_eval(clf=clf)

<a name = Section82></a>
### **8.2 Finding the Optimal value of K**

- In this section, we will use various models of **KNN Classifiers** for **comparison** and find the optimal value of K.

---
**<h4>Question 12:** Create a function that fits different models on train set for different values of k and evaluates them on test set using the accuracy score.</h4>

---

<details>

**<summary>Hint:</summary>**

- Set the range of **k** from **1 to 30**.
- `Fit` various KNeighborsClassifiers on the training set.
- `Predict` the values on train and test set.
- Evaluate them using the `accuracy_score` on the train and test set.
- Append the scores in a list so we can find the best achieved accuracy.
- Find the best accuracy and the corresponding value of k.

</details>

In [None]:
def check_k():
  # Put your code here...

In [None]:
accuracy_train, accuracy_test = check_k()

---
**<h4>Question 13:** Create a function that plots the accuracy scores for various values of k.</h4>

---

<details>

**<summary>Hint:</summary>**

- Create a 15x7 inches figure

- Use the test accuracy scores list and plot them against corresponding values of k.

- Add additional cosmetics like `grid` and `title`.

- Set `fontsize` for ticks as 12, labels as 14 and title as 16.

- Use `plt.show()` to properly display the plot.

</details>

In [None]:
def plot_k():
  # Put your code here...

In [None]:
plot_k()

<a name = Section83></a>
### **8.3 Model Development & Evaluation for Optimal K**

---
**<h4>Question 13:** Use the previously written functions to train a new KNN classifier model with the best value of K and evaluate it on train and test scores</h4>

---

<details>

**<summary>Hint:</summary>**

- Use `model_classifers_generate` and `train_n_eval()` functions to fit and evaluate the new model.

- Use the optimal value of K obtained (say 23)

</details>

In [None]:
train_n_eval(clf=model_classifers_generate(k=23))

<a name = Section9></a>

---
# **9. Conclusion**
---