# Diabetes Analysis

## What is Diabetes?
Diabetes is one of the most common and hazardous diseases on the planet.<br>
It requires a lot of care and proper medication to keep the disease in control.<br>
If you are curious about data mining projects in healthcare, you should explore the diabetes dataset.

## Project Objectives
<ul>
<li>Understand the dataset attributes</li>
<li>Apply the required data cleaning methods</li>
<li>Detect the outliers </li>
<li>Implement different classification models to investigate the performance of each classifier on diabetes datasets.</li>
<li>Mention your observations and study the parameters (features) to determine the major factors affecting the onset of diabetes </li>
<li>(What percentage of younger people are prone to be diagnosed with diabetes disease? </li>
<li>Are women more prone to diabetes, or is it the other way? ….  etc.)</li>
<li>Visualize the result of data in plots (discover the potential plots to describe result )</li>
</ul>

## Features Understanding
The data consist of medical information, laboratory analysis… etc.<br> 
The data that have been entered initially into the system are:
<ol>
<li>No. of Patient: The numerical identifier of each patient in the dataset</li>
<li>Sugar Level Blood: blood glucose level is the measure of glucose concentrated in the blood of humans</li>
<li>Age: The age of the patient in years</li>
<li>Gender: Male or Female</li>
<li>Creatinine ratio(Cr): the ratio of the blood levels of urea and creatinine</li>
<li>Body Mass Index (BMI): (weight in kg/height in m)^2)</li>
<li>Urea: The amount of urea present in the patient's blood, which can be used to assess kidney function</li>
<li>Cholesterol (Chol): The amount of cholesterol present in the patient's blood</li>
<li>LDL: The amount of LDL cholesterol in a person's bloodstream</li>
<li>VLDL: The amount of very low-density lipoprotein (VLDL) in your blood</li>
<li>Triglycerides(TG): The amount of a fat in your blood called</li>
<li>HDL Cholesterol: Amount of high-density lipoprotein (good) cholesterol in your blood</li>
<li>HBA1C: A blood test that measures the average blood sugar levels over the past 2-3 months</li>
<li>Class (the patient's diabetes disease class may be Diabetic, Non-Diabetic, or Predict-Diabetic)</li>
</ol>

## Phase One

## Importing Libraries

In [42]:
import numpy as np
import pandas as pd
import seaborn as sns 
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import KBinsDiscretizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression

## Loading Dataset

In [29]:
# load the data
dataset = pd.read_csv(".\Dataset\Dataset of Diabetes.csv")

## Understanding/Exploring Dataset:-

### Head of Dataset

In [30]:
dataset.head(10)

Unnamed: 0,ID,No_Pation,Gender,AGE,Urea,Cr,HbA1c,Chol,TG,HDL,LDL,VLDL,BMI,CLASS
0,502,17975,F,50,4.7,46,4.9,4.2,0.9,2.4,1.4,0.5,24.0,N
1,735,34221,M,26,4.5,62,4.9,3.7,1.4,1.1,2.1,0.6,23.0,N
2,420,47975,F,50,4.7,46,4.9,4.2,0.9,2.4,1.4,0.5,24.0,N
3,680,87656,F,50,4.7,46,4.9,4.2,0.9,2.4,1.4,0.5,24.0,N
4,504,34223,M,33,7.1,46,4.9,4.9,1.0,0.8,2.0,0.4,21.0,N
5,634,34224,F,45,2.3,24,4.0,2.9,1.0,1.0,1.5,0.4,21.0,N
6,721,34225,F,50,2.0,50,4.0,3.6,1.3,0.9,2.1,0.6,24.0,N
7,421,34227,M,48,4.7,47,4.0,2.9,0.8,0.9,1.6,0.4,24.0,N
8,670,34229,M,43,2.6,67,4.0,3.8,0.9,2.4,3.7,1.0,21.0,N
9,759,34230,F,32,3.6,28,4.0,3.8,2.0,2.4,3.8,1.0,24.0,N


### Dataset Shape

In [31]:
dataset.shape

(1000, 14)

### Types of columns/attributes

In [32]:
dataset.dtypes

ID             int64
No_Pation      int64
Gender        object
AGE            int64
Urea         float64
Cr             int64
HbA1c        float64
Chol         float64
TG           float64
HDL          float64
LDL          float64
VLDL         float64
BMI          float64
CLASS         object
dtype: object

### Dataset Description

In [33]:
dataset.drop(["ID","No_Pation"], axis=1).describe()

Unnamed: 0,AGE,Urea,Cr,HbA1c,Chol,TG,HDL,LDL,VLDL,BMI
count,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0,1000.0
mean,53.528,5.124743,68.943,8.28116,4.86282,2.34961,1.20475,2.60979,1.8547,29.57802
std,8.799241,2.935165,59.984747,2.534003,1.301738,1.401176,0.660414,1.115102,3.663599,4.962388
min,20.0,0.5,6.0,0.9,0.0,0.3,0.2,0.3,0.1,19.0
25%,51.0,3.7,48.0,6.5,4.0,1.5,0.9,1.8,0.7,26.0
50%,55.0,4.6,60.0,8.0,4.8,2.0,1.1,2.5,0.9,30.0
75%,59.0,5.7,73.0,10.2,5.6,2.9,1.3,3.3,1.5,33.0
max,79.0,38.9,800.0,16.0,10.3,13.8,9.9,9.9,35.0,47.75


## Data Cleaning

### Check for null values
The first step of data cleaning was handling missing values, so we iterated over the whole dataset to check if any value in the dataset is empty.<br>
The following code was to print some text if a missing value was found:

In [34]:
dataset.isnull().sum()

ID           0
No_Pation    0
Gender       0
AGE          0
Urea         0
Cr           0
HbA1c        0
Chol         0
TG           0
HDL          0
LDL          0
VLDL         0
BMI          0
CLASS        0
dtype: int64

In [35]:
# checking if any value is empty
for tuple in dataset.values.tolist():
    for val in tuple:
        if val == np.nan:
            print('empty value')
# no empty values

In the case of our dataset, there were no missing values, thus there was no need to handle any missing values.

## Label Encoding

By taking a look into our dataset, it is to be observed that there were some categorical data such as the gender (male, female) and the class (yes, possible, no) that needed to be converted into numeric values to be able to handle them easily.<br>
We also noticed that the gender data is binary (male or female) and the class data is ordinal data, so accordingly, we used the label encoder to encode these values into numbers by iterating over each categorical data column as shown in the code below:

In [36]:
categorical = dataset.select_dtypes(include=['object']).columns.tolist()
encoder = LabelEncoder()
for i in categorical:
    # label encoding for categorical data
    dataset[i] = encoder.fit(dataset[i]).transform(dataset[i])
dataset.head()

Unnamed: 0,ID,No_Pation,Gender,AGE,Urea,Cr,HbA1c,Chol,TG,HDL,LDL,VLDL,BMI,CLASS
0,502,17975,0,50,4.7,46,4.9,4.2,0.9,2.4,1.4,0.5,24.0,0
1,735,34221,1,26,4.5,62,4.9,3.7,1.4,1.1,2.1,0.6,23.0,0
2,420,47975,0,50,4.7,46,4.9,4.2,0.9,2.4,1.4,0.5,24.0,0
3,680,87656,0,50,4.7,46,4.9,4.2,0.9,2.4,1.4,0.5,24.0,0
4,504,34223,1,33,7.1,46,4.9,4.9,1.0,0.8,2.0,0.4,21.0,0


After encoding the data, we observed the encoded data and noticed that the number of different labels for each categorical column were greater than the number of possible values that this data.<br>
We were also able to confirm that by the following code:

In [38]:
# checking if any categorical value is not labelled in the correct range
for val in dataset[['Gender']].values.tolist():
    if val[0] > 1:
        print('Error in Gender column')
        break
for val in dataset[['CLASS']].values.tolist():
    if val[0] > 2:
        print('Error in Class column')
        break

Error in Gender column
Error in Class column


This meant that the there were some values that needed to be cleaned before labelling them, and it turned out that some data were not written in the correct format (not capitalized) and there were some extra spaces that the label encoder believed to be new values.<br>
Therefore, we added an extra line in the label encoder code to clean the data before labelling them as follows:

In [40]:
categorical = dataset.select_dtypes(include=['object']).columns.tolist()
encoder = LabelEncoder()
for i in categorical:
    # cleaning data and unifying their format
    for val in dataset[[i]].values.tolist():
        dataset[[i]] = dataset[[i]].replace([val[0]], val[0].replace(' ', '').upper())
    # label encoding for categorical data
    dataset[i] = encoder.fit(dataset[i]).transform(dataset[i])
dataset.head()

Unnamed: 0,ID,No_Pation,Gender,AGE,Urea,Cr,HbA1c,Chol,TG,HDL,LDL,VLDL,BMI,CLASS
0,502,17975,0,50,4.7,46,4.9,4.2,0.9,2.4,1.4,0.5,24.0,0
1,735,34221,1,26,4.5,62,4.9,3.7,1.4,1.1,2.1,0.6,23.0,0
2,420,47975,0,50,4.7,46,4.9,4.2,0.9,2.4,1.4,0.5,24.0,0
3,680,87656,0,50,4.7,46,4.9,4.2,0.9,2.4,1.4,0.5,24.0,0
4,504,34223,1,33,7.1,46,4.9,4.9,1.0,0.8,2.0,0.4,21.0,0


After changing the code, we checked for the number of different labels for each column again, and the result was that the data was labelled correctly.

## Removing unnecessary columns

The first two columns (ID and patient number) were not so useful in our pipeline, so we removed them:

In [41]:
# remove unnecessary id and patien number columns
dataset = dataset.iloc[:,2:]

## Splitting Data

We wrote the following code to separate the class column from the rest of the data and then split the data into 80% training data and 20% testing data:

In [43]:
# splitting the data into training and tesing
x = dataset.drop('CLASS', axis=1)
y = dataset[['CLASS']]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

In [47]:
print("x_train shape: ", x_train.shape)
print("x_test shape: ", x_test.shape)
print("y_train shape: ", y_train.shape)
print("y_test shape: ", y_test.shape)

x_train shape:  (800, 11)
x_test shape:  (200, 11)
y_train shape:  (800, 1)
y_test shape:  (200, 1)
