# 0. KEY INFORMATION FROM THE DATASET DOCUMENTATION

* Independent features:
    * Pregnancies
    * PlasmaGlucose
    * DiastolicBloodPressure
    * TricepsThickness
    * SerumInsulin
    * BMI
    * DiabetesPedigree
    * Age
<br>
<br>

* Dependent feature:
    * Diabetic:
        * 1 = diabetes diagnosed;
        * 0 = no diabetes diagnosed.

# 1. DATA IMPORT & PROJECT SET-UP

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [None]:
# ! Change the path with os library to make sure everyone can read it on their machine
df = pd.read_csv(r"C:\Users\ernes\Desktop\GITHUB\diabetes_prediction\assets\data\TAIPEI_diabetes.csv")

# 2. INITIAL DATA OVERVIEW

#### This part is dedicated to general understanding of the data available and identification the following:
* General shape of the dataframe
* Features available for the prediction of our target variable
* Columns' data types
* Check if the data point respects the data type of the column
* Potential NULL data points
* Duplicate values

In [None]:
df.shape

(15000, 10)

In [5]:
df.head()

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
0,1354778,0,171,80,34,23,43.509726,1.213191,21,0
1,1147438,8,92,93,47,36,21.240576,0.158365,23,0
2,1640031,7,115,47,52,35,41.511523,0.079019,23,0
3,1883350,9,103,78,25,304,29.582192,1.28287,43,1
4,1424119,1,85,59,27,35,42.604536,0.549542,22,0


In [4]:
df.tail(10)

Unnamed: 0,PatientID,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
14990,1220763,5,169,83,31,60,49.004793,0.187397,53,1
14991,1603490,0,114,53,34,40,44.760534,0.143487,23,0
14992,1202654,3,48,60,24,81,29.417154,0.159605,42,1
14993,1165919,1,128,59,21,182,19.766514,0.16728,53,0
14994,1453189,0,72,99,32,32,20.932808,0.545038,22,0
14995,1490300,10,65,60,46,177,33.512468,0.148327,41,1
14996,1744410,2,73,66,27,168,30.132636,0.862252,38,1
14997,1742742,0,93,89,43,57,18.690683,0.427049,24,0
14998,1099353,0,132,98,18,161,19.791645,0.302257,23,0
14999,1386396,3,114,65,47,512,36.215437,0.147363,34,1


In [None]:
# First check of the NULL data points inside the columns and their data types
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   PatientID               15000 non-null  int64  
 1   Pregnancies             15000 non-null  int64  
 2   PlasmaGlucose           15000 non-null  int64  
 3   DiastolicBloodPressure  15000 non-null  int64  
 4   TricepsThickness        15000 non-null  int64  
 5   SerumInsulin            15000 non-null  int64  
 6   BMI                     15000 non-null  float64
 7   DiabetesPedigree        15000 non-null  float64
 8   Age                     15000 non-null  int64  
 9   Diabetic                15000 non-null  int64  
dtypes: float64(2), int64(8)
memory usage: 1.1 MB


In [None]:
# Second check of the NULL values
df.isna().sum()

In [8]:
#Check of duplicated data points per column
for duplicates in df.columns:
    print(f"{duplicates}: {df[duplicates].duplicated().sum()}")

PatientID: 105
Pregnancies: 14985
PlasmaGlucose: 14851
DiastolicBloodPressure: 14910
TricepsThickness: 14931
SerumInsulin: 14337
BMI: 0
DiabetesPedigree: 1
Age: 14944
Diabetic: 14998


#### DATA READING & INTEGRITY CHECK OBSERVATIONS

Dataset contains 15 000 records in total.

It has 10 columns in total, with all of them being of a numerical data type (8 int64 and 2 float64 columns).

There are neither NULL data points nor duplicated rows present.

All of the data points respect the data type of the columns that are are in.

Nonetheless, we can observe an incosistency of certain data records, where some of them share the same PatientID.
<br>The initial hypothesis is that some patients might have came for the diabetes check more than once, which we will explore more in detail during the exploratory data analysis stage.

All of the feature attributes are nominal, meaning that the data is not hierarchical (ex. not a scale).

# 3. EXPLORATORY DATA ANALYSIS (EDA)

#### This part is dedicated to :
* 1
* 2
* 3

#### Questions to answer : 
- How to treat the outliers?
- Which of the features should be encoded and which encoding type to choose?
- False positive vs false negative privilege? - rather positive, because of the context, but needs to be more argumented
- understand if the model will be better at predicting non-diabetes or diabetes patients
- make sure that you do hypothesis around the data and you confirm with the data ; when the data disagrees with us, we need to reevaluate the hypothesis and dig deeper ; BUT ALWAYS START WITH SIMPLE HYPOTHESIS and then if not confirmed by dat - go more complex, complex, complex
- label the diabetic & non-diabetic patients


#### TO DO :
- Closer look at the repeated PatientID rows
- Add data distributions for the whole dataframe vs diabetic & non-diabetic groups
- Add outliers detection using boxplots
- Add : understanding relationships between the features

In [9]:
# Division of the dataset into diabetic and non-diabetic dataframes
# Initial dataframe : "df"
df_non_diabetic = df[df["Diabetic"] == 0]
df_diabetic = df[df["Diabetic"] == 1]

In [10]:
# Descriptive statistics of the features of full dataframe
df.describe().drop(columns=["PatientID"])

Unnamed: 0,Pregnancies,PlasmaGlucose,DiastolicBloodPressure,TricepsThickness,SerumInsulin,BMI,DiabetesPedigree,Age,Diabetic
count,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0,15000.0
mean,3.224533,107.856867,71.220667,28.814,137.852133,31.509646,0.398968,30.137733,0.333333
std,3.39102,31.981975,16.758716,14.555716,133.068252,9.759,0.377944,12.089703,0.47142
min,0.0,44.0,24.0,7.0,14.0,18.200512,0.078044,21.0,0.0
25%,0.0,84.0,58.0,15.0,39.0,21.259887,0.137743,22.0,0.0
50%,2.0,104.0,72.0,31.0,83.0,31.76794,0.200297,24.0,0.0
75%,6.0,129.0,85.0,41.0,195.0,39.259692,0.616285,35.0,1.0
max,14.0,192.0,117.0,93.0,799.0,56.034628,2.301594,77.0,1.0


In [11]:
# Number of records for diabetic vs non-diabetic patients
print(df["Diabetic"].value_counts())
print("\n")
print(df["Diabetic"].value_counts()*100/len(df))

Diabetic
0    10000
1     5000
Name: count, dtype: int64


Diabetic
0    66.666667
1    33.333333
Name: count, dtype: float64


In [None]:
# Mean comparison of diabetic and non-diabetic patients
print("-" * 10 + " Mean " + "-" * 10)
df.groupby("Diabetic").mean().drop(columns=["PatientID"])

In [None]:
# Standard deviation comparison of diabetic and non-diabetic patients
print("-" * 10 + " Standard deviation " + "-" * 10)
df.groupby("Diabetic").std().drop(columns=["PatientID"])

In [None]:
# Minimum values comparison of diabetic and non-diabetic patients
print("-" * 10 + " Minimum values " + "-" * 10)
df.groupby("Diabetic").min().drop(columns=["PatientID"])

In [None]:
# Maximum values comparison of diabetic and non-diabetic patients
print("-" * 10 + " Maximum values " + "-" * 10)
df.groupby("Diabetic").max().drop(columns=["PatientID"])

In [None]:
# Quantile fractions comparison of diabetic and non-diabetic patients
print("-" * 10 + " Quantile fractions " + "-" * 10)
df.groupby("Diabetic").quantile(q=[0.25, 0.5, 0.75]).drop(columns=["PatientID"])

#### Observations - Part 1
- Diabetic patients tend to have more pregnancies compared to non-diabetic patients.
- Higher glucose levels are observed in diabetics, which makes sense because diabetes is linked to high blood sugar.
- Blood pressure is generally higher in diabetics compared to non-diabetics.
- Diabetic patients tend to have thicker skinfolds, which may indicate higher fat distribution.
- Diabetics have significantly higher insulin levels, likely due to insulin resistance.
- Diabetics have a higher BMI, suggesting a link between obesity and diabetes.
- Higher genetic risk is associated with diabetes.
- Diabetic patients are generally older, as diabetes risk increases with age.

These observations will be explored further with the correlation matrix.
Note : 2 x more records for non-diabetic vs diabetic patients (5k vs 10k)


* 2/3 of the dataset contains non-diabetic patients records on our target value for the machine learning project, which may lead to the class imbalance. Some of the possible solutions include synthesizing (duplicating) the minority class using SMOTE technique / using class weighting in algorithms (class_weight) in logistic regression, random forest, XGBoost etc. In medical field such as diabetic prediction, false negatives are more harmful that false positives. Using class weights may lead to have slightly more false positives, but that's okay in this context.
* Diabetic group has a significantly higher SerumInsulin and PlasmaGlucose levels


# 4. Feature Engineering

Goals : 
* Feature selection
* Data normalization
* Outliers treatment
* Data encoding 

# 5. MODELS EVALUATION