<a href="https://colab.research.google.com/github/RefaelAharouni/Project_Machine_Learning_Refael_Venkatsai_Norbert_William/blob/Refa%C3%ABl/Project_Machine_Learning_Refael_Venkatsai_Norbert_William.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Aharouni Refaël, Kadari Venkatsai, Devaraj Norbert Dias, Aye William $-$ DIA1
<br><br>

<p align="center"><b> Report of the Machine Learning Project</b></p><br><br>




&nbsp;&nbsp;&nbsp;In this dataset, we want to study the relationship between the presence of diabetes for patients and features like the patient's age, his geographical location, his potential heart diseases and level of blood glucoses for example. Therefore, this problem is a binary classification one (we want to know if the patient has a diabete (the target feature) or not, given his personal information).

&nbsp;&nbsp;&nbsp;&nbsp;Let us first import the libraries that will be used for this project. We can then read the the dataset with the *pd.read_csv* method and store it in the *data* variable.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from collections import Counter
from sklearn.metrics import (confusion_matrix, roc_curve, precision_recall_curve, auc, classification_report)

In [None]:
data = pd.read_csv("diabetes_dataset.csv", sep = ",")

Now, we can print the first lines of the dataset to see what it looks like using *head* method.

In [None]:
data.head(10)

<br><br>

# **1. Analysis and preprocessing of the data**

&nbsp;&nbsp;&nbsp;&nbsp;Let us begin this project by analyzing the data.
<br>
## &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**A. Variables definition**
&nbsp;&nbsp;&nbsp;The first thing to do is understanding the variables that we are dealing with in this dataset using the *columns.tolist()* method.

In [None]:
print(data.columns.tolist())

&nbsp;&nbsp;&nbsp;From the previous result, we can draw up a list of the features and their type:

*   **Year**: This feature represents the year during which the test has been done on the patient. At first glance, we could think that this variable is *quantitative continuous* because in theory, it can take an infinite number of values. However, after using the command: "len(data["year"].value_counts())", we saw that there were only 7 different values taken for this feature, which is largely smaller than 100 000, the number of individuals. Therefore, this variable can be considered as *quantitative discrete*. <br><br>

*   **Gender:** This feature represents the gender of the patient, so it is a *qualitative nominale* variable (the possible values are: "male", "female" or "other"). <br><br>

*   **Age:** This feature is *quantitative discrete*, just as the "year" one (even if, once again, it can take every value between 0 and 150). It represents the age of the patient when the test to know if he had the diabete was made. <br><br>

*   **Location:** This variable represents the location of the patient (the American state of the patient). Therefore, this variable is *qualitative nominale*.<br><br>

*   **race:AfricanAmerican**, **race:Asian**, **race:Caucasian**, **race:Hispanic**, **race:Other**: Those variables indicate if the patient is African/American or not, Asian or not, Caucasian or not, Hispanic or not, or other (or not). Therefore, they are *quantitative binary*. <br><br>

*  **hypertension**: Quantitative binary as it indicates if the patient is in hypertension or not (1 --> Hypertension, 0 --> no hypertension)<br><br>

*  **heart_disease:** Quantitative binary as it indicates if the patient has a heart disease or not (1 --> heart_disease, 0 --> no heart_disease). <br><br>

*  **smoking_history:** Qualitative ordinal because it indicates the patient's smoking frequency (with a hierarchy). The possible values are: "never" (level 0), "not current" (level 1), "former" (level 2), "ever" (level 3), "current" (level 4).<br><br>

*  **bmi:** This variable represents the ratio between the patient's weight and his height squared. Because it can take an infinite number of values, it is a *quantitative continuous* feature. <br><br>

*  **hbA1c_level:** At first glance, it seemed that this feature, representing the hemoglobin percentage of sugar in the blood, was *quantitative continuous*. However, just like the *year* feature, after using the command: "len(data["hbA1c_level"].value_counts())", we saw that there were only 18 different values taken for this feature, which is largely smaller than 100 000, the number of individuals. Therefore, this variable can also be considered as *quantitative discrete*. <br><br>

*  **blood_glucose_level:** This feature characterizes the level of the patient's blood glucose (it is expressed in mg/dL). Because it also takes 18 distinct values, it will be considered as *quantitative discrete*. <br><br>

*  **diabetes:** *Quantitative binary* as it indicates if the patient has a diabete or not (1 --> Diabete, 0 --> No diabete).

<br>

## &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**B. Number of columns, rows and values**

&nbsp;&nbsp;&nbsp;In order to know the exact number of columns and rows, we can use the *.shape* method. The first argument will be the number of lines (number of individuals) while the second one will represent the number of columns (number of features).

In [None]:
data.shape

Therefore, our dataset contains 100 000 individuals and 16 features. Using the *size* function allows us to know that our diabetes dataset contains 160 000 values, which is normal because we have 100 000 individuals and 16 features.

In [None]:
data.size

<br>

## &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**C. Missing values, inconsistencies and outliers**

&nbsp;&nbsp;&nbsp;Let us now use *.info* and *.isnull.sum()* to see if our dataset contains any missing values.

In [None]:
print(data.info())
print("\nNumber of missing values for each feature:")
print(data.isnull().sum())

From what we can see, this dataset does not seem to contain any missing values.<br>However, before going any further, a little detail required our attention, especially for the "age" feature, which is in float64. To simplify the study, let us convert this feature's type into integer. It will prevent the dataset from having for example 0.54 for the age.

In [None]:
data["age"] = data["age"].astype(int)

<br>We also need to ensure that diabetes_dataset.csv does not have any inconsistencies. This step implies to look at the potential spelling mistakes, incoherent spaces or incoherent uppercases for the qualitative columns.

Let us look at the values taken by each qualitative feature with the *data[name_feature].value_counts()* method, knowing from the *info* method precedently used that all qualitative features are "object" in our dataset.

In [None]:
for j in data.select_dtypes(include = ["object"]).columns: # We select the qualitative columns and print the occurrencies of every value taken by each feature
  print(data[j].value_counts(), "\n\n")

No spelling mistakes or additional spaces appear in our dataset. <br>However, even if it did not seem so at first glance, "smoking_history" contains more than 35 800 null values (that were registered as "No Info") and  will be treated once we will use one-hot encoder. For this feature, let us also use *map* function to rename the values taken and make them begin with uppercases. To make sure of the renaming, we can once again print the occurrencies of this feature's values using *.value_counts()* method.

In [None]:
data["smoking_history"] = data["smoking_history"].map({"No Info": "No Info", "never": "Never", "former": "Former", "current": "Current", "not current": "Not Current", "ever": "Ever"})
data["smoking_history"].value_counts()

<br><br>
&nbsp;&nbsp;&nbsp;Let us now evaluate the possible inconsistencies for quantitative features. For example, let us verify that:
* Every patient's age is between 0 and 122 (the official human longevity record is held by Jeanne Louise Calment, who lived 122 years and 164 days, according to Statista).
* Every patient's BMI is between 6.7 kg/m$^2$ and 98 kg/m$^2$ (the highest BMI ever registered was 98 kg/m$^2$ according to the article of PreciDIAB's website, while the lowest was 6.7 kg/m$^2$ according to Psychiatria Polska's article).
* Every patient's blood glucose level is between 40 mg/dL and 600 mg/dL (according to Guideline Central's article, a blood glucose level greater than 600 mg/dL is very serious while American Diabetes Association tells us that a blood glucose level below 70 is characteristic of a diabetic patient).
* Every patient's hbA1c is between 0% and 20%.
* Every year in the dataset is between 1900 and 2025 as it represents the year where the diabete test has been done on the patients.

In [None]:
print("Inconsistent ages: ", data[(data["age"] < 0) | (data["age"] > 122)])
print("Inconsistent years: ", data[(data["year"] < 1900) | (data["year"] > 2025)])
print("Inconsistent BMI: ", data[(data["bmi"] < 6.7) | (data["bmi"] > 98)])
print("Inconsistent blood glucose level: ", data[(data["blood_glucose_level"] < 40) | (data["blood_glucose_level"] > 600)])
print("Inconsistent hbA1C level: ", data[(data["hbA1c_level"] < 0) | (data["hbA1c_level"] > 20)])

Therefore, no inconstency seems to appear in our dataset.
<br><br><br>&nbsp;&nbsp;&nbsp;To go further in our analysis, let us display the boxplot of every quantitative feature to see what the outliers in our dataset are (the outliers are the points that are out of the whiskers).

In [None]:
for j in data.select_dtypes(include = ["int64", "float64"]).columns: # We select the quantitative features
  plt.figure(figsize = (4, 4)) # Creation of a new figure
  sns.boxplot(data[j]) # Display of boxplot
  plt.title(f"Boxplot of {j}")
  plt.show()

From what one can see here:
* The feature "Year" has some outliers, but this is because more than 75% of patients had their test in 2019 (hence, all the other values are considered as outliers). However, all values are between 2015 and 2022, so none of them is really "extreme". <br>
* The feature "age" does not have any extreme values. <br>
* For the binary features, like "race:Asian", "race:AfricanAmerican", "race:Caucasian", "race:Hispanic", "hypertension", "heart_disease" and "diabetes", 1 is always considered as an outlier. This is because the dataset is imbalanced for every one of this feature. <br>
* The "BMI" feature contains a lot of outliers, as there are many people that have a BMI greater than 40 (morbid obesity), but also patients that have a BMI less than 15 (severe malnutrition).<br>
* The features concerning the blood glucose level and hbA1c level also have 2-3 outliers, as some points are to be seen after 250 mg/dL for the blood glucose and after 8.5% for the hbA1c.

<br>

## &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**D. Duplicated values**

&nbsp;&nbsp;&nbsp;Knowing if our dataset contains duplicated values is also useful and can be done through *data[data.duplicated()]* code. We can also use *duplicated.sum()* method to know the exact number of duplicates.

In [None]:
print(f"There are {data.duplicated().sum()} duplicated values that are:")
data[data.duplicated()]

Given the fact that we have 100 000 individuals in this dataset, removing 15 individuals will not have have a huge impact on the dataset. Therefore, we can simply delete those lines using *drop_duplicates()* function.

In [None]:
data = data.drop_duplicates()

To make sure that those lines were really deleted, let us use once again the *shape* method to display the number of lines (individuals) and columns (features).

In [None]:
data.shape

<br>

## &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**E. Visualization**

&nbsp;&nbsp;&nbsp;&nbsp;Let us now present an univariate analysis of different features from all the types (quantitative continuous, quantitative discrete, qualitative nominale, qualitative ordinal and binary).
<br> In this part, we will work on:
* bmi (quantitative continuous)
* Age (quantitative discrete)
* Location (qualitative nominale)
* smoking_history (qualitative ordinal)
* diabetes (binary)
<br> &nbsp;&nbsp;&nbsp;&nbsp;We already know that every feature was measured for all the patients of the dataset, this is the reason why every feature's study will be done over the 99 985 patients.

<br>

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**1. Univariate analysis of bmi**

&nbsp;&nbsp;&nbsp;Let us begin the univariate analysis of bmi, a quantitative continuous feature representing the ratio between the patient's weight and his height squared.
<br>
&nbsp;&nbsp;&nbsp;Those are the BMI categories for adults, according to the article written by the Center for Disease Control and Prevention:
* **Underweight:** BMI $<$ 18.5
* **Healthy Weight:** 18.5 $\leq$ BMI $<$ 25.
* **Overweight:** 25 $\leq$ BMI $<$ 30.
* **Obesity:** BMI $\geq$ 30:<br>
&nbsp;&nbsp;&nbsp; $-$   **First class obesity:** 30 $\leq$ BMI $<$ 35 <br>
&nbsp;&nbsp;&nbsp; $-$   **Second class obesity:** 35 $\leq$ BMI $<$ 40. <br>
&nbsp;&nbsp;&nbsp; $-$   **Third class (morbid obesity):** BMI $>$ 40. <br><br>
&nbsp;&nbsp;&nbsp;Using the *describe()* method allows us to have the principal information about our feature.

In [None]:
data["bmi"].describe()

From what we see here, 50% of the patients have a BMI greater than 27 (overweight), while 25% have at most a healthy weight as the first quartile is 23.6.
<br>Moreover, the minimum and maximum values show that there is a patient with a severe malnutrition (10.01), while one has an extreme obesity (95.69). <br>Finally, the fact that the standard deviation is equal to 6.6 shows that the distribution of BMI amongst the individuals from the dataset is very scattered, as we will see later.

<br>&nbsp;&nbsp;&nbsp;Because this feature is continuous, we can split the data into classes (5 bins for instance) and then display the histogram in value counts by classes of the feature. To better understand the proportion of each BMI class, we can also present the histogram by frequency and the cumulative distribution function.

In [None]:
plt.figure(figsize = (8, 6))
valuecounts, bins, _ = plt.hist(data["bmi"], bins = 5, edgecolor = "black", linewidth = 1.2) # We draw the histogram

# We center the text indicating the class at the top of each rectangle of the histogram
labels = [f"[{round(bins[i], 1)} - {round(bins[i+1], 1)}[" for i in range(len(bins) - 1)]
centeredLabels = [(bins[i+1] + bins[i])/2 for i in range(len(bins) - 1)]
for i, j, k in zip(centeredLabels, labels, valuecounts):
  plt.text(i, k + max(valuecounts)/100, j, ha = "center", color = "red", fontweight = "bold")

plt.title("Histogram by value counts of BMI classes")
plt.xlabel("BMI")
plt.ylabel("Occurrences of each class")
plt.grid(axis='y', linestyle='-', alpha=0.7) # We add a grid, to make the representation easier to understand
plt.show()

In [None]:
plt.figure(figsize = (8, 6))
valuecounts, bins, _ = plt.hist(data["bmi"], bins = 5, edgecolor = "black", linewidth = 1.2, weights = np.ones(len(data)) / len(data)) # We add the parameter np.ones(len(data))/len(data) to print the histogram in frequency

# We center the text indicating the class at the top of each rectangle of the histogram
labels = [f"[{round(bins[i], 1)} - {round(bins[i+1], 1)}[" for i in range(len(bins) - 1)]
centeredLabels = [(bins[i+1] + bins[i])/2 for i in range(len(bins) - 1)]
for i, j, k in zip(centeredLabels, labels, valuecounts):
  plt.text(i, k + max(valuecounts)/100, j, ha = "center", color = "red", fontweight = "bold")

plt.title("Histogram by frequency of BMI classes")
plt.xlabel("BMI")
plt.ylabel("Occurrences of each class")
plt.grid(axis='y', linestyle='-', alpha=0.7) # We add a grid to make the representation easier to understand
plt.show()

In [None]:
plt.figure(figsize = (8, 7))
sns.ecdfplot(data = data, x = "bmi", linewidth = 1.5)
plt.xlabel('BMI')
plt.ylabel("Cumulative distribution Function of BMI")
plt.grid(True)
plt.show()

These histograms indicate us that BMIs' values were separated into 5 classes:
* $ [10, 27.1[$, which represents 40% of patients from the dataset ($\approx$ 40 000 individuals) that are at most in overweight (underweight patients, healthy weight patients and overweight patients with a BMI between 25 and 27). <br>
* $ [27.1, 44.3[$, which designates the 55 000 patients ($\approx$ 55% of individuals) in overweight (with a BMI between 27 and 30) or in obesity (from class 1, 2 or 3) <br>
* $ [44.3, 61.4[$, characterizing roughly 3 000 patients ($\approx$ 3% of individuals) that have morbid obesity with a BMI between 44.3 and 61.4. <br>
* $ [61.4, 78.6[$ and $[78.6, 95.7[$ that cumulate less than 1 000 patients ($\approx$ 1% of patients) each and characterize the individuals with extreme obesity (BMI between 61.4 and 78.6 or 78.6 and 95.7).


<br><br>

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**2. Univariate analysis of age**

&nbsp;&nbsp;&nbsp;Let us now talk of the "age" feature, representing the age of the patient at the moment of the diabetes test. Once again, we begin with the *describe* method to display the principal information about the feature.

In [None]:
data["age"].describe()

&nbsp;&nbsp;&nbsp;This array allows us to understand that the patient's age was approximately 42, while a patient was just a few months old, a few weeks old or a few days old as proves the minimum value of 0. The fact that the oldest patient was 80 years old justifies the high value of standard deviation (patients from all kinds of ages did the test).
<br>&nbsp;&nbsp;&nbsp;Finally, 25% of the patients ($\approx$ 24 996 patients) were at most 24 years old, while 50% of them were 43 years old and 24 996 patients were older than 60.

<br><br>&nbsp;&nbsp;&nbsp;Before looking at the occurrencies and frequencies of this feature's values, let us simplify the study by splitting the values into 4 different classes: [0, 20], [21, 40], [41, 60] and [61, 80].

In [None]:
classes = pd.cut(data["age"], bins = [0, 20, 40, 60, 80], labels = ["0-20", "21-40", "41-60", "61-80"], include_lowest = True)
pd.DataFrame({"Occurrencies": classes.value_counts().sort_index(), "Frequencies" : classes.value_counts(normalize = True).sort_index()})

<br><br>

<br><br>

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**3. Univariate analysis of Location**

<br><br>

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**4. Univariate analysis of smoking_history**

<br><br>

### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**5. Univariate analysis of diabetes**

<br><br>

In [None]:
print(data.describe)

In [None]:
# Visualize numeric distributions
data.hist(figsize=(12, 8))
plt.tight_layout()
plt.show()

<br>

## &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**F. Data balance**

&nbsp;&nbsp;&nbsp;Let us now continue our analysis and preprocessing of the dataset by observing the balance of the feature "diabetes". This process will help us to know which models and pipelines will be used later.
<br>&nbsp;&nbsp;&nbsp;First, let us use once again *value_counts()* method to print the number of occurrences of both classes (0 and 1) for the target feature: "diabetes".
<br>&nbsp;&nbsp;&nbsp;We can also use *min* and *max* functions to calculate the ratio between the minority class and the majority one.  

In [None]:
print(data["diabetes"].value_counts())
minority = data["diabetes"].value_counts().min()
majority = data["diabetes"].value_counts().max()
ratio = majority / minority
print(f"Imbalance ratio ≈ 1:{ratio:.2f}")

From what one can see here, the majoritarian class is the 0 (representing the patients without diabete), while the minority class is the 1 (representing the patients with diabete).<br>
The fact that there are 10.76 times more patients without diabete than patients with diabete proves that this feature is highly imbalanced. To counter this imbalance, some models like RandomOverSampling or SMOTE will be used later.

<br>

## &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**G. Correlation analysis and encoding**

&nbsp;&nbsp;&nbsp;In this part, we will study the correlation between the different features and use encoding to split qualitative features into dummies ones (quantitative binary).
<br>&nbsp;&nbsp;&nbsp;Let us first display the correlation matrix to see which variables are independent (correlation far from $\pm1$) from the others and which ones directly depend from one another (correlation close to $\pm1$).

In [None]:
plt.figure(figsize=(11, 12)) # Creation of the figure that will represent the correlation matrix
sns.heatmap(data.corr(numeric_only=True), annot=True, cmap='coolwarm') # We keep only quantitative variables and print the correlation matrix
plt.title("Feature Correlation")
plt.show()

If we look at the last line of this matrix, we can see that the features the most correlated to "diabetes" are:
* blood_glucose_level (correlation of 0.42)
* hbA1c_level (correlation of 0.4)
* age (correlation of 0.26)
However, the relationship between those variables and diabete is slight and qualitative features don't appear in the matrix.
<br>To resolve this problem, let us transform the qualitative features into quantitative binary features using *pd.get_dummies* function. To make sure that the new features will take 0/1 values instead of "True"/"False", we can add the parameter: *dtype = int* at the end of the *get_dummies* function.


In [None]:
data_encoded = pd.get_dummies(data, columns = data.select_dtypes(include = "object").columns, drop_first=True, dtype = int)
data_encoded.head()

Let us now see the new number of columns (features) and lines (individuals) using the *shape* function.

In [None]:
data_encoded.shape

As one can see, approximately 60 new features appeared, and we now have 99 986 individuals and 74 features.
<br>Because all the features are now quantitative, we can once again print the correlation matrix. To simplify the display, we will apply a filter: we will only keep the line of diabetes (because for the moment, only the relationships between the diabetes and the other features are important) and the features that have an absolute correlation of more than 0.25 with diabetes.

In [None]:
matrix = data_encoded.corr(numeric_only = True)
matrix_temp = matrix["diabetes"].drop("diabetes") # We delete the cell showing the relationship between diabetes and itself (useless)
matrix_filtered = matrix_temp[abs(matrix_temp) > 0.25] # We filter the matrix to keep the features having an absolute value correlation with diabetes greater than 0.25
plt.figure(figsize = (3, 3))
sns.heatmap(matrix_filtered.to_frame().T, annot = True, cmap = 'coolwarm') # We convert the matrix into a dataframe and print its transpose to have an horizontal result.
plt.title("Correlation matrix keeping only features with an absolute correlation greater than 0.3")
plt.show()

From this final correlation matrix, we understand that the variables the most correlated to "diabetes" are the blood glucose level (correlation of 0.42), then the hbA1c_level (correlation of 0.4) and finally the age (0.26).
<br>However, once again, those correlations with diabetes remain slight (for hba1c_level and blood_glucose_level) or even weak (for age).

<br><br>

# **2. Train_test_split and implementation of algorithms and pipelines**

&nbsp;&nbsp;&nbsp;As we saw in the first part, the dataset's feature target is "diabetes", while the other features are explicative ones. This is the reason why the feature "diabetes" will be called *y* while the others will be grouped in a variable called *X*.

In [None]:
X = data_encoded.drop('diabetes', axis=1)
y = data_encoded['diabetes']

We can now use *train_test_split* function with a *test_size* of 0.3 to split our dataset into a training set and a test set. Because our dataset is moderately imbalanced (ratio $\approx$ 11 as we saw before), we can add the parameter *stratify = y* to keep the same class proportions.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Now that the separation between the training and the test has been done, we can develop some models to predict if a patient has diabete or not. Before beginning, let us define a function

In [None]:
# Define a dictionary of models to compare
models = {
    # Logistic Regression: a linear model that estimates probabilities using a logistic function.
    # max_iter=1000 ensures the solver has enough iterations to converge.
    # random_state=42 ensures reproducibility of results.
    "Logistic Regression": LogisticRegression(max_iter=1000, random_state=42),

    # Decision Tree: a non-linear model that splits data into branches based on feature values.
    # random_state=42 ensures consistent tree structure across runs.
    "Decision Tree": DecisionTreeClassifier(random_state=42),

    # Random Forest: an ensemble of decision trees that improves accuracy and reduces overfitting.
    # random_state=42 ensures reproducibility of the forest structure.
    "Random Forest": RandomForestClassifier(random_state=42)
}


In [None]:
# Train baseline models
for name, model in models.items():
    model.fit(X_train, y_train)
    print(f"\n{name} Performance:")
    from sklearn.metrics import classification_report
    print(classification_report(y_test, model.predict(X_test)), "\n\n")

# STEP 3: Handle Class Imbalance with SMOTE

In [None]:
# Apply SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
print("Resampled class distribution:", Counter(y_resampled))

In [None]:
# Retrain Random Forest on resampled data
rf_balanced = RandomForestClassifier(random_state=42)
rf_balanced.fit(X_resampled, y_resampled)
print("\nRandom Forest (SMOTE) Performance:")
print(classification_report(y_test, rf_balanced.predict(X_test)))

<br><br>

# **3. Data normalization and reduction**

# **4. Analysis of the results and management of the overfitting**

In [None]:
def evaluate_model(model, X_test, y_test, label="Model"):
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]

    fpr, tpr, _ = roc_curve(y_test, y_proba)
    precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
    auc_roc = auc(fpr, tpr)
    auc_pr = auc(recall, precision)

    print(f"\n=== {label} ===")
    print(classification_report(y_test, y_pred))
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))
    print(f"AUC (ROC): {auc_roc:.3f} | AUC (PR): {auc_pr:.3f}")

    # Plot ROC
    plt.figure()
    plt.plot(fpr, tpr, label=f"{label} ROC (AUC={auc_roc:.2f})")
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title("ROC Curve")
    plt.legend()
    plt.grid(True)
    plt.show()

    # Plot Precision-Recall
    plt.figure()
    plt.plot(recall, precision, label=f"{label} PR (AUC={auc_pr:.2f})")
    plt.xlabel("Recall")
    plt.ylabel("Precision")
    plt.title("Precision-Recall Curve")
    plt.legend()
    plt.grid(True)
    plt.show()

# Evaluate both models
evaluate_model(models["Random Forest"], X_test, y_test, label="Random Forest (Baseline)")
evaluate_model(rf_balanced, X_test, y_test, label="Random Forest (SMOTE)")

# Cost-Sensitive Evaluation
def custom_cost(y_true, y_pred, fn_weight=10, fp_weight=1):
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    return fn_weight * fn + fp_weight * fp

cost_baseline = custom_cost(y_test, models["Random Forest"].predict(X_test))
cost_smote = custom_cost(y_test, rf_balanced.predict(X_test))

print(f"Custom Cost (Baseline): {cost_baseline}")
print(f"Custom Cost (SMOTE): {cost_smote}")




  RANDOM OVER SAMPLING

In [None]:

import pandas as pd
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Load your dataset
df = pd.read_csv('diabetes_dataset.csv')

# One-hot encode categorical columns
df_encoded = pd.get_dummies(df, drop_first=True)

# Separate features and target
X = df_encoded.drop('diabetes', axis=1)
y = df_encoded['diabetes']

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Apply RandomOverSampler
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)

# Train the model
model = RandomForestClassifier(random_state=42)
model.fit(X_resampled, y_resampled)

# Predict and evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))


RANOM UNDER SAMPLING

In [None]:
# Import necessary libraries
import pandas as pd
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Load your dataset
df = pd.read_csv('diabetes_dataset.csv')

# One-hot encode categorical variables
df_encoded = pd.get_dummies(df, drop_first=True)

# Separate features and target
X = df_encoded.drop('diabetes', axis=1)
y = df_encoded['diabetes']

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Apply RandomUnderSampler to balance the training data
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)

# Train a Random Forest model
model = RandomForestClassifier(random_state=42)
model.fit(X_resampled, y_resampled)

# Predict and evaluate
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))


HYPERPARAMETER TUNING FOR RANDOM FOREST USING RANDOM OVERSAMPLING


In [None]:
# Step 1: Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, precision_recall_curve, auc
import matplotlib.pyplot as plt

In [None]:
# Step 2: Load and preprocess the dataset
data = pd.read_csv("diabetes_dataset.csv")

# One-hot encode categorical features
data_encoded = pd.get_dummies(data, columns=['gender', 'location', 'smoking_history'], drop_first=True)

# Separate features and target
X = data_encoded.drop('diabetes', axis=1)
y = data_encoded['diabetes']

In [None]:
# Step 3: Split into training and test sets (stratified to preserve class balance)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)

In [None]:
# Step 4: Apply Random OverSampling to balance the training data
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)

In [None]:
# Step 5: Scale the features (optional for Random Forest, but good practice)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_resampled_scaled = scaler.fit_transform(X_resampled)
X_test_scaled = scaler.transform(X_test)

In [None]:
# Step 6: Define the Random Forest model and hyperparameter grid
rf = RandomForestClassifier(random_state=42)

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'max_features': ['sqrt', 'log2']
}

In [None]:
# Step 7: Use RandomizedSearchCV for faster tuning
random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_grid,
    n_iter=20,              # Try 20 random combinations
    cv=2,                   # 3-fold cross-validation
    scoring='f1',           # Optimize for F1-score
    verbose=2,
    n_jobs=-1,
    random_state=42
)

In [None]:
# Step 8: Fit the randomized search on the resampled training data
random_search.fit(X_resampled_scaled, y_resampled)

Fitting 2 folds for each of 20 candidates, totalling 40 fits


In [None]:
# Step 9: Retrieve the best model and parameters
best_rf = random_search.best_estimator_
print("Best Parameters:", random_search.best_params_)

In [None]:
# Step 10: Evaluate the tuned model on the original test set
y_pred = best_rf.predict(X_test_scaled)
y_proba = best_rf.predict_proba(X_test_scaled)[:, 1]

print("\n=== Tuned Random Forest (Random OverSampling) ===")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

In [None]:
# Step 11: Plot ROC and Precision-Recall curves
fpr, tpr, _ = roc_curve(y_test, y_proba)
precision, recall, _ = precision_recall_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)
pr_auc = auc(recall, precision)

plt.figure()
plt.plot(fpr, tpr, label=f"ROC Curve (AUC = {roc_auc:.2f})")
plt.plot([0, 1], [0, 1], linestyle='--', color='gray')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.grid(True)
plt.show()

plt.figure()
plt.plot(recall, precision, label=f"PR Curve (AUC = {pr_auc:.2f})")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve")
plt.legend()
plt.grid(True)
plt.show()

LOGISTIC REGRESSION WITH RANDOM OVERSAMPLING

In [None]:
# Step 1: Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, precision_recall_curve, auc
import matplotlib.pyplot as plt

In [None]:
# Step 2: Load and preprocess the dataset
data = pd.read_csv("diabetes_dataset.csv")

# One-hot encode categorical features
data_encoded = pd.get_dummies(data, columns=['gender', 'location', 'smoking_history'], drop_first=True)

# Separate features and target
X = data_encoded.drop('diabetes', axis=1)
y = data_encoded['diabetes']

In [None]:
# Step 3: Split into training and test sets (stratified to preserve class balance)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)

In [None]:
# Step 4: Apply Random OverSampling to balance the training data
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)

In [None]:
# Step 5: Scale the features (important for Logistic Regression)
scaler = StandardScaler()
X_resampled_scaled = scaler.fit_transform(X_resampled)
X_test_scaled = scaler.transform(X_test)

In [None]:
# Step 6: Define and train Logistic Regression with manually selected hyperparameters
best_log_reg = LogisticRegression(
    C=1.0,                    # Regularization strength (default = 1.0)
    penalty='l2',             # L2 regularization (standard)
    solver='liblinear',       # Works well with small datasets and L2
    max_iter=1000,
    random_state=42
)

best_log_reg.fit(X_resampled_scaled, y_resampled)

In [None]:
# Step 7: Evaluate the tuned model on the original test set
y_pred = best_log_reg.predict(X_test_scaled)
y_proba = best_log_reg.predict_proba(X_test_scaled)[:, 1]

print("\n=== Tuned Logistic Regression (Random OverSampling) ===")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

In [None]:
# Step 8: Plot ROC and Precision-Recall curves
fpr, tpr, _ = roc_curve(y_test, y_proba)
precision, recall, _ = precision_recall_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)
pr_auc = auc(recall, precision)

plt.figure()
plt.plot(fpr, tpr, label=f"ROC Curve (AUC = {roc_auc:.2f})")
plt.plot([0, 1], [0, 1], linestyle='--', color='gray')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.grid(True)
plt.show()

plt.figure()
plt.plot(recall, precision, label=f"PR Curve (AUC = {pr_auc:.2f})")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve")
plt.legend()
plt.grid(True)
plt.show()

DECISION TREE WITH OVERSAMPLING

In [None]:
# Step 1: Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, precision_recall_curve, auc, accuracy_score
import matplotlib.pyplot as plt

In [None]:
# Step 2: Load and preprocess the dataset
data = pd.read_csv("diabetes_dataset.csv")

# One-hot encode categorical features
data_encoded = pd.get_dummies(data, columns=['gender', 'location', 'smoking_history'], drop_first=True)

# Separate features and target
X = data_encoded.drop('diabetes', axis=1)
y = data_encoded['diabetes']

In [None]:
# Step 3: Split into training and test sets (stratified to preserve class balance)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.3, random_state=42)

In [None]:
# Step 4: Apply Random OverSampling to balance the training data
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X_train, y_train)

In [None]:
# Step 5: Scale the features (optional for Decision Tree)
scaler = StandardScaler()
X_resampled_scaled = scaler.fit_transform(X_resampled)
X_test_scaled = scaler.transform(X_test)

In [None]:
# Step 6: Define the Decision Tree model and hyperparameter grid
dt = DecisionTreeClassifier(random_state=42)

param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [5, 10, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

In [None]:
# Step 7: Use RandomizedSearchCV for faster tuning
random_search = RandomizedSearchCV(
    estimator=dt,
    param_distributions=param_grid,
    n_iter=20,              # Try 20 random combinations
    cv=2,                   # 3-fold cross-validation
    scoring='f1',           # Optimize for F1-score
    verbose=2,
    n_jobs=-1,
    random_state=42
)

In [None]:
# Step 8: Fit the randomized search on the resampled training data
random_search.fit(X_resampled_scaled, y_resampled)

In [None]:
# Step 9: Retrieve the best model and parameters
best_dt = random_search.best_estimator_
print("Best Parameters:", random_search.best_params_)

In [None]:
# Step 10: Evaluate the tuned model on the original test set
y_pred = best_dt.predict(X_test_scaled)
y_proba = best_dt.predict_proba(X_test_scaled)[:, 1]
accuracy = accuracy_score(y_test, y_pred)

print("\n=== Tuned Decision Tree (Random OverSampling) ===")
print(f"Accuracy: {accuracy:.4f}")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

In [None]:
# Step 11: Plot ROC and Precision-Recall curves
fpr, tpr, _ = roc_curve(y_test, y_proba)
precision, recall, _ = precision_recall_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)
pr_auc = auc(recall, precision)

plt.figure()
plt.plot(fpr, tpr, label=f"ROC Curve (AUC = {roc_auc:.2f})")
plt.plot([0, 1], [0, 1], linestyle='--', color='gray')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.grid(True)
plt.show()

plt.figure()
plt.plot(recall, precision, label=f"PR Curve (AUC = {pr_auc:.2f})")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve")
plt.legend()
plt.grid(True)
plt.show()