**Diabetes Risk Prediction Model: A Data-Driven Approach to Early Detection**

![alt](https://i.giphy.com/media/v1.Y2lkPTc5MGI3NjExMWI0cGQ4ZG5kd25hNTU4Z3V1bmNkN3J6ZGdmdjNrZGFndG84MWlvcyZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/8g9pJglyWQxr5xkfky/giphy.gif)

**Project Background and Objective**

- Given the increasing prevalence of diabetes, early identification of at-risk individuals is crucial for timely intervention and better health outcomes. This project aims to develop a prediction model to assess a patient’s risk of developing diabetes based on specific health metrics. Using variables like age, blood pressure, pulse rate, and blood sugar levels, the model aims to support healthcare providers in delivering targeted interventions. This model will lay the foundation for a telemedicine platform focused on early chronic illness detection; particularly aiding underserved communities where early interventions can significantly enhance care accessibility. The model aligns with several UN Sustainable Development Goals (SDGs), notably SDG 3: Good Health and Well-being, SDG 9: Industry, Innovation, and Infrastructure, and SDG 10: Reduced Inequalities, by emphasizing equitable access to preventive healthcare through innovative technology.


**Business Questions**

- Can we accurately predict the likelihood of diabetes in a patient based on their health metrics (age, blood pressure, pulse rate)?

- Which features are the most predictive of diabetes, enabling focused, efficient monitoring?

- How accurate is the model across different demographic groups, including various age groups and gender categories?

- What additional parameters or methods can improve the model’s accuracy and reliability for diabetes risk classification?


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
import xgboost as xgb


from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score

import warnings
warnings.filterwarnings("ignore")

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# **Importing the Data**

In [4]:
df = pd.read_csv("data/Vitals_prep.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'data/Vitals_prep.csv'

# **Data Understanding**

- **`patient_no`**: A unique identifier for each patient in the dataset.

Importance: Ensures each entry is distinct, preventing redundancy and ensuring accurate patient-specific analysis.
- **`gender`**:Specifies the gender of the patient.

Importance: Gender can influence diabetes risk factors and symptom presentation. For example, certain studies suggest that men and women may exhibit different diabetes symptoms and associated risks.
- **`age`**: The age of the patient in years.

Importance: Age is a well-known risk factor for diabetes, with higher prevalence observed in older populations due to insulin resistance and other metabolic changes over time.
- **`weight and height`**: The weight (in kg) and height (in cm) of the patient, used to calculate Body Mass Index (BMI).

Importance: BMI is a key indicator of obesity, a significant risk factor for diabetes. Higher BMI values are commonly associated with an increased risk
- **`dialstolic and systolic`**:Measures of blood pressure, with systolic representing the pressure when the heart beats and diastolic when the heart is at rest.

Importance: Hypertension is frequently associated with diabetes, as both conditions often share underlying risk factors, such as obesity and insulin resistance. High blood pressure can indicate vascular issues, which are common in diabetic patients.              
- **`pulse`**: The resting heart rate of the patient.

Importance: Abnormal pulse rates, particularly elevated resting heart rates, may indicate stress on the cardiovascular system, which can be linked to diabetes complications.
- **`temp`** (Temperature): The body temperature of the patient.

Importance: While not a direct indicator of diabetes, fluctuations in temperature may point to infections or other inflammatory responses. Diabetic patients are at higher risk for infections, which can influence blood glucose control.
- **`resp`** (Respiratory Rate): The number of breaths a patient takes per minute.

Importance: Abnormal respiratory rates, such as rapid breathing, can indicate metabolic acidosis, a potential complication of diabetes. In some cases, breathing irregularities might also suggest diabetes-related respiratory issues.
- **`disease`**:Indicates the presence of any pre-existing conditions or diseases that may predispose patients to diabetes.

Importance: Understanding comorbidities provides a clearer picture of a patient’s overall health and risk factors. Conditions such as hypertension, heart disease, or metabolic syndrome can significantly increase the likelihood of diabetes onset.

# **Data Exploration and Cleaning**
- Data Summary: An overview of the dataset by exploring column data types, null values, and basic statistics.

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
# summary statistics
df.describe()

## **Handling Missing Data**
Imputation: For columns (pulse, temp and resp), using mean or median imputation.

In [None]:
# Check for missing values
df.isnull().sum()

In [None]:
# impute rows with missing values
df2 = df.copy()
df2['pulse'] = df2['pulse'].fillna(df2['pulse'].median())
df2['temp'] = df2['temp'].fillna(df2['temp'].mean())
df2['resp'] = df2['resp'].fillna(df2['resp'].mean())

df2.isnull().sum()

In [None]:
df2.info()

## **Detecting Outliers**
- The histogram and boxplots below show that there are outliers in the data.
- However, this is expected as the data is extracted from a hospital database where patients come in with extreme and unique conditions.
- Also, being a hospital that receives over 9,000 patients per month, individuals from all walks of life will be received.

In [None]:
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
sns.boxplot(y='age', data=df2)
plt.title('Box Plot of Age')

plt.subplot(1, 3, 2)
sns.boxplot(y='weight', data=df2)
plt.title('Box Plot of Weight')

plt.subplot(1, 3, 3)
sns.boxplot(y='height', data=df2)
plt.title('Box Plot of Height')

plt.tight_layout()
plt.show()

In [None]:
# age, weight and height histogram

plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.hist(df2['age'], bins=20)
plt.title('Histogram of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')

plt.subplot(1, 3, 2)
plt.hist(df2['weight'], bins=20)
plt.title('Histogram of Weight')
plt.xlabel('Weight')
plt.ylabel('Frequency')

plt.subplot(1, 3, 3)
plt.hist(df2['height'], bins=20)
plt.title('Histogram of Height')
plt.xlabel('Height')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

- The analysis will focus on diabetes risk in adults, thus, patients below 18 years will be dropped.

In [None]:
# Drop rows where 'age' is below 18 (adults only)
df2 = df2[df2["age"] > 17]

plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.hist(df2["age"], bins=20)
plt.title('Histogram of Age')
plt.xlabel('Age')
plt.ylabel('Frequency')

plt.subplot(1, 3, 2)
plt.hist(df2['weight'], bins=20)
plt.title('Histogram of Weight')
plt.xlabel('Weight')
plt.ylabel('Frequency')

plt.subplot(1, 3, 3)
plt.hist(df2['height'], bins=20)
plt.title('Histogram of Height')
plt.xlabel('Height')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
sns.boxplot(y='age', data=df2)
plt.title('Box Plot of Age')

plt.subplot(1, 3, 2)
sns.boxplot(y='weight', data=df2)
plt.title('Box Plot of Weight')

plt.subplot(1, 3, 3)
sns.boxplot(y='height', data=df2)
plt.title('Box Plot of Height')

plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
sns.boxplot(y='pulse', data=df2)
plt.title('Box Plot of Pulse Rate')

plt.subplot(1, 3, 2)
sns.boxplot(y='temp', data=df2)
plt.title('Box Plot of Temperature')

plt.subplot(1, 3, 3)
sns.boxplot(y='resp', data=df2)
plt.title('Box Plot of Respiration Rate')

plt.tight_layout()
plt.show()

In [None]:
# Drop rows where 'pulse' is below 20
df2 = df2[df2["pulse"] >= 20]

# Drop rows where 'temp' is below 30
df2 = df2[df2["temp"]>= 30]


# box plots
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
sns.boxplot(y='pulse', data=df2)
plt.title('Box Plot of Pulse Rate')

plt.subplot(1, 3, 2)
sns.boxplot(y='temp', data=df2)
plt.title('Box Plot of Temperature')

plt.subplot(1, 3, 3)
sns.boxplot(y='resp', data=df2)
plt.title('Box Plot of Respiration Rate')

plt.tight_layout()
plt.show()

In [None]:
df2.shape

# **Feature Engineering**
- Creating additional features that enhance the model’s ability to analyze and predict diabetes risk based on patient health metrics.

## **1. Age Group: `age_group`**
- Segmenting patients into age categories relevant to health risk analysis.

In [None]:
# Define age groups
bins_age = [18, 25, 35, 45, 55, 65, float("inf")]
labels_age = [
    "18-25", "26-35", "36-45", "46-55", "56-65", "Over 65"
]
df2["age_group"] = pd.cut(df2["age"], bins=bins_age, labels=labels_age, right=False)


## **2. Body Mass Index: `bmi`, `bmi_category`**
- Calculate BMI using height and weight.
- Create BMI categories


In [None]:
# Calculate BMI
df2["bmi"] = df2["weight"] / ((df2["height"] / 100) ** 2) # height in cm
df2["bmi"] = df2["bmi"].round(1)

In [None]:
# categorize BMI
def categorize_bmi(bmi):
    if bmi <= 16.9:
        return "Severely Underweight"
    elif 17 <= bmi < 18.4:
        return "Underweight"
    elif 18.4 <= bmi <= 24.9:
        return "Normal"
    elif 25 <= bmi <= 29.9:
        return "Overweight"
    elif 30 <= bmi <= 34.9:
        return "Obese Class 1"
    elif 35 <= bmi <= 39.9:
        return "Obese Class 2"
    elif bmi >= 40:
        return "Obese Class 3"
    else:
        return "Unknown"

# Apply the function to create the 'bmi_category' column
df2['bmi_category'] = df2['bmi'].apply(categorize_bmi)

## **3. Blood Pressure: `blood_pressure_category`**
- Classify blood pressure levels to identify patients at risk of hypertension or hypotension.

In [None]:
# Classify blood pressure
def classify_blood_pressure(row):
    if row["systolic"] < 90 and row["diastolic"] < 60:
        return "low"
    elif row["systolic"] > 140 or row["diastolic"] > 90:
        return "high"
    else:
        return "normal"

df2["blood_pressure_category"] = df2.apply(classify_blood_pressure, axis=1)


## **6. Diabetes Indicator: `is_diabetes`**
- Identify patients who have been diagnosed with diabetes based on the `disease` column.

In [None]:
# Determine if the patient disease has diabetes/diabetic
df2["is_diabetes"] = df2["disease"].str.contains(
    r"diabetes|diabetic", case=False, na=False).apply(lambda x: "yes" if x else "no"
    )

## **7. Patient Has Hypertension: `is_hypertensive`**
-  Identify patients with hypertension based on the `disease` column.

In [None]:
# Determine if the patient disease has hypertension/hypertensive
df2["is_hypertensive"] = df2["disease"].str.contains(
    r"hypertension|hypertensive", case=False, na=False).apply(lambda x: "yes" if x else "no"
    )

In [None]:
df2.info()

In [None]:
df2.head(10)

In [None]:
df2.columns

In [None]:
df2.to_csv("data/cleaned_data_2.csv", index=False)

## **8. Dropping Unnecessary Columns**


- Columns to keep:
   
    patient_no, gender, pulse, temp, resp, bmi_category, age_group, blood_pressure_category, is_hypertensive, is_diabetes

In [None]:
# remain with specified columns
columns_to_keep = ["patient_no", "gender", "pulse", "temp", "resp", "bmi_category", "age_group", "blood_pressure_category", "is_hypertensive", "is_diabetes"]
hospital = df2[columns_to_keep]

# drop age groups Below 18 and Below 14
# hospital = hospital[~hospital['age_group'].isin(["Below 14", "Below 18"])]

In [None]:
hospital.info()

## **9. Creating the Target Column: diabetes_risk**
- Classify patients into risk categories to predict diabetes likelihood.

- Patients are categorized as "Low Risk", "Medium Risk", "High Risk", or "Has Diabetes" based on multiple factors like age, BMI, and blood pressure.

In [None]:
def define_diabetes_risk(row):
    # Has Diabetes
    if row["is_diabetes"] == "yes":
        return "Has Diabetes"

    # High Risk
    if (
        # condition 1
        (row["age_group"] in ["36-45", "46-55", "56-65", "Over 65"] and
         row["bmi_category"] in ["Overweight", "Obese Class 1", "Obese Class 2", "Obese Class 3"] and
         row["blood_pressure_category"] == "high" and
         (row["is_hypertensive"] == "yes" or row["is_hypertensive"] == "no"))
        or # condition 2
        (row["age_group"] in ["18-25", "25-35", "36-45", "46-55"] and
         row["bmi_category"] in ["Obese Class 3"] and
         row["blood_pressure_category"] == "high" and
         (row["is_hypertensive"] == "yes" or row["is_hypertensive"] == "no"))
        or # condition 3
        ((row["pulse"] > 100 or row["pulse"] < 50) and
         row["temp"] > 37.5 and
         row["resp"] > 20 or row["resp"] < 12)
    ):
        return "High Risk"

    # Medium Risk
    if (
        # condition 1
        (row["age_group"] in ["18-25", "25-35", "36-45", "46-55", "56-65", "Over 65"] and
         row["bmi_category"] in ["Overweight", "Obese Class 1"] and
         row["blood_pressure_category"] == "high" and
         (row["is_hypertensive"] == "yes" or row["is_hypertensive"] == "no"))
        or # condition 2
        (row["bmi_category"] in ["Overweight", "Obese Class 1"] and
         row["blood_pressure_category"] == "high" and
         (row["is_hypertensive"] == "yes" or row["is_hypertensive"] == "no"))
        or # condition 3
        ((row["pulse"] > 90 or row["pulse"] < 60) and
         (row["temp"] > 37.0 and row["temp"] <= 37.5) and
         row["resp"] > 16 and row["resp"] < 18)
    ):
        return "Medium Risk"

    # Low Risk
    return "Low Risk"

# Apply the function to create the target column
hospital["diabetes_risk"] = hospital.apply(define_diabetes_risk, axis=1)

# Display value counts for the new diabetes_risk column
print(hospital["diabetes_risk"].value_counts())


# **Exploratory Data Analysis (EDA)**
- Drop records with `"Has Diabetes"` is yes.
- Drop `is_diabetes` column
- Summary statistics
- Univariate, Bivariate and Multivariate analysis.

## **1. Drop records where `"Has Diabetes"` is `yes` and `is_diabetes` column**
- We are predicting the risk of becoming diabetic, thus, the columns are no longer needed.

In [None]:
# drop "Has Diabetes" records
hospital = hospital[hospital["diabetes_risk"] != "Has Diabetes"]
hospital["diabetes_risk"].value_counts()

In [None]:
# Drop 'is_diabetes' column
hospital = hospital.drop("is_diabetes", axis=1)
hospital.info()

In [None]:
hospital = hospital.reset_index(drop=True)
hospital.info()

- There are now 30,538   records remaining.
- The data is imbalanced.

In [None]:
hospital.to_csv("data/modeling_data_2.csv", index=False)

In [None]:
hospital.shape

In [None]:
hospital.head()

In [None]:
# Categorical columns
categorical_cols = ["gender", "bmi_category", "age_group", "blood_pressure_category", "is_hypertensive"]
numerical_cols = ["pulse", "temp", "resp"]
target = ["diabetes_risk"]

## **Summary Statistics**
- To obtain general insights into data spread, central tendency, and feature distribution.

In [None]:
# data summary
hospital.describe()

## **Target Distribution**
- Visualize `diabetes_risk` distribution to understand category imbalances

In [None]:
target_counts = hospital['diabetes_risk'].value_counts()
print(target_counts)

# Create a pie chart
plt.figure(figsize=(7, 6))
plt.pie(target_counts, labels=target_counts.index, autopct='%1.1f%%', startangle=50)
plt.title('Distribution of Diabetes Risk')
plt.axis('equal')
plt.show()

## **Univariate Analysis**

### **Categorical Columns: Value Counts**

-  Analyze value counts for each categorical column

- The features are either object or categorical.

In [None]:
# column value counts
for i, col in enumerate(categorical_cols):
  print(f"\nValue counts: '{col}':")
  print(hospital[col].value_counts())
  # print((hospital[col].value_counts() * 100).round(2))
  print("-" * 100)

### **Countplots**

In [None]:
# plotting the features
def categorical_features_count(df, categorical_cols):
    n_cols = 2
    n_rows = (len(categorical_cols) + 1) // n_cols  # number of rows needed

    # subplots
    fig, axes = plt.subplots(n_rows, n_cols, figsize=(12, 4 * n_rows))
    axes = axes.flatten()

    for i, col in enumerate(categorical_cols):
        sns.countplot(x=df[col], palette="Paired", ax=axes[i])
        axes[i].set_title(f'Distribution of {col}')
        # axes[i].tick_params(axis='x', rotation=45)
        x_labels = [label.get_text().replace(" ", "\n") for label in axes[i].get_xticklabels()]
        axes[i].set_xticklabels(x_labels, rotation=0)

    for j in range(i + 1, len(axes)):
        fig.delaxes(axes[j])

    plt.tight_layout()
    plt.show()

categorical_features_count(hospital, categorical_cols)

## **Numerical Columns: Data Distribution**

In [None]:
#histograms and box plots

for col in numerical_cols:
  plt.figure(figsize=(8, 4))
  plt.subplot(1, 2, 1)
  sns.histplot(hospital[col], kde=True)
  plt.title(f'Distribution of {col}')

  plt.subplot(1, 2, 2)
  sns.boxplot(y=hospital[col])
  plt.title(f'Boxplot of {col}')

  plt.tight_layout()
  plt.show()

## **Bivariate Analysis**
 - Analyze relationships between categorical features and `diabetes_risk` to assess any associations with risk categories.

### **Features vs Target**

In [None]:
n_cols = 2
n_rows = (len(categorical_cols) + 1) // n_cols
fig, axes = plt.subplots(n_rows, n_cols, figsize=(12, 4 * n_rows))
axes = axes.flatten()

# countplot
for i, col in enumerate(categorical_cols):
    sns.countplot(data=hospital, x=col, hue="diabetes_risk", palette="Set2", ax=axes[i])
    axes[i].set_title(f"{col} Distribution by Diabetes Risk")
    labels = axes[i].get_xticklabels()
    new_labels = [label.get_text().replace(" ", "\n") for label in labels]  # Add line breaks
    axes[i].set_xticklabels(new_labels)
    # axes[i].tick_params(axis='x', rotation=45)

for j in range(i + 1, len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()

## **Multivariate Analysis**
- Explore how age and BMI interact with `diabetes_risk` through a heatmap.


### **Age Group vs BMI Category vs Diabetes Risk**

In [None]:
# Age Group vs BMI Category vs Diabetes Risk
plt.figure(figsize=(14, 8))
sns.heatmap(pd.crosstab(hospital["age_group"], hospital["bmi_category"], values=hospital["diabetes_risk"], aggfunc='count', normalize="index"), annot=True, cmap="Blues")
plt.title("Heatmap of Age Group vs BMI Category by Diabetes Risk Count")
plt.show()


### **Gender vs Pulse, Temperature, Respiratory Rate**

In [None]:
# Plotting gender vs pulse, temp, resp
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x='gender', y='pulse', data=df2)
plt.title('Gender vs Pulse Rate')

plt.subplot(1, 3, 2)
sns.boxplot(x='gender', y='temp', data=df2)
plt.title('Gender vs Temperature')

plt.subplot(1, 3, 3)
sns.boxplot(x='gender', y='resp', data=df2)
plt.title('Gender vs Respiration Rate')

plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
sns.boxplot(x='gender', y='weight', data=df2)
plt.title('Gender vs Weight')

plt.subplot(1, 3, 2)
sns.boxplot(x='gender', y='height', data=df2)
plt.title('Gender vs Height')

plt.subplot(1, 3, 3)
sns.boxplot(x='gender', y='age', data=df2)
plt.title('Gender vs Age')

plt.tight_layout()
plt.show()

# **Modeling**

## **Data Preprocessing**

### **Data Encoding**
- Convert categorical columns to numerical data type.

In [None]:
# before encoding
hospital.head()

In [None]:
hospital2 = hospital.copy()

# Convert categorical columns to numerical data type

# 1. gender
# Male -0, Female - 1
hospital2["gender"] = hospital2["gender"].map({"Male": 0, "Female": 1})

# 2. bmi_category
hospital2["bmi_category"] = hospital2["bmi_category"].map({
    "Normal": 1,
    "Severely Underweight": 2,
    "Underweight": 3,
    "Overweight": 4,
    "Obese Class 1": 5,
    "Obese Class 2": 6,
    "Obese Class 3": 7
})

# 3. blood_pressure_category
hospital2["blood_pressure_category"] = hospital2["blood_pressure_category"].map({
    "normal": 1,
    "low": 2,
    "high": 3
})

# 4. age_group
hospital2['age_group'] = hospital2['age_group'].map({
    "18-25": 1,
    "26-35": 2,
    "36-45": 3,
    "46-55": 4,
    "56-65": 5,
    "Over 65": 6
})

# 5. is_hypertensive
# 0 for 'no', 1 for 'yes'
hospital2["is_hypertensive"] = hospital2["is_hypertensive"].map({"no": 0, "yes": 1})

# 6. diabetes_risk
hospital2["diabetes_risk"] = hospital2["diabetes_risk"].map({
    "Low Risk": 0,
    "Medium Risk": 1,
    "High Risk": 2,
})

# Check the updated dataframe
hospital2.head()


In [None]:
# Convert to integer type
hospital2 = hospital2.astype({
    "gender": "int",
    "bmi_category": "int",
    "blood_pressure_category": "int",
    "age_group": "int",
    "is_hypertensive": "int",
    "diabetes_risk": "int"
})

### **Separate the Data**
- Define the features and target

In [None]:
X = hospital2.drop(columns=["patient_no", "diabetes_risk"]) # features
y = hospital2["diabetes_risk"] # target

In [None]:
hospital2.info()

In [None]:
hospital2.columns

In [None]:
X.head(10)

In [None]:
y.head(20)

### **Split: Training and Testing Sets**

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(X_train.shape, X_test.shape)

### **Class Imbalance**
- SMOTE: *Synthetic Minority Oversampling Technique*

In [None]:
smote = SMOTE(sampling_strategy='auto', random_state=42)

# Fit and transform the training set
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
X_train_resampled = X_train_resampled.astype(int)
X_test = X_test.astype(int)

# Check the class distribution
print("Class distribution before SMOTE:", y_train.value_counts())
print("Class distribution after SMOTE:", pd.Series(y_train_resampled).value_counts())

# "Low Risk": 0,
# "Medium Risk": 1,
# "High Risk": 2


In [None]:
print(X_train_resampled.dtypes)
print(X_test.dtypes)


### **Standardize the Data**

In [None]:
# scale test set
scaler = StandardScaler()
X_train_resampled = scaler.fit_transform(X_train_resampled)
X_test = scaler.transform(X_test)

## **Baseline Model: Random Forest Classifier**

In [None]:
rf = RandomForestClassifier(random_state=42)

# Train the model
rf.fit(X_train_resampled, y_train_resampled)

# Make predictions on the test set
y_pred_rf = rf.predict(X_test)

# Evaluate the model
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print("Random Forest Accuracy:", "{:.4f}".format(accuracy_rf))

print("Classification Report:\n", classification_report(y_test, y_pred_rf))

In [None]:
# plot confusion matrix
y_true = y_test
cm = confusion_matrix(y_true, y_pred_rf, labels=[0, 1, 2])

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=["Low Risk", "Medium Risk", "High Risk"],
            yticklabels=["Low Risk", "Medium Risk", "High Risk"], cbar=False)
plt.title('Confusion Matrix: Random Forest')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

## **Model 2: Logistic Regression**



In [None]:
# Initialize the model
lr = LogisticRegression(max_iter=1000, random_state=42)

# Train the model
lr.fit(X_train_resampled, y_train_resampled)

# Make predictions
y_pred_lr = lr.predict(X_test)

# Evaluate the model
accuracy_lr = accuracy_score(y_test, y_pred_lr)
print("Logistic Regression Accuracy:", "{:.4f}".format(accuracy_lr))

print(classification_report(y_test, y_pred_lr))

In [None]:
y_true = y_test
cm = confusion_matrix(y_true, y_pred_lr, labels=[0, 1, 2])

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=["Low Risk", "Medium Risk", "High Risk"],
            yticklabels=["Low Risk", "Medium Risk", "High Risk"], cbar=False)
plt.title('Confusion Matrix: Logistic Regression')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

## **Model 3: XGBoost Classifier**

In [None]:
xgb_model = xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='mlogloss')

# Train the model on resampled data
xgb_model.fit(X_train_resampled, y_train_resampled)

# Make predictions
y_pred_xgb = xgb_model.predict(X_test)

# Evaluate the model
accuracy_xgb = accuracy_score(y_test, y_pred_xgb)
print("XGBoost Accuracy:", "{:.4f}".format(accuracy_xgb))
print("Classification Report:\n", classification_report(y_test, y_pred_xgb))


In [None]:
y_true = y_test
cm = confusion_matrix(y_true, y_pred_xgb, labels=[0, 1, 2])

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=["Low Risk", "Medium Risk", "High Risk"],
            yticklabels=["Low Risk", "Medium Risk", "High Risk"], cbar=False)
plt.title('Confusion Matrix: XGBoost')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

## **Model 4: Gradient Boosting Classifier**

In [None]:
gb = GradientBoostingClassifier(random_state=42)

# Train the model on resampled data
gb.fit(X_train_resampled, y_train_resampled)

# Make predictions
y_pred_gb = gb.predict(X_test)

# Evaluate the model
accuracy_gb = accuracy_score(y_test, y_pred_gb)
print("Gradient Boosting Accuracy:", "{:.4f}".format(accuracy_gb))
print("Classification Report:\n", classification_report(y_test, y_pred_gb))

In [None]:
y_true = y_test
cm = confusion_matrix(y_true, y_pred_gb, labels=[0, 1, 2])

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=["Low Risk", "Medium Risk", "High Risk"],
            yticklabels=["Low Risk", "Medium Risk", "High Risk"], cbar=False)
plt.title('Confusion Matrix: Gradient Boosting')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

## **Model Performance**

All models are performing well on the majority class, but they differ in how they handle the minority classes.

- The classes are:

  - Low Risk: 1
  - Medium Risk: 2
  - High Risk: 3

**1. Random Forest Classifier**

- The model has a high overall accuracy (99.39%) and excellent performance across all classes.
-  It has high precision, recall, and F1-scores for each class, especially for the underrepresented class (2: Medium Risk).
- This suggests that the Random Forest Classifier handles imbalanced classes better after SMOTE (Synthetic Minority Oversampling Technique), which balanced the data before training.

**2. Logistic Regression**

- Despite the high accuracy (93.86%), the model's performance for class 2 is notably lower than Random Forest.
- Precision for class 2 is low, meaning it has difficulty distinguishing this minority class from the others.
- However, recall for class 2 is high, indicating that Logistic Regression can find most instances of this class, though it misclassifies some other classes as class 2. This could be due to the linear nature of the model that makes it harder for it to separate minority classes accurately even after applying SMOTE.

**3. XGBoost Classifier**
- This model has a 99.46% accuracy coupled with high precision, recall and F1-scores.
- It performs well in predicting low-risk and high-risk categories but less strongly on medium-risk prediction with slightly lower precision (91%), recall (91%) and F1-scores (91%).

**4. Gradient Boosting Classifier**
- At 99.26% accuracy, the weakness in predicting medium-risk class is still prevalent.
- Precision, recall and F1-scores for medium-risk class are 85%, 95% and 90% respectively.

Overall, **XGBoost** performed the best at 99.46% accuracy. It performs highly in precision, recall, and F1-scores for all classes, especially the low-risk (0) and high-risk (2) classes, where precision and recall are close to 1.00.
- When it comes to the most affected class, Class 1 (Medium Risk), XGBoost performs significantly better than Random Forest and Logistic Regression in terms of precision (0.91) and recall (0.91).

## **Model Tuning**

### **Logistic Regression**
- Since this model had the lowest accuracy score (93.86%), it may benefit from tuning to increase its performance.

In [None]:
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'solver': ['liblinear', 'lbfgs'],
    'max_iter': [100, 200, 300]
}

lr_tuned = LogisticRegression()

# Set up GridSearchCV
grid_search_lr = GridSearchCV(estimator=lr_tuned,
                              param_grid=param_grid,
                              scoring='accuracy',
                              cv=3, verbose=1,
                              n_jobs=-1
                              )

# Fit model
grid_search_lr.fit(X_train_resampled, y_train_resampled)

# Get best parameters
best_params_lr = grid_search_lr.best_params_
print("Best Logistic Regression parameters found: ", best_params_lr)

# Train Logistic Regression model with best parameters
best_lr_model = grid_search_lr.best_estimator_

In [None]:
# Predict on the test set and get accuracy
y_pred = best_lr_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")

print("Classification Report for Tuned Logistic Regression:\n", classification_report(y_test, y_pred))


- The Logistic Regression's test accuracy did not change after tuning.

In [None]:
cv_scores = cross_val_score(best_lr_model, X_train_resampled, y_train_resampled, cv=5, scoring='accuracy')

# Print cross-validation results
print(f"Cross-validation accuracy: {cv_scores.mean():.4f} ± {cv_scores.std():.4f}")

In [None]:
# Train the model on the full training data
best_lr_model.fit(X_train_resampled, y_train_resampled)

# Predict on both training and test sets
train_pred = best_lr_model.predict(X_train_resampled)
test_pred = best_lr_model.predict(X_test)

# Calculate accuracy for both sets
train_accuracy = accuracy_score(y_train_resampled, train_pred)
test_accuracy = accuracy_score(y_test, test_pred)

print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")

- A Mean Cross-Validation Accuracy of 98.45% suggests that the model performs consistently well when tested on different subsets of the training data.
- A low standard deviation (0.0012) indicates that the model's performance is quite stable across the folds. This means that the accuracy is very consistent, with very little variation from one fold to another.
- Overfitting: the higher train accuracy (95.27%) compared to the test accuracy (93.86%) indicates that the model memorised the data rather than generalizing it well (slight overfitting).

### **Adjusting Class Weights**
- Give more importance to class 1.

In [None]:
best_lr_model_2 = LogisticRegression(class_weight={0: 1, 1: 15, 2: 1},
                                   max_iter=100,
                                   solver='lbfgs')

# Train on the original data with adjusted class weights
best_lr_model_2.fit(X_train_resampled, y_train_resampled)


In [None]:
# Predict on both training and test sets
train_pred = best_lr_model_2.predict(X_train_resampled)
test_pred = best_lr_model_2.predict(X_test)

# Calculate accuracy for both sets
train_accuracy = accuracy_score(y_train_resampled, train_pred)
test_accuracy = accuracy_score(y_test, test_pred)

print(f"Training Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")

print("\nClassification Report for Training Set:\n")
print(classification_report(y_train_resampled, train_pred))

print("\nClassification Report for Test Set:\n")
print(classification_report(y_test, test_pred))

- The initial issue of class imbalance still persists even after adjusting the minority class weight. Based on the above report:
  - **Class imbalance:** The model is struggling to predict class 1 on the test set despite good performance on the training set. This is because the class (Medium Risk) is underrepresented.
  - **Overfitting:** Recall for class 1 is high on the training set but very low on the test set.

# **Feature Importance**

In [None]:
# prompt: feature importance

# Feature Importance using Random Forest
feature_importances_rf = rf.feature_importances_
feature_names = X.columns

# Create a DataFrame for feature importance
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances_rf})
feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)

# Display the feature importance
print("Feature Importance (Random Forest):")
print(feature_importance_df)

# Visualize feature importance
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance_df)
plt.title('Feature Importance from Random Forest')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()

# Feature Importance using XGBoost
feature_importances_xgb = xgb_model.feature_importances_

# Create a DataFrame for feature importance
feature_importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': feature_importances_xgb})
feature_importance_df = feature_importance_df.sort_values('Importance', ascending=False)

# Display the feature importance
print("Feature Importance (XGBoost):")
print(feature_importance_df)

# Visualize feature importance
plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=feature_importance_df)
plt.title('Feature Importance from XGBoost')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()

# **Exporting the Model**
- For a start, the final Logistic Regression Model (best_lr_model_2) will be presented, pending further tuning and optimization.

In [None]:
import joblib

# Save the trained model
joblib.dump(best_lr_model_2, "model/logistic_regression_model.pkl")


['/content/drive/MyDrive/Projects/AWB-Lynette/Data/Predictions/logistic_regression_model.pkl']