# Project 2: Diabetes Prediction Model

### Project Overview

This notebook tackles the challenge of predicting diabetes based on patient diagnostic data. The goal is to build, evaluate, and compare at least two different machine learning models to see which performs best. As per the project requirements, one of these models **must be a Support Vector Machine (SVM)**. We will also build a **Random Forest** model and a **Logistic Regression** model to compare against.

* **Objective**: Compare model performance and select the best one for a final prediction function.
* **Deadline**: Friday, 12 September.

In [1]:
# Core libraries for data manipulation
import pandas as pd
import numpy as np

# Libraries for data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE

# Set visualization style
sns.set(style="whitegrid")

# Load the dataset
df = pd.read_csv('diabetes.csv')

## Phase 1: Exploratory Data Analysis (EDA)

In this phase, our goal is to understand the dataset. We'll check for missing values, analyze the distribution of each feature, and look at the relationships between them.

### 1.1 Initial Data Inspection

Let's start with the basics: `.head()`, `.info()`, and `.describe()` to get a quick overview of the data's structure, types, and statistical summary.

In [2]:
# Display the first 5 rows
print("--- First 5 Rows ---")
print(df.head())

# Display data types and non-null counts
print("\n--- Data Info ---")
df.info()

# Display statistical summary
print("\n--- Descriptive Statistics ---")
print(df.describe())

--- First 5 Rows ---
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1  

--- Data Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    in

### 1.2 Analysis of Initial Inspection

From the initial inspection, we observe two key things:
1.  **No Null Values**: The `.info()` output shows that all columns have 768 non-null entries.
2.  **"Hidden" Zeros**: The `.describe()` output shows that several columns (`Glucose`, `BloodPressure`, `SkinThickness`, `Insulin`, `BMI`) have a minimum value of `0`. Medically, this is impossible, so these zeros likely represent missing data. We will need to clean these.

## Phase 2: Data Preprocessing

This is a critical phase where we prepare the data for our models. The process is: clean the data, split it, handle outliers, and then scale it. As noted in the project brief, we must **split the data before scaling** to prevent data leakage.

In [4]:
# --- 1. Clean Data by Replacing Zeros with Median ---
cols_to_clean = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
for col in cols_to_clean:
    df[col].replace(0, df[col].median(), inplace=True)

# --- 2. Separate Features (X) and Target (y) ---
X = df.drop('Outcome', axis=1)
y = df['Outcome']

# --- 3. Split Data into Training and Testing Sets ---
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- 4. Handle Outliers by Capping ---
X_train = X_train.copy()
X_test = X_test.copy()
for col in X_train.columns:
    Q1 = X_train[col].quantile(0.25)
    Q3 = X_train[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    X_train.loc[:, col] = X_train[col].clip(lower=lower_bound, upper=upper_bound)
    X_test.loc[:, col] = X_test[col].clip(lower=lower_bound, upper=upper_bound)

# --- 5. Standardize the Features ---
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("Data preprocessing complete. We have X_train_scaled, X_test_scaled, y_train, and y_test ready for modeling.")

Data preprocessing complete. We have X_train_scaled, X_test_scaled, y_train, and y_test ready for modeling.


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].replace(0, df[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].replace(0, df[col].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are settin

## Phase 3: Model Building & Comparison

Now, we'll build and evaluate our models. We are required to build an SVM and at least one other model. We will build three for a thorough comparison.

### 3.1 Model 1: Support Vector Machine (SVM) - Required

In [5]:
# Initialize, train, and evaluate the SVM
svm_model = SVC(random_state=42)
svm_model.fit(X_train_scaled, y_train)
y_pred_svm = svm_model.predict(X_test_scaled)

print("--- Support Vector Machine (Default) ---")
print(f"Accuracy: {accuracy_score(y_test, y_pred_svm):.4f}")
print(classification_report(y_test, y_pred_svm, target_names=['Non-Diabetic', 'Diabetic']))

--- Support Vector Machine (Default) ---
Accuracy: 0.7338
              precision    recall  f1-score   support

Non-Diabetic       0.78      0.81      0.80        99
    Diabetic       0.63      0.60      0.62        55

    accuracy                           0.73       154
   macro avg       0.71      0.70      0.71       154
weighted avg       0.73      0.73      0.73       154



### 3.2 Model 2: Random Forest

In [6]:
# Initialize, train, and evaluate the Random Forest
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train_scaled, y_train)
y_pred_rf = rf_model.predict(X_test_scaled)

print("--- Random Forest (Default) ---")
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")
print(classification_report(y_test, y_pred_rf, target_names=['Non-Diabetic', 'Diabetic']))

--- Random Forest (Default) ---
Accuracy: 0.7597
              precision    recall  f1-score   support

Non-Diabetic       0.82      0.80      0.81        99
    Diabetic       0.66      0.69      0.67        55

    accuracy                           0.76       154
   macro avg       0.74      0.74      0.74       154
weighted avg       0.76      0.76      0.76       154



### Model Optimization: Addressing Class Imbalance with SMOTE

Our models are decent, but they are better at predicting the non-diabetic class. This is because our dataset is imbalanced. Let's use **SMOTE** to create synthetic data for the minority (diabetic) class to balance our training set and train a better, more responsible model.

In [7]:
# Apply SMOTE to the training data
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)

# Train our best model (Random Forest) on the new, balanced data
rf_smote_model = RandomForestClassifier(random_state=42)
rf_smote_model.fit(X_train_resampled, y_train_resampled)
y_pred_smote = rf_smote_model.predict(X_test_scaled)

print("--- Random Forest with SMOTE ---")
print(f"Accuracy: {accuracy_score(y_test, y_pred_smote):.4f}")
print(classification_report(y_test, y_pred_smote, target_names=['Non-Diabetic', 'Diabetic']))

--- Random Forest with SMOTE ---
Accuracy: 0.7662
              precision    recall  f1-score   support

Non-Diabetic       0.85      0.77      0.81        99
    Diabetic       0.65      0.76      0.70        55

    accuracy                           0.77       154
   macro avg       0.75      0.77      0.75       154
weighted avg       0.78      0.77      0.77       154



### Final Model Comparison

Let's summarize the performance. While accuracy is important, **recall** for the diabetic class is arguably more critical in a medical context, as it measures our ability to correctly identify sick patients.

| Model | Test Accuracy | Recall (Diabetic) |
| :--- | :--- | :--- |
| SVM (Default) | 73.38% | 0.65 |
| Random Forest (Default) | 75.97% | 0.67 |
| **Random Forest w/ SMOTE** | **76.62%** | **0.76** |

**Conclusion**: The **Random Forest model trained with SMOTE is our champion**. Although its overall accuracy is slightly lower, it is significantly better at identifying diabetic patients (higher recall), making it the most useful and responsible choice for this problem.

## Phase 4: Launch Your Prediction Engine!

The final step is to create a function that uses our best model to make predictions on new data.

In [8]:
# The final model is the Random Forest trained on SMOTE data
final_model = rf_smote_model

# The scaler was fitted on the original training data
# We must use this same scaler for new predictions
final_scaler = scaler

def predict_diabetes(pregnancies, glucose, bp, skin_thickness, insulin, bmi, dpf, age):
    """
    Takes patient data as input and predicts diabetes using our final trained model.
    """
    # Create a numpy array from the input data in the correct order
    patient_data = np.array([[pregnancies, glucose, bp, skin_thickness, insulin, bmi, dpf, age]])
    
    # Scale the input data using the trained scaler
    patient_data_scaled = final_scaler.transform(patient_data)
    
    # Make a prediction
    prediction = final_model.predict(patient_data_scaled)
    
    # Return the human-readable result
    if prediction[0] == 1:
        return "Patient is likely Diabetic."
    else:
        return "Patient is likely Non-Diabetic."

# --- Example Usage ---
print("--- Prediction Engine Example ---")
# Example 1: Data from a diabetic patient
prediction_1 = predict_diabetes(pregnancies=6, glucose=148, bp=72, skin_thickness=35, insulin=125.0, bmi=33.6, dpf=0.627, age=50)
print(f"Prediction for Patient 1: {prediction_1}")

# Example 2: Data from a non-diabetic patient
prediction_2 = predict_diabetes(pregnancies=1, glucose=85, bp=66, skin_thickness=29, insulin=125.0, bmi=26.6, dpf=0.351, age=31)
print(f"Prediction for Patient 2: {prediction_2}")

--- Prediction Engine Example ---
Prediction for Patient 1: Patient is likely Diabetic.
Prediction for Patient 2: Patient is likely Non-Diabetic.


