In [None]:
from google.colab import files
uploaded = files.upload()

Saving diabetes.csv to diabetes.csv


Mohammad Mahdi Razmjoo - 400101272


# Importing and Preparing the Dataset

We begin by importing the Pima Indians Diabetes dataset, which you can obtain from [Kaggle](https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database) or a similar platform. The process starts with exploratory analysis to understand the structure and look for any missing entries. After that, we separate the inputs (features) from the target variable. An 80-20 train-test split is performed, followed by standard scaling—especially essential for algorithms such as KNN and SVM.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, f1_score
import warnings

warnings.filterwarnings("ignore")

dataframe = pd.read_csv("diabetes.csv")

print("Sample records from the dataset:")
print(dataframe.head(), "\n")

print(f"Data dimensions: {dataframe.shape}\n")
print("General information:")
print(dataframe.info(), "\n")

print("Missing values per column:")
print(dataframe.isnull().sum(), "\n")

print("Descriptive statistics:")
print(dataframe.describe(), "\n")

features = dataframe.drop("Outcome", axis=1)
labels = dataframe["Outcome"]

train_x, test_x, train_y, test_y = train_test_split(
    features, labels, test_size=0.2, stratify=labels, random_state=42
)

scaler = StandardScaler()
train_x_scaled = scaler.fit_transform(train_x)
test_x_scaled = scaler.transform(test_x)

print(f"Training set size: {train_x.shape}")
print(f"Test set size: {test_x.shape}")

Sample records from the dataset:
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35        0  33.6   
1            1       85             66             29        0  26.6   
2            8      183             64              0        0  23.3   
3            1       89             66             23       94  28.1   
4            0      137             40             35      168  43.1   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   21        0  
4                     2.288   33        1   

Data dimensions: (768, 9)

General information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   P

# 2) Applying Logistic Regression for Predictive Modeling

We now fit a Logistic Regression model to the training data. After training, predictions are made on the test set, and performance is evaluated using the F1-score metric. The goal is to achieve an F1-score greater than 0.75, as specified.

In [None]:
from sklearn.linear_model import LogisticRegression

model_lr = LogisticRegression(random_state=42)
model_lr.fit(train_x_scaled, train_y)

predictions_lr = model_lr.predict(test_x_scaled)
score_f1 = f1_score(test_y, predictions_lr)

print("Logistic Regression - Test F1 Score:", score_f1)
print("\nClassification Report for Logistic Regression:")
print(classification_report(test_y, predictions_lr))

Logistic Regression - Test F1 Score: 0.56

Classification Report for Logistic Regression:
              precision    recall  f1-score   support

           0       0.76      0.82      0.79       100
           1       0.61      0.52      0.56        54

    accuracy                           0.71       154
   macro avg       0.68      0.67      0.67       154
weighted avg       0.71      0.71      0.71       154



# 3) Linear SVM for Classification

We proceed by training a Support Vector Machine with a linear kernel using scikit-learn’s SVC. The model is then evaluated on the test data, aiming for an F1-score exceeding 0.80.


In [None]:
from sklearn.svm import SVC

model_svm = SVC(kernel="linear", random_state=42)
model_svm.fit(train_x_scaled, train_y)

pred_svm = model_svm.predict(test_x_scaled)
f1_linear_svm = f1_score(test_y, pred_svm)

print("Linear SVM - Test F1 Score:", f1_linear_svm)
print("\nClassification Report for Linear SVM:")
print(classification_report(test_y, pred_svm))

Linear SVM - Test F1 Score: 0.5656565656565656

Classification Report for Linear SVM:
              precision    recall  f1-score   support

           0       0.76      0.83      0.79       100
           1       0.62      0.52      0.57        54

    accuracy                           0.72       154
   macro avg       0.69      0.67      0.68       154
weighted avg       0.71      0.72      0.71       154



# 4) SVM with RBF Kernel

Next, we utilize a non-linear Support Vector Machine with an RBF kernel to model potentially complex relationships. A grid search is used to optimize the `C` and `gamma` parameters with the goal of achieving an F1-score above 0.80.


In [None]:
from sklearn.model_selection import GridSearchCV

param_space = {
    'C': [0.1, 1, 10],
    'gamma': ['scale', 'auto', 0.01, 0.1, 1]
}

rbf_svm_model = SVC(kernel='rbf', random_state=42)
grid_search = GridSearchCV(rbf_svm_model, param_space, cv=5, scoring='f1', n_jobs=-1)
grid_search.fit(train_x_scaled, train_y)

optimal_svm_rbf = grid_search.best_estimator_
pred_svm_rbf = optimal_svm_rbf.predict(test_x_scaled)
f1_rbf_svm = f1_score(test_y, pred_svm_rbf)

print("Best RBF-SVM parameters:", grid_search.best_params_)
print("Kernel SVM (RBF) - Test F1 Score:", f1_rbf_svm)
print("\nClassification Report for Kernel SVM (RBF):")
print(classification_report(test_y, pred_svm_rbf))

Best RBF-SVM parameters: {'C': 10, 'gamma': 0.01}
Kernel SVM (RBF) - Test F1 Score: 0.6041666666666666

Classification Report for Kernel SVM (RBF):
              precision    recall  f1-score   support

           0       0.78      0.87      0.82       100
           1       0.69      0.54      0.60        54

    accuracy                           0.75       154
   macro avg       0.73      0.70      0.71       154
weighted avg       0.75      0.75      0.74       154



# 5) K-Nearest Neighbors (KNN)

## 5.1) Optimizing the Number of Neighbors (k)

The performance of KNN depends on the choice of `k` (the number of neighbors). We will conduct a grid search over different values of `k` and weight strategies to maximize the F1-score, with a target of surpassing 0.80.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn_param_space = {
    'n_neighbors': [3, 5, 7, 9, 11, 13, 15],
    'weights': ['uniform', 'distance']
}

knn_model = KNeighborsClassifier()
knn_grid_search = GridSearchCV(knn_model, knn_param_space, cv=5, scoring='f1', n_jobs=-1)
knn_grid_search.fit(train_x_scaled, train_y)

best_knn_model = knn_grid_search.best_estimator_
knn_predictions = best_knn_model.predict(test_x_scaled)
f1_knn_score = f1_score(test_y, knn_predictions)

print("Best KNN parameters:", knn_grid_search.best_params_)
print("KNN - Test F1 Score:", f1_knn_score)
print("\nClassification Report for KNN:")
print(classification_report(test_y, knn_predictions))

Best KNN parameters: {'n_neighbors': 7, 'weights': 'uniform'}
KNN - Test F1 Score: 0.6138613861386139

Classification Report for KNN:
              precision    recall  f1-score   support

           0       0.79      0.84      0.81       100
           1       0.66      0.57      0.61        54

    accuracy                           0.75       154
   macro avg       0.72      0.71      0.71       154
weighted avg       0.74      0.75      0.74       154



# 6) Decision Trees

## 6.1) Optimizing Maximum Depth

Decision Trees are sensitive to overfitting. To prevent this, we will adjust hyperparameters like `max_depth`, `min_samples_split`, and `min_samples_leaf` for regularization. The goal is to achieve an F1-score exceeding 0.80.


In [None]:
from sklearn.tree import DecisionTreeClassifier

dt_param_grid = {
    'max_depth': [2, 3, 4, 5, 6, 7, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 5]
}

dt_model = DecisionTreeClassifier(random_state=42)
dt_grid_search = GridSearchCV(dt_model, dt_param_grid, cv=5, scoring='f1', n_jobs=-1)
dt_grid_search.fit(train_x_scaled, train_y)

best_dt_model = dt_grid_search.best_estimator_
dt_predictions = best_dt_model.predict(test_x_scaled)
f1_dt_score = f1_score(test_y, dt_predictions)

print("Best Decision Tree parameters:", dt_grid_search.best_params_)
print("Decision Tree - Test F1 Score:", f1_dt_score)
print("\nClassification Report for Decision Tree:")
print(classification_report(test_y, dt_predictions))

Best Decision Tree parameters: {'max_depth': 4, 'min_samples_leaf': 1, 'min_samples_split': 2}
Decision Tree - Test F1 Score: 0.693069306930693

Classification Report for Decision Tree:
              precision    recall  f1-score   support

           0       0.82      0.88      0.85       100
           1       0.74      0.65      0.69        54

    accuracy                           0.80       154
   macro avg       0.78      0.76      0.77       154
weighted avg       0.80      0.80      0.80       154



## 6.2) Regularization Techniques for Decision Trees

To reduce the risk of overfitting, we apply the following regularization strategies to decision trees:

1. **Limiting Maximum Depth (`max_depth`):**  
   By restricting the tree's depth, we reduce its complexity and prevent it from overfitting to noise in the data.

2. **Minimum Samples for Splitting and Leaf Nodes (`min_samples_split` and `min_samples_leaf`):**  
   These parameters ensure that a node must have a sufficient number of samples before it can be split or form a leaf, which helps avoid creating splits based on very few data points.

3. **Cost Complexity Pruning (`ccp_alpha`):**  
   This technique prunes the tree after training by penalizing its complexity. Increasing the value of `ccp_alpha` leads to a simpler tree.

# 7) Random Forest

Random Forest is an ensemble learning technique that aggregates multiple decision trees. We will optimize key parameters, such as the number of estimators and maximum depth, through grid search, aiming for an F1-score greater than 0.85.


In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_param_space = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7, None],
    'min_samples_leaf': [1, 2, 4]
}

rf_model = RandomForestClassifier(random_state=42)
rf_grid_search = GridSearchCV(rf_model, rf_param_space, cv=5, scoring='f1', n_jobs=-1)
rf_grid_search.fit(train_x_scaled, train_y)

best_rf_model = rf_grid_search.best_estimator_
rf_predictions = best_rf_model.predict(test_x_scaled)
f1_rf_score = f1_score(test_y, rf_predictions)

print("Best Random Forest parameters:", rf_grid_search.best_params_)
print("Random Forest - Test F1 Score:", f1_rf_score)
print("\nClassification Report for Random Forest:")
print(classification_report(test_y, rf_predictions))

Best Random Forest parameters: {'max_depth': None, 'min_samples_leaf': 2, 'n_estimators': 100}
Random Forest - Test F1 Score: 0.6

Classification Report for Random Forest:
              precision    recall  f1-score   support

           0       0.78      0.84      0.81       100
           1       0.65      0.56      0.60        54

    accuracy                           0.74       154
   macro avg       0.71      0.70      0.70       154
weighted avg       0.73      0.74      0.73       154



In [None]:
!pip install xgboost



# 8) Bonus: Targeting an F1-Score Above 0.90

For the bonus challenge, we aim to surpass an F1-score of 0.90 using XGBoost. This powerful ensemble method often yields improved performance when fine-tuned. If class imbalance is present, techniques such as oversampling might be utilized to enhance model accuracy.


In [None]:
import xgboost as xgb

xgb_param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [2, 3, 4, 5],
    'learning_rate': [0.01, 0.1, 0.3],
    'subsample': [0.8, 1.0]
}

xgb_model = xgb.XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss')
xgb_grid_search = GridSearchCV(xgb_model, xgb_param_grid, cv=5, scoring='f1', n_jobs=-1)
xgb_grid_search.fit(train_x_scaled, train_y)

best_xgb_model = xgb_grid_search.best_estimator_
xgb_predictions = best_xgb_model.predict(test_x_scaled)
f1_xgb_score = f1_score(test_y, xgb_predictions)

print("Best XGBoost parameters:", xgb_grid_search.best_params_)
print("XGBoost - Test F1 Score:", f1_xgb_score)
print("\nClassification Report for XGBoost:")
print(classification_report(test_y, xgb_predictions))

Best XGBoost parameters: {'learning_rate': 0.1, 'max_depth': 4, 'n_estimators': 100, 'subsample': 0.8}
XGBoost - Test F1 Score: 0.6415094339622641

Classification Report for XGBoost:
              precision    recall  f1-score   support

           0       0.80      0.82      0.81       100
           1       0.65      0.63      0.64        54

    accuracy                           0.75       154
   macro avg       0.73      0.72      0.73       154
weighted avg       0.75      0.75      0.75       154



# Optional: Applying Oversampling to Enhance Performance

If the F1-score remains below 0.90, we can implement oversampling techniques, such as Random OverSampling, to address class imbalance and potentially improve model performance.


In [None]:
from imblearn.over_sampling import RandomOverSampler

oversampler = RandomOverSampler(random_state=42)
X_train_oversampled, y_train_oversampled = oversampler.fit_resample(train_x_scaled, train_y)

xgb_grid_search.fit(X_train_oversampled, y_train_oversampled)
best_xgb_oversampled = xgb_grid_search.best_estimator_
xgb_oversampled_predictions = best_xgb_oversampled.predict(test_x_scaled)
f1_xgb_oversampled_score = f1_score(test_y, xgb_oversampled_predictions)

print("XGBoost with Random Oversampling - Test F1 Score:", f1_xgb_oversampled_score)
print("\nClassification Report for XGBoost with Oversampling:")
print(classification_report(test_y, xgb_oversampled_predictions))

XGBoost with Random Oversampling - Test F1 Score: 0.6306306306306306

Classification Report for XGBoost with Oversampling:
              precision    recall  f1-score   support

           0       0.80      0.78      0.79       100
           1       0.61      0.65      0.63        54

    accuracy                           0.73       154
   macro avg       0.71      0.71      0.71       154
weighted avg       0.74      0.73      0.74       154



# 9) Summary of All Models' F1-Scores

Below, we summarize the F1-scores of all the models to verify if they meet the required performance thresholds:

- Logistic Regression: F1-score should be > 0.75  
- Linear SVM: F1-score should be > 0.80  
- Kernel SVM: F1-score should be > 0.80  
- KNN: F1-score should be > 0.80  
- Decision Tree: F1-score should be > 0.80  
- Random Forest: F1-score should be > 0.85  
- Bonus: XGBoost (with or without oversampling): F1-score > 0.90 (if achieved)

Note: Due to the nature of the dataset, we may not have been able to reach the desired accuracy levels.


In [None]:
print("Logistic Regression (F1):", score_f1)
print("Linear SVM (F1):", f1_linear_svm)
print("Kernel SVM (F1):", f1_rbf_svm)
print("KNN (F1):", f1_knn_score)
print("Decision Tree (F1):", f1_dt_score)
print("Random Forest (F1):", f1_rf_score)
print("XGBoost (F1):", f1_xgb_score, "(Bonus attempt)")
print("XGBoost with Oversampling (F1):", f1_xgb_oversampled_score, "(Optional bonus attempt)")

Logistic Regression (F1): 0.56
Linear SVM (F1): 0.5656565656565656
Kernel SVM (F1): 0.6041666666666666
KNN (F1): 0.6138613861386139
Decision Tree (F1): 0.693069306930693
Random Forest (F1): 0.6
XGBoost (F1): 0.6415094339622641 (Bonus attempt)
XGBoost with Oversampling (F1): 0.6306306306306306 (Optional bonus attempt)


# Conclusion

In this notebook, we presented a complete binary classification pipeline applied to the Pima Indians Diabetes Dataset:

- **Data loading, exploratory data analysis (EDA), and preprocessing** with scaling.
- **Implementation and tuning** of models including **Logistic Regression, Linear SVM, Kernel SVM, KNN, Decision Trees, and Random Forests**.
- Discussion of **regularization methods for Decision Trees**.
- A **bonus challenge** with XGBoost (both with and without oversampling), aimed at achieving an F1-score above 0.90 on the test set.

Each section is accompanied by explanations that detail the rationale behind each step and the insights gained.