# Midterm Project – Classification Analysis (Diabetes)

**Author:** Beth Spornitz  
**Date:** November 3, 2025

### Introduction
This project predicts the likelihood of diabetes using the Pima Indians Diabetes dataset. The target is `Outcome` (1 = diabetes, 0 = no diabetes). We follow the same framework as Project 3 (Decision Tree, SVM, NN), adapted to this dataset. We load, inspect, clean, engineer features (as needed), train models, evaluate them with standard classification metrics, visualize confusion matrices and decision boundaries, and reflect after each section.


# Core
import pandas as pd
import numpy as np

# Viz
import matplotlib.pyplot as plt
import seaborn as sns

# Modeling
from sklearn.model_selection import StratifiedShuffleSplit, train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report, confusion_matrix

# Display settings
pd.set_option("display.max_columns", 100)


## Section 1. Import and Inspect the Data

**Goal:** Load the dataset and confirm structure, types, and basic stats.

We’ll load from a local `data/diabetes.csv` (as required by the midterm), display the first 10 rows, check missing values, and show summary statistics.


In [None]:
# Load Diabetes dataset
# Place diabetes.csv at: ./data/diabetes.csv
# If needed, you can also use a known public mirror URL, but for the midterm keep a local copy in /data.
DATA_PATH = "data/diabetes.csv"

df = pd.read_csv(DATA_PATH)

# Standard inspect
print("Shape:", df.shape)
display(df.head(10))
display(df.info())
display(df.describe(include="all").T)


**Reflection 1:**  
What do you notice about the dataset? Are there any data issues?
- Notes:
  - Confirm column names and types.
  - Look for suspicious zeros in physiological fields (e.g., BloodPressure = 0).
  - Consider whether any columns need imputation.


## Section 2. Data Exploration and Preparation

**Goal:** Explore distributions, check class balance, handle missing/invalid values, and prepare features for modeling.

We'll:
- Plot histograms/boxplots to see distributions and outliers.
- Check class balance of the target (`Outcome`).
- Handle invalid zero entries in certain medical measurements (common in this dataset), imputing with medians as needed.


In [None]:
# List columns
print("Columns:", df.columns.tolist())

# Histograms for numeric columns
df.hist(figsize=(12, 10), bins=30)
plt.tight_layout()
plt.show()

# Boxplots for quick outlier scan
plt.figure(figsize=(12, 8))
sns.boxplot(data=df, orient="h")
plt.title("Boxplots of Features")
plt.tight_layout()
plt.show()

# Target balance
print("Target balance (Outcome):")
display(df['Outcome'].value_counts())
df['Outcome'].value_counts(normalize=True).plot(kind='bar', title='Class Distribution (Outcome)')
plt.show()


In [None]:
# In the Pima dataset, zeros in these columns are biologically implausible and represent missing values.
cols_with_invalid_zero = ["Glucose", "BloodPressure", "SkinThickness", "Insulin", "BMI"]

# Count invalid zeros before
zero_counts_before = (df[cols_with_invalid_zero] == 0).sum()
print("Invalid zero counts BEFORE:", zero_counts_before.to_dict())

# Replace zeros with NaN for listed columns
df[cols_with_invalid_zero] = df[cols_with_invalid_zero].replace(0, np.nan)

# Impute NaN with median per column
for c in cols_with_invalid_zero:
    df[c] = df[c].fillna(df[c].median())

# Verify after
zero_counts_after = (df[cols_with_invalid_zero] == 0).sum()
print("Invalid zero counts AFTER (should be 0 now):", zero_counts_after.to_dict())

# Confirm no NaNs remain
print("Any NaNs left?", df.isna().sum().sum())


**Reflection 2:**  
What patterns or anomalies did you see? Which preprocessing steps were necessary?
- Notes:
  - Comment on distributions (e.g., Glucose/BMI).
  - Note class balance for `Outcome`.
  - Document the zero→NaN→median imputation choice.
  - Mention any additional prep you considered (scaling, transforms) and why you did/didn’t apply them.


## Section 3. Feature Selection and Justification

We’ll follow the same “three cases” pattern as Project 3 to keep comparability:

- **Case 1:** Single feature → `BMI`  
- **Case 2:** Single feature → `Glucose`  
- **Case 3:** Two features → `Glucose` + `BMI`  

**Target:** `Outcome` (0 = no diabetes, 1 = diabetes)


In [None]:
# Case 1: Feature = BMI
X1 = df[['BMI']].dropna()
y1 = df.loc[X1.index, 'Outcome']

# Case 2: Feature = Glucose
X2 = df[['Glucose']].dropna()
y2 = df.loc[X2.index, 'Outcome']

# Case 3: Features = Glucose + BMI
X3 = df[['Glucose', 'BMI']].dropna()
y3 = df.loc[X3.index, 'Outcome']

# Sanity checks
print("Case 1 shape:", X1.shape, y1.shape)
print("Case 2 shape:", X2.shape, y2.shape)
print("Case 3 shape:", X3.shape, y3.shape)


**Reflection 3:**  
Why did you choose these features? How might `Glucose` and `BMI` impact predictions or accuracy? Are there others you would add in a future iteration?


## Section 4. Train a Model (Decision Tree)

We’ll mirror the Project 3 workflow:
- Split each case with **StratifiedShuffleSplit** (80/20) to preserve class proportions.
- Train a **Decision Tree** for each case.
- Evaluate with **classification_report** on train and test splits.
- Plot **confusion matrices** (heatmaps).
- Plot the **decision tree** for each case.


In [None]:
# Case 1: BMI
splitter1 = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=123)
for train_idx1, test_idx1 in splitter1.split(X1, y1):
    X1_train = X1.iloc[train_idx1]
    X1_test  = X1.iloc[test_idx1]
    y1_train = y1.iloc[train_idx1]
    y1_test  = y1.iloc[test_idx1]
print('Case 1 - BMI | Train:', len(X1_train), '| Test:', len(X1_test))

# Case 2: Glucose
splitter2 = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=123)
for train_idx2, test_idx2 in splitter2.split(X2, y2):
    X2_train = X2.iloc[train_idx2]
    X2_test  = X2.iloc[test_idx2]
    y2_train = y2.iloc[train_idx2]
    y2_test  = y2.iloc[test_idx2]
print('Case 2 - Glucose | Train:', len(X2_train), '| Test:', len(X2_test))

# Case 3: Glucose + BMI
splitter3 = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=123)
for train_idx3, test_idx3 in splitter3.split(X3, y3):
    X3_train = X3.iloc[train_idx3]
    X3_test  = X3.iloc[test_idx3]
    y3_train = y3.iloc[train_idx3]
    y3_test  = y3.iloc[test_idx3]
print('Case 3 - Glucose + BMI | Train:', len(X3_train), '| Test:', len(X3_test))


In [None]:
# Decision Trees (same style as Project 3)
tree_model1 = DecisionTreeClassifier()
tree_model1.fit(X1_train, y1_train)

tree_model2 = DecisionTreeClassifier()
tree_model2.fit(X2_train, y2_train)

tree_model3 = DecisionTreeClassifier()
tree_model3.fit(X3_train, y3_train)


In [None]:
# Case 1
print("Decision Tree — Case 1 (BMI) — TRAIN")
print(classification_report(y1_train, tree_model1.predict(X1_train)))
print("Decision Tree — Case 1 (BMI) — TEST")
y1_test_pred = tree_model1.predict(X1_test)
print(classification_report(y1_test, y1_test_pred))

# Case 2
print("Decision Tree — Case 2 (Glucose) — TRAIN")
print(classification_report(y2_train, tree_model2.predict(X2_train)))
print("Decision Tree — Case 2 (Glucose) — TEST")
y2_test_pred = tree_model2.predict(X2_test)
print(classification_report(y2_test, y2_test_pred))

# Case 3
print("Decision Tree — Case 3 (Glucose + BMI) — TRAIN")
print(classification_report(y3_train, tree_model3.predict(X3_train)))
print("Decision Tree — Case 3 (Glucose + BMI) — TEST")
y3_test_pred = tree_model3.predict(X3_test)
print(classification_report(y3_test, y3_test_pred))


In [None]:
# Case 1
cm1 = confusion_matrix(y1_test, y1_test_pred)
sns.heatmap(cm1, annot=True, cmap='Blues', fmt='g')
plt.title('Confusion Matrix — Case 1: BMI')
plt.xlabel('Predicted'); plt.ylabel('Actual')
plt.show()

# Case 2
cm2 = confusion_matrix(y2_test, y2_test_pred)
sns.heatmap(cm2, annot=True, cmap='Blues', fmt='g')
plt.title('Confusion Matrix — Case 2: Glucose')
plt.xlabel('Predicted'); plt.ylabel('Actual')
plt.show()

# Case 3
cm3 = confusion_matrix(y3_test, y3_test_pred)
sns.heatmap(cm3, annot=True, cmap='Blues', fmt='g')
plt.title('Confusion Matrix — Case 3: Glucose + BMI')
plt.xlabel('Predicted'); plt.ylabel('Actual')
plt.show()


In [None]:
# Case 1
fig = plt.figure(figsize=(12, 6))
plot_tree(tree_model1, feature_names=X1.columns, class_names=['No Diabetes','Diabetes'], filled=True)
plt.title("Decision Tree — Case 1: BMI")
plt.show()
fig.savefig("tree_case1_bmi.png")

# Case 2
fig = plt.figure(figsize=(12, 6))
plot_tree(tree_model2, feature_names=X2.columns, class_names=['No Diabetes','Diabetes'], filled=True)
plt.title("Decision Tree — Case 2: Glucose")
plt.show()
fig.savefig("tree_case2_glucose.png")

# Case 3
fig = plt.figure(figsize=(16, 8))
plot_tree(tree_model3, feature_names=X3.columns, class_names=['No Diabetes','Diabetes'], filled=True)
plt.title("Decision Tree — Case 3: Glucose + BMI")
plt.show()
fig.savefig("tree_case3_glucose_bmi.png")


**Reflection 4:**  
How well did the Decision Trees perform across the three cases? Any overfitting signs (big train vs test gap)? Which inputs worked better and why?


## Section 5. Improve the Model or Try Alternates (Implement a Second Option)

We will mirror Project 3 and try:
- **Support Vector Classifier (SVC)** for all three cases
- **Neural Network (MLPClassifier)** for **Case 3** (two inputs)  
We’ll evaluate with the same metrics and visualize support vectors (like Project 3).


In [None]:
# SVC — default RBF kernel, consistent with Project 3 examples

# Case 1: BMI
svc_model1 = SVC()
svc_model1.fit(X1_train, y1_train)
y1_svc_pred = svc_model1.predict(X1_test)
print("SVC — Case 1 (BMI) — TEST")
print(classification_report(y1_test, y1_svc_pred))

# Case 2: Glucose
svc_model2 = SVC()
svc_model2.fit(X2_train, y2_train)
y2_svc_pred = svc_model2.predict(X2_test)
print("SVC — Case 2 (Glucose) — TEST")
print(classification_report(y2_test, y2_svc_pred))

# Case 3: Glucose + BMI
svc_model3 = SVC()
svc_model3.fit(X3_train, y3_train)
y3_svc_pred = svc_model3.predict(X3_test)
print("SVC — Case 3 (Glucose + BMI) — TEST")
print(classification_report(y3_test, y3_svc_pred))


In [None]:
# Visualize support vectors for Case 1 (1D BMI) — using 0.5 Y trick exactly like Project 3

# Create groups based on Outcome
diab_bmi = X1_test.loc[y1_test == 1, 'BMI']
nod_bmi  = X1_test.loc[y1_test == 0, 'BMI']

plt.figure(figsize=(8, 6))
plt.scatter(diab_bmi, y1_test.loc[y1_test == 1], c='yellow', marker='s', label='Diabetes (1)')
plt.scatter(nod_bmi,  y1_test.loc[y1_test == 0], c='cyan',   marker='^', label='No Diabetes (0)')

if hasattr(svc_model1, 'support_vectors_'):
    support_x = svc_model1.support_vectors_[:, 0]
    plt.scatter(support_x, [0.5] * len(support_x), c='black', marker='+', s=100, label='Support Vectors')

plt.xlabel('BMI'); plt.ylabel('Outcome (0/1)')
plt.title('Support Vectors — SVC (Case 1: BMI)')
plt.legend(); plt.grid(True)
plt.show()


In [None]:
# Visualize support vectors for Case 3 (Glucose, BMI)

diab = X3_test[y3_test == 1]
nod  = X3_test[y3_test == 0]

plt.figure(figsize=(10, 7))
plt.scatter(diab['Glucose'], diab['BMI'], c='yellow', marker='s', label='Diabetes (1)')
plt.scatter(nod['Glucose'],  nod['BMI'],  c='cyan',   marker='^', label='No Diabetes (0)')

if hasattr(svc_model3, 'support_vectors_'):
    sv = svc_model3.support_vectors_
    plt.scatter(sv[:, 0], sv[:, 1], c='black', marker='+', s=100, label='Support Vectors')

plt.xlabel('Glucose'); plt.ylabel('BMI')
plt.title('Support Vectors — SVC (Case 3: Glucose + BMI)')
plt.legend(); plt.grid(True)
plt.show()


In [None]:
# Neural Network (MLP) — Case 3 (two inputs), consistent with Project 3 style
nn_model3 = MLPClassifier(
    hidden_layer_sizes=(50, 25, 10),
    solver='lbfgs',
    max_iter=1000,
    random_state=42
)
nn_model3.fit(X3_train, y3_train)

y3_nn_pred = nn_model3.predict(X3_test)
print("Neural Network — Case 3 (Glucose + BMI) — TEST")
print(classification_report(y3_test, y3_nn_pred))

# Confusion matrix
cm_nn3 = confusion_matrix(y3_test, y3_nn_pred)
sns.heatmap(cm_nn3, annot=True, cmap='Blues', fmt='g')
plt.title('Confusion Matrix — Neural Network (Case 3)')
plt.xlabel('Predicted'); plt.ylabel('Actual')
plt.show()


In [None]:
# Decision surface for NN on Case 3 (2D)
from matplotlib.colors import ListedColormap

padding = 1
x_min, x_max = X3['Glucose'].min() - padding, X3['Glucose'].max() + padding
y_min, y_max = X3['BMI'].min() - padding, X3['BMI'].max() + padding

xx, yy = np.meshgrid(np.linspace(x_min, x_max, 500),
                     np.linspace(y_min, y_max, 500))

Z = nn_model3.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

plt.figure(figsize=(10, 7))
cmap_background = ListedColormap(['lightblue', 'lightyellow'])
plt.contourf(xx, yy, Z, cmap=cmap_background, alpha=0.7)

# Overlay test points
plt.scatter(X3_test['Glucose'][y3_test == 0], X3_test['BMI'][y3_test == 0],
            c='blue', marker='^', edgecolor='k', label='No Diabetes (0)')
plt.scatter(X3_test['Glucose'][y3_test == 1], X3_test['BMI'][y3_test == 1],
            c='gold', marker='s', edgecolor='k', label='Diabetes (1)')

plt.xlabel('Glucose'); plt.ylabel('BMI')
plt.title('Neural Network Decision Surface — Case 3 (Glucose + BMI)')
plt.legend(); plt.grid(True)
plt.show()


**Reflection 5:**  
Compare Decision Tree vs SVC vs NN. Which performed best on **test** data? Any surprises? Why might one model be better for this dataset?


## Section 6. Final Thoughts & Insights

- **6.1 Summary of Findings:** (include a small table of key metrics by model/case)
- **6.2 Challenges:** (data quality, zeros→NaN, class balance, convergence, etc.)
- **6.3 Next Steps:** (try more features, scaling for SVC/NN, logistic regression, Random Forest, hyperparameter tuning)


**Reflection 6:**  
What did you learn from this project? What would you try next with more time?
