# Question 7 – Decision Tree and Random Forest on Pima Diabetes
Dataset: `pima-diabetes.xlsx`.

In [3]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

def load_pima():
    for p in [
        'pima-diabetes.xlsx', 'pima_diabetes.xlsx',
        '/mnt/data/pima-diabetes.xlsx', '/mnt/data/pima_diabetes.xlsx'
    ]:
        if os.path.exists(p):
            path = p
            break
    else:
        raise FileNotFoundError('pima-diabetes.xlsx not found')

    df = pd.read_excel(path)

    # Normalize Outcome name and map strings to 0/1
    outcome_cols = [c for c in df.columns if c.strip().lower() == 'outcome']
    if outcome_cols:
        oc = outcome_cols[0]
        if oc != 'Outcome':
            df = df.rename(columns={oc: 'Outcome'})

    df['Outcome'] = (
        df['Outcome']
        .astype(str)
        .str.strip()
        .map({'Non-Diabetic': 0, 'Diabetic': 1})
    )

    return df

pima = load_pima()
print("Outcome value counts (after mapping):")
print(pima['Outcome'].value_counts(dropna=False))


Outcome value counts (after mapping):
Outcome
0    500
1    268
Name: count, dtype: int64


## 7(a) Question – Tunable Parameters and Their Effects
> When using decision tree and random forest to predict diabetes, what parameters can be tuned for
> each model? What are the effects of increasing or decreasing these parameters?

In [4]:
# For modeling: drop rows with missing Outcome only
model_df = pima.dropna(subset=['Outcome']).copy()

X = model_df.drop(columns=['Outcome'])
y = model_df['Outcome']

# Column-wise median imputation for features (no list deletion)
X_imputed = X.copy()
for col in X_imputed.columns:
    med = X_imputed[col].median(skipna=True)
    X_imputed[col] = X_imputed[col].fillna(med)

X_train, X_test, y_train, y_test = train_test_split(
    X_imputed, y,
    test_size=0.3, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# PCA features
pca = PCA(n_components=3)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)

### 7(a) Explanation – Hyperparameters

**Decision Tree:**
- `max_depth`: maximum depth of the tree; increasing depth → more complex model, risk of overfitting.
- `min_samples_split`: minimum number of samples to split; larger → fewer splits (simpler tree).
- `min_samples_leaf`: minimum samples in a leaf; larger → smoother decision boundaries.
- `criterion`: `gini` or `entropy` for impurity measure.

**Random Forest:**
- `n_estimators` ($k$ in slides): number of trees. Larger $k$ reduces variance but increases computation.
- `max_depth`, `min_samples_split`, `min_samples_leaf`: control tree complexity.
- `max_features` ($d$): number of features considered at each split. Smaller $d$ increases randomness and diversity
  among trees, often improving generalization (slides suggest $d \approx \sqrt{m}$ for classification).

Tuning these parameters balances **bias vs variance** and controls overfitting.

## 7(b) Question – Overfitting and Detection
> What is overfitting? How do you check whether overfitting happens or not?

### 7(b) Explanation – Overfitting

Overfitting occurs when a model learns **noise** and specific details in the training data that do not generalize.
Symptoms:
- **Very high training accuracy**.
- **Lower test accuracy** (large train–test performance gap).

To detect overfitting:
1. Split data into training and testing sets.
2. Train the model on training set.
3. Evaluate accuracy (or other metrics) on both sets.
4. If training >> test, overfitting is likely.

Decision trees with high depth are prone to overfitting; random forests reduce this by averaging many trees built on
bootstrap samples and random feature subsets (bagging & random subspace).

## 7(c) Question – Predictions with Full Features vs PCA Features
> Make a prediction of Outcome using:
> 1. All original features.
> 2. Features obtained from PCA analysis.
> Compare the results and state your conclusion.

In [5]:
# 7(c) – Train decision tree and random forest with full features and PCA features

# Decision tree with original features
tree_full = DecisionTreeClassifier(max_depth=5, random_state=42)
tree_full.fit(X_train, y_train)
y_pred_tree_full_train = tree_full.predict(X_train)
y_pred_tree_full_test = tree_full.predict(X_test)
acc_tree_full_train = accuracy_score(y_train, y_pred_tree_full_train)
acc_tree_full_test = accuracy_score(y_test, y_pred_tree_full_test)
print("Decision Tree (full) train acc:", acc_tree_full_train)
print("Decision Tree (full) test acc:", acc_tree_full_test)

# Random forest with original features
rf_full = RandomForestClassifier(n_estimators=200, max_depth=None, random_state=42)
rf_full.fit(X_train, y_train)
y_pred_rf_full_train = rf_full.predict(X_train)
y_pred_rf_full_test = rf_full.predict(X_test)
acc_rf_full_train = accuracy_score(y_train, y_pred_rf_full_train)
acc_rf_full_test = accuracy_score(y_test, y_pred_rf_full_test)
print("Random Forest (full) train acc:", acc_rf_full_train)
print("Random Forest (full) test acc:", acc_rf_full_test)

# Decision tree with PCA features
tree_pca = DecisionTreeClassifier(max_depth=5, random_state=42)
tree_pca.fit(X_train_pca, y_train)
y_pred_tree_pca_train = tree_pca.predict(X_train_pca)
y_pred_tree_pca_test = tree_pca.predict(X_test_pca)
acc_tree_pca_train = accuracy_score(y_train, y_pred_tree_pca_train)
acc_tree_pca_test = accuracy_score(y_test, y_pred_tree_pca_test)
print("Decision Tree (PCA) train acc:", acc_tree_pca_train)
print("Decision Tree (PCA) test acc:", acc_tree_pca_test)

# Random forest with PCA features
rf_pca = RandomForestClassifier(n_estimators=200, max_depth=None, random_state=42)
rf_pca.fit(X_train_pca, y_train)
y_pred_rf_pca_train = rf_pca.predict(X_train_pca)
y_pred_rf_pca_test = rf_pca.predict(X_test_pca)
acc_rf_pca_train = accuracy_score(y_train, y_pred_rf_pca_train)
acc_rf_pca_test = accuracy_score(y_test, y_pred_rf_pca_test)
print("Random Forest (PCA) train acc:", acc_rf_pca_train)
print("Random Forest (PCA) test acc:", acc_rf_pca_test)

print("\nClassification report for Random Forest (full features):")
print(classification_report(y_test, y_pred_rf_full_test))

Decision Tree (full) train acc: 0.8491620111731844
Decision Tree (full) test acc: 0.7316017316017316
Random Forest (full) train acc: 1.0
Random Forest (full) test acc: 0.7445887445887446
Decision Tree (PCA) train acc: 0.8361266294227188
Decision Tree (PCA) test acc: 0.7229437229437229
Random Forest (PCA) train acc: 1.0
Random Forest (PCA) test acc: 0.7359307359307359

Classification report for Random Forest (full features):
              precision    recall  f1-score   support

           0       0.77      0.86      0.81       150
           1       0.67      0.53      0.59        81

    accuracy                           0.74       231
   macro avg       0.72      0.70      0.70       231
weighted avg       0.74      0.74      0.74       231



### 7(c) Explanation – Full vs PCA Features

**Analysis**
**1. Decision Tree (full features)**
- Train accuracy = **0.849**  
- Test accuracy = **0.732**  
- Shows moderate overfitting.  
- Sensitive to noise because a single tree forms very specific rules.

**2. Random Forest (full features)**
- Train accuracy = **1.000** → Expected overfitting (forest memorizes training data).  
- Test accuracy = **0.745**, the highest among all models.  
- Combines multiple trees → better generalization.

**3. Effect of PCA on Tree Models**
- PCA compresses and mixes original features → decision trees lose interpretability of splits.  
- Both tree and forest with PCA have **slightly lower** test accuracy:
  - DT PCA = 0.723  
  - RF PCA = 0.736  
- PCA is not ideal for tree models, which prefer axis-aligned splits on original variables.

**Conclusion:**  
**Random Forest with full features performs the best** and generalizes better than all PCA-based models.

**Interpretation**
- Class **0** (non-diabetic) is much easier to classify due to being the majority class.  
- Class **1** has lower recall (0.53) → many diabetics are missed.  
- This confirms the dataset suffers from **class imbalance**.

---