
# Module 18 Practice Notebook

**Dataset used here:** `sklearn.datasets.load_wine()`

### What you will practice
- Train/test split with stratification
- Baseline Random Forest training
- Evaluation (accuracy, classification report, confusion matrix)
- Feature importance
- Hyperparameter tuning with GridSearchCV
- Comparing baseline vs tuned model
- Regression Implementation and it's Analysis

> **Rule for this notebook:** Every section has TODO tasks. Fill them in and run.

This notebook has two parts:
- **Part A:** Classification (Breast Cancer)
- **Part B:** Regression (California Housing)

## Part A: Random Forest for Classification

In [None]:
# TODO: Run this cell first (imports)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


### 1) Load the dataset (Wine dataset)

The Wine dataset is a **multiclass classification** problem:
- 3 classes (wine cultivars)
- 13 numeric features

Your job:
- Load the dataset
- Create a DataFrame for features `X`
- Create a Series for target `y`
- Print shapes and class distribution


In [None]:
# TODO: Load the wine dataset
# Hint: data = load_wine()

data = None  # TODO
# TODO: Create X and y
X = None  # TODO (DataFrame)
y = None  # TODO (Series)

# TODO: Print shapes
# print("X shape:", ...)
# print("y shape:", ...)

# TODO: Show class distribution
# print(y.value_counts())



### 2) Train-test split

Requirements:
- test_size = 0.25
- random_state = 42
- stratify by y (important for class balance)

Your job:
- Create X_train, X_test, y_train, y_test
- Print the train/test sizes


In [None]:
# TODO: Split the dataset
X_train, X_test, y_train, y_test = None, None, None, None  # TODO

# TODO: Print sizes
# print("Train:", X_train.shape, y_train.shape)
# print("Test :", X_test.shape, y_test.shape)



### 3) Baseline Random Forest model

Requirements:
- Use RandomForestClassifier
- n_estimators = 200 (slightly larger than default)
- random_state = 42

Your job:
- Initialize the model
- Fit on training data
- Predict on test data


In [None]:
# TODO: Baseline model
rf_baseline = None  # TODO

# TODO: Fit
# rf_baseline.fit(X_train, y_train)

# TODO: Predict
y_pred_baseline = None  # TODO




### 4) Evaluate baseline model

Your job:
1. Compute accuracy
2. Print classification report
3. Build confusion matrix
4. Plot confusion matrix using matplotlib (no seaborn)

Note:
- This is multiclass, so confusion matrix is 3x3.


In [None]:
# TODO: Accuracy
# acc = accuracy_score(y_test, y_pred_baseline)
# print("Baseline accuracy:", acc)

# TODO: Classification report
# print(classification_report(y_test, y_pred_baseline, target_names=data.target_names))

# TODO: Confusion matrix
cm = None  # TODO

# TODO: Plot confusion matrix (matplotlib)
# plt.figure(figsize=(5,4))
# plt.imshow(cm, interpolation="nearest")
# plt.title("Confusion Matrix (Baseline)")
# plt.colorbar()
# tick_marks = np.arange(len(data.target_names))
# plt.xticks(tick_marks, data.target_names, rotation=45, ha="right")
# plt.yticks(tick_marks, data.target_names)
# plt.xlabel("Predicted")
# plt.ylabel("Actual")
# for i in range(cm.shape[0]):
#     for j in range(cm.shape[1]):
#         plt.text(j, i, cm[i, j], ha="center", va="center")
# plt.tight_layout()
# plt.show()



### 5) Feature importance (baseline)

Your job:
1. Extract `feature_importances_`
2. Create a DataFrame of feature names and importances
3. Sort and show top 5 features
4. Plot top 5 importances using matplotlib

Reminder:
- Feature importance is **global**, not per individual prediction.


In [None]:

# TODO: Extract feature importances
importances = None  # TODO

# TODO: Build a sorted DataFrame
feat_imp = None  # TODO DataFrame with columns: feature, importance

# TODO: Print top 5
# display(feat_imp.head(5))

# TODO: Plot top 5
# plt.figure(figsize=(8,4))
# plt.bar(feat_imp["feature"].head(5), feat_imp["importance"].head(5))
# plt.xticks(rotation=45, ha="right")
# plt.title("Top 5 Feature Importances (Baseline)")
# plt.ylabel("Importance")
# plt.tight_layout()
# plt.show()



### 6) Hyperparameter tuning with GridSearchCV

Tune these parameters:
- n_estimators
- max_depth
- min_samples_split
- max_features

Requirements:
- cv = 5
- scoring = "accuracy"
- n_jobs = -1
- random_state = 42 in the estimator

Your job:
1. Define param_grid
2. Run GridSearchCV
3. Print best_params_
4. Create best model and predict on test set


In [None]:
# TODO: Define param grid
param_grid = {
    # "n_estimators": [...],
    # "max_depth": [...],
    # "min_samples_split": [...],
    # "max_features": [...]
}

# TODO: Create GridSearchCV
grid = None  # TODO

# TODO: Fit grid search
# grid.fit(X_train, y_train)

# TODO: Print best params
# print("Best params:", grid.best_params_)

# TODO: Best estimator and prediction
best_rf = None  # TODO
y_pred_tuned = None  # TODO



### 7) Evaluate tuned model and compare with baseline

Your job:
1. Compute tuned accuracy
2. Print tuned classification report
3. Compare baseline vs tuned accuracy in one print block
4. (Optional) Plot tuned confusion matrix like before

Write a 2-3 line conclusion:
- Did tuning help?
- If not, why might that happen?


In [None]:
# TODO: Tuned accuracy
# acc_tuned = accuracy_score(y_test, y_pred_tuned)
# print("Tuned accuracy:", acc_tuned)

# TODO: Report
# print(classification_report(y_test, y_pred_tuned, target_names=data.target_names))

# TODO: Compare
# baseline_acc = accuracy_score(y_test, y_pred_baseline)
# print(f"Baseline accuracy: {baseline_acc:.4f}")
# print(f"Tuned accuracy   : {acc_tuned:.4f}")

# TODO: Short written conclusion (as a print or markdown)
# print("Conclusion: ...")



### 8) Challenge Tasks (Optional but recommended)

**Challenge A:** Change `random_state` and rerun. Does the accuracy change a lot? Why?  
**Challenge B:** Increase `n_estimators` to 500. Does it improve? What happens to runtime?  
**Challenge C:** Try `class_weight="balanced"` and compare results (even if classes are not extremely imbalanced).

Write your observations briefly.



## Part B: Random Forest for Regression

In regression, the target is **continuous numeric**, not class labels.

You will use **California Housing** dataset:
- Features: information about California districts (e.g., median income, house age, rooms, etc.)
- Target: median house value (a numeric value)

### Your goals for regression part

- Load a regression dataset (X_reg, y_reg)
- Split train and test sets
- Train a RandomForestRegressor
- Evaluate using regression metrics:
  - **MAE** (Mean Absolute Error)
  - **MSE** (Mean Squared Error)
  - **RMSE** (Root Mean Squared Error)
  - **R-squared** (coefficient of determination)
- Do a simple residual analysis



### Step B1: Imports for Regression

You will need these extra imports:
- fetch_california_housing
- RandomForestRegressor
- mean_absolute_error, mean_squared_error, r2_score


In [None]:
# TODO (B1): Import regression-specific tools
# from sklearn.datasets import fetch_california_housing
# from sklearn.ensemble import RandomForestRegressor
# from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

### Step B2: Load the California Housing Dataset

Tasks:
1. Load the dataset using fetch_california_housing()
2. Convert feature matrix into a pandas DataFrame
3. Convert target into a pandas Series
4. Print shapes to confirm:
   - X_reg shape should be (n_samples, n_features)
   - y_reg shape should be (n_samples,)


In [None]:
# TODO (B2): Load California Housing dataset
# data_reg = fetch_california_housing()

# TODO: Create X_reg DataFrame and y_reg Series
# Hint: data_reg.data, data_reg.feature_names, data_reg.target

# TODO: Print shapes
# print(X_reg.shape)
# print(y_reg.shape)

### Step B3: Quick Data Check (Very Important)

Before modeling, always inspect:
- First few rows
- Summary statistics
- Missing values

Random Forest can handle non-linear patterns, but it cannot handle missing values magically.


In [None]:
# TODO (B3): Basic inspection
# 1) X_reg.head()
# 2) X_reg.describe()
# 3) X_reg.isna().sum().sort_values(ascending=False).head(10)

### Step B4: Train-test Split for Regression

Use the same split style:
- test_size = 0.25
- random_state = 42

In regression, we typically do NOT use stratify.


In [None]:
# TODO (B4): Split X_reg and y_reg
# X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(...)

### Step B5: Train a Baseline RandomForestRegressor

Start with a baseline model first. Use reasonable defaults.

Recommended baseline settings:
- n_estimators = 300  (more trees = smoother prediction, but slower)
- random_state = 42
- n_jobs = -1 (use all CPU cores if available)

Note: We are not tuning yet. This is the baseline.


In [None]:
# TODO (B5): Initialize and train RandomForestRegressor
# rf_reg = RandomForestRegressor(
#     n_estimators=300,
#     random_state=42,
#     n_jobs=-1
# )
# rf_reg.fit(X_train_reg, y_train_reg)


### Step B6: Make Predictions

After fitting, predict on the test set and store predictions in y_pred_reg.


In [None]:
# TODO (B6): Predict using the trained regressor
# y_pred_reg = rf_reg.predict(X_test_reg)

### Step B7: Evaluate Regression Performance (Step-by-step)

We will compute:

1) **MAE**: average absolute error  
   - Easy to interpret (same units as target)

2) **MSE**: average squared error  
   - Penalizes large errors more

3) **RMSE**: square root of MSE  
   - Same unit as target, but still penalizes large errors

4) **R-squared**: fraction of variance explained  
   - 1.0 is perfect
   - 0.0 means “no better than predicting the mean”
   - can be negative if the model is very poor

Print all four clearly.


In [None]:
# TODO (B7): Compute metrics
# mae = mean_absolute_error(y_test_reg, y_pred_reg)
# mse = mean_squared_error(y_test_reg, y_pred_reg)
# rmse = np.sqrt(mse)
# r2 = r2_score(y_test_reg, y_pred_reg)

# TODO: Print metrics nicely
# print("MAE:", mae)
# print("MSE:", mse)
# print("RMSE:", rmse)
# print("R2:", r2)

### Step B8: Visual Check (Predicted vs Actual)

A quick sanity plot:
- x-axis: actual values (y_test_reg)
- y-axis: predicted values (y_pred_reg)

If the model is strong, points should roughly follow the diagonal line.


In [None]:
# TODO (B8): Scatter plot: Actual vs Predicted
# plt.figure(figsize=(6,6))
# plt.scatter(y_test_reg, y_pred_reg, alpha=0.4)
# plt.xlabel("Actual")
# plt.ylabel("Predicted")
# plt.title("Random Forest Regression: Actual vs Predicted")
# plt.show()

### Step B9: Residual Analysis

Residual = Actual - Predicted

A good model should have residuals centered around 0 with no obvious pattern.


In [None]:
# TODO (B9): Residual plot
# residuals = y_test_reg - y_pred_reg
# plt.figure(figsize=(7,4))
# plt.scatter(y_pred_reg, residuals, alpha=0.4)
# plt.axhline(0)
# plt.xlabel("Predicted")
# plt.ylabel("Residual (Actual - Predicted)")
# plt.title("Residual Plot")
# plt.show()

### Step B10: One Mini Experiment

Change ONE hyperparameter and observe the effect.

Choose one:
- max_depth
- min_samples_split
- max_features

Task:
1. Train a second model with your chosen change
2. Compute MAE, RMSE, and R2 again
3. Compare with baseline

Keep everything else identical.


In [None]:
# TODO (B10): Mini experiment
# Example:
# rf_reg2 = RandomForestRegressor(
#     n_estimators=300,
#     random_state=42,
#     n_jobs=-1,
#     max_depth=10
# )
# rf_reg2.fit(X_train_reg, y_train_reg)
# y_pred_reg2 = rf_reg2.predict(X_test_reg)

# TODO: Compute and print MAE/RMSE/R2 for rf_reg2

### Final Reflection (Write answers, no coding)

1) In regression, why do we use MAE/RMSE instead of accuracy?  
2) What does R-squared mean in simple language?  
3) Which parameter seems most related to overfitting: max_depth or n_estimators? Why?  
4) If your RMSE is high, list two possible reasons.