# Module 13 - Practice Notebook
This notebook includes TODO markers **inside code cells** so students complete the missing parts.

### Welcome to Module 13 Practice!

This notebook contains hands-on exercises to help you practice Multiple Linear Regression and Polynomial Regression concepts.

**What you'll practice:**
1. Working with a real insurance dataset
2. Handling categorical features with OneHotEncoder
3. Building pipelines for preprocessing
4. Comparing different polynomial degrees
5. Understanding overfitting through visualization

**How to use this notebook:**
- Look for `# TODO:` comments in code cells
- Fill in the missing code as instructed
- Run each cell to verify your solution
- Read the explanations to understand each step

**Tips for beginners:**
- Take your time to understand each concept
- Don't worry about making mistakes - that's how we learn!
- Refer back to Mod 13.ipynb for examples
- Ask questions if anything is unclear

Let's get started!

In [None]:
#Import all required libraries
import numpy as np
import pandas as pd

# Scikit-learn imports for machine learning
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder, PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

import matplotlib.pyplot as plt

# What each library does:
# - numpy: For numerical operations and arrays
# - pandas: For data manipulation and working with DataFrames
# - train_test_split: To split data into training and testing sets
# - LinearRegression: Our main regression model
# - OneHotEncoder: Converts categorical variables to numeric format
# - PolynomialFeatures: Creates polynomial terms for non-linear relationships
# - ColumnTransformer: Applies different transformations to different columns
# - Pipeline: Chains multiple steps together for cleaner code
# - Metrics: Functions to evaluate model performance
# - matplotlib: For creating visualizations and plots

## Load Insurance Dataset

### About the Insurance Dataset

This dataset contains information about individuals and their insurance charges. It's a great example for regression because:

**Features include:**
- `age`: Age of the primary beneficiary
- `sex`: Gender (male/female) - categorical
- `bmi`: Body mass index (kg/m²)
- `children`: Number of children covered
- `smoker`: Whether the person smokes (yes/no) - categorical
- `region`: Residential region in US (northeast, southeast, southwest, northwest) - categorical

**Target:**
- `charges`: Individual medical costs billed by health insurance

**Why this dataset is interesting:**
- Mix of numeric and categorical features
- Likely non-linear relationships (e.g., age effects might not be linear)
- Clear interactions (smoking probably amplifies age effects)
- Real-world relevance for insurance companies

In [None]:
# Load the insurance dataset
url = "https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv"
insurance = pd.read_csv(url)

# TODO: Display first 5 rows


## Dataset Overview

### Understanding Dataset Information

Exploring your dataset is a crucial first step in any machine learning project. This helps you:

**1. Understand your data:**
- How many samples (rows) do you have?
- What are the features (columns)?
- Are there missing values?

**2. Identify data types:**
- Numeric features (integers, floats)
- Categorical features (objects, strings)
- This determines which preprocessing steps are needed

**3. Get basic statistics:**
- Mean, median, standard deviation for numeric features
- Value counts for categorical features
- Range of values to spot potential outliers

**4. Plan your modeling approach:**
- Which features might be predictive?
- Do you need to create new features?
- Should you scale or normalize anything?

Best practice: Always explore your data before modeling!

In [None]:
# TODO: Print dataset info


# TODO: Print descriptive statistics



## Define Features and Target

### Preparing Features and Target

In supervised machine learning, we need to clearly separate:

**Features (X)**: The input variables we use to make predictions
- Also called predictors, independent variables
- Everything except what we want to predict
- Must be in the right format for the model

**Target (y)**: What we want to predict
- Also called response, dependent variable, label
- Single column we're trying to estimate
- Must match the number of rows in X

**Why this separation matters:**
- Models learn patterns from X to predict y
- During training, both X and y are provided
- During prediction, only X is provided
- Clear separation prevents data leakage

**Common mistakes to avoid:**
- Accidentally including target in features
- Wrong shapes (X should be 2D, y should be 1D)
- Mismatched number of samples between X and y

In [None]:
# TODO: Set numeric and categorical feature lists


# TODO: Define target column


# TODO: Create X and y


# TODO: Display X head



## Train Test Split

### Why Train-Test Split?

Splitting data is fundamental to machine learning because:

**Training Set** (typically 70-80% of data):
- Used to teach/train the model
- Model learns patterns from this data
- Model parameters are adjusted based on this data

**Test Set** (typically 20-30% of data):
- Used only for evaluation
- Model never sees this data during training
- Provides unbiased estimate of real-world performance

**Why not use all data for training?**
- Would lead to overfitting (model memorizes data)
- No way to know if model generalizes to new data
- Inflated performance metrics
- Poor real-world performance

**The key principle:**
Test set simulates "future" data the model will encounter in production.

**Important:**
- Always split BEFORE any preprocessing
- Use same random_state for reproducibility
- Never look at test data during model development

In [None]:
# Perform train test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train.shape, X_test.shape


## Multiple Linear Regression Pipeline

### Handling Categorical Features

Most machine learning models require numeric input, but real-world data often contains categorical variables (text labels). 

**Common categorical variables:**
- Gender: male, female
- Yes/No: smoker, non-smoker
- Categories: region, product type
- Ordinal: education level, satisfaction rating

**One-Hot Encoding:**
- Converts each category into a binary (0/1) column
- Example: Region becomes region_northeast, region_southeast, etc.
- Only one of these columns will be 1 for each row
- `drop="first"` removes one category to avoid multicollinearity

**Why not label encoding (0, 1, 2)?**
- Implies ordinal relationship that may not exist
- Models might assume 2 > 1 > 0 has meaning
- One-hot is safer for nominal categories

**ColumnTransformer:**
- Applies different preprocessing to different columns
- Passes through numeric features unchanged
- Applies OneHotEncoder to categorical features
- Keeps track of which columns get which treatment

In [None]:
# Build preprocessing transformer
preprocess = ColumnTransformer(
    transformers=[
        ("num", "passthrough", numeric_features),
        ("cat", OneHotEncoder(drop="first"), categorical_features),
    ]
)

# TODO: Build full pipeline with LinearRegression


# TODO: Fit the model



## Regression Performance Function

### Creating Evaluation Functions

Creating reusable functions for evaluation makes your code:
1. **Cleaner**: Don't repeat the same evaluation code
2. **Consistent**: Same metrics calculated everywhere
3. **Error-free**: Less chance of mistakes
4. **Efficient**: Write once, use many times

**Key metrics for regression:**
- **MAE**: Mean Absolute Error (interpretable in target units)
- **RMSE**: Root Mean Squared Error (punishes large errors more)
- **R²**: R-squared (proportion of variance explained, 0-1)

**What makes a good evaluation function:**
- Clear, descriptive labels
- Shows multiple metrics
- Handles both train and test sets
- Easy to compare different models

**Best practices:**
- Always evaluate both train AND test sets
- Look for large gaps (sign of overfitting)
- Consider the context when interpreting metrics
- MAE is most interpretable for stakeholders

In [None]:
# TODO: Complete function to print performance


## Evaluate Model

In [None]:
# Predict for train and test sets
y_train_pred = mlr_model.predict(X_train)
y_test_pred = mlr_model.predict(X_test)

# TODO: Print train and test performance



## Inspect Coefficients

### Interpreting Model Coefficients

After training a linear model, coefficients tell us about feature importance:

**Positive coefficient:**
- As feature increases, prediction increases
- Example: Higher age → higher insurance charges

**Negative coefficient:**
- As feature increases, prediction decreases
- Example: Being non-smoker → lower charges

**Magnitude matters:**
- Larger absolute value = stronger effect
- Small coefficient = weak effect
- Near zero = little to no effect

**One-hot encoded features:**
- Coefficient represents difference from dropped category
- Example: region_northeast = +500 means northeast has $500 higher charges than reference

**Important considerations:**
- Correlation doesn't imply causation
- Coefficients depend on feature scaling
- Correlated features can have unstable coefficients
- Always consider domain knowledge when interpreting

In [None]:
# Extract feature names after OneHotEncoding
ohe = mlr_model.named_steps["preprocess"].named_transformers_["cat"]
cat_feature_names = ohe.get_feature_names_out(categorical_features)

all_feature_names = numeric_features + list(cat_feature_names)

# Extract model coefficients
linreg = mlr_model.named_steps["linreg"]
coeffs = pd.DataFrame({"feature": all_feature_names, "coefficient": linreg.coef_})

# Print intercept and coefficients
print("Intercept:", linreg.intercept_)
coeffs.sort_values("coefficient", ascending=False)


## Plot Actual vs Predicted

### Visualizing Predictions vs Actual

A scatter plot of predicted vs actual values is one of the most important visualizations for regression models:

**What the plot shows:**
- X-axis: Actual (true) values
- Y-axis: Predicted values
- Each point = one prediction

**Ideal pattern:**
- Points form a diagonal line from bottom-left to top-right
- This means predicted = actual (perfect predictions)

**Common patterns and what they mean:**
1. **Tight around diagonal**: Good model
2. **Wide spread**: Model has high error
3. **Above diagonal**: Systematic underprediction
4. **Below diagonal**: Systematic overprediction
5. **Curve**: Model missing non-linear pattern

**The diagonal reference line:**
- Shows where perfect predictions would fall
- Helps assess model quality visually
- Distance from line = prediction error

**Why this visualization matters:**
- Reveals patterns not captured by metrics alone
- Shows if errors are consistent across value ranges
- Helps identify systematic biases in predictions

In [None]:
# Plot scatter of actual vs predicted
plt.figure(figsize=(6, 6))
plt.scatter(y_test, y_test_pred, alpha=0.4)
plt.xlabel("Actual charges")
plt.ylabel("Predicted charges")
plt.title("Actual vs Predicted (Test Set)")

# Add diagonal line
lims = [min(y_test.min(), y_test_pred.min()), max(y_test.max(), y_test_pred.max())]
plt.plot(lims, lims)

plt.show()


## Part B: Polynomial Regression Practice

### Understanding Polynomial Regression on Synthetic Data

This section uses synthetic (artificially generated) data to clearly demonstrate polynomial regression concepts:

**Why use synthetic data?**
- We know the true relationship (perfect for learning)
- Can control the complexity
- No noise from real-world complications
- Clear visualization of concepts

**The generated data:**
- X: Hours studied (0 to 8)
- y: Exam scores with a curved relationship
- Equation: score = 35 + 12*hours - 1*hours² + noise
- This creates a parabola shape (peaks then declines)

**What you'll see:**
1. **Linear fit (degree 1)**: Straight line - underfits the curve
2. **Quadratic fit (degree 2)**: Perfect match - captures the parabola
3. **Cubic fit (degree 3)**: Slight overfit - extra wiggles
4. **High degree (degree 8)**: Severe overfit - passes through every point

**Key learning goals:**
- Understand how polynomial degree affects fit
- Visualize underfitting vs overfitting
- See why sometimes simpler is better

In [None]:
# Generate synthetic curved dataset
np.random.seed(42)  # Set random seed for reproducible results

# Create X values: 80 points evenly spaced from 0 to 8 hours
X_hours = np.linspace(0, 8, 80).reshape(-1, 1)
# np.linspace creates evenly spaced numbers
# reshape(-1, 1) converts to 2D array (required by scikit-learn)

# Generate y values with a quadratic relationship plus random noise
noise = np.random.normal(0, 5, size=X_hours.shape[0])
# Add Gaussian noise with mean=0, std=5
# This makes the relationship more realistic

# True relationship: y = 35 + 12x - x²
y_scores = 35 + 12 * X_hours[:, 0] - 1 * (X_hours[:, 0] ** 2) + noise
# X_hours[:, 0] extracts the values from 2D array
# This creates an inverted parabola (goes up then down)

# Plot the data to see the pattern
plt.figure(figsize=(10, 6))
plt.scatter(X_hours, y_scores, alpha=0.5)
# Scatter plot shows individual data points
# alpha=0.5 makes points semi-transparent

plt.xlabel("Hours studied")
plt.ylabel("Exam score")
plt.title("Exam Score (Curved Relationship)")
plt.grid(True)
plt.show()

# What this plot shows:
# - Score increases with study time up to a point
# - After ~6 hours, scores decrease (burnout?)
# - This is a classic inverted parabola shape
# - Noise makes it more realistic than perfect data

In [None]:
# Train test split for polynomial data
X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
    X_hours, y_scores, test_size=0.2, random_state=42
)
# Split 80% training, 20% testing
# Same random_state ensures reproducible split
# Training: 64 samples, Testing: 16 samples

print(f"Training set shape: {X_train_h.shape}")
print(f"Test set shape: {X_test_h.shape}")

# Why we need train/test split here:
# 1. To detect overfitting in high-degree polynomials
# 2. To ensure our model generalizes to new data
# 3. To compare performance across different degrees
# 4. To find the optimal complexity level

# With synthetic data:
# We know degree 2 is the true relationship
- But the model doesn't know this
- We'll see if train/test split helps identify this
- Higher degrees will overfit to noise

In [None]:
# TODO: Create helper to fit polynomial model



In [None]:
# TODO: Fit models for degrees 1, 2, 3, 8



In [None]:
# TODO: Plot fitted curves for each degree (R2,RMSE)



## TODO: Final Reflection

After completing all the exercises, write down your answers to these questions:

1. **Which polynomial degree overfits and why?**
   - Look at the R² and RMSE values for different degrees
   - Compare training vs test performance
   - Visualize the fitted curves
   - Explain why this degree overfits (too many parameters, fits noise)

2. **Which degree gives the best generalization?**
   - Which degree has the best test set performance?
   - Is there a large gap between train and test scores?
   - How does this compare to the true relationship (degree 2)?

3. **What did you learn about MLR and polynomial regression?**
   - How do MLR and polynomial regression differ?
   - When would you use each?
   - What are the pros and cons of polynomial regression?
   - What's the relationship between model complexity and overfitting?

4. **How would you apply these concepts to real-world problems?**
   - When would you choose polynomial over linear?
   - How would you select the right degree?
   - What other techniques could help prevent overfitting?

**Bonus question:**
- How does the synthetic data example differ from the insurance dataset?
- What additional challenges do real-world datasets present?