# 📚 Interactive Linear Regression Learning Playbook

Welcome to your comprehensive guide to understanding Linear Regression! This notebook will take you through the fundamental concepts with interactive visualizations and hands-on exercises.

## 🎯 Learning Objectives

By the end of this notebook, you will:
- Understand what linear regression is and when to use it
- Learn the mathematical foundation behind linear regression
- Explore how different parameters affect the model
- Practice with real datasets
- Evaluate model performance

## 📖 Table of Contents

1. [What is Linear Regression?](#section1)
2. [Mathematical Foundation](#section2)
3. [Simple Linear Regression - Interactive Demo](#section3)
4. [Real Dataset Example - Student Scores](#section4)
5. [Multiple Linear Regression](#section5)
6. [Model Evaluation](#section6)
7. [Practice Exercises](#section7)

In [None]:
!pip install -q numpy pandas matplotlib seaborn scikit-learn ipywidgets plotly
!wget https://raw.githubusercontent.com/Khayrulbuet13/Medium-Posts/refs/heads/main/LinnearRegression/utils.py

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

import ipywidgets as widgets
from IPython.display import display, HTML
import warnings
import plotly.graph_objs as go
import plotly.io as pio

# Import our utility functions
import utils

# Suppress warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("📚 Welcome to the Interactive Linear Regression Learning Playbook!")
print("🚀 All libraries loaded successfully!")

📚 Welcome to the Interactive Linear Regression Learning Playbook!
🚀 All libraries loaded successfully!


In [2]:
utils.create_simple_student_score_dataset(), utils.create_student_score_dataset()
utils.create_simple_house_price_dataset(), utils.create_house_price_dataset()
print("All csv files created successfully!")

simple student scores dataset saved successfully!
student_scores dataset saved successfully!
simple house price dataset saved successfully!
house_price_dataset saved successfully!
All csv files created successfully!


<a id='section1'></a>

## 1. 🤔 What is Linear Regression?

> **One‑line intuition:** *Linear regression draws the simplest possible trend‑line through your data by **adding up weighted pieces of the input**.*

### Formal Definition

Linear regression models the relationship between a **target** (dependent variable, $y$) and one or more **features** (independent variables, collected in a design matrix $X$) by learning coefficients $\boldsymbol{\beta}$ that appear **linearly** in the equation

$$
\hat y = \beta_0 + \beta_1\,f_1(X)\; + \; \beta_2\,f_2(X)\; + \;\dots + \beta_k\,f_k(X).
$$

The functions $f_j(\cdot)$ may be nonlinear transformations of the raw inputs (e.g. $X^2$, $\log X$), yet the *parameters* $\beta_j$ enter only as *additive, first‑power* terms—this is what keeps the model "linear".

### 🤓 Why “Linear in Parameters” Matters

* **📏 Straight line**: $Y=\beta_0+\beta_1X$
* **🌀 Curved but linear**: $Y=\beta_0+\beta_1X+\beta_2X^2$
* **📉 Log transform**: $Y=\beta_0+\beta_1\log X$
* **🔗 Interaction**: $Y=\beta_0+\beta_1X_1+\beta_2X_2X_3$

If a coefficient is *wrapped inside* a nonlinear function—e.g. $Y = \beta_0 + e^{\beta_1 X}$—the model leaves the linear‑regression family and becomes genuinely *nonlinear*.

## 📘 Key Concepts

**Dependent variable (Y)**: What we want to predict

**Independent variables (X)**: What we use to make predictions

**Weights / Coefficients ($\beta$)**: Learned parameters that quantify each feature's influence

**Model linearity**: Linear in parameters—even if relationship appears curved

**Design matrix (X)**: Matrix containing all input features plus intercept column

**Residuals ($\varepsilon$)**: Differences between actual and predicted values

### When to Reach for Linear Regression

* You believe the (possibly transformed) features relate approximately linearly to the target.
* You need an **interpretable** baseline before trying fancier models.
* The target is **continuous** and errors are roughly symmetrical.
* You want a quick yard‑stick: linear models train instantly and provide a benchmark for MSE / $R^2$.




<a id='section2'></a>

## 2. 🧮 Mathematical Foundation

### 2.1  Model in Vector / Matrix Form

Given

* Design matrix $X\in\mathbb{R}^{n\times (k+1)}$ (first column = 1’s for the intercept)
* Parameter vector $\boldsymbol{\beta}\in\mathbb{R}^{k+1}$
* Target vector $\mathbf{y}\in\mathbb{R}^{n}$

we write

$$
\hat{\mathbf{y}} = X\,\boldsymbol{\beta}, \qquad \mathbf{y}=\hat{\mathbf{y}}+\boldsymbol{\varepsilon}.
$$

### 2.2  Objective Function (Ordinary Least Squares)

The coefficients are chosen to **minimise** the Mean Squared Error (MSE):

$$
J(\boldsymbol{\beta})
= \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat y_i)^2
= \frac{1}{n}\lVert\mathbf{y}-X\boldsymbol{\beta}\rVert_2^{\,2}.
$$

### 2.3  Closed‑form Solution (Normal Equation)

When $X^TX$ is invertible,

$$
\hat{\boldsymbol{\beta}}\;=\;(X^T X)^{-1}X^T\mathbf{y}.
$$

This gives the exact OLS solution in a single matrix step.

### 2.4  Gradient‑descent Alternative

For huge datasets or streaming contexts we iterate

$$
\boldsymbol{\beta}\leftarrow\boldsymbol{\beta} - \eta \frac{2}{n} X^T(X\boldsymbol{\beta}-\mathbf{y}),
$$

with learning‑rate $\eta$.

### 2.5  Quality Metrics

| 📏 Metric | 📐 Formula                                    | 📣 Meaning         |
| --------- | --------------------------------------------- | ------------------ |
| 🎯 $R^2$  | $1-\frac{\sum(y-\hat y)^2}{\sum(y-\bar y)^2}$ | Variance explained |
| 🧮 MSE    | $\frac{1}{n}\sum(y-\hat y)^2$                 | Avg. squared error |
| 📉 MAE    | $\frac{1}{n}\sum\|y-\hat y\|$                 | Avg. absolute error |

### 2.6  Classical Assumptions (OLS)

1. **Linearity** in parameters
2. **Independence** of errors
3. **Homoscedasticity** (constant error variance)
4. **No perfect multicollinearity** among features
5. **Normality** of errors (for inference)

Violating these does *not* break prediction, but it affects reliability of confidence intervals and hypothesis tests.


<a id='section3'></a>

## 3. 🎮 Simple Linear Regression - Interactive Demo

Let's start with a simple example to understand how linear regression works!
Move sliders to adjust slope and intercept independently! to minimize the MSE

In [None]:
df = pd.read_csv('simple_student_scores.csv')

# Create an interactive regression demo using only Study_Hours as input
slope_slider, intercept_slider = utils.create_interactive_regression_demo(df)

VBox(children=(HTML(value='<h1 style="text-align: center; font-family: Arial, sans-serif; color: #2c3e50; marg…

In [5]:
# Let's see what the optimal solution looks like
X_demo = df['Study_Hours'].values
y_demo = df['Exam_Score'].values

# Implement find_optimal_parameters directly here
X_reshaped = X_demo.reshape(-1, 1)


model = LinearRegression()
model.fit(X_reshaped, y_demo)

optimal_slope = model.coef_[0] if X_demo.ndim == 1 else model.coef_
optimal_intercept = model.intercept_
y_pred_optimal = model.predict(X_reshaped)
optimal_mse = mean_squared_error(y_demo, y_pred_optimal)
optimal_r2 = r2_score(y_demo, y_pred_optimal)

print("🎯 OPTIMAL SOLUTION (using sklearn):")
print(f"   Optimal Slope: {optimal_slope:.3f}")
print(f"   Optimal Intercept: {optimal_intercept:.3f}")
print(f"   Minimum MSE: {optimal_mse:.3f}")
print(f"   R² Score: {optimal_r2:.3f}")
print("\n💡 How close were you to the optimal solution?")

🎯 OPTIMAL SOLUTION (using sklearn):
   Optimal Slope: 9.343
   Optimal Intercept: 4.808
   Minimum MSE: 80.658
   R² Score: 0.823

💡 How close were you to the optimal solution?


<a id='section4'></a>

## 4. 📊 Predict Exam Score with the model we built
You can hover over the interactive window to see what should be the exam score from given study hours, or we can use the linear regression model we just built to predict the score.

In [6]:
Study_hours = [1, 3, 5, 7, 9]
# Example predictions
print(f"\n📊 Example Predictions:")
for hours in Study_hours:
             score = model.predict([[hours]])[0]
             print(f"   {hours} hours → {score:.1f} points")


📊 Example Predictions:
   1 hours → 14.2 points
   3 hours → 32.8 points
   5 hours → 51.5 points
   7 hours → 70.2 points
   9 hours → 88.9 points


<a id='section5'></a>

## 5. 🔢 Multiple Linear Regression

Now let's explore multiple linear regression with more features!

In [7]:
# Import the multiple linear regression dataset from CSV
df_multi = pd.read_csv('student_scores.csv')

print("📊 Multiple Linear Regression Dataset")
print(f"Dataset shape: {df_multi.shape}")
# print("\nFirst 10 students:")
display(df_multi.head(10))

print("\n📝 Description:")
print("This dataset contains simulated student data with the following features:")
print("- Study_Hours: Number of hours spent studying per week")
print("- Sleep_Hours: Average hours of sleep per night")
print("- Previous_Score: Previous exam score (out of 100)")
print("- Attendance_Percent: Class attendance percentage")
print("- Exam_Score: Current exam score (target variable, out of 100)")

📊 Multiple Linear Regression Dataset
Dataset shape: (200, 5)


Unnamed: 0,Study_Hours,Sleep_Hours,Previous_Score,Attendance_Percent,Exam_Score
0,4.370861,7.85219,45.671813,66.757403,83.77192
1,9.556429,4.50484,89.64041,71.143614,100.0
2,7.587945,4.969772,67.78888,67.080419,100.0
3,6.387926,9.391325,85.455161,63.548101,100.0
4,2.404168,7.638574,57.602728,64.825435,82.722329
5,2.403951,4.055182,89.253778,78.431151,77.248255
6,1.522753,4.608829,61.406092,68.253349,58.524152
7,8.795585,7.981011,40.596071,74.570794,100.0
8,6.410035,4.03037,89.796009,80.136691,100.0
9,7.372653,4.964848,45.020767,87.615793,100.0



📝 Description:
This dataset contains simulated student data with the following features:
- Study_Hours: Number of hours spent studying per week
- Sleep_Hours: Average hours of sleep per night
- Previous_Score: Previous exam score (out of 100)
- Attendance_Percent: Class attendance percentage
- Exam_Score: Current exam score (target variable, out of 100)


In [8]:
# Build multiple linear regression model
X_multi = df_multi[['Study_Hours', 'Sleep_Hours', 'Previous_Score', 'Attendance_Percent']]
y_multi = df_multi['Exam_Score']

# Split the data
X_train_multi, X_test_multi, y_train_multi, y_test_multi = train_test_split(
    X_multi, y_multi, test_size=0.2, random_state=42
)

# Create and train the model
model_multi = LinearRegression()
model_multi.fit(X_train_multi, y_train_multi)

# Make predictions
y_train_pred_multi = model_multi.predict(X_train_multi)
y_test_pred_multi = model_multi.predict(X_test_multi)

print(f"🎯 Multiple Regression Model Parameters:")
feature_names = ['Study_Hours', 'Sleep_Hours', 'Previous_Score', 'Attendance_Percent']
for i, (feature, coef) in enumerate(zip(feature_names, model_multi.coef_)):
    print(f"   {feature}: {coef:.3f}")
print(f"   Intercept: {model_multi.intercept_:.3f}")

print(f"\n📈 Multiple Regression Performance:")
print(f"   Training R² Score: {r2_score(y_train_multi, y_train_pred_multi):.3f}")
print(f"   Test R² Score: {r2_score(y_test_multi, y_test_pred_multi):.3f}")
print(f"   Training MSE: {mean_squared_error(y_train_multi, y_train_pred_multi):.3f}")
print(f"   Test MSE: {mean_squared_error(y_test_multi, y_test_pred_multi):.3f}")

🎯 Multiple Regression Model Parameters:
   Study_Hours: 3.438
   Sleep_Hours: 0.848
   Previous_Score: 0.122
   Attendance_Percent: 0.102
   Intercept: 51.191

📈 Multiple Regression Performance:
   Training R² Score: 0.685
   Test R² Score: 0.691
   Training MSE: 39.618
   Test MSE: 47.660


### 🧮 How Multiple Regression Predicts Exam Scores

A **multiple linear regression** model predicts the target using this equation:

### Multiple Linear Regression:

$$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + \beta_4x_4 + \epsilon$$

Where:
- $y$ = Exam Score (what we predict)
- $\beta_0$ = Intercept (baseline score)
- $\beta_1, \beta_2, \beta_3, \beta_4$ = Coefficients for each feature
- $x_1, x_2, x_3, x_4$ = Study Hours, Sleep Hours, Previous Score, Attendance
- $\epsilon$ = Error term

**Example prediction with your model:**

$$\text{Exam Score} = 51.191 + 3.438 \times \text{Study Hours} + 0.848 \times \text{Sleep Hours} + 0.122 \times \text{Previous Score} + 0.102 \times \text{Attendance}$$

### Cost Function (Mean Squared Error):

$$MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y_i})^2$$

The goal is to find the best coefficients that **minimize** this cost function!


<a id='section6'></a>

## 6. 📏 Model Evaluation

Understanding how to evaluate your linear regression model is crucial!

In [9]:
# Comprehensive model evaluation using our utility function
# Make predictions
y_pred = model_multi.predict(X_test_multi)

# Calculate various metrics
r2 = r2_score(y_test_multi, y_pred)
mse = mean_squared_error(y_test_multi, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test_multi, y_pred)

# Additional metrics
n = len(y_test_multi)
p = X_test_multi.shape[1]  # number of features
adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)

# Mean and standard deviation of residuals
residuals = y_test_multi - y_pred
residual_mean = np.mean(residuals)
residual_std = np.std(residuals)

print("📊 COMPREHENSIVE MODEL EVALUATION")
print("=" * 50)

print(f"\n🎯 Accuracy Metrics:")
print(f"   R² Score: {r2:.4f}")
print(f"   Adjusted R²: {adjusted_r2:.4f}")
print(f"   Mean Squared Error (MSE): {mse:.4f}")
print(f"   Root Mean Squared Error (RMSE): {rmse:.4f}")
print(f"   Mean Absolute Error (MAE): {mae:.4f}")

print(f"\n🔍 Residual Analysis:")
print(f"   Residual Mean: {residual_mean:.4f} (should be close to 0)")
print(f"   Residual Std Dev: {residual_std:.4f}")

print(f"\n💡 Interpretation:")
print(f"   • R² = {r2:.3f} means {r2*100:.1f}% of variance is explained")
print(f"   • On average, predictions are off by {mae:.1f} points (MAE)")
print(f"   • RMSE of {rmse:.1f} penalizes larger errors more than MAE")

📊 COMPREHENSIVE MODEL EVALUATION

🎯 Accuracy Metrics:
   R² Score: 0.6912
   Adjusted R²: 0.6560
   Mean Squared Error (MSE): 47.6602
   Root Mean Squared Error (RMSE): 6.9036
   Mean Absolute Error (MAE): 5.9182

🔍 Residual Analysis:
   Residual Mean: -0.0780 (should be close to 0)
   Residual Std Dev: 6.9032

💡 Interpretation:
   • R² = 0.691 means 69.1% of variance is explained
   • On average, predictions are off by 5.9 points (MAE)
   • RMSE of 6.9 penalizes larger errors more than MAE


## 🏠 Exercise 1: House Price Prediction Challenge

Build a simple linear regression model to predict house prices and discover key factors that influence property values!

### 📊 The Dataset

The `simple_house_price_dataset.csv` dataset contains:
- **Size_SqFt**: House size in square feet
- **Location_Score**: Location quality (1-10 scale)
- **Age_Years**: Age of the house
- **Price_USD**: House price (target variable)

### 🎯 Your Mission

Create a model that predicts house prices by:
1. **Exploring** the data with quick statistics and visualizations
2. **Building** a multiple linear regression model
3. **Evaluating** performance with R² and MSE
4. **Interpreting** which factors impact price most
5. **Predicting** prices for sample properties

### 📝 Steps to Complete

1. **Explore**: Check statistics and correlations
2. **Visualize**: Create scatter plots
3. **Build**: Use multiple linear regression
4. **Evaluate**: Calculate R², MSE metrics
5. **Interpret**: Analyze coefficients
6. **Predict**: Test on sample houses

Ready to become a real estate pricing expert? 🚀

In [None]:
# Exercise 1 - Solution Space
# TODO: Add your code here to explore the house price dataset
# 1. Create visualizations to understand the data
# 2. Check correlations between features and price
# 3. Build a linear regression model
# 4. Evaluate the model performance

# Example starter code (uncomment and complete):
# print("📊 Dataset Statistics:")
# display(df_houses.describe())

# print("\n🔗 Correlations:")
# print(df_houses.corr())

# Your code here...
pass

## 🏠 Exercise 2: Advanced Feature Engineering

Take your house price model to the next level by applying feature engineering techniques to improve prediction accuracy!

### 📊 Enhanced Features

Building on Exercise 1, explore additional features:
- **Ceiling_Height_Ft**: Ceiling height in feet
- **Garage_Size_Cars**: Garage capacity
- **Distance_to_Metro_km**: Distance to nearest metro station

### 🔧 Engineering Techniques

Try these transformations to boost model performance:
- **Polynomial features**: `Size_SqFt²`, `Age_Years³`
- **Logarithmic features**: `log(Size_SqFt)`, `log(Distance_to_Metro_km + 1)`
- **Interaction terms**: `Size_SqFt × Location_Score`

### 📈 Your Challenge

1. Compare basic vs. engineered models
2. Identify which features contribute most
3. Visualize how feature engineering improves predictions

Can you create a model that outperforms your Exercise 1 solution? 🚀


In [None]:
# Exercise 2 - Solution Space
# TODO: Add your code here to explore the house price dataset
# 1. Create visualizations to understand the data
# 2. Check correlations between features and price
# 3. Build a linear regression model
# 4. Evaluate the model performance

# Example starter code (uncomment and complete):
# print("📊 Dataset Statistics:")
# display(df_houses.describe())

# print("\n🔗 Correlations:")
# print(df_houses.corr())

# Your code here...
pass

## 🎉 Congratulations!

You've completed the Interactive Linear Regression Learning Playbook! Here's what you've learned:

### 📚 Key Takeaways:
1. **Linear Regression Fundamentals**: Understanding the mathematical foundation and when to use it
2. **Interactive Learning**: How adjusting parameters affects model performance
3. **Real-world Application**: Working with student scores dataset
4. **Multiple Features**: Extending to multiple linear regression
5. **Model Evaluation**: Comprehensive metrics and interpretation
6. **Hands-on Practice**: Building your own models

### 🚀 Next Steps:
- Try the practice exercises with different datasets
- Experiment with feature engineering
- Explore regularization techniques (Ridge, Lasso)
- Learn about polynomial regression
- Study advanced regression techniques

### 📖 Additional Resources:
- [Scikit-learn Linear Regression Documentation](https://scikit-learn.org/stable/modules/linear_model.html)
- [Student Scores Dataset](https://www.kaggle.com/datasets/shubham47/students-score-dataset-linear-regression)
- [Multiple Linear Regression Dataset](https://www.kaggle.com/datasets/hussainnasirkhan/multiple-linear-regression-dataset)

Happy learning! 🎓