## 1. What is Linear Regression? (5–7 minutes)

### 📌 Definition
**Linear Regression** is a **supervised learning** algorithm used to predict a **continuous target variable** based on one or more **independent features (predictors)**.

It's one of the simplest and most widely used algorithms in machine learning and statistics.


### 🎯 Common Use Cases

- Predicting house prices
- Forecasting sales
- Estimating insurance risk
- Demand prediction
- Stock price estimation

---






In [1]:
# Linear Regression and Logistic Regression with preprocessing
# Using sklearn datasets: California Housing 

# Import libraries
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, r2_score, accuracy_score, confusion_matrix

# ----------------------------
# PART 1: Linear Regression
# ----------------------------

print("### Linear Regression Example ###")

# Load California Housing dataset
housing = fetch_california_housing(as_frame=True)
df_housing = housing.frame

# Show first 5 rows before preprocessing
print("\nHousing Data - first 5 rows BEFORE preprocessing:")
df_housing.head()


### Linear Regression Example ###

Housing Data - first 5 rows BEFORE preprocessing:


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [None]:
| Column Name     | Meaning / Description                                                                              |
| --------------- | -------------------------------------------------------------------------------------------------- |
| **MedInc**      | Median income in block group (in tens of thousands of dollars)                                     |
| **HouseAge**    | Median house age in block group (years)                                                            |
| **AveRooms**    | Average number of rooms per household                                                              |
| **AveBedrms**   | Average number of bedrooms per household                                                           |
| **Population**  | Block group population                                                                             |
| **AveOccup**    | Average number of household members                                                                |
| **Latitude**    | Block group latitude                                                                               |
| **Longitude**   | Block group longitude                                                                              |
| **MedHouseVal** | Median house value for California districts (target variable, in hundreds of thousands of dollars) |

In [2]:

# Features and target
X_h = df_housing.drop('MedHouseVal', axis=1)
y_h = df_housing['MedHouseVal']


In [4]:
X_h.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25


In [3]:
y_h.head()

0    4.526
1    3.585
2    3.521
3    3.413
4    3.422
Name: MedHouseVal, dtype: float64

In [4]:

# Check for missing values
print("\nMissing values in housing data:")
X_h.isnull().sum()



Missing values in housing data:


MedInc        0
HouseAge      0
AveRooms      0
AveBedrms     0
Population    0
AveOccup      0
Latitude      0
Longitude     0
dtype: int64

In [5]:

# Split data
X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(X_h, y_h, test_size=0.2, random_state=42)


In [6]:
X_h.shape,y_h.shape, X_train_h.shape, X_test_h.shape, y_train_h.shape, y_test_h.shape

((20640, 8), (20640,), (16512, 8), (4128, 8), (16512,), (4128,))

In [7]:
X_train_h.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
14196,3.2596,33.0,5.017657,1.006421,2300.0,3.691814,32.71,-117.03
8267,3.8125,49.0,4.473545,1.041005,1314.0,1.738095,33.77,-118.16
17445,4.1563,4.0,5.645833,0.985119,915.0,2.723214,34.66,-120.48
14265,1.9425,36.0,4.002817,1.033803,1418.0,3.994366,32.69,-117.11
2271,3.5542,43.0,6.268421,1.134211,874.0,2.3,36.78,-119.8


In [8]:
y_train_h.head()

14196    1.030
8267     3.821
17445    1.726
14265    0.934
2271     0.965
Name: MedHouseVal, dtype: float64

In [9]:
X_test_h.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
20046,1.6812,25.0,4.192201,1.022284,1392.0,3.877437,36.06,-119.01
3024,2.5313,30.0,5.039384,1.193493,1565.0,2.679795,35.14,-119.46
15663,3.4801,52.0,3.977155,1.185877,1310.0,1.360332,37.8,-122.44
20484,5.7376,17.0,6.163636,1.020202,1705.0,3.444444,34.28,-118.72
9814,3.725,34.0,5.492991,1.028037,1063.0,2.483645,36.62,-121.93


In [10]:
y_test_h.head()

20046    0.47700
3024     0.45800
15663    5.00001
20484    2.18600
9814     2.78000
Name: MedHouseVal, dtype: float64

In [11]:

# Preprocessing - Feature Scaling
scaler = StandardScaler()

print("\nFeature means BEFORE scaling (training data):")
print(X_train_h.mean())

print("\nFeature std dev BEFORE scaling (training data):")
print(X_train_h.std())

# Fit scaler on training data and transform both train and test
X_train_h_scaled = scaler.fit_transform(X_train_h)
X_test_h_scaled = scaler.transform(X_test_h)

print("\nFeature means AFTER scaling (training data):")
print(X_train_h_scaled.mean(axis=0))  # approx 0

print("\nFeature std dev AFTER scaling (training data):")
print(X_train_h_scaled.std(axis=0))   # approx 1



Feature means BEFORE scaling (training data):
MedInc           3.880754
HouseAge        28.608285
AveRooms         5.435235
AveBedrms        1.096685
Population    1426.453004
AveOccup         3.096961
Latitude        35.643149
Longitude     -119.582290
dtype: float64

Feature std dev BEFORE scaling (training data):
MedInc           1.904294
HouseAge        12.602499
AveRooms         2.387375
AveBedrms        0.433215
Population    1137.056380
AveOccup        11.578744
Latitude         2.136665
Longitude        2.005654
dtype: float64

Feature means AFTER scaling (training data):
[-6.51933288e-17 -9.25185854e-18 -1.98108110e-16 -1.70729064e-16
 -2.15159501e-19  4.93656580e-17  6.40099515e-17  1.75333477e-15]

Feature std dev AFTER scaling (training data):
[1. 1. 1. 1. 1. 1. 1. 1.]


In [12]:
pd.DataFrame(X_train_h_scaled, columns=X_train_h.columns).head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,-0.326196,0.34849,-0.174916,-0.208365,0.768276,0.051376,-1.372811,1.272587
1,-0.035843,1.618118,-0.402835,-0.12853,-0.098901,-0.117362,-0.876696,0.709162
2,0.144701,-1.95271,0.088216,-0.257538,-0.449818,-0.03228,-0.460146,-0.447603
3,-1.017864,0.586545,-0.600015,-0.145156,-0.007434,0.077507,-1.382172,1.232698
4,-0.171488,1.142008,0.349007,0.086624,-0.485877,-0.068832,0.532084,-0.108551


In [13]:

# Train Linear Regression model on scaled data
lr_model = LinearRegression()
lr_model.fit(X_train_h_scaled, y_train_h)

# Predict and evaluate
y_pred_h = lr_model.predict(X_test_h_scaled)

mse = mean_squared_error(y_test_h, y_pred_h)
r2 = r2_score(y_test_h, y_pred_h)

print(f"\nLinear Regression Results:\nMSE: {mse:.4f}\nR2 Score: {r2:.4f}")



Linear Regression Results:
MSE: 0.5559
R2 Score: 0.5758



---

### ✅ 1. **R² Score (Coefficient of Determination)**

* **Purpose:** Measures how much of the variance in the target variable is explained by the model.
* **Range:**

  * 1.0 = perfect prediction
  * 0 = model does no better than predicting the mean
  * Negative = model is worse than just predicting the mean

#### **Formula:**

$$
R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}
$$

* $y_i$: Actual value
* $\hat{y}_i$: Predicted value
* $\bar{y}$: Mean of actual values
* $n$: Number of data points

#### **Interpretation:**

* If $R^2 = 0.85$, then 85% of the variance in the target variable is explained by the model.

---

### ✅ 2. **MSE (Mean Squared Error)**

* **Purpose:** Measures the average of the squares of the errors (difference between actual and predicted values).
* **Always positive** — lower MSE means better performance.

#### **Formula:**

$$
MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2
$$

* $y_i$: Actual value
* $\hat{y}_i$: Predicted value
* $n$: Number of samples

---

### 🔍 Summary Comparison

| Metric  | Ideal Value | Type     | Indicates                            |
| ------- | ----------- | -------- | ------------------------------------ |
| **R²**  | Close to 1  | Relative | How well the model explains variance |
| **MSE** | Close to 0  | Absolute | Average error magnitude (squared)    |

---



In [14]:

# ----------------------------
# PART 2: Logistic Regression
# ----------------------------

print("\n\n### Logistic Regression Example ###")

# Load Breast Cancer dataset
cancer = load_breast_cancer()
X_cancer = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y_cancer = pd.Series(cancer.target)

print("\nBreast Cancer Data - first 5 rows BEFORE preprocessing:")
print(X_cancer.head())

# Check for missing values
print("\nMissing values in cancer data:")
print(X_cancer.isnull().sum())

# Split data
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(X_cancer, y_cancer, test_size=0.2, random_state=42)

# Preprocessing - Feature Scaling
scaler_c = StandardScaler()

print("\nFeature means BEFORE scaling (training data):")
print(X_train_c.mean())

print("\nFeature std dev BEFORE scaling (training data):")
print(X_train_c.std())

X_train_c_scaled = scaler_c.fit_transform(X_train_c)
X_test_c_scaled = scaler_c.transform(X_test_c)

print("\nFeature means AFTER scaling (training data):")
print(X_train_c_scaled.mean(axis=0))

print("\nFeature std dev AFTER scaling (training data):")
print(X_train_c_scaled.std(axis=0))

# Train Logistic Regression model on scaled data
log_model = LogisticRegression(max_iter=10000)
log_model.fit(X_train_c_scaled, y_train_c)

# Predict and evaluate
y_pred_c = log_model.predict(X_test_c_scaled)

accuracy = accuracy_score(y_test_c, y_pred_c)
conf_matrix = confusion_matrix(y_test_c, y_pred_c)

print(f"\nLogistic Regression Results:\nAccuracy: {accuracy:.4f}\nConfusion Matrix:\n{conf_matrix}")




### Logistic Regression Example ###

Breast Cancer Data - first 5 rows BEFORE preprocessing:
   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809  