# **Supervised Machine Learning: Regression Lab**
This notebook demonstrates how to build and evaluate several regression models—**Linear Regression**, **K-Neighbors Regressor**, and **Support Vector Regression**—using the California Housing dataset. We'll compare performance via MSE, MAE, and R² metrics.

---

## 🔗**Library Overview**

- **NumPy & pandas**: For data manipulation and numerical calculations.
- **Matplotlib**: Optional for visualizing data or residual plots.
- **scikit-learn**:
  - `fetch_california_housing()`: Built-in dataset with California housing features.
  - `train_test_split()`: Splitting into train and test sets.
  - `StandardScaler()`: For scaling features (optional if your models benefit from normalization).
  - **Models**: `LinearRegression`, `KNeighborsRegressor`, `SVR`.
  - **Metrics**: MSE, MAE, R² score for regression tasks.


## **Step 1: Import Required Libraries**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

**Explanation**  
Here we import the essential libraries for:
- Data handling (`numpy`, `pandas`)
- Visualization (`matplotlib` - in case we want to plot)
- Machine learning utilities from **scikit-learn** (train/test split, metrics)
- Feature scaling (`StandardScaler`) if needed


## **Step 2: Load Dataset**


In [None]:
from sklearn.datasets import fetch_california_housing

# Load the dataset
data = fetch_california_housing()

# Separate features (X) and target (y)
X, y = data.data, data.target

# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42)

# Display the number of training/testing samples
print(f'Training samples: {X_train.shape[0]}, Testing samples: {X_test.shape[0]}')


Training samples: 16512, Testing samples: 4128


**Explanation**  
We load the built-in **California Housing** dataset from scikit-learn, which contains data about California districts with features like:
- Median income
- Average number of rooms
- Latitude/longitude
- Etc.

The target variable (`y`) is the median house value (in 100k USD).  
We then split the dataset into **training** (80%) and **test** (20%) subsets, printing the sample counts for each.


## **Step 3: Train and Evaluate Regression Models**

### **Brief Overview of Each Regression Algorithm**

In this lab, we compare three popular regression algorithms: **Linear Regression**, **K-Nearest Neighbors (KNN) Regression**, and **Support Vector Regression (SVR)**.

#### **1. Linear Regression**

- **Key Idea**: Linear Regression models the relationship between one or more independent variables (features) and a continuous target by fitting a **linear** function. Typically:
  
  $$
\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \dots + \beta_n x_n
$$

  
  The parameters (\(\beta_i\)) are learned by **minimizing** the sum of squared residuals (Ordinary Least Squares).
  
- **Use Cases**: Predicting continuous values such as house prices, sales numbers, or any linear-ish trend.
- **Pros**: Simple, fast, easily interpretable coefficients (you can see how each feature affects the outcome).
- **Cons**: Assumes a linear relationship. Not well-suited if the data is significantly non-linear or has many complex interactions.

>>>![Linear Regression Diagram](https://miro.medium.com/v2/resize%3Afit%3A1358/0%2AQp-81eR-lw3zr4Do.png)  
*Figure: Illustration of a Linear Regression model depicting the relationship between an independent variable and a dependent variable in Machine Learning.*


---

#### **2. K-Nearest Neighbors (KNN) Regression**

- **Key Idea**: Instead of learning an explicit function, KNN **stores** the training data and **predicts** the target for a new data point by looking at the **average (mean)** of the **k nearest neighbors** in feature space.
  
- **Use Cases**: Works reasonably well for smaller datasets with low to moderate dimensionality, where a non-parametric approach might capture local patterns better than a global linear model.
- **Pros**:  
  - Straightforward, intuitive.  
  - No real “training” phase, because data is just stored.
- **Cons**:  
  - Can be **computationally expensive** at prediction time (you have to find neighbors).  
  - Performance can degrade as the number of features grows (curse of dimensionality).  
  - Sensitive to outliers and scaling of features (it’s often beneficial to scale or normalize your data).

>>>![KNN Regression Diagram](https://media.geeksforgeeks.org/wp-content/uploads/20240617141430/download-%2819%29.png)  
*Figure: Illustration of K-Nearest Neighbors (KNN) regression, showing how predictions are made by averaging the k nearest neighbors in feature space.*


---

#### **3. Support Vector Regression (SVR)**

- **Key Idea**: SVR applies the principles of **Support Vector Machines** to regression tasks. Instead of trying to **classify** data points, it tries to fit them within a **tube** (often described by parameter \(\epsilon\)) around a **best-fit line** (or hyperplane).  
  - Minimizes a loss function that ignores errors less than \(\epsilon\), focusing on finding a flat hyperplane that fits most points within that margin.
- **Use Cases**: Good when the relationship between features and target is complex or high-dimensional. Can use **kernel tricks** (e.g., RBF, polynomial) to capture non-linearities.
- **Pros**:  
  - Can model **non-linear relationships** (with kernels).  
  - Has **regularization** built in, can be less prone to overfitting if tuned properly.
- **Cons**:  
  - Requires **hyperparameter tuning** (C, \(\epsilon\), kernel parameters).  
  - Not as immediately interpretable as Linear Regression (especially with non-linear kernels).  
  - Can be slower on large datasets compared to simpler methods.

>>>![Support Vector Regression Diagram](https://cdn-images-1.medium.com/max/1600/1%2Ars0EfF8RPVpgA-EfgAq85g.jpeg)  
*Figure: Illustration of Support Vector Regression (SVR) showing the epsilon-insensitive tube and support vectors.*


---

### **Choosing the Right Algorithm**

- **Linear Regression**: Start here when you suspect a roughly linear relationship and need interpretability.
- **KNN Regression**: Works well for smaller datasets or when relationships are highly non-linear but can be captured by local neighborhoods.
- **SVR**: Useful for more complex, high-dimensional data, or when you want a robust method that can handle non-linearity (via kernels), but be prepared for hyperparameter tuning.



### **Linear Regression**

In [None]:
from sklearn.linear_model import LinearRegression

# Initialize Linear Regression
lr = LinearRegression()

# Train the model
lr.fit(X_train, y_train)

# Predict on the test set
y_pred = lr.predict(X_test)

# Evaluate the model
print(f'MSE: {mean_squared_error(y_test, y_pred):.2f}')
print(f'MAE: {mean_absolute_error(y_test, y_pred):.2f}')
print(f'R2 Score: {r2_score(y_test, y_pred):.2f}')


MSE: 0.56
MAE: 0.53
R2 Score: 0.58


**Explanation**  
1. We instantiate a **LinearRegression** model and fit it to `(X_train, y_train)`.  
2. We then predict on the test set `X_test`.  
3. We compute:
   - **MSE (Mean Squared Error)**: average of squared differences.  
   - **MAE (Mean Absolute Error)**: average of absolute differences.  
   - **R² Score**: measures how much variance is explained by the model (1.0 = perfect).


### **K-Nearest Neighbors Regression**

In [None]:
from sklearn.neighbors import KNeighborsRegressor

# Initialize a KNN regressor with 5 neighbors
knn = KNeighborsRegressor(n_neighbors=5)

# Train the model
knn.fit(X_train, y_train)

# Predict on the test set
y_pred = knn.predict(X_test)

# Evaluate the model
print(f'MSE: {mean_squared_error(y_test, y_pred):.2f}')
print(f'MAE: {mean_absolute_error(y_test, y_pred):.2f}')
print(f'R2 Score: {r2_score(y_test, y_pred):.2f}')


MSE: 1.12
MAE: 0.81
R2 Score: 0.15


**Explanation**  
1. **KNeighborsRegressor** is a simple, instance-based learning technique that predicts a value by averaging the targets of its 'k' nearest neighbors in feature space.  
2. We set `n_neighbors=5` (the default) and fit the model.  
3. Similar to before, we predict on test data and compute MSE, MAE, and R².


### **Support Vector Regression**

In [None]:
from sklearn.svm import SVR

# Initialize an SVR with a linear kernel
svr = SVR(kernel='linear')

# Train the model
svr.fit(X_train, y_train)

# Predict
y_pred = svr.predict(X_test)

# Evaluate performance
print(f'MSE: {mean_squared_error(y_test, y_pred):.2f}')
print(f'MAE: {mean_absolute_error(y_test, y_pred):.2f}')
print(f'R2 Score: {r2_score(y_test, y_pred):.2f}')


**Explanation**  
1. We use **SVR** (Support Vector Regression) with a linear kernel.  
2. Fit the model on `(X_train, y_train)`.  
3. Predict on `X_test` and compute MSE, MAE, R² to see how well the model explains the variance in the housing prices.


## 📚**OPTIONAL: Practice Tasks**

1. **Experiment with Feature Scaling**  
   - In this notebook, we didn’t demonstrate feature scaling in depth. Try applying a `StandardScaler` or `MinMaxScaler` on your features before training KNN or SVR.  
   - Observe how scaling impacts performance metrics (MSE, MAE, R²).  

2. **Hyperparameter Tuning**  
   - Explore hyperparameter tuning with `GridSearchCV` or `RandomizedSearchCV`:  
     - **KNN**: Number of neighbors (`n_neighbors`), distance metrics (Euclidean, Manhattan), weighting scheme (`uniform` vs. `distance`).  
     - **SVR**: Kernel type (`linear`, `rbf`, `poly`), regularization parameter `C`, epsilon `\(\epsilon\)` in the epsilon-insensitive loss, or kernel parameters (e.g., `gamma` for RBF).  
   - Compare results to see which parameters yield the best performance.

3. **Polynomial & Interaction Features**  
   - Apply `PolynomialFeatures` from `sklearn.preprocessing` to create polynomial and interaction terms (e.g., degree=2) and see if it improves Linear Regression results.  
   - Note if overfitting occurs and how to mitigate it (e.g., regularization).

4. **Cross-Validation**  
   - Instead of a single train/test split, use **k-fold cross-validation** to get a more robust estimate of each model’s generalization ability.  
   - Check variance in cross-validation scores (e.g., using `cross_val_score`).

5. **Compare with Other Regression Algorithms**  
   - Try additional models like **Random Forest Regressor**, **Gradient Boosting**, or **Ridge/Lasso** (regularized linear models).  
   - Observe how the performance compares, and consider trade-offs in terms of interpretability vs. accuracy.    
   
6. **Visualization Tasks**  
   - Create **residual plots** to see the distribution of errors for each model.  
   - Try partial dependence plots or feature importance (especially if you experiment with tree-based models) to understand which features matter most.

By pursuing these tasks, you’ll gain practical insights into **why** each model performs the way it does, **how** to systematically improve performance, and **when** a certain model or technique might be best for a real-world scenario.


##🔗 **Further Reading**
1. **Scikit-Learn Documentation**:
   - [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)
   - [KNeighborsRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html)
   - [SVR](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html)

2. **Hands-On Machine Learning with Scikit-Learn & TensorFlow**  
   - By Aurélien Géron. Detailed coverage of regression, hyperparameter tuning, and more.

3. **Kaggle Datasets**  
   - [House Prices - Advanced Regression Techniques](https://www.kaggle.com/c/house-prices-advanced-regression-techniques)
   - Try your regression models on real-world housing data.

4. **California Housing** Additional Notebooks  
   - Check out the [scikit-learn User Guide](https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset) for more examples.

5.   **Introduction to Statistical Learning** (James, Witten, Hastie, Tibshirani) [Free PDF](https://www.statlearning.com/)  
     - Covers regression models, regularization, and more sophisticated techniques in a more theoretical but accessible manner.
