## **Machine Learning Algorithm**

## **Type**: Supervised Learning 

## **Regression + Classification**

## **Day 2**: Ridge Regression + K-NN Algorithm

## **Student**: Muhammad Shafiq

-------------------------------------------

## **What is Ridge Regression?**

**Ridge Regression** is also known as **L2 Regularization** a  technique for analyzing multiple regression data that suffer from **multicollinearity**. It adds a penalty to the regression **coefficients to prevent overfitting**.
Multicollinearity occurs when independent variable in regression are highly  correlated with each other.

**Ridge regression** is a model-tuning method that is used to analyze any data that suffers from multicollinearity. This method performs **L2 regularization**. When the issue of multicollinearity occurs, least-squares are unbiased, and variances are large, this results in predicted values being far away from the actual values. 

**The cost function for ridge regression:**

               Min(||Y – X(theta)||^2 + λ||theta||^2)

**Lambda** is the penalty term. λ given here is denoted by an alpha parameter in the ridge function. So, by changing the values of alpha, we are controlling the penalty term. The higher the values of alpha, the bigger is the penalty and therefore the magnitude of coefficients is reduced.

- It shrinks the parameters. Therefore, it is used to prevent multicollinearity
- It reduces the model complexity by coefficient shrinkage


### **Standardization** 

In ridge regression, the first step is to standardize the variables (both dependent and independent) by subtracting their means and dividing by their standard deviations. This causes a challenge in notation since we must somehow indicate whether the variables in a particular formula are standardized or not. As far as standardization is concerned, all ridge regression calculations are based on standardized variables. When the final regression coefficients are displayed, they are adjusted back into their original scale. However, the ridge trace is on a standardized scale.

#### **Bias and variance trade-off**

Bias and variance trade-off is generally complicated when it comes to building ridge regression models on an actual dataset. However, following the general trend which one needs to remember is:

- The bias increases as λ increases.
- The variance decreases as λ increases.


Ridge regression introduces bias into the estimates to reduce their variance. The mean squared error (MSE) of the ridge estimator can be decomposed into bias and variance components:

- **Bias**: Measures the error introduced by approximating a real-world problem, which may be complex, by a simplified model. In ridge regression, as the regularization parameter k increases, the model becomes simpler, which increases bias but reduces variance.

- **Variance**: Measures how much the ridge regression model's predictions would vary if we used different training data. As the regularization parameter k decreases, the model becomes more complex, fitting the training data more closely, which reduces bias but increases variance.

- **Irreducible Error**: Represents the noise in the data that cannot be reduced by any model.



## **Selection of the Ridge Parameter in Ridge Regression**

Choosing an appropriate value for the ridge parameter k is crucial in ridge regression, as it directly influences the bias-variance tradeoff and the overall performance of the model. Several methods have been proposed for selecting the optimal ridge parameter, each with its own advantages and limitations. Methods for Selecting the Ridge Parameter are:

#### **1. Cross-Validation**

Cross-validation is a common method for selecting the ridge parameter by dividing data into subsets. The model trains on some subsets and validates on others, repeating this process and averaging the results to find the optimal value of k.

- **K-Fold Cross-Validation**: The data is split into K subsets, training on K-1 folds and validating on the remaining fold. This is repeated K times, with each fold serving as the validation set once.

- **Leave-One-Out Cross-Validation (LOOCV)** A special case of K-fold where K equals the number of observations, training on all but one observation and validating on the remaining one. It’s computationally intensive but unbiased.

#### **2. Generalized Cross-Validation (GCV)**

Generalized Cross-Validation is an extension of cross-validation that provides a more efficient way to estimate the optimal k without explicitly dividing the data. GCV is based on the idea of minimizing a function that approximates the leave-one-out cross-validation error. It is computationally less intensive and often yields similar results to traditional cross-validation methods.

#### **3. Information Criteria**

Information criteria such as the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC) can also be used to select the ridge parameter. These criteria balance the goodness of fit of the model with its complexity, penalizing models with more parameters.

#### **4. Empirical Bayes Methods**

Empirical Bayes methods involve estimating the ridge parameter by treating it as a hyperparameter in a Bayesian framework. These methods use prior distributions and observed data to estimate the posterior distribution of the ridge parameter.

- **Empirical Bayes Estimation**: This method involves specifying a prior distribution for k and using the observed data to update this prior to obtain a posterior distribution. The mode or mean of the posterior distribution is then used as the estimate of k.

#### **5. Stability Selection**
Stability selection improves ridge parameter robustness by subsampling data and fitting the model multiple times. The most frequently selected parameter across all subsamples is chosen as the final estimate.

#### **Applications of Ridge Regression**

- **Forecasting Economic Indicators**: Ridge regression helps predict economic factors like GDP, inflation, and unemployment by managing multicollinearity between predictors like interest rates and consumer spending, leading to more accurate forecasts.

- **Medical Diagnosis**: In healthcare, it aids in building diagnostic models by controlling multicollinearity among biomarkers, improving disease diagnosis and prognosis.

- **Sales Prediction**: In marketing, ridge regression forecasts sales based on factors like advertisement costs and promotions, handling correlations between these variables for better sales planning.

- **Climate Modeling**: Ridge regression improves climate models by eliminating interference between variables like temperature and precipitation, ensuring more accurate predictions.

- **Risk Management**: In credit scoring and financial risk analysis, ridge regression evaluates creditworthiness by addressing multicollinearity among financial ratios, enhancing accuracy in risk management.

### **Advantages and Disadvantages of Ridge Regression**

##### **Advantages:**

- **Stability**: Ridge regression provides more stable estimates in the presence of multicollinearity.

- **Bias-Variance Tradeoff**: By introducing bias, ridge regression reduces the variance of the estimates, leading to lower MSE.

- **Interpretability**: Unlike principal component regression, ridge regression retains the original predictors, making the results easier to interpret.

##### **Disadvantages**:

- **Bias Introduction**: The introduction of bias can lead to underestimation of the true effects of the predictors.

- **Parameter Selection**: Choosing the optimal ridge parameter k can be challenging and computationally intensive.

- **Not Suitable for Variable Selection**: Ridge regression does not perform variable selection, meaning all predictors remain in the model, even those with negligible effects.

**Read From Here For Full Details of Redge Regression**

[Machine Learning Ridge Regression](https://www.geeksforgeeks.org/machine-learning/what-is-ridge-regression/)

### **Redge Regression on Diabetes data**

In [1]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Load data
X, y = load_diabetes(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

alphas = [0.01, 0.1, 1, 10, 100]
for alpha in alphas:
    model = Ridge(alpha=alpha)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"Alpha: {alpha}, R2 Score: {r2_score(y_test, y_pred):.3f}, MSE: {mean_squared_error(y_test, y_pred):.2f}")


Alpha: 0.01, R2 Score: 0.487, MSE: 2836.41
Alpha: 0.1, R2 Score: 0.492, MSE: 2810.04
Alpha: 1, R2 Score: 0.438, MSE: 3105.47
Alpha: 10, R2 Score: 0.156, MSE: 4664.72
Alpha: 100, R2 Score: 0.009, MSE: 5479.45


### **Redge Regression on california housing**

In [2]:
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

# 1. Load dataset
data = fetch_california_housing()
X, y = data.data, data.target

# 2. Optional: scale features (important for regularized models)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Split data
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# 4. Train Ridge model
model = Ridge(alpha=1.0)
model.fit(X_train, y_train)

# 5. Predict
y_pred = model.predict(X_test)

# 6. Evaluate
print("R2 Score:", r2_score(y_test, y_pred))
print("MSE:", mean_squared_error(y_test, y_pred))


R2 Score: 0.5758185345441319
MSE: 0.5558512007367515


| Metric   | Meaning                                         |
| -------- | ----------------------------------------------- |
| R² Score | % of variance explained (closer to 1 is better) |
| MSE      | Mean squared error (closer to 0 is better)      |


------------------------------------------

# **Topic 2: KNN(K-Nearst Neighbours) Algorithms**

## **K-Nearest Neighbor(KNN) Algorithm**

K-Nearst Neighbors (KNN) is a supervised machine learning algorithm generally used for classification but can also be used for regression tasks. It works by finding the "k" closest data points (neighbors) to a given input and makes a predictions based on the majority class (for classificaiton) or the averages value of (for regression). 

K-Nearest Neighbors is also called as a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification it performs an action on the dataset.


### **What is 'K' in K Nearest Neighbour?**

In the k-Nearst Neighbours algorithm k is just a number that tells the algorithm how many nearby points or neighbors to look when it makes a decision.

**Example:**

Imagine you're deciding which fruit it is based on its shape and size. you compare it to fruits you already know.

- if K=3 the algorithm looks at the 3 closest fruits to the new one.

- if 2 of those 3 fruits are apples and 1 is banna, the algorithm say the new fruit is an apple because most of its neighbors are apples.

### **How to choose the value of k for KNN Algorithm?**

- The value of k in KNN decides how many neighbors the algorithm looks at when making a prediction.

- Choosing the right k is important for good results.

- If the data has lots of noise or outliers, using a larger k can make the predictions more stable.

- But if k is too large the model may become too simple and miss important patterns and this is called underfitting.

- So k should be picked carefully based on the data.

## **Statistical Methods for Selecting k**

- **Cross-Validation**: Cross-Validation is a good way to find the best value of k is by using k-fold cross-validation. This means dividing the dataset into k parts. The model is trained on some of these parts and tested on the remaining ones. This process is repeated for each part. The k value that gives the highest average accuracy during these tests is usually the best one to use.

- **Elbow Method**: In Elbow Method we draw a graph showing the error rate or accuracy for different k values. As k increases the error usually drops at first. But after a certain point error stops decreasing quickly. The point where the curve changes direction and looks like an "elbow" is usually the best choice for k.

- **Odd Values for k**: It’s a good idea to use an odd number for k especially in classification problems. This helps avoid ties when deciding which class is the most common among the neighbors.


## **Distance Metrics Used in KNN Algorithm**

KNN uses distance metrics to identify nearest neighbor, these neighbors are used for classification and regression task. To identify nearest neighbor we use below distance metrics:

### **1. Euclidean Distance**

Euclidean distance is defined as the straight-line distance between two points in a plane or space. You can think of it like the shortest path you would walk if you were to go directly from one point to another.

             distance(x,Xi) = ∑j=1d(xj−Xij)2]



### **2. Manhattan Distance**

This is the total distance you would travel if you could only move along horizontal and vertical lines like a grid or city streets. It’s also called "taxicab distance" because a taxi can only drive along the grid-like streets of a city.

                    d(x,y)=∑i=1n∣xi−yi∣


### **3. Minkowski Distance**

Minkowski distance is like a family of distances, which includes both Euclidean and Manhattan distances as special cases.

                   d(x,y)=(∑i=1n(xi−yi)p)1p

 
From the formula above, when p=2, it becomes the same as the Euclidean distance formula and when p=1, it turns into the Manhattan distance formula. Minkowski distance is essentially a flexible formula that can represent either Euclidean or Manhattan distance depending on the value of p.

## **Working of KNN algorithm**

Thе K-Nearest Neighbors (KNN) algorithm operates on the principle of similarity where it predicts the label or value of a new data point by considering the labels or values of its K nearest neighbors in the training dataset.

### **Step 1: Selecting the optimal value of K**

- K represents the number of nearest neighbors that needs to be considered while making prediction.

### **Step 2: Calculating distance**

- To measure the similarity between target and training data points Euclidean distance is used. Distance is calculated between data points in the dataset and target point.

### **Step 3: Finding Nearest Neighbors**

- The k data points with the smallest distances to the target point are nearest neighbors.

### **Step 4: Voting for Classification or Taking Average for Regression**

- When you want to classify a data point into a category like spam or not spam, the KNN algorithm looks at the K closest points in the dataset. These closest points are called neighbors. The algorithm then looks at which category the neighbors belong to and picks the one that appears the most. This is called majority voting.

- In regression, the algorithm still looks for the K closest points. But instead of voting for a class in classification, it takes the average of the values of those K neighbors. This average is the predicted value for the new point for the algorithm.

It shows how a test point is classified based on its nearest neighbors. As the test point moves the algorithm identifies the closest 'k' data points i.e. 5 in this case and assigns test point the majority class label that is grey label class here.


## **Applications of KNN**

- **Recommendation Systems**: Suggests items like movies or products by finding users with similar preferences.
- **Spam Detection**: Identifies spam emails by comparing new emails to known spam and non-spam examples.
- **Customer Segmentation**: Groups customers by comparing their shopping behavior to others.
- **Speech Recognition**: Matches spoken words to known patterns to convert them into text.
## **Advantages of KNN**
- **Simple to use**: Easy to understand and implement.
- **No training step**: No need to train as it just stores the data and uses it during prediction.
- **Few parameters**: Only needs to set the number of neighbors (k) and a distance method.
- **Versatile**: Works for both classification and regression problems.
## **Disadvantages of KNN**
- **Slow with large data**: Needs to compare every point during prediction.
- **Struggles with many features**: Accuracy drops when data has too many features.
- **Can Overfit**: It can overfit especially when the data is high-dimensional or not clean.

## **KNN Classifier with Breast Cancer dataset**

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report

# 1. Load dataset
data = load_breast_cancer()
X, y = data.data, data.target

# 2. Feature scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 3. Split dataset
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# 4. Train KNN
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)

# 5. Predict
y_pred = model.predict(X_test)

# 6. Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))


Accuracy: 0.9473684210526315
              precision    recall  f1-score   support

           0       0.93      0.93      0.93        43
           1       0.96      0.96      0.96        71

    accuracy                           0.95       114
   macro avg       0.94      0.94      0.94       114
weighted avg       0.95      0.95      0.95       114



## **important parameter of `KNeighborsClassifier`**

| Parameter     | Description                                      |
| ------------- | ------------------------------------------------ |
| `n_neighbors` | Number of neighbors to use (default=5)           |
| `weights`     | 'uniform' or 'distance' (closer = more weight)   |
| `metric`      | Distance metric (`euclidean`, `manhattan`, etc.) |
| `p`           | Power for Minkowski metric (p=2 = Euclidean)     |


## **Try Different `K` values**

In [2]:
for k in range(1, 11):
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(X_train, y_train)
    acc = model.score(X_test, y_test)
    print(f"K={k} → Accuracy: {acc:.3f}")


K=1 → Accuracy: 0.939
K=2 → Accuracy: 0.939
K=3 → Accuracy: 0.947
K=4 → Accuracy: 0.956
K=5 → Accuracy: 0.947
K=6 → Accuracy: 0.947
K=7 → Accuracy: 0.947
K=8 → Accuracy: 0.956
K=9 → Accuracy: 0.965
K=10 → Accuracy: 0.956
