# Linear Regression

In this assignment, you need to understand and implement linear regression and evaluate its performance on the Boston Housing Dataset

### **Import Libraries**

In [24]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

### **Load the dataset**
Use pd.read_csv() function to read data from the 'HousingData.csv' file

In [25]:
data = pd.read_csv('HousingData.csv')
#print(data.to_string())


### **Data Preparation**

We will split the dataset into training and testing data with an 80/20 split.

In [26]:
from sklearn.preprocessing import StandardScaler

X = data.drop(columns='MEDV').values    # All input features
y = data['MEDV'].values                 # Target variable

scaler = StandardScaler()
X = scaler.fit_transform(X)             # Normalizing the input data to avoid any overflows

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)   # specifying random state ensures that the random split is same everytime we run this command

# Drop rows with NaN or Inf
y_train = y_train[~np.isnan(X_train).any(axis=1)]
X_train = X_train[~np.isnan(X_train).any(axis=1)]

y_train = y_train[~np.isinf(X_train).any(axis=1)]
X_train = X_train[~np.isinf(X_train).any(axis=1)]

# Similarly for X_test
y_test = y_test[~np.isnan(X_test).any(axis=1)]
X_test = X_test[~np.isnan(X_test).any(axis=1)]

y_test = y_test[~np.isinf(X_test).any(axis=1)]
X_test = X_test[~np.isinf(X_test).any(axis=1)]

### **Implement Linear Regression from Scratch**

The three major tasks in this process are:
* Fitting the model using gradient descent
* Predicting values for test data
* Calculating the Mean Squared Error (MSE)

Since this implementation will handle multiple input features, it is called multivariate linear regression

In [27]:
class LinearRegression:
    def __init__(self):
        # Initialize weights and bias to None
        self.w = None
        self.b = None
        
    def fit(self, X, y, lr=0.01, epochs=1000):
        # Initialize parameters to zero
        self.m, self.n = X.shape # m = number of samples, n = number of features
        self.w = np.zeroes(self.n) # weights are initialized to zero
        self.b = 0 # bias is initialized to zero

        # Gradient descent
        for epoch in range(epochs):
            y_pred = self.predict(X) # compute prediction using current weights and bias
            #computing gradients
            dw = (2 / self.m) * np.dot(X.T, (y_pred - y)) # Gradient w.r.t weights
            db = (2 / self.m) * np.sum(y_pred - y)  # Gradient w.r.t bias

            # Update weights and bias
            self.w -= lr * dw
            self.b -= lr * db

            # Optional: Print cost every 100 epochs for monitoring
            if epoch % 100 == 0:
                cost = self.mean_squared_error(y, y_pred)
                print(f"Epoch {epoch}/{epochs}, Cost: {cost}")

    def predict(self, X):
        # Compute predictions: y = X * w + b
        return np.dot(X, self.w) + self.b

    def mean_squared_error(self, y_true, y_pred):
        # Calculate the Mean Squared Error (MSE) between true and predicted values
        return np.mean((y_true - y_pred) ** 2)

### **Instantiate the model**
Create an instance of the class and name it *model* and fit on the training data with **learning rate = 0.01** and **1000 iterations**.

In [36]:
# Initialize the model
model = LinearRegressionGD()
# Train the model
model.fit(X_train, y_train, lr = 0.01, epochs = 1000) # fitting the model to training data
y_pred_train = model.predict(X_train) # make predictions on the training data
train_mse = model.mean_squared_error(y_train, y_pred_train) # Calculate the Mean Squared Error (MSE) on the training data
print(f"Training mse: {train_mse}") #print the mse on training data


Epoch 0/1000, Cost: 595.1172815533981
Epoch 100/1000, Cost: 28.20421029400211
Epoch 200/1000, Cost: 18.79760098585739
Epoch 300/1000, Cost: 18.240633536508515
Epoch 400/1000, Cost: 18.03101327177283
Epoch 500/1000, Cost: 17.91250220863855
Epoch 600/1000, Cost: 17.839372926272766
Epoch 700/1000, Cost: 17.79152753497774
Epoch 800/1000, Cost: 17.758699341759577
Epoch 900/1000, Cost: 17.735311587693456
Training mse: 17.718172415341463


### **Evaluate the Model**
Evaluate the model's performance on the test set using:
1. Mean Squared Error (MSE)
2. R-squared Score

In [43]:
from sklearn.metrics import mean_squared_error, r2_score
y_pred_test = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred_test)
print(f"Mean Squared Error: {mse:.2f}")
r2 = r2_score(y_test, y_pred_test)
print(f"R-squared Score: {r2:.2f}")


Mean Squared Error: 28.86
R-squared Score: 0.64


## **Comparing your results with sklearn's Linear Regression**
To validate your implementation, let's compare the results with sklearn's `LinearRegression` model

In [47]:
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_train, y_train)

# Predict and evaluate
y_test_pred_sklearn = lr.predict(X_test)
mse_sklearn = mean_squared_error(y_test, y_test_pred_sklearn)
r2_sklearn = r2_score(y_test, y_test_pred_sklearn)

print(f"Sklearn Model's Mean Squared Error: {mse_sklearn:.2f}")
print(f"Sklearn Model's R-squared Score: {r2_sklearn:.2f}")

# Compare weights and bias
 #print(f"Your Model's Weights: {model.weights}, Bias: {model.bias}")
# print(f"Sklearn Model's Weights: {lr.coef_}, Bias: {lr.intercept_}")

Sklearn Model's Mean Squared Error: 28.28
Sklearn Model's R-squared Score: 0.65
