<a href="https://colab.research.google.com/github/TanishaL67/MLFromScratch/blob/main/LinearRegression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Linear Regression


Linear Regression is the simplest model. A linear model makes a prediction by simply computing a weighted sum of the input features, plus a constant called the bias term also called the intercept term.

Here is the Linear Regression model prediction equation:
$$
\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n
$$

We will train the model, which means setting the parameters such that the model best fits the training set. To measure how well the model fits the training set we will use the Root Mean square error (RMSE) and to train the model we will need to find the value of parameters that will minimize the RMSE.

In [2]:
#we will use the pandas and numpy library to code the linear regression model
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_california_housing
import pandas as pd

# Load the California Housing dataset
california = fetch_california_housing()

# Convert to a DataFrame for easy manipulation
df = pd.DataFrame(california.data, columns=california.feature_names)
print(df.head())

   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  \
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   

   Longitude  
0    -122.23  
1    -122.22  
2    -122.24  
3    -122.25  
4    -122.25  


## Normal Equation

To find the value of parameters that minimizes the cost function, there is a closed form solution - a mathematical equation that gives the result directly. This is called the Normal Equation.

Normal equation is the dot product of inverse of dot product X transpose and X with X transpose and y.



In [29]:
# We will start coding the linear regression model from scratch.
class LinearRegression:
  def __init__(self):
    #we start with initialising the coefficients and intercepts
    self.coefficients = None
    self.intercept = None

  def feature_scaling(self, X):
    # Apply Standard Scaling to features (Z-score scaling)
    self.mean = np.mean(X, axis=0)
    self.std = np.std(X, axis=0)
    X_scaled = (X - self.mean) / self.std
    return X_scaled

  def fit(self, X, y):
    # Scale features first
    X = self.feature_scaling(X)
    # add a column of ones for the intercept (bias term)
    ones = np.ones((len(X), 1))
    X = np.concatenate((ones, X), axis=1)
    # We need to calculate the normal equation
    XT = X.T
    XTX = XT.dot(X)
    XTX_inv = np.linalg.inv(XTX) #
    XTy = XT.dot(y)
    self.coefficients = XTX_inv.dot(XTy)

  def predict(self, X):
    #add the columns of ones to match the structure used in fitting the model
    X = (X - self.mean) / self.std
    ones = np.ones((len(X), 1))
    X = np.concatenate((ones, X), axis=1)
    return X.dot(self.coefficients)

  def Rsquared(self, X, y):
    ypred = self.predict(X)
    ss_total = np.sum((y - np.mean(y))**2) #total sum of squares
    ss_residual = np.sum((y - ypred)**2) #residual sum of squares
    r2 = 1 - (ss_residual / ss_total) #R-squared formula
    return r2


In [16]:
df['MedHouseVal'] = california.target  # Target variable
X = df.drop(columns=['MedHouseVal']).values  # Features as numpy array
y = df['MedHouseVal'].values  # Target as numpy array

#test train split(80-20 split)
def train_test_split(X, y, test_size=0.3, random_state=None):
    if random_state is not None:
        np.random.seed(random_state)

    indices = np.arange(X.shape[0])
    np.random.shuffle(indices)

    split_index = int((1 - test_size) * len(X))

    X_train, X_test = X[indices[:split_index]], X[indices[split_index:]]
    y_train, y_test = y[indices[:split_index]], y[indices[split_index:]]

    return X_train, X_test, y_train, y_test

# Perform the manual train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Output the shapes to confirm the split
print("Training set shape:", X_train.shape, y_train.shape)
print("Testing set shape:", X_test.shape, y_test.shape)


Training set shape: (16512, 8) (16512,)
Testing set shape: (4128, 8) (4128,)


In [30]:
# We will now initialize and fit the model
model = LinearRegression()
model.fit(X_train, y_train)


In [31]:
ypored = model.predict(X_test)
print(ypored)

[0.96867243 1.56633224 1.81157253 ... 2.03410525 2.83710153 2.2572115 ]


In [32]:
# let's calculate the R-'squared value for the given data points
r2 = model.Rsquared(X_test, y_test)
print(r2)

0.5875292022460028
