#Building a Custom Linear Regression Class: A Comparison with Scikit-Learn

**Objective:** In this project, we aim to implement a Linear Regression model using the Scikit-Learn library and compare its results with a custom-built Linear Regression class. This exercise will enhance our understanding of the underlying mechanics of linear regression and provide insights into the implementation details of machine learning algorithms.

In [None]:
#Importing necessary libraries

from sklearn.datasets import load_diabetes
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split

In [None]:
#Data is present in dictionary
df1 = load_diabetes()

In [None]:
#Converting dictionary data to the dataframe
df = pd.DataFrame(df1.data, columns = df1.feature_names)

In [None]:
#Extracting target column from the data
target = pd.DataFrame(df1.target, columns = ['target'])

In [None]:
#Final DataFrame
df = pd.concat([df, target], axis = 1)

In [None]:
df.sample(5)

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,target
295,-0.052738,0.05068,0.039062,-0.040099,-0.005697,-0.0129,0.011824,-0.039493,0.016307,0.003064,85.0
18,-0.038207,-0.044642,-0.010517,-0.036656,-0.037344,-0.019476,-0.028674,-0.002592,-0.018114,-0.017646,97.0
157,-0.001882,0.05068,-0.033151,-0.018306,0.031454,0.04284,-0.013948,0.019917,0.010227,0.027917,84.0
144,0.030811,0.05068,0.046607,-0.015999,0.020446,0.050669,-0.058127,0.07121,0.006207,0.007207,174.0
296,0.067136,-0.044642,-0.061174,-0.040099,-0.026336,-0.024487,0.033914,-0.039493,-0.056153,-0.059067,89.0


In [None]:
#Train_Test_Split

x_train, x_test, y_train, y_test = train_test_split(df.drop('target', axis = 1), df.target, test_size = 0.3, random_state= 30)

In [None]:
#Model object
LR = LinearRegression()

In [None]:
#Fitting model on x_train and y_train
LR.fit(x_train, y_train)

In [None]:
#Predicting the value of y for x_test data
y_predicted = LR.predict(x_test)

In [None]:
#Displaying all predicted values
y_predicted

array([113.06940069, 204.12959425, 104.97217242, 121.78112233,
       116.41938889, 233.61302737,  57.65530057, 131.6777207 ,
       121.80677783, 172.03843637, 144.70022575,  73.17947892,
       152.48052001, 153.95137768, 100.11974827,  69.34239043,
       161.08654627, 184.78097325, 173.82214399, 108.02237645,
       111.54238624, 156.88201882,  78.06372734,  54.37692745,
       168.58320919, 113.88605844, 103.76026941,  71.67748632,
       167.44618941, 143.98256429,  94.5207867 , 154.88090675,
       181.84934014,  77.09600235, 127.33706006, 133.62282988,
        57.51516277, 182.34090937, 260.44117815,  86.53274751,
       244.11273378, 150.06357386, 117.07215169, 156.03542782,
       178.24488768, 197.01551748, 160.02135531, 256.96212729,
       203.53821677, 121.14657616, 208.25886086, 166.15354522,
       138.0741457 , 191.80070033, 168.74794819, 229.93099634,
       152.26625934,  95.39278585, 110.22489623, 149.67079239,
       222.92206741,  79.44056298, 195.63977271, 256.47

##Points to Note:
1. It is a multiple linear regression problem.
2. General equation of the best fit line will be ->
####y_predicted = b0 + b1.(age) + b2.(sex) + b3.(bmi) + b4.(bp) + b5.(s1) + b6.(s2) + b7.(s3) + b8.(s4) + b9.(s5) + b10.(s6)
3. b0 -> intercept
4. b1, b2,......, b10 -> coefficients or weights of each column.

In [None]:
#b1, b2, b3, b4, b5, b6, b7, b8, b9, b10
LR.coef_

array([  -55.91101021,  -244.97829092,   520.38777297,   359.95099617,
       -1337.2394159 ,   901.97668443,   324.40164613,   135.11482593,
         978.07597849,    76.40005332])

In [None]:
#b0
LR.intercept_

np.float64(151.40111436430269)

In [None]:
#Calculating r2 score of our model
r2_score(y_test, y_predicted)

0.47599191698770515

###Creating custom "Regression class" using "Ordinary Least Square" method

In [None]:
class Regression:
  def __init__(self):
    self.coef_ = None
    self.intercept_ = None

  def train(self, x, y):
    x1 = x.copy()
    x1.insert(loc=0, column='ones', value=np.ones(len(x_train)))
    term_1 = np.linalg.inv(np.dot(x1.T, x1))
    term_2 = np.dot(term_1, x1.T)
    self.coef = np.dot(term_2, y)
    return self.coef

  def prediction(self, x):
    y_predicted = self.coef[0] + np.dot(x, self.coef[1:])
    return y_predicted

  def accuracy(self, y_predicted, y_test ):
    return r2_score(y_test, y_predicted)


In [None]:
lr = Regression()

In [None]:
#b0, b1, b2, ........, b10
coeff = lr.train(x_train, y_train)

In [None]:
#Printing b0, b1, ......, b10
coeff

array([  151.40111436,   -55.91101021,  -244.97829092,   520.38777297,
         359.95099617, -1337.2394159 ,   901.97668443,   324.40164613,
         135.11482593,   978.07597849,    76.40005332])

In [None]:
#Printing all predicted values of x_test
predicted_values = lr.prediction(x_test)
predicted_values

array([113.06940069, 204.12959425, 104.97217242, 121.78112233,
       116.41938889, 233.61302737,  57.65530057, 131.6777207 ,
       121.80677783, 172.03843637, 144.70022575,  73.17947892,
       152.48052001, 153.95137768, 100.11974827,  69.34239043,
       161.08654627, 184.78097325, 173.82214399, 108.02237645,
       111.54238624, 156.88201882,  78.06372734,  54.37692745,
       168.58320919, 113.88605844, 103.76026941,  71.67748632,
       167.44618941, 143.98256429,  94.5207867 , 154.88090675,
       181.84934014,  77.09600235, 127.33706006, 133.62282988,
        57.51516277, 182.34090937, 260.44117815,  86.53274751,
       244.11273378, 150.06357386, 117.07215169, 156.03542782,
       178.24488768, 197.01551748, 160.02135531, 256.96212729,
       203.53821677, 121.14657616, 208.25886086, 166.15354522,
       138.0741457 , 191.80070033, 168.74794819, 229.93099634,
       152.26625934,  95.39278585, 110.22489623, 149.67079239,
       222.92206741,  79.44056298, 195.63977271, 256.47

In [None]:
#Printing accuracy of predicted values of y of x_test data and actual y_test
#The goal is not to improve accuracy but to code the LR algorithm from scratch

lr.accuracy(predicted_values, y_test)

0.47599191698770515

#Conclusion

In this project, I have successfully implemented a Linear Regression model using both Scikit-Learn and a custom-built class. By comparing the coefficients and r2_score, it can be verified that custom implementation **exactly** matches the results from Scikit-Learn, demonstrating a solid understanding of the linear regression algorithm.

This exercise not only reinforces my knowledge of linear regression but also highlights the importance of understanding the underlying mechanics of machine learning algorithms.