In [123]:
import pandas as pd
import numpy as np

Simple Linear Regression predicts the values by determining the slope(weight) and intercept

Values of m(slope) and b(intercept) can be determined by two methods

- Closed Form Solutions (OLS) ; only for 1-D data
- Non-closed Form Solution (Gradient Descent)

ScikitLearn library implements OLS behind the scenes and values of m & b are determined by formula

## b = $\overline{y}$ - x * $\overline{X}$ 

## m = $\frac{\sum_{i=1}^{n} (X_i - \overline{X}) (y_i - \overline{y})}{\sum_{i=1}^{n} (X_i - \overline{X})^2 }$

    


![](./Images/linearreg_error.webp)

In this image, Linear Regression finds the best fit line. But, how to decide if this the best fit one or not?

First lets find the Error. It is the difference between actual y and predicted y i.e $y_i - \hat{y}_i$

E = $\sum_{i=1}^{n} (y_i - \hat{y}_i)$

We square it to punish those values which deviate by a large margin. So
    E = $\sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

The lower the value of E, the better our model predicts.

So, our aim should be to find optimal values of m(slope) and b(intercept).

E(m,b) = $\sum_{i=1}^{n} (y_i - \hat{y}_i)^2$

E(m,b) = $\sum_{i=1}^{n} (y_i - m*X_i - b)^2$


In [112]:

class LinearRegression:
    def __init__(self) -> None:
        self.m = None
        self.b = None
    
    def fit(self,X_train, y_train):

         X = X_train.values
         y = y_train.values

         x_mean = X.mean()
         y_mean = y.mean()

         numerator = denominator = 0

         for i  in range(X.shape[0]):
            numerator += (X[i] - x_mean) * (y[i] - y_mean)
            denominator += (X[i] - x_mean) ** 2
        
         self.m = numerator/denominator


         self.b = y_mean - (self.m * x_mean)

         print(self.m, self.b)
    
    def predict(self, X_test):
        X = X_test.values

        predicted_values = np.zeros(X.shape[0])

        

        for i in range(X.shape[0]):
            predicted_values[i] = (self.m * X[i]) + self.b

        print(predicted_values)



- Training linear regression means finding values of m(slope/weight) and b(intercept)

In [113]:
df = pd.read_csv('./Advertising.csv')


In [114]:
df.head()

Unnamed: 0,TV,radio,newspaper,sales
0,230.1,37.8,69.2,22.1
1,44.5,39.3,45.1,10.4
2,17.2,45.9,69.3,9.3
3,151.5,41.3,58.5,18.5
4,180.8,10.8,58.4,12.9


In [115]:
df['total'] = df['TV'] + df['radio'] + df['newspaper']

df.drop(['TV','radio', 'newspaper'], axis=1, inplace=True)

df.head()

Unnamed: 0,sales,total
0,22.1,337.1
1,10.4,128.9
2,9.3,132.4
3,18.5,251.3
4,12.9,250.0


In [116]:
X = df['total']
 
y = df['sales']

In [117]:
from sklearn.model_selection import train_test_split

In [118]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [119]:
myModel = LinearRegression()

In [120]:
myModel.fit(X_train, y_train)

0.048957096175931 4.169512013489047


In [121]:
myModel.predict(X_test)

[16.30108045 18.66081248 22.00947786  8.76168763 17.51032072 12.13972727
 18.26426     8.08607971 15.86536229 15.38558275  7.02371072  8.65398202
 19.96796695  6.35789421 12.40409559 14.60716492  7.929417   15.67932532
 10.44091603 17.67677485 20.61909633 14.86174182  8.58544209 21.77937951
  8.04691403  7.93431271 18.38175703 12.19847579 10.28425333  6.01519454
 15.11631872  9.4568784  19.26298476 11.51307644 20.40368511 17.76979333
  9.2512586  21.90666796 10.89132132  6.6075754 ]


In [122]:
from statistics import mode
from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(X_train.values.reshape(-1,1), y_train)

print(model.coef_)

print(model.intercept_)

[0.0489571]
4.16951201348904
