## Implementing Linear Regression from Scratch for multiple variables


Dataset Details:

The Boston Housing dataset includes the following columns:

- crim: Per capita crime rate by town.
- zn: Proportion of residential land zoned for large lots (over 25,000 sq. ft.).
- indus: Proportion of non-retail business acres per town.
- chas: Charles River dummy variable (1 if tract bounds river; 0 otherwise).
- nox: Nitric oxides concentration (parts per 10 million).
- rm: Average number of rooms per dwelling.
- age: Proportion of owner-occupied units built prior to 1940.
- dis: Weighted distances to five Boston employment centers.
- rad: Index of accessibility to radial highways.
- tax: Full-value property-tax rate per $10,000.
- PTRATIO - pupil-teacher ratio by town
- B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT - % lower status of the population
- MEDV - Median value of owner-occupied homes in $1000's

This dataset provides valuable insights into the housing market dynamics and socio-economic factors influencing property values in Boston.


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error #Scratch Code is available in Simple Linear Regression File
from sklearn.model_selection import train_test_split #Scratch Code is available in Simple Linear Regression File
from Algo_MultivariableLinearRegression_FromScratch import MultiLR #Importing the algorithm File

In [2]:
df = pd.read_csv("https://raw.githubusercontent.com/Sanchit028/Machine-Learning-from-scratch/main/02.%20Multi-Variable%20Linear%20Regression/BostonHousing.csv") #Importing csv Data File

In [3]:
df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,,36.2


#### Data Cleaning and Pre-processing


In [4]:
df.shape

(506, 14)

In [5]:
df.dtypes  # No Categorical Data Found. Hence Regression is an option

CRIM       float64
ZN         float64
INDUS      float64
CHAS       float64
NOX        float64
RM         float64
AGE        float64
DIS        float64
RAD          int64
TAX          int64
PTRATIO    float64
B          float64
LSTAT      float64
MEDV       float64
dtype: object

In [6]:
df.info()  # We can see there are some null values in the dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   CRIM     486 non-null    float64
 1   ZN       486 non-null    float64
 2   INDUS    486 non-null    float64
 3   CHAS     486 non-null    float64
 4   NOX      506 non-null    float64
 5   RM       506 non-null    float64
 6   AGE      486 non-null    float64
 7   DIS      506 non-null    float64
 8   RAD      506 non-null    int64  
 9   TAX      506 non-null    int64  
 10  PTRATIO  506 non-null    float64
 11  B        506 non-null    float64
 12  LSTAT    486 non-null    float64
 13  MEDV     506 non-null    float64
dtypes: float64(12), int64(2)
memory usage: 55.5 KB


In [7]:
df.isnull().sum() 

CRIM       20
ZN         20
INDUS      20
CHAS       20
NOX         0
RM          0
AGE        20
DIS         0
RAD         0
TAX         0
PTRATIO     0
B           0
LSTAT      20
MEDV        0
dtype: int64

In [8]:
# List of columns with NaN values
na_columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'AGE', 'LSTAT']

# Fill NaN values in these columns with the mean of the column
df[na_columns] = df[na_columns].fillna(df.mean())

In [9]:
df.isnull().sum() # All nan values are resolved

CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

In [10]:
#correlation with target variable
target_corr = np.abs(df.corrwith(df["MEDV"]))#finding absolute
target_corr.sort_values(ascending=False) #printing in descending order

MEDV       1.000000
LSTAT      0.721975
RM         0.695360
PTRATIO    0.507787
INDUS      0.478657
TAX        0.468536
NOX        0.427321
RAD        0.381626
AGE        0.380223
CRIM       0.379695
ZN         0.365943
B          0.333461
DIS        0.249929
CHAS       0.179882
dtype: float64

In [11]:
X = df.drop(["MEDV", "B", "DIS", "CHAS"], axis=1) #Removing irrelevant and target values
y = df["MEDV"] #Target attribute
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((404, 10), (102, 10), (404,), (102,))

In [12]:
X_train, X_test, y_train, y_test = X_train.values, X_test.values, y_train.values, y_test.values #converting them to numpy arrays

#### Model Training and prediction


In [13]:
reg=MultiLR(0.00000507, 75000) #Object Creation for the Multi-Variate Linear Regression

In [14]:
coefs, costs = reg.fit(X_train, y_train, True)  #If Anyboy can help me optimise this, please contact me.

Iteration : 1 	 Cost : 241675.8508
Iteration : 2 	 Cost : 239696.1709
Iteration : 3 	 Cost : 237741.5172
Iteration : 4 	 Cost : 235811.5094
Iteration : 5 	 Cost : 233905.7733
Iteration : 6 	 Cost : 232023.9407
Iteration : 7 	 Cost : 230165.6493
Iteration : 8 	 Cost : 228330.5428
Iteration : 9 	 Cost : 226518.2701
Iteration : 10 	 Cost : 224728.4863
Iteration : 11 	 Cost : 222960.8516
Iteration : 12 	 Cost : 221215.0317
Iteration : 13 	 Cost : 219490.6978
Iteration : 14 	 Cost : 217787.5263
Iteration : 15 	 Cost : 216105.1986
Iteration : 16 	 Cost : 214443.4014
Iteration : 17 	 Cost : 212801.8264
Iteration : 18 	 Cost : 211180.1699
Iteration : 19 	 Cost : 209578.1334
Iteration : 20 	 Cost : 207995.4231
Iteration : 21 	 Cost : 206431.7498
Iteration : 22 	 Cost : 204886.8289
Iteration : 23 	 Cost : 203360.3804
Iteration : 24 	 Cost : 201852.1287
Iteration : 25 	 Cost : 200361.8027
Iteration : 26 	 Cost : 198889.1355
Iteration : 27 	 Cost : 197433.8644
Iteration : 28 	 Cost : 195995.7311
I

In [15]:
Y_pred = reg.predict(np.insert(X_test, 0, 1, axis = 1))

In [16]:
mean_squared_error(y_test, Y_pred)

46.73711565520401

#### Checking our result by compairing it with sklearn model results


In [17]:
from sklearn import linear_model


In [18]:
reg2= linear_model.LinearRegression()
reg2.fit(X_train, y_train)

LinearRegression()

In [19]:
mean_squared_error(reg2.predict(X_test), y_test) #There is still an error but difference is minimal

40.902230306456715