Linear Regression Exercise (Closed Form Solution)

In statistics, linear regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables) [Wikipedia]. The closed form solution to finding the parameter $\theta$ of a linear regression model is given by $$\theta = (X^TX)^{-1}X^TY$$ where X are your features and Y is your target.

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
import pandas as pd 
df =pd.read_csv("/content/drive/MyDrive/AMMI_Lesson1/qsar_aquatic_toxicity.csv",header=None,sep=";")

In [None]:
cols=["TPSA(Tot)",'SAacc',' H-050','MLOGP','RDCHI','GATS1p','nN','C-040','target']
df.columns=cols


In [None]:
df.head()

Unnamed: 0,TPSA(Tot),SAacc,H-050,MLOGP,RDCHI,GATS1p,nN,C-040,target
0,0.0,0.0,0,2.419,1.225,0.667,0,0,3.74
1,0.0,0.0,0,2.638,1.401,0.632,0,0,4.33
2,9.23,11.0,0,5.799,2.93,0.486,0,0,7.019
3,9.23,11.0,0,5.453,2.887,0.495,0,0,6.723
4,9.23,11.0,0,4.068,2.758,0.695,0,0,5.979


In [None]:
import numpy as np
class LinearReg:
  '''
  class for Linear Regression Model
  '''
  def fit(self,X,y):
    '''
    X :These are the input features
    y : These are the corresponding y values of X
    '''
    X=np.array(X)
    y=np.array(y)
    firstpart=np.dot(X.T,X)#X transpose * X
    firstpart_result =np.linalg.inv(firstpart)#inverse above answer
    second=np.dot(X.T,y)#X transpose * y
    self.theta= np.dot(firstpart_result,second) # combinne all above to get theta
    return self.theta


  def predict(self,X):
    '''
    computes the prediction
    X  :Features
    '''
    X=np.array(X)
    result =np.dot(X,self.theta)
    return result


  def mse(self,y_true,y_pred):
    '''
    n : number of elements
    y_true : actual value of label
    y_pred : predicted value of label
    '''
    n = y_true.shape[0]
    return (1/n)*(np.sum(y_true-y_pred)**2)


In [None]:
A=LinearReg()#initialize our class

In [None]:
y=df['target']#label
X=df.loc[:, df.columns != 'target']#Features

In [None]:
z=A.fit(X,y)#fit the features and corresponding label

In [None]:
res1=A.predict(X)#predictions

In [None]:
A.mse(y,res1)#compute the error

7.800044554426762

PART A

You will be implementing this model on a dataset of your choice using <b>numpy</b>.

Steps

1. Get any interesting dataset online. You can use this dataset repo [mcu dataset](https://archive.ics.uci.edu/ml/datasets.php). We will try to predict one of the features with continuous values. Set the continuous variable as your target column and other columns as your features i.e divide your dataset into $X$ and $y$.
Hint: download the dataset and use pandas to load the data into your environment. You should be familiar with this already.
2. We will bypass the step of exploring your data and assume that your data $(X, y)$ is linearly separable.
3. Create a class called LinearReg: 
    - the \_\_init\_\_ constructor will take hyperparameters for the model class. **Ignore this for this exercise as you do not currently have any hyperparameters**
    - the class will have two major methods **"fit"** and **"predict"**.
    - the fit method takes as input $X$ and $y$, and calculates $\theta$ using the formula above. Make sure you save $\theta$ in as a **class variable** after calculation.
    -  the predict function takes in $X$ and returns predictions as $\hat{y}$. Use the same data $X$ for prediction. Do not worry of overfitting the model at this point.
        $$\hat{Y}=X\theta$$
4. Next create a static method in your class, called **"rmse"** that takes in the original target **y** and your predictions **\hat{y}**, and uses them to calculate the mean square error in prediction (MSE). MSE is computed as;
$$MSE = \sum{(y - \hat{y})^2} $$
The MSE helps us to know how well we were able to model the data. Lower MSE is always better.
5. Run your linear regression model

To run your model
1. Instantiate the model class, model = LinearReg()
2. Run model.fit() with $(X, y)$ as arguments
3. Run $\hat{y}$ = model.predict() with $X$ as argument
4. Compare the predictions to the target with model.rmse(y, $\hat{y})$ . What is the rmse value?

PART B

Well, we have some bugs in our code and this next section will help to fix that. Linear regression models usually have a zeroth parameter, $\theta_0$ which helps to model the "bias" and gives an extra degree of freedom to the model. To fix this, we do the following.

5. Create a function, called **"add_ones"** which takes in $X$ and returns an augmented version where 1s have been concatenated with X. This implies that a new column is added to $X$ which contain ones. Call this new augmented data, $X_{new}$. The function should return $X_{new}$. Note that $X_{new}$ has one column more than $X$.
6. Edit your fit method **to add ones** to $X$ to make $X_{new}$ before computing $\theta$ in your code.
7. Now, calculate the $MSE$ for your predictions using this $X_{new}$. 

- Is it better than the previous MSE you got? Give any reason why it is better or not better.

In [None]:
import numpy as np
class LinearReg:
  '''
  class for Linear Regression Model
  '''
  def fit(self,X,y):
    '''
    X :These are the input features
    y : These are the corresponding y values of X
    firstpart :X transpose times X
    firstpart_result : inverse of firstpart
    second : X transpose y
    theta :dot product firstpart_result and second
    
    '''
    X=np.array(X)
    y=np.array(y)
    firstpart=np.dot(X.T,X)#X transpose * X
    firstpart_result =np.linalg.inv(firstpart)#inverse above answer
    second=np.dot(X.T,y)#X transpose * y
    self.theta= np.dot(firstpart_result,second) # combinne all above to get theta
    return self.theta


  def predict(self,X):
    '''
    Takes in features and gives out predictions
    '''

    X=np.array(X)
    result =np.dot(X,self.theta)
    return result


  def mse(self,y_true,y_pred):
    '''
    takes in actual labela snd predictions and computes the error
    '''
    n = y_true.shape[0]
    return (1/n*(np.sum(y_true-y_pred)**2))

  def add_ones(self,X):
    '''
    Adds a column of ones
    '''
    X=np.array(X)
    X_new=np.c_[np.ones(X.shape[0]),X]
    return X_new



In [None]:
A=LinearReg()

In [None]:
new_X=A.add_ones(X)

In [None]:

z=A.fit(new_X,y)

In [None]:
res2=A.predict(new_X)

In [None]:
error=A.mse(y,res2)

In [None]:
error

4.2756838983055925e-28

In [None]:
round(error,5)

0.0

**Conclussion**

The new mse is far better than the previous mse.This is because we added the vector of ones(new column) which acts as a fake feature that helped us find the intercept unlike in our first step only where  the parameters associated with the features were returned.