Linear Regression Exercise (Closed Form Solution)

In statistics, linear regression is a linear approach to modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables) [Wikipedia]. The closed form solution to finding the parameter $\theta$ of a linear regression model is given by $$\theta = (X^TX)^{-1}X^TY$$ where X are your features and Y is your target.

PART A

You will be implementing this model on a dataset of your choice using <b>numpy</b>.

Steps

1. Get any interesting dataset online. You can use this dataset repo [mcu dataset](https://archive.ics.uci.edu/ml/datasets.php). We will try to predict one of the features with continuous values. Set the continuous variable as your target column and other columns as your features i.e divide your dataset into $X$ and $y$.
Hint: download the dataset and use pandas to load the data into your environment. You should be familiar with this already.
2. We will bypass the step of exploring your data and assume that your data $(X, y)$ is linearly separable.
3. Create a class called LinearReg: 
    - the \_\_init\_\_ constructor will take hyperparameters for the model class. **Ignore this for this exercise as you do not currently have any hyperparameters**
    - the class will have two major methods **"fit"** and **"predict"**.
    - the fit method takes as input $X$ and $y$, and calculates $\theta$ using the formula above. Make sure you save $\theta$ in as a **class variable** after calculation.
    -  the predict function takes in $X$ and returns predictions as $\hat{y}$. Use the same data $X$ for prediction. Do not worry of overfitting the model at this point.
        $$\hat{Y}=X\theta$$
4. Next create a static method in your class, called **"rmse"** that takes in the original target **y** and your predictions **\hat{y}**, and uses them to calculate the mean square error in prediction (MSE). MSE is computed as;
$$MSE = \sum{(y - \hat{y})^2} $$
The MSE helps us to know how well we were able to model the data. Lower MSE is always better.
5. Run your linear regression model

To run your model
1. Instantiate the model class, model = LinearReg()
2. Run model.fit() with $(X, y)$ as arguments
3. Run $\hat{y}$ = model.predict() with $X$ as argument
4. Compare the predictions to the target with model.rmse(y, $\hat{y})$ . What is the rmse value?

In [None]:
import pandas as pd
import numpy as np


In [None]:
!pwd

/content


In [None]:
 !wget -P data https://raw.githubusercontent.com/satishgunjal/datasets/master/multivariate_housing_prices_in_portlans_oregon.csv


--2021-02-15 12:46:59--  https://raw.githubusercontent.com/satishgunjal/datasets/master/multivariate_housing_prices_in_portlans_oregon.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 705 [text/plain]
Saving to: ‘data/multivariate_housing_prices_in_portlans_oregon.csv’


2021-02-15 12:46:59 (39.7 MB/s) - ‘data/multivariate_housing_prices_in_portlans_oregon.csv’ saved [705/705]



In [None]:
path = '/data'

In [None]:
Mydata=pd.read_csv("data/multivariate_housing_prices_in_portlans_oregon.csv",sep=',')
Mydata.shape


(47, 3)

In [None]:
Mydata.head(5)

Unnamed: 0,size(in square feet),number of bedrooms,price
0,2104,3,399900
1,1600,3,329900
2,2400,3,369000
3,1416,2,232000
4,3000,4,539900


In [None]:
class LinearReg():
  def __init__(self, theta=None):
        self.theta = theta
        
  def fit (self,X,y):
    self.theta = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
    return self.theta
        
  def predict(self, X):
    y_hat=np.dot(X, self.theta)
    return y_hat
  
  def rmse(self,y,predict):
    mse = np.mean((predict - y)**2)
    return mse
    
  def __repr__(self):
    return "My first part is done. Thank you"

In [None]:
X = Mydata.values[:, 0:2]
y = Mydata.values[:, 2].reshape(len(Mydata.values[:, 2]), 1)

In [None]:
y.shape

(47, 1)

In [None]:
model =LinearReg()

In [None]:
model.fit(X,y)

array([[  140.86108621],
       [16978.19105903]])

In [None]:
predicted=model.predict(X)
predicted.shape

(47, 1)

In [None]:
model.rmse(y, predicted)

4513951420.499287

In [None]:
model.__repr__()

'My first part is done. Thank you'

PART B

Well, we have some bugs in our code and this next section will help to fix that. Linear regression models usually have a zeroth parameter, $\theta_0$ which helps to model the "bias" and gives an extra degree of freedom to the model. To fix this, we do the following.

5. Create a function, called **"add_ones"** which takes in $X$ and returns an augmented version where 1s have been concatenated with X. This implies that a new column is added to $X$ which contain ones. Call this new augmented data, $X_{new}$. The function should return $X_{new}$. Note that $X_{new}$ has one column more than $X$.
6. Edit your fit method **to add ones** to $X$ to make $X_{new}$ before computing $\theta$ in your code.
7. Now, calculate the $MSE$ for your predictions using this $X_{new}$. 

- Is it better than the previous MSE you got? Give any reason why it is better or not better.

In [None]:
def add_ones(X):
    n=X.shape[0]
    X_new=np.c_[np .ones(n )  , X]
    return X_new
  
add_ones(np.array([[1,2,3],
                   [4,5,6]]))

array([[1., 1., 2., 3.],
       [1., 4., 5., 6.]])

In [None]:
class Lnmodel():
  def __init__(self, theta=None,fit_intercept=True):
        self.theta = theta
        self.fit_intercept=fit_intercept

  def fit (self,X,y):
    if self.fit_intercept:
       X_new = add_ones(X)
    else:
      X_new=X
    self.theta = np.linalg.inv(X_new.T.dot(X_new)).dot(X_new.T).dot(y)
    return self.theta
        
  def predict(self, X_new):
    y_hat=np.dot(X_new, self.theta)
    return y_hat
  
  def rmse(self, y, predict):
    mse = np.mean((predict - y)**2)
    return mse
 
  def plot(self,reference_line=False):
    pass

  
  def __repr__(self):
    return "My second part is done.Thank you"

In [None]:
model = Lnmodel()
X = Mydata.values[:, 0:2]
y = Mydata.values[:, 2].reshape(len(y), 1)
X_new = add_ones(X)


In [None]:
model.fit(X, y )
#plotLnmodel( Lnmodel , X, y )


array([[89597.9095428 ],
       [  139.21067402],
       [-8738.01911233]])

In [None]:
predicted = model.predict(X_new)
predicted.shape

(47, 1)

In [None]:
model.rmse(y, predicted)

4086560101.205656

In [None]:
model.rmse(y,predicted)

4086560101.205656

In [None]:
model.__repr__()

'My second part is done.Thank you'