#  Multivariate Regression(GD)

In this notebook we are going to implement multivariate regression(gradient descent version). In particular, you will have to:

* Complete the function `cost_function` to implement cost function for multivariate regression(gradient descent version) algorithm.
* Complete the function `GDmultiLinparamEstimates` to implement multivariate regression(gradient descent version) algorithm.

Note we do not cover single value linear regression (gradient descent version) in this experiment as it is a similar one. You can play with it yourselves after going through this notebook.

# Import libraries

The required libraries for this notebook are pandas, sklearn and numpy.

In [1]:
# import libraries
import pandas
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split



# Load the data
The data we are using is from ***multi_regr_data.csv***. It consists of 1000 data related to student marks. Each data point has 3 columns(marks) and we are going to use all of them for multivariate linear regression. In particular, we will use the first 2 marks to predict the 3rd mark.


In [2]:
# Loading the CSV file
dataset=pandas.read_csv('./multi_regr_data.csv')
print(dataset.shape) #(data_number,feature_number)

(1000, 3)


# Split data into training and testing

In [3]:
# Split the data, we will use first 2 columns as features and the 3rd columns as target.
X = dataset[list(dataset.columns)[:-1]]
Y = dataset[list(dataset.columns)[-1]] 

# As pointed out in previous lab, we need to add a constant feature
intercept=np.ones((X.shape[0],1))
X=np.concatenate((intercept,X),axis=1)

# Split the data into training and testing(75% training and 25% testing data)
xtrain,xtest,ytrain,ytest=train_test_split(X, Y, random_state=0)


# Optimize using Gradient Descent Algorithm

The cost function is defined as follows:
\begin{align}
J\left(\beta \right) & =  {\frac{1}{2n}}\sum_{i=1}^n \left(y_i - \hat{y_i} \right)^2\\
\end{align}
or 
\begin{align}
J\left(\beta \right) & =  {\frac{1}{2n}}SSR\left(y_i,\hat{y_i} \right)\end{align}

You are asked to implement this cost function.

In [4]:
def cost_function(X, Y, beta):
    # implement cost function here
    n=X.shape[0]
    J = np.sum((X.dot(beta) - Y) ** 2)/(2 * n)
    return J


# initialize B, all the beta values are set as 0.
beta= np.zeros(3, dtype=float)

# cal initial cost
inital_cost = cost_function(X, Y, beta)
print(inital_cost)

2470.11


**Gradient Descent Steps:**
1. Initialize values: \begin{align}\beta_0,\beta_1,…,\beta_n\end{align}  It is suggested you initilize with 0.
2. Iteratively update, until convergence: \begin{align} β_j : =  β_j - \alpha {\frac{\partial}{\partial J\left(\beta_j \right) }}J\left(\beta \right) \\ \end{align}  α: learning rate.

**Hint:** Step 2 function can also be written as \begin{align} β_j : =  β_j - \alpha \frac{1}{n}\sum_{i=1}^n \left(\hat{y_i} - y_i \right)x_{ij}\\ \end{align}

In [5]:

def GDmultiLinparamEstimates(X, Y, beta, learning_rate, iterations):
    cost_history = [0] * iterations
    n = X.shape[0]
    
    for iteration in range(iterations):
        #  complete the code below.   
        # Hypothesis Values
        loss = X.dot(beta)-Y
        # Gradient Calculation
        gradient = X.T.dot(loss)
        # Changing Values of beta using Gradient
        beta = beta - learning_rate * gradient/ n
        
        cost = cost_function(X, Y, beta)
        cost_history[iteration] = cost
        
    return beta, cost_history

iterations = 100000 # a value of iteration
learning_rate = 0.0001 # alpha



# run your algorithm
newB, cost_history = GDmultiLinparamEstimates(X, Y, beta, learning_rate, iterations)

# New Values of B
print(newB)
# Final Cost of new B
print(cost_history[-1])


[-0.47889172  0.09137252  0.90144884]
10.475123473539167


The final hypothesis for the whole dataset (i.e. X and Y as defined above) should be:
\begin{align}
y & = -0.47889172 + 0.09137252*x_1 + 0.90144884*x_2
\end{align}

Test data:

In [6]:
def multilinearRegrPredict(xtrain, ytrain,xtest):
    reg=LinearRegression()
    reg.fit(xtrain, ytrain)
    y_pred = reg.predict(xtest)
    print('For the true target: ',list(ytest)[-1])
    print('We predict as: ', list(y_pred)[-1]) # print out the 
    print("Overall Accuracy Score from library implementation:", reg.score(xtest, ytest)) #.score(Predicted value, Y axis of Test data) methods returns the Accuracy Score or how much percentage the predicted value and the actual value matches

    return y_pred

y_pred = multilinearRegrPredict(xtrain, ytrain, xtest)

For the true target:  25
We predict as:  20.603310452986506
Overall Accuracy Score from library implementation: 0.9112675801400184


  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


In [7]:
estimatedB, trainingcost = GDmultiLinparamEstimates(xtrain, ytrain, beta, 0.0001, 10000)

In [11]:
y_pred1 = xtest.dot(estimatedB)
print("The prediction of last item",y_pred[-1])

The prediction of last item 20.603310452986506
