Gradient Descent - Boston Dataset

* Boston dataset is one of the datasets available in
sklearn.
* You are given a Training dataset csv file with X train and
Y train data. As studied in lecture, your task is to come up
with Gradient Descent algorithm and thus predictions for
the test dataset given.

Your task is to:
1. Code Gradient Descent for N features and come
with predictions.
2. Try and test with various combinations of learning
rates and number of iterations.
3. Try using Feature Scaling, and see if it helps you
in getting better results.

Read Instructions carefully -
1. Use Gradient Descent as a training algorithm and
submit results predicted.
2. Files are in csv format, you can use genfromtxt
function in numpy to load data from csv file.
Similarly you can use savetxt function to save data
into a file.
3. Submit a csv file with only predictions for X test
data. File name should not have spaces. File should
not have any headers and should only have one
column i.e. predictions. Also predictions shouldn't be
in exponential form.
4. Your score is based on coefficient of
determination.

In [1]:
import pandas as pd
import numpy as np
from numpy import genfromtxt, savetxt
from sklearn.model_selection import train_test_split

from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor

In [2]:
data = genfromtxt("./boston/0000000000002417_training_boston_x_y_train.csv", delimiter=',', skip_header=0)

In [3]:
X = data[:,0:-1]
Y = data[:, -1]
print(X.shape)
print(Y.shape)

(379, 13)
(379,)


# Building my own gradient decent algorithm for multiplie variables

In [13]:
def step_gradient(X, Y, learning_rate, m):
    m_slope=np.zeros(len(X[0])) # final slope
    for i in range(len(X)):
        x = X[i] # no of columns based on ith row 
        y = Y[i]
        for j in range(len(x)): # for each of jth column in ith row is needed to calculate for cost
            # m * x is numpy array it will do the dot product n then sum is applied
            # it a derivative w.r.t m
            m_slope[j]+=(-2/len(X))*(y-sum(m*x))*x[j]  # dervative of slope
    new_m=m-(learning_rate*m_slope) # final slope updatation 
    return new_m

def cost(m, X, Y):
    cost=0
    for i in range(len(X)):
        cost+=(1/len(X))*((Y[i]-sum(m*X[i]))**2)
    print(cost)

def gradient_decent(X, Y, learning_rate, num_iterations):
    m = np.zeros(len(X[0]))
    for i in range(num_iterations):
        m = step_gradient(X, Y, learning_rate, m)
        print("itr= ", i, "cost=", end=' ')
        cost(m, X, Y)
    return m

def run(X, Y):
    learning_rate = 0.11
    num_iterations = 100
    # Adding C as 1 to the feature itself, for Mltivariate reg
    X = np.append(X, np.ones(len(X)).reshape(-1, 1), axis=1)
    m = gradient_decent(X, Y, learning_rate, num_iterations)
    return m



In [16]:
X= data[:,0:-1]
Y = data[:, -1]
#adding squared values of each column
sq=[]
for i in X:
    sq.append(i**2)
sq=np.array(sq)
X=np.append(X, sq, axis=1)

scaler=StandardScaler()
scaler.fit(X)
X=scaler.transform(X)

m=run(X, Y)

itr=  0 cost= 364.2207823129001
itr=  1 cost= 227.07295926827348
itr=  2 cost= 145.27146506086993
itr=  3 cost= 96.01248803958627
itr=  4 cost= 66.18412884892768
itr=  5 cost= 48.044725362943794
itr=  6 cost= 36.967804164141754
itr=  7 cost= 30.170650379759213
itr=  8 cost= 25.97298746848971
itr=  9 cost= 23.357608261016633
itr=  10 cost= 21.707589352143543
itr=  11 cost= 20.648244265075636
itr=  12 cost= 19.951727070680185
itr=  13 cost= 19.479312567906874
itr=  14 cost= 19.146414929296892
itr=  15 cost= 18.901362467477387
itr=  16 cost= 18.712508245097066
itr=  17 cost= 18.560397817514882
itr=  18 cost= 18.43300751430429
itr=  19 cost= 18.32284816796667
itr=  20 cost= 18.225202667245515
itr=  21 cost= 18.13705289895754
itr=  22 cost= 18.056425982089326
itr=  23 cost= 17.98199559545105
itr=  24 cost= 17.912838552068898
itr=  25 cost= 17.848285891355772
itr=  26 cost= 17.78783154466078
itr=  27 cost= 17.73107609422234
itr=  28 cost= 17.677691943489712
itr=  29 cost= 17.62740156894531
i

In [17]:
test_data = genfromtxt("./boston/0000000000002417_test_boston_x_test.csv", delimiter=',', skip_header=0)
test_data.shape

sq=[]
for i in test_data:
    sq.append(i**2)
sq=np.array(sq)
test_data=np.append(test_data, sq, axis=1)

testing1=scaler.transform(test_data)
x_test=np.append(testing1, np.ones(len(testing1)).reshape(-1, 1), axis=1)
ans=[]
for i in x_test:
    ans.append(sum(i*m)) # captured best values of m from the training data and use it here for testing
for i in ans:
    print(i)
ans=np.array(ans)
np.savetxt(X=ans,fname='./boston/Prediction.csv',delimiter=',', fmt='%.5f')

13.870445989719263
28.59863740125128
22.78456922946407
24.154785055703755
19.347224798555477
13.03437545593491
27.34351664522876
22.969317949015945
18.870849780800697
23.22612077857668
24.8667515427157
17.02992130395709
18.55421529667459
18.795388874605177
49.54392099233067
23.284910490199486
24.393689370611607
26.187625507708542
18.701333816151248
31.51425660264661
21.40492168887355
24.154468796983384
35.40683803113731
35.7338714853582
32.97159693174076
17.06349334955803
22.372235702313002
31.669268697899007
23.067132237578065
31.991951025174217
16.076821977589503
25.737491856444553
23.101133751154986
23.915855236103706
13.23316161579606
28.001353744156088
24.92273523585821
19.19841259778402
22.901599306629585
9.99572815195414
16.985718696801673
26.319145090326597
29.824017734955454
19.363607829989125
18.220866665764163
13.257646001925401
46.994345547104714
23.990054907608688
31.55990434193138
14.29771323410174
16.05567703250442
41.48734120355262
15.032953528271527
20.039795660327616


In [20]:
X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.3, random_state=43)

In [21]:
print(X_train.shape)
print(X_val.shape)
print(Y_train.shape)
print(Y_val.shape)

(265, 13)
(114, 13)
(265,)
(114,)


In [50]:
model = GradientBoostingRegressor()
model.fit(X_train, Y_train)

Y_train_pred = model.predict(X_train)
Y_Val_pred = model.predict(X_val)

In [51]:
train_score = model.score(X_train, Y_train)
val_score_score = model.score(X_val, Y_val)
print(train_score, val_score_score)

0.9849588503525196 0.8388220523369654


# Scaling and then doing the model

In [52]:
scaler = StandardScaler()
scaler.fit(X_train)

X_train_scaled = scaler.transform(X_train)
X_val_scaled = scaler.transform(X_val)

model = GradientBoostingRegressor()
model.fit(X_train, Y_train)

Y_train_pred = model.predict(X_train)
Y_Val_pred = model.predict(X_val)
train_score = model.score(X_train, Y_train)
val_score_score = model.score(X_val, Y_val)
print(train_score, val_score_score)

0.9849588503525196 0.8390490388444315


* Conclusion : Scaling not changed any output score

In [55]:
test_data = genfromtxt("./boston/0000000000002417_test_boston_x_test.csv", delimiter=',', skip_header=0)
test_data.shape

(127, 13)

In [56]:
test_predicted = model.predict(test_data)

In [59]:
savetxt("boston_prediction.csv", test_predicted, fmt='%1.5f')

# Adding extra feature and testing the same in datasets


In [61]:
sq=[]
for i in X:
    sq.append(i**2)
sq=np.array(sq)
X=np.append(X, sq, axis=1)

X_train, X_val, Y_train, Y_val = train_test_split(X, Y, test_size=0.3, random_state=43)
print(X_train.shape)
print(X_val.shape)
print(Y_train.shape)
print(Y_val.shape)

model = GradientBoostingRegressor()
model.fit(X_train, Y_train)

Y_train_pred = model.predict(X_train)
Y_Val_pred = model.predict(X_val)

train_score = model.score(X_train, Y_train)
val_score_score = model.score(X_val, Y_val)
print(train_score, val_score_score)

(265, 52)
(114, 52)
(265,)
(114,)
0.985269044068177 0.8471330220304578
