In [1]:
import pandas as pd
import numpy as np

In [2]:
x1=np.random.randint(low=1,high=20,size=20000)


x2=np.random.randint(low=1,high=20,size=20000)

In [3]:
y=3+2*x1-4*x2+np.random.random(20000)

you can see that we have generated data such that y is an approximate linear combination of x1 and x2, next we'll calculate optimal parameter values using gradient descent and compare them with results from sklearn and we'll see how good is the method. 

In [4]:
x=pd.DataFrame({'intercept':np.ones(x1.shape[0]),'x1':x1,'x2':x2})

w=np.random.random(x.shape[1])



In [9]:
y

array([-56.09137984, -30.66773532, -50.67174653, ...,  11.62136621,
       -20.02142877,   5.09106376])

Lets write functions for predictions, error, cost and gradient that we discussed above 

In [7]:
def myprediction(features,weights):

    predictions=np.dot(features,weights)
    return(predictions)

myprediction(x,w)

# function numpy.dot : is used for matrix multiplication

array([13.18182866, 11.93584716, 12.64021729, ...,  4.84156431,
       11.58366209,  6.2089265 ])

Note that , `np.dot` here is being used for matrix multiplication . Simple multiplication results to element wise multiplication , which is simply wrong in this context .

In [8]:
def myerror(target,features,weights):
    error=target-myprediction(features,weights)
    return(error)
myerror(y,x,w)

array([-69.2732085 , -42.60358248, -63.31196383, ...,   6.77980189,
       -31.60509087,  -1.11786275])

In [10]:
def mycost(target,features,weights):
    error=myerror(target,features,weights)
    cost=np.dot(error.T,error)
    return(cost)

mycost(y,x,w)

26988672.160963915

In [11]:
def gradient(target,features,weights):
    
    error=myerror(target,features,weights)
    
    gradient=-np.dot(features.T,error)/features.shape[0]
    
    return(gradient)

gradient(y,x,w)

array([ 24.36623849, 184.45321497, 383.48750706])

Note that gradient here is vector of 3 values because there are 3 parameters . Also since this is being evaluated on the entire data, we scaled it down with number of observations . Do recall that , the approximation which led to the ultimate results was that change in parameters is small. We dont have any direct control over gradient , we can always chose a small value for $\eta$ to ensure that change in parameter remains small. Also if we end up chosing too small value for $\eta$, we'll need to take larger number of steps to change in parameter in order to arrive at the optimal value of the parameters 

Lets looks at the expected value for parameters from sklearn . Dont worry about the syntax here , we'll discuss that in detail, when we formally start with linear models in next module .

In [12]:
from sklearn.linear_model import LinearRegression

In [13]:
lr=LinearRegression()
lr.fit(x.iloc[:,1:],y)
sk_estimates=([lr.intercept_]+list(lr.coef_))

In [14]:
sk_estimates

[3.497047463340305, 2.0000507703755974, -3.999730754941359]

When you run the same , these might be different for you, as we generated the data randomly 

Now lets write our version of this , using gradient descent 

In [16]:
def my_lr(target,features,learning_rate,num_steps,print_when):
    
    # start with random values of parameters
    weights=np.random.random(features.shape[1])
    
    # change parameter multiple times in sequence 
    # using the cost function gradient which we discussed earlier 
    for i in range(num_steps):
        
        weights =weights- learning_rate*gradient(target,features,weights)
       
    # this simply prints the cost function value every (print_when)th iteration
        if i%print_when==0:
            print(mycost(target,features,weights),weights)
        
    return(weights)



In [18]:
my_lr(y,x,.0001,500,10)

37663827.46206986 [0.22601932 0.86074853 0.83003571]
27010820.13853439 [0.19555256 0.59641581 0.38580932]
20124049.214374103 [0.17149693 0.40269982 0.01810048]
15614321.66494527 [ 0.15252241  0.26448127 -0.28850124]
12608404.998134563 [ 0.1375749   0.16979156 -0.54622288]
10557184.76876914 [ 0.125819    0.10915866 -0.76476819]
9115164.508575158 [ 0.11659262  0.07508881 -0.95184064]
8064776.524146876 [ 0.10937104  0.06165565 -1.11355813]
7268838.880803236 [ 0.10373838  0.0641747  -1.2547815 ]
6640691.346660266 [ 0.09936507  0.07894531 -1.37937498]
6125435.742959881 [ 0.09598987  0.10304623 -1.49041263]
5688151.164908491 [ 0.09340578  0.13417351 -1.5903419 ]
5306490.27288684 [ 0.09144872  0.17051216 -1.6811133 ]
4966027.461901615 [ 0.08998865  0.21063434 -1.76428321]
4657335.595387895 [ 0.08892251  0.25341872 -1.84109529]
4374148.546691798 [ 0.08816863  0.29798662 -1.91254512]
4112205.822855315 [ 0.08766227  0.34365134 -1.9794313 ]
3868525.6830490814 [ 0.08735212  0.38987804 -2.04239607]

array([ 0.09901114,  1.48300032, -3.19245568])

In [19]:
sk_estimates

[3.497047463340305, 2.0000507703755974, -3.999730754941359]

you can see that if we take too few steps , we did not reach to the optimal value 

Lets increase the learning rate $\eta$

In [21]:
my_lr(y,x,.01,500,10)

44586015.81860581 [ 0.64081645 -2.22987593 -3.62215937]
7069311510.619205 [ -1.55890545 -25.50453518 -31.61929409]
1310662687403.5015 [ -31.94735733 -373.07612181 -382.41012641]
243000004275404.0 [ -446.16053115 -5106.53474914 -5158.00179928]
4.505278400043583e+16 [ -6086.62344389 -69558.51956322 -70183.63330206]
8.352894282453912e+18 [ -82889.02983719 -947153.1618916  -955589.19365729]
1.5486466473016364e+21 [ -1128650.59893636 -12896705.57457836 -13011497.05765287]
2.871228052336887e+23 [-1.53680072e+07 -1.75604858e+08 -1.77167813e+08]
5.323325719840954e+25 [-2.09254718e+08 -2.39108055e+09 -2.41236207e+09]
9.869573646877738e+27 [-2.84926561e+09 -3.25575626e+10 -3.28473370e+10]
1.8298426415668165e+30 [-3.87963271e+10 -4.43312074e+11 -4.47257716e+11]
3.392572174539155e+32 [-5.28260683e+11 -6.03625024e+12 -6.08997511e+12]
6.289910234905311e+34 [-7.19293217e+12 -8.21911225e+13 -8.29226540e+13]
1.1661644536284863e+37 [-9.79407985e+13 -1.11913528e+15 -1.12909600e+15]
2.162096885516361e+39 

array([6.88163892e+55, 7.86340834e+56, 7.93339559e+56])

You can see that because of high learning rate , change is parameter is huge and we end up missing the optimal point , cost function values , as well as parameter values ended up exploding. Now lets run with low learning rate and higher number of steps 

In [22]:
my_lr(y,x,.0004,100000,1000)

18356616.836679872 [ 0.3967177  -0.00595057  0.04184493]
25653.253792419913 [ 0.48118684  2.13054596 -3.86747094]
23257.450244518728 [ 0.63566199  2.12387766 -3.87426105]
21100.791927820854 [ 0.78222479  2.11753513 -3.88068772]
19159.407484943178 [ 0.9212805   2.11151747 -3.88678522]
17411.808653053013 [ 1.05321365  2.10580805 -3.89257039]
15838.652090256626 [ 1.17838907  2.10039106 -3.89805924]
14422.525273329902 [ 1.29715288  2.09525154 -3.90326695]
13147.753766764683 [ 1.40983351  2.09037527 -3.90820792]
12000.227729882454 [ 1.51674253  2.08574876 -3.9128958 ]
10967.245741705841 [ 1.61817557  2.08135923 -3.91734357]
10037.374214959103 [ 1.71441311  2.07719454 -3.92156352]
9200.32084311801 [ 1.80572128  2.07324316 -3.92556731]
8446.820679754801 [ 1.89235257  2.06949418 -3.92936603]
7768.533589243868 [ 1.97454652  2.06593723 -3.93297018]
7157.951933759943 [ 2.05253041  2.06256246 -3.93638972]
6608.317474798442 [ 2.12651991  2.05936056 -3.9396341 ]
6113.546569441836 [ 2.19671959  2.056

array([ 3.48049819,  2.00076694, -3.99900508])

We can see here that we ended up getting pretty good estimates for $\beta$s , as good as from sklearn .

In [23]:
sk_estimates

[3.497047463340305, 2.0000507703755974, -3.999730754941359]

 there are modifications to gradient descent which can achieve the same thing in much less number of iterations. We'll discuss that in detail when we start with our course in Deep learning. For now ,we'll conclude this module 