# Gradient Descent and Feature Scalling 
as mentioned earlier , Gradient Descent is an optimimazation function accompined with the right cost function (in our case squared error function) it will update the value of w and b until it reaches to a minimal point which translates to the model predicting in a descent way 

# Gradient Descent 
to actually implement gradient descent we need to compute the derivative of the cost funtion $ J(W,B) =  (sum_{i=1}^{m} i = \ (f(x)^i-y^i)^2)/2m $     taking w , b as the derivative terms. it would be much better , comprehensive easy to compute if we were able to compute the derivative and then use in gradient descent , here also we will provide two implementation one for univariate linear regression and one for the multivariate linear regression. 

## univariate linear regression 
in univariate linear regression the trainig example consists of one feature which lead to parameter w being one number and not vector thus we will compute two derivatives one in terms of w and another one in terms of b. 

In [19]:
# now let's start by the importation 
# import numpy 
import numpy as np

### computing derivative 
let's start by recalling the cost function $ J(W,B) =  (sum_{i=1}^{m} i = \ (f(x)^i-y^i)^2)/2m $ now we will need to compute the derivative in terms of b and w. let's start by computing the derivative in terms of w. 
$ \frac{dJ}{dw} = (sum_{i=1}^{m} i = \ (f(x)^i-y^i)*x^i)/m $. 
now, let's compute in terms of b. 
$ \frac{dJ}{db} = (sum_{i=1}^{m} i = \ (f(x)^i-y^i))/m $ now we will implement the code for the derivative 


In [20]:
# define the model 
def linear_regression_model_with_vec (x , W , b )  : 
    # x is the vector of features (one example)
    # W is the vector of parameters w 
    # b is a parameter represented as an integer 
    # the return is an integer f(x) 
    f_X = 0.0 # the returned value 
    f_X += np.dot(x, W)
    f_X += b 
    return f_X 

In [23]:
# define the cost function 
def squared_error_cost_fuction (X,Y,W,b ): 
    # X is the set of features in the training set 
    # Y is the set of outputs in the training set 
    # W is a vector of parameters 
    # b is a parameter 
    m = len(X) # the number of training example 
    diff = 0.0
    for i in range (m) : 
        f_X=linear_regression_model_with_vec(X[i],W ,b) # compute the value using the model 
        diff += (f_X - Y[i])**2 # compute the error for one training example 
    diff /= (2*m ) # compute the average 
    return diff 

In [21]:
# computing the derivative 
def computing_the_derivative (X,Y,W,b): 
    # X is the set of features in the training set 
    # Y is the set of outputs in the training set 
    # W is a parameter 
    # b is a parameter 
    dw = 0.0 
    db = 0.0
    m = len(X) 
    for i in range (m) : 
        fx = linear_regression_model_with_vec(X[i] , W ,b ) # compute fx using the mode l
        dw += ((fx - Y[i])* X[i])  
        db += (fx - Y[i])
    dw /= m 
    db /= m 

    return dw , db 

In [22]:
# define x 
X = np.array ([
    1,2,3,4,5
])
# define Y 
Y = np.array ([
    0,2,3,5,7 
])
# define W 
w = 1 
# define b 
b = 10 
dw , db = computing_the_derivative(X,Y,w,b) 
print (f'dw value is {dw} and db value is {db}')

dw value is 27.4 and db value is 9.6


### implementing the gradient descent for univariate 
after we have finished implementing the derivative now we will try to implement gradient descent for univariate linear regression , as a function gradient descent take X (set of features in the training set ) , Y (set of outputs in the training set ) , W (initial values for the parameter ) , b (initial value for the parameter ), learning parameter alpha. the function we will make use of both the derivative function and the cost function 

In [26]:
# define the gradient descent for univariate 
def gradient_descent_univariate (X,Y,W,b , alpha ) : 
    # X is the set of features in the training example 
    # Y is the set of output in the training example 
    # W is the parameter represented as an integer 
    # b is the parameter represented as an integer 
    # alpha is the learning rate 
    current_cost = squared_error_cost_fuction(X,Y,W,b) 
    while True : 
        dw , db = computing_the_derivative(X,Y,W,b)
        b -= alpha*db 
        W -= alpha*dw 
        new_cost = squared_error_cost_fuction(X,Y,W,b)
        if new_cost < current_cost : 
            current_cost = new_cost 
        else : 
            break; 
    return W, b 


In [27]:
# define X 
X = np.array ([1,2,3,4,5,6])
# define Y 
Y = np.array ([1,1,2,3,4,5])
# define W 
w = 11
# define b 
b = 10 
# define alpha 
alpha = 0.001 

w,b = gradient_descent_univariate (X,Y,w,b,alpha)
print (f'the final value of w is {w} and the final value of b is {b}')


the final value of w is 0.8571425654559908 and the final value of b is -0.3333320845639586


### multivariate linear regression 
after tackling the univariate linear regression and testing it with dummy values, now we will go ahread and try to implement the gradient for multivariate and we will start first by implementing a function that computes the gradient and then we will call it within the gradient descent. we will try to compute the gradient in terms of the parameter b and the vector of parameters w 

$ \frac{\partial J(W, b)}{\partial W_j} = \frac{1}{m} \sum_{i=1}^{m} \left( \hat{y}^{(i)} - y^{(i)} \right) X_j^{(i)} $              
                                                                                                                                            $  \frac{\partial J(W, b)}{\partial b} = \frac{1}{m} \sum_{i=1}^{m} \left( \hat{y}^{(i)} - y^{(i)} \right) $
                                                                                                                                    

now , we will try to implement the above as a function that returns the gradient as a vector in terms of W and a number in terms of bias 

In [30]:
#define the function gradient 
def compute_the_gradient (X,Y,W, b ): 
    # X is the set of features in the training set 
    # Y is the set of output in the training set 
    # W is the vector of parameters 
    # b is the bias 
    m = len(X) # the number of examples 
    n = len(W) # the number of features 
    dj_db = 0.0 # a number to be returned 
    dj_dw = np.zeros(n) # array to be returned 
    for i in range (m) : 
        f_x = linear_regression_model_with_vec(X[i],W,b); # compute the prediction 
        loss = f_x - Y[i] # compute the loss 
        dj_db += loss # update dj_db 
        for j in range (n) : 
            dj_dw[j] += loss * X[i][j] # update each parameter value 
    dj_db /= m # average them 
    dj_dw /= m # average them 
    return dj_dw , dj_db # return the values 
        

In [31]:
#define the gradient descent function 
def gradient_descent (X,Y,W,b , alpha ) : 
    # X is the set of features in the training set 
    # Y is the set of outputs in the training set 
    # W is the vector of scalar parameters (initial values )
    # b is the bias (initial values) 
    current_cost = squared_error_cost_fuction(X,Y,W,b)
    for i in range (1000) : 
        dw , db = compute_the_gradient(X,Y,W,b)
        W -= (alpha*dw) 
        b -= (alpha*db)
        next_cost = squared_error_cost_fuction (X,Y,W,b)
        if next_cost < current_cost : 
            print (f'cost at iteration {i} is {next_cost}')
            current_cost = next_cost 
        else : 
            break 
    return W , b  

In [32]:
# define the data set
X = np.array([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])
Y = np.array([7.0, 8.0, 9.0])

# Initialize parameters
W = np.array([1.0, 1.0])
b = 1.0
alpha = 0.001
w, b = gradient_descent(X,Y,W,b,alpha)
print (f'the value of w is {w} and the value of b is {b}')

cost at iteration 0 is 2.9684773333333325
cost at iteration 1 is 2.938805147028148
cost at iteration 2 is 2.9108702577252554
cost at iteration 3 is 2.8845664072915587
cost at iteration 4 is 2.8597938390925655
cost at iteration 5 is 2.8364589001911598
cost at iteration 6 is 2.8144736678862365
cost at iteration 7 is 2.793755599102012
cost at iteration 8 is 2.7742272012298024
cost at iteration 9 is 2.7558157231097202
cost at iteration 10 is 2.7384528649199846
cost at iteration 11 is 2.722074505816979
cost at iteration 12 is 2.706620448239933
cost at iteration 13 is 2.6920341778606187
cost at iteration 14 is 2.678262638220767
cost at iteration 15 is 2.6652560191585466
cost at iteration 16 is 2.65296755818041
cost at iteration 17 is 2.641353353986196
cost at iteration 18 is 2.63037219140392
cost at iteration 19 is 2.6199853770361083
cost at iteration 20 is 2.610156584962307
cost at iteration 21 is 2.6008517118824197
cost at iteration 22 is 2.592038741123294
cost at iteration 23 is 2.5836876

## scaling data 
when the data has multiple features and these last are so different in range this lead for gradient descent taking so much to reach a global minimum if this is the case then we will need to scale them , apply some statistical methods to get all of the features to fall within a somehow similar range. 
there are different statistical methods is Mean Normalization , Z score Normalization. We will try to implement them in the following code cells. 

### Mean normalization 
to apply this we will need to first , find the mean , the min , the max then we will need to compute :                                                                                      
$ \frac{x^1-u^1} {max - min} $      now we will try to implement the function that normalize the features 

In [40]:
# mean normalization function 
def mean_normalization (X) : 
    # X is the set of features that we want to be normalized 
    n = X.shape[1] # the set of features 
    X=X.astype(float)
    for i in range (n) : # loop over the features 
        x = X[:, i] 
        mean = x.mean() 
        minval = x.min() 
        maxval = x.max() 
        X[:,i] -= mean 
        X[:,i] /= (maxval - minval) 
    return X 

### Z score normalization 
to apply this we will have to compute the mean and the standard derivation and then for each feature compute the following  :          $ x^j = \frac {x^j-u^j}{sigma^j} $

In [41]:
# z score normalization 
def zscore_normalizaton (X) : 
    # X is the set of features that we want to be normalized 
    n = X.shape[1] # the set of features 
    X= X.astype(float)
    for i in range (n) : # loop over features 
        x = X[:,i] 
        mean = x.mean() 
        sigma = np.std(x) 
        X[:,i] -= mean 
        X[:,i] /= sigma 
    return X 

In [45]:
X = np.array([[2100, 3],
              [1600, 3],
              [2400, 3],
              [1416, 2],
              [3000, 4]])
nromalized= zscore_normalizaton(X) 
print (nromalized) 

[[-0.00562564  0.        ]
 [-0.88463186  0.        ]
 [ 0.52177809  0.        ]
 [-1.20810614 -1.58113883]
 [ 1.57658555  1.58113883]]
