## 1. Absolute Trick

**This assumes that the point in question wants the line to come closer**

Recall that a line has a _slope_ and a _y-intercept_, in the form $y = mx + b$, where $m$ is the slope and $b$ the _y-intercept_

We also assume that the `learning-rate` or `alpha` must be positive.

1. The first step is to figure out if the point is lower or higher than the line from the linear equation

2. If the point is lower, we reduce the _y-intercept_ by the _learning_rate_ to lower the line towards the point and we subtract the `x-value times the alpha` from the slope to tilt it towards the point.

3. If the point is higher, we increase the _y-intercept_ by the _learning_rate_ to lift the line towards the point and we add the `x-value times the alpha` to the slope to tilt it towards the point.

4. Thus to tilt the slope, _Absolute-Trick_ exclusively adds or subtracts the `x-value times the alpha` from the slope.  

In [2]:
import numpy as np
import pandas as pd

In [3]:
def absolute_trick(line_equ, point_coord, alpha):
    """Find the new line equation using the
        Absolute-Trick formula
        
    @param line_equ: tuple of floats for slope and y_intercept
    @param point_coord: tuple of floats for 2D point coordinates
    @param alpha: float, the learning-rate
    """
    assert alpha >= 0, 'ERROR: Alpha must be positive'
    
    slope, intercept = line_equ[0], line_equ[1]
    x, y = point_coord[0], point_coord[1]
    
    # if point is higher than line, is_higher=True, else False.
    is_higher = y > (slope*x + intercept) 
    print(f'Old-Equ: y = {slope}x + {intercept},\nIs_Point_Higher: {is_higher}\n')
    
    if is_higher:
        intercept += alpha
        slope += x*alpha
    
    else:
        intercept -= alpha
        slope -= x*alpha
    
    print(f'New-Equ: y = {round(slope, 2)}x + {round(intercept,2)}\n')

In [4]:
alpha = 0.1
line_equ = (-0.6, 4)
point_coord = (-5, 3)

absolute_trick(line_equ, point_coord, alpha)

Old-Equ: y = -0.6x + 4,
Is_Point_Higher: False

New-Equ: y = -0.1x + 3.9



## 2. Square Trick

**This assumes that the point in question also wants the line to come closer**

Recall that a line has a _slope_ and a _y-intercept_, in the form $y = mx + b$, where $m$ is the slope and $b$ the _y-intercept_

We also assume that the `learning-rate` or `alpha` must be positive.

1. The main difference between the _Absolute-Trick_ and the _Square-Trick_ is that in addition to the distance on the _x-axis_, the _Square-Trick_ compares the distance on the _y-axis_ between the point and the line

1. The first step is to consider the difference between the $y$ value of the point and the $y$ value of the line when the $x$'s are the same.

2. This difference is multiplied to the `x-value of the point times alpha` and added to the slope to get the new line-equation slope 

3. This difference is multiplied to the _alpha_ and added to the _y_intercept_ to get the new line-equation _y_intercept_

4. This automatically moves the line closer to the point irrespective of whether the point is higher or lower than the line

5. Including this distance in the calculation, helps the line to take the right size of steps towards the point than the relatively fixed sizes based on the _alpha_ in the _Absolute-Trick_. In other words, if the point is far from the line, the _Square-Trick_ makes the line move more than if the point was closer to the line.
6.  Finally, the _Square-Trick_ automatically takes care of points that may be higher or lower than the line and so, we don't need to have two rules like the `if-else` statements of _Absolute-Trick_.

**Thus in Square-Trick, The magnitude by which the intercept and slope change is dependent on how large the error in prediction is.**


In [5]:
def square_trick(line_equ, point_coord, alpha):
    """Find the new line equation using the
        Absolute-Trick formula
        
    @param line_equ: tuple of floats for slope and y_intercept
    @param point_coord: tuple of floats for point coordinates
    @param alpha: float, the learning-rate
    """
    assert alpha >= 0, 'ERROR: Alpha must be positive'
    
    slope, intercept = line_equ[0], line_equ[1]
    x, y = point_coord[0], point_coord[1]
    
    # let's just see if the point is higher or not
    is_higher = y > (slope*x + intercept) 
    print(f'Old-Equ: y = {slope}x + {intercept},\nIs_Point_Higher: {is_higher}\n')
    
    # calculate the distance from the point to the line,
    # On the y-axis
    dist = y - (slope*x + intercept)
    
    # Include dist in the new-line Equation
    slope += x*dist*alpha
    intercept += dist*alpha
    
    print(f'New-Equ: y = {round(slope, 2)}x + {round(intercept, 2)}\n')

In [6]:
alpha = 0.01
line_equ = (2, 3)
point_coord = (5, 15)

square_trick(line_equ, point_coord, alpha)

Old-Equ: y = 2x + 3,
Is_Point_Higher: True

New-Equ: y = 2.1x + 3.02



In [7]:
alpha = 0.01
line_equ = (-0.6, 4)
point_coord = (-5, 3)

square_trick(line_equ, point_coord, alpha)

Old-Equ: y = -0.6x + 4,
Is_Point_Higher: False

New-Equ: y = -0.4x + 3.96



## Gradient Descent:

1. First for each point, we calculate the gradient of the chosen error function in respect to the weights. This error function could be MAE or MSE for example. This gradient is the biggest distance from that point to the line or global minimum. 
2. Next, we take the negative of the gradient and move in this negative or opposite direction as this is the direction that minimizes the gradient distance the most
3. Usually this is such a big step and in ML generally we avoid taking too big steps so as not to over shoot the minimum, so we multiply this negative-gradient value by the alpha or learning rate and move in the right direction gradually and repetitively.
4. **Note That:** The Gradient Descent takes into consideration the following:
* 1. The derivative of the error function in respect to the _weight_ $w1$
* 2. The derivative of the error function in respect to the _y-intercept_ $w2$
* In short, it takes the derivatives of the error or loss function in respect to the existing weights and bias.
5. These calculations are exactly the same as the _Absolute-Trick_ or _Squared-Trick_ we computed earlier
6. Note that a gradient is a vector, so it has both of the following characteristics:
* A direction
* A magnitude

The gradient always points in the direction of steepest increase in the loss function. Thus, gradient descent algorithm takes a step in the direction of the negative gradient in order to reduce loss as quickly as possible.

## Error Functions

#### Mean Absolute Error (MAE) and Mean Squared Error (MSE):

**1. MAE:**

This is the sum or total absolute errors between each point $y$ and the prediction line's point $\hat{y}$, divided by the total number of points $m$.

## MAE = $\frac{1}{m}\sum_{i=1}^m{|y - \hat{y}|}$

* We take the absolute values of the errors or differences between $y$ and $\hat{y}$ so that we don't have any negative values that could in addition, cancel out positive values.

**2. MSE:**

This is the sum or total squared errors between each point $y$ and the prediction line's point $\hat{y}$, divided by the total number of points $m$, multiplied by $\frac{1}{2}$. The half is just for convenience for calculating _Gradient Descent_.

## MSE = $\frac{1}{2m}\sum_{i=1}^m{(y - \hat{y})}^2$

* We take the squared values of the errors or differences between $y$ and $\hat{y}$ so that we don't have any negative values,as squared values must be positive.

**Quiz for Mean Absolute Error**

Compute the mean absolute error for the following line and points:

line: `y = 1.2x + 2`

points: `(2, -2), (5, 6), (-4, -4), (-7, 1), (8, 14)`

In [8]:
import numpy as np
import math

In [9]:
def calc_mae(line_equ, point_coords):
    """Find the MAE between points and a line
        
    @param line_equ: a tuple of numbers for slope and y_intercept
    @param point_coords: a list of tuples of numbers for point coordinates
    @return: MAE, a number
    """
    slope, intercept = line_equ[0], line_equ[1]
    xes = [tup[0] for tup in point_coords]
    yes = [tup[1] for tup in point_coords]
    mae_sum = 0
    print(f'Line-Equ: y = {slope}x + {intercept}\n')
    
    # calculate the distance from each point to the line,
    # On the y-axis
    for x, y in zip(xes, yes):  
        y_hat = slope*x + intercept
        mae_dist = np.abs(y - y_hat)
        mae_sum+=mae_dist
    
    return mae_sum / len(point_coords)

In [10]:
line_equ = (1.2, 2)
point_coords = [(2, -2), (5, 6), (-4, -4), (-7, 1), (8, 14)]

calc_mae(line_equ, point_coords)

Line-Equ: y = 1.2x + 2



3.88

In [11]:
def calc_mse(line_equ, point_coords):
    """Find the MSE between points and a line
        
    @param line_equ: a tuple of numbers for slope and y_intercept
    @param point_coords: a list of tuples of numbers for point coordinates
    @return; a list of MAE number scores
    """
    slope, intercept = line_equ[0], line_equ[1]
    xes = [tup[0] for tup in point_coords]
    yes = [tup[1] for tup in point_coords]
    mse_sum = 0
    print(f'Line-Equ: y = {slope}x + {intercept}\n')
    
    # calculate the distance from each point to the line,
    # On the y-axis
    for x, y in zip(xes, yes):  
        y_hat = slope*x + intercept
        mse_dist = np.power(y - y_hat, 2)
        mse_sum+=mse_dist
    
    return round((mse_sum / len(point_coords))*0.5, 2)

In [12]:
calc_mse(line_equ, point_coords)

Line-Equ: y = 1.2x + 2



10.69

### Proof: 
That Minimizing The Error with Gradient Descent is exactly same as minimizing with Absolute or Square Tricks

## $\frac{dx}{dw_1}Error => -(y - \hat{y})x$

## $\frac{dx}{dw_2}Error => -(y - \hat{y})$

The above two equations mean that in calculating _Gradient Descent_, the derivative of the error function with respect to the weight or slope $w_1$ is equal to the negative of $y - \hat{y}$ multiplied by $x$ as the updated slope, while the derivative of the error function in respect to the y-intercept or bias unit $w_2$ is equal to the negative of $y - \hat{y}$ as the new y-intercept.

Let's try to do this with some concrete examples

In [13]:
alpha = 0.01
line_equ = (2, 3)
point_coord = (5, 15)

square_trick(line_equ, point_coord, alpha)

Old-Equ: y = 2x + 3,
Is_Point_Higher: True

New-Equ: y = 2.1x + 3.02



In [14]:
x, y = point_coord[0], point_coord[1]
slope, bias = line_equ[0], line_equ[1]

y_hat = slope*x + bias
mse = ((y - y_hat)**2)*0.5
print(f'yhat: {y_hat}, mse: {mse}')
print(f'slope: {slope}, bias: {bias}')

yhat: 13, mse: 2.0
slope: 2, bias: 3


In [15]:
yy = 2.1*x + 3.02
yy

13.52

In [16]:
new_slope = -(y - y_hat)*x
new_bias = -(y - y_hat)

print(f'new-slope: {new_slope}, new_bias: {new_bias}')

new-slope: -10, new_bias: -2


**Numpy dot and matmul functions**

The numpy.dot() function is used for performing matrix multiplication in Python. It also checks the condition for matrix multiplication, that is, the number of columns of the first matrix must be equal to the number of the rows of the second. It works with multi-dimensional arrays also. We can also specify an alternate array as a parameter to store the result. The @ operator for multiplication invokes the matmul() function of an array that is used to perform the same multiplication. For example,

In [17]:
a = np.array([[1,2], [2, 3]])
b = np.array([[8,4], [4, 7]])

print(a)
print(' ')
print(b)

[[1 2]
 [2 3]]
 
[[8 4]
 [4 7]]


In [18]:
print(np.dot(a,b))

[[16 18]
 [28 29]]


In [19]:
c = a@b
c

array([[16, 18],
       [28, 29]])