# 1. Introduction to Linear Regression

- Let's consider a case where we have a regression line which shows relationship between 'Prices of the house' and 'Size of the house'
    - __Question__: What is the best estimate for the price of the house
    - __Answer__: 120k

![house.png](attachment:house.png)

# 2. Fitting a Line

- Start by drawing a random line and ask points 'what they would want the line to do?'
- Idea is to take a few steps that would make us closer to all points

![LR.PNG](attachment:LR.PNG)

# 3. Absolute Trick

- First discuss 'How to move lines by changing the parameter?'
- Consider a line __y = w1 x +  w2__
- This means __slope of the line = w1__ and __intercept = w2__
- If we change the __w1__ the line rotates (rotates up if we increase w1 and rotates down if we decrease w1)
- If we change the __w2__ the line moves in a parallel way (parallelly up if we increase w2 and parallelly down if we decrease w2)

### Step1:
- Consider a point say (p,q) and a line y = w1 x + w2
- The point (p,q) wants the line to come closer to it
![absolute_trick1.PNG](attachment:absolute_trick1.PNG)

### Step 2:
- Adding p to the slope and 1 to the intercept we get the new line equation as y = (w1 + p) x + (w2 + 1)
![absolute_trick2.PNG](attachment:absolute_trick2.PNG)

### Step 3:
- But the line overtakes the point, we don't take such big steps in machine learning, so consider a learning rate 'alpha' and multiply this by p and add to the slope. Similarly, multiply 'alpha' by 1 and add to the intercept.
- We get the new line equation as y = (w1 + p) x + (w2 + 1)
![absolute_trick3.PNG](attachment:absolute_trick3.PNG)

# 4. Square Trick

- If we have a point (p,q) that is close to the line, then we would want the line to move less
- If we have a point (p,q) that is far away from the line, then we would want the line to move more
- Absolute trick discussed above, does not have this property
- Suppose we have point (p,q) and a line y = w1 x + w2
    - We would add 'alpha' multiplied by p(q-q') to the slope
    - We would add 'alpha' multiplied by (q-q') to the intercept

![Square_trick.PNG](attachment:Square_trick.PNG)

# 5. Gradient Descent

- Gradient descent is an iterative optimization algorithm to find the minimum of a function. 
- __Example:__ 
    - Imagine a valley and a person with no sense of direction who wants to get to the bottom of the valley.
    - He goes down the slope and takes large steps when the slope is steep and small steps when the slope is less steep.
    - He decides his next position based on his current position and stops when he gets to the bottom of the valley which was his goal.
- Our goal is to find the line which minimizes the error (where error is the difference between our line and the points in either aboslute or square terms)
- In order to descent the error curve, we take the derivative of the error with respect to weights
- In real life, we mulitply this derivative by learning rate

![gradient_descent.PNG](attachment:gradient_descent.PNG)

# 6. Error Function (Mean Absolute Error)

![MAE.PNG](attachment:MAE.PNG)

- Point has co-ordinates (x, y)
- Our prediction point will be (x, $\hat{y}$)
## - $Error = (y - \hat{y})$
## - $ Mean \ Absolute \ Error = \frac{\sum|(y - \hat{y})|}{m}$

# 7. Error Function (Mean Squared Error)

![MSE.PNG](attachment:MSE.PNG)

- Point has co-ordinates (x, y)
- Our prediction point will be (x, $\hat{y}$)
## - $Error = (y - \hat{y})$
## - $ Mean \ Squared \ Error = \frac{\sum(y - \hat{y})^2}{2m}$

![MSE.PNG](attachment:MSE.PNG)

# 8. Minimizing Error Functions

![Minimizing_MSE.PNG](attachment:Minimizing_MSE.PNG)

### Case 1: Minimizing MSE

![Minimizing_MSE2.PNG](attachment:Minimizing_MSE2.PNG)

### Case 2: Minimizing MAE

![Minimizing_MSE3.PNG](attachment:Minimizing_MSE3.PNG)

![MAE_MSE.PNG](attachment:MAE_MSE.PNG)

# 9. MAE vs MSE

![MSE_better.jpg](attachment:MSE_better.jpg)

- Lines A, B and C would give us the same Mean Absolute Error, whereas the line B would give us the smallest Mean Squared Error.

# 10. Linear Regression in scikit-learn

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

In [2]:
# Load dataset
bmi_life_data = pd.read_csv('bmi_data.csv')

In [3]:
# View first few rows of the data
bmi_life_data.head()

Unnamed: 0,Country,Life expectancy,BMI
0,Afghanistan,52.8,20.62058
1,Albania,76.8,26.44657
2,Algeria,75.5,24.5962
3,Andorra,84.6,27.63048
4,Angola,56.7,22.25083


In [4]:
# Fit the model and assign it to bmi_life_model
bmi_life_model = LinearRegression()
bmi_life_model.fit(bmi_life_data[['BMI']], bmi_life_data[['Life expectancy']])

LinearRegression()

In [5]:
# Make a prediction using the model
# Predict life expectancy for a BMI value of 21.07931
laos_life_exp = bmi_life_model.predict(np.array([21.07931]).reshape(1,1))
laos_life_exp

array([[60.31564716]])

# 11. Multiple Linear Regression

![MLR.PNG](attachment:MLR.PNG)

In [6]:
# Import Libraries
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston

In [7]:
# Load the data from the boston house-price dataset
boston_data = load_boston()

In [8]:
# Assign X and y
X = boston_data['data']
y = boston_data['target']

In [9]:
# Fit the model and Assign it to the model variable
model = LinearRegression()
model.fit(X, y)

LinearRegression()

In [10]:
# Make a prediction using the model
sample_house = [[2.29690000e-01, 0.00000000e+00, 1.05900000e+01, 0.00000000e+00, 4.89000000e-01,
                6.32600000e+00, 5.25000000e+01, 4.35490000e+00, 4.00000000e+00, 2.77000000e+02,
                1.86000000e+01, 3.94870000e+02, 1.09700000e+01]]

In [11]:
# Predict housing price for the sample_house
prediction = model.predict(sample_house)
prediction

array([23.68284712])

# 12. Closed Form Solution

![MSE_method.PNG](attachment:MSE_method.PNG)

- In above example, we have n = 2 i.e. we will get 2 system of equations with 2 unknowns
- But, in case of n > 2, we have to use matrices which would substantially increase the computing time and efforts

### 2-D Solution

![2D_1.PNG](attachment:2D_1.PNG)

![2D_2.PNG](attachment:2D_2.PNG)

### n-D Solution

![n_D.PNG](attachment:n_D.PNG)

![n_D2.PNG](attachment:n_D2.PNG)

# 13. Linear Regression Warnings

- Linear regression comes with a set of implicit assumptions and is not the best model for every situation

### Linear Regression Works Best When the Data is Linear
- Linear regression produces a straight line model from the training data. If the relationship in the training data is not really linear, you'll need to either make adjustments (transform your training data), add features (we'll come to this next), or use another kind of model.

![linear_graph.png](attachment:linear_graph.png)

### Linear Regression is Sensitive to Outliers
- Linear regression tries to find a 'best fit' line among the training data. If your dataset has some outlying extreme values that don't fit a general pattern, they can have a surprisingly large effect.

- In this first plot, the model fits the data pretty well.
![lin-reg-no-outliers.png](attachment:lin-reg-no-outliers.png)

- However, adding a few points that are outliers and don't fit the pattern really changes the way the model predicts.
![lin-reg-w-outliers.png](attachment:lin-reg-w-outliers.png)

- In most circumstances, you'll want a model that fits most of the data most of the time, so watch out for outliers!

# 14. Polynomial Regression

![ploy_regrn.PNG](attachment:ploy_regrn.PNG)

- When fitting a line won't do much help, so we have to consider polynomial
- MSE or MAE can be used to find that curve that minimizes the errors

# 15. Regularization

- This technique can be applied to both Regression and Classification

#### The one on the left makes couple of mistakes but is much simplier, AND model on the right overfits and does not generalize well. So, we pick the model on the LEFT

![df_models.PNG](attachment:df_models.PNG)

#### The RIGHT Model has many coefficients, so complex model would have high number of errors as compared to simple model
- Notice that the yellow part of the error comes from the mis-classification that happens more in the model on the LEFT
- But, the green part of error comes from comes from the complexity of the model i.e. coefficients in the model which happens more in the model on the RIGHT

![combined_error.PNG](attachment:combined_error.PNG)

### Question:
How do we go about chosing the model on the LEFT?

### L1 Regularization

#### 1. Model on the RIGHT
![Model1.PNG](attachment:Model1.PNG)

#### 2. Model on the LEFT
![Model2.PNG](attachment:Model2.PNG)

### L2 Regularization

#### 1. Model on the RIGHT
![model3.PNG](attachment:model3.PNG)

#### 2. Model on the LEFT
![model4.PNG](attachment:model4.PNG)

### What is Lambda Parameter?
- If we require low error and are okay with a complex model, then punishment on the complexity should be small.
- If we require simple model and are okay with the errors, then punishment on the complexity should be large.
- This can be fixed with a parameter lambda
    - Small lambda = punish the complex model less, then complex model wins
    - Large lambda = punish the complex model more, then simple model wins

### L1 vs L2 Regularization

![L1_L2.PNG](attachment:L1_L2.PNG)