<a href="https://colab.research.google.com/github/ReshamWadhwa/ML-models/blob/main/LinearRegression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Linear Regression

In [8]:
# Feature Vector 
# x = [x_1, x_2, …., x_n]
# REsponse vector
# y = [y_1, y_2, …., y_n]

#  h( x ) = w * x + b  
    
#   b = bias -- needed
#   x represents the feature vector
#   w represents the weight vector.


**WHY IS BIAS TERM NEEDED?**

When bias is absent

y = w1x1+w2x2+w3x3...+wnxn

For situation when x1=x2=x3 = 0 , our y is forced to be 0.
This might lead to underfit or bias in model if original data fit doesn;t pass through 0 but we force it to do so. 
Hence to avoid bias wrt origin, a bias term is added to the equation and the model

What are the assumptions about data before fitting a Linear Regressor?

1. **Data Target variable is continuous (Regression problem).**
    
    Non continous data require classification model

1. **Linear relationship in y and X.**

  Relationship between dependent and independent variables is linear. 
  
  The linearity assumption can be tested using scatter plots.

  **But Why?** Because Non linear data relationships cannot be captured by sum of scalar matrix multiplications - how LR works

1. **Little or no multi-collinearity.**

  Features shouldn't be dependent on each other. 

  There shouldn't be high correlation between two or more independent variables i.e., X. 

  **But why?** Because coefficients/weights are reflective of change in that feature only while computing Y. 
  Let say , y = w1a1+w2a2+w3a3+ w0   //(bias)

  and  that a1 and a2 are correlated.

  Now, model's interpretation of w1 is the change in y caused by a unit change in a1 alone and no other feature, however since a1 and a2 are correlated, this assumption fails.

  Can be resolved using VIF ( Variance Inflation Factor )

1. **Little or no auto-correlation/Independence of observations.**
  
    Autocorrelation is when a variable is related to earlier versions of itself. 

1. **Homoscedasticity**

  Homoscedasticity describes a situation in which the error term (that is, the “noise” or random disturbance in the relationship between the independent variables and the dependent variable) is the same across all values of the independent variables. The variance of the residuals is constant.

  **But why ?** 10% change in lower value of x will result in lower error than 10% change in x when it is equal to 1,000,000. This is a case when error will have different variance across different range of features.

  Can be solved by using Weighted Least Square Model - This type of regression assigns a weight to each data point based on the variance of its fitted value.

1. **All independent variables are uncorrelated with the error term**




**Variance inflation factor (VIF)** 

is used to detect the severity of multicollinearity in the ordinary least square (OLS) regression analysis.


Variance Inflation Factors (VIFs) measure the correlation among independent variables in least squares regression models. Statisticians refer to this type of correlation as multicollinearity. Excessive multicollinearity can cause problems for regression models.

*Calculating Variance Inflation Factors*

VIFs use multiple regression to calculate the degree of multicollinearity. Imagine you have four independent variables: X1, X2, X3, and X4. Of course, the model has a dependent variable (Y), but we don’t need to worry about it for our purposes. When your statistical software calculates VIFs, it uses multiple regression to regress all IVs except one on that final IV. It repeats this process for all IVs, as shown below:

X1 ⇐ X2, X3, X4
X2 ⇐ X1, X3, X4
X3 ⇐ X1, X2, X4
X4 ⇐ X1, X2, X3

To calculate the VIFs, all independent variables become a dependent variable. Each model produces an R-squared value indicating the percentage of the variance in the individual IV that the set of IVs explains. Consequently, higher R-squared values indicate higher degrees of multicollinearity. VIF calculations use these R-squared values. The VIF for an independent variable equals the following:

VIF formula.





**R-squared** = Variance Explained (R2)

It records the proportion of variation in the dependent variable explained by the independent variables. 
In range [0,1] -- 0 means X doesn't explain any variance in Y and hence has no impact at all, 1 means X explains all variation present in Y

In case of overfit, R2 will still be high if too many independent features are present . Adjusted R2 considers the nummber of features as well. 


```
Adjusted R Squared = 1 – [((1 – R2) * (n – 1)) / (n – k – 1)]

n == data points 
k == number of independent feature variables

```


Now, in VIF, all independent features are modelled against other IFs. R-squared is coefficient of determination of these individual features. 

If R-sq of X1 =0 , it means that other features do not explain any of its variance. 







```
Tolerance of i-th feature= 1- R-squared
```

i.e., how much variance is left unexplained when using the given features.


```
VIF of i-th feature = 1/tolerance

VIFi = 1/(1-R2)
```

Higher the explained variance for a coef, lower the tolerance, higher the VIF.

The potential solutions include the following:

1. Remove some of the highly correlated independent variables.
1. Linearly combine the independent variables, such as adding them together.
1. Perform an analysis designed for highly correlated variables, such as principal components analysis or partial least squares regression.
1. LASSO and Ridge regression are advanced forms of regression analysis that can handle multicollinearity. If you know how to perform linear least squares regression, you’ll be able to handle these analyses with just a little additional study.

In [6]:
import numpy as np

In [64]:



# y = w0 + w1x

# Aim is to arrive at w0 , w1 values at which error is minimum.

# Error = 

# In Ordinary Least Squares Regression, 

# J(w0,w1)= 1/2m x Sum(y-y`)^2

In [12]:
# # PseudoCode
# 1. Start with random values of w0 and w2
# 2. Calculate error term after getting output
# 3. repeat till convergence:
#     a. Find the change to be made in w0 so that error is min
#     b. Find the change to be made in w1 so that error is min
#     c. Update the weights 
#     d. Find predictions

# repeat until convergence  {
#        tmpi = wi - alpha * dwi          
#        wi = tmpi              
# }
# where alpha is the learning rate.

In [65]:
X = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Y = np.array([0, 2, 4, 5, 8, 9, 12, 14, 16, 19])

In [66]:
for x,y in zip(X,Y):
  print(x,y)

0 0
1 2
2 4
3 5
4 8
5 9
6 12
7 14
8 16
9 19


In [73]:
class LinearRegression:

  X = None
  Y = None
  b = None
  learning_rate = None
  iterations = None

  def __init__(self, learning_rate=0.01, iterations=500):
    self.learning_rate = learning_rate
    self.iterations = iterations

  def initial_weights(self):
     # weight initialization
      self.W = np.zeros( self.n )
      self.b = 0
      

  def fit( self, X, Y ) :
     
      # no_of_training_examples, no_of_features
      self.m, self.n = X.shape[0] if len(X.shape)>1 else X.shape[0],1
      self.X = X
      self.Y = Y
      
      self.initial_weights()

      # gradient descent learning   
      for i in range( self.iterations ) :   
          print("Iteration ",i, "W:",self.W,", b:",self.b) 
          self.update_weights()
            
      return self


  def h(self,x) :
    return self.W*x+self.b
       

  def update_weights( self ) :
             
        error = 0
        for (x,y) in zip(X,Y):
          y_pred = self.h(x)
          y_actual = y
          error += (y_pred - y_actual)*(y_pred - y_actual)
        self.error = error
        dW = -1*(self.learning_rate/self.m)*error*x
       
        db = -1*(self.learning_rate/self.m)*error
          
        # update weights
      
        self.W = self.W - self.learning_rate * dW
        self.b = self.b - self.learning_rate * db
        print
        return self

In [74]:
l = LinearRegression()
model = l.fit(X,Y)

Iteration  0 W: [0.] , b: 0
Iteration  1 W: [0.10323] , b: [0.01147]
Iteration  2 W: [0.19594932] , b: [0.02177215]
Iteration  3 W: [0.2797107] , b: [0.03107897]
Iteration  4 W: [0.35577224] , b: [0.03953025]
Iteration  5 W: [0.42516464] , b: [0.04724052]
Iteration  6 W: [0.48874074] , b: [0.05430453]
Iteration  7 W: [0.5472126] , b: [0.0608014]
Iteration  8 W: [0.60117955] , b: [0.06679773]
Iteration  9 W: [0.65114985] , b: [0.07234998]
Iteration  10 W: [0.69755743] , b: [0.07750638]
Iteration  11 W: [0.7407751] , b: [0.08230834]
Iteration  12 W: [0.781125] , b: [0.08679167]
Iteration  13 W: [0.818887] , b: [0.09098744]
Iteration  14 W: [0.85430542] , b: [0.09492282]
Iteration  15 W: [0.88759455] , b: [0.09862162]
Iteration  16 W: [0.91894314] , b: [0.10210479]
Iteration  17 W: [0.94851815] , b: [0.10539091]
Iteration  18 W: [0.97646777] , b: [0.10849642]
Iteration  19 W: [1.00292404] , b: [0.111436]
Iteration  20 W: [1.02800497] , b: [0.11422277]
Iteration  21 W: [1.0518164] , b: [0.