3A. The following data is used to fit a model of consumption-expenditure on income and wealth:
> Cons-exp (‘00 Rs.): 70,65,90,70,110,115,120,140,155,150
> Income (’00 Rs.): 80,100,120,140,160,180,200,220,240,260
> Wealth (’00 Rs.): 810,1009,1273,1425,1633,1876,2052,2201,2435,2686
1. Use two different methods to check for the presence of multicollinearity.
2. Do the additional information of an 11th individual having expenditure 160, income 120, wealth 3,000 and a 12th individual having expenditure 85, income 255, wealth 920 help you to overcome the problem?
3. Use PCA and Ridge regression to solve the problem.

In [101]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as st
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import Ridge
from numpy.linalg import eig
import numpy.linalg as LA
import math

## Load Dataset

In [2]:
Dict= dict({"cons_exp" : [70,65,90,70,110,115,120,140,155,150], 
            "income" : [80,100,120,140,160,180,200,220,240,260], 
            "wealth" : [810,1009,1273,1425,1633,1876,2052,2201,2435,2686] })
data= pd.DataFrame(Dict)
data.head()

Unnamed: 0,cons_exp,income,wealth
0,70,80,810
1,65,100,1009
2,90,120,1273
3,70,140,1425
4,110,160,1633


#### Multicollinearity occurs when there are two or more independent variables in a multiple regression model, which have a high correlation among themselves.Multicollinearity can be detected using various techniques, one such technique being the Variance Inflation Factor(VIF).

In [75]:
 # the independent variables set
X= data[["income","wealth"]]

pandas.core.frame.DataFrame

In [4]:
# VIF dataframe
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns

In [5]:
# calculating VIF for each feature
vif_data["VIF"] = [variance_inflation_factor(X.values, i)
                          for i in range(X.shape[1])]
print(vif_data)

  feature         VIF
0  income  4693.97166
1  wealth  4693.97166


#### Checking multicolinearity using Condition number 

In [104]:
CN=LA.cond(X)
print("CONDITION NUMBER:",CN)

CONDITION NUMBER: 707.7038938618891


#### condition number is greater than 30, thats indicate high multicolinearity between them.

In [6]:
#dependent variable
y= data["cons_exp"]            
y.head()

0     70
1     65
2     90
3     70
4    110
Name: cons_exp, dtype: int64

#### The additional information of an 11th individual having expenditure 160, income 120, wealth 3,000 and a 12th individual having expenditure 85, income 255, wealth 920 

In [8]:
Dict1= dict({"cons_exp" : [70,65,90,70,110,115,120,140,155,150,160,85], 
             "income" : [80,100,120,140,160,180,200,220,240,260,120,255], 
             "wealth" : [810,1009,1273,1425,1633,1876,2052,2201,2435,2686,3000,920] })
data1= pd.DataFrame(Dict1)
data1.tail()

Unnamed: 0,cons_exp,income,wealth
7,140,220,2201
8,155,240,2435
9,150,260,2686
10,160,120,3000
11,85,255,920


In [9]:
X1= data1[["income","wealth"]]  
y1=data1["cons_exp"]               # the independent variables set

In [10]:
# VIF dataframe
vif_data1 = pd.DataFrame()
vif_data1["feature"] = X1.columns

In [11]:
# calculating VIF for each feature
vif_data1["VIF"] = [variance_inflation_factor(X1.values, i)
                          for i in range(X1.shape[1])]
print(vif_data1)

  feature       VIF
0  income  7.404982
1  wealth  7.404982


### performing OLS 

In [12]:
x_const=sm.add_constant(X1)                  
model= sm.OLS(y1,x_const)
results=model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:               cons_exp   R-squared:                       0.929
Model:                            OLS   Adj. R-squared:                  0.913
Method:                 Least Squares   F-statistic:                     58.80
Date:                Fri, 03 Jun 2022   Prob (F-statistic):           6.81e-06
Time:                        13:43:57   Log-Likelihood:                -43.269
No. Observations:                  12   AIC:                             92.54
Df Residuals:                       9   BIC:                             93.99
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         17.8521     10.174      1.755      0.1

#### adding 2 more information, it reduces VIF from previous one. But still VIF is more than 5, so it consider as high multicolinearity.

### Use PCA and Ridge regression to solve the problem.

In [13]:
from sklearn.preprocessing import StandardScaler                                  # performing preprocessing part
sc = StandardScaler()
 
X_1 = sc.fit_transform(X1)

from sklearn.decomposition import PCA                                             # Applying PCA function on X component 
pca = PCA(n_components = 2)
 
X_11 = pca.fit_transform(X_1)
explained_variance = pca.explained_variance_ratio_
print(explained_variance)
principalDf = pd.DataFrame(data = X_11
             , columns = ['principal component 1', 'principal component 2'])
print(principalDf)

[0.70562458 0.29437542]
    principal component 1  principal component 2
0                2.096103               0.108740
1                1.654249               0.076009
2                1.145579               0.110094
3                0.752038               0.029049
4                0.300933               0.005570
5               -0.186150               0.018068
6               -0.604362              -0.038306
7               -0.994818              -0.122434
8               -1.472650              -0.119187
9               -1.967957              -0.098465
10              -0.629685               1.885358
11              -0.093280              -1.854496


In [14]:
finalDf = pd.concat([principalDf, y1], axis = 1)
print(finalDf)

    principal component 1  principal component 2  cons_exp
0                2.096103               0.108740        70
1                1.654249               0.076009        65
2                1.145579               0.110094        90
3                0.752038               0.029049        70
4                0.300933               0.005570       110
5               -0.186150               0.018068       115
6               -0.604362              -0.038306       120
7               -0.994818              -0.122434       140
8               -1.472650              -0.119187       155
9               -1.967957              -0.098465       150
10              -0.629685               1.885358       160
11              -0.093280              -1.854496        85


In [15]:
X2=principalDf[["principal component 1","principal component 2"]]
y2=data1["cons_exp"]

In [16]:
model= sm.OLS(y2,X2)                               #performing OLS 
results=model.fit()
print(results.summary())

                                 OLS Regression Results                                
Dep. Variable:               cons_exp   R-squared (uncentered):                   0.077
Model:                            OLS   Adj. R-squared (uncentered):             -0.107
Method:                 Least Squares   F-statistic:                             0.4192
Date:                Fri, 03 Jun 2022   Prob (F-statistic):                       0.669
Time:                        13:43:57   Log-Likelihood:                         -73.562
No. Observations:                  12   AIC:                                      151.1
Df Residuals:                      10   BIC:                                      152.1
Df Model:                           2                                                  
Covariance Type:            nonrobust                                                  
                            coef    std err          t      P>|t|      [0.025      0.975]
------------------------------

#### After performing PCA technique, Our new model is 
> Yi= -24.9328* prinicipal component 1 + 16.4513* prinicipal component 2

### DATA SET -2

### B. The following data is given for 20 individuals. Y is the dependent variable and rest are independent variables. A regression model is developed using the data. Check for the presence of multicollinearity and suggest all possible measure. Compare the effectiveness of these remedial measures and comment.
> Y: 105,115,116,117,112,121,121,110,110,114,114,115,114,106,125,114,106,113,110,122
> X1: 47,49,49,50,51,48,49,47,49,48,47,49,50,45,52,46,46,46,48,56
> X2: 85.4,94.2,95.3,94.7,89.4,99.5,99.8,90.9,89.2,92.7,94.4,94.1,91.6,87.1,101.3,94.5,87.0,94.5,90.5,95.7
> X3: 1.75,2.10,1.98,2.01,1.89,2.25,2.25,1.90,1.83,2.07,2.07,1.98,2.05,1.92,2.19,1.98,1.87,1.90,1.88,2.09
> X4: 5.1,3.8,8.2,5.8,7.0,9.3,2.5,6.2,7.1,5.6,5.3,5.6,10.2,5.6,10.0,7.4,3.6,4.3,9.0,7.0
> X6: 63,70,72,73,72,71,69,66,69,64,74,71,68,67,76,69,62,70,71,75
> X7: 33,14,10,99,95,10,42,8,62,35,90,21,47,80,98,95,18,12,99,99


In [27]:
Dict2= {"X1": [47,49,49,50,51,48,49,47,49,48,47,49,50,45,52,46,46,46,48,56],
             "X2": [85.4,94.2,95.3,94.7,89.4,99.5,99.8,90.9,89.2,92.7,94.4,94.1,91.6,87.1,101.3,94.5,87.0,94.5,90.5,95.7],
             "X3": [1.75,2.10,1.98,2.01,1.89,2.25,2.25,1.90,1.83,2.07,2.07,1.98,2.05,1.92,2.19,1.98,1.87,1.90,1.88,2.09],
             "X4": [5.1,3.8,8.2,5.8,7.0,9.3,2.5,6.2,7.1,5.6,5.3,5.6,10.2,5.6,10.0,7.4,3.6,4.3,9.0,7.0],
             "X5": [63,70,72,73,72,71,69,66,69,64,74,71,68,67,76,69,62,70,71,75],
             "X6": [ 33,14,10,99,95,10,42,8,62,35,90,21,47,80,98,95,18,12,99,99],
             "Y": [105,115,116,117,112,121,121,110,110,114,114,115,114,106,125,114,106,113,110,122]}
data2=pd.DataFrame(Dict2)
data2.head()

Unnamed: 0,X1,X2,X3,X4,X5,X6,Y
0,47,85.4,1.75,5.1,63,33,105
1,49,94.2,2.1,3.8,70,14,115
2,49,95.3,1.98,8.2,72,10,116
3,50,94.7,2.01,5.8,73,99,117
4,51,89.4,1.89,7.0,72,95,112


In [98]:
X1=data2["X1"]
X2=data2["X2"]
X3=data2["X3"]
X4=data2["X4"]
X5=data2["X5"]
X6=data2["X6"]

In [63]:
design_mat=arr[0:,:-1]                   #design matrix

### Check for the presence of multicollinearity and suggest all possible measure. 

In [24]:
X_3= data2[["X1","X2","X3","X4","X5","X6"]]         #independent variable
y_3= data2[["Y"]]                                   #dependent variable

In [41]:
y_= y_3.to_numpy()                                     #target variable

In [19]:
# VIF dataframe
vif_data3 = pd.DataFrame()
vif_data3["feature"] = X_3.columns

In [20]:
# calculating VIF for each feature
vif_data3["VIF"] = [variance_inflation_factor(X_3.values, i)
                          for i in range(X_3.shape[1])]
print(vif_data3)

  feature          VIF
0      X1   536.793041
1      X2  3187.057943
2      X3  1044.246904
3      X4    12.514432
4      X5  1553.320340
5      X6     5.809743


#### Here,we can clearly see that, X1,X2,X3,X5 have the highest VIF i.e they have strong multicolinearity between them and rest of covariates have less VIF comparativlely from those.




### RIDGE REGRESSION:
>  betaR= (X'X)^-1 * ( X'Y) + KI(X'Y) 

In [57]:
X_T= design_mat.transpose()
res=np.dot(X_T,design_mat)
a=np.linalg.inv(res) 
b=np.dot(X_T,y_)
C= np.dot(a,b)
K=0.01
I=np.identity(6)
e=np.multiply(I,0.01)
D=np.dot(e,b)

In [59]:
C.shape

(6, 1)

In [60]:
D.shape

(6, 1)

In [62]:
betaR= np.add(C,D)             #estimated BetaHat of Ridge regression
print(betaR)

[[1110.36191435]
 [2127.4766327 ]
 [  52.37714125]
 [ 147.36450462]
 [1589.64308771]
 [1222.65662513]]


#### new model is Yi= 1110.36 * X1 + 2127.47 * X2 + 52.377 * X3 + 147.36 * X4 + 1589.64*X5 + 1222.65 * X6
#### which is free from multicolinearity

### PCA

In [75]:
from sklearn.preprocessing import StandardScaler                                  # performing preprocessing part
sc = StandardScaler()
 
X_3 = sc.fit_transform(X_3)

from sklearn.decomposition import PCA                                             # Applying PCA function on X component 
pca = PCA(n_components = 6)
 
X_33 = pca.fit_transform(X_3)
explained_variance = pca.explained_variance_ratio_
print(explained_variance)
principalDf = pd.DataFrame(data = X_33
             , columns = ['principal component 1', 'principal component 2',
                          'principal component 3','principal component 4',
                          'principal component 5','principal component 6'])
finalDf = pd.concat([principalDf, y_3], axis = 1)
print(finalDf)

pandas.core.frame.DataFrame

In [68]:
X=principalDf[["principal component 1","principal component 2","principal component 3"]]
y=y_3

In [69]:
model= sm.OLS(y,X)                               #performing OLS 
results=model.fit()
print(results.summary())

                                 OLS Regression Results                                
Dep. Variable:                      Y   R-squared (uncentered):                   0.002
Model:                            OLS   Adj. R-squared (uncentered):             -0.174
Method:                 Least Squares   F-statistic:                            0.01183
Date:                Fri, 03 Jun 2022   Prob (F-statistic):                       0.998
Time:                        15:09:26   Log-Likelihood:                         -123.10
No. Observations:                  20   AIC:                                      252.2
Df Residuals:                      17   BIC:                                      255.2
Df Model:                           3                                                  
Covariance Type:            nonrobust                                                  
                            coef    std err          t      P>|t|      [0.025      0.975]
------------------------------

##### after performing PCA, new model is Yi= 2.8002 * principal component 1 + 1.5891 *principal component2 - 0.0430 * principal component 3





Submitted by,

> Soumitro Mukherjee


Reg. No. : 213001818010030, Roll No. : 30018021030