## Why PCA
- It is often useful to measure data in terms of its principal components rather than on a normal y-axis
- So what are principal component
- The underlying structure in the data

## Curse of dimensionality
- High dimensional data is complex to process due to inconsistencies in the data

# What is PCA
- Process of computing the principal components
- used in EDA for predictive models
- Dimensionality reduction by projecting each data point onto only the first few principal components
- To obtain lower-dimensioanl data while preserving as much of the data as possible

## Steps for PCA
- standardization or normalization of data
- Computing Covariance matrix
- Calculating eigenvectors and eigenvalues
- Computing principal components
- reducing dimenstions

In [2]:
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats as stats
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from statsmodels.stats.stattools import durbin_watson
from statsmodels.stats.diagnostic import het_breuschpagan
from scipy.special import boxcox,boxcox1p, inv_boxcox
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [5]:
df = pd.read_csv('Boston.csv')#,index_col=0
df.drop(df.columns[0],axis=1,inplace=True)

In [6]:
df.head()

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


In [7]:
x = df.iloc[:,0:13]
y = df.iloc[:,13]

In [8]:
x_train,x_test,y_train,y_test = train_test_split(x,y,train_size = 0.8,random_state=100)

In [9]:
model_lin = LinearRegression().fit(x_train,y_train)

In [10]:
pred_lin = model_lin.predict(x_test)

In [11]:
r2_score(y_test,pred_lin)

0.7555033086871299

## performing PCA

In [12]:
x_scaled = StandardScaler().fit_transform(x)

In [13]:
x_scaled

array([[-0.41978194,  0.28482986, -1.2879095 , ..., -1.45900038,
         0.44105193, -1.0755623 ],
       [-0.41733926, -0.48772236, -0.59338101, ..., -0.30309415,
         0.44105193, -0.49243937],
       [-0.41734159, -0.48772236, -0.59338101, ..., -0.30309415,
         0.39642699, -1.2087274 ],
       ...,
       [-0.41344658, -0.48772236,  0.11573841, ...,  1.17646583,
         0.44105193, -0.98304761],
       [-0.40776407, -0.48772236,  0.11573841, ...,  1.17646583,
         0.4032249 , -0.86530163],
       [-0.41500016, -0.48772236,  0.11573841, ...,  1.17646583,
         0.44105193, -0.66905833]])

- Variance
- Number of features

In [14]:
pca = PCA(n_components=0.95)

In [15]:
x_pca = pca.fit_transform(x_scaled)

In [16]:
x_pca

array([[-2.09829747,  0.77311275,  0.34294273, ...,  0.31864075,
         0.2958318 , -0.42493682],
       [-1.45725167,  0.59198521, -0.69519931, ...,  0.55386126,
        -0.22366994, -0.16696207],
       [-2.07459756,  0.5996394 ,  0.1671216 , ...,  0.48455996,
         0.10516613,  0.06977513],
       ...,
       [-0.31236047,  1.15524644, -0.40859759, ...,  0.29411936,
        -0.63866037,  0.98103226],
       [-0.27051907,  1.04136158, -0.58545406, ...,  0.27159707,
        -0.57934447,  0.9367553 ],
       [-0.12580322,  0.76197805, -1.294882  , ...,  0.17530965,
        -0.13338197,  0.85468922]])

In [21]:
print('shape of x_scaled: {}\nshape of x_pca: {}'.format(x_scaled.shape,x_pca.shape))

shape of x_scaled: (506, 13)
shape of x_pca: (506, 9)


In [22]:
x_train,x_test,y_train,y_test = train_test_split(x,y,train_size = 0.8,random_state=100)

In [23]:
model_pca = LinearRegression().fit(x_train,y_train)

In [24]:
pred_pca = model_pca.predict(x_test)

In [25]:
r2_score(y_test,pred_pca)

0.7555033086871299

In [29]:
pca.explained_variance_ratio_

array([0.47129606, 0.11025193, 0.0955859 , 0.06596732, 0.06421661,
       0.05056978, 0.04118124, 0.03046902, 0.02130333])

In [28]:
np.sum(pca.explained_variance_ratio_)

0.9508411978679069

---

In [32]:
df = pd.read_csv('4. NSSO68 data set(1).csv')
df.drop(df.columns[0],axis=1,inplace = True)

  exec(code_obj, self.user_global_ns, self.user_ns)


In [33]:
df.head()

Unnamed: 0,grp,Round_Centre,FSU_number,Round,Schedule_Number,Sample,Sector,state,State_Region,District,...,pickle_v,sauce_jam_v,Othrprocessed_v,Beveragestotal_v,foodtotal_v,foodtotal_q,state_1,Region,fruits_df_tt_v,fv_tot
0,4.099999999999999e+31,1,41000,68,10,1,2,24,242,7,...,0.0,0.0,0.0,0.0,1141.4924,30.942394,GUJ,2,12.0,154.18
1,4.099999999999999e+31,1,41000,68,10,1,2,24,242,7,...,0.0,0.0,0.0,17.5,1244.5535,29.286153,GUJ,2,333.0,484.95
2,4.099999999999999e+31,1,41000,68,10,1,2,24,242,7,...,0.0,0.0,0.0,0.0,1050.3154,31.527046,GUJ,2,35.0,214.84
3,4.099999999999999e+31,1,41000,68,10,1,2,24,242,7,...,0.0,0.0,0.0,33.333333,1142.591667,27.834607,GUJ,2,168.333333,302.3
4,4.099999999999999e+31,1,41000,68,10,1,2,24,242,7,...,0.0,0.0,0.0,75.0,945.2495,27.600713,GUJ,2,15.0,148.0


- MPCE  - Monthly per capita Consumer expenditure
- MRP - per month
- URP over the year

In [34]:
df.columns.get_loc('MPCE_MRP')

50

In [37]:
df.drop(df.dtypes[df.dtypes == 'object'].index.tolist(),axis=1,inplace=True)

In [43]:
df.drop(df.isna().sum()[df.isna().sum() > 0].index.tolist(),axis=1,inplace=True)

In [3]:
df.shape

NameError: name 'df' is not defined

In [2]:
df.corr()['MPCE_MRP'][df.corr()['MPCE_MRP']>0.3]

NameError: name 'df' is not defined