  <tr>
        <td>
            <div align="left">
                <font size=25px>
                    <b>Dimension Reduction Techniques (PCA)
                    </b>
                </font>
            </div>
        </td>
    </tr>

## Problem Statement:
A key challenge for property sellers is to determine the sale price of the property. The ability to predict the exact property value is beneficial for property investors as well as for buyers to plan their finances according to the price trend. The property prices depend on the number of features like the property area, basement square footage, year built, number of bedrooms, and others. The prices can be predicted more accurately if the number of predictors is less. Several dimension reduction techniques are being applied to decrease this number of predictors.

<a id='import_packages'></a>
## 1. Import Packages

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime as dt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA


<a id='Read_Data'></a>
## 2. Read the Data

In [None]:
raw_data = pd.read_csv('houseprice.csv')
raw_data.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


<a id='data_preparation'></a>
## 3. Understand and Prepare the Data

#### Change index column

The first column in the data contains a unique numbering for each observation. We can make this column as an index column

In [None]:

raw_data = pd.read_csv('houseprice.csv', index_col=0)
raw_data.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


<a id='Data_Types'></a>
## 3.1 Data Types and Dimensions

In [None]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   MSZoning       1460 non-null   object 
 2   LotFrontage    1201 non-null   float64
 3   LotArea        1460 non-null   int64  
 4   Street         1460 non-null   object 
 5   Alley          91 non-null     object 
 6   LotShape       1460 non-null   object 
 7   LandContour    1460 non-null   object 
 8   Utilities      1460 non-null   object 
 9   LotConfig      1460 non-null   object 
 10  LandSlope      1460 non-null   object 
 11  Neighborhood   1460 non-null   object 
 12  Condition1     1460 non-null   object 
 13  Condition2     1460 non-null   object 
 14  BldgType       1460 non-null   object 
 15  HouseStyle     1460 non-null   object 
 16  OverallQual    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  YearBuil

In [None]:
print(raw_data.shape)

(1460, 80)


**We see the dataframe has 80 columns and 1460 observations**

In [None]:
for feature in ['MSSubClass','OverallQual','OverallCond']:
    raw_data[feature] = raw_data[feature].astype('object')

In [None]:
raw_data.dtypes

MSSubClass        object
MSZoning          object
LotFrontage      float64
LotArea            int64
Street            object
                  ...   
MoSold             int64
YrSold             int64
SaleType          object
SaleCondition     object
SalePrice          int64
Length: 80, dtype: object

<a id='Feature_Engineering'></a>
## 3.2 Feature Engineering




In [None]:
current_year = int(dt.datetime.now().year)

In [None]:
Buiding_age = current_year - raw_data.YearBuilt
Remodel_age = current_year - raw_data.YearRemodAdd

In [None]:
raw_data['Buiding_age'] = Buiding_age
raw_data['Remodel_age'] = Remodel_age

In [None]:
raw_data.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,Buiding_age,Remodel_age
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,,,0,2,2008,WD,Normal,208500,20,20
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,,,0,5,2007,WD,Normal,181500,47,47
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,,,0,9,2008,WD,Normal,223500,22,21
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,,,0,2,2006,WD,Abnorml,140000,108,53
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,,,0,12,2008,WD,Normal,250000,23,23


In [None]:
raw_data.shape

(1460, 82)

**We see the dataframe has 82 columns and 1460 observations**

<a id='Missing_Data_Treatment'></a>
## 3.3. Missing Data Treatment
We can not perform the matrix operations in PCA without removing null values in the data

In [None]:

Total = raw_data.isnull().sum().sort_values(ascending=False)

Percent = (raw_data.isnull().sum()*100/raw_data.isnull().count()).sort_values(ascending=False)

missing_data = pd.concat([Total, Percent], axis=1, keys=['Total', 'Percent'])
missing_data

Unnamed: 0,Total,Percent
PoolQC,1453,99.520548
MiscFeature,1406,96.301370
Alley,1369,93.767123
Fence,1179,80.753425
FireplaceQu,690,47.260274
...,...,...
Foundation,0,0.000000
ExterCond,0,0.000000
ExterQual,0,0.000000
Exterior2nd,0,0.000000


**Replace the 'NA' values with their actual meaning as per the data definition**

In [None]:

raw_data['Alley'].fillna('No alley access' , inplace = True)

In [None]:

raw_data['MasVnrType'].fillna('None' , inplace = True)

In [None]:

for col in ['BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2']:
    raw_data[col].fillna('No Basement' , inplace = True)

In [None]:

raw_data['Electrical'].fillna('SBrkr' , inplace = True)

In [None]:

raw_data['FireplaceQu'].fillna('No Fireplace' , inplace = True)

In [None]:

for col in ['GarageType','GarageFinish','GarageQual','GarageCond']:
    raw_data[col].fillna('No Garage' , inplace = True)

In [None]:

raw_data['PoolQC'].fillna('No Pool' , inplace = True)

In [None]:

raw_data['Fence'].fillna('No Fence' , inplace = True)

In [None]:

raw_data['MiscFeature'].fillna('None' , inplace = True)

In [None]:

raw_data['LotFrontage'].fillna(raw_data['LotFrontage'].median() , inplace = True)

In [None]:

raw_data['MasVnrArea'].fillna(0 , inplace = True)

In [None]:

raw_data['GarageYrBlt'].fillna(0 , inplace = True)

**After replacing the null values, check for the null values for the final time**

In [None]:

raw_data.isnull().any().sum()

0

<a id='Scale'></a>
### 4.2 Scale the Data
The variables like 'YearBuilt', 'MasVnrArea', 'OpenPorchSF', etc. have a different value range. We scale the variable to get all the variables in the same range. With this, we can avoid a problem in which some features come to dominate solely because they tend to have larger values than others


In [None]:

df_num_std = StandardScaler().fit_transform(df_num)

print(df_num_std)

[[-0.22087509 -0.20714171  1.05099379 ...  0.13877749 -1.05099379
  -0.87866809]
 [ 0.46031974 -0.09188637  0.15673371 ... -0.61443862 -0.15673371
   0.42957697]
 [-0.08463612  0.07347998  0.9847523  ...  0.13877749 -0.9847523
  -0.83021457]
 ...
 [-0.1754621  -0.14781027 -1.00249232 ...  1.64520971  1.00249232
  -1.02402865]
 [-0.08463612 -0.08016039 -0.70440562 ...  1.64520971  0.70440562
  -0.53949344]
 [ 0.23325479 -0.05811155 -0.20759447 ...  0.13877749  0.20759447
   0.96256569]]


In [None]:

print(df_num_std.shape)

(1460, 35)


<a id='pcafunction'></a>
## 5. PCA using sklearn


In [None]:

pca = PCA(n_components=5, random_state=0)
PrincipalComponents = pca.fit_transform(df_num_std)

In [None]:

PCA_df = pd.DataFrame(data = PrincipalComponents, columns = ['PC1', 'PC2','PC3','PC4','PC5'])

PCA_df.head()

Unnamed: 0,PC1,PC2,PC3,PC4,PC5
0,1.571352,-0.240622,-1.586777,-2.241474,0.753565
1,0.199709,-0.835527,1.042285,0.089567,-0.634687
2,1.741028,-0.25054,-1.38954,-1.627933,0.038217
3,-1.470503,1.87147,1.526313,-0.195221,-1.759654
4,4.216874,1.125024,-0.532845,-1.588267,-0.391245


**Here, we have used the in-built PCA function to perform dimension reduction and obtained the new dataset with 5 dimensions**

<a id='Conclusion'></a>
## 6. Conclusion

We have performed the PCA technique using the in-built function  have reduced the dimension of the numerical variables from 35 to 5. The obtained principal components explains most of the variance in the data without losing much information.