[back](./03C-association-and-correlation.ipynb)

---
## `Dimensionality Reduction`

A very common issue with **dimensionality** is with having a lot of **dimensions** is called as _curse of dimensionality_

> _curse of dimensionality:_
>
> As a number of features or dimensions grows, the amount of data that we need to generalize accurately grows exponentially

So, if we have more features, meaning we have more columns in our DataFrame then we'll need a lot more rules, data-points to counter this sparsity that you get from the growth of the feature.

That's why we want to do our best to reduce the features and for that, `PCA` - **Principal Component Analysis** is the No.1 technique that people use in data-science to accomplish this, to reduce the number of features.

The implementation of it via the code is easy, but behind the scenes, we can think about `PCA` like it is a **compressor**.

So say, at the start we are having originally about 100 features and we are trying to compress them so that in the end, we get 10 new features. But the 10 new features are made up with some percentage of the _original feature one_, some percentage of the _original feature two_ and some percentage of _original feature three and so forth_.

So, what we will be loosing there is that we will not be able to interpret the features as before, like the previous example, _number of bedrooms_ is a pretty easy to understand what it means as a feature, likely number of bathrooms or squareFootage etc.

But, if we are saying that 20% is coming from bathrooms, 30% from bedrooms, the rest is coming from squareFootage and this is our _new feature number one_ then this is not so easy to understand what that it means. Rather on the flip-side, we can really avoid this _curse of dimensionality_ and really compress our feature space.

### `Initial Setup`

In [1]:
# Imports
import pandas as pd
import numpy as np
import seaborn as sns  # Library used to print nicer charts and visualizations
import matplotlib.pyplot as plt

%matplotlib inline


In [2]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split


In [3]:
df = pd.read_csv(r'../../assets/single_family_home_values.csv')
df.head(4)


Unnamed: 0,id,address,city,state,zipcode,latitude,longitude,bedrooms,bathrooms,rooms,squareFootage,lotSize,yearBuilt,lastSaleDate,lastSaleAmount,priorSaleDate,priorSaleAmount,estimated_value
0,39525749,8171 E 84th Ave,Denver,CO,80022,39.84916,-104.893468,3,2.0,6,1378,9968,2003.0,2009-12-17,75000,2004-05-13,165700.0,239753
1,184578398,10556 Wheeling St,Denver,CO,80022,39.88802,-104.83093,2,2.0,6,1653,6970,2004.0,2004-09-23,216935,,,343963
2,184430015,3190 Wadsworth Blvd,Denver,CO,80033,39.76171,-105.08107,3,1.0,0,1882,23875,1917.0,2008-04-03,330000,,,488840
3,155129946,3040 Wadsworth Blvd,Denver,CO,80033,39.76078,-105.08106,4,3.0,0,2400,11500,1956.0,2008-12-02,185000,2008-06-27,0.0,494073


In [4]:
# X will be all the columns, without estimated_value
X = df.drop('estimated_value', axis=1)
X = X[['bedrooms', 'bathrooms', 'rooms', 'squareFootage',
       'lotSize', 'yearBuilt', 'priorSaleAmount']]
X.fillna(0, inplace=True)
X.head()


Unnamed: 0,bedrooms,bathrooms,rooms,squareFootage,lotSize,yearBuilt,priorSaleAmount
0,3,2.0,6,1378,9968,2003.0,165700.0
1,2,2.0,6,1653,6970,2004.0,0.0
2,3,1.0,0,1882,23875,1917.0,0.0
3,4,3.0,0,2400,11500,1956.0,0.0
4,3,4.0,8,2305,5600,1998.0,0.0


In [5]:
y = df.estimated_value
y

0         239753
1         343963
2         488840
3         494073
4         513676
          ...   
14995    1080081
14996     807306
14997    1737156
14998    2008794
14999    1421401
Name: estimated_value, Length: 15000, dtype: int64

### `PCA - Principal Component Analysis`

In [6]:
from sklearn.decomposition import PCA


In [7]:
pca = PCA(4)

We are declaring that we need **4 components**. We surely might not know this at the beginning so we might have to do some analysis or use elbow approach to determine where we need to stop and use that value.

Finally, we need to have the new feature set lower than the original feature set. So, sometime we need to do a lot of trial and errors and other times we can look at Elbow plots to see what is the drop in the explained variance by adding a new feature _(increasing the count)_.

So, that where we can customize and decide what number is best.

Now, we have our `X`, our initial feature set, with 15000 rows and 7 columns and we'll take look at how we can use `PCA` to compress our feature set and get the new feature set.

In [8]:
X.shape

(15000, 7)

In [9]:
X_transformed = pca.fit_transform(X) # This will fit and transform

In [10]:
X_transformed.shape

(15000, 4)

We can see that we keep the same number of rows, but we have the columns _(features)_ reduced to 4.

And now we can use a very useful function in `PCA`, **.components_**, will give the components.

In [11]:
pca.components_

array([[ 4.34835866e-07,  1.39033126e-06,  1.76645671e-06,
         9.91884229e-04,  1.22556479e-03,  8.13159056e-06,
         9.99998757e-01],
       [ 4.59899754e-05,  8.88602690e-05,  1.02614970e-04,
         9.51591022e-02,  9.95457158e-01,  2.83604230e-03,
        -1.31440908e-03],
       [-7.41279240e-04, -1.10361769e-03, -1.93477104e-03,
        -9.95458475e-01,  9.51576074e-02,  1.08902953e-03,
         8.70755249e-04],
       [ 2.87494377e-03, -3.44585965e-03,  6.23055904e-03,
        -8.24565322e-04,  2.92725018e-03, -9.99965895e-01,
         5.35419321e-06]])

In [12]:
# The shape of .components
pca.components_.shape

(4, 7)

The shape of **.components_** says 4x7, meaning we have 4 components and each of these components is made up by the original 7 components.

Let's look at the first component, the first new feature that we have after the transformation.

In [13]:
# The first new feature
pca.components_[0]

array([4.34835866e-07, 1.39033126e-06, 1.76645671e-06, 9.91884229e-04,
       1.22556479e-03, 8.13159056e-06, 9.99998757e-01])

We see that this new component has **seven members**, meaning these are the weights that we assigned to the original feature.

That is, 0 element is the weight assigned to the original feature 0 etc

In [14]:
lg = LinearRegression()

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X_transformed, y)

In [16]:
lg.fit(X_train, y_train)

LinearRegression()

In [17]:
lg.score(X_test, y_test)

0.7348404575064864

So, comparing to the previous result using the LinearRegression value **0.7580178906271744**, we see that this is almost the same compared to this **transformed feature-set**.

We can say that we actually did not need `PCA` as in fact we only have 7 features and it's really not a lot of features.

But, if in case we were using our original dataset, which had a lot of fields (excluding the prediction variable), then may be `PCA` could have been a better choice.

### `Conclusion`

---
[next]()