# Implementation of PCA (Principal Component Analysis)

This contains the experiments shown on the Udemy Course "Practical Recommender Systems For Business Applications" plus some additions I deemed relevant.

Here is a link to the course lecture: https://www.udemy.com/course/practical-recommender-systems-for-business-applications/learn/lecture/30386228#content

In [1]:
%matplotlib inline

import matplotlib
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

# Loading the IRIS Dataset

In [7]:
iris = sns.load_dataset('iris')
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


# PCA only works on Numerical Variables!

This means that we are now going to only perform the PCA on the values:

$$
\left \langle \text{sepal\_length}, \text{sepal\_width}, \text{petal\_length},\text{petal\_width} \right \rangle
$$



In [9]:
numerical_features = iris[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]

#Convert to Numpy Array
numerical_features = numerical_features.values
numerical_features

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

# Normalization (Important)

PCA is highly influenced by outliers (different scales are pretty much the same thing), so we want to normalize the whole data.

We use the `scale` function from sklearn preprocessing to normalize them.

Question:

    Is it better to normalize along a row or along a column?

    The lecture uses row normalization but to me it doesn't feel right: theoretically we would like to have all measurements (features) to be on the same scale, so we should normalize the feature.

Answer:

    The sklearn is counter-intuitive: the axis = 0 means it is normalizing each row by first calculating the mean and standard deviation of each column
    You can test this by just watching the mean of any column and checking if it is approximately zero.
    The same goes for std.dev. checking it is 1

    ```python3
    X[:, 1].mean() # Should be 0
    X[:, 1].std()  # Should be 1
    ```
    

In [29]:
X = scale(numerical_features, axis=0)
print(f'The column mean: {X[:, 1].mean()} # Should be 0')
print(f'The column std dev: {X[:, 1].std()} # Should be 1')
X[:5]

The column mean: -7.815970093361102e-16 # Should be 0
The column std dev: 0.9999999999999999 # Should be 1


array([[-0.90068117,  1.01900435, -1.34022653, -1.3154443 ],
       [-1.14301691, -0.13197948, -1.34022653, -1.3154443 ],
       [-1.38535265,  0.32841405, -1.39706395, -1.3154443 ],
       [-1.50652052,  0.09821729, -1.2833891 , -1.3154443 ],
       [-1.02184904,  1.24920112, -1.34022653, -1.3154443 ]])