# Experiments with PCA (Principal Component Analysis)

This contains the experiments shown on the Udemy Course "Practical Recommender Systems For Business Applications" plus some additions I deemed relevant.

Here is a link to the course lecture: https://www.udemy.com/course/practical-recommender-systems-for-business-applications/learn/lecture/30386228#content


PCA is:
- A dimensionality reduction technique
- A decorrelating procedure

PCA needs:
- Numerical Features
- Scaled Features (Mean = 0 and StdDev = 1)

In [1]:
%matplotlib inline

import plotly.express as px
import pandas as pd
import seaborn as sns

from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

# Loading the IRIS Dataset

In [2]:
iris = sns.load_dataset('iris')
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


# PCA only works on Numerical Variables!

This means that we are now going to only perform the PCA on the values:

$$
\left \langle \text{sepal\_length}, \text{sepal\_width}, \text{petal\_length},\text{petal\_width} \right \rangle
$$



In [3]:
features = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
numerical_features = iris[features]

#Convert to Numpy Array
numerical_features = numerical_features.values
numerical_features

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1],
       [5.4, 3.7, 1.5, 0.2],
       [4.8, 3.4, 1.6, 0.2],
       [4.8, 3. , 1.4, 0.1],
       [4.3, 3. , 1.1, 0.1],
       [5.8, 4. , 1.2, 0.2],
       [5.7, 4.4, 1.5, 0.4],
       [5.4, 3.9, 1.3, 0.4],
       [5.1, 3.5, 1.4, 0.3],
       [5.7, 3.8, 1.7, 0.3],
       [5.1, 3.8, 1.5, 0.3],
       [5.4, 3.4, 1.7, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.6, 3.6, 1. , 0.2],
       [5.1, 3.3, 1.7, 0.5],
       [4.8, 3.4, 1.9, 0.2],
       [5. , 3. , 1.6, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [5.2, 3.5, 1.5, 0.2],
       [5.2, 3.4, 1.4, 0.2],
       [4.7, 3.2, 1.6, 0.2],
       [4.8, 3.1, 1.6, 0.2],
       [5.4, 3.4, 1.5, 0.4],
       [5.2, 4.1, 1.5, 0.1],
       [5.5, 4.2, 1.4, 0.2],
       [4.9, 3

# Scaling (Important)

PCA is highly influenced by outliers (different scales are pretty much the same thing), so we want to scale the whole data so that for each variable the mean $\mu = 0$ and the standard deviation $\sigma = 1$.

We use the `scale` function from sklearn preprocessing to normalize them.


Question:
---
Is it better to scale along a row or along a column?

The lecture uses row scaling but to me it doesn't feel right: theoretically we would like to have all measurements (features) to be on the same scale, so we should normalize the feature.

---

Answer:
---

The sklearn is counter-intuitive: the axis = 0 means it is scaling each row by first calculating the mean and standard deviation of each column
You can test this by just checking the mean and standard deviation of any column and checking if it matches with the values we wanted.

```python3
X[:, 1].mean() # Should be 0
X[:, 1].std()  # Should be 1
```

---
    

In [4]:
X = scale(numerical_features, axis=0)
print(f'The column mean: {X[:, 1].mean()} # Should be 0')
print(f'The column std dev: {X[:, 1].std()} # Should be 1')
X[:5]

The column mean: -7.815970093361102e-16 # Should be 0
The column std dev: 0.9999999999999999 # Should be 1


array([[-0.90068117,  1.01900435, -1.34022653, -1.3154443 ],
       [-1.14301691, -0.13197948, -1.34022653, -1.3154443 ],
       [-1.38535265,  0.32841405, -1.39706395, -1.3154443 ],
       [-1.50652052,  0.09821729, -1.2833891 , -1.3154443 ],
       [-1.02184904,  1.24920112, -1.34022653, -1.3154443 ]])

# PCA 

We now perform PCA specifying the number of components needed.

In [5]:
pca = PCA(n_components = 3)
pca.fit(X)
pca

In [6]:
pca.explained_variance_ratio_

array([0.72962445, 0.22850762, 0.03668922])

In [7]:
import plotly.graph_objects as go

# Precalculated percentages
percentages = [100 * pca.explained_variance_ratio_[i].item() for i in range(len(pca.explained_variance_ratio_))]
percentages.append(100 - (pca.explained_variance_ratio_.sum() * 100))

# Labels for each percentage (you can customize these)
labels = [f"Component {i + 1}" for i in range(len(percentages) - 1)]
labels.append("Lost Information")

# Create the pie chart figure
fig = go.Figure(data=[go.Pie(labels=labels, values=percentages, hole=0.3)])

# Customize the layout (optional)
fig.update_layout(title_text="The percentage of influence that each component has on the variance of the dataset", width = 800, height = 400)

# Show the figure
fig.show()

# Before and After PCA

In [8]:
import plotly.subplots as sp

fig1 = px.scatter_matrix(
    iris,
    dimensions=features,
    color=iris["species"]
)

pca = PCA(n_components=2)
components = pca.fit_transform(X) # Fit on X and then return the Principal Components
fig2 = px.scatter(components, x=0, y = 1, color=iris['species'])

fig1.update_traces(diagonal_visible=True)
fig1.update_layout(title_text="The dataset Before applying Principal Component Analysis", width = 900, height = 700)

fig2.update_layout(title_text="The dataset After Finding the TWO principal Components", width = 900, height = 500)
fig1.show()
fig2.show()

In [9]:
pca = PCA(n_components=3)
components = pca.fit_transform(X)

total_var = pca.explained_variance_ratio_.sum() * 100

fig = px.scatter_3d(
    components, x=0, y=1, z=2, color=iris['species'],
    title=f'Total Explained Variance: {total_var:.2f}%',
    labels={'0': 'PC 1', '1': 'PC 2', '2': 'PC 3'}
)
fig.update_layout(width = 900, height = 500)
fig.show()