# Essentials of Principal Component Analysis

Principal Component Analysis (PCA) is one of the unsupervised algorithms most broadly used.

In [1]:
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 10)

import plotly.express as px
import plotly.io as pio
pio.renderers.default = "plotly_mimetype+notebook_connected"

## Automobile Dataset

We will use the Automobile Data Set [https://archive.ics.uci.edu/ml/datasets/automobile] from the UCI Machine Learning Repository [https://archive-beta.ics.uci.edu/]. It includes categorical and continuous variables.

In [2]:
# Defining the headers
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration", "num_doors", 
            "body_style", "drive_wheels", "engine_location","wheel_base", "length", "width", 
            "height", "curb_weight", "engine_type", "num_cylinders", "engine_size", "fuel_system",
            "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm","city_mpg", 
            "highway_mpg", "price"]

In [3]:
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data",
                  header=None, names=headers, na_values="?" )
print(df.shape)
df.head()

(205, 26)


Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,...,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,,alfa-romero,gas,std,...,111.0,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,std,...,111.0,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,std,...,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,...,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,...,115.0,5500.0,18,22,17450.0


## Essentials of PCA

PCA is a fast and flexible unsupervised method for dimensionality reduction in data. 

Let's work with the variables: `city_mpg` and `highway_mpg`.

In [4]:
data = df[['city_mpg', 'highway_mpg']]
print(data.shape)
data.head()

(205, 2)


Unnamed: 0,city_mpg,highway_mpg
0,21,27
1,21,27
2,19,26
3,24,30
4,18,22


In [5]:
# Asking for missing values
data.isnull().sum()

city_mpg       0
highway_mpg    0
dtype: int64

There are no missing values. 

We are analyzing a simple example of two variables measured with the same scale: miles per gallon. For this, it is not necessary to standardize them. We can proceed to visualize the data.

In [6]:
# Plotting the data
fig = px.scatter(data, x='city_mpg', y='highway_mpg', 
           width=500, title="City (MPG) vs Highway (MPG)")

# Set the aspect ratio to be equal, this is done by updating the layout of the figure.
fig.update_layout(title="City vs. Highway (Miles per Gallon)",
                  xaxis=dict(title='x - City (Miles per Gallon)', scaleanchor="y", scaleratio=1),
                  yaxis=dict(title='y - Highway (Miles per Gallon)', scaleanchor="x", scaleratio=1))

# Update the markers and the opacity to get a more beautiful plot.
fig.update_traces(marker=dict(size=8, opacity=0.6))

# Show the figure
fig.show()

There is a linear relationship between `city_mpg` and `highway_mpg`.

The problem setting here is different: rather than attempting to predict the y values (`highway_mpg`) from the x values (`city_mpg`), the unsupervised learning problem attempts to learn about the *relationship* between both variables. 

In PCA, this relationship is quantified by finding a list of the *principal axes* in the data and using those axes to describe the dataset.

In [7]:
from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(data)

The fit learns the `explained variance` and `components` from the data.

In [8]:
print('Explained variance:', pca.explained_variance_)

Explained variance: [88.93313699  1.28957941]


Explained variance refers to the amount of variance in the data that is accounted for by each principal component. It indicates how much of the total variance in the dataset is captured by each component. 

Each principal component captures a certain amount of the total variability in the dataset. The first principal component captures the most variance (`88.9` in our case), followed by the second, and so on.

In [9]:
print('Components:\n', pca.components_)

Components:
 [[ 0.68820301  0.72551817]
 [-0.72551817  0.68820301]]


The components are a set of orthogonal axes (vectors) in feature space along which the data varies the most, and they define the directions of maximum variance in the data.

They can be used to transform the data into a new coordinate system through a linear transformation.

Dimenssions:
- *rows*: number of principal components retained: `n_components`, (`2` in our case).
- *columns*: number of features in the input dataset, (`2` in our case).

To understand what the `explained variance` and `components` mean, let's visualize them as vectors over the input data.
- The `components` will define the direction of the vectors 
- The `explained variance` will be used to compute the length of the vector.

In [10]:
fig = px.scatter(data, x='city_mpg', y='highway_mpg', width=500)

# Calculate the PCA vectors and lengths for the annotation
for length, vector in zip(pca.explained_variance_, pca.components_):        
    start = pca.mean_      
    end   = pca.mean_ + 3*vector*np.sqrt(length)                  
    fig.add_annotation(x=end[0], y=end[1], ax=start[0], ay=start[1],   
                       xref="x", yref="y", axref="x", ayref="y",
                       showarrow=True, arrowhead=2, arrowwidth=2)                           
    
# Set the titles and aspect ratio to be the same
fig.update_layout(title="PCA Vectors",
                  xaxis=dict(title='x - City (Miles per Gallon)', scaleanchor="y", scaleratio=1),
                  yaxis=dict(title='y - Highway (Miles per Gallon)', scaleanchor="x", scaleratio=1))

# Update the markers and the opacity to get a more beautiful plot
fig.update_traces(marker=dict(size=8, opacity=0.6))

# Show the figure
fig.show()

These vectors represent the *principal axes* of the data, and the length of the vector is an estimate of how "important" that axis is in describing the data distribution. More precisely, it is a measure of the variance of the data when projected onto that axis.

The projection of the points onto the principal axles are the "`principal components`" of the data.

In [11]:
# Getting the transformed data
data_pca = pd.DataFrame(pca.fit_transform(data), columns=['Component 1', 'Component 2'])
data_pca.head()

Unnamed: 0,Component 1,Component 2
0,-5.625459,0.479732
1,-5.625459,0.479732
2,-7.727383,1.242566
3,-1.384295,0.367787
4,-11.317659,-0.784728


In [12]:
# Plotting the components
fig = px.scatter(data, x=data_pca['Component 1'], y=data_pca['Component 2'], 
           width=500, title="Component 1 vs Component 2")

# Set the titles and aspect ratio to be the same
fig.update_layout(title="PCA Components",
    xaxis=dict(title='x - Component 1', scaleanchor="y", scaleratio=1),
    yaxis=dict(title='y - Component 2', scaleanchor="x", scaleratio=1))

# Update the markers and the opacity to get a more beautiful plot
fig.update_traces(marker=dict(size=8, opacity=0.6))

# Show the figure
fig.show()

After applying PCA, the data is transformed and plotted in the space of principal components. The first principal component would lie along the direction of greatest variance, which, judging by the orientation of the original data, is likely aligned with the diagonal line that would fit through the blue points. The second principal component would be orthogonal to "Component 1" and capture the remaining variance.

## PCA as Dimensionality Reduction

Using PCA for dimensionality reduction involves zeroing out one or more of the smallest principal components. This results in a lower-dimensional projection of the data that preserves the maximal data variance.

Let's use the same data, asking for only one principal component.

### Analysis with two features

In [13]:
pca1 = PCA(n_components=1)
pca1.fit(data)

In [14]:
# Getting the exaplained variance
print('Explained variance: ', pca1.explained_variance_.round(3))
print('Explained variance ratio: ', pca1.explained_variance_ratio_.round(3))

Explained variance:  [88.933]
Explained variance ratio:  [0.986]


The `Explained Variance Ratio` is expressed as a ratio or percentage, representing the fraction of the total variance that is captured by a particular principal component. This ratio is helpful in understanding the importance or significance of each component.

Notice in our case, the computed component gets the `98.6%` of the total variance!

In [15]:
# Showing the component
print('Components:\n', pca1.components_.round(4))

Components:
 [[0.6882 0.7255]]


In [16]:
# Getting the transformed data
data1 = pca1.transform(data)
print("Original shape:   ", data.shape)
print("Transformed shape:", data1.shape)

Original shape:    (205, 2)
Transformed shape: (205, 1)


The original data has two features; the transformed data has been reduced to a single dimension. 

To understand the effect of this dimensionality reduction, we can perform the inverse transform of this reduced data and plot it along with the original data.

In [17]:
data_pca = pd.DataFrame(pca1.inverse_transform(data1), columns=['city_mpg', 'highway_mpg'])
data_pca.head()

Unnamed: 0,city_mpg,highway_mpg
0,21.348054,26.669847
1,21.348054,26.669847
2,19.901504,25.144863
3,24.266836,29.746888
4,17.430665,22.540052


In [18]:
# Plotting the original data
fig = px.scatter(data, x='city_mpg', y='highway_mpg', width=700)

# Manually update the legend for the original data
fig.data[0].name = 'Original Data'

# Here we update the traces to increase the marker size
fig.update_traces(marker=dict(size=8, opacity=0.7), selector=dict(mode='markers'))

# Plotting the transformed data
fig.add_scatter(x=data_pca.city_mpg, y=data_pca.highway_mpg, opacity=0.8, 
                mode='markers', marker=dict(color='darkorange', size = 7), 
                name='Transformed Data')

# Update the axes to be equal
fig.update_xaxes(scaleanchor="y", scaleratio=1)
fig.update_yaxes(constrain='domain') # This makes the y-axis range equal to the x-axis range

fig.update_layout(title="Original Data vs PCA Transformed Data",
    xaxis=dict(title='x - City (Miles per Gallon)', scaleanchor="y", scaleratio=1),
    yaxis=dict(title='y - Highway (Miles per Gallon)', scaleanchor="x", scaleratio=1))

# Show the figure
fig.show()

The blue points are the original data, while the orange points are the projected version.

The graph explains the PCA dimensionality reduction: the information along the least important principal axis is removed, leaving only the component(s) of the data with the highest variance.

The fraction of variance that is cut out (proportional to the spread of points about the line formed in this figure) is roughly a measure of how much "information" is discarded in this dimensionality reduction.

This reduced-dimension dataset is "good enough" to encode the most important relationships between the points. Despite reducing the data dimension by 50%, the overall relationship between the data points is mostly preserved.

### Analysis with more features

In [19]:
more_data = df[['wheel_base', 'curb_weight', 'horsepower', 'length', 'width', 'height', 'city_mpg', 'highway_mpg']]
print(more_data.shape)
more_data.head()

(205, 8)


Unnamed: 0,wheel_base,curb_weight,horsepower,length,width,height,city_mpg,highway_mpg
0,88.6,2548,111.0,168.8,64.1,48.8,21,27
1,88.6,2548,111.0,168.8,64.1,48.8,21,27
2,94.5,2823,154.0,171.2,65.5,52.4,19,26
3,99.8,2337,102.0,176.6,66.2,54.3,24,30
4,99.4,2824,115.0,176.6,66.4,54.3,18,22


In [20]:
# Asking for missing values
more_data.isnull().sum()

wheel_base     0
curb_weight    0
horsepower     2
length         0
width          0
height         0
city_mpg       0
highway_mpg    0
dtype: int64

In [21]:
# Removing the missing values
more_data = more_data.dropna()
print(more_data.shape)
more_data.head()

(203, 8)


Unnamed: 0,wheel_base,curb_weight,horsepower,length,width,height,city_mpg,highway_mpg
0,88.6,2548,111.0,168.8,64.1,48.8,21,27
1,88.6,2548,111.0,168.8,64.1,48.8,21,27
2,94.5,2823,154.0,171.2,65.5,52.4,19,26
3,99.8,2337,102.0,176.6,66.2,54.3,24,30
4,99.4,2824,115.0,176.6,66.4,54.3,18,22


In [22]:
# Plotting the data
more_data_long = pd.melt(more_data)

px.box(more_data_long, x='variable', y='value', width=700, height=400)

As you can see, variables do not share the measure scale. We must standardize them before applying PCA.

In [23]:
from sklearn.preprocessing import StandardScaler

more_data_st = StandardScaler().fit_transform(more_data)
more_data_st = pd.DataFrame(more_data_st, columns=more_data.columns)
more_data_st.head()

Unnamed: 0,wheel_base,curb_weight,horsepower,length,width,height,city_mpg,highway_mpg
0,-1.688467,-0.015177,0.170228,-0.420804,-0.838083,-2.024547,-0.647094,-0.543037
1,-1.688467,-0.015177,0.170228,-0.420804,-0.838083,-2.024547,-0.647094,-0.543037
2,-0.710151,0.511728,1.255637,-0.22655,-0.186775,-0.547224,-0.952228,-0.687894
3,0.168675,-0.419457,-0.05695,0.210521,0.138878,0.232474,-0.189394,-0.108465
4,0.102349,0.513644,0.271197,0.210521,0.231922,0.232474,-1.104795,-1.267324


In [24]:
# Plotting the standardized data
more_data_st_long = pd.melt(more_data_st)

px.box(more_data_st_long, x='variable', y='value', width=700, height=400)

In [25]:
pca3 = PCA(n_components=3)
pca3.fit(more_data_st)

In [26]:
# Getting the exaplained variance
print('Explained variance: ', pca3.explained_variance_.round(2))
print('Explained variance ratio: ', pca3.explained_variance_ratio_.round(4))

Explained variance:  [5.48 1.55 0.46]
Explained variance ratio:  [0.6811 0.1922 0.0572]


The Explained Variance Ratio represents the proportion of the dataset's total variance captured by each principal component. Notice the first is much greater than the second, the second is greater than the third, and so on.

The first component accounts for approximately `68.11%` (0.6811) of the total variance, the second for about `19.22%` (0.1922), and the third for `5.72%`.

These ratios help understand the relative importance of each principal component in explaining the variance in the dataset.

How many components will be necessary for preserving a good amount of the original variance?

## References

- https://scikit-learn.org/stable/unsupervised_learning.html
- Muller, A.C. & Guido, S. (2017) Introduction to Machine Learning with Python. A guide for Data scientists. USA: O'Reilly, chapter 3.
- VanderPlas, J. (2017) Python Data Science Handbook: Essential Tools for Working with Data. USA: O'Reilly Media, Inc. chapter 5.