# Principal Component Analysis (PCA)

PCA is a dimensionality reduction technique. It can reduce the number of features in a dataset by projecting the data onto a lower dimensional space. The idea is that it preserves the structure of the data as much as possible, by preserving what matters the most. Technically, it finds a lower dimensional projection of the dataset which maximizes the information contained within the dataset (the definition of "information" is purposfully vague).

Intuitively: imagine you have a 3D dataset with 3 features. You can visualize it in a 3D space. Now imagine you want to reduce the number of features to 2. You can project the data onto a 2D space. This is what PCA does. It's a bit like taking the best 2D picture of the Eiffel Tower.

<div class="sketchfab-embed-wrapper"> <iframe title="Eiffel Tower model 3D with best quality" frameborder="0" allowfullscreen mozallowfullscreen="true" webkitallowfullscreen="true" allow="autoplay; fullscreen; xr-spatial-tracking" xr-spatial-tracking execution-while-out-of-viewport execution-while-not-rendered web-share src="https://sketchfab.com/models/c3391c293e70471e9a112f7855adcf2f/embed"> </iframe> <p style="font-size: 13px; font-weight: normal; margin: 5px; color: #4A4A4A;"> <a href="https://sketchfab.com/3d-models/eiffel-tower-model-3d-with-best-quality-c3391c293e70471e9a112f7855adcf2f?utm_medium=embed&utm_campaign=share-popup&utm_content=c3391c293e70471e9a112f7855adcf2f" target="_blank" rel="nofollow" style="font-weight: bold; color: #1CAAD9;"> Eiffel Tower model 3D with best quality </a> by <a href="https://sketchfab.com/shatlykxfree?utm_medium=embed&utm_campaign=share-popup&utm_content=c3391c293e70471e9a112f7855adcf2f" target="_blank" rel="nofollow" style="font-weight: bold; color: #1CAAD9;"> shatlykxfree </a> on <a href="https://sketchfab.com?utm_medium=embed&utm_campaign=share-popup&utm_content=c3391c293e70471e9a112f7855adcf2f" target="_blank" rel="nofollow" style="font-weight: bold; color: #1CAAD9;">Sketchfab</a></p></div>

PCA has several applications:

- Visualization: PCA can be used to visualize high dimensional data in 2 or 3 dimensions.
- Speeding up Machine Learning (ML) training: PCA can be used to speed up ML training by reducing the number of features.
- Collaborative filtering: PCA can be used in collaborative filtering applications.
- Anomaly detection: PCA can be used to detect anomalies in datasets.
- Simplify: linear regression applied to PCA data is a form of shrinkage (see [principal component regression](https://www.wikiwand.com/en/Principal_component_regression))

In this notebook, we will use PCA for noise reduction and visualization.

We'll be analyzing an emission factor database. These are the environmental impacts of goods and services. In particular, we'll be looking at the emission factors of different food items. The data is from the [Agribalyse](https://agribalyse.ademe.fr/app) database. The raw data can be downloaded from the ADEME website, [here](https://data.ademe.fr/datasets?topics=TQJGtxm2_).

In [1]:
import pandas as pd

food = pd.read_csv('../../data/agribalyse-31-synthese.csv')
indicators = {
    'Changement climatique': 'Changement climatique',
    'Appauvrissement de la couche d\'ozone': 'Couche d\'ozone',
    'Rayonnements ionisants': 'Rayonnements',
    'Formation photochimique d\'ozone': 'Formation d\'ozone',
    'Particules fines': 'Particules fines',
    'Effets toxicologiques sur la santé humaine : substances non-cancérogènes': 'Effets non-cancéreux',
    'Effets toxicologiques sur la santé humaine : substances cancérogènes': 'Effets cancéreux',
    'Acidification terrestre et eaux douces': 'Acidification terrestre/eau douce',
    'Eutrophisation eaux douces': 'Eutrophisation eau douce',
    'Eutrophisation marine': 'Eutrophisation marine',
    'Eutrophisation terrestre': 'Eutrophisation terrestre',
    'Écotoxicité pour écosystèmes aquatiques d\'eau douce': 'Écotoxicité eau douce',
    'Utilisation du sol': 'Utilisation sol',
    'Épuisement des ressources eau': 'Épuisement ressources eau',
    'Épuisement des ressources énergétiques': 'Épuisement ressources énergétiques',
    'Épuisement des ressources minéraux': 'Épuisement ressources minéraux'
}
food = food.rename(columns=indicators)
food = food.set_index('LCI Name')
food.head()


Unnamed: 0_level_0,Code AGB,Code CIQUAL,Groupe d'aliment,Sous-groupe d'aliment,Nom du Produit en Français,code saison,code avion,Livraison,Matériau d'emballage,Préparation,DQR,Score unique EF,Changement climatique,Couche d'ozone,Rayonnements,Formation d'ozone,Particules fines,Effets non-cancéreux,Effets cancéreux,Acidification terrestre/eau douce,Eutrophisation eau douce,Eutrophisation marine,Eutrophisation terrestre,Écotoxicité eau douce,Utilisation sol,Épuisement ressources eau,Épuisement ressources énergétiques,Épuisement ressources minéraux
LCI Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1
"Seaweed, agar, raw",11084,11084,aides culinaires et ingrédients divers,algues,"Agar (algue), cru",2,0,Ambiant (long),LDPE,Pas de préparation,2.99,1.226025,6.745806,1.017147e-06,11.050022,0.038875,7.172887e-07,4.771949e-08,6.108019e-09,0.099733,0.001868,0.01328,0.12735,43.368466,24.699395,3.079121,315.05717,8.8e-05
"Garlic, powder, dried",11023,11023,aides culinaires et ingrédients divers,herbes,"Ail séché, poudre",2,0,Ambiant (long),Verre,Pas de préparation,4.11,0.103513,0.750274,1.229332e-07,0.172269,0.002936,8.262876e-08,9.281982e-09,4.343632e-10,0.006246,0.000125,0.002379,0.012785,5.672531,20.508995,2.585763,12.701604,7e-06
"Garlic, fresh",11000,11000,aides culinaires et ingrédients divers,herbes,"Ail, cru",2,0,Ambiant (long),Pas d'emballage,Pas de préparation,3.54,0.064652,0.358043,5.483998e-08,0.143971,0.001081,2.344885e-08,6.066248e-09,2.824785e-10,0.002036,6.9e-05,0.002197,0.006394,4.06465,18.871203,3.104146,6.275385,4e-06
"Dill, fresh",11093,11093,aides culinaires et ingrédients divers,herbes,"Aneth, frais",2,0,Ambiant (long),LDPE,Pas de préparation,3.75,0.131581,0.813436,6.268829e-08,0.141342,0.002559,6.89672e-08,9.73999e-09,5.867577e-10,0.008112,0.000203,0.00405,0.031181,5.534832,36.288833,5.290335,12.553109,5e-06
"Sea lettuce (Enteromorpha sp.), dried or dehydrated",20995,20995,aides culinaires et ingrédients divers,algues,"Ao-nori (Enteromorpha sp.), séchée ou déshydratée",2,0,Ambiant (long),LDPE,Pas de préparation,2.99,1.226025,6.745806,1.017147e-06,11.050022,0.038875,7.172887e-07,4.771949e-08,6.108019e-09,0.099733,0.001868,0.01328,0.12735,43.368466,24.699395,3.079121,315.05717,8.8e-05


Let's focus on a particular food group.

In [2]:
food["Sous-groupe d'aliment"].value_counts()


légumes                                              197
charcuteries                                         144
viandes crues                                        136
céréales de petit-déjeuner et biscuits               125
plats composés                                       122
fromages                                             121
viandes cuites                                       105
boissons sans alcool                                 103
fruits                                                98
poissons crus                                         94
eaux                                                  84
produits laitiers frais et assimilés                  79
gâteaux et pâtisseries                                76
pains et viennoiseries                                73
pâtes, riz et céréales                                62
sauces                                                59
poissons cuits                                        55
fruits à coque et graines oléag

For instance, we can pick carbs.

In [3]:
drinks = food[food["Sous-groupe d'aliment"] == 'boisson alcoolisées']
drinks = drinks[indicators.values()]
len(drinks)


42

PCA finds structure in the data by studying the correlation between variables. But that doesn't mean we can't take a look at the correlation matrix to get a first understanding of the data.

In [4]:
import altair as alt

corr = (
    drinks
    .corr()
    .stack()
    .to_frame()
    .reset_index()
)
corr.columns = ['x', 'y', 'corr']

# Create a heatmap of the correlation matrix
heatmap = alt.Chart(corr).mark_rect().encode(
    x=alt.X('x:O', title=None),
    y=alt.Y('y:O', title=None),
    color=alt.Color('corr:Q', scale=alt.Scale(scheme='blueorange'))
)

# Create text labels for the correlation values
text = alt.Chart(corr).mark_text(baseline='middle').encode(
    x=alt.X('x:O'),
    y=alt.Y('y:O'),
    text=alt.Text('corr:Q', format='.1f'),
)

# Combine the heatmap and text labels
heatmap + text


Clearly there are some correlations. But it's hard to see exactly what's going on. This is the value of PCA: it can help summarize a numeric dataset, which is practical when there are many columns.

In [5]:
import prince

pca = prince.PCA(n_components=5)
pca = pca.fit(drinks)
pca.plot(drinks, x_component=0, y_component=1)


A good PCA is one where a small amount of components capture most of the dataset's variance. In this case it's pretty good. This means we can trust the ensueing interpretation. Indeed, four components account for >90% of the dataset's variance, even though it has 16 variables.

In [7]:
pca.eigenvalues_summary


Unnamed: 0_level_0,eigenvalue,% of variance,% of variance (cumulative)
component,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,7.696,48.10%,48.10%
1,3.35,20.94%,69.04%
2,1.823,11.40%,80.43%
3,1.689,10.55%,90.99%
4,0.73,4.56%,95.55%


We can see what variables contribute to each component. Here the first component accounts for eutrophisation and CO2. The second component has more to do with the impact on the ozone layer and thin particles.

In [8]:
(
    pca.column_cosine_similarities_
    .style
    .applymap(lambda x: 'background-color: yellow' if x > 0.5 else '')
)


component,0,1,2,3,4
variable,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Changement climatique,0.773034,0.005659,0.092593,0.062739,0.041694
Couche d'ozone,0.235993,0.618397,0.071324,0.034904,0.00146
Rayonnements,0.10884,0.818837,0.029075,0.001843,0.02079
Formation d'ozone,0.000971,0.513284,0.315106,0.100727,0.014488
Particules fines,0.023624,0.577621,0.378026,0.014057,7e-06
Effets non-cancéreux,0.757044,0.001849,0.042602,0.178259,0.001951
Effets cancéreux,0.825175,0.014475,0.082554,0.045205,0.003379
Acidification terrestre/eau douce,0.582455,0.130124,0.037258,0.185036,0.032196
Eutrophisation eau douce,0.893773,0.002593,0.000362,0.000401,0.023987
Eutrophisation marine,0.928781,0.006137,0.007235,0.029124,0.000933


We can then check prototypes. These are observations that are very correlated with each principal component.

In [104]:
pca.row_cosine_similarities(drinks).sort_values(0).tail(8)


component,0,1,2,3,4
LCI Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Whiskey-based cocktail,0.753958,0.073643,0.037128,0.071098,0.001103
"Wine, white, sparkling",0.854678,0.009956,0.010814,0.098537,0.002207
"Wine, white, sparkling, flavoured",0.854678,0.009956,0.010814,0.098537,0.002207
"Wine, white, dry",0.854678,0.009956,0.010814,0.098537,0.002207
Champagne,0.854678,0.009956,0.010814,0.098537,0.002207
Sparkling fruit wine,0.927111,0.002097,0.003016,0.034381,0.003679
Kir (Cocktail of white wine with red fruit liqueur),0.968557,0.005857,0.001516,0.006701,0.001645
Champagne kir (Cocktail of champagne with red fruit liqueur),0.969862,0.005311,0.00337,0.002443,0.002737


In [105]:
pca.row_cosine_similarities(drinks).sort_values(1).tail(8)


component,0,1,2,3,4
LCI Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Liqueur,0.247304,0.366375,0.097338,0.215447,0.053886
"Beer, alcohol-free (<1,2° alcohol)",0.056102,0.831145,0.06682,0.024263,0.021002
"Beer, strong (>8° alcohol)",0.056102,0.831145,0.06682,0.024263,0.021002
"Beer, low alcohol-content (3° alcohol)",0.056102,0.831145,0.06682,0.024263,0.021002
"Beer, dark",0.056102,0.831145,0.06682,0.024263,0.021002
"Beer, special (5-6° alcohol)",0.056102,0.831145,0.06682,0.024263,0.021002
"Beer, ""specialty"", from abbey or regional (varying alcohol content)",0.056102,0.831145,0.06682,0.024263,0.021002
"Beer, regular (4-5° alcohol)",0.056102,0.831145,0.06682,0.024263,0.021002


This provides an interesting insight. It seems that we have a first group with wines, and another with beers.

In [119]:
(
    drinks.loc['Champagne'] -
    drinks.loc['Beer, regular (4-5° alcohol)']
)


Changement climatique                 1.119583e-01
Couche d'ozone                       -3.480632e-08
Rayonnements                         -1.042852e-01
Formation d'ozone                     2.110754e-03
Particules fines                      1.587766e-08
Effets non-cancéreux                  1.147933e-07
Effets cancéreux                      1.805962e-09
Acidification terrestre/eau douce     2.545040e-03
Eutrophisation eau douce              1.003545e-04
Eutrophisation marine                 5.227861e-03
Eutrophisation terrestre              1.196195e-02
Écotoxicité eau douce                 8.662852e+00
Utilisation sol                       9.541839e+01
Épuisement ressources eau             1.389024e-02
Épuisement ressources énergétiques   -1.899935e+00
Épuisement ressources minéraux        9.207747e-06
dtype: float64

This goes with the narrative that people with wealthy habits, such as drinking champagne, pollute more than people with modest habits, such as drinking beer. Then again, one thing we haven't take into account is quantity: people usually drink less champagne than beer whenever they drink either of the two.

## How does it work?

A PCA transforms a dataset into a new dataset with less columns. Each column in the new dataset is called a principal component. It is actually a linear combination of the columns in the original dataset. By linear combination, we mean each column contributes to a certain extent -- i.e. they are not weighted equally. Another way of saying this is that the observations are projected onto a new coordinate system.

So how is this new coordinate system determined? Basically, the PCA fits an ellipsoid to the dataset. This ellipsoid is determined by the covariance matrix. The axes of the coordinate system are the eigenvectors of the dataset: these are the axes of the ellipsoid.

- This blog post is quite intuitive: http://mengnote.blogspot.com/2013/05/an-intuitive-explanation-of-pca.html
- Some good discussion here too: https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues/140579#140579
- More details are available in this excellent course by Xavier Gendre: https://www.math.univ-toulouse.fr/~xgendre/ens/l3sid/StatExplo.pdf

In [20]:
drinks.head()


Unnamed: 0_level_0,Changement climatique,Couche d'ozone,Rayonnements,Formation d'ozone,Particules fines,Effets non-cancéreux,Effets cancéreux,Acidification terrestre/eau douce,Eutrophisation eau douce,Eutrophisation marine,Eutrophisation terrestre,Écotoxicité eau douce,Utilisation sol,Épuisement ressources eau,Épuisement ressources énergétiques,Épuisement ressources minéraux
LCI Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Pure alcohol,1.033117,1.627922e-07,0.291521,0.008376,1.429583e-07,9.173058e-09,4.486892e-10,0.006793,0.000135,0.001184,0.013257,4.524899,20.249974,0.198157,19.43816,6e-06
Wine-based aperitif,1.08897,1.699931e-07,0.34838,0.00548,1.166544e-07,9.734375e-08,1.89087e-09,0.010367,0.000232,0.00621,0.031048,13.326634,94.187577,0.472749,19.441347,1.3e-05
"Beer, regular (4-5° alcohol)",1.119368,2.410478e-07,0.564481,0.003681,9.261719e-08,9.013428e-09,5.413089e-10,0.007554,0.000162,0.002034,0.017878,6.903986,18.736437,0.256583,24.641994,7e-06
"Beer, ""specialty"", from abbey or regional (varying alcohol content)",1.119368,2.410478e-07,0.564481,0.003681,9.261719e-08,9.013428e-09,5.413089e-10,0.007554,0.000162,0.002034,0.017878,6.903986,18.736437,0.256583,24.641994,7e-06
"Beer, special (5-6° alcohol)",1.119368,2.410478e-07,0.564481,0.003681,9.261719e-08,9.013428e-09,5.413089e-10,0.007554,0.000162,0.002034,0.017878,6.903986,18.736437,0.256583,24.641994,7e-06


In [8]:
pca.row_coordinates(drinks).head(5)


component,0,1,2,3,4
LCI Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Pure alcohol,-2.704146,1.408955,-1.402337,-0.897011,-0.299677
Wine-based aperitif,1.800187,1.061591,0.932706,-1.054395,0.311338
"Beer, regular (4-5° alcohol)",-0.860067,-3.310398,-0.938632,0.56561,0.526221
"Beer, ""specialty"", from abbey or regional (varying alcohol content)",-0.860067,-3.310398,-0.938632,0.56561,0.526221
"Beer, special (5-6° alcohol)",-0.860067,-3.310398,-0.938632,0.56561,0.526221


In a PCA, the row coordinates are obtained by projecting the data onto the eigenvectors. Using SVD, we can obtain the eigenvectors and eigenvalues of the covariance matrix. The eigenvectors are the columns of the matrix V.

In [44]:
from sklearn import preprocessing

X = preprocessing.scale(drinks)
coords = X.dot(pca.svd_.V.T)
coords = pd.DataFrame(coords, index=drinks.index)
coords.head()


Unnamed: 0_level_0,0,1,2,3,4
LCI Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Pure alcohol,-2.704146,1.408955,-1.402337,-0.897011,-0.299677
Wine-based aperitif,1.800187,1.061591,0.932706,-1.054395,0.311338
"Beer, regular (4-5° alcohol)",-0.860067,-3.310398,-0.938632,0.56561,0.526221
"Beer, ""specialty"", from abbey or regional (varying alcohol content)",-0.860067,-3.310398,-0.938632,0.56561,0.526221
"Beer, special (5-6° alcohol)",-0.860067,-3.310398,-0.938632,0.56561,0.526221


The original dataset can be reconstructed from the SVD:

$$X = U \Sigma V^T$$

In [43]:
import numpy as np

U = pca.svd_.U
s = pca.svd_.s
V = pca.svd_.V

k = 5  # number of components to keep
X_reconstructed = (U[:, :k] * s[:k]) @ V[:k, :]

X_reconstructed = pca.scaler_.inverse_transform(X_reconstructed)
X_reconstructed = pd.DataFrame(X_reconstructed, index=drinks.index, columns=drinks.columns)
X_reconstructed.head()


Unnamed: 0_level_0,Changement climatique,Couche d'ozone,Rayonnements,Formation d'ozone,Particules fines,Effets non-cancéreux,Effets cancéreux,Acidification terrestre/eau douce,Eutrophisation eau douce,Eutrophisation marine,Eutrophisation terrestre,Écotoxicité eau douce,Utilisation sol,Épuisement ressources eau,Épuisement ressources énergétiques,Épuisement ressources minéraux
LCI Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Pure alcohol,1.025348,1.626658e-07,0.286824,0.00816,1.424237e-07,9.042156e-09,3.889513e-10,0.007067,0.000136,0.0013,0.014463,3.128067,20.718178,0.15371,19.228812,6e-06
Wine-based aperitif,1.122023,1.774973e-07,0.339822,0.006038,1.171269e-07,9.160913e-08,1.964081e-09,0.009624,0.000234,0.005575,0.027333,17.918986,87.966189,0.368713,19.638023,1.5e-05
"Beer, regular (4-5° alcohol)",1.124532,2.399617e-07,0.565259,0.003696,9.210668e-08,9.991771e-09,5.523543e-10,0.007539,0.000162,0.00205,0.01777,6.921552,18.947581,0.265279,24.684087,7e-06
"Beer, ""specialty"", from abbey or regional (varying alcohol content)",1.124532,2.399617e-07,0.565259,0.003696,9.210668e-08,9.991771e-09,5.523543e-10,0.007539,0.000162,0.00205,0.01777,6.921552,18.947581,0.265279,24.684087,7e-06
"Beer, special (5-6° alcohol)",1.124532,2.399617e-07,0.565259,0.003696,9.210668e-08,9.991771e-09,5.523543e-10,0.007539,0.000162,0.00205,0.01777,6.921552,18.947581,0.265279,24.684087,7e-06
