For this notebook install the library pca in your environment

# PCA for Decathlon Athletes


<div class="alert alert-block alert-success">
<b>Goals:</b> 

* Demonstrate PCA.
* Show technical aspects.
* Observe the mathematics.
* See options for interpretation.
* This notebook is a mix of technical demo and analysis and presentation of results.
</div>
<div class="alert alert-block alert-warning">
<b>Content:</b> In this notebook, we mix the demo of a usecase (decathlon) and discussion of mathematical properties of PCA. The latter is for a deeper understanding of the theory. In a real usecase for a customer, we would not discuss the math in such detail!
</div>

<div class="alert alert-block alert-info">
<b>Content:</b> In this notebook, we 
    
* demonstrate the use of PCA,
* observe some of the properties,
* use resulting objects (matrices) to understand the transformation process.
</div>



In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

## Load the Data

In [None]:
df=pd.read_csv('data/decathlon_data.csv')
print(df.shape)
df

* athlete names can serve as index
* use only features from the sports competition

In [None]:
df=df.set_index('athlete')
positions = df['position']
df=df.drop(['country', 'position'], axis=1)
df

## Scaling
* Centering and Scaling are useful prerequisits for PCA

In [None]:
scaler = StandardScaler()
 
df_sc = scaler.fit_transform(df)
print(type(df_sc))
df_sc = pd.DataFrame(df_sc, index=df.index, columns=df.columns)
df_sc.round(2)

## Covariance Analysis of the Features

In [None]:
df_sc.cov().round(2)

* We investigate 10-dimensional data
* We observe non-zero covariance between all features 
* Conclusion: Features are correlated, there is a smaller intrisic dimension (smaller than 10)

### Principal Component Analysis

In [None]:
pca=PCA() # we should rename this variable because below we are going to use a lib of the same name
pca_transformed=pca.fit_transform(df_sc)
df_pca=pd.DataFrame(pca_transformed, index=df.index)
df_pca.round(2)

In [None]:
pca.get_covariance()

<div class="alert alert-block alert-warning">
<b>Observation:</b> The features are no longer interpretable in the context of the sports competitions.
</div>

In [None]:
df_pca.cov().round(2)

<div class="alert alert-block alert-success">
<b>Observations:</b> 

* The transformed data shows no covariance, thus no correlation among different components
* The components are ordered by their variance (highest to lowest, see main diagonal)
    </div>


### Digression -- the Components of the Transformation.
Let's take a look a the transformation matrix to understand the transformation process mathematically. 

In [None]:
np.round(pca.components_,2) # the transformation matrix

In [None]:
X=df_sc.to_numpy() # our data as numpy array
W = pca.components_
manually_computed_trafo=np.transpose(np.matmul(W, np.transpose(X))) # this is the actual transformation (results are similar to df_pca)
np.round(manually_computed_trafo,2)

In [None]:
np.array_equal(
    np.round(df_pca.to_numpy(),2),
    np.round(manually_computed_trafo, 2)
)

# What have we won yet?
* Our data still has 10 features, i.e. the dimensionality has not been reduced.
* Idea: Use the principal components with the highest variance.

In [None]:
expl_var=pca.explained_variance_ratio_
print(expl_var)
plt.plot(expl_var, label='expl. var.')
plt.plot(np.add.accumulate(expl_var), label='acc. expl. var.')
plt.legend()
print(np.add.accumulate(expl_var))

* The first 4 principal components account for over 80% of the variance
* the first 2 already for 57% of the variance
* Let's plot the data in a 2-D plot using only the first two components

In [None]:
plt.scatter(df_pca.iloc[:,0],df_pca.iloc[:,1])
for i in range(len(df_pca)):
    plt.annotate(df_pca.index[i], (df_pca.iloc[i,0]+0.1, df_pca.iloc[i,1]+0.1))

### For Specific Tasks, there are Specific Libs

In [None]:
from pca import pca

In [None]:
#model = pca(n_components=0.8)
model = pca(n_components=4)

# Fit transform
results = model.fit_transform(df_sc)

In [None]:
# explained variance
fig, ax = model.plot()

In [None]:
# Scatter first 2 PCs (2 and 3)
fig, ax = model.scatter(legend=False)

In [None]:
# biplot with number of original features (plot data and loadings)
fig, ax = model.biplot(n_feat=10, legend=False) #, PC=[2,3]) # use this to display compontens 3 and 4

* annotated are the highest (absolute value) loadings (the weight for the component a feature influences the most)
* red arrows indicate features that are the ones most important for a particular component
* angles between loading vectors indicate correlation: small angles -> high correlation, angles near 90 degrees -> low correlation
* length of the vector indicates the strength of the feature with respect to the currently choosen components (e.g. highjump with 0.08 and 0.57 (long) vs. shotput with 0.18 and 0.11 (short)); vectors are scaled vor better readability

In [None]:
model.results['loadings'].round(2)

In [None]:
model.results['PC']

In [None]:
model.results['explained_var']

## Clustering / Predictions ...
* use the PCA-transformed dataset as input for further analyses (clustering, classification, regression, ...)
* gain: 
    * lower dimensionality -> faster training
    * un-correlated features -> fits assumptions of algorithms better
* loss: 
    * interpretability (features are now mixed into the principal components)

<div class="alert alert-block alert-info">
<b>Take Aways:</b> 

* PCA uses a matrix multiplication to transform the data. This matrix is learned when fitting the data.
* The resulting components are ordered by the amount of variance they explain.
* Biplots show the connections between data and features.
* In a preprocessing step, the data scientist selects a reasonable number of dimensions, e.g. by using all components that exceed a variance threshold individually or by using the first $k$ components such that their accumulated explained variance exceeds a reasonable threshold.
</div>

<div class="alert alert-block alert-warning">
<b>Note:</b>
    
* The features are no longer interpretable in the original context of the domain.
* For example for clusterings on PCA-transformed data this means that the interpretation of the clustering should be conducted on the original data!
</div>


<div class="alert alert-block alert-success">
<b>Play with:</b> 

* Choose different numbers of components and plot the biplot.
* Go through the biplot and find the respective components in the loadings.
* Use PCA on the Iris dataset and observe the loadings.
* Combine PCA and clustering by first transforming the data using PCA and then applying the clustering to the transformed data.
</div>