# Análisis de Componentes Principales - Paso a Paso

* Estandarizar los datos.
* Obtener eigenvalues y eigenvectors de la matriz de covarianzas o de correlaciones o incluso SVD.
* Ordenar eigenvalues en orden descendente y quedarnos con los $p$ que se corresponden a los $p$ mayores y así disminuír el número de variables del dataset $(p<m)$.
* Construir matriz de proyección $W$ a partir de los $p$ vectores propios.
* Transformar el dataset original $X$ a través de $W$ para así obtener datos en el subespacio vectorial de dimensión $p$, que será $Y$.

## Importación de librerías

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import StandardScaler

## Lectura de datos

In [2]:
df = pd.read_csv("../datasets/iris/iris.csv")
df.head()

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [3]:
x = df.iloc[:,0:4].values
y = df.iloc[:,4].values

In [4]:
x_std = StandardScaler().fit_transform(x)

## Cálculo de los eigenvalues y eigenvectors

### Utilizando la matriz de covarianza

Teniendo en cuenta las siguientes expresiones

$$
\begin{align}
\sigma_{jk} &= \frac{1}{n-1}\sum_{i=1}^m \left(x_{ij} - \overline{x_j}\right) \left(x_{ik}-\overline{x_k}\right)\\
\Sigma &= \frac{1}{n-1} \left( X - \overline{x} \right)^T\left( X - \overline{x} \right) \\
\overline x &= \sum_{i=1} ^n x_i
\end{align}
$$

In [5]:
mean_vect = np.mean(x_std, axis = 0)
cov_matrix = (x_std - mean_vect).T.dot((x_std - mean_vect))/(x_std.shape[0]-1)
cov_matrix

array([[ 1.00671141, -0.11835884,  0.87760447,  0.82343066],
       [-0.11835884,  1.00671141, -0.43131554, -0.36858315],
       [ 0.87760447, -0.43131554,  1.00671141,  0.96932762],
       [ 0.82343066, -0.36858315,  0.96932762,  1.00671141]])

In [6]:
eig_vals, eig_vect = np.linalg.eig(cov_matrix)

### Utilizando la matriz de correlaciones

In [7]:
cor_mat = np.corrcoef(x_std.T)
cor_mat

array([[ 1.        , -0.11756978,  0.87175378,  0.81794113],
       [-0.11756978,  1.        , -0.4284401 , -0.36612593],
       [ 0.87175378, -0.4284401 ,  1.        ,  0.96286543],
       [ 0.81794113, -0.36612593,  0.96286543,  1.        ]])

In [8]:
eig_val_cor, eig_vec_cor = np.linalg.eig(cor_mat)

### Singular Value Decompositon

In [9]:
u,s,v = np.linalg.svd(x_std.T)

In [10]:
u

array([[-0.52106591, -0.37741762,  0.71956635,  0.26128628],
       [ 0.26934744, -0.92329566, -0.24438178, -0.12350962],
       [-0.5804131 , -0.02449161, -0.14212637, -0.80144925],
       [-0.56485654, -0.06694199, -0.63427274,  0.52359713]])

In [11]:
s

array([20.92306556, 11.7091661 ,  4.69185798,  1.76273239])

In [12]:
v

array([[ 1.08239531e-01,  9.94577561e-02,  1.12996303e-01, ...,
        -7.27030413e-02, -6.56112167e-02, -4.59137323e-02],
       [-4.09957970e-02,  5.75731483e-02,  2.92000319e-02, ...,
        -2.29793601e-02, -8.63643414e-02,  2.07800179e-03],
       [ 2.72186462e-02,  5.00034005e-02, -9.42089147e-03, ...,
        -3.84023516e-02, -1.98939364e-01, -1.12588405e-01],
       ...,
       [ 5.43380310e-02,  5.12936114e-03,  2.75184277e-02, ...,
         9.89532683e-01, -1.41206665e-02, -8.30595907e-04],
       [ 1.96438400e-03,  8.48544595e-02,  1.78604309e-01, ...,
        -1.25488246e-02,  9.52049996e-01, -2.19201906e-02],
       [ 2.46978090e-03,  5.83496936e-03,  1.49419118e-01, ...,
        -7.17729676e-04, -2.32048811e-02,  9.77300244e-01]])

## Componentes principaes

In [13]:
for ev in eig_vect:
    print('Longitud del eigenvector es: {:.4f}'.format(np.linalg.norm(ev)))

Longitud del eigenvector es: 1.0000
Longitud del eigenvector es: 1.0000
Longitud del eigenvector es: 1.0000
Longitud del eigenvector es: 1.0000


In [14]:
eigen_pairs = [(np.abs(eig_vals[i]), eig_vect[:,i]) for i in range(len(eig_vals))]

In [15]:
eigen_pairs

[(2.9380850501999953,
  array([ 0.52106591, -0.26934744,  0.5804131 ,  0.56485654])),
 (0.9201649041624852,
  array([-0.37741762, -0.92329566, -0.02449161, -0.06694199])),
 (0.14774182104494754,
  array([-0.71956635,  0.24438178,  0.14212637,  0.63427274])),
 (0.020853862176463244,
  array([ 0.26128628, -0.12350962, -0.80144925,  0.52359713]))]

In [16]:
for ep in eigen_pairs:
    print(ep[0])

2.9380850501999953
0.9201649041624852
0.14774182104494754
0.020853862176463244


In [17]:
total_sum = sum(eig_vals)
var_exp = [(i/total_sum)*100 for i in sorted(eig_vals, reverse = True)]

In [18]:
var_exp

[72.9624454132999, 22.850761786701725, 3.6689218892828652, 0.5178709107155041]

In [19]:
W = np.hstack((eigen_pairs[0][1].reshape(4,1),
              eigen_pairs[1][1].reshape(4,1)))

In [20]:
x[0]

array([5.1, 3.5, 1.4, 0.2])

## Proyectando variables en nuevo subespacio vectorial

$$
Y = XW, \ X \in M(\mathbb R)_{150,4}, \ W \in M(\mathbb R)_{4,2}, \ Y \in M(\mathbb R)_{150,2}
$$

In [21]:
Y = x_std.dot(W)