# Implementación de PCA en NumPy

In [1]:
import numpy as np
# import sklearn as skl
from sklearn.decomposition import PCA

## Objetivos
* Implementación de PCA en NumPy paso a paso
* Comparación de resultados con Scikit-learn

## Implementación

1. Dado un dataset $X \in \mathbb{R}^{n, d}$, con $n$ muestras y $d$ features, queremos reducir sus dimensiones a $m$. Para ello, el primer paso es centrar el dataset (Hint: usen np.mean)

In [2]:
n = 10
d = 3
X = np.random.uniform(low=0., high=100., size=(n,d))
print(X)

[[42.2480501  35.29280919 82.73204353]
 [43.85261966 53.97444366 39.33641668]
 [92.05706613 91.62309171 10.6340016 ]
 [ 9.87765015 58.88238508 89.0634533 ]
 [ 5.59363733 41.89388889  3.1244253 ]
 [72.15263047  8.49744244 90.2232056 ]
 [78.25203309 15.29848102 26.03485004]
 [38.45827209 45.34578596 54.63736307]
 [ 7.12248678 94.38150298 25.48554755]
 [41.56659137 59.06929062 42.94906436]]


In [3]:
X = X-np.mean(X,axis=0)
print(X)

[[ -0.87005362 -15.13310296  36.31000643]
 [  0.73451594   3.54853151  -7.08562042]
 [ 48.93896242  41.19717956 -35.7880355 ]
 [-33.24045357   8.45647292  42.6414162 ]
 [-37.52446639  -8.53202327 -43.2976118 ]
 [ 29.03452675 -41.92846972  43.80116849]
 [ 35.13392938 -35.12743113 -20.38718706]
 [ -4.65983163  -5.08012619   8.21532597]
 [-35.99561694  43.95559082 -20.93648956]
 [ -1.55151235   8.64337846  -3.47297274]]


2. Obtener la matriz de covarianza de $X^T$, revisar en la teoría por qué utilizamos la transpuesta. Buscar en la documentación de NumPy qué funciones se pueden utilizar.

In [4]:
cov = np.cov(X.T)
print(cov)

[[ 922.94825309 -216.94460388  -33.85925821]
 [-216.94460388  789.74005209 -381.26398456]
 [ -33.85925821 -381.26398456 1021.60545611]]


3. Calcular los autovalores y autovectores de la matriz de covarianza. Revisar la documentación de NumPy.

In [5]:
w, v = np.linalg.eig(cov)

print("Eigenvalues:\n"+str(w))
print("Eigenvectors:\n"+str(v))

Eigenvalues:
[ 428.77084969  972.78170821 1332.7412034 ]
Eigenvectors:
[[ 0.3735812   0.88766881  0.26922326]
 [ 0.77032841 -0.13520131 -0.62314905]
 [ 0.51675064 -0.4401871   0.73430518]]


4. Ordernar los autovectores en el sentido de los autovalores decrecientes, revisar la teoría de ser necesario.

In [6]:
indx = w.argsort()[::-1]
v = v[indx]
print(v)

[[ 0.51675064 -0.4401871   0.73430518]
 [ 0.77032841 -0.13520131 -0.62314905]
 [ 0.3735812   0.88766881  0.26922326]]


5. Proyectar el dataset centrado sobre los $m$ autovectores más relevantes (Hint: usen np.dot).

In [7]:
m = 2
v_reduced = v[:, :m]
X_reduced = np.dot(X,v_reduced[:, :m])
print(X_reduced)

[[  1.45767571  34.66026212]
 [  0.46604167  -7.09277484]
 [ 43.654861   -58.88013567]
 [  5.267267    51.34014789]
 [-42.13842563 -20.76261301]
 [ -0.93178831  31.7690914 ]
 [-16.52044753 -28.81329775]
 [ -3.25224522  10.03052615]
 [  7.43800361  -8.68271631]
 [  4.5590577   -3.56849   ]]


6. Consolidar los pasos anteriores en una función o clase PCA.

In [8]:
def npPCA(X,m):

    X = X-np.mean(X,axis=0)
    
    cov = np.cov(X.T)
    w, v = np.linalg.eig(cov)

    indx = w.argsort()[::-1]
    v = v[:, indx]
    
    v_reduced = v[:, :m]
    X_reduced = np.dot(X,v_reduced)
        
    return X_reduced

In [9]:
n = 10
d = 3
X = np.random.uniform(low=0., high=100., size=(n,d))
print(npPCA(X,2))

[[ 30.64860606 -19.66989077]
 [ -1.52914288  33.8536157 ]
 [-45.35675287   6.72044103]
 [-27.28657471  11.17135254]
 [-54.03793305 -13.02992229]
 [-15.77444765  17.73604996]
 [ 70.03467357  -8.51701226]
 [-19.04491666 -59.93965952]
 [-14.83219328  23.36422694]
 [ 77.17868146   8.31079868]]


7. Comparar los resultados obtenidos con el modelo de PCA implementado en Scikit-learn ([ver documentación](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)). Tomar como dataset:

$X=\begin{bmatrix}
0.8 & 0.7\\
0.1 & -0.1
\end{bmatrix}$

Se debe reducir a un componente. Verificar los resultados con np.testing.assert_allclose

In [10]:
X = np.array([[0.8,0.7],[0.1,-0.1]])

pca = PCA(n_components=1)
X_new = pca.fit_transform(X)
print("skl result:\n"+str(X_new))

myPCA = npPCA(X,1)
print("npPCA result:\n"+str(myPCA))

skl result:
[[-0.53150729]
 [ 0.53150729]]
npPCA result:
[[-0.53150729]
 [ 0.53150729]]
