# Principal Component Analysis
The goal is to produced a reduced-dimension version of VGG features.

In [1]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import numpy as np

In [2]:
infile='/home/jrm/Martinez/images/features/G15_I1.features.csv'
handle=open(infile, "r")
raw_data = np.loadtxt(handle, delimiter=",")
raw_data.shape

(40, 25088)

In [3]:
scaler = StandardScaler()
scaler.fit(raw_data)
scaled_data = scaler.transform(raw_data)
scaled_data.shape

(40, 25088)

## Notes on PCA
From https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html  
The input data is centered but not scaled for each feature before applying the SVD.  

Do VGG features need scaling for variance?  
For now, we are not partitioning the data into train and test.

In [4]:
model = PCA()
model.fit(scaled_data)
pca_data = model.transform(scaled_data)

In [5]:
pca_data.shape

(40, 40)

In [6]:
model.explained_variance_ratio_

array([4.23736733e-02, 3.88529436e-02, 3.73245623e-02, 3.61506704e-02,
       3.46757405e-02, 3.33440294e-02, 3.31600864e-02, 3.28171457e-02,
       3.19821134e-02, 3.08648289e-02, 3.01181019e-02, 2.92716074e-02,
       2.91256308e-02, 2.83477290e-02, 2.76216937e-02, 2.75027228e-02,
       2.66044836e-02, 2.63870156e-02, 2.57683148e-02, 2.53383236e-02,
       2.45337143e-02, 2.36470808e-02, 2.34585390e-02, 2.30035640e-02,
       2.24203457e-02, 2.20806599e-02, 2.17605971e-02, 2.13491166e-02,
       2.10972618e-02, 2.02750482e-02, 1.98264638e-02, 1.90943506e-02,
       1.82065833e-02, 1.71271026e-02, 1.67413494e-02, 1.51994239e-02,
       1.46185525e-02, 1.44386287e-02, 1.34902004e-02, 8.35673096e-33])

The first PC only explains 4% of variance.  
The skee plot is flat, so there is no natural cutoff (expect at 39th PC).  

In [7]:
sum(model.explained_variance_ratio_)  # 100% (but lots of variance remains unexplained)

1.0000000000000002

In [8]:
ofile = '/home/jrm/Martinez/images/features/G15_I1.pca.csv'
np.savetxt(ofile, pca_data, delimiter=",")

## K means

In [10]:
from sklearn.cluster import KMeans

In [11]:
km = KMeans(2)
km.fit(pca_data)
km_data = km.transform(pca_data)
km_data.shape

(40, 2)

In [15]:
km.n_iter_

2

In [16]:
km.inertia_

692512.0266990949

In [14]:
km.cluster_centers_

array([[ 2.70455612e+01,  3.44012091e+00, -1.21510146e+01,
        -3.52729189e+00,  5.95358564e-01,  1.75037618e+00,
        -3.83699890e+00, -5.75177010e+00, -1.51744167e+00,
        -2.78328682e+00,  3.17348137e+00, -1.16922559e+00,
         4.57167910e+00,  9.06653869e-01, -2.27556488e+00,
         5.27426127e+00,  3.36851877e+00,  3.06208071e+00,
         2.40790268e+00,  4.79051960e+00, -2.32042624e+00,
        -1.79774635e+00, -6.25807943e-01, -2.48043901e-01,
         1.33286352e+00,  5.72501228e-01, -1.48182423e-01,
         1.34000235e-01, -1.42171587e+00,  9.39441105e-01,
         2.76874243e+00, -1.03863886e+00, -6.23574366e-01,
         3.31978274e+00,  7.48499978e-01, -1.57717131e+00,
         5.48697956e-01, -7.12970258e-01, -5.01014864e-01,
         4.94995393e-15],
       [-1.62273367e+01, -2.06407254e+00,  7.29060876e+00,
         2.11637513e+00, -3.57215139e-01, -1.05022571e+00,
         2.30219934e+00,  3.45106206e+00,  9.10465002e-01,
         1.66997209e+00, -1.90

In [17]:
km.labels_

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0], dtype=int32)

In [12]:
km_data

array([[140.87358084, 130.3836165 ],
       [114.13239938, 103.72965974],
       [134.96171657, 123.68116144],
       [134.86111748, 123.39476012],
       [138.06813548, 127.18582978],
       [151.69573303, 141.10926647],
       [123.68157445, 112.75590263],
       [153.04411251, 140.72638012],
       [155.52547652, 143.22747908],
       [115.25576088, 103.20968297],
       [129.58022647, 118.56915449],
       [144.89036375, 132.51389567],
       [133.24957777, 122.57331958],
       [153.71996886, 141.37371834],
       [110.5433334 ,  98.34719018],
       [128.61280731, 117.03814415],
       [168.12698188, 156.66885503],
       [137.96850703, 126.99946254],
       [133.05865701, 122.73049738],
       [142.39117467, 130.61935901],
       [161.73715647, 154.39125473],
       [147.65733148, 160.74187528],
       [105.40991368, 115.96615715],
       [103.14922427, 112.7318557 ],
       [115.54916471, 124.83692401],
       [133.48404277, 144.99974983],
       [142.21396221, 151.02338172],
 