# PCA (Pricnicpal Component Analysis)

PCA stands for Principal Component Analysis. It is a commonly used technique in machine learning, statistics, and data science for reducing the dimensionality of a dataset.

PCA works by identifying the directions in which the data varies the most, which are called the principal components. These principal components are a linear combination of the original features of the dataset, and they are chosen to maximize the variance of the data along that direction.

By retaining only the top k principal components, PCA reduces the dimensionality of the dataset to k, while still retaining as much of the original variance as possible. This can be useful for visualizing high-dimensional data or for reducing the computational complexity of a model that relies on that data.

PCA can also be used for feature extraction and data compression. It is a powerful tool that has many applications in fields like finance, image processing, and natural language processing.

<img src = "https://learnopencv.com/wp-content/uploads/2018/01/principal-component-analysis.png">

Here we will try to make our own Pricncipal Component Analysis

**Modules Used**
* Pandas
* Numpy
**Note**
* This is the [dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/magic/)

In [1]:
import pandas as pd 
import numpy as np 

Lets import our dataset 

In [21]:
pd.read_csv("/content/magic04.data")

Unnamed: 0,28.7967,16.0021,2.6449,0.3918,0.1982,27.7004,22.011,-8.2027,40.092,81.8828,g
0,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.2610,g
1,162.0520,136.0310,4.0612,0.0374,0.0187,116.7410,-64.8580,-45.2160,76.9600,256.7880,g
2,23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.4490,116.7370,g
3,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.6480,356.4620,g
4,51.6240,21.1502,2.9085,0.2420,0.1340,50.8761,43.1887,9.8145,3.6130,238.0980,g
...,...,...,...,...,...,...,...,...,...,...,...
19014,21.3846,10.9170,2.6161,0.5857,0.3934,15.2618,11.5245,2.8766,2.4229,106.8258,h
19015,28.9452,6.7020,2.2672,0.5351,0.2784,37.0816,13.1853,-2.9632,86.7975,247.4560,h
19016,75.4455,47.5305,3.4483,0.1417,0.0549,-9.3561,41.0562,-9.4662,30.2987,256.5166,h
19017,120.5135,76.9018,3.9939,0.0944,0.0683,5.8043,-93.5224,-63.8389,84.6874,408.3166,h


We dont want the target column, so we will just remove it 

In [22]:
pd.read_csv("/content/magic04.data").drop("g" , axis = 1)

Unnamed: 0,28.7967,16.0021,2.6449,0.3918,0.1982,27.7004,22.011,-8.2027,40.092,81.8828
0,31.6036,11.7235,2.5185,0.5303,0.3773,26.2722,23.8238,-9.9574,6.3609,205.2610
1,162.0520,136.0310,4.0612,0.0374,0.0187,116.7410,-64.8580,-45.2160,76.9600,256.7880
2,23.8172,9.5728,2.3385,0.6147,0.3922,27.2107,-6.4633,-7.1513,10.4490,116.7370
3,75.1362,30.9205,3.1611,0.3168,0.1832,-5.5277,28.5525,21.8393,4.6480,356.4620
4,51.6240,21.1502,2.9085,0.2420,0.1340,50.8761,43.1887,9.8145,3.6130,238.0980
...,...,...,...,...,...,...,...,...,...,...
19014,21.3846,10.9170,2.6161,0.5857,0.3934,15.2618,11.5245,2.8766,2.4229,106.8258
19015,28.9452,6.7020,2.2672,0.5351,0.2784,37.0816,13.1853,-2.9632,86.7975,247.4560
19016,75.4455,47.5305,3.4483,0.1417,0.0549,-9.3561,41.0562,-9.4662,30.2987,256.5166
19017,120.5135,76.9018,3.9939,0.0944,0.0683,5.8043,-93.5224,-63.8389,84.6874,408.3166


Also we need this as a numpy array

In [23]:
np.array(pd.read_csv("/content/magic04.data").drop("g" , axis = 1))

array([[ 31.6036,  11.7235,   2.5185, ...,  -9.9574,   6.3609, 205.261 ],
       [162.052 , 136.031 ,   4.0612, ..., -45.216 ,  76.96  , 256.788 ],
       [ 23.8172,   9.5728,   2.3385, ...,  -7.1513,  10.449 , 116.737 ],
       ...,
       [ 75.4455,  47.5305,   3.4483, ...,  -9.4662,  30.2987, 256.5166],
       [120.5135,  76.9018,   3.9939, ..., -63.8389,  84.6874, 408.3166],
       [187.1814,  53.0014,   3.2093, ...,  31.4755,  52.731 , 272.3174]])

OR

In [5]:
data = np.array(pd.read_csv("/content/magic04.data").drop("g" , axis = 1))

For finding what would be the values of the transformed data, we take help from the eignvalues and eigenvectors of the covariance of the data. So know we just need to calculate the eigenvalues and eigenvectors of all the covariacne values 

But first of all we need to find the covariacne of the values

For choosing which feature to keep and which not, we first need to find how those features are covarried, for simplicity we will keep the comparison between two at most of the time 

Lets say we have two arrays, `array_1` and `array_2`, for finding the covariacne we have $$covariance_{x , y} = \frac {\sum \limits _{i = 1} ^ {n} (x_i - x_{mean})(y_i - y_{mean})}{Number_-of_-values - 1}$$

In [24]:
array_1 = np.array([x for x in range(0 , 40)])
array_2 = np.array([x for x in range(40 , 120 , 2)])

Lets try to find the covariance between these two arrays 

In [25]:
num = 0
for i , j in zip(array_1 , array_2):
    num += ((i - np.mean(array_1)) * (j - np.mean(array_2)))

In [26]:
num

10660.0

In [27]:
co = num / (39)

In [28]:
co

273.3333333333333

Lets try to find covariance with numpy

In [29]:
np.cov(array_1 , array_2)

array([[136.66666667, 273.33333333],
       [273.33333333, 546.66666667]])

You might be wondering why we got a matrix, this is beacuse `np.cov()` function works like this 

<img src = "https://www.statlect.com/images/covariance-matrix__32.png">

We can also do like this 

In [31]:
sample_array = np.array([[x for x in range(0 , 40)] , [x for x in range(40 , 120 , 2)]])
np.cov(sample_array)

array([[136.66666667, 273.33333333],
       [273.33333333, 546.66666667]])

Lets try for $3$

In [32]:
sample_array = np.array([[x for x in range(0 , 40)] , [x for x in range(40 , 120 , 2)] , [x for x in range(120 , 240 , 3)]])
np.cov(sample_array)

array([[ 136.66666667,  273.33333333,  410.        ],
       [ 273.33333333,  546.66666667,  820.        ],
       [ 410.        ,  820.        , 1230.        ]])

So now we can caluclate the covarince of our whole large dataset

In [33]:
np.cov(data)

array([[ 3973.49360177,  4745.96701765,  2248.03224613, ...,
         4846.35035653,  7935.63989139,  5128.45896296],
       [ 4745.96701765, 10580.97364716,  3162.64152849, ...,
         6176.00353497, 12901.07140578,  9389.93265368],
       [ 2248.03224613,  3162.64152849,  1342.63308859, ...,
         2703.94298205,  4860.64839684,  3214.20348159],
       ...,
       [ 4846.35035653,  6176.00353497,  2703.94298205, ...,
         6395.97517425, 10346.63306902,  7880.87014123],
       [ 7935.63989139, 12901.07140578,  4860.64839684, ...,
        10346.63306902, 19746.25985233, 15634.46822326],
       [ 5128.45896296,  9389.93265368,  3214.20348159, ...,
         7880.87014123, 15634.46822326, 18350.5052724 ]])

Lets store this valuable information in a varibale 

In [34]:
cov = np.cov(data)

In [35]:
cov

array([[ 3973.49360177,  4745.96701765,  2248.03224613, ...,
         4846.35035653,  7935.63989139,  5128.45896296],
       [ 4745.96701765, 10580.97364716,  3162.64152849, ...,
         6176.00353497, 12901.07140578,  9389.93265368],
       [ 2248.03224613,  3162.64152849,  1342.63308859, ...,
         2703.94298205,  4860.64839684,  3214.20348159],
       ...,
       [ 4846.35035653,  6176.00353497,  2703.94298205, ...,
         6395.97517425, 10346.63306902,  7880.87014123],
       [ 7935.63989139, 12901.07140578,  4860.64839684, ...,
        10346.63306902, 19746.25985233, 15634.46822326],
       [ 5128.45896296,  9389.93265368,  3214.20348159, ...,
         7880.87014123, 15634.46822326, 18350.5052724 ]])

But there is one thing we are missing, we need the covariance of the data subtracted from its mean 

In [38]:
mean = data - np.mean(data)

In [39]:
cov = np.cov(mean)

We need to find the covarince of the transopose of the data 

In [40]:
cov = np.cov(mean.T)

In [42]:
cov

array([[ 1.79484386e+03,  5.98887086e+02,  1.40647372e+01,
        -4.88723699e+00, -2.80054871e+00, -9.24441522e+02,
        -2.58726016e+02,  1.18037602e+01, -9.69132366e+00,
         1.32478986e+03],
       [ 5.98887086e+02,  3.36593472e+02,  6.22137035e+00,
        -2.04524066e+00, -1.17829371e+00, -2.89977185e+02,
        -1.64898679e+02,  1.51841391e+01,  3.16424013e+01,
         4.61774113e+02],
       [ 1.40647372e+01,  6.22137035e+00,  2.23359521e-01,
        -7.35148362e-02, -4.22456089e-02, -4.47300996e+00,
         2.29376373e+00,  1.52049280e-01, -2.30292805e+00,
         1.54352188e+01],
       [-4.88723699e+00, -2.04524066e+00, -7.35148362e-02,
         3.34223972e-02,  1.97273268e-02,  1.21523946e+00,
        -1.13659419e+00, -4.29996627e-02,  1.12278904e+00,
        -4.48582887e+00],
       [-2.80054871e+00, -1.17829371e+00, -4.22456089e-02,
         1.97273268e-02,  1.22132646e-02,  6.55394863e-01,
        -6.69414421e-01, -2.52476965e-02,  6.62952807e-01,
        -2.

Now we need to find the eigenvalues , and eigenvectors of this

In [41]:
eigen_values , eigen_vectors = np.linalg.eig(cov)

In [43]:
eigen_values

array([6.57940709e+03, 3.85406143e+03, 2.01648128e+03, 1.32681867e+03,
       6.10218593e+02, 4.33618554e+02, 1.17335954e+02, 8.64007193e-02,
       1.07368268e-02, 3.85330787e-04])

In [44]:
eigen_vectors

array([[ 3.27831566e-01,  1.34391257e-01, -6.54919335e-02,
        -8.63984924e-01, -9.45486785e-02, -1.98590747e-02,
         3.38166047e-01,  3.99908190e-03, -4.54855335e-04,
         4.57009104e-05],
       [ 1.13818908e-01,  5.38781071e-02,  1.90004002e-02,
        -3.19763260e-01,  1.97550122e-02,  2.32607173e-02,
        -9.38308008e-01,  1.33846960e-02,  6.74914234e-04,
        -5.73082482e-05],
       [ 3.06095229e-03, -6.59527755e-04, -1.05914677e-03,
        -6.93020544e-03, -2.65186570e-03,  1.33537798e-04,
        -1.08822564e-02, -9.27766268e-01, -3.72227209e-01,
         2.26325035e-02],
       [-9.27207133e-04,  2.58262608e-04,  5.10330302e-04,
         2.60551065e-03,  1.53901245e-03, -3.47117394e-05,
         3.06281191e-03,  3.22018741e-01, -7.69225390e-01,
         5.51884502e-01],
       [-5.21959171e-04,  1.52747029e-04,  2.94589142e-04,
         1.51272408e-03,  9.19355464e-04, -2.22106506e-05,
         1.81381210e-03,  1.87999579e-01, -5.19363159e-01,
        -8.

Now we need to sort these values 

In [45]:
idx = np.argsort(eigen_values)

In [46]:
idx

array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])

We need the values in ascencding order not descending 

In [47]:
idx = np.argsort(eigen_values)[::-1]

In [48]:
idx

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

Lets assume we are working for getting $5$ columns

In [51]:
eigen_values = eigen_values[idx]
eigen_vectors = eigen_vectors[idx]

In [53]:
components = eigen_vectors[0 : 5]

In [54]:
components

array([[ 3.27831566e-01,  1.34391257e-01, -6.54919335e-02,
        -8.63984924e-01, -9.45486785e-02, -1.98590747e-02,
         3.38166047e-01,  3.99908190e-03, -4.54855335e-04,
         4.57009104e-05],
       [ 1.13818908e-01,  5.38781071e-02,  1.90004002e-02,
        -3.19763260e-01,  1.97550122e-02,  2.32607173e-02,
        -9.38308008e-01,  1.33846960e-02,  6.74914234e-04,
        -5.73082482e-05],
       [ 3.06095229e-03, -6.59527755e-04, -1.05914677e-03,
        -6.93020544e-03, -2.65186570e-03,  1.33537798e-04,
        -1.08822564e-02, -9.27766268e-01, -3.72227209e-01,
         2.26325035e-02],
       [-9.27207133e-04,  2.58262608e-04,  5.10330302e-04,
         2.60551065e-03,  1.53901245e-03, -3.47117394e-05,
         3.06281191e-03,  3.22018741e-01, -7.69225390e-01,
         5.51884502e-01],
       [-5.21959171e-04,  1.52747029e-04,  2.94589142e-04,
         1.51272408e-03,  9.19355464e-04, -2.22106506e-05,
         1.81381210e-03,  1.87999579e-01, -5.19363159e-01,
        -8.

Now we just need to make the dot product of these values, with the dataset

In [55]:
x = np.dot(mean , components)

ValueError: ignored

We got a shape error, we just need to chanage the shape, and we will be good to go

In [56]:
x = np.dot(mean , components.T)

In [57]:
x

array([[  26.13864511,   13.34951551,   51.11177255,  101.80343691,
        -140.64183336],
       [  54.01175843,  119.96228383,   60.00797544,   64.21518584,
        -227.10318683],
       [  12.97905164,   40.8031363 ,   45.29001402,   50.6213786 ,
         -68.49383771],
       ...,
       [  52.15726041,    3.42607804,   42.82665621,  111.85679948,
        -195.69523009],
       [  24.83685928,  136.09296087,   78.04731071,  135.83992826,
        -360.97064514],
       [  21.84456435,  209.86409383,   -0.55371305,  115.76752434,
        -213.25422782]])

Now as we have made our own pca , we just need to make a function and place it all there

In [58]:
def PCA(array , n_components):
    mean = array - np.mean(array)
    cov = np.cov(mean.T)
    
    _ , eigen_vectors = np.linalg.eig(cov)
    
    idx = np.argsort(eigen_vectors)
    eigen_vectors = eigen_vectors[idx]
    
    components = eigen_vectors[0 : n_components]
    
    return np.dot(mean , components.T)