## Understand the PCA concept with simple example

Let's say you have a dataset of students' grades on a test. The dataset has five features: math, science, English, history, and social studies. Each student has a grade for each subject.

| Student | Math | Science | English | History | Social Studies |
|---|---|---|---|---|---|
| A | 90 | 73 | 70 | 85 | 86 |
| B | 83 | 85 | 90 | 65 | 75 |
| C | 70 | 75 | 68 | 92 | 65 |
| D | 60 | 65 | 50 | 45 | 59 |
| E | 77 | 65 | 80 | 75 | 90 |

Create Student table sample values.

In [48]:
# import required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [28]:
df = pd.DataFrame({
    'Student' : ['A','B','C','D','E'],
    'Math' : [90,83,70,60,77],
    'Science' : [73,85,75,65,65],
    'English' : [70,90,68,50,80],
    'History' : [85,65,92,45,75],
    'SS':[86,75,65,59,90]})

df.set_index('Student',inplace = True)

In [29]:
df

Unnamed: 0_level_0,Math,Science,English,History,SS
Student,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,90,73,70,85,86
B,83,85,90,65,75
C,70,75,68,92,65
D,60,65,50,45,59
E,77,65,80,75,90


In [54]:
# Calculate the mean of each column
mean_values = df.mean()

# Center the dataset by subtracting the mean
centered_data = df - mean_values

# Display the centered dataset
print(centered_data)

         Math  Science  English  History    SS
Student                                       
A        14.0      0.4     -1.6     12.6  11.0
B         7.0     12.4     18.4     -7.4   0.0
C        -6.0      2.4     -3.6     19.6 -10.0
D       -16.0     -7.6    -21.6    -27.4 -16.0
E         1.0     -7.6      8.4      2.6  15.0


The first step is to calculate the covariance matrix of the data. The covariance matrix is a square matrix that measures the correlation between each pair of features.

In [55]:
covariance_matrix = centered_data.cov()
covariance_matrix

Unnamed: 0,Math,Science,English,History,SS
Math,134.5,48.0,120.5,112.0,121.25
Science,48.0,68.8,79.8,37.2,-3.0
English,120.5,79.8,222.8,96.7,122.5
History,112.0,37.2,96.7,338.8,105.0
SS,121.25,-3.0,122.5,105.0,175.5


The next step is to find the eigenvalues and eigenvectors of the covariance matrix. The eigenvalues are the variances of the principal components, and the eigenvectors are the directions of the principal components.

In [43]:
# Find the eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)

In [44]:
print("Eigenvalues:")
print(eigenvalues)

print('')

print("Eigenvectors:")
print(eigenvectors)

Eigenvalues:
[ 5.79134426e+02  2.04643774e+02 -3.21331331e-14  4.00099013e+01
  1.16611899e+02]

Eigenvectors:
[[ 0.42121228  0.19411769  0.45667606  0.75070764  0.11306439]
 [ 0.15819477  0.16339966 -0.66465266  0.36530491 -0.61078976]
 [ 0.49140467  0.53090696  0.27626881 -0.52871994 -0.34754858]
 [ 0.60745969 -0.77373914  0.00518723 -0.12438619 -0.12969824]
 [ 0.43251451  0.23470068 -0.52281207 -0.08929541  0.69031924]]


The eigenvalues are the variances of the principal components. This means that the larger the eigenvalue, the more variance there is in the data along that direction. The eigenvalues are sorted in descending order, so the first eigenvalue is the largest eigenvalue. This means that the first principal component is the direction in the data that has the most variance.

Sort the eigenvalues and eigenvectors in descending order.

In [56]:
# Sort the eigenvalues and eigenvectors in descending order
sorting_indices = np.argsort(eigenvalues)[::-1]
sorted_eigenvalues = eigenvalues[sorting_indices]
sorted_eigenvectors = eigenvectors[:, sorting_indices]

# Display the sorted eigenvalues and eigenvectors
print("Sorted Eigenvalues:")
print(sorted_eigenvalues)
print("\nSorted Eigenvectors:")
print(sorted_eigenvectors)

Sorted Eigenvalues:
[ 5.79134426e+02  2.04643774e+02  1.16611899e+02  4.00099013e+01
 -3.21331331e-14]

Sorted Eigenvectors:
[[ 0.42121228  0.19411769  0.11306439  0.75070764  0.45667606]
 [ 0.15819477  0.16339966 -0.61078976  0.36530491 -0.66465266]
 [ 0.49140467  0.53090696 -0.34754858 -0.52871994  0.27626881]
 [ 0.60745969 -0.77373914 -0.12969824 -0.12438619  0.00518723]
 [ 0.43251451  0.23470068  0.69031924 -0.08929541 -0.52281207]]


#### Choose the number of principal components.
You can select the number of principal components based on the explained variance or specific requirements. For simplicity, let's assume we want to keep the top two principal components.

In [60]:
# Select the top two principal components
selected_components = sorted_eigenvectors[:, :2]

# Compute the transformed data
transformed_data = centered_data.dot(selected_components)

# Display the transformed data
print("Transformed Data:")
print(transformed_data)

Transformed Data:
                 0          1
Student                      
A        17.585654  -5.233849
B         9.456745  18.879337
C         3.664402 -20.196106
D       -42.120645   1.629931
E        11.413844   4.920687


The original dataset was transformed into a new coordinate system defined by the selected principal components. The transformed data, obtained by multiplying the centered data with the selected eigenvectors, represents the original dataset in terms of these principal components. The transformed data can potentially reveal patterns and relationships that were not apparent in the original feature space.

In this case, the transformed data represents a combination of the original features (Math, Science, English, History, and SS) in terms of the selected principal components.