**PCA: Principal Component Analysis**

Here, I am implementing PCA. The idea is to reduce the Dimentions of the Speaker Data before implementing KNN on it in order to get better and refined results. After experimenting the final results on the speaker dataset shall be displayed in the Report.

**Step 1:**
The first step is to standardise the data we are getting. The goal is to have the N(0,1) type of distribution i.e. mean=0 and variance=1

In [20]:
import numpy as np

def standardise_data(dataset):
  mean=np.mean(dataset,axis=0) #Mean
  std_dev=np.std(dataset,axis=0) #Standard Deviation
  std_data=(dataset-mean)/std_dev #Standardised Data

  return std_data

**Step 2:**

Done with Standardising the Data, now Covariance Matrix of this data is to be made.

In [21]:
#Covariance matrix

def covariance_matrix(dataset):
  m=dataset.shape[0] #no of samples in the dataset
  cov_matrix=np.dot(dataset.T,dataset)/(m-1) #Covariance Matrix

  return cov_matrix

**Step 3:**

Here, comes the main part. Calculation of the Eigenvalues and Eigenvectors of the Covariance Matrix. These serve as the parameters to decide which features to drp in the reduction of dimentionality

In [22]:
#Eigenvalues and Eigenvectors

def eigenvalues_eigenvectors(cov_matrix):
  eigenvalues,eigenvectors=np.linalg.eig(cov_matrix)

  return eigenvalues,eigenvectors

**Step 4:**

Now the eigenvalues and eigenvectors are to be sorted

In [23]:
def sort_eigenpairs(eigenvalues,eigenvectors):
  sorted_indices=np.argsort(eigenvalues)[::-1] #Sorting in descending order
  sorted_eigenvalues=eigenvalues[sorted_indices]
  sorted_eigenvectors=eigenvectors[:,sorted_indices]

  return sorted_eigenvalues,sorted_eigenvectors

**Step 5:**

Projecting the Dataset on new coordinate system with top n components. On the actual dataset, the value of n will be varied and experiment results will be compiled and presented.

In [24]:
def project_dataset_to_new_plane(X, eigenvectors, top_n=1):
  W=eigenvectors[:,:top_n]

  projected_data=np.dot(X,W)

  return projected_data

Testing with a random dataset

In [25]:
#Testing
data = np.array([[2.5, 2.4],
                     [0.5, 0.7],
                     [2.2, 2.9],
                     [1.9, 2.2],
                     [3.1, 3.0],
                     [2.3, 2.7],
                     [2.0, 1.6],
                     [1.0, 1.1],
                     [1.5, 1.6],
                     [1.1, 0.9]])

standardized_data = standardise_data(data)
print("Standardized Data:\n", standardized_data)

cov_matrix_result = covariance_matrix(standardized_data)
print("Covariance Matrix:\n", cov_matrix_result)

eigenvalues, eigenvectors = eigenvalues_eigenvectors(cov_matrix_result)
print("Eigenvalues:\n", eigenvalues)
print("Eigenvectors:\n", eigenvectors)

sorted_eigenvalues, sorted_eigenvectors = sort_eigenpairs(eigenvalues, eigenvectors)
print("Sorted Eigenvalues:\n", sorted_eigenvalues)
print("Sorted Eigenvectors:\n", sorted_eigenvectors)

projected_data = project_dataset_to_new_plane(standardized_data, sorted_eigenvectors)
print("Projected Data:\n", projected_data)



Standardized Data:
 [[ 0.92627881  0.61016865]
 [-1.7585873  -1.506743  ]
 [ 0.52354889  1.23278973]
 [ 0.12081898  0.36112022]
 [ 1.73173864  1.35731394]
 [ 0.6577922   0.9837413 ]
 [ 0.25506228 -0.38602507]
 [-1.08737078 -1.00864614]
 [-0.41615425 -0.38602507]
 [-0.95312747 -1.25769457]]
Covariance Matrix:
 [[1.11111111 1.0288103 ]
 [1.0288103  1.11111111]]
Eigenvalues:
 [0.08230081 2.13992141]
Eigenvectors:
 [[-0.70710678 -0.70710678]
 [ 0.70710678 -0.70710678]]
Sorted Eigenvalues:
 [2.13992141 0.08230081]
Sorted Eigenvectors:
 [[-0.70710678 -0.70710678]
 [-0.70710678  0.70710678]]
Projected Data:
 [[-1.08643242]
 [ 2.3089372 ]
 [-1.24191895]
 [-0.34078247]
 [-2.18429003]
 [-1.16073946]
 [ 0.09260467]
 [ 1.48210777]
 [ 0.56722643]
 [ 1.56328726]]


Testing by creating a synthetic dataset

In [26]:
def generate_dataset(n_samples=100, n_features=5):
    np.random.seed(42)
    feature_1 = np.random.normal(10, 2, n_samples)
    feature_2 = feature_1 + np.random.normal(0, 1, n_samples)
    feature_3 = np.random.normal(20, 5, n_samples)
    feature_4 = feature_3 * 0.5 + np.random.normal(0, 2, n_samples)
    feature_5 = np.random.normal(50, 10, n_samples)
    return np.column_stack((feature_1, feature_2, feature_3, feature_4, feature_5))


In [None]:
#Just a try
def  main():
  X=generate_dataset()
  #print("Dataset:\n",dataset)

  X_std=standardise_data(X)
  #print("Standardised Dataset:\n",X_std)

  cov_matrix_result=covariance_matrix(X_std)
  #print("Covariance Matrix:\n",cov_matrix_result)

  eigenvalues,eigenvectors=eigenvalues_eigenvectors(cov_matrix_result)
  #print("Eigenvalues:\n",eigenvalues)
  #print("Eigenvectors:\n",eigenvectors)

  sorted_eigenvalues,sorted_eigenvectors=sort_eigenpairs(eigenvalues,eigenvectors)
  #print("Sorted Eigenvalues:\n",sorted_eigenvalues)
  #print("Sorted Eigenvectors:\n",sorted_eigenvectors)

  explained_variance=sorted_eigenvalues/np.sum(sorted_eigenvalues)
  print("Explained Variance:\n",explained_variance)

  cumulative_variance=np.cumsum(explained_variance)
  print("Cumulative Variance:\n",cumulative_variance)

