# Principle Component Analysis

Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. The main idea of PCA is to reduce the number of data dimensions for easier interpretation, trading up with a slight decrease in data point accuracy.

The PCA algorithm is most widely used to visualize a high dimensional data set by reducing its dimensions. The algorithm, for example can convert a 3-D dataset,

<img src="./imgs/before_pca.jpg" alt="Before PCA"/>

To a more interpretable form of 2-D, as shown below.

<img src="./imgs/after_pca.jpg" alt="After PCA" />

In this notebook, we will go through the step by step process of how PCA works. Let's get started.

## Import Libraries
As always, we need some libraries for computing values. These are,
- **numpy** : For mathematical Computation
- **matplotlib** : For ploting

In [1]:
import numpy as np
import matplotlib.pyplot as plt

## Loading Dataset
For this tutorial, we will be using *GLoVe Word embeddings*. **Word Embeddings** are a way of representing words in computer's as a fixed length vector, which contain all the semantic meaning of that word. Word embeddings are widely used in Natural Language Processing systems and are a major break through in the field of NLP.

The problem with word embeddings is that they are high dimensional vectors which cannot be visualized easily. We are going to transform word embeddings for 1000 most commonly used words in English language and visualize them on a 2-D plain.

### Loading Word list
We are taking 1000 most commonly used words. The Linkfor the list is in references.

In [2]:
with open("Word List.txt", "r", encoding='ascii', errors='ignore') as f:
    data = f.readlines()

word_list = [d.strip() for d in data]

print(f"First five words : {word_list[:5]}")

First five words : ['be', 'and', 'of', 'a', 'in']


### Loading pretrained vectors
Now, we are going to load up pretrained glove vectors and create a dictionary of the for <key, vector>. Where *key* will be the word and corresponding *vector* will be its pretrained fixed dimension vector. The link for the pretrained vector file is in references. 

In [3]:
with open("glove.6B.50d.txt", "r", encoding='utf-8') as f:
    emb = f.readlines()

emb_dict = {}

for word in emb:
    word = word.strip().split()
    vec = np.array(list(map(float, word[1:])))
    emb_dict[word[0]] = vec

### Extracting useful vectors
Now that we have both, our word list as well as the embedding vectors, We will extract the vectors for our words and keep them in a seperate dictionary.

In [4]:
word_used = {}

for word in word_list:
    vec = emb_dict.get(word, None)
    if vec is not None:
        word_used[word] = vec

print(f"Number of words with their vectors out of 1000 : {len(word_used)}")

Number of words with their vectors out of 1000 : 893


## Implementing PCA
Now its time to implement the PCA algorithm. The algorithm steps are as follows,
1. **Standardization**
The first step is to standardise the data in a continous range. This makes the data insensitive to variance, making each variable contribute equally in component analysis. If this step is not done, the variables with larger variance will dominate over the variables of lower variance. Mathematically, this step looks like as,


$$
z\_std = z - mean(z) \tag{Mean Normalization}
$$

2. **Compute covariance matrix**
The aim of this step is to understand how the variables of the input data set are varying from the mean with respect to each other, or in other words, to see if there is any relationship between them. Because sometimes, variables are highly correlated in such a way that they contain redundant information. So, in order to identify these correlations, we compute the covariance matrix. Mathematically,


$$
Cov(z\_std) = z\_std * z\_std.T
$$

3. **Compute Eigen Value and Eigen Vectors**
The Eigen values and vectors are concepts from linear algebra that are used to compute the Principal Components of the covariance matrix. Principal components are new variables that are constructed as linear combinations or mixtures of the initial variables. These combinations are done in such a way that the new variables (i.e., principal components) are uncorrelated and most of the information within the initial variables is squeezed or compressed into the first components. For example, in our dataset, we have a 50 dimensional vector, the principal components of this 50 dimensional vector will try to squeeze maximum information into the first component allowing to you to retain most information by only selecting a few first components.

The Eigen Values gives the amount of information retained by a eigen vector. So we sort the eigen vectors in decreasing order of eigen values to get the components ordered from component with max information to minimum information. The graph of variance / information retained by princial components looks as below,

<img src="./imgs/var.png" alt="variance retained" width = "512" height="750" />

4. **Transforming Data**
Once we have our principal compoenets, we can simply take the dot product of the normalized data with the component vectors to get the reduced dimensional data.


$$
reduced\_z = z\_std . component\_matrix
$$

All the steps explained above make up the PCA algorithm. I have implemented the algorithm as a class as shown below.

In [5]:
class PCA:
    def __init__(self, n_components = 2):
        self.n_components = n_components
        self.evecs = None
        self.evals = None
        
    def fit_transform(self, X):
        # Mean normalization
        mean = np.mean(X, axis = 0)
        X = X - mean
        # Covariance Matrix
        cov = np.cov(X, rowvar=False)
        # Eigen values and vectors
        eigen_vals, eigen_vecs = np.linalg.eigh(cov)
        
        # Sorting in decreasing order of Eigen values
        idxs = np.argsort(eigen_vals)[::-1]
        
        self.evecs = eigen_vecs[:, idxs]
        self.evals = eigen_vals[idxs]
        
        # Extracting components
        comp = self.evecs[:, :self.n_components]
        # Transformig and returning
        return np.dot(X, comp)

## Running PCA

### Get data
We get the vector data as a matrix.

In [6]:
emb_arr = np.array(list(word_used.values()))

print(f"Vector matrix shape : {emb_arr.shape}")

Vector matrix shape : (893, 50)


### Run PCA
Next we make an instance of PCA class and run the transform.

In [7]:
# PCA instance
pca = PCA()
# getting transformed data
result = pca.fit_transform(emb_arr)

In [8]:
print(f"Shape of transformed Data : {result.shape}")

Shape of transformed Data : (893, 2)


## Plotting results
Now we can plot the vectors and see how the word vectors represent the words in semantically meaningful ways.

In [9]:
%matplotlib notebook
fig = plt.figure(figsize=(8, 8))

plt.scatter(result[:, 0], result[:, 1])

for i, word in enumerate(word_used.keys()):
    plt.annotate(word, [result[i, 0], result[i, 1]])


<IPython.core.display.Javascript object>

As we can see, With the help fo PCA we are able to visualize the word embeddings and see how similar words are grouped together in the data space.

Words like, *Knowlegde*, *ability*, *theory*, *apply* all come together in a sentence so they ar grouped together in the data space.

<img src="./imgs/plot.PNG" alt="Word embeddings" width = "512" height = "512" />

## References
- *PCA* : https://en.wikipedia.org/wiki/Principal_component_analysis
- *Covariance Matrix* : https://datascienceplus.com/understanding-the-covariance-matrix/#:~:text=where%20our%20data%20set%20is,XT%20X%20X%20T%20.
- *Word List* : https://www.gonaturalenglish.com/1000-most-common-words-in-the-english-language/
- *Glove Word Embeddings* : https://nlp.stanford.edu/projects/glove/