## t-SNE

This notebook implements the t-SNE mapping for visualising higher dimensional datasets into lower dimensions.

### Importing the Libraries

In [None]:
# import libraries 

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

### Dimensionality Reduction

Reducing the higher dimensional data to 30 major components using the PCA decomposition.

In [None]:

def reduceDimensionality(data, n_components = 30):
    
    """
    This function aims to reduce the higher dimensional data to lower dimensions [ = 30 ]
    
    Parameters
    ----------
    data : data frame object with last column being the label corresponding to the given data samples
    n_components : number of components to be reduced to

    """
    # standardize the data
    scaler = StandardScaler()
    x = data[:,:-1]
    y = data[:,-1]
    x = scaler.fit_transform(x)
    # apply PCA
    pca = PCA(n_components = n_components)
    x = pca.fit_transform(x)
    # concatenate the data
    data = np.concatenate((x, y.reshape(-1,1)), axis = 1)
    return data


In [None]:
def plotPCA(data,title):

    """
    Function to plot the generated PCA plot , only in 2 dimensions considering the major 2 PCA components 
    
    Parameters:
    -----------
    data : data frame object with last column being the label corresponding to the given data samples
    title : title of the plot
    
    """
    plt.figure(figsize=(10, 10))
    plt.scatter(data[:, 0], data[:, 1], alpha=0.5,c=data[:,-1])
    plt.colorbar()
    plt.xlabel('First Principal Component')
    plt.ylabel('Second Principal Component')
    plt.title(title)
    plt.show()

# Probability Matrice Q in the lower dimensional space

Calculating the similarity matrix, amongst the points in the lower dimensions. Wherein each element in the matrix is given by : 

$ q_{i, j} = \frac{(1 + ||y_i - y_j||^2)^{-1}}{\Sigma_{k\neq l}(1+||y_k-y_l||^2)^{-1}} $

In [None]:
def computeQMatrix(Y):
    """
    Function to compute the Q matrix for the given data points
    
    Parameters: 
    -----------

    Y : data points in the lower dimensional space

    """

# Binary Search fro Optimal Parameter


In [None]:
# code 

# Gradient Descent

The gradient of the t-SNE cost function reduces to a simple form : 

$ \frac{\partial C}{\partial y_i} = 4\Sigma_{j} (p_{i, j} - q_{i, j})(y_i - y_j)(1 + ||y_i-y_j||^2)^{-1}$, 

In [1]:
def objectiveFunction(Y,P,Q):
    """
    Function to calculate the main objective function of t-SNE
    Parameters:
    -----------
    Y : The data points in the lower dimensional space
    P : The joint probability distribution of the data points in the high dimensional space
    Q : The joint probability distribution of the data points in the low dimensional space
   
    """
    # difference is a nxn matrix
    difference = P - Q
    



In [None]:
def gradientDescent(Y,P,learningRate= 200.0 , momentum = 0.9, maxIterations = 1000):
   
    """
    Performs the gradient descent step for t-SNE algorithm 

    Parameters:
    -----------
    
    learningRate : learning rate for the gradient descent step

    momentum : momentum for the gradient descent step

    maxIterations : maximum number of iterations for the gradient descent step
    
    """

    # store the variation of the points in gradient descent steps
    Y_iterations = []

    Y_t1 = Y.copy()
    Y_t2 = Y.copy()
    for iteration in range(0,maxIterations):


        Y_iterations.append(Y.copy().reshape(-1,2))
        # the similarity matrix in the low dimensional space
        Q = computeQMatrix(Y)
        
        # gradient is a 1-D array consisting of point-wise gradients 
        gradient = objectiveFunction(Y,P,Q) 

        modGradient = np.linalg.norm(gradient)

        # update the points in the low dimensional space
        Y = Y - learningRate * gradient + momentum * (Y_t1 - Y_t2) 
        # update the vector terms
        Y_t2 = Y_t1.copy()
        Y_t1 = Y.copy()


## Load MNIST Dataset

- Loading the mnist dataset and reducing it to a lower dimensional space using PCA and number of components as 30

In [None]:
import seaborn as sns
from scipy.linalg import eigh

In [None]:
df = pd.read_csv('./data/mnist_train.csv')
print(df.shape)
df.head()

In [None]:
labels = df['label']
data = df.drop('label', axis = 1)
data = data.values
data = np.concatenate((data, labels.values.reshape(-1,1)), axis = 1)
data = reduceDimensionality(data)

In [None]:
#  plot PCA
plotPCA(data, 'PCA')