## Using TruncatedSVD and t-SNE to perform dimensionality reduction on the Amazon Reviews Dataset


In [1]:
!pip install ipython-autotime
%load_ext autotime



In [2]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
time: 3.06 ms


In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import bz2
import os

time: 2.36 ms


In [0]:
# Right now stop words have been successfully removed

import pickle
# Getting back the objects:
with open('/content/drive/My Drive/Colab Notebooks/objs2.pkl', 'rb') as f:  # Python 3: open(..., 'rb')
    train_texts, train_labels, test_texts, test_labels, train_texts_vec, test_texts_vec = pickle.load(f)

time: 27.9 s


### We will be using the Truncated SVD (instead of PCA) in order to do dimensionality reduction as PCA does not support sparse matrices.

Also, since t-SNE scales quadratically with number of features, the official documentation recommends that we first use another dimensionality reduction technique to reduce the number of features to 50 and then apply t-SNE

Thus, we will be applying two approaches:
1. Using TruncatedSVD (equivalent to PCA) to reduce dimensions to 50, and then using k-means on this dataset.

2. Using TruncatedSVD to reduce dimensions to 50, and then usisng t-SNE to reduce dimensions to 10, and then applying k-means to the dataset.

This notebook only focuses on the reduction of dimensions, the k-means will be done in another notebook.



In [0]:
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=50, n_iter=7, random_state=42)
train_texts_50dim = svd.fit_transform(train_texts_vec)

## Saving the new objects
import pickle

# Saving the objects:
with open('/content/drive/My Drive/Colab Notebooks/objs_svd1.pkl', 'wb') as f:  # Python 3: open(..., 'wb')
    pickle.dump([train_texts_50dim], f)

time: 6min 12s


Here, from SVD to TSNE, I am choosing to reduce the dimensions to 2 dimensions arbitrarily, and this number can be varied for different outputs.

In [5]:
# This code can be used to convert the output of the TruncatedSVD algorithm 
#    so that it can be manipulated as a dataframe of the correct dimensions.

mat = np.array(train_texts_50dim)
mat2 = mat.reshape(3600000,50)
mat2 = pd.DataFrame(mat2)

time: 475 ms


### TSNE to reduce dimensionality further

Here I **tried** to reduce the number of dimensions to two, however the code didn't run because of the time complexity of the TSNE algorithm. I will try to implement a different version of TSNE to reslove this issue

In [0]:
from sklearn.manifold import TSNE

train_texts_tsne = TSNE(n_components = 2, learning_rate = 1000).fit_transform(mat2)

## Saving the newest objects
import pickle

# Saving the objects: 
with open('/content/drive/My Drive/Colab Notebooks/objs_tsne1.pkl', 'wb') as f:  # Python 3: open(..., 'wb')
    pickle.dump([train_texts_50dim, train_texts_tsne], f)


Thus, we have successfully reduced the dimensions of the training data into 50 using the TruncatedSVD algorithm and 2 dimensions using the t-SNE algorithm.

This dimensionally reduced data can now be used to perform k-means or some other clustering technique to find out the clusters.

### In case you want to do different number of dimensions for TSNE, you can import the objects back using the following code

In [4]:
import pickle
# Getting back the objects:
with open('/content/drive/My Drive/Colab Notebooks/objs_svd1.pkl', 'rb') as f:  # Python 3: open(..., 'rb')
    train_texts_50dim = pickle.load(f)

time: 11.1 s
