# Dimensionality Reduction for Unsupervised Learning

In [1]:
#Libraries
import pandas as pd
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

In [2]:
#Import to dataframe
df_features = pd.read_csv('processed_hotel_bookings.csv')

## Final Dimensionality Reduction Configuration

#### Notes on Methodology

The dataset is quite large, with nearly 40,000 rows. t-SNE rens quite slow, over six minutes if nothing is done to speed it up. 

I am only using eight features for the models. When I ran PCA first and feed it into the t-SNE to speed it up, it would either produce very "stringy" clusters or inferior cluster to grouping to just the normal t-SNE clustering. This can be seen in the alternate dimensionality reduction notebook.

I feel it would work better if there were a lot of features, instead of just eight.

I want to run the Dimensionality Reduction and the Clustering algorithms on as much of the data as I can, but I encountered memory errors when trying to do to much in one notebook. 

Later on, I found that even while running the clustering algorithms on a very reduced number of rows (often less that 20% of the original data), and further reducing that data via t-SNE dimensionality reduction, memory errors persisted.

Since it would not be proper to compare the performance of models run on different portions and slices of the data. I decided to reduce the data down to a random slice of 5% of the data, to be used for all of the models.

I will then scale up the best-performing model as high as the memory errors will permit.

In [3]:
#Libraries
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import normalize

In [4]:
# Loading the data
X = df_features
y = df_features.is_canceled

# Standarizing the features
scaler = StandardScaler()
X_std = scaler.fit_transform(X)

X_norm = normalize(X_std)

ValueError: could not convert string to float: 'Resort Hotel'

### t-SNE

In [None]:
from sklearn.manifold import TSNE
import time

In [None]:
time_start = time.time()
tsne = TSNE(n_components=2, verbose=1, perplexity=100, n_iter=300)
tsne_results = tsne.fit_transform(X_std)

print('t-SNE done! Time elapsed: {} seconds'.format(time.time()-time_start))

In the figure below, red shows the bookings that were not canceled, and blue shows the bookings that were.

In [None]:
plt.figure(figsize=(10,5))
colours = ["r","b"]
for i in range(tsne_results.shape[0]):
    plt.text(tsne_results[i, 0], tsne_results[i, 1], str(y[i]),
             color=colours[int(y[i])],
             fontdict={'weight': 'bold', 'size': 50}
        )

plt.xticks([])
plt.yticks([])
plt.axis('off')
plt.show()

### Elaboration of t_SNE configuration decision

# Dimensionality Reduction

Here we can see the clusters steadily become more distinct and the KL divergence value steadily go up. I selected a perplexity of 100 because it performed best in terms of the KL divergence and my visual judgement of the clusters.

t-SNE

In [None]:
for i in range(10,110,10):
    tsne = TSNE(n_components=2, verbose=1, perplexity=i, n_iter=300)
    tsne_results = tsne.fit_transform(X_std)
    plt.figure(figsize=(10,5))
    plt.title("t-SNE with perplexity={}".format(i))
    plt.scatter(tsne_results[:, 0], tsne_results[:, 1])
    plt.xticks([])
    plt.yticks([])
    plt.axis('off')
    plt.show()

In [None]:
df_featues_reduced.to_csv('reduced_processed_hotel_bookings.csv')

### Notebook Summary 

In this notebook I show how I arrived at the final parameter decision for my dimensioality reduction algorithm.

I will show how I apply that dimensionality reduction to the various algorithm in the notebooks titled for them.