# UMAP

I noticed @tunguz's notebook didn't include UMAP, which is something I use a lot at work for quickly visualising datasets so I thought I would quickly expand on his work


Another small thing I've added is to colour the train and test data differently (blue and orange). It's hardly scientific, but my first impression here is the train and test sets look fairly similar and if there is a difference it seems there isn't much in the test that isn't in the train set. Obviously that's assuming the fft doesn't remove weird things :)

There's nothing stopping you from also doing this to the labels ;)

----------------

In this notebook we'll do dimensionality reduction and visualization of the FFT features that were first used in this competition in [this Giba's notebook](https://www.kaggle.com/titericz/0-309-baseline-logisticregression-using-fft). I've created a stand-alone notebook that extracts those features, and it can be found [here](https://www.kaggle.com/tunguz/giba-s-fft-features-only).

We will make this visualization notebook with the Rapids library. [Rapids](https://rapids.ai) is an open-source GPU accelerated Data Sceince and Machine Learning library, developed and mainatained by [Nvidia](https://www.nvidia.com). It is designed to be compatible with many existing CPU tools, such as Pandas, scikit-learn, numpy, etc. It enables **massive** acceleration of many data-science and machine learning tasks, oftentimes by a factor fo 100X, or even more. 

Rapids is still undergoing developemnt, and only recently has it become possible to use RAPIDS natively in the Kaggle Docker environment. If you are interested in installing and riunning Rapids locally on your own machine, then you should [refer to the followong instructions](https://rapids.ai/start.html).

In [None]:
import cupy as cp
import cudf, cuml
import pandas as pd
import numpy as np
from cuml.manifold import TSNE, UMAP
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score
%matplotlib inline

In [None]:
train = cp.load('../input/giba-s-fft-features-only/TRAIN.npy')
test = cp.load("../input/giba-s-fft-features-only/TEST.npy")

In [None]:
train_test = cp.vstack([train, test])

# UMAP

The default number of neighbors (`n_neighbors`) used in UMAP is 15, playing with it will give you different results. I've typically found reducing it gives interesting results while increasing it gives you a blob with less structure. Perhaps that's just a reflection of the size of the datasets I'm playing with

In [None]:
%%time
umap = UMAP(n_components=2, n_neighbors=15, random_state=42)
train_test_2D = umap.fit_transform(train_test)
train_test_2D = cp.asnumpy(train_test_2D)

train_2D = train_test_2D[:train.shape[0], :]
test_2D = train_test_2D[train.shape[0]:, :]

plt.figure(figsize=(10,10))
plt.scatter(train_2D[:,0], train_2D[:,1], alpha=0.3)
plt.scatter(test_2D[:,0], test_2D[:,1], alpha=0.3)

## With n_neighbors=10

In [None]:
%%time
umap = UMAP(n_components=2, n_neighbors=10, random_state=42)
train_test_2D = umap.fit_transform(train_test)
train_test_2D = cp.asnumpy(train_test_2D)

train_2D = train_test_2D[:train.shape[0], :]
test_2D = train_test_2D[train.shape[0]:, :]

plt.figure(figsize=(10,10))
plt.scatter(train_2D[:,0], train_2D[:,1], alpha=0.3)
plt.scatter(test_2D[:,0], test_2D[:,1], alpha=0.3)

## with n_neighbors=7

In [None]:
%%time
umap = UMAP(n_components=2, n_neighbors=7, random_state=42)
train_test_2D = umap.fit_transform(train_test)
train_test_2D = cp.asnumpy(train_test_2D)

train_2D = train_test_2D[:train.shape[0], :]
test_2D = train_test_2D[train.shape[0]:, :]

plt.figure(figsize=(10,10))
plt.scatter(train_2D[:,0], train_2D[:,1], alpha=0.3)
plt.scatter(test_2D[:,0], test_2D[:,1], alpha=0.3)

## With n_neighbors=5

Below this you basically lose all structure again

In [None]:
%%time
umap = UMAP(n_components=2, n_neighbors=5, random_state=42)
train_test_2D = umap.fit_transform(train_test)
train_test_2D = cp.asnumpy(train_test_2D)

train_2D = train_test_2D[:train.shape[0], :]
test_2D = train_test_2D[train.shape[0]:, :]

plt.figure(figsize=(10,10))
plt.scatter(train_2D[:,0], train_2D[:,1], alpha=0.3)
plt.scatter(test_2D[:,0], test_2D[:,1], alpha=0.3)