# Uniform Manifold Approximation and Projection (UMAP)
UMAP is a dimensionality reduction algorithm which performs non-linear dimension reduction. It can also be used for visualization of the dataset. The UMAP model implemented in cuml allows the user to set the following parameter values:
1.	n_neighbors: number of neighboring sample used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved
2.	n_components: the dimension of the space to embed into
3.	n_epochs: number of training epochs to be used in optimizing the low dimensional embedding
4.	learning_rate: initial learning rate for the embedding optimization.
5.	init: the low dimensional embedding to use: a. 'spectral': use a spectral embedding of the fuzzy 1-skeleton b. 'random': assign initial embedding positions at random.
6.	min_dist: The minimum distance that should be present between embedded points.
7.	spread: determines how clustered the embedded points will be.
8.	set_op_mix_ratio: is the ratio of pure fuzzy union to intersection. If the value is 1.0 then it will be a pure fuzzy union and for the value of 0.0 it will be a pure fuzzy interpolation.
9.	local_connectivity: number of nearest neighbors that should be assumed to be connected at a local level. It should be not more than the local intrinsic dimension of the manifold.
10.	repulsion_strength: Weighting applied to negative samples in low dimensional embedding
optimization. Values > 1 implements a higher negative value to the samples.
11.	negative_sample_rate: The rate at which the negative samples should be selected per positive sample during the optimization process.
12.	transform_queue_size: embedding new points using a trained model_ will control how aggressively to search for nearest neighbors.
13.	verbose: bool (optional, default False)

The cuml implemetation of the UMAP model has the following functions that one can run:
1.	fit: it fits the dataset into an embedded space
2.	fit_transform: it fits the dataset into an embedded space and returns the transformed output
3.	transform: it transforms the dataset into an existing embedded space and returns the low dimensional output

The model accepts only numpy arrays or cudf dataframes as the input. In order to convert your dataset to cudf format please read the cudf documentation on https://rapidsai.github.io/projects/cudf/en/latest/. For additional information on the linear regression model please refer to the documentation on https://rapidsai.github.io/projects/cuml/en/latest/index.html

In [1]:
import numpy as np
import pandas as pd

import cudf
import os

from sklearn import datasets
from sklearn.metrics import adjusted_rand_score
from sklearn.cluster import KMeans
from sklearn.manifold.t_sne import trustworthiness

from cuml.manifold.umap import UMAP

# Running UMAP model on blobs dataset

In [2]:
# create a blobs dataset with 500 samples and 10 features each
data, labels = datasets.make_blobs(
    n_samples=500, n_features=10, centers=5)

In [3]:
# using the cuml UMAP algorithm to reduce the features of the dataset and store
embedding = UMAP().fit_transform(data)

In [4]:
# calculate the score of the results obtained using cuml's algorithm and sklearn kmeans
score = adjusted_rand_score(labels,
            KMeans(5).fit_predict(embedding))

assert score == 1.0

# Running UMAP model on iris dataset

In [5]:
# load the iris dataset from sklearn and extract the required information
iris = datasets.load_iris()
data = iris.data

In [6]:
# define the cuml UMAP model and use fit_transform function to obtain the low dimensional output of the input dataset
embedding = UMAP(
    n_neighbors=10, min_dist=0.01,  init="random"
).fit_transform(data)

In [7]:
# calculate the trust worthiness of the results obtaind from the cuml UMAP
trust = trustworthiness(iris.data, embedding, 10)
assert trust >= 0.95

In [8]:
# create a selection variable which will have 75% True and 25% False values. The size of the selection variable is 150
iris_selection = np.random.choice(
    [True, False], 150, replace=True, p=[0.75, 0.25])
# create an iris dataset using the selection variable
data = iris.data[iris_selection]

In [9]:
# create a cuml UMAP model 
fitter = UMAP(n_neighbors=10, min_dist=0.01, verbose=True)
# fit the data created the selection variable to the cuml UMAP model created (fitter)
fitter.fit(data)
# create a new iris dataset by inverting the values of the selection variable (ie. 75% False and 25% True values) 
new_data = iris.data[~iris_selection]
# transform the new data using the previously created embedded space
embedding = fitter.transform(new_data)

In [10]:
# calculate the trustworthiness score for the new data created (new_data)
trust = trustworthiness(new_data, embedding, 10)
assert trust >= 0.90