# Beginner's Guide to K-Nearest Neighbors with cuML

<iframe width="560" height="315" src="https://youtu.be/HVXime0nQeI" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>


<!-- blank line -->
<figure class="video_container">
  <iframe src="https://youtu.be/HVXime0nQeI" frameborder="0" allowfullscreen="true"> </iframe>
</figure>
<!-- blank line -->


<!-- blank line -->
<figure class="video_container">
  <iframe src="https://drive.google.com/file/d/0B6m34D8cFdpMZndKTlBRU0tmczg/preview" frameborder="0" allowfullscreen="true"> </iframe>
</figure>
<!-- blank line -->



In [None]:
import cudf

df = cudf.read_csv('https://github.com/gumdropsteve/datasets/raw/master/iris.csv')

In [None]:
df.tail()

In [None]:
df.to_pandas().sample(50).plot(kind='scatter', x='sepal_length', y='petal_width', c='target', cmap=('spring'), sharex=False)

In [None]:
df.species.unique()

## Nearest Neighbors
Nearest Neighbors enables the query of the k-nearest neighbors from a set of input samples.

In [None]:
from cuml.neighbors import NearestNeighbors

knn = NearestNeighbors(n_neighbors=3)

In [None]:
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]

X.tail(3)

Pass features and labels into model, then calculate the nearest neighbors at k=3.

In [None]:
knn.fit(X)

cuML's `.kneighbors()` model returns a tuple with 2 cudf.DataFrames holding the *distances* and *indices* of the k-nearest neighbors for each column vector in X.

In [None]:
distances, indicies = knn.kneighbors(X, n_neighbors=3)

In [None]:
distances.tail(3)

In [None]:
indicies.tail(3)

### Data Prep

Before we get too far ahead of ourselves, we should split our data into training and testing datasets. This allows us to test our model with actual data that the model has never seen.

In [None]:
from cuml.preprocessing import train_test_split

In [None]:
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]

y = df.target

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

## K-Nearest Neighbors Classification vs Regression

#### **Classifier**

K-Nearest Neighbors Classifier is an instance-based learning technique,
that keeps training samples around for prediction, rather than trying
to learn a generalizable set of model parameters.

In [None]:
from cuml.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=3)

In [None]:
knn.fit(X_train, y_train)

In [None]:
results = knn.predict(X_test)

results.tail(3)

In [None]:
df = X_test.copy()

df['actual'] = y_test.values
df['predicted'] = results.values

In [None]:
df

#### **Regressor**

K-Nearest Neighbors Regressor is an instance-based learning technique,
that keeps training samples around for prediction, rather than trying
to learn a generalizable set of model parameters.

The K-Nearest Neighbors Regressor will compute the average of the
labels for the k closest neighbors and use it as the label.

In [None]:
from cuml.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor(n_neighbors=3)

In [None]:
knn.fit(X_train, y_train)

In [None]:
results = knn.predict(X_test)

results.tail(3)

In [None]:
df = X_test.copy()

df['actual'] = y_test.values
df['predicted'] = results.values

In [None]:
df

# Continued Learning 
Here are some resources I recommend to help fill in any gaps and provide a more complete picture.

#### **StatQuest: K-nearest neighbors, Clearly Explained**
- Watch on YouTube: [https://youtu.be/HVXime0nQeI](https://youtu.be/HVXime0nQeI)
- Channel: StatQuest with Josh Starmer ([Subscribe](https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw?sub_confirmation=1))

In [None]:
from IPython.display import YouTubeVideo
YouTubeVideo('HVXime0nQeI', width=(1280*0.69), height=(720*0.69))

#### **_k_-nearest neighbors algorithm**
Wikipedia: [https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)