# Beginner's Guide to K-Nearest Neighbors with cuML

In [1]:
import cudf

df = cudf.read_csv('https://github.com/gumdropsteve/datasets/raw/master/iris.csv')

In [2]:
# df.tail()

In [3]:
# df.to_pandas().plot(kind='scatter', x='sepal_length', y='petal_width', c='target', cmap=('spring'), sharex=False)

In [4]:
# df.species.unique()

## Nearest Neighbors
Nearest Neighbors enables the query of the k-nearest neighbors from a set of input samples.

In [5]:
from cuml.neighbors import NearestNeighbors

knn = NearestNeighbors(n_neighbors=3)

NearestNeighbors returns a tuple of distances and indices.

distances: cuDF DataFrame or numpy ndarray
    The distances of the k-nearest neighbors for each column vector
    in X

indices: cuDF DataFrame of numpy ndarray
    The indices of the k-nearest neighbors for each column vector in X

In [6]:
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]

X.tail(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3
149,5.9,3.0,5.1,1.8


In [7]:
# pass features and labels into model
knn.fit(X)

NearestNeighbors(n_neighbors=3, verbose=False, handle=<cuml.common.handle.Handle object at 0x7f9f3d6c5bf0>, algorithm='brute', metric='euclidean')

In [8]:
distances, indicies = knn.kneighbors(X, n_neighbors=3)

In [9]:
# distances.tail(3)

In [10]:
# indicies.tail(3)

### Data Prep

Before we get too far ahead of ourselves, we should split our data into training and testing datasets. This allows us to test our model with actual data that the model has never seen.

In [11]:
from cuml.preprocessing import train_test_split

In [12]:
X = df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]

y = df.target

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

## K-Nearest Neighbors Classification vs Regression

#### **Classifier**

K-Nearest Neighbors Classifier is an instance-based learning technique,
that keeps training samples around for prediction, rather than trying
to learn a generalizable set of model parameters.

In [14]:
from cuml.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=3)

In [15]:
knn.fit(X_train, y_train)

In [16]:
results_df = knn.predict(X_test)

results_df.tail(3)

Unnamed: 0,0
27,2
28,0
29,0


In [17]:
results_df = results_df.rename({0:'predicted'})

results_df['actual'] = y_test.values

results_df.tail(3)

Unnamed: 0,predicted,actual
27,2,2
28,0,0
29,0,0


#### **Regressor**

K-Nearest Neighbors Regressor is an instance-based learning technique,
that keeps training samples around for prediction, rather than trying
to learn a generalizable set of model parameters.

The K-Nearest Neighbors Regressor will compute the average of the
labels for the k closest neighbors and use it as the label.

In [18]:
from cuml.neighbors import KNeighborsRegressor

knn = KNeighborsRegressor(n_neighbors=3)

In [19]:
knn.fit(X_train, y_train)

In [20]:
results_df = knn.predict(X_test)

results_df.tail(3)

Unnamed: 0,0
27,2.0
28,0.0
29,0.0


In [21]:
results_df = results_df.rename({0:'predicted'})

results_df['actual'] = y_test.values

results_df.tail(3)

Unnamed: 0,predicted,actual
27,2.0,2
28,0.0,0
29,0.0,0
