Practical: Classification with *K Nearest Neighbours* using *scikit-learn*
======================

This brief practical exercise introduces classification in Python.

We're going to create some data using synthetic data functions which we will use again in a different context in the clustering practical. I hope this isn't confusing: in this case, we are going to use some information from the synthetic data sets about which class (which "set") each data point belongs to. In other words, we're working here with "labeled data".

We'll start by loading in some libraries again. We're going to use **scikit-learn** both to create the synthetic data, and to do the classification. We'll also use **matplotlib** for plotting.

In [None]:
import sklearn as sk
import sklearn.datasets as skd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

In [None]:
X,y = skd.make_moons(n_samples=500,noise=0.3, random_state=1)
plt.scatter(X[:,0], X[:,1], c=y)

Let's create a classifier with default values. (You can see these default values by calling `classifier.get_params()`.)

In [None]:
classifier = KNeighborsClassifier()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)

In [None]:
X_train

**Try this**: How many points will X_train contain? Look at the parameters for `train_test_split()` above and see if you can guess this before using `X_train.shape` to get the answer.

In [None]:
#X_train.shape

**Try this**: Confirm that the sizes of `X_test`, `y_train` and `y_test` are as you expect.

In [None]:
plt.scatter(X_train[:,0], X_train[:,1],c=y_train)

In [None]:
plt.scatter(X_test[:,0], X_test[:,1],c=y_test)

And now we train the classifier...

In [None]:
classifier.fit(X_train, y_train)

From the output above, you should be able to see which value of _k_ is chosen by default.

In [None]:
pred = classifier.predict(X_test)
pred

In [None]:
pred.shape

In [None]:
plt.scatter(X_test[:,0], X_test[:,1],c=pred)

print(accuracy_score(y_test, pred))

The graph above gives the classes predicted by the classifier. Compare that with the graph below, which gives the real data (which we didn't use when creating the plot above). You can see that the prediction is quite good; you can see that it's opted for smoother boundaries than the real data, but you'll probably see an accuracy score of more than 85%.

In [None]:
plt.scatter(X_test[:,0], X_test[:,1],c=y_test)

**Try this**: Take the iris dataset*. Use a similar 40% train_test_split to see if you can predict the iris type. What accuracy score do you get in this case? (See hints below).

\* The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems. It is built into R (just type iris!) and is included as one of the example datasets in sklearn.datasets. It can also be found on the UCI Machine Learning Repository.

In [None]:
iris=skd.load_iris()

In [None]:
iris

In [None]:
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=.4)

Acknowledgements
----------------
Acknowledgements: This exercise was compiled by Adam Carter, EPCC for Practical Introduction to Data Science. It is based on content originally created by Magnus Morton, for a custom EPCC course.