## A first application: classifying iris species

In this section, we will work through a simple classification problem and create our first machine learning model.

Let's assume a hobby botanist, Alice, is interested in distinguishing what the species is of some iris flowers that she found. She has collected some measurements associated with the iris: the length and width of the petals, and the length and width of the sepal, all measured in centimeters.

<img src="images/iris.png" />

Alice also has the measurements of some irises that have been previously identified by an expert botanist as belonging to the species Setosa, Versicolor or Virginica. 

For these measurements, she can be certain of which species each iris belongs to. Let’s assume that these are the only species our hobby botanist will encounter in the wild.

Our goal? Build a machine learning model that can learn from the measurements of these irises whose species is known, so that we can predict the species for a new or unseen iris.

### Working towards a solution

- We have measurements for which we know the correct species of iris - supervised learning problem 
- Goal is to predict one of several options (the species of iris) - classification problem
- The possible outputs (different species of irises) are called classes

The desired output for a single data point (an iris) is the species of the flower. For a particular data point, the species it belongs to is called its label.

### Again...let's meet the data!

This time a different dataset. The data we will use for this example is the iris dataset, a classical dataset in machine learning and statistics. It is included in scikit-learn in the dataset module. We can load it by calling the `load_iris` function.

In [17]:
from sklearn.datasets import load_iris
iris = load_iris()

The iris object that is returned by `load_iris` is a __[`Bunch`](https://pypi.org/project/bunch/)__  object (and not a Dataframe!), which is very similar to a dictionary. It contains keys and values:

In [20]:
iris.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

The value to the key `DESCR` is a short description of the dataset. You can look up the rest on your own.

In [26]:
print(iris['DESCR'][:200] + "\n...")

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive
...


The value with key `target_names` is an array of strings, containing the species of flower that we want to predict:

In [28]:
iris['target_names']

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

The `feature_names` are a list of strings, giving the description of each feature:

In [29]:
iris['feature_names']

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

The data itself is contained in the `target` and `data` fields. The data contains the numeric measurements of sepal length, sepal width, petal length, and petal width in a numpy array:

In [31]:
type(iris['data'])

numpy.ndarray

The rows in the data array correspond to flowers, while the columns represent the four measurements that were taken for each flower:

In [33]:
iris['data'].shape

(150, 4)

`shape` shows us that the iris dataset contains measurements for 150 different flowers.

The individual items are called <em>samples</em> in machine learning, and their
properties are called <em>features</em>.

The shape of the data array is the number of samples times the number of features. This is a convention in scikit-learn, and your data will always be assumed to be in this format.

The feature values for the first five samples can accessed as:

In [36]:
iris['data'][:5]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])


The target array contains the species of each of the flowers that were measured, also
as a numpy array:

In [37]:
type(iris['target'])

numpy.ndarray

The target is a one-dimensional array, with one entry per flower:

In [38]:
iris['target'].shape

(150,)

The species are encoded as integers from 0 to 2:

In [39]:
iris['target']

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

The coding of species is made to correspond to the `iris['target_names']` array, such that 0 represents Setosa, 1 represents Versicolor and 2 represents Virginica.

### Training and testing data

Our goal is to build an ML model from the iris dataset that can predict the species of iris for new sets of measurements. In other words, our model should be able to generalise to unseen instances.

However, before we can apply our ML model to new instances or measurments, we need to know whether the model actually works - that is whether we should trust its predictions.

To assess the performance of a model, we show the model new data (that it hasn’t seen before) for which we have labels. This is usually done by splitting the labeled data we have collected (150 flower measurements in this instance) into two parts.

One part of our dataset will be used to build or train our ML model. We call this the **training data** or **training set**. The rest of the data will be used to assess how well the model works. We call this the **test data**, **test set** or **hold-out set**.

Scikit-learn provides the `train_test_split` function that shuffles and splits the dataset.

`train_test_split` extracts 75% of the rows in the data as the training set, together with the corresponding labels for this data. The remaining 25% of the data, together with the remaining labels, are declared as the test set.

> How much data you want to put into the training and the test set respectively is somewhat arbitrary, but using a test-set containing 25% of the data is a good rule of thumb.

In [40]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(iris['data'], iris['target'],
                                                        random_state=0)

The `train_test_split` function shuffles the dataset using a pseudo random number generator before making the split. If we would take the last 25% of the data as a test set, all the data point would have the label 2, as the data points are sorted by the label (see the output for iris['target'] above). 

Using a test set containing only one of the three classes would not tell us much about how well we generalize, so we shuffle our data, to make sure the test data contains data from all classes.

To make sure that we will get the same output if we run the same function several times, we provide the pseudo random number generator with a fixed seed using the `random_state` parameter. This will make the outcome deterministic, meaning that we will always have the same outcome. 

# k-nearest neighbors (KNN)

K-nearest neighbors, KNN for short, is a supervised learning algorithm best suited for classification. 

It is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its K neighbors. 

The case being assigned to the class is the most common among its K nearest neighbors measured by a distance function. The distance function can be Euclidean, Manhattan, Minkowski or Hamming. The first three distance functions are used for continuous variables, and the fourth (Hamming) is suitable for categorical variables. 

If K = 1, then the case is simply assigned to the class of its nearest neighbor. 

## How the KNN algorithm works

Let’s take a simple case to understand the KNN algorithm. Consider a spread of red circles (RC) and green squares (GS) as illustrated in the figure below.

<img src="images/knn1.png" />

The task at hand is to find out the class of the blue star (BS). BS can either be RC or GS and nothing else. The K in KNN algorithm is the nearest neighbors we wish to take vote from. Let’s say K = 3. Hence, we will now make a circle with BS as center just as big as to enclose only three datapoints on the plane.

<img src="images/knn2.png" />

The three closest points to BS are all RC. Hence, with good confidence level we can say that the BS should belong to the class RC. Here, the choice became very obvious as all three votes from the closest neighbor went to RC. The choice of the parameter K is very crucial in this algorithm. 

The KNN algorithm is based on feature similarity. Choosing the right value of K is a non-trivial task; a process known as parameter tuning. This is important to ensure better classification accuracy.

Typically used methods in choosing the value K are:

- sqrt(n), n being the number of data points
- odd number if the number of classes is 2
- higher values of k has lesser chance of error
- K-Fold Cross Validation (KFCV) to decide the value of K - i.e. KFCV for testing performance of KNN with different  values of K


All machine learning models in scikit-learn are implemented in their own class, which are called `Estimator` classes. The K nearest neighbors classification algorithm is implemented in the `KNeighborsClassifier` class in the `neighbors` module.

We consider the following:

- KNN: Looks at the K closest labeled data points
- classification method
- first, we need to train our data. Train = fit
- fit(): fits the data or trains the data
- predict(): predicts the data 
- x: features
- y: target variables(normal, abnormal)
- n_neighbors: i.e. K. In the example below is set to 3, which means consider the 3 closest labeled data points

In [11]:
# %store -r data # retrieve stored data

no stored variable # retrieve stored data


In [14]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 3)

The knn object encapsulates the algorithm to build the model from the training data, as well as the algorithm to make predictions on new data points.

It will also hold the information the algorithm has extracted from the training data. In the case of KNeighborsClassifier, it will just store the training set.

To build the model on the training set, we call the fit method of the knn object, which takes as arguments the numpy array x_train containing the training data and the numpy array y_train of the corresponding training labels class:

In [15]:
x_train ,y_train = data.loc[:,data.columns != 'class'], data.loc[:,'class']
knn.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=3, p=2,
           weights='uniform')

### Making predictions

In [None]:
prediction = knn.predict(x_train)
print('Prediction: {}'.format(prediction))