In [1]:
%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt

# Machine Learning

Python is an excellent choice for machine learning. The following two libraries are where most people start:

_Sckikit Learn_

[Scikit learn](http://scikit-learn.org/stable/) is the best place to get started with ML in Python. There are loads of good tutorials around and it's very easy to prototype and play around with.

_TensorFlow_

[Tensorflow](https://www.tensorflow.org/) is the "serious" machine learning library used by many professional machine learning engineers and researchers.  It's very versatile but can be intimidating for the beginner.

## Basics

The aim of machine learning is to "learn" on some sample dataset, then predict or classify some new unknown data.

There are two broad categories:

- **Supervised** - the training data comes with some additional attributes that we want to predict. This will normally be classification (i.e. labelling unlabelled data) or regression (predict some continuous variable given some input).

- **Unsupervised** - the training data is not annotated and we want the system to "discover" some properties. This will usually be some sort of clustering or grouping.


## Scikit Learn

Scikit learn has a good set of [examples](http://scikit-learn.org/stable/auto_examples/index.html) and [tutorials](http://scikit-learn.org/stable/tutorial/index.html).

### Nearest Neighbours

The nearest neighbours algorithm is a simple way to group data into a given number of clusters.

We can start by loading the iris dataset. This is a list of measurements of iris flowers (sepal length, sepal width, petal length and petal width), plus the class of each flower (Setosa, Vesicolour and Virginica).

We can load the iris dataset with scikit learn like this:

In [9]:
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets

iris = datasets.load_iris()

We can see the values listed for the type of flower in the `target` property. This will be `0`, `1` or `2`:

In [11]:
iris.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

So that we can plot the result we only want to take the first two columns of data. The data in the dataset is found in `data` variable. 

In [38]:
cut_down_data = iris.data[:, :2]
cut_down_data

array([[5.1, 3.5],
       [4.9, 3. ],
       [4.7, 3.2],
       [4.6, 3.1],
       [5. , 3.6],
       [5.4, 3.9],
       [4.6, 3.4],
       [5. , 3.4],
       [4.4, 2.9],
       [4.9, 3.1],
       [5.4, 3.7],
       [4.8, 3.4],
       [4.8, 3. ],
       [4.3, 3. ],
       [5.8, 4. ],
       [5.7, 4.4],
       [5.4, 3.9],
       [5.1, 3.5],
       [5.7, 3.8],
       [5.1, 3.8],
       [5.4, 3.4],
       [5.1, 3.7],
       [4.6, 3.6],
       [5.1, 3.3],
       [4.8, 3.4],
       [5. , 3. ],
       [5. , 3.4],
       [5.2, 3.5],
       [5.2, 3.4],
       [4.7, 3.2],
       [4.8, 3.1],
       [5.4, 3.4],
       [5.2, 4.1],
       [5.5, 4.2],
       [4.9, 3.1],
       [5. , 3.2],
       [5.5, 3.5],
       [4.9, 3.1],
       [4.4, 3. ],
       [5.1, 3.4],
       [5. , 3.5],
       [4.5, 2.3],
       [4.4, 3.2],
       [5. , 3.5],
       [5.1, 3.8],
       [4.8, 3. ],
       [5.1, 3.8],
       [4.6, 3.2],
       [5.3, 3.7],
       [5. , 3.3],
       [7. , 3.2],
       [6.4, 3.2],
       [6.9,

Before we split up the data we need to shuffle it around. We can do this with `numpy`:

In [39]:
shuffled_data = np.random.permutation(cut_down_data)
shuffled_target = np.random.permutation(iris.target)

We can then take a certain proportion of each dataset as training/ testing data:

In [40]:
# Training data - take all but the last 10
train_data = shuffled_data[:-10]
train_target = shuffled_target[:-10]

Now that we have our training data we can train our model:

In [44]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()

# Here is where we do the actual training
knn.fit(train_data, train_target) 

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

We can now use our `knn` object to predict the class of the remaining target data:

In [43]:
test_data = shuffled_data[-10:]
test_target = shuffled_target[-10:]

predicted_target = knn.predict(test_data)

print('Predicted: {}'.format(predicted_target))
print('Actual:    {}'.format(test_target))

Predicted: [0 1 0 0 1 1 1 0 0 1]
Actual:    [1 2 0 1 2 0 1 1 2 2]


## Exercise

Have a look at the Scikit learn [linear regression example](http://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html). 

Copy the code into this notebook and see if you can work out what it's doing.

What happens if you reduce the size of the training data?