# Machine Learning with Python 

In [None]:
import matplotlib.pyplot as plt
import numpy as np

## 2.1 Classification

Now we will move on to supervised methods, starting with some different approaches to classification. Let's use the `moons` data from our clustering examples but increase the noise.

In [None]:
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
X, y = make_moons(n_samples=100, noise=0.25, random_state=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)


plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train)
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
plt.show()

### [k-Nearest Neighbours](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html#sklearn.neighbors.KNeighborsClassifier)

In many respects this is the simplest supervised algorithm:

* Choose an integer value for *k*.
* Choose a [distance metric](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html#sklearn.neighbors.DistanceMetric). The default is `minkowski`, which corresponds to Euclidean distance when the power parameter `p=2`(also the default setting).
* Given an input feature vector **x**, rank the training data by increasing distance from **x**.
* Predict the class label by majority vote from the *k* nearest instances to **x**.

This simplicity makes k-NN a good baseline for comparing with the performance of other methods.

"Training the model" consists only of storing the training data:

In [None]:
from sklearn.neighbors import KNeighborsClassifier
k = 1
knn = KNeighborsClassifier(n_neighbors=k, metric='minkowski', p=2)
knn.fit(X_train,y_train)

Now we can make a prediction for any new data point. By considering a grid of points, we can construct the *decision boundary*:

In [None]:
def plot_boundary(model,X_p,y_p,title):
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    h = 0.05
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure(figsize=(6, 6))
    plt.contourf(xx, yy, Z, alpha=0.2)

    # Plot also the training points
    plt.scatter(X_p[:, 0], X_p[:, 1], c=y_p)

    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.title(title)
    plt.xlabel("Feature 0")
    plt.ylabel("Feature 1")
    plt.show()

In [None]:
plot_boundary(knn,X_train,y_train,"k-NN(k=" + str(k) + ") + training set")

Despite its simplicity, k-NN performs very well on our test data:

In [None]:
plot_boundary(knn,X_test,y_test,"k-NN(k=" + str(k) + ") + test set")

In [None]:
y_pred = knn.predict(X_test)
y_pred

In [None]:
y_test

We can calculate the *accuracy* as a basic performance metric:

In [None]:
y_pred == y_test

In [None]:
print( "accuracy =", np.mean(y_pred == y_test))

We can investigate what happens to accuracy as we increase the value of *k*:

In [None]:
acc = np.zeros(50)
for k in range(1,51):
    knn = KNeighborsClassifier(n_neighbors=k, metric='minkowski', p=2)
    knn.fit(X_train,y_train)
    y_pred = knn.predict(X_test)
    acc[k-1] = np.mean(y_pred == y_test)
    
plt.scatter(np.arange(1,51),acc, c='c')
plt.title( "Test accuracy")
plt.xlabel("k")
plt.ylabel("accuracy")
plt.show()

When *k=1*, the model is *overfitting* to the training data. However, if we set *k* too high then the local information will be too diluted, resulting in *underfitting*.

Looking at the plot of accuracy calculated for the training data helps to illustrate this point:

In [None]:
acc = np.zeros(50)
for k in range(1,51):
    knn = KNeighborsClassifier(n_neighbors=k, metric='minkowski', p=2)
    knn.fit(X_train,y_train)
    y_pred = knn.predict(X_train)
    acc[k-1] = np.mean(y_pred == y_train)
    
plt.scatter(np.arange(1,51),acc, c='m')
plt.title( "Training accuracy")
plt.xlabel("k")
plt.ylabel("accuracy")
plt.show()

### [Logistic regression](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)

Logistic regression is a simple method for linear classification. Because it is a linear method, the decision boundary will be a *hyperplane* in the feature space.

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='none')
lr.fit(X_train,y_train)


In [None]:
plot_boundary(lr,X_train,y_train,"Logistic regression + training set")

This is a nonlinear classification task, so logistic regression is not able to capture the detailed shape of the training data. However, performance on this particular test data set is still reasonably good:

In [None]:
plot_boundary(lr,X_test,y_test,"Logistic regression + test set")

In [None]:
y_pred = lr.predict(X_test)
print( "accuracy =", np.mean(y_pred == y_test))


### [Decision Tree](https://scikit-learn.org/stable/modules/tree.html)

A decision tree is one way to implement a nonlinear decision boundary:


In [None]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(random_state=0)
tree.fit(X_train,y_train)

In [None]:
plot_boundary(tree,X_train,y_train,"Decision Tree + training set")

Because the decision tree's decision boundary has no underlying parametrisation, it may overfit when the training data is sparse (i.e. does not have good coverage of the feature space).

The ensemble method [*random forest*](https://scikit-learn.org/stable/modules/ensemble.html#random-forests) mitigates this problem by resampling the training data to generate variation:

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100,random_state=0)
rf.fit(X_train,y_train)

In [None]:
plot_boundary(rf,X_train,y_train,"Random Forest + training set")

In general, the random forest is less prone to overfitting than the single tree:

In [None]:
y_pred = tree.predict(X_test)
print( "accuracy =", np.mean(y_pred == y_test))

In [None]:
y_pred = rf.predict(X_test)
print( "accuracy =", np.mean(y_pred == y_test))

Both the tree and the forest can report *feature importances*, which can be helpful to gain insight into the model.

In [None]:
importances = rf.feature_importances_
print(importances)

This shows us that *if removed from the model*, feature 1 will impact performance more than feature 2.

### [Support Vector Machines](https://scikit-learn.org/stable/modules/svm.html)

Another example of a nonlinear classification algorithm. 

Conceptually, the SVM deals with nonlinearity by projecting into a *higher*-dimensional space in which the training data are linearly separable. The particular form of the transformation is called the [*kernel function*](https://scikit-learn.org/stable/modules/svm.html#kernel-functions). In practice, the so-called [*kernel trick*](https://towardsdatascience.com/the-kernel-trick-c98cdbcaeb3f) means that we can compute the separating hyperplane without actually needing to perform any expensive high-dimensional transformations.

Different tasks will require different choices of kernel function.


In [None]:
from sklearn.svm import SVC
svc = SVC(kernel='rbf')
svc.fit(X_train,y_train)

In [None]:
plot_boundary(svc,X_train,y_train,"Support Vector Classifier + training set")

In [None]:
y_pred = svc.predict(X_test)
print( "accuracy =", np.mean(y_pred == y_test))

### Neural network

A neural network can be an extremely flexible way to learn a nonlinear decision boundary.

More expressive power is gained from multiple hidden layers, at the expense of adding many parameters to the model.


In [None]:
from sklearn.neural_network import MLPClassifier
nn = MLPClassifier(hidden_layer_sizes=(10,50),max_iter=10000,random_state=0)
nn.fit(X_train,y_train)

In [None]:
nn.coefs_

In [None]:
nn.intercepts_

In [None]:
plot_boundary(nn,X_train,y_train,"Neural Network + training set")

In [None]:
y_pred = nn.predict(X_test)
print( "accuracy =", np.mean(y_pred == y_test))

For difficult unstructured inputs such as image data, a neural network may be the best option as it has the potential to extract meaningful features despite rotations, translations and scaling. However, tuning the model metaparameters to the specific problem can be challenging.

## Exercise

Train a classifier of your choice on the `digits` dataset.

What is the accuracy of your model, evaluated on the test data?

Does your model do better than random guessing?