# Flower Classification using Scikit-Learn

In this notebook, we will use __Scikit-Learn__ to __classify__ different species of __iris__ flowers.

The iris dataset was used in Fisher's classic 1936 paper: *The Use of Multiple Measurements in Taxonomic Problems*, and is a popular toy dataset used in the field of machine learning.

### The Dataset

The Iris dataset consists of 50 samples from each species of iris. Each iris plant in the dataset has four different __features__ or attributes:
* Petal Length
* Petal Width
* Sepal Length
* Sepal Width
![alt-text](http://terpconnect.umd.edu/~petersd/666/html/iris_with_labels.jpg)

The aim of this notebook is to __classify__ the iris plants into 3 species:
* Iris Versicolor
* Iris Setosa
* Iris Virginica

![alt-text](http://dataaspirant.com/wp-content/uploads/2017/01/irises.png)


To do this, we will use two popular classification algorithms: __Logistic Regression__, and the __K-Nearest Neighbors__ classifier.

## Logistic Regression

Unlike linear regression, logistic regression is used for predicting __binary dependent variables__, rather than  continuous variables. The binary variables represent different types of __classess__, in this case these are different species of iris flowers. 

Logistic regression is a method that evaluates decision boundaries between different classes. Here we observe three classes, with their decision boundaries shown. 

In most cases, decision boundaries are harder to visualize, as most datasets have more than 3 dimensions. The features of the iris dataset for example has 4 dimensions, and is therefore more difficult to visualize.

![alt-text](http://scikit-learn.org/stable/_images/sphx_glr_plot_logistic_multinomial_thumb.png)

## Loading the Iris Dataset

Scikit-learn offers a convenient way of loading toy datasets to experiment with. Here we will use the scikit-learn API to easily load the iris dataset.

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()

### Set Feature and Target Variables


In [None]:
X = iris.data
y = iris.target

### Seperate data into Training and Testing Data

When creating machine learning algorithms, we often want to evaluate the performance of our models. To do this, we need to seperate our data into a __training set__ and a __testing set__.

We train our model using the training set, and withold some of the data, the testing set, to use later to see how well our model can predict them.

This is done so that we can evaluate how well our machine learning algorithm can make predictions for new, unseen data points.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   test_size=0.33,
                                                   random_state=69)

## Training our Logistic Regression Model

Now that we have loaded our dataset and seperated our dataset, we will begin the process of training our logistic regression model.

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

## Evaluating Performance

Now that we have trained our logistic regression model, we will now score the performance of the model at predicting the labels of the testing dataset, given the features of the testing set.



In [None]:
lr.score(X_test, y_test)

# K-Nearest Neighbours

K-Nearest Neighbours is a popular classification algorithm known for its speed and accuracy.

The prediction of the class is found by determining the *k* closest neighbours, and voting on the majority of the classes in this neighbourhood of data points, where *k* represents the number of neighbours that the algorithm searches for. 

Using *k*-NN, the majority vote wins, and the prediction takes on the class of the majority of its neighbours.

## Let's look at an example

Consider the image below. 

![alt-text](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/KnnClassification.svg/220px-KnnClassification.svg.png)

There are two different classes. 

Want to predict the class of the green data point.

To do this we use the *k*-Nearest Neighbours algorithm.

The inner circle has *k*=3.

The dotted circle has *k*=5

If *k*=3, what class would our prediction be? What if *k*=5?

# K-Nearest Neighbours Applied to the Iris Dataset



In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
knn.score(X_test, y_test)

# Exercise

Apply __logistic regression__, and the __*k*-nearest neighbors__ on the MNIST digits dataset. Which one performs better?

Tips
1. Load the data
2. Assign features and targets
3. Seperate data into training set and testing set
4. Train your models
5. Score their accuracy

In [None]:
from sklearn.datasets import load_digits



## Solutions

To load these solutions, uncomment the code block below and run it. 

I highly recommend looking at them only after you've attempted the problem. If you are stuck, feel free to ask me for help, or work together with others.

In [None]:
# %load solutions/02_digits_knn_lr.py