## Week 9: Pair Programming - The Iris Dataset

### Task A: Data Analysis

The Iris flower dataset was composed by Ronald Fisher in 1936. It consists of 150 data points, each corresponding to a particular iris.

Each data point consists of 4 measurements (in cm) and a label corresponding to the exact species of that iris: setosa, virginica, or versicolor.

We can load this dataset in Python through `sklearn` using the code below.

- `iris.data` gives you a $150\times 4$ array where each row `iris.data[i,:]` contains the 4 measurements for data point `i`

- `iris.feature_names` tell you what each of the 4 measurements are

- `iris.target` gives you an array of size $150$ where `iris.target[i]` is a numbered label for data point `i`

- `iris.target_names` tells you which species each numbered label corresponds to

In [None]:
import sklearn.datasets # If you are running this locally, then `pip install sklearn` in your Python environment.
import matplotlib.pyplot as plt
import numpy as np
import scipy 

iris = sklearn.datasets.load_iris()

**A1.** Write a function that takes in an index `i` and prints out a verbose desciption of the species and measurements for data point `i`. For example:

```
Data point 5 is of the species setosa
Its sepal length (cm) is 5.4
Its sepal width (cm) is 3.9
Its petal length (cm) is 1.7
Its petal width (cm) is 0.4
```

**A2.** Compute the mean and standard deviation of each measurement for each of three species.



**A3.** Produce a scatter plot of petal width vs. petal length for each iris. Use a different colour for each of the three species and label these using `plt.legend`.

**A4.** Produce a 3D scatter point of petal width vs. petal length vs. sepal length for each iris. Again, use a different colour for each species. Which species is easiest to identify?

### Task B: Learning a Classifier 

The four measurements for a data point may be used as a feature vector $\mathbf{x_i}$. Each measurement of the iris is a feature.

$$\mathbf{x_i}= [feature_0, feature_1, feature_2, feature_3]^{T}$$

In this task we will learn a classifier to determine from a feature vector, whether an iris is (1) a setosa, or (2) not a setosa (non-setosa).

Our classifier is a model that outputs the probability of a feature vector being a setosa. It takes the following form:

$$ \text{p}( \text{setosa} | \mathbf{x_i}) = \sigma(\mathbf{w^T}\mathbf{x_i})$$

where $\mathbf{w}$ is a vector of learnable parameters $\mathbf{w} = [weight_0, weight_1, weight_2, weight_3]^{T}$.

Note that our model is outputing a probability, which is a single number between 0 and 1. $\sigma$ is the sigmoid function that squishes $\mathbf{w^T}\mathbf{x_i}$ between these values.

$$\sigma(\mathbf{w^T}\mathbf{x_i}) = \frac{1}{1+e^{-\mathbf{w^T}\mathbf{x_i}}}$$

**B1.** Plot the sigmoid function and confirm it behaves in a suitable manner.

**B2.** When we learn a classifier, we create a training and validation set. The training set is used to learn the actual classifier, and the validation set is used to verify the performance of the classifier. Create a training set that consists of 25 data points of setosas, and 25 data points of non-setosa.

**B3.** Create a validation set that consists of 25 setosas, and 25 non-setosas. The data points in the validation set must **not** overlap with those used to create the training set.

**B4.** Learn a classifier using your training data that finds the $\mathbf{w}$ that minimises the loss function below, where $y_i=1$ for a data point that corresponds to a setosa, and $y_i=0$ for a data point that is non-setosa. Use the CG method (https://docs.scipy.org/doc/scipy/reference/optimize.minimize-cg.html).

$$L(\mathbf{w}) = \frac{1}{50}\sum_{i=0}^{49} -y_i \log \text{p}( \text{setosa} | \mathbf{x_i}) - (1-y_i) \log   \text{p}   ( \text{non-setosa} | \mathbf{x_i})$$

*This loss function may seem complicated, but the important thing to know is that it is low when we classify setosas as setosas, and non-setosas as non-setosas!*



**B5.** Apply your learnt classifier to the validation set. If the classifier outputs a probability lower than 0.5, it classifies a data point as **non-setosa**. If it outputs a probability higher than 0.5 it classifies a data point as **setosa**. Report the accuracy of the classifier. This is the percentage of correct classifications.


