# Classification

* You're probably aware of some parametric prediction methods, e.g., linear regression.  
* Let's study a non-parametric prediction method. 
* The goal of this method: classify something into one of a discrete number of types. 
* This is also known as 'supervised learning'. 

# Scikit-Learn

* Scikit-Learn is a major machine learning library that includes many reference data sets. 
* Initial release: June 2007, predates `pandas` but not by much! (Scikit-Learn and `pandas` solved different set of problems so they could simply coexisted for a long time)
* It has its own formats. 
* It's important to know how to translate to other formats to accomplish tasks. 

<img align="right" style="padding-left:10px; height: 24%; width: 24%;" src="figures/iris_with_labels.jpg">

# The Iris Dataset

* There is one dataset that is so well-known that it bears mentioning in any context. 
* The *iris dataset* consists of a multidimensional array of iris characteristics used in determining species. 
* Let's explore this dataset and see if we can understand it. 
* More information on the iris dataset is available at the [Scikit-Learn website](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html).
* Scikit-Learn also provides a [Jupyter notebook](plot_iris_dataset.ipynb) included here.

In [2]:
from sklearn import datasets
iris = datasets.load_iris()
iris


{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

# This is a special-purpose format. 
* A class
* Implemented as a dictionary. 
* Intended for testing machine-learning algorithms. 
* With fields that make sense for that task.
* Most entries are arrays in `numpy` format. 

Let's find out a bit about it. 

In [3]:
print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

# Important fields in the iris dataset
* `iris.data`: a set of feature vectors describing different plants. 
* `iris.target`: the kind of plant
* `iris.feature_names`: the names of columns
* `iris.target_names`: the English names of the kinds. 

# The classification problem
* Given what we know about a thing (`iris.data`) 
* What species is it (`iris.target`)? 

# How we approach classification: 
* Take all data into account. 
* Think of the data as a function from `data` to `target`.
* Approximate that function. 

# Then, if there is a new kind of iris, 
* Use the function to predict what species it is. 

# Let's run the demo provided by scikit-learn: 

In [4]:
from sklearn.neighbors import KNeighborsClassifier

# Declare a KNN classifier of a given complexity. The number of neighbors determines runtime.
knn = KNeighborsClassifier(n_neighbors=6)

# create a map between data and target. 
knn.fit(iris['data'], iris['target'])

# Provide data whose class labels are to be predicted
X = [
    [5.9, 1.0, 5.1, 1.8],
    [3.4, 2.0, 1.1, 4.8],
]

# Prints the data provided
print(X)

# Store predicted class labels of X
prediction = knn.predict(X)

# Prints the predicted class labels of X
print(prediction)

[[5.9, 1.0, 5.1, 1.8], [3.4, 2.0, 1.1, 4.8]]
[1 1]


This, according to the predictor, they're both species 1 of 0-2. 

* Writing such a predictor is a complex task that we study in COMP 135. 
* You can read up on it here: https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

For now, suffice it to say that from enough measurements, one can form a prediction 
from the instances that have been observed so far. This prediction can be accurate or inaccurate 
based upon the prediction method. 

# From whence comes accuracy
* You would be right to be suspicious of what I just did. 
* I didn't tell you anything at all about the prediction method. It is an "oracle". 
* How do we know that this worked? 

# Cross-validation
* Cross-validation is a standard technique in machine learning for testing classifiers. 
* Separate all feature data into 'training' and 'testing' subsets. 
* Train on the training subset. 
* Test on the testing subset. 
* See if you get the correct answers.

# Let's do this. I'll help.
* This is a different kind of exercise. 
* This is a real cross-validation using random data. 
* There is no one "correct" answer. 
* I can check your answers for sanity but not for correctness. 

First let's select rows of the data to use as training and testing data. This recipe selects them randomly. 

In [5]:
import random
selections = list(range(len(iris.data)))
random.shuffle(selections)
training_selections = selections[:130]
testing_selections = selections[130:]

# What this does
* `random.shuffle` scrambles the numbers between 0 and 149. 
* `training_selections` is a list of the array offsets for a training set. 
* `testing_selections` is a list of the array offsets for a testing set. 
* These are disjoint lists with no elements in common. 
* These represent a random sampling of the data in the iris database. 

In [6]:
print(training_selections)

[66, 104, 51, 111, 113, 123, 141, 108, 71, 74, 70, 115, 48, 129, 12, 4, 106, 11, 84, 121, 46, 56, 9, 118, 34, 79, 142, 134, 53, 145, 78, 147, 13, 132, 10, 57, 125, 47, 37, 31, 65, 63, 23, 131, 58, 24, 26, 60, 92, 32, 33, 52, 83, 88, 91, 117, 128, 30, 39, 44, 8, 97, 28, 122, 67, 110, 85, 27, 102, 130, 146, 94, 41, 42, 103, 105, 136, 76, 50, 149, 139, 36, 140, 68, 100, 40, 62, 20, 17, 59, 0, 86, 29, 133, 135, 116, 112, 18, 72, 5, 124, 35, 143, 54, 69, 43, 109, 16, 19, 99, 73, 77, 107, 75, 96, 3, 25, 61, 64, 55, 138, 98, 144, 6, 95, 114, 89, 15, 126, 93]


In [7]:
print(testing_selections)

[38, 1, 7, 21, 45, 22, 120, 81, 101, 49, 119, 127, 90, 137, 2, 87, 148, 14, 80, 82]


In [9]:
# Don't change this cell; just run it. 
from client.api.notebook import Notebook
ok = Notebook('04-02-classification.ok')

Assignment: 04-02-classification
OK, version v1.14.15



1. Create an `array` `training_features` that consists of the rows that match `training_selections`. Look up how to do it. Hint: `iris.data` is an `array`. Use row selection for `np.array`. 

In [24]:
# Your answer:
training_features = iris.data[training_selections]
training_features

array([[5.6, 3. , 4.5, 1.5],
       [6.5, 3. , 5.8, 2.2],
       [6.4, 3.2, 4.5, 1.5],
       [6.4, 2.7, 5.3, 1.9],
       [5.7, 2.5, 5. , 2. ],
       [6.3, 2.7, 4.9, 1.8],
       [6.9, 3.1, 5.1, 2.3],
       [6.7, 2.5, 5.8, 1.8],
       [6.1, 2.8, 4. , 1.3],
       [6.4, 2.9, 4.3, 1.3],
       [5.9, 3.2, 4.8, 1.8],
       [6.4, 3.2, 5.3, 2.3],
       [5.3, 3.7, 1.5, 0.2],
       [7.2, 3. , 5.8, 1.6],
       [4.8, 3. , 1.4, 0.1],
       [5. , 3.6, 1.4, 0.2],
       [4.9, 2.5, 4.5, 1.7],
       [4.8, 3.4, 1.6, 0.2],
       [5.4, 3. , 4.5, 1.5],
       [5.6, 2.8, 4.9, 2. ],
       [5.1, 3.8, 1.6, 0.2],
       [6.3, 3.3, 4.7, 1.6],
       [4.9, 3.1, 1.5, 0.1],
       [7.7, 2.6, 6.9, 2.3],
       [4.9, 3.1, 1.5, 0.2],
       [5.7, 2.6, 3.5, 1. ],
       [5.8, 2.7, 5.1, 1.9],
       [6.1, 2.6, 5.6, 1.4],
       [5.5, 2.3, 4. , 1.3],
       [6.7, 3. , 5.2, 2.3],
       [6. , 2.9, 4.5, 1.5],
       [6.5, 3. , 5.2, 2. ],
       [4.3, 3. , 1.1, 0.1],
       [6.4, 2.8, 5.6, 2.2],
       [5.4, 3

In [25]:
_ = ok.grade('q01')  # check answer for sanity

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 2
    Failed: 0
[ooooooooook] 100.0% passed



2. Create an `array` `training_targets` that consists of the targets corresponding to the selected training rows. 

In [26]:
# Your answer
training_targets = iris.target[training_selections]
training_targets

array([1, 2, 1, 2, 2, 2, 2, 2, 1, 1, 1, 2, 0, 2, 0, 0, 2, 0, 1, 2, 0, 1,
       0, 2, 0, 1, 2, 2, 1, 2, 1, 2, 0, 2, 0, 1, 2, 0, 0, 0, 1, 1, 0, 2,
       1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 2, 2, 0, 0, 0, 0, 1, 0, 2, 1, 2,
       1, 0, 2, 2, 2, 1, 0, 0, 2, 2, 2, 1, 1, 2, 2, 0, 2, 1, 2, 0, 1, 0,
       0, 1, 0, 1, 0, 2, 2, 2, 2, 0, 1, 0, 2, 0, 2, 1, 1, 0, 2, 0, 0, 1,
       1, 1, 2, 1, 1, 0, 0, 1, 1, 1, 2, 1, 2, 0, 1, 2, 1, 0, 2, 1])

In [27]:
_ = ok.grade('q02')  # check answer for sanity

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 2
    Failed: 0
[ooooooooook] 100.0% passed



3. Using the pattern above, train a kNN on the training data. Start with a new one `knn2` and just train on this. Hint: You need the data from parts 1 and 2. 

In [28]:
# Your answer: 
# Declare a KNN classifier of a given complexity. The number of neighbors determines runtime.
import numpy as np

knn2 = KNeighborsClassifier(n_neighbors=6)

# create a map between data and target. 
knn2.fit(training_features, training_targets)
knn2

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=6, p=2,
           weights='uniform')

4. Put the test data into an `array` `testing_features`, repeating what you did for training data. 

In [29]:
# Your answer: 
testing_features = iris.data[testing_selections]
testing_features

array([[4.4, 3. , 1.3, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [5. , 3.4, 1.5, 0.2],
       [5.1, 3.7, 1.5, 0.4],
       [4.8, 3. , 1.4, 0.3],
       [4.6, 3.6, 1. , 0.2],
       [6.9, 3.2, 5.7, 2.3],
       [5.5, 2.4, 3.7, 1. ],
       [5.8, 2.7, 5.1, 1.9],
       [5. , 3.3, 1.4, 0.2],
       [6. , 2.2, 5. , 1.5],
       [6.1, 3. , 4.9, 1.8],
       [5.5, 2.6, 4.4, 1.2],
       [6.4, 3.1, 5.5, 1.8],
       [4.7, 3.2, 1.3, 0.2],
       [6.3, 2.3, 4.4, 1.3],
       [6.2, 3.4, 5.4, 2.3],
       [5.8, 4. , 1.2, 0.2],
       [5.5, 2.4, 3.8, 1.1],
       [5.8, 2.7, 3.9, 1.2]])

In [30]:
_ = ok.grade('q04')  # check answer for sanity

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 2
    Failed: 0
[ooooooooook] 100.0% passed



5. Run the predictor as above, but on the array `testing_features`. Put the result into `test_results`

In [31]:
# Your answer: 
test_results = knn.predict(testing_features)
test_results

array([0, 0, 0, 0, 0, 0, 2, 1, 2, 0, 1, 2, 1, 2, 0, 1, 2, 0, 1, 1])

In [32]:
_ = ok.grade('q05')  # check answer for sanity

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 2
    Failed: 0
[ooooooooook] 100.0% passed



6. Compute the expected outcomes and put them into the `array` `expected_results`. 

In [33]:
# Your answer: 
expected_results = iris.target[testing_selections]
expected_results

array([0, 0, 0, 0, 0, 0, 2, 1, 2, 0, 2, 2, 1, 2, 0, 1, 2, 0, 1, 1])

In [34]:
_ = ok.grade('q06')  # check answer for sanity

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Running tests

---------------------------------------------------------------------
Test summary
    Passed: 2
    Failed: 0
[ooooooooook] 100.0% passed



7. Count the number of identical answers between test_results and expected results and place the result into `correct_answers`

In [35]:
# Your answer: 
correct_answers = (test_results == expected_results).sum()
correct_answers

19

# An afterword on cross-validation
* If you got a perfect result, you're lucky. 
* Classification algorithms aren't perfect. 
* You can run it again to get an imperfect result. 
* Running the cross-validation multiple times gives one an idea of how accurate the classifier will be. 
* There are no "correct" answers to this. You just ran a random trial. 

# When you are done with this notebook, 

* Save and checkpoint. 
* Ensure that the name of this file is precisely `04-02-classification.ipynb`. 
* Save and checkpoint the notebook. 


* If your Jupyter installation can download the notebook as a PDF,
    * (File >> Download as >> PDF via LaTeX (.pdf)), 
    * Rename the downloaded file to `<loginid>-04-02-classification.pdf`. In other words, my filename would be `jsingh11-04-02-classification.pdf`.
    * Submit the file `<loginid>-04-02-classification.pdf` to Canvas.
* Otherwise 
    * (File >> Download as >> Notebook (.ipynb)). In other words, my filename would be `jsingh11-04-02-classification.ipynb`.
    * Rename the downloaded file to `<loginid>-04-02-classification.ipynb`,
    * Submit the file `<loginid>-04-02-classification.ipynb` to Canvas.