# Practice 4: Traditional Classificaiton Methods

Use this notebook as the starting point for the Practice activities.

Student Name:    **[  Put your Name Here ]**

[Video Walkthough by Tom](https://www.youtube.com/watch?v=8MfDdwiWuco)


## Revisit the Iris classification problem

*Note: This description comes from [Google's Custom Training Walkthroughs](https://www.tensorflow.org/tutorials/eager/custom_training_walkthrough).*

Imagine you are a botanist seeking an automated way to categorize each Iris flower you find. Machine learning provides many algorithms to classify flowers statistically. For instance, a sophisticated machine learning program could classify flowers based on photographs. Our ambitions are more modest—we're going to classify Iris flowers based on the length and width measurements of their [sepals](https://en.wikipedia.org/wiki/Sepal) and [petals](https://en.wikipedia.org/wiki/Petal).

The Iris genus entails about 300 species, but our program will only classify the following three:

* Iris setosa
* Iris virginica
* Iris versicolor

<table>
  <tr><td>
    <img src="https://www.tensorflow.org/images/iris_three_species.jpg"
         alt="Petal geometry compared for three iris species: Iris setosa, Iris virginica, and Iris versicolor">
  </td></tr>
  <tr><td align="center">
    <b>Figure 1.</b> <a href="https://commons.wikimedia.org/w/index.php?curid=170298">Iris setosa</a> (by <a href="https://commons.wikimedia.org/wiki/User:Radomil">Radomil</a>, CC BY-SA 3.0), <a href="https://commons.wikimedia.org/w/index.php?curid=248095">Iris versicolor</a>, (by <a href="https://commons.wikimedia.org/wiki/User:Dlanglois">Dlanglois</a>, CC BY-SA 3.0), and <a href="https://www.flickr.com/photos/33397993@N05/3352169862">Iris virginica</a> (by <a href="https://www.flickr.com/photos/33397993@N05">Frank Mayfield</a>, CC BY-SA 2.0).<br/>&nbsp;
  </td></tr>
</table>

Fortunately, someone has already created a [data set of 120 Iris flowers](https://en.wikipedia.org/wiki/Iris_flower_data_set) with the sepal and petal measurements. This is a classic dataset that is popular for beginner machine learning classification problems.



## Setting up Python tools



We'll use three libraries for this tutorial: 
- [pandas](http://pandas.pydata.org/) : dataframes for spreadsheet-like data analysis, reading CSV files, time series
- [numpy](http://www.numpy.org/) : for multidimensional data and linear algebra tools
- [matplotlib](http://matplotlib.org/) : Simple plotting and graphing
- [seaborn](http://stanford.edu/~mwaskom/software/seaborn/) : more advanced graphing




In [0]:
# First, we'll import pandas and numpy, two data processing libraries
import pandas as pd
import numpy as np

# We'll also import seaborn and matplot, twp Python graphing libraries
import seaborn as sns
import matplotlib.pyplot as plt
#sns.set(style="white", color_codes=True)

# We will turn off some warns in this notebook to make it easier to read for new students
import warnings
warnings.filterwarnings('ignore')

## Read in the Iris flower data
The Iris flower data is read in from a file stored on the internet
<p>
It is stored in a Pandas DataFrame which is similar to an internal spreadsheet in that the data is stored in rows and columns.

In [0]:
# Read in the data file from stored in a raw file in GitHub
url = 'https://raw.githubusercontent.com/CIS3115-Machine-Learning-Scholastica/CIS3115ML-Units3and4/master/Iris.csv'
iris = pd.read_csv(url)
# Set the Id column as the index since it is unique for each flower
iris.set_index('Id', inplace=True)

In [14]:
# Display the first 5 flowers to make sure the data was read in
iris.head(5)

Unnamed: 0_level_0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,5.1,3.5,1.4,0.2,Iris-setosa
2,4.9,3.0,1.4,0.2,Iris-setosa
3,4.7,3.2,1.3,0.2,Iris-setosa
4,4.6,3.1,1.5,0.2,Iris-setosa
5,5.0,3.6,1.4,0.2,Iris-setosa


When we use machine learning models, we will general use these variables
- X will be the input data, in this case the size of the flower's sepal and petals
- y will be the out data or what we want to predict, in this case the species of iris

One way to this of this, is that on the graphs above the x-axis is the house size and the y-axis is the price.

We will also break the data using 80% of the flower samples for training the model and keeping 20% for testing the model

- X_train : The iris sizes used for training the model
- X_test : The iris sizes used for testing the model
- y_train : The species used for training the model
- y_test : The species used for testing the model

We will only use the training data for fitting the model.


In [0]:
# Use the first four collumn for input into the model and the final iris label as output for training
X = iris.iloc[:, 0:4]
y = iris.iloc[:, 4]

# feature_columns = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm','PetalWidthCm']
# X = iris[feature_columns].values
# y = iris['Species'].values

from sklearn.model_selection import train_test_split
# Split the data into 80% for training and 20% for testing out the models
X_train, X_test, y_train, y_test = train_test_split(X, y.ravel(), test_size=0.2)

# Classifiers
We will try out a couple traditional classifiers and compare them.
<P>
For a good comparison of classifiers see http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#sphx-glr-auto-examples-classification-plot-classifier-comparison-py
<P>
We will focus on
    - K-Nearest Neighbors or from sklearn.neighbors import KNeighborsClassifier
    - Support Vector Machines (SVM) or from sklearn.svm import SVC
    - Decision Trees or from sklearn.tree import DecisionTreeClassifier
Note, in the next unit we will revisit these adding deminsional reduction methods like PCA or LDA 
    

## K-Nearest Neighbors (KNN)

K-Nearest Neighbors is a relatively simple classification algorithm. Given a set of labeled data, when a new unknown data point needs to be classified, we find the known points that are closest to it. 

Let's say we are classifying fruit as either apples or oranges. Assume we measure two parameters for each fruit.
- The smoothness as a number between 1 and 10 
- Amount of red color as a number between 0 and 25

We have a set of fruit we already know are either apples or oranges. When a new fruit arrives we measure its smoothness and red color and compare these values to the known fruit as follows:
1. We use the Euclidean distance to calculate how near the new fruit is to every known fruit.
1. We find the 3 nearest known fruits. Let us assume they are an Orange, an Apple and another Apple.
1. We classify the new fruit based on what is the most common type of neighbor, in this case, it is labeled as an Apple since two of the three neighbors are apples.

In the example above we looked at the 3 nearest neighbors, but we could have just as well looked at the 5 nearest neighbors or the 10 nearest neighbors. In general, we use the term K Nearest Neighbors and assume K can be any whole number.

The KNN algorithm works well for small datasets but does not work well on large datasets since it can take a lot of time and memory to compare the new item with every known item in the data set.

For another overview see [KNN Classification using Scikit-learn](https://www.datacamp.com/community/tutorials/k-nearest-neighbor-classification-scikit-learn) by Avinash Navlani


---

*I am still looking for a good video introduction to KNN. If anyone finds one, please post it to the discussion area. *

## Task 1: Applying K-Nearest Neighbors

Using the graph below
which plots data related to Apples and Oranges, determine the following by k-nearest neighbors using either visual inspection or simple measuring:
1.	The classification of the x at  (18.8, 4.9) when k = 1:	 Looking at the 1 nearest neighbor, should the x be classified as an Apple, an Orange, or  Neither
1.	The classification of the x at  (18.8, 4.9) when k = 3: Looking at the 3 nearest neighbors, should the x be classified as an Apple, an Orange, or  Neither
1.	The classification of the x at  (18.8, 4.9) when k = 5: Looking at the 5 nearest neighbors, should the x be classified as an Apple, an Orange, or  Neither

![KNN Image]( https://raw.githubusercontent.com/CIS3115-Machine-Learning-Scholastica/CIS3115ML-Units3and4/master/K_nearest_neighbors.jpg )
  
---
*Double-click on this cell to put your answer here...*

## Task 2: Test out different values of K

The code below runs the KNN algorithm initially with 5 neighbors. Try this code out with different numbers of neighbors and record the results here.

The closer to 100% or 1.0 the better for this score.

  
---
- Score for 1 Neighbor: 
- Score for 3 Neighbor:
- Score for 5 Neighbor: 0.96
- Score for 20 Neighbor:
- Score for 50 Neighbor:
- Score for 100  Neighbor:
- Score of your choice:
- Score of your choice:
- Score of your choice:

In [16]:
from sklearn.neighbors import KNeighborsClassifier
# Set up the K-Nearest neighbor model using the k nearest neighbors. Change the value of n_neighbors
knn_model = KNeighborsClassifier(n_neighbors=1)
# Train the model on the iris data
knn_model.fit(X_train, y_train)
score = knn_model.score(X_test, y_test)
print ("The score for this model is ", score)

The score for this model is  0.9333333333333333


## Task 3: Prediction
Once we have trained or fit the model to the data, we can use it to make predictions.

Here we will predict the iris species for a new flower with 
- SepalLengthCm = 5.2	
- SepalWidthCm = 3.3
- PetalLengthCm = 1.4
- PetalWidthCm	= 0.2

This should be a Iris-setosa

Change the code below to make a prediction for a new flower with the following measurements.
- SepalLengthCm = 4.2	
- SepalWidthCm = 3.0
- PetalLengthCm = 3.4
- PetalWidthCm	= 1.2

This should be a Iris-setosa

In [17]:
# The parameter order is SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm

prediction = knn_model.predict([[5.2, 3.3, 1.4, 0.2]])

print ("Predicted iris species is ", prediction)


Predicted iris species is  ['Iris-setosa']


## Support Vector Machine (SVM)

SVM is another algorithm for classifying data. It tries to divide the data up using lines, sometimes straight linear lines and sometimes curved lines.

SVM tries to find the best lines, actually, a plain in multiple dimensions, to divide the data up into the known categories.

For a good introduction, see the first part of [Support Vector Machine — Introduction to Machine Learning Algorithms](https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47). * (I suggest you stop at the Cost Function and Gradient Updates because the math and coding get too complex for this course.) *

For a good video tutorial that explains SVM using baking cupcakes and muffins as an example is Alice Zhao's [Support Vector Machines: A Visual Explanation with Sample Python Code](https://www.youtube.com/watch?v=N1vOgolbjSc)




## Task 4: SVM Kernels and Parameters

### kernels
Support Vector Machines have different ways of defining the lines or hyperplanes separating the data into classes. These are called kernels in our software. You will try out two:
- linear - uses only straight lines
- rbf - [Radial Basis Function](http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex8/ex8.html)

### C penalty parameter
Besides the kernel, there are a number of other parameters you can set on the SVM algorithm. We will look only at one called "C" which is the penalty the algorithm pays for misclassifying a point. As C gets above 1, the algorithm tries not to misclassify any points. For a good overview, see the [second answer in the StackOverflow question](https://stats.stackexchange.com/questions/31066/what-is-the-influence-of-c-in-svms-with-linear-kernel).
  
---
Run the code below for both the linear and rbf kernels and different values of the C penalty parameter. Note the resulting score here. A score of 100% or 1.0 is the best.

- linear kernel with C=1.0 gave a score of 0.96
- rbf kernel with C=1.0 gave a score of ???
- 
- record other values here
- 

In [0]:
from sklearn.svm import SVC

# Set up SVM model with a given kernel and c parameter
svm_model = SVC(C=1.0, kernel='linear')          # linear SVM
#svm_model = SVC(C=10.0, kernel='rbf')           # non-linear SVM

# Train the model on the iris data
svm_model.fit(X_train, y_train)
score = svm_model.score(X_test, y_test)
print ("The score for this model is ", score)

The score for this model is  0.9666666666666667


# Decision Trees 
This classification method tries to break the classification task into a series of decisions structured as a tree.
Here is a sample from the [SKlearn documentation for decision trees](https://scikit-learn.org/stable/auto_examples/tree/plot_unveil_tree_structure.html).
![alt text](https://scikit-learn.org/stable/_images/iris.png)

For more information on decision trees see:
*  An OK tutorial from medium.com [Decision Tree Classifier — Theory](https://medium.com/machine-learning-101/chapter-3-decision-trees-theory-e7398adac567)
* Use  https://en.wikipedia.org/wiki/Decision_tree as a reference
*   

## Running Decision Tree model

The code below will set up a simple decision tree model and run it. In general, decision trees do not have parameters to select.

In [0]:
from sklearn.tree import DecisionTreeClassifier
DT_model = DecisionTreeClassifier()

# Train the model on the iris data
DT_model.fit(X_train, y_train)
score = DT_model.score(X_test, y_test)
print ("The score for this model is ", score)

The score for this model is  0.9666666666666667


## Task 5: When to select SVM version Decision Trees

Do some research comparing Support Vector Machines and Decision Trees and list two reasons for selecting one model over another.

### Reason 1

### Reason 2




## Wrapping Up

Remember to **share this sheet with your instructo**r and submit a link to it in Blackboard.