# Machine Learning Workshop Activity

Activity for the SUNY Oswego Computer Science Association’s Git Workshop. Written by Christopher Wells, and released under CC0 license.

## Getting the Necessary Python Libraries
For this workshop we will be using a few different Python libraries that are useful for applying machine learning to problems. Specifically, we willbe using the following libraries:

* numpy
* pandas
* scikit-learn
* matplotlib
* seaborn
* scikitplot

You can install these libraries using `pip`. Though be sure to install them for Python 3, as this notebook uses Python 3.

## Check that the Libraries are Installed
Before we move on to the machine learning, let's first check that we have allthe needed libraries installed. Try running the cell below to see if they are all installed.

You can run the cell by clicking on it and pressing `Ctrl` + `Enter`.

In [None]:
import numpy as np
import pandas as pd
import sklearn.datasets

import matplotlib.pyplot as plt
import seaborn as sns
import scikitplot.plotters as scplt

## Getting the Data
The first thing we will need to do is to get the data that we want to try applying machine learning to.

For this workshop we will be using the classic `iris` dataset, which a dataset of information on different types of iris flowers.

Run the cell below to download the `.csv` file for the the dataset.

In [None]:
import urllib

url = "https://raw.githubusercontent.com/pandas-dev/pandas/master/pandas/tests/data/iris.csv"
file_name = "iris.csv"

urllib.request.urlretrieve(url, file_name)

Now that we have downloaded the data file, we can use the `pandas` library to load it into Python.

We can use the `read_csv` function in `pandas` to read the data file into a strucutre called a dataframe that we can use to work with the data.

In [None]:
data = pd.read_csv(file_name)
data

We can see here that the dataframe has 5 columns: `SepalLength`, `SepalWidth`, `PetalLength`, `PetalWidth`, and `Name`

The first 4 columns are qualities of the flower, and the last column is the type of iris that the flower is.

## Visualizing the Data
Before we start applying machine learning to the data, let's try plotting some of the data to see if there is a relationship between the different qualities of the flowers and their type (`Name`).

By visualizing the data, we can get an idea of what patterns the machine learning algorithms can pickup on to determine the type of an unkown iris.

We can use the `matplotlib` and `seaborn` libraries to make plots of the data.

Let's try plotting making a scatter plot of the `SepalLength` and `SepalWidth` and clor the points with the `Name`.

In [None]:
x = "SepalLength"
y = "SepalWidth"
hue = "Name"

sns.lmplot(x=x, y=y, hue=hue, truncate=True, size=6, data=data)
plt.show()

We can see that just looking at the `SepalWidth` and `SepalLength` we can visually distinguish between the `iris-seteosa` and the other categories of irises.

Try making another plot of the data showing the `PetalLength` and `PetalWidth` with the points colored by `Name`.

Now that we have plotted the `PetalLength` and `PetalWidth` we can see that each of the different categories of irises are fairly easy to tell apart based on these two variables.

So perhaps there may be some sets of variables that make it easier to distinguish between the different categories. Let's try making some plots of all the different variable combinations.

Seaborn provides a nice function to automate this process called `pairplot`.

In [None]:
hue = "Name"

sns.pairplot(data, hue=hue)
plt.show()

Looking at the different plots, we can see that the blue points (`iris-setosa`) are very seperable from the red (`iris-virginica`) and green (`iris-versicolor`). Additionally the red and green points look mostly seperable, but with some overlap in each of the plots.

## Applying ML for Classification
Now that we have an idea of what the data looks like, we can try applying machine learning algorithms to it.

We will try to create a classifier, which given information about an iris of an unknown type, will try to guess the type of the iris. This is known as supervised learning.

In order to create the classifier and see how well it works, we will need to split our dataset into two pieces, a training set and a test set. We will train the classifier on the training set, and test how well it works on the test set.

In [None]:
from sklearn.model_selection import train_test_split

X_columns = ["SepalLength", "SepalWidth", "PetalLength", "PetalWidth"]
y_column = "Name"

X = data.as_matrix(X_columns)
y = data[y_column]

test_percentage = 0.66

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=test_percentage, random_state=42)

Now that we have our training and test sets, we can create and train a classifier on them.

For now we will be using a `RandomForestClassifier`, which is a classifier model in the `scikit-learn` library which is simple and works well on most datasets.

We will need to:

* Create the classifier
* Train it on the training data
* Test/score it on the test data

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=42)

rf.fit(X_train, y_train)

rf.score(X_test, y_test)

We can see that the classifier got a score of about `0.93`, which means that it correctly identified about 93% of the irises in the test set.

To get more information on which irises it is misclassifying we can create a confusion matrix, which plots the predicted categories versus the actual categories.

Luckily the `scikitplot` library provides a handy function `plot_confusion_matrix` for this purpose.

In [None]:
y_test_pred = rf.predict(X_test)

scplt.plot_confusion_matrix(y_test, y_test_pred)
plt.show()

In this plot, the diagonal indicates irises that were correctly classified, and the boxes outside of the diagonal indicate those that were incorrectly classified.

We can see that some `iris-versicolor` were confused with `iris-virginica` and vice-versa. This matches up with what we saw in the visualizations, that those two categories are more difficult to distinguish from one another than the `iris-setosa` category.

Try creating another classifier using a Support Vector Classifier (`sklearn.svm.SVC`) instead and see if it perfoms any better on the data. ([link](http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html))

## Clustering
In addition to classifying the data, we can also try applying clustering to it as well.

In clustering, you take data without any category labels and you try to figure out what different categories exist in the data. This is also known as unsupervised learning.

Let's try applying `KMeans` clsutering to the data. KMeans is a simple clustering algorithm that tires to create `K` number of clusters from the data where you specify the number `K`.

Though before we apply clustering to the data, we will need to scale the data to account for the columns having different ranges of values, we can do this by using a `StandardScaler`.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

Know that we have scaled the data, let's try creating 2 clusters from it.

In [None]:
from sklearn.cluster import KMeans

n_clusters = 2
kmeans = KMeans(n_clusters, random_state=42)

clustering = kmeans.fit_predict(X_scaled)

data["cluster"] = clustering

Now that we have our clusters, let's try plotting the data to see how the clusters look.

In [None]:
hue = "cluster"

sns.pairplot(data, hue=hue)
plt.show()

We can see that the two clusters appear very seperable in most of the plots. Additionally the two clusters seem to coincide with the `iris-setosa` and `iris-versicolor` + `iris-virginica` categories.

Try applying KMeans clustering to the data again, but this time with 3 clusters, and put the clusters in a column called `cluster_2`.

We can see that the 3 clusters we created are actually quite simmilar to the pre-existing categories of irises. Let's try creating a confusion matrix to see how well the clusters line up with the known categories.

In [None]:
clustering = [str(i) for i in data["cluster_2"]]

true_labels = data["Name"].unique()
cluster_labels = np.unique(clustering)

scplt.plot_confusion_matrix(y, clustering, true_labels=true_labels, pred_labels=cluster_labels)
plt.show()

We can see that each of the clusters lines up fairly well with a known category. This shows us that the known categories seem to accurately reflect groupings of iris qualities.

## Resources

* [Scikit-learn User Guide](http://scikit-learn.org/stable/user_guide.html)
* [Seaborn Tutorials](http://seaborn.pydata.org/tutorial.html)
* [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/raw/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)