# Tutorial Objectives

* Implement a simple machine learning classifier with **Python** and **scikit-learn**

* Learn a *straightforward and reusable template* for machine learning

* Complete a **fun machine learning project!**

# Setup

This tutorial is **very interactive** so to follow along, please make sure you have *numpy*, *pandas* and *scikit-learn* installed. 

If you don't have these installed, you can get these packages (and many more useful packages for machine learning) by downloading **Anaconda**. You can do this by visiting https://www.anaconda.com/download/.

Once you're done the installation, the following packages should import succesfully in a Python interpreter.

In [19]:
import numpy as np
import pandas as pd

# Classification

There's *many* different types of machine learning problems, but the one that we're going to focus on in this tutorial is **classification**.

In a classification problem, we are given observations/details about an object and we want to **assign a category or class** to the object.

* Length and width of Sepal/Petal --> Type of flower
* Image of digit --> Digit
* Information about a passenger on the titanic --> Whether or not the passenger survived

# Problem

The problem that we're going to explore in this tutorial is that of predicting types of irises. For this purpose, we use the **Iris** dataset.

The Iris dataset consists of 3 different types of irises (*Setosa, Versicolour, and Virginica*), which we must predict given their petal/sepal length/width. If, like me, you have no idea what the difference between a sepal and a petal is, the image below should help!

![test](https://upload.wikimedia.org/wikipedia/commons/thumb/7/78/Petal-sepal.jpg/220px-Petal-sepal.jpg)

The code below loads this data from the scikit-learn package.

In [26]:
from sklearn import datasets

iris = datasets.load_iris()
print(iris.DESCR)

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris d

In [25]:
X = iris.data
y = iris.target

print("X:", X)
print("y:", y)

X: [[ 5.1  3.5  1.4  0.2]
 [ 4.9  3.   1.4  0.2]
 [ 4.7  3.2  1.3  0.2]
 [ 4.6  3.1  1.5  0.2]
 [ 5.   3.6  1.4  0.2]
 [ 5.4  3.9  1.7  0.4]
 [ 4.6  3.4  1.4  0.3]
 [ 5.   3.4  1.5  0.2]
 [ 4.4  2.9  1.4  0.2]
 [ 4.9  3.1  1.5  0.1]
 [ 5.4  3.7  1.5  0.2]
 [ 4.8  3.4  1.6  0.2]
 [ 4.8  3.   1.4  0.1]
 [ 4.3  3.   1.1  0.1]
 [ 5.8  4.   1.2  0.2]
 [ 5.7  4.4  1.5  0.4]
 [ 5.4  3.9  1.3  0.4]
 [ 5.1  3.5  1.4  0.3]
 [ 5.7  3.8  1.7  0.3]
 [ 5.1  3.8  1.5  0.3]
 [ 5.4  3.4  1.7  0.2]
 [ 5.1  3.7  1.5  0.4]
 [ 4.6  3.6  1.   0.2]
 [ 5.1  3.3  1.7  0.5]
 [ 4.8  3.4  1.9  0.2]
 [ 5.   3.   1.6  0.2]
 [ 5.   3.4  1.6  0.4]
 [ 5.2  3.5  1.5  0.2]
 [ 5.2  3.4  1.4  0.2]
 [ 4.7  3.2  1.6  0.2]
 [ 4.8  3.1  1.6  0.2]
 [ 5.4  3.4  1.5  0.4]
 [ 5.2  4.1  1.5  0.1]
 [ 5.5  4.2  1.4  0.2]
 [ 4.9  3.1  1.5  0.1]
 [ 5.   3.2  1.2  0.2]
 [ 5.5  3.5  1.3  0.2]
 [ 4.9  3.1  1.5  0.1]
 [ 4.4  3.   1.3  0.2]
 [ 5.1  3.4  1.5  0.2]
 [ 5.   3.5  1.3  0.3]
 [ 4.5  2.3  1.3  0.3]
 [ 4.4  3.2  1.3  0.2]
 [ 5.   

# Preparing the Data

Generally, we want to **train** our classifiers and then **evaluate** their performance. To do this, we need to split our dataset.

The larger the size of your training set, the more data your model has to learn from. The larger the size of your testing set, the more confident you can be in your evaluation. For our purposes, we will do an 80/20 split with 80% of the data being used for training and the remaining 20% being used for evaluation.

In [24]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

print("Training shape:", X_train.shape)
print("Testing shape:", X_test.shape)

Training shape: (120, 4)
Testing shape: (30, 4)


# Classification Models

scikit-learn implements a large numbers of commonly used machine learning models. Some commonly used ones are listed below:

* Support Vector Machine (*sklearn.svm.SVC*)
* Logistic Regression (*sklearn.linear_model.LogisticRegression*)
* Random Forest (*sklearn.ensemble.RandomForestClassifier*)
* Multi-layer Perceptron (*sklearn..neural_network.MLPClassifier*)random forest random forest 

We're going to use a **random forest** consisting of 20 decision trees. A **decision tree** essentially consists of a sequence of questions pertaining to the features, that ultimately generate a decision (sort of like a flow chart).

![Example of a decision tree](http://dataaspirant.com/wp-content/uploads/2017/01/B03905_05_01-compressor.png)

A **random forest** trains multiple decision trees with *different subsets of the features* and averages out the predicions of these trees during prediction time. The intuition behind random forests is that the decision trees learn to identify *different* patterns and relationships in the data.

![Example of a random forest](https://d2wh20haedxe3f.cloudfront.net/sites/default/files/random_forest_diagram_complete.png)

In [23]:
from sklearn.ensemble import RandomForestClassifier

# Define and train the random forest classifier
clf = RandomForestClassifier(n_estimators=20)
clf.fit(X_train, y_train)

# Make a prediction on the first testing sample
print("Testing sample:", X_test[0])
print("Prediction:", clf.predict([X_test[0]]))
print("Ground truth:", y_test[0])

Testing sample: [ 6.4  2.8  5.6  2.2]
Prediction: [2]
Ground truth: 2


# Evaluation

We now need to evaluate the quality of our model. In order to do this, we use a couple of metrics that are commonly used for classification problems.

![Metric definition](https://i.stack.imgur.com/z5WJHm.jpg)

**Precision**: *how often are we correct when predicting a particular label*

**Recall**: *how often do we get all of the samples for a particular label*

In [21]:
# Run the classifier on all of our testing samples
y_pred = clf.predict(X_test)
print(y_pred)

[2 2 1 2 2 2 0 0 2 1 0 1 1 1 2 0 0 1 2 1 2 2 2 2 2 1 1 0 1 1]


In [22]:
# Evaluate our predictions
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

             precision    recall  f1-score   support

          0       1.00      1.00      1.00         6
          1       0.91      1.00      0.95        10
          2       1.00      0.93      0.96        14

avg / total       0.97      0.97      0.97        30



# Conclusion

Great work! You have now trained a model to predict iris-types, and you'll never go another day without knowing the type of iris you're looking at.

# Your Turn!

Now it's your turn to train a machine learning model to predict who survives on the titanic! Visit http://ubcml.com/challenges/titanic to download the starter code.

The starter file already goes through the process of training a model and dumping the generated predictions to a file for you. Nonetheless, we'll go through some data processing and visualization to get you familiarized with the data.

In [20]:
df = pd.read_csv('train.csv', header=0) 
df

FileNotFoundError: File b'train.csv' does not exist

In [None]:
df_X = df.drop(['Survived'], axis=1)
df_X

In [None]:
df_Y = df[['Survived']]
df_Y

In [None]:
df_features = df_X[['Pclass', 'Sex', 'Age', 'Fare']]
df_features

In [None]:
features = df_features.as_matrix()
features

In [None]:
features[:,1] = [int(e == "male") for e in features[:,1]]
X = np.nan_to_num(features.astype("float"))
X

In [None]:
y = df_Y.as_matrix()[:,0]
y