# Introduction to Machine Learning

In this example, we'll tackle the classic iris plant classification problem using Scikit-learn. Think of this as your 'Hello World!' to machine learning. :D

The Iris dataset was used in Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems. The data set consists of 50 samples from each of three species of Iris. Each iris plant listed in the dataset has four different features or attributes:

- Sepal Length
- Sepal Width
- Petal Length
- Petal Width

Our task is to classify the iris plants into 3 species:
- Iris Setosa
- Iris Versicolour
- Iris Virginica 

![alt](images/iris_.png)

## 1. Importing libraries
We'll start by importing some import libraries for data analysis and scientific computing:

In [1]:
import pandas as pd
import numpy as np

[Pandas](http://pandas.pydata.org/) provides easy-to-use data structures and data analysis tools for the Python programming language and allows for high-level data manipulation. Meanwhile [NumPy](http://www.numpy.org/), which stands for Numerical Python, is a scientific computing package.

Next, we'll import important tools and models from the Scikit-learn library.

In [2]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.svm import SVC

In these four lines, we import the four machine learning algorithms we'll be using in this example. In future sessions, we'll get to learn how some of these algorithms works, but for now, you can view these models as a 'black box' which accepts some input (features/attributes) and produces some output (predictions), without any knowledge of its internal workings. 

<b>FAQ</b>: How do you know which algorithm to choose and which will work well for your problem?

<b>Answer</b>: Generally, there are a lot of algorithms to choose from. [See this guide for a tour of the different ML algorithms](https://www.quora.com/How-do-you-choose-a-machine-learning-algorithm). 

To choose an appropriate algorithm, you first need to understand and categorize the problem you are trying to solve. [See this post for a detailed explanation](https://www.quora.com/How-do-you-choose-a-machine-learning-algorithm). 
For a short version, what you need to do to choose the right algorithms are:
1. Categorize your problem (supervised, unsupervised, reinforcement, classification, regression, sequential/temporal, etc.)
2. Find available algorithms applicable to your problem (e.g. SVM, ANN work for classification problems; CRF, HMM for sequential data, etc.)
3. Implement them by setting up a machine learning pipeline that compares the performances of the different algorithms
4. Optimize hyperparameters (optional)

In fact, you can can look at this [scikit-learn cheat sheet](https://i.stack.imgur.com/BZJiN.png) for a very rough guide to choosing an algorithm for your problem (just don't limit the models you use to the algorithms in the cheat sheet).

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

These two lines are tools for automatically splitting the dataset and computing the accuracy of the resulting predictions, respectively. We'll get to see these two tools in action later.

## 2. Loading and Inspecting the Dataset

In [4]:
filename = '../datasets/iris.csv' 
dataframe = pd.read_csv(filename, header=0) 

Alright. So here, we've loaded the dataset. This piece of code locates the `iris.csv` file in the `datasets` folder. We set `header = 0` to specify that the the first row (row 0 in Python) contains the headers, i.e. the names of each column (e.g. Sepal Length, Sepal Width, ... , Species). Some datasets might not contain headers, so be careful with this!

Afterwards, we store the dataset in the variable `dataframe` as a Pandas data structure called the [DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html). You can think of a DataFrame as a two-dimensional array or matrix containing our data. [You can also look at this cheat sheet for a visualization]().

Now, let's inspect our data. It's always a good idea to know what your data looks like. 

### Dimensions

In [5]:
print(dataframe.shape)

(150, 5)


Here, we check the dimensions of the data. There are:
- 150 rows (50 samples for each of the the 3 species) 
- 5 columns (Sepal Length, Sepal Width, Petal Length, Petal Width, and Species).

### Peeking at your data

`head(10)` prints the first 10 rows or instances of your data. 

In [6]:
print(dataframe.head(10))

   SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm      Species
0            5.1           3.5            1.4           0.2  Iris-setosa
1            4.9           3.0            1.4           0.2  Iris-setosa
2            4.7           3.2            1.3           0.2  Iris-setosa
3            4.6           3.1            1.5           0.2  Iris-setosa
4            5.0           3.6            1.4           0.2  Iris-setosa
5            5.4           3.9            1.7           0.4  Iris-setosa
6            4.6           3.4            1.4           0.3  Iris-setosa
7            5.0           3.4            1.5           0.2  Iris-setosa
8            4.4           2.9            1.4           0.2  Iris-setosa
9            4.9           3.1            1.5           0.1  Iris-setosa


`tail(10)` prints the last 10 rows or instances of your data.

In [7]:
print(dataframe.tail(10))

     SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm         Species
140            6.7           3.1            5.6           2.4  Iris-virginica
141            6.9           3.1            5.1           2.3  Iris-virginica
142            5.8           2.7            5.1           1.9  Iris-virginica
143            6.8           3.2            5.9           2.3  Iris-virginica
144            6.7           3.3            5.7           2.5  Iris-virginica
145            6.7           3.0            5.2           2.3  Iris-virginica
146            6.3           2.5            5.0           1.9  Iris-virginica
147            6.5           3.0            5.2           2.0  Iris-virginica
148            6.2           3.4            5.4           2.3  Iris-virginica
149            5.9           3.0            5.1           1.8  Iris-virginica


### Data Type for each Attribute

In [8]:
print(dataframe.dtypes)

SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species           object
dtype: object


You can get an idea of the types of attributes by peeking at the raw data, as above. You can also list the data types used by the DataFrame to characterize each attribute using the dtypes property.
- `float64` refers to numerical data (particularly decimal numbers) 
- `object` in this dataset pertains to data with string values

[Check out the other types of data here.](https://docs.scipy.org/doc/numpy-1.12.0/user/basics.types.html)

### Descriptive Statistics

Descriptive statistics can give you great insight into the shape of each attribute.

Often you can create more summaries than you have time to review. The describe() function on the Pandas DataFrame lists 8 statistical properties of each attribute:
- Count
- Mean
- Standard Deviation
- Minimum Value
- 25th Percentile
- 50th Percentile (Median)
- 75th Percentile
- Maximum Value

In [9]:
print(dataframe.describe())

       SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
count     150.000000    150.000000     150.000000    150.000000
mean        5.843333      3.054000       3.758667      1.198667
std         0.828066      0.433594       1.764420      0.763161
min         4.300000      2.000000       1.000000      0.100000
25%         5.100000      2.800000       1.600000      0.300000
50%         5.800000      3.000000       4.350000      1.300000
75%         6.400000      3.300000       5.100000      1.800000
max         7.900000      4.400000       6.900000      2.500000


### Class Distribution (Classes only)

On classification problems you need to know how balanced the class values are.

In [10]:
print(dataframe.groupby('Species').size())

Species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
dtype: int64


A balanced dataset is where the number of observations/instances for each class are more or less the same. The Iris dataset is an example of a balanced dataset since each class has exactly 50 observations.

Highly imbalanced problems are problems wherein there are a lot more observations for one class than another. These are common and may need special handling in the data preparation stage of your project. We'll discuss ways to handle this in future sessions.

### Correlation between Attributes
Correlation refers to the relationship between two variables and how they may or may not change together.

The most common method for calculating correlation is Pearson’s Correlation Coefficient, that assumes a normal distribution of the attributes involved. A correlation of -1 or 1 shows a full negative or positive correlation respectively. Whereas a value of 0 shows no correlation at all.

Some machine learning algorithms like linear and logistic regression can suffer poor performance if there are highly correlated attributes in your dataset. As such, it is a good idea to review all of the pair-wise correlations of the attributes in your dataset. You can use the corr() function on the Pandas DataFrame to calculate a correlation matrix.

In [11]:
print(dataframe.corr(method='pearson'))

               SepalLengthCm  SepalWidthCm  PetalLengthCm  PetalWidthCm
SepalLengthCm       1.000000     -0.109369       0.871754      0.817954
SepalWidthCm       -0.109369      1.000000      -0.420516     -0.356544
PetalLengthCm       0.871754     -0.420516       1.000000      0.962757
PetalWidthCm        0.817954     -0.356544       0.962757      1.000000


The matrix lists all attributes across the top and down the side, to give correlation between all pairs of attributes (twice, because the matrix is symmetrical). You can see the diagonal line through the matrix from the top left to bottom right corners of the matrix shows perfect correlation of each attribute with itself.

## 3. Extracting Features and Labels
Next, we need to exract the features/attributes as well as our class labels. 

In [12]:
X = dataframe.iloc[0:150, 0:4] 
y = dataframe.iloc[0:150, 4] 

Remember that we have four features (Sepal Length, Sepal Width, Petal Length, and Petal Width) in the first four columns and the class label ('Species') in the fifth (last) column.

<b>FAQ</b>: What's `iloc`?

<b>Answer</b> We use `iloc` (which stands for '<b>i</b>nteger <b>loc</b>ation') to select certain parts of the data based on their position or index. 

<b>Recap on Python indexing</b>: Remember that in Python, indexing starts at zero (0) and ends at n-1 where n is the maximum number of elements. Thus, for the Iris dataset containing 150 rows, the first row is considered to be at index 0 and the last row is at index 149. Similarly, for the columns, we have 5 columns with indices 0, 1, 2, 3, and 4. 

<b>Recap on Python Slicing</b>: Remember that in Python, when slicing a list or matrix, we need to specify the starting index and the ending index in square brackets separated by a colon( e.g. `[starting_index : ending_index]`). Python however returns the list of elements from the starting index up until the ending index - 1. (Yeah, Python's a little eff'd up like.)

Also, if you don't specify the starting index, it automatically starts at the very beginning which is at index `0` (e.g. `[:5]` is the same as `[0:5]`). Likewise, if you don't specify the ending index, it gets everything from the starting index up to the very end (e.g. if you have 4 elements, then `[1:]` is the same as `[1:5]`). Meaning, if you have something like `[:]`, this basically gets ALL the elements. Kapische? 

[For more information on basic Python indexing and slicing for lists, see this very helpful tutorial!](https://www.tutorialspoint.com/python/python_lists.htm)

<b> Slicing for DataFrames</b>

In the first line, 


In [13]:
X = dataframe.iloc[0:150, 0:4] #or dataframe.iloc[:, :4]

we create a variable `X` containing a matrix of the features of all the rows. That is, we slice the dataframe using two sets of colons: the first set is for the <b>rows</b>, the second for the <b>columns</b>. In other words, `0:150` indicates that we want to get all the rows from index 0 until index 149. That's basically all the rows in our dataset! Then, `0:4` indicates we want to get all the columns from index 0 up to index 3 (excluding index 4). Thus, `dataframe.iloc[0:150, 0:4]` is the same as `dataframe.iloc[:, :4]`.

![](images/slice1.png)

In the second line, 

In [14]:
y = dataframe.iloc[0:150, 4] #or dataframe.iloc[:, 4]

we create a variable `y` containing an array of the class labels of all the rows. Since the labels are located at index 4, we slice the DataFrame such that we get all the rows at the 4th index. 
![](images/slice2.png)

(<b>Note</b>: Error in the figure - 151 should be 149.)

Lastly, note that we can also access the columns values using the header names or labels using `loc` instead of `iloc`.
For more info, [try reading up on it here to get a better understanding](https://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-position).

## 4. Splitting the dataset
Some models may perform better than others on certain datasets. By 'perform', we are pertaining to how well a model can predict new data that it has not yet seen before. How then do we quantify the performance of a model?

A general practice is to split your data into a training and test set. 
- The training set is used train/tune your model in order for it to learn your data.
- The test set is used to evaluate how well it generalizes to data it has never seen before. We can use a certain metric (e.g. accuracy, error measures) to quanitify the performance of the model on the test set.
(Source: [Why split into training and test set?](https://www.quora.com/What-is-a-training-data-set-test-data-set-in-machine-learning-What-are-the-rules-for-selecting-them))

In essence, rather than using the entire dataset to train your model, you <i>hold back</i> a part of your dataset with the goal of eventually coming up with some measurement (e.g. accuracy, error) of how well your model performs on never-before-seen-data (the test set). For example, you wouldn't want to use a model with only a 51% accuracy. That's hardly better than random guessing!

Note that in splitting the dataset, the training set must be much <i>much</i> larger than the training set. This is because machine learning models generally depend on a lot of data to be able to generalize well. By holding back data for testing, we're essentially reducing the amount valuable information for the model to learn. Therefore, we usually allot more data to the training set; a typical split is 60/40, 70/30, 80/20 with the larger portion alotted for the training set.

Alright. So let's start splitting the data.

In [15]:
train_size = 0.8
seed = 7 
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=train_size, random_state=seed)

Luckily, there's a method called `train_test_split` that automatically splits the dataset. Here, we set the `train_size` to 80%, which automatically allots the remaining 20% to the test set. 

<b>FAQ:</b> What's the <i>seed</i> or <i>random state</i>?

<b>Answer</b>: [This answer sums it up pretty nicely :)](https://stackoverflow.com/questions/42191717/python-random-state-in-splitting-dataset)
Short answer: It's used for reproducibility and debugging. It can be set to any integer. 

## 5. Instantiate the Model

In [16]:
# Instantiate learning model
clf = GaussianNB()

# Fit model to training set
clf.fit(X_train, y_train)

# Predict labels of the test set
y_pred = clf.predict(X_test)

We use the Gaussian Naive Bayes Model denoted as `GaussianNB()`. We then fit the classifier model to the training set using the `fit()` method. Finally, we use `predict()` to produce predicted labels for the test set. 

We can compare the predicted labels `y_pred` and true test labels `y_test` manually as follows:

In [17]:
results = pd.DataFrame({'Predicted label': y_pred, 'True label': y_test})
print(results)

     Predicted label       True label
149   Iris-virginica   Iris-virginica
84   Iris-versicolor  Iris-versicolor
40       Iris-setosa      Iris-setosa
66   Iris-versicolor  Iris-versicolor
106  Iris-versicolor   Iris-virginica
41       Iris-setosa      Iris-setosa
52    Iris-virginica  Iris-versicolor
94   Iris-versicolor  Iris-versicolor
11       Iris-setosa      Iris-setosa
51   Iris-versicolor  Iris-versicolor
77    Iris-virginica  Iris-versicolor
85   Iris-versicolor  Iris-versicolor
32       Iris-setosa      Iris-setosa
109   Iris-virginica   Iris-virginica
28       Iris-setosa      Iris-setosa
70    Iris-virginica  Iris-versicolor
108   Iris-virginica   Iris-virginica
137   Iris-virginica   Iris-virginica
46       Iris-setosa      Iris-setosa
37       Iris-setosa      Iris-setosa
82   Iris-versicolor  Iris-versicolor
120   Iris-virginica   Iris-virginica
63   Iris-versicolor  Iris-versicolor
119  Iris-versicolor   Iris-virginica
129   Iris-virginica   Iris-virginica
138   Iris-v

You can check that our model has not predicted the test set perfectly and commits some mistakes. 

We want to get the percentage of the `test_set` wherein the predicted label is the same as the true label:

In [18]:
# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred) # or accuracy = np.mean(y_pred == y_test) 
print("Gaussian Naive Bayes: " + str(accuracy))

Gaussian Naive Bayes: 0.833333333333


In [19]:
# Or, more concisely (predicts y_pred and evaluates accuracy in a single line of code)
accuracy = clf.score(X_test, y_test)
print("Gaussian Naive Bayes: " + str(accuracy))

Gaussian Naive Bayes: 0.833333333333


And voila! Using the Gaussian Naive Bayes classifier we get a whooping 83% accuracy. Can we do better though?

Let's try comparing the other different classifiers!

In [20]:
# To evaluate multiple classifiers, store classifiers into a dictionary:
classifiers = dict() 
classifiers['Gaussian Naive Bayes'] = GaussianNB()
classifiers['Decision Tree Classifier'] = DecisionTreeClassifier(random_state=seed)
classifiers['Support Vector Machines'] = SVC()

# Iterate over dictionary
for clf_name, clf in classifiers.items(): #clf_name is the key, clf is the value
    clf.fit(X_train, y_train)
    accuracy = clf.score(X_test, y_test)
    print(clf_name + ': ' + str(accuracy))

Decision Tree Classifier: 0.9
Support Vector Machines: 0.933333333333
Gaussian Naive Bayes: 0.833333333333


Here, we see that Support Vector Machines (SVM) produces the best results with a 93% accuracy!

In the next session, we'll learn about and implement our first machine learning algorithm, K-Near Neighbor (KNN). 