I am going to following this [tutorial](https://machinelearningmastery.com/machine-learning-in-python-step-by-step/) to create my first machine learning project. To actually learn the steps in creating a machine learning project, read the tutorial. In this notebook I will only be explaining the code from the tutorial, not why we are using it. 

## Version Checking

In [30]:
# Check the versions of libraries

# Python version
import sys
print('Python: {}'.format(sys.version))
# scipy
import scipy
print('scipy: {}'.format(scipy.__version__))
# numpy
import numpy
print('numpy: {}'.format(numpy.__version__))
# matplotlib
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
# pandas
import pandas
print('pandas: {}'.format(pandas.__version__))
# scikit-learn
import sklearn
print('sklearn: {}'.format(sklearn.__version__))

Python: 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0]
scipy: 0.19.1
numpy: 1.12.1
matplotlib: 2.1.0
pandas: 0.22.0
sklearn: 0.18.2


Let's break down what the above code means. 

In [31]:
import sys

Sys stands for system. I think it's similar to std in c/c++. 
This line is saying to import the library (called module in python) called sys. This module provides a number of functions and variables that can be used to manipulate different parts of the Python runtime environment.

Side note -- here's an excellent [resource](http://effbot.org/librarybook/sys.htm) for understanding modules (including sys) as well as other basics. 

In [32]:
print('Python: {}'.format(sys.version))

Python: 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0]


This is a basic call to the print function. This [link](https://www.python-course.eu/python3_print.php) shows the basics of the print function. As with a print function in any language, only what is in quotations marks gets printed. 

Python: {} ----> This is a string object. Remember objects have member functions that can be called using dot operator.

.format ----> This is a string member function. It is used to format the string. The empty braces in the string are used as an argument holder. ( Could have also written 'Python: {0}'.format(sys.version) and it would have been equivalent). To understand this further, visit this [website](https://www.python.org/dev/peps/pep-3101/). Also, understanding the basics of what classes are, and what an object-oriented programming language (which python is) would be helpful.

sys.version ----> Again, sys is a module. It has certain member functions, version being one of them.

Alright, let's move on. 


## Building the Actual Project

In [33]:
# Load libraries
import pandas
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

In [34]:
# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv(url, names=names)

Above code is simply uploading the dataset into a [dataframe](https://chrisalbon.com/python/data_wrangling/pandas_dataframe_importing_csv/) called "dataset", and it is giving each column in the data a certain name. A dataframe (shorthand is dt) is simply like a matrix, each column (called attributes) will have the names we specificed, and each row (called instance) will contain the data that we uploaded. Dataframes have certain attributes, like shape and values (returns the values in the dt, not row or col labels). They also have certain functions like head, and describe. 

Another excellent resource for understanding [dataframes](https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python).

The pandas module is the one that offers this data structure called dataframes. 

In [35]:
# shape
print(dataset.shape)

(150, 5)


150 instances, 5 attributes

In [36]:
# head
print(dataset.head(20))

    sepal-length  sepal-width  petal-length  petal-width        class
0            5.1          3.5           1.4          0.2  Iris-setosa
1            4.9          3.0           1.4          0.2  Iris-setosa
2            4.7          3.2           1.3          0.2  Iris-setosa
3            4.6          3.1           1.5          0.2  Iris-setosa
4            5.0          3.6           1.4          0.2  Iris-setosa
5            5.4          3.9           1.7          0.4  Iris-setosa
6            4.6          3.4           1.4          0.3  Iris-setosa
7            5.0          3.4           1.5          0.2  Iris-setosa
8            4.4          2.9           1.4          0.2  Iris-setosa
9            4.9          3.1           1.5          0.1  Iris-setosa
10           5.4          3.7           1.5          0.2  Iris-setosa
11           4.8          3.4           1.6          0.2  Iris-setosa
12           4.8          3.0           1.4          0.1  Iris-setosa
13           4.3    

In [37]:
#descriptions
print(dataset.describe())

       sepal-length  sepal-width  petal-length  petal-width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.054000      3.758667     1.198667
std        0.828066     0.433594      1.764420     0.763161
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000


In [38]:
#class distribution
print(dataset.groupby('class').size())

class
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
dtype: int64


I skipped a few steps after this, as there is no point in copying and pasting code from the tutorial. It pretty much makes sense when you read it.

Now let's get to actually evaluating some algorithms. 

In [39]:
array = dataset.values
X = array[:,0:4]
Y = array[:,4]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)

What we are trying to do now is to seperate the dataset into 20% to test the accuracy of our algorithms, and 80% to train the algorithm.

Let's understand the above code before we move on.


dataset.values --> Like I said before, dataset is a dataframe. A dataframe comes with certain attributes, "values" is one of them. To understand the dataframe attribute values further, here's a [resource](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.values.html#pandas.DataFrame.values). This resource also explains many other important details about dataframes. 

The result from dataset.values is stored into a storage structure called array. This specific array is an NumPy array (different from a python array). It from the class called ndarray.

An ndarray array has a rank (number of dimensions) and a shape (a tuple stating the length of each dimension of the array). 



Now let's move on to the next line of code.

X = array[:,0:4] ---> What's happening over here is what's called slicing. A certain subsect of the array is being stored into x.

Slicing is a form of indexing. It allows you grab a certain portion of the storage container and store it into another storage container. 

array[:,0:4] ---> This means take all the values from the rows from columns 0 to 4 (exculsive) from array

Y = array[:,4] ---> This means take all the values from the rows in column 4 from array and store into Y.


To understand numpy a bit better, here's a [link](http://cs231n.github.io/python-numpy-tutorial/). I highly recommend going through that link as it really does a good job of explaining things such as creating functions, dictionaries, classes, tuples etc. in python. It also goes into numpy and throughly explains the concepts I have mentioned (like ndarrays, indexing, slicing).  

validation_size = 0.20 ---> validation_size is a variable with the value 0.20

seed = 7 ---> seed is a variable with the value 7

Alright now the last line in that chunk of code is very long so let me first copy it down here and then we can go through it:

X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)

model_selection ---> This is a submodule of the module sklearn. If you scroll up you can see that we imported this at the beginning

model_selection.train_test_split ---> This is one of the functions from the model_selection library. To read the actual function defintion, it's purpose, and implementation, have a look at [this](https://github.com/scikit-learn/scikit-learn/blob/ed5e127b/sklearn/model_selection/_split.py#L1920). 

In [40]:
# Test options and evaluation metric
seed = 7
scoring = 'accuracy'

In [42]:
# Spot Check Algorithms
models = [] #Create an empty list

In [None]:
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))