# Hello World

> The Iris dataset is the "hello world" of machine learning. It consists of classifying three types of Iris plant according to 4 distinct features, the sepal lenght and width and the petal lenght and width.  

> We start by importing the dataset from the UCI Machine Learning Repository


In [56]:
import pandas as pd
import numpy
import sklearn
import scipy

pd.set_option('display.max_colwidth', -1) # display full content of cells

In [57]:

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/bezdekIris.data"
names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
iris_df = pd.read_csv(url, names=names)
print(iris_df.shape)

(150, 5)


> The dataset has no missing values and the values are all in centimeters. There is little preprocessing to be done for this dataset.

In [58]:
print(iris_df.describe())

       sepal_length  sepal_width  petal_length  petal_width
count  150.000000    150.000000   150.000000    150.000000 
mean   5.843333      3.057333     3.758000      1.199333   
std    0.828066      0.435866     1.765298      0.762238   
min    4.300000      2.000000     1.000000      0.100000   
25%    5.100000      2.800000     1.600000      0.300000   
50%    5.800000      3.000000     4.350000      1.300000   
75%    6.400000      3.300000     5.100000      1.800000   
max    7.900000      4.400000     6.900000      2.500000   


In [59]:
print(iris_df.groupby('class').size())

class
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
dtype: int64


In [60]:
iris_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal_length    150 non-null float64
sepal_width     150 non-null float64
petal_length    150 non-null float64
petal_width     150 non-null float64
class           150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB


## Split the data for model validation
> Separate 20% of the data for testing the model afterwards. Function model_selection.train_test_split() automates this process and also randomizes the data. 


In [61]:
from sklearn import model_selection 
array = iris_df.values
X = array[:,0:4]
y = array[:,4]
validation_size = 0.20
seed = 7
X_train, X_validation, y_train, y_validation = model_selection.train_test_split(X, y, test_size=validation_size, random_state=seed)

   ## Test models
   
   - Two different algorithms will be tested, one linear and one nonlinear. The difference between them is that linear models make more rigid assumptions about the data. The linear algorithm tested will be the Logistic Regression (LR) and the nonlinear will be the Decision Tree Classifier (CART).   
   - Acuracy will be used as an evaluation metric. It simply divides the number of correctly predicted instances by the total number of instances, outputting a percentage.   
   - The data used to train the model, X_train, will be split in 10 parts, trained on 9 and tested on 1, for all combinations. This is the process of Cross-Validation. The accuracy of each model will then be the mean of the accuracy for all combinations.
   

In [72]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

seed = 3
scoring = 'accuracy'
models = []
models.append(('LR', LogisticRegression()))
models.append(('CART', DecisionTreeClassifier()))

results = []
names = []
for name, model in models:
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f " % (name, cv_results.mean())
    print(msg)

LR: 0.966667 
CART: 0.983333 


> The Decision Tree Classifier reached a 98.33% accuracy so it will be used to predict the 20% left of the data.

In [74]:
DTC = DecisionTreeClassifier()
DTC.fit(X_train, y_train)
predictions = DTC.predict(X_validation)
print(accuracy_score(Y_validation, predictions))

0.8666666666666667


> Reaching a 86.6% accuracy on the validation data.