# Simple Classification With SciKit-Learn

We will be visualizing the IRIS dataset using SciKit-Learn using a few of the built-in classifiers.  

We'll start with some import statements to use libraries that will be needed.

In [1]:
from __future__ import print_function
import numpy as np
from sklearn.datasets import load_iris
from sklearn.ensemble import (RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier, GradientBoostingClassifier)
from sklearn.tree import DecisionTreeClassifier
#from sklearn.cross_validation import train_test_split
from sklearn.model_selection import train_test_split

We'll start by loading the IRIS dataset into memory and defining the *n_estimators* variable which tells us how many trees to produce in the ensembles.

Afterwards, we'll split our data into training and testing sets.

In [2]:
n_estimators = 10
iris = load_iris()


X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

The majority of classifiers in SciKit-Learn follow a similar pattern to run.  Specifically you:

```python
CLASSIFIER = CLASSIFIER_CLASS(...)
CLASSIFIER = CLASSIFIER.fit(...)
```

There are many ways to score your classifier, SciKit-Learn does offer a *score* function for you to use.  For specific models that support feature importances, they can be extracted from the public *feature_importances_* variable in the classifier class.

We'll run four models below:
- Decision tree
- Random forest
- Extremely randomized trees (ExtraTrees)
- Adaboost
- Gradient boosting

We'll print their scores afterwards.

In [4]:
DT = DecisionTreeClassifier(max_depth=3)
DT = DT.fit(X_train, y_train)
DTscore = DT.score(X_test, y_test)                           
DTimportance = DT.feature_importances_

RF = RandomForestClassifier(n_estimators=n_estimators)
RF = RF.fit(X_train, y_train)
RFscore = RF.score(X_test, y_test)                           
RFimportance = RF.feature_importances_

ET = ExtraTreesClassifier(n_estimators=n_estimators)
ET = ET.fit(X_train, y_train)
ETscore = ET.score(X_test, y_test)                           
ETimportance = ET.feature_importances_

AB = AdaBoostClassifier(algorithm='SAMME.R', n_estimators=n_estimators)
AB = AB.fit(X_train, y_train)
ABscore = AB.score(X_test, y_test)                           
ABimportance = AB.feature_importances_

GB = GradientBoostingClassifier(n_estimators=n_estimators)
GB = GB.fit(X_train, y_train)
GBscore = GB.score(X_test, y_test)                           
GBimportance = GB.feature_importances_

print ("Decision Tree:", DTscore)
print ("Random Forest:", RFscore)
print ("Extra Trees:", ETscore)
print ("AdaBoost:", ABscore)
print ("Gradient Boost:", GBscore)
print ()

Decision Tree: 0.96
Random Forest: 0.92
Extra Trees: 0.946666666667
AdaBoost: 0.92
Gradient Boost: 0.946666666667



We can parse the *feature_importances_* variables for each model to see which features the models found most important.  The following code does this for you.

In [5]:
print ("Decsision Tree Importance")
# creating an array called indices that is the feature ssorted in descending order
indices = np.argsort(DTimportance)[::-1]
# this is saying for each element equal to the number of columns in the matrix X:
for f in range(X.shape[1]):   
    print("%d. feature %d (%f)" % (f + 1, indices[f], DTimportance[indices[f]]))
print()

print ("Random Forest Importance")
indices = np.argsort(RFimportance)[::-1]
for f in range(X.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], RFimportance[indices[f]]))
print ()

print ("Extra Trees Importance")
indices = np.argsort(ETimportance)[::-1]
for f in range(X.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], ETimportance[indices[f]]))
print ()

print ("AdaBoost Importance")
indices = np.argsort(ABimportance)[::-1]
for f in range(X.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], ABimportance[indices[f]]))
print ()

print ("Gradient Boost Importance")
indices = np.argsort(GBimportance)[::-1]
for f in range(X.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], GBimportance[indices[f]]))
print ()

Decsision Tree Importance
1. feature 3 (0.605619)
2. feature 2 (0.394381)
3. feature 1 (0.000000)
4. feature 0 (0.000000)

Random Forest Importance
1. feature 3 (0.449457)
2. feature 2 (0.396069)
3. feature 0 (0.108318)
4. feature 1 (0.046156)

Extra Trees Importance
1. feature 3 (0.478541)
2. feature 2 (0.332334)
3. feature 0 (0.132471)
4. feature 1 (0.056654)

AdaBoost Importance
1. feature 3 (0.500000)
2. feature 2 (0.400000)
3. feature 0 (0.100000)
4. feature 1 (0.000000)

Gradient Boost Importance
1. feature 3 (0.542569)
2. feature 2 (0.311392)
3. feature 1 (0.119088)
4. feature 0 (0.026951)



Based on the results above, the *decision tree classifier* seems to be the most accurate for this simple problem.  We can save this model into a *pkl* (pronounced pickle) using the *pickle* library.

In [6]:
import pickle

We can save the model using the *dump* function after the model has been trained.  This requires us to open up a file and dump the model into the file.  Make sure you close your files!  The dump is a binary dump, so the open mode *w* and *b* should both be specified for *write* and *binary*.

In [9]:
fileName = "Saved_DT.pkl"
f = open(fileName, 'wb')
pickle.dump(DT, f)
f.close()

After it is saved, we can load it once again using the following code.  Once again, always close your files in Python!   

In [11]:
f = open(fileName, 'rb')
loaded_RF = pickle.load(f)
f.close()

We can then predict some values from the model. 

In [14]:
X_test_spl = X_test[0:10]
y_test_spl = y_test[0:10]

predictions = loaded_RF.predict(X_test_spl)

for i in range(0,len(predictions)):
    print(predictions[i], y_test_spl[i])

2 2
1 1
0 0
2 2
0 0
2 2
0 0
1 1
1 1
1 1


And we quickly see, the model is perfect.

OK, not really.