# Random Forests Solution
Random Forests are an ensemble of multiple decision trees

There are a few libraries we'll need - numpy and pandas for handling the data, and a few modules from SciKit Learn:
* Datasets - has the IRIS dataset for us to use, ready for consumption
* Ensemble - this holds the functionality for any ensemble model, including Random Forests
* Model Selection - this helps us split our dataset down to training and validation sets
* Metrics - this has the fucntionality for us to monitor our model and see how it behaves

In [58]:
import numpy as np
import pandas as pd
from sklearn import datasets, ensemble, model_selection, metrics

We'll need to load the data from the sklearn `datasets` module, and load it into a dataframe for readability

In [3]:
iris_dataset = datasets.load_iris()

iris_dataframe=pd.DataFrame(iris_dataset.data, columns=iris_dataset.feature_names)

iris_dataframe["label"] = pd.Categorical.from_codes(iris_dataset.target, iris_dataset.target_names)

In [15]:
iris_dataframe.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),label
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [18]:
iris_dataframe.shape

(150, 5)

# Prepare training and test datasets
Let's look at the attributes - the "x" data - separate from the labels ("y").

This comes from the idea that any model is a simple function where 
$$answers = AI\ model(attributes)$$
or more simply
$$y = f(x)$$

We need to do a few things:
1. Create an X and Y set
2. Factorise Y - at the moment the labels are strings, which are hard to understand for the algorith. We need to change them to integers (e.g. 0, 1, 2 for the three different species of iris flower)
3. Split the dataset into training and validation sets so don't test the model on examples it has seen already

In [29]:
iris_dataframe.iloc[:, 0:4]

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [34]:
X=iris_dataframe.iloc[:, 0:4]  # Features
y=iris_dataframe['label']  # Labels
y = pd.factorize(y)[0]

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3) # 70% training and 30% test

# Random Forests
Now that the data is ready, we need to create a model to train. We are going to be creating an "instance" of the Random Forests sklearn class - this may be a little different to the basic python you are used to so far. Effectively, you are creating a already-made copy of a random forests model based on a blueprint that sklearn provides, complete with special functions.

the RandomForestClassifier needs to know how many cores it is allowed to run on. -1 cores means use all available cores.

We then need to train the model on the training data, both x and y. In this case, the term for training is "fitting". Once it's been trained/fitted, we ask it to predict the answers for validation dataset, given x. importantly, the model does not change itself when predicting. It just tried to label.

In [51]:
classifier = ensemble.RandomForestClassifier(n_jobs=-1, verbose=True)

In [52]:
classifier.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    0.1s finished


RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=None,
                       verbose=True, warm_start=False)

In [53]:
classifier.predict(X_test)

[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.0s finished


array([1, 2, 1, 0, 2, 1, 0, 1, 0, 2, 1, 2, 0, 2, 0, 0, 1, 2, 1, 1, 1, 0,
       1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 2, 1, 2, 1, 0, 2, 1, 0, 1, 1, 0, 1,
       1])

In [54]:
classifier.predict_proba(X_test)

[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.0s finished


array([[0.  , 1.  , 0.  ],
       [0.  , 0.  , 1.  ],
       [0.  , 0.99, 0.01],
       [1.  , 0.  , 0.  ],
       [0.  , 0.  , 1.  ],
       [0.  , 0.89, 0.11],
       [1.  , 0.  , 0.  ],
       [0.  , 1.  , 0.  ],
       [1.  , 0.  , 0.  ],
       [0.  , 0.07, 0.93],
       [0.  , 1.  , 0.  ],
       [0.  , 0.  , 1.  ],
       [0.94, 0.06, 0.  ],
       [0.  , 0.  , 1.  ],
       [1.  , 0.  , 0.  ],
       [1.  , 0.  , 0.  ],
       [0.01, 0.99, 0.  ],
       [0.  , 0.04, 0.96],
       [0.  , 1.  , 0.  ],
       [0.  , 1.  , 0.  ],
       [0.  , 1.  , 0.  ],
       [1.  , 0.  , 0.  ],
       [0.  , 1.  , 0.  ],
       [0.  , 1.  , 0.  ],
       [1.  , 0.  , 0.  ],
       [1.  , 0.  , 0.  ],
       [1.  , 0.  , 0.  ],
       [0.01, 0.99, 0.  ],
       [0.  , 1.  , 0.  ],
       [0.98, 0.02, 0.  ],
       [1.  , 0.  , 0.  ],
       [0.08, 0.92, 0.  ],
       [0.  , 0.02, 0.98],
       [0.  , 0.54, 0.46],
       [0.  , 0.11, 0.89],
       [0.  , 1.  , 0.  ],
       [1.  , 0.  , 0.  ],
 

# Measuring Performance
When we run our model, it is helpful to capture the output as a variable. We can then pass that forward with out metrics, which are made easy with the sklearn metrics module.

In [55]:
predictions = classifier.predict(X_test)

[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.0s finished


In [62]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, predictions))

print("F1:",metrics.f1_score(y_test, predictions, average="weighted"))

Accuracy: 0.9333333333333333
F1: 0.931135531135531


We can also look at two more exotic ways of measuring the performance. A confusion metrix that shows how likely the classes are mistaken for each other. Sklearn also can estimate how important each feature or attrivute was for the classification - a great way to understand your model and also look for overdependence on a single feature.

In [56]:
pd.crosstab(y_test, predictions, rownames=['Actual Species'], colnames=['Predicted Species'])

Predicted Species,0,1,2
Actual Species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,15,0,0
1,0,18,0
2,0,3,9


In [57]:
list(zip(X_train, classifier.feature_importances_))

[('sepal length (cm)', 0.12649500140778575),
 ('sepal width (cm)', 0.02234380200311226),
 ('petal length (cm)', 0.38445390454459),
 ('petal width (cm)', 0.46670729204451206)]