# Random Forests Solution
Random Forests are an ensemble of multiple decision trees

There are a few libraries we'll need - numpy and pandas for handling the data, and a few modules from SciKit Learn:
* Datasets - has the IRIS dataset for us to use, ready for consumption
* Ensemble - this holds the functionality for any ensemble model, including Random Forests
* Model Selection - this helps us split our dataset down to training and validation sets
* Metrics - this has the fucntionality for us to monitor our model and see how it behaves

In [58]:
import numpy as np
import pandas as pd
from sklearn import datasets, ensemble, model_selection, metrics

We'll need to load the data from the sklearn `datasets` module, and load it into a dataframe for readability

In [3]:
iris_dataset = datasets.load_iris()

iris_dataframe=pd.DataFrame(iris_dataset.data, columns=iris_dataset.feature_names)

iris_dataframe["label"] = pd.Categorical.from_codes(iris_dataset.target, iris_dataset.target_names)

In [15]:
iris_dataframe.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),label
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [18]:
iris_dataframe.shape

(150, 5)

# Prepare training and test datasets
Let's look at the attributes - the "x" data - separate from the labels ("y").

This comes from the idea that any model is a simple function where 
$$answers = AI\ model(attributes)$$
or more simply
$$y = f(x)$$

We need to do a few things:
1. Create an X and Y set
2. Factorise Y - at the moment the labels are strings, which are hard to understand for the algorithm. We need to change them to integers (e.g. 0, 1, 2 for the three different species of iris flower)
3. Split the dataset into training and validation sets so don't test the model on examples it has seen already

#### Hints
```python
pd.factorise(labels)[0] #factorises classes into integers
model_selection.train_test_split # creates x train, x test, y train and y test outputs in that order. needs x, y and a decimal test size. split it by 30%
```

# Random Forests
Once the data is ready, we need to create a model to train. We are going to be creating an "instance" of the Random Forests sklearn class - this may be a little different to the basic python you are used to so far. Effectively, you are creating a already-made copy of a random forests model based on a blueprint that sklearn provides, complete with special functions.

the RandomForestClassifier needs to know how many cores it is allowed to run on. -1 cores means use all available cores.

We then need to train the model on the training data, both x and y. In this case, the term for training is "fitting". Once it's been trained/fitted, we ask it to predict the answers for validation dataset, given x. importantly, the model does not change itself when predicting. It just tried to label.

In [51]:
classifier = ensemble.RandomForestClassifier(n_jobs=-1, verbose=True)

#### Hints
```python
classifier.fit #takes X and y
classifier.predict # takes X
```

# Measuring Performance
When we run our model, it is helpful to capture the output as a variable. We can then pass that forward with out metrics, which are made easy with the sklearn metrics module.

#### Hints
```python
output = classifier.predict(...)
metrics.accuracy_score(...) # takes validation y and the predictions
metrics.f1_score(...) # same again. also needs a specified average - use "weighted"
```

In [62]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, predictions))

print("F1:",metrics.f1_score(y_test, predictions, average="weighted"))

Accuracy: 0.9333333333333333
F1: 0.931135531135531


We can also look at two more exotic ways of measuring the performance. A confusion metrix that shows how likely the classes are mistaken for each other. Sklearn also can estimate how important each feature or attrivute was for the classification - a great way to understand your model and also look for overdependence on a single feature.

#### Hints
check out pandas' crosstab function, and use the y_test and predictions against each other.

Our classifier instance has a `.feature_importances_` attribute