# Getting started with `RandomForests`

The library `scikit learn` adds simple and efficient tools for predictive data analysis through supervised and unsupervised machine learning algorithms. One of the core features of the library is its classification and estimator algorithims. The Random Forests package is one of these classifiers which is extremely powerful, and easily accessible to the average user.

This series of notebooks will guide you through an end to end walkthrough of implementing the random forests package to real world data with an excercise at the end for you to try out. This is purely a guide and will only lightly touch on the theoretical knowledge. It is assumed that some basic knowledge of decision trees and machine learning is known.

For documentation on functions or theoretical knowledge on decision trees and random forests you can check out these links:
- [https://scikit-learn.org/stable/modules/ensemble.html#forest](https://scikit-learn.org/stable/modules/ensemble.html#forest).
- [https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf](https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf)


# How random forests work

Random forests work in the most basic way in four steps.
1. A random selection of samples from a given dataset are taken
2. A decision tree is constructed for each sample, and a subsequent prediction is made
3. A vote is performed on each prediction based
4. The predicition with the most votes is picked and a classification is made.

The advantage of random forests is that is does not suffer overfitting problems that normal decision trees may suffer due to the bootstrap 'bagging' of multiple decision trees and averaging over all predicitions. A disadvantage of this however is that it can be very time consuming due to the large number of trees and predicitons needing to be made.

A final and very imporant feature of the random forest package is the ability to find important features. The Scikit learn library contains an addtional variable which defines the variable importance in the model. This feature is important as you can determine what features can be excluded to increase the performance of the random forest classifier you are trying to build.

# Importing the necassary packages
The scikit learn library is not a core library of python, so we will first learn how to install these into our python directory, and load in any supporting packages that may be required.

Firstly, if using a jupyter notebook with the pip environnment you can travel to your terminal and type the following code ``` pip install -U scikit-learn```. The code should look like the following:

![Pip_install.png](attachment:50e146ee-b883-4634-b36d-bc2be6938458.png)
We can check the installation has correctly worked via the terminal again. Typing ```python -m pip show scikit-learn``` will show us what version we have currently installed. If you do not get something like this: 

![Pip_confirm.png](attachment:05b6b876-822b-4cd9-beb9-b2a3af07144d.png)
you may need to try the installation again.

## Importing the library
Once we have installed the library we can attempt to load in some of the fucntions into our notebook.
The first one we can try is the training_test splitting library. Run the code:
```python 
from sklearn.model_selection import train_test_split
```

In [1]:
#try it here
from sklearn.model_selection import train_test_split

## Pandas
The pandas module is a python library that offers data structures and operations for manipulating numerical tables and time series. We need this library as it is the easiest way to store and manipualte our real world data that random forests will use.
To load this package run the cell:
```python 
import pandas as pd
```

We can create a simple pandas object by creating an empty dataframe and assigning labels and row names.

```python
df = pd.DataFrame(...)
df.columns=['Label_1', 'Label_2',...,'Label_N']
df.index = ['Row_1','Row_2',...,'Row_M']
```


In [2]:
# Try it here
import pandas as pd

## Check our imports
To check we have actually imported the correct libraries and functions we can run the test code:

In [3]:
from test_r import rftest_1 as tt
tt.test_imports()

## Roadmap for Random Forests

Before we get started in random forests it is important that we setup a simple workflow to keep us on track. You can follow this roadmap when you are tackling your own research after this tutorial.

1. State what your research question is, find what data you will need to answer this.
    - This is often an iterative process where you may realise you need more data as your project goes on (so revist often!)
2. Aquire the data, pre- process the data and get it into a usebal format
3. Prepare the data and remove anomalies or missing data
4. Prepare a base model with random parameter selection. 
    - This is the model you will compare against after tuning
5. Subset data into training and testing samples
6. train model using training data
7. Use trained model to make predictions on test samples
8. Evaluate model performance
    - adjust and retune, or find new data that can be solve the research question
9. interpret results and present them in a friendly way
    