# Modeling - Random Forest



What is it?
- a machine learning algorithm used for predicting categorical target variables
- Pipeline: Plan - Acquire - Prepare - Explore - **Model** - Deliver

Why do we care?
- we can predict future target variables based on the model we build! 

How does it work?
- [slides we already saw](https://docs.google.com/presentation/d/14alN-7mOuKKUEjbPxdUDRWXI3cfI51_T/edit?usp=sharing&ouid=110448495992573862737&rtpof=true&sd=true)

How do we use it?
- acquire, prepare, explore our data
- split data for modeling
- build models on train
    - create rules based on our input data
- evaluate models on train & validate
    - see how our rules work on unseen data
- pick best of the best model, and evaluate bestest model on test

## Show us!

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import classification_report, confusion_matrix

import acquire
import prepare

## Example - Iris Dataset

See it in the data science pipeline!

### Acquire

In [None]:
#get my iris data
df = acquire.get_iris_data()

In [None]:
#look at it
df.head()

### Prepare

In [None]:
#clean my data
df = prepare.prep_iris(df)
df.head()

In [None]:
#split my data
train, validate, test = prepare.my_train_test_split(df, 'species')

In [None]:
#look at my train
train.head()

### Explore

ONLY USING TRAIN!

completed the following steps on my features and target variable
1. hypothesize
2. visualize
3. analyze
4. summarize

these steps arent written out here, however, i found that petal width and petal length identified species the most

### Model

Before we put anything into our machine learning model, we will want to establish a baseline predication

#### Baseline

In [None]:
train.head()

In [None]:
#find most common species
train.species.value_counts()

Since there is not a most common class, it doesn't matter which one we choose.  

In [None]:
#calculate baseline accuracy
baseline_accuracy = 30 / (30+30+30)
baseline_accuracy

#### 0. split into features and target variable
- need to do this on my train, validate, and test dataframe
- will end up with the following variables:
    - X_train, X_validate, X_test: all the features we plan to put into our model
    - y_train, y_validate, y_test: the targete variable

In [None]:
#look at train
train.head()

For my first iteration, im going to send all possible features into my model

In [None]:
#set all my features as my X_train
X_train = train.iloc[:,:-1]
X_train.head()

In [None]:
#repeat for validate and test
X_validate = validate.iloc[:,:-1]
X_test = test.iloc[:,:-1]

In [None]:
#set target
target = 'species'

In [None]:
#notice im sending in a single column name
y_train = train[target]
y_train.head()

In [None]:
#repeat for validate and test
y_validate = validate[target]
y_test = test[target]

Note: our X variables are dataframes, our y variables are series

#### 1. make the object

In [None]:
#new import!


#### 2. fit the object

In [None]:
#building our model on our train values


#### 3. transform the object

In [None]:
#score on my train data


#### how does our model work on unseen data?

In [None]:
#score on validate


#### feature importance

### change a hyperparameter

random forest hyperparameters
- n_estimators: The number of trees in the forest (default=100)
- bootstrap: whether bootstrap samples are used when building trees (default=True)
- random_state: controls randomness of bootstrapping samples (default=None)

seen before
- criterion (default=”gini”)
- max_depth (default=None)
- min_samples_split (default=2)
- min_samples_leaf (default=1)
- max_leaf_nodes (default= None)

#### 1. create the object

this is when you set your hyperparameter

In [None]:
#set max depth & random_state


#### 2. fit the object

In [None]:
#still using train data


#### 3. transform the object

In [None]:
#evaluate on train


In [None]:
#see predictions


In [None]:
#see probability of predictions 


#### more evaluation

In [None]:
#y_pred
y_pred = trees1.predict(X_train)
y_pred[:5]

In [None]:
#generate confusion matrix!
conf = confusion_matrix(y_train, y_pred)
conf

In [None]:
#find labels in our dataset & sort
labels = sorted(y_train.unique())
labels

In [None]:
#make pretty with df
pd.DataFrame(conf,
            index=[label + '_actual'for label in labels],
            columns=[label + '_predict'for label in labels])

In [None]:
#generate classification report
print(classification_report(y_train, y_pred))

#### evaluate on unseen data

In [None]:
#score our validate
