In this notebook we will train a [Random Forest model](https://en.wikipedia.org/wiki/Random_forest) to distinguish between feature vectors corresponding to legitimate transactions and fraudulent transactions. 


Random Forest models are made up of multiple decision trees. Each decision tree can be thought of as a real tree, with the data flowing upwards from the stump to the leaves: 
- initially all of the data sits in the stump at the base of the tree. 
- every time the tree 'splits' (e.g. the data reaches the next highest branch) each data point must either move onto the new branch, or carry on with it's existing path. The decision on which route to take is made based on an 'if-else' statement. For example, "if the third component of your feature vector is less than 7, take the branch on the left. else take the branch on the right"
- data continues to flow through the tree, making decisions on it's path based on these 'if-else' statements, until all of the data is sitting at the tree's leaves. 
 
 
Training the random forest 


In [None]:
import numpy as np
import pandas as pd
df = pd.read_parquet("fraud-cleaned-sample.parquet")

# Train/test split

We're using time-series data, so we'll split based on time.

In [None]:
first = df['timestamp'].min()
last = df['timestamp'].max()
cutoff = first + ((last - first) * 0.7)

df = df.copy()

train = df[df['timestamp'] <= cutoff]
test = df[df['timestamp'] > cutoff]

In [None]:
import cloudpickle as cp
feature_pipeline = cp.load(open('feature_pipeline.sav', 'rb'))

# Train the model

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection

rfc = RandomForestClassifier(n_estimators=16, max_depth=8, random_state=404, class_weight="balanced_subsample")

svecs = feature_pipeline.fit_transform(train)
rfc.fit(svecs, train["label"])


In [None]:
from sklearn.metrics import classification_report

predictions = rfc.predict(feature_pipeline.fit_transform(test))
print(classification_report(test.label.values, predictions))


In [None]:
from mlworkflows import plot
df, chart = plot.binary_confusion_matrix(test["label"], predictions)
chart

In [None]:
df

# Save the trained model as a pipeline stage

In [None]:
from mlworkflows import util
util.serialize_to(rfc, "rfc.sav")