In [2]:
import os
os.chdir('../')

In [5]:
from data_pipeline import ETL_Pipeline
from dataset import Fraud_Dataset
from metrics import Metrics 
import sklearn
print(sklearn.__version__)

0.24.2


In [3]:
pipeline = ETL_Pipeline()
metrics = Metrics()

raw_data = pipeline.extract('transactions.csv')

transformed_data = pipeline.transform(raw_data)

pipeline.load(transformed_data, 'transformed_transactions.csv')

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import pandas as pd

fraud_dataset = Fraud_Dataset(transformed_data)

model = LogisticRegression(max_iter=1000)

for fold, (train_data, validation_data) in enumerate(fraud_dataset.get_kfold_datasets()):
    X_train, y_train = train_data.drop('is_fraud', axis=1), train_data['is_fraud']
    X_val, y_val = validation_data.drop('is_fraud', axis=1), validation_data['is_fraud']
    
    model.fit(X_train, y_train)
    
    y_pred = model.predict(X_val)
    y_pred_proba = model.predict_proba(X_val)[:, 1]  

    metrics = Metrics()
    metrics.generate_report(y_val, y_pred, y_pred_proba, 'analysis/')

NameError: name 'transformed_data' is not defined

In [4]:
from sklearn.ensemble import RandomForestClassifier

fraud_dataset = Fraud_Dataset(transformed_data)

random_forest_model = RandomForestClassifier(n_estimators=100, random_state=42)

for fold, (train_data, validation_data) in enumerate(fraud_dataset.get_kfold_datasets()):
    X_train, y_train = train_data.drop('is_fraud', axis=1), train_data['is_fraud']
    X_val, y_val = validation_data.drop('is_fraud', axis=1), validation_data['is_fraud']
    
    random_forest_model.fit(X_train, y_train)
    
    y_pred_rf = random_forest_model.predict(X_val)
    y_pred_proba_rf = random_forest_model.predict_proba(X_val)[:, 1]
    
    metrics.generate_report(y_val, y_pred_rf, y_pred_proba_rf, 'analysis/results')

Report successfully written to analysis/results/metrics_report.txt
Report successfully written to analysis/results/metrics_report.txt
Report successfully written to analysis/results/metrics_report.txt
Report successfully written to analysis/results/metrics_report.txt
Report successfully written to analysis/results/metrics_report.txt


In [5]:
from sklearn.ensemble import GradientBoostingClassifier

fraud_dataset = Fraud_Dataset(transformed_data)

gradient_boosting_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)

for fold, (train_data, validation_data) in enumerate(fraud_dataset.get_kfold_datasets()):
    X_train, y_train = train_data.drop('is_fraud', axis=1), train_data['is_fraud']
    X_val, y_val = validation_data.drop('is_fraud', axis=1), validation_data['is_fraud']
    
    gradient_boosting_model.fit(X_train, y_train)
    
    y_pred_gb = gradient_boosting_model.predict(X_val)
    y_pred_proba_gb = gradient_boosting_model.predict_proba(X_val)[:, 1]
    
    metrics.generate_report(y_val, y_pred_gb, y_pred_proba_gb, 'analysis/results')

Report successfully written to analysis/results/metrics_report.txt
Report successfully written to analysis/results/metrics_report.txt
Report successfully written to analysis/results/metrics_report.txt
Report successfully written to analysis/results/metrics_report.txt
Report successfully written to analysis/results/metrics_report.txt


## Conclusions

I chose three different models to test on my data, the linear regression, Random Forest Classifier, and Gradient Boosting Classifier. I chose these three as linear regression is more or less my "Staple" model type that gives me a good baseline. I chose the other two however as they both handle skewed datasets quite well. For each of these 5 metrics reports were generated based on k-folds, and I kept what I thought was the best performing ones, most metric reports were fairly consistent.

### Linear Regression

This model performed horribly. At first I assumed it had performed well as the accuracy in earlier testing was at 99%. This was quickly proven wrong by looking at the outputted metrics report `LogReg_metrics_report.txt`. This shows that Precision, Recall, and Sensitivity were all at 0%. This shows me that more than likely the model labeled everything as not_fraud, which works for the skewed dataset but is not what we want.

### Random Forest Classifier

This model performed the best out of the group based on the metrics I gave it. As this model handles skewed datasets well this is not too much of a surprise. We do unfortunately have Recall and Sensitivity at 74.11%, which isn't much better than the initial model baseline given to us, but our Precision score was at 95.74% which is lightyears ahead of the original model.

### Gradient Boosting Classifier

This model also performed quite well out of the three. However, while looking over the metrics they were inconsistent with Recall being as low as 50% and as high as 71%. While this model still performs alright, overall it is not doing too much better than the baseline model, except in Precision where it recieved a score of 89.90%.

### Which Model Did I Choose?

I decided to go with Random Forest Classifier as it performed the best out of the three models while still being the most consistent. Gradient Boosting Classifier Varied too much between k-folds for me to feel comfortable utilizing it.