## Big Data: Checkpoint 5

In the following task, you'll continue working with the Credit Card Fraud Detection dataset from Kaggle. Before moving on to the tasks, you should load the dataset using Dask.

Please submit your solutions to the following tasks as a link to your jupyter notebook on Github.

In [1]:
#!pip install dask_ml --quiet

In [2]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_validate, GridSearchCV
from sklearn.metrics import roc_auc_score
import joblib
from dask.distributed import Client, progress
from dask_ml.model_selection import train_test_split
import dask.dataframe as dd
import pandas as pd
import warnings

warnings.filterwarnings("ignore")

In [3]:
client = Client(n_workers=4, threads_per_worker=2, memory_limit='2GB')
client

0,1
Client  Scheduler: tcp://127.0.0.1:64121  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 8  Memory: 8.00 GB


In [4]:
path = 'data/creditcard.csv'
df = dd.read_csv(path, dtype={'Time':'float'})

## 1. In this task, you'll train several machine learning models from scikit-learn using Dask as the backend of joblib. This time, you need to use all the variables except `Class` as your feature set. `Class` variable will be your target variable.

In [5]:
x = df.drop('Class', axis=1)
y = df.Class
X_train, X_test, y_train, y_test = train_test_split(x,y,test_size=0.2, random_state=1234)

# dataset is small enough to make available to the memory
X_train.persist()
X_test.persist()
y_train.persist()
y_test.persist()

Dask Series Structure:
npartitions=3
    int64
      ...
      ...
      ...
Name: Class, dtype: int64
Dask Name: split, 3 tasks

In [6]:
lr = LogisticRegression()

with joblib.parallel_backend('dask'):
    lr.fit(X_train.compute(), y_train.compute())
    
preds_train = lr.predict(X_train.values.compute())
preds_test = lr.predict(X_test.values.compute())

print('LogReg training Score: is {}'.format(roc_auc_score(preds_train, y_train.values.compute())))
print('LogReg testing Score is {}'.format(roc_auc_score(preds_test, y_test.values.compute())))

LogReg training Score: is 0.8599285843693881
LogReg testing Score is 0.8190234345359043


In [8]:
gbc = GradientBoostingClassifier()

with joblib.parallel_backend('dask'):
    gbc.fit(X_train.compute(), y_train.compute())
    
preds_train = gbc.predict(X_train.values.compute())
preds_test = gbc.predict(X_test.values.compute())

print('GradBoostTree training Score: is {}'.format(roc_auc_score(preds_train, y_train.values.compute())))
print('GradBoostTree testing Score is {}'.format(roc_auc_score(preds_test, y_test.values.compute())))

GradBoostTree training Score: is 0.9550773417579038
GradBoostTree testing Score is 0.8648340797046601


In [10]:
rfc = RandomForestClassifier()

with joblib.parallel_backend('dask'):
    rfc.fit(X_train.compute(), y_train.compute())
    
preds_train = rfc.predict(X_train.values.compute())
preds_test = rfc.predict(X_test.values.compute())

print("RanForest training score is: {}".format(roc_auc_score(preds_train, y_train.values.compute())))
print("RanForest testing score is: {}".format(roc_auc_score(preds_test, y_test.values.compute())))

RanForest training score is: 0.9999977988598094
RanForest testing score is: 0.9680528489814954


## 2. Compare the results of your models.

The Random Forest model out-performed the other models in both the training and test scores. No tuning was done to any of the models, and this could change the results. Tuning can be performed with the Grid Search Cross Validation method. The best hyperparameters for each algorithm can then be applied to the algorithms to get their best performance before camparing and making a selection with which to move forward.                                                                                  