## 3. ML

In [344]:
# Getting model evalution statistics
model_evalution_df = pd.DataFrame(columns=['Model Name', 'Training Score', 'Testing Score', 'Accuracy', 'F1 Score', 'Precision', 'Recall'])

def add_model_evalution_stat(model_name, model, X_train, X_test, y_train, y_test, y_train_pred, y_test_pred, model_evalution_df):
    """Function for adding model evalution statistics to DataFrame

    Args:
        model_name (str): Model name
        model (object): Object of fitted model
        X_train (DataFrame): Train dataset
        X_test (DataFrame): Test dataset
        y_train (DataFrame): Train labels
        y_test (DataFrame): Test labels
        y_train_pred (DataFrame): Predicted train labels
        y_test_pred (DataFrame): Predicted test labels
        model_evalution_df (DataFrame): DataFrame with evalution statistics

    Returns:
        DataFrame: DataFrame with added statistics
    """
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    accuracy = metrics.accuracy_score(y_test, y_test_pred)
    f1_score = metrics.f1_score(y_test, y_test_pred, average='weighted')
    precision = metrics.precision_score(y_test, y_test_pred)
    recall = metrics.recall_score(y_test, y_test_pred)
    model_evalutions_stats = [model_name, train_score, test_score, accuracy, f1_score, precision, recall]
    model_evalution_dict = {model_evalution_df.columns[i]:model_evalutions_stats[i] for i in range(len(model_evalutions_stats))}
    model_evalution_df = model_evalution_df.append(model_evalution_dict, ignore_index=True)
    return model_evalution_df

In [217]:
X = df_processed.drop(['fraud'], axis=1)
y = df_processed['fraud']
 
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state = 42, test_size = 0.30)

In [218]:
print(X_test.shape)
print(X_train.shape)

(44137, 75)
(102986, 75)


### Baseline

__Let's work on creating a baseline__

Let's take a not so simple model and experiment with a random forest

In [224]:
model_rf = ensemble.RandomForestClassifier(
    n_estimators=50,
    criterion='gini',
    max_depth=10,
    min_samples_split=2,
    random_state=13
)

model_rf.fit(X_train, y_train)
y_train_pred = model_rf.predict(X_train)
print('Train recall: {:.2f}'.format(metrics.recall_score(y_train, y_train_pred)))
print('Train f1: {:.2f}'.format(metrics.f1_score(y_train, y_train_pred)))
y_test_pred = model_rf.predict(X_test)
print('Test recall: {:.2f}'.format(metrics.recall_score(y_test, y_test_pred)))
print()
print('Test f1: {:.2f}'.format(metrics.f1_score(y_test, y_test_pred)))

Train recall: 0.35
Train f1: 0.52
Test recall: 0.27

Test f1: 0.43


Let's make another basic model using df_processed_1 data

Let us recall that at the stage of reconnaissance analysis we identified two correlated features and the df_processed_1 data contains one of them - days_since_last_logon, the df_processed data contains the other - group_days_since_last_logon

In [237]:
X_1 = df_processed_1.drop(['fraud'], axis=1)
y_1 = df_processed_1['fraud']
 
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(X_1, y_1, stratify=y, random_state = 42, test_size = 0.30)

In [239]:
model_rf_1 = ensemble.RandomForestClassifier(
    n_estimators=50,
    criterion='gini',
    max_depth=10,
    min_samples_split=2,
    random_state=13
)

model_rf_1.fit(X_train_1, y_train_1)
y_train_pred_1 = model_rf_1.predict(X_train_1)
print('Train recall: {:.2f}'.format(metrics.recall_score(y_train_1, y_train_pred_1)))
print('Train f1: {:.2f}'.format(metrics.f1_score(y_train_1, y_train_pred_1)))
y_test_pred_1 = model_rf_1.predict(X_test_1)
print('Test recall: {:.2f}'.format(metrics.recall_score(y_test_1, y_test_pred_1)))
print()
print('Test f1: {:.2f}'.format(metrics.f1_score(y_test_1, y_test_pred_1)))

Train recall: 0.40
Train f1: 0.57
Test recall: 0.34

Test f1: 0.50


#### Select the most suitable features

In [245]:
# using SelectKBest we will select the most suitable features
best_f = SelectKBest(k='all')
best_f.fit(X_train, y_train)
X_train_fs = best_f.transform(X_train)
X_test_fs = best_f.transform(X_test)

# display selected features and their scores
selected_features = best_f.get_support(indices=True)
feature_scores = best_f.scores_[selected_features]
for i, (feature, score) in enumerate(zip(X_train.columns[selected_features], feature_scores), 1):
    print(f"{i}. Feature: {feature}, Score: {score}")

  f = msb / msw


1. Feature: account_age_days, Score: 1815.3937850020484
2. Feature: transaction_amt, Score: 1580.3951123068073
3. Feature: transaction_adj_amt, Score: 15841.400495512637
4. Feature: historic_velocity, Score: 0.11054434179965776
5. Feature: days_since_last_logon, Score: 0.011331690301879013
6. Feature: inital_amount, Score: 0.07478969604828632
7. Feature: year, Score: 1.68893844353431
8. Feature: month, Score: 6.420333424866
9. Feature: hour, Score: 0.46716840820262706
10. Feature: day_of_week, Score: 2.062373288970358
11. Feature: billing_country_USA, Score: nan
12. Feature: billing_country, Score: nan
13. Feature: country_match, Score: 0.2409949941203818
14. Feature: black_list, Score: 7139.093656679107
15. Feature: currency_cad, Score: 73.15444956692386
16. Feature: currency_eur, Score: 124.0974935911433
17. Feature: currency_usd, Score: 18.91011953887548
18. Feature: browser_0, Score: 0.012919131174316766
19. Feature: browser_1, Score: 0.012919131124584749
20. Feature: billing_city_

Using SelectKBest, we will select 25 features that are best for predicting the target variable. We implement selection using a training sample using the chi2 parameter

We indicate the characteristics that are included in the selected list

In [248]:
from sklearn.feature_selection import SelectKBest, chi2

# Create a SelectKBest object with score_func=chi2 and k=25
best_features_selector = SelectKBest(score_func=chi2, k=25)

# We train SelectKBest using training data
X_train_selected = best_features_selector.fit_transform(X_train, y_train)

# Displaying selected features
selected_feature_indices = best_features_selector.get_support(indices=True)
selected_features = X.columns[selected_feature_indices]
print(selected_features)

Index(['account_age_days', 'transaction_amt', 'transaction_adj_amt',
       'historic_velocity', 'inital_amount', 'month', 'day_of_week',
       'black_list', 'currency_cad', 'currency_eur', 'currency_usd',
       'billing_city_1', 'billing_state_0', 'signature_image_0',
       'signature_image_1', 'signature_image_2', 'signature_image_4',
       'transaction_type_1', 'transaction_type_2', 'transaction_type_3',
       'transaction_env_0', 'transaction_env_1', 'transaction_env_2',
       'transaction_env_3', 'transaction_env_4'],
      dtype='object')


In [254]:
X_train = X_train_selected
X_test = X_test[['account_age_days', 'transaction_amt', 'transaction_adj_amt',
       'historic_velocity', 'inital_amount', 'month', 'day_of_week',
       'black_list', 'currency_cad', 'currency_eur', 'currency_usd',
       'billing_city_1', 'billing_state_0', 'signature_image_0',
       'signature_image_1', 'signature_image_2', 'signature_image_4',
       'transaction_type_1', 'transaction_type_2', 'transaction_type_3',
       'transaction_env_0', 'transaction_env_1', 'transaction_env_2',
       'transaction_env_3', 'transaction_env_4']]

In [255]:
# normalize the data using minmaxscaler
scaler = preprocessing.MinMaxScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)



We have selected the features, let's see how random forest works on them

We will fill out the table for clarity

#### LogisticRegression

In [345]:
# Model with random under sampled data
model_name = 'LogisticRegression / Imbalanced data'
lr = linear_model.LogisticRegression(random_state=42, max_iter=1000)

lr = lr.fit(X_train_scaled, y_train)
y_train_pred = lr.predict(X_train_scaled)
y_test_pred = lr.predict(X_test_scaled)

print(f'Metrics for the test sample: \n \n{metrics.classification_report(y_test, y_test_pred)}')
# Adding baseline model evalution statistics
model_evalution_df = add_model_evalution_stat(model_name, lr, X_train_scaled, X_test_scaled, y_train, y_test, y_train_pred, y_test_pred, model_evalution_df)
model_evalution_df

Metrics for the test sample: 
 
              precision    recall  f1-score   support

           0       0.97      0.99      0.98     41732
           1       0.81      0.55      0.65      2405

    accuracy                           0.97     44137
   macro avg       0.89      0.77      0.82     44137
weighted avg       0.97      0.97      0.97     44137



  model_evalution_df = model_evalution_df.append(model_evalution_dict, ignore_index=True)


Unnamed: 0,Model Name,Training Score,Testing Score,Accuracy,F1 Score,Precision,Recall
0,LogisticRegression / Imbalanced data,0.969423,0.968484,0.968484,0.965576,0.812192,0.548441


#### RandomForest

In [346]:
model_name2 = 'Baseline - RandomForestClassifier / Imbalanced data'

In [322]:
random_forest_base = ensemble.RandomForestClassifier(
    n_estimators=200,
    max_depth=12,
    criterion='gini',
    min_samples_leaf=20,
    random_state=42
)
random_forest_base.fit(X_train_scaled, y_train)
y_train_pred_rf = random_forest_base.predict(X_train_scaled)
y_test_pred_rf = random_forest_base.predict(X_test_scaled)
# Calculation of the RMSLE metric on training and validation sets
print('Recall train', round(np.sqrt(metrics.recall_score(y_train, y_train_pred_rf)), 2))
print('Recall test', round(np.sqrt(metrics.recall_score(y_test, y_test_pred_rf)), 2))

Recall train 0.74
Recall test 0.72


__We will take this model as the base one and compare further variations with the base one__

In [347]:
# Adding baseline model evalution statistics
model_evalution_df = add_model_evalution_stat(model_name2, random_forest_base, X_train_scaled, X_test_scaled, y_train, y_test, y_train_pred_rf, y_test_pred_rf, model_evalution_df)

  model_evalution_df = model_evalution_df.append(model_evalution_dict, ignore_index=True)


In [348]:
model_evalution_df

Unnamed: 0,Model Name,Training Score,Testing Score,Accuracy,F1 Score,Precision,Recall
0,LogisticRegression / Imbalanced data,0.969423,0.968484,0.968484,0.965576,0.812192,0.548441
1,Baseline - RandomForestClassifier / Imbalanced...,0.973288,0.97075,0.97075,0.967033,0.89957,0.521414


#### Stacking

In [349]:
model_name3 = 'Stacking / Imbalanced data'

In [351]:
# We create a list of tuples of the form: (model name, model)
estimators = [
    ('dt', tree.DecisionTreeClassifier(
        criterion='entropy',
        max_depth=6,
        random_state=42
        )),
    ('gb', ensemble.GradientBoostingClassifier(
        min_samples_leaf=5,
        learning_rate=0.05,
        n_estimators=300,
        max_depth=5,
        random_state=42
        )),
    ('rf', ensemble.RandomForestClassifier(
        n_estimators=100,
        max_depth=6,
        random_state=42
        ))
]

# Create an object of the stacking class

clf = ensemble.StackingClassifier(
    estimators=estimators,
    final_estimator=ensemble.RandomForestClassifier(
        n_estimators=100,
        max_depth=6,
        random_state=42
    ),
    n_jobs=-1
)

clf.fit(X_train_scaled, y_train)
y_train_pred_clf = clf.predict(X_train_scaled)
y_test_pred_clf = clf.predict(X_test_scaled)

print(f'Metrics for the test sample: \n \n{metrics.classification_report(y_test, y_test_pred_clf)}')

Metrics for the test sample: 
 
              precision    recall  f1-score   support

           0       0.98      0.99      0.99     41732
           1       0.88      0.69      0.77      2405

    accuracy                           0.98     44137
   macro avg       0.93      0.84      0.88     44137
weighted avg       0.98      0.98      0.98     44137



In [352]:
# Adding baseline model evalution statistics
model_evalution_df = add_model_evalution_stat(model_name3, clf, X_train_scaled, X_test_scaled, y_train, y_test, y_train_pred_rf, y_test_pred_clf, model_evalution_df)
model_evalution_df

  model_evalution_df = model_evalution_df.append(model_evalution_dict, ignore_index=True)


Unnamed: 0,Model Name,Training Score,Testing Score,Accuracy,F1 Score,Precision,Recall
0,LogisticRegression / Imbalanced data,0.969423,0.968484,0.968484,0.965576,0.812192,0.548441
1,Baseline - RandomForestClassifier / Imbalanced...,0.973288,0.97075,0.97075,0.967033,0.89957,0.521414
2,Stacking / Imbalanced data,0.980939,0.977592,0.977592,0.97629,0.876596,0.685239


Let's try to train a model on 45 features

In [259]:
X_train_45f, X_test_45f, y_train45f, y_test45f = train_test_split(X, y, stratify=y, random_state=42, test_size=0.3)

In [260]:
# Create a SelectKBest object with score_func=chi2 and k=45
best_features_selector = SelectKBest(score_func=chi2, k=45)

# We train SelectKBest using training data
X_train_45 = best_features_selector.fit_transform(X_train_45f, y_train45f)

# Displaying selected features
selected_feature_indices = best_features_selector.get_support(indices=True)
selected_features = X.columns[selected_feature_indices]
print(selected_features)

Index(['account_age_days', 'transaction_amt', 'transaction_adj_amt',
       'historic_velocity', 'inital_amount', 'month', 'hour', 'day_of_week',
       'black_list', 'currency_cad', 'currency_eur', 'currency_usd',
       'billing_city_1', 'billing_city_6', 'billing_city_10',
       'billing_city_11', 'billing_city_12', 'billing_state_0',
       'billing_state_1', 'billing_state_2', 'billing_state_4',
       'signature_image_0', 'signature_image_1', 'signature_image_2',
       'signature_image_4', 'transaction_type_0', 'transaction_type_1',
       'transaction_type_2', 'transaction_type_3', 'transaction_type_4',
       'transaction_env_0', 'transaction_env_1', 'transaction_env_2',
       'transaction_env_3', 'transaction_env_4', 'tranaction_initiate_1',
       'tranaction_initiate_4', 'ip_country_1', 'ip_country_7',
       'locale_country_0', 'locale_country_1', 'locale_country_2',
       'locale_country_4', 'locale_country_5', 'locale_country_7'],
      dtype='object')


In [355]:
model_name = 'Stacking-45features / Imbalanced data'

In [354]:
X_train45 = X_train_45
X_test45 = X_test_45f[['account_age_days', 'transaction_amt', 'transaction_adj_amt',
       'historic_velocity', 'inital_amount', 'month', 'hour', 'day_of_week',
       'black_list', 'currency_cad', 'currency_eur', 'currency_usd',
       'billing_city_1', 'billing_city_6', 'billing_city_10',
       'billing_city_11', 'billing_city_12', 'billing_state_0',
       'billing_state_1', 'billing_state_2', 'billing_state_4',
       'signature_image_0', 'signature_image_1', 'signature_image_2',
       'signature_image_4', 'transaction_type_0', 'transaction_type_1',
       'transaction_type_2', 'transaction_type_3', 'transaction_type_4',
       'transaction_env_0', 'transaction_env_1', 'transaction_env_2',
       'transaction_env_3', 'transaction_env_4', 'tranaction_initiate_1',
       'tranaction_initiate_4', 'ip_country_1', 'ip_country_7',
       'locale_country_0', 'locale_country_1', 'locale_country_2',
       'locale_country_4', 'locale_country_5', 'locale_country_7']]
# normalize the data using minmaxscaler
scaler = preprocessing.MinMaxScaler()
scaler.fit(X_train45)
X_train_sc45 = scaler.transform(X_train45)
X_test_sc45 = scaler.transform(X_test45)

clf.fit(X_train_sc45, y_train45f)
y_test_pred_clf45 = clf.predict(X_test_sc45)
y_train_pred_clf45 = clf.predict(X_train_sc45)
print(f'Metrics for the test sample: \n \n{metrics.classification_report(y_test45f, y_test_pred_clf45)}')




Metrics for the test sample: 
 
              precision    recall  f1-score   support

           0       0.98      0.99      0.99     41732
           1       0.88      0.72      0.79      2405

    accuracy                           0.98     44137
   macro avg       0.93      0.86      0.89     44137
weighted avg       0.98      0.98      0.98     44137



In [356]:
# Adding baseline model evalution statistics
model_evalution_df = add_model_evalution_stat(model_name, clf, X_train45, X_test45, y_train45f, y_test45f, y_train_pred_clf45, y_test_pred_clf45, model_evalution_df)
model_evalution_df

  model_evalution_df = model_evalution_df.append(model_evalution_dict, ignore_index=True)


Unnamed: 0,Model Name,Training Score,Testing Score,Accuracy,F1 Score,Precision,Recall
0,LogisticRegression / Imbalanced data,0.969423,0.968484,0.968484,0.965576,0.812192,0.548441
1,Baseline - RandomForestClassifier / Imbalanced...,0.973288,0.97075,0.97075,0.967033,0.89957,0.521414
2,Stacking / Imbalanced data,0.980939,0.977592,0.977592,0.97629,0.876596,0.685239
3,Stacking-45features / Imbalanced data,0.508322,0.504883,0.979246,0.978226,0.881208,0.715593


Let's try to train the model on all features.

In [357]:
model_name = 'Stacking-all_features / Imbalanced data'

In [358]:
X_train_all, X_test_all, y_train_all, y_test_all = train_test_split(X, y, stratify=y, random_state=42, test_size=0.3)
# normalize the data using minmaxscaler
scaler = preprocessing.MinMaxScaler()
scaler.fit(X_train_all)
X_train_sc_all = scaler.transform(X_train_all)
X_test_sc_all = scaler.transform(X_test_all)

clf.fit(X_train_sc_all, y_train_all)
y_pred_clf_all = clf.predict(X_test_sc_all)
y_train_pred_clf_all = clf.predict(X_train_sc_all)
print(f'Metrics for the test sample: \n \n{metrics.classification_report(y_test_all, y_pred_clf_all)}')


Metrics for the test sample: 
 
              precision    recall  f1-score   support

           0       0.98      0.99      0.99     41732
           1       0.88      0.73      0.80      2405

    accuracy                           0.98     44137
   macro avg       0.93      0.86      0.89     44137
weighted avg       0.98      0.98      0.98     44137



In [359]:
# Adding baseline model evalution statistics
model_evalution_df = add_model_evalution_stat(model_name, clf, X_train_all, X_test_all, y_train_all, y_test_all, y_train_pred_clf_all, y_pred_clf_all, model_evalution_df)
model_evalution_df

  model_evalution_df = model_evalution_df.append(model_evalution_dict, ignore_index=True)


Unnamed: 0,Model Name,Training Score,Testing Score,Accuracy,F1 Score,Precision,Recall
0,LogisticRegression / Imbalanced data,0.969423,0.968484,0.968484,0.965576,0.812192,0.548441
1,Baseline - RandomForestClassifier / Imbalanced...,0.973288,0.97075,0.97075,0.967033,0.89957,0.521414
2,Stacking / Imbalanced data,0.980939,0.977592,0.977592,0.97629,0.876596,0.685239
3,Stacking-45features / Imbalanced data,0.508322,0.504883,0.979246,0.978226,0.881208,0.715593
4,Stacking-all_features / Imbalanced data,0.525547,0.521173,0.9797,0.978842,0.875186,0.731809


#### Random_forest/GridSearch_all_features

In [363]:
model_name = 'Random_forest/GridSearch_all_features / Imbalanced data'

In [338]:

param_grid = {'n_estimators': list(range(80, 200, 30)),
              'min_samples_leaf': [5],
              'max_depth': list(np.linspace(20, 40, 5, dtype=int)),
              'criterion' : ['gini', 'entropy']
              }


grid_search_rf_all = GridSearchCV(
    estimator=ensemble.RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,
    n_jobs=-1,
    scoring='recall'
)

best_params_rf_all = grid_search_rf_all.fit(X_train_all, y_train_all)

y_train_pred_rf_all = best_params_rf_all.best_estimator_.predict(X_train_all)
y_test_pred_rf_all = best_params_rf_all.best_estimator_.predict(X_test_all)

print(f'Metrics for the test sample: \n \n{metrics.classification_report(y_test_all, y_test_pred_rf_all)}')

Metrics for the test sample: 
 
              precision    recall  f1-score   support

           0       0.97      1.00      0.98     41732
           1       0.95      0.45      0.61      2405

    accuracy                           0.97     44137
   macro avg       0.96      0.73      0.80     44137
weighted avg       0.97      0.97      0.96     44137



In [364]:
# Adding baseline model evalution statistics
model_evalution_df = add_model_evalution_stat(model_name, grid_search_rf_all, X_train_all, X_test_all, y_train_all, y_test_all, y_train_pred_rf_all, y_test_pred_rf_all, model_evalution_df)
model_evalution_df

  model_evalution_df = model_evalution_df.append(model_evalution_dict, ignore_index=True)


Unnamed: 0,Model Name,Training Score,Testing Score,Accuracy,F1 Score,Precision,Recall
0,LogisticRegression / Imbalanced data,0.969423,0.968484,0.968484,0.965576,0.812192,0.548441
1,Baseline - RandomForestClassifier / Imbalanced...,0.973288,0.97075,0.97075,0.967033,0.89957,0.521414
2,Stacking / Imbalanced data,0.980939,0.977592,0.977592,0.97629,0.876596,0.685239
3,Stacking-45features / Imbalanced data,0.508322,0.504883,0.979246,0.978226,0.881208,0.715593
4,Stacking-all_features / Imbalanced data,0.525547,0.521173,0.9797,0.978842,0.875186,0.731809
5,Random_forest/GridSearch_all_features / Imbala...,0.664349,0.453638,0.969028,0.963756,0.953671,0.453638


#### Random Under Sampling data

In [269]:
# Random Under Sampling data implimentation
random_under_sampler = RandomUnderSampler(random_state=42)
X_rus, y_rus = random_under_sampler.fit_resample(X_train_scaled, y_train)

y_rus.value_counts()

0    5610
1    5610
Name: fraud, dtype: int64

In [270]:
# Train-Test splitting
X_train_u_sampled, X_test_u_sampled, y_train_u_sampled, y_test_u_sampled = train_test_split(X_rus, y_rus, test_size=0.25, random_state=42, stratify=y_rus)
y_train_u_sampled.value_counts()

1    4208
0    4207
Name: fraud, dtype: int64

In [365]:
# Model with random under sampled data
model_name = 'LogisticRegression / Random Under Sampled data'
lr = linear_model.LogisticRegression(random_state=42, max_iter=1000)

lr_model_u_sampled = lr.fit(X_train_u_sampled, y_train_u_sampled)
y_train_pred = lr_model_u_sampled.predict(X_train_u_sampled)
y_test_pred = lr_model_u_sampled.predict(X_test_u_sampled)

print(f'Metrics for the test sample: \n \n{metrics.classification_report(y_test_u_sampled, y_test_pred)}')

Metrics for the test sample: 
 
              precision    recall  f1-score   support

           0       0.87      0.88      0.87      1403
           1       0.88      0.87      0.87      1402

    accuracy                           0.87      2805
   macro avg       0.87      0.87      0.87      2805
weighted avg       0.87      0.87      0.87      2805



In [366]:
# Adding baseline model evalution statistics
model_evalution_df = add_model_evalution_stat(model_name, lr, X_train_u_sampled, X_test_u_sampled, y_train_u_sampled, y_test_u_sampled, y_train_pred, y_test_pred, model_evalution_df)
model_evalution_df

  model_evalution_df = model_evalution_df.append(model_evalution_dict, ignore_index=True)


Unnamed: 0,Model Name,Training Score,Testing Score,Accuracy,F1 Score,Precision,Recall
0,LogisticRegression / Imbalanced data,0.969423,0.968484,0.968484,0.965576,0.812192,0.548441
1,Baseline - RandomForestClassifier / Imbalanced...,0.973288,0.97075,0.97075,0.967033,0.89957,0.521414
2,Stacking / Imbalanced data,0.980939,0.977592,0.977592,0.97629,0.876596,0.685239
3,Stacking-45features / Imbalanced data,0.508322,0.504883,0.979246,0.978226,0.881208,0.715593
4,Stacking-all_features / Imbalanced data,0.525547,0.521173,0.9797,0.978842,0.875186,0.731809
5,Random_forest/GridSearch_all_features / Imbala...,0.664349,0.453638,0.969028,0.963756,0.953671,0.453638
6,LogisticRegression / Random Under Sampled data,0.884492,0.872727,0.872727,0.872724,0.87617,0.868046


#### Random Over Sampling data

In [280]:
random_over_sampler = RandomOverSampler(random_state=42)
X_ros, y_ros = random_over_sampler.fit_resample(X_train_scaled, y_train)

y_ros.value_counts()

0    97376
1    97376
Name: fraud, dtype: int64

In [281]:
# Train-Test splitting
X_train_o_sampled, X_test_o_sampled, y_train_o_sampled, y_test_o_sampled = train_test_split(X_ros, y_ros, test_size=0.25, random_state=42, stratify=y_ros)
y_train_o_sampled.value_counts()

0    73032
1    73032
Name: fraud, dtype: int64

In [367]:
# Model with random over sampled data
model_name = 'LogisticRegression / Random Over Sampled data'
lr = linear_model.LogisticRegression(random_state=42, max_iter=1000)

lr_model_o_sampled = lr.fit(X_train_o_sampled, y_train_o_sampled)
y_train_pred = lr_model_o_sampled.predict(X_train_o_sampled)
y_test_pred = lr_model_o_sampled.predict(X_test_o_sampled)

print(f'Metrics for the test sample: \n \n{metrics.classification_report(y_test_o_sampled, y_test_pred)}')

Metrics for the test sample: 
 
              precision    recall  f1-score   support

           0       0.88      0.88      0.88     24344
           1       0.88      0.88      0.88     24344

    accuracy                           0.88     48688
   macro avg       0.88      0.88      0.88     48688
weighted avg       0.88      0.88      0.88     48688



In [368]:
# Adding baseline model evalution statistics
model_evalution_df = add_model_evalution_stat(model_name, lr, X_train_o_sampled, X_test_o_sampled, y_train_o_sampled, y_test_o_sampled, y_train_pred, y_test_pred, model_evalution_df)
model_evalution_df

  model_evalution_df = model_evalution_df.append(model_evalution_dict, ignore_index=True)


Unnamed: 0,Model Name,Training Score,Testing Score,Accuracy,F1 Score,Precision,Recall
0,LogisticRegression / Imbalanced data,0.969423,0.968484,0.968484,0.965576,0.812192,0.548441
1,Baseline - RandomForestClassifier / Imbalanced...,0.973288,0.97075,0.97075,0.967033,0.89957,0.521414
2,Stacking / Imbalanced data,0.980939,0.977592,0.977592,0.97629,0.876596,0.685239
3,Stacking-45features / Imbalanced data,0.508322,0.504883,0.979246,0.978226,0.881208,0.715593
4,Stacking-all_features / Imbalanced data,0.525547,0.521173,0.9797,0.978842,0.875186,0.731809
5,Random_forest/GridSearch_all_features / Imbala...,0.664349,0.453638,0.969028,0.963756,0.953671,0.453638
6,LogisticRegression / Random Under Sampled data,0.884492,0.872727,0.872727,0.872724,0.87617,0.868046
7,LogisticRegression / Random Over Sampled data,0.883325,0.884345,0.884345,0.884345,0.884298,0.884407


#### SMOTE Over Sampling data

In [285]:
smote_sampler = SMOTE(random_state=42)
X_sm, y_sm = smote_sampler.fit_resample(X_train_scaled.astype('float'), y_train)

y_sm.value_counts()

0    97376
1    97376
Name: fraud, dtype: int64

In [287]:
X_train_s_sampled, X_test_s_sampled, y_train_s_sampled, y_test_s_sampled = train_test_split(X_sm, y_sm, test_size=0.25, random_state=42, stratify=y_sm)
y_train_s_sampled.value_counts()

0    73032
1    73032
Name: fraud, dtype: int64

In [369]:
# Model with SMOTE over sampled data
model_name = 'LogisticRegression / SMOTE Over Sampled data'
lr = linear_model.LogisticRegression(random_state=42, max_iter=1000)

lr_model_s_sampled = lr.fit(X_train_s_sampled, y_train_s_sampled)
y_train_pred = lr_model_s_sampled.predict(X_train_s_sampled)
y_test_pred = lr_model_s_sampled.predict(X_test_s_sampled)

print(f'Metrics for the test sample: \n \n{metrics.classification_report(y_test_s_sampled, y_test_pred)}')

Metrics for the test sample: 
 
              precision    recall  f1-score   support

           0       0.90      0.90      0.90     24344
           1       0.90      0.90      0.90     24344

    accuracy                           0.90     48688
   macro avg       0.90      0.90      0.90     48688
weighted avg       0.90      0.90      0.90     48688



In [370]:
# Adding baseline model evalution statistics
model_evalution_df = add_model_evalution_stat(model_name, lr, X_train_s_sampled, X_test_s_sampled, y_train_s_sampled, y_test_s_sampled, y_train_pred, y_test_pred, model_evalution_df)
model_evalution_df

  model_evalution_df = model_evalution_df.append(model_evalution_dict, ignore_index=True)


Unnamed: 0,Model Name,Training Score,Testing Score,Accuracy,F1 Score,Precision,Recall
0,LogisticRegression / Imbalanced data,0.969423,0.968484,0.968484,0.965576,0.812192,0.548441
1,Baseline - RandomForestClassifier / Imbalanced...,0.973288,0.97075,0.97075,0.967033,0.89957,0.521414
2,Stacking / Imbalanced data,0.980939,0.977592,0.977592,0.97629,0.876596,0.685239
3,Stacking-45features / Imbalanced data,0.508322,0.504883,0.979246,0.978226,0.881208,0.715593
4,Stacking-all_features / Imbalanced data,0.525547,0.521173,0.9797,0.978842,0.875186,0.731809
5,Random_forest/GridSearch_all_features / Imbala...,0.664349,0.453638,0.969028,0.963756,0.953671,0.453638
6,LogisticRegression / Random Under Sampled data,0.884492,0.872727,0.872727,0.872724,0.87617,0.868046
7,LogisticRegression / Random Over Sampled data,0.883325,0.884345,0.884345,0.884345,0.884298,0.884407
8,LogisticRegression / SMOTE Over Sampled data,0.901201,0.900058,0.900058,0.900057,0.898063,0.902563


#### Random Under Sampling data with optimized parameters

In [371]:
# Preparing Grid Search 
param_grid = [
              {'penalty': ['l2', 'none'], 
              'solver': ['lbfgs', 'sag'],
               'C': [0.01, 0.1, 0.3, 0.5, 0.7, 0.9, 1]},
              
              {'penalty': ['l1', 'l2'] ,
              'solver': ['liblinear', 'saga'],
               'C': [0.01, 0.1, 0.3, 0.5, 0.7, 0.9, 1]}
]

grid_search_lr = GridSearchCV(
    estimator=linear_model.LogisticRegression(random_state=42, max_iter=1000), 
    param_grid=param_grid, 
    cv=5, 
    n_jobs=-1,
    scoring='recall'
)

model_name = 'Tuned LogisticRegression / Random Under Sampled data'
best_params_lr = grid_search_lr.fit(X_train_u_sampled, y_train_u_sampled)

y_train_pred = best_params_lr.best_estimator_.predict(X_train_u_sampled)
y_test_pred = best_params_lr.best_estimator_.predict(X_test_u_sampled)

print(f'Metrics for the test sample: \n \n{metrics.classification_report(y_test_u_sampled, y_test_pred)}')



Metrics for the test sample: 
 
              precision    recall  f1-score   support

           0       0.87      0.87      0.87      1403
           1       0.87      0.87      0.87      1402

    accuracy                           0.87      2805
   macro avg       0.87      0.87      0.87      2805
weighted avg       0.87      0.87      0.87      2805



In [375]:
# Adding baseline model evalution statistics
model_evalution_df = add_model_evalution_stat(model_name, best_params_lr.best_estimator_, X_train_u_sampled, X_test_u_sampled, y_train_u_sampled, y_test_u_sampled, y_train_pred, y_test_pred, model_evalution_df)
model_evalution_df

  model_evalution_df = model_evalution_df.append(model_evalution_dict, ignore_index=True)


Unnamed: 0,Model Name,Training Score,Testing Score,Accuracy,F1 Score,Precision,Recall
0,LogisticRegression / Imbalanced data,0.969423,0.968484,0.968484,0.965576,0.812192,0.548441
1,Baseline - RandomForestClassifier / Imbalanced...,0.973288,0.97075,0.97075,0.967033,0.89957,0.521414
2,Stacking / Imbalanced data,0.980939,0.977592,0.977592,0.97629,0.876596,0.685239
3,Stacking-45features / Imbalanced data,0.508322,0.504883,0.979246,0.978226,0.881208,0.715593
4,Stacking-all_features / Imbalanced data,0.525547,0.521173,0.9797,0.978842,0.875186,0.731809
5,Random_forest/GridSearch_all_features / Imbala...,0.664349,0.453638,0.969028,0.963756,0.953671,0.453638
6,LogisticRegression / Random Under Sampled data,0.884492,0.872727,0.872727,0.872724,0.87617,0.868046
7,LogisticRegression / Random Over Sampled data,0.883325,0.884345,0.884345,0.884345,0.884298,0.884407
8,LogisticRegression / SMOTE Over Sampled data,0.901201,0.900058,0.900058,0.900057,0.898063,0.902563
9,Tuned LogisticRegression / Random Under Sample...,0.883541,0.871658,0.871658,0.871657,0.873745,0.868759


#### Decision Tree - Random Under Sampling data

In [376]:
# Decision Tree model with Random Under Sampled data
model_name = 'DecisionTreeClassifier / Random Under Sampled data'
dt = tree.DecisionTreeClassifier(
    max_depth=10,
    random_state=42
)

dt_model_u_sampled = dt.fit(X_train_u_sampled, y_train_u_sampled)
y_train_pred = dt_model_u_sampled.predict(X_train_u_sampled)
y_test_pred = dt_model_u_sampled.predict(X_test_u_sampled)

print(f'Metrics for the test sample: \n \n{metrics.classification_report(y_test_u_sampled, y_test_pred)}')

Metrics for the test sample: 
 
              precision    recall  f1-score   support

           0       0.85      0.89      0.87      1403
           1       0.89      0.84      0.86      1402

    accuracy                           0.87      2805
   macro avg       0.87      0.87      0.87      2805
weighted avg       0.87      0.87      0.87      2805



In [377]:
# Adding baseline model evalution statistics
model_evalution_df = add_model_evalution_stat(model_name, dt, X_train_u_sampled, X_test_u_sampled, y_train_u_sampled, y_test_u_sampled, y_train_pred, y_test_pred, model_evalution_df)
model_evalution_df

  model_evalution_df = model_evalution_df.append(model_evalution_dict, ignore_index=True)


Unnamed: 0,Model Name,Training Score,Testing Score,Accuracy,F1 Score,Precision,Recall
0,LogisticRegression / Imbalanced data,0.969423,0.968484,0.968484,0.965576,0.812192,0.548441
1,Baseline - RandomForestClassifier / Imbalanced...,0.973288,0.97075,0.97075,0.967033,0.89957,0.521414
2,Stacking / Imbalanced data,0.980939,0.977592,0.977592,0.97629,0.876596,0.685239
3,Stacking-45features / Imbalanced data,0.508322,0.504883,0.979246,0.978226,0.881208,0.715593
4,Stacking-all_features / Imbalanced data,0.525547,0.521173,0.9797,0.978842,0.875186,0.731809
5,Random_forest/GridSearch_all_features / Imbala...,0.664349,0.453638,0.969028,0.963756,0.953671,0.453638
6,LogisticRegression / Random Under Sampled data,0.884492,0.872727,0.872727,0.872724,0.87617,0.868046
7,LogisticRegression / Random Over Sampled data,0.883325,0.884345,0.884345,0.884345,0.884298,0.884407
8,LogisticRegression / SMOTE Over Sampled data,0.901201,0.900058,0.900058,0.900057,0.898063,0.902563
9,Tuned LogisticRegression / Random Under Sample...,0.883541,0.871658,0.871658,0.871657,0.873745,0.868759


#### Decision Tree - Random Over Sampling data

In [378]:
# Decision Tree model with Random Over Sampled data
model_name = 'DecisionTreeClassifier / Random Over Sampled data'
dt = tree.DecisionTreeClassifier(
    max_depth=10,
    random_state=42
)

dt_model_o_sampled = dt.fit(X_train_o_sampled, y_train_o_sampled)
y_train_pred = dt_model_o_sampled.predict(X_train_o_sampled)
y_test_pred = dt_model_o_sampled.predict(X_test_o_sampled)

print(f'Metrics for the test sample: \n \n{metrics.classification_report(y_test_o_sampled, y_test_pred)}')

Metrics for the test sample: 
 
              precision    recall  f1-score   support

           0       0.90      0.92      0.91     24344
           1       0.92      0.90      0.91     24344

    accuracy                           0.91     48688
   macro avg       0.91      0.91      0.91     48688
weighted avg       0.91      0.91      0.91     48688



In [379]:
# Adding baseline model evalution statistics
model_evalution_df = add_model_evalution_stat(model_name, dt, X_train_o_sampled, X_test_o_sampled, y_train_o_sampled, y_test_o_sampled, y_train_pred, y_test_pred, model_evalution_df)
model_evalution_df

  model_evalution_df = model_evalution_df.append(model_evalution_dict, ignore_index=True)


Unnamed: 0,Model Name,Training Score,Testing Score,Accuracy,F1 Score,Precision,Recall
0,LogisticRegression / Imbalanced data,0.969423,0.968484,0.968484,0.965576,0.812192,0.548441
1,Baseline - RandomForestClassifier / Imbalanced...,0.973288,0.97075,0.97075,0.967033,0.89957,0.521414
2,Stacking / Imbalanced data,0.980939,0.977592,0.977592,0.97629,0.876596,0.685239
3,Stacking-45features / Imbalanced data,0.508322,0.504883,0.979246,0.978226,0.881208,0.715593
4,Stacking-all_features / Imbalanced data,0.525547,0.521173,0.9797,0.978842,0.875186,0.731809
5,Random_forest/GridSearch_all_features / Imbala...,0.664349,0.453638,0.969028,0.963756,0.953671,0.453638
6,LogisticRegression / Random Under Sampled data,0.884492,0.872727,0.872727,0.872724,0.87617,0.868046
7,LogisticRegression / Random Over Sampled data,0.883325,0.884345,0.884345,0.884345,0.884298,0.884407
8,LogisticRegression / SMOTE Over Sampled data,0.901201,0.900058,0.900058,0.900057,0.898063,0.902563
9,Tuned LogisticRegression / Random Under Sample...,0.883541,0.871658,0.871658,0.871657,0.873745,0.868759


#### Decision Tree - SMOTE Over Sampling data

In [382]:
# Decision Tree model with SMOTE Over Sampled data
model_name = 'DecisionTreeClassifier / SMOTE Over Sampled data'
dt = tree.DecisionTreeClassifier(
    max_depth=10,
    random_state=42
)

dt_model_s_sampled = dt.fit(X_train_s_sampled, y_train_s_sampled)
y_train_pred = dt_model_s_sampled.predict(X_train_s_sampled)
y_test_pred = dt_model_s_sampled.predict(X_test_s_sampled)

print(f'Metrics for the test sample: \n \n{metrics.classification_report(y_test_s_sampled, y_test_pred)}')

Metrics for the test sample: 
 
              precision    recall  f1-score   support

           0       0.93      0.95      0.94     24344
           1       0.95      0.93      0.94     24344

    accuracy                           0.94     48688
   macro avg       0.94      0.94      0.94     48688
weighted avg       0.94      0.94      0.94     48688



In [383]:
# Adding baseline model evalution statistics
model_evalution_df = add_model_evalution_stat(model_name, dt, X_train_s_sampled, X_test_s_sampled, y_train_s_sampled, y_test_s_sampled, y_train_pred, y_test_pred, model_evalution_df)
model_evalution_df

  model_evalution_df = model_evalution_df.append(model_evalution_dict, ignore_index=True)


Unnamed: 0,Model Name,Training Score,Testing Score,Accuracy,F1 Score,Precision,Recall
0,LogisticRegression / Imbalanced data,0.969423,0.968484,0.968484,0.965576,0.812192,0.548441
1,Baseline - RandomForestClassifier / Imbalanced...,0.973288,0.97075,0.97075,0.967033,0.89957,0.521414
2,Stacking / Imbalanced data,0.980939,0.977592,0.977592,0.97629,0.876596,0.685239
3,Stacking-45features / Imbalanced data,0.508322,0.504883,0.979246,0.978226,0.881208,0.715593
4,Stacking-all_features / Imbalanced data,0.525547,0.521173,0.9797,0.978842,0.875186,0.731809
5,Random_forest/GridSearch_all_features / Imbala...,0.664349,0.453638,0.969028,0.963756,0.953671,0.453638
6,LogisticRegression / Random Under Sampled data,0.884492,0.872727,0.872727,0.872724,0.87617,0.868046
7,LogisticRegression / Random Over Sampled data,0.883325,0.884345,0.884345,0.884345,0.884298,0.884407
8,LogisticRegression / SMOTE Over Sampled data,0.901201,0.900058,0.900058,0.900057,0.898063,0.902563
9,Tuned LogisticRegression / Random Under Sample...,0.883541,0.871658,0.871658,0.871657,0.873745,0.868759


#### Decision Tree - Imbala Random Under sampled data - Optimized Hyperparameters

In [384]:
# Max depth of tree
max_depth = list(np.linspace(start=4, stop=20, num=5, dtype=int))
# Number of samples to split a node
min_samples_split = list(np.linspace(start=2, stop=12, num=6, dtype=int))
# Number of samplet at leaf node
min_samples_leaf = list(np.linspace(start=2, stop=12, num=6, dtype=int))
# Type of criterion
criterion = ['gini', 'entropy']

# Preparing Grid Search
param_grid = {
    'max_depth':max_depth,
    'min_samples_leaf':min_samples_leaf,
    'min_samples_split':min_samples_split,
    'criterion':criterion
}

grid_search_tree = GridSearchCV(
    estimator=tree.DecisionTreeClassifier(random_state=42), 
    param_grid=param_grid, 
    cv=5, 
    n_jobs=-1,
    scoring='recall'
)

model_name = 'Tuned DecisionTreeClassifier / Random Under sampled data'
best_params_dt = grid_search_tree.fit(X_train_u_sampled, y_train_u_sampled)

y_train_pred = best_params_dt.best_estimator_.predict(X_train_u_sampled)
y_test_pred = best_params_dt.best_estimator_.predict(X_test_u_sampled)
best_params_dt.best_estimator_

DecisionTreeClassifier(max_depth=4, min_samples_leaf=2, random_state=42)

In [385]:
print(f'Metrics for the test sample: \n \n{metrics.classification_report(y_test_u_sampled, y_test_pred)}')

Metrics for the test sample: 
 
              precision    recall  f1-score   support

           0       0.86      0.87      0.87      1403
           1       0.87      0.86      0.86      1402

    accuracy                           0.87      2805
   macro avg       0.87      0.87      0.87      2805
weighted avg       0.87      0.87      0.87      2805



In [386]:
# Adding baseline model evalution statistics
model_evalution_df = add_model_evalution_stat(model_name, best_params_dt.best_estimator_, X_train_u_sampled, X_test_u_sampled, y_train_u_sampled, y_test_u_sampled, y_train_pred, y_test_pred, model_evalution_df)
model_evalution_df

  model_evalution_df = model_evalution_df.append(model_evalution_dict, ignore_index=True)


Unnamed: 0,Model Name,Training Score,Testing Score,Accuracy,F1 Score,Precision,Recall
0,LogisticRegression / Imbalanced data,0.969423,0.968484,0.968484,0.965576,0.812192,0.548441
1,Baseline - RandomForestClassifier / Imbalanced...,0.973288,0.97075,0.97075,0.967033,0.89957,0.521414
2,Stacking / Imbalanced data,0.980939,0.977592,0.977592,0.97629,0.876596,0.685239
3,Stacking-45features / Imbalanced data,0.508322,0.504883,0.979246,0.978226,0.881208,0.715593
4,Stacking-all_features / Imbalanced data,0.525547,0.521173,0.9797,0.978842,0.875186,0.731809
5,Random_forest/GridSearch_all_features / Imbala...,0.664349,0.453638,0.969028,0.963756,0.953671,0.453638
6,LogisticRegression / Random Under Sampled data,0.884492,0.872727,0.872727,0.872724,0.87617,0.868046
7,LogisticRegression / Random Over Sampled data,0.883325,0.884345,0.884345,0.884345,0.884298,0.884407
8,LogisticRegression / SMOTE Over Sampled data,0.901201,0.900058,0.900058,0.900057,0.898063,0.902563
9,Tuned LogisticRegression / Random Under Sample...,0.883541,0.871658,0.871658,0.871657,0.873745,0.868759


#### Random Forest

In [388]:
# Random Forest model with imbalanced data
model_name = 'RandomForestClassifier / Imbalanced data'
rf = ensemble.RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42
)

rf_model = rf.fit(X_train_scaled, y_train)
y_train_pred = rf_model.predict(X_train_scaled)
y_test_pred = rf_model.predict(X_test_scaled)

print(f'Metrics for the test sample: \n \n{metrics.classification_report(y_test, y_test_pred)}')

Metrics for the test sample: 
 
              precision    recall  f1-score   support

           0       0.97      1.00      0.98     41732
           1       0.91      0.51      0.66      2405

    accuracy                           0.97     44137
   macro avg       0.94      0.76      0.82     44137
weighted avg       0.97      0.97      0.97     44137



In [389]:
# Adding baseline model evalution statistics
model_evalution_df = add_model_evalution_stat(model_name, rf, X_train_scaled, X_test_scaled, y_train, y_test, y_train_pred, y_test_pred, model_evalution_df)
model_evalution_df

  model_evalution_df = model_evalution_df.append(model_evalution_dict, ignore_index=True)


Unnamed: 0,Model Name,Training Score,Testing Score,Accuracy,F1 Score,Precision,Recall
0,LogisticRegression / Imbalanced data,0.969423,0.968484,0.968484,0.965576,0.812192,0.548441
1,Baseline - RandomForestClassifier / Imbalanced...,0.973288,0.97075,0.97075,0.967033,0.89957,0.521414
2,Stacking / Imbalanced data,0.980939,0.977592,0.977592,0.97629,0.876596,0.685239
3,Stacking-45features / Imbalanced data,0.508322,0.504883,0.979246,0.978226,0.881208,0.715593
4,Stacking-all_features / Imbalanced data,0.525547,0.521173,0.9797,0.978842,0.875186,0.731809
5,Random_forest/GridSearch_all_features / Imbala...,0.664349,0.453638,0.969028,0.963756,0.953671,0.453638
6,LogisticRegression / Random Under Sampled data,0.884492,0.872727,0.872727,0.872724,0.87617,0.868046
7,LogisticRegression / Random Over Sampled data,0.883325,0.884345,0.884345,0.884345,0.884298,0.884407
8,LogisticRegression / SMOTE Over Sampled data,0.901201,0.900058,0.900058,0.900057,0.898063,0.902563
9,Tuned LogisticRegression / Random Under Sample...,0.883541,0.871658,0.871658,0.871657,0.873745,0.868759


#### Random Forest - Random Under Sampling data

In [390]:
# Random Forest model with Random Under Sampling data
model_name = 'RandomForestClassifier / Random Under Sampling data'
rf = ensemble.RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42
)

rf_u_model = rf.fit(X_train_u_sampled, y_train_u_sampled)
y_train_pred = rf_u_model.predict(X_train_u_sampled)
y_test_pred = rf_u_model.predict(X_test_u_sampled)

print(f'Metrics for the test sample: \n \n{metrics.classification_report(y_test_u_sampled, y_test_pred)}')

Metrics for the test sample: 
 
              precision    recall  f1-score   support

           0       0.87      0.91      0.89      1403
           1       0.91      0.86      0.89      1402

    accuracy                           0.89      2805
   macro avg       0.89      0.89      0.89      2805
weighted avg       0.89      0.89      0.89      2805



In [391]:
# Adding baseline model evalution statistics
model_evalution_df = add_model_evalution_stat(model_name, rf, X_train_u_sampled, X_test_u_sampled, y_train_u_sampled, y_test_u_sampled, y_train_pred, y_test_pred, model_evalution_df)
model_evalution_df

  model_evalution_df = model_evalution_df.append(model_evalution_dict, ignore_index=True)


Unnamed: 0,Model Name,Training Score,Testing Score,Accuracy,F1 Score,Precision,Recall
0,LogisticRegression / Imbalanced data,0.969423,0.968484,0.968484,0.965576,0.812192,0.548441
1,Baseline - RandomForestClassifier / Imbalanced...,0.973288,0.97075,0.97075,0.967033,0.89957,0.521414
2,Stacking / Imbalanced data,0.980939,0.977592,0.977592,0.97629,0.876596,0.685239
3,Stacking-45features / Imbalanced data,0.508322,0.504883,0.979246,0.978226,0.881208,0.715593
4,Stacking-all_features / Imbalanced data,0.525547,0.521173,0.9797,0.978842,0.875186,0.731809
5,Random_forest/GridSearch_all_features / Imbala...,0.664349,0.453638,0.969028,0.963756,0.953671,0.453638
6,LogisticRegression / Random Under Sampled data,0.884492,0.872727,0.872727,0.872724,0.87617,0.868046
7,LogisticRegression / Random Over Sampled data,0.883325,0.884345,0.884345,0.884345,0.884298,0.884407
8,LogisticRegression / SMOTE Over Sampled data,0.901201,0.900058,0.900058,0.900057,0.898063,0.902563
9,Tuned LogisticRegression / Random Under Sample...,0.883541,0.871658,0.871658,0.871657,0.873745,0.868759


#### Random Forest - Random Over Sampling data

In [393]:
# Random Forest model with Random Over Sampling data
model_name = 'RandomForestClassifier / Random Over Sampling data'
rf = ensemble.RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42
)

rf_o_model = rf.fit(X_train_o_sampled, y_train_o_sampled)
y_train_pred = rf_o_model.predict(X_train_o_sampled)
y_test_pred = rf_o_model.predict(X_test_o_sampled)

print(f'Metrics for the test sample: \n \n{metrics.classification_report(y_test_o_sampled, y_test_pred)}')

Metrics for the test sample: 
 
              precision    recall  f1-score   support

           0       0.91      0.93      0.92     24344
           1       0.93      0.90      0.92     24344

    accuracy                           0.92     48688
   macro avg       0.92      0.92      0.92     48688
weighted avg       0.92      0.92      0.92     48688



In [394]:
# Adding baseline model evalution statistics
model_evalution_df = add_model_evalution_stat(model_name, rf, X_train_o_sampled, X_test_o_sampled, y_train_o_sampled, y_test_o_sampled, y_train_pred, y_test_pred, model_evalution_df)
model_evalution_df

  model_evalution_df = model_evalution_df.append(model_evalution_dict, ignore_index=True)


Unnamed: 0,Model Name,Training Score,Testing Score,Accuracy,F1 Score,Precision,Recall
0,LogisticRegression / Imbalanced data,0.969423,0.968484,0.968484,0.965576,0.812192,0.548441
1,Baseline - RandomForestClassifier / Imbalanced...,0.973288,0.97075,0.97075,0.967033,0.89957,0.521414
2,Stacking / Imbalanced data,0.980939,0.977592,0.977592,0.97629,0.876596,0.685239
3,Stacking-45features / Imbalanced data,0.508322,0.504883,0.979246,0.978226,0.881208,0.715593
4,Stacking-all_features / Imbalanced data,0.525547,0.521173,0.9797,0.978842,0.875186,0.731809
5,Random_forest/GridSearch_all_features / Imbala...,0.664349,0.453638,0.969028,0.963756,0.953671,0.453638
6,LogisticRegression / Random Under Sampled data,0.884492,0.872727,0.872727,0.872724,0.87617,0.868046
7,LogisticRegression / Random Over Sampled data,0.883325,0.884345,0.884345,0.884345,0.884298,0.884407
8,LogisticRegression / SMOTE Over Sampled data,0.901201,0.900058,0.900058,0.900057,0.898063,0.902563
9,Tuned LogisticRegression / Random Under Sample...,0.883541,0.871658,0.871658,0.871657,0.873745,0.868759


#### Random Forest - SMOTE Over Sampling data

In [395]:
# Random Forest model with Random Over Sampling data
model_name = 'RandomForestClassifier / SMOTE Over Sampling data'
rf = ensemble.RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42
)

rf_s_model = rf.fit(X_train_s_sampled, y_train_s_sampled)
y_train_pred = rf_s_model.predict(X_train_s_sampled)
y_test_pred = rf_s_model.predict(X_test_s_sampled)

print(f'Metrics for the test sample: \n \n{metrics.classification_report(y_test_s_sampled, y_test_pred)}')

Metrics for the test sample: 
 
              precision    recall  f1-score   support

           0       0.94      0.94      0.94     24344
           1       0.94      0.94      0.94     24344

    accuracy                           0.94     48688
   macro avg       0.94      0.94      0.94     48688
weighted avg       0.94      0.94      0.94     48688



In [None]:
X_test_s_sampled[-5]

array([0.7570133 , 0.52519528, 0.56393208, 0.55820649, 0.06329663,
       0.50592782, 1.        , 1.        , 0.        , 0.        ,
       1.        , 0.        , 0.        , 0.        , 1.        ,
       0.        , 0.        , 0.        , 1.        , 0.        ,
       0.85869849, 0.        , 1.        , 1.        , 1.        ])

In [396]:
# Adding baseline model evalution statistics
model_evalution_df = add_model_evalution_stat(model_name, rf, X_train_s_sampled, X_test_s_sampled, y_train_s_sampled, y_test_s_sampled, y_train_pred, y_test_pred, model_evalution_df)
model_evalution_df

  model_evalution_df = model_evalution_df.append(model_evalution_dict, ignore_index=True)


Unnamed: 0,Model Name,Training Score,Testing Score,Accuracy,F1 Score,Precision,Recall
0,LogisticRegression / Imbalanced data,0.969423,0.968484,0.968484,0.965576,0.812192,0.548441
1,Baseline - RandomForestClassifier / Imbalanced...,0.973288,0.97075,0.97075,0.967033,0.89957,0.521414
2,Stacking / Imbalanced data,0.980939,0.977592,0.977592,0.97629,0.876596,0.685239
3,Stacking-45features / Imbalanced data,0.508322,0.504883,0.979246,0.978226,0.881208,0.715593
4,Stacking-all_features / Imbalanced data,0.525547,0.521173,0.9797,0.978842,0.875186,0.731809
5,Random_forest/GridSearch_all_features / Imbala...,0.664349,0.453638,0.969028,0.963756,0.953671,0.453638
6,LogisticRegression / Random Under Sampled data,0.884492,0.872727,0.872727,0.872724,0.87617,0.868046
7,LogisticRegression / Random Over Sampled data,0.883325,0.884345,0.884345,0.884345,0.884298,0.884407
8,LogisticRegression / SMOTE Over Sampled data,0.901201,0.900058,0.900058,0.900057,0.898063,0.902563
9,Tuned LogisticRegression / Random Under Sample...,0.883541,0.871658,0.871658,0.871657,0.873745,0.868759


#### Random Forest - Random Under Sampling data - Tuned hyperparameters

In [398]:
# Number of trees in Random Forest
n_estimators = list(np.linspace(start=100, stop=300, num=5, dtype=int))
# Max depth of trees in Random Forest
max_depth = list(np.linspace(start=4, stop=20, num=5, dtype=int))
# Number of samples to split a node
min_samples_split = list(np.linspace(start=2, stop=12, num=6, dtype=int))
# Number of samplet at leaf node
min_samples_leaf = list(np.linspace(start=2, stop=12, num=6, dtype=int))
# Type of criterion
criterion = ['gini', 'entropy']

param_grid = {
    'n_estimators':n_estimators,
    'max_depth':max_depth,
    'min_samples_leaf':min_samples_leaf,
    'min_samples_split':min_samples_split,
    'criterion':criterion
}

In [399]:
grid_search_rf = GridSearchCV(
    estimator=ensemble.RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,
    n_jobs=-1,
    scoring='recall'
)


model_name = 'Tuned RandomForestClassifier / Random Under Sampled data'
best_params_rf = grid_search_rf.fit(X_train_u_sampled, y_train_u_sampled)

y_train_pred = best_params_rf.best_estimator_.predict(X_train_u_sampled)
y_test_pred = best_params_rf.best_estimator_.predict(X_test_u_sampled)

print(f'Metrics for the test sample: \n \n{metrics.classification_report(y_test_u_sampled, y_test_pred)}')

Metrics for the test sample: 
 
              precision    recall  f1-score   support

           0       0.87      0.91      0.89      1403
           1       0.91      0.86      0.89      1402

    accuracy                           0.89      2805
   macro avg       0.89      0.89      0.89      2805
weighted avg       0.89      0.89      0.89      2805



In [400]:
# Adding baseline model evalution statistics
model_evalution_df = add_model_evalution_stat(model_name, best_params_rf.best_estimator_, X_train_u_sampled, X_test_u_sampled, y_train_u_sampled, y_test_u_sampled, y_train_pred, y_test_pred, model_evalution_df)
model_evalution_df

  model_evalution_df = model_evalution_df.append(model_evalution_dict, ignore_index=True)


Unnamed: 0,Model Name,Training Score,Testing Score,Accuracy,F1 Score,Precision,Recall
0,LogisticRegression / Imbalanced data,0.969423,0.968484,0.968484,0.965576,0.812192,0.548441
1,Baseline - RandomForestClassifier / Imbalanced...,0.973288,0.97075,0.97075,0.967033,0.89957,0.521414
2,Stacking / Imbalanced data,0.980939,0.977592,0.977592,0.97629,0.876596,0.685239
3,Stacking-45features / Imbalanced data,0.508322,0.504883,0.979246,0.978226,0.881208,0.715593
4,Stacking-all_features / Imbalanced data,0.525547,0.521173,0.9797,0.978842,0.875186,0.731809
5,Random_forest/GridSearch_all_features / Imbala...,0.664349,0.453638,0.969028,0.963756,0.953671,0.453638
6,LogisticRegression / Random Under Sampled data,0.884492,0.872727,0.872727,0.872724,0.87617,0.868046
7,LogisticRegression / Random Over Sampled data,0.883325,0.884345,0.884345,0.884345,0.884298,0.884407
8,LogisticRegression / SMOTE Over Sampled data,0.901201,0.900058,0.900058,0.900057,0.898063,0.902563
9,Tuned LogisticRegression / Random Under Sample...,0.883541,0.871658,0.871658,0.871657,0.873745,0.868759


__The first is the processing of addresses; you also need to add signs of correspondence between the country in which the payment was made and the country of account registration
it is necessary to work with the ip addresses attribute
Also a sign of user_agent - you need to drag in both the domain and the browser and see from which browser transactions are usually made.__

Then try using gridsearch to sort through the parameters and make a complex model. But all this will give results if you follow step 1

Перше це обробка адрес, ще треба додати ознаки відповідності країни, за якою оплачували зі строною реєстрації рахунку.
необхідно попрацювати з ознакою ip adres
Також ознака user_agent - необхідно вітянути і домен і браузер і подивитися з якого браузера зазвичай здійснюються транзакції.

Потім спробувати за допомогою gridsearch перебрати параметри і створити комплексну модель. Але це все дасть результат, якщо виконати крок 1

#### CatBoost

#### CatBoost - Random Under Sampling data

In [410]:
# Preparing dataset
pool = Pool(data=X_train_u_sampled, label=y_train_u_sampled)

In [411]:
# cv
params = {
    'loss_function': 'Logloss',
    'iterations': 300,
    'custom_loss': 'Recall',
    'random_seed': 42,
    'learning_rate': 0.15
}

cv_data = cv(
    params=params,
    pool=Pool(data=X_train_u_sampled, label=y_train_u_sampled),
    fold_count=5, # separating to 5 folds
    shuffle=True,
    partition_random_seed=42,
    stratified=True, 
    verbose=False
)

# Best score printing
best_value = np.min(cv_data['test-Logloss-mean'])
best_iter = np.argmin(cv_data['test-Logloss-mean'])
print("Best validation Logloss score, stratified: {:.4f}+/-{:.3f} on step {}".format(best_value, cv_data['test-Logloss-std'][best_iter], best_iter))

Best validation Logloss score, stratified: 0.2286+/-0.014 on step 94


In [403]:
# Creating model object
cb_model = CatBoostClassifier(
    iterations=300,
    learning_rate=0.15,
    eval_metric='Recall'
)

model_name = 'CatBoostClassifier / Random Under Sampled data'
cb_model.fit(X_train_u_sampled, y_train_u_sampled,
         eval_set=(X_test_scaled, y_test),
         verbose=50,
         early_stopping_rounds=20,
)

y_train_pred = cb_model.predict(X_train_u_sampled)
y_test_pred = cb_model.predict(X_test_u_sampled)

0:	learn: 0.7799430	test: 0.7646570	best: 0.7646570 (0)	total: 10.4ms	remaining: 3.11s
50:	learn: 0.9073194	test: 0.8918919	best: 0.8918919 (50)	total: 1.04s	remaining: 5.07s
100:	learn: 0.9230038	test: 0.8923077	best: 0.8948025 (88)	total: 2.29s	remaining: 4.51s
Stopped by overfitting detector  (20 iterations wait)

bestTest = 0.8948024948
bestIteration = 88

Shrink model to first 89 iterations.


In [404]:
# Adding model evalution statistics
model_evalution_df = add_model_evalution_stat(model_name, cb_model, X_train_u_sampled, X_test_u_sampled, y_train_u_sampled, y_test_u_sampled, y_train_pred, y_test_pred, model_evalution_df)
model_evalution_df

  model_evalution_df = model_evalution_df.append(model_evalution_dict, ignore_index=True)


Unnamed: 0,Model Name,Training Score,Testing Score,Accuracy,F1 Score,Precision,Recall
0,LogisticRegression / Imbalanced data,0.969423,0.968484,0.968484,0.965576,0.812192,0.548441
1,Baseline - RandomForestClassifier / Imbalanced...,0.973288,0.97075,0.97075,0.967033,0.89957,0.521414
2,Stacking / Imbalanced data,0.980939,0.977592,0.977592,0.97629,0.876596,0.685239
3,Stacking-45features / Imbalanced data,0.508322,0.504883,0.979246,0.978226,0.881208,0.715593
4,Stacking-all_features / Imbalanced data,0.525547,0.521173,0.9797,0.978842,0.875186,0.731809
5,Random_forest/GridSearch_all_features / Imbala...,0.664349,0.453638,0.969028,0.963756,0.953671,0.453638
6,LogisticRegression / Random Under Sampled data,0.884492,0.872727,0.872727,0.872724,0.87617,0.868046
7,LogisticRegression / Random Over Sampled data,0.883325,0.884345,0.884345,0.884345,0.884298,0.884407
8,LogisticRegression / SMOTE Over Sampled data,0.901201,0.900058,0.900058,0.900057,0.898063,0.902563
9,Tuned LogisticRegression / Random Under Sample...,0.883541,0.871658,0.871658,0.871657,0.873745,0.868759


#### CatBoost - Random Over Sampling data

In [414]:
# Preparing dataset
pool = Pool(data=X_train_o_sampled, label=y_train_o_sampled)

In [415]:
# cv
params = {
    'loss_function': 'Logloss',
    'iterations': 300,
    'custom_loss': 'Recall',
    'random_seed': 42,
    'learning_rate': 0.15
}

cv_data = cv(
    params=params,
    pool=Pool(data=X_train_o_sampled, label=y_train_o_sampled),
    fold_count=5, # separating to 5 folds
    shuffle=True,
    partition_random_seed=42,
    stratified=True, 
    verbose=False
)

# Best score printing
best_value = np.min(cv_data['test-Logloss-mean'])
best_iter = np.argmin(cv_data['test-Logloss-mean'])
print("Best validation Logloss score, stratified: {:.4f}+/-{:.3f} on step {}".format(best_value, cv_data['test-Logloss-std'][best_iter], best_iter))

Best validation Logloss score, stratified: 0.1320+/-0.002 on step 299


In [422]:
# Creating model object
cb_o_model = CatBoostClassifier(
    iterations=300,
    learning_rate=0.15,
    eval_metric='Recall'
)

model_name = 'CatBoostClassifier / Random Over Sampled data'
cb_o_model.fit(X_train_o_sampled, y_train_o_sampled,
         eval_set=(X_test_scaled, y_test),
         verbose=50,
         early_stopping_rounds=20,
)

y_train_pred = cb_o_model.predict(X_train_o_sampled)
y_test_pred = cb_o_model.predict(X_test_o_sampled)

0:	learn: 0.8607734	test: 0.8665281	best: 0.8665281 (0)	total: 156ms	remaining: 46.5s
50:	learn: 0.8990716	test: 0.8881497	best: 0.8910603 (39)	total: 8.44s	remaining: 41.2s
Stopped by overfitting detector  (20 iterations wait)

bestTest = 0.8910602911
bestIteration = 39

Shrink model to first 40 iterations.


In [423]:
# Adding model evalution statistics
model_evalution_df = add_model_evalution_stat(model_name, cb_o_model, X_train_o_sampled, X_test_o_sampled, y_train_o_sampled, y_test_o_sampled, y_train_pred, y_test_pred, model_evalution_df)
model_evalution_df

  model_evalution_df = model_evalution_df.append(model_evalution_dict, ignore_index=True)


Unnamed: 0,Model Name,Training Score,Testing Score,Accuracy,F1 Score,Precision,Recall
0,LogisticRegression / Imbalanced data,0.969423,0.968484,0.968484,0.965576,0.812192,0.548441
1,Baseline - RandomForestClassifier / Imbalanced...,0.973288,0.97075,0.97075,0.967033,0.89957,0.521414
2,Stacking / Imbalanced data,0.980939,0.977592,0.977592,0.97629,0.876596,0.685239
3,Stacking-45features / Imbalanced data,0.508322,0.504883,0.979246,0.978226,0.881208,0.715593
4,Stacking-all_features / Imbalanced data,0.525547,0.521173,0.9797,0.978842,0.875186,0.731809
5,Random_forest/GridSearch_all_features / Imbala...,0.664349,0.453638,0.969028,0.963756,0.953671,0.453638
6,LogisticRegression / Random Under Sampled data,0.884492,0.872727,0.872727,0.872724,0.87617,0.868046
7,LogisticRegression / Random Over Sampled data,0.883325,0.884345,0.884345,0.884345,0.884298,0.884407
8,LogisticRegression / SMOTE Over Sampled data,0.901201,0.900058,0.900058,0.900057,0.898063,0.902563
9,Tuned LogisticRegression / Random Under Sample...,0.883541,0.871658,0.871658,0.871657,0.873745,0.868759


#### CatBoost - SMOTE Over Sampling data

In [425]:
# Preparing dataset
pool = Pool(data=X_train_s_sampled, label=y_train_s_sampled)

In [426]:
# cv
params = {
    'loss_function': 'Logloss',
    'iterations': 300,
    'custom_loss': 'Recall',
    'random_seed': 42,
    'learning_rate': 0.15
}

cv_data = cv(
    params=params,
    pool=Pool(data=X_train_s_sampled, label=y_train_s_sampled),
    fold_count=5, # separating to 5 folds
    shuffle=True,
    partition_random_seed=42,
    stratified=True, 
    verbose=False
)

# Best score printing
best_value = np.min(cv_data['test-Logloss-mean'])
best_iter = np.argmin(cv_data['test-Logloss-mean'])
print("Best validation Logloss score, stratified: {:.4f}+/-{:.3f} on step {}".format(best_value, cv_data['test-Logloss-std'][best_iter], best_iter))

Best validation Logloss score, stratified: 0.0409+/-0.001 on step 299


In [427]:
# Creating model object
cb_s_model = CatBoostClassifier(
    iterations=300,
    learning_rate=0.15,
    eval_metric='Recall'
)

model_name = 'CatBoostClassifier / SMOTE Over Sampled data'
cb_s_model.fit(X_train_s_sampled, y_train_s_sampled,
         eval_set=(X_test_scaled, y_test),
         verbose=50,
         early_stopping_rounds=20,
)

y_train_pred = cb_s_model.predict(X_train_s_sampled)
y_test_pred = cb_s_model.predict(X_test_s_sampled)

0:	learn: 0.9092316	test: 0.8049896	best: 0.8049896 (0)	total: 95.2ms	remaining: 28.5s
Stopped by overfitting detector  (20 iterations wait)

bestTest = 0.8486486486
bestIteration = 3

Shrink model to first 4 iterations.


In [428]:
# Adding model evalution statistics
model_evalution_df = add_model_evalution_stat(model_name, cb_s_model, X_train_s_sampled, X_test_s_sampled, y_train_s_sampled, y_test_s_sampled, y_train_pred, y_test_pred, model_evalution_df)
model_evalution_df

  model_evalution_df = model_evalution_df.append(model_evalution_dict, ignore_index=True)


Unnamed: 0,Model Name,Training Score,Testing Score,Accuracy,F1 Score,Precision,Recall
0,LogisticRegression / Imbalanced data,0.969423,0.968484,0.968484,0.965576,0.812192,0.548441
1,Baseline - RandomForestClassifier / Imbalanced...,0.973288,0.97075,0.97075,0.967033,0.89957,0.521414
2,Stacking / Imbalanced data,0.980939,0.977592,0.977592,0.97629,0.876596,0.685239
3,Stacking-45features / Imbalanced data,0.508322,0.504883,0.979246,0.978226,0.881208,0.715593
4,Stacking-all_features / Imbalanced data,0.525547,0.521173,0.9797,0.978842,0.875186,0.731809
5,Random_forest/GridSearch_all_features / Imbala...,0.664349,0.453638,0.969028,0.963756,0.953671,0.453638
6,LogisticRegression / Random Under Sampled data,0.884492,0.872727,0.872727,0.872724,0.87617,0.868046
7,LogisticRegression / Random Over Sampled data,0.883325,0.884345,0.884345,0.884345,0.884298,0.884407
8,LogisticRegression / SMOTE Over Sampled data,0.901201,0.900058,0.900058,0.900057,0.898063,0.902563
9,Tuned LogisticRegression / Random Under Sample...,0.883541,0.871658,0.871658,0.871657,0.873745,0.868759


In [430]:
# Saving fitted model (RandomForestClassifier / SMOTE Over Sampling data)
with open('models/model.pkl', 'wb') as output:
    pickle.dump(rf_s_model, output)

### Conclusions

#### Many models have been built through machine learning. In order to carry out the experiment, logistic regression on different data sets, decision trees, random forest, and CatBoostClassifier were implemented. 

#### The problem with the training dataset is class imbalance. Three methods were implemented to combat imbalance, which clearly improved the quality of models for all algorithms.

#### Thus, the most important aspects were feature selection and combating class imbalance.

#### RandomForestClassifier / SMOTE Over Sampling data was chosen as the final model: recall 0.936822, f1 score 0.938219.

### __We have quite complex, up-to-date transaction data. The task was to build a machine learning model that predicts fraudulent transactions. Let me summarize the work done, a fairly broad reconnaissance analysis, and we can conclude that it is very difficult to determine dependencies in the data by eye. At the same time, the presence of a large number of observations, data processing and working with class imbalance helped solve the problem with a fairly good result.__