# Welcome to this lovely notebook. This is the extention of the notebooks of modeling. Here we'll try to use voting classifier

## In this notebook we are going to implement the following:

1. We'll try to use voting to get better results by the use of 3 models: XGBoost, LightGBM, CatBoost

## Importing all required libraries

In [1]:
#importing all the required libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import csv
from datetime import datetime

# Set the option to display all columns
pd.set_option('display.max_columns', None)

# metrics
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score,\
classification_report, precision_recall_curve, auc, make_scorer, fbeta_score

# Encoding
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Machine Learning - Preparation
from sklearn.model_selection import train_test_split

# Machine Learning - Algorithm
from sklearn.ensemble import VotingClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier

## 1 Let's get the data

In [2]:
train = pd.read_csv('fraudTrain.csv', index_col=0)
test = pd.read_csv('fraudTest.csv', index_col=0)

## 2 Let's split the data

In [3]:
X_train = train.drop(columns=['is_fraud'])
y_train = train['is_fraud']

X_test = test.drop(columns=['is_fraud'])
y_test = test['is_fraud']

## 3 Preparing the 50% downsampled data

In [4]:
%%time

X_train_encoded_3 = pd.read_parquet('X_train_encoded.csv')
X_test_encoded_3 = pd.read_parquet('X_test_encoded.csv')

# Reset the indices to align them
X_train_encoded_3 = X_train_encoded_3.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)

# Step 1: Separate majority class (0s) and minority class (1s)
X_train_encoded_0 = X_train_encoded_3[y_train == 0]
X_train_encoded_1 = X_train_encoded_3[y_train == 1]

y_train_0 = y_train[y_train == 0]
y_train_1 = y_train[y_train == 1]

# Downsample X_train_encoded_0 and use its indices to select the corresponding rows from y_train_0
X_train_0_downsampled = X_train_encoded_0.sample(frac=0.5, random_state=42)
y_train_0_downsampled = y_train_0.loc[X_train_0_downsampled.index]  # Use the same indices for y_train

# Step 3: Concatenate the downsampled majority class with the minority class
X_train_downsampled = pd.concat([X_train_0_downsampled, X_train_encoded_1])
y_train_downsampled = pd.concat([y_train_0_downsampled, y_train_1])

# Step 4: Shuffle the dataset to mix the downsampled rows
X_train_encoded_3 = X_train_downsampled.sample(frac=1, random_state=42)
y_train = y_train_downsampled.loc[X_train_downsampled.index]

CPU times: total: 3min 41s
Wall time: 1min 53s


In [5]:
X_train_encoded_3.shape

(652090, 2151)

In [6]:
# List all DataFrames in memory
dfs_in_memory = {name: obj for name, obj in globals().items() if isinstance(obj, pd.DataFrame)}

# Display the DataFrame names
for name in dfs_in_memory:
    print(name)

train
test
X_train
X_test
X_train_encoded_3
X_test_encoded_3
X_train_encoded_0
X_train_encoded_1
X_train_0_downsampled
X_train_downsampled


### Removing unsed dfs from memory

In [7]:
del train, test, X_train, X_test, X_train_encoded_0, X_train_encoded_1, X_train_0_downsampled, X_train_downsampled

### Define the models

In [4]:
xgb = XGBClassifier(max_depth=7, learning_rate=0.05, sub_sample=0.9, scale_pos_weight=0.8)
lgbm = LGBMClassifier(max_depth=7, learning_rate=0.05, n_estimators=100, scale_pos_weight=0.8)
catboost = CatBoostClassifier(verbose=0, depth=7, learning_rate=0.2, n_estimators=100, scale_pos_weight=0.6)

### Define the classifiers

In [5]:
voting_clf_hard = VotingClassifier(estimators=[
    ('xgb', xgb), 
    ('lgbm', lgbm), 
    ('catboost', catboost)],
    voting='hard'
)

In [6]:
voting_clf_soft = VotingClassifier(estimators=[
    ('xgb', xgb), 
    ('lgbm', lgbm), 
    ('catboost', catboost)],
    voting='soft'
)

In [9]:
%%time

X_train_encoded = pd.read_parquet('X_train_encoded.csv')
X_test_encoded = pd.read_parquet('X_test_encoded.csv')

# Reset the indices to align them
X_train_encoded = X_train_encoded.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)

# Step 1: Separate majority class (0s) and minority class (1s)
X_train_encoded_0 = X_train_encoded[y_train == 0]
X_train_encoded_1 = X_train_encoded[y_train == 1]

y_train_0 = y_train[y_train == 0]
y_train_1 = y_train[y_train == 1]


for model in [voting_clf_hard, voting_clf_soft]:

    # Step 2: Downsample the majority class (0s) by frac
    # Downsample X_train_encoded_0 and use its indices to select the corresponding rows from y_train_0
    X_train_0_downsampled = X_train_encoded_0.sample(frac=0.5, random_state=42)
    y_train_0_downsampled = y_train_0.loc[X_train_0_downsampled.index]  # Use the same indices for y_train

    # Step 3: Concatenate the downsampled majority class with the minority class
    X_train_downsampled = pd.concat([X_train_0_downsampled, X_train_encoded_1])
    y_train_downsampled = pd.concat([y_train_0_downsampled, y_train_1])

    # Step 4: Shuffle the dataset to mix the downsampled rows
    X_train_downsampled_3 = X_train_downsampled.sample(frac=1, random_state=42)
    y_train_downsampled_3 = y_train_downsampled.loc[X_train_downsampled_3.index]  # Align y_train after shuffling

    # Step 5: Initialize voting classifier
    model_3 = model
    
    # Step 6: Train the model on the training data
    model_3.fit(X_train_downsampled_3, y_train_downsampled_3)

    # Step 7: Predict on the training data
    y_train_pred = model_3.predict(X_train_downsampled_3)

    # Step 8: Predict on the test data
    y_test_pred = model_3.predict(X_test_encoded)

    print('\n')
    print(f'This is a set of results for 50% downsample')
    print('\n')

    # Step 9: Generate the classification report for training data
    print("Classification Report for Training Data:")
    print(classification_report(y_train_downsampled_3, y_train_pred))

    # Step 10: Generate the classification report for test data
    print("\nClassification Report for Test Data:")
    print(classification_report(y_test, y_test_pred))

Parameters: { "sub_sample" } are not used.



[LightGBM] [Info] Number of positive: 7506, number of negative: 644584
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004017 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 4748
[LightGBM] [Info] Number of data points in the train set: 652090, number of used features: 2084
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.011511 -> initscore=-4.452902
[LightGBM] [Info] Start training from score -4.452902


This is a set of results for 50% downsample


Classification Report for Training Data:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    644584
           1       0.89      0.69      0.78      7506

    accuracy                           1.00    652090
   macro avg       0.94      0.84      0.89    652090
weighted avg       1.00      1.00      1.00    652090


Classification Report 

Parameters: { "sub_sample" } are not used.



[LightGBM] [Info] Number of positive: 7506, number of negative: 644584
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003626 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 4748
[LightGBM] [Info] Number of data points in the train set: 652090, number of used features: 2084
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.011511 -> initscore=-4.452902
[LightGBM] [Info] Start training from score -4.452902


This is a set of results for 50% downsample


Classification Report for Training Data:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    644584
           1       0.88      0.69      0.77      7506

    accuracy                           1.00    652090
   macro avg       0.94      0.84      0.89    652090
weighted avg       1.00      1.00      1.00    652090


Classification Report 

#### Conclustions
1. The best results of voting classifier are the same as the results of LightGBM. They are quite balanced: 66% recall and 71% precision
2. Looks like LightGBM did the best balanced job