In [2]:
!pip install --upgrade scikit-learn imbalanced-learn



# Import required libraries

In [14]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
import numpy as np
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE

# Loading the data

In [4]:
data = pd.read_csv('bs140513_032310.csv')
data.head()

Unnamed: 0,step,customer,age,gender,zipcodeOri,merchant,zipMerchant,category,amount,fraud
0,0,'C1093826151','4','M','28007','M348934600','28007','es_transportation',4.55,0
1,0,'C352968107','2','M','28007','M348934600','28007','es_transportation',39.68,0
2,0,'C2054744914','4','F','28007','M1823072687','28007','es_transportation',26.89,0
3,0,'C1760612790','3','M','28007','M348934600','28007','es_transportation',17.25,0
4,0,'C757503768','5','M','28007','M348934600','28007','es_transportation',35.72,0


In [5]:
data.info() #checking for any missing values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 594643 entries, 0 to 594642
Data columns (total 10 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   step         594643 non-null  int64  
 1   customer     594643 non-null  object 
 2   age          594643 non-null  object 
 3   gender       594643 non-null  object 
 4   zipcodeOri   594643 non-null  object 
 5   merchant     594643 non-null  object 
 6   zipMerchant  594643 non-null  object 
 7   category     594643 non-null  object 
 8   amount       594643 non-null  float64
 9   fraud        594643 non-null  int64  
dtypes: float64(1), int64(2), object(7)
memory usage: 45.4+ MB


I decided to do a count for the fraud column to see if the dataset is imbalanced. Yes, it is imbalanced. The non-fraud occurences is approximately 82 times bigger than fraudlent occurences.

In [6]:
count_fraud = data.groupby('fraud').size().reset_index(name='count')
print(count_fraud)

   fraud   count
0      0  587443
1      1    7200


# Split the data into features and target variable

In [7]:
X = data.drop('fraud', axis=1)
y = data['fraud']   

# Split the data into training and testing sets

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [9]:
y_train.head()

37107     0
163300    0
108691    0
429389    0
222059    0
Name: fraud, dtype: int64

# Preprocessing the different types of data

In [10]:
# Preprocessing for numerical data (scaling)
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
numerical_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

# Preprocessing for categorical data (one-hot encoding)
categorical_features = X.select_dtypes(include=['object']).columns
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Apply preprocessing to training data
X_train_preprocessed = preprocessor.fit_transform(X_train)

# Perform SMOTE to the imbalanced dataset

Because the dataset is imbalanced, it has to be balanced. Oversampling and undersampling are the techniques to be used for balancing. 

(Ramentol, et al., 2012), describes undersampling as it entails deleting some of the data points connected to the overrepresented or dominating class in order to shrink the size of the original dataset, some of the undersampling techniques include, EasyEnsemble, BalanceCascade and MLPUS.
Furthermore they explain, oversampling as simply the process of expanding the original dataset by replicating or reproducing some of the datapoints from the minority class. Techniques for oversampling include ADASYN, MWMOTE, RAMOBoost, and SMOTE.

In this project, we choose oversampling because it preserves the information and better model generalization performance compared to undersampling. In particular we will perform SMOTE, (Chawla, et al., 2002), defines SMOTE as a technique that increases the representation of the minority class in order to alleviate class imbalance. SMOTE augments the minority class in the dataset by producing "synthetic" samples rather than just copying already-existing data points.


In [11]:
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train_preprocessed, y_train)

# Perform classification, train, test and evaluate the model

In addition to SMOTE, i decided to choose to try out two ensemble methods for classification, Random Forest Classifier and Gradient Boosting Classifier. 

(Dong, et al., 2020), states that by combining results from different voting mechanisms, ensemble learning techniques improve performance over that of any single constituent algorithm by utilizing multiple machine learning algorithms to generate weak predictive results based on features extracted through a diversity of projections on data.

(www.turing.com, n.d.), describes that the classifier called Random Forest uses many decision trees on different dataset subsets and averages them to increase the dataset's projected accuracy. Random forests gather output from multiple decision trees and forecast the ultimate conclusion based on the majority vote of predictions, as opposed to depending solely on one decision tree.

(PANCHOTIA, 2021), states that the combination of Gradient Descent and Boosting is called Gradient Boosting. Each new gradient boosting model uses the Gradient Descent Method to minimize the loss function from its predecessor. This process is repeated until a more accurate estimate of the target variable is obtained. In contrast to other ensemble techniques, gradient boosting builds a succession of trees in which each successive tree attempts to fix the errors of its predecessor.

So as to decide whether to go for Random Forest (RF) or Gradient boosting (GB) techniques i had to run the code below for both and do the evaluation on the classification report. I noticed there was a major difference in the precison for 1, for RF precision was at 77% and for GB it was 26%, meaning Gradient Boosting was struggling in accurately predicting fraudlent tranactions as fraud, also making the f1-score to go low. This might be because it is more sensitive to outliers or the imbalanced data, thus leading to overfitting on test data. Random Forest Classifier might have performed better because it it handles outliers better, in addition, it's ensemble diversity, in that it builds multiple decision tress independently and combines them through voting thus making it robust to overfitting and improved generalization performance. The accuracies were quite close RF at 100% and GB at 97%, but you can see Random Forest performs better. THough accuracy is not our main focus.

NOTE: THE WORKINGS ON THE GRADIENT BOOSTING IS BELOW, RANDOM FOREST CLASSIFIER THE MAIN PART FOLLOWS RIGHT AFTER.

In [15]:
# Train Gradient Boosting Classifier
gb_classifier = GradientBoostingClassifier(random_state=42)
gb_classifier.fit(X_resampled, y_resampled)

# Preprocess Testing Data
X_test_preprocessed = preprocessor.transform(X_test)

# Make predictions on testing data
y_pred = gb_classifier.predict(X_test_preprocessed)

# Evaluate the performance
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Confusion Matrix:
[[170309   5968]
 [    27   2089]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      0.97      0.98    176277
           1       0.26      0.99      0.41      2116

    accuracy                           0.97    178393
   macro avg       0.63      0.98      0.70    178393
weighted avg       0.99      0.97      0.98    178393



In [16]:
# Define the Random Forest Classifier with class weights
class_weights = {0: 1, 1: 5}
model = RandomForestClassifier(class_weight=class_weights)

# Train the model on the resampled data
model.fit(X_resampled, y_resampled)

# Apply preprocessing to testing data
X_test_preprocessed = preprocessor.transform(X_test)

# Predictions on the testing data
y_pred = model.predict(X_test_preprocessed)

# Evaluate the model
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Confusion Matrix:
[[175716    561]
 [   279   1837]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    176277
           1       0.77      0.87      0.81      2116

    accuracy                           1.00    178393
   macro avg       0.88      0.93      0.91    178393
weighted avg       1.00      1.00      1.00    178393



# Conclusion

When we decide to look at accuracy the system performed excellently at 100%. This might not be true to say it performed excellently by just looking at accuracy. We have to consider precision, recall and F1-score for 1, precision is how accurately a fraud 1 was categorized as fraudlent transaction 1, we can see 23% off from 100%, we can say the model tried because of imbalanced test data. Recall performed a bit better than precision meaning the model sensitivity is quite high. F1-score is harmonic mean of precision and recall. 

Finally, i can say that this proof-of-concept system performed well and can be implemented to the bank's system.

# References

Y. C. R. B. a. F. H. Enislay Ramentol, “SMOTE-RSB *: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory,” Knowledge and Information Systems, vol. 33, pp. 245-265, 2012.

N. V. Chawla, K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer, “SMOTE: Syntehtic Minority Oversampling Technique,” Journal of Artificial Intelligence Researcxh, vol. 16, pp. 321-357, 2002. 

Dong, X., Yu, Z., Cao, W., Shi, Y. and Ma, Q., 2020. A survey on ensemble learning. Frontiers of Computer Science, 14, pp.241-258.

www.turing.com. (n.d.). Random Forest Algorithm - How It Works and Why It Is So Effective. [online] Available at: https://www.turing.com/kb/random-forest-algorithm#what-is-random-forest-algorithm?.

PANCHOTIA, R. (2021). Introduction To Gradient Boosting Classification. [online] Analytics Vidhya. Available at: https://medium.com/analytics-vidhya/introduction-to-gradient-boosting-classification-da4e81f54d3.

