# Fraud Dection in Online Transactions

This is the Kaggle project in IEEE Computational Intelligence Society (IEEE-CIS) Fraud Dection: 

Addison Howard, Bernadette Bouchon-Meunier, IEEE CIS, inversion, John Lei, Lynn@Vesta, Marcus2010, Prof. Hussein Abbass. (2019). IEEE-CIS Fraud Detection. Kaggle. https://kaggle.com/competitions/ieee-fraud-detection

# Methodology 
I will use following **Machine Learning Models** to conduct this project 
1. **Logistic Regression**: This is a basic model that can be used as a starting point. It's simple and interpretable.

2. **Decision Trees and Random Forests**: These models can capture complex patterns in the data. Random Forests, being an ensemble of decision trees, can provide a more robust solution against overfitting.

3. **Neural Networks**: Deep learning models, especially autoencoders, can be used for anomaly detection. An autoencoder tries to reconstruct its input data, and a high reconstruction error can indicate a fraudulent transaction.

4. **Support Vector Machines (SVM)**: SVMs can be effective, especially when the data is not linearly separable.

5. **K-Nearest Neighbors (KNN)**: This is an instance-based learning algorithm that can be used for its simplicity and effectiveness.

    K-means via RapidsML cuDF: https://www.kaggle.com/code/suraj520/k-means-via-rapidsml-cudf-know-fit-infer
    

The following **Techniques** will be involved:

1. **Principal Component Analysis (PCA)**: Given that credit card registers are considered personal information and cannot be shared publicly, PCA can be used to mask the actual variables, changing not only their names but also their numeric values. This allows for dimensionality reduction and working with real data without revealing any personal information.
2. **Handling Imbalanced Data**: Fraudulent transactions are typically much fewer than legitimate ones, leading to class imbalance. Techniques like oversampling, undersampling, and using the Synthetic Minority Over-sampling Technique (SMOTE) can help balance the classes.


The data comes from Vesta's real-world e-commerce transactions and contains a wide range of features from device type to product features.


# References

[1] Data Science in Banking: Fraud Detection. https://www.datacamp.com/blog/data-science-in-banking

[2] Modelling Credit Card Frauds. https://towardsdatascience.com/modelling-credit-card-fraud-detection-e3006dd212ab

[3] 1st Place Solution. https://www.kaggle.com/c/ieee-fraud-detection/discussion/111284



In [1]:
# Import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import gc # Garbage Collector interface
import scipy as sp
# from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
# from sklearn.neighbors import KNeighborsClassifier
from sklearn import preprocessing 
from sklearn.impute import SimpleImputer
import seaborn as sns


In [2]:
Test_ID = pd.read_csv('./RawData/test_identity.csv')
Test_Trans = pd.read_csv('./RawData/test_transaction.csv')
Train_ID = pd.read_csv('./RawData/train_identity.csv')
Train_Trans = pd.read_csv('./RawData/train_transaction.csv')
Train_Trans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 590540 entries, 0 to 590539
Columns: 394 entries, TransactionID to V339
dtypes: float64(376), int64(4), object(14)
memory usage: 1.7+ GB


In [None]:
Train_df = Train_Trans.merge(Train_ID, on = 'TransactionID', how='left')
Test_df = Test_Trans.merge(Test_ID, on = 'TransactionID', how='left')
Test_df.head()


In [None]:
Test_df = Test_df.rename(columns={"id-01": "id_01", "id-02": "id_02", "id-03": "id_03", 
                            "id-06": "id_06", "id-05": "id_05", "id-04": "id_04", 
                            "id-07": "id_07", "id-08": "id_08", "id-09": "id_09", 
                            "id-10": "id_10", "id-11": "id_11", "id-12": "id_12", 
                            "id-15": "id_15", "id-14": "id_14", "id-13": "id_13", 
                            "id-16": "id_16", "id-17": "id_17", "id-18": "id_18", 
                            "id-21": "id_21", "id-20": "id_20", "id-19": "id_19", 
                            "id-22": "id_22", "id-23": "id_23", "id-24": "id_24", 
                            "id-27": "id_27", "id-26": "id_26", "id-25": "id_25", 
                            "id-28": "id_28", "id-29": "id_29", "id-30": "id_30", 
                            "id-31": "id_31", "id-32": "id_32", "id-33": "id_33", 
                            "id-34": "id_34", "id-35": "id_35", "id-36": "id_36", 
                            "id-37": "id_37", "id-38": "id_38"})
Test_df.head()

In [None]:
new_features = ['TransactionAmt', 'ProductCD',
                'card1', 'card2', 'card3', 'card5','card6', 'addr1', 'addr2', 
                'dist1', 'dist2', 'P_emaildomain''R_emaildomain', 
                'C1', 'C2', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9','C10', 'C11', 'C12', 'C13', 'C14', 
                'D1', 'D2', 'D3', 'D4', 'D5','D10', 'D11', 'D15', 
                'M1', 'M2', 'M3', 'M4', 'M6', 'M7', 'M8','M9', 
                'V1', 'V3', 'V4', 'V6', 'V8', 'V11', 'V13', 'V14', 'V17','V20', 'V23', 'V26',
                'V27', 'V30', 'V36', 'V37', 'V40', 'V41','V44', 'V47', 'V48', 'V54', 'V56', 
                'V59', 'V62', 'V65', 'V67','V68', 'V70', 'V76', 'V78', 'V80', 'V82', 'V86', 
                'V88', 'V89','V91', 'V107', 'V108', 'V111', 'V115', 'V117', 'V120', 'V121',
                'V123', 'V124', 'V127', 'V129', 'V130', 'V136', 'V138', 'V139','V142', 'V147', 
                'V156', 'V160', 'V162', 'V165', 'V166', 'V169','V171', 'V173', 'V175', 'V176', 
                'V178', 'V180', 'V182', 'V185','V187', 'V188', 'V198', 'V203', 'V205', 'V207', 
                'V209', 'V210','V215', 'V218', 'V220', 'V221', 'V223', 'V224', 'V226', 'V228',
                'V229', 'V234', 'V235', 'V238', 'V240', 'V250', 'V252', 'V253','V257', 'V258', 
                'V260', 'V261', 'V264', 'V266', 'V267', 'V271','V274', 'V277', 'V281', 'V283', 
                'V284', 'V285', 'V286', 'V289','V291', 'V294', 'V296', 'V297', 'V301', 'V303', 
                'V305', 'V307','V309', 'V310', 'V314', 'V320', 
                'id_01', 'id_02', 'id_03', 'id_04','id_05', 'id_06', 'id_09', 'id_10', 
                'id_11', 'id_12', 'id_13','id_15', 'id_16', 'id_17', 'id_18', 'id_19', 
                'id_20', 'id_28','id_29', 'id_31', 'id_35', 'id_36', 'id_37', 'id_38', 
                'DeviceType','DeviceInfo']
len(new_features)

In [None]:
for features in Train_df.columns: 
    if features not in new_features: 
        Train_df = Train_df.drop(features, axis = 1)
gc.collect() 
print(Train_df.shape)

pd. set_option("display.max_columns", None)
pd. set_option("display.max_rows", None)
Train_df.head()
# With no arguments, run a full collection. 
# The optional argument generation may be an integer specifying which generation 
# to collect (from 0 to 2). A ValueError is raised if the generation number is invalid. 
# The number of unreachable objects found is returned.

# Trainning process

In [None]:
numerical_features = [feature for feature in Train_df.columns if Train_df[feature].dtypes != 'O']
print('Number of numerical variables: ', len(numerical_features))
# visualise the numerical variables

for feature in numerical_features: 
    ## Replace by using mean value
    mean_value= Train_df[feature].mean()
    ## create a new feature to capture nan values
    Train_df[feature+'nan']=np.where(Train_df[feature].isnull(),1,0)
    Train_df[feature].fillna(mean_value,inplace=True)
df_num_train = Train_df[numerical_features]
gc.collect()
Train_df[numerical_features].head()
gc.collect()

In [None]:
categorical_features = [feature for feature in Train_df.columns if Train_df[feature].dtypes == 'O']

print('Number of categorical variables: ', len(categorical_features)) 

# visualise the numerical variables
df_cat_train = Train_df[categorical_features]