<a href="https://colab.research.google.com/github/JapiKredi/EDA_extensive_library/blob/main/Credit_Card_Fraud_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Credit-Card Fraud Detection**



1. Introduction
2. Preliminaries - load packages
3. Import dataset
4. Exploratory data analysis
5. Predictive modeling
6. Results and conclusion
7. References

## **1. Introduction**


- Credit card fraud is when someone uses our credit card or credit account to make a purchase we didn't authorize.

- Fraudsters steal ₹615.39 crore in more than 1.17 lakh cases of credit and debit card frauds over 10 years (April 2009 to September 2019), Reserve Bank of India (RBI) data revealed.

- So, in this project, we attempt to detect credit-card frauds.

- So, let's get started. First, we take a look at the dataset.

- The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

- It contains only numerical input variables which are the result of a PCA transformation.

- Due to confidentiality issues, the are not provided the original features and more background information about the data.

  - Features V1, V2, ... V28 are the principal components obtained with PCA;
  - The only features which have not been transformed with PCA are Time and Amount. Feature Time contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature Amount is the transaction Amount, this feature can be used for example-dependant cost-senstive learning.
  - Feature Class is the response variable and it takes value 1 in case of fraud and 0 otherwise.

## **2. Preliminaries - load packages**

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly import tools
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

In [None]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m98.7/98.7 MB[0m [31m88.3 MB/s[0m eta [36m0:00:01[0m

In [None]:
import gc
from datetime import datetime
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from catboost import CatBoostClassifier
from sklearn import svm
import lightgbm as lgb
from lightgbm import LGBMClassifier
import xgboost as xgb

In [None]:
pd.set_option('display.max_columns', 100)

In [None]:
RFC_METRIC = 'gini'  #metric used for RandomForrestClassifier
NUM_ESTIMATORS = 100 #number of estimators used for RandomForrestClassifier
NO_JOBS = 4 #number of parallel jobs used for RandomForrestClassifier

In [None]:
#TRAIN/VALIDATION/TEST SPLIT
#VALIDATION
VALID_SIZE = 0.20 # simple validation using train_test_split
TEST_SIZE = 0.20 # test size using_train_test_split

In [None]:
#CROSS-VALIDATION
NUMBER_KFOLDS = 5 #number of KFolds for cross-validation

In [None]:
RANDOM_STATE = 2024

MAX_ROUNDS = 1000 #lgb iterations
EARLY_STOP = 50 #lgb early stop
OPT_ROUNDS = 1000  #To be adjusted based on best validation rounds
VERBOSE_EVAL = 50 #Print out metric result

## **3. Import dataset**

In [None]:
from google.colab import drive
drive.mount('/content/drive')


In [None]:
path = '/content/drive/MyDrive/CreditCard_Fraud_Detection/creditcard.csv'

df = pd.read_csv(path)

## **4. Exploratory data analysis**

Let's first check the shape of the dataset.

In [None]:
print("Credit Card Fraud Detection data -  rows:", df.shape[0]," columns:", df.shape[1])

We can see that the dataset contains 284807 rows and 31 columns.

Now, we will take a look at the dataset.

In [None]:
df.head()

Now, let's take a more closer look at the dataset.

In [None]:
df.info()

- We can see that all the 31 features are of numerical type - 30 are of float data type and 1 is of integer data type.

- Now, let's take a more indepth look of the data.

In [None]:
df.describe()

- If we look at the time feature, we can confirm that the data contains 284,807 transactions, during 2 consecutive days (or 172792 seconds).

#### **Check for missing values**

- Now, let's check for missing values in the dataset.

In [None]:
df.isnull().any()

- We can see that the dataset does not contain any missing values. We can confirm this further.

In [None]:
df.isnull().sum().sum()

- We can see that there are no missing values in the entire dataset.

#### **Visualize distribution of time**

In [None]:
plt.figure(figsize=(8,6))
plt.title('Distribution of Time')
sns.histplot(df.Time)

#### **Visualize fraudulent Vs normal transactions**

In [None]:
#fraud vs. normal transactions
counts = df.Class.value_counts()
normal = counts[0]
fraudulent = counts[1]
perc_normal = (normal/(normal+fraudulent))*100
perc_fraudulent = (fraudulent/(normal+fraudulent))*100
print('There were {} non-fraudulent transactions ({:.3f}%) and {} fraudulent transactions ({:.3f}%).'.format(normal, perc_normal, fraudulent, perc_fraudulent))

In [None]:
plt.figure(figsize=(8,6))
sns.barplot(x=counts.index, y=counts)
plt.title('Count of Fraudulent vs. Non-Fraudulent Transactions')
plt.ylabel('Count')
plt.xlabel('Class (0:Non-Fraudulent, 1:Fraudulent)')

#### **Features Correlation**

In [None]:
plt.figure(figsize = (12,10))
plt.title('Credit card transactions features correlation plot')
corr = df.corr()
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns,linewidths=.1,cmap="Reds")
plt.show()

- As expected, there is no notable correlation between features **V1-V28**.
- There are certain correlations between some of these features and **Time** (inverse correlation with **V3**) and **Amount** (direct correlation with **V7** and **V20**, inverse correlation with **V1** and **V5**).

- Let's plot the correlated and inverse correlated values on the same graph.

- Let's start with the direct correlated values: {**V20;Amount**} and {**V7;Amount**}.

In [None]:
s = sns.lmplot(x = 'V20', y = 'Amount',data = df, hue = 'Class', fit_reg = True, scatter_kws = {'s':2})
plt.show()

In [None]:
s = sns.lmplot(x = 'V7', y = 'Amount',data = df, hue = 'Class', fit_reg = True, scatter_kws = {'s':2})
plt.show()

We can confirm that the two couples of features are correlated (the regression lines for **Class = 0** have a positive slope, whilst the regression line for **Class = 1** have a smaller positive slope).

Let's now plot now the inverse correlated values.

In [None]:
s = sns.lmplot(x = 'V2', y = 'Amount', data = df, hue = 'Class', fit_reg = True, scatter_kws = {'s':2})
plt.show()

In [None]:
s = sns.lmplot(x = 'V5', y = 'Amount', data = df, hue = 'Class', fit_reg = True, scatter_kws = {'s':2})
plt.show()

- We can confirm that the two couples of features are inverse correlated (the regression lines for **Class = 0** have a negative slope while the regression lines for **Class = 1** have a very small negative slope).

#### **Features density plot**

In [None]:
var = df.columns.values

i = 0
t0 = df.loc[df['Class'] == 0]
t1 = df.loc[df['Class'] == 1]

sns.set_style('whitegrid')
plt.figure()
fig, ax = plt.subplots(8,4,figsize=(16,28))

for feature in var:
    i += 1
    plt.subplot(8,4,i)
    sns.kdeplot(t0[feature], label = "Class = 0")
    sns.kdeplot(t1[feature], label = "Class = 1")
    plt.xlabel(feature, fontsize=12)
    locs, labels = plt.xticks()
    plt.tick_params(axis = 'both', which = 'major', labelsize = 12)
plt.show()


- For some of the features we can observe a good selectivity in terms of distribution for the two values of **Class: V4, V11** have clearly separated distributions for **Class** values 0 and 1, **V12**, **V14**, **V18** are partially separated, **V1**, **V2**, **V3**, **V10** have a quite distinct profile, whilst **V25**, **V26**, **V28** have similar profiles for the two values of Class.

- In general, with just few exceptions (**Time** and **Amount**), the features distribution for legitimate transactions (values of **Class = 0**) is centered around 0, sometime with a long queue at one of the extremities. In the same time, the fraudulent transactions (values of **Class = 1**) have a skewed (asymmetric) distribution.

## **5. Predictive Modelling**

- Now, we will move on to predictive modelling. We will define predictor and target values and evaluate various model performance on them. So, let's do it.

#### **Define predictors and target values**

- Now, let's define the predictor features and target values.

In [None]:
target = 'Class'
predictors = ['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',\
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19',\
       'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28',\
       'Amount']

#### **Split data in train, test and validation set**

- Now, let's define train, validation and test sets.

- First, we will split the dataset into train and test set as follows-

In [None]:
train_df, test_df = train_test_split(df, test_size=TEST_SIZE, random_state=RANDOM_STATE, shuffle=True )


- Now, we will split the training set into train and validation set.

In [None]:
train_df, valid_df = train_test_split(train_df, test_size=VALID_SIZE, random_state=RANDOM_STATE, shuffle=True )

### **Random Forest Classifier**

- Now, we will first start with the Random Forest Classifier  model.

- Define model parameters - Let's set the parameters for the model. Let's run a model using the training set for training. Then, we will use the validation set for validation.

- We will use as validation criterion **GINI**. Its formula is **GINI = 2 * (AUC) - 1**, where **AUC** is the **Receiver Operating Characteristic - Area Under Curve (ROC-AUC)**. Number of estimators is set to **100** and number of parallel jobs is set to **4**.

- We start by initializing the **RandomForestClassifier**.

In [None]:
clf = RandomForestClassifier(n_jobs=NO_JOBS,
                             random_state=RANDOM_STATE,
                             criterion=RFC_METRIC,
                             n_estimators=NUM_ESTIMATORS,
                             verbose=False)

- Now, let's train the Random Forest Classifier.

In [None]:
clf.fit(train_df[predictors], train_df[target].values)

- Now, let's predict the target values for the valid_df data, using predict function.

In [None]:
preds = clf.predict(valid_df[predictors])

#### **Features importance**

- Now, let's visualize the features importance.

In [None]:
tmp = pd.DataFrame({'Feature': predictors, 'Feature importance': clf.feature_importances_})
tmp = tmp.sort_values(by='Feature importance',ascending=False)
plt.figure(figsize = (7,4))
plt.title('Features importance',fontsize=14)
s = sns.barplot(x='Feature',y='Feature importance',data=tmp)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
plt.show()

- The most important features are **V17**, **V12**, **V14**, **V16**, **V11**, **V10**.

#### **Confusion matrix**

- Now, let's plot the confusion matrix for the results we obtained.

In [None]:
cm = pd.crosstab(valid_df[target].values, preds, rownames=['Actual'], colnames=['Predicted'])
fig, (ax1) = plt.subplots(ncols=1, figsize=(5,5))
sns.heatmap(cm,
            xticklabels=['Not Fraud', 'Fraud'],
            yticklabels=['Not Fraud', 'Fraud'],
            annot=True,ax=ax1,
            linewidths=.2,linecolor="Darkblue", cmap="Blues")
plt.title('Confusion Matrix', fontsize=14)
plt.show()

 #### **Type I error** and **Type II error**

- Now, confussion matrix is not a very good tool to represent the results in the case of highly unbalanced data, like in this case. We will actually need a different metric that accounts in the same time for the selectivity and specificity of the method we are using, so that we minimize in the same time both Type I errors and Type II errors.

- **Null Hypothesis (H0)** - The transaction is not a fraud.
- **Alternative Hypothesis (H1)** - The transaction is a fraud.

- **Type I error** - We reject the null hypothesis when the null hypothesis is actually true.
- **Type II error** - We fail to reject the null hypothesis when the the alternative hypothesis is true.

- **Cost of Type I error** - We erroneously presume that the the transaction is a fraud, and a true transaction is rejected.
- **Cost of Type II error** - We erroneously presume that the transaction is not a fraud and a ffraudulent transaction is accepted.


- So, **Type II error** is more dangerous than a **Type I error**.

#### **ROC-AUC Score**

- Now, let's calculate the ROC-AUC Score of the Random Forest Classifier model.

In [None]:
roc_auc_score(valid_df[target].values, preds)

- So, the **ROC-AUC score** obtained with Random Forrest Classifier is 0.85.

### **AdaBoost Classifier**

- **AdaBoost Classifier** stands for Adaptive Boosting Classifier

#### **Initialize the model**

- Let's set the parameters for the model and initialize the model.



In [None]:
clf = AdaBoostClassifier(random_state=RANDOM_STATE,
                         algorithm='SAMME.R',
                         learning_rate=0.8,
                             n_estimators=NUM_ESTIMATORS)


#### **Fit the model**

- Now, let's fit the model.

In [None]:
clf.fit(train_df[predictors], train_df[target].values)

#### **Predict the target values**

- Let's now predict the target values for the valid_df data, using predict function.

In [None]:
preds = clf.predict(valid_df[predictors])

#### **Features importance**

- Let's see the features importance.

In [None]:
tmp = pd.DataFrame({'Feature': predictors, 'Feature importance': clf.feature_importances_})
tmp = tmp.sort_values(by='Feature importance',ascending=False)
plt.figure(figsize = (8,6))
plt.title('Features importance',fontsize=14)
s = sns.barplot(x='Feature',y='Feature importance',data=tmp)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
plt.show()

#### **Confusion matrix**

- Let's visualize the confusion matrix.

In [None]:
cm = pd.crosstab(valid_df[target].values, preds, rownames=['Actual'], colnames=['Predicted'])
fig, (ax1) = plt.subplots(ncols=1, figsize=(5,5))
sns.heatmap(cm,
            xticklabels=['Not Fraud', 'Fraud'],
            yticklabels=['Not Fraud', 'Fraud'],
            annot=True,ax=ax1,
            linewidths=.2,linecolor="Darkblue", cmap="Blues")
plt.title('Confusion Matrix', fontsize=14)
plt.show()

#### **ROC-AUC Score**

Let's calculate also the ROC-AUC.

In [None]:
roc_auc_score(valid_df[target].values, preds)

The ROC-AUC score obtained with AdaBoostClassifier is 0.83.

### **CatBoost Classifier**

- **CatBoost Classifier** is a gradient boosting for decision trees algorithm with support for handling categorical data.

#### **Initialize the model**

- Let's set the parameters for the model and initialize the model.

In [None]:
clf = CatBoostClassifier(iterations=500,
                             learning_rate=0.02,
                             depth=12,
                             eval_metric='AUC',
                             random_seed = RANDOM_STATE,
                             bagging_temperature = 0.2,
                             od_type='Iter',
                             metric_period = VERBOSE_EVAL,
                             od_wait=100)

In [None]:
clf.fit(train_df[predictors], train_df[target].values,verbose=True)


#### **Predict the target values**

- Let's now predict the target values for the **val_df** data, using predict function.

In [None]:
preds = clf.predict(valid_df[predictors])

#### **Features importance**

- Let's see also the features importance.

In [None]:
tmp = pd.DataFrame({'Feature': predictors, 'Feature importance': clf.feature_importances_})
tmp = tmp.sort_values(by='Feature importance',ascending=False)
plt.figure(figsize = (8,6))
plt.title('Features importance',fontsize=14)
s = sns.barplot(x='Feature',y='Feature importance',data=tmp)
s.set_xticklabels(s.get_xticklabels(),rotation=90)
plt.show()

#### **Confusion matrix**

- Let's visualize the confusion matrix.

In [None]:
cm = pd.crosstab(valid_df[target].values, preds, rownames=['Actual'], colnames=['Predicted'])
fig, (ax1) = plt.subplots(ncols=1, figsize=(5,5))
sns.heatmap(cm,
            xticklabels=['Not Fraud', 'Fraud'],
            yticklabels=['Not Fraud', 'Fraud'],
            annot=True,ax=ax1,
            linewidths=.2,linecolor="Darkblue", cmap="Blues")
plt.title('Confusion Matrix', fontsize=14)
plt.show()

#### **ROC-AUC Score**

- Now, let's calculate also the ROC-AUC Score.


In [None]:
roc_auc_score(valid_df[target].values, preds)

- The ROC-AUC score obtained with CatBoostClassifier is 0.86.



### **XGBoost Classifier**


- **XGBoost** is a gradient boosting algorithm.

- Let's initialize the model.

- We initialize the DMatrix objects for training and validation, starting from the datasets. We also set some of the parameters used for the model tuning.

In [None]:
# Prepare the train and valid datasets
dtrain = xgb.DMatrix(train_df[predictors], train_df[target].values)
dvalid = xgb.DMatrix(valid_df[predictors], valid_df[target].values)
dtest = xgb.DMatrix(test_df[predictors], test_df[target].values)


In [None]:
#What to monitor (in this case, **train** and **valid**)
watchlist = [(dtrain, 'train'), (dvalid, 'valid')]


In [None]:
# Set xgboost parameters
params = {}
params['objective'] = 'binary:logistic'
params['eta'] = 0.039
params['silent'] = True
params['max_depth'] = 2
params['subsample'] = 0.8
params['colsample_bytree'] = 0.9
params['eval_metric'] = 'auc'
params['random_state'] = RANDOM_STATE


#### **Train the model**

- Now, let's train the model.

In [None]:
model = xgb.train(params,
                dtrain,
                MAX_ROUNDS,
                watchlist,
                early_stopping_rounds=EARLY_STOP,
                maximize=True,
                verbose_eval=VERBOSE_EVAL)

- The best validation score (ROC-AUC) was 0.986, for round 258.

#### **Plot variable importance**

In [None]:
fig, (ax) = plt.subplots(ncols=1, figsize=(12,8))
xgb.plot_importance(model, height=0.8, title="Features importance (XGBoost)", ax=ax, color="green")
plt.show()

#### **Predict test set**

- We used the train and validation sets for training and validation. We will use the trained model now to predict the target value for the test set.

In [None]:
preds = model.predict(dtest)

#### **ROC-AUC Score**

- Now, let's calculate the ROC-AUC Score.

In [None]:
roc_auc_score(test_df[target].values, preds)

- The ROC- AUC score for the prediction of fresh data (test set) is 0.977.

### **LightGBM Classifier**

- Now, we will  predict with another gradient boosting algorithm - LightGBM Classifier model.

#### **Define model parameters**

- Now, let's set the parameters for the model. We will use these parameters for the lgb model.

In [None]:
params = {
          'boosting_type': 'gbdt',
          'objective': 'binary',
          'metric':'auc',
          'learning_rate': 0.05,
          'num_leaves': 7,  # we should let it be smaller than 2^(max_depth)
          'max_depth': 4,  # -1 means no limit
          'min_child_samples': 100,  # Minimum number of data need in a child(min_data_in_leaf)
          'max_bin': 100,  # Number of bucketed bin for feature values
          'subsample': 0.9,  # Subsample ratio of the training instance.
          'subsample_freq': 1,  # frequence of subsample, <=0 means no enable
          'colsample_bytree': 0.7,  # Subsample ratio of columns when constructing each tree.
          'min_child_weight': 0,  # Minimum sum of instance weight(hessian) needed in a child(leaf)
          'min_split_gain': 0,  # lambda_l1, lambda_l2 and min_gain_to_split to regularization
          'nthread': 8,
          'verbose': 0,
          'scale_pos_weight':150, # because training data is extremely unbalanced
         }

#### **Initialize the model**

- Now, let's initialize the model, creating the Datasets data structures from the train and validation sets.

In [None]:
dtrain = lgb.Dataset(train_df[predictors].values,
                     label=train_df[target].values,
                     feature_name=predictors)

dvalid = lgb.Dataset(valid_df[predictors].values,
                     label=valid_df[target].values,
                     feature_name=predictors)

#### **Run the model**

- Now, let's run the model, using the **train** function.

In [None]:
evals_results = {}

model = lgb.train(params,
                  dtrain,
                  valid_sets=[dtrain, dvalid],
                  valid_names=['train','valid'],
                  evals_result=evals_results,
                  num_boost_round=MAX_ROUNDS,
                  early_stopping_rounds=2*EARLY_STOP,
                  verbose_eval=VERBOSE_EVAL,
                  feval=None)

- We can see that the best validation score was obtained for round 85, for which AUC ~= 0.974.

#### **Plot variable importance**

- Now, let's plot variable importance

In [None]:
fig, (ax) = plt.subplots(ncols=1, figsize=(10,8))
lgb.plot_importance(model, height=0.8, title="Features importance (LightGBM)", ax=ax, color="green")
plt.show()

#### **Predict test data**


- Now, let's predict the target for the test data.

In [None]:
preds = model.predict(test_df[predictors])


#### **ROC-AUC Score**

- Now, let's calculate the ROC-AUC score for the prediction.

In [None]:
roc_auc_score(test_df[target].values, preds)

- The ROC-AUC score obtained for the test set is 0.946.

#### **Training and validation using cross-validation**

- We will now use now cross-validation. We will use cross-validation (KFolds) with 5 folds. Data is divided in 5 folds and, by rotation, we are training using 4 folds (n-1) and validate using the 5th (nth) fold.

- Test set is calculated as an average of the predictions.

In [None]:
kf = KFold(n_splits = NUMBER_KFOLDS, random_state = RANDOM_STATE, shuffle = True)

# Create arrays and dataframes to store results
oof_preds = np.zeros(train_df.shape[0])
test_preds = np.zeros(test_df.shape[0])
feature_importance_df = pd.DataFrame()
n_fold = 0
for train_idx, valid_idx in kf.split(train_df):
    train_x, train_y = train_df[predictors].iloc[train_idx],train_df[target].iloc[train_idx]
    valid_x, valid_y = train_df[predictors].iloc[valid_idx],train_df[target].iloc[valid_idx]

    evals_results = {}
    model =  LGBMClassifier(
                  nthread=-1,
                  n_estimators=2000,
                  learning_rate=0.01,
                  num_leaves=80,
                  colsample_bytree=0.98,
                  subsample=0.78,
                  reg_alpha=0.04,
                  reg_lambda=0.073,
                  subsample_for_bin=50,
                  boosting_type='gbdt',
                  is_unbalance=False,
                  min_split_gain=0.025,
                  min_child_weight=40,
                  min_child_samples=510,
                  objective='binary',
                  metric='auc',
                  silent=-1,
                  verbose=-1,
                  feval=None)
    model.fit(train_x, train_y, eval_set=[(train_x, train_y), (valid_x, valid_y)],
                eval_metric= 'auc', verbose= VERBOSE_EVAL, early_stopping_rounds= EARLY_STOP)

    oof_preds[valid_idx] = model.predict_proba(valid_x, num_iteration=model.best_iteration_)[:, 1]
    test_preds += model.predict_proba(test_df[predictors], num_iteration=model.best_iteration_)[:, 1] / kf.n_splits

    fold_importance_df = pd.DataFrame()
    fold_importance_df["feature"] = predictors
    fold_importance_df["importance"] = clf.feature_importances_
    fold_importance_df["fold"] = n_fold + 1

    feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    print('Fold %2d AUC : %.6f' % (n_fold + 1, roc_auc_score(valid_y, oof_preds[valid_idx])))
    del model, train_x, train_y, valid_x, valid_y
    gc.collect()
    n_fold = n_fold + 1
train_auc_score = roc_auc_score(train_df[target], oof_preds)
print('Full AUC score %.6f' % train_auc_score)

- The AUC score for the prediction from the test data was 0.931823.

- We prepare the test prediction, from the averaged predictions for test over the 5 folds.

## **6. Results and conclusion**


- We investigated the data, checking for data unbalancing, visualizing the features and understanding the relationship between different features. We then investigated two predictive models. The data was split in 3 parts, a train set, a validation set and a test set. For the first three models, we only used the train and test set.

- We started with RandomForrestClassifier, for which we obtained an AUC score of 0.85 when predicting the target for the test set.

- We followed with an AdaBoostClassifier model, with lower AUC score (0.83) for prediction of the test set target values.

- We then followed with an CatBoostClassifier, with the AUC score after training 500 iterations 0.86.

- We then experimented with a XGBoost model. In this case, se used the validation set for validation of the training model. The best validation score obtained was 0.986. Then we used the model with the best training step, to predict target value from the test data; the AUC score obtained was 0.977.

- We then presented the data to a LightGBM model. We used both train-validation split and cross-validation to evaluate the model effectiveness to predict 'Class' value, i.e. detecting if a transaction was fraudulent. With the first method we obtained values of AUC for the validation set around 0.974. For the test set, the score obtained was 0.946.
With the cross-validation, we obtained an AUC score for the test prediction of 0.93.






## **7. References**


The concepts and ideas in this project are taken from the following websites -

1. [Credit-Card Fraud Detection Dataset](https://www.kaggle.com/mlg-ulb/creditcardfraud)
2. [Random Forest Classifier Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
3. [AdaBoost Classifier Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html)
4. [CatBoost Classifier Documentation](https://catboost.ai/docs/concepts/python-reference_catboostclassifier.html)
5. [XGBoost Python API Reference](https://xgboost.readthedocs.io/en/latest/python/python_api.html)
6. [LightGBM Python Implemwentation](https://github.com/Microsoft/LightGBM/tree/master/python-package)