### About

**NOTE: This notebook is a continuation of my previous notebook which contained EDA.**   

You can refer the following links:  
- Part-1 EDA: LINK  
- End-to-End implementation with deployment on AWS: https://github.com/Sharma-Ayush/Credit-Card-Fraud-Detection.git

### Objective of this notebook

The objective for this notebook is:  
- Feature Engineering.  
- Handling Imbalance.  
- Feature importance.  
- Cross-validation and hyper parameter tuning of classification models.  
- Testing on test set.

<u>PS:</u>  
- Feel free to contact me if you have any doubts or feedback through the comment section or my socials.
- Please upvote the notebook if you like it, as it would motivate me to develop more projects like these.

### My Socials

Follow me on these platforms, for more such content:  

LinkedIn: https://www.linkedin.com/in/ayush-sharma-660831125/  
Github: https://github.com/Sharma-Ayush  
Kaggle: https://www.kaggle.com/ayushsharma0812

### Import Required Libraries

In [1]:
# Data Mainpulation
import numpy as np
import pandas as pd

# Machine Learning
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import ADASYN
from imblearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.compose import ColumnTransformer

# Extra modules
from time import time


### Custom classes & functions

In [2]:
def double_log_transform(x):
    '''Transform x by taking the log of the data after shifting by 1. This operation is done two times iteratively.'''
    return np.log10(np.log10(x + 1) + 1)

def cube_root_transform(x):
    '''Transform x by taking the cube root of the data.'''
    return np.cbrt(x)

### Loading the dataset

In [3]:
df_train = pd.read_csv('Data/creditcard_train.csv')

### Handling imbalance

We have already seen the huge imbalance that exists within our dataset. We need to mitigate that imbalance, otherwise our model will focus more on the majority class(genuine transactions). There are many techniques that we can use:  

1. <u>Undersampling of majority class:</u> We take random sample of majority class so that the size matches that of minority class. This will reduce the dataset size which is good if we want reduced computation effort and thus, will lead to faster computation but at the same time it will lead to loss of information.  

2. <u>Oversampling of minority class:</u> We do random sampling on minority class with replacement so that the size of minority class matches that of majority class. This will increase dataset's size which will lead to more computational effort but will preserve information unlike undersampling. Duplicates are generated within minority class and this can lead to overfitting.  

3. <u>Synthetic data generation techniques:</u> We use techniques like SMOTE, its variants or ADASYN to oversample minority class in a systematic way by adding points in between existing minority class data points with some logic behind. This is the best way to upsample as duplication is not present but points are added along straight lines in between existing data points and this can lead to addition of artificial patterns. 

4. <u>Weight based sensitivity:</u> Many algorithms allow us to specify weights for each record of our dataset, we can specify higher weights to minority class records and the algorithm will focus more on them.

I would like to try both synthetic generation and weight based sensitivity, then compare them but the dataset is decently big in size and I want to perform cross-validaiton for hyper parameter tuning as well and thus, there is a problem here.  

When we generate cross-validation sets they should be first split up and then each of the set should be passed through the synthetic generator, any other transformations and then finally evaluated upon. This is done after splitting so that the information doesn't leak in between train and validation sets. To save time on computation and still do hyperparameter tuning which itself will be time taking, I want a technique of synthetic generation where the size of the resulting dataset is not too big because I don't have that much time and computation to spare on both - mulltiple synthetic generations and cross-validation.  

So, what should I do? Definitely, I can try weight based sensitivity as it is, because it does not involve any synthetic generation. Other option that I am thinking of is a combination of undersampling and ADASYN. If I was to undersample majority class all the way to same size as minority class then that will be too much loss of information. Due to my own time and computation constraints, I want to just reduce the size of majority class by enough and at the same time oversample minority class through synthetic generator such that the time taken for this process is managable.

I will use ADASYN. ADSYN tries to add more data towards the low density points of minority class. This means that even noisy points of fraudulent class will get attention and it might overfit a bit too much on the minority class i.e. it might perform very well on minority class but poorer on majority class. In a case like this of frauds and scams we want to primarily focus on predicting fraud cases as best as possible and even if some genuine cases(as long as its not too much above some threshold) are predicted to be as fraud we should be okay with it as the credi card issuer can provide tighter verification to user in these cases. This way we have tight leash on frauds which will save the company money and tighter verification for genuine cases predicted as fraud. If its too much overfitting we can try other techniques like SMOTE.

Undersampling will be done before training. I will choose a decent size of 10k for undersampling and then upsample minority class to this using ADASYN. Anyone who can spare time and computational effort on this can use the whole dataset as it is and upsample minority class, this will preserve information.

In [4]:
minority_class_size = df_train.Class.value_counts().min()

# Random under sampling the majority class to a size of 10000
X_resampled, Y_resampled = RandomUnderSampler(sampling_strategy = minority_class_size/10000, random_state = 42).fit_resample(df_train.drop(columns = 'Class'), df_train['Class'])

# ADASYN resampler fo synthetic generation of minority class
ADASYN_resampler = ADASYN(sampling_strategy = 'auto', random_state = 42)

### Feature Engineering

Based on the EDA notebook, I have to perform following transformations and manipulations on my dataset:  
- Take log10 Amount column after shiting the data by 1. This is done two times iteratively.  
- Drop 3 columns: V13, V15, and V23.  
- Take cube root of all the rest of PCA encoded columns.  

The order of the transformations will be as followed:  
- Transform columns based on EDA.  
- Standardize columns so that scales of columns are similar.  
- Then in the last use ADASYN for upsampling minority class.

In [5]:
# Define the columns to be transformed
double_log_transform_columns = ['Amount']
cube_root_transform_columns = ['V' + str(i) for i in range(1, 29) if i not in [13, 15, 23]]
passthrough_columns = ['Time']

# Define the functional transformers for the group of columns
double_log_transformer = FunctionTransformer(func = double_log_transform, feature_names_out = 'one-to-one')
cube_root_transformer = FunctionTransformer(func = cube_root_transform, feature_names_out = 'one-to-one')
pass_through_transformer = 'passthrough'

# Create transformers
column_transformer = ColumnTransformer([('double_log_transform', double_log_transformer, double_log_transform_columns),
                                        ('cube_root_transformer', cube_root_transformer, cube_root_transform_columns),
                                        ('passthrough', pass_through_transformer, passthrough_columns)],
                                        remainder = 'drop')
standard_scaler = StandardScaler()

### References

- https://www.kdnuggets.com/2023/01/7-smote-variations-oversampling.html
- https://www.analyticsvidhya.com/blog/2020/10/overcoming-class-imbalance-using-smote-techniques/#h-dealing-with-imbalanced-data
- https://www.youtube.com/@krishnaik06  
- Hands on Machine Learning with Scikit-Learn & TensorFlow by Aurélien Géron (O'Reilly). CopyRight 2017 Aurélien Géron