## **Data Description**

This dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from **April 2005** to **September 2005.**


### **Attribute Information:**

There are 25 variables:

*   ID: ID of each client
*   LIMIT_BAL: Amount of given credit in NT dollars (includes individual and family/supplementary credit

*   SEX: Gender (1=male, 2=female)
*   EDUCATION: (1=graduate school, 2=university, 3=high school, 4=others, 5=unknown, 6=unknown)

*   MARRIAGE: Marital status (1=married, 2=single, 3=others)
*   AGE: Age in years

*   PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment delay for one month, 2=payment delay for two months, ... 8=payment delay for eight months, 9=payment delay for nine months and above)

*   PAY_2: Repayment status in August, 2005 (scale same as above)

*   PAY_3: Repayment status in July, 2005 (scale same as above)
*   PAY_4: Repayment status in June, 2005 (scale same as above)


*   PAY_5: Repayment status in May, 2005 (scale same as above)


*   PAY_6: Repayment status in April, 2005 (scale same as above)




*   BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)



*   BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
*   BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)

*   BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
*   BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)


*   BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)

*   PAY_AMT1: Amount of previous payment in September, 2005 (NT dollar)

*   PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
*   PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)


*   PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)


*   PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)

*   PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
*   default.payment.next.month: Default payment (1=yes, 0=no)

# **Project Title :**

# A Supervised Approach to Credit Card Fraud Detection Using Regression and Classification ML Models


# **Introduction**

It is important that credit card companies can recognize fraudulent credit
card transactions so that customers are not charged for items that they
did not purchase.
The Credit Card Fraud Detection Problem includes modelling past credit card
transactions with the knowledge of the ones that turned out to be fraud. This
model is then used to identify whether a new transaction is fraudulent or not. **Our aim here is to detect 100% of the fraudulent transactions while minimizing
the incorrect fraud classifications.**

## **Objective:**

The notebook is structured as follows:

*   First exploration: just to see what we have.  
*   Cleaning: time to make choices about undocumented labels
*   Feature engineering: time to be creative
*   Final result and lessons learned

In [None]:
# import basic libraries
import pandas as pd
pd.set_option("display.max_columns", 100)
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
import warnings
warnings.filterwarnings("ignore")

In [None]:
# Load the data
df = pd.read_excel("default of credit card clients.xlsx")
nRow, nCol = df.shape
print(f'There are {nRow} row and {nCol} columns')

In [None]:
df.head()

In [None]:
df.describe()


In [None]:
print("Shape of the Dataset : {}".format(df.shape))
print("Number of Columns in the Dataset : {}".format(df.shape[1]))
print("Number of Rows in the Dataset : {}".format(df.shape[0]))
print("-"*40)

In [None]:
numeric_features = df.select_dtypes(include = [np.number])
categoric_features = df.select_dtypes(exclude = [np.number])
print("Number of Numerical Features : {}".format(numeric_features.shape[1]))
print("Number of Categorical Features : {}".format(categoric_features.shape[1]))
print("-"*40)

In [None]:
df.info()

In [None]:
df.isnull().sum().max()

In [None]:
print("No Fraud", round(df['default payment next month'].value_counts()[0]/len(df) * 100,2), "% of the dataset")
print("Fraud", round(df['default payment next month'].value_counts()[1]/len(df) * 100,2), "% of the dataset")

# **Data Cleaning**

> Indented block



In [None]:
fil = (df.EDUCATION == 5) | (df.EDUCATION == 6) | (df.EDUCATION == 0)
df.loc[fil, 'EDUCATION'] = 4
df.EDUCATION.value_counts()

In [None]:
df.loc[df.MARRIAGE == 0, 'MARRIAGE'] = 3
df.MARRIAGE.value_counts()

In [None]:
# renaming column for our convinience
df.rename(columns = {'default payment next month': 'Isfraud'}, inplace = True)
df.rename(columns = {'PAY_0': 'PAY_1'}, inplace = True)
df.head()

In [None]:
df.head()

# Data Transformation

## Feature Engineering

1. Payment Status Aggregation: Instead of considering repayment status for each month separately, you could aggregate them to create new features such as:

    * Average repayment status over the past few months.

    * Maximum delay in repayment over the past few months.
    
    * Number of months with delayed payments.

2. Bill Amount Difference: Calculate the difference between the bill amounts for consecutive months. This could indicate the trend in spending behavior over time.

3. Bill Amount to Credit Limit Ratio: Calculate the ratio of bill amount to the credit limit for each month. This could provide insights into the credit utilization behavior of the clients.

4. Age Binning: Instead of using age as a continuous variable, you could create age bins or categories to capture different age groups' behavior more effectively.

5. Payment Amount Ratios: Calculate ratios such as the percentage of the bill amount paid each month compared to the total bill amount or the credit limit.

6. Payment Amount Difference: Calculate the difference between the previous payment amount and the current bill amount. This could indicate how much of the outstanding balance is being paid off each month.

7. Marriage Status Encoding: Encode the marriage status variable as binary indicators (e.g., married or not married) or group categories with fewer samples into an "others" category.

8. Education Encoding: Encode the education variable into fewer categories by grouping similar levels together (e.g., graduate school and university into one category).

In [None]:
import pandas as pd


# Feature engineering

# Aggregating payment status over the past few months
df['Avg_PAY'] = df[['PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']].mean(axis=1)
df['Max_Delay'] = df[['PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']].max(axis=1)
df['Num_Delay_Months'] = (df[['PAY_1', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']] > 0).sum(axis=1)

# Calculating bill amount differences
df['Bill_Amt_Diff_1'] = df['BILL_AMT1'] - df['BILL_AMT2']
df['Bill_Amt_Diff_2'] = df['BILL_AMT2'] - df['BILL_AMT3']
df['Bill_Amt_Diff_3'] = df['BILL_AMT3'] - df['BILL_AMT4']
df['Bill_Amt_Diff_4'] = df['BILL_AMT4'] - df['BILL_AMT5']
df['Bill_Amt_Diff_5'] = df['BILL_AMT5'] - df['BILL_AMT6']

# Calculating bill amount to credit limit ratio
df['Bill_Amt_to_Limit_Ratio'] = df['BILL_AMT1'] / df['LIMIT_BAL']

# Binning age into categories
bins = [20, 30, 40, 50, 60, 70, 80]
labels = ['20-30', '30-40', '40-50', '50-60', '60-70', '70-80']
df['Age_Group'] = pd.cut(df['AGE'], bins=bins, labels=labels, right=False)

# Encoding marriage status into binary indicators
df['Married'] = df['MARRIAGE'].apply(lambda x: 1 if x == 1 else 0)

# Encoding education into fewer categories
df['Education'] = df['EDUCATION'].replace({4: 5, 5: 5, 6: 5})  # Grouping others and unknown categories into one

# Calculating payment amount ratios
df['Payment_Ratio_1'] = df['PAY_AMT1'] / df['BILL_AMT1']
df['Payment_Ratio_2'] = df['PAY_AMT2'] / df['BILL_AMT2']
df['Payment_Ratio_3'] = df['PAY_AMT3'] / df['BILL_AMT3']
df['Payment_Ratio_4'] = df['PAY_AMT4'] / df['BILL_AMT4']
df['Payment_Ratio_5'] = df['PAY_AMT5'] / df['BILL_AMT5']
df['Payment_Ratio_6'] = df['PAY_AMT6'] / df['BILL_AMT6']

# Calculating payment amount differences
df['Payment_Amt_Diff_1'] = df['PAY_AMT1'] - df['BILL_AMT2']
df['Payment_Amt_Diff_2'] = df['PAY_AMT2'] - df['BILL_AMT3']
df['Payment_Amt_Diff_3'] = df['PAY_AMT3'] - df['BILL_AMT4']
df['Payment_Amt_Diff_4'] = df['PAY_AMT4'] - df['BILL_AMT5']
df['Payment_Amt_Diff_5'] = df['PAY_AMT5'] - df['BILL_AMT6']



# Display the updated dataframe
print(df.head())


### One-hot encoding

In [None]:
# Perform one-hot encoding for 'Age_Group'
df = pd.get_dummies(df, columns=['Age_Group'], drop_first=True)


In [None]:
df.head()

### Data Cleaning

* Handling missing values: This involves identifying columns with missing values and deciding how to handle them. Common strategies include imputing missing values (replacing them with a statistical measure like the median, mean, or mode) or removing rows or columns with missing values.

* Dealing with infinite values: Infinite values may arise during calculations or transformations. Converting infinite values to NaNs is a common practice, as NaNs can be easily handled using imputation or removal strategies.

In [None]:
import numpy as np

# Replace infinite values with NaN
df.replace([np.inf, -np.inf], np.nan, inplace=True)

# Impute NaN values with the median of each feature
df.fillna(df.median(), inplace=True)

### Feature Scaling: 

* Scaling numerical features ensures that all features have a similar scale. MinMaxScaler is one of the scaling techniques that scales features to a specified range (commonly [0, 1]).

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Initialize the MinMaxScaler
scaler = MinMaxScaler()

# Fit the scaler to your data and transform it
df_scaled = scaler.fit_transform(df)

# Model Building:

* After preprocessing the data, you can proceed with training your machine learning model using the preprocessed features and target variables.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix, classification_report

## Finalizing Features

In [None]:
features = ['LIMIT_BAL', 'EDUCATION', 'MARRIAGE', 'PAY_1','PAY_2', 'PAY_3',
            'PAY_4', 'PAY_5', 'PAY_6','BILL_AMT1', 'BILL_AMT2',
            'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
            'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6', 'Avg_PAY',	'Max_Delay',	
            'Num_Delay_Months',	'Bill_Amt_Diff_1',	'Bill_Amt_Diff_2',	'Bill_Amt_Diff_3',	
            'Bill_Amt_Diff_4',	'Bill_Amt_Diff_5',	'Bill_Amt_to_Limit_Ratio',	
            'Married', 'Education',	'Payment_Ratio_1',	'Payment_Ratio_2',	'Payment_Ratio_3',	
            'Payment_Ratio_4',	'Payment_Ratio_5',	'Payment_Ratio_6',	'Payment_Amt_Diff_1',	
            'Payment_Amt_Diff_2',	'Payment_Amt_Diff_3',	'Payment_Amt_Diff_4',	'Payment_Amt_Diff_5']
X = df[features].copy()

## Train-Test Split

In [None]:
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, df['Isfraud'], test_size=0.2, random_state=42)

In [None]:
df.isnull().sum().max()

In [None]:
df.head(10)

## Balancing Data with Smote

In [None]:
from imblearn.over_sampling import SMOTE

# Apply SMOTE to balance the dataset
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

## Model Training:

In [None]:
# Logistic Regression
logistic_model = LogisticRegression()
logistic_model.fit(X_train_balanced, y_train_balanced)
logistic_predictions = logistic_model.predict(X_test)

# Decision Tree
decision_tree_model = DecisionTreeClassifier()
decision_tree_model.fit(X_train_balanced, y_train_balanced)
decision_tree_predictions = decision_tree_model.predict(X_test)

# Random Forest
random_forest_model = RandomForestClassifier()
random_forest_model.fit(X_train_balanced, y_train_balanced)
random_forest_predictions = random_forest_model.predict(X_test)


## Model Evaluation on Test_data:

In [None]:
# Evaluate the models
models = {
    'Logistic Regression': logistic_predictions,
    'Decision Tree': decision_tree_predictions,
    'Random Forest': random_forest_predictions
}

for model_name, predictions in models.items():
    print(f"Confusion Matrix and Classification Report for {model_name}:")
    print(confusion_matrix(y_test, predictions))
    print(classification_report(y_test, predictions))
    print("="*60)

## Hyperparameter Tuning

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Initialize the Random Forest classifier with the best hyperparameters
best_rf_classifier = RandomForestClassifier(n_estimators=300, max_depth=None, min_samples_split=2, min_samples_leaf=1)

# Train the Random Forest classifier on the balanced training data
best_rf_classifier.fit(X_train_balanced, y_train_balanced)

# Now you can use the trained Random Forest classifier (best_rf_classifier) for making predictions


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Make predictions on the test set
y_pred = best_rf_classifier.predict(X_test)

# Calculate evaluation metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

# Print the evaluation metrics
print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)
print("Confusion Matrix:")
print(conf_matrix)


## PLot Learning Curve:

In [None]:
from sklearn.model_selection import learning_curve
from sklearn.ensemble import RandomForestClassifier


# Define your model and parameters
model = RandomForestClassifier(n_estimators=300, max_depth=None, min_samples_leaf=1, min_samples_split=2)

# Plot learning curves
train_sizes, train_scores, test_scores = learning_curve(model, X_train_balanced, y_train_balanced, cv=5, scoring='accuracy', n_jobs=-1)

# Calculate mean and standard deviation of training and validation scores
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

# Plot the learning curves
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_mean, label='Training score', color='blue')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.1, color='blue')
plt.plot(train_sizes, test_mean, label='Validation score', color='red')
plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, alpha=0.1, color='red')
plt.xlabel('Number of training examples')
plt.ylabel('Accuracy')
plt.title('Learning curves')
plt.legend()
plt.show()


## CAP Analysis:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

def plot_cap_curve(y_true, y_pred_proba, title=''):
    total = len(y_true)
    class_1_count = np.sum(y_true)
    
    sorted_proba_indices = np.argsort(y_pred_proba)[::-1]
    y_sorted = y_true[sorted_proba_indices]
    x = np.arange(1, total + 1)
    y = np.cumsum(y_sorted) / class_1_count
    
    plt.figure(figsize=(8, 6))
    plt.plot(x, y, marker='o', linestyle='-', color='b')
    plt.plot([0, total], [0, 1], linestyle='--', color='r')
    plt.xlabel('Total Observations')
    plt.ylabel('Cumulative True Positive Rate')
    plt.title(title)
    plt.grid(True)
    plt.show()

# Assuming best_rf_classifier is already trained and X_test, y_test are available
y_pred_proba = best_rf_classifier.predict_proba(X_test)[:, 1]
plot_cap_curve(y_test.values, y_pred_proba, title='CAP Curve for Random Forest Classifier')


## Hyperparameter Tuning using Grid Search Cross-Validation for Random Forest Classifier.

In [None]:
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.ensemble import RandomForestClassifier

# Define the parameter grid to search
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create a Random Forest classifier
rf_classifier = RandomForestClassifier()

# Define the GridSearchCV with k-fold cross-validation
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=kfold, scoring='accuracy', n_jobs=-1)

# Perform the grid search
grid_search.fit(X_train_balanced, y_train_balanced)

# Get the best hyperparameters and the best model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

print("Best Hyperparameters:", best_params)
print("Best Model:", best_model)


## Model Evaluation on Training Data

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Predictions on the training data
y_train_pred = best_model.predict(X_train_balanced)

# Calculate evaluation metrics
accuracy_train = accuracy_score(y_train_balanced, y_train_pred)
precision_train = precision_score(y_train_balanced, y_train_pred)
recall_train = recall_score(y_train_balanced, y_train_pred)
f1_score_train = f1_score(y_train_balanced, y_train_pred)
confusion_matrix_train = confusion_matrix(y_train_balanced, y_train_pred)

# Print the evaluation metrics
print("Performance Metrics on Training Data:")
print("Accuracy:", accuracy_train)
print("Precision:", precision_train)
print("Recall:", recall_train)
print("F1-score:", f1_score_train)
print("Confusion Matrix:")
print(confusion_matrix_train)
