# <a id="1">Introduction</a>  
​
The datasets contains transactions made by credit cards. This dataset presents transactions that occurred in a week, where we have **7200 frauds** out of **594,643 transactions**. The dataset is **highly unbalanced**, the **positive class (frauds)** account for **1.21%** of all transactions.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, RandomizedSearchCV, StratifiedShuffleSplit
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline as ImbPipeline
from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline
from sklearn.metrics import recall_score, precision_score, f1_score, accuracy_score
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
from imblearn.over_sampling import SMOTE

import warnings
warnings.filterwarnings("ignore")

In [None]:
# Load the dataset
fraud_data = pd.read_csv('../input/fraud/fraud.csv')

# Display the first few rows of the dataset.
fraud_data.head()

In [None]:
print(fraud_data.info())

In [None]:
print(fraud_data.describe())

In [None]:
print(fraud_data.nunique())

### **Observations:**
* **Step:** Represents the unit of time (in hours).
* **Customer & Merchant IDs:** Unique identifiers.
* **Age & Gender:** Categorical demographic information.
* **ZipcodeOri & zipMerchant:** All transactions seem to originate and occur in the same zip code (28007).
* **Category:** The type of purchase.
* **Amount:** The transaction amount.
* **Fraud:** Binary target variable indicating whether the transaction is fraudulent.

### Data Cleaning and Preprocessing:

In [None]:
fraud_data['customer'] = fraud_data['customer'].str.strip("'")
fraud_data['age'] = fraud_data['age'].str.strip("'")
fraud_data['gender'] = fraud_data['gender'].str.strip("'")
fraud_data['zipcodeOri'] = fraud_data['zipcodeOri'].str.strip("'")
fraud_data['merchant'] = fraud_data['merchant'].str.strip("'")
fraud_data['zipMerchant'] = fraud_data['zipMerchant'].str.strip("'")
fraud_data['category'] = fraud_data['category'].str.strip("'")

fraud_data.dtypes

In [None]:
# Check for missing values
missing_values = fraud_data.isnull().sum()

missing_values

### Exploratory Data Analysis (EDA)

I will
**1. Analyze some key variables distributions:**

* amount
* age
* fraud

**2. Explore Relationships:**

* Between fraud and gender.
* Between fraud and category.
* Between fraud and amount.

**3. Check Patterns or Anomalies:**

* Time-based patterns in step.
* Geographic patterns based on zipcodeOri and zipMerchant.

In [None]:
# Set the style for the plots
sns.set(style="whitegrid")

# Plot the distribution of 'amount'
plt.figure(figsize=(14, 6))
plt.subplot(1, 3, 1)
sns.histplot(fraud_data['amount'], bins=50, kde=True)
plt.title('Distribution of Amount')

# Plot the distribution of 'age'
plt.subplot(1, 3, 2)
sns.countplot(x='age', data=fraud_data)
plt.title('Distribution of Age')

# Plot the distribution of 'fraud'
plt.subplot(1, 3, 3)
sns.countplot(x='fraud', data=fraud_data)
plt.title('Distribution of Fraud')

plt.tight_layout()
plt.show()


**1. Amount:**

The distribution is right-skewed with most transactions having smaller amounts.
There are some higher amounts, but they are less frequent.

**2. Age:**

The age distribution shows various categories, with categories 2 (26-35) and 3 (36-45) being more frequent.
The category -1 represents unknown ages, which are present but not predominant.

**3. Fraud:**

There are far more benign transactions (0) than fraudulent ones (1), indicating class imbalance.

#### **Fraud and Categorical Variables**
**Fraud by Gender:**

* The majority of transactions are by females (F), followed by males (M).
* Fraudulent transactions occur across both genders, but with no clear dominance in any particular gender.

**Fraud by Category:**

* Certain categories have higher instances of fraud.
* Categories like es_transportation and es_food have more transactions, while categories like es_travel and es_leisure show fewer transactions.
* The distribution of fraud varies by category, indicating some categories may be more susceptible to fraud es_sportsandtoys.

#### **Amount Differences for Fraud vs. Non-Fraud**
**Distribution of Amount by Fraud:**

* Fraudulent transactions tend to have a wider range of amounts compared to non-fraudulent ones.
* Non-fraudulent transactions are clustered around smaller amounts, whereas fraudulent ones include higher amounts more frequently.

### Time-Based Analysis

In [None]:
# Analyze transaction counts and fraud counts over time
transactions_over_time = fraud_data.groupby('step').size()
fraud_over_time = fraud_data[fraud_data['fraud'] == 1].groupby('step').size()

# Analyze fraud rate over time
fraud_rate_over_time = fraud_over_time / transactions_over_time

# Set up the figure
plt.figure(figsize=(14, 10))

# Plot transaction volume over time
plt.subplot(2, 1, 1)
plt.plot(transactions_over_time, label='Total Transactions')
plt.plot(fraud_over_time, label='Fraudulent Transactions', color='red')
plt.title('Transaction Volume Over Time')
plt.xlabel('Step (Hours)')
plt.ylabel('Number of Transactions')
plt.legend()

# Plot fraud rate over time
plt.subplot(2, 1, 2)
plt.plot(fraud_rate_over_time, label='Fraud Rate', color='orange')
plt.title('Fraud Rate Over Time')
plt.xlabel('Step (Hours)')
plt.ylabel('Fraud Rate')
plt.legend()

plt.tight_layout()
plt.show()


**Transaction Volume Over Time:**

* The total volume of transactions fluctuates and increase over time, but the number of fraudulent transactions remains constant.

**Fraud Rate Over Time:**

* The fraud rate ddecrease over time, as the total transactions increase and fraudulent transaction remains constant.

#### Geographic Analysis

In [None]:
# Check for unique values in categorical columns to understand encoding needs
unique_zipcode = fraud_data['zipcodeOri'].unique()
unique_zipmerchant = fraud_data['zipMerchant'].unique()

In [None]:
print(unique_zipcode)

In [None]:
print(unique_zipmerchant)

**zipcodeOri and zipMerchant:** Both were the same for all transactions and thus not informative.

### Feature Engineering

In [None]:
# 1. Customer Transaction Frequency
fraud_data['customer_transaction_count'] = fraud_data.groupby('customer')['customer'].transform('count')

# 2. Average Transaction Amount by Customer
fraud_data['customer_avg_amount'] = fraud_data.groupby('customer')['amount'].transform('mean')

# 3. Average Transaction Amount by Merchant
fraud_data['merchant_avg_amount'] = fraud_data.groupby('merchant')['amount'].transform('mean')

# 4. 'day_of_week' feature
fraud_data['day_of_week'] = (fraud_data['step'] // 24) % 7  # Days of the week (0 = Monday, 6 = Sunday)

# 5. 'hour_of_day' feature
fraud_data['hour_of_day'] = fraud_data['step'] % 24  # Hours of the day (0 to 23)

In [None]:
fraud_data.head()

### Encoding and Scaling

In [None]:
# Check for unique values in categorical columns to understand encoding needs
unique_ages = fraud_data['age'].unique()
unique_genders = fraud_data['gender'].unique()
unique_categories = fraud_data['category'].unique()

In [None]:
print(unique_ages)

In [None]:
print(unique_genders)

In [None]:
print(unique_categories)

**Categorical Variables:**

* Age: 8 unique values including 'U' for Unknown.
* Gender: 4 unique values including 'E' for Enterprise and 'U' for Unknown.
* Category: 15 distinct transaction categories.

In [None]:
# Drop columns that may not provide meaningful information for modeling
fraud_data_cleaned = fraud_data.drop(columns=['zipcodeOri', 'zipMerchant', 'customer', 'merchant'])

In [None]:
# Instantiate the encoders with adjustments to handle the sparse matrix issue
column_transformer = ColumnTransformer(
    [
        ('ohe_age', OneHotEncoder(), ['age']),
        ('ohe_gender', OneHotEncoder(), ['gender']),
        ('ohe_category', OneHotEncoder(), ['category']),
        ('scale_amount', StandardScaler(), ['amount']),
        ('scale_CusTranCount', StandardScaler(), ['customer_transaction_count']),
        ('scale_CusAvgAmount', StandardScaler(), ['customer_avg_amount']),
        ('scale_MerAvgAmount', StandardScaler(), ['merchant_avg_amount']),
        ('scale_DayOfWeek', StandardScaler(), ['day_of_week']),
        ('scale_HourOfDay', StandardScaler(), ['hour_of_day'])
    ],
    remainder='passthrough'
)

# Apply the transformations to the data
fraud_data_transformed = column_transformer.fit_transform(fraud_data_cleaned)

# Ensuring the transformation result is a dense format if it's sparse
if hasattr(fraud_data_transformed, 'toarray'):
    fraud_data_transformed = fraud_data_transformed.toarray()

# Generate the correct column names based on transformers
transformed_columns = (
    [f"age_{cat}" for cat in column_transformer.named_transformers_['ohe_age'].categories_[0]] +
    [f"gender_{cat}" for cat in column_transformer.named_transformers_['ohe_gender'].categories_[0]] +
    [f"category_{cat}" for cat in column_transformer.named_transformers_['ohe_category'].categories_[0]] +
    ['scaled_amount'] +
    ['scale_CusTranCount'] +
    ['scale_CusAvgAmount'] +
    ['scale_MerAvgAmount'] +
    ['scale_DayOfWeek'] +
    ['scale_HourOfDay'] +
    ['step', 'fraud']  # Adding passthrough columns
)

# Convert transformed data back to a DataFrame with correct column names
fraud_data_encoded = pd.DataFrame(fraud_data_transformed, columns=transformed_columns)

fraud_data_encoded.head()


#### Handling Class Imbalance & Model Selection

In [None]:
print('No Frauds', round(fraud_data_encoded['fraud'].value_counts()[0]/len(fraud_data_encoded) * 100,2), '% of the dataset')
print('Frauds', round(fraud_data_encoded['fraud'].value_counts()[1]/len(fraud_data_encoded) * 100,2), '% of the dataset')

X = fraud_data_encoded.drop('fraud', axis=1)
y = fraud_data_encoded['fraud']

sss = StratifiedKFold(n_splits=5, random_state=None, shuffle=False)

for train_index, test_index in sss.split(X, y):
    original_Xtrain, original_Xtest = X.iloc[train_index], X.iloc[test_index]
    original_ytrain, original_ytest = y.iloc[train_index], y.iloc[test_index]


# Check the Distribution of the labels
# Turn into an array
original_Xtrain = original_Xtrain.values
original_Xtest = original_Xtest.values
original_ytrain = original_ytrain.values
original_ytest = original_ytest.values

# See if both the train and test label distribution are similarly distributed
train_unique_label, train_counts_label = np.unique(original_ytrain, return_counts=True)
test_unique_label, test_counts_label = np.unique(original_ytest, return_counts=True)
print('-' * 100)

print('Label Distributions: \n')
print(train_counts_label/ len(original_ytrain) * 100)
print(test_counts_label/ len(original_ytest) * 100)

In [None]:
# Parameters
log_reg_params = {"penalty": ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000]}

# List to append the score and then find the average
accuracy_lst = []
precision_lst = []
recall_lst = []
f1_lst = []
auc_lst = []

# Classifier with optimal parameters
# log_reg_sm = grid_log_reg.best_estimator_
log_reg_sm = LogisticRegression()
rand_log_reg = RandomizedSearchCV(LogisticRegression(), log_reg_params, n_iter=4)


# Implementing SMOTE Technique 
# Cross Validating the right way

for train, test in sss.split(original_Xtrain, original_ytrain):
    pipeline = imbalanced_make_pipeline(SMOTE(sampling_strategy='minority'), rand_log_reg)
    model = pipeline.fit(original_Xtrain[train], original_ytrain[train])
    best_est = rand_log_reg.best_estimator_
    prediction = best_est.predict(original_Xtrain[test])
    
    accuracy_lst.append(pipeline.score(original_Xtrain[test], original_ytrain[test]))
    precision_lst.append(precision_score(original_ytrain[test], prediction))
    recall_lst.append(recall_score(original_ytrain[test], prediction))
    f1_lst.append(f1_score(original_ytrain[test], prediction))
    auc_lst.append(roc_auc_score(original_ytrain[test], prediction))
    
print('---' * 45)
print('')
print("accuracy: {}".format(np.mean(accuracy_lst)))
print("precision: {}".format(np.mean(precision_lst)))
print("recall: {}".format(np.mean(recall_lst)))
print("f1: {}".format(np.mean(f1_lst)))
print('---' * 45)

In [None]:
labels = ['No Fraud', 'Fraud']
smote_prediction = best_est.predict(original_Xtest)
print(classification_report(original_ytest, smote_prediction, target_names=labels))

* **Precision:** How many of the predicted fraud cases were actual fraud.
* **Recall:** How many of the actual fraud cases were detected.
* **F1 Score:** Harmonic mean of precision and recall, balancing both.
* **Accuracy:** Overall correctness of the model, but can be misleading in imbalanced datasets.
* **Macro Average:** Averaged metrics considering each class equally.
* **Weighted Average:** Averaged metrics weighted by the number of instances in each class.

#### Random Forest

In [None]:
# Parameters for Random Forest
rf_params = {
    'n_estimators': [5, 10]
}

# Initialize RandomizedSearchCV
rand_rf = RandomizedSearchCV(RandomForestClassifier(random_state=42), rf_params, random_state=42)

# Lists to store metrics
rf_accuracy_lst = []
rf_precision_lst = []
rf_recall_lst = []
rf_f1_lst = []
rf_auc_lst = []


# Cross-validation loop for Random Forest
for train_idx, test_idx in sss.split(X, y):
    pipeline_rf = imbalanced_make_pipeline(SMOTE(sampling_strategy='minority', random_state=42), rand_rf)
    pipeline_rf.fit(X.iloc[train_idx], y.iloc[train_idx])
    best_rf = rand_rf.best_estimator_
    predictions_rf = best_rf.predict(X.iloc[test_idx])
    
    rf_accuracy_lst.append(accuracy_score(y.iloc[test_idx], predictions_rf))
    rf_precision_lst.append(precision_score(y.iloc[test_idx], predictions_rf))
    rf_recall_lst.append(recall_score(y.iloc[test_idx], predictions_rf))
    rf_f1_lst.append(f1_score(y.iloc[test_idx], predictions_rf))
    rf_auc_lst.append(roc_auc_score(y.iloc[test_idx], predictions_rf))


# Average metrics for Random Forest
print('Random Forest Results:')
print('---' * 15)
print(f"Accuracy: {np.mean(rf_accuracy_lst)}")
print(f"Precision: {np.mean(rf_precision_lst)}")
print(f"Recall: {np.mean(rf_recall_lst)}")
print(f"F1 Score: {np.mean(rf_f1_lst)}")
print(f"AUC: {np.mean(rf_auc_lst)}")
print('---' * 15)

In [None]:
labels = ['No Fraud', 'Fraud']
rf_prediction = best_rf.predict(original_Xtest)
print(classification_report(original_ytest, rf_prediction, target_names=labels))

#### XGBoost

In [None]:
# Parameters for XGBoost
xgb_params = {
    'n_estimators': [5, 10],
}

rand_xgb = RandomizedSearchCV(xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42), xgb_params, random_state=42)


xgb_accuracy_lst = []
xgb_precision_lst = []
xgb_recall_lst = []
xgb_f1_lst = []
xgb_auc_lst = []


# Cross-validation loop for XGBoost
for train_idx, test_idx in sss.split(X, y):
    pipeline_xgb = imbalanced_make_pipeline(SMOTE(sampling_strategy='minority', random_state=42), rand_xgb)
    pipeline_xgb.fit(X.iloc[train_idx], y.iloc[train_idx])
    best_xgb = rand_xgb.best_estimator_
    predictions_xgb = best_xgb.predict(X.iloc[test_idx])
    
    xgb_accuracy_lst.append(accuracy_score(y.iloc[test_idx], predictions_xgb))
    xgb_precision_lst.append(precision_score(y.iloc[test_idx], predictions_xgb))
    xgb_recall_lst.append(recall_score(y.iloc[test_idx], predictions_xgb))
    xgb_f1_lst.append(f1_score(y.iloc[test_idx], predictions_xgb))
    xgb_auc_lst.append(roc_auc_score(y.iloc[test_idx], predictions_xgb))


# Average metrics for XGBoost
print('XGBoost Results:')
print('---' * 15)
print(f"Accuracy: {np.mean(xgb_accuracy_lst)}")
print(f"Precision: {np.mean(xgb_precision_lst)}")
print(f"Recall: {np.mean(xgb_recall_lst)}")
print(f"F1 Score: {np.mean(xgb_f1_lst)}")
print(f"AUC: {np.mean(xgb_auc_lst)}")
print('---' * 15)


In [None]:
labels = ['No Fraud', 'Fraud']
xgb_prediction = best_xgb.predict(original_Xtest)
print(classification_report(original_ytest, rf_prediction, target_names=labels))

XGBoost:

* **High Precision and Recall for Fraud:** Indicates that XGBoost is good at identifying fraud with relatively low false positives and misses.
* **Balanced Performance:** The model achieves a good balance, making it suitable for detecting fraud in this context.

### Model Interpretation

#### Feature Importance

In [None]:
# Extract feature importance from XGBoost
xgb_feature_importance = rand_xgb.best_estimator_.feature_importances_
xgb_features = feature_columns
xgb_feature_importance_df = pd.DataFrame({'Feature': xgb_features, 'Importance': xgb_feature_importance})
xgb_feature_importance_df = xgb_feature_importance_df.sort_values(by='Importance', ascending=False)

# Plot feature importance for XGBoost
plt.figure(figsize=(12, 6))
plt.barh(xgb_feature_importance_df['Feature'], xgb_feature_importance_df['Importance'])
plt.title('XGBoost Feature Importance')
plt.xlabel('Importance')
plt.gca().invert_yaxis()
plt.show()

#### SHAP Values

In [None]:
import shap

# Fit SHAP explainer on the final model
explainer_xgb = shap.TreeExplainer(rand_xgb.best_estimator_)
shap_values_xgb = explainer_xgb.shap_values(X)

# Plot summary plot for XGBoost
shap.summary_plot(shap_values_xgb, X, plot_type="bar")
