## 🧪 EDA - Fraud Detection Dataset

In [6]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
# import seaborn as sns
# from ydata_profiling import ProfileReport

import numpy as np
from sklearn.preprocessing import RobustScaler


# Profiling using a sample of the data can indeed provide a clear picture of the overall dataset, especially when working with very large datasets. Therefore to get the understanding about the data, we'll do profiling

# %%
# Sample a fraction of the data (e.g., 1% of the data)

df = pd.read_csv(r'D:\Data science\Projects\Fraud_detection\fraud_detection_app\data\Fraud.csv', low_memory=False)


# Generate the Pandas Profiling report
# profile = ProfileReport(sample_df, title="Pandas Profiling Report", explorative=True)

# # Save the report as an HTML file
# profile.to_file("profiling_report.html")

# # %%
# profile

# # %%
# Insights:

# 1.isFlaggedFraud:
# Constant Value: This column has a constant value of "0".

# 2.nameOrig:
# High Cardinality: Contains 63,625 distinct values.
# Uniform Distribution: The values are uniformly distributed.
# Unique Values: Each value is unique.

# 3.nameDest:
# High Cardinality: Contains 60,682 distinct values.
# Uniform Distribution: The values are uniformly distributed.

# 4.isFraud:
# Highly Imbalanced: The dataset is highly imbalanced with only 1.4% of the transactions being fraudulent (98.6% non-fraudulent).

# 5.amount:
# Highly Skewed: Skewness of 30.165, indicating a highly skewed distribution.

# 6.oldbalanceOrg:
# Zeros: Contains 20,917 zeros (32.9% of the values).

# 7.newbalanceOrig:
# Zeros: Contains 35,926 zeros (56.5% of the values).

# 8.oldbalanceDest:
# Highly Skewed: Skewness of 26.748, indicating a highly skewed distribution.
# Zeros: Contains 27,057 zeros (42.5% of the values).

# 9.newbalanceDest:
# Highly Skewed: Skewness of 24.522, indicating a highly skewed distribution.
# Zeros: Contains 24,403 zeros (38.4% of the values).

# %%
# Check for missing values for the whole data
print(df.isnull().sum())

# %% [markdown]
# ## No null values found

# %%
# We can now get the deep insight of whole data by performing EDA

# %% [markdown]
# # EDA

# %%
# Correlation matrix
corr_matrix = df.corr()

# Heatmap of the correlation matrix
# plt.figure(figsize=(12, 8))
# sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
# plt.show()

# The correlation matrix shows a multi-collinearity between oldbalanceDest and newbalanceDest & newbalanceOrig and oldbalanceOrig (correlation coefficient of 0.98, 1). This indicates that these features are highly redundant, and keeping them may not add much value to our model
# Highly correlated features can cause multicollinearity in regression models, which can make the model coefficients unstable and harder to interpret

df.head()


# # Drop the highly correlated feat
df.drop(columns=['oldbalanceDest','oldbalanceOrg'],axis=1,inplace=True)
# # to improve the model performance

# %%
df.head()

# %%
# Data Distribution Visualization
df.hist(figsize=(20, 15))
plt.show()

# %% [markdown]
# STEP: The data is spread across the entire range of steps, with more transactions occurring around the middle steps (100 to 300 hours). This indicates a relatively even distribution of transactions over time, with some variation.
#
# AMOUNT: The majority of transaction amounts are clustered towards the lower end of the scale, with a long tail extending towards very high amounts. This indicates a positively skewed distribution with a few large transactions.
#
# newBALANCEORG: Most initial balances are very low, with a few accounts having significantly higher balances. This shows a skewed distribution with many small values and a long tail of larger values. Same for newbalanceDest.
#
# isFraud:The histogram shows a highly imbalanced dataset with the majority of transactions being non-fraudulent. Very few transactions are marked as fraudulent. Same for isFlaggedFraud.

# %%
plt.figure(figsize=(20, 15))
sns.boxplot(data=df)
plt.xticks(rotation=90)
plt.show()

# %% [markdown]
# here, we can two problems
# 1. The data is nt scaled for whcich we can either use minmax scaler, normalization, clipping, log transformation, binning.
# 2. There are outliers which we can remove as possibility of fraud can be there in such case.
#
# Considering both the issue to resolve, and I dont want to remove outliers and scaling needs to be done. So, clipping preferable.

# %%
df.head()

# %%
df['Amount_Clip'] = df['amount'].clip(lower=0, upper=1000000)
df['NewBalOrig_Clip'] = df['newbalanceOrig'].clip(lower=0, upper=1000000)
df['BalDest_Clip'] = df['newbalanceDest'].clip(lower=0, upper=1000000)

# %%
upper_threshold1 = df['amount'].quantile(0.99)
upper_threshold2= df['newbalanceOrig'].quantile(0.99)
upper_threshold3= df['newbalanceDest'].quantile(0.99)
upper_threshold1,upper_threshold2,upper_threshold3

# %% [markdown]
# Not considering clipping due to lack of clarity which can gain with the discussion with client. Instead of clipping using Robustscaler

# %%


# %%
features_to_scale = ['amount', 'newbalanceOrig', 'newbalanceDest']
data_to_scale = df[features_to_scale].values
scaler = RobustScaler()
scaled_data = scaler.fit_transform(data_to_scale)


# %%
scaled_data

# %%
# Replace original columns with scaled data
df[features_to_scale] = scaled_data


# %%
df.head()

# %% [markdown]
# Also, from profiling we come to know nameDest, nameOrig has high cardinality, and each value is unique. Clearly, keeping such value would effect the performance. Therefore dropping both the features

# %%
df.drop(['nameOrig','nameDest'],inplace=True,axis=1)

# %%
df['step'].unique()

# %%
df['step'].unique().size

# %% [markdown]
# Here,analysis does not require understanding the timing of transactions (e.g., when they occur within the 30-day period), the step column might not add value.

# %%
df.drop('step',axis=1,inplace=True)

# %%
df.head()

# %%
df['type'].unique()

# %%
# Feature encoding

# %%

from sklearn.preprocessing import LabelEncoder

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Apply LabelEncoder to 'transaction_type' column
df['transaction_type'] = label_encoder.fit_transform(df['type'])
df.head()

# %%

df.drop('type',axis=1,inplace=True)

# %%
df.head()

# %%

# Create count plot
# plt.figure(figsize=(6, 4))
# sns.countplot(x='isFlaggedFraud', data=df, palette='viridis')

# # Add titles and labels
# plt.title('Count of Fraud Flagged Transactions')
# plt.xlabel('Is Flagged Fraud')
# plt.ylabel('Count')
# plt.xticks(ticks=[0, 1], labels=['Not Flagged', 'Flagged'])

# # %%
# cor=df['isFlaggedFraud'].corr(df['isFraud'])
# cor

# %% [markdown]
# this data is completely imbalance, and its impact on target variable is also very low, we may prefer to remove it

# %%

# # Create count plot
# plt.figure(figsize=(6, 4))
# sns.countplot(x='isFraud', data=df, palette='viridis')

# %% [markdown]
# TARGET variablle is imbalanceif would use stratify while model training, depending on the performance we may opt SMOTE further

# %%

: 