<a href="https://colab.research.google.com/github/SriRamK345/Enhancing-Financial-Security-A-Predictive-Model-for-Fraud-Detection/blob/main/Predicting_fraudulent_transactions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Key Features:**

* **step:** Represents a unit of time, where 1 step equals 1 hour. This means the simulation spans 744 steps (30 days * 24 hours/day).
* **type:** Categorical variable indicating the transaction type:
    * CASH-IN
    * CASH-OUT
    * DEBIT
    * PAYMENT
    * TRANSFER
* **amount:** Numerical value representing the transaction amount in the local currency.
* **nameOrig:** Customer who initiated the transaction.
* **oldbalanceOrg:** Initial balance of the originator's account before the transaction.
* **newbalanceOrig:** Updated balance of the originator's account after the transaction.
* **nameDest:** Recipient of the transaction (if applicable). Note: Missing for merchants starting with "M".
* **oldbalanceDest:** Initial balance of the recipient's account before the transaction (if applicable). Missing for merchants starting with "M".
* **newbalanceDest:** Updated balance of the recipient's account after the transaction (if applicable). Missing for merchants starting with "M".
* **isFraud:** This is the transactions made by the fraudulent agents inside the simulation. In this specific dataset the fraudulent behavior of the agents aims to profit by taking control or customers accounts and try to empty the funds by transferring to another account and then cashing out of the system..
* **isFlaggedFraud:** The business model aims to control massive transfers from one account to another and flags illegal attempts. An illegal attempt in this dataset is an attempt to transfer more than 200.000 in a single transaction.

In [None]:
# Data cleaning
import pandas as pd
import numpy as np
# Visualization / EDA
import matplotlib.pyplot as plt
import seaborn as sns
# remove warnings
import warnings
warnings.filterwarnings("ignore")
import joblib

# Loading datasets

In [None]:
df = pd.read_csv("/content/drive/MyDrive/Predicting fraudulent transactions/Fraud.csv")
df.head()

# Analysing Datasets

In [None]:
df.info()

In [None]:
print("number of rows :",len(df))
print("number of columns :",len(df.columns))

In [None]:
num_duplicates = df.duplicated().sum()
total_rows = len(df)
percentage_duplicates = (num_duplicates / total_rows) * 100
print(f"Percentage of duplicate values: {percentage_duplicates:.2f}%")


# Checking Null Values

In [None]:
df.isna().sum()

In [None]:
per_null = df.isna().sum()/len(df)*100
print(f"percentage of missing data {per_null}")

In [None]:
df["isFraud"].value_counts()

In [None]:
df["isFlaggedFraud"].value_counts()

In [None]:
df.describe().T

# Unique Values

In [None]:
unique_number = []
for i in df.columns:
    x = df[i].value_counts().count()
    unique_number.append(x)

pd.DataFrame(unique_number, index = df.columns, columns = ["Total Unique Values"])

# Exploratory Data Analysis(EDA)

In [None]:
sns.set_style("darkgrid")
plt.figure(figsize=(10, 6))

isFraud_counts = df["isFraud"].value_counts()
labels = isFraud_counts.index
sizes = isFraud_counts.values

# Create pie chart
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90)
plt.title("Distribution of Fraud")
plt.show()

In [None]:
sns.countplot(data=df, x='type',palette = "Set2")
plt.xlabel("Transaction Type")
plt.ylabel("Count")
plt.title("Distribution of Transaction Types")
plt.show()

In [None]:
plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='type', hue='isFraud', palette="Set2")
plt.xlabel("Transaction Type")
plt.ylabel("Count")
plt.title("Distribution of Transaction Types by Fraud")
plt.legend(title="Is Fraud", loc="upper right")
plt.show()

# Feature Engineering

In [None]:
df_=df.copy()

In [None]:
# check object datatypes
obj = df_.select_dtypes(include = "object").columns
print(obj)

## Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
# encode the objects
le = LabelEncoder()

for i in obj:
    df_[i] = le.fit_transform(df_[i].astype(str))

print(df_.info())

In [None]:
# Checking for correlation
corr = df_.corr(numeric_only=True)
plt.figure(figsize=(10,5))
sns.heatmap(corr , annot =True)

# Variation inflation factor

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [None]:
# function to find the variation inflation factor
def cal_vif(df):
    vif = pd.DataFrame()
    vif['variables'] = df.columns
    vif['VIF'] = [variance_inflation_factor(df.values,i) for i in range(df.shape[1])]
    return vif

cal_vif(df_)

# Sampling

In [None]:
df["isFraud"].value_counts()

In [None]:
# Separate majority and minority classes
df_majority = df_[df_['isFraud'] == 0]
df_minority = df_[df_['isFraud'] == 1]

# Downsample the majority class to 25000 samples
df_majority_downsampled = df_majority.sample(n=25000, random_state=42)

# Oversample
df_minority_oversample = df_minority.sample(n=20000, random_state=42, replace=True)

# Combine
df_balanced = pd.concat([df_majority_downsampled, df_minority_oversample])

# Shuffle the balanced dataset
df_balanced = df_balanced.sample(frac=1, random_state=42).reset_index(drop=True)

print("Balanced Class Distribution:")
print(df_balanced['isFraud'].value_counts())

# PCA

PCA creates uncorrelated principal components, which can be helpful for certain machine learning algorithms that assume feature independence. In your code, you observed high correlation between features such as `oldbalanceOrg`, `newbalanceOrig`, `oldbalanceDest`, and `newbalanceDest`, so PCA was applied to address this issue.

In [None]:
from sklearn.decomposition import PCA

# Selecting the desired columns using square brackets
high = df_balanced[['oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']]
pca = PCA(n_components=2) # number of components 2
principal_components = pca.fit_transform(high)
principal_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])

In [None]:
# concat
df_balanced.drop(['oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest'], axis=1, inplace=True)
df_balanced = pd.concat([df_balanced, principal_df], axis=1)
df_balanced.head()

In [None]:
loadings = pca.components_.T  # Transpose to get variables as rows
loading_df = pd.DataFrame(loadings, index=high.columns, columns=['PC1', 'PC2'])
print(loading_df)

In [None]:
# Checking for correlation again
corr = df_balanced.corr(numeric_only=True)
plt.figure(figsize=(10,5))
sns.heatmap(corr , annot =True)

# scaling

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

scaled_data = scaler.fit_transform(df_balanced)
pd.DataFrame(scaled_data, columns=df_balanced.columns)

# Train Test Splitting & Modelling and Evaluation matrix

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = df_balanced.drop("isFraud", axis=1)
Y = df_balanced["isFraud"]

# split the dataset for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size= 0.2, random_state= 42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

## LogisticRegression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
log_r = LogisticRegression()
log_r.fit(X_train, y_train)
y_pred_lg = log_r.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred_lg))

In [None]:
# plot confusuon matrix
sns.heatmap(confusion_matrix(y_test, y_pred_lg), annot=True, fmt='g')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

## XGBClassifier

In [None]:
from xgboost import XGBClassifier

# XGBoost
xgb = XGBClassifier()
xgb.fit(X_train, y_train)

y_pred_xgb = xgb.predict(X_test)
xgb_score = xgb.score(X_test, y_test) * 100
print("XGBoost Classifier Accuracy:", xgb_score)

In [None]:
print(classification_report(y_test, y_pred_xgb))

In [None]:
# plot confusuon matrix
sns.heatmap(confusion_matrix(y_test, y_pred_xgb), annot=True, fmt='g')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

## Inference

* The XGBoost Classifier demonstrates high accuracy and a good balance between precision and recall in predicting fraudulent transactions.
* It is effective in identifying a significant portion of actual fraud cases (high recall), while also minimizing false positives (reasonable precision).
* The model might require further optimization to further reduce false positives, depending on the specific business context and the costs associated with false alarms.
* Ongoing monitoring and evaluation are crucial to ensure the model's continued effectiveness in fraud detection.


# Feature Importance

In [None]:
from sklearn.inspection import permutation_importance

result = permutation_importance(xgb, X_test, y_test, n_repeats=10, random_state=42)

# Plotting permutation importance
perm_sorted_idx = result.importances_mean.argsort()
plt.barh(range(X_test.shape[1]), result.importances_mean[perm_sorted_idx])
plt.yticks(range(X_test.shape[1]), X_test.columns[perm_sorted_idx])
plt.xlabel("Mean Importance (Permuted)")
plt.title("Permutation Feature Importance")
plt.show()

# Key Findings

1. `oldbalanceOrg` and `newbalanceOrig`: These features, representing the originator's balance before and after the transaction, have the highest permutation importance. This suggests that the model heavily relies on changes in the originator's account balance to identify fraudulent activities. This is logical as fraudulent transactions often involve significant withdrawals or transfers, leading to noticeable changes in the account balance.
2. `amount:` The transaction amount is another crucial feature. Large or unusual transaction amounts can be indicative of fraudulent behavior, and the model seems to have learned this pattern.
3. `nameDest` and `nameOrig`: While not as important as the balance and amount features, these features (encoded representations of the recipient and originator's names) also contribute to the model's predictions. This might reflect patterns where certain accounts or individuals are more frequently involved in fraudulent transactions.
4. Other Features: The remaining features, such as `step`, `type`, `oldbalanceDest`, and `newbalanceDest`, have relatively lower importance. This doesn't mean they are completely irrelevant, but their impact on the model's predictions is less significant compared to the top features.
---
# Implications for Fraud Detection

- **Focus on Balance and Amount:** The results emphasize the importance of monitoring changes in account balances and transaction amounts for fraud detection. Real-time systems that flag unusual patterns in these features could be valuable in preventing fraudulent activities.

- **Investigate Suspicious Accounts:** The importance of nameDest and nameOrig suggests that identifying and monitoring accounts frequently associated with fraud could be an effective strategy.
Contextual Information: While step and type have lower importance, they still provide valuable contextual information. Combining these features with the more important ones could improve the accuracy and interpretability of fraud detection models.
- **Further Feature Engineering:** It might be beneficial to explore new features derived from existing data, such as transaction velocity, daily spending limits, or time-based features, to enhance the model's performance.
Overall, the permutation importance analysis for this dataset highlights the key features that drive the model's predictions and provides valuable insights for improving fraud detection strategies. By focusing on these important features and incorporating contextual information, we can develop more robust and effective systems to combat fraud.