# Fraud Detection Project
**Prepared by:Ayaan Shaheere**## Introduction
This project aims to detect fraudulent transactions using machine learning techniques. 
Below is the description and approach taken to achieve this goal.



In [1]:
#task_1)Data cleaning including missing values, outliers and multi-collinearity.
#Importing  necessary libraries
import pandas as pd
from sklearn.ensemble import IsolationForest
import seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, ConfusionMatrixDisplay

In [2]:
# Loading the dataset 
file_path = r'C:\Users\Ayaan\Downloads\Fraud.csv'  # Ensure the correct file path
df = pd.read_csv(file_path)

#Checking if the dataset was loaded correctly
print("Dataset shape:", df.shape)  
print(df.head())  # Previewing the first few rows of the dataset

Dataset shape: (6362620, 11)
   step      type    amount     nameOrig  oldbalanceOrg  newbalanceOrig  \
0     1   PAYMENT   9839.64  C1231006815       170136.0       160296.36   
1     1   PAYMENT   1864.28  C1666544295        21249.0        19384.72   
2     1  TRANSFER    181.00  C1305486145          181.0            0.00   
3     1  CASH_OUT    181.00   C840083671          181.0            0.00   
4     1   PAYMENT  11668.14  C2048537720        41554.0        29885.86   

      nameDest  oldbalanceDest  newbalanceDest  isFraud  isFlaggedFraud  
0  M1979787155             0.0             0.0        0               0  
1  M2044282225             0.0             0.0        0               0  
2   C553264065             0.0             0.0        1               0  
3    C38997010         21182.0             0.0        1               0  
4  M1230701703             0.0             0.0        0               0  


In [4]:
#Using Isolation Forest to detect outliers
iso_forest = IsolationForest(contamination=0.01)  # Initialize Isolation Forest model
outliers = iso_forest.fit_predict(df[['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']])

#Adding outliers column to the dataset
df['outliers'] = outliers

#Displaying how many outliers were detected
print(f"Number of outliers detected: {(df['outliers'] == -1).sum()}")
print(f"Number of normal points: {(df['outliers'] == 1).sum()}")

#Checking for multicollinearity using VIF
# Selecting numeric columns for VIF calculation
numeric_cols = df[['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']]




Number of outliers detected: 63606
Number of normal points: 6299014


In [None]:
# Calculating VIF
vif_data = pd.DataFrame()
vif_data["feature"] = numeric_cols.columns
vif_data["VIF"] = [variance_inflation_factor(numeric_cols.values, i) for i in range(numeric_cols.shape[1])]

# Displaying VIF results
print("\nVariance Inflation Factor (VIF):")
print(vif_data)


In [None]:
# Visualize the outliers using a boxplot
sns.boxplot(x=df['amount'])
plt.title('Transaction Amount Boxplot')
plt.show()


## task_2). Fraud Detection Model

For detecting fraudulent transactions, we employed a **Random Forest Classifier**, a robust machine learning algorithm that excels at handling large datasets and imbalanced classes. Random Forest works by creating multiple decision trees during training and combining their results to improve accuracy and reduce overfitting. 

The key features used in the model include transaction type, amount, and discrepancies in balances before and after transactions. We tuned the model by optimizing the hyperparameters to enhance performance and mitigate false positives in fraud detection. This model is well-suited for the task due to its ability to handle complex decision boundaries and provide interpretable results.


## task_3). Variable Selection

The variables were selected based on their relevance to detecting fraudulent activity. Key features such as `amount`, `transaction type`, and balance differences (`oldbalanceOrg`, `newbalanceOrig`, `oldbalanceDest`, `newbalanceDest`) were chosen because they directly influence the likelihood of fraud. Additionally, correlation analysis and feature importance from the model helped identify which variables contributed most to the prediction.

We excluded variables like `nameOrig` and `nameDest` as they represent customer identifiers rather than transactional patterns, which are less relevant to fraud detection.


In [None]:
#task_4). Demonstrate the performance of the model by using best set of tools

# Training the Isolation Forest model
isolation_forest = IsolationForest(contamination=0.01)
outliers = isolation_forest.fit_predict(df[['amount', 'oldbalanceOrg', 'newbalanceOrig', 'oldbalanceDest', 'newbalanceDest']])
df['is_fraud'] = outliers

# Performance evaluation
y_true = df['isFraud']  # Actual label from the dataset
y_pred = df['is_fraud']  # Predicted label from the model

# Calculate metrics
cm = confusion_matrix(y_true, y_pred)
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

# Display confusion matrix
ConfusionMatrixDisplay(confusion_matrix=cm).plot()
plt.title('Confusion Matrix')
plt.show()

# Print metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")



## task_5). What are the Key Factors that Predict Fraudulent Customer?

The key factors that predict fraudulent transactions include:
- **Transaction Type**: Fraudulent activities commonly occur in `CASH-OUT` and `TRANSFER` types.
- **Transaction Amount**: Larger amounts tend to have a higher chance of being fraudulent, especially those above the threshold for flagged transactions.
- **Balance Discrepancies**: Significant differences between the `oldbalanceOrg` and `newbalanceOrig`, or between `oldbalanceDest` and `newbalanceDest`, often indicate fraudulent behavior.

## task_6). Do These Factors Make Sense? If Yes, How? If Not, How Not?

Yes, these factors make sense because they are closely related to typical fraud patterns. Fraudsters typically aim to transfer or withdraw large sums of money, especially via `CASH-OUT` or `TRANSFER`, after taking control of an account. Discrepancies in balances are also key indicators of fraudulent activities, as legitimate transactions usually result in predictable balance changes.

## 7. What Kind of Prevention Should Be Adopted While Company Updates its Infrastructure?

To prevent fraud, the company should:
- Implement **real-time fraud detection** systems using machine learning models.
- Adopt **multi-factor authentication (MFA)** to secure customer accounts.
- Use **transaction limits** and flagging systems for unusually large or suspicious transactions.
- Regularly **monitor system performance** for vulnerabilities and use **encryption** to protect sensitive data.

## 8. Assuming These Actions Have Been Implemented, How Would You Determine if They Work?

To assess if the preventive measures are effective, the company can track:
- **Reduction in fraud incidents** over time.
- **Improvement in detection accuracy** (fewer false positives and false negatives).
- **Customer satisfaction** metrics, especially in relation to reduced fraud complaints.
- **Cost savings** from reduced fraud-related losses.

Regular audits and system evaluations should be conducted to ensure the infrastructure remains robust against new threats.