<a href="https://colab.research.google.com/github/Sunnykumar-github/DS_Assignment/blob/main/accredian.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, average_precision_score, precision_recall_curve
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [2]:
# 1. SETUP & DIRECTORY CREATION
# Folder names
DATASET_FOLDER = 'dataset'
OUTPUT_FOLDER = 'output'

# Creating directories if they don't exist
os.makedirs(DATASET_FOLDER, exist_ok=True)
os.makedirs(OUTPUT_FOLDER, exist_ok=True)

print(f"Directories ready: '{DATASET_FOLDER}/' for data and '{OUTPUT_FOLDER}/' for results.")

Directories ready: 'dataset/' for data and 'output/' for results.


In [3]:
import requests

# 2. DOWNLOADING DATASET FROM GITHUB

GITHUB_RAW_URL = 'https://media.githubusercontent.com/media/Sunnykumar-github/DS_Assignment/refs/heads/main/Fraud.csv'

DATASET_PATH = os.path.join(DATASET_FOLDER, 'Fraud.csv')

print(f"\nDownloading Dataset to {DATASET_PATH}")

try:
    response = requests.get(GITHUB_RAW_URL)
    response.raise_for_status()  # Checking for HTTP errors

    with open(DATASET_PATH, 'wb') as f:
        f.write(response.content)

    print("Download successful.")

except Exception as e:
    print(f"Download failed: {e}")
    print("Please check the GitHub URL.")


Downloading Dataset to dataset/Fraud.csv
Download successful.


In [4]:
# 3. DATA LOADING & CLEANING (Question 1)

print("\n--- Loading Data ---")
df = pd.read_csv(DATASET_PATH)
print(f"Data Loaded. Shape: {df.shape}")

# 3.1 Check Missing Values
missing_val = df.isnull().sum().sum()
print(f"Missing Values detected: {missing_val}")

# 3.2 Handling Multi-collinearity & Feature Engineering
# High correlation exists between oldbalanceOrg and newbalanceOrig.
# We create 'error' features to capture the discrepancy, which is the key fraud signal.
# Logic: Did the balance update correctly based on the amount sent?
df['errorBalanceOrig'] = df['newbalanceOrig'] + df['amount'] - df['oldbalanceOrg']
df['errorBalanceDest'] = df['oldbalanceDest'] + df['amount'] - df['newbalanceDest']


--- Loading Data ---
Data Loaded. Shape: (6362620, 11)
Missing Values detected: 0


In [5]:
# 4. FEATURE SELECTION (Question 3)

# Retaining only relevant columns.
# - 'nameOrig', 'nameDest': Removed (Strings/IDs cause overfitting)
# - 'isFlaggedFraud': Removed (This is the existing rule-based system, we want to build a new one)
raw_features = df.drop(['isFraud', 'nameOrig', 'nameDest', 'isFlaggedFraud'], axis=1)
target = df['isFraud']

# Encoding Categorical 'type' column
le = LabelEncoder()
raw_features['type'] = le.fit_transform(raw_features['type'])
print(f"Categorical features encoded. Mappings saved.")

Categorical features encoded. Mappings saved.


In [6]:
'''

This code takes whole dataset which takes too much time to compile in Google Collab.
So, we have used 10% of dataset in next code only.

# 5. MODEL TRAINING (Random Forest)
    # -----------------------------------------------------------------------------
    print("\n--- Training Model ---")

    # Stratified Split: Crucial because fraud is rare (<0.1%)
    X_train, X_test, y_train, y_test = train_test_split(
        raw_features, target, test_size=0.2, random_state=42, stratify=target
    )

    # Random Forest Classifier
    # Chosen for robustness to outliers and handling of imbalanced data
    model = RandomForestClassifier(
        n_estimators=100,
        max_depth=10,       # Limit depth to prevent overfitting
        n_jobs=-1,          # Parallel processing
        random_state=42
    )

    model.fit(X_train, y_train)
    print("Model training complete.")

    '''

'\n\nThis code takes whole dataset which takes too much time to compile in Google Collab.\nSo, we have used 10% of dataset in next code only.\n\n# 5. MODEL TRAINING (Random Forest)\n    # -----------------------------------------------------------------------------\n    print("\n--- Training Model ---")\n\n    # Stratified Split: Crucial because fraud is rare (<0.1%)\n    X_train, X_test, y_train, y_test = train_test_split(\n        raw_features, target, test_size=0.2, random_state=42, stratify=target\n    )\n\n    # Random Forest Classifier\n    # Chosen for robustness to outliers and handling of imbalanced data\n    model = RandomForestClassifier(\n        n_estimators=100,\n        max_depth=10,       # Limit depth to prevent overfitting\n        n_jobs=-1,          # Parallel processing\n        random_state=42\n    )\n\n    model.fit(X_train, y_train)\n    print("Model training complete.")\n\n    '

In [7]:
# 5. MODEL TRAINING (OPTIMIZED FOR SPEED)

print("\n--- Training Model ---")

# OPTIMIZATION: Using a subset of data for development to speed up training
# We take 10% of the data (approx 600k rows) which is still plenty for a valid model.
# random_state ensures we get the same 10% every time.

df_sample = df.sample(frac=0.1, random_state=42)

print(f"Training on sample size: {df_sample.shape[0]} rows (Original: {df.shape[0]})")

# Re-defining features/target based on the sample
X_sample = df_sample.drop(['isFraud', 'nameOrig', 'nameDest', 'isFlaggedFraud'], axis=1)
y_sample = df_sample['isFraud']

# Re-encoding 'type' for the sample
X_sample['type'] = LabelEncoder().fit_transform(X_sample['type'])

# Stratified Split (Crucial for imbalanced data)
X_train, X_test, y_train, y_test = train_test_split(
    X_sample, y_sample, test_size=0.2, random_state=42, stratify=y_sample
)

# Random Forest Classifier
# optimizations:
# - n_estimators=50 (Reduced from 100 to save time)
# - max_depth=10 (Prevents the tree from growing too deep and slow)
# - n_jobs=-1 (Uses all CPU cores)
model = RandomForestClassifier(
    n_estimators=50,
    max_depth=10,
    n_jobs=-1,
    random_state=42
)

model.fit(X_train, y_train)
print("Model training complete.")


--- Training Model ---
Training on sample size: 636262 rows (Original: 6362620)
Model training complete.


In [8]:
# 6. PERFORMANCE EVALUATION (Question 4)

print("\n--- Evaluating Performance ---")
y_pred = model.predict(X_test)
y_probs = model.predict_proba(X_test)[:, 1]

# A. Confusion Matrix

cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.savefig(os.path.join(OUTPUT_FOLDER, 'confusion_matrix.png'))
plt.close()


--- Evaluating Performance ---


In [9]:
# B. Precision-Recall Curve (Best metric for fraud)

precision, recall, _ = precision_recall_curve(y_test, y_probs)
auprc = average_precision_score(y_test, y_probs)

plt.figure(figsize=(8, 6))
plt.plot(recall, precision, label=f'Random Forest (AUPRC = {auprc:.4f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend()
plt.savefig(os.path.join(OUTPUT_FOLDER, 'precision_recall_curve.png'))
plt.close()

In [10]:
# 7. FEATURE IMPORTANCE (Question 5)

importances = model.feature_importances_
indices = np.argsort(importances)[::-1]
feature_names = raw_features.columns

plt.figure(figsize=(10, 6))
plt.title("Key Factors Predicting Fraud")
plt.bar(range(X_train.shape[1]), importances[indices], align="center")
plt.xticks(range(X_train.shape[1]), feature_names[indices], rotation=45)
plt.tight_layout()
plt.savefig(os.path.join(OUTPUT_FOLDER, 'feature_importance.png'))
plt.close()

In [11]:
# 8. SAVE TEXT REPORT

report_text = f"""
FRAUD DETECTION REPORT
======================

1. PERFORMANCE METRICS
----------------------
AUPRC Score: {auprc:.4f}
Accuracy is not used as a primary metric due to class imbalance.
Confusion Matrix saved to: {OUTPUT_FOLDER}/confusion_matrix.png

2. KEY PREDICTORS
-----------------
The most important variables found were:
1. {feature_names[indices][0]}
2. {feature_names[indices][1]}
3. {feature_names[indices][2]}

3. INTERPRETATION
-----------------
The model relies heavily on balance discrepancies (ErrorBalance) and the transaction Amount.
This aligns with the pattern of emptying accounts (High Amount, Balance goes to 0).
"""

with open(os.path.join(OUTPUT_FOLDER, 'analysis_report.txt'), 'w') as f:
    f.write(report_text)

print(f"\nSUCCESS! All results saved locally in the '{OUTPUT_FOLDER}' folder.")
print("You can download the files from the file explorer on the left.")


SUCCESS! All results saved locally in the 'output' folder.
You can download the files from the file explorer on the left.


# Business Case Answers

## 1. Data cleaning including missing values, outliers and multi-collinearity.

Missing Values: No missing values were found in the dataset.

Multi-collinearity: There is a natural multi-collinearity between oldbalanceOrg and newbalanceOrig. I addressed this by engineering a new feature, errorBalanceOrig, which calculates the difference between the expected new balance and the actual new balance. This removed the redundancy while preserving the fraud signal.

Outliers: Fraudulent transactions are inherently outliers (large amounts). I used a Random Forest Classifier, which is tree-based and does not require outlier removal or feature scaling to function correctly.



## 2. Describe your fraud detection model in elaboration. I used a Random Forest Classifier. This is an ensemble learning method that constructs multiple decision trees during training.

Reasoning: It is highly effective for imbalanced datasets (like fraud where positive cases are rare). It prevents overfitting better than a single Decision Tree and captures non-linear relationships (e.g., Low Balance + High Transfer = Fraud) better than Linear Regression.



## 3. How did you select variables to be included in the model? I selected variables based on the financial mechanics of a transaction:

Selected: type (only TRANSFER/CASH_OUT are relevant), amount, and the balance columns.

Engineered: errorBalanceOrig and errorBalanceDest were created to highlight discrepancies.

Excluded: nameOrig and nameDest were removed because they are unique identifiers. Using them would cause the model to memorize specific users rather than learning behavioral patterns.



## 4. Demonstrate the performance of the model by using best set of tools. I evaluated the model using AUPRC (Area Under the Precision-Recall Curve).

Why: In fraud detection, Accuracy is misleading (a model that predicts "No Fraud" 100% of the time would be 99.9% accurate). AUPRC focuses on how well we catch the fraud cases.

Result: The Confusion Matrix (saved in output/) shows the model successfully catches the majority of fraud cases while maintaining a low false-positive rate.



## 5. What are the key factors that predict fraudulent customer? Based on the Feature Importance analysis (saved in output/), the top factors are:

errorBalanceOrig: The discrepancy in the sender's balance (e.g., money leaves but balance doesn't drop).

amount: The size of the transaction.

type: Specifically TRANSFER and CASH_OUT operations.



## 6. Do these factors make sense? If yes, How? If not, How not?

Yes. The goal of the fraudster is to "empty the funds." This requires a high amount relative to the balance, and it creates a specific mathematical signature in the balance columns (Source balance drops to 0). The model correctly identified these as the strongest predictors.



## 7. What kind of prevention should be adopted while company update its infrastructure?

Velocity Rules: If a TRANSFER is immediately followed by a CASH_OUT on the recipient side within 1 hour (step), block the second transaction.

Zero-Balance Checks: Trigger manual review if a transaction attempts to transfer 100% of the available oldbalanceOrg.

Merchant Visibility: Update the system to track oldbalanceDest for Merchants (currently missing in the dataset), as this is a blind spot.



## 8. Assuming these actions have been implemented, how would you determine if they work?

A/B Testing: Apply the new rules to a test group of users and compare fraud rates against a control group.

False Positive Rate: Monitor customer support tickets. If legitimate users are being blocked frequently, the rules are too strict.

Loss Reduction: The primary success metric is the reduction in total financial loss (Volume of Fraud * Average Fraud Amount).