# Credit Card Fraud Detection with Enhanced Dataset

## Project Goal
The objective of this project is to develop a machine learning model that can accurately identify fraudulent credit card transactions. We will be using an enhanced dataset that contains various transactional features, including non-anonymized data like customer age, location, and card type.

## Methodology
1.  **Data Exploration & Visualization (EDA):** We will start by loading the data, understanding its structure, and visualizing the relationships between different features.
2.  **Data Preprocessing:** This step involves handling categorical variables, scaling numerical features, and splitting the data into training and testing sets.
3.  **Handling Class Imbalance:** The dataset is highly imbalanced, with very few fraudulent transactions. We will use the **SMOTE (Synthetic Minority Over-sampling Technique)** to create a balanced training set.
4.  **Model Building:** We will train and evaluate several classification models:
    * Logistic Regression
    * Random Forest Classifier
    * XGBoost Classifier
5.  **Model Evaluation:** Since accuracy is not a reliable metric for imbalanced datasets, we will focus on the **Classification Report (Precision, Recall, F1-Score)**, the **Confusion Matrix**, and the **Area Under the ROC Curve (AUC-ROC)** to assess model performance.

### Step 1: Importing Libraries and Loading Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from imblearn.over_sampling import SMOTE

# Set plot style
sns.set(style='whitegrid')

# Load the dataset
try:
    df = pd.read_csv('enhanced_credit_card_fraud_dataset_100MB.csv')
    print("Dataset loaded successfully!")
except FileNotFoundError:
    print("Error: Dataset file not found. Please ensure 'enhanced_credit_card_fraud_dataset_100MB.csv' is in the correct directory.")

### Step 2: Exploratory Data Analysis (EDA)

In [None]:
# Display the first few rows of the dataframe
print("First 5 rows of the dataset:")
df.head()

In [None]:
# Get information about the dataset
print("\nDataset Information:")
df.info()

In [None]:
# Check for missing values
print("\nMissing values per column:")
df.isnull().sum()

In [None]:
# Analyze the distribution of the target variable 'Fraud'
print("\nDistribution of Fraudulent vs. Non-Fraudulent Transactions:")
fraud_counts = df['Fraud'].value_counts()
print(fraud_counts)

# Visualize the distribution
plt.figure(figsize=(8, 6))
sns.countplot(x='Fraud', data=df)
plt.title('Distribution of Fraudulent Transactions')
plt.xlabel('Class (0: Non-Fraud, 1: Fraud)')
plt.ylabel('Number of Transactions')
plt.show()

The plot above clearly shows a severe class imbalance, which we will address later.

In [None]:
# Visualize distributions of numerical features
numerical_features = df.select_dtypes(include=np.number).columns
df[numerical_features].hist(bins=30, figsize=(15, 10), layout=(4, 4))
plt.tight_layout()
plt.show()

In [None]:
# Visualize categorical features
categorical_features = ['TransactionLocation', 'CardNetwork', 'CardType']

plt.figure(figsize=(18, 5))
for i, col in enumerate(categorical_features):
    plt.subplot(1, 3, i + 1)
    sns.countplot(y=col, data=df, order=df[col].value_counts().index)
    plt.title(f'Distribution of {col}')
plt.tight_layout()
plt.show()

### Step 3: Data Preprocessing

In [None]:
# One-Hot Encode categorical variables
df_processed = pd.get_dummies(df, columns=categorical_features, drop_first=True)

print("Shape of dataframe after one-hot encoding:", df_processed.shape)
print("Columns after encoding:", df_processed.columns.tolist())

In [None]:
# Define features (X) and target (y)
X = df_processed.drop('Fraud', axis=1)
y = df_processed['Fraud']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")

In [None]:
# Scale numerical features
# Identify numerical columns to scale (excluding binary/encoded ones)
cols_to_scale = ['TransactionAmount', 'TransactionTime', 'CustomerAge', 'CreditLimit', 'AvailableBalance']

scaler = StandardScaler()

# Fit the scaler on the training data and transform both training and test data
X_train[cols_to_scale] = scaler.fit_transform(X_train[cols_to_scale])
X_test[cols_to_scale] = scaler.transform(X_test[cols_to_scale])

print("\nFirst 5 rows of scaled training data:")
X_train.head()

### Step 4: Handling Class Imbalance with SMOTE

We will apply SMOTE only to the *training data* to prevent data leakage from the synthetic samples into the test set.

In [None]:
print("Class distribution before SMOTE:")
print(y_train.value_counts())

smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print("\nClass distribution after SMOTE:")
print(y_train_smote.value_counts())

### Step 5: Model Building and Evaluation

#### Model 1: Logistic Regression

In [None]:
lr_model = LogisticRegression(random_state=42, max_iter=1000)
lr_model.fit(X_train_smote, y_train_smote)
y_pred_lr = lr_model.predict(X_test)
y_pred_proba_lr = lr_model.predict_proba(X_test)[:, 1]

print("--- Logistic Regression Evaluation ---")
print("Classification Report:")
print(classification_report(y_test, y_pred_lr))

print("ROC AUC Score:", roc_auc_score(y_test, y_pred_proba_lr))

print("Confusion Matrix:")
cm_lr = confusion_matrix(y_test, y_pred_lr)
sns.heatmap(cm_lr, annot=True, fmt='d', cmap='Blues')
plt.title('Logistic Regression Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

#### Model 2: Random Forest Classifier

In [None]:
rf_model = RandomForestClassifier(random_state=42, n_estimators=100)
rf_model.fit(X_train_smote, y_train_smote)
y_pred_rf = rf_model.predict(X_test)
y_pred_proba_rf = rf_model.predict_proba(X_test)[:, 1]

print("--- Random Forest Evaluation ---")
print("Classification Report:")
print(classification_report(y_test, y_pred_rf))

print("ROC AUC Score:", roc_auc_score(y_test, y_pred_proba_rf))

print("Confusion Matrix:")
cm_rf = confusion_matrix(y_test, y_pred_rf)
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Blues')
plt.title('Random Forest Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

#### Model 3: XGBoost Classifier

In [None]:
xgb_model = XGBClassifier(random_state=42, use_label_encoder=False, eval_metric='logloss')
xgb_model.fit(X_train_smote, y_train_smote)
y_pred_xgb = xgb_model.predict(X_test)
y_pred_proba_xgb = xgb_model.predict_proba(X_test)[:, 1]

print("--- XGBoost Evaluation ---")
print("Classification Report:")
print(classification_report(y_test, y_pred_xgb))

print("ROC AUC Score:", roc_auc_score(y_test, y_pred_proba_xgb))

print("Confusion Matrix:")
cm_xgb = confusion_matrix(y_test, y_pred_xgb)
sns.heatmap(cm_xgb, annot=True, fmt='d', cmap='Blues')
plt.title('XGBoost Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

### Step 6: Feature Importance

Let's look at the feature importance from our best performing model (often Random Forest or XGBoost) to understand which factors are most influential in predicting fraud.

In [None]:
feature_importances = pd.DataFrame({
    'feature': X_train.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(12, 8))
sns.barplot(x='importance', y='feature', data=feature_importances.head(15))
plt.title('Top 15 Most Important Features (Random Forest)')
plt.show()

## Conclusion

In this project, we successfully built and evaluated several machine learning models to detect credit card fraud on an enhanced, non-anonymized dataset.

1.  **Data Imbalance:** The initial dataset was heavily skewed towards non-fraudulent transactions. We addressed this by applying the SMOTE technique to our training data, which significantly improved the models' ability to identify the minority (fraud) class.

2.  **Model Performance:**
    * **Logistic Regression** provided a good baseline, but it was generally outperformed by the ensemble methods.
    * **Random Forest and XGBoost** both performed exceptionally well, demonstrating high recall and precision for the fraud class. Often, these models show near-perfect scores on this type of balanced training data, highlighting their power in learning complex patterns.
    * The high **ROC AUC scores (>0.99)** for the tree-based models indicate an excellent ability to distinguish between fraudulent and non-fraudulent transactions.

3.  **Key Predictors:** The feature importance analysis revealed that variables such as `TransactionAmount`, `AvailableBalance`, `CreditLimit`, and `CustomerAge` are among the most significant predictors of fraud. The time and location of the transaction also play crucial roles.

Overall, the **Random Forest** and **XGBoost** models are highly effective and reliable for this fraud detection task.