# Credit Card Fraud Detection - Machine Learning Project

## Project Overview
This notebook implements machine learning algorithms to detect fraudulent credit card transactions.

### Dataset Information
- **Features**: V1-V28 (PCA transformed), Time, Amount
- **Target**: Class (0=Normal, 1=Fraud)
- **Challenge**: Highly imbalanced dataset (~0.17% fraud cases)

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from sklearn.metrics import precision_recall_curve, f1_score, precision_score, recall_score
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
import warnings
warnings.filterwarnings('ignore')

# Set style for plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Libraries imported successfully!")

## 1. Data Loading and Initial Exploration

In [None]:
# Load the dataset
# Update the path to your actual file location
data_path = r"C:\Users\dasar\OneDrive\Desktop\ds project\major\creditcard.csv"
df = pd.read_csv(data_path)

print(f"Dataset shape: {df.shape}")
print(f"\nDataset info:")
df.info()

In [None]:
# Display first few rows
print("First 5 rows of the dataset:")
df.head()

In [None]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())

# Basic statistics
print("\nBasic statistics:")
df.describe()

## 2. Class Distribution Analysis

In [None]:
# Analyze class distribution
class_counts = df['Class'].value_counts()
class_percentages = df['Class'].value_counts(normalize=True) * 100

print("Class Distribution:")
print(f"Normal transactions (0): {class_counts[0]:,} ({class_percentages[0]:.2f}%)")
print(f"Fraudulent transactions (1): {class_counts[1]:,} ({class_percentages[1]:.2f}%)")

# Visualize class distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Bar plot
class_counts.plot(kind='bar', ax=ax1, color=['skyblue', 'salmon'])
ax1.set_title('Class Distribution (Count)')
ax1.set_xlabel('Class')
ax1.set_ylabel('Count')
ax1.tick_params(axis='x', rotation=0)

# Pie chart
ax2.pie(class_counts.values, labels=['Normal', 'Fraud'], autopct='%1.2f%%', 
        colors=['skyblue', 'salmon'], startangle=90)
ax2.set_title('Class Distribution (Percentage)')

plt.tight_layout()
plt.show()

## 3. Feature Analysis

In [None]:
# Analyze Time and Amount features
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Time distribution
axes[0,0].hist(df['Time'], bins=50, alpha=0.7, color='blue')
axes[0,0].set_title('Distribution of Time')
axes[0,0].set_xlabel('Time')
axes[0,0].set_ylabel('Frequency')

# Amount distribution
axes[0,1].hist(df['Amount'], bins=50, alpha=0.7, color='green')
axes[0,1].set_title('Distribution of Amount')
axes[0,1].set_xlabel('Amount')
axes[0,1].set_ylabel('Frequency')

# Amount by class
df[df['Class'] == 0]['Amount'].hist(bins=50, alpha=0.7, label='Normal', ax=axes[1,0])
df[df['Class'] == 1]['Amount'].hist(bins=50, alpha=0.7, label='Fraud', ax=axes[1,0])
axes[1,0].set_title('Amount Distribution by Class')
axes[1,0].set_xlabel('Amount')
axes[1,0].set_ylabel('Frequency')
axes[1,0].legend()

# Box plot for Amount by Class
df.boxplot(column='Amount', by='Class', ax=axes[1,1])
axes[1,1].set_title('Amount Distribution by Class (Box Plot)')
axes[1,1].set_xlabel('Class')
axes[1,1].set_ylabel('Amount')

plt.tight_layout()
plt.show()

## 4. Data Preprocessing

In [None]:
# Feature scaling for Amount and Time
scaler = RobustScaler()
df['Amount_scaled'] = scaler.fit_transform(df[['Amount']])
df['Time_scaled'] = scaler.fit_transform(df[['Time']])

# Drop original Time and Amount columns
df_processed = df.drop(['Time', 'Amount'], axis=1)

print("Data preprocessing completed!")
print(f"Processed dataset shape: {df_processed.shape}")
df_processed.head()

In [None]:
# Prepare features and target
X = df_processed.drop('Class', axis=1)
y = df_processed['Class']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"Feature columns: {list(X.columns)}")

In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")
print(f"Training set class distribution:")
print(y_train.value_counts())
print(f"Test set class distribution:")
print(y_test.value_counts())

## 5. Handling Class Imbalance

In [None]:
# Apply SMOTE for oversampling
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print(f"Original training set shape: {X_train.shape}")
print(f"SMOTE training set shape: {X_train_smote.shape}")
print(f"\nOriginal class distribution:")
print(y_train.value_counts())
print(f"\nSMOTE class distribution:")
print(pd.Series(y_train_smote).value_counts())

## 6. Model Training and Evaluation

In [None]:
# Define models
models = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Random Forest': RandomForestClassifier(random_state=42, n_estimators=100),
    'SVM': SVC(random_state=42, probability=True)
}

# Function to evaluate models
def evaluate_model(model, X_train, X_test, y_train, y_test, model_name):
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    y_pred_proba = model.predict_proba(X_test)[:, 1]
    
    # Calculate metrics
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    
    print(f"\n{model_name} Results:")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-Score: {f1:.4f}")
    print(f"ROC-AUC: {roc_auc:.4f}")
    
    # Confusion Matrix
    cm = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
    plt.title(f'{model_name} - Confusion Matrix')
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.show()
    
    return {
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'roc_auc': roc_auc,
        'y_pred_proba': y_pred_proba
    }

In [None]:
# Train and evaluate models on original imbalanced data
print("=" * 50)
print("RESULTS ON ORIGINAL IMBALANCED DATA")
print("=" * 50)

results_original = {}
for name, model in models.items():
    results_original[name] = evaluate_model(model, X_train, X_test, y_train, y_test, name)

In [None]:
# Train and evaluate models on SMOTE balanced data
print("\n" + "=" * 50)
print("RESULTS ON SMOTE BALANCED DATA")
print("=" * 50)

results_smote = {}
for name, model in models.items():
    results_smote[name] = evaluate_model(model, X_train_smote, X_test, y_train_smote, y_test, f"{name} (SMOTE)")

## 7. Model Comparison and ROC Curves

In [None]:
# Compare models performance
comparison_df = pd.DataFrame({
    'Model': list(results_original.keys()) + [f"{k} (SMOTE)" for k in results_smote.keys()],
    'Precision': [results_original[k]['precision'] for k in results_original.keys()] + 
                 [results_smote[k]['precision'] for k in results_smote.keys()],
    'Recall': [results_original[k]['recall'] for k in results_original.keys()] + 
              [results_smote[k]['recall'] for k in results_smote.keys()],
    'F1-Score': [results_original[k]['f1'] for k in results_original.keys()] + 
                [results_smote[k]['f1'] for k in results_smote.keys()],
    'ROC-AUC': [results_original[k]['roc_auc'] for k in results_original.keys()] + 
               [results_smote[k]['roc_auc'] for k in results_smote.keys()]
})

print("\nModel Comparison:")
print(comparison_df.round(4))

In [None]:
# Plot ROC curves
plt.figure(figsize=(12, 8))

# Original models
for name, result in results_original.items():
    fpr, tpr, _ = roc_curve(y_test, result['y_pred_proba'])
    plt.plot(fpr, tpr, label=f"{name} (AUC = {result['roc_auc']:.3f})")

# SMOTE models
for name, result in results_smote.items():
    fpr, tpr, _ = roc_curve(y_test, result['y_pred_proba'])
    plt.plot(fpr, tpr, linestyle='--', label=f"{name} SMOTE (AUC = {result['roc_auc']:.3f})")

plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curves Comparison')
plt.legend()
plt.grid(True)
plt.show()

## 8. Feature Importance Analysis

In [None]:
# Feature importance from Random Forest
rf_model = RandomForestClassifier(random_state=42, n_estimators=100)
rf_model.fit(X_train_smote, y_train_smote)

# Get feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

# Plot top 15 features
plt.figure(figsize=(10, 8))
sns.barplot(data=feature_importance.head(15), x='importance', y='feature')
plt.title('Top 15 Feature Importance (Random Forest)')
plt.xlabel('Importance')
plt.tight_layout()
plt.show()

print("Top 10 Most Important Features:")
print(feature_importance.head(10))

## 9. Conclusions and Recommendations

### Key Findings:
1. **Class Imbalance**: The dataset is highly imbalanced with only ~0.17% fraud cases
2. **SMOTE Impact**: SMOTE generally improves recall but may reduce precision
3. **Model Performance**: Different models show varying performance on precision vs recall trade-off
4. **Feature Importance**: Certain V features are more predictive of fraud

### Recommendations:
1. **For Production**: Choose model based on business cost of false positives vs false negatives
2. **Threshold Tuning**: Adjust classification threshold based on business requirements
3. **Ensemble Methods**: Consider combining multiple models for better performance
4. **Real-time Monitoring**: Implement continuous model monitoring and retraining
5. **Feature Engineering**: Explore additional features like transaction patterns, time-based features

### Next Steps:
- Hyperparameter tuning using GridSearchCV
- Try advanced algorithms (XGBoost, Neural Networks)
- Implement cost-sensitive learning
- Deploy model using Flask/FastAPI
- Set up model monitoring and alerting system