Fraud Detection - Data Loading and Initial Exploration

Import Libraries

In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

Load Dataset

In [2]:
print("="*60)
print("LOADING CREDIT CARD FRAUD DETECTION DATASET")
print("="*60)

# Load the dataset
# Make sure you have downloaded creditcard.csv from Kaggle and placed it in data/ folder
df = pd.read_csv('../data/creditcard.csv')

print("\n‚úÖ Dataset loaded successfully!")
print(f"üìä Dataset shape: {df.shape}")
print(f"   - Total transactions: {df.shape[0]:,}")
print(f"   - Total features: {df.shape[1]}")

LOADING CREDIT CARD FRAUD DETECTION DATASET

‚úÖ Dataset loaded successfully!
üìä Dataset shape: (284807, 31)
   - Total transactions: 284,807
   - Total features: 31


Basic Information

In [3]:
print("\n" + "="*60)
print("BASIC DATASET INFORMATION")
print("="*60)

# Display first few rows
print("\nüìã First 5 rows of the dataset:")
print(df.head())

# Display column names and data types
print("\nüìù Column Names and Data Types:")
print(df.dtypes)

# Get dataset info
print("\nüîç Detailed Dataset Info:")
df.info()



BASIC DATASET INFORMATION

üìã First 5 rows of the dataset:
   Time        V1        V2        V3        V4        V5        V6        V7  \
0   0.0 -1.359807 -0.072781  2.536347  1.378155 -0.338321  0.462388  0.239599   
1   0.0  1.191857  0.266151  0.166480  0.448154  0.060018 -0.082361 -0.078803   
2   1.0 -1.358354 -1.340163  1.773209  0.379780 -0.503198  1.800499  0.791461   
3   1.0 -0.966272 -0.185226  1.792993 -0.863291 -0.010309  1.247203  0.237609   
4   2.0 -1.158233  0.877737  1.548718  0.403034 -0.407193  0.095921  0.592941   

         V8        V9  ...       V21       V22       V23       V24       V25  \
0  0.098698  0.363787  ... -0.018307  0.277838 -0.110474  0.066928  0.128539   
1  0.085102 -0.255425  ... -0.225775 -0.638672  0.101288 -0.339846  0.167170   
2  0.247676 -1.514654  ...  0.247998  0.771679  0.909412 -0.689281 -0.327642   
3  0.377436 -1.387024  ... -0.108300  0.005274 -0.190321 -1.175575  0.647376   
4 -0.270533  0.817739  ... -0.009431  0.798278 -0.1

Check for Missing Values

In [4]:
print("\n" + "="*60)
print("MISSING VALUES ANALYSIS")
print("="*60)

missing_values = df.isnull().sum()
missing_percentage = (df.isnull().sum() / len(df)) * 100

missing_df = pd.DataFrame({
    'Column': df.columns,
    'Missing Values': missing_values.values,
    'Percentage': missing_percentage.values
})

print(missing_df)

if df.isnull().sum().sum() == 0:
    print("\n‚úÖ No missing values found! Dataset is clean.")
else:
    print(f"\n‚ö†Ô∏è Total missing values: {df.isnull().sum().sum()}")


MISSING VALUES ANALYSIS
    Column  Missing Values  Percentage
0     Time               0         0.0
1       V1               0         0.0
2       V2               0         0.0
3       V3               0         0.0
4       V4               0         0.0
5       V5               0         0.0
6       V6               0         0.0
7       V7               0         0.0
8       V8               0         0.0
9       V9               0         0.0
10     V10               0         0.0
11     V11               0         0.0
12     V12               0         0.0
13     V13               0         0.0
14     V14               0         0.0
15     V15               0         0.0
16     V16               0         0.0
17     V17               0         0.0
18     V18               0         0.0
19     V19               0         0.0
20     V20               0         0.0
21     V21               0         0.0
22     V22               0         0.0
23     V23               0         0.0


Statistical Summary

In [5]:
print("\n" + "="*60)
print("STATISTICAL SUMMARY")
print("="*60)

print("\nüìä Statistical summary of all features:")
print(df.describe())



STATISTICAL SUMMARY

üìä Statistical summary of all features:
                Time            V1            V2            V3            V4  \
count  284807.000000  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05   
mean    94813.859575  1.175161e-15  3.384974e-16 -1.379537e-15  2.094852e-15   
std     47488.145955  1.958696e+00  1.651309e+00  1.516255e+00  1.415869e+00   
min         0.000000 -5.640751e+01 -7.271573e+01 -4.832559e+01 -5.683171e+00   
25%     54201.500000 -9.203734e-01 -5.985499e-01 -8.903648e-01 -8.486401e-01   
50%     84692.000000  1.810880e-02  6.548556e-02  1.798463e-01 -1.984653e-02   
75%    139320.500000  1.315642e+00  8.037239e-01  1.027196e+00  7.433413e-01   
max    172792.000000  2.454930e+00  2.205773e+01  9.382558e+00  1.687534e+01   

                 V5            V6            V7            V8            V9  \
count  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05   
mean   1.021879e-15  1.494498e-15 -5.620335e-16  1.149614

Target Variable Analysis (Class Distribution)

In [6]:
print("\n" + "="*60)
print("TARGET VARIABLE ANALYSIS (CLASS DISTRIBUTION)")
print("="*60)

# Count of each class
class_distribution = df['Class'].value_counts()
print("\nüéØ Class Distribution:")
print(class_distribution)

# Percentage distribution
class_percentage = df['Class'].value_counts(normalize=True) * 100
print("\nüìà Class Distribution (Percentage):")
print(class_percentage)

# Calculate fraud percentage
fraud_percentage = (class_distribution[1] / len(df)) * 100
print(f"\n‚ö†Ô∏è Fraud transactions: {class_distribution[1]:,} ({fraud_percentage:.3f}%)")
print(f"‚úÖ Legitimate transactions: {class_distribution[0]:,} ({100-fraud_percentage:.3f}%)")

# Imbalance ratio
imbalance_ratio = class_distribution[0] / class_distribution[1]
print(f"\n‚öñÔ∏è Imbalance Ratio: 1:{imbalance_ratio:.2f}")
print("   This means for every 1 fraud transaction, there are ~{:.0f} legitimate ones".format(imbalance_ratio))



TARGET VARIABLE ANALYSIS (CLASS DISTRIBUTION)

üéØ Class Distribution:
Class
0    284315
1       492
Name: count, dtype: int64

üìà Class Distribution (Percentage):
Class
0    99.827251
1     0.172749
Name: proportion, dtype: float64

‚ö†Ô∏è Fraud transactions: 492 (0.173%)
‚úÖ Legitimate transactions: 284,315 (99.827%)

‚öñÔ∏è Imbalance Ratio: 1:577.88
   This means for every 1 fraud transaction, there are ~578 legitimate ones


Feature Understanding

In [7]:
print("\n" + "="*60)
print("FEATURE UNDERSTANDING")
print("="*60)

print("\nüìå Feature Breakdown:")
print(f"   - Time: Seconds elapsed between first transaction and this one")
print(f"   - V1-V28: PCA transformed features (anonymized for privacy)")
print(f"   - Amount: Transaction amount")
print(f"   - Class: Target variable (0 = Legitimate, 1 = Fraud)")

# Amount statistics
print("\nüí∞ Amount Statistics:")
print(f"   - Min amount: ${df['Amount'].min():.2f}")
print(f"   - Max amount: ${df['Amount'].max():.2f}")
print(f"   - Mean amount: ${df['Amount'].mean():.2f}")
print(f"   - Median amount: ${df['Amount'].median():.2f}")

# Time statistics
print("\n‚è∞ Time Statistics:")
print(f"   - Min time: {df['Time'].min():.0f} seconds")
print(f"   - Max time: {df['Time'].max():.0f} seconds")
print(f"   - Duration: {df['Time'].max()/(3600*24):.2f} days")


FEATURE UNDERSTANDING

üìå Feature Breakdown:
   - Time: Seconds elapsed between first transaction and this one
   - V1-V28: PCA transformed features (anonymized for privacy)
   - Amount: Transaction amount
   - Class: Target variable (0 = Legitimate, 1 = Fraud)

üí∞ Amount Statistics:
   - Min amount: $0.00
   - Max amount: $25691.16
   - Mean amount: $88.35
   - Median amount: $22.00

‚è∞ Time Statistics:
   - Min time: 0 seconds
   - Max time: 172792 seconds
   - Duration: 2.00 days


Check for Duplicates

In [8]:
print("\n" + "="*60)
print("DUPLICATE RECORDS CHECK")
print("="*60)

duplicates = df.duplicated().sum()
print(f"\nüîÑ Number of duplicate rows: {duplicates}")

if duplicates > 0:
    print("‚ö†Ô∏è Duplicate rows found. Consider removing them in preprocessing.")
else:
    print("‚úÖ No duplicate rows found!")


DUPLICATE RECORDS CHECK

üîÑ Number of duplicate rows: 1081
‚ö†Ô∏è Duplicate rows found. Consider removing them in preprocessing.


Data Types Verification

In [9]:
print("\n" + "="*60)
print("DATA TYPES VERIFICATION")
print("="*60)

numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

print(f"\nüî¢ Numeric columns: {len(numeric_cols)}")
print(f"üìù Categorical columns: {len(categorical_cols)}")

if len(categorical_cols) == 0:
    print("‚úÖ All features are numeric - perfect for ML models!")


DATA TYPES VERIFICATION

üî¢ Numeric columns: 31
üìù Categorical columns: 0
‚úÖ All features are numeric - perfect for ML models!


FINAL SUMMARY

In [10]:
print("\n" + "="*60)
print("EXPLORATION SUMMARY")
print("="*60)

print(f"""
‚úÖ Dataset loaded successfully
‚úÖ Shape: {df.shape[0]:,} transactions √ó {df.shape[1]} features
‚úÖ No missing values
‚úÖ No duplicate records (if 0 shown above)
‚úÖ All features are numeric
‚úÖ Highly imbalanced dataset detected (~{fraud_percentage:.2f}% fraud)

‚ö†Ô∏è KEY INSIGHTS:
   1. Dataset is highly imbalanced - need sampling techniques
   2. Features V1-V28 are already PCA transformed
   3. Amount and Time need scaling/normalization
   4. Ready for EDA and preprocessing!
""")

# Save basic stats for reference
print("\nüíæ Saving basic statistics to reports/...")
basic_stats = df.describe()
basic_stats.to_csv('../reports/basic_statistics.csv')
print("‚úÖ Saved to reports/basic_statistics.csv")

print("\nüéâ Task 1 Complete! Ready for EDA.")


EXPLORATION SUMMARY

‚úÖ Dataset loaded successfully
‚úÖ Shape: 284,807 transactions √ó 31 features
‚úÖ No missing values
‚úÖ No duplicate records (if 0 shown above)
‚úÖ All features are numeric
‚úÖ Highly imbalanced dataset detected (~0.17% fraud)

‚ö†Ô∏è KEY INSIGHTS:
   1. Dataset is highly imbalanced - need sampling techniques
   2. Features V1-V28 are already PCA transformed
   3. Amount and Time need scaling/normalization
   4. Ready for EDA and preprocessing!


üíæ Saving basic statistics to reports/...
‚úÖ Saved to reports/basic_statistics.csv

üéâ Task 1 Complete! Ready for EDA.
