# Credit Card Fraud Data Analysis: Step-by-Step Workflow

## 1. Data Preprocessing in Python (Stratified Sampling)

### **Step 1: Install Packages**

In [1]:
pip install pandas

### **Step 2: Filtering, Selecting Columns, and Stratified Sample**

In [13]:
import pandas as pd

# Load the dataset
df = pd.read_csv('cc_data.csv')  # Replace with your actual file path

# Select only required columns for downstream analysis
cols_needed = ['amt', 'city_pop', 'is_fraud', 'gender', 'category', 'state', 'job']
df = df[cols_needed]

# Apply basic filters for clean analysis
df = df[df['amt'] > 0]  # Only positive transactions
df = df[df['gender'].notnull()]  # Valid gender values

# Remove duplicates
df = df.drop_duplicates()

# Stratified Sampling: Sample 5% from each ‘category’ group, retain small groups
optimum_frac = 0.05
stratified_sample = df.groupby('category', group_keys=False).apply(
lambda x: x.sample(frac=optimum_frac, random_state=42) if len(x) > 20 else x
).reset_index(drop=True)

# Export to CSV for Excel Online analysis
stratified_sample.to_csv('stratified_sample.csv', index=False)