# 01 - Data Acquisition

Download the LendingClub dataset from Kaggle and perform initial exploration.

**Dataset**: [LendingClub Loan Data](https://www.kaggle.com/datasets/wordsforthewise/lending-club)

**Contents**:
- Download data using kagglehub
- Initial exploration
- Save as parquet for faster reloads

In [None]:
import kagglehub
import pandas as pd
import numpy as np
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Configure paths
DATA_DIR = Path("../data/raw")
DATA_DIR.mkdir(parents=True, exist_ok=True)

print(f"Data directory: {DATA_DIR.resolve()}")

## Download Dataset

Using kagglehub for programmatic download. Requires Kaggle API credentials.

**Setup**: Create `~/.kaggle/kaggle.json` with your API key from https://www.kaggle.com/settings

In [None]:
# Download LendingClub dataset
path = kagglehub.dataset_download("wordsforthewise/lending-club")
print(f"Dataset downloaded to: {path}")

In [None]:
# List downloaded files
import os
files = os.listdir(path)
print("Downloaded files:")
for f in files:
    size_mb = os.path.getsize(os.path.join(path, f)) / (1024 * 1024)
    print(f"  - {f} ({size_mb:.1f} MB)")

## Load and Explore Data

In [None]:
# Load the accepted loans dataset (the main one)
# Try different possible filenames
possible_files = [
    "accepted_2007_to_2018Q4.csv.gz",
    "accepted_2007_to_2018Q4.csv",
    "lending_club_loan_two.csv"
]

df = None
for filename in possible_files:
    filepath = os.path.join(path, filename)
    if os.path.exists(filepath):
        print(f"Loading {filename}...")
        df = pd.read_csv(filepath, low_memory=False)
        break

if df is None:
    # Fallback: load any CSV file
    csv_files = [f for f in files if f.endswith('.csv') or f.endswith('.csv.gz')]
    if csv_files:
        filepath = os.path.join(path, csv_files[0])
        print(f"Loading {csv_files[0]}...")
        df = pd.read_csv(filepath, low_memory=False)

print(f"\nDataset loaded successfully!")
print(f"Shape: {df.shape[0]:,} rows x {df.shape[1]} columns")

In [None]:
# Basic info
print("Column names:")
print(df.columns.tolist())

In [None]:
# Data types and memory usage
df.info(show_counts=True)

In [None]:
# Statistical summary
df.describe()

## Target Variable: Loan Status

In [None]:
# Check target variable distribution
print("Loan Status Distribution:")
print("=" * 50)
status_counts = df['loan_status'].value_counts()
status_pct = df['loan_status'].value_counts(normalize=True) * 100

for status in status_counts.index:
    print(f"{status:30} {status_counts[status]:>10,} ({status_pct[status]:>5.1f}%)")

In [None]:
# For binary classification, we'll use:
# - Fully Paid = 0 (good loan)
# - Charged Off = 1 (default)
# We'll filter out ambiguous statuses in the cleaning notebook

binary_statuses = ['Fully Paid', 'Charged Off']
binary_count = df[df['loan_status'].isin(binary_statuses)].shape[0]
print(f"\nRows with clear outcomes (Fully Paid/Charged Off): {binary_count:,}")
print(f"Percentage of total: {binary_count/len(df)*100:.1f}%")

## Key Features Preview

In [None]:
# Key features we'll use
key_features = [
    'loan_amnt', 'term', 'int_rate', 'installment', 'grade', 'sub_grade',
    'emp_length', 'home_ownership', 'annual_inc', 'verification_status',
    'purpose', 'dti', 'open_acc', 'revol_bal', 'revol_util', 'total_acc'
]

existing_features = [f for f in key_features if f in df.columns]
print(f"Key features available: {len(existing_features)}/{len(key_features)}")
print("\nPreview of key features:")
df[existing_features].head()

In [None]:
# Missing values in key features
print("Missing values in key features:")
print("=" * 50)
missing = df[existing_features].isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({'Missing': missing, 'Pct': missing_pct})
print(missing_df[missing_df['Missing'] > 0].sort_values('Pct', ascending=False))

## Save Raw Data

Save as parquet for faster loading in subsequent notebooks.

In [None]:
# Save as parquet
output_path = DATA_DIR / "lending_club_raw.parquet"
df.to_parquet(output_path, index=False)

file_size_mb = os.path.getsize(output_path) / (1024 * 1024)
print(f"Saved to: {output_path}")
print(f"File size: {file_size_mb:.1f} MB")

In [None]:
# Verify saved data
df_verify = pd.read_parquet(output_path)
print(f"Verification: {df_verify.shape} == {df.shape} ? {df_verify.shape == df.shape}")

## Next Steps

Proceed to `02_data_cleaning.ipynb` to:
- Filter to binary outcomes
- Handle missing values
- Remove data leakage columns