# 01 - Data Acquisition

Download the LendingClub dataset from Kaggle and perform initial exploration.

**Dataset**: [LendingClub Loan Data](https://www.kaggle.com/datasets/wordsforthewise/lending-club)

**Contents**:

- Download data using kagglehub
- Initial exploration
- Save as parquet for faster reloads


In [1]:
import kagglehub
import pandas as pd
import numpy as np
from pathlib import Path
import warnings

warnings.filterwarnings("ignore")

In [2]:
# Configure paths
DATA_DIR = Path("../data/raw")
DATA_DIR.mkdir(parents=True, exist_ok=True)

print(f"Data directory: {DATA_DIR.resolve()}")

Data directory: /Users/hussain/data/raw


## Download Dataset

Using kagglehub for programmatic download. Requires Kaggle API credentials.

**Setup**: Create `~/.kaggle/kaggle.json` with your API key from https://www.kaggle.com/settings


In [3]:
# Download LendingClub dataset
path = kagglehub.dataset_download("wordsforthewise/lending-club")
print(f"Dataset downloaded to: {path}")

Downloading from https://www.kaggle.com/api/v1/datasets/download/wordsforthewise/lending-club?dataset_version_number=3...


100%|██████████| 1.26G/1.26G [00:29<00:00, 46.1MB/s]

Extracting files...





Dataset downloaded to: /Users/hussain/.cache/kagglehub/datasets/wordsforthewise/lending-club/versions/3


In [4]:
# List downloaded files
import os

files = os.listdir(path)
print("Downloaded files:")
for f in files:
    size_mb = os.path.getsize(os.path.join(path, f)) / (1024 * 1024)
    print(f"  - {f} ({size_mb:.1f} MB)")

Downloaded files:
  - rejected_2007_to_2018q4.csv (0.0 MB)
  - accepted_2007_to_2018q4.csv (0.0 MB)
  - rejected_2007_to_2018Q4.csv.gz (243.6 MB)
  - accepted_2007_to_2018Q4.csv.gz (374.4 MB)


## Load and Explore Data

- Expected time: 5-10 minutes on a MacBook Air, depending on RAM.


In [5]:
# Load the accepted loans dataset (the main one)
# Try different possible filenames
possible_files = [
    "accepted_2007_to_2018Q4.csv.gz",
    "accepted_2007_to_2018Q4.csv",
    "lending_club_loan_two.csv",
]

df = None
for filename in possible_files:
    filepath = os.path.join(path, filename)
    if os.path.exists(filepath):
        print(f"Loading {filename}...")
        df = pd.read_csv(filepath, low_memory=False)
        break

if df is None:
    # Fallback: load any CSV file
    csv_files = [f for f in files if f.endswith(".csv") or f.endswith(".csv.gz")]
    if csv_files:
        filepath = os.path.join(path, csv_files[0])
        print(f"Loading {csv_files[0]}...")
        df = pd.read_csv(filepath, low_memory=False)

print(f"\nDataset loaded successfully!")
print(f"Shape: {df.shape[0]:,} rows x {df.shape[1]} columns")

Loading accepted_2007_to_2018Q4.csv.gz...

Dataset loaded successfully!
Shape: 2,260,701 rows x 151 columns


In [6]:
# Basic info
print("Column names:")
print(df.columns.tolist())

Column names:
['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate', 'installment', 'grade', 'sub_grade', 'emp_title', 'emp_length', 'home_ownership', 'annual_inc', 'verification_status', 'issue_d', 'loan_status', 'pymnt_plan', 'url', 'desc', 'purpose', 'title', 'zip_code', 'addr_state', 'dti', 'delinq_2yrs', 'earliest_cr_line', 'fico_range_low', 'fico_range_high', 'inq_last_6mths', 'mths_since_last_delinq', 'mths_since_last_record', 'open_acc', 'pub_rec', 'revol_bal', 'revol_util', 'total_acc', 'initial_list_status', 'out_prncp', 'out_prncp_inv', 'total_pymnt', 'total_pymnt_inv', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee', 'last_pymnt_d', 'last_pymnt_amnt', 'next_pymnt_d', 'last_credit_pull_d', 'last_fico_range_high', 'last_fico_range_low', 'collections_12_mths_ex_med', 'mths_since_last_major_derog', 'policy_code', 'application_type', 'annual_inc_joint', 'dti_joint', 'verification_status_joint', 'ac

In [7]:
# Data types and memory usage
df.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2260701 entries, 0 to 2260700
Columns: 151 entries, id to settlement_term
dtypes: float64(113), object(38)
memory usage: 2.5+ GB


In [8]:
# Statistical summary
df.describe()

Unnamed: 0,member_id,loan_amnt,funded_amnt,funded_amnt_inv,int_rate,installment,annual_inc,dti,delinq_2yrs,fico_range_low,...,deferral_term,hardship_amount,hardship_length,hardship_dpd,orig_projected_additional_accrued_interest,hardship_payoff_balance_amount,hardship_last_payment_amount,settlement_amount,settlement_percentage,settlement_term
count,0.0,2260668.0,2260668.0,2260668.0,2260668.0,2260668.0,2260664.0,2258957.0,2260639.0,2260668.0,...,10917.0,10917.0,10917.0,10917.0,8651.0,10917.0,10917.0,34246.0,34246.0,34246.0
mean,,15046.93,15041.66,15023.44,13.09283,445.8068,77992.43,18.8242,0.3068792,698.5882,...,3.0,155.045981,3.0,13.743886,454.798089,11636.883942,193.994321,5010.664267,47.780365,13.191322
std,,9190.245,9188.413,9192.332,4.832138,267.1735,112696.2,14.18333,0.8672303,33.01038,...,0.0,129.040594,0.0,9.671178,375.3855,7625.988281,198.629496,3693.12259,7.311822,8.15998
min,,500.0,500.0,0.0,5.31,4.93,0.0,-1.0,0.0,610.0,...,3.0,0.64,3.0,0.0,1.92,55.73,0.01,44.21,0.2,0.0
25%,,8000.0,8000.0,8000.0,9.49,251.65,46000.0,11.89,0.0,675.0,...,3.0,59.44,3.0,5.0,175.23,5627.0,44.44,2208.0,45.0,6.0
50%,,12900.0,12875.0,12800.0,12.62,377.99,65000.0,17.84,0.0,690.0,...,3.0,119.14,3.0,15.0,352.77,10028.39,133.16,4146.11,45.0,14.0
75%,,20000.0,20000.0,20000.0,15.99,593.32,93000.0,24.49,0.0,715.0,...,3.0,213.26,3.0,22.0,620.175,16151.89,284.19,6850.1725,50.0,18.0
max,,40000.0,40000.0,40000.0,30.99,1719.83,110000000.0,999.0,58.0,845.0,...,3.0,943.94,3.0,37.0,2680.89,40306.41,1407.86,33601.0,521.35,181.0


## Target Variable: Loan Status


In [9]:
# Check target variable distribution
print("Loan Status Distribution:")
print("=" * 50)
status_counts = df["loan_status"].value_counts()
status_pct = df["loan_status"].value_counts(normalize=True) * 100

for status in status_counts.index:
    print(f"{status:30} {status_counts[status]:>10,} ({status_pct[status]:>5.1f}%)")

Loan Status Distribution:
Fully Paid                      1,076,751 ( 47.6%)
Current                           878,317 ( 38.9%)
Charged Off                       268,559 ( 11.9%)
Late (31-120 days)                 21,467 (  0.9%)
In Grace Period                     8,436 (  0.4%)
Late (16-30 days)                   4,349 (  0.2%)
Does not meet the credit policy. Status:Fully Paid      1,988 (  0.1%)
Does not meet the credit policy. Status:Charged Off        761 (  0.0%)
Default                                40 (  0.0%)


In [10]:
# For binary classification, we'll use:
# - Fully Paid = 0 (good loan)
# - Charged Off = 1 (default)
# We'll filter out ambiguous statuses in the cleaning notebook

binary_statuses = ["Fully Paid", "Charged Off"]
binary_count = df[df["loan_status"].isin(binary_statuses)].shape[0]
print(f"\nRows with clear outcomes (Fully Paid/Charged Off): {binary_count:,}")
print(f"Percentage of total: {binary_count / len(df) * 100:.1f}%")


Rows with clear outcomes (Fully Paid/Charged Off): 1,345,310
Percentage of total: 59.5%


## Key Features Preview


In [11]:
# Key features we'll use
key_features = [
    "loan_amnt",
    "term",
    "int_rate",
    "installment",
    "grade",
    "sub_grade",
    "emp_length",
    "home_ownership",
    "annual_inc",
    "verification_status",
    "purpose",
    "dti",
    "open_acc",
    "revol_bal",
    "revol_util",
    "total_acc",
]

existing_features = [f for f in key_features if f in df.columns]
print(f"Key features available: {len(existing_features)}/{len(key_features)}")
print("\nPreview of key features:")
df[existing_features].head()

Key features available: 16/16

Preview of key features:


Unnamed: 0,loan_amnt,term,int_rate,installment,grade,sub_grade,emp_length,home_ownership,annual_inc,verification_status,purpose,dti,open_acc,revol_bal,revol_util,total_acc
0,3600.0,36 months,13.99,123.03,C,C4,10+ years,MORTGAGE,55000.0,Not Verified,debt_consolidation,5.91,7.0,2765.0,29.7,13.0
1,24700.0,36 months,11.99,820.28,C,C1,10+ years,MORTGAGE,65000.0,Not Verified,small_business,16.06,22.0,21470.0,19.2,38.0
2,20000.0,60 months,10.78,432.66,B,B4,10+ years,MORTGAGE,63000.0,Not Verified,home_improvement,10.78,6.0,7869.0,56.2,18.0
3,35000.0,60 months,14.85,829.9,C,C5,10+ years,MORTGAGE,110000.0,Source Verified,debt_consolidation,17.06,13.0,7802.0,11.6,17.0
4,10400.0,60 months,22.45,289.91,F,F1,3 years,MORTGAGE,104433.0,Source Verified,major_purchase,25.37,12.0,21929.0,64.5,35.0


In [12]:
# Missing values in key features
print("Missing values in key features:")
print("=" * 50)
missing = df[existing_features].isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({"Missing": missing, "Pct": missing_pct})
print(missing_df[missing_df["Missing"] > 0].sort_values("Pct", ascending=False))

Missing values in key features:
                     Missing   Pct
emp_length            146940  6.50
dti                     1744  0.08
revol_util              1835  0.08
loan_amnt                 33  0.00
term                      33  0.00
int_rate                  33  0.00
installment               33  0.00
grade                     33  0.00
sub_grade                 33  0.00
home_ownership            33  0.00
annual_inc                37  0.00
verification_status       33  0.00
purpose                   33  0.00
open_acc                  62  0.00
revol_bal                 33  0.00
total_acc                 62  0.00


## Save Raw Data

Save as parquet for faster loading in subsequent notebooks.


In [13]:
# Save as parquet
output_path = DATA_DIR / "lending_club_raw.parquet"
df.to_parquet(output_path, index=False)

file_size_mb = os.path.getsize(output_path) / (1024 * 1024)
print(f"Saved to: {output_path}")
print(f"File size: {file_size_mb:.1f} MB")

Saved to: ../data/raw/lending_club_raw.parquet
File size: 335.7 MB


In [14]:
# Verify saved data
df_verify = pd.read_parquet(output_path)
print(f"Verification: {df_verify.shape} == {df.shape} ? {df_verify.shape == df.shape}")

Verification: (2260701, 151) == (2260701, 151) ? True


## Next Steps

Proceed to `02_data_cleaning.ipynb` to:

- Filter to binary outcomes
- Handle missing values
- Remove data leakage columns
