# Bronze Layer - Data Ingestion

## Lending Club Loan Data Pipeline

**Use Case:** Predict loan default risk and analyze factors affecting loan approval

This notebook handles the first layer of the Medallion Architecture:
- Load raw CSV data
- Minimal transformation (preserve raw state)
- Store in efficient Parquet format

**Dataset:** Lending Club Loan Data (2007-2018)
- `accepted_2007_to_2018Q4.csv` - Approved loans
- `rejected_2007_to_2018Q4.csv` - Rejected loan applications

## 1. Setup and Configuration

In [1]:
import time
import json
from collections import defaultdict
from functools import reduce
from typing import List, Tuple, Any
import builtins
import findspark
import os

findspark.init()

# For Spark (will install if needed)
try:
    from pyspark.sql import SparkSession
    from pyspark.sql.window import Window
    from pyspark.sql.functions import *
    import pyspark.sql.functions as F
    from pyspark.sql.types import *
    pyspark_available = True
except ImportError:
    print("PySpark not available. Install with: pip install pyspark")
    pyspark_available = False

print("Setup complete!")

if pyspark_available:
    # Initialize Spark Session
    spark = SparkSession.builder \
        .appName("LendingClub-Bronze-Layer") \
        .config("spark.sql.adaptive.enabled", "true") \
        .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
        .getOrCreate()

    # Set log level to reduce noise
    spark.sparkContext.setLogLevel("ERROR")

    print(f"Spark Version: {spark.version}")

Setup complete!


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/25 15:44:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Spark Version: 3.5.0


In [2]:
# Define paths
# Update these paths according to your setup
RAW_DATA_PATH = "../data/lendingclub/"
BRONZE_PATH = "../data/medallion/bronze/"

# Create bronze directory if it doesn't exist
os.makedirs(BRONZE_PATH, exist_ok=True)

# Input files
ACCEPTED_LOANS_FILE = os.path.join(RAW_DATA_PATH, "accepted_2007_to_2018Q4.csv")
REJECTED_LOANS_FILE = os.path.join(RAW_DATA_PATH, "rejected_2007_to_2018Q4.csv")

print(f"Accepted loans file exists: {os.path.exists(ACCEPTED_LOANS_FILE)}")
print(f"Rejected loans file exists: {os.path.exists(REJECTED_LOANS_FILE)}")

Accepted loans file exists: True
Rejected loans file exists: True


## 2. Data Exploration (Quick Look at Raw Files)

In [3]:
# Quick peek at the raw files using shell commands
# This helps understand the structure before loading into Spark
!head -5 {ACCEPTED_LOANS_FILE}

id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,fico_range_low,fico_range_high,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,last_fico_range_high,last_fico_range_low,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,to

In [4]:
# Check file sizes
!ls -lh {RAW_DATA_PATH}

total 3.3G
-rw-r--r-- 1 ubuntu ubuntu 1.6G Nov 25 09:17 accepted_2007_to_2018Q4.csv
-rw-r--r-- 1 ubuntu ubuntu 1.7G Nov 25 09:21 rejected_2007_to_2018Q4.csv


In [5]:
# Count lines in files (to know what we're dealing with)
!wc -l {ACCEPTED_LOANS_FILE}
!wc -l {REJECTED_LOANS_FILE}

2260702 ../data/lendingclub/accepted_2007_to_2018Q4.csv
27648742 ../data/lendingclub/rejected_2007_to_2018Q4.csv


## 3. Ingest Accepted Loans Data

In [6]:
# Read accepted loans CSV
# We keep inferSchema=False to preserve raw data as strings
# This is important for the Bronze layer - minimal transformation

accepted_raw = spark.read.csv(
    ACCEPTED_LOANS_FILE,
    header=True,
    inferSchema=False,  # Keep all as strings - Bronze layer principle
    multiLine=True,     # Handle fields with newlines
    escape='"'          # Handle escaped quotes
)

print(f"Accepted loans - Row count: {accepted_raw.count()}")
print(f"Accepted loans - Column count: {len(accepted_raw.columns)}")

[Stage 1:>                                                          (0 + 1) / 1]

Accepted loans - Row count: 2260701
Accepted loans - Column count: 151


                                                                                

In [7]:
# Show schema - all columns should be StringType
accepted_raw.printSchema()

root
 |-- id: string (nullable = true)
 |-- member_id: string (nullable = true)
 |-- loan_amnt: string (nullable = true)
 |-- funded_amnt: string (nullable = true)
 |-- funded_amnt_inv: string (nullable = true)
 |-- term: string (nullable = true)
 |-- int_rate: string (nullable = true)
 |-- installment: string (nullable = true)
 |-- grade: string (nullable = true)
 |-- sub_grade: string (nullable = true)
 |-- emp_title: string (nullable = true)
 |-- emp_length: string (nullable = true)
 |-- home_ownership: string (nullable = true)
 |-- annual_inc: string (nullable = true)
 |-- verification_status: string (nullable = true)
 |-- issue_d: string (nullable = true)
 |-- loan_status: string (nullable = true)
 |-- pymnt_plan: string (nullable = true)
 |-- url: string (nullable = true)
 |-- desc: string (nullable = true)
 |-- purpose: string (nullable = true)
 |-- title: string (nullable = true)
 |-- zip_code: string (nullable = true)
 |-- addr_state: string (nullable = true)
 |-- dti: string 

In [8]:
# Preview the data
accepted_raw.show(5, truncate=50)

+--------+---------+---------+-----------+---------------+----------+--------+-----------+-----+---------+---------------------------+----------+--------------+----------+-------------------+--------+-----------+----------+--------------------------------------------------+----+------------------+------------------+--------+----------+-----+-----------+----------------+--------------+---------------+--------------+----------------------+----------------------+--------+-------+---------+----------+---------+-------------------+---------+-------------+------------------+---------------+---------------+-------------+------------------+----------+-----------------------+------------+---------------+------------+------------------+--------------------+-------------------+--------------------------+---------------------------+-----------+----------------+----------------+---------+-------------------------+--------------+------------+-----------+-----------+-----------+-----------+----------

In [9]:
# List all columns
print("Columns in accepted loans dataset:")
for i, col_name in enumerate(accepted_raw.columns):
    print(f"{i+1}. {col_name}")

Columns in accepted loans dataset:
1. id
2. member_id
3. loan_amnt
4. funded_amnt
5. funded_amnt_inv
6. term
7. int_rate
8. installment
9. grade
10. sub_grade
11. emp_title
12. emp_length
13. home_ownership
14. annual_inc
15. verification_status
16. issue_d
17. loan_status
18. pymnt_plan
19. url
20. desc
21. purpose
22. title
23. zip_code
24. addr_state
25. dti
26. delinq_2yrs
27. earliest_cr_line
28. fico_range_low
29. fico_range_high
30. inq_last_6mths
31. mths_since_last_delinq
32. mths_since_last_record
33. open_acc
34. pub_rec
35. revol_bal
36. revol_util
37. total_acc
38. initial_list_status
39. out_prncp
40. out_prncp_inv
41. total_pymnt
42. total_pymnt_inv
43. total_rec_prncp
44. total_rec_int
45. total_rec_late_fee
46. recoveries
47. collection_recovery_fee
48. last_pymnt_d
49. last_pymnt_amnt
50. next_pymnt_d
51. last_credit_pull_d
52. last_fico_range_high
53. last_fico_range_low
54. collections_12_mths_ex_med
55. mths_since_last_major_derog
56. policy_code
57. application_ty

In [10]:
# Add metadata columns (good practice for Bronze layer)
accepted_bronze = accepted_raw \
    .withColumn("_ingestion_timestamp", current_timestamp()) \
    .withColumn("_source_file", lit("accepted_2007_to_2018Q4.csv")) \
    .withColumn("_data_source", lit("lending_club"))

accepted_bronze.select("_ingestion_timestamp", "_source_file", "_data_source").show(3)

+--------------------+--------------------+------------+
|_ingestion_timestamp|        _source_file|_data_source|
+--------------------+--------------------+------------+
|2025-11-25 15:44:...|accepted_2007_to_...|lending_club|
|2025-11-25 15:44:...|accepted_2007_to_...|lending_club|
|2025-11-25 15:44:...|accepted_2007_to_...|lending_club|
+--------------------+--------------------+------------+
only showing top 3 rows



## 4. Ingest Rejected Loans Data

In [11]:
# Read rejected loans CSV
rejected_raw = spark.read.csv(
    REJECTED_LOANS_FILE,
    header=True,
    inferSchema=False,
    multiLine=True,
    escape='"'
)

print(f"Rejected loans - Row count: {rejected_raw.count()}")
print(f"Rejected loans - Column count: {len(rejected_raw.columns)}")

[Stage 7:>                                                          (0 + 1) / 1]

Rejected loans - Row count: 27648741
Rejected loans - Column count: 9


                                                                                

In [12]:
# Show schema
rejected_raw.printSchema()

root
 |-- Amount Requested: string (nullable = true)
 |-- Application Date: string (nullable = true)
 |-- Loan Title: string (nullable = true)
 |-- Risk_Score: string (nullable = true)
 |-- Debt-To-Income Ratio: string (nullable = true)
 |-- Zip Code: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Employment Length: string (nullable = true)
 |-- Policy Code: string (nullable = true)



In [13]:
# Preview rejected loans data
rejected_raw.show(5, truncate=50)

+----------------+----------------+--------------------------------+----------+--------------------+--------+-----+-----------------+-----------+
|Amount Requested|Application Date|                      Loan Title|Risk_Score|Debt-To-Income Ratio|Zip Code|State|Employment Length|Policy Code|
+----------------+----------------+--------------------------------+----------+--------------------+--------+-----+-----------------+-----------+
|          1000.0|      2007-05-26|Wedding Covered but No Honeymoon|     693.0|                 10%|   481xx|   NM|          4 years|        0.0|
|          1000.0|      2007-05-26|              Consolidating Debt|     703.0|                 10%|   010xx|   MA|         < 1 year|        0.0|
|         11000.0|      2007-05-27|     Want to consolidate my debt|     715.0|                 10%|   212xx|   MD|           1 year|        0.0|
|          6000.0|      2007-05-27|                         waksman|     698.0|              38.64%|   017xx|   MA|         

In [None]:
# Add metadata columns
rejected_bronze = rejected_raw \
    .withColumn("_ingestion_timestamp", current_timestamp()) \
    .withColumn("_source_file", lit("rejected_2007_to_2018Q4.csv")) \
    .withColumn("_data_source", lit("lending_club"))

## 5. Data Quality Checks (Bronze Level)

In [15]:
# Basic data quality metrics for accepted loans
print("=== Accepted Loans - Bronze Quality Report ===")
print(f"Total rows: {accepted_bronze.count()}")
print(f"Total columns: {len(accepted_bronze.columns)}")
print(f"Partitions: {accepted_bronze.rdd.getNumPartitions()}")

=== Accepted Loans - Bronze Quality Report ===


[Stage 11:>                                                         (0 + 1) / 1]

Total rows: 2260701
Total columns: 154
Partitions: 1


                                                                                

In [16]:
# Check for null values in key columns
key_columns = ['loan_amnt', 'term', 'int_rate', 'grade', 'loan_status']

print("\nNull counts in key columns (Accepted Loans):")
for col_name in key_columns:
    if col_name in accepted_bronze.columns:
        null_count = accepted_bronze.filter(col(col_name).isNull()).count()
        print(f"  {col_name}: {null_count}")


Null counts in key columns (Accepted Loans):


                                                                                

  loan_amnt: 33


                                                                                

  term: 33


                                                                                

  int_rate: 33


                                                                                

  grade: 33


[Stage 26:>                                                         (0 + 1) / 1]

  loan_status: 33


                                                                                

In [17]:
# Sample values from important columns
print("\nSample values from key columns:")
accepted_bronze.select('loan_amnt', 'term', 'int_rate', 'grade', 'loan_status').show(10, truncate=False)


Sample values from key columns:
+---------+----------+--------+-----+-----------+
|loan_amnt|term      |int_rate|grade|loan_status|
+---------+----------+--------+-----+-----------+
|3600.0   | 36 months|13.99   |C    |Fully Paid |
|24700.0  | 36 months|11.99   |C    |Fully Paid |
|20000.0  | 60 months|10.78   |B    |Fully Paid |
|35000.0  | 60 months|14.85   |C    |Current    |
|10400.0  | 60 months|22.45   |F    |Fully Paid |
|11950.0  | 36 months|13.44   |C    |Fully Paid |
|20000.0  | 36 months|9.17    |B    |Fully Paid |
|20000.0  | 36 months|8.49    |B    |Fully Paid |
|10000.0  | 36 months|6.49    |A    |Fully Paid |
|8000.0   | 36 months|11.48   |B    |Fully Paid |
+---------+----------+--------+-----+-----------+
only showing top 10 rows



## 6. Save to Bronze Layer (Parquet Format)

In [18]:
# Save accepted loans to Bronze layer
# Parquet is columnar and efficient for analytical workloads

BRONZE_ACCEPTED_PATH = os.path.join(BRONZE_PATH, "accepted_loans")

accepted_bronze.write \
    .mode("overwrite") \
    .parquet(BRONZE_ACCEPTED_PATH)

print(f"Accepted loans saved to: {BRONZE_ACCEPTED_PATH}")

[Stage 30:>                                                         (0 + 1) / 1]

Accepted loans saved to: ../data/medallion/bronze/accepted_loans


                                                                                

In [19]:
# Save rejected loans to Bronze layer
BRONZE_REJECTED_PATH = os.path.join(BRONZE_PATH, "rejected_loans")

rejected_bronze.write \
    .mode("overwrite") \
    .parquet(BRONZE_REJECTED_PATH)

print(f"Rejected loans saved to: {BRONZE_REJECTED_PATH}")

[Stage 31:>                                                         (0 + 1) / 1]

[Stage 31:>                                                         (0 + 1) / 1]

Rejected loans saved to: ../data/medallion/bronze/rejected_loans


                                                                                

In [20]:
# Verify the saved files
!ls -lh {BRONZE_PATH}

total 8.0K
drwxr-xr-x 2 ubuntu ubuntu 4.0K Nov 25 15:47 accepted_loans
drwxr-xr-x 2 ubuntu ubuntu 4.0K Nov 25 15:49 rejected_loans


In [None]:
# Check parquet file sizes (should be smaller than CSV due to compression)
!du -sh {BRONZE_ACCEPTED_PATH}
!du -sh {BRONZE_REJECTED_PATH}

377M	../data/medallion/bronze/accepted_loans


214M	../data/medallion/bronze/rejected_loans


## 7. Verification - Read Back from Bronze

In [22]:
# Verify we can read the data back
accepted_verify = spark.read.parquet(BRONZE_ACCEPTED_PATH)
rejected_verify = spark.read.parquet(BRONZE_REJECTED_PATH)

print(f"Accepted loans (verified): {accepted_verify.count()} rows")
print(f"Rejected loans (verified): {rejected_verify.count()} rows")

Accepted loans (verified): 2260701 rows
Rejected loans (verified): 27648741 rows


In [23]:
# Show sample with metadata columns
accepted_verify.select(
    'loan_amnt', 'grade', 'loan_status', 
    '_ingestion_timestamp', '_source_file'
).show(5)

+---------+-----+-----------+--------------------+--------------------+
|loan_amnt|grade|loan_status|_ingestion_timestamp|        _source_file|
+---------+-----+-----------+--------------------+--------------------+
|   3600.0|    C| Fully Paid|2025-11-25 15:46:...|accepted_2007_to_...|
|  24700.0|    C| Fully Paid|2025-11-25 15:46:...|accepted_2007_to_...|
|  20000.0|    B| Fully Paid|2025-11-25 15:46:...|accepted_2007_to_...|
|  35000.0|    C|    Current|2025-11-25 15:46:...|accepted_2007_to_...|
|  10400.0|    F| Fully Paid|2025-11-25 15:46:...|accepted_2007_to_...|
+---------+-----+-----------+--------------------+--------------------+
only showing top 5 rows



## 8. Summary Statistics

In [24]:
# Generate summary for the report
print("=" * 60)
print("BRONZE LAYER INGESTION SUMMARY")
print("=" * 60)
print(f"\nData Source: Lending Club (2007-2018)")
print(f"\nAccepted Loans:")
print(f"  - Rows: {accepted_verify.count():,}")
print(f"  - Columns: {len(accepted_verify.columns)}")
print(f"  - Output: {BRONZE_ACCEPTED_PATH}")
print(f"\nRejected Loans:")
print(f"  - Rows: {rejected_verify.count():,}")
print(f"  - Columns: {len(rejected_verify.columns)}")
print(f"  - Output: {BRONZE_REJECTED_PATH}")
print(f"\nFormat: Parquet (columnar, compressed)")
print(f"Metadata added: _ingestion_timestamp, _source_file, _data_source")
print("=" * 60)

BRONZE LAYER INGESTION SUMMARY

Data Source: Lending Club (2007-2018)

Accepted Loans:
  - Rows: 2,260,701
  - Columns: 154
  - Output: ../data/medallion/bronze/accepted_loans

Rejected Loans:
  - Rows: 27,648,741
  - Columns: 12
  - Output: ../data/medallion/bronze/rejected_loans

Format: Parquet (columnar, compressed)
Metadata added: _ingestion_timestamp, _source_file, _data_source


In [None]:
# Stop Spark session (optional - keep running if continuing to Silver)
spark.stop()

## Next Steps

The Bronze layer is complete. The data is now stored in Parquet format with minimal transformation.

**Continue to:** `02_silver_cleaning.ipynb` for data cleaning using MapReduce operations.