# Bronze Layer - Data Ingestion

## Lending Club Loan Data Pipeline

**Use Case:** Predict loan default risk and analyze factors affecting loan approval

This notebook handles the first layer of the Medallion Architecture:
- Load raw CSV data
- Minimal transformation (preserve raw state)
- Store in efficient Parquet format

**Dataset:** Lending Club Loan Data (2007-2018)
- `accepted_2007_to_2018Q4.csv` - Approved loans
- `rejected_2007_to_2018Q4.csv` - Rejected loan applications

## 1. Setup and Configuration

In [1]:
import time
import json
from collections import defaultdict
from functools import reduce
from typing import List, Tuple, Any
import builtins
import findspark
import os

findspark.init()

# For Spark (will install if needed)
try:
    from pyspark.sql import SparkSession
    from pyspark.sql.window import Window
    from pyspark.sql.functions import *
    import pyspark.sql.functions as F
    from pyspark.sql.types import *
    pyspark_available = True
except ImportError:
    print("PySpark not available. Install with: pip install pyspark")
    pyspark_available = False

print("Setup complete!")

Setup complete!


In [2]:
if pyspark_available:
    # Initialize Spark Session
    spark = SparkSession.builder \
        .appName("LendingClub-Bronze-Layer") \
        .master("spark://spark-master:7077") \
        .config("spark.sql.adaptive.enabled", "true") \
        .config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
        .config("spark.executor.memory", "4g") \
        .config("spark.driver.memory", "4g") \
        .config("spark.executor.cores", "4") \
        .getOrCreate()


    # Set log level to reduce noise
    sc = spark.sparkContext
    sc.setLogLevel("ERROR")


    print(f"Spark Version: {spark.version}")
    print(f"Spark UI available at: {sc.uiWebUrl}")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/11/27 13:23:40 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Spark Version: 3.5.0
Spark UI available at: http://spark-master:4040


In [3]:
# Define paths
RAW_DATA_PATH = "../data/lendingclub/"
BRONZE_PATH = "../data/medallion/bronze/"

# Create bronze directory if it doesn't exist
os.makedirs(BRONZE_PATH, exist_ok=True)

# Input files
ACCEPTED_LOANS_FILE = os.path.join(RAW_DATA_PATH, "accepted_2007_to_2018Q4.csv")
REJECTED_LOANS_FILE = os.path.join(RAW_DATA_PATH, "rejected_2007_to_2018Q4.csv")

print(f"Accepted loans file exists: {os.path.exists(ACCEPTED_LOANS_FILE)}")
print(f"Rejected loans file exists: {os.path.exists(REJECTED_LOANS_FILE)}")

Accepted loans file exists: True
Rejected loans file exists: True


## 2. Data Exploration (Quick Look at Raw Files)

In [4]:
# Quick peek at the raw files using shell commands
# This helps understand the structure before loading into Spark
!head -5 {ACCEPTED_LOANS_FILE}

id,member_id,loan_amnt,funded_amnt,funded_amnt_inv,term,int_rate,installment,grade,sub_grade,emp_title,emp_length,home_ownership,annual_inc,verification_status,issue_d,loan_status,pymnt_plan,url,desc,purpose,title,zip_code,addr_state,dti,delinq_2yrs,earliest_cr_line,fico_range_low,fico_range_high,inq_last_6mths,mths_since_last_delinq,mths_since_last_record,open_acc,pub_rec,revol_bal,revol_util,total_acc,initial_list_status,out_prncp,out_prncp_inv,total_pymnt,total_pymnt_inv,total_rec_prncp,total_rec_int,total_rec_late_fee,recoveries,collection_recovery_fee,last_pymnt_d,last_pymnt_amnt,next_pymnt_d,last_credit_pull_d,last_fico_range_high,last_fico_range_low,collections_12_mths_ex_med,mths_since_last_major_derog,policy_code,application_type,annual_inc_joint,dti_joint,verification_status_joint,acc_now_delinq,tot_coll_amt,tot_cur_bal,open_acc_6m,open_act_il,open_il_12m,open_il_24m,mths_since_rcnt_il,total_bal_il,il_util,open_rv_12m,open_rv_24m,max_bal_bc,all_util,total_rev_hi_lim,inq_fi,to

In [5]:
# Check file sizes
!ls -lh {RAW_DATA_PATH}

total 3.3G
-rw-r--r-- 1 ubuntu ubuntu 1.6G Nov 26 13:25 accepted_2007_to_2018Q4.csv
-rw-r--r-- 1 ubuntu ubuntu 1.7G Nov 26 13:28 rejected_2007_to_2018Q4.csv


In [6]:
# Count lines in files (to know what we're dealing with)
!wc -l {ACCEPTED_LOANS_FILE}
!wc -l {REJECTED_LOANS_FILE}

2260702 ../data/lendingclub/accepted_2007_to_2018Q4.csv
27648742 ../data/lendingclub/rejected_2007_to_2018Q4.csv


## 3. Ingest Accepted Loans Data (Using RDDs)

**Note:** We use RDDs for Bronze layer as required by the project.
This demonstrates understanding of low-level Spark operations and MapReduce concepts.

In [7]:
# RDD-based ingestion for Bronze layer
# Read raw CSV file as text lines using RDD

if pyspark_available:
    print("=== Loading Accepted Loans with RDD ===")
    
    # Load file as RDD of text lines
    raw_rdd = spark.sparkContext.textFile(ACCEPTED_LOANS_FILE)
    
    # Extract header (first line)
    header = raw_rdd.first()
    header_cols = header.split(",")
    print(f"Number of columns: {len(header_cols)}")
    print(f"First 5 columns: {header_cols[:5]}")
    
    # Filter out header row
    data_rdd = raw_rdd.filter(lambda row: row != header)
    
    print(f"Raw data rows (excluding header): {data_rdd.count()}")

=== Loading Accepted Loans with RDD ===


                                                                                

Number of columns: 151
First 5 columns: ['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv']




Raw data rows (excluding header): 2260701


                                                                                

In [8]:
# Define parsing function for Bronze layer
# This uses MapReduce concept: map each CSV line to a dictionary

def parse_accepted_loan(line):
    """
    Parse a CSV line into a dictionary with metadata.
    Bronze layer: minimal transformation, preserve raw data.
    """
    try:
        # Simple CSV parsing (note: doesn't handle quoted commas - that's OK for Bronze)
        values = line.split(",")
        
        # Create record as dictionary
        record = {}
        for i, col_name in enumerate(header_cols):
            record[col_name] = values[i] if i < len(values) else None
        
        # Add Bronze layer metadata
        record['_ingestion_timestamp'] = time.time()
        record['_source_file'] = 'accepted_2007_to_2018Q4.csv'
        record['_data_source'] = 'lending_club'
        record['_status'] = 'valid'
        
        return record
        
    except Exception as e:
        # Error handling: preserve raw data for debugging
        return {
            '_raw_data': line,
            '_ingestion_timestamp': time.time(),
            '_source_file': 'accepted_2007_to_2018Q4.csv',
            '_data_source': 'lending_club',
            '_status': 'parse_error',
            '_error_message': str(e)
        }

# Apply parsing using map() - this is the MapReduce pattern
accepted_bronze_rdd = data_rdd.map(parse_accepted_loan)

print(f"Bronze RDD created: {accepted_bronze_rdd.count()} records")

# Show sample record
print("\nSample record:")
sample = accepted_bronze_rdd.take(1)[0]
print(f"Keys: {list(sample.keys())[:10]}...")  # Show first 10 keys

                                                                                

Bronze RDD created: 2260701 records

Sample record:
Keys: ['id', 'member_id', 'loan_amnt', 'funded_amnt', 'funded_amnt_inv', 'term', 'int_rate', 'installment', 'grade', 'sub_grade']...


In [9]:
# Data quality check: count valid vs error records
valid_count = accepted_bronze_rdd.filter(lambda r: r.get('_status') == 'valid').count()
error_count = accepted_bronze_rdd.filter(lambda r: r.get('_status') == 'parse_error').count()

print(f"\nData Quality Check:")
print(f"  Valid records: {valid_count:,}")
print(f"  Parse errors: {error_count:,}")
print(f"  Error rate: {error_count/accepted_bronze_rdd.count()*100:.2f}%")

                                                                                


Data Quality Check:
  Valid records: 2,260,701
  Parse errors: 0




  Error rate: 0.00%


                                                                                

In [10]:
# Show sample records using RDD operations
print("Sample valid records (first 3):")
for i, record in enumerate(accepted_bronze_rdd.filter(lambda r: r.get('_status') == 'valid').take(3), 1):
    print(f"\nRecord {i}:")
    # Show subset of fields for readability
    sample_fields = ['loan_amnt', 'term', 'int_rate', 'grade', 'loan_status', '_ingestion_timestamp', '_status']
    for field in sample_fields:
        if field in record:
            print(f"  {field}: {record[field]}")

Sample valid records (first 3):

Record 1:
  loan_amnt: 3600.0
  term:  36 months
  int_rate: 13.99
  grade: C
  loan_status: Fully Paid
  _ingestion_timestamp: 1764249851.0851123
  _status: valid

Record 2:
  loan_amnt: 24700.0
  term:  36 months
  int_rate: 11.99
  grade: C
  loan_status: Fully Paid
  _ingestion_timestamp: 1764249851.0851736
  _status: valid

Record 3:
  loan_amnt: 20000.0
  term:  60 months
  int_rate: 10.78
  grade: B
  loan_status: Fully Paid
  _ingestion_timestamp: 1764249851.085191
  _status: valid


## 4. Ingest Rejected Loans Data (Using RDDs)

In [11]:
# RDD-based ingestion for rejected loans
if pyspark_available:
    print("=== Loading Rejected Loans with RDD ===")
    
    # Load file as RDD
    rejected_raw_rdd = spark.sparkContext.textFile(REJECTED_LOANS_FILE)
    
    # Extract header
    rejected_header = rejected_raw_rdd.first()
    rejected_header_cols = rejected_header.split(",")
    print(f"Number of columns: {len(rejected_header_cols)}")
    print(f"Columns: {rejected_header_cols}")
    
    # Filter out header
    rejected_data_rdd = rejected_raw_rdd.filter(lambda row: row != rejected_header)
    print(f"Raw data rows: {rejected_data_rdd.count()}")

=== Loading Rejected Loans with RDD ===
Number of columns: 9
Columns: ['Amount Requested', 'Application Date', 'Loan Title', 'Risk_Score', 'Debt-To-Income Ratio', 'Zip Code', 'State', 'Employment Length', 'Policy Code']




Raw data rows: 27648741


                                                                                

In [12]:
# Define parsing function for rejected loans
def parse_rejected_loan(line):
    """Parse rejected loan CSV line into dictionary."""
    try:
        values = line.split(",")
        
        record = {}
        for i, col_name in enumerate(rejected_header_cols):
            record[col_name] = values[i] if i < len(values) else None
        
        # Add metadata
        record['_ingestion_timestamp'] = time.time()
        record['_source_file'] = 'rejected_2007_to_2018Q4.csv'
        record['_data_source'] = 'lending_club'
        record['_status'] = 'valid'
        
        return record
        
    except Exception as e:
        return {
            '_raw_data': line,
            '_ingestion_timestamp': time.time(),
            '_source_file': 'rejected_2007_to_2018Q4.csv',
            '_data_source': 'lending_club',
            '_status': 'parse_error',
            '_error_message': str(e)
        }

# Apply parsing using map()
rejected_bronze_rdd = rejected_data_rdd.map(parse_rejected_loan)

print(f"Bronze RDD created: {rejected_bronze_rdd.count()} records")



Bronze RDD created: 27648741 records


                                                                                

In [13]:
# Data quality check for rejected loans
rejected_valid = rejected_bronze_rdd.filter(lambda r: r.get('_status') == 'valid').count()
rejected_errors = rejected_bronze_rdd.filter(lambda r: r.get('_status') == 'parse_error').count()

print(f"\nRejected Loans Quality Check:")
print(f"  Valid records: {rejected_valid:,}")
print(f"  Parse errors: {rejected_errors:,}")




Rejected Loans Quality Check:
  Valid records: 27,648,741
  Parse errors: 0


                                                                                

## 5. Data Quality Checks (Bronze Level - Using RDD Operations)

In [14]:
# MapReduce pattern: check for null/empty values in key columns using RDDs

# Quality checks for ACCEPTED loans
key_columns_accepted = ['loan_amnt', 'term', 'int_rate', 'grade', 'loan_status']

print("=== Accepted Loans - Bronze Quality Report ===")
print(f"Total records: {accepted_bronze_rdd.count():,}")
print(f"Partitions: {accepted_bronze_rdd.getNumPartitions()}")

print("\nNull/Empty counts in key columns:")
for col_name in key_columns_accepted:
    # Use filter and count - MapReduce pattern
    null_count = accepted_bronze_rdd.filter(
        lambda r: r.get('_status') == 'valid' and (r.get(col_name) is None or r.get(col_name) == '')
    ).count()
    print(f"  {col_name}: {null_count:,}")

=== Accepted Loans - Bronze Quality Report ===


                                                                                

Total records: 2,260,701
Partitions: 50

Null/Empty counts in key columns:


                                                                                

  loan_amnt: 33


                                                                                

  term: 33


                                                                                

  int_rate: 33


                                                                                

  grade: 33




  loan_status: 33


                                                                                

In [15]:
# Quality checks for REJECTED loans
# Rejected loans have different columns than accepted loans
key_columns_rejected = ['Amount Requested', 'Application Date', 'Loan Title', 'Risk_Score', 'Debt-To-Income Ratio']

print("\n=== Rejected Loans - Bronze Quality Report ===")
print(f"Total records: {rejected_bronze_rdd.count():,}")
print(f"Partitions: {rejected_bronze_rdd.getNumPartitions()}")

print("\nNull/Empty counts in key columns:")
for col_name in key_columns_rejected:
    # Use filter and count - MapReduce pattern
    null_count = rejected_bronze_rdd.filter(
        lambda r: r.get('_status') == 'valid' and (r.get(col_name) is None or r.get(col_name) == '')
    ).count()
    print(f"  {col_name}: {null_count:,}")

# Additional check: sample rejected loan records
print("\nSample rejected loan records (first 2):")
for i, record in enumerate(rejected_bronze_rdd.filter(lambda r: r.get('_status') == 'valid').take(2), 1):
    print(f"\nRejected Loan {i}:")
    for field in key_columns_rejected:
        if field in record:
            print(f"  {field}: {record[field]}")


=== Rejected Loans - Bronze Quality Report ===


                                                                                

Total records: 27,648,741
Partitions: 54

Null/Empty counts in key columns:


                                                                                

  Amount Requested: 0


                                                                                

  Application Date: 0


                                                                                

  Loan Title: 1,303


                                                                                

  Risk_Score: 18,497,546


                                                                                

  Debt-To-Income Ratio: 74

Sample rejected loan records (first 2):

Rejected Loan 1:
  Amount Requested: 1000.0
  Application Date: 2007-05-26
  Loan Title: Wedding Covered but No Honeymoon
  Risk_Score: 693.0
  Debt-To-Income Ratio: 10%

Rejected Loan 2:
  Amount Requested: 1000.0
  Application Date: 2007-05-26
  Loan Title: Consolidating Debt
  Risk_Score: 703.0
  Debt-To-Income Ratio: 10%


## 6. Save to Bronze Layer (Parquet Format)

**Note:** We convert RDD to DataFrame only for efficient storage in Parquet format.
This is acceptable as it's just for persistence, not for processing logic.

In [None]:
import shutil

# Helper function to clean directory before saving
def clean_output_directory(path):
    """Remove existing directory to prevent duplicate files."""
    if os.path.exists(path):
        print(f"Removing existing directory: {path}")
        shutil.rmtree(path)
        print(f"Directory cleaned.")

# Save accepted loans to Bronze layer
BRONZE_ACCEPTED_PATH = os.path.join(BRONZE_PATH, "accepted_loans")

print("=== Saving Accepted Loans to Bronze ===")
clean_output_directory(BRONZE_ACCEPTED_PATH)

# Convert RDD to DataFrame for Parquet storage
accepted_bronze_df = spark.createDataFrame(accepted_bronze_rdd)

accepted_bronze_df.write \
    .mode("overwrite") \
    .parquet(BRONZE_ACCEPTED_PATH)

print(f"Accepted loans saved to: {BRONZE_ACCEPTED_PATH}")

=== Saving Accepted Loans to Bronze ===
Removing existing directory: ../data/medallion/bronze/accepted_loans
Directory cleaned.


                                                                                

✅ Accepted loans saved to: ../data/medallion/bronze/accepted_loans


In [None]:
# Save rejected loans to Bronze layer
BRONZE_REJECTED_PATH = os.path.join(BRONZE_PATH, "rejected_loans")

print("\n=== Saving Rejected Loans to Bronze ===")
clean_output_directory(BRONZE_REJECTED_PATH)

# Convert RDD to DataFrame for Parquet storage
rejected_bronze_df = spark.createDataFrame(rejected_bronze_rdd)

rejected_bronze_df.write \
    .mode("overwrite") \
    .parquet(BRONZE_REJECTED_PATH)

print(f"Rejected loans saved to: {BRONZE_REJECTED_PATH}")


=== Saving Rejected Loans to Bronze ===
Removing existing directory: ../data/medallion/bronze/rejected_loans
Directory cleaned.




✅ Rejected loans saved to: ../data/medallion/bronze/rejected_loans


                                                                                

In [18]:
# Verify the saved files
!ls -lh {BRONZE_PATH}

total 28K
drwxr-xr-x 2 ubuntu ubuntu 12K Nov 27 13:25 accepted_loans
drwxr-xr-x 2 ubuntu ubuntu 16K Nov 27 13:25 rejected_loans


In [19]:
# Check parquet file sizes (should be smaller than CSV due to compression)
!du -sh {BRONZE_ACCEPTED_PATH}
!du -sh {BRONZE_REJECTED_PATH}

421M	../data/medallion/bronze/accepted_loans
295M	../data/medallion/bronze/rejected_loans


## 7. Verification - Read Back from Bronze

In [20]:
# Verify we can read the data back
accepted_verify = spark.read.parquet(BRONZE_ACCEPTED_PATH)
rejected_verify = spark.read.parquet(BRONZE_REJECTED_PATH)

print(f"Accepted loans (verified): {accepted_verify.count()} rows")
print(f"Rejected loans (verified): {rejected_verify.count()} rows")

Accepted loans (verified): 2260701 rows
Rejected loans (verified): 27648741 rows


In [21]:
# Show sample with metadata columns
accepted_verify.select(
    'loan_amnt', 'grade', 'loan_status', 
    '_ingestion_timestamp', '_source_file'
).show(5)

+---------+-----+-----------+--------------------+--------------------+
|loan_amnt|grade|loan_status|_ingestion_timestamp|        _source_file|
+---------+-----+-----------+--------------------+--------------------+
|  11200.0|    C|    Current|1.7642499284482965E9|accepted_2007_to_...|
|  12000.0|    C|    Current|1.7642499284483583E9|accepted_2007_to_...|
|  11000.0|    C| Fully Paid|1.7642499284483843E9|accepted_2007_to_...|
|  15000.0|    B|    Current|1.7642499284484448E9|accepted_2007_to_...|
|   7000.0|    D|    Current|1.7642499284484687E9|accepted_2007_to_...|
+---------+-----+-----------+--------------------+--------------------+
only showing top 5 rows



## 8. Summary Statistics

In [22]:
# Generate summary for the report
print("=" * 60)
print("BRONZE LAYER INGESTION SUMMARY")
print("=" * 60)
print(f"\nData Source: Lending Club (2007-2018)")
print(f"\nAccepted Loans:")
print(f"  - Rows: {accepted_verify.count():,}")
print(f"  - Columns: {len(accepted_verify.columns)}")
print(f"  - Output: {BRONZE_ACCEPTED_PATH}")
print(f"\nRejected Loans:")
print(f"  - Rows: {rejected_verify.count():,}")
print(f"  - Columns: {len(rejected_verify.columns)}")
print(f"  - Output: {BRONZE_REJECTED_PATH}")
print(f"\nFormat: Parquet (columnar, compressed)")
print(f"Metadata added: _ingestion_timestamp, _source_file, _data_source")
print("=" * 60)

BRONZE LAYER INGESTION SUMMARY

Data Source: Lending Club (2007-2018)

Accepted Loans:
  - Rows: 2,260,701
  - Columns: 155
  - Output: ../data/medallion/bronze/accepted_loans

Rejected Loans:
  - Rows: 27,648,741
  - Columns: 13
  - Output: ../data/medallion/bronze/rejected_loans

Format: Parquet (columnar, compressed)
Metadata added: _ingestion_timestamp, _source_file, _data_source


In [23]:
# Stop Spark session (optional - keep running if continuing to Silver)
spark.stop()

## Next Steps

The Bronze layer is complete. The data is now stored in Parquet format with minimal transformation.

**Continue to:** `02_silver_cleaning.ipynb` for data cleaning using MapReduce operations.