# Gold Layer - Data Serving with SQL & MLlib

## Lending Club Loan Data Pipeline

**Use Case:** Predict loan default risk and analyze factors affecting loan approval

This notebook implements the Gold (Serving) layer of the Medallion Architecture:
- Business analytics using **Spark SQL**
- Machine Learning using **MLlib** (Loan Default Prediction)
- Create aggregated tables ready for dashboards and applications

**Note:** Unlike the Silver layer, this notebook uses high-level APIs (DataFrames, SQL, MLlib) as permitted by the project requirements.

## 1. Setup and Configuration

In [47]:
import os
import time
import numpy as np
from datetime import datetime

import findspark
findspark.init()

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import *
from pyspark.sql.window import Window

# ML imports
from pyspark.ml import Pipeline
from pyspark.ml.feature import (
    VectorAssembler, 
    StringIndexer, 
    OneHotEncoder,
    StandardScaler,
    Imputer
)
from pyspark.ml.classification import (
    LogisticRegression, 
    RandomForestClassifier,
    GBTClassifier
)
from pyspark.ml.evaluation import (
    BinaryClassificationEvaluator,
    MulticlassClassificationEvaluator
)
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

print("Imports complete!")

Imports complete!


In [48]:
# Initialize Spark Session
spark = SparkSession.builder \
    .appName("LendingClub-Gold-Layer") \
    .master("local[*]") \
    .config("spark.sql.adaptive.enabled", "true") \
    .config("spark.driver.memory", "4g") \
    .config("spark.sql.shuffle.partitions", "8") \
    .getOrCreate()

spark.sparkContext.setLogLevel("ERROR")

print(f"Spark Version: {spark.version}")

Spark Version: 3.5.0


In [49]:
# Define paths
SILVER_PATH = "../data/medallion/silver/"
GOLD_PATH = "../data/medallion/gold/"

SILVER_ACCEPTED_PATH = os.path.join(SILVER_PATH, "accepted_loans")
SILVER_REJECTED_PATH = os.path.join(SILVER_PATH, "rejected_loans")

# Create gold directory
os.makedirs(GOLD_PATH, exist_ok=True)

print(f"Silver input path: {SILVER_PATH}")
print(f"Gold output path: {GOLD_PATH}")

Silver input path: ../data/medallion/silver/
Gold output path: ../data/medallion/gold/


## 2. Load Silver Data

In [50]:
# Load cleaned data from Silver layer
loans_df = spark.read.parquet(SILVER_ACCEPTED_PATH)
rejected_df = spark.read.parquet(SILVER_REJECTED_PATH)

print(f"Accepted loans: {loans_df.count():,} rows")
print(f"Rejected loans: {rejected_df.count():,} rows")
print(f"\nAccepted loans columns: {len(loans_df.columns)}")

Accepted loans: 2,258,994 rows
Rejected loans: 27,647,453 rows

Accepted loans columns: 30


In [51]:
# Show schema
loans_df.printSchema()

root
 |-- loan_amnt: float (nullable = true)
 |-- term: integer (nullable = true)
 |-- int_rate: float (nullable = true)
 |-- installment: float (nullable = true)
 |-- grade: string (nullable = true)
 |-- sub_grade: string (nullable = true)
 |-- emp_length: integer (nullable = true)
 |-- home_ownership: string (nullable = true)
 |-- annual_inc: float (nullable = true)
 |-- verification_status: string (nullable = true)
 |-- purpose: string (nullable = true)
 |-- loan_status: string (nullable = true)
 |-- loan_status_binary: integer (nullable = true)
 |-- issue_d: string (nullable = true)
 |-- dti: float (nullable = true)
 |-- earliest_cr_line: string (nullable = true)
 |-- open_acc: float (nullable = true)
 |-- pub_rec: float (nullable = true)
 |-- revol_bal: float (nullable = true)
 |-- revol_util: float (nullable = true)
 |-- total_acc: float (nullable = true)
 |-- fico_range_low: float (nullable = true)
 |-- fico_range_high: float (nullable = true)
 |-- addr_state: string (nullable =

In [52]:
# Preview data
loans_df.select(
    'loan_amnt', 'int_rate', 'grade', 'annual_inc', 
    'loan_status', 'loan_status_binary', 'fico_avg'
).show(10)

+---------+--------+-----+----------+-----------+------------------+--------+
|loan_amnt|int_rate|grade|annual_inc|loan_status|loan_status_binary|fico_avg|
+---------+--------+-----+----------+-----------+------------------+--------+
|   3600.0|   13.99|    C|   55000.0| Fully Paid|                 0|   677.0|
|  24700.0|   11.99|    C|   65000.0| Fully Paid|                 0|   717.0|
|  20000.0|   10.78|    B|   63000.0| Fully Paid|                 0|   697.0|
|  35000.0|   14.85|    C|  110000.0|    Current|                 0|   787.0|
|  10400.0|   22.45|    F|  104433.0| Fully Paid|                 0|   697.0|
|  11950.0|   13.44|    C|   34000.0| Fully Paid|                 0|   692.0|
|  20000.0|    9.17|    B|  180000.0| Fully Paid|                 0|   682.0|
|  20000.0|    8.49|    B|   85000.0| Fully Paid|                 0|   707.0|
|  10000.0|    6.49|    A|   85000.0| Fully Paid|                 0|   687.0|
|   8000.0|   11.48|    B|   42000.0| Fully Paid|               

In [53]:
# Register as SQL temp view for queries
loans_df.createOrReplaceTempView("loans")
rejected_df.createOrReplaceTempView("rejected")

print("Temp views created: 'loans', 'rejected'")

Temp views created: 'loans', 'rejected'


---
# Part A: Business Analytics with Spark SQL

This section demonstrates SQL capabilities for business intelligence and reporting.

## 3. Exploratory Analytics

In [54]:
# Basic statistics
spark.sql("""
    SELECT 
        COUNT(*) as total_loans,
        ROUND(SUM(loan_amnt), 2) as total_funded,
        ROUND(AVG(loan_amnt), 2) as avg_loan_amount,
        ROUND(AVG(int_rate), 2) as avg_interest_rate,
        ROUND(AVG(annual_inc), 2) as avg_annual_income,
        ROUND(AVG(fico_avg), 0) as avg_fico_score
    FROM loans
""").show()

+-----------+-------------+---------------+-----------------+-----------------+--------------+
|total_loans| total_funded|avg_loan_amount|avg_interest_rate|avg_annual_income|avg_fico_score|
+-----------+-------------+---------------+-----------------+-----------------+--------------+
|    2258994|3.39840788E10|        15043.9|            13.09|         77969.52|         701.0|
+-----------+-------------+---------------+-----------------+-----------------+--------------+



In [55]:
# Loan status distribution
spark.sql("""
    SELECT 
        loan_status,
        loan_status_binary,
        COUNT(*) as count,
        ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) as percentage
    FROM loans
    GROUP BY loan_status, loan_status_binary
    ORDER BY count DESC
""").show(truncate=False)

+---------------------------------------------------+------------------+-------+----------+
|loan_status                                        |loan_status_binary|count  |percentage|
+---------------------------------------------------+------------------+-------+----------+
|Fully Paid                                         |0                 |1076457|47.65     |
|Current                                            |0                 |877044 |38.82     |
|Charged Off                                        |1                 |268491 |11.89     |
|Late (31-120 days)                                 |1                 |21444  |0.95      |
|In Grace Period                                    |0                 |8427   |0.37      |
|Late (16-30 days)                                  |1                 |4346   |0.19      |
|Does not meet the credit policy. Status:Fully Paid |0                 |1984   |0.09      |
|Does not meet the credit policy. Status:Charged Off|1                 |761    |

## 4. Default Rate Analysis by Grade

In [56]:
# Default rate by grade - KEY BUSINESS METRIC
default_by_grade = spark.sql("""
    SELECT 
        grade,
        COUNT(*) as total_loans,
        SUM(loan_status_binary) as defaults,
        ROUND(SUM(loan_status_binary) * 100.0 / COUNT(*), 2) as default_rate,
        ROUND(AVG(int_rate), 2) as avg_interest_rate,
        ROUND(AVG(loan_amnt), 2) as avg_loan_amount,
        ROUND(SUM(loan_amnt), 2) as total_funded
    FROM loans
    GROUP BY grade
    ORDER BY grade
""")

default_by_grade.show()

+-----+-----------+--------+------------+-----------------+---------------+-------------+
|grade|total_loans|defaults|default_rate|avg_interest_rate|avg_loan_amount| total_funded|
+-----+-----------+--------+------------+-----------------+---------------+-------------+
|    A|     432747|   15876|        3.67|             7.08|       14599.73| 6.31798985E9|
|    B|     663130|   58429|        8.81|            10.68|       14170.49| 9.39687965E9|
|    C|     649552|   94795|       14.59|            14.14|       15034.86|   9.765923E9|
|    D|     324099|   66970|       20.66|            18.14|       15709.13|  5.0913131E9|
|    E|     135542|   38760|       28.60|            21.83|       17452.09|2.365490575E9|
|    F|      41773|   15349|       36.74|            25.45|       19123.56| 7.98848475E8|
|    G|      12151|    4903|       40.35|            28.07|       20379.73|  2.4763415E8|
+-----+-----------+--------+------------+-----------------+---------------+-------------+



In [57]:
# Save as Gold table
default_by_grade.write.mode("overwrite").parquet(f"{GOLD_PATH}/default_rate_by_grade")
print(f"Saved: {GOLD_PATH}/default_rate_by_grade")

Saved: ../data/medallion/gold//default_rate_by_grade


## 5. Default Rate Analysis by Sub-Grade

In [58]:
# Default rate by sub-grade - more granular view
default_by_subgrade = spark.sql("""
    SELECT 
        grade,
        sub_grade,
        COUNT(*) as total_loans,
        SUM(loan_status_binary) as defaults,
        ROUND(SUM(loan_status_binary) * 100.0 / COUNT(*), 2) as default_rate,
        ROUND(AVG(int_rate), 2) as avg_interest_rate,
        ROUND(AVG(fico_avg), 0) as avg_fico
    FROM loans
    GROUP BY grade, sub_grade
    ORDER BY grade, sub_grade
""")

default_by_subgrade.show(25)

+-----+---------+-----------+--------+------------+-----------------+--------+
|grade|sub_grade|total_loans|defaults|default_rate|avg_interest_rate|avg_fico|
+-----+---------+-----------+--------+------------+-----------------+--------+
|    A|       A1|      86720|    1596|        1.84|              5.6|   748.0|
|    A|       A2|      69523|    1977|        2.84|             6.55|   741.0|
|    A|       A3|      73130|    2394|        3.27|             7.09|   730.0|
|    A|       A4|      95809|    4017|        4.19|             7.56|   725.0|
|    A|       A5|     107565|    5892|        5.48|             8.19|   717.0|
|    B|       B1|     125256|    8297|        6.62|             9.08|   709.0|
|    B|       B2|     126536|    9353|        7.39|             9.97|   705.0|
|    B|       B3|     131439|   11681|        8.89|            10.71|   701.0|
|    B|       B4|     139704|   13690|        9.80|            11.37|   699.0|
|    B|       B5|     140195|   15408|       10.99| 

In [59]:
# Save
default_by_subgrade.write.mode("overwrite").parquet(f"{GOLD_PATH}/default_rate_by_subgrade")
print(f"Saved: {GOLD_PATH}/default_rate_by_subgrade")

Saved: ../data/medallion/gold//default_rate_by_subgrade


## 6. Geographic Analysis

In [60]:
# Default rate by state - TOP 10 states by loan volume
state_analysis = spark.sql("""
    SELECT 
        addr_state as state,
        COUNT(*) as total_loans,
        ROUND(SUM(loan_amnt), 2) as total_funded,
        SUM(loan_status_binary) as defaults,
        ROUND(SUM(loan_status_binary) * 100.0 / COUNT(*), 2) as default_rate,
        ROUND(AVG(annual_inc), 2) as avg_income,
        ROUND(AVG(fico_avg), 0) as avg_fico
    FROM loans
    WHERE addr_state IS NOT NULL
    GROUP BY addr_state
    ORDER BY total_loans DESC
    LIMIT 15
""")

state_analysis.show(15)

+-----+-----------+-------------+--------+------------+----------+--------+
|state|total_loans| total_funded|defaults|default_rate|avg_income|avg_fico|
+-----+-----------+-------------+--------+------------+----------+--------+
|   CA|     314298|4.803618425E9|   42244|       13.44|  83875.68|   700.0|
|   NY|     186324|2.766027925E9|   26742|       14.35|  81083.59|   701.0|
|   TX|     186144|  2.9275107E9|   24092|       12.94|   82813.7|   701.0|
|   FL|     161893|2.331124025E9|   22779|       14.07|  73215.91|   699.0|
|   IL|      91113|1.409321675E9|   10242|       11.24|  79937.53|   701.0|
|   NJ|      83091|1.315369975E9|   11198|       13.48|  88641.03|   701.0|
|   PA|      76882|1.130945525E9|   10380|       13.50|  73995.17|   701.0|
|   OH|      75073|1.076166525E9|    9676|       12.89|  69377.09|   701.0|
|   GA|      74153|  1.1359375E9|    8833|       11.91|  77902.72|   700.0|
|   VA|      62907|1.011978375E9|    8340|       13.26|  85166.59|   701.0|
|   NC|     

In [61]:
# Full state analysis for Gold layer
full_state_analysis = spark.sql("""
    SELECT 
        addr_state as state,
        COUNT(*) as total_loans,
        ROUND(SUM(loan_amnt), 2) as total_funded,
        SUM(loan_status_binary) as defaults,
        ROUND(SUM(loan_status_binary) * 100.0 / COUNT(*), 2) as default_rate,
        ROUND(AVG(annual_inc), 2) as avg_income,
        ROUND(AVG(int_rate), 2) as avg_interest_rate,
        ROUND(AVG(fico_avg), 0) as avg_fico
    FROM loans
    WHERE addr_state IS NOT NULL
    GROUP BY addr_state
    ORDER BY total_loans DESC
""")

full_state_analysis.write.mode("overwrite").parquet(f"{GOLD_PATH}/loan_analysis_by_state")
print(f"Saved: {GOLD_PATH}/loan_analysis_by_state")

Saved: ../data/medallion/gold//loan_analysis_by_state


## 7. Loan Purpose Analysis

In [62]:
# Default rate by loan purpose
purpose_analysis = spark.sql("""
    SELECT 
        purpose,
        COUNT(*) as total_loans,
        ROUND(SUM(loan_amnt), 2) as total_funded,
        SUM(loan_status_binary) as defaults,
        ROUND(SUM(loan_status_binary) * 100.0 / COUNT(*), 2) as default_rate,
        ROUND(AVG(loan_amnt), 2) as avg_loan_amount,
        ROUND(AVG(int_rate), 2) as avg_interest_rate
    FROM loans
    WHERE purpose IS NOT NULL
    GROUP BY purpose
    ORDER BY total_loans DESC
""")

purpose_analysis.show(15, truncate=False)

+------------------+-----------+---------------+--------+------------+---------------+-----------------+
|purpose           |total_loans|total_funded   |defaults|default_rate|avg_loan_amount|avg_interest_rate|
+------------------+-----------+---------------+--------+------------+---------------+-----------------+
|debt_consolidation|1276906    |2.0384028775E10|180261  |14.12       |15963.61       |13.52            |
|credit_card       |516646     |7.913234825E9  |54878   |10.62       |15316.55       |11.7             |
|home_improvement  |150320     |2.203918675E9  |17338   |11.53       |14661.51       |12.62            |
|other             |139330     |1.460089175E9  |18417   |13.22       |10479.36       |14.24            |
|major_purchase    |50415      |6.393208E8     |6173    |12.24       |12681.16       |12.76            |
|medical           |27460      |2.601001E8     |3753    |13.67       |9471.96        |13.63            |
|small_business    |24673      |4.05632325E8   |5059   

In [63]:
purpose_analysis.write.mode("overwrite").parquet(f"{GOLD_PATH}/loan_analysis_by_purpose")
print(f"Saved: {GOLD_PATH}/loan_analysis_by_purpose")

Saved: ../data/medallion/gold//loan_analysis_by_purpose


## 8. Time Series Analysis

In [64]:
# Loan trends over time (by issue year)
time_analysis = spark.sql("""
    SELECT 
        SUBSTRING(issue_d, 1, 4) as year,
        COUNT(*) as total_loans,
        ROUND(SUM(loan_amnt) / 1000000, 2) as total_funded_millions,
        SUM(loan_status_binary) as defaults,
        ROUND(SUM(loan_status_binary) * 100.0 / COUNT(*), 2) as default_rate,
        ROUND(AVG(loan_amnt), 2) as avg_loan_amount,
        ROUND(AVG(int_rate), 2) as avg_interest_rate
    FROM loans
    WHERE issue_d IS NOT NULL
    GROUP BY SUBSTRING(issue_d, 1, 4)
    ORDER BY year
""")

time_analysis.show(15)

+----+-----------+---------------------+--------+------------+---------------+-----------------+
|year|total_loans|total_funded_millions|defaults|default_rate|avg_loan_amount|avg_interest_rate|
+----+-----------+---------------------+--------+------------+---------------+-----------------+
|2007|        599|                 4.95|     158|       26.38|        8267.57|            11.85|
|2008|       2393|                21.12|     496|       20.73|        8825.43|            12.06|
|2009|       5281|                51.93|     723|       13.69|        9833.03|            12.44|
|2010|      12537|               131.99|    1757|       14.01|       10528.24|            11.99|
|2011|      21721|               261.68|    3297|       15.18|        12047.5|            12.22|
|2012|      53367|               718.41|    8644|       16.20|       13461.71|            13.64|
|2013|     134814|              1982.77|   21027|       15.60|       14707.41|            14.53|
|2014|     235629|            

In [65]:
time_analysis.write.mode("overwrite").parquet(f"{GOLD_PATH}/loan_trends_by_year")
print(f"Saved: {GOLD_PATH}/loan_trends_by_year")

Saved: ../data/medallion/gold//loan_trends_by_year


## 9. Risk Segmentation Analysis

In [66]:
# Create risk segments based on FICO score
risk_segments = spark.sql("""
    SELECT 
        CASE 
            WHEN fico_avg >= 750 THEN 'Excellent (750+)'
            WHEN fico_avg >= 700 THEN 'Good (700-749)'
            WHEN fico_avg >= 650 THEN 'Fair (650-699)'
            WHEN fico_avg >= 600 THEN 'Poor (600-649)'
            ELSE 'Very Poor (<600)'
        END as fico_segment,
        COUNT(*) as total_loans,
        SUM(loan_status_binary) as defaults,
        ROUND(SUM(loan_status_binary) * 100.0 / COUNT(*), 2) as default_rate,
        ROUND(AVG(loan_amnt), 2) as avg_loan_amount,
        ROUND(AVG(int_rate), 2) as avg_interest_rate,
        ROUND(AVG(annual_inc), 2) as avg_income,
        ROUND(AVG(dti), 2) as avg_dti
    FROM loans
    WHERE fico_avg IS NOT NULL
    GROUP BY 
        CASE 
            WHEN fico_avg >= 750 THEN 'Excellent (750+)'
            WHEN fico_avg >= 700 THEN 'Good (700-749)'
            WHEN fico_avg >= 650 THEN 'Fair (650-699)'
            WHEN fico_avg >= 600 THEN 'Poor (600-649)'
            ELSE 'Very Poor (<600)'
        END
    ORDER BY default_rate
""")

risk_segments.show(truncate=False)

[Stage 340:>                                                        (0 + 4) / 4]

+----------------+-----------+--------+------------+---------------+-----------------+----------+-------+
|fico_segment    |total_loans|defaults|default_rate|avg_loan_amount|avg_interest_rate|avg_income|avg_dti|
+----------------+-----------+--------+------------+---------------+-----------------+----------+-------+
|Excellent (750+)|207174     |10960   |5.29        |16067.47       |8.83             |84230.11  |16.61  |
|Good (700-749)  |747729     |73072   |9.77        |16328.71       |11.72            |82026.92  |19.37  |
|Fair (650-699)  |1303860    |210976  |16.18       |14145.98       |14.55            |74652.42  |18.87  |
|Poor (600-649)  |231        |74      |32.03       |6454.22        |15.32            |52731.55  |13.3   |
+----------------+-----------+--------+------------+---------------+-----------------+----------+-------+



                                                                                

In [67]:
risk_segments.write.mode("overwrite").parquet(f"{GOLD_PATH}/risk_segments_by_fico")
print(f"Saved: {GOLD_PATH}/risk_segments_by_fico")

Saved: ../data/medallion/gold//risk_segments_by_fico


## 10. Advanced SQL: Window Functions

In [68]:
# Rank states by default rate within each grade
spark.sql("""
    WITH state_grade_stats AS (
        SELECT 
            grade,
            addr_state as state,
            COUNT(*) as loan_count,
            ROUND(SUM(loan_status_binary) * 100.0 / COUNT(*), 2) as default_rate
        FROM loans
        WHERE addr_state IS NOT NULL
        GROUP BY grade, addr_state
        HAVING COUNT(*) >= 100  -- Filter for statistical significance
    )
    SELECT 
        grade,
        state,
        loan_count,
        default_rate,
        RANK() OVER (PARTITION BY grade ORDER BY default_rate DESC) as risk_rank
    FROM state_grade_stats
    ORDER BY grade, risk_rank
""").show(30)

+-----+-----+----------+------------+---------+
|grade|state|loan_count|default_rate|risk_rank|
+-----+-----+----------+------------+---------+
|    A|   NM|      2238|        5.05|        1|
|    A|   MS|      2081|        4.71|        2|
|    A|   OK|      3718|        4.57|        3|
|    A|   SD|       780|        4.49|        4|
|    A|   LA|      4848|        4.35|        5|
|    A|   AR|      2924|        4.34|        6|
|    A|   AL|      4483|        4.33|        7|
|    A|   NY|     34038|        4.31|        8|
|    A|   NJ|     16455|        4.23|        9|
|    A|   NV|      5811|        4.15|       10|
|    A|   FL|     29527|        4.12|       11|
|    A|   NC|     11767|        4.02|       12|
|    A|   MT|      1297|        4.01|       13|
|    A|   AK|       975|        4.00|       14|
|    A|   AZ|     10623|        3.99|       15|
|    A|   DE|      1220|        3.85|       16|
|    A|   CA|     62412|        3.82|       17|
|    A|   MN|      7867|        3.79|   

---
# Part B: Machine Learning with MLlib

This section builds a loan default prediction model using Spark MLlib.

## 11. Prepare Data for ML

In [69]:
# Select features for the model
# Numeric features
numeric_features = [
    'loan_amnt',
    'int_rate',
    'installment',
    'annual_inc',
    'dti',
    'open_acc',
    'pub_rec',
    'revol_bal',
    'revol_util',
    'total_acc',
    'fico_avg',
    'loan_to_income',
    'delinq_2yrs',
    'inq_last_6mths'
]

# Categorical features
categorical_features = [
    'term',
    'grade',
    'home_ownership',
    'verification_status',
    'purpose'
]

# Target variable
target = 'loan_status_binary'

print(f"Numeric features: {len(numeric_features)}")
print(f"Categorical features: {len(categorical_features)}")
print(f"Target: {target}")

Numeric features: 14
Categorical features: 5
Target: loan_status_binary


In [70]:
# Select columns and filter nulls in target
all_features = numeric_features + categorical_features + [target]

ml_df = loans_df.select(all_features).filter(F.col(target).isNotNull())

print(f"ML Dataset: {ml_df.count():,} rows")
ml_df.show(5)

ML Dataset: 2,258,994 rows
+---------+--------+-----------+----------+-----+--------+-------+---------+----------+---------+--------+--------------+-----------+--------------+----+-----+--------------+-------------------+------------------+------------------+
|loan_amnt|int_rate|installment|annual_inc|  dti|open_acc|pub_rec|revol_bal|revol_util|total_acc|fico_avg|loan_to_income|delinq_2yrs|inq_last_6mths|term|grade|home_ownership|verification_status|           purpose|loan_status_binary|
+---------+--------+-----------+----------+-----+--------+-------+---------+----------+---------+--------+--------------+-----------+--------------+----+-----+--------------+-------------------+------------------+------------------+
|   3600.0|   13.99|     123.03|   55000.0| 5.91|     7.0|    0.0|   2765.0|      29.7|     13.0|   677.0|    0.06545454|        0.0|           1.0|  36|    C|      MORTGAGE|       Not Verified|debt_consolidation|                 0|
|  24700.0|   11.99|     820.28|   65000.

In [71]:
# Check class distribution (important for imbalanced data)
ml_df.groupBy(target).count().show()

+------------------+-------+
|loan_status_binary|  count|
+------------------+-------+
|                 1| 295082|
|                 0|1963912|
+------------------+-------+



In [72]:
# Check for nulls in features
print("Null counts per column:")
null_counts = ml_df.select([F.count(F.when(F.col(c).isNull(), c)).alias(c) for c in ml_df.columns])
null_counts.show(truncate=False)

Null counts per column:


[Stage 364:>                                                        (0 + 4) / 4]

+---------+--------+-----------+----------+---+--------+-------+---------+----------+---------+--------+--------------+-----------+--------------+----+-----+--------------+-------------------+-------+------------------+
|loan_amnt|int_rate|installment|annual_inc|dti|open_acc|pub_rec|revol_bal|revol_util|total_acc|fico_avg|loan_to_income|delinq_2yrs|inq_last_6mths|term|grade|home_ownership|verification_status|purpose|loan_status_binary|
+---------+--------+-----------+----------+---+--------+-------+---------+----------+---------+--------+--------------+-----------+--------------+----+-----+--------------+-------------------+-------+------------------+
|0        |0       |0          |0         |52 |25      |25     |0        |1793      |25       |0       |0             |25         |26            |0   |0    |50            |0                  |0      |0                 |
+---------+--------+-----------+----------+---+--------+-------+---------+----------+---------+--------+--------------+-

                                                                                

## 12. Feature Engineering Pipeline

In [73]:
# Handle missing values in numeric columns
# Fill with median using Imputer
imputer = Imputer(
    inputCols=numeric_features,
    outputCols=[f"{c}_imputed" for c in numeric_features],
    strategy="median"
)

imputed_numeric_features = [f"{c}_imputed" for c in numeric_features]
print(f"Imputer configured for {len(numeric_features)} numeric features")

Imputer configured for 14 numeric features


In [74]:
# String Indexers for categorical features
# This converts strings to numeric indices

indexers = []
indexed_cat_features = []

for cat_col in categorical_features:
    indexer = StringIndexer(
        inputCol=cat_col, 
        outputCol=f"{cat_col}_indexed",
        handleInvalid="keep"  # Handle unseen labels
    )
    indexers.append(indexer)
    indexed_cat_features.append(f"{cat_col}_indexed")

print(f"Created {len(indexers)} StringIndexers")

Created 5 StringIndexers


In [75]:
# One-Hot Encoders for categorical features
encoders = []
encoded_cat_features = []

for cat_col in categorical_features:
    encoder = OneHotEncoder(
        inputCol=f"{cat_col}_indexed",
        outputCol=f"{cat_col}_encoded",
        handleInvalid="keep"
    )
    encoders.append(encoder)
    encoded_cat_features.append(f"{cat_col}_encoded")

print(f"Created {len(encoders)} OneHotEncoders")

Created 5 OneHotEncoders


In [76]:
# Assemble all features into a single vector
all_feature_cols = imputed_numeric_features + encoded_cat_features

assembler = VectorAssembler(
    inputCols=all_feature_cols,
    outputCol="features_unscaled",
    handleInvalid="skip"  # Skip rows with nulls
)

print(f"VectorAssembler will combine {len(all_feature_cols)} feature columns")

VectorAssembler will combine 19 feature columns


In [77]:
# Scale features (important for Logistic Regression)
scaler = StandardScaler(
    inputCol="features_unscaled",
    outputCol="features",
    withStd=True,
    withMean=False  # Don't center for sparse data
)

print("StandardScaler configured")

StandardScaler configured


## 13. Train/Test Split

In [78]:
# Split data into training and test sets
train_df, test_df = ml_df.randomSplit([0.8, 0.2], seed=42)

# Cache for performance
train_df.cache()
test_df.cache()

train_count = train_df.count()
test_count = test_df.count()

print(f"Training set: {train_count:,} rows ({train_count/(train_count+test_count)*100:.1f}%)")
print(f"Test set: {test_count:,} rows ({test_count/(train_count+test_count)*100:.1f}%)")

Training set: 1,807,149 rows (80.0%)
Test set: 451,845 rows (20.0%)


In [79]:
# Verify class distribution in splits
print("Training set class distribution:")
train_df.groupBy(target).count().show()

print("Test set class distribution:")
test_df.groupBy(target).count().show()

Training set class distribution:
+------------------+-------+
|loan_status_binary|  count|
+------------------+-------+
|                 1| 235812|
|                 0|1571337|
+------------------+-------+

Test set class distribution:
+------------------+------+
|loan_status_binary| count|
+------------------+------+
|                 1| 59270|
|                 0|392575|
+------------------+------+



## 14. Model 1: Logistic Regression

In [80]:
# Create Logistic Regression model
lr = LogisticRegression(
    featuresCol="features",
    labelCol=target,
    maxIter=100,
    regParam=0.01,
    elasticNetParam=0.8  # L1/L2 mix
)

# Build pipeline
lr_pipeline = Pipeline(stages=[
    imputer,
    *indexers,
    *encoders,
    assembler,
    scaler,
    lr
])

print(f"Logistic Regression Pipeline: {len(lr_pipeline.getStages())} stages")

Logistic Regression Pipeline: 14 stages


In [81]:
%%time
# Train the model
print("Training Logistic Regression model...")
lr_model = lr_pipeline.fit(train_df)
print("Training complete!")

Training Logistic Regression model...


                                                                                

Training complete!
CPU times: user 100 ms, sys: 36.9 ms, total: 137 ms
Wall time: 22.3 s


In [82]:
# Make predictions
lr_predictions = lr_model.transform(test_df)

# Show sample predictions
lr_predictions.select(
    'loan_amnt', 'int_rate', 'grade', 'fico_avg',
    target, 'prediction', 'probability'
).show(10, truncate=False)

+---------+--------+-----+--------+------------------+----------+-----------------------------------------+
|loan_amnt|int_rate|grade|fico_avg|loan_status_binary|prediction|probability                              |
+---------+--------+-----+--------+------------------+----------+-----------------------------------------+
|600.0    |13.24   |D    |752.0   |0                 |0.0       |[0.8832807611679021,0.1167192388320979]  |
|900.0    |12.92   |D    |717.0   |0                 |0.0       |[0.8911017040293526,0.10889829597064737] |
|1000.0   |5.31    |A    |827.0   |0                 |0.0       |[0.9687015311729185,0.03129846882708154] |
|1000.0   |5.32    |A    |772.0   |0                 |0.0       |[0.9642980348211304,0.03570196517886959] |
|1000.0   |5.32    |A    |772.0   |1                 |0.0       |[0.9625904754212486,0.03740952457875135] |
|1000.0   |5.32    |A    |687.0   |0                 |0.0       |[0.9483839573031319,0.05161604269686815] |
|1000.0   |6.03    |A    |75

In [83]:
# Evaluate Logistic Regression
evaluator_auc = BinaryClassificationEvaluator(
    labelCol=target,
    rawPredictionCol="rawPrediction",
    metricName="areaUnderROC"
)

evaluator_pr = BinaryClassificationEvaluator(
    labelCol=target,
    rawPredictionCol="rawPrediction",
    metricName="areaUnderPR"
)

evaluator_accuracy = MulticlassClassificationEvaluator(
    labelCol=target,
    predictionCol="prediction",
    metricName="accuracy"
)

lr_auc = evaluator_auc.evaluate(lr_predictions)
lr_pr = evaluator_pr.evaluate(lr_predictions)
lr_accuracy = evaluator_accuracy.evaluate(lr_predictions)

print("=" * 50)
print("LOGISTIC REGRESSION RESULTS")
print("=" * 50)
print(f"AUC-ROC: {lr_auc:.4f}")
print(f"AUC-PR:  {lr_pr:.4f}")
print(f"Accuracy: {lr_accuracy:.4f}")



LOGISTIC REGRESSION RESULTS
AUC-ROC: 0.6893
AUC-PR:  0.2359
Accuracy: 0.8688


                                                                                

In [84]:
# Confusion Matrix
print("\nConfusion Matrix:")
lr_predictions.groupBy(target, 'prediction').count().orderBy(target, 'prediction').show()


Confusion Matrix:




+------------------+----------+------+
|loan_status_binary|prediction| count|
+------------------+----------+------+
|                 0|       0.0|392575|
|                 1|       0.0| 59269|
|                 1|       1.0|     1|
+------------------+----------+------+



                                                                                

## 15. Model 2: Random Forest

In [85]:
# Create Random Forest model (doesn't need scaling)
rf = RandomForestClassifier(
    featuresCol="features_unscaled",  # Use unscaled features
    labelCol=target,
    numTrees=20,
    maxDepth=5,
    minInstancesPerNode=100,
    seed=42
)

# Build pipeline (without scaler)
rf_pipeline = Pipeline(stages=[
    imputer,
    *indexers,
    *encoders,
    assembler,
    rf
])

print(f"Random Forest Pipeline: {len(rf_pipeline.getStages())} stages")

Random Forest Pipeline: 13 stages


In [86]:
%%time
# Train the model
print("Training Random Forest model...")
rf_model = rf_pipeline.fit(train_df)
print("Training complete!")

Training Random Forest model...


                                                                                

Training complete!
CPU times: user 92.5 ms, sys: 35.6 ms, total: 128 ms
Wall time: 25.5 s


In [87]:
# Make predictions
rf_predictions = rf_model.transform(test_df)

# Evaluate Random Forest
rf_auc = evaluator_auc.evaluate(rf_predictions)
rf_pr = evaluator_pr.evaluate(rf_predictions)
rf_accuracy = evaluator_accuracy.evaluate(rf_predictions)

print("=" * 50)
print("RANDOM FOREST RESULTS")
print("=" * 50)
print(f"AUC-ROC: {rf_auc:.4f}")
print(f"AUC-PR:  {rf_pr:.4f}")
print(f"Accuracy: {rf_accuracy:.4f}")



RANDOM FOREST RESULTS
AUC-ROC: 0.5633
AUC-PR:  0.2061
Accuracy: 0.8688


                                                                                

In [88]:
# Confusion Matrix
print("\nConfusion Matrix:")
rf_predictions.groupBy(target, 'prediction').count().orderBy(target, 'prediction').show()


Confusion Matrix:




+------------------+----------+------+
|loan_status_binary|prediction| count|
+------------------+----------+------+
|                 0|       0.0|392575|
|                 1|       0.0| 59270|
+------------------+----------+------+



                                                                                

In [89]:
# Feature Importance (Random Forest)
rf_model_final = rf_model.stages[-1]
feature_importance = rf_model_final.featureImportances

# Get feature names
# Note: This is an approximation since OneHot expands features
print("\nTop Feature Importances (by index):")
importance_list = [(i, float(imp)) for i, imp in enumerate(feature_importance)]
sorted_importance = sorted(importance_list, key=lambda x: x[1], reverse=True)[:15]

print(f"{'Index':<8} {'Importance':<12}")
print("-" * 20)
for idx, imp in sorted_importance:
    print(f"{idx:<8} {imp:.4f}")


Top Feature Importances (by index):
Index    Importance  
--------------------
23       0.4730
33       0.2802
22       0.2159
11       0.0104
7        0.0100
3        0.0040
26       0.0024
35       0.0022
10       0.0015
12       0.0005
0        0.0000
1        0.0000
2        0.0000
4        0.0000
5        0.0000


## 16. Model Comparison

In [90]:
# Compare models
print("=" * 60)
print("MODEL COMPARISON")
print("=" * 60)
print(f"{'Metric':<15} {'Logistic Regression':<20} {'Random Forest':<20}")
print("-" * 60)
print(f"{'AUC-ROC':<15} {lr_auc:<20.4f} {rf_auc:<20.4f}")
print(f"{'AUC-PR':<15} {lr_pr:<20.4f} {rf_pr:<20.4f}")
print(f"{'Accuracy':<15} {lr_accuracy:<20.4f} {rf_accuracy:<20.4f}")
print("=" * 60)

# Determine best model
best_model_name = "Random Forest" if rf_auc > lr_auc else "Logistic Regression"
best_model = rf_model if rf_auc > lr_auc else lr_model
best_auc = max(rf_auc, lr_auc)

print(f"\nBest Model: {best_model_name} (AUC: {best_auc:.4f})")

MODEL COMPARISON
Metric          Logistic Regression  Random Forest       
------------------------------------------------------------
AUC-ROC         0.6893               0.5633              
AUC-PR          0.2359               0.2061              
Accuracy        0.8688               0.8688              

Best Model: Logistic Regression (AUC: 0.6893)


## 17. Hyperparameter Tuning with Cross-Validation (Optional)

In [91]:
# Hyperparameter tuning for Random Forest
# Note: This can be time-consuming, so we use a small grid

# Create a new RF for tuning
rf_tune = RandomForestClassifier(
    featuresCol="features_unscaled",
    labelCol=target,
    seed=42
)

# Pipeline for tuning
rf_tune_pipeline = Pipeline(stages=[
    imputer,
    *indexers,
    *encoders,
    assembler,
    rf_tune
])

# Parameter grid
paramGrid = ParamGridBuilder() \
    .addGrid(rf_tune.numTrees, [50, 100]) \
    .addGrid(rf_tune.maxDepth, [5, 10]) \
    .build()

print(f"Parameter grid size: {len(paramGrid)} combinations")

Parameter grid size: 4 combinations


In [92]:
%%time
# Cross-validator
cv = CrossValidator(
    estimator=rf_tune_pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=evaluator_auc,
    numFolds=3,
    seed=42
)

# Sample data for faster tuning (optional - remove for full tuning)
train_sample = train_df.sample(0.2, seed=42)
print(f"Training CV on sample of {train_sample.count():,} rows...")

# Fit cross-validator
cv_model = cv.fit(train_sample)
print("Cross-validation complete!")

Training CV on sample of 362,229 rows...


                                                                                

Cross-validation complete!
CPU times: user 2.1 s, sys: 1.32 s, total: 3.42 s
Wall time: 4min 21s


In [93]:
# Best model from CV
print("Cross-validation AUC scores:")
for i, score in enumerate(cv_model.avgMetrics):
    print(f"  Config {i+1}: {score:.4f}")

print(f"\nBest CV AUC: {max(cv_model.avgMetrics):.4f}")

Cross-validation AUC scores:
  Config 1: 0.6915
  Config 2: 0.7031
  Config 3: 0.6915
  Config 4: 0.7036

Best CV AUC: 0.7036


In [94]:
# Evaluate best CV model on test set
cv_predictions = cv_model.transform(test_df)
cv_auc = evaluator_auc.evaluate(cv_predictions)
print(f"Best CV Model Test AUC: {cv_auc:.4f}")

                                                                                

Best CV Model Test AUC: 0.7026


## 18. Save Models and Predictions

In [95]:
# Save best model
MODEL_PATH = f"{GOLD_PATH}/models/default_prediction_model"
best_model.write().overwrite().save(MODEL_PATH)
print(f"Model saved to: {MODEL_PATH}")

Model saved to: ../data/medallion/gold//models/default_prediction_model


In [97]:
# Save predictions
# Use the predictions from the best model
best_predictions = rf_predictions if rf_auc > lr_auc else lr_predictions

predictions_to_save = best_predictions.select(
    'loan_amnt', 'int_rate', 'term', 'grade',
    'annual_inc', 'dti', 'fico_avg', 'purpose', 'home_ownership',
    target, 'prediction', 'probability'
)

PREDICTIONS_PATH = f"{GOLD_PATH}/predictions"
predictions_to_save.write.mode("overwrite").parquet(PREDICTIONS_PATH)
print(f"Predictions saved to: {PREDICTIONS_PATH}")



Predictions saved to: ../data/medallion/gold//predictions


                                                                                

## 19. Create Risk Scoring Table

In [103]:
# Create a risk scoring summary for business use
# Extract probability of default from the probability vector

from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType

# UDF to extract probability of default (class 1)
@udf(FloatType())
def extract_prob_default(probability):
    return float(probability[1])

risk_scores = best_predictions.withColumn(
    "default_probability", 
    extract_prob_default(F.col("probability"))
)

# Create risk categories
risk_scores = risk_scores.withColumn(
    "risk_category",
    F.when(F.col("default_probability") < 0.1, "Low Risk")
    .when(F.col("default_probability") < 0.25, "Medium Risk")
    .when(F.col("default_probability") < 0.5, "High Risk")
    .otherwise("Very High Risk")
)

# Show distribution
risk_scores.groupBy("risk_category").agg(
    F.count("*").alias("count"),
    F.round(F.avg("default_probability"), 4).alias("avg_prob"),
    F.round(F.avg(target), 4).alias("actual_default_rate")
).orderBy("avg_prob").show()



+--------------+------+--------+-------------------+
| risk_category| count|avg_prob|actual_default_rate|
+--------------+------+--------+-------------------+
|      Low Risk|155016|  0.0685|             0.0535|
|   Medium Risk|271977|  0.1502|              0.159|
|     High Risk| 24851|  0.3008|             0.3108|
|Very High Risk|     1|  0.7076|                1.0|
+--------------+------+--------+-------------------+



                                                                                

In [104]:
# Save risk scores
risk_score_table = risk_scores.select(
    'loan_amnt', 'int_rate', 'grade', 'fico_avg', 'annual_inc',
    'purpose', target, 'default_probability', 'risk_category'
)

RISK_SCORES_PATH = f"{GOLD_PATH}/risk_scores"
risk_score_table.write.mode("overwrite").parquet(RISK_SCORES_PATH)
print(f"Risk scores saved to: {RISK_SCORES_PATH}")



Risk scores saved to: ../data/medallion/gold//risk_scores


                                                                                

## 20. Gold Layer Summary

In [105]:
# List all Gold layer outputs
!ls -la {GOLD_PATH}

total 44
drwxrwxr-x 11 ubuntu ubuntu 4096 Nov 25 20:13 .
drwxrwxr-x  5 ubuntu ubuntu 4096 Nov 25 19:59 ..
drwxr-xr-x  2 ubuntu ubuntu 4096 Nov 25 20:03 default_rate_by_grade
drwxr-xr-x  2 ubuntu ubuntu 4096 Nov 25 20:03 default_rate_by_subgrade
drwxr-xr-x  2 ubuntu ubuntu 4096 Nov 25 20:03 loan_analysis_by_purpose
drwxr-xr-x  2 ubuntu ubuntu 4096 Nov 25 20:03 loan_analysis_by_state
drwxr-xr-x  2 ubuntu ubuntu 4096 Nov 25 20:04 loan_trends_by_year
drwxr-xr-x  3 ubuntu ubuntu 4096 Nov 25 20:09 models
drwxr-xr-x  2 ubuntu ubuntu 4096 Nov 25 20:11 predictions
drwxr-xr-x  2 ubuntu ubuntu 4096 Nov 25 20:13 risk_scores
drwxr-xr-x  2 ubuntu ubuntu 4096 Nov 25 20:04 risk_segments_by_fico


In [106]:
# Final Summary
print("=" * 70)
print("GOLD LAYER SUMMARY")
print("=" * 70)

print("\n--- Part A: SQL Analytics ---")
print("Tables created for business intelligence:")
print(f"  1. default_rate_by_grade     - Default rates by loan grade")
print(f"  2. default_rate_by_subgrade  - Default rates by sub-grade")
print(f"  3. loan_analysis_by_state    - Geographic analysis")
print(f"  4. loan_analysis_by_purpose  - Analysis by loan purpose")
print(f"  5. loan_trends_by_year       - Time series analysis")
print(f"  6. risk_segments_by_fico     - FICO-based risk segments")

print("\n--- Part B: Machine Learning ---")
print("Models trained:")
print(f"  1. Logistic Regression - AUC: {lr_auc:.4f}")
print(f"  2. Random Forest       - AUC: {rf_auc:.4f}")
print(f"\nBest model: {best_model_name}")

print("\nML outputs:")
print(f"  - models/default_prediction_model - Trained ML pipeline")
print(f"  - predictions                     - Test set predictions")
print(f"  - risk_scores                     - Risk scoring table")

print("\n--- Technologies Used ---")
print("  - Spark SQL: Complex queries, aggregations, window functions")
print("  - MLlib: Pipeline, VectorAssembler, StringIndexer, OneHotEncoder")
print("  - MLlib: LogisticRegression, RandomForestClassifier")
print("  - MLlib: CrossValidator, BinaryClassificationEvaluator")

print("\n" + "=" * 70)

GOLD LAYER SUMMARY

--- Part A: SQL Analytics ---
Tables created for business intelligence:
  1. default_rate_by_grade     - Default rates by loan grade
  2. default_rate_by_subgrade  - Default rates by sub-grade
  3. loan_analysis_by_state    - Geographic analysis
  4. loan_analysis_by_purpose  - Analysis by loan purpose
  5. loan_trends_by_year       - Time series analysis
  6. risk_segments_by_fico     - FICO-based risk segments

--- Part B: Machine Learning ---
Models trained:
  1. Logistic Regression - AUC: 0.6893
  2. Random Forest       - AUC: 0.5633

Best model: Logistic Regression

ML outputs:
  - models/default_prediction_model - Trained ML pipeline
  - predictions                     - Test set predictions
  - risk_scores                     - Risk scoring table

--- Technologies Used ---
  - Spark SQL: Complex queries, aggregations, window functions
  - MLlib: Pipeline, VectorAssembler, StringIndexer, OneHotEncoder
  - MLlib: LogisticRegression, RandomForestClassifier
  - M

In [107]:
# Clean up
train_df.unpersist()
test_df.unpersist()

print("Cached DataFrames unpersisted.")
print("\nGold layer complete!")

Cached DataFrames unpersisted.

Gold layer complete!


---
## Conclusion

The Gold layer provides:

1. **Business Analytics** - Ready-to-use aggregated tables for dashboards
2. **ML Model** - Trained loan default prediction model
3. **Risk Scoring** - Probability-based risk categorization

These outputs can be:
- Connected to BI tools (Tableau, Power BI)
- Served via REST API for applications
- Used for real-time scoring of new loan applications