#### Data cleaning plan for loan-defaulters dataset

1. **Initialize SparkSession**

2. **Define defaulters schema**

3. **Load raw loan-defaulters CSV**
   - Set `header=True`
   - Apply predefined schema

4. **Inspect raw DataFrame**
   - Show sample rows
   - Print schema

5. **Create temporary view** `"loan_defaulters"`

6. **Explore `delinq_2yrs`**
   - List distinct values
   - Count by value

7. **Clean `delinq_2yrs`**
   - Cast to `Integer`
   - Fill nulls with `0`

8. **Re-register temp view**

9. **Validate no nulls remain in `delinq_2yrs`**

10. **Filter primary defaulters**
    - Keep rows where `delinq_2yrs > 0` OR `mths_since_last_delinq > 0`

11. **Count defaulter records**

12. **Filter all records with credit issues**
    - Condition:
      ```
      delinq_2yrs > 0  
      OR pub_rec_bankruptcies > 0  
      OR inq_last_6mths > 0
      ```

13. **Count those “enforcement” records**

14. **Save both DataFrames**
    - Parquet → `/user/itv006277/lendingclubproject/raw/cleaned/loans_defaulters_parquet`
    - CSV → `/user/itv006277/lendingclubproject/raw/cleaned/loans_defaulters_csv`


### 1. Initialize SparkSession

In [1]:
from pyspark.sql import SparkSession
import getpass

# Use OS username for a user-specific warehouse directory
username = getpass.getuser()

spark = (
    SparkSession.builder
      .config('spark.ui.port', '0')
      .config('spark.sql.warehouse.dir', f"/user/{username}/warehouse")
      .config('spark.shuffle.useOldFetchProtocol', 'true')
      .enableHiveSupport()
      .master('yarn')
      .getOrCreate()
)

### 2. Define defaulters schema

In [2]:
# Exact column names & types for defaulters dataset
loan_defaulters_schema = """
    member_id string,
    delinq_2yrs float,
    delinq_amnt float,
    pub_rec float,
    pub_rec_bankruptcies float,
    inq_last_6mths float,
    total_rec_late_fee float,
    mths_since_last_delinq float,
    mths_since_last_record float
"""


### 3. Load raw loan‑defaulters CSV

In [3]:
loans_def_raw_df = (
    spark.read
      .format("csv")
      .option("header", True)                    
      .schema(loan_defaulters_schema)           
      .load("/public/trendytech/lendingclubproject/raw/loans_defaulters_csv")
)

# Quick look at data
loans_def_raw_df.show(5)


+--------------------+-----------+-----------+-------+--------------------+--------------+------------------+----------------------+----------------------+
|           member_id|delinq_2yrs|delinq_amnt|pub_rec|pub_rec_bankruptcies|inq_last_6mths|total_rec_late_fee|mths_since_last_delinq|mths_since_last_record|
+--------------------+-----------+-----------+-------+--------------------+--------------+------------------+----------------------+----------------------+
|9cb79aa7323e81be1...|        2.0|        0.0|    0.0|                 0.0|           0.0|               0.0|                  11.0|                  null|
|0dd2bbc517e3c8f9e...|        0.0|        0.0|    1.0|                 1.0|           3.0|               0.0|                  null|                 115.0|
|458458599d3df3bfc...|        0.0|        0.0|    1.0|                 1.0|           1.0|               0.0|                  null|                  76.0|
|05ea141ec28b5c7f7...|        0.0|        0.0|    0.0|          

### 4. Inspect raw DataFrame


In [4]:
# Verify the column types and nullability
loans_def_raw_df.printSchema()


root
 |-- member_id: string (nullable = true)
 |-- delinq_2yrs: float (nullable = true)
 |-- delinq_amnt: float (nullable = true)
 |-- pub_rec: float (nullable = true)
 |-- pub_rec_bankruptcies: float (nullable = true)
 |-- inq_last_6mths: float (nullable = true)
 |-- total_rec_late_fee: float (nullable = true)
 |-- mths_since_last_delinq: float (nullable = true)
 |-- mths_since_last_record: float (nullable = true)



### 5. Create temporary view "loan_defaulters"


In [5]:
loans_def_raw_df.createOrReplaceTempView("loan_defaulters")

### 6. Explore `delinq_2yrs`

In [6]:
# Distinct values
spark.sql("SELECT DISTINCT(delinq_2yrs) FROM loan_defaulters").show()

# Count by each distinct value
spark.sql("""
    SELECT delinq_2yrs, COUNT(*) AS total
      FROM loan_defaulters
  GROUP BY delinq_2yrs
  ORDER BY total DESC
""").show(40)


+-----------+
|delinq_2yrs|
+-----------+
|      20.04|
|      18.53|
|       18.0|
|      26.24|
|       6.52|
|        9.0|
|      21.72|
|      17.17|
|       58.0|
|        5.0|
|       39.0|
|       9.44|
|       17.0|
|       30.0|
|       26.0|
|       29.0|
|       9.56|
|       23.0|
|       1.41|
|      17.18|
+-----------+
only showing top 20 rows

+-----------+-------+
|delinq_2yrs|  total|
+-----------+-------+
|        0.0|1838878|
|        1.0| 281335|
|        2.0|  81285|
|        3.0|  29539|
|        4.0|  13179|
|        5.0|   6599|
|        6.0|   3717|
|        7.0|   2062|
|        8.0|   1223|
|        9.0|    818|
|       10.0|    556|
|       11.0|    363|
|       12.0|    264|
|       null|    261|
|       13.0|    165|
|       14.0|    120|
|       15.0|     87|
|       16.0|     55|
|       18.0|     30|
|       17.0|     30|
|       19.0|     23|
|       20.0|     17|
|       21.0|     12|
|       22.0|      5|
|       24.0|      4|
|       26.0|      3|


### 7. Clean `delinq_2yrs`


In [7]:
from pyspark.sql.functions import col

# Cast to integer and fill any nulls with 0
loans_def_processed_df = (
    loans_def_raw_df
      .withColumn("delinq_2yrs", col("delinq_2yrs").cast("integer"))
      .fillna(0, subset=["delinq_2yrs"])
)

# Update the view
loans_def_processed_df.createOrReplaceTempView("loan_defaulters")


### 8. Re‑validate `delinq_2yrs` nulls

In [8]:
# Should be zero now
spark.sql("SELECT COUNT(*) FROM loan_defaulters WHERE delinq_2yrs IS NULL").show()

# Re‑check distribution
spark.sql("""
    SELECT delinq_2yrs, COUNT(*) AS total
      FROM loan_defaulters
  GROUP BY delinq_2yrs
  ORDER BY total DESC
""").show(40)


+--------+
|count(1)|
+--------+
|       0|
+--------+

+-----------+-------+
|delinq_2yrs|  total|
+-----------+-------+
|          0|1839141|
|          1| 281337|
|          2|  81285|
|          3|  29545|
|          4|  13180|
|          5|   6601|
|          6|   3719|
|          7|   2063|
|          8|   1226|
|          9|    821|
|         10|    558|
|         11|    363|
|         12|    266|
|         13|    167|
|         14|    123|
|         15|     90|
|         16|     56|
|         17|     33|
|         18|     32|
|         19|     24|
|         20|     19|
|         21|     16|
|         22|      7|
|         24|      6|
|         23|      5|
|         26|      4|
|         29|      2|
|         25|      2|
|         30|      2|
|         27|      1|
|         28|      1|
|         58|      1|
|         35|      1|
|         39|      1|
|         32|      1|
|         42|      1|
|         36|      1|
+-----------+-------+



### 9. Filter primary defaulters

In [9]:
# Keep those with any recent delinquencies
loans_def_delinq_df = spark.sql("""
    SELECT member_id, delinq_2yrs, delinq_amnt, int(mths_since_last_delinq)
      FROM loan_defaulters
     WHERE delinq_2yrs > 0
        OR mths_since_last_delinq > 0
""")
loans_def_delinq_df

member_id,delinq_2yrs,delinq_amnt,mths_since_last_delinq
9cb79aa7323e81be1...,2,0.0,11
aac68850fdac09fd0...,1,0.0,21
c89986155a070db2e...,1,0.0,5
6e8d94bf446e97025...,0,0.0,36
42f73fd8a01f1c475...,0,0.0,46
1eef79a0e79b72c7a...,1,0.0,21
1dd1d1b51473d4993...,0,0.0,44
ec1953dba2cfb89ad...,2,0.0,13
8241a6bb3a9350fb8...,0,0.0,57
cdc94fa1c29a6a70a...,0,0.0,44


In [10]:
print("Primary defaulters count:", loans_def_delinq_df.count())

Primary defaulters count: 1106163


### 10. Filter all records with credit issues

In [11]:
# Broader “enforcement”: bankruptcies, inquiries, or delinquencies
loans_def_records_enq_df = spark.sql("""
    SELECT *
      FROM loan_defaulters
     WHERE pub_rec                   > 0
        OR pub_rec_bankruptcies      > 0
        OR inq_last_6mths            > 0
""")
loans_def_records_enq_df.show(5)


+--------------------+-----------+-----------+-------+--------------------+--------------+------------------+----------------------+----------------------+
|           member_id|delinq_2yrs|delinq_amnt|pub_rec|pub_rec_bankruptcies|inq_last_6mths|total_rec_late_fee|mths_since_last_delinq|mths_since_last_record|
+--------------------+-----------+-----------+-------+--------------------+--------------+------------------+----------------------+----------------------+
|0dd2bbc517e3c8f9e...|          0|        0.0|    1.0|                 1.0|           3.0|               0.0|                  null|                 115.0|
|458458599d3df3bfc...|          0|        0.0|    1.0|                 1.0|           1.0|               0.0|                  null|                  76.0|
|f1efcf7dfbfef21be...|          0|        0.0|    0.0|                 0.0|           1.0|               0.0|                  null|                  null|
|c89986155a070db2e...|          1|        0.0|    0.0|          

In [12]:
print("Records with any credit issues:", loans_def_records_enq_df.count())

Records with any credit issues: 1070125


### 11. Save cleaned DataFrames

In [13]:
# 11a. Save primary defaulters
loans_def_delinq_df.write \
    .format("parquet") \
    .mode("overwrite") \
    .option("path", "/user/itv017499/lendingclubproject/cleaned/loans_defaulters_parquet") \
    .save()
# 11b. Save broader credit‐issue records
loans_def_records_enq_df.write \
    .format("parquet") \
    .mode("overwrite") \
    .option("path", "/user/itv017499/lendingclubproject/cleaned/loans_defaulters_records_parquet") \
    .save()