#### Data cleaning plan for loans dataset

1. **Initialize SparkSession**

2. **Define loans schema**

3. **Load raw loans CSV**
   - Set `header=True`
   - Apply predefined schema

4. **Inspect raw DataFrame**
   - Display first few rows
   - Print schema

5. **Add ingestion timestamp** (`ingest_date`)

6. **Create temporary view**

7. **Quick data quality checks**
   - Total record count
   - Count of `loan_amount` nulls

8. **Drop rows with nulls in critical columns**

9. **Re‑inspect record count**

10. **Clean `loan_term_months`**
    - Remove the `" months"` suffix via `regexp_replace`
    - Cast to `Integer`

11. **Inspect and clean `loan_purpose`**
    - Show distinct values & counts
    - Define a lookup list of main purposes
    - Map all others to `"other"`

12. **Save cleaned loans data**
    - Parquet to `/cleaned/loans_parquet`
    - CSV to `/cleaned/loans_csv`


### 1. Initialize SparkSession

In [1]:
from pyspark.sql import SparkSession
import getpass 
username=getpass.getuser()
spark=SparkSession. \
    builder. \
    config('spark.ui.port','0'). \
    config("spark.sql.warehouse.dir", f"/user/{username}/warehouse"). \
    config('spark.shuffle.useOldFetchProtocol', 'true'). \
    enableHiveSupport(). \
    master('yarn'). \
    getOrCreate()

### 2. Define loans schema

In [2]:
# Defining the exact types and names of each loan column
loans_schema = """
    loan_id string,
    member_id string,
    loan_amount float,
    funded_amount float,
    loan_term_months string,
    interest_rate string,
    monthly_installment float,
    issue_date string,
    loan_status string,
    loan_purpose string,
    loan_title string
"""

### 3. Load raw loans CSV

In [3]:
loans_raw_df = (
    spark.read
      .format("csv")
      .option("header", True)         
      .schema(loans_schema) 
      .load("/public/trendytech/lendingclubproject/raw/loans_data_csv")
)

In [4]:
# create a temp view for sql queries
loans_raw_df.createOrReplaceTempView('loans')

In [5]:
# Quick peek
spark.sql("select*from loans limit 5")

loan_id,member_id,loan_amount,funded_amount,loan_term_months,interest_rate,monthly_installment,issue_date,loan_status,loan_purpose,loan_title
145499677,a703357afc7be3fe3...,10000.0,10000.0,36 months,8.19,314.25,Dec-2018,Fully Paid,debt_consolidation,Debt consolidation
144538467,a0c637c3df6764663...,5000.0,5000.0,36 months,15.02,173.38,Dec-2018,Current,other,Other
145515405,63571114d3a96e5bc...,7500.0,7500.0,36 months,10.33,243.17,Dec-2018,Current,debt_consolidation,Debt consolidation
145207340,4db14234c3f2f87c1...,20400.0,20400.0,60 months,16.14,497.61,Dec-2018,Current,home_improvement,Home improvement
145467050,88a6f97ff3afc51b6...,12000.0,12000.0,36 months,7.02,370.64,Dec-2018,Current,credit_card,Credit card refin...


### 4. Inspect raw DataFrmae

In [6]:
# Verify column types and nullability
loans_raw_df.printSchema()

root
 |-- loan_id: string (nullable = true)
 |-- member_id: string (nullable = true)
 |-- loan_amount: float (nullable = true)
 |-- funded_amount: float (nullable = true)
 |-- loan_term_months: string (nullable = true)
 |-- interest_rate: string (nullable = true)
 |-- monthly_installment: float (nullable = true)
 |-- issue_date: string (nullable = true)
 |-- loan_status: string (nullable = true)
 |-- loan_purpose: string (nullable = true)
 |-- loan_title: string (nullable = true)



### 5. Add ingestion timestamp (`ingest_date`)

In [7]:
from pyspark.sql.functions import current_timestamp
loans_df_ingested = loans_raw_df.withColumn("ingest_date",current_timestamp())

In [8]:
# create a temp view for sql queries
loans_df_ingested.createOrReplaceTempView('loans')

### 6. Create temporary view

In [9]:
# Quick peek
spark.sql("select*from loans limit 5")

loan_id,member_id,loan_amount,funded_amount,loan_term_months,interest_rate,monthly_installment,issue_date,loan_status,loan_purpose,loan_title,ingest_date
145499677,a703357afc7be3fe3...,10000.0,10000.0,36 months,8.19,314.25,Dec-2018,Fully Paid,debt_consolidation,Debt consolidation,2025-04-29 06:02:...
144538467,a0c637c3df6764663...,5000.0,5000.0,36 months,15.02,173.38,Dec-2018,Current,other,Other,2025-04-29 06:02:...
145515405,63571114d3a96e5bc...,7500.0,7500.0,36 months,10.33,243.17,Dec-2018,Current,debt_consolidation,Debt consolidation,2025-04-29 06:02:...
145207340,4db14234c3f2f87c1...,20400.0,20400.0,60 months,16.14,497.61,Dec-2018,Current,home_improvement,Home improvement,2025-04-29 06:02:...
145467050,88a6f97ff3afc51b6...,12000.0,12000.0,36 months,7.02,370.64,Dec-2018,Current,credit_card,Credit card refin...,2025-04-29 06:02:...


### 7. Quick data quality checks

In [10]:
# Total number of records
spark.sql("SELECT COUNT(*) FROM loans").show()

+--------+
|count(1)|
+--------+
| 2260701|
+--------+



In [11]:

# How many rows are missing loan_amount?
spark.sql("SELECT COUNT(*) FROM loans WHERE loan_amount IS NULL").show()

+--------+
|count(1)|
+--------+
|      33|
+--------+



In [12]:
# checking where loan_amount is NULL
spark.sql("SELECT * FROM loans WHERE loan_amount IS NULL limit 5")

loan_id,member_id,loan_amount,funded_amount,loan_term_months,interest_rate,monthly_installment,issue_date,loan_status,loan_purpose,loan_title,ingest_date
Total amount fund...,e3b0c44298fc1c149...,,,,,,,,,,2025-04-29 06:02:...
Total amount fund...,e3b0c44298fc1c149...,,,,,,,,,,2025-04-29 06:02:...
Total amount fund...,e3b0c44298fc1c149...,,,,,,,,,,2025-04-29 06:02:...
Total amount fund...,e3b0c44298fc1c149...,,,,,,,,,,2025-04-29 06:02:...
Total amount fund...,e3b0c44298fc1c149...,,,,,,,,,,2025-04-29 06:02:...


### 8. Drop rows with nulls in critical columns

In [13]:
# Define which columns must not be null
columns_to_check = [
    "loan_amount", "funded_amount", "loan_term_months",
    "monthly_installment", "issue_date", "loan_status", "loan_purpose"
]

In [14]:
# Remove any rows missing those
loans_filtered_df = loans_df_ingested.na.drop(subset=columns_to_check)

In [15]:
print("After drop:", loans_filtered_df.count())

After drop: 2260667


In [16]:
loans_filtered_df.createOrReplaceTempView("loans")

### 9. Clean `loan_term_months`

In [17]:
from pyspark.sql.functions import regexp_replace, col

# Strip out non‑digits (e.g. "36 months" → "36") and cast to int and convert it into year
loans_term_modified_df = (
    loans_filtered_df
      .withColumn(
         "loan_term_months",
         regexp_replace(col("loan_term_months"), "\D", "")
      )
      .withColumn("loan_term_years", col("loan_term_months").cast("int")/12)
      .drop("loan_term_months")
)


In [18]:
loans_term_modified_df.printSchema()

root
 |-- loan_id: string (nullable = true)
 |-- member_id: string (nullable = true)
 |-- loan_amount: float (nullable = true)
 |-- funded_amount: float (nullable = true)
 |-- interest_rate: string (nullable = true)
 |-- monthly_installment: float (nullable = true)
 |-- issue_date: string (nullable = true)
 |-- loan_status: string (nullable = true)
 |-- loan_purpose: string (nullable = true)
 |-- loan_title: string (nullable = true)
 |-- ingest_date: timestamp (nullable = false)
 |-- loan_term_years: double (nullable = true)



In [19]:
loans_term_modified_df.createOrReplaceTempView("loans")

In [20]:
spark.sql("SELECT * FROM loans limit 10")

loan_id,member_id,loan_amount,funded_amount,interest_rate,monthly_installment,issue_date,loan_status,loan_purpose,loan_title,ingest_date,loan_term_years
91609139,f4d5593aeb85f0302...,2800.0,2800.0,10.49,91.0,Oct-2016,Fully Paid,car,Car financing,2025-04-29 06:02:...,3.0
91302887,2a6b46e98f2f63710...,8000.0,8000.0,13.49,271.45,Oct-2016,Fully Paid,credit_card,Credit card refin...,2025-04-29 06:02:...,3.0
91313928,2aaeecbb0cb90f9a5...,8000.0,8000.0,16.99,285.19,Oct-2016,Charged Off,moving,Moving and reloca...,2025-04-29 06:02:...,3.0
91343295,064393f5934a2eca2...,10125.0,10125.0,11.49,333.84,Oct-2016,Charged Off,credit_card,Credit card refin...,2025-04-29 06:02:...,3.0
91503678,5461aa5fb52d7380e...,18000.0,18000.0,6.99,555.71,Oct-2016,Fully Paid,debt_consolidation,Debt consolidation,2025-04-29 06:02:...,3.0
91473933,5efe3afbace24e65a...,6000.0,6000.0,6.99,185.24,Oct-2016,Fully Paid,debt_consolidation,Debt consolidation,2025-04-29 06:02:...,3.0
91301376,98f50c55db8d927c5...,15000.0,15000.0,17.99,542.22,Oct-2016,Fully Paid,debt_consolidation,Debt consolidation,2025-04-29 06:02:...,3.0
91609173,e11da03be1a015884...,13000.0,13000.0,5.32,391.5,Oct-2016,Current,credit_card,Credit card refin...,2025-04-29 06:02:...,3.0
91021798,d15f8e7c07f44fc53...,5000.0,5000.0,5.32,150.58,Oct-2016,Fully Paid,debt_consolidation,Debt consolidation,2025-04-29 06:02:...,3.0
91052844,eb0132cfaebe240cd...,16000.0,16000.0,14.49,550.66,Oct-2016,Charged Off,debt_consolidation,Debt consolidation,2025-04-29 06:02:...,3.0


### 10. Inspect and clean `loan_purpose`

In [21]:
# Check what purposes we have
spark.sql("SELECT loan_purpose, COUNT(*) AS cnt FROM loans GROUP BY loan_purpose ORDER BY cnt DESC").show(20)

+--------------------+-------+
|        loan_purpose|    cnt|
+--------------------+-------+
|  debt_consolidation|1277790|
|         credit_card| 516926|
|    home_improvement| 150440|
|               other| 139413|
|      major_purchase|  50429|
|             medical|  27481|
|      small_business|  24659|
|                 car|  24009|
|            vacation|  15525|
|              moving|  15402|
|               house|  14131|
|             wedding|   2351|
|    renewable_energy|   1445|
|         educational|    412|
|but we cant all b...|      1|
|but not much info...|      1|
|and if they are a...|      1|
|putting together ...|      1|
|on one of the bus...|      1|
|I became his prim...|      1|
+--------------------+-------+
only showing top 20 rows



In [22]:
# Define the main-purpose list; everything else → "other"
main_purposes = [
"debt_consolidation", "credit_card", "home_improvement", 
    "other", "major_purchase", "medical", "small_business", 
    "car", "vacation", "moving", "house", "wedding", "renewable_energy", 
    "educational"
]

In [23]:
from pyspark.sql.functions import when

# Map any non‑standard purpose to "other"
loans_purpose_modified = (
    loans_term_modified_df.withColumn(
      "loan_purpose",
      when(col("loan_purpose").isin(main_purposes), col("loan_purpose"))
        .otherwise("other")
    )
)

In [24]:
loans_purpose_modified.createOrReplaceTempView("loans")

In [25]:
# verify
spark.sql("""
    SELECT loan_purpose, COUNT(*) AS cnt 
    FROM loans 
    GROUP BY loan_purpose 
    ORDER BY cnt DESC
""")

loan_purpose,cnt
debt_consolidation,1277790
credit_card,516926
home_improvement,150440
other,139667
major_purchase,50429
medical,27481
small_business,24659
car,24009
vacation,15525
moving,15402


### 11. Save cleaned loans data

In [26]:
# Parquet for efficient downstream queries
loans_purpose_modified.write \
    .format("parquet") \
    .mode("overwrite") \
    .option("path", "/user/itv017499/lendingclubproject/cleaned/loans_parquet") \
    .save()