Load the Data in Azure Databricks: Use the CSV data stored in ADLS.

In [0]:
file_path = "/mnt/dataset/loan.csv"
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(file_path)
df.show()

+-----------+---+------+-------------------+--------------+-----------+------+-----------+-------------+------------------+-----------+-------+------------+----------------+------------------+
|Customer_ID|Age|Gender|         Occupation|Marital Status|Family Size|Income|Expenditure|Use Frequency|     Loan Category|Loan Amount|Overdue| Debt Record| Returned Cheque| Dishonour of Bill|
+-----------+---+------+-------------------+--------------+-----------+------+-----------+-------------+------------------+-----------+-------+------------+----------------+------------------+
|    IB14001| 30|  MALE|       BANK MANAGER|        SINGLE|          4| 50000|      22199|            6|           HOUSING| 10,00,000 |      5|      42,898|               6|                 9|
|    IB14008| 44|  MALE|          PROFESSOR|       MARRIED|          6| 51000|      19999|            4|          SHOPPING|     50,000|      3|      33,999|               1|                 5|
|    IB14012| 30|FEMALE|           

Clean the Data:

Remove leading/trailing spaces from column names.
Handle any null or inconsistent values.

In [0]:
from pyspark.sql.functions import col

df = df.select([col(c).alias(c.strip()) for c in df.columns])
df.show()

+-----------+---+------+-------------------+--------------+-----------+------+-----------+-------------+------------------+-----------+-------+-----------+---------------+-----------------+
|Customer_ID|Age|Gender|         Occupation|Marital Status|Family Size|Income|Expenditure|Use Frequency|     Loan Category|Loan Amount|Overdue|Debt Record|Returned Cheque|Dishonour of Bill|
+-----------+---+------+-------------------+--------------+-----------+------+-----------+-------------+------------------+-----------+-------+-----------+---------------+-----------------+
|    IB14001| 30|  MALE|       BANK MANAGER|        SINGLE|          4| 50000|      22199|            6|           HOUSING| 10,00,000 |      5|     42,898|              6|                9|
|    IB14008| 44|  MALE|          PROFESSOR|       MARRIED|          6| 51000|      19999|            4|          SHOPPING|     50,000|      3|     33,999|              1|                5|
|    IB14012| 30|FEMALE|            DENTIST|      

In [0]:
df.describe().show()

+-------+-----------+-----------------+------+---------------+--------------+----------------+-----------------+------------------+------------------+-------------+-----------+------------------+-----------------+-----------------+------------------+
|summary|Customer_ID|              Age|Gender|     Occupation|Marital Status|     Family Size|           Income|       Expenditure|     Use Frequency|Loan Category|Loan Amount|           Overdue|      Debt Record|  Returned Cheque| Dishonour of Bill|
+-------+-----------+-----------------+------+---------------+--------------+----------------+-----------------+------------------+------------------+-------------+-----------+------------------+-----------------+-----------------+------------------+
|  count|        500|              500|   500|            500|           500|             500|              468|               481|               500|          500|        500|               500|              500|              500|               5

### Income Distribution
### Analyze the range and distribution of customer income. You can also categorize customers into income brackets.

In [0]:
from pyspark.sql.functions import when

income_bracket_df = df.withColumn(
    "Income_Bracket",
    when(df["Income"] < 30000, "Low Income")
    .when((df["Income"] >= 30000) & (df["Income"] <= 60000), "Middle Income")
    .otherwise("High Income")
)

income_bracket_df.groupBy("Income_Bracket").count().show()

+--------------+-----+
|Income_Bracket|count|
+--------------+-----+
| Middle Income|  268|
|   High Income|  230|
|    Low Income|    2|
+--------------+-----+



In [0]:
display(income_bracket_df.groupBy("Income_Bracket").count())

Income_Bracket,count
Middle Income,268
High Income,230
Low Income,2


Databricks visualization. Run in Databricks to view.

### Loan Categories and Frequency
### Find the most common loan categories and their frequencies.

In [0]:
loan_category_df = df.groupBy("Loan Category").count().orderBy("count", ascending=False)
loan_category_df.show()

+------------------+-----+
|     Loan Category|count|
+------------------+-----+
|         GOLD LOAN|   77|
|           HOUSING|   67|
|        AUTOMOBILE|   60|
|        TRAVELLING|   53|
|       RESTAURANTS|   41|
|COMPUTER SOFTWARES|   35|
|          SHOPPING|   35|
|          BUSINESS|   24|
|  EDUCATIONAL LOAN|   20|
|        RESTAURANT|   20|
|           DINNING|   14|
|       ELECTRONICS|   14|
|   HOME APPLIANCES|   14|
|       AGRICULTURE|   12|
|       BOOK STORES|    7|
|          BUILDING|    7|
+------------------+-----+



In [0]:
display(loan_category_df)

Loan Category,count
GOLD LOAN,77
HOUSING,67
AUTOMOBILE,60
TRAVELLING,53
RESTAURANTS,41
COMPUTER SOFTWARES,35
SHOPPING,35
BUSINESS,24
EDUCATIONAL LOAN,20
RESTAURANT,20


Databricks visualization. Run in Databricks to view.

### Overdue Analysis
### Identify customers with high overdue amounts and compare it with their income and expenditure.

In [0]:
from pyspark.sql.functions import expr

overdue_analysis_df = df.withColumn(
    "Overdue_to_Income", expr("Overdue / Income")
).withColumn(
    "Overdue_to_Expenditure", expr("Overdue / Expenditure")
)
overdue_analysis_df.orderBy("Overdue_to_Income", ascending=False).show()

+-----------+---+------+-----------------+--------------+-----------+------+-----------+-------------+------------------+-----------+-------+-----------+---------------+-----------------+--------------------+----------------------+
|Customer_ID|Age|Gender|       Occupation|Marital Status|Family Size|Income|Expenditure|Use Frequency|     Loan Category|Loan Amount|Overdue|Debt Record|Returned Cheque|Dishonour of Bill|   Overdue_to_Income|Overdue_to_Expenditure|
+-----------+---+------+-----------------+--------------+-----------+------+-----------+-------------+------------------+-----------+-------+-----------+---------------+-----------------+--------------------+----------------------+
|    IB14693| 27|  MALE|           DOCTOR|        SINGLE|          7| 28366|      29258|            2|         GOLD LOAN|  1,659,986|      9|      20611|              5|                5| 3.17281252203342E-4|  3.076081755417321...|
|    IB14685| 26|  MALE|     BANK MANAGER|        SINGLE|          5| 29

In [0]:
display(overdue_analysis_df.orderBy("Overdue_to_Income", ascending=False))

Customer_ID,Age,Gender,Occupation,Marital Status,Family Size,Income,Expenditure,Use Frequency,Loan Category,Loan Amount,Overdue,Debt Record,Returned Cheque,Dishonour of Bill,Overdue_to_Income,Overdue_to_Expenditure
IB14693,27,MALE,DOCTOR,SINGLE,7,28366.0,29258.0,2,GOLD LOAN,1659986,9,20611,5,5,0.000317281252203342,0.0003076081755417321
IB14685,26,MALE,BANK MANAGER,SINGLE,5,29565.0,19490.0,9,HOUSING,1767908,9,72653,4,6,0.00030441400304414006,0.0004617752693689071
IB14619,43,MALE,TEACHER,SINGLE,4,35020.0,26286.0,8,SHOPPING,162449,9,22862,3,3,0.0002569960022844089,0.0003423875827436658
IB14459,32,MALE,DOCTOR,MARRIED,4,35472.0,18340.0,6,SHOPPING,571540,9,52731,0,0,0.0002537212449255751,0.0004907306434023992
IB14321,28,MALE,TEACHER,MARRIED,7,31747.0,17995.0,8,COMPUTER SOFTWARES,720712,8,66706,1,7,0.00025199231423441587,0.0004445679355376493
IB14380,41,FEMALE,NURSE,MARRIED,5,36784.0,,4,HOUSING,1232534,9,60243,3,1,0.0002446715963462375,
IB14070,40,MALE,PUBLIC WORKS,MARRIED,4,38000.0,20000.0,3,GOLD LOAN,400000,9,19954,3,2,0.00023684210526315788,0.00045
IB14210,40,MALE,PUBLIC WORKS,MARRIED,4,38000.0,20000.0,3,GOLD LOAN,400000,9,19954,3,2,0.00023684210526315788,0.00045
IB14285,40,MALE,PUBLIC WORKS,MARRIED,4,38000.0,20000.0,3,GOLD LOAN,400000,9,19954,3,2,0.00023684210526315788,0.00045
IB14086,51,FEMALE,TECHNICIAN,MARRIED,5,30000.0,,5,RESTAURANTS,125463,7,52634,4,10,0.00023333333333333333,


Databricks visualization. Run in Databricks to view.

### Expenditure Patterns
### Analyze expenditure patterns across family sizes or marital statuses.

In [0]:
expenditure_by_family_size = df.groupBy("Family Size").avg("Expenditure").orderBy("Family Size")
expenditure_by_family_size.show()

+-----------+------------------+
|Family Size|  avg(Expenditure)|
+-----------+------------------+
|          2| 30383.32786885246|
|          3|25790.014705882353|
|          4|25633.577981651375|
|          5|25842.114583333332|
|          6| 30232.19540229885|
|          7|           28854.2|
+-----------+------------------+



In [0]:
display(expenditure_by_family_size)

Family Size,avg(Expenditure)
2,30383.32786885246
3,25790.014705882357
4,25633.57798165137
5,25842.11458333333
6,30232.19540229885
7,28854.2


Databricks visualization. Run in Databricks to view.

### Debt and Payment Issues
### Identify customers with high counts of "Returned Cheque" or "Dishonour of Bill."

In [0]:
payment_issues_df = df.filter((df["Returned Cheque"] > 3) | (df["Dishonour of Bill"] > 3))
payment_issues_df.show()

+-----------+---+------+-------------------+--------------+-----------+------+-----------+-------------+------------------+-----------+-------+-----------+---------------+-----------------+
|Customer_ID|Age|Gender|         Occupation|Marital Status|Family Size|Income|Expenditure|Use Frequency|     Loan Category|Loan Amount|Overdue|Debt Record|Returned Cheque|Dishonour of Bill|
+-----------+---+------+-------------------+--------------+-----------+------+-----------+-------------+------------------+-----------+-------+-----------+---------------+-----------------+
|    IB14001| 30|  MALE|       BANK MANAGER|        SINGLE|          4| 50000|      22199|            6|           HOUSING| 10,00,000 |      5|     42,898|              6|                9|
|    IB14008| 44|  MALE|          PROFESSOR|       MARRIED|          6| 51000|      19999|            4|          SHOPPING|     50,000|      3|     33,999|              1|                5|
|    IB14018| 29|  MALE|            TEACHER|      

In [0]:
display(payment_issues_df)

Customer_ID,Age,Gender,Occupation,Marital Status,Family Size,Income,Expenditure,Use Frequency,Loan Category,Loan Amount,Overdue,Debt Record,Returned Cheque,Dishonour of Bill
IB14001,30,MALE,BANK MANAGER,SINGLE,4,50000.0,22199.0,6,HOUSING,1000000,5,42898,6,9
IB14008,44,MALE,PROFESSOR,MARRIED,6,51000.0,19999.0,4,SHOPPING,50000,3,33999,1,5
IB14018,29,MALE,TEACHER,MARRIED,5,45767.0,12787.0,3,GOLD LOAN,600000,7,11000,0,4
IB14025,39,FEMALE,TEACHER,MARRIED,6,46619.0,18675.0,4,HOUSING,1209867,8,29999,6,8
IB14027,51,MALE,SYSTEM MANAGER,MARRIED,3,49999.0,19111.0,5,RESTAURANTS,60676,8,13000,2,5
IB14029,24,FEMALE,TEACHER,SINGLE,3,45008.0,17454.0,4,AUTOMOBILE,399435,9,51987,4,7
IB14031,37,FEMALE,SOFTWARE ENGINEER,MARRIED,5,55999.0,23999.0,5,AUTOMOBILE,60999,2,0,5,3
IB14034,32,MALE,PRODUCT ENGINEER,MARRIED,6,,29000.0,7,COMPUTER SOFTWARES,80660,6,4500,5,4
IB14037,54,FEMALE,TEACHER,MARRIED,5,48099.0,19999.0,4,RESTAURANTS,30999,1,12000,7,5
IB14039,45,MALE,ACCOUNT MANAGER,MARRIED,7,45777.0,18452.0,4,GOLD LOAN,987611,7,39999,8,1


# Transform Data for Further Analysis

### Add Derived Columns
### Add meaningful columns such as total liabilities (sum of loan amount, overdue, and debt record).

In [0]:
from pyspark.sql.functions import col

df = df.withColumn(
    "Total_Liabilities",
    col("Loan Amount") + col("Overdue") + col("Debt Record")
)
df.select("Customer_ID", "Total_Liabilities").show()

+-----------+-----------------+
|Customer_ID|Total_Liabilities|
+-----------+-----------------+
|    IB14001|             NULL|
|    IB14008|             NULL|
|    IB14012|             NULL|
|    IB14018|             NULL|
|    IB14022|             NULL|
|    IB14024|             NULL|
|    IB14025|             NULL|
|    IB14027|             NULL|
|    IB14029|             NULL|
|    IB14031|             NULL|
|    IB14032|             NULL|
|    IB14034|             NULL|
|    IB14037|             NULL|
|    IB14039|             NULL|
|    IB14041|             NULL|
|    IB14042|             NULL|
|    IB14045|             NULL|
|    IB14049|             NULL|
|    IB14050|             NULL|
|    IB14054|             NULL|
+-----------+-----------------+
only showing top 20 rows



### Customer Segmentation
### Segment customers based on usage frequency (low, medium, high).

In [0]:
df = df.withColumn(
    "Use_Frequency_Segment",
    when(df["Use Frequency"] <= 3, "Low")
    .when((df["Use Frequency"] > 3) & (df["Use Frequency"] <= 6), "Medium")
    .otherwise("High")
)

df.groupBy("Use_Frequency_Segment").count().show()

+---------------------+-----+
|Use_Frequency_Segment|count|
+---------------------+-----+
|                 High|  147|
|                  Low|  104|
|               Medium|  249|
+---------------------+-----+



# Save Processed Data for Reporting


In [0]:
df.write.format("parquet").mode("overwrite").save("/mnt/project5output", header="true", inferschema="true")

saving as CSV

In [0]:
df.write.format("csv").mode("overwrite").save("/mnt/project5output", header="true", inferschema="true")