<a href="https://colab.research.google.com/github/BenjaminUy/Predicting-Loan-User-Default-Risk/blob/main/notebooks/Cleaning_%26_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **India Loan Users - Data Cleaning & Analysis**
Notebook creator: Benjamin Uy

Date created: 6/28/2025

---
Introduction: This is my Jupyter notebook for performing data cleaning and analysis on a Kaggle dataset on loan customers from India.

The dataset I will use is from Kaggle user Subham Surana's "Loan Prediction Based on Customer Behavior" (link below). The original dataset has 13 columns and +250,000 rows, where each row is a consumer user including details like age, income, geography, and whether or not they were flagged. Note that this dataset was organized by Univ.AI.

Link to dataset: https://www.kaggle.com/datasets/subhamjain/loan-prediction-based-on-customer-behavior?select=Training+Data.csv


### Data Dive and Data Cleaning

In [1]:
# Import required modules
import pandas as pd
import numpy as np

In [4]:
# URL to dataset from project repo
url = 'https://raw.githubusercontent.com/BenjaminUy/Predicting-Loan-User-Default-Risk/refs/heads/main/datasets/loan_users.csv'

df = pd.read_csv(url);
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252000 entries, 0 to 251999
Data columns (total 13 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   Id                 252000 non-null  int64 
 1   Income             252000 non-null  int64 
 2   Age                252000 non-null  int64 
 3   Experience         252000 non-null  int64 
 4   Married/Single     252000 non-null  object
 5   House_Ownership    252000 non-null  object
 6   Car_Ownership      252000 non-null  object
 7   Profession         252000 non-null  object
 8   CITY               252000 non-null  object
 9   STATE              252000 non-null  object
 10  CURRENT_JOB_YRS    252000 non-null  int64 
 11  CURRENT_HOUSE_YRS  252000 non-null  int64 
 12  Risk_Flag          252000 non-null  int64 
dtypes: int64(7), object(6)
memory usage: 25.0+ MB


In [5]:
df.describe()

Unnamed: 0,Id,Income,Age,Experience,CURRENT_JOB_YRS,CURRENT_HOUSE_YRS,Risk_Flag
count,252000.0,252000.0,252000.0,252000.0,252000.0,252000.0,252000.0
mean,126000.5,4997117.0,49.954071,10.084437,6.333877,11.997794,0.123
std,72746.278255,2878311.0,17.063855,6.00259,3.647053,1.399037,0.328438
min,1.0,10310.0,21.0,0.0,0.0,10.0,0.0
25%,63000.75,2503015.0,35.0,5.0,3.0,11.0,0.0
50%,126000.5,5000694.0,50.0,10.0,6.0,12.0,0.0
75%,189000.25,7477502.0,65.0,15.0,9.0,13.0,0.0
max,252000.0,9999938.0,79.0,20.0,14.0,14.0,1.0


In [6]:
df['Risk_Flag'].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
Risk_Flag,Unnamed: 1_level_1
0,0.877
1,0.123


It appears that about 12% of loan users in this dataset were flagged for potentially defaulting.

In [7]:
# I will drop Id as this likely won't be useful in future analysis
df = df.drop(columns = ['Id'])

In [8]:
# Remove rows with null values
df = df.dropna(axis=0)
df.reset_index(inplace=True, drop=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252000 entries, 0 to 251999
Data columns (total 12 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   Income             252000 non-null  int64 
 1   Age                252000 non-null  int64 
 2   Experience         252000 non-null  int64 
 3   Married/Single     252000 non-null  object
 4   House_Ownership    252000 non-null  object
 5   Car_Ownership      252000 non-null  object
 6   Profession         252000 non-null  object
 7   CITY               252000 non-null  object
 8   STATE              252000 non-null  object
 9   CURRENT_JOB_YRS    252000 non-null  int64 
 10  CURRENT_HOUSE_YRS  252000 non-null  int64 
 11  Risk_Flag          252000 non-null  int64 
dtypes: int64(6), object(6)
memory usage: 23.1+ MB


In [9]:
df.head()

Unnamed: 0,Income,Age,Experience,Married/Single,House_Ownership,Car_Ownership,Profession,CITY,STATE,CURRENT_JOB_YRS,CURRENT_HOUSE_YRS,Risk_Flag
0,1303834,23,3,single,rented,no,Mechanical_engineer,Rewa,Madhya_Pradesh,3,13,0
1,7574516,40,10,single,rented,no,Software_Developer,Parbhani,Maharashtra,9,13,0
2,3991815,66,4,married,rented,no,Technical_writer,Alappuzha,Kerala,4,10,0
3,6256451,41,2,single,rented,yes,Software_Developer,Bhubaneswar,Odisha,2,12,1
4,5768871,47,11,single,rented,no,Civil_servant,Tiruchirappalli[10],Tamil_Nadu,3,14,1


In [10]:
df['Profession'].value_counts().head(10)

Unnamed: 0_level_0,count
Profession,Unnamed: 1_level_1
Physician,5957
Statistician,5806
Web_designer,5397
Psychologist,5390
Computer_hardware_engineer,5372
Drafter,5359
Magistrate,5357
Fashion_Designer,5304
Air_traffic_controller,5281
Comedian,5259


In [11]:
df['CITY'].value_counts().head(10)

Unnamed: 0_level_0,count
CITY,Unnamed: 1_level_1
Vijayanagaram,1259
Bhopal,1208
Bulandshahr,1185
Saharsa[29],1180
Vijayawada,1172
Srinagar,1136
Indore,1130
New_Delhi,1098
Hajipur[31],1098
Satara,1096


In [12]:
df['STATE'].value_counts().head(10)

Unnamed: 0_level_0,count
STATE,Unnamed: 1_level_1
Uttar_Pradesh,28400
Maharashtra,25562
Andhra_Pradesh,25297
West_Bengal,23483
Bihar,19780
Tamil_Nadu,16537
Madhya_Pradesh,14122
Karnataka,11855
Gujarat,11408
Rajasthan,9174


There seem to be some formatting inconsistencies such as mixed cases in Profession and extra characters in CITY (and possibly) STATE. Let's fix them.

In [14]:
import re

# Formatting STATE to proper case
df['STATE'] = df['STATE'].str.title()

# Removing instances of square brackets
pattern = r"\[(\d+)\]"
repl = ''
df['STATE'] = df['STATE'].apply(lambda x : re.sub(pattern, repl, x))

df['STATE'].value_counts()

Unnamed: 0_level_0,count
STATE,Unnamed: 1_level_1
Uttar_Pradesh,29143
Maharashtra,25562
Andhra_Pradesh,25297
West_Bengal,23483
Bihar,19780
Tamil_Nadu,16537
Madhya_Pradesh,14122
Karnataka,11855
Gujarat,11408
Rajasthan,9174


In [15]:
# Formatting CITY to proper case
df['CITY'] = df['CITY'].str.title()

# Removing instances of square brackets
pattern = r"\[(\d+)\]"
repl = ''
df['CITY'] = df['CITY'].apply(lambda x : re.sub(pattern, repl, x))

df['CITY'].value_counts()

Unnamed: 0_level_0,count
CITY,Unnamed: 1_level_1
Aurangabad,1543
Vijayanagaram,1259
Bhopal,1208
Bulandshahr,1185
Saharsa,1180
...,...
Ujjain,486
Warangal,459
Bettiah,457
Katni,448


In [16]:
# Formatting Profession to proper case
df['Profession'] = df['Profession'].str.title()

# Just in case, removing instances of square brackets
pattern = r"\[(\d+)\]"
repl = ''
df['Profession'] = df['Profession'].apply(lambda x : re.sub(pattern, repl, x))

df['Profession'].value_counts()

Unnamed: 0_level_0,count
Profession,Unnamed: 1_level_1
Physician,5957
Statistician,5806
Web_Designer,5397
Psychologist,5390
Computer_Hardware_Engineer,5372
Drafter,5359
Magistrate,5357
Fashion_Designer,5304
Air_Traffic_Controller,5281
Comedian,5259


In [17]:
df.nunique()

Unnamed: 0,0
Income,41920
Age,59
Experience,21
Married/Single,2
House_Ownership,3
Car_Ownership,2
Profession,51
CITY,316
STATE,28
CURRENT_JOB_YRS,15


Things to note:
- In terms of non-numeric columns, Profession, CITY, and STATE have the most unique values.
- We'll need to find ways to reduce this number, if I am to continue using these features for further analysis.
- Married/Single, House_Ownership, Car_Ownership, CURRENT_HOUSE_YRS, and Risk_Flag could be categorical.

Future steps:
- Since CITY may have too many values to work with as a categorical variable, I could discard this feature, since STATE may implicitly account for removing CITY.
- Find ways to group Professions by same sector (e.g., military, politics, engineering, etc.)
- Alternative approach to grouping Professions: create a variable that indicates if the user's income is an outlier, given their profession.
- Could create variable that indicates if there is a incongruity between the STATE and CITY variables (i.e., if a STATE should not be associated with a CITY).


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252000 entries, 0 to 251999
Data columns (total 12 columns):
 #   Column             Non-Null Count   Dtype 
---  ------             --------------   ----- 
 0   Income             252000 non-null  int64 
 1   Age                252000 non-null  int64 
 2   Experience         252000 non-null  int64 
 3   Married/Single     252000 non-null  object
 4   House_Ownership    252000 non-null  object
 5   Car_Ownership      252000 non-null  object
 6   Profession         252000 non-null  object
 7   CITY               252000 non-null  object
 8   STATE              252000 non-null  object
 9   CURRENT_JOB_YRS    252000 non-null  int64 
 10  CURRENT_HOUSE_YRS  252000 non-null  int64 
 11  Risk_Flag          252000 non-null  int64 
dtypes: int64(6), object(6)
memory usage: 23.1+ MB
