# Stage 06 ‚Äî Data Preprocessing
Document assumptions, run the preprocessing pipeline, save processed outputs.


In [1]:
import os, sys

sys.path.append(os.path.abspath('..'))

from src.cleaning import load_raw_data, clean_loans, save_processed_data

In [2]:
df_raw = load_raw_data('../data/raw/raw_loan_data.csv')

In [3]:
df_clean = clean_loans(df_raw)

In [4]:
save_processed_data(df_clean, '../data/processed/cleaned_loan_data.csv')

**Data Cleaning & Preprocessing Assumptions**

During the preprocessing stage, several assumptions were made to handle missing and inconsistent values in the loan dataset. These choices ensure the dataset remains usable while minimizing the loss of critical information.

Tenure

Assumption: Missing Tenure values represent either new or incomplete loan records.

Rationale: These were filled with 0 to indicate ‚Äúno tenure yet,‚Äù rather than dropping the records.

Date Columns (LoanDate, DisbursementDate, LastPaymentDate, RetirementDate)

Assumption: Missing dates mean the event has not occurred or was not recorded.

Rationale: Converted all dates to datetime format. Missing values were left as NaT (null datetime), preserving the information without arbitrary assumptions.

Loan Amount, Disbursement Amount, Instalment, Principal Balance

Assumption: Missing amounts likely indicate loans that were not disbursed or not fully recorded.

Rationale: Filled with 0 instead of dropping rows, so incomplete loans remain in the dataset for analysis.

Loan Status & Loan Purpose

Assumption: Missing entries mean information was not recorded.

Rationale: Filled with "Unknown" to preserve rows while marking incomplete data explicitly.

Interest Rate

Assumption: Missing rates are due to incomplete loan documentation.

Rationale: Filled with the dataset median interest rate to avoid bias toward extreme values.

IsNPL (Target Variable)

Assumption: A missing IsNPL label means we cannot use the record for supervised learning.

Rationale: Rows with missing target values were dropped, since model training requires labels.

General Rule

Assumption: Retaining as much data as possible is preferred to maximize sample size.

Rationale: Instead of aggressive dropping, imputation and ‚ÄúUnknown‚Äù placeholders were applied unless the target variable was missing.

üìå Notes and Risks

Some imputations (like filling missing interest rates with the median) may reduce variability and affect model accuracy.

Records with extensive missing financial details (loan amounts, balances) may still not reflect real-world loans accurately.

If time allows, future iterations should explore more sophisticated imputations (e.g., predictive models for missing values).