# Kaggle: Predict Loan Payback ‚Äî Data Cleaning

**Notebook:** `02_data_cleaning.ipynb`
**Author:** Brice Nelson
**Organization:** Kaggle Series | Brice Machine Learning Projects
**Date Created:** November 2, 2025
**Last Updated:** November 2, 2025

---

## üß≠ Purpose

This notebook performs **data cleaning and validation** for the Kaggle *Predict Loan Payback* dataset.
The focus is on ensuring the **train** and **test** datasets are structurally aligned and free of inconsistencies prior to feature engineering and model training.

### **Objectives**
1. Load and inspect both train and test datasets.
2. Validate schema consistency (columns, dtypes, and shapes).
3. Identify and address any missing, duplicated, or outlier values.
4. Standardize formatting across categorical and numeric fields.

---

## üîç Dataset Comparison Overview

Before applying cleaning operations, it is essential to verify that both datasets share compatible structures.
The following checks confirm that column names, data types, and row counts align as expected.


## Import Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


In [2]:
# ============================================================
# üîç Train vs Test Structure Validation
# ============================================================

# Load datasets
train_path = "../data/raw/train.csv"
test_path = "../data/raw/test.csv"

loan_train_df = pd.read_csv(train_path)
loan_test_df = pd.read_csv(test_path)

# Basic shape and info comparison
print(f"Train Shape: {loan_train_df.shape}")
print(f"Test Shape:  {loan_test_df.shape}\n")

print("Train Columns:", loan_train_df.columns.tolist())
print("Test Columns:", loan_test_df.columns.tolist())

# Define known target column(s)
target_cols = {"loan_paid_back"}

# Check for any column mismatches
train_only_cols = set(loan_train_df.columns) - set(loan_test_df.columns) - target_cols
test_only_cols = set(loan_test_df.columns) - set(loan_train_df.columns)

if train_only_cols or test_only_cols:
    print("\n‚ö†Ô∏è Column mismatches detected:")
    if train_only_cols:
        print("Columns only in train (excluding target):", train_only_cols)
    if test_only_cols:
        print("Columns only in test:", test_only_cols)
else:
    print("\n‚úÖ Train and test datasets have matching columns (except for target variable).")

# Quick dtype consistency check (only for common columns)
common_cols = loan_train_df.columns.intersection(loan_test_df.columns)
dtype_diff = loan_train_df[common_cols].dtypes != loan_test_df[common_cols].dtypes

if dtype_diff.any():
    print("\n‚ö†Ô∏è Data type mismatches found in the following columns:")
    print(loan_train_df[common_cols].dtypes[dtype_diff])
else:
    print("\n‚úÖ Data types are consistent across train and test datasets.")



Train Shape: (593994, 13)
Test Shape:  (254569, 12)

Train Columns: ['id', 'annual_income', 'debt_to_income_ratio', 'credit_score', 'loan_amount', 'interest_rate', 'gender', 'marital_status', 'education_level', 'employment_status', 'loan_purpose', 'grade_subgrade', 'loan_paid_back']
Test Columns: ['id', 'annual_income', 'debt_to_income_ratio', 'credit_score', 'loan_amount', 'interest_rate', 'gender', 'marital_status', 'education_level', 'employment_status', 'loan_purpose', 'grade_subgrade']

‚úÖ Train and test datasets have matching columns (except for target variable).

‚úÖ Data types are consistent across train and test datasets.


In [3]:
loan_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 593994 entries, 0 to 593993
Data columns (total 13 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   id                    593994 non-null  int64  
 1   annual_income         593994 non-null  float64
 2   debt_to_income_ratio  593994 non-null  float64
 3   credit_score          593994 non-null  int64  
 4   loan_amount           593994 non-null  float64
 5   interest_rate         593994 non-null  float64
 6   gender                593994 non-null  object 
 7   marital_status        593994 non-null  object 
 8   education_level       593994 non-null  object 
 9   employment_status     593994 non-null  object 
 10  loan_purpose          593994 non-null  object 
 11  grade_subgrade        593994 non-null  object 
 12  loan_paid_back        593994 non-null  float64
dtypes: float64(5), int64(2), object(6)
memory usage: 58.9+ MB


# Categorical Columns

| Column           | # of Categories | 1     | 2        | 3             | 4         | 5        | 6     | 7     | 8     |
|------------------|-----------------|-------|----------|---------------|-----------|----------|-------|-------|-------|
| gender           | 3               | male  | female   | other         |           |          |
| marital_status   | 4               | single | married  | divorced      | widowed   |          |
| education_level  | 5               | high_school | bachelor  | master  | phd       | other    |
| employment_status| 5               | employed | unemployed| self_employed | retired   | student  |
| loan_purpose     | 8               | home  | debt_consolidation | car           | education | business | medical | vacation | other |



## Confirm Missing Data

In [7]:
missing_data = pd.DataFrame({'Total Missing':loan_train_df.isnull().sum()})
if missing_data['Total Missing'].any():
    print(missing_data)
else:
    print('Data contains no null values')

Data contains no null values


## Hot Encoding Categorical Data