# Loan Approval Data Cleansing

This notebook prepares a loan approval dataset for machine learning by cleansing and transforming the data. 

In [35]:
import pandas as pd
import numpy as np
from pathlib import Path
from sklearn.preprocessing import LabelEncoder
import os

## Setup

Importing the necessary libraries for data manipulation, preprocessing, and file handling.

In [36]:
# Grab the directory of the current file
# Should be ./Project/src/Program
try:
    current_dir = Path(__file__).parent.absolute()
except NameError:
    # If using a Jupyter notebook
    try:
        current_dir = Path.cwd()
    except:
        # If continuing to fail set path manually
        current_dir = Path("D:/School/CS 434/Project/src/Program")

# Go to the Data directory 
# Should be ./Project/src/Data
data_dir = current_dir.parent / "Data"

# Get the Datasets path
dataset_path = data_dir / "Dataset.csv"

# Verify the file exists
if dataset_path.exists():
    print(f"Dataset found at: {dataset_path}")
    df = pd.read_csv(dataset_path)
else:
    print(f"Dataset not found at expected path: {dataset_path}")

Dataset found at: d:\School\CS 434\Project\src\Data\Dataset.csv


## Data Loading
Locating and loading the dataset from the expected path. This ensures we can find our data regardless of the working directory.

**Importance**: Proper file path handling is crucial for reproducible data science workflows.

In [37]:
# Check for any duplicate rows and remove them
df = df.drop_duplicates()

# Convert numeric comlumns to appropriate types
numeric_columns = ['ApplicantIncome', 'CoapplicantIncome', 'LoanAmount', 'Loan_Amount_Term']
for col in numeric_columns:
    df[col] = pd.to_numeric(df[col], errors='coerce')

# Categorical columns
categorical_columns = ['Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Credit_History', 'Property_Area']

# Binary columns
binary_columns = ['Gender', 'Married', 'Self_Employed', 'Loan_Status']

# Dependents - convert to numeric where possible
def clean_dependents(value):
    if value == '3+':
        return 3
    try:
        return int(value)
    except ValueError:
        return np.nan

## Data Preparation
Initial data cleaning steps:
- Removing duplicate entries to prevent bias
- Converting columns to appropriate data types
- Defining categorical and binary columns for proper encoding
- Creating a function to handle the special case of '3+' in the Dependents column

**Importance**: These steps ensure data integrity and prepare the data for proper encoding.

In [45]:
# Create first cleansed dataset - using median/mode imputation
df_imputed = df.copy()

# Impute missing numeric values with media
for col in numeric_columns:
    median_value = df_imputed[col].median()
    df_imputed[col].fillna(median_value, inplace=True)

# Impute missing categorical values with mode
for col in categorical_columns:
    mode_value = df_imputed[col].mode()[0]
    df_imputed[col].fillna(mode_value, inplace=True)

# Binary encoding where applicable
for col in binary_columns:
    df_imputed[col] = df_imputed[col].map({'Male': 1, 'Female': 0,
                                           'Yes': 1, 'No': 0,
                                           'Graduate': 1, 'Not Graduate': 0,
                                           'Y': 1, 'N': 0}).astype('int')
    
# For non-binary categorical columns, use one-hot encoding
# Propert_Area
df_imputed = pd.get_dummies(df_imputed, columns=['Property_Area'], drop_first=True)

# Dependents
df_imputed['Dependents'] = df_imputed['Dependents'].apply(clean_dependents)
df_imputed['Dependents'].fillna(df_imputed['Dependents'].median(), inplace=True)
df_imputed['Dependents'] = df_imputed['Dependents'].astype('int')

# Credit_History is already binary, just need to convert to int
df_imputed['Credit_History'] = df_imputed['Credit_History'].astype('int')

# Save the cleansed dataset to CSV file
df_imputed.to_csv(os.path.join(data_dir,r'loan_data_imputed.csv'), index=False)
print("Cleansed dataset saved to 'loan_data_imputed.csv'")

Cleansed dataset saved to 'loan_data_imputed.csv'


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_imputed[col].fillna(median_value, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_imputed[col].fillna(median_value, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are settin

## Imputation Method
Creating the first cleansed dataset using imputation:
- Missing numeric values filled with median values
- Missing categorical values filled with mode (most frequent value)
- Binary encoding for binary columns
- One-hot encoding for non-binary categorical columns
- Special handling for Dependents column

**Importance**: Imputation preserves all data points while providing reasonable estimates for missing values, maintaining the original dataset size.

In [46]:
# Create second cleansed dataset - removing rows with missing values
df_removed = df.dropna()

# Encode categorical variables to binary/numeric
# Binary encoding where applicable
for col in binary_columns:
    df_removed[col] = df_removed[col].map({'Male': 1, 'Female': 0,
                                           'Yes': 1, 'No': 0,
                                           'Graduate': 1, 'Not Graduate': 0,
                                           'Y': 1, 'N': 0}).astype('int')
    
# For non-binary categorical columns, use one-hot encoding
# Propert_Area
df_removed = pd.get_dummies(df_removed, columns=['Property_Area'], drop_first=True)

# Dependents
df_removed['Dependents'] = df_removed['Dependents'].apply(clean_dependents)

# Credit History is already binary, just need to convert to int
df_removed['Credit_History'] = df_removed['Credit_History'].astype('int')

# Save the cleansed dataset to CSV file
df_removed.to_csv(os.path.join(data_dir,r'loan_data_removed.csv'), index=False)
print("Cleansed dataset with removed rows saved to 'loan_data_removed.csv'")

Cleansed dataset with removed rows saved to 'loan_data_removed.csv'


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_removed[col] = df_removed[col].map({'Male': 1, 'Female': 0,
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_removed[col] = df_removed[col].map({'Male': 1, 'Female': 0,
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_removed[col] = df_removed[col].map({'Male': 1, 'Female': 0,
A value is tryin

## Row Removal Method
Creating the second cleansed dataset by removing rows with missing values:
- Dropping all rows with any missing values
- Applying the same encoding transformations as the imputed dataset

**Importance**: This approach ensures we only work with complete data points, potentially providing more reliable but fewer training examples.

In [47]:
# Print summary of the cleaning
print(f"Original dataset shape: {df.shape}")
print(f"Imputed dataset shape: {df_imputed.shape}")
print(f"Removed dataset shape: {df_removed.shape}")

Original dataset shape: (367, 13)
Imputed dataset shape: (367, 14)
Removed dataset shape: (289, 14)


## Results Summary
Comparing the shapes of the original and cleansed datasets:
- Original dataset: 367 rows
- Imputed dataset: 367 rows (same size, all rows preserved)
- Removed dataset: 289 rows (78 rows removed due to missing values)

**Importance**: This comparison helps us understand how much data was affected by each cleaning method, which can influence model training decisions.

## Next Steps
The cleansed datasets are now ready for:
1. Exploratory data analysis
2. Feature selection
3. Model training and evaluation

Both datasets (imputed and row-removed) can be used to compare model performance with different data cleaning approaches.
