# **Preprocessing Component Test Notebook**

### **Summary of Preprocessing Results**
- ✅ **Loaded Dataset**: Verified dataset name and shape.
- ✅ **Handled Missing Values**: Checked how many rows were dropped.
- ✅ **Encoded Categorical Features**: Ensured categorical variables are transformed properly.

This confirms that the preprocessing functions work correctly.


In [1]:
import sys
import os

# Get the absolute path of the project root (move up from notebooks/tests)
project_root = os.path.abspath(os.path.join(os.getcwd(), "../.."))  

# Add `src/` directory explicitly to Python path
src_path = os.path.join(project_root, "src")
if src_path not in sys.path:
    sys.path.insert(0, src_path)


In [2]:
# Verify the path
print(sys.path)

['c:\\Users\\delea\\OneDrive\\Documents\\Desktop\\Master Thesis\\MasterThesisCode\\src', 'c:\\Users\\delea\\AppData\\Local\\Programs\\Python\\Python312\\python312.zip', 'c:\\Users\\delea\\AppData\\Local\\Programs\\Python\\Python312\\DLLs', 'c:\\Users\\delea\\AppData\\Local\\Programs\\Python\\Python312\\Lib', 'c:\\Users\\delea\\AppData\\Local\\Programs\\Python\\Python312', '', 'c:\\Users\\delea\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages', 'c:\\Users\\delea\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\win32', 'c:\\Users\\delea\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\win32\\lib', 'c:\\Users\\delea\\AppData\\Local\\Programs\\Python\\Python312\\Lib\\site-packages\\Pythonwin']


In [3]:
import pandas as pd
from preprocessing.data_loader import load_dataset
from preprocessing.missing_value_handler import handle_missing_values
from preprocessing.encoding import encode_categorical_features

# Define dataset path
original_dataset_path = "../../datasets/original/loan.csv"
separator = ","  # Adjust based on dataset format
target_column = "LoanAmount"  # Adjust based on dataset


In [4]:
# Load dataset
original_data, dataset_name = load_dataset(original_dataset_path, separator)
original_data.head()


📂 Loading dataset from: ../../datasets/original/loan.csv...

Processing dataset: loan
Original dataset size: 614 rows


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [5]:
# Handle missing values
cleaned_data = handle_missing_values(original_data, strategy="drop")
cleaned_data.head()


Dropped 134 rows due to missing values


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y
5,LP001011,Male,Yes,2,Graduate,Yes,5417,4196.0,267.0,360.0,1.0,Urban,Y


In [6]:

# Encode categorical features using Binary Encoding
encoded_data = encode_categorical_features(cleaned_data, target_column)
encoded_data.head()


🔹 Identified Categorical Columns: ['Loan_ID', 'Gender', 'Married', 'Dependents', 'Education', 'Self_Employed', 'Property_Area', 'Loan_Status']
Original Data Shape: (480, 13)
✅ Binary Encoding applied. New Features Added: ['Loan_ID_1', 'Property_Area_0', 'Loan_ID_7', 'Gender_0', 'Education_1', 'Dependents_0', 'Loan_Status_1', 'Loan_ID_3', 'Loan_ID_4', 'Dependents_2', 'Loan_ID_8', 'Loan_ID_5', 'Self_Employed_1', 'Self_Employed_0', 'Education_0', 'Property_Area_1', 'Loan_ID_6', 'Married_1', 'Loan_ID_0', 'Gender_1', 'Loan_ID_2', 'Married_0', 'Dependents_1', 'Loan_Status_0']
New Data Shape after Encoding: (480, 29)


Unnamed: 0,Loan_ID_0,Loan_ID_1,Loan_ID_2,Loan_ID_3,Loan_ID_4,Loan_ID_5,Loan_ID_6,Loan_ID_7,Loan_ID_8,Gender_0,...,Self_Employed_1,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area_0,Property_Area_1,Loan_Status_0,Loan_Status_1
1,0,0,0,0,0,0,0,0,1,0,...,1,4583,1508.0,128.0,360.0,1.0,0,1,0,1
2,0,0,0,0,0,0,0,1,0,0,...,0,3000,0.0,66.0,360.0,1.0,1,0,1,0
3,0,0,0,0,0,0,0,1,1,0,...,1,2583,2358.0,120.0,360.0,1.0,1,0,1,0
4,0,0,0,0,0,0,1,0,0,0,...,1,6000,0.0,141.0,360.0,1.0,1,0,1,0
5,0,0,0,0,0,0,1,0,1,0,...,0,5417,4196.0,267.0,360.0,1.0,1,0,1,0
