<h2 style="text-align:center;">Data Preprocessing</h2>

<h2>🔃Loading Dataset</h2>

In [33]:
import os
import sys

# Detect project root by going up until we find the 'src' directory
current_dir = os.getcwd()
while not os.path.isdir(os.path.join(current_dir, 'src')):
    current_dir = os.path.dirname(current_dir)
    if current_dir == os.path.dirname(current_dir):  # Reached filesystem root
        raise FileNotFoundError("Could not find 'src' directory in any parent folders.")

# Set project root and add it to sys.path
PROJECT_ROOT = current_dir
print(f"Setting project root: {PROJECT_ROOT}")
os.chdir(PROJECT_ROOT)
sys.path.insert(0, PROJECT_ROOT)


from src.data import loader, preprocessor
from src.visualization import exploration_visualized


Setting project root: c:\Users\HP\Desktop\Healthcare_test_results_classification-


In [34]:

project_root = r'C:\Users\HP\Desktop\Healthcare_test_results_classification-'  # Replace with the actual path
data_path = os.path.join(project_root, 'data', 'raw')

train_df, test_df = loader.load_data(
    train_path=os.path.join(data_path, 'train data.csv'),
    test_path=os.path.join(data_path, 'test data.csv')
)

train_df.head()


Unnamed: 0,ID,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results
0,1,Bobby JacksOn,27,Female,O-,Asthma,06/06/2022,Mark Hartman Jr.,Sons and Miller,Cigna,2625.980554,379,Elective,18/08/2022,Ibuprofen,Normal
1,2,LesLie TErRy,68,Female,O-,Cancer,19/11/2021,Angela Contreras,White-White,Cigna,1471.387317,113,Elective,20/11/2021,Ibuprofen,Inconclusive
2,3,DaNnY sMitH,21,Female,A+,Hypertension,05/03/2022,David Ruiz,Group Middleton,Medicare,5131.488104,154,Emergency,16/05/2022,Paracetamol,Normal
3,4,andrEw waTtS,91,Male,AB-,Diabetes,06/04/2020,Jenny Griffith,Morris-Arellano,Blue Cross,8972.793157,293,Urgent,26/04/2020,Ibuprofen,Abnormal
4,5,adrIENNE bEll,52,Female,A+,Diabetes,31/12/2022,Cynthia Scott,Williams-Davis,Blue Cross,2015.522684,265,Emergency,11/02/2023,Penicillin,Abnormal


<h2>Handling DateTime Datatype</h2>

In [35]:

fixed_datatypes=preprocessor.handle_date_features(train_df)
preprocessor.save_processed_df(fixed_datatypes,"processed_train_data.csv",output_dir="data/processed")
fixed_datatypes.head()



📅 Detected and converted date columns: ['Date of Admission', 'Discharge Date']

📄 Preview of dataset after date conversion:
✅ Saved processed DataFrame to:
c:\Users\HP\Desktop\Healthcare_test_results_classification-\data\processed\processed_train_data.csv


Unnamed: 0,ID,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results
0,1,Bobby JacksOn,27,Female,O-,Asthma,2022-06-06,Mark Hartman Jr.,Sons and Miller,Cigna,2625.980554,379,Elective,2022-08-18,Ibuprofen,Normal
1,2,LesLie TErRy,68,Female,O-,Cancer,2021-11-19,Angela Contreras,White-White,Cigna,1471.387317,113,Elective,2021-11-20,Ibuprofen,Inconclusive
2,3,DaNnY sMitH,21,Female,A+,Hypertension,2022-03-05,David Ruiz,Group Middleton,Medicare,5131.488104,154,Emergency,2022-05-16,Paracetamol,Normal
3,4,andrEw waTtS,91,Male,AB-,Diabetes,2020-04-06,Jenny Griffith,Morris-Arellano,Blue Cross,8972.793157,293,Urgent,2020-04-26,Ibuprofen,Abnormal
4,5,adrIENNE bEll,52,Female,A+,Diabetes,2022-12-31,Cynthia Scott,Williams-Davis,Blue Cross,2015.522684,265,Emergency,2023-02-11,Penicillin,Abnormal


<h2>Handling Missing Values</h2>

In [36]:
handled_missing=preprocessor.handle_missing_values(fixed_datatypes)
preprocessor.save_processed_df(handled_missing,"processed_train_data.csv",output_dir="data/processed")


✅Missing values handled:
Filled categorical column 'Blood Type' with mode: B-
Filled categorical column 'Doctor' with mode: Angela Contreras
Filled categorical column 'Hospital' with mode: Houston PLC
Filled categorical column 'Insurance Provider' with mode: Blue Cross
Filled numerical column 'Billing Amount' with median: 5313.5078885
Filled categorical column 'Admission Type' with mode: Urgent
✅ Saved processed DataFrame to:
c:\Users\HP\Desktop\Healthcare_test_results_classification-\data\processed\processed_train_data.csv


<h2>Encoding</h2>

In [37]:

# encoded_df = preprocessor.encode_students_dataset(train_df)
# print("Encoded shape:", encoded_df.shape)
encoded_df=preprocessor.encoding_features(handled_missing)

preprocessor.save_processed_df(encoded_df,"processed_train_data.csv",output_dir="data/processed")
encoded_df.head()

✅ Label encoded 'Gender'.
🎯 Label encoded target column 'Test Results'.
✅ One-hot encoded 'Blood Type'.
✅ One-hot encoded 'Medical Condition'.
✅ One-hot encoded 'Insurance Provider'.
✅ One-hot encoded 'Admission Type'.
✅ One-hot encoded 'Medication'.
✅ Frequency encoded 'Doctor'.
✅ Frequency encoded 'Hospital'.

📐 Encoded shape: (50000, 30)

📄 Preview of encoded dataset:
✅ Saved processed DataFrame to:
c:\Users\HP\Desktop\Healthcare_test_results_classification-\data\processed\processed_train_data.csv


Unnamed: 0,Age,Gender,Date of Admission,Doctor,Hospital,Billing Amount,Discharge Date,Test Results,Blood Type_A-,Blood Type_AB+,...,Insurance Provider_Blue Cross,Insurance Provider_Cigna,Insurance Provider_Medicare,Insurance Provider_UnitedHealthcare,Admission Type_Emergency,Admission Type_Urgent,Medication_Ibuprofen,Medication_Lipitor,Medication_Paracetamol,Medication_Penicillin
0,27,1,2022-06-06,528,1350,2625.980554,2022-08-18,0,False,False,...,False,True,False,False,False,False,True,False,False,False
1,68,1,2021-11-19,1389,2108,1471.387317,2021-11-20,2,False,False,...,False,True,False,False,False,False,True,False,False,False
2,21,1,2022-03-05,349,1561,5131.488104,2022-05-16,0,False,False,...,False,False,True,False,True,False,False,False,True,False
3,91,0,2020-04-06,66,1423,8972.793157,2020-04-26,1,False,False,...,True,False,False,False,False,True,True,False,False,False
4,52,1,2022-12-31,360,1350,2015.522684,2023-02-11,1,False,False,...,True,False,False,False,True,False,False,False,False,True


<h2>Scaling (Standrization)</h2>


PCA, Logistic Regression, SVM, MLP → Use StandardScaler

In [38]:
scaled_df=preprocessor.scale_numerical_features(encoded_df)

preprocessor.save_processed_df(scaled_df,"processed_train_data.csv",output_dir="data/processed")
scaled_df.head()

✅ Scaled numerical columns: ['Age', 'Billing Amount']

📐 Scaled shape: (50000, 30)

📄 Preview of scaled dataset:
✅ Saved processed DataFrame to:
c:\Users\HP\Desktop\Healthcare_test_results_classification-\data\processed\processed_train_data.csv


Unnamed: 0,Age,Gender,Date of Admission,Doctor,Hospital,Billing Amount,Discharge Date,Test Results,Blood Type_A-,Blood Type_AB+,...,Insurance Provider_Blue Cross,Insurance Provider_Cigna,Insurance Provider_Medicare,Insurance Provider_UnitedHealthcare,Admission Type_Emergency,Admission Type_Urgent,Medication_Ibuprofen,Medication_Lipitor,Medication_Paracetamol,Medication_Penicillin
0,-0.7726,1,2022-06-06,528,1350,-0.861078,2022-08-18,0,False,False,...,False,True,False,False,False,False,True,False,False,False
1,0.906636,1,2021-11-19,1389,2108,-1.219978,2021-11-20,2,False,False,...,False,True,False,False,False,False,True,False,False,False
2,-1.018342,1,2022-03-05,349,1561,-0.082254,2022-05-16,0,False,False,...,False,False,True,False,True,False,False,False,True,False
3,1.848646,0,2020-04-06,66,1423,1.111797,2020-04-26,1,False,False,...,True,False,False,False,False,True,True,False,False,False
4,0.251324,1,2022-12-31,360,1350,-1.050836,2023-02-11,1,False,False,...,True,False,False,False,True,False,False,False,False,True
