# Data Transformation and Exploration.


- Data mapping
- Replacing columns from raw data
- Imputation for missing values
- One-hot encoding
- Feature Selection (Chi-square test)

--- 

Before we start running the code, below are the libraries we have to import

In [14]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import KNNImputer
from sklearn.preprocessing import LabelEncoder
from cProfile import label
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import pickle


---

## Loading data files

In [15]:
df_train = pd.read_csv("Dataset/diabetic_data_training.csv")
df_test = pd.read_csv('Dataset/diabetic_data_test.csv')

display(df_train.shape, df_test.shape)

(91589, 50)

(10177, 50)

---

### Data Mapping and Dictionary Creation

- This code performs **data mapping** (data preprocessing) by reading a CSV file (`IDS_mapping.csv`) that contains mappings of categorical or numerical IDs to their corresponding descriptions. 
- The goal is to create a structured dictionary for each type of ID, which can later be used to translate numerical values in the main dataset into human-readable labels. 
- Then we replace categorical placeholder values (`admission_type_id`, `discharge_disposition_id`, `admission_source_id`) in the training (`df_train`) and testing (`df_test`) datasets with their corresponding human-readable labels. This process enhances data interpretability and prepares the dataset for further analysis or modeling.

In [16]:
mapping_csv = pd.read_csv("Dataset/IDS_mapping.csv")
mapping_csv = mapping_csv.dropna(how='all')
dictionaries = {}
current_dict = 'admission_type_id'
dictionaries[current_dict] = {}


for index, row in mapping_csv.iterrows():
   id = row.to_list()[0]
   if id.isdigit():
       dictionaries[current_dict][row.to_list()[0]] = str(row.to_list()[1])
   else:
       current_dict = id
       dictionaries[current_dict] = {}

In [17]:
# Mapping the textual values associated with the placeholder values given originally in the dataset

# TRAIN DATA
if 'admission_type_id' in df_train.columns:
   df_train['admission_type_id'] = df_train['admission_type_id'].astype(str).map(dictionaries['admission_type_id'])

if 'discharge_disposition_id' in df_train.columns:
   df_train['discharge_disposition_id'] = df_train['discharge_disposition_id'].astype(str).map(dictionaries['discharge_disposition_id'])


if 'admission_source_id' in df_train.columns:
   df_train['admission_source_id'] = df_train['admission_source_id'].astype(str).map(dictionaries['admission_source_id'])
   
# TEST DATA

if 'admission_type_id' in df_test.columns:
   df_test['admission_type_id'] = df_test['admission_type_id'].astype(str).map(dictionaries['admission_type_id'])

if 'discharge_disposition_id' in df_test.columns:
   df_test['discharge_disposition_id'] = df_test['discharge_disposition_id'].astype(str).map(dictionaries['discharge_disposition_id'])


if 'admission_source_id' in df_test.columns:
   df_test['admission_source_id'] = df_test['admission_source_id'].astype(str).map(dictionaries['admission_source_id'])
   
display(df_train.head(10), df_test.head(10))

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,149190,55629189,Caucasian,Female,[10-20),?,Emergency,Discharged to home,Emergency Room,3,...,No,Up,No,No,No,No,No,Ch,Yes,>30
1,64410,86047875,AfricanAmerican,Female,[20-30),?,Emergency,Discharged to home,Emergency Room,2,...,No,No,No,No,No,No,No,No,Yes,NO
2,500364,82442376,Caucasian,Male,[30-40),?,Emergency,Discharged to home,Emergency Room,2,...,No,Up,No,No,No,No,No,Ch,Yes,NO
3,16680,42519267,Caucasian,Male,[40-50),?,Emergency,Discharged to home,Emergency Room,1,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
4,35754,82637451,Caucasian,Male,[50-60),?,Urgent,Discharged to home,Clinic Referral,3,...,No,Steady,No,No,No,No,No,No,Yes,>30
5,55842,84259809,Caucasian,Male,[60-70),?,Elective,Discharged to home,Clinic Referral,4,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
6,63768,114882984,Caucasian,Male,[70-80),?,Emergency,Discharged to home,Emergency Room,5,...,No,No,No,No,No,No,No,No,Yes,>30
7,12522,48330783,Caucasian,Female,[80-90),?,Urgent,Discharged to home,Transfer from a hospital,13,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
8,15738,63555939,Caucasian,Female,[90-100),?,Elective,Discharged/transferred to SNF,Transfer from a hospital,12,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
9,36900,77391171,AfricanAmerican,Male,[60-70),?,Urgent,Discharged to home,Transfer from a hospital,7,...,No,Steady,No,No,No,No,No,Ch,Yes,<30


Unnamed: 0,encounter_id,patient_nbr,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),?,,Not Mapped,Physician Referral,1,...,No,No,No,No,No,No,No,No,No,NO
1,28236,89869032,AfricanAmerican,Female,[40-50),?,Emergency,Discharged to home,Emergency Room,9,...,No,Steady,No,No,No,No,No,No,Yes,>30
2,150006,22864131,?,Female,[50-60),?,Urgent,Discharged to home,Transfer from a hospital,2,...,No,Down,No,No,No,No,No,Ch,Yes,NO
3,253380,56480238,AfricanAmerican,Female,[60-70),?,Emergency,Discharged to home,Emergency Room,6,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
4,383430,80588529,Caucasian,Female,[70-80),?,Emergency,Discharged/transferred to another short term h...,Emergency Room,1,...,No,Down,No,No,No,No,No,Ch,Yes,>30
5,550098,21820806,AfricanAmerican,Male,[50-60),?,Urgent,Discharged to home,Clinic Referral,4,...,No,No,No,No,No,No,No,No,No,<30
6,676422,63754317,AfricanAmerican,Female,[70-80),?,Emergency,Discharged to home,Emergency Room,4,...,No,Steady,No,No,No,No,No,No,Yes,>30
7,870294,95075649,Caucasian,Female,[70-80),?,Emergency,Discharged/transferred to home with home healt...,Emergency Room,7,...,No,No,No,No,No,No,No,No,No,<30
8,1072554,114039603,Caucasian,Male,[70-80),?,Elective,Discharged to home,Clinic Referral,3,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
9,1161024,20830941,Caucasian,Female,[70-80),?,Emergency,Discharged/transferred to another short term h...,Emergency Room,5,...,No,Steady,No,No,No,No,No,Ch,Yes,>30


We could see that the mapping has been done.

---

### Dropping Unnecessary Columns from Train and Test Data

- We do some **data cleaning** and remove specific columns manually from the training (`df_train`) and testing (`df_test`) datasets to streamline the data and eliminate some features that are not useful for modeling or analysis.

In [18]:
# CREATION OF df_train_ and df_test_ dataframes, which are simply dropping the columns below
features_to_drop = ['payer_code', 'medical_specialty', 'weight'] 

print("Number of columns in df_train_ before dropping:", len(df_train.columns))
print("Number of columns in df_test_ before dropping:", len(df_test.columns))
print("\n")

# # Drop columns directly from both dataframes
df_train = df_train.drop(columns=features_to_drop, axis=1)
df_test = df_test.drop(columns=features_to_drop, axis=1)

# Print the number of columns remaining
print("Number of columns in df_train_ after dropping:", len(df_train.columns))
print("Number of columns in df_test_ after dropping:", len(df_test.columns))


Number of columns in df_train_ before dropping: 50
Number of columns in df_test_ before dropping: 50


Number of columns in df_train_ after dropping: 47
Number of columns in df_test_ after dropping: 47


---

### Replacing Symbols

- Replaces all instances of the placeholder value `'?'` in the training (`df_train`) and testing (`df_test`) datasets with `NaN` (Not a Number), which is the standard representation for missing values in Pandas. 
- We also replace missing or placeholder values (`None` or `'none'`) in two specific columns (`max_glu_serum` and `A1Cresult`) in the training dataset (`df_train`) with a more descriptive string to prevent misinterpretation of the data:  **"The test was not conducted on this patient"**.


In [19]:
# Replace '?' with NaN
df_train = df_train.replace('?', pd.NA)
df_test = df_test.replace('?', pd.NA)

# Replace None or 'none' values in the specified columns with the desired text
df_train[['max_glu_serum', 'A1Cresult']] = df_train[['max_glu_serum', 'A1Cresult']].replace(
    [None, 'none'], "The test was not conducted on this patient"
)
# Replace None or 'none' values in the specified columns with the desired text
df_test[['max_glu_serum', 'A1Cresult']] = df_test[['max_glu_serum', 'A1Cresult']].replace(
    [None, 'none'], "The test was not conducted on this patient"
)

display(df_train.head(10), df_test.head(10))

Unnamed: 0,encounter_id,patient_nbr,race,gender,age,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,num_lab_procedures,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,149190,55629189,Caucasian,Female,[10-20),Emergency,Discharged to home,Emergency Room,3,59,...,No,Up,No,No,No,No,No,Ch,Yes,>30
1,64410,86047875,AfricanAmerican,Female,[20-30),Emergency,Discharged to home,Emergency Room,2,11,...,No,No,No,No,No,No,No,No,Yes,NO
2,500364,82442376,Caucasian,Male,[30-40),Emergency,Discharged to home,Emergency Room,2,44,...,No,Up,No,No,No,No,No,Ch,Yes,NO
3,16680,42519267,Caucasian,Male,[40-50),Emergency,Discharged to home,Emergency Room,1,51,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
4,35754,82637451,Caucasian,Male,[50-60),Urgent,Discharged to home,Clinic Referral,3,31,...,No,Steady,No,No,No,No,No,No,Yes,>30
5,55842,84259809,Caucasian,Male,[60-70),Elective,Discharged to home,Clinic Referral,4,70,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
6,63768,114882984,Caucasian,Male,[70-80),Emergency,Discharged to home,Emergency Room,5,73,...,No,No,No,No,No,No,No,No,Yes,>30
7,12522,48330783,Caucasian,Female,[80-90),Urgent,Discharged to home,Transfer from a hospital,13,68,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
8,15738,63555939,Caucasian,Female,[90-100),Elective,Discharged/transferred to SNF,Transfer from a hospital,12,33,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
9,36900,77391171,AfricanAmerican,Male,[60-70),Urgent,Discharged to home,Transfer from a hospital,7,62,...,No,Steady,No,No,No,No,No,Ch,Yes,<30


Unnamed: 0,encounter_id,patient_nbr,race,gender,age,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,num_lab_procedures,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmitted
0,2278392,8222157,Caucasian,Female,[0-10),,Not Mapped,Physician Referral,1,41,...,No,No,No,No,No,No,No,No,No,NO
1,28236,89869032,AfricanAmerican,Female,[40-50),Emergency,Discharged to home,Emergency Room,9,47,...,No,Steady,No,No,No,No,No,No,Yes,>30
2,150006,22864131,,Female,[50-60),Urgent,Discharged to home,Transfer from a hospital,2,66,...,No,Down,No,No,No,No,No,Ch,Yes,NO
3,253380,56480238,AfricanAmerican,Female,[60-70),Emergency,Discharged to home,Emergency Room,6,87,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
4,383430,80588529,Caucasian,Female,[70-80),Emergency,Discharged/transferred to another short term h...,Emergency Room,1,28,...,No,Down,No,No,No,No,No,Ch,Yes,>30
5,550098,21820806,AfricanAmerican,Male,[50-60),Urgent,Discharged to home,Clinic Referral,4,40,...,No,No,No,No,No,No,No,No,No,<30
6,676422,63754317,AfricanAmerican,Female,[70-80),Emergency,Discharged to home,Emergency Room,4,48,...,No,Steady,No,No,No,No,No,No,Yes,>30
7,870294,95075649,Caucasian,Female,[70-80),Emergency,Discharged/transferred to home with home healt...,Emergency Room,7,75,...,No,No,No,No,No,No,No,No,No,<30
8,1072554,114039603,Caucasian,Male,[70-80),Elective,Discharged to home,Clinic Referral,3,29,...,No,Steady,No,No,No,No,No,Ch,Yes,NO
9,1161024,20830941,Caucasian,Female,[70-80),Emergency,Discharged/transferred to another short term h...,Emergency Room,5,36,...,No,Steady,No,No,No,No,No,Ch,Yes,>30


We could see that dropping and replacement are done.

---

### Converting Age from Categorical to Numerical Values

This code transforms the `age` column in the training (`df_train`) and testing (`df_test`) datasets from categorical bins into representative numerical values. This process facilitates easier analysis and allows age to be treated as a continuous variable in modeling. 

In [20]:
# convert age from categorical to numerical variable
age_mapping = {
       '[0-10)' : 5,
       '[10-20)' : 15,
       '[20-30)' : 25,
       '[30-40)' : 35,
       '[40-50)' : 45,
       '[50-60)' : 55,
       '[60-70)' : 65,
       '[70-80)' : 75,
       '[80-90)' : 85,
       '[90-100)' : 95,
}

df_train['age'] = df_train['age'].map(age_mapping)
df_test['age'] = df_test['age'].map(age_mapping)



---
### Imputating for Missing Values in the Training Data

- Checks for any missing values in the `df_train` dataset
- Then, imputates the missing values

In [21]:
# Checking for Missing Values
missing_values_df_train = df_train.isnull().sum()
missing_values_df_train = missing_values_df_train[missing_values_df_train > 0]

# Print the features with missing values
print("\nFeatures with Missing Values in df_train:")
print(missing_values_df_train)


Features with Missing Values in df_train:
race      2059
diag_1      21
diag_2     320
diag_3    1266
dtype: int64


In [22]:
# Create label encoders
label_encoders = {}
other_cols = []
columns_to_impute = ['race', 'diag_1', 'diag_2', 'diag_3']

# Function to encode and decode 
def encode_categorical(df, columns):
    df_encoded = df.copy()
    for col in columns:
        if df[col].dtype == 'object':  # If column is categorical
            if col not in label_encoders:
                label_encoders[col] = LabelEncoder()
                # Fill NaN with a placeholder first
                df_encoded[col] = df_encoded[col].fillna('Missing')
                # Fit and transform
                df_encoded[col] = label_encoders[col].fit_transform(df_encoded[col])
    return df_encoded

def decode_categorical(df):
    df_decoded = df.copy()
    for col in columns_to_impute:
        if col in label_encoders:
            df_decoded[col] = label_encoders[col].inverse_transform(df_decoded[col].astype(int))
    return df_decoded

# Encode both training and test data
df_train_encoded = encode_categorical(df_train, columns_to_impute)

# Now apply KNN imputation
imputer = KNNImputer(n_neighbors=5)

# Fit and transform training data
df_train_imputed = pd.DataFrame(
    imputer.fit_transform(df_train_encoded[columns_to_impute]),
    columns=columns_to_impute,
    index=df_train_encoded.index
)

# Round the values since we're working with categorical data
for col in columns_to_impute:
    df_train_imputed[col] = df_train_imputed[col].round()

# Decode the imputed values back to categories
df_train_imputed_decoded = decode_categorical(df_train_imputed)

for col in df_train:
    if col not in columns_to_impute:
        other_cols.append(col)
if other_cols:
    df_train = pd.concat([df_train_imputed_decoded, df_train[other_cols]], axis=1)


In [23]:
# Checks again to see if imputation is successful or not
missing_values_df_train = df_train.isnull().sum()
missing_values_df_train = missing_values_df_train[missing_values_df_train > 0]

# Print the features with missing values
print("\nFeatures with Missing Values in df_train after imputation:")
print(missing_values_df_train)


Features with Missing Values in df_train after imputation:
Series([], dtype: int64)


Imputation has been done successfully

---
### One-Hot Encoding of Categorical Variables and Seperation of numerical and categorical columns

- Splits the data into numerical and categorical columns, so that it faciliates the **one-hot encoding** implementation
- This code performs **one-hot encoding** on the categorical features in the training (`df_train`) and testing (`df_test`) datasets. 
- It transforms categorical variables into binary columns, ensuring the dataset is in a format that machine learning models can process effectively.

In [24]:
# Seperates the target variable 
readmitted_col_df_train_ = df_train['readmitted']
df_train = df_train.drop(columns=['readmitted'], axis=1)

readmitted_col_df_test_ = df_test['readmitted']
df_test = df_test.drop(columns=['readmitted'], axis=1)


# Make sure our readmitted columns are containing only 0,1,and 2 for the three classes: <30, >30, and NO
label_encoder = LabelEncoder()

label_encoder.fit_transform(readmitted_col_df_train_)
df_train_target = label_encoder.transform(readmitted_col_df_train_)

label_encoder.fit_transform(readmitted_col_df_test_)
df_test_target = label_encoder.transform(readmitted_col_df_test_)

# Create a list with the categorical columns and one with the numerical columns
categorical_columns = df_train.select_dtypes(include=['object']).columns.tolist()
numerical_columns = df_train.select_dtypes(include=['int64', 'float64']).columns.tolist()

print("\nCategorical Columns in df_train_:\n", categorical_columns)
print("\nNumerical Columns in df_train_:\n", numerical_columns)

# Hot-one encoding of all our categorical variables.
df_train_cat = pd.get_dummies(df_train[categorical_columns],drop_first=False) 
df_test_cat = pd.get_dummies(df_test[categorical_columns],drop_first=False)

# Gets the numerical variables
df_train_num = df_train[numerical_columns]
df_test_num = df_test[numerical_columns]



Categorical Columns in df_train_:
 ['race', 'diag_1', 'diag_2', 'diag_3', 'gender', 'admission_type_id', 'discharge_disposition_id', 'admission_source_id', 'max_glu_serum', 'A1Cresult', 'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone', 'tolazamide', 'examide', 'citoglipton', 'insulin', 'glyburide-metformin', 'glipizide-metformin', 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone', 'change', 'diabetesMed']

Numerical Columns in df_train_:
 ['encounter_id', 'patient_nbr', 'age', 'time_in_hospital', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_outpatient', 'number_emergency', 'number_inpatient', 'number_diagnoses']


--- 
### Concatenating Numerical and Categorical Features

- Combines the numerical, categorical, and target variables from the training (`df_train`) and testing (`df_test`) datasets into unified DataFrames (`df_train_concat` and `df_test_concat`). 
- Prepares the data for modeling by ensuring all relevant features are in a single DataFrame.

In [25]:
# Now, we concatenate all types of variables together
df_train = pd.concat([df_train_num, df_train_cat], axis = 1)
df_test = pd.concat([df_test_num, df_test_cat], axis = 1)

# Displays the number of columns in both datasets now
print(df_train.shape)
print(df_test.shape)

(91589, 2369)
(10177, 1538)


---
### Converting Binary Categorical Values to Numerical Format

- Replaces binary categorical values (`Yes`, `No`, `True`, `False`) in the concatenated train and test datasets with numerical values (`1` and `0`).
- Ensures that all features are numerical and ready for model training.


In [26]:
# We still have some non-numerical binary values which we need to replace by zeros and ones. For all columns in the dataset, make sure that Yes, True are replaced by 1 and that No, False are replaced by 0. 

df_train_ = df_train.replace({'Yes': 1, 'No': 0, True: 1, False: 0})
df_test_ = df_test.replace({"yes":1, "No":0, True:1, False:0})

display(df_train_.head(10), df_test_.head(10))


  df_train_ = df_train.replace({'Yes': 1, 'No': 0, True: 1, False: 0})
  df_test_ = df_test.replace({"yes":1, "No":0, True:1, False:0})


Unnamed: 0,encounter_id,patient_nbr,age,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,...,glipizide-metformin_Steady,glimepiride-pioglitazone_No,metformin-rosiglitazone_No,metformin-rosiglitazone_Steady,metformin-pioglitazone_No,metformin-pioglitazone_Steady,change_Ch,change_No,diabetesMed_No,diabetesMed_Yes
0,149190,55629189,15,3,59,0,18,0,0,0,...,0,1,1,0,1,0,1,0,0,1
1,64410,86047875,25,2,11,5,13,2,0,1,...,0,1,1,0,1,0,0,1,0,1
2,500364,82442376,35,2,44,1,16,0,0,0,...,0,1,1,0,1,0,1,0,0,1
3,16680,42519267,45,1,51,0,8,0,0,0,...,0,1,1,0,1,0,1,0,0,1
4,35754,82637451,55,3,31,6,16,0,0,0,...,0,1,1,0,1,0,0,1,0,1
5,55842,84259809,65,4,70,1,21,0,0,0,...,0,1,1,0,1,0,1,0,0,1
6,63768,114882984,75,5,73,0,12,0,0,0,...,0,1,1,0,1,0,0,1,0,1
7,12522,48330783,85,13,68,2,28,0,0,0,...,0,1,1,0,1,0,1,0,0,1
8,15738,63555939,95,12,33,3,18,0,0,0,...,0,1,1,0,1,0,1,0,0,1
9,36900,77391171,65,7,62,0,11,0,0,0,...,0,1,1,0,1,0,1,0,0,1


Unnamed: 0,encounter_id,patient_nbr,age,time_in_hospital,num_lab_procedures,num_procedures,num_medications,number_outpatient,number_emergency,number_inpatient,...,glipizide-metformin_No,glimepiride-pioglitazone_No,glimepiride-pioglitazone_Steady,metformin-rosiglitazone_No,metformin-rosiglitazone_Steady,metformin-pioglitazone_No,change_Ch,change_No,diabetesMed_No,diabetesMed_Yes
0,2278392,8222157,5,1,41,0,1,0,0,0,...,1,1,0,1,0,1,0,1,1,0
1,28236,89869032,45,9,47,2,17,0,0,0,...,1,1,0,1,0,1,0,1,0,1
2,150006,22864131,55,2,66,1,19,0,0,0,...,1,1,0,1,0,1,1,0,0,1
3,253380,56480238,65,6,87,0,18,0,0,0,...,1,1,0,1,0,1,1,0,0,1
4,383430,80588529,75,1,28,0,15,0,0,0,...,1,1,0,1,0,1,1,0,0,1
5,550098,21820806,55,4,40,1,14,0,0,0,...,1,1,0,1,0,1,0,1,1,0
6,676422,63754317,75,4,48,2,15,0,1,0,...,1,1,0,1,0,1,0,1,0,1
7,870294,95075649,75,7,75,2,22,0,0,0,...,1,1,0,1,0,1,0,1,1,0
8,1072554,114039603,75,3,29,1,7,0,0,0,...,1,1,0,1,0,1,1,0,0,1
9,1161024,20830941,75,5,36,0,14,0,0,0,...,1,1,0,1,0,1,1,0,0,1


---
### Feature Selection

- Selects the most important features from the dataset through chi-square test, ensuring the final dataset retains only the most informative features
- This process enhances model performance, reduces overfitting, and speeds up training.

In [27]:
# Ensure test dataset has the same columns as training dataset
missing_cols = [col for col in df_train_.columns if col not in df_test_.columns]
extra_cols = [col for col in df_test_.columns if col not in df_train_.columns]

# Add missing columns with zeros to test data
for col in missing_cols:
    df_test_[col] = 0

# Remove extra columns from test data
df_test_ = df_test_.drop(columns=extra_cols, errors='ignore')

# Ensure columns are in the same order as training data
df_test_ = df_test_[df_train_.columns]

# Verify shapes and column alignment
print("Training shape:", df_train_.shape)
print("Test shape:", df_test_.shape)
print("Columns identical:", all(df_train_.columns == df_test_.columns))

Training shape: (91589, 2369)
Test shape: (10177, 2369)
Columns identical: True


  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0
  df_test_[col] = 0


Makes sure that the number of columns of identical before performing chi-square test

In [28]:
# Step 3: Min-Max Scaling for Chi-Square
minmax_scaler = MinMaxScaler()
X_train_minmax_scaled = minmax_scaler.fit_transform(df_train_)
X_test_minmax_scaled = minmax_scaler.transform(df_test_) 

print("Applying Chi-Square Test...")
# Initially fit with 'all' features to get scores
chi2_selector = SelectKBest(score_func=chi2, k='all')
chi2_selector.fit(X_train_minmax_scaled, df_train_target)

# Get and sort Chi-Square scores
chi2_scores = pd.Series(chi2_selector.scores_, index=df_train_.columns).sort_values(ascending=False)

# Calculate cumulative explained contribution
cumulative_scores_chi2 = np.cumsum(chi2_scores) / chi2_scores.sum()

# Find optimal number of features
desired_contribution = 0.9
optimal_k_chi2 = np.argmax(cumulative_scores_chi2 >= desired_contribution) + 1
print(f"Optimal number of features to retain from Chi-Square: {optimal_k_chi2}")

# Get top features
top_features_chi2 = chi2_scores.head(optimal_k_chi2).index.tolist()
print(f"Top {optimal_k_chi2} features selected by Chi-Square:\n", top_features_chi2)

# Now refit the selector with the optimal k
chi2_selector = SelectKBest(score_func=chi2, k=optimal_k_chi2)
X_train_chi2_selected = chi2_selector.fit_transform(X_train_minmax_scaled, df_train_target)
X_test_chi2_selected = chi2_selector.transform(X_test_minmax_scaled) 

# Convert to DataFrame while maintaining feature names
X_train_final = pd.DataFrame(X_train_chi2_selected, columns=top_features_chi2)
X_test_final = pd.DataFrame(X_test_chi2_selected, columns=top_features_chi2) 

Applying Chi-Square Test...
Optimal number of features to retain from Chi-Square: 838
Top 838 features selected by Chi-Square:
 ['discharge_disposition_id_Expired', 'number_inpatient', 'discharge_disposition_id_Discharged/transferred to another rehab fac including rehab units of a hospital .', 'diag_1_428', 'discharge_disposition_id_Discharged/transferred to home with home health service', 'admission_source_id_ Transfer from another health care facility', 'admission_source_id_Transfer from a hospital', 'diag_2_250', 'diabetesMed_No', 'diag_2_403', 'admission_source_id_ Emergency Room', 'admission_type_id_Elective', 'diag_3_401', 'insulin_Down', 'discharge_disposition_id_Hospice / medical facility', 'diag_3_250', 'diag_1_V58', 'discharge_disposition_id_Discharged to home', 'diag_2_401', 'diag_3_585', 'diag_1_491', 'discharge_disposition_id_Discharged/transferred to SNF', 'race_Missing', 'discharge_disposition_id_Hospice / home', 'diag_3_Missing', 'admission_type_id_nan', 'diag_3_403', '

In [29]:
print("Training shape:", X_train_final.shape)
print("Test shape:", X_test_final.shape)
print("Columns identical:", all(X_train_final.columns == X_test_final.columns))

Training shape: (91589, 838)
Test shape: (10177, 838)
Columns identical: True


Ensures that the number of columns is identical before we perform any machine learning

---
# Data Storing
- Simply stores the data in a pickle file

In [30]:
with open("train_test_data.pkl", "wb") as f:
    pickle.dump((X_train_chi2_selected, X_test_chi2_selected, df_train_target, df_test_target), f)