## Section 1: Introduction

This notebook analyzes the Labor Force Survey (LFS) dataset from April 2016, provided by the Philippine Statistics Authority (PSA). The primary objective is to preprocess and clean the data, perform exploratory data analysis (EDA) to uncover insights, and build several machine learning models to predict whether an individual has worked in the past week. We will explore Logistic Regression, Decision Tree, and k-Nearest Neighbors models, tune their hyperparameters, and compare their performance to identify the most effective approach for this classification task.

### Library Imports and Setup

First, we import the necessary Python libraries for data manipulation, visualization, and machine learning. We also configure default plotting styles and sizes for consistency.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV, GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, ConfusionMatrixDisplay
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.neighbors import KNeighborsClassifier
%matplotlib inline

# Set default size of plots
plt.rcParams['figure.figsize'] = (6.0, 6.0)
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

plt.style.use('ggplot')

# Autoreload external python modules
%load_ext autoreload
%autoreload 2

## Section 2: Data Loading and Initial Inspection

In [None]:
try:
    lfs_data = pd.read_csv("src/data/LFS PUF April 2016.CSV")
except FileNotFoundError:
    print("Error: CSV file not found. Please make sure the file exists in the correct directory or provide the correct path.")
    exit()

### Data Information, Pre-Processing, and Cleaning

Let's start by getting a high-level overview of our dataset's structure, including the number of entries, column names, and data types.

In [None]:
lfs_data.info()

The initial output shows a mix of data types:
<ul>
    <li>1 column with float values.</li>
    <li>14 columns with integer values.</li>
    <li><b>35 columns with object (string) values</b>, which will likely require cleaning and conversion.</li>
</ul>

Next, we'll check for any duplicate rows in the dataset.

In [None]:
lfs_data.duplicated().sum()

No duplicates were found, so no action is needed in this regard.

### Handling Whitespace and Null Values

A common issue in datasets is the use of whitespace to represent missing or null values. Let's count the number of whitespace entries in our object-type columns.

In [None]:
has_null = lfs_data.apply(lambda col: col.str.isspace().sum() if col.dtype == 'object' else 0)

print("Columns with Empty Cells:")
print(has_null[has_null > 0])

Many columns contain a significant number of whitespace entries. Our strategy will be to handle these on a case-by-case basis. Instead of a blanket conversion to `NaN`, we will analyze each column to determine the most appropriate way to impute or re-categorize these missing values based on the survey's structure.

We will now go through several key columns, cleaning them and converting their data types to be suitable for analysis and modeling.

### `PUFC06_MSTAT`: Marital Status

This column represents the marital status of the respondent. Let's examine its current data type and unique values.

In [None]:
print("Data Type of PUFC06_MSTAT: ", lfs_data['PUFC06_MSTAT'].dtype)
print("Values Used in PUFC06_MSTAT: ", sorted(lfs_data['PUFC06_MSTAT'].unique()))

The column is of type `object` and contains whitespace values. We will first replace the whitespace with an arbitrary integer (`-1`) to facilitate type conversion, and then cast the entire column to `int64`.

In [None]:
lfs_data.loc[lfs_data['PUFC06_MSTAT'] == " ", 'PUFC06_MSTAT'] = -1
lfs_data['PUFC06_MSTAT'] = lfs_data['PUFC06_MSTAT'].astype('int64')

Upon reviewing the LFS questionnaire documentation, it's clear that there is a specific code for "unknown" marital status. We can logically impute our missing values to this category, which corresponds to the integer `6`.

In [None]:
lfs_data.loc[lfs_data['PUFC06_MSTAT'] == -1, 'PUFC06_MSTAT'] = 6

Let's double-check our work to ensure the column is now clean and has the correct data type.

In [None]:
print("Data Type of PUFC06_MSTAT: ", lfs_data['PUFC06_MSTAT'].dtype)
print("Values Used in PUFC06_MSTAT: ", np.sort(lfs_data['PUFC06_MSTAT'].unique().tolist()))

### `PUFC07_GRADE`: Highest Grade Completed

This column indicates the highest level of education attained by the respondent. We'll apply a similar cleaning process.

In [None]:
print("Data Type of PUFC07_GRADE: ", lfs_data['PUFC07_GRADE'].dtype)
print("Values Used in PUFC07_GRADE: \n", np.sort(lfs_data['PUFC07_GRADE'].unique().tolist()))

This column also contains whitespace and consists of numeric codes stored as strings. We will convert the whitespace to `-1` and cast the column to `int64`.

In [None]:
lfs_data.loc[lfs_data['PUFC07_GRADE'] == "   ", 'PUFC07_GRADE'] = -1
lfs_data['PUFC07_GRADE'] = lfs_data['PUFC07_GRADE'].astype('int64')

The `-1` value now represents missing information about the highest grade completed. Let's verify the result.

In [None]:
print("Data Type of PUFC07_GRADE: ", lfs_data['PUFC07_GRADE'].dtype)
print("Values Used in PUFC07_GRADE: \n", np.sort(lfs_data['PUFC07_GRADE'].unique().tolist()))

After cleaning, we can analyze the codes more deeply. By cross-referencing with the LFS questionnaire, we find that some numeric codes in our dataset are not explicitly defined in the manual. These likely represent specific courses for post-secondary and college graduates, which the survey allows respondents to specify.

First, let's define the list of codes that *are* explicitly mentioned in the manual.

In [None]:
# These are the codes that were explicitly defined in the manual
valid_codes = [ 
    0,                                                  # No Grade
    10,                                                 # Preschool
    210, 220, 230, 240, 250, 260, 280,                  # Elementary (Grade 1 to Elementary Graduate)
    310, 320, 330, 340, 350,                            # High School (First Year to High School Graduate)
    410, 420,                                           # Post Secondary; If Graduate Specify
    810, 820, 830, 840,                                 # College; If Graduate Specify
    900                                                 # Post Baccalaureate
]

Now, we can identify the values present in our data that are not in this list of `valid_codes`. We hypothesize that these are the user-specified courses.

In [None]:
invalid_values = np.sort(lfs_data[~(lfs_data['PUFC07_GRADE'].isin(valid_codes))]['PUFC07_GRADE'].unique())

Let's separate our previously assigned missing value code (`-1`) from these other "invalid" codes, which we will now refer to as graduate-specified courses.

In [None]:
missing_values = [-1]
graduate_specified_courses = [int(x) for x in [value for value in invalid_values if value not in missing_values]]

To simplify this feature for modeling, we will group all of these specific graduate course codes into a single, new category. We will use the arbitrary code `700` for this purpose.

In [None]:
lfs_data["PUFC07_GRADE"] = lfs_data["PUFC07_GRADE"].apply(lambda x: 700 if x in graduate_specified_courses else x)

Let's print a summary of this preprocessing step to confirm our logic.

In [None]:
print("\nPUFC07_GRADE Data Type: ", lfs_data['PUFC07_GRADE'].dtype)
print("\nUnique Codes in Column: ", np.sort(lfs_data['PUFC07_GRADE'].unique().tolist()))
print("\nValid Codes (Original): ", valid_codes)
print("\nInvalid Codes Found (incl. missing): ", invalid_values)
print("\nAssigned Missing Code: ", missing_values)
print("\nCodes for Hypothesized Graduate Specified Courses: ", graduate_specified_courses)
print("\nArbitrary Code for Graduate Specified Courses: ", 700)

### `PUFC08_CURSCH`: Currently Attending School

This column is a binary indicator of whether a person is currently in school.

In [None]:
print("Data Type of PUFC08_CURSCH: ", lfs_data['PUFC08_CURSCH'].dtype)
print("Values Used in PUFC08_CURSCH: ", sorted(lfs_data['PUFC08_CURSCH'].unique()))

The column has a large number of whitespace values. The questionnaire specifies that this question is only asked of respondents aged 5 to 24. We can therefore infer that individuals outside this age range did not answer, and it is logical to assume they are not currently attending school. We will map the whitespace values to `2` (No) and then convert the column to an integer type.

In [None]:
lfs_data.loc[lfs_data['PUFC08_CURSCH'] == " ", 'PUFC08_CURSCH'] = 2
lfs_data['PUFC08_CURSCH'] = lfs_data['PUFC08_CURSCH'].astype('int64')

Let's check the result of our conversion:

In [None]:
print("Data Type of PUFC08_CURSCH: ", lfs_data['PUFC08_CURSCH'].dtype)
print("Values Used in PUFC08_CURSCH: \n", np.sort(lfs_data['PUFC08_CURSCH'].unique().tolist()))

### `PUFC11_WORK`: Worked in the Past Week

This is our target variable. It indicates if the person did any work for at least one hour during the past week. We'll clean it by mapping whitespace to `2` (No) and ensuring the data type is integer.

In [None]:
lfs_data.loc[lfs_data['PUFC11_WORK'] == " ", 'PUFC11_WORK'] = 2
lfs_data['PUFC11_WORK'] = lfs_data['PUFC11_WORK'].astype('int64')

### `PUFC30_LOOKW`: Looked for Work

This column asks if the person looked for work in the past week. According to the survey's flow, this question is skipped (resulting in whitespace) if the respondent worked 48 hours or less. Therefore, a whitespace entry implies the person is not overworked and already has a job. We will handle this logic during preprocessing.

First, we convert the column to a numeric format, using `-1` for the skipped entries.

In [None]:
# 1. Convert whitespace to -1 and other values to integers.
lfs_data.loc[lfs_data['PUFC30_LOOKW'] == " ", 'PUFC30_LOOKW'] = -1
lfs_data.loc[lfs_data['PUFC30_LOOKW'] == "1", 'PUFC30_LOOKW'] = 1
lfs_data.loc[lfs_data['PUFC30_LOOKW'] == "2", 'PUFC30_LOOKW'] = 2
lfs_data['PUFC30_LOOKW'] = lfs_data['PUFC30_LOOKW'].astype('int64')

# 2. For modeling purposes, create a second version of the column where skipped (-1) is treated as 'No' (2).
# This simplifies the feature into a simple binary 'Yes'/'No' for looking for work.
lfs_data['PUFC30_LOOKW_version2'] = lfs_data['PUFC30_LOOKW'].copy()
lfs_data.loc[lfs_data['PUFC30_LOOKW_version2'] == -1, 'PUFC30_LOOKW_version2'] = 2

The original `PUFC30_LOOKW` column (with -1 for skipped) can be used to analyze overwork. A value of 1 or 2 indicates the person answered, which implies they worked more than 48 hours. A value of -1 (skipped) implies they worked 48 hours or less.

In [None]:
not_overworked_people = (lfs_data['PUFC30_LOOKW'] == -1).sum()
overworked_people = (lfs_data['PUFC30_LOOKW'] == 1).sum() + (lfs_data['PUFC30_LOOKW'] == 2).sum()
total_population = overworked_people + not_overworked_people

print(f"People who are NOT overworked (worked <= 48 hours): {not_overworked_people}")
print(f"Percentage: {100 * not_overworked_people / total_population:.2f}%")
print(f"\nPeople who ARE overworked (worked > 48 hours): {overworked_people}")
print(f"Percentage: {100 * overworked_people / total_population:.2f}%")

### `PUFC31_FLWRK`: First Time Looking for Work

This column is also part of a sequence of questions and will be cleaned similarly to the previous one.

In [None]:
# 1. Convert whitespace to -1 and other values to integers.
lfs_data.loc[lfs_data['PUFC31_FLWRK'] == " ", 'PUFC31_FLWRK'] = -1
lfs_data.loc[lfs_data['PUFC31_FLWRK'] == "1", 'PUFC31_FLWRK'] = 1
lfs_data.loc[lfs_data['PUFC31_FLWRK'] == "2", 'PUFC31_FLWRK'] = 2
lfs_data['PUFC31_FLWRK'] = lfs_data['PUFC31_FLWRK'].astype('int64')

# 2. Create a simplified binary version for modeling.
lfs_data['PUFC31_FLWRK_version2'] = lfs_data['PUFC31_FLWRK'].copy()
lfs_data.loc[lfs_data['PUFC31_FLWRK_version2'] == -1, 'PUFC31_FLWRK_version2'] = 2 # Treat skipped as 'No'

print("Cleaned Data Type:", lfs_data['PUFC31_FLWRK_version2'].dtype)
print("Unique Values:", lfs_data['PUFC31_FLWRK_version2'].unique())

### `PUFC34_WYNOT`: Reason for Not Looking for Work

This column provides reasons why a person might not be looking for work. We will clean it and create a version where skipped entries are mapped to a specific category.

In [None]:
# 1. Convert all values to a consistent integer format.
lfs_data.loc[lfs_data['PUFC34_WYNOT'] == " ", 'PUFC34_WYNOT'] = -1
lfs_data['PUFC34_WYNOT'] = pd.to_numeric(lfs_data['PUFC34_WYNOT'])
lfs_data['PUFC34_WYNOT'] = lfs_data['PUFC34_WYNOT'].astype('int64')

# 2. Create a version for modeling where skipped entries (-1) are mapped to category 9 ('Others specify').
# This assumes that if a reason wasn't given, it falls into a general 'other' category or the question was not applicable.
lfs_data['PUFC34_WYNOT_version3'] = lfs_data['PUFC34_WYNOT'].copy()
lfs_data.loc[lfs_data['PUFC34_WYNOT_version3'] == -1, 'PUFC34_WYNOT_version3'] = 9

### `PUFC38_PREVJOB`: Ever Worked Before

This column asks if the respondent has any prior work experience. We will clean it using the same methodology.

In [None]:
# 1. Convert to integer format, with -1 for whitespace.
lfs_data.loc[lfs_data['PUFC38_PREVJOB'] == " ", 'PUFC38_PREVJOB'] = -1
lfs_data.loc[lfs_data['PUFC38_PREVJOB'] == "1", 'PUFC38_PREVJOB'] = 1
lfs_data.loc[lfs_data['PUFC38_PREVJOB'] == "2", 'PUFC38_PREVJOB'] = 2
lfs_data['PUFC38_PREVJOB'] = lfs_data['PUFC38_PREVJOB'].astype('int64')

# 2. Create a simplified binary version where skipped is treated as 'No'.
lfs_data['PUFC38_PREVJOB_version2'] = lfs_data['PUFC38_PREVJOB'].copy()
lfs_data.loc[lfs_data['PUFC38_PREVJOB_version2'] == -1, 'PUFC38_PREVJOB_version2'] = 2

print(lfs_data['PUFC38_PREVJOB_version2'].dtype)
print(lfs_data['PUFC38_PREVJOB_version2'].unique())

### Finalizing Data Type Conversion

Now that we've handled the most complex columns, we can perform a broader conversion. We will replace all remaining whitespace across the dataframe with our placeholder `-1` and then attempt to convert all remaining object columns to integer type.

In [None]:
lfs_data.replace(r"^\s+$", -1, regex=True, inplace=True)

In [None]:
columns_to_convert = [
    'PUFC09_GRADTECH', 'PUFC10_CONWR', 
    'PUFC12_JOB', 'PUFC14_PROCC', 'PUFC16_PKB', 'PUFC17_NATEM', 'PUFC18_PNWHRS', 
    'PUFC19_PHOURS', 'PUFC20_PWMORE', 'PUFC21_PLADDW', 'PUFC22_PFWRK', 'PUFC23_PCLASS', 
    'PUFC24_PBASIS', 'PUFC25_PBASIC', 'PUFC26_OJOB', 'PUFC27_NJOBS', 'PUFC28_THOURS', 
    'PUFC29_WWM48H', 'PUFC32_JOBSM', 'PUFC33_WEEKS', 
    'PUFC35_LTLOOKW', 'PUFC36_AVAIL', 'PUFC37_WILLING', 
    'PUFC40_POCC', 'PUFC41_WQTR', 'PUFC43_QKB', 'PUFNEWEMPSTAT'
]

for col in columns_to_convert:
    if col in lfs_data.columns:
        lfs_data[col] = lfs_data[col].astype(int)

Let's get a final look at the number of unique values in each column after cleaning.

In [None]:
lfs_data.apply(lambda x: x.nunique())

## Section 3: Exploratory Data Analysis (EDA)

With the data cleaned and preprocessed, we can now explore it to find patterns, correlations, and insights.

### Correlation Analysis

We'll start by computing a correlation matrix to identify strong linear relationships between our numeric variables. For this calculation, we'll temporarily replace our `-1` placeholder with `NaN` so that these values are ignored.

In [None]:
lfs_data_with_nan = lfs_data.copy()
lfs_data_with_nan.replace(-1, np.nan, inplace=True)
corr_matrix = lfs_data_with_nan.corr()

strong_correlations = []
for i in range(len(corr_matrix.columns)):
    for j in range(i + 1, len(corr_matrix.columns)): 
        corr_value = corr_matrix.iloc[i, j]
        if (0.5 < corr_value < 1) or (-1 < corr_value < -0.5):
            strong_correlations.append((
                corr_matrix.index[i], 
                corr_matrix.columns[j], 
                corr_value
            ))

strong_correlations.sort(key=lambda x: abs(x[2]), reverse=True)

print("Strong correlations (|corr| > 0.5):")
for var1, var2, corr in strong_correlations:
    print(f"{var1} — {var2}: {corr:.3f}")

**Placeholder for Interpretation:** The output above lists pairs of variables with a correlation coefficient greater than 0.5 or less than -0.5. This helps us understand which variables move together. For example, a strong positive correlation between 'total hours worked' and 'income' would be expected. High correlations between predictor variables can also indicate multicollinearity, which might be a concern for some modeling techniques.

### Visualizations

Now, we will create a series of plots to visualize the relationships between different demographic variables and occupation or work status.

#### Occupation Distribution by Sex

In [None]:
occupation_sex = lfs_data.groupby(['PUFC04_SEX', 'PUFC14_PROCC']).size().unstack(fill_value=0)
occupation_sex.plot(kind='bar', stacked=True, figsize=(12, 6))
plt.title('Occupation Distribution by Sex')
plt.xlabel('Sex')
plt.ylabel('Count')
plt.xticks([0, 1], ['Male', 'Female'], rotation=0) # Assuming 1 is Male, 2 is Female or vice versa
plt.legend(title='Occupation', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

**Placeholder for Interpretation:** This stacked bar chart shows the distribution of various occupations for males and females. It allows us to see which occupations are more prevalent for each gender and observe the overall workforce composition.

#### Occupation Distribution by Marital Status

In [None]:
occupation_marriageStatus = lfs_data.groupby(['PUFC06_MSTAT', 'PUFC14_PROCC']).size().unstack(fill_value=0)
occupation_marriageStatus.plot(kind='bar', stacked=True, figsize=(20, 10))
plt.title('Occupation Distribution by Marital Status')
plt.xlabel('Marital Status')
plt.ylabel('Count')
plt.legend(title='Occupation', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

**Placeholder for Interpretation:** This chart breaks down occupation types by the marital status of the respondents. We can analyze if certain occupations are more common among single, married, or widowed individuals, which might reflect different life stages and career paths.

#### Occupation Distribution by Highest Grade Accomplished

In [None]:
occupation_highestGrade = lfs_data.groupby(['PUFC07_GRADE', 'PUFC14_PROCC']).size().unstack(fill_value=0)
occupation_highestGrade.plot(kind='bar', stacked=True, figsize=(20, 10))
plt.title('Occupation Distribution by Highest Grade Accomplished')
plt.xlabel('Highest Grade')
plt.ylabel('Count')
plt.legend(title='Occupation', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

**Placeholder for Interpretation:** This visualization explores the relationship between education level and occupation. It helps to illustrate how higher levels of education may lead to different types of employment, highlighting the value of education in the labor market.

#### Relationship Between Looking for Work and Current Work Status

In [None]:
occupation_looking_for_work_ver2 = lfs_data.groupby(['PUFC30_LOOKW_version2', 'PUFC11_WORK']).size().unstack(fill_value=0)
occupation_looking_for_work_ver2.plot(kind='bar', stacked=True, figsize=(12, 7))
plt.title('Work Status vs. Looking for Work')
plt.xlabel('Did the person look for work in the past week?')
plt.ylabel('Count')
plt.xticks([0, 1], ['No', 'Yes'], rotation=0)
plt.legend(title='Worked last week?', labels=['Yes', 'No'], bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

occupation_looking_for_work = lfs_data.groupby(['PUFC30_LOOKW', 'PUFC11_WORK']).size().unstack(fill_value=0)
occupation_looking_for_work.plot(kind='bar', stacked=True, figsize=(12, 7))
plt.title('Work Status vs. Looking for Work (with Skipped)')
plt.xlabel('Did the person look for work in the past week?')
plt.ylabel('Count')
plt.xticks([0, 1, 2], ['Skipped', 'Yes', 'No'], rotation=0)
plt.legend(title='Worked last week?', labels=['Yes', 'No'], bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

#### Reasons for Not Looking for Work vs. Work Status

In [None]:
occupation_reason_not_looking_for_work_version3 = lfs_data.groupby(['PUFC34_WYNOT_version3', 'PUFC11_WORK']).size().unstack(fill_value=0)
occupation_reason_not_looking_for_work_version3.plot(kind='bar', stacked=True, figsize=(20, 10))
plt.title('Reasons for Not Looking for Work vs. Work Status')
plt.xlabel("Reason for not looking for work")
plt.ylabel('Count')
plt.xticks(ticks=range(9), labels=[
    'Tired/believed no work available', 
    'Awaiting results of previous job application', 
    'Temporary illness/disability', 
    'Bad weather', 
    'Waiting for rehire/job recall', 
    'Too young/old or retired/permanent disability', 
    'Household, family duties', 
    'Schooling', 
    'Others/Not Applicable'
], rotation=45, ha='right')
plt.legend(title='Worked last week?', labels=['Yes', 'No'], bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()

## Section 4: Predictive Modeling

We will now build machine learning models to predict our target variable, `PUFC11_WORK` — whether a person has worked in the past week. We will test three different classification algorithms: Logistic Regression, Decision Tree, and k-Nearest Neighbors.

### Feature Selection and Preprocessing for Modeling

We select a set of demographic and economic features as our predictors. We will use a custom `prepare_data` function to handle the final steps of creating training and testing sets. Specifically, it will create five different 80/20 train-test splits (folds) to allow for robust evaluation. It will also perform one-hot encoding on our categorical features to convert them into a numerical format suitable for the models.

In [None]:
from src.preprocessing import prepare_data_kfold

target_col = 'PUFC11_WORK'
feature_cols = [
    'PUFC05_AGE', 'PUFC06_MSTAT', 'PUFC04_SEX', 
    'PUFC07_GRADE', 'PUFC08_CURSCH', 
    'PUFC38_PREVJOB', 'PUFC31_FLWRK',
    'PUFC30_LOOKW', 'PUFC34_WYNOT'
]

categorical_cols = feature_cols
n_splits = 5
missing_value = -1
seed = 45


folds_data = prepare_data_kfold(lfs_data, target_col=target_col,
                         missing_value=missing_value,
                         feature_cols=feature_cols,
                         seed=seed)

### Model 1: Logistic Regression (LR)

Logistic Regression is an excellent baseline model for binary classification. It is efficient, interpretable, and provides a good starting point for evaluating predictability.

#### Training the Baseline LR Model
We will train the model across our five folds using a set of initial hyperparameters. We'll use Stochastic Gradient Descent as the optimizer and monitor performance on both training and test sets.

In [None]:
from src.trainEval import train_model

result_dict = train_model(
    folds_data,
    scheduler_step_size=5,
    learning_rate=0.01,
    scheduler_gamma=0.5,
    convergence_threshold=1e-4, 
    num_epochs=50,
    patience=3, # Stop at 3 epochs with no improvement
    weight_decay=0, # No regularization for the baseline
    seed=seed
)

lr_accuracies_test = result_dict["all_final_test_accuracies"]
print("Test Accuracies per Fold:", lr_accuracies_test);

print("\nModel training complete!")
aggregate_cm = result_dict["aggregate_confusion_matrix"]

sns.heatmap(aggregate_cm, annot=True, fmt="d", cmap="Blues", 
            xticklabels=["Did not Work", "Worked"], 
            yticklabels=["Did not Work", "Worked"]) 

print(f"\n  Average Final Train Loss:     {result_dict['aggregated_final_metrics']['avg_final_train_loss']:.6f}")
print(f"  Average Final Test Loss:      {result_dict['aggregated_final_metrics']['avg_final_test_loss']:.6f}")
print(f"  Average Final Train Accuracy: {result_dict['aggregated_final_metrics']['avg_final_train_accuracy']:.6f}")
print(f"  Average Final Test Accuracy:  {result_dict['aggregated_final_metrics']['avg_final_test_accuracy']:.6f}")

plt.title("Aggregate Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

#### Baseline LR Evaluation
**Placeholder for Interpretation:** The output from the training process would show the average accuracy and loss across all five folds. The aggregate confusion matrix visualizes the model's performance, showing the counts of true positives, true negatives, false positives, and false negatives. A good baseline model will have high accuracy and a low number of misclassifications (off-diagonal elements in the matrix). The small difference between training and test metrics suggests the model is generalizing well without significant overfitting.

#### Hyperparameter Tuning: LR

To improve our baseline, we will perform hyperparameter tuning using Randomized Search. This method efficiently samples a wide range of parameter combinations to find a more optimal set. We will search across learning rates, batch sizes, optimizers, and regularization strengths.

In [None]:
param_distributions = {
    'learning_rate': np.logspace(-4, -1, 20),
    'batch_size': [32, 64, 128, 256],
    'optimizer': ['sgd', 'adam', 'rmsprop'],
    'weight_decay': np.logspace(-5, -2, 10),
    'num_epochs': [30, 50, 75, 100],
    'scheduler_step_size': [5, 10, 15],
    'scheduler_gamma': [0.5, 0.7, 0.9],
    'patience': [3, 5, 7]
}

hyperparameter_results = hyperparameter_random_search(folds_data=folds_data, param_distributions=param_distributions, n_iter_search=50)

### Model 2: Decision Tree (DT)

Decision Trees are powerful models that can capture non-linear relationships in the data. They work by recursively partitioning the data based on feature values, creating a tree-like structure of decision rules.

#### Training the Baseline DT Model
We will use the same one-hot encoded data from our folds to train a Decision Tree classifier. We'll set a `min_impurity_decrease` to provide some basic pre-pruning and prevent the tree from growing too complex.

In [None]:
from sklearn.metrics import accuracy_score, classification_report, log_loss, confusion_matrix
from sklearn.tree import DecisionTreeClassifier
import numpy as np

accuracies_train = []
losses_train = []
dt_accuracies_test = []
losses_test = []
confusion_matrices = []

for i in range(5):
    X_train, X_test, y_train, y_test = (
        folds_data[i]['X_train'],
        folds_data[i]['X_test'],
        folds_data[i]['y_train'],
        folds_data[i]['y_test'],
    )

    model = DecisionTreeClassifier(random_state=45, min_impurity_decrease=0.001)
    model.fit(X_train, y_train)

    y_train_pred_proba = model.predict_proba(X_train)
    loss_train = log_loss(y_train, y_train_pred_proba)
    accuracies_train.append(accuracy_score(y_train, model.predict(X_train)))
    losses_train.append(loss_train)

    y_test_pred_proba = model.predict_proba(X_test)
    loss_test = log_loss(y_test, y_test_pred_proba)
    dt_accuracies_test.append(accuracy_score(y_test, model.predict(X_test)))
    losses_test.append(loss_test)

    cm = confusion_matrix(y_test, model.predict(X_test))
    confusion_matrices.append(cm)

print(f"Average Test Accuracy: {np.mean(dt_accuracies_test):.4f}")
print(f"Average Test Loss: {np.mean(losses_test):.4f}")
print("\nOverall Confusion Matrix:")
print(sum(confusion_matrices))

#### Baseline DT Evaluation
**Placeholder for Interpretation:** The baseline Decision Tree model shows strong performance. The small gap between average training accuracy and test accuracy indicates that our initial pruning was effective in preventing significant overfitting. The overall confusion matrix would confirm the model's high predictive power.

#### Hyperparameter Tuning: DT
We will again use Randomized Search to find better hyperparameters for our Decision Tree. We will explore various criteria (`gini`, `entropy`), tree depth, sample requirements for splitting nodes, and both pre-pruning (`min_impurity_decrease`) and post-pruning (`ccp_alpha`) techniques.

In [None]:
from sklearn.model_selection import RandomizedSearchCV

p_grid = {
    "criterion": ["gini", "entropy"],
    "max_depth": [30, 45, 50, 60],
    "min_samples_split": [2, 5, 10],
    "min_samples_leaf": [1, 2, 4],
    "min_impurity_decrease": [0.001, 0.005, 0.01], # pre-pruning
    "ccp_alpha": np.logspace(-4, 0, 10) # post-pruning
}

#### Final DT Evaluation


In [None]:
X_test_sample = folds_data[0]['X_test'] # Using one fold's test set for final evaluation example
y_test_sample = folds_data[0]['y_test']

test_preds = best_model.predict(X_test_sample)
test_acc = accuracy_score(y_test_sample, test_preds)
test_loss = log_loss(y_test_sample, best_model.predict_proba(X_test_sample))

print(f"Final Test Accuracy: {test_acc:.4f}")
print(f"Final Test Log Loss: {test_loss:.4f}")

cm = confusion_matrix(y_test_sample, test_preds)  
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title("Confusion Matrix for Final DT Model")
plt.show()

### Model 3: k-Nearest Neighbors (kNN)

kNN is a simple, instance-based learning algorithm that classifies a data point based on the majority class of its 'k' nearest neighbors. It makes no assumptions about the underlying data distribution.

#### Training the Baseline kNN Model
We will start with a default `k` value of 5 and use distance weighting, meaning closer neighbors have more influence on the prediction.

In [None]:
# Initial k value; an optimal value will be determined later.
k = 5
knn_accuracies_test = []

for i, fold in enumerate(folds_data):
    X_train, y_train = fold['X_train'], fold['y_train']
    X_test, y_test = fold['X_test'], fold['y_test']

    knn_model = KNeighborsClassifier(n_neighbors=k, weights='distance')
    knn_model.fit(X_train, y_train)

    y_pred = knn_model.predict(X_test)
    accuracy_test = accuracy_score(y_test, y_pred)
    knn_accuracies_test.append(accuracy_test)
    print(f"Fold {i+1} - Test Accuracy: {accuracy_test:.4f}")

#### Hyperparameter Tuning: Finding the Optimal k
The most critical hyperparameter for kNN is `k` itself. A small `k` can be sensitive to noise, while a large `k` can be computationally expensive and may oversmooth the decision boundary. We will use `GridSearchCV` to systematically test a range of `k` values and find the one that yields the best cross-validated accuracy.

In [None]:
param_grid = {'n_neighbors': range(1, 31)} 

grid_search = GridSearchCV(KNeighborsClassifier(weights='distance'), param_grid, cv=5, scoring='accuracy', n_jobs=-1)

grid_search.fit(X_train_full, y_train_full)

print("Best k:", grid_search.best_params_['n_neighbors'])

## Section 5: Model Comparison and Conclusion

Finally, we will compare the performance of our three tuned models: Logistic Regression, Decision Tree, and k-Nearest Neighbors. We will look at their test accuracies across the five folds to determine which model performed the best on this dataset.

In [None]:
lr_accuracies_test = np.array(lr_accuracies_test)
dt_accuracies_test = np.array(dt_accuracies_test)
knn_accuracies_test = np.array(knn_accuracies_test)

print(f"Logistic Regression Average Test Accuracy: {np.mean(lr_accuracies_test):.4f} (+/- {np.std(lr_accuracies_test):.4f})")
print(f"Decision Tree Average Test Accuracy:       {np.mean(dt_accuracies_test):.4f} (+/- {np.std(dt_accuracies_test):.4f})")
print(f"k-Nearest Neighbors Average Test Accuracy: {np.mean(knn_accuracies_test):.4f} (+/- {np.std(knn_accuracies_test):.4f})")