# Featuring Engineering

In this notebook, we continue the data preparation process by leveraging the preprocessed dataset, `feature_engineering.csv`. Our focus is on two key tasks:

#### **Scaling and Encoding**
- Numerical and categorical columns are scaled and encoded using techniques supported by external resources to ensure the data is model-ready.
- The choice of scaling and encoding techniques was informed by best practices and references to maintain consistency and enhance model performance.

#### **Creating New Features**
- To further enrich the dataset, we identified and created potential interaction features. These new features aim to capture additional relationships and patterns in the data that could improve predictive power.
- External resources were consulted to guide the identification and creation of interaction terms, ensuring they align with the dataset and project objectives.

This notebook builds on the foundation established in the EDA, with a clear focus on enhancing the dataset for the next step: model training.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler
import sweetviz as sv
from sweetviz import FeatureConfig
import matplotlib.pyplot as plt

print("All libraries loaded!")

  from .autonotebook import tqdm as notebook_tqdm


All libraries loaded!


## Data Loading

In [2]:
df = pd.read_csv("feature_engineering.csv")

In [3]:
df.columns

Index(['Age', 'Sex', 'Race', 'Education_Level', 'Income_Level',
       'Marital_Status', 'Employment_Status', 'Number_of_Children', 'Weight',
       'Height', 'Housing_Status', 'Smoking_Status', 'Alcohol_Consumption',
       'Alcohol_Frequency', 'Exercise_Status', 'Sleep_Duration',
       'Asthma_Status', 'General_Health', 'Physical_Health_Poor_Days',
       'Mental_Health_Poor_Days', 'Difficulty_Walking', 'BMI_Category',
       'Arthritis_Status', 'Coronary_Heart_Disease_Status', 'Stroke_Status',
       'COPD_Status', 'Kidney_Disease_Status', 'Total_Physical_Inactivity',
       'Depression_Status', 'Primary_Health_Insurance_Source',
       'Has_Personal_Doctor', 'Could_Not_See_Doctor_Due_To_Cost',
       'Last_Routine_Checkup', 'Tetanus_Shot_Status', 'Colonoscopy_Status',
       'Mammogram_Status', 'Flu_Shot_Status', 'Last_Dental_Visit',
       'Veteran_Status', 'Chronic_Condition'],
      dtype='object')

In [4]:
df_test = pd.read_csv("Post_EDA_test.csv")

In [5]:
df_test.columns

Index(['Age', 'Sex', 'Race', 'Education_Level', 'Income_Level',
       'Marital_Status', 'Employment_Status', 'Number_of_Children', 'Weight',
       'Height', 'Housing_Status', 'Smoking_Status', 'Alcohol_Consumption',
       'Alcohol_Frequency', 'Exercise_Status', 'Sleep_Duration',
       'Asthma_Status', 'Pneumonia_Vaccination_Status', 'General_Health',
       'Physical_Health_Poor_Days', 'Mental_Health_Poor_Days',
       'Difficulty_Walking', 'BMI_Category', 'Arthritis_Status',
       'Coronary_Heart_Disease_Status', 'Stroke_Status', 'COPD_Status',
       'Kidney_Disease_Status', 'Caregiver_Major_Health_Problem',
       'Total_Physical_Inactivity', 'Depression_Status',
       'Primary_Health_Insurance_Source', 'Has_Personal_Doctor',
       'Could_Not_See_Doctor_Due_To_Cost', 'Last_Routine_Checkup',
       'Tetanus_Shot_Status', 'Colonoscopy_Status', 'Mammogram_Status',
       'PSA_Test_Status', 'Flu_Shot_Status', 'Eye_Exam_Status',
       'Last_Dental_Visit', 'Veteran_Status'],
     

In [6]:
df_test.shape

(20915, 43)

## Encoding Categorical Variables

#### **1. Label Encoding**
- **Definition**: Label encoding converts categorical labels into integer representations, mapping each unique category to a corresponding integer.
- **Use Case**: Best suited for ordinal data where the order of categories has significance (e.g., "Low," "Medium," "High").
- **Citation**: [Label Encoding Guide - Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)

#### **2. One-Hot Encoding**
- **Definition**: One-hot encoding transforms categorical variables into binary vectors, creating a new column for each category and assigning a 1 or 0 based on category presence.
- **Use Case**: Ideal for nominal data where categories have no intrinsic order (e.g., "Red," "Green," "Blue").
- **Citation**: [One-Hot Encoding Guide - Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)

#### **Why Label Encoding was Used for General_Health, Has_Personal_Doctor, Last_Routine_Checkup, Last_Dental_Visit, Marital_Status, Education_Level, Physical_Health_Poor_Days, Mental_Health_Poor_Days, Employment_Status, Income_Level, BMI_Category, Smoking_Status**

Label encoding was chosen for ordinal features, such as education level or income categories, because the encoded integers preserve the natural order of the categories. This ensures that the model understands the relative significance or ranking between categories while maintaining simplicity. Label encoding is particularly efficient for features with a clear ordinal structure without introducing unnecessary complexity.

In [7]:
# List of columns to apply label encoding for ordinal categorical features.
label_encoding_columns = [
    "General_Health", "Has_Personal_Doctor", "Last_Routine_Checkup",
    "Last_Dental_Visit", "Marital_Status", "Education_Level", 
    "Physical_Health_Poor_Days", "Mental_Health_Poor_Days", "Employment_Status",
    "Income_Level", "BMI_Category", "Smoking_Status"
]

In [8]:
# Initialize a LabelEncoder instance to encode ordinal categorical features into numeric labels.
label_encoder = LabelEncoder()

In [9]:
label_encoders = {}

In [10]:
# Apply label encoding to each column in the training data
for col in label_encoding_columns:
    label_encoders[col] = LabelEncoder()
    df[col] = label_encoders[col].fit_transform(df[col])


In [11]:
# Apply the same encoding to the test data using the trained label encoders
for col in label_encoding_columns:
    df_test[col] = label_encoders[col].transform(df_test[col])

#### **Why One-Hot Encoding was Used for Primary_Health_Insurance_Source, Housing_Status, Veteran_Status, Mammogram_Status, Flu_Shot_Status, Tetanus_Shot_Status, Difficulty_Walking, Race, Sex, Coronary_Heart_Disease_Status, Asthma_Status**

One-hot encoding was selected for nominal categorical features, such as "Race" or "Sex," because these features do not have a natural order. This method creates binary columns for each category, ensuring that the model does not infer incorrect ordinal relationships between categories. One-hot encoding is particularly effective for nominal data, allowing the model to treat each category independently and equitably, improving interpretability and performance.


In [12]:
# List of columns to apply hot one encoding for nominal categorical features.
one_hot_encoding_columns = [
    "Primary_Health_Insurance_Source", "Housing_Status", "Veteran_Status",
    "Mammogram_Status", "Flu_Shot_Status", "Tetanus_Shot_Status",
    "Difficulty_Walking", "Race", "Sex", "Coronary_Heart_Disease_Status", "Asthma_Status", 
]

# One-Hot Encoding
df = pd.get_dummies(df, columns=one_hot_encoding_columns, drop_first=True)

In [13]:
# Get the final list of columns after encoding df
encoded_columns = df.columns

In [14]:
# One-Hot Encoding for the test data
df_test = pd.get_dummies(df_test, columns=one_hot_encoding_columns, drop_first=True)

In [15]:
# Add missing columns in df_test that exist in df
missing_columns = set(encoded_columns) - set(df_test.columns)
for col in missing_columns:
    df_test[col] = 0

In [16]:
# Ensure column order matches between df and df_test
df_test = df_test[encoded_columns]

In [17]:
df_test.columns

Index(['Age', 'Education_Level', 'Income_Level', 'Marital_Status',
       'Employment_Status', 'Number_of_Children', 'Weight', 'Height',
       'Smoking_Status', 'Alcohol_Consumption', 'Alcohol_Frequency',
       'Exercise_Status', 'Sleep_Duration', 'General_Health',
       'Physical_Health_Poor_Days', 'Mental_Health_Poor_Days', 'BMI_Category',
       'Arthritis_Status', 'Stroke_Status', 'COPD_Status',
       'Kidney_Disease_Status', 'Total_Physical_Inactivity',
       'Depression_Status', 'Has_Personal_Doctor',
       'Could_Not_See_Doctor_Due_To_Cost', 'Last_Routine_Checkup',
       'Colonoscopy_Status', 'Last_Dental_Visit', 'Chronic_Condition',
       'Primary_Health_Insurance_Source_2.0',
       'Primary_Health_Insurance_Source_3.0',
       'Primary_Health_Insurance_Source_4.0',
       'Primary_Health_Insurance_Source_5.0',
       'Primary_Health_Insurance_Source_6.0',
       'Primary_Health_Insurance_Source_7.0',
       'Primary_Health_Insurance_Source_8.0',
       'Primary_Health

## Encoding the Target Variable - Chronic_Condition

#### **Why Label Encoding was Used for the Target Variable 'Chronic_Condition'**

Label encoding was chosen for the target variable 'Chronic_Condition' because it converts categorical classes into numerical labels, a format required by most machine learning algorithms. This approach ensures that the model can interpret the target values effectively without introducing unnecessary complexity. Additionally, label encoding is particularly suitable for the target variable as it maintains the integrity of the class distinctions without implying any ordinal relationship between the categories.

In [18]:
# Apply label encoding to the target variable 'Chronic_Condition'.
# This step transforms the categorical classes (e.g., 'Yes', 'No') into numerical labels (e.g., 0, 1, 2),
# making the target variable suitable for machine learning algorithms that require numerical input.
df['Chronic_Condition'] = label_encoder.fit_transform(df['Chronic_Condition'])

In [19]:
# Create a mapping of the original class labels to their corresponding encoded numerical values.
# This dictionary helps to interpret the label-encoded target variable, allowing for clarity when analyzing model predictions.
target_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))

In [20]:
# Define a feature configuration to explicitly treat the target variable 'Chronic_Condition' as a numerical feature.
# This is necessary for certain libraries or tools (e.g., Sweetviz) that may not automatically recognize it as numeric.
feature_config = FeatureConfig(force_num=['Chronic_Condition'])

## Scaling Numerical Features

#### **1. MinMaxScaler**
- **Definition**: MinMaxScaler scales features to a fixed range, typically [0, 1], by transforming data linearly based on the feature’s minimum and maximum values.
- **Use Case**: Ideal for datasets where the distribution of features is not Gaussian or when a fixed range is required (e.g., neural networks).
- **Citation**: [MinMaxScaler Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)

#### **2. RobustScaler**
- **Definition**: RobustScaler scales features by removing the median and scaling according to the interquartile range (IQR), making it robust to outliers.
- **Use Case**: Suitable for datasets with significant outliers that could distort other scaling methods (e.g., datasets with heavy-tailed distributions).
- **Citation**: [RobustScaler Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html)

In [21]:
# Define a dictionary of scalers for different scaling techniques
scalers = {
    "MinMaxScaler" : MinMaxScaler(),      # Scales features to a fixed range, typically [0, 1]
    "RobustScaler" : RobustScaler()       # Scales features using the median and IQR, making it robust to outliers
}

### Why RobustScaler was Used for Height and Weight

The **RobustScaler** was chosen for the Height and Weight columns because these features have extreme outliers that could distort scaling. Unlike other scalers, RobustScaler scales the data based on the median and interquartile range, making it resilient to the influence of outliers. This ensures that the scaled values accurately reflect the majority of the data distribution while minimizing the impact of outliers.


In [22]:
scalers["RobustScaler"] = scalers["RobustScaler"].fit(df[["Weight", "Height"]])

In [23]:
# Scale the "Weight" and "Height" columns using RobustScaler (robust to outliers)
df["Weight"] = scalers["RobustScaler"].fit_transform(df[["Weight"]])
df["Height"] = scalers["RobustScaler"].fit_transform(df[["Height"]])

In [24]:
# Let's ensure its properly fomratted as a 2D array before scaling these features on the test df
df_test["Weight"] = scalers["RobustScaler"].transform(df_test[["Weight"]].values)
df_test["Height"] = scalers["RobustScaler"].transform(df_test[["Height"]].values)



### Why MinMaxScaler Was Used for Age, Sleep Duration, Primary Health Insurance Source, Number of Children, and Income Level

The MinMaxScaler was chosen because it scales features to a fixed range (e.g., 0 to 1), preserving their relationships and bounded distributions. These columns, with naturally limited ranges and no significant outliers, align well with this scaler, ensuring proportionality and consistency for model inputs. This method is particularly suitable for algorithms sensitive to feature magnitudes, such as logistic regression.

In [26]:
scalers["MinMaxScaler"] = scalers["MinMaxScaler"].fit(df[["Income_Level", "Number_of_Children",  
                                                          "Sleep_Duration", "Age"]])

In [27]:
# Scale the "Income_Level" column using MinMaxScaler to normalize values between 0 and 1.
df["Income_Level"] = scalers["MinMaxScaler"].fit_transform(df[["Income_Level"]])
# Scale the "Number_of_Children" column using MinMaxScaler to handle non-negative values consistently.
df["Number_of_Children"] = scalers["MinMaxScaler"].fit_transform(df[["Number_of_Children"]])
# Scale the "Sleep_Duration" column using MinMaxScaler, as it is non-negative and does not contain outliers.
df["Sleep_Duration"] = scalers["MinMaxScaler"].fit_transform(df[["Sleep_Duration"]])
# Scale the "Age" column using MinMaxScaler to normalize it to the range [0, 1], which is suitable for the model.
df["Age"] = scalers["MinMaxScaler"].fit_transform(df[["Age"]])

In [28]:
# Scale the "Income_Level" column in df_test
df_test["Income_Level"] = scalers["MinMaxScaler"].transform(df_test[["Income_Level"]].values)
# Scale the "Number_of_Children" column in df_test
df_test["Number_of_Children"] = scalers["MinMaxScaler"].transform(df_test[["Number_of_Children"]].values)
# Scale the "Sleep_Duration" column in df_test
df_test["Sleep_Duration"] = scalers["MinMaxScaler"].transform(df_test[["Sleep_Duration"]].values)
# Scale the "Age" column in df_test
df_test["Age"] = scalers["MinMaxScaler"].transform(df_test[["Age"]].values)



### Why Log Transformation and MinMaxScaler Were Used for Alcohol_Frequency

The np.log1p function was applied to the Alcohol_Frequency column to reduce the impact of skewness caused by extreme values or highly uneven distributions. Log transformation compresses the range of larger values while maintaining smaller values, creating a more normalized distribution suitable for scaling.

In [29]:
# Fit the MinMaxScaler on the log-transformed Alcohol_Frequency column in the training data
scalers["MinMaxScaler"] = scalers["MinMaxScaler"].fit(np.log1p(df[["Alcohol_Frequency"]]))

In [30]:
# Apply MinMaxScaler to the logarithmic transformation of "Alcohol_Frequency" to reduce skewness and scale the feature.
df["Alcohol_Frequency"] = scalers["MinMaxScaler"].fit_transform(np.log1p(df[["Alcohol_Frequency"]]))

In [31]:
# Transform the test data using the already fitted scaler
df_test["Alcohol_Frequency"] = scalers["MinMaxScaler"].transform(np.log1p(df_test[["Alcohol_Frequency"]]))

### Target Variable Sweetviz Report

In [None]:
sweetviz_report = sv.analyze(df, target_feat='Chronic_Condition', feat_cfg=feature_config)

In [None]:
sweetviz_report.show_html("Sweetviz_Target_Report.html")

### Creating Interaction Terms

Creating interaction terms is a powerful feature engineering technique that allows us to uncover relationships between features that might not be evident in their individual forms. By combining features, we can capture complex patterns and dependencies in the data, ultimately enhancing model performance and predictive accuracy. Studies highlight that interaction terms can improve model interpretability and are particularly valuable in addressing non-linear relationships between variables. 

**Citation**: [Feature Engineering and Interaction Terms - Towards Data Science](https://towardsdatascience.com/feature-engineering-for-machine-learning-3a5e293a5114)


The creation of interaction terms was guided by insights obtained from the Sweetviz report, which highlighted significant associations between features and the target variable, `Chronic_Condition`. For instance, `General_Health`, `Difficulty_Walking_2.0`, and `Employment_Status` showed moderate correlations with the target variable, suggesting their potential combined influence on chronic conditions. Similarly, numerical features like `Age` and `Alcohol_Frequency` were moderately associated with `Chronic_Condition`, motivating the creation of interactions such as `Age_General_Health` to capture nuanced relationships. 

These interaction terms aim to enhance the dataset's predictive power by capturing complex, non-linear relationships between variables, thereby aligning with the project’s objective to improve chronic disease detection and cost reduction strategies. By integrating information from both numerical and categorical features, we ensure a comprehensive representation of the behavioral and systemic factors influencing chronic health outcomes.

In [32]:
# Create a "Health Risk Index" as a weighted combination of BMI_Category, Alcohol_Frequency, and Exercise_Status.
# The weights (0.4, 0.3, 0.3) represent the relative importance of each feature in assessing health risk.
# This feature aims to capture an individual's overall health risk based on lifestyle factors.
df["Health_Risk_index"] = (df["BMI_Category"] * 0.4) + (df["Alcohol_Frequency"] * 0.3) + (df["Exercise_Status"] * 0.3)
df_test["Health_Risk_index"] = (df_test["BMI_Category"] * 0.4) + (df_test["Alcohol_Frequency"] * 0.3) + (df_test["Exercise_Status"] * 0.3)

In [33]:
# Define a list of columns related to comorbidities (e.g., Arthritis, Stroke, COPD).
# These features represent the presence or absence of various chronic diseases.
comorbidity_columns = [
    'Arthritis_Status', 'Stroke_Status', 'COPD_Status',
    'Kidney_Disease_Status', 'Depression_Status',
    'Coronary_Heart_Disease_Status_2.0', 'Asthma_Status_2',
    'Asthma_Status_3', 'Asthma_Status_9'
]

In [34]:
# Calculate a "Comorbidity Score" by summing the values across all comorbidity-related columns.
# This score reflects the total number of comorbid conditions an individual has, providing a measure of overall disease burden.
df['Comorbidity_Score'] = df[comorbidity_columns].sum(axis=1)
df_test['Comorbidity_Score'] = df_test[comorbidity_columns].sum(axis=1)

In [35]:
# Define a list of columns related to preventive health behaviors (e.g., Colonoscopy, Mammogram, Flu Shot, Tetanus Shot).
# These features represent the individual's engagement in preventive care activities.
preventive_behavior_columns = [
    'Colonoscopy_Status', 'Mammogram_Status_2.0', 'Mammogram_Status_7.0',
    'Mammogram_Status_9.0', 'Mammogram_Status_99.0', 'Flu_Shot_Status_2.0',
    'Flu_Shot_Status_7.0', 'Flu_Shot_Status_9.0', 'Tetanus_Shot_Status_2.0',
    'Tetanus_Shot_Status_3.0', 'Tetanus_Shot_Status_4.0',
    'Tetanus_Shot_Status_7.0', 'Tetanus_Shot_Status_9.0'
]

In [36]:
# Calculate a "Preventive Behavior Index" by summing the values across all preventive behavior-related columns.
# This index measures an individual's level of engagement in preventive care activities, which may influence health outcomes.
df['Preventive_Behavior_Index'] = df[preventive_behavior_columns].sum(axis=1)
df_test['Preventive_Behavior_Index'] = df_test[preventive_behavior_columns].sum(axis=1)

In [37]:
# Create an interaction term between Alcohol Consumption and Smoking Status.
# This feature captures the combined effect of these two lifestyle factors on health outcomes.
df['Alcohol_Smoking_Interaction'] = df['Alcohol_Consumption'] * df['Smoking_Status']
df_test['Alcohol_Smoking_Interaction'] = df_test['Alcohol_Consumption'] * df_test['Smoking_Status']

In [38]:
# Create an interaction term between BMI Category and Total Physical Inactivity.
# This feature reflects the relationship between an individual's body mass index and their level of physical inactivity.
df['BMI_Physical_Inactivity'] = df['BMI_Category'] * df['Total_Physical_Inactivity']
df_test['BMI_Physical_Inactivity'] = df_test['BMI_Category'] * df_test['Total_Physical_Inactivity']

In [40]:
# Create an interaction term between Preventive Behavior Index and Comorbidity Score.
# This feature measures how an individual's engagement in preventive behaviors interacts with their total disease burden.
df['Preventive_Comorbidity_Interaction'] = df['Preventive_Behavior_Index'] * df['Comorbidity_Score']
df_test['Preventive_Comorbidity_Interaction'] = df_test['Preventive_Behavior_Index'] * df_test['Comorbidity_Score']

In [39]:
# Create a Housing-Chronic Interaction feature by summing specific housing-related status columns.
# This feature represents how various housing statuses might contribute to chronic condition risk factors.
df['Housing_Chronic_Interaction'] = df['Housing_Status_2.0'] + df['Housing_Status_3.0'] + df['Housing_Status_7.0']
df_test['Housing_Chronic_Interaction'] = df['Housing_Status_2.0'] + df_test['Housing_Status_3.0'] + df_test['Housing_Status_7.0']

In [41]:
# Create an interaction term between Income Level and Employment Status.
# This feature reflects the relationship between an individual's income and employment status, which may influence health access and outcomes.
df['Income_Employment'] = df['Income_Level'] * df['Employment_Status']
df_test['Income_Employment'] = df_test['Income_Level'] * df_test['Employment_Status']

In [42]:
# Create an interaction term between Age and General Health.
# This feature captures how an individual's age interacts with their self-assessed general health status.
df['Age_General_Health'] = df['Age'] * df['General_Health']
df_test['Age_General_Health'] = df_test['Age'] * df_test['General_Health']

In [43]:
# Create a Race-Chronic Condition feature by summing specific race-related columns.
# This feature aggregates racial factors that may be associated with chronic disease risks.
df['Race_Chronic_Condition'] = df['Race_2'] + df['Race_3'] + df['Race_4']
df_test['Race_Chronic_Condition'] = df_test['Race_2'] + df_test['Race_3'] + df_test['Race_4']

In [44]:
# Create an interaction term between Sex and Preventive Behavior Index.
# This feature reflects how gender interacts with an individual's engagement in preventive care behaviors.
df['Sex_Preventive_Behavior'] = df['Sex_2'] * df['Preventive_Behavior_Index']
df_test['Sex_Preventive_Behavior'] = df_test['Sex_2'] * df_test['Preventive_Behavior_Index']

In [45]:
print(df.columns)

Index(['Age', 'Education_Level', 'Income_Level', 'Marital_Status',
       'Employment_Status', 'Number_of_Children', 'Weight', 'Height',
       'Smoking_Status', 'Alcohol_Consumption', 'Alcohol_Frequency',
       'Exercise_Status', 'Sleep_Duration', 'General_Health',
       'Physical_Health_Poor_Days', 'Mental_Health_Poor_Days', 'BMI_Category',
       'Arthritis_Status', 'Stroke_Status', 'COPD_Status',
       'Kidney_Disease_Status', 'Total_Physical_Inactivity',
       'Depression_Status', 'Has_Personal_Doctor',
       'Could_Not_See_Doctor_Due_To_Cost', 'Last_Routine_Checkup',
       'Colonoscopy_Status', 'Last_Dental_Visit', 'Chronic_Condition',
       'Primary_Health_Insurance_Source_2.0',
       'Primary_Health_Insurance_Source_3.0',
       'Primary_Health_Insurance_Source_4.0',
       'Primary_Health_Insurance_Source_5.0',
       'Primary_Health_Insurance_Source_6.0',
       'Primary_Health_Insurance_Source_7.0',
       'Primary_Health_Insurance_Source_8.0',
       'Primary_Health

In [46]:
print(df_test.columns)

Index(['Age', 'Education_Level', 'Income_Level', 'Marital_Status',
       'Employment_Status', 'Number_of_Children', 'Weight', 'Height',
       'Smoking_Status', 'Alcohol_Consumption', 'Alcohol_Frequency',
       'Exercise_Status', 'Sleep_Duration', 'General_Health',
       'Physical_Health_Poor_Days', 'Mental_Health_Poor_Days', 'BMI_Category',
       'Arthritis_Status', 'Stroke_Status', 'COPD_Status',
       'Kidney_Disease_Status', 'Total_Physical_Inactivity',
       'Depression_Status', 'Has_Personal_Doctor',
       'Could_Not_See_Doctor_Due_To_Cost', 'Last_Routine_Checkup',
       'Colonoscopy_Status', 'Last_Dental_Visit', 'Chronic_Condition',
       'Primary_Health_Insurance_Source_2.0',
       'Primary_Health_Insurance_Source_3.0',
       'Primary_Health_Insurance_Source_4.0',
       'Primary_Health_Insurance_Source_5.0',
       'Primary_Health_Insurance_Source_6.0',
       'Primary_Health_Insurance_Source_7.0',
       'Primary_Health_Insurance_Source_8.0',
       'Primary_Health

In [47]:
df_test.shape

(20915, 85)

## Feature Selection

#### Threshold for Collinearity

- **|r| > 0.8**: Indicates strong collinearity and potential redundancy between features. Features with such high correlation should be carefully reviewed and possibly removed or combined.
- **|r| > 0.9**: Almost perfect collinearity, which is highly problematic and can significantly affect model performance, especially in linear models.

#### Why Collinearity is a Concern

##### **1. Multicollinearity in Linear Models**
- In linear models (e.g., linear regression), high collinearity can lead to unstable coefficients, making the model sensitive to small changes in data.

##### **2. Redundancy in Features**
- Highly correlated features provide overlapping information, which increases computational costs without adding significant predictive power.

##### **3. Model Overfitting**
- Including multiple collinear features may lead to overfitting, reducing the model’s generalization capability.


In [48]:
# Calculate the correlation matrix for all numerical features in the DataFrame.
# The correlation matrix provides pairwise correlation coefficients, indicating the strength and direction of the relationships between features.
# This is useful for identifying highly correlated features, which may indicate redundancy or multicollinearity.
correlation_matrix = df.corr()

In [49]:
# Print the computed correlation matrix to visually inspect the relationships between features.
# High positive or negative correlations (close to 1 or -1) can guide feature selection or modification.
print(correlation_matrix)

                                         Age  Education_Level  Income_Level  \
Age                                 1.000000         0.008154     -0.030120   
Education_Level                     0.008154         1.000000      0.235755   
Income_Level                       -0.030120         0.235755      1.000000   
Marital_Status                     -0.333512        -0.119968     -0.159365   
Employment_Status                   0.540928        -0.128083     -0.153822   
...                                      ...              ...           ...   
Preventive_Comorbidity_Interaction -0.805305        -0.018973      0.022417   
Income_Employment                   0.447958        -0.013062      0.363556   
Age_General_Health                  0.594951        -0.177134     -0.171793   
Race_Chronic_Condition             -0.084230        -0.030239     -0.059590   
Sex_Preventive_Behavior            -0.475931         0.020758     -0.022554   

                                    Marital_Status 

#### Adjusting Correlation Threshold in the Context of Health Insurance

In the life and health insurance industry, the accuracy and reliability of predictive models are crucial for making informed decisions that impact costs and member health outcomes. Given this context:
- A **stricter correlation threshold** (e.g., |r| > 0.7 or even |r| > 0.6) is recommended to mitigate overfitting and improve generalization.
- Lowering the threshold helps reduce redundancy in features, ensuring the model focuses on distinct, meaningful relationships.

This adjustment aligns with best practices for creating interpretable, robust models in the insurance domain, where reliability and fairness are paramount.

In [50]:
# Set a threshold for high correlation
threshold = 0.60

# Compute the correlation matrix
correlation_matrix = df.corr().abs()  # Take the absolute value of correlations

# Create a mask to filter upper triangle of the matrix (including diagonal)
mask = np.triu(np.ones(correlation_matrix.shape), k=1).astype(bool)

# Filter correlations above the threshold
high_corr = correlation_matrix.where(mask).stack()

# Filter only those that exceed the threshold
high_corr_dict = high_corr[high_corr > threshold].to_dict()

# Print the dictionary
print("Highly Correlated Feature Pairs:")
for key, value in high_corr_dict.items():
    print(f"{key}: {value:.2f}")


Highly Correlated Feature Pairs:
('Age', 'Colonoscopy_Status'): 0.80
('Age', 'Preventive_Behavior_Index'): 0.81
('Age', 'Preventive_Comorbidity_Interaction'): 0.81
('Employment_Status', 'Income_Employment'): 0.79
('Alcohol_Consumption', 'Alcohol_Smoking_Interaction'): 0.77
('Exercise_Status', 'Total_Physical_Inactivity'): 0.99
('Exercise_Status', 'BMI_Physical_Inactivity'): 0.74
('General_Health', 'Age_General_Health'): 0.82
('BMI_Category', 'Health_Risk_index'): 0.90
('BMI_Category', 'BMI_Physical_Inactivity'): 0.69
('Arthritis_Status', 'Comorbidity_Score'): 0.61
('Total_Physical_Inactivity', 'BMI_Physical_Inactivity'): 0.73
('Colonoscopy_Status', 'Preventive_Behavior_Index'): 1.00
('Colonoscopy_Status', 'Preventive_Comorbidity_Interaction'): 0.99
('Colonoscopy_Status', 'Sex_Preventive_Behavior'): 0.61
('Housing_Status_2.0', 'Housing_Chronic_Interaction'): 0.88
('Mammogram_Status_2.0', 'Sex_Preventive_Behavior'): 0.71
('Mammogram_Status_99.0', 'Sex_2'): 1.00
('Race_2', 'Race_Chronic_C

#### List of Columns to Drop

Based on the high correlation analysis, the following columns have been dropped to address multicollinearity and redundancy:

1. **`Colonoscopy_Status`**: Perfectly correlated with `Preventive_Behavior_Index` and redundant.
2. **`Preventive_Behavior_Index`** and **`Preventive_Comorbidity_Interaction`**: Highly correlated and redundant with `Colonoscopy_Status`.
3. **`Age_General_Health`**: Strong correlation with `Age` and `General_Health`.
4. **`Exercise_Status`**: Redundant due to its high correlation with `Total_Physical_Inactivity`.
5. **`BMI_Category`** and **`BMI_Physical_Inactivity`**: Highly correlated with `Health_Risk_index`.
6. **`Housing_Status_2.0`**: Redundant with `Housing_Chronic_Interaction`.
7. **`Mammogram_Status_99.0`**: Perfectly correlated with `Sex_2`.
8. **`Alcohol_Smoking_Interaction`**: High correlation with `Alcohol_Consumption`.
9. **`Health_Risk_index`**: Redundant due to its high correlation with `BMI_Category`.
10. **`Race_Chronic_Condition`**: Highly correlated with `Race_2`.

This step ensures that our dataset retains only unique, non-redundant features, improving model efficiency and interpretability.

In [51]:
columns_to_drop = [
    'Colonoscopy_Status', 'Preventive_Behavior_Index', 
    'Preventive_Comorbidity_Interaction', 'Age_General_Health',
    'Exercise_Status', 'BMI_Category', 'Housing_Status_2.0', 
    'Mammogram_Status_99.0', 'BMI_Physical_Inactivity', 
    'Alcohol_Smoking_Interaction', 'Health_Risk_index', 
    'Race_Chronic_Condition'
]

df = df.drop(columns=columns_to_drop)

In [52]:
df.shape

(227525, 73)

In [53]:
df_test = df_test.drop(columns=columns_to_drop)

In [54]:
df_test.shape

(20915, 73)

In [None]:
# Check if all columns in df_test are in df
test_columns = set(df_test.columns)
train_columns = set(df.columns)

# Find columns in df_test that are not in df
extra_columns_in_test = test_columns - train_columns

# Find columns in df that are not in df_test
missing_columns_in_test = train_columns - test_columns

# Print the results
if not extra_columns_in_test and not missing_columns_in_test:
    print("All column names in df_test match those in df!")
else:
    if extra_columns_in_test:
        print(f"Columns in df_test but not in df: {extra_columns_in_test}")
    if missing_columns_in_test:
        print(f"Columns in df but not in df_test: {missing_columns_in_test}")

All column names in df_test match those in df!


### Exporting the Feature Engineering Data Set

The final step in this notebook saves the cleaned and preprocessed dataset to a new CSV file named `alexis_train.csv`. This file will be used in subsequent steps for training our model

In [55]:
df.to_csv("alexis_train.csv", index=False)

print("The file 'alexis_train.csv' has been created!")

The file 'alexis_train.csv' has been created!


### Exporting the Test Data to csv

In [56]:
df_test.to_csv("Final_test.csv", index=False)

# Confirm that the file has been successfully created
print("The file 'Final_test.csv' has been updated!")

The file 'Final_test.csv' has been updated!
