# Exploratory Data Analysis (EDA)

#### Hi team,

In this notebook, I have conducted an extensive exploratory data analysis (EDA) using the provided `train.csv` file. EDA is a critical step in any data science project as it allows us to explore the dataset, scrutinize its characteristics, and uncover insights that guide data preparation and model development.

To build a strong foundation, I also relied on external sources to understand key concepts related to **preventive care** and **chronic disease detection**, aligning the analysis with the project's objective of identifying patterns and actionable insights.

#### Key Steps in the Analysis:
1. **Dataset Loading and Overview**:
   - I used the pandas library to import and explore the dataset, examining its dimensions and structure. 
   - Initial summaries and descriptive statistics provided an overview of numerical and categorical features.

2. **Automated EDA with Sweetviz**:
   - Leveraged the Sweetviz library to automate EDA and generate an interactive HTML report. This report quickly highlighted key dataset characteristics, such as feature distributions, correlations, and missing values.
   - The Sweetviz report played a pivotal role in identifying potential preprocessing needs, such as handling missing data and feature engineering.

3. **Handling Missing Values**:
   - Identified columns with high percentages of missing values and made data-driven decisions to either drop or impute missing entries.

4. **Insights for Model Preparation**:
   - Key observations from this notebook will inform the next steps, such as feature engineering and model training.


In [1]:
import numpy as np
import pandas as pd
import neptune
import sweetviz as sv
import seaborn as sns
import matplotlib.pyplot as plt

print("All libraries loaded!")


  from .autonotebook import tqdm as notebook_tqdm


All libraries loaded!


## Data Loading and Overview

In [2]:
df = pd.read_csv("train.csv")

### Initial Exploration

The following lines of code are used to:
1. **Check the dataset dimensions** (`df.shape`).
2. **Preview the dataset** to understand its structure and content (`df.head()`).
3. **Generate descriptive statistics** for numerical columns (`df.describe()`), which helps understand the distribution of the data.

In [3]:
# Check the dimensions of the DataFrame (number of rows and columns)
df.shape

(275236, 44)

In [4]:
# Display the first 5 rows of the DataFrame to understand its structure and contents
df.head()

Unnamed: 0,Age,Sex,Race,Education_Level,Income_Level,Marital_Status,Employment_Status,Number_of_Children,Weight,Height,...,Last_Routine_Checkup,Tetanus_Shot_Status,Colonoscopy_Status,Mammogram_Status,PSA_Test_Status,Flu_Shot_Status,Eye_Exam_Status,Last_Dental_Visit,Veteran_Status,Chronic_Condition
0,66,1,1,4.0,99.0,1.0,1.0,88.0,220.0,603.0,...,1.0,4.0,1.0,,,2.0,,2.0,2.0,3.0
1,74,1,1,6.0,7.0,1.0,7.0,88.0,170.0,511.0,...,1.0,3.0,1.0,,,1.0,,1.0,1.0,3.0
2,61,1,2,4.0,5.0,5.0,8.0,88.0,170.0,511.0,...,7.0,4.0,2.0,,,2.0,,7.0,1.0,3.0
3,44,2,1,6.0,3.0,5.0,1.0,88.0,150.0,502.0,...,2.0,4.0,,2.0,,1.0,,4.0,2.0,3.0
4,79,1,7,2.0,5.0,6.0,5.0,10.0,275.0,507.0,...,3.0,,1.0,,,,,3.0,2.0,1.0


In [5]:
# Generate descriptive statistics for the DataFrame, including count, mean, and standard deviation
# Useful for understanding the distribution of numerical features
df.describe()

Unnamed: 0,Age,Sex,Race,Education_Level,Income_Level,Marital_Status,Employment_Status,Number_of_Children,Weight,Height,...,Last_Routine_Checkup,Tetanus_Shot_Status,Colonoscopy_Status,Mammogram_Status,PSA_Test_Status,Flu_Shot_Status,Eye_Exam_Status,Last_Dental_Visit,Veteran_Status,Chronic_Condition
count,275236.0,275236.0,275236.0,275235.0,267285.0,275234.0,271388.0,269475.0,265485.0,264803.0,...,275236.0,247391.0,179415.0,137219.0,4610.0,248494.0,10051.0,274382.0,272652.0,275236.0
mean,55.344355,1.529934,1.91895,5.047697,22.549402,2.405448,3.916393,67.001974,780.393687,823.406389,...,1.438198,3.113351,1.269883,1.248661,1.786551,1.519618,2.411203,1.76825,1.897276,2.633329
std,17.592387,0.499104,1.924496,1.045934,32.867302,1.813631,2.900939,37.132558,2268.496758,1600.589501,...,1.113077,1.695932,0.4439,0.633706,1.508633,0.749518,1.57118,1.325203,0.499729,0.766646
min,18.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,32.0,209.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,41.0,1.0,1.0,4.0,6.0,1.0,1.0,88.0,150.0,504.0,...,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,3.0
50%,58.0,2.0,1.0,5.0,8.0,1.0,2.0,88.0,180.0,507.0,...,1.0,3.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,3.0
75%,70.0,2.0,2.0,6.0,10.0,3.0,7.0,88.0,220.0,511.0,...,1.0,4.0,2.0,1.0,2.0,2.0,3.0,2.0,2.0,3.0
max,80.0,2.0,7.0,9.0,99.0,9.0,9.0,99.0,9999.0,9999.0,...,9.0,9.0,2.0,9.0,9.0,9.0,9.0,9.0,9.0,3.0


### Automated EDA with Sweetviz

The `Sweetviz` library's `analyze` function was used to perform an automated exploratory data analysis (EDA) on the dataset. This function provides a comprehensive overview of the data, including:

1. **Dataset Summary**:
   - The shape of the dataset (number of rows and columns).
   - Types of features (numerical or categorical).
   - Missing values and unique value counts.

2. **Feature Analysis**:
   - For **numerical features**, it provides:
     - Distributions (via histograms).
     - Key statistics like mean, median, variance, and outlier counts.
   - For **categorical features**, it includes:
     - Frequency distributions and proportions.
     - Bar charts for visualization.

3. **Target Variable Insights**:
   - If a target variable is provided, the function evaluates the relationship between each feature and the target.
   - It highlights influential features and calculates correlations.

4. **Comparison (if applicable)**:
   - If multiple datasets (e.g., training and testing sets) are provided, the function compares them to identify differences in feature distributions.

The resulting report is saved as an interactive HTML file, enabling a quick and intuitive review of the dataset's key properties. This tool significantly accelerates the initial data exploration process, allowing us to focus on deeper analysis and feature engineering.


### Sources for Sweetviz Analysis

1. **[Official Sweetviz Documentation](https://github.com/fbdesignpro/sweetviz)**  
   Provides comprehensive details on the library's capabilities, including dataset summaries, feature analysis, target variable insights, and dataset comparisons.

2. **["SweetViz Library – EDA in Seconds" by Analytics Vidhya](https://www.analyticsvidhya.com/blog/2021/05/sweetviz-library-eda-in-seconds/)**  
   Offers an in-depth tutorial on using Sweetviz for automated exploratory data analysis, highlighting its features and benefits.

3. **["SweetViz: Streamlining EDA with Elegant Visualizations" by Statistics Canada](https://statcan.github.io/aaw/en/1-Experiments/Notebooks/SweetViz_EN.html)**  
   Discusses how Sweetviz simplifies the exploratory data analysis process through automated and interactive reports.


In [None]:
# Generate a Sweetviz analysis report for the dataset
# This will include insights such as feature distributions, correlations, and relationships with the target variable
report = sv.analyze(df)

In [None]:
# Save and open the Sweetviz report as an interactive HTML file
# The report provides a user-friendly way to explore the dataset visually
report.show_html("Sweetviz_Report.html")

## Swetviz Report Analysis

The Sweetviz report provides a high-level overview of the dataset, summarizing its key characteristics. The dataset consists of **275,326 rows**, with **8 duplicate rows**, requiring approximately **96.9 MB of RAM**. It contains **44 features**, of which **9 are numerical** and **35 are categorical**.

The automated report offers detailed insights into each feature:
1. **Numerical Features**:
   - Includes a comprehensive statistical summary, such as mean, median, standard deviation, and range.
2. **Categorical Features**:
   - Displays frequency distribution reports to highlight the prevalence of each category.

Additionally, the report highlights missing values within each feature:
- **Color Coding**: 
   - Low percentages of missing values are displayed in green for easy identification.
   - High percentages are flagged in red to prioritize handling during data preprocessing.

This analysis provides a strong foundation for identifying patterns, missing data, and potential areas for feature engineering.

### Identifying Columns with High Missing Values

As the next step, we analyze the columns with the highest percentages of missing values. Based on the Sweetviz report, the following features were identified as having significant missing data:

- **Pneumonia_Vaccination_Status**: 61%
- **Caregiver_Major_Health_Problem**: 96%
- **PSA_Test_Status**: 98%
- **Mammogram_Status**: 50%
- **Eye_Exam_Status**: 96%

These columns will require careful handling during preprocessing, as their high missing percentages may impact model performance. Potential strategies include:


### Target Variable: Chronic_Condition

#### Distribution of Classes:
The dataset reveals a significant class imbalance:
- **81%** of individuals fall under `3.0` (no chronic condition).
- **18%** fall under `1.0` (chronic condition present).
- **1%** fall under `2.0` (special case, Yes, but female told only during pregnancy).

This imbalance will require adjustments during modeling to avoid bias toward the majority class.

#### Key Feature Associations:
1. **Categorical Features**:
   - The strongest association is observed with **Eye_Exam_Status** (Uncertainty Coefficient = **0.32**).
   - Other notable features include:
     - **General_Health**: **0.08**
     - **Employment_Status**: **0.05**
     - **Colonoscopy_Status**: **0.04**
     - These features may represent preventive care behaviors and health indicators critical for predicting chronic conditions.

2. **Numerical Features**:
   - **Age** has the highest correlation with `Chronic_Condition` (Correlation Ratio = **0.24**).
   - **Number_of_Children**: **0.13**
   - **Primary_Health_Insurance_Source**: **0.03**
   - Age’s correlation suggests a potential risk increase with age, consistent with prior medical literature.


#### Key Insights:
The analysis highlights the importance of preventive behaviors and demographic factors in predicting chronic conditions. These insights will guide the feature engineering and model-building phases, ensuring the inclusion of relevant predictors.

#### Transition:
After analyzing the target variable and identifying key feature associations, we now address missing values in the dataset. Proper handling of missing data is crucial, especially for features like `Eye_Exam_Status` and `Colonoscopy_Status`, which are strongly associated with `Chronic_Condition`.


## Missing Value Analysis

In [8]:
# Calculate the percentage of missing values for each column
missing_percent = df.isnull().sum() / len(df) * 100
print(missing_percent)

Age                                  0.000000
Sex                                  0.000000
Race                                 0.000000
Education_Level                      0.000363
Income_Level                         2.888794
Marital_Status                       0.000727
Employment_Status                    1.398073
Number_of_Children                   2.093113
Weight                               3.542778
Height                               3.790565
Housing_Status                       0.001453
Smoking_Status                       0.000000
Alcohol_Consumption                  0.000000
Alcohol_Frequency                    0.000000
Exercise_Status                      0.000000
Sleep_Duration                       0.000363
Asthma_Status                        0.000000
Pneumonia_Vaccination_Status        61.279774
General_Health                       0.000363
Physical_Health_Poor_Days            0.000000
Mental_Health_Poor_Days              0.000000
Difficulty_Walking                

In this step, we focus on identifying columns with **less than 10% missing values** and removing rows where these columns still have missing data. The reasoning behind this decision is rooted in the following principles:

1. **Data Integrity**:
   - Columns with minimal missing data are typically critical to preserve because they provide valuable information. Dropping rows with missing values ensures the data remains complete for these columns without significantly reducing the overall dataset size.

2. **Minimal Information Loss**:
   - Since less than 10% of values are missing, the number of rows affected by this operation is relatively small. This approach minimizes the impact on dataset size while maintaining high data quality.

3. **Avoiding Imputation Bias**:
   - Imputing values (e.g., with the mean, median, or mode) for these columns may introduce bias, especially when the missingness is not random (e.g., related to certain groups). By removing rows with missing data, we ensure the integrity of the remaining data.

#### Supporting References:
- *Kang, H. (2013)*: "How to understand and utilize missing data" explains that when the percentage of missing data is small, removing affected rows is often the simplest and most effective approach. ([Source](https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-13-1))
- *Schafer, J. L., & Graham, J. W. (2002)*: Discuss best practices for handling missing data, noting that listwise deletion (row removal) is appropriate for small percentages of missingness. ([Source](https://doi.org/10.1037/1082-989X.7.2.147))


#### Implementation:
The following code identifies columns with less than 10% missing values, then drops rows with missing values in those columns. This ensures data quality and consistency moving forward.

In [9]:
# Identify columns with less than 10% missing values
columns_with_few_missing = missing_percent[missing_percent < 10].index

In [10]:
print(columns_with_few_missing)

Index(['Age', 'Sex', 'Race', 'Education_Level', 'Income_Level',
       'Marital_Status', 'Employment_Status', 'Number_of_Children', 'Weight',
       'Height', 'Housing_Status', 'Smoking_Status', 'Alcohol_Consumption',
       'Alcohol_Frequency', 'Exercise_Status', 'Sleep_Duration',
       'Asthma_Status', 'General_Health', 'Physical_Health_Poor_Days',
       'Mental_Health_Poor_Days', 'Difficulty_Walking', 'Arthritis_Status',
       'Coronary_Heart_Disease_Status', 'Stroke_Status', 'COPD_Status',
       'Kidney_Disease_Status', 'Total_Physical_Inactivity',
       'Depression_Status', 'Primary_Health_Insurance_Source',
       'Has_Personal_Doctor', 'Could_Not_See_Doctor_Due_To_Cost',
       'Last_Routine_Checkup', 'Flu_Shot_Status', 'Last_Dental_Visit',
       'Veteran_Status', 'Chronic_Condition'],
      dtype='object')


In [11]:
# Drop rows where columns with less than 10% missing values still have missing data
df = df.dropna(subset=columns_with_few_missing)

Our next step focuses on programmatically identifying columns with **over 10% missing values** and evaluating their impact on the dataset. Among these, columns with **more than 60% missing values** will be dropped entirely. The reasoning behind this approach is as follows:

#### Impact of High Missing Values:
- Columns with a high percentage of missing values (>60%) provide limited information and can introduce noise into the model.
- Retaining such columns may lead to unreliable results and increased computational overhead.

#### Defining Thresholds:
- The **10% threshold** is used to identify and evaluate features with moderate missingness. These columns may require specific handling (e.g., imputation or encoding).
- The **60% threshold** is chosen as a cutoff to eliminate features that are too incomplete to contribute meaningful insights.

#### Balancing Data Retention and Quality:
- Dropping columns with excessive missing data prevents bias and maintains the dataset's overall integrity while ensuring critical features are preserved.

#### Supporting References:
- **Van Buuren, S. (2018)**: Notes that features with high percentages of missing data (>50%) are often too incomplete to be useful and should be removed. ([Source](https://stefvanbuuren.name/fimd/))
- **Little & Rubin (2019)**: Discuss strategies for handling missing data and emphasize that a balance must be struck between data retention and quality. ([Source](https://doi.org/10.1007/978-1-4899-7274-2))

In [12]:
# Identify columns with more than 10% missing values
columns_with_high_missing = missing_percent[missing_percent > 10].index

In [13]:
# Print columns with more than 10% missing values for further analysis or decision-making
print(missing_percent[missing_percent > 10])

Pneumonia_Vaccination_Status      61.279774
BMI_Category                      10.858681
Caregiver_Major_Health_Problem    95.596870
Tetanus_Shot_Status               10.116773
Colonoscopy_Status                34.814123
Mammogram_Status                  50.144967
PSA_Test_Status                   98.325074
Eye_Exam_Status                   96.348225
dtype: float64


In [14]:
# Identify columns with more than 60% missing values
columns_with_very_high_missing = missing_percent[missing_percent > 60].index

In [15]:
# Drop columns with more than 60% missing values as they are unlikely to provide useful information
df = df.drop(columns=columns_with_very_high_missing)

Our next step is to identify the columns that still have missing values and evaluate their importance in achieving the project's goal. Specifically, this involves determining how these features contribute to identify early patients with chronic diseases to reduce healthcare costs and implement effective preventive care strategies.

By focusing on the remaining missing data, we ensure that all critical features are appropriately handled, maintaining the dataset's integrity and relevance for predictive modeling.

In [16]:
# Identify columns that still have missing values after dropping high-missing-value columns
remaining_missing_columns = df.columns[df.isnull().any()]


In [17]:
# Analyze the percentage of missing values in remaining columns by the Chronic_Condition categories
for column in remaining_missing_columns:
    print(f"Missingness in {column} by Chronic_Condition:")
    print(df.groupby("Chronic_Condition")[column].apply(lambda x: x.isnull().mean() * 100))

Missingness in BMI_Category by Chronic_Condition:
Chronic_Condition
1.0     6.524040
2.0    11.629603
3.0     6.716666
Name: BMI_Category, dtype: float64
Missingness in Tetanus_Shot_Status by Chronic_Condition:
Chronic_Condition
1.0    0.432482
2.0    0.437477
3.0    0.446671
Name: Tetanus_Shot_Status, dtype: float64
Missingness in Colonoscopy_Status by Chronic_Condition:
Chronic_Condition
1.0     9.799862
2.0    50.054685
3.0    34.737732
Name: Colonoscopy_Status, dtype: float64
Missingness in Mammogram_Status by Chronic_Condition:
Chronic_Condition
1.0    49.767656
2.0     0.437477
3.0    47.212833
Name: Mammogram_Status, dtype: float64


From the breakdown above, we derive the following insights regarding missing values in key columns:

#### **Tetanus_Shot_Status**:
- Rows with missing values in this column can be safely dropped as they constitute nearly **0% across all target variable categories**, making their impact negligible.

#### **BMI_Category**:
- Similarly, rows with missing values in this column can also be dropped, as the distribution of missingness is minimal and well-balanced across the target categories. This ensures no significant bias is introduced by removing these rows.

#### **Colonoscopy_Status**:
- This column reveals an **interesting pattern**:
  - High percentages of missing values are observed for patients with `Yes` (**9.8%**) and `No` (**34.7%**) responses in the `Chronic_Condition` target variable.
  - However, the group with `Yes (but during pregnancy)` has an extremely high percentage (**50%**).
  - These insights are critical as they reflect differences in **preventive care practices** among patients and highlight gaps that could guide targeted interventions.

#### **Mammogram_Status**:
- A similar pattern is observed:
  - For patients with `No` chronic condition, **47%** of values are missing, while missingness is almost negligible for patients who responded `Yes during pregnancy`.
  - This column also reflects differences in **gender preventive care**, and should be highlighted when using over or under sampling.


In [18]:
# Drop rows with missing values in the BMI_Category column
# This column is considered critical, and any missing values are not acceptable
df = df.dropna(subset=['BMI_Category'])

In [19]:
# Drop rows with missing values in the Tetanus_Shot_Status column
df = df.dropna(subset=['Tetanus_Shot_Status'])

Given the project's objective to **identify patients with chronic diseases**, reduce costs, and implement effective **preventive care strategies**, it is essential to understand our patients' behaviors. 

The current implementation of the survey allows patients to skip certain answers, which introduces missing values in key columns like `Colonoscopy_Status` and `Mammogram_Status`. Rather than dropping these columns due to missing values, we will account for this behavior by imputing the value **99**. This ensures:
- The imputed value does not introduce a false relationship with existing categories in the column.
- The behavior of skipped responses is retained for future analysis, preserving potential insights into **preventive care practices**.

This approach ensures we do not lose valuable context while maintaining the integrity of the dataset for predictive modeling.

In [20]:
# Fill missing values in the Colonoscopy_Status column with 99 as a placeholder
# This ensures the model treats missing values explicitly
df["Colonoscopy_Status"] = df["Colonoscopy_Status"].fillna(99)

In [21]:
# Fill missing values in the Mammogram_Status column with 99 as a placeholder
# This ensures consistency and avoids dropping rows unnecessarily
df["Mammogram_Status"] = df["Mammogram_Status"].fillna(99)


To track the changes to our dataset we will be creating a new Sweetviz report.

In [None]:
# Generate a Sweetviz analysis report for the dataset
# This will include insights such as feature distributions, correlations, and relationships with the target variabler
report = sv.analyze(df)

In [None]:
# Save and open the Sweetviz report as an interactive HTML file
# The report provides a user-friendly way to explore the dataset visually
report.show_html("Sweetviz_Report_After_EDA.html")

Check

In [24]:
remaining_missing_columns = df.columns[df.isnull().any()]

In [25]:
print(remaining_missing_columns)

Index([], dtype='object')


### Exporting the Preprocessed Data
The final step in this notebook saves the cleaned and preprocessed dataset to a new CSV file named `feature_engineering.csv`. This file will be used in subsequent steps for feature engineering and model development

In [26]:
# Save the processed DataFrame to a new CSV file for the feature engineering step
# This ensures the cleaned and preprocessed data is ready for further analysis or modeling
df.to_csv("feature_engineering.csv", index=False)

# Confirm that the file has been successfully created
print("The file 'feature_engineering.csv' has been created!")

The file 'feature_engineering.csv' has been created!


### Final Step: Ensuring Consistent Preprocessing for the Test Dataset


In [27]:
df_test = pd.read_csv("test_original.csv")

In [None]:
missing_percent = df_test.isnull().sum() / len(df) * 100
print(missing_percent)

Age                                  0.000000
Sex                                  0.000000
Race                                 0.000000
Education_Level                      0.000440
Income_Level                         0.840347
Marital_Status                       0.000440
Employment_Status                    0.409186
Number_of_Children                   0.606087
Weight                               1.064498
Height                               1.144929
Housing_Status                       0.000440
Smoking_Status                       0.000000
Alcohol_Consumption                  0.000000
Alcohol_Frequency                    0.000000
Exercise_Status                      0.000000
Sleep_Duration                       0.000000
Asthma_Status                        0.000000
Pneumonia_Vaccination_Status        18.628722
General_Health                       0.000000
Physical_Health_Poor_Days            0.000000
Mental_Health_Poor_Days              0.000000
Difficulty_Walking                

In [29]:
df_test.shape

(68809, 43)

In [30]:
df_test.columns

Index(['Age', 'Sex', 'Race', 'Education_Level', 'Income_Level',
       'Marital_Status', 'Employment_Status', 'Number_of_Children', 'Weight',
       'Height', 'Housing_Status', 'Smoking_Status', 'Alcohol_Consumption',
       'Alcohol_Frequency', 'Exercise_Status', 'Sleep_Duration',
       'Asthma_Status', 'Pneumonia_Vaccination_Status', 'General_Health',
       'Physical_Health_Poor_Days', 'Mental_Health_Poor_Days',
       'Difficulty_Walking', 'BMI_Category', 'Arthritis_Status',
       'Coronary_Heart_Disease_Status', 'Stroke_Status', 'COPD_Status',
       'Kidney_Disease_Status', 'Caregiver_Major_Health_Problem',
       'Total_Physical_Inactivity', 'Depression_Status',
       'Primary_Health_Insurance_Source', 'Has_Personal_Doctor',
       'Could_Not_See_Doctor_Due_To_Cost', 'Last_Routine_Checkup',
       'Tetanus_Shot_Status', 'Colonoscopy_Status', 'Mammogram_Status',
       'PSA_Test_Status', 'Flu_Shot_Status', 'Eye_Exam_Status',
       'Last_Dental_Visit', 'Veteran_Status'],
     

In [31]:
# Define columns with missing values <10% in the training dataset
columns_with_few_missing_training = [col for col in df.columns if df[col].isnull().sum() < len(df) * 0.1]

# Ensure the same columns exist in the test dataset
columns_with_few_missing_test = [col for col in columns_with_few_missing_training if col in df_test.columns]

# Drop rows in df_test with missing values in these columns
df_test = df_test.dropna(subset=columns_with_few_missing_test)

In [32]:
print(f"Rows remaining after dropping rows with missing values in <10% columns: {df_test.shape[0]}")

Rows remaining after dropping rows with missing values in <10% columns: 20915


In [33]:
df_test["Mammogram_Status"] = df_test["Mammogram_Status"].fillna(99)
df_test["Colonoscopy_Status"] = df_test["Colonoscopy_Status"].fillna(99)

In [34]:
df_test.shape

(20915, 43)

In [36]:
# Drop rows with missing values in the Tetanus_Shot_Status column
df_test = df_test.dropna(subset=['Tetanus_Shot_Status'])
df_test = df_test.dropna(subset=['BMI_Category'])

print(f"Rows remaining after dropping columns: {df_test.shape[0]}")  # After final dropna

Rows remaining after dropping columns: 20915


### Exporting the Test Data to csv

In [38]:
df_test.to_csv("Post_EDA_test.csv", index=False)

# Confirm that the file has been successfully created
print("The file 'Post_EDA_test.csv' has been updated!")

The file 'Post_EDA_test.csv' has been updated!


In [39]:
df_test.shape

(20915, 43)

In [40]:
df_test.columns

Index(['Age', 'Sex', 'Race', 'Education_Level', 'Income_Level',
       'Marital_Status', 'Employment_Status', 'Number_of_Children', 'Weight',
       'Height', 'Housing_Status', 'Smoking_Status', 'Alcohol_Consumption',
       'Alcohol_Frequency', 'Exercise_Status', 'Sleep_Duration',
       'Asthma_Status', 'Pneumonia_Vaccination_Status', 'General_Health',
       'Physical_Health_Poor_Days', 'Mental_Health_Poor_Days',
       'Difficulty_Walking', 'BMI_Category', 'Arthritis_Status',
       'Coronary_Heart_Disease_Status', 'Stroke_Status', 'COPD_Status',
       'Kidney_Disease_Status', 'Caregiver_Major_Health_Problem',
       'Total_Physical_Inactivity', 'Depression_Status',
       'Primary_Health_Insurance_Source', 'Has_Personal_Doctor',
       'Could_Not_See_Doctor_Due_To_Cost', 'Last_Routine_Checkup',
       'Tetanus_Shot_Status', 'Colonoscopy_Status', 'Mammogram_Status',
       'PSA_Test_Status', 'Flu_Shot_Status', 'Eye_Exam_Status',
       'Last_Dental_Visit', 'Veteran_Status'],
     