# Predictive Health Assessment: Leveraging DHS Data for Targeted Interventions in Kenya


**Authors**: [Alpha Guya](mailto:alpha.guya@student.moringaschool.com), [Ben Ochoro](mailto:ben.ochoro@student.moringaschool.com), [Caleb Ochieng](mailto:caleb.ochieng@student.moringaschool.com), [Christine Mukiri](mailto:christine.mukiri@student.moringaschool.com), [Dominic Muli](mailto:dominic.muli@student.moringaschool.com), [Frank Mandele](mailto:frank.mandele@student.moringaschool.com), [Jacquiline Tulinye](mailto:jacquiline.tulinye@student.moringaschool.com) and [Lesley Wanjiku](mailto:lesley.wanjiku@student.moringaschool.com)

## 1.0) Project Overview

Our project focuses on using machine learning techniques and data sourced from the Demographic and Health Surveys (DHS) program to generate predictive models aimed at evaluating individual and household health risks in Kenya. By analyzing various set of demographic, socio-economic, and health-related indicators, we target to develop reliable predictive models capable of estimating the likelihood of malnutrition, disease prevalence, and various health risks within certain communities. The goal is to provide users such as public health officials with targeted insights. This will enable more effective allotment of resources and interventions. This proactive approach is geared to optimize the impact of health initiatives, allowing for the prioritization and customization of interventions to at risk populations, ultimately contributing to the improvement of health outcomes in Kenya.

## 1.1) Business Problem

Despite existing health interventions, Kenya encounters difficulties in effectively targeting resources and interventions. This will help to address individual and household health risks, including malnutrition, diseases, and other health concerns. This fault highlights the need for a predictive and targeted approach to allocate resources and interventions more effectively. Leveraging machine learning models built upon Demographic and Health Surveys (DHS) data, the project aims to develop predictive models capable of assessing the likelihood of malnutrition, disease prevalence, and health risks based on individual and household characteristics. By accurately identifying at-risk populations, this solution seeks to empower decision-makers and public health officials to allocate resources on need basis, ultimately increasing the impact of health interventions and improving overall health outcomes in Kenya.

## 1.2) Objectives


Based on the data provided by DHS(Demographic and Health Surveys) , some of the objectives include:

* To analyze trends in health indicators over time.

* To predict Health Risks based on individual and household characteristics. 

* To find the relationship between the most common diseases and the demographic.

*  To Build predictive models to estimate health outcomes based on various demographic and socio-economic factors.

*  To identify regional variations in health indicators.

* To Identify factors contributing to changes in health outcomes.

* To Conduct comprehensive feature engineering to extract relevant features from DHS data, considering demographic, socio-economic, and health-related variables.


### API Deployment and Usability

Deploy an accessible API interface for stakeholders to input data and receive health risk predictions based on the developed models.

### Recommendations and Conclusion
* Targeted Intervention Recommendations:

Utilize model predictions to generate targeted recommendations for health interventions and resource allocation in specific Kenyan communities.

* Impact Assessment and Validation:

Assess the real-world impact of model-guided interventions by monitoring and evaluating changes in health outcomes in targeted Kenyan populations.

## 1.3) Metric of Success

* Achieve a predictive accuracy of at least 90% on unseen validation data.
* Identify and utilize the top 10 most influential features contributing to the models' predictive power.
* Generate clear and interpretable explanations for at least 70% of model predictions.
* Create a prioritized list of actionable recommendations based on identified health risks for at least 100 of communities.
* Ensure an API uptime of at least 90% and gather feedback on usability for further improvements.
* Measure the effectiveness of interventions by observing changes in health indicators, aiming for improvements in at least 80% of targeted communities.

## 1.4) Data Relevance and Validation

The data available is relevant for the intended analysis and predictions

## 2.0) Understanding the Data

The data for this project is obtained from the [DHS Program website](https://dhsprogram.com/data/dataset/Kenya_Standard-DHS_2022.cfm?flag=0).
The encoding for this dataset is explained [here](./Recode7_DHS_10Sep2018_DHSG4.pdf).

## 2.1) Reading the Data

### 2.1.1) Installations

In [1]:
# installations
# %pip install requests
# %pip install pyreadstat
# %pip install --upgrade openpyxl

### 2.1.2) Importing Relevant Libraries

In [2]:
# importing necessary libraries
import requests, json
import urllib
import urllib.request
import urllib.error
import pandas as pd
import numpy as np
import pyreadstat
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
import warnings
warnings.filterwarnings("ignore")

### 2.1.3) Reading the Data

Observation: we will work with

In [3]:
# Reading downloaded relevant data
df, meta = pyreadstat.read_sav("KEHR8BFL.SAV")
df_2, meta_2 = pyreadstat.read_sav("KEHR81FL - NEW.SAV")

In [4]:
df

Unnamed: 0,HHID,HV000,HV001,HV002,HV003,HV004,HV005,HV006,HV007,HV008,...,SH305E$02,SH305E$03,SH305E$04,SH305E$05,SH305E$06,SH305E$07,SH305E$08,SH305E$09,SH305E$10,SH305E$11
0,1 4,KE8,1.0,4.0,2.0,1.0,1306431.0,4.0,2022.0,1468.0,...,,,,,,,,,,
1,1 7,KE8,1.0,7.0,2.0,1.0,1306431.0,4.0,2022.0,1468.0,...,,,,,,,,,,
2,1 10,KE8,1.0,10.0,1.0,1.0,1306431.0,4.0,2022.0,1468.0,...,,,,,,,,,,
3,1 13,KE8,1.0,13.0,4.0,1.0,1306431.0,4.0,2022.0,1468.0,...,,,,,,,,,,
4,1 17,KE8,1.0,17.0,1.0,1.0,1306431.0,4.0,2022.0,1468.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37906,1692 77,KE8,1692.0,77.0,2.0,1692.0,7456997.0,5.0,2022.0,1469.0,...,,,,,,,,,,
37907,1692 80,KE8,1692.0,80.0,2.0,1692.0,7456997.0,5.0,2022.0,1469.0,...,,,,,,,,,,
37908,1692 84,KE8,1692.0,84.0,2.0,1692.0,7456997.0,5.0,2022.0,1469.0,...,,,,,,,,,,
37909,1692 88,KE8,1692.0,88.0,2.0,1692.0,7456997.0,5.0,2022.0,1469.0,...,,,,,,,,,,


In [5]:
df_2

Unnamed: 0,HHID,HV000,HV001,HV002,HV003,HV004,HV005,HV006,HV007,HV008,...,SML16A$15,SML16A$16,SML16A$17,SML16A$18,SML16A$19,SML16A$20,SML16A$21,SML16A$22,SML16A$23,SML16A$24
0,2 1,KE7,2.0,1.0,8.0,2.0,588318.0,11.0,2020.0,1451.0,...,,,,,,,,,,
1,2 6,KE7,2.0,6.0,2.0,2.0,588318.0,11.0,2020.0,1451.0,...,,,,,,,,,,
2,2 10,KE7,2.0,10.0,2.0,2.0,588318.0,11.0,2020.0,1451.0,...,,,,,,,,,,
3,2 13,KE7,2.0,13.0,1.0,2.0,588318.0,11.0,2020.0,1451.0,...,,,,,,,,,,
4,2 16,KE7,2.0,16.0,3.0,2.0,588318.0,11.0,2020.0,1451.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7947,9186 155,KE7,9186.0,155.0,1.0,9186.0,454518.0,11.0,2020.0,1451.0,...,,,,,,,,,,
7948,9186 161,KE7,9186.0,161.0,1.0,9186.0,454518.0,11.0,2020.0,1451.0,...,,,,,,,,,,
7949,9186 166,KE7,9186.0,166.0,1.0,9186.0,454518.0,11.0,2020.0,1451.0,...,,,,,,,,,,
7950,9186 170,KE7,9186.0,170.0,1.0,9186.0,454518.0,11.0,2020.0,1451.0,...,,,,,,,,,,


In [6]:
def collapse_columns(df_2, prefixes, suffixes, combined_column_prefix):
    for prefix in prefixes:
        # Extracting columns with the specified prefix and suffixes
        relevant_columns = [col for col in df_2.columns if col.startswith(prefix) and any(col.endswith(suffix) for suffix in suffixes)]

        # Creating a new column 'combined_column' with the highest value for each row
        combined_column_name = f'{combined_column_prefix}_{prefix}'
        df_2[combined_column_name] = df_2[relevant_columns].max(axis=1)

        # Dropping the original columns
        df_2.drop(relevant_columns, axis=1, inplace=True)

    return df_2

# Specify the prefixes and suffixes for the columns you want to collapse
prefixes_to_collapse = ['HML32$', 'SH305E$', 'HML35$', 'SB115B$', 'SB115F$', 'HML8$', 'SH130$', 'SB115A$', 'SB115E$', 'SB119$',
                       'HML33$', 'SB115C$', 'SB115G$', 'SB122$', 'HC57$', 'SB115D$', 'SB115H$', 'HML5$']
suffixes = ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '14',
            '15', '16', '17', '18', '19', '20', '21', '22', '23', '24']

# Call the function for collapsing the specified columns
collapsed_df_1 = collapse_columns(df_2, prefixes_to_collapse, suffixes, 'collapsed')

# Display the resulting DataFrame
collapsed_df_1

Unnamed: 0,HHID,HV000,HV001,HV002,HV003,HV004,HV005,HV006,HV007,HV008,...,collapsed_SB115E$,collapsed_SB119$,collapsed_HML33$,collapsed_SB115C$,collapsed_SB115G$,collapsed_SB122$,collapsed_HC57$,collapsed_SB115D$,collapsed_SB115H$,collapsed_HML5$
0,2 1,KE7,2.0,1.0,8.0,2.0,588318.0,11.0,2020.0,1451.0,...,,,0.0,,,,4.0,,,
1,2 6,KE7,2.0,6.0,2.0,2.0,588318.0,11.0,2020.0,1451.0,...,,,0.0,,,,4.0,,,
2,2 10,KE7,2.0,10.0,2.0,2.0,588318.0,11.0,2020.0,1451.0,...,0.0,1.0,0.0,0.0,0.0,,3.0,0.0,0.0,
3,2 13,KE7,2.0,13.0,1.0,2.0,588318.0,11.0,2020.0,1451.0,...,,,,,,,,,,
4,2 16,KE7,2.0,16.0,3.0,2.0,588318.0,11.0,2020.0,1451.0,...,,,0.0,,,,3.0,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7947,9186 155,KE7,9186.0,155.0,1.0,9186.0,454518.0,11.0,2020.0,1451.0,...,0.0,0.0,0.0,0.0,0.0,1.0,4.0,0.0,0.0,
7948,9186 161,KE7,9186.0,161.0,1.0,9186.0,454518.0,11.0,2020.0,1451.0,...,,,0.0,,,,4.0,,,
7949,9186 166,KE7,9186.0,166.0,1.0,9186.0,454518.0,11.0,2020.0,1451.0,...,,,0.0,,,,4.0,,,
7950,9186 170,KE7,9186.0,170.0,1.0,9186.0,454518.0,11.0,2020.0,1451.0,...,0.0,0.0,0.0,0.0,0.0,1.0,2.0,0.0,0.0,


In [30]:
# Checking percentage of missing values
def missing_values_summary(collapsed_df_1):
    """
    Generate a summary of missing values for each column in a DataFrame.

    Parameters:
    - df: pandas DataFrame

    Returns:
    - DataFrame containing columns with NaN values and their percentages
    """

    # Checking percentage of missing values
    nan_info = collapsed_df_1.isna().sum()
    nan_percentage = (nan_info / len(collapsed_df_1)) * 100

    # Creating a DataFrame with columns and their NaN percentages
    nan_df = pd.DataFrame({'Column': nan_info.index, 'NaN Count': nan_info.values, 'NaN Percentage': nan_percentage.values})

    # Filtering columns with NaN values
    columns_with_nan = nan_df[nan_df['NaN Count'] > 0]

    return columns_with_nan


# Calling the function on df_cleaned
result = missing_values_summary(collapsed_df_1)
result.tail(50)

Unnamed: 0,Column,NaN Count,NaN Percentage
1631,HML36$17,7947,99.937123
1632,HML36$18,7949,99.962274
1633,HML36$19,7950,99.974849
1634,HML36$20,7951,99.987425
1635,HML36$21,7952,100.0
1636,HML36$22,7952,100.0
1637,HML36$23,7952,100.0
1638,HML36$24,7951,99.987425
1639,SML16A$01,7952,100.0
1640,SML16A$02,7124,89.587525


In [8]:
# collapsed_df_1['collapsed_HML32$'].isna().sum()

## 2.2) Data Cleaning

In [9]:
# # Replacing empty(missing values) with NaN
# df.replace(' ',np.nan, inplace=True)
# df.replace("",np.nan, inplace=True)
# df_2.replace(' ',np.nan, inplace=True)
# df_2.replace("",np.nan, inplace=True)

In [10]:
# # Calculating the threshold for 80% empty columns
# threshold = 0.8 * len(df)

# # Dropping columns with 80% or more empty values
# df_cleaned = df.dropna(axis=1, thresh=threshold)

In [11]:
# # Calculating the threshold for 80% empty columns
# threshold_1 = 0.8 * len(df_2)

# # Dropping columns with 80% or more empty values
# df_2_cleaned = df_2.dropna(axis=1, thresh=threshold_1)

In [12]:
# df_cleaned

In [13]:
# df_2_cleaned

In [14]:
# df_cleaned.columns == df_2_cleaned.columns

In [15]:
# column_list = ['SH119', 'SH130$1', 'SH130$2', 'SH130$3', 'SH130$4', 'SH130$5', 'SH130$6', 'SH130$7', 'HML32$01', 'HML32$02', 'HML32$03', 
#               'HML32$04', 'HML32$05', 'HML32$06', 'HML32$07', 'HML32$08', 'HML32$09', 'HML32$10', 'HML32$11', 'HML32$12', 'HML32$13', 
#               'HML32$14', 'HML32$15', 'HML32$16', 'HML32$17', 'HML32$18', 'HML32$19', 'HML32$20', 'HML32$21', 'HML32$22', 'HML32$23', 
#               'HML32$24', 'HML35$01', 'HML35$02', 'HML35$03', 'HML35$04', 'HML35$05', 'HML35$06', 'HML35$07', 'HML35$08', 'HML35$09', 
#               'HML35$10', 'HML35$11', 'HML35$12', 'HML35$13', 'HML35$14', 'HML35$15', 'HML35$16', 'HML35$17', 'HML35$18', 'HML35$19', 
#               'HML35$20', 'HML35$21', 'HML35$22', 'HML35$23', 'HML35$24', 'HML33$01', 'HML33$02', 'HML33$03', 'HML33$04', 'HML33$05', 
#               'HML33$06', 'HML33$07', 'HML33$08', 'HML33$09', 'HML33$10', 'HML33$11', 'HML33$12', 'HML33$13', 'HML33$14', 'HML33$15', 
#               'HML33$16', 'HML33$17', 'HML33$18', 'HML33$19', 'HML33$20', 'HML33$21', 'HML33$22', 'HML33$23', 'HML33$24', 'HC57$01', 
#               'HC57$02', 'HC57$03', 'HC57$04', 'HC57$05', 'HC57$06', 'HC57$07', 'HC57$08', 'HC57$09', 'HC57$10', 'HC57$11', 'HC57$12', 
#               'HC57$13', 'HC57$14', 'HC57$15', 'HC57$16', 'HC57$17', 'HC57$18', 'HC57$19', 'HC57$20', 'SB115A$01', 'SB115A$02', 
#               'SB115A$03', 'SB115A$04', 'SB115A$05', 'SB115A$06', 'SB115A$07', 'SB115A$08', 'SB115A$09', 'SB115A$10', 'SB115A$11', 
#               'SB115A$12', 'SB115A$13', 'SB115A$14', 'SB115A$15', 'SB115A$16', 'SB115A$17', 'SB115A$18', 'SB115A$19', 'SB115A$20', 
#               'SB115B$01', 'SB115B$02', 'SB115B$03', 'SB115B$04', 'SB115B$05', 'SB115B$06', 'SB115B$07', 'SB115B$08', 'SB115B$09', 
#               'SB115B$10', 'SB115B$11', 'SB115B$12', 'SB115B$13', 'SB115B$14', 'SB115B$15', 'SB115B$16', 'SB115B$17', 'SB115B$18', 
#               'SB115B$19', 'SB115B$20', 'SB115C$01', 'SB115C$02', 'SB115C$03', 'SB115C$04', 'SB115C$05', 'SB115C$06', 'SB115C$07', 
#               'SB115C$08', 'SB115C$09', 'SB115C$10', 'SB115C$11', 'SB115C$12', 'SB115C$13', 'SB115C$14', 'SB115C$15', 'SB115C$16', 
#               'SB115C$17', 'SB115C$18', 'SB115C$19', 'SB115C$20', 'SB115D$01', 'SB115D$02', 'SB115D$03', 'SB115D$04', 'SB115D$05', 
#               'SB115D$06', 'SB115D$07', 'SB115D$08', 'SB115D$09', 'SB115D$10', 'SB115D$11', 'SB115D$12', 'SB115D$13', 'SB115D$14', 
#               'SB115D$15', 'SB115D$16', 'SB115D$17', 'SB115D$18', 'SB115D$19', 'SB115D$20', 'SB115E$01', 'SB115E$02', 'SB115E$03', 
#               'SB115E$04', 'SB115E$05', 'SB115E$06', 'SB115E$07', 'SB115E$08', 'SB115E$09', 'SB115E$10', 'SB115E$11', 'SB115E$12', 
#               'SB115E$13', 'SB115E$14', 'SB115E$15', 'SB115E$16', 'SB115E$17', 'SB115E$18', 'SB115E$19', 'SB115E$20', 'SB115F$01', 
#               'SB115F$02', 'SB115F$03', 'SB115F$04', 'SB115F$05', 'SB115F$06','SB115F$07', 'SB115F$08', 'SB115F$09', 'SB115F$10', 
#               'SB115F$11', 'SB115F$12', 'SB115F$13', 'SB115F$14', 'SB115F$15', 'SB115F$16', 'SB115F$17', 'SB115F$18', 'SB115F$19', 
#               'SB115F$20', 'SB115G$01', 'SB115G$02', 'SB115G$03', 'SB115G$04', 'SB115G$05', 'SB115G$06', 'SB115G$07', 'SB115G$08', 
#               'SB115G$09', 'SB115G$10', 'SB115G$11', 'SB115G$12', 'SB115G$13', 'SB115G$14', 'SB115G$15', 'SB115G$16', 'SB115G$17', 
#               'SB115G$18', 'SB115G$19', 'SB115G$20', 'SB115H$01', 'SB115H$02', 'SB115H$03', 'SB115H$04', 'SB115H$05', 'SB115H$06', 
#               'SB115H$07', 'SB115H$08', 'SB115H$09', 'SB115H$10', 'SB115H$11', 'SB115H$12', 'SB115H$13', 'SB115H$14', 'SB115H$15', 
#               'SB115H$16', 'SB115H$17', 'SB115H$18', 'SB115H$19', 'SB115H$20', 'SB119$01', 'SB119$02', 'SB119$03', 'SB119$04', 'SB119$05', 
#               'SB119$06', 'SB119$07', 'SB119$08', 'SB119$09', 'SB119$10', 'SB119$11', 'SB119$12', 'SB119$13', 'SB119$14', 'SB119$15', 
#               'SB119$16', 'SB119$17', 'SB119$18', 'SB119$19', 'SB119$20', 'SB122$01', 'SB122$02', 'SB122$03', 'SB122$04', 'SB122$05', 
#               'SB122$06', 'SB122$07', 'SB122$08', 'SB122$09', 'SB122$10', 'SB122$11', 'SB122$12', 'SB122$13', 'SB122$14', 'SB122$15', 
#               'SB122$16', 'SB122$17', 'SB122$18', 'SB122$19', 'SB122$20', 'HML5$1', 'HML5$2', 'HML5$3', 'HML5$4', 'HML5$5', 'HML5$6', 
#               'HML5$7', 'HML8$1']

# new_df_2 = df_2[column_list].copy()

# new_df_2

In [16]:
# new_df_2.isna().sum()

In [17]:
# new_df_2['HML35$01']

In [18]:
# import pandas as pd

# # Assuming 'df' is your DataFrame containing health test data
# # Replace 'your_prefix' with the common prefix of the columns you want to collapse
# prefix = 'your_prefix'

# # Get a list of columns that have the specified prefix
# columns_to_collapse = [col for col in df.columns if col.startswith(prefix)]

# # Create a new column by concatenating values from columns with the specified prefix
# df['collapsed_column'] = df[columns_to_collapse].astype(str).agg(','.join, axis=1)

# # Drop the original columns if needed
# df.drop(columns=columns_to_collapse, inplace=True)

# # Display the resulting DataFrame
# df

In [19]:
# import pandas as pd

# def collapse_columns(df, prefix, suffixes, combined_column):
#     # Extract columns with the specified prefix and suffixes
#     relevant_columns = [col for col in df.columns if col.startswith(prefix) and any(col.endswith(suffix) for suffix in suffixes)]

#     # Create a new column 'combined_column' with the highest value for each row
#     df[combined_column] = df[relevant_columns].max(axis=1)

#     # Replace NaN values in the 'combined_column' with 0
#     df[combined_column].fillna(0, inplace=True)

#     # Drop the original columns
#     df.drop(relevant_columns, axis=1, inplace=True)

#     return df

# # Example usage:
# # Assuming your DataFrame is named 'your_dataframe'
# your_dataframe = pd.DataFrame({
#     'HML32$01': [0, 1, 0],
#     'HML32$02': [1, 0, 1],
#     'OtherColumn': ['A', 'B', 'C']
# })

# collapsed_df = collapse_columns(your_dataframe, 'HML32$', ['01', '02'], 'HML32_combined')

# collapsed_df

In [20]:
# # Checking percentage of missing values
# def missing_values_summary(df):
#     """
#     Generate a summary of missing values for each column in a DataFrame.

#     Parameters:
#     - df: pandas DataFrame

#     Returns:
#     - DataFrame containing columns with NaN values and their percentages
#     """

#     # Checking percentage of missing values
#     nan_info = df.isna().sum()
#     nan_percentage = (nan_info / len(df)) * 100

#     # Creating a DataFrame with columns and their NaN percentages
#     nan_df = pd.DataFrame({'Column': nan_info.index, 'NaN Count': nan_info.values, 'NaN Percentage': nan_percentage.values})

#     # Filtering columns with NaN values
#     columns_with_nan = nan_df[nan_df['NaN Count'] > 0]

#     return columns_with_nan


# # Calling the function on df_cleaned
# result = missing_values_summary(df_cleaned)
# result


In [21]:
# # Checking percentage of missing values on df_2_cleaned
# result_2 = missing_values_summary(df_2_cleaned)
# result_2

In [22]:
# # Saving column names into an Excel file

# # Getting the column names
# column_names = df_cleaned.columns

# # Creating a DataFrame with a single column containing the column names
# column_names_df = pd.DataFrame(column_names, columns=["Column Names"])

# # Specifying the Excel file path
# excel_file_path = 'column_names.xlsx'

# # Writing the DataFrame to the Excel file
# column_names_df.to_excel(excel_file_path, index=False)

In [23]:
# # Converting column names to labels dictionary to a DataFrame
# labels_df = pd.DataFrame(list(meta.column_names_to_labels.items()), columns=['Column Name', 'Label'])

# # Saving the DataFrame to an Excel file
# excel_file_path = 'column_names_to_labels.xlsx'
# labels_df.to_excel(excel_file_path, index=False)

In [24]:
# # Converting Coded column names into readable column names

# # Loading the Excel file with the column names into a Pandas DataFrame
# excel_file_path = 'column_names_dictionary.xlsx'
# df_excel = pd.read_excel(excel_file_path, sheet_name='Sheet1')

# # Displaying the original DataFrame with the column headers
# print("Original Excel DataFrame:")
# print(df_excel)

# # Replacing the column headers using a for loop
# for old_header, new_header in zip(df_cleaned.columns, df_excel['Label Names']):
#     df_cleaned.rename(columns={old_header: new_header}, inplace=True)

# # Displaying the DataFrame with the updated column headers
# df_cleaned

In [25]:
# # Imputing the DataFrame 

# # Initializing  SimpleImputer
# imputer = SimpleImputer(strategy='mean')

# # Identifying numeric and non-numeric columns
# numeric_cols = df_cleaned.select_dtypes(include=['float64', 'int64']).columns
# non_numeric_cols = df_cleaned.select_dtypes(exclude=['float64', 'int64']).columns

# # Imputing numeric columns
# numeric_imputer = SimpleImputer(strategy='mean')
# df_cleaned[numeric_cols] = numeric_imputer.fit_transform(df_cleaned[numeric_cols])

# # Imputing non-numeric columns
# non_numeric_imputer = SimpleImputer(strategy='most_frequent')
# df_cleaned[non_numeric_cols] = non_numeric_imputer.fit_transform(df_cleaned[non_numeric_cols])


In [26]:
# # Checking percentage of missing values

# result = missing_values_summary(df_cleaned)
# result

## 2.3) EDA

In [27]:
# # Visualizing distributions of numerical features
# for col in numeric_cols:
#     plt.figure(figsize=(8, 6))
#     sns.histplot(df[col].dropna(), kde=True)
#     plt.title(f'Distribution of {col}')
#     plt.show()


In [28]:
# # Visualize relationships between variables
# sns.pairplot(df_cleaned)
# plt.show()

In [29]:
# # Correlation heatmap
# correlation_matrix = df_cleaned.corr()
# plt.figure(figsize=(10, 8))
# sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
# plt.title('Correlation Heatmap')
# plt.show()

## 2.4) Building Model

* Model Interpretability and Explainability:

Enhance model interpretability to provide actionable insights for decision-makers by employing techniques such as SHAP values or feature importance analysis.


## 2.5) Conclusion

## 2.6) Recommendation

## 2.7) Model Deployment