<h1 align="center">
    Run this notebook in Kaggle notebooks<br />
    (Language - Python)
</h1>

# [Link to Kaggle Notebook](https://www.kaggle.com/code/adamprzychodni/2v0-2-eda-kaggle-python)

# Install dependencies

In [1]:
# ydata-profiling installation for valuable automatic statistics on data distribution, central tendencies, and categorical variable frequencies
%%capture
import sys
!{sys.executable} -m pip install -U ydata-profiling[notebook]
!jupyter nbextension enable --py widgetsnbextension

UsageError: Line magic function `%%capture` not found.


# Import libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from ydata_profiling import ProfileReport
from typing import Union, List
from sklearn.impute import KNNImputer

# Load data

Expand sidebar and Add Data - dataset "ML Prediction of Poverty and Malnutrition Dataset"

In [None]:
df = pd.read_csv("/kaggle/input/ml-prediction-of-poverty-and-malnutrition-dataset/data.csv")

In [None]:
df

<h2 align="center">
    &#x1F4A1; <strong>Conclusion_1</strong> - The Unnamed: 0 column is just an index and we don't need that, will be deleted in EDA - excluded in pipeline parameters_data_processing.yml - <span style="color:green;">Done &#x2705;</span>
</h2>


In [None]:
# Conclusion-1 
# delete 'Unnamed: 0' column
df = df.drop('Unnamed: 0', axis=1)

In [None]:
df

# EDA

## Data description

### Target Variables

- **Stunted (Child Stunting):** Measures children under five years of age whose height-for-age z-score is less than -2.0 standard deviations below the median, according to the WHO Child Growth Standards. Stunting is a critical indicator of chronic malnutrition and reflects long-term nutritional status.

- **Wasted (Child Wasting):** Refers to children under five whose weight-for-height is less than -2.0 standard deviations below the median, following WHO Child Growth Standards. This measure is essential for identifying acute malnutrition, reflecting recent and severe nutritional deficits.

- **Healthy (Healthy Weight):** Identifies children under five whose weight-for-height falls within the normal range, typically between [-2.0, 2.0] standard deviations from the median, as per WHO Child Growth Standards. This category aims to pinpoint children who maintain an adequate nutritional status, free from stunting or wasting.

- **Poorest (Asset Poverty):** Indicates households in the poorest quintile based on an asset-based comparative wealth index. This socioeconomic marker is crucial in understanding how material deprivation correlates with other forms of poverty.

- **Underweight_bmi (Underweight Women):** Represents women aged 15 to 49 with a body mass index (BMI) below 18.5. This health indicator is vital for assessing maternal and child health, as undernutrition in women can have significant health impacts on both mothers and their children.

In [None]:
# Let's see them :)
df_with_target_variables = df[['stunted', 'wasted', 'healthy', 'poorest', 'underweight_bmi']]

In [None]:
df_with_target_variables

### Features

- **URBAN_RURAL:** Denotes whether a location is urban or rural, a key factor in understanding geographical and socioeconomic disparities.

- **alt (Altitude):** Represents the altitude of a location, influencing climate and environmental conditions.

- **chrps (Climate Hazards Group InfraRed Precipitation with Station data):** Provides rainfall estimates vital for agricultural and climate studies.

- **country:** Specifies the country of the data point, important for regional analysis.

- **deathcount:** Counts the number of deaths in an area or period, significant for public health and safety studies.

- **latnum (Latitude) and longnum (Longitude):** Geographic coordinates, essential for location-specific analysis.

- **lst (Land Surface Temperature):** Key environmental and climate variable.

- **markets0 to markets47:** Series of variables related to market food prices, critical for understanding economic conditions and food security. (Note: Clarify these variables with examples or a more detailed explanation.)

- **numevents (Number of Significant Events):** Reflects the frequency of violent or significant events, indicating conflict and political instability's impact on food security and poverty.

- **pasture:** Indicates pasture coverage, relevant in agricultural land use and environmental studies.

- **sif (Solar-Induced Chlorophyll Fluorescence):** A measure of plant photosynthetic activity, crucial in agricultural and ecological research.

- **slope:** Measures land steepness or gradient, relevant in geographical and environmental studies.

- **tree:** Indicates tree coverage or density, significant in environmental, ecological, and climate studies.

- **tt00_500k (Travel Time to Urban Centers):** Represents accessibility to urban centers, vital for understanding remoteness and socio-economic impacts.

- **year:** Year of data collection, crucial for temporal analysis and trend identification.

## Data profiling

In [None]:
def data_profiling(df: pd.DataFrame, name: str = "data_profiling_report",
                   interface: str = "html", num_columns: int = None, chunk_size: int = 30) -> None:
    """
    This function generates a data profiling report using the pandas_profiling package.

    Args:
        df (pd.DataFrame): The DataFrame to profile.
        name (str, optional): The title of the profile report. Defaults to "data_profiling_report".
        interface (str, optional): The format of the report. Defaults to "html".
                                    Choose between 'html' or 'widget'.
        num_columns (int, optional): Number of columns to include in the report. If None, profiles in chunks of 30 columns.
        chunk_size (int, optional): Size of each chunk for profiling when num_columns is None. Defaults to 30.

    Raises:
        ValueError: If df is not a pandas DataFrame or name is not a string or
                    if interface is not 'html' or 'widget'.
    """

    # Check if df is a pandas DataFrame
    if not isinstance(df, pd.DataFrame):
        raise ValueError("df should be a pandas DataFrame")

    # Check if name is a string
    if not isinstance(name, str):
        raise ValueError("name should be a string")

    # Check if interface is a string and a valid option
    if not isinstance(interface, str) or interface not in ['html', 'widget']:
        raise ValueError("interface should be a string, either 'html' or 'widget'")

    # If num_columns is specified, profile only that many columns
    if num_columns is not None:
        if not isinstance(num_columns, int) or num_columns <= 0:
            raise ValueError("num_columns should be a positive integer")
        columns_to_profile = df.columns[:num_columns]
        df = df[columns_to_profile]
        create_profile(df, name, interface)
    else:
        # If num_columns is not specified, profile in chunks
        for start_col in range(0, len(df.columns), chunk_size):
            end_col = min(start_col + chunk_size, len(df.columns))
            chunk_columns = df.columns[start_col:end_col]
            chunk_df = df[chunk_columns]
            chunk_name = f"{name}_{start_col + 1}-{end_col}"
            create_profile(chunk_df, chunk_name, interface)

def create_profile(df, report_name, interface):
    """ Helper function to create and save the profile report. """
    profile = ProfileReport(df, title=report_name, explorative=True)
    if interface == "html":
        profile.to_file(f"{report_name}.html")
        logging.info(f"Report {report_name} generated in html format, check files.")
    elif interface == "widget":
        logging.info(f"Report {report_name} will be generated as a widget, it might take a while.")
        profile.to_widgets()

# Example usage
# data_profiling(df)
# data_profiling(df, num_columns=5)
# data_profiling(df, specific_columns=['column1', 'column2'])

In [None]:
# data_profiling(df)

## Cleaning data

### Data types

In [None]:
def categorize_column_types(df, unique_threshold=10):
    """
    Counts and categorizes the columns in a Pandas DataFrame into numerical, categorical,
    and potentially categorical (numeric but with low unique value count) types.

    Parameters:
    df (pd.DataFrame): The DataFrame to analyze.
    unique_threshold (int): The maximum number of unique values for a numeric column
                            to be considered potentially categorical.

    Returns:
    dict: A dictionary with counts of numerical, categorical, and potentially categorical columns.
    """
    if not isinstance(df, pd.DataFrame):
        raise ValueError("Input must be a pandas DataFrame")

    categorical = df.select_dtypes(include=['object', 'category']).shape[1]
    numerical = df.select_dtypes(include=['int64', 'float64']).shape[1]

    # Check for numeric columns with low unique value counts
    potentially_categorical = 0
    for col in df.select_dtypes(include=['int64', 'float64']):
        if df[col].nunique() <= unique_threshold:
            potentially_categorical += 1

    return {
        "categorical": categorical,
        "numerical": numerical,
        "potentially_categorical": potentially_categorical
    }

# Example usage
# categorize_column_types(df)

def list_columns_by_type(df, column_type, unique_threshold=10):
    """
    Lists the column names in a DataFrame based on their categorization as
    'categorical', 'numerical', or 'potentially_categorical'.

    Parameters:
    df (pd.DataFrame): The DataFrame to analyze.
    column_type (str): Type of columns to list ('categorical', 'numerical', 'potentially_categorical').
    unique_threshold (int): The maximum number of unique values for a numeric column
                            to be considered potentially categorical.

    Returns:
    list: A list of column names that fall into the specified category.
    """
    if not isinstance(df, pd.DataFrame):
        raise ValueError("Input must be a pandas DataFrame")

    if column_type not in ['categorical', 'numerical', 'potentially_categorical']:
        raise ValueError("column_type must be 'categorical', 'numerical', or 'potentially_categorical'")

    if column_type == 'categorical':
        return df.select_dtypes(include=['object', 'category']).columns.tolist()

    if column_type == 'numerical':
        return df.select_dtypes(include=['int64', 'float64']).columns.tolist()

    if column_type == 'potentially_categorical':
        return [col for col in df.select_dtypes(include=['int64', 'float64']).columns
                if df[col].nunique() <= unique_threshold]

# Example usage
# list_columns_by_type(df, 'potentially_categorical')

In [None]:
categorize_column_types(df)

In [None]:
list_columns_by_type(df, 'potentially_categorical')

So the column 'URBAN_RURA' is actually categroical so we will change it, the rest of the potentially categorical is just potentially because function classified them as potentially categorical because they are almost empty we will get rid of them later ;)

<h2 align="center">
    &#x1F4A1; <strong>Conclusion_2</strong> - Change data type of column 'URBAN_RURA' - do it in preprocessing also :)
</h2>

In [None]:
# Convert 'URBAN_RURA' to a categorical column
df.loc[:, 'URBAN_RURA'] = df['URBAN_RURA'].astype('category')


In [None]:
categorize_column_types(df)

In [None]:
list_columns_by_type(df, 'categorical')

Nice it works :) ok so our data types are correct let's analyze our missing values because there will be some, maybe not some but a looot ;)

### Missing data

In [None]:
def summarize_missing_values(df, display_all_rows=False):
    """
    Summarize missing values in the DataFrame, showing columns with missing values in descending order,
    and providing a summary of how many columns have and don't have missing values.
    Optionally displays all rows of the summary DataFrame.

    Args:
        df (pd.DataFrame): The DataFrame to analyze.
        display_all_rows (bool): If True, display all rows of the summary. Default is False.

    Returns:
        pd.DataFrame: A DataFrame with columns and their missing values count and percentage.
    """
    # Option to display all rows
    if display_all_rows:
        pd.set_option('display.max_rows', None)

    # Count missing values for each column
    missing_values = df.isnull().sum()

    # Calculate the percentage of missing values
    missing_percentage = (missing_values / len(df)) * 100

    # Filter out columns with no missing values and sort in descending order
    missing_summary = pd.DataFrame({'Missing Values': missing_values[missing_values > 0],
                                    'Percentage': missing_percentage[missing_values > 0]})
    missing_summary = missing_summary.sort_values(by='Missing Values', ascending=False)

    # Display the summary DataFrame
    print(missing_summary)

    # Reset display option to default
    if display_all_rows:
        pd.reset_option('display.max_rows')

# Example usage
# summarize_missing_values(df)
# summarize_missing_values(df, display_all_rows=True)

In [None]:
summarize_missing_values(df, display_all_rows=True)

<h2 align="center">
    &#x1F4A1; <strong>Conclusion_3</strong> - Ok so we will drop columns that have more than 64% missing data for the second experiment (first version in 2v0.2_EDA_kaggle_python) - define this in the parameters in preprocessing.yaml file - <span style="color:green;">Done &#x2705;</span>
</h2>


so we stayed with just 2797 in previous experiment, rows I will also perfrom the experiment with using features with 64% of data 

In [None]:
# features with lower than 64% of missing data 
features = [
    'year',
    'URBAN_RURA',
    'country',
    'alt',
    'chrps',
    'deathcount',
    'latnum',
    'longnum',
    'lst',
    'marketm0',
    'marketm1',
    'marketm2',
    'marketm3',
    'marketm4',
    'marketm5',
    'marketm6',  
    'marketm7',
    'marketm8',
    'marketm9',
    'markets0',
    'markets1',
    'markets2',
    'markets3',
    'markets4',
    'markets5',
    'markets6',  
    'markets7',
    'markets8',
    'markets9',
    'numevents',
    'pasture',
    'sif',
    'slope',
    'tree',
    'tt00_500k'
]


In [None]:
# Select only the columns specified in 'features' from the DataFrame
df = df[features]

In [None]:
df

In [None]:
summarize_missing_values(df, display_all_rows=True)

In [None]:
def explore_missing_values(df):
    """
    Visualizes the pattern of missing values in a pandas DataFrame using a heatmap.

    The function creates a heatmap where missing values are marked, helping to identify
    the pattern of missingness across different columns.

    Parameters:
    - df (pandas.DataFrame): The DataFrame to analyze for missing values.

    Returns:
    - None: The function outputs a visualization and does not return any value.

    Usage:
    - explore_missing_values(df)

    Note: The function requires pandas, matplotlib, and seaborn libraries. Ensure
    these are installed and imported before using the function.
    """

    # Missing data heatmap
    plt.figure(figsize=(12, 8))
    # Using a custom color map: red for missing, green for not missing
    sns.heatmap(df.isnull(), cbar=False, cmap=sns.color_palette(["green", "red"]))
    plt.title("Heatmap of Missing Values")
    plt.show()

# Example usage:
# explore_missing_values(df)


In [None]:
explore_missing_values(df)

diffrent approach we will firstly deete rows which have 20 missing values in a row 

In [None]:
def drop_rows_with_missing_values(dataframe: pd.DataFrame, threshold: int) -> pd.DataFrame:
    """
    Drop rows from a DataFrame based on a specified threshold of missing values.

    Parameters:
    - dataframe (pd.DataFrame): The DataFrame from which rows are to be removed.
    - threshold (int): The maximum number of non-missing values in a row required 
                       for it not to be dropped. For example, if threshold is 3, 
                       rows with less than 3 non-missing values will be dropped.

    Returns:
    - pd.DataFrame: A new DataFrame with rows dropped based on the specified threshold.

    Raises:
    - ValueError: If the threshold is negative or not an integer.
    - TypeError: If the provided dataframe is not a pandas DataFrame.
    """

    if not isinstance(threshold, int) or threshold < 0:
        raise ValueError("Threshold must be a non-negative integer")
    if not isinstance(dataframe, pd.DataFrame):
        raise TypeError("The first argument must be a pandas DataFrame")

    return dataframe.dropna(thresh=(dataframe.shape[1] - threshold + 1), axis=0)

# Example usage:
# df = drop_rows_with_missing_values(df, 2)


In [None]:
df = drop_rows_with_missing_values(df, 4)

In [None]:
df

In [None]:
explore_missing_values(df)

In [None]:
summarize_missing_values(df, display_all_rows=True)

In [None]:
def visualize_missing_values_by_group(df, group_columns, value_column):
    """
    Visualizes missing values in the DataFrame grouped by specified columns using a heatmap,
    showing both counts and percentages of missing values. If no missing values are found,
    prints a message indicating this.

    Parameters:
    - df (pd.DataFrame): The DataFrame to analyze.
    - group_columns (list of str): The column names to group the data by.
    - value_column (str): The name of the column with missing values to analyze.

    Returns:
    - None: Displays seaborn heatmap visualizations of missing values by the specified groups or a message if no missing values are present.
    """

    # Create a DataFrame that counts missing values for each group
    missing_counts = df.groupby(group_columns)[value_column].apply(lambda x: x.isnull().sum()).reset_index(name='missing_count')
    total_counts = df.groupby(group_columns)[value_column].apply(lambda x: len(x)).reset_index(name='total_count')

    # Merge the missing counts with total counts to calculate percentages
    missing_data = pd.merge(missing_counts, total_counts, on=group_columns)
    missing_data['missing_percentage'] = (missing_data['missing_count'] / missing_data['total_count']) * 100

    if missing_data['missing_count'].sum() == 0:
        # If there are no missing values, print the information and return
        print(f"No missing values found in column '{value_column}'.")
        return

    # Pivot the DataFrame to prepare for heatmap visualization
    missing_counts_pivot = missing_data.pivot(index=group_columns[0], columns=group_columns[1], values='missing_count')
    missing_percentage_pivot = missing_data.pivot(index=group_columns[0], columns=group_columns[1], values='missing_percentage')

    # Plot the heatmap for missing counts
    plt.figure(figsize=(15, 8))
    sns.heatmap(missing_counts_pivot, annot=True, fmt=".0f", cmap='RdYlGn_r', center=0)
    plt.title(f'Missing Values Count in {value_column} by {" and ".join(group_columns)}')
    plt.show()

    # Plot the heatmap for missing percentages
    plt.figure(figsize=(15, 8))
    sns.heatmap(missing_percentage_pivot, annot=True, fmt=".1f", cmap='RdYlGn_r', center=0)
    plt.title(f'Missing Values Percentage in {value_column} by {" and ".join(group_columns)}')
    plt.show()

# Example usage:
# visualize_missing_values_by_group(df, ['country', 'year'], 'marketm0')

In [None]:
visualize_missing_values_by_group(df, ['country', 'year'], 'marketm4')

In [None]:
visualize_missing_values_by_group(df, ['country', 'year'], 'marketm6')

In [None]:
def drop_rows_with_values(df, *args):
    """
    Drops rows based on specified conditions. Conditions can be a single value for a column,
    or multiple values for a column using a tuple with the column name followed by the values.

    Parameters:
    - df (pd.DataFrame): The DataFrame from which to drop rows.
    - args: A sequence of arguments where each argument can be:
            - A tuple of (column_name, value), to drop rows where column_name is value, or
            - A tuple of (column_name, value1, value2, ...), to drop rows where column_name is any of value1, value2, ...

    Returns:
    - pd.DataFrame: A DataFrame with the specified rows dropped.
    """

    # Process each argument provided to the function
    for arg in args:
        if not isinstance(arg, tuple):
            raise TypeError("Each argument must be a tuple containing the column name and value(s) to drop.")

        column_name = arg[0]
        values_to_drop = arg[1:]

        # Ensure the column exists in the DataFrame
        if column_name not in df.columns:
            raise ValueError(f"The column '{column_name}' is not in the DataFrame")

        # If only one value is provided, use equality
        if len(values_to_drop) == 1:
            df = df[df[column_name] != values_to_drop[0]]
        else:
            # If multiple values are provided, use isin to check for membership
            df = df[~df[column_name].isin(values_to_drop)]

    return df

# Example usage:
# To drop all rows with country '8'
# df = drop_rows_with_values(df, ('country', '8'))

# To drop rows with Bangladesh in 'country' and years 2004, 2007, 2001 in 'year'
# df = drop_rows_with_values(df, ('country', 'Bangladesh'), ('year', 2004, 2007, 2001))

In [None]:
def remove_rows_with_missing_values(df: pd.DataFrame, columns: Union[str, List[str]] = None, verbose: bool = False) -> pd.DataFrame:
    """
    Removes all rows containing missing values either from the whole DataFrame or from specific columns.
    This function adds validation for input types and optional logging for more informative output.

    Parameters:
    - df (pd.DataFrame): The DataFrame from which to remove rows.
    - columns (Union[str, List[str]], optional): Column or list of columns to consider for row removal.
                                                 If None, consider all columns. Default is None.
    - verbose (bool, optional): If True, prints the number of rows removed. Default is False.

    Returns:
    - pd.DataFrame: A DataFrame with rows containing missing values removed.

    Raises:
    - ValueError: If `df` is not a DataFrame or `columns` are not found in `df`.
    """

    # Validation
    if not isinstance(df, pd.DataFrame):
        raise ValueError("Input 'df' must be a pandas DataFrame.")
    
    if columns is not None:
        if isinstance(columns, list):
            missing_cols = [col for col in columns if col not in df.columns]
            if missing_cols:
                raise ValueError(f"Columns not found in DataFrame: {missing_cols}")
        elif isinstance(columns, str):
            if columns not in df.columns:
                raise ValueError(f"Column '{columns}' not found in DataFrame.")
        else:
            raise ValueError("'columns' must be a string or a list of strings.")

    # DataFrame shape before removal
    initial_shape = df.shape

    # Removing rows with missing values
    if columns is not None:
        df_cleaned = df.dropna(subset=columns)
    else:
        df_cleaned = df.dropna()

    # Logging
    if verbose:
        removed_rows = initial_shape[0] - df_cleaned.shape[0]
        print(f"Removed {removed_rows} rows with missing values.")

    return df_cleaned

# Example usage:

# Example 1: Remove missing values from entire DataFrame
# df = remove_rows_with_missing_values(sample_df, verbose=True)

# Example 2: Remove missing values from specific column 'A'
# df = remove_rows_with_missing_values(sample_df, columns='A', verbose=True)

In [None]:
# Example 2: Remove missing values from specific column 'A'
# df = remove_rows_with_missing_values(sample_df, columns='A', verbose=True)

### Experiment with features which has 64% of missing data let's see with how many records we will stay 