## Importing the necessary Libraries

In [None]:
import pandas as pd
import numpy as np
import math
import datetime
import json
import warnings
import re
warnings.filterwarnings('ignore')

### health_measure_pre_processing:
- This function handles missing data for health-related measures in a DataFrame by filling missing values using two different methods

- **Backward Fill (bfill):** Columns like 'Positive impact on community' and 'Positive impact on well being' will have missing values filled using the next valid observation in the column.

- **Forward Fill (ffill):** A wider set of columns (e.g., 'Net Favorability', 'Likelihood to recommend', 'NPS', etc.) will have missing values filled using the previous valid observation.

## Brand Health Measures Pre Processing

In [None]:
def health_measure_pre_processing(df_):
    """
    Pre-processes brand health measures by filling missing values 
    using backward and forward filling methods.

    Parameters:
    - df_: DataFrame, brand health data

    Returns:
    - df_: DataFrame, processed brand health data
    """

    # Columns to fill missing values using backward fill
    hm_back_fill_var = ['Positive impact on community', 'Positive impact on well being']

    # Columns to fill missing values using forward fill
    hm_forward_fill_var = [
        'Net Favorability', 'Likelihood to recommend', 'Net Trust', 'Realibality', 
        'Accuracy', 'NPS', 'Usage', 'Preference', 'Seen as experts', 
        'Positive impact on community', 'Positive impact on well being'
    ]

    # Apply backward fill to specified columns
    df_[hm_back_fill_var] = df_[hm_back_fill_var].fillna(method='bfill')

    # Apply forward fill to specified columns
    df_[hm_forward_fill_var] = df_[hm_forward_fill_var].fillna(method='ffill')

    return df_


## Organic Search Pre Processing

### organic_search_pre_processing:

- **Date Conversion:** The 'Date' column is converted to a proper datetime format using pd.to_datetime() to ensure that the data can be time-series processed.

- **CTR Calculation for Imputation:** It calculates a consistent Click-Through Rate (CTR) for imputation by dividing the total clicks by total impressions from non-missing values. This CTR is used later to estimate missing values.

- **Imputation of Missing Clicks:** Missing values in the 'OrganicSearch_Google_Clicks' column are filled using data from another column ('SEO_Clicks_OrganicSearch_Desktop_MobileWeb(Combined)').

- **Imputation of Missing Impressions:** Missing values in the 'OrganicSearch_Google_Impressions' column are estimated by dividing the already filled 'OrganicSearch_Google_Clicks' by the previously calculated CTR.

- **Imputation of Missing Positions:** Missing values in 'OrganicSearch_Google_Position' are filled using the median of the non-missing values in that column.

In [None]:
def organic_search_pre_processing(df_):
    """
    Pre-processes organic search data by handling missing values 
    and converting the 'Date' column to datetime format.

    Parameters:
    - df_: DataFrame, organic search data

    Returns:
    - df_: DataFrame, processed organic search data
    """
    
    # Convert 'Date' column to datetime
    df_['Date'] = pd.to_datetime(df_['Date'])

    # Calculate consistent CTR for imputation
    consisten_CTR_for_imputation = round(
        df_[df_['OrganicSearch_Google_Clicks'].notna()]['OrganicSearch_Google_Clicks'].sum() / 
        df_[df_['OrganicSearch_Google_Impressions'].notna()]['OrganicSearch_Google_Impressions'].sum(), 
        4
    )

    # Fill missing 'OrganicSearch_Google_Clicks' using 'SEO_Clicks_OrganicSearch_Desktop_MobileWeb(Combined)'
    df_.loc[df_['OrganicSearch_Google_Clicks'].isna(), 'OrganicSearch_Google_Clicks'] = df_.loc[
        df_['OrganicSearch_Google_Clicks'].isna(), 'SEO_Clicks_OrganicSearch_Desktop_MobileWeb(Combined)'
    ]

    # Impute missing 'OrganicSearch_Google_Impressions' using the calculated CTR
    df_['OrganicSearch_Google_Impressions'] = np.where(
        df_['OrganicSearch_Google_Impressions'].isna(), 
        round(df_['OrganicSearch_Google_Clicks'] / consisten_CTR_for_imputation), 
        df_['OrganicSearch_Google_Impressions']
    )

    # Impute missing 'OrganicSearch_Google_Position' with the median value of non-missing data
    df_['OrganicSearch_Google_Position'] = np.where(
        df_['OrganicSearch_Google_Position'].isna(), 
        np.median(df_[df_['OrganicSearch_Google_Position'].notna()]['OrganicSearch_Google_Position']), 
        df_['OrganicSearch_Google_Position']
    )

    return df_


## Social Media Pre Processing

### social_media_pre_processing:

- **Date Conversion:** It converts the 'Date' column to a datetime format to facilitate time-based filtering and operations.

- **Mean Calculation for Imputation:** A date range (from September 1, 2021, to December 31, 2023) is used to calculate the mean values for three LinkedIn metrics: Impressions, Total Engagements, and Estimated Clicks. These means are calculated only from non-missing values within the specified date range.

- **Date Range for Imputation:** The function defines another date range (April 1, 2023, to December 31, 2023) within which it will impute missing values using the calculated means.

- **Imputation of Missing Values:** Missing values in the 'SocialEng_LinkedIn_Impressions', 'SocialEng_LinkedIn_Total_Engagements', and 'SocialEng_LinkedIn_Estimated_Clicks' columns during the imputation date range are filled with the respective mean values calculated earlier.

In [None]:
def social_media_pre_processing(df_):
    """
    Pre-processes social media data by handling missing values for LinkedIn-related metrics 
    (Impressions, Total Engagements, and Estimated Clicks) using mean imputation for a specific date range.

    Parameters:
    - df_: DataFrame, social media engagement data

    Returns:
    - df_: DataFrame, processed social media engagement data
    """
    
    # Convert 'Date' column to datetime
    df_['Date'] = pd.to_datetime(df_['Date'])

    # Define the date range for calculating mean values
    start_date = pd.to_datetime('2021-09-01')
    end_date = pd.to_datetime('2023-12-31')
    date_mask = (df_['Date'] >= start_date) & (df_['Date'] <= end_date)

    # Calculate mean values for LinkedIn-related metrics
    mean_impressions = df_[date_mask & df_['SocialEng_LinkedIn_Impressions'].notna()]['SocialEng_LinkedIn_Impressions'].mean()
    mean_total_engagements = df_[date_mask & df_['SocialEng_LinkedIn_Total_Engagements'].notna()]['SocialEng_LinkedIn_Total_Engagements'].mean()
    mean_estimated_clicks = df_[date_mask & df_['SocialEng_LinkedIn_Estimated_Clicks'].notna()]['SocialEng_LinkedIn_Estimated_Clicks'].mean()

    # Define the date range for imputing missing values
    impute_start_date = pd.to_datetime('2023-04-01')
    impute_end_date = pd.to_datetime('2023-12-31')
    impute_mask = (df_['Date'] >= impute_start_date) & (df_['Date'] <= impute_end_date)

    # Impute missing values for LinkedIn metrics using the calculated mean values
    df_['SocialEng_LinkedIn_Impressions'] = np.where(
        impute_mask & df_['SocialEng_LinkedIn_Impressions'].isna(), 
        mean_impressions, 
        df_['SocialEng_LinkedIn_Impressions']
    )

    df_['SocialEng_LinkedIn_Total_Engagements'] = np.where(
        impute_mask & df_['SocialEng_LinkedIn_Total_Engagements'].isna(), 
        mean_total_engagements, 
        df_['SocialEng_LinkedIn_Total_Engagements']
    )

    df_['SocialEng_LinkedIn_Estimated_Clicks'] = np.where(
        impute_mask & df_['SocialEng_LinkedIn_Estimated_Clicks'].isna(), 
        mean_estimated_clicks, 
        df_['SocialEng_LinkedIn_Estimated_Clicks']
    )

    return df_


## Dropping Empty Fields

### drop_empty_fields:

- **Identify Columns with All Missing Values:** It checks each column in the DataFrame to see if all its values are NaN (missing). Columns that meet this condition are identified and stored in cols_v1.

- **Identify Columns with a Sum of Zero:** The function calculates the sum of each column. If a column's sum is zero (which might indicate that all values are zero), the column is identified and stored in cols_v2.

- **Combine and Drop Columns:** The function combines the columns identified in cols_v1 and cols_v2 into a single list, cols, and removes these columns from the DataFrame.
 - **Output:** If any columns are dropped, it prints the names of the dropped columns for reference.

In [None]:
def drop_empty_fields(df_):
    """
    Drops columns from the DataFrame that are entirely empty or have a sum of zero.

    Parameters:
    - df_: DataFrame, the input data

    Returns:
    - df_: DataFrame, with columns dropped if they contain only NaN values or sum to zero
    """
    
    # Identify columns with all NaN values
    cols_v1 = list(df_.columns[df_.isna().sum() == df_.shape[0]])
    
    # Identify columns where the sum is zero
    column_sums = df_.sum()
    cols_v2 = column_sums[column_sums == 0].index.tolist()
    
    # Combine both lists of columns to be dropped
    cols = list(set(cols_v1 + cols_v2))
    
    # Drop columns if any are identified
    if cols:
        df_ = df_.drop(cols, axis=1)
        print('Columns dropped due to null/zero fields:', cols)
    
    return df_


## Dropping Unnecessary Columns

### drop_unnecessary_columns:

- **Print Statements:** The function includes print statements to explain why certain columns are being dropped, providing context for the decisions.

- **Define Columns to Drop:**
    - Date Variables: Columns related to date information.
    - Organic Search Variables: Columns related to CTR and SEO clicks.
    - Partner Visits Variables: Columns related to partner visit data.
    - Goal Visits Variables: Columns related to goal visits.

- **Combine and Drop Columns:** All specified columns are combined into a single list and removed from the DataFrame using the drop method.

- **Output:** The function returns the DataFrame with the unnecessary columns removed.

In [None]:
def drop_unnecessary_columns(df_):
    """
    Drops unnecessary columns from the DataFrame based on predefined categories.

    Parameters:
    - df_: DataFrame, the input data

    Returns:
    - df_: DataFrame, with unnecessary columns dropped
    """
    
    # Define columns related to dates
    date_var = ['Year', 'Quarter']
    
    # Print statements explaining the dropped variables
    print("Dropping CTR and SEO clicks as organic search imputation is complete")
    print("Dropping Platform/Product visits as overall visits are considered, being the target variable")
    print("Dropping Overall/Platform/Product *Goal* visits, the variable is the total goal visits for that date; data is sparse")
    
    # Define various categories of columns to drop
    organic_search_var = ['OrganicSearch_Google_CTR', 'SEO_Clicks_OrganicSearch_Desktop_MobileWeb(Combined)']
    platform_installs_var = [
        'Pricing_Android_Installs', 
        'Pricing_iOS_Installs'
    ]
    goal_visits_var = [
        'Apps_TWC Universal Android 4G+_Goal', 
        'Apps_TWC Universal iOS 4G+_Goal', 
        'MobileWeb_TWC Mobile Web_Goal', 
        'Web_TWC Web_Goal', 
        'Overall_Product_Goal'
    ]
    
    # Combine all columns to drop into one list
    drop_var = date_var + organic_search_var + platform_installs_var + goal_visits_var
    
    # Drop the specified columns from the DataFrame
    df_ = df_.drop(drop_var, axis=1)
    
    return df_


## Perform Pre Processing

### perform_pre_processing:

- **Function Purpose:** The function orchestrates multiple preprocessing steps to clean and prepare the data for further analysis or modeling.

- **Brand Health Imputation:** Calls **health_measure_pre_processing** to handle missing values in brand health metrics.

- **Organic Search Imputation:** Calls **organic_search_pre_processing** to impute missing values in organic search data based on calculated metrics.

- **Social Media LinkedIn Imputation:** Calls **social_media_pre_processing** to impute missing LinkedIn engagement metrics using mean values.

- **Drop Empty Fields:** Calls **rop_empty_fields** to remove columns that are entirely empty or have all zero values.

- **Drop Unnecessary Columns:** Calls **drop_unnecessary_columns** to remove columns that are no longer relevant based on predefined criteria.

- **Returns:** The function returns the DataFrame with all preprocessing steps applied, resulting in cleaner and more relevant data.

In [None]:
def perform_pre_processing(df_):
    """
    Performs a series of preprocessing steps on the input DataFrame.

    Parameters:
    - df_: DataFrame, the input data

    Returns:
    - df_: DataFrame, with all preprocessing steps applied
    """
    
    # Print and apply Brand Health imputation
    print('***** Brand Health - Imputation *****')
    df_ = health_measure_pre_processing(df_)
    
    # Print and apply Organic Search imputation
    print('***** Organic Search - Imputation *****')
    df_ = organic_search_pre_processing(df_)
    
    # Print and apply Social Media LinkedIn imputation
    print('***** Social Media - LinkedIn - Imputation *****')
    df_ = social_media_pre_processing(df_)
    
    # Print and check for variables with no data, then drop empty fields
    print('***** Checking variables which do not have any data *****')
    df_ = drop_empty_fields(df_)
    
    # Print and drop unnecessary columns
    print('***** Dropping unnecessary columns *****')
    df_ = drop_unnecessary_columns(df_)
    
    return df_
