Note: Please have a look at the original git repo for documentation and the initial data ingestion pipeline for the source data and references here https://github.com/ShawnKyzer/who-ears-social-listening

# Step 1 - Loading the data from our pipeline outputs for exploration
The code below will install the necessary dependencies in order to get started exploring the full dataset post data ingestion pipeline. You will need to run this first before the data is loaded into the notebook. 

In [None]:
!pip install gdown==4.6.0
!pip install pyarrow==10.0.1
!pip install mlflow==2.1.1
!pip install pyngrok 5.2.1

In [None]:
import gdown

# Location of publicly available combined set for analysis post data ingestion pipeline run
url = "https://drive.google.com/drive/folders/1tUcqzoeM9AaDaUexcHg-OqXJeL6f5k6U?usp=share_link"
gdown.download_folder(url, quiet=True, use_cookies=False)

In [None]:
import pyarrow.parquet as pq
import pandas as pd

combined_data = pq.read_table(source="who_ears_social_listening_public/merged_owid_who_ears.parquet").to_pandas()
# We will need to reformat the date column for analysis
combined_data['date'] = pd.to_datetime(combined_data['date'], format = '%Y-%m-%d')

In [None]:
combined_data.columns

In [None]:
# This will show the first 10 columns and rows we just want to make sure we loaded the data. 
combined_data.head(10)

# Step 2 - Creating a Space for Data Exploration 
We want to give the ability for our Domain expert to explore all of the variables in order to determine the most interesting to answer the research question. Leveraging tools like dtale make it easy to examine things such as missing variable and correlations. 


In [None]:
# Install the dtale package
# Note: You may have to restart the runtime in which case be sure and import the dataframe again from the above code. 
!pip install -U dtale
!pip install statsmodels --upgrade

In [None]:
# lets examine out dataset 
import pandas as pd

import dtale
import dtale.app as dtale_app

dtale_app.USE_COLAB = True # Comment this out if you are using another environment

dtale.show(combined_data)

Follow the link above and lets analyze our data and select the right features! You might also want to get acquainted with the dtale documentation here https://github.com/man-group/dtale. 

# Step 3 - Feature Engineering - Extraction and Transformation
Now that we know which columns we want to extract we can pull these out. We have some clever insights from our domain expert who has instucted us to perform the following transformations on the variables in preparation for the time series model analysis. Once we experiment with this and we feel comfortable we are on the right track we can build this in the pipeline in prepartion for production. 

Tranform Step 1 - Create a dataframe with the following columns: 

* date
* mis_and_disinformation
* mis_and_disinformation_male
* mis_and_disinformation_female
* myths
* myths_female
* myths_male
* new_vaccinations_smoothed

Tranform Step 2 - Merge all myths and mis_disinformation for all columns respectively by summing the columns. The new columns names will will be prefixed with 'mis_myths_' (e.g. 'mis_myths_male', 'mis_myths_female' etc.) 

In [None]:
# lets next create a sensible feature set for training and testing 
features = combined_data[['date', 'mis_and_disinformation', 'mis_and_disinformation_male',  
                          'mis_and_disinformation_female',
                         'myths','myths_female', 'myths_male', 'new_vaccinations_smoothed']]

# remove any NaN
features = features.dropna()

# Sum all the variables and rename
features = features.eval("mis_myths = myths + mis_and_disinformation")
features = features.eval("mis_myths_male = myths_male + mis_and_disinformation_male")
features = features.eval("mis_myths_female = myths_female + mis_and_disinformation_female")

# Lets drop the old columns now that we have merged the two into our new columns

features = features.drop(columns=['mis_and_disinformation', 'mis_and_disinformation_male',  
                          'mis_and_disinformation_female',
                         'myths','myths_female', 'myths_male'])

features.head(10)

# Step 4 - Selecting the optimal model for prediction 
Since this is time series data and we can reasonaly infer starting with analysis of performance on linear models. As we don't want to spend time writing boilerplate code lets start with a library that simply uses generic hyperparameters for a multitude of regression models and displays the results for us. 


In [None]:
# Sadly there are a lot of dependencies that conflict with dtale so we need to remove them and install pycaret
!pip uninstall -y dtale
!pip uninstall -y statsmodels
# install pycaret as a way to do some quick analysis of all regression model types to know which performs the best 
!pip install --pre pycaret


In [None]:
from pycaret.regression import *
# lets setup our first session for eval
# we chose total cases since its a complete set AND is has high correlation with the other features (Pearson)
# our X is all the data from the who-ears social media listening set and our Y 
# or target is the new_vaccinations_smoothed to predict

session_1 = setup(features, target = 'new_vaccinations_smoothed', 
                  session_id=1, 
                  log_experiment=False, 
                  experiment_name='new_vaccinations_smoothed_1')

In [None]:
# We can now compare all the regression models available 
best_model = compare_models()

# Step 5 - Post Initial Model Analysis and Experimentation
We weren't able to get a very high R2 with our features, but after some discussion we think we may be able to improve the initial numbers before entering the hyperparametering tuning phase. We have now been instructed to consider some additional OWID columns: hosp_patients, new_cases_smoothed and new_deaths_smoothed and then analyze these new columns by shifting on 0, 5, 10 and 20 days in the four variables. This is done due to the lag in COVID-19 data since someone can see the misinformation, make a decision to not vaccinate then days later become infected. Again this is just an experiment so we don't know if it will yield better outcomes.  


In [None]:
# Since we know that we are going to have to keep repeating this experiment in 
# many forms with many variables lets be smart about it and build some methods 
# to make our life easier. 

# Lets define our target and our target shift in days 
target_name = 'new_cases_smoothed'
shift_in_days = 0

# First lets create a new feature set with our desired training and target base
# features 
def create_feature_set(df, columns):
    # Create a feature set by selecting specified columns from the dataframe
    features = df[columns]
    
    # Remove any NaN values
    features = features.dropna()
    
    # Sum all the variables and rename
    features = features.eval("mis_myths = myths + mis_and_disinformation")
    features = features.eval("mis_myths_male = myths_male + mis_and_disinformation_male")
    features = features.eval("mis_myths_female = myths_female + mis_and_disinformation_female")
    # Drop the old columns now that we have merged the two into our new columns
    features = features.drop(columns=['mis_and_disinformation', 'mis_and_disinformation_male',  
                          'mis_and_disinformation_female',
                         'myths','myths_female', 'myths_male'])
    return features

# now we can run these accordingly 
columns = ['date', 'mis_and_disinformation', 'mis_and_disinformation_male',  
                          'mis_and_disinformation_female',
                         'myths','myths_female', 'myths_male', target_name]
feature_set = create_feature_set(combined_data, columns)

# Lets construct a method that takes a list of columns from the master combinded
# dataset, a target column name (such as new_vaccinations_smoothed) and a shift
# value such as 5, 10 or 20 days. Since we cannot use NaN columns lets drop those
# rows created from a shift

def shift_merge_and_dropna(df, target, shift_value):
    # Shift the target column by the shift value
    df_shifted = pd.Series(df[target].shift(shift_value).values, index=df["date"], copy=False)
    
    # Merge the original dataframe with the shifted dataframe on the index
    df_merged = pd.merge(df, df_shifted.rename(target+'_'+str(shift_value)), how='right', on=df.index)
    
    # Drop the NaN created from the shift 
    df_merged = df_merged.dropna()

    # Drop the original field and the key index which we used to join so that we 
    # dont use that as part of our training set. 
    df_merged = df_merged.drop(columns=[target, 'key_0'])
    return df_merged

# lets give it a little test
df_output = shift_merge_and_dropna(feature_set, target_name, shift_in_days)

df_output.head(20)


In [None]:
from pycaret.regression import *
# lets setup our first session for eval as we did before

session_1 = setup(df_output, target = target_name+'_'+str(shift_in_days), 
                  session_id=shift_in_days, 
                  log_experiment=False, 
                  experiment_name=target_name+'_'+str(shift_in_days))

# We can now compare all the regression models available 
best_model = compare_models()

# Step 6 - Experimentation Tracking and Analysis
Those methods are really helpful and we are able to run one experiment and one shift at a time, but thats pretty daunting to do manually. If only there were a way to run all the experiments and then put them in a table so we can understand what is the best approach forward. Lets meet our new best friend MLflow!

In [None]:
# Setup NGROK so we can share this with others and view the MLflow UI 
import mlflow

# Push tracking UI to run in the background
get_ipython().system_raw("mlflow ui --port 5000 &") 

# Setup a remote tunnel using ngrok.com to allow local port access
# borrowed from https://colab.research.google.com/github/alfozan/MLflow-GBRT-demo/blob/master/MLflow-GBRT-demo.ipynb#scrollTo=4h3bKHMYUIG6

# import the required libraries 
from pyngrok import ngrok

# Terminate open tunnels if exist
ngrok.kill()

# Setting the authtoken
# Get your authtoken from https://dashboard.ngrok.com/auth
NGROK_AUTH_TOKEN = "{Insert your auth token here once you signup from the link above}"
ngrok.set_auth_token(NGROK_AUTH_TOKEN)

# Open an HTTPs tunnel on port 5000 for http://localhost:5000
ngrok_tunnel = ngrok.connect(addr="5000", proto="http", bind_tls=True)
print("MLflow Tracking UI:", ngrok_tunnel.public_url)

In [None]:
target_names = ['new_cases_smoothed', 'hosp_patients', 'new_vaccinations_smoothed', 'new_deaths_smoothed']
shift_in_days = [0,5,10,20]

# Define columns once before the loop
columns = ['date', 'mis_and_disinformation', 'mis_and_disinformation_male',  
           'mis_and_disinformation_female', 'myths','myths_female', 'myths_male']

# Loop through the target names and shift values
for target_name in target_names:
  # Append target name to columns list
  columns.append(target_name)
  for shift in shift_in_days:

    # Create feature set using the updated columns list
    feature_set = create_feature_set(combined_data, columns)

    # do the shift and merge to create the new feature
    feature_set = shift_merge_and_dropna(feature_set, target_name, shift)

    # Setup session with target name appended with shift value and experiment name appended with shift value
    session_1 = setup(feature_set, target = target_name + '_' + str(shift), 
                      session_id=shift, 
                      log_experiment=True, 
                      experiment_name=target_name + '_' + str(shift))

    # Compare all the regression models
    best_model = compare_models()
    
  # Remove target name from columns list for next iteration
  columns.remove(target_name)


