Note: Please have a look at the original git repo for documentation and the initial data ingestion pipeline for the source data and references here https://github.com/ShawnKyzer/who-ears-social-listening

# Step 1 - Loading the data from our pipeline outputs for exploration
The code below will install the necessary dependencies in order to get started exploring the full dataset post data ingestion pipeline. You will need to run this first before the data is loaded into the notebook. 

In [None]:
!pip install gdown
!pip install pyarrow

In [None]:
import gdown

# Location of publicly available combined set for analysis post data ingestion pipeline run
url = "https://drive.google.com/drive/folders/1tUcqzoeM9AaDaUexcHg-OqXJeL6f5k6U?usp=share_link"
gdown.download_folder(url, quiet=True, use_cookies=False)

In [None]:
import pyarrow.parquet as pq
import pandas as pd

combined_data = pq.read_table(source="who_ears_social_listening_public/merged_owid_who_ears.parquet").to_pandas()
# We will need to reformat the date column for analysis
combined_data['date'] = pd.to_datetime(combined_data['date'], format = '%Y-%m-%d')

In [None]:
combined_data.columns

In [None]:
# This will show the first 10 columns and rows we just want to make sure we loaded the data. 
combined_data.head(10)

# Step 2 - Creating a Space for Data Exploration 
We want to give the ability for our Domain expert to explore all of the variables in order to determine the most interesting to answer the research question. Leveraging tools like dtale make it easy to examine things such as missing variable and correlations. 


In [None]:
# Install the dtale package
# Note: You may have to restart the runtime in which case be sure and import the dataframe again from the above code. 
!pip install -U dtale
!pip install statsmodels --upgrade

In [None]:
# lets examine out dataset 
import pandas as pd

import dtale
import dtale.app as dtale_app

dtale_app.USE_COLAB = True # Comment this out if you are using another environment

dtale.show(combined_data)

Follow the link above and lets analyze our data and select the right features! You might also want to get acquainted with the dtale documentation here https://github.com/man-group/dtale. 

# Step 3 - Feature Engineering - Extraction and Transformation
Now that we know which columns we want to extract we can pull these out. We have some clever insights from our domain expert who has instucted us to perform the following transformations on the variables in preparation for the time series model analysis. Once we experiment with this and we feel comfortable we are on the right track we can build this in the pipeline in prepartion for production. 

Tranform Step 1 - Create a dataframe with the following columns: 

* date
* mis_and_disinformation
* mis_and_disinformation_male
* mis_and_disinformation_female
* myths
* myths_female
* myths_male
* new_vaccinations_smoothed

Tranform Step 2 - Merge all myths and mis_disinformation for all columns respectively by summing the columns. The new columns names will will be prefixed with 'mis_myths_' (e.g. 'mis_myths_male', 'mis_myths_female' etc.) 

In [None]:
# lets next create a sensible feature set for training and testing 
features = combined_data[['date', 'mis_and_disinformation', 'mis_and_disinformation_male',  
                          'mis_and_disinformation_female',
                         'myths','myths_female', 'myths_male', 'new_vaccinations_smoothed']]

# remove any NaN
features = features.dropna()

# Sum all the variables and rename
features = features.eval("mis_myths = myths + mis_and_disinformation")
features = features.eval("mis_myths_male = myths_male + mis_and_disinformation_male")
features = features.eval("mis_myths_female = myths_female + mis_and_disinformation_female")

# Lets drop the old columns now that we have merged the two into our new columns

features = features.drop(columns=['mis_and_disinformation', 'mis_and_disinformation_male',  
                          'mis_and_disinformation_female',
                         'myths','myths_female', 'myths_male'])

features.head(10)

# Step 4 - Selecting the optimal model for prediction 
Since this is time series data and we can reasonaly infer starting with analysis of performance on linear models. As we don't want to spend time writing boilerplate code lets start with a library that simply uses generic hyperparameters for a multitude of regression models and displays the results for us. 


In [None]:
# Sadly there are a lot of dependencies that conflict with dtale so we need to remove them and install pycaret
!pip uninstall -y dtale
!pip uninstall -y statsmodels
# install pycaret as a way to do some quick analysis of all regression model types to know which performs the best 
!pip install --pre pycaret


In [None]:
from pycaret.regression import *
# lets setup our first session for eval
# we chose total cases since its a complete set AND is has high correlation with the other features (Pearson)
# our X is all the data from the who-ears social media listening set and our Y 
# or target is the new_vaccinations_smoothed to predict

session_1 = setup(features, target = 'new_vaccinations_smoothed', 
                  session_id=1, 
                  log_experiment=False, 
                  experiment_name='new_vaccinations_smoothed_1')

In [None]:
# We can now compare all the regression models available 
best_model = compare_models()

# Step 5 - Post Model Selection Analysis
We weren't able to get a very high R2 with our features, but after some discussion we think we may be able to improve the initial numbers before entering the hyperparametering tuning phase. We have now been instructed to add in some additional OWID columns: hosp_patients, new_cases_smoothed and new_deaths_smoothed and then create three new columns representing a 5, 10 and 20 day shift in each of the four variables. This is done due to the lag in COVID-19 appearance. Again this is just an experiment so we don't know if it will yield better outcomes.  


In [27]:
# Since we know that we are going to have to keep repeating this experiment in 
# many forms with many variables lets be smart about it and build some methods 
# to make our life easier. 

# Lets define our target and our target shift in days 
target_name = 'new_cases_smoothed'
shift_in_days = 0

# First lets create a new feature set with our desired training and target base
# features 
def create_feature_set(df, columns):
    # Create a feature set by selecting specified columns from the dataframe
    features = df[columns]
    
    # Remove any NaN values
    features = features.dropna()
    
    # Sum all the variables and rename
    features = features.eval("mis_myths = myths + mis_and_disinformation")
    features = features.eval("mis_myths_male = myths_male + mis_and_disinformation_male")
    features = features.eval("mis_myths_female = myths_female + mis_and_disinformation_female")
    # Drop the old columns now that we have merged the two into our new columns
    features = features.drop(columns=['mis_and_disinformation', 'mis_and_disinformation_male',  
                          'mis_and_disinformation_female',
                         'myths','myths_female', 'myths_male'])
    return features

# now we can run these accordingly 
columns = ['date', 'mis_and_disinformation', 'mis_and_disinformation_male',  
                          'mis_and_disinformation_female',
                         'myths','myths_female', 'myths_male', target_name]
feature_set = create_feature_set(combined_data, columns)

# Lets construct a method that takes a list of columns from the master combinded
# dataset, a target column name (such as new_vaccinations_smoothed) and a shift
# value such as 5, 10 or 20 days. Since we cannot use NaN columns lets drop those
# rows created from a shift

def shift_merge_and_dropna(df, target, shift_value):
    # Shift the target column by the shift value
    df_shifted = pd.Series(df[target].shift(shift_value).values, index=df["date"], copy=False)
    
    # Merge the original dataframe with the shifted dataframe on the index
    df_merged = pd.merge(df, df_shifted.rename(target+'_'+str(shift_value)), how='right', on=df.index)
    
    # Drop the NaN created from the shift 
    df_merged = df_merged.dropna()

    # Drop the original field and the key index which we used to join so that we 
    # dont use that as part of our training set. 
    df_merged = df_merged.drop(columns=[target, 'key_0'])
    return df_merged

# lets give it a little test
df_output = shift_merge_and_dropna(feature_set, target_name, shift_in_days)

df_output.head(20)


Unnamed: 0,date,mis_myths,mis_myths_male,mis_myths_female,new_cases_smoothed_0
0,2020-12-15,1,1,0,90.429
1,2020-12-16,2,1,0,86.143
2,2020-12-17,2,0,0,79.857
3,2020-12-18,2,0,0,71.571
4,2020-12-19,1,0,0,66.429
5,2020-12-20,2,0,0,65.143
6,2020-12-21,1,0,1,58.429
7,2020-12-22,0,0,0,62.857
8,2020-12-23,1,1,0,74.857
9,2020-12-24,0,0,0,77.857


In [28]:
from pycaret.regression import *
# lets setup our first session for eval as we did before

session_1 = setup(df_output, target = target_name+'_'+str(shift_in_days), 
                  session_id=shift_in_days, 
                  log_experiment=False, 
                  experiment_name=target_name+'_'+str(shift_in_days))

# We can now compare all the regression models available 
best_model = compare_models()

Unnamed: 0,Description,Value
0,Session id,0
1,Target,new_cases_smoothed_0
2,Target type,Regression
3,Original data shape,"(22545, 5)"
4,Transformed data shape,"(22545, 7)"
5,Transformed train set shape,"(15781, 7)"
6,Transformed test set shape,"(6764, 7)"
7,Numeric features,3
8,Date features,1
9,Preprocess,True


Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
et,Extra Trees Regressor,5005.3201,180053684.8096,13146.7427,0.8788,2.5079,57.961,2.351
rf,Random Forest Regressor,5467.6868,234417971.0395,15106.6103,0.8443,2.5026,64.1462,2.784
lightgbm,Light Gradient Boosting Machine,6543.5893,319620356.5664,17734.5861,0.7874,2.7226,63.8292,0.225
dt,Decision Tree Regressor,6034.9727,386115290.4852,19490.8935,0.7487,2.716,69.4031,0.076
gbr,Gradient Boosting Regressor,7957.3299,425385162.5878,20506.7735,0.7164,2.7732,68.1676,0.753
knn,K Neighbors Regressor,9281.5796,833028220.8,28705.9504,0.4581,2.5538,61.594,0.058
llar,Lasso Least Angle Regression,11863.5365,1020581740.0207,31845.8784,0.3372,3.396,274.4437,0.038
en,Elastic Net,11666.1518,1025260334.2774,31927.3119,0.3323,3.3107,240.8864,0.074
br,Bayesian Ridge,11900.3498,1025741022.5486,31937.5193,0.3312,3.3753,267.408,0.038
lar,Least Angle Regression,11943.0503,1025901637.805,31940.3289,0.331,3.3885,272.0031,0.038


Processing:   0%|          | 0/77 [00:00<?, ?it/s]