# COVID19 ANALYSIS USING SOCIAL MEDIA DATA

After collecting tweets from users, we need to clean the data, and visualize it. The aim of this notebook is to understand how to clean and filter textual data, to then build interactive visualizations.

We will keep only the tweets mentionning COVID19 symptoms, and compare the evolution of these tweets with the evolution of hospitalizations over time.

In [2]:
import pandas as pd
import numpy as np
import os

import warnings
warnings.filterwarnings('ignore')

# 1. Import data

We import the tweets containing symptoms. These tweets have already been filtered on the whole dataset, which contained around 55M of tweets : we only kept those in French who mention a possible COVID19 symptom and tweets that were not retweets. We also anonymized the text, removing url and usernames.

In [150]:
# Paths to data
path_to_data = "../data/"
tweets_symptoms = pd.read_csv(os.path.join(path_to_data,'list_tweets_symptoms.csv'), sep=';')
tweets_symptoms['day']=pd.to_datetime(tweets_symptoms['day'])

In [151]:
tweets_symptoms.head()

Unnamed: 0,id_str,day,anonymized_text
0,1206157336534016001,2019-12-15,Symptomatique [url]
1,1206550539477159941,2019-12-16,#ReformeRetraite #Delevoye @mention appelle au...
2,1216811655738413056,2020-01-13,@mention @mention Rhume toux fièvre
3,1218113362917236737,2020-01-17,Il serait temps de condamner toutes ces feigna...
4,1219872603554361344,2020-01-22,@mention Parce que ça veut dire que je serais ...


# 2. Data preparation

We want to find tweets that mention specific symptoms.

## 2.1. Text cleaning

To do so, we will first normalize the texts, by removing capital letters, and accents, since Twitter users often don't use accents. We also want to remove punctuation - to recognize hastags (eg. "#fever") as a symptom.

In [152]:
from unidecode import unidecode    # remove punctuation
import re                          # regex library

In [153]:
def clean_text(text):
    
    # remove accents
    text=unidecode(text)
    
    # lowercase
    text=text.lower()
    
    # remove punctuation, except @ and [url]
    text=re.sub(r'[^\sa-zA-Z0-9@\[\]]',' ',text)
    
    return text

In [154]:
tweets_symptoms['clean_text']=tweets_symptoms['anonymized_text'].apply(lambda x: clean_text(x))
tweets_symptoms.head()

Unnamed: 0,id_str,day,anonymized_text,clean_text
0,1206157336534016001,2019-12-15,Symptomatique [url],symptomatique [url]
1,1206550539477159941,2019-12-16,#ReformeRetraite #Delevoye @mention appelle au...,reformeretraite delevoye @mention appelle au...
2,1216811655738413056,2020-01-13,@mention @mention Rhume toux fièvre,@mention @mention rhume toux fievre
3,1218113362917236737,2020-01-17,Il serait temps de condamner toutes ces feigna...,il serait temps de condamner toutes ces feigna...
4,1219872603554361344,2020-01-22,@mention Parce que ça veut dire que je serais ...,@mention parce que ca veut dire que je serais ...


## 2.2. Symptoms dictionary

Next step: find the different symptoms in the tweets.

We will build a dictionary to find the different symptoms. We want this dictionary to contain familiar ways of expressing symptoms.

In [155]:
symptoms_dict_fr = {'cough' : ['toux','touss'],
                   'sore_throat' : ['maux de gorge','mal de gorge','mal a la gorge'],
                   'fever' : ['fievre'],
                   'loss_taste' : ['perte du gout','perte de l odorat','perte de lodorat','perdu le gout','perdu l odorat',
                                   'perdu lodorat','plus de gout','plus d odeur', 'plus dodeur'],
                    'breathing_difficulties' : ['difficultes a respirer','difficulte a respirer','difficultes respiratoires',
                                                'mal a respirer'],
                   'symptoms' : ['symptom']}

In [156]:
# We want to capture the tweets in which a word starts with the symptom expression
for symptom in symptoms_dict_fr.keys():
    tweets_symptoms[symptom] = (tweets_symptoms['clean_text'].str.contains\
                                    ('|'.join(['^'+x for x in symptoms_dict_fr.get(symptom)]+
                                            [' '+x for x in symptoms_dict_fr.get(symptom)])))
                                


In [157]:
#Examples?
tweets_symptoms.loc[tweets_symptoms['fever']==True, 'anonymized_text'].tolist()[:5]

['@mention @mention Rhume toux fièvre',
 '@mention Parce que ça veut dire que je serais encore en train de dormir cassé par la fièvre. Donc je refuse.',
 '«\xa0j’ai de la fièvre je tousse je préfère pas venir\xa0» alors que ça a juste une gueule de bois',
 '⚠️ En cas de fièvre, privilégiez le paracétamol. En effet, les anti-inflammatoires comme l’ibuprofène, de part leur mécanisme d’action, peuvent aggraver l’infection ! 🦠  En cas de doute, demandez conseil à un professionnel de #santé 👨\u200d⚕️👩\u200d🔬 #COVIDー19 [url]',
 'Merci. Mais je suis inquiète sa fièvre ne descend et sa toux est persistante. Un simple rhume le coronavirus ?? Je ne sais pas mais Je soigne mon bébé malgré tout et nous restons chez nous  tous les 5 . Dieu veille.']

In [158]:
print('Number of tweets mentionning each symptom:')
tweets_symptoms[symptoms_dict_fr.keys()].sum()

Number of tweets mentionning each symptom:


cough                     4266
sore_throat                577
fever                     2551
loss_taste                 485
breathing_difficulties     436
symptoms                  5583
dtype: int64

# 3. Visualization

Next step: we will visualize the data, using plotly for interactive visualizations!

[Plotly](https://plotly.com/python/) allows you to build interactive graphs in Python Jupyter Notebooks. You can then easily convert these visualizations to .html, to build a dashboard presenting you results. It can also easily be combined with [ipywidgets](https://ipywidgets.readthedocs.io/en/latest/), a tool to add widgets and make your plots even more interactive -- this option cannot be converted to .html though.

In [159]:
from matplotlib import pyplot as plt
import plotly.figure_factory as ff
import plotly.offline as py
import plotly.graph_objs as go
from plotly import tools
py.init_notebook_mode(connected = True)

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

## 3.1. Evolution of symptoms over time

First, we want to see how each symptom evolves over time.

We first group the data by day.

In [160]:
tweets_symptoms['n_symptom']=1
tweets_day=tweets_symptoms.groupby('day').agg('sum').reset_index().drop(columns=['id_str'])
tweets_day.head()

Unnamed: 0,day,cough,sore_throat,fever,loss_taste,breathing_difficulties,symptoms,n_symptom
0,2019-12-02,1.0,1.0,0.0,1.0,0.0,1.0,4.0
1,2019-12-03,1.0,1.0,2.0,0.0,0.0,5.0,9.0
2,2019-12-04,1.0,0.0,2.0,0.0,0.0,2.0,5.0
3,2019-12-05,1.0,0.0,2.0,0.0,0.0,2.0,5.0
4,2019-12-06,1.0,1.0,2.0,0.0,1.0,1.0,6.0


In [161]:
def plot_symptoms_evolution(average_week=False) :
    
    
    traces=[]
    
    for value in symptoms_dict_fr.keys():
        y_value = tweets_day[value].values
        if average_week==True:
            y_value = tweets_day[value].rolling(window=7).mean()
        traces.append(go.Scatter(x = tweets_day['day'], 
                                 y = y_value,
                                mode = 'lines',
                                name = value))
    
    layout = go.Layout(title="Evolution of tweets mentionning COVID19 symptoms in Ile-de-France")
    if average_week==True:
        layout=go.Layout(title='Evolution of tweets mentionning COVID19 symptoms in Ile-de-France - average week')
    fig = go.Figure(traces, layout)
    
    fig.add_shape(dict(type="rect",
                       yref='paper',
                       x0='2020-03-17',
                       y0=0,
                       x1='2020-05-11',
                       y1=1,
                       fillcolor="LightSalmon",
                       opacity=0.2,
                       layer='below',
                       line_width=0))
    
    fig.update_layout(annotations=[dict(
        x='2020-04-15',
        y=0.9,
        yref="paper",
        text="Lockdown (France)", showarrow=False)])
    
    py.iplot(fig)

In [162]:
plot_symptoms_evolution()

In [163]:
plot_symptoms_evolution(average_week=True)

Note : The plots are interactive, you can select only a few symptoms, hover to see the value, or zoom.

## 3.2. Comparision with hospital data

We would like now to compare all these tweets mentionning symptoms, with medical data about the evolution of the pandemic. We will take data from Santé Publique France about [emergencies and SOS médecins data related to COVID19](https://www.data.gouv.fr/fr/datasets/donnees-des-urgences-hospitalieres-et-de-sos-medecins-relatives-a-lepidemie-de-covid-19/).

In [164]:
emergencies=pd.read_csv('https://www.data.gouv.fr/fr/datasets/r/eceb9fb4-3ebc-4da3-828d-f5939712600a', sep=';')

# date to datetime
emergencies['date_de_passage'] = pd.to_datetime(emergencies['date_de_passage'])
emergencies['dep']=emergencies['dep'].astype(str)
# Keep only Île-De-France
emergencies = emergencies.loc[emergencies['dep'].isin(['75','77','78','91','93','94','95'])]
emergencies = emergencies.groupby(['date_de_passage']).agg('sum').reset_index()

emergencies.head()

Unnamed: 0,date_de_passage,nbre_pass_corona,nbre_pass_tot,nbre_hospit_corona,nbre_pass_corona_h,nbre_pass_corona_f,nbre_pass_tot_h,nbre_pass_tot_f,nbre_hospit_corona_h,nbre_hospit_corona_f,nbre_acte_corona,nbre_acte_tot,nbre_acte_corona_h,nbre_acte_corona_f,nbre_acte_tot_h,nbre_acte_tot_f
0,2020-02-24,0.0,15406.0,0.0,0.0,0.0,3917.0,3785.0,0.0,0.0,0.0,4761.0,0.0,0.0,1055.0,1333.0
1,2020-02-25,2.0,13530.0,0.0,0.0,1.0,3411.0,3353.0,0.0,0.0,0.0,4197.0,0.0,0.0,908.0,1197.0
2,2020-02-26,0.0,13176.0,0.0,0.0,0.0,3342.0,3246.0,0.0,0.0,0.0,3994.0,0.0,0.0,848.0,1150.0
3,2020-02-27,4.0,13538.0,0.0,2.0,0.0,3399.0,3370.0,0.0,0.0,0.0,4093.0,0.0,0.0,881.0,1167.0
4,2020-02-28,0.0,12952.0,0.0,0.0,0.0,3345.0,3130.0,0.0,0.0,0.0,3474.0,0.0,0.0,739.0,1003.0


In this dataset, we have the :
- number of passages to emergencies for suspicion of COVID19, total number of passages to emergencies ;
- number of hospitalizations COVID19, total number of hospitalizations ;
- number of medical acts from SOS médecins for suspicion of COVID19, number of medical acts.


For each variable, we have the total, as well as the type of emergencies for men(h) and women(f).

We will focus on the total numbers of emergencies for COVID19 (nbre_pass_corona, nbre_hospit_corona, nbre_acte_corona). We first define a dictionary of these emergencies.

In [165]:
emergencies_dict={'passages to emergencies':'nbre_pass_corona',
                  'hospitalizations':'nbre_hospit_corona',
                  'acts':'nbre_acte_corona'}

This time, we will build an interactive visualization with widgets, to choose which variable to display.

In [166]:
def plot_symptoms_emergencies(emergency) :
    
    emergency_type = emergencies_dict.get(emergency)
    
    
    traces=[]
    traces.append(go.Scatter(x = tweets_day['day'],
                            y = tweets_day['n_symptom'].values,
                            mode = 'lines',
                            name = 'Tweets symptoms',
                            opacity=0.3,
                            line=dict(color='red'),
                            yaxis="y1"))
    traces.append(go.Scatter(x = tweets_day['day'],
                            y = tweets_day['n_symptom'].rolling(7).mean(),
                            mode = 'lines',
                             line=dict(color='red'),
                            name = 'Tweets symptoms (avg 7d)',
                             yaxis='y1'))
    
    traces.append(go.Scatter(x = emergencies['date_de_passage'], 
                             y = emergencies[emergency_type],
                             mode = 'lines',
                             name = 'Passages to emergencies',
                             opacity=0.3,
                             line=dict(color='green'),
                             yaxis="y2"))
    traces.append(go.Scatter(x = emergencies['date_de_passage'], 
                             y = emergencies[emergency_type].rolling(7).mean(),
                             mode = 'lines',
                             name =  'Passages to emergencies (avg 7d)',
                             line=dict(color='green'),
                             yaxis="y2"))
    
    
    layout = go.Layout(title="Evolution of mentions of symptoms and emergencies related to COVID in Ile-de-France ",
                       legend={"x" : 1.1, "y" : 1},
                       yaxis=dict(title='Number of tweets'),
                       yaxis2=dict(title='Number of emergencies related to COVID',
                                   overlaying='y',
                                   side='right'))
    
    
    fig = go.Figure(traces, layout)
    
    
    # Add rectangle for lockdown
    fig.add_shape(dict(type="rect",
                       yref='paper',
                       x0='2020-03-17',
                       y0=0,
                       x1='2020-05-11',
                       y1=1,
                       fillcolor="LightSalmon",
                       opacity=0.2,
                       layer='below',
                       line_width=0))
    
    fig.update_layout(annotations=[dict(
        x='2020-04-15',
        y=0.95,
        yref="paper",
        text="Lockdown (France)", showarrow=False)])
    
    py.iplot(fig)

In [167]:
interact(plot_symptoms_emergencies, 
         emergency=widgets.Dropdown(options=emergencies_dict.keys(),
                                        value='passages to emergencies'))

interactive(children=(Dropdown(description='emergency', options=('passages to emergencies', 'hospitalizations'…

<function __main__.plot_symptoms_emergencies(emergency)>

Focussing on the first wave, we notive that the two curves have the same evolution, but with a lag of a few days.

# To be continued

## 3.3. Statistical metrics to measure the correlation

First measure: Pearson correlation.

In [171]:
def pearson_correlation(average_week=False):
    df_tweets=tweets_day.loc[tweets_day['day'].isin(emergencies['date_de_passage'].tolist())]
    df_emergencies=emergencies.loc[emergencies['date_de_passage'].isin(tweets_day['day'].tolist())]
    
    if average_week==True:
        df_tweets=tweets_day.loc[tweets_day['day'].isin(emergencies['date_de_passage'].tolist())]
        df_emergencies=emergencies.loc[emergencies['date_de_passage'].isin(tweets_day['day'].tolist())]

    r, p = stats.pearsonr(df_tweets['n_symptom'], df_emergencies['nbre_pass_corona'])
    print(f"Scipy computed Pearson r: {r} and p-value: {p}")
pearson_correlation()

Scipy computed Pearson r: 0.6091414935880327 and p-value: 1.65501898981073e-28


We shift the tweets curve by 11 days and find that the two curves superpose.

In [172]:
# def plot_shifted_emergencies(lag) :
    
#     # Shift : 
#     tweets_symptoms['symptom_shift'] = tweets_symptoms['has_symptom'].shift(lag)
#     tweets_symptoms['symptom_shift_3'] = tweets_symptoms['has_symptom_mean_week'].shift(lag)

    
#     symptoms = 'symptom_shift'
#     #label_emergency = emergencies_dict.get(type_emergency)
    
#     traces=[]
#     traces.append(go.Scatter(x = tweets_symptoms['day'],
#                             y = tweets_symptoms['symptom_shift_3'].values,
#                             mode = 'lines',
#                              line=dict(color='red'),
#                             name = 'Tweets symptoms (avg 7d)',
#                             yaxis="y1"))

#     traces.append(go.Scatter(x = urgences.date_de_passage, 
#                              y = urgences['nbre_pass_corona_mean_week'],
#                              mode = 'lines',
#                              name = 'Passages to emergencies (avg 7d)',
#                              line=dict(color='green'),
#                              yaxis="y2"))
# #     traces.append(go.Scatter(x = urgences.date_de_passage, 
# #                              y = urgences['nbre_hospit_corona_mean_3'],
# #                              mode = 'lines',
# #                              name = 'Hospitalizations (avg 3d)',
# #                              line=dict(color='grey'),
# #                              yaxis="y2"))
    
#     layout = go.Layout(title="Evolution of mentions of symptoms (shifted 10 days) and emergencies related to COVID in Ile-de-France",
#                         legend={"x" : 1.08, "y" : 1},
#                        yaxis=dict(title='Number of tweets'),
#                        yaxis2=dict(title='Number of passages in emergencies',
#                                    overlaying='y',
#                                    side='right'))
    
#     fig = go.Figure(traces, layout)
#     py.iplot(fig)
    
# plot_shifted_emergencies(lag=10)

Cross correlation

In [173]:
# import statsmodels
# import statsmodels.tsa.stattools as ts
# from statsmodels.tsa.stattools import acf, pacf
# import matplotlib as mpl
# import matplotlib.pyplot as plt
# import quandl
# import scipy.stats as stats

# #Variables
# plt.xcorr(tweets_symptoms.loc[tweets_symptoms['day']>'2020-02-25','has_symptom_mean_3'], 
#           urgences.loc[(urgences['date_de_passage']<'2020-09-03')&(urgences['date_de_passage']>'2020-02-25'),'nbre_pass_corona_mean_3'],
#           normed=True, usevlines=True, maxlags=30, lw=2)
# plt.grid(True)
# # plt.title("Cross-correlation between tweets refering to COVID symptoms and passages to urgences/SOS Medecin for COVID")
# plt.show()

## Comparison with the number of deaths in Île-de-France

We also analyzed the evolution of the number of deaths due to COVID, based on the [Data from OpenCOVID19-fr](https://www.data.gouv.fr/en/datasets/chiffres-cles-concernant-lepidemie-de-covid19-en-france/). We also notice a similar trends between the number of tweets mentioning symptoms, and the number of deaths in Île-de-France, but with a lag of around 20 days.

In [174]:
# def plot_symptoms_deaths_with_ma() :
    
#     #['nbre_pass_corona','nbre_hospit_corona','nbre_acte_corona']
#     #label_emergency = emergencies_dict.get(type_emergency)
#     traces=[]
#     traces.append(go.Scatter(x = tweets_symptoms['day'],
#                             y = tweets_symptoms['has_symptom'].values,
#                             mode = 'lines',
#                             name = 'Tweets symptoms',
#                             opacity=0.2,
#                             line=dict(color='red'),
#                             yaxis="y1"))
#     traces.append(go.Scatter(x = tweets_symptoms['day'],
#                             y = tweets_symptoms['has_symptom_mean_3'].values,
#                             mode = 'lines',
#                              line=dict(color='red'),
#                              opacity=0.5,
#                             name = 'Tweets symptoms (avg 3days)',
#                             yaxis="y1"))
#     traces.append(go.Scatter(x = tweets_symptoms['day'],
#                             y = tweets_symptoms['has_symptom_mean_week'].values,
#                             mode = 'lines',
#                              line=dict(color='red'),
#                             name = 'Tweets symptoms (avg 7days)',
#                             yaxis="y1"))
    
#     traces.append(go.Scatter(x = open_covid.date, 
#                              y = open_covid['deaths_freq'],
#                              mode = 'lines',
#                              name = 'Deaths due to COVID',
#                              opacity=0.2,
#                              line=dict(color='green'),
#                              yaxis="y2"))
#     traces.append(go.Scatter(x = open_covid.date, 
#                              y = open_covid['deaths_3'],
#                              mode = 'lines',
#                              name =  'Deaths due to COVID (avg 3days)',
#                              line=dict(color='green'),
#                              opacity=0.5,
#                              yaxis="y2"))
#     traces.append(go.Scatter(x = open_covid.date, 
#                              y = open_covid['deaths_week'],
#                              mode = 'lines',
#                              name =  'Deaths due to COVID (avg 7days)',
#                              line=dict(color='green'),
#                              yaxis="y2"))
    
    
#     layout = go.Layout(title="Evolution of mentions of symptoms and deaths due to COVID in Ile-de-France ",
#                        legend={"x" : 1.08, "y" : 1},
#                        yaxis=dict(title='Number of tweets'),
#                        yaxis2=dict(title='Number of deaths due to COVID',
#                                    overlaying='y',
#                                    side='right'))
    
    
#     fig = go.Figure(traces, layout)
    
    
#     fig.add_shape(dict(type="rect",
#                        yref='paper',
#                        x0='2020-03-17',
#                        y0=0,
#                        x1='2020-05-11',
#                        y1=1,
#                        fillcolor="LightSalmon",
#                        opacity=0.2,
#                        layer='below',
#                        line_width=0))
    
#     fig.update_layout(annotations=[dict(
#         x='2020-04-15',
#         y=0.95,
#         yref="paper",
#         text="Shutdown (France)", showarrow=False)])
    
#     py.iplot(fig)
    
# plot_symptoms_deaths_with_ma()

In [175]:
# def plot_shifted_deaths(lag) :
    
#     # Shift : 
#     tweets_symptoms['symptom_shift'] = tweets_symptoms['has_symptom'].shift(lag)
#     tweets_symptoms['symptom_shift_3'] = tweets_symptoms['has_symptom_mean_3'].shift(lag)
#     tweets_symptoms['symptom_shift_7'] = tweets_symptoms['has_symptom_mean_week'].shift(lag)

    
#     symptoms = 'symptom_shift'
#     #label_emergency = emergencies_dict.get(type_emergency)
    
#     traces=[]
#     traces.append(go.Scatter(x = tweets_symptoms['day'],
#                             y = tweets_symptoms['symptom_shift_7'].values,
#                             mode = 'lines',
#                              line=dict(color='red'),
#                             name = 'Tweets symptoms (avg 7d)',
#                             yaxis="y1"))

#     traces.append(go.Scatter(x = open_covid.date, 
#                              y = open_covid['deaths_week'],
#                              mode = 'lines',
#                              name = 'Deaths due to COVID (avg 7d)',
#                              line=dict(color='green'),
#                              yaxis="y2"))
# #     traces.append(go.Scatter(x = urgences.date_de_passage, 
# #                              y = urgences['nbre_hospit_corona_mean_3'],
# #                              mode = 'lines',
# #                              name = 'Hospitalizations (avg 3d)',
# #                              line=dict(color='grey'),
# #                              yaxis="y2"))
    
#     layout = go.Layout(title="Evolution of mentions of symptoms (shifted 20 days) and deaths due to COVID in Ile-de-France",
#                         legend={"x" : 1.08, "y" : 1},
#                        yaxis=dict(title='Number of tweets'),
#                        yaxis2=dict(title='Number of deaths due to COVID',
#                                    overlaying='y',
#                                    side='right'))
    
#     fig = go.Figure(traces, layout)
#     py.iplot(fig)
    
# plot_shifted_deaths(lag=20)