## PART 1

In this frst part, we want to randomly retrieve tweets posted by people in the early stage of the pandemic in order to analyse how COVID-19 affected people's interests and the role of social media as communication platforms. The way we use to retrieve tweets
aims at obtaining a sample which is the least possible biased. After retrieving data for each country (in this first analysis
we start with Serbia and Italy, out homelands, to test the scalability and feasibility of our ideas), we want to classify them according to
the topic they present. We are interested in analysing which have been the most discussed topics to see if we get result
similar to those shown in CoronaWiki dataset. We think this is a good way to start analysing how people's interests shifted, how people reacted to the situation and how communication has been affected by COVID-19.

Please notice that the choice of focusing on the first period of the pandemic is also due to the fact that, in Task3, we are willing to test whether a higher or lower interest in COVID in the period preceding the lockdown might have been a crucial factor to slow the infection rate after the lockdown was set. To compute this analysis, we need to collect as many data as possible regarding the early stage of the pandemic and we are not allowed to use data regarding successive stages of the pandemic. 

In [43]:
# Useful libraries
import pandas as pd
import json
import re
import pickle
import time
from datetime import datetime, timedelta

# Twitter library
import tweepy

# Math libraries
import numpy as np

#Data visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Natural language processing libraries
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer
from nltk.stem import PorterStemmer, WordNetLemmatizer
from textblob import TextBlob # to compute sentiment analysis on each tweet
import translators as ts

nltk.download('stopwords')

# Libray to infer the topics discussed in each tweet
from empath import Empath
lexicon = Empath()
stemmer = PorterStemmer()
# Helpers file
from helpers import *

%load_ext autoreload
%autoreload 2


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ricca\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### TWEETS DATASET CREATION - ITALY AND SERBIA

We start by creating lists containing the names of the analysed countries and the spoken languages.

In [2]:
# We define lists containing the names and the spoken languages in each country
# france, danimark, germany, italy, netherlands, normway, sweden, serbia, finland, england
total_countries = ['FR','DK','DE','IT','NL','NO','SE','RS','FI','GB']
total_languages = ['fr','da','de','it','nl','no','sv','sr','fi','en']
analysed_countries = ['IT','RS']
analysed_languages = ['it','sr']
period_per_countries = {}

In order to compute our analysis, we need to define a period of time during the pandemic. We use pagelogs and intervention data in order to define this period of interest for each country. We retrieved data during the 3 weeks preceding the lockdown, since we are interested in analyzing human reactions and behaviour during the first stage of the pandemic.

In [3]:
# Importing pagelogs time series
data_path = './data/'
with open(data_path+'aggregated_timeseries.json','r') as file:
    pagelogs_time_series = json.load(file)

In [4]:
# Importing intervention dates for each country
interventions = pd.read_csv(data_path + 'interventions.csv', delimiter = ',',parse_dates = ['1st case','1st death','School closure',
                                                                                            'Public events banned','Lockdown','Mobility','Normalcy'])
interventions.set_index('lang',inplace = True)
interventions

Unnamed: 0_level_0,1st case,1st death,School closure,Public events banned,Lockdown,Mobility,Normalcy
lang,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
fr,2020-01-24,2020-02-14,2020-03-14,2020-03-13,2020-03-17,2020-03-16,2020-07-02
da,2020-02-27,2020-03-12,2020-03-13,2020-03-12,2020-03-18,2020-03-11,2020-06-05
de,2020-01-27,2020-03-09,2020-03-14,2020-03-22,2020-03-22,2020-03-16,2020-07-10
it,2020-01-31,2020-02-22,2020-03-05,2020-03-09,2020-03-11,2020-03-11,2020-06-26
nl,2020-02-27,2020-03-06,2020-03-11,2020-03-24,NaT,2020-03-16,2020-05-29
no,2020-02-26,2020-02-26,2020-03-13,2020-03-12,2020-03-24,2020-03-11,2020-06-04
sr,2020-03-06,2020-03-20,2020-03-15,2020-03-21,2020-03-21,2020-03-16,2020-05-02
sv,2020-01-31,2020-03-11,2020-03-18,2020-03-12,NaT,2020-03-11,2020-06-05
ko,2020-01-20,2020-02-20,2020-02-23,NaT,NaT,2020-02-25,2020-04-15
ca,2020-01-31,2020-02-13,2020-03-12,2020-03-08,2020-03-14,2020-03-16,NaT


In [49]:
# Defining period of interest for each country. Dates refered to 3 weeks before the lockdown
lockdown_dates = [interventions.loc['it','Lockdown'],interventions.loc['sr','Lockdown']]

for idx,country in enumerate(analysed_countries):
    lang = analysed_languages[idx]
    dates = [datetime.strptime(date.split()[0], '%Y-%m-%d')  for date in list(pagelogs_time_series[lang]['sum'].keys())]
    dates = [date for date in dates if (lockdown_dates[idx] - date < timedelta(21)) and 
             (lockdown_dates[idx] - date >  timedelta(0))]
    period_per_countries[country] = dates

We want to retrieve tweets on a daily basis. In order to reduce the bias in our data, we decide to retrieve tweets in different moments of the day which are randomly chosen. Since most of the activity was during the afternoon, we want to give more weights to these hours. 

In [6]:
# Defining a hour time window to retrieve data
hours = [11,12,13,14,15,16,17,18,19,20]
# We want to give more weights to part of the day closer to dinner / late afternoon. So we assign them a higher weights
weights = np.ones(len(hours)) / 15
weights[-5:] = weights[-5:]*2

We now proceed to define two helper functions to create and import the needed dataframes. As said before, we start working on tweets posted by italian and serbian people to verify the goodness of our approach. For Milestone3, we will focus on a bigger number of countries.
Notice that, in order to filter the tweets we retrieve, we set a very specific query.

In [52]:
def create_dataframe(name_country, language, period_of_interest, time_window, prob, skip_day=1, subsample=None):
    
    """
    Function which creates dataframe retrieving tweets using Twitter API
    
    Arguments:
        name_country: name of country from which we are retrieving tweets
        language: languages spoken in analyised country
        period_of_interest: dates from which we are retrieving tweets
        time_window: list of hours from which we are retrieving tweets 
        prob: list of weights assigned to each hour in time_window
        skip_day: step used when iterating over period of interest
        subsample: index of sample of data retrieved
    """
    # Defining a list of topics based on Coronawiki Dataset
    topics = []
    # Defining support structure
    new_data = []
    output_path = './output/'+name_country
    if subsample != None:
        output_path+= subsample
    output_path+='_tweets.pkl'
    
    # Importing Twitter API keys
    with open('./Data/BearerTokens.json', 'r') as f:
        bearer_tokens = json.load(f)

    # Define more than one clien
    bearer_token1 = bearer_tokens['balsa']
    bearer_token2 = bearer_tokens['federico']
    # We initialize tweepy 
    client1 = tweepy.Client(bearer_token=bearer_token1, wait_on_rate_limit=True)
    client2 = tweepy.Client(bearer_token=bearer_token2, wait_on_rate_limit=True)
    
    for idx in range(0,len(period_of_interest[name_country]), skip_day):
        # We randomly choose the time of the day to retrieve tweets. We repeat the procedure two times in order to retrieve more data
        random_hour = np.random.choice(hours, size=2, p=weights)
        date = period_of_interest[name_country][idx]
        
        # We define start and end time to retrieve (then passed as inputs for twitter.API)
        start_time1 = datetime(date.year,date.month,date.day,random_hour[0])
        end_time1 = datetime(date.year,date.month,date.day,random_hour[0]+2)
        start_time2 = datetime(date.year,date.month,date.day,random_hour[1])
        end_time2 = datetime(date.year,date.month,date.day,random_hour[1]+2)
        
        # We define a proper query to get tweets from the country we're interested in
        query = " place_country:{} lang:{} -is:retweet -has:links -has:media -has:images \
                                    -has:video_link -has:mentions".format(name_country,language)
        
        while True:
            tweets1 = client1.search_all_tweets( query, max_results = 30, 
                                         start_time = start_time1, end_time = end_time1,
                                              tweet_fields  = ['text','context_annotations','id'])
            tweets2 = client2.search_all_tweets( query, max_results = 30, 
                                         start_time = start_time2, end_time = end_time2,
                                              tweet_fields  = ['text','context_annotations','id'])
            
            time.sleep(1)
            if tweets1.data != None or tweets2.data!= None:
                break
            # If we do not have data, we retrieve once again
            random_hour = np.random.choice(hours, size=2, p=weights)
            date = period_of_interest[name_country][idx]
            # We define start and end time to retrieve (then passed as inputs for twitter.API)
            start_time1 = datetime(date.year,date.month,date.day,random_hour[0])
            end_time1 = datetime(date.year,date.month,date.day,random_hour[0]+2)
            start_time2 = datetime(date.year,date.month,date.day,random_hour[1])
            end_time2 = datetime(date.year,date.month,date.day,random_hour[1]+2)
        
        # We perform basic preprocessing operations on the first group of retrieved tweets (translation and removal of punctuations)
        for tweet in tweets1.data:
            if language != 'en':
                text = ts.google(tweet.text)
            else:
                text = tweet.text
            # We remove punctuation
            text = ("".join([ch for ch in text if ch not in string.punctuation])).lower()
            # We remove numbers
            text = re.sub("\d+", "",text).strip()
            # We compute sentiment analysis on the given text
            text_sentiment = TextBlob(text).sentiment
            text_polarity, text_subjectivity = text_sentiment.polarity, text_sentiment.subjectivity
            # We tokenize the tweet to make the work easier
            tokenized_stemmed_version = nltk.word_tokenize(text)
            tokenized_stemmed_version = [stemmer.stem(word) for word in tokenized_stemmed_version]
            # Saving new datapoint in new_data list
            if len(text) > 0:
                new_data.append([date,tweet.id, language,text,tokenized_stemmed_version,
                                 tweet.context_annotations,text_polarity,text_subjectivity])
        
        # We perform basic preprocessing operations on the second group of retrieved tweets
        for tweet in tweets2.data:
            if language != 'en':
                text = ts.google(tweet.text)
            else:
                text = tweet.text
            # We remove punctuation
            text = ("".join([ch for ch in text if ch not in string.punctuation])).lower()
            # We remove numbers
            text = re.sub("\d+", "",text).strip()
            # We compute sentiment analysis on the given text
            text_sentiment = TextBlob(text).sentiment
            text_polarity, text_subjectivity = text_sentiment.polarity, text_sentiment.subjectivity
            # We tokenize the tweet to make the work easier
            tokenized_stemmed_version = nltk.word_tokenize(text)
            tokenized_stemmed_version = [stemmer.stem(word) for word in tokenized_stemmed_version]
            # Saving new datapoint in new_data list
            if len(text) > 0:
                new_data.append([date,tweet.id, language,text,tokenized_stemmed_version,
                                 tweet.context_annotations,text_polarity,text_subjectivity])
            
        # We create the dataframe
        df = pd.DataFrame(new_data, columns = ['date','id','language','tweet','tokenized_tweet_list',
                                               'context_from_Twitter','polarity','subjectivity'])
        df.to_pickle(output_path)
        

### PIPELINE TO RETRIEVE DATA

In [None]:
# choose the right language and country
create_dataframe('IT', 'it', period_per_countries,hours, weights, skip_day=1, subsample=None)
data = get_dataframe('IT').drop_duplicate(subset = ['date','id'])
data.shape