## PART1

In this frst part, we want to randomly retrieve tweets posted by people during the pandemic. The way we use to retrieve tweets
aims at obtaining a sample which is the least possible biased. After retrieving data for each country ( in this first analysis
we start with Serbia and Italy, out homelands, to test the scalability of the code), we want to classify them according to
the topic they present. We are interested in analysing which have been the most discussed topics to see if we get result
similar to those shown in CoronaWiki dataset. We think this is a good way to start analysing how people's interests shifted, how people reacted to the situation and how communication has been affected by COVID-19

In [61]:
# Useful libraries
import pandas as pd
import json
import re
import pickle
from datetime import datetime
import tweepy

# Math libraries
import numpy as np

#Data visualization libraries
import matplotlib.pyplot as plt
import seaborn as sns

# Natural language processing libraries
import nltk
import string
from nltk.corpus import stopwords
from nltk.tokenize import TweetTokenizer
from nltk.stem import PorterStemmer, WordNetLemmatizer
from textblob import TextBlob # to compute sentiment analysis on each tweet

nltk.download('stopwords')

# Libray to infer the topics discussed in each tweet
from empath import Empath 


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ricca\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### TWEETS DATASET CREATION

We start by creating a dataset of retrieved tweets for each analysed country. We decide to save the dataframe in pickle format 
for optimization purposes

In [62]:
# We define lists containing the names and the spoken languages in each country
# france, danimark, germany, italy, netherlands, normway, sweden, serbia, finland, england
countries = ['FR','DK','DE','IT','NL','NO','SE','RS','FI','GB']
languages = ['fr','da','de','it','nl','no','sv','sr','fi','en']
period_per_countries = {}

We use pagelogs data in order to define the period of interest for each country. We start retrieving data from 01/12/2019 since it is reported as the official starting date from the pandemic.

In [63]:
# Importing time series
data_path = './data/'
with open(data_path+'aggregated_timeseries.json','r') as file:
    pagelogs_time_series = json.load(file)

In [64]:
# Defining period of interest for each country. Dates are retrieved starting from 01/12/2019
for idx,country in enumerate(countries):
    lang = languages[idx]
    dates = [datetime.strptime(date.split()[0], '%Y-%m-%d')  for date in list(pagelogs_time_series[lang]['sum'].keys())]
    dates = [date for date in dates if (date.year >= 2020 or (date.year == 2019 and date.month == 12))]
    period_per_countries[country] = dates

We want to retrieve tweets on a daily basis. In order to reduce the bias in our data, we decide to retrieve tweets in different moments of the day which are randomly chosen. Since most of the activity was during the afternoon, we want to give more weights to these hours. 

In [65]:
# Defining a hour time window to retrieve data
hours = [11,12,13,14,15,16,17,18,19,20]
# We want to give more weights to part of the day closer to dinner / late afternoon. So we assign them a higher weights
weights = np.ones(len(hours)) / 15
weights[-5:] = weights[-5:]*2

We now proceed to deine a helper function which created the dataframes. As said before, we start working on tweets posted by italian and serbian people to verify the goodness of our approach. For Milestone3, we will focus on a bigger number of countries.

In [72]:
def create_dataframe(name_country,language,period_of_interest,time_window = hours, prob = weights, skip_day = 1):
    
    # Initialize the stemmer
    stemmer = PorterStemmer()
    lexicon = Empath()
    # Defining a list of topics based on Coronawiki Dataset
    topics = []
    # Defining support structure
    new_data = []
    output_path = './output/'+name_country+'_tweets.pkl'
    
    # We initialize tweepy 
    bearer_token_balsa = "AAAAAAAAAAAAAAAAAAAAAPXlYgEAAAAAmHO9bfJYAPCZSDa8%2BHELxeAfwgQ%3D91B98esE93293wEGbjH4JsUMe7R3wDok1ZCGNGLLvvQXzNcyBE"
    client = tweepy.Client(bearer_token=bearer_token_balsa, wait_on_rate_limit=True)
    
    for idx in range(0,len(period_of_interest), skip_day):
        # We randomly choose the time of the day to retrieve tweets
        random_hour = np.random.choice(hours,p = weights)
        date = period_of_interest[name_country][idx]
        
        # We define start and end time to retrieve (then passed as inputs for twitter.API)
        start_time = datetime(date.year,date.month,date.day,random_hour)
        end_time = datetime(date.year,date.month,date.day,random_hour+2)
        
        # We define a proper wuery to get tweets from the country we're interested in
        query = " place_country:{} lang:{} -is:retweet -has:links -has:media -has:images \
                                    -has:video_link -has:mentions".format(name_country,language)
        tweets = client.search_all_tweets( query, max_results = 10, 
                                     start_time = start_time, end_time = end_time,
                                          tweet_fields  = ['text','context_annotations'])
        
        # We perform basic preprocessing operations on the text (translation and removal of punctuations)
        for tweet in tweets.data:
            # WE NEED TO TRANSLATE
            # We remove punctuation
            text = ("".join([ch for ch in tweet.text if ch not in string.punctuation])).lower()
            # We remove numbers
            text = re.sub("\d+", "",text).strip()
            # We compute sentiment analysis on the given text
            text_sentiment = TextBlob(text).sentiment
            text_polarity, text_subjectivity = text_sentiment.polarity, text_sentiment.subjectivity
            # We infer the discussed topic using Empath()
            # discussed_topic = lexicon.analyze(text, categories = topics, normalize = True)
            # We tokenize the tweet to make the work easier
            tokenized_stemmed_version = nltk.word_tokenize(text)
            tokenized_stemmed_version = [stemmer.stem(word) for word in tokenized_stemmed_version]
            # Saving new datapoint in new_data list
            if len(text) > 0:
                new_data.append([date,language,text,tokenized_stemmed_version,
                             tweet.context_annotations,text_polarity,text_subjectivity])
            
        # We create the dataframe
        df = pd.DataFrame(new_data, columns = ['date','language','tweet','tokenized_tweet_list',
                                               'context_from_Twitter','polarity','subjectivity'])
        df.to_pickle(output_path)
        
        
def get_dataframe(name_country):
    get_path = './output/'+name_country+'_tweets.pkl'
    return pd.read_pickle(get_path)
        

In [78]:
periodo = {}
periodo['GB'] = period_per_countries['GB'][150:151]
create_dataframe('GB','en',periodo)
df = get_dataframe('GB')
df.head()

Unnamed: 0,date,language,tweet,tokenized_tweet_list,context_from_Twitter,polarity,subjectivity
0,2020-04-29,en,so excited,"[so, excit]",[],0.375,0.75
1,2020-04-29,en,projects include learning about and educating ...,"[project, includ, learn, about, and, educ, oth...","[{'domain': {'id': '65', 'name': 'Interests an...",-0.133333,0.533333
2,2020-04-29,en,is it just me or anyone else feeling a bit lik...,"[is, it, just, me, or, anyon, els, feel, a, bi...","[{'domain': {'id': '123', 'name': 'Ongoing New...",0.266667,0.35
3,2020-04-29,en,this march and april ive ordered more stuff on...,"[thi, march, and, april, ive, order, more, stu...","[{'domain': {'id': '45', 'name': 'Brand Vertic...",0.325,0.575
4,2020-04-29,en,love catching luke up on my fave tiktoks of th...,"[love, catch, luke, up, on, my, fave, tiktok, ...",[],0.55,0.75


In [79]:
df

Unnamed: 0,date,language,tweet,tokenized_tweet_list,context_from_Twitter,polarity,subjectivity
0,2020-04-29,en,so excited,"[so, excit]",[],0.375,0.75
1,2020-04-29,en,projects include learning about and educating ...,"[project, includ, learn, about, and, educ, oth...","[{'domain': {'id': '65', 'name': 'Interests an...",-0.133333,0.533333
2,2020-04-29,en,is it just me or anyone else feeling a bit lik...,"[is, it, just, me, or, anyon, els, feel, a, bi...","[{'domain': {'id': '123', 'name': 'Ongoing New...",0.266667,0.35
3,2020-04-29,en,this march and april ive ordered more stuff on...,"[thi, march, and, april, ive, order, more, stu...","[{'domain': {'id': '45', 'name': 'Brand Vertic...",0.325,0.575
4,2020-04-29,en,love catching luke up on my fave tiktoks of th...,"[love, catch, luke, up, on, my, fave, tiktok, ...",[],0.55,0.75
5,2020-04-29,en,mcu loki but its matt damon in dogma instead,"[mcu, loki, but, it, matt, damon, in, dogma, i...","[{'domain': {'id': '10', 'name': 'Person', 'de...",0.0,0.0
6,2020-04-29,en,just finished recording episode and legit wow...,"[just, finish, record, episod, and, legit, wow...",[],0.3,0.8
7,2020-04-29,en,oh and skin care apparently we should all have...,"[oh, and, skin, care, appar, we, should, all, ...","[{'domain': {'id': '65', 'name': 'Interests an...",0.155556,0.402381
8,2020-04-29,en,treated myself to apex legends 😊😊,"[treat, myself, to, apex, legend, 😊😊]","[{'domain': {'id': '71', 'name': 'Video Game',...",0.0,0.0
