### Social Media Data Collection (Twitter) 

The data collected consist mainly of the data gathered from Christian Lopez, Malolan Vasu, and Caleb Gallemore (2020) (Understanding the perception of COVID-19 policies by mining a multilanguage Twitter dataset. arXiv:cs.SI/2003.10359,2020 https://arxiv.org/abs/2003.10359). <br /><br /> 
However, they faced some technical issues which did not permit the collection of data for specific dates. 
<br /><br /> Therefore, data from the IEEE DataPort (Rabindra Lamsal, 2020. Coronavirus (COVID-19) Tweets Dataset. Available at: http://dx.doi.org/10.21227/781w-ef42) and Chen E, Lerman K, Ferrara E ( Tracking Social Media Discourse About the COVID-19 Pandemic: Development of a Public Coronavirus Twitter Data Set JMIR Public Health Surveill 2020;6(2):e19273 DOI: 10.2196/19273 PMID: 32427106 ) were used to form the social media dataset.

- According to Twitter's terms of use, only tweet Ids can be provided as a dataset.
- To retrieve the tweet's text and date, tweet hydration needs to be implemented. 
- The tweets are retrieved and saved in a csv file so that they can be used for the analysis.  

#### Pip install required libaries for tweet hydration

In [28]:
# !pip install twarc
# !pip install jsonlines
# !pip install pandas
import pandas as pd
import datetime as dt

#### Setting up Directory and Twarc authentication keys

In [29]:
import os
from IPython.display import clear_output

dirpath = os.getcwd()
print("current directory is : " + dirpath)

current directory is : C:\Users\D.Petkidis\Desktop\TweetsRS


In [30]:
from twarc import Twarc

# These keys are received after applying for a twitter developer account
consumer_key = "" 
consumer_secret = "" 
access_token = "" 
access_token_secret = ""

t = Twarc(consumer_key, consumer_secret, access_token, access_token_secret)

#### Choosing keywords to hydrate

In [31]:
coronavirus = True 
virus = True 
covid = True 
ncov19 = True 
ncov2019 = True 
keyword_dict = {"coronavirus": coronavirus, "virus": virus, "covid": covid, "ncov19": ncov19, "ncov2019": ncov2019}

#### Choosing start and end date (a file list with all the files corresponding to the selected dates is created)

- The start and end date of the desired hydration timeframe are selected.
- The files that contain the tweet ids for the selected keywords and dates are then selected from the corresponding folder and appended to a list 

In [32]:
# Insert the start and end date
start_date = '2020-03-31'
end_date = '2020-03-31' 

import pandas as pd
import datetime as dt

files = []
covid_loc = dirpath

# Looks at each folder
for folder in os.listdir(covid_loc):
    foldername = os.fsdecode(folder)
    
    # The folder name is a keyword. We continue for keywords selected above
    if keyword_dict.get(foldername.split()[0].lower()) == True:
        folderpath = os.path.join(covid_loc, foldername)
        # Each file is of the format [keyword]_yyyy_mm_dd.txt
        for file in os.listdir(folderpath):
            filename = os.fsdecode(file)
            date = filename[filename.index("_")+1:filename.index(".")]

            # If the date is within the required range, it is added to the list of files to read.
            if (dt.datetime.strptime(start_date, "%Y-%m-%d").date() 
            <= dt.datetime.strptime(date, '%Y_%m_%d').date()
             <= dt.datetime.strptime(end_date, "%Y-%m-%d").date()):
                
                # print(dt.datetime.strptime(date, '%Y_%m_%d').date())
                # print(filename.split('_', 1)[0])
                files.append(os.path.join(folderpath, filename))

#### A sample of the ids for each day are added to a dataframe

- A set of the different dates that were selected is created.
- Each file is opened and the tweet ids that are included in it are placed in a dataframe
- The dataframe is then randomly sampled in order to reduce the volume of tweets
(millions of tweets are collected for each day, however the computational resources cannot support the retrieval of all those tweets and therefore a part of them is selected).

In [33]:
# Gathering the set of the different dates that were selected
dates = set()

for filename in files:
    dates.add(filename.split('_', 1)[1].split('.',1)[0])
# print(dates)

# The final list is read, and each of the individual IDs is stored in a collective
# set of IDs for each date. Duplicates are removed.
for date in dates:
    ids = set()
    for filename in files:
        if date in filename:
            # print(filename.split('_', 1)[1].split('.',1)[0]) --> prints file date
            with open(filename) as f:
                # The files are of the format: [id1,id2,id3,...,idn]
                # Remove the brackets and split on commas
                for i in f.readline().strip('][').replace(" ", "").split(","):
                    ids.add(i) 
                # print(filename, len(ids))
    # Append the ids in a pandas dataframe            
    file_ids = list(ids)
    df = pd.DataFrame(file_ids)
    print(date, len(df))
    # print(filename.split('_', 1)[0].split('\\')[-1]) --> prints file keyword

    # Randomly sample 30000 tweets to hydrate for each day
    if len(df) >= 30000:
        sampledf = df.sample(n=30000, random_state=1)
    else:
        sampledf = df
    print(date)
    # Append the sampled ids for each day in the txt file that contains all the selected tweets from the desired timeframe
    sampledf.to_csv(r'final_ids_{}.txt'.format(date), header=None, mode='a+', index=False)            

2020_03_31 1100934
2020_03_31


#### Tweet Hydration

- A dataframe containing the hydrated tweets for each date and all the chosen keywords is created.
- The hydrated tweets are then converted to the necessary format. 
- The original text from Retweets is retrieved and placed in the dataframe. Duplicates are dropped.

In [34]:
cols = ["created_at", "id_str", "full_text", "RT_text", "retweeted_status", "retweet_count", "place"] 
count = 0
num_save  = 1000
length = 0

for date in dates:
    # creates an empty dataframe for each date
    date_tweets = pd.DataFrame(columns = cols)
    # Hydrates the tweets that were sampled for each date
    for tweet in t.hydrate(open('final_ids_{}.txt'.format(date))):
        # if (tweet['place'] is not None) and (tweet['place']['country'] == 'United Kingdom'):
        if (tweet['lang'] == "en"):
            date_tweets.at[count, "created_at"] = tweet["created_at"]
            date_tweets.at[count, "id_str"] = tweet["id_str"]
            date_tweets.at[count, 'full_text'] = tweet["full_text"]
            date_tweets.at[count, "retweet_count"] = tweet["retweet_count"]
            if "retweeted_status" in tweet:
                date_tweets.at[count, "RT_text"] = tweet["retweeted_status"]['full_text']
            else:
                date_tweets.at[count, "RT_text"] = tweet["full_text"]
                
            count = count + 1  
        # An indicator showing how many tweets were hydrated in thousands    
        if (count % num_save) == 0:
            print("Saved " + str(count) + " hydrated tweets.")
    # Converts created_at to datetime
    date_tweets["created_at"] = date_tweets["created_at"].astype('datetime64[ns]') 
    date_tweets["created_at"] = date_tweets.created_at.dt.to_pydatetime()
    # Drops the duplicates
    date_tweets_final = date_tweets.sort_values("RT_text") 
    date_tweets_final = date_tweets_final.drop_duplicates(subset='RT_text', ignore_index=True)
    date_tweets_final = date_tweets_final.drop(columns=['full_text', 'retweeted_status'], axis=1)
    # Calculate total number of hydrated tweets
    length = length + len(date_tweets_final)
    
    # Append hydrated tweets for each date to a file containing all the hydrated tweets for that period
    # (Change file name based on chosen dates)
    date_tweets_final.to_csv('ids_Mar31.csv', index=False, mode='a+')   # header=None

Saved 1000 hydrated tweets.
Saved 2000 hydrated tweets.
Saved 3000 hydrated tweets.
Saved 4000 hydrated tweets.
Saved 5000 hydrated tweets.
Saved 5000 hydrated tweets.
Saved 5000 hydrated tweets.
Saved 5000 hydrated tweets.
Saved 6000 hydrated tweets.
Saved 7000 hydrated tweets.
Saved 7000 hydrated tweets.
Saved 7000 hydrated tweets.
Saved 8000 hydrated tweets.
Saved 9000 hydrated tweets.
Saved 10000 hydrated tweets.
Saved 11000 hydrated tweets.
Saved 12000 hydrated tweets.
Saved 12000 hydrated tweets.
Saved 13000 hydrated tweets.
Saved 13000 hydrated tweets.
Saved 14000 hydrated tweets.
Saved 14000 hydrated tweets.
Saved 15000 hydrated tweets.
Saved 15000 hydrated tweets.
Saved 15000 hydrated tweets.


#### File transformation

- The file that contains the tweets for the chosen date is loaded and the date is properly formatted. 
- The total number of hydrated tweets for the selected time period is printed
- The total daily number of hydrated tweets is printed

In [35]:
# Reads the file created from the hydration of the tweets
test_ids = pd.read_csv(r'ids_Mar31.csv', sep=",", parse_dates=False)

# The tweets were written in the file but the header was preserved so the rows that contain the name of the columns should
# be removed.
# print(test_ids[test_ids['created_at'].str.len() != 19])
test_ids = test_ids[test_ids.created_at != 'created_at'].reset_index().drop(columns=['index'])
test_ids["created_at"] = test_ids["created_at"].astype('datetime64[ns]') 
test_ids["created_at"] = test_ids.created_at.dt.to_pydatetime()
test_ids['created_at'] = test_ids['created_at'].dt.date

print('Total tweets:', len(test_ids))
# Writes the tweets in a file
test_ids.to_csv(r'ids_final_Mar31.csv', index=False)

Total tweets: 12016


In [36]:
# Prints the number of tweets for each date
test_ids.created_at.value_counts()

2020-03-31    12016
Name: created_at, dtype: int64