## Tweet Reduction Pipeline

This is a data reduction pipeline for tweets generated from **tweet Ids**.
You can obtain millions of 2020 Covid-19 tweet Ids from this website https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/LW0BTB. <g>You are free to check other websites that can provide you with tweet Id on your topic of interest<g>. Once obtained, the tweet Ids can be used to retrieve the corresponding tweets with extra information by using the **twarc library**. A documentation on how to install, configure and use twarc can be found here https://scholarslab.github.io/learn-twarc/06-twarc-command-basics.

<div class="alert alert-block alert-warning">

<b>!! Attention !!</b> High number of tweet Ids can results into 10s of GB of data. Ensure you have enough storage space and first Internet speed.

</div>

First, we begin by importing necessary libraries.

In [2]:
import pandas as pd
import sys
sys.setrecursionlimit(10000)
import os
import warnings
warnings.filterwarnings('ignore')

We have obatined our twitter Ids from the following website https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/LW0BTB. These are just Ids that can be used to generate the corresponding tweets using **twarc** library function known as hydrate. The Process of generating the corresponding tweets is known as **hydrating**. 

<div class="alert alert-block alert-warning">

<b>!! Attention !!</b> This is a memory intensive task. For instance, it took 13 hours for a 6GB GPU computer with a 10Mbps Internet speed connection. The higher the Internet speed, the faster the process.

</div>

This can be done in linux-based systems' **terminal** or windows' **command**.

In [None]:
%%time
!twarc hydrate tweets_id2.txt > full_tweets2.json

The resulting file can be very huge (20GB) depending on how many tweet Ids you hydrated and cannot be read directly by the computer due to memory limitation. Therefore we split them in terminal into 100000 lines each using the following command.

In [92]:
#!split -l 100000 full_tweets1.json

The following code is for renaming the resultant splitted individual files.

In [23]:
path = '/home/makavelli/Documents/TUK/DARA Big Data/retweetsNo/'
files = os.listdir(path)
i = 1

for index, file in enumerate(files):
    os.rename(os.path.join(path, file), os.path.join(path, ''.join([str(index), '.csv'])))

We now read each of the json files, create new dataframes with selected columns of interest and save them as CSV files in a folder called CSV.

In [3]:
path = '/home/makavelli/Documents/TUK/DARA Big Data/files/'

<div class="alert alert-block alert-info">
    
<b>Note:</b> You need to manually create the **'CSV'** folder in your working directory.

</div>

In [4]:
%%time
lists = os.listdir('/home/makavelli/Documents/TUK/DARA Big Data/files/') #Where to read files from
files = len(lists) #Count the number of files and return it as interger to be used in the loop
path1 = '/home/makavelli/Documents/TUK/DARA Big Data/CSV/'#Where to store the resulting files
for i in range(files):
    try:
        df = pd.read_json(path+str(i)+".json",lines=True)
        df1 =df[['created_at','full_text','user','lang']] #specify the twitter data column that you want to use
        df1.to_csv(path1+str(i)+'.csv',)
    except:
        pass

CPU times: user 12min 15s, sys: 38.2 s, total: 12min 53s
Wall time: 13min 1s


We now read the saved CSV files and remove tweets that are not in English and then save them in a folder called "langreduced"

<div class="alert alert-block alert-info">
    
<b>Note:</b> You need to manually create the **'langreduced'** in your working directory.

</div>

In [5]:
%%time
lists = os.listdir('/home/makavelli/Documents/TUK/DARA Big Data/CSV/') #Where to read files from
files = len(lists) #Count the number of files and return it as interger to be used in the loop
path2 = '/home/makavelli/Documents/TUK/DARA Big Data/langreduced/'#Where to store the resulting files
for i in range(files):
    try:
        df2 = pd.read_csv(path1+str(i)+'.csv')
        df3 = df2[df2['lang']=='en']
        df3.to_csv(path2+str(i)+'.csv',index=False)
    except:
        pass

CPU times: user 1min 19s, sys: 3.77 s, total: 1min 22s
Wall time: 1min 23s


In order to ensure that our data is free of redundancy, we need to remove any retweets as they might compromise the quality.
We now create new column labelled "Retweet" in each of the csv files and store them in a new directory called "retweetFree".

<div class="alert alert-block alert-info">
    
<b>Note:</b> You need to manually create the **'retweetFree'** folder in your working directory.

</div>

In [6]:
%%time
lists = os.listdir('/home/makavelli/Documents/TUK/DARA Big Data/langreduced/') #Where to read files from
files = len(lists)#Count the number of files and return it as interger to be used in the loop
path3 = '/home/makavelli/Documents/TUK/DARA Big Data/retweetFree/'#Where to store the resulting files
for i in range(files):
    try:
        df4 = pd.read_csv(path2+str(i)+'.csv')
        df4['Retweet']=""
        df5 = df4.to_csv(path3+str(i)+'.csv',index=False)
    except:
        pass

CPU times: user 1min 6s, sys: 2.94 s, total: 1min 9s
Wall time: 1min 10s


We now create a loop that will read through all the csv files in path3, count the rows and assign *'Yes'* or *'No'* to the newly created column, **'Retweets'** if the **full_text** column as any word, *'RT @'*. We then select only the rows with *'No'* and save the corresponding dataframe to the folder known as *'retweetsNo'*.

<div class="alert alert-block alert-info">
    
<b>Note:</b> You need to manually create the **'retweetNo'** folder in your working directory.

</div>

In [16]:
%%time
path4 = '/home/makavelli/Documents/TUK/DARA Big Data/retweetsNo/' #path to store results
lists = os.listdir(path3) #Create a variable to store the number of files
files = len(lists) #Create a variable to return the number of files as interger to be used in the loop
for i in range(files):
    try:
        df6 = pd.read_csv(path3+str(i)+'.csv') #read the files in path3
        rows = df6.shape[0]          #Count the number of rows in each dataframe to use in the subsequent loop
        for k in range(rows):
            if str('RT @') in str(df6['full_text'].loc[k]):#check for tweets with RT @
                df6['Retweet'].loc[k]='Yes' #Assign 'Yes' to tweets with RT @
            else:
                df6['Retweet'].loc[k]='No' #Assign 'No' to tweets without RT @
        df6=df6[df6.Retweet=='No'] #Create new dataframes for tweets without retweets
        df7 = df6.to_csv(path4+str(i)+'.csv',index=False) #save the dataframes as csv files to path4
    except:
        pass

CPU times: user 15min 48s, sys: 6.34 s, total: 15min 55s
Wall time: 15min 45s


We then confirm if all retweets have been removed.

In [17]:
df8 = pd.read_csv(path4+'1.csv')
df8

Unnamed: 0.1,Unnamed: 0,created_at,full_text,user,lang,Retweet
0,9,2020-03-03 17:47:49+00:00,CORNAVIRUS FULL STORY : WATCH HERE : https://t...,"{'id': 1167737750075764741, 'id_str': '1167737...",en,No
1,10,2020-03-03 17:47:49+00:00,A bit bungling this from Boris.\nBut far worse...,"{'id': 43730789, 'id_str': '43730789', 'name':...",en,No
2,14,2020-03-03 17:47:49+00:00,"Don’t be panic, just be careful about corona v...","{'id': 372364297, 'id_str': '372364297', 'name...",en,No
3,16,2020-03-03 17:47:50+00:00,"'The pope is dead',\n\n In the HBO tv show a v...","{'id': 1137155780342439936, 'id_str': '1137155...",en,No
4,30,2020-03-03 17:47:50+00:00,Breaking: #CoronaOutbreak has killed 6 more pe...,"{'id': 20420845, 'id_str': '20420845', 'name':...",en,No
...,...,...,...,...,...,...
11803,99924,2020-03-03 19:15:44+00:00,@DrTedros @WHO @Twitter We wish and pray to Go...,"{'id': 1221856187660099588, 'id_str': '1221856...",en,No
11804,99938,2020-03-03 19:15:45+00:00,😔💔💔💔💔💔 Prayers aren't going to help us. Our ad...,"{'id': 2796385418, 'id_str': '2796385418', 'na...",en,No
11805,99961,2020-03-03 19:15:46+00:00,Stephanie Grisham' Mr President '\nDonald J. T...,"{'id': 345676271, 'id_str': '345676271', 'name...",en,No
11806,99985,2020-03-03 19:15:47+00:00,7th person confirmed to have died in relation ...,"{'id': 12137172, 'id_str': '12137172', 'name':...",en,No


In [24]:
lists = os.listdir(path4) 
files = len(lists)
tot=0
for i in range(files):
    try:
        df8 = pd.read_csv(path4+str(i)+'.csv')
        rows = df8.shape[0]
        tot += rows
    except:
        pass
print('The total remaining tweets are '+str(tot)+' tweets')

The total remaining tweets are 525174 tweets


Your data can now be used

## Appendix

The following code lines will help in renaming the downloaded files numerically

In [None]:
path = 'specificy path here/'
files = os.listdir(path)
i = 1

for index, file in enumerate(files):
    os.rename(os.path.join(path, file), os.path.join(path, ''.join([str(index), '.csv'])))