# Corpus Retrieval
Here we will collect tweets from two date ranges, one spanning the year 2015, and the other spanning the year 2020. We will also collect from the following hashtags that have been associated with the anti-vaccination movement: 

<table width="700" align="left" style="text-align:left; width:100%;">
   <tr>
      <td>
         <ul>
            <li>#sb277</li>
            <li>#cdcwhistleblower</li>
            <li>#vaccineswork</li>
            <li>#antivaccine</li>
            <li>#antivax</li>
            <li>#vaxxed</li>
            <li>#vaccinedamage</li>
            <li>#Educateb4uVax</li>
            <li>#StopMandatoryVaccination</li>
            <li>#VaççinesKill</li>
         </ul>
      </td>
      <td>
         <ul>
            <li>#vaccines</li>
            <li>#bigpharmakills</li>
            <li>#tannersdad</li>
            <li>#vaccineinjured</li>
            <li>#vaccinecult</li>
            <li>#markofthebeast</li>
            <li>#vaccineinjuryisreal</li>
            <li>#vaccinefailure</li>
            <li>#BigPharmaSins</li>
            <li>va$$ines</li>
         </ul>
      </td>
   </tr>
</table>



In [1]:
# Import modules
import snscrape.modules.twitter as sntwitter
import json
import os
import datetime

## Scrape & Organize Tweets
The tweets will bet categorized according to the two aforementioned time frames, as well as by their associated hashtag. They will be placed into separate directories, and then combined prior to the main analyses.

In [2]:
# List of hashtags to query
hashtags = [
    '#sb277',
'#cdcwhistleblower',
'#vaccineswork',
'#antivaccine',
'#antivax',
'#vaxxed',
'#vaccinedamage',
'#educateb4uvax',
'#stopmandatoryvaccination',
'#vaççineskill',
'#vaccines',
'#bigpharmakills',
'#tannersdad',
'#vaccineinjured',
'#vaccinecult',
'#markofthebeast',
'#vaccineinjuryisreal',
'#vaccinefailure',
'#bigpharmasins',
'#va$$ines'
]

# Time points to restrict data
PRE_START = '2015-01-01'
PRE_END = '2016-01-01'
POST_START = '2020-03-01'
POST_END = '2020-05-01'

In [3]:
# Function to download a set of tweets corresponding to a certain search query
def download_query_tweets(query, date_since, date_until, max=1000):
    print(f"Downloading tweets for query: '{query}' from {date_since} to {date_until} (max of {max})")

    tweet_list = []
    
    query = f'{query} since:{date_since} until:{date_until}'
    
    for i,tweet in enumerate(sntwitter.TwitterSearchScraper(query).get_items()):
        if i>=max:
            break
    

        tweet_dict = {
            'id': tweet.id,
            'created_at': tweet.date.strftime('%Y-%m-%d %H:%M'),
            'text': tweet.content,
            'username': tweet.username,
        }

        tweet_list.append(tweet_dict)
        
    return tweet_list

In [4]:
PRE_DATA_DIR = '../data/pre_covid'
POST_DATA_DIR = '../data/post_covid'

if not os.path.exists(PRE_DATA_DIR):
    os.makedirs(PRE_DATA_DIR)
    
if not os.path.exists(POST_DATA_DIR):
    os.makedirs(POST_DATA_DIR)

for tag in hashtags:
    pre_list = download_query_tweets(tag, PRE_START, PRE_END)
    post_list = download_query_tweets(tag, POST_START, POST_END)
    
    pre_outfilename = "{}/{}_{}_to_{}.json".format(PRE_DATA_DIR, tag.replace(' ','_'), PRE_START, PRE_END)
    post_outfilename = "{}/{}_{}_to_{}.json".format(POST_DATA_DIR, tag.replace(' ','_'), POST_START, POST_END)

    
    print('\t retrieved {} pre-covid tweets...\n'.format(len(pre_list)))
    print('\t retrieved {} post-covid tweets...\n'.format(len(post_list)))

    with open(pre_outfilename,'w') as out:
        out.write(json.dumps(pre_list))
        
    with open(post_outfilename,'w') as out:
        out.write(json.dumps(post_list))

Downloading tweets for query: '#sb277' from 2015-01-01 to 2016-01-01 (max of 1000)
Downloading tweets for query: '#sb277' from 2020-03-01 to 2020-05-01 (max of 1000)
	 retrieved 1000 pre-covid tweets...

	 retrieved 38 post-covid tweets...

Downloading tweets for query: '#cdcwhistleblower' from 2015-01-01 to 2016-01-01 (max of 1000)
Downloading tweets for query: '#cdcwhistleblower' from 2020-03-01 to 2020-05-01 (max of 1000)
	 retrieved 1000 pre-covid tweets...

	 retrieved 314 post-covid tweets...

Downloading tweets for query: '#vaccineswork' from 2015-01-01 to 2016-01-01 (max of 1000)
Downloading tweets for query: '#vaccineswork' from 2020-03-01 to 2020-05-01 (max of 1000)
	 retrieved 1000 pre-covid tweets...

	 retrieved 1000 post-covid tweets...

Downloading tweets for query: '#antivaccine' from 2015-01-01 to 2016-01-01 (max of 1000)
Downloading tweets for query: '#antivaccine' from 2020-03-01 to 2020-05-01 (max of 1000)
	 retrieved 1000 pre-covid tweets...

	 retrieved 140 post-c