## Harvesting tweets using the Twitter API
Welcome to Ben's and Rienje's Jupyter Notebooks! In these notebooks, we will showcase step-by-step how we arrived at our final maps and visualizations for our project on Dutch political tweet analysis using geolocation and sentiment analysis. In addition, a ArcGIS story map describing our results can be found [here](https://storymaps.arcgis.com/stories/c09bda1912464431b43bd8bc5fc7de40). This notebook shows how we harvested our tweets with the Twitter API, for which we have used the Twython module.

In [1]:
# Import the needed libraries

from twython import Twython
import json
import datetime
import pandas as pd

#### API keys

In order to use the Twitter API, we need a (free) developer account at [Twitter](https://developer.twitter.com/). Here, we can generate the API keys needed for harvesting the tweets.

In [None]:
# Get access to the twitter API

APP_KEY = 'amWUfdu9uy0ppeo6ntlAT4D6V'
APP_SECRET = 'lGlGANJ2ExKMALe9WSVifowTKya1gIdVjYqt8EcMwVXoIx2aPn'
OAUTH_TOKEN = '123030766-yDftWoWB0vJIxOy6qBbW1rbJBpdf9klo7qqMl5IG'
OAUTH_TOKEN_SECRET = 'zwvtNMq4eOWZDUU2IrcNPVixRnd851Ai8J9Mjsp3XPv8r'

#### Setting the parameters for our harvest
In our project, we made use of the REST API, which made it possible to harvest relatively large quantities of tweets. Because there is a 7-day time window and we needed pre-election tweets, all tweets were harvested in the week leading up to election day (March 17th). First, we need to set the following parameters to narrow down our search:
* Query words  
* Filters  
* Date - window  


In [None]:
# General queries to catch tweets mentioning parties
generalqueries = '#vvd OR vvd OR #pvv OR pvv OR #cda OR cda OR #d66 OR d66 OR #groenlinks OR groenlinks OR #sp OR sp OR #pvda OR pvda OR #christenunie OR christenunie OR pvdd OR Partij+Voor+De+Dieren OR #50plus OR 50plus OR #sgp OR sgp OR #denk OR #fvd OR fvd OR Forum+voor+Democratie OR #ja21 OR ja21 OR #volt OR volt OR #bij1 OR bij1 OR #bbb OR bbb -filter:retweets -filter:replies -filter:quotes'

# Don't harvest retweets, replies or quotes
filters = ' -filter:retweets -filter:replies -filter:quotes'

# Set an area to retrieve tweets from and maximum data threshold (election date) 
geo = '52.0907374,5.1214201,250km'
date_until = '2021-03-17'

# Tweet count needed for the harvest-loop
tweet_count = 100


#### Harvesting the tweets with a for loop
Because only a limited amount of tweets can be harvested per request, we need a for loop that enables us to harvest 1000 tweets per run (with thanks to Arend!). 

In [None]:
# Initiating Twython object 
twitter = Twython(APP_KEY, APP_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET)

# Set parameters for for loop
tweets = []
MAX_ATTEMPTS = 10
COUNT_OF_TWEETS_TO_BE_FETCHED = 1000

# Harvesting loop
for i in range(0,MAX_ATTEMPTS):
    print("=========================: "+str(i))

    if (0 == i):
        search_results_twy = twitter.search(q=generalqueries, lang= 'nl', count=100, until = date_until)
    else:
        search_results_twy = twitter.search(q=generalqueries, lang= 'nl', include_entities='true', max_id=next_max_id, count=100)

            # Parsing out 
    for tweet in search_results_twy["statuses"]:  
        if tweet['place'] != None:
            print("%d: %d" %(len(tweets),tweet['id']))
        elif tweet['user']['location'] != None and tweet['user']['location'] != 'Nederland':
            print("%d: %d" %(len(tweets),tweet['id']))
        
        tweet_text = [tweet['id'],tweet['created_at'],tweet['user']['screen_name'], tweet['user']['location'], tweet['text'], tweet['user']['followers_count'],tweet['user']['statuses_count'],tweet['retweet_count'], tweet['favorite_count']]
        tweets.append(tweet_text)
    try:
        # Parse the data returned to get max_id to be passed in consequent call.
        next_results_url_params = search_results_twy['search_metadata']['next_results']
        next_max_id = next_results_url_params.split('max_id=')[1].split('&')[0]
    except:
       break

#### Saving tweets to CSV
Lastly, we save the tweets to a csv file. Since we had many files to save, we created a function that automated this process.

In [None]:
# Loop that automates saving tweets to csv
def make_csv(filename, tweetdata):
    # Convert to df and choose what information to store
    tweets_2021 = pd.DataFrame(data=tweetdata,
                               columns = ['id','created_at','screen_name','location','text','follower_count','statuses_count','retweet_count','favorite_count'])
    # Rename
    cols = ['id','created_at','screen_name','location','text','follower_count','statuses_count','retweet_count','favorite_count']
    # Remove commas, semicolons and new lines to create a clean CSV 
    tweets_2021[cols] = tweets_2021[cols].replace({',': '', ';': '','\\n':' '}, regex=True)
    tweets_2021.to_csv(filename, header=True,index=False)

# Save tweets
make_csv('output/General_tweets_2021_until17_FNL.csv', tweets)


#### Reflection
When starting this project, we initially wanted to analyse geo-tagged tweets. However, we are quite happy that we decided to switch to the location provided in the users profiles. We found that the amount of tweets that have a valid user location is much larger than the amount of geo-tagged tweets that can be harvested from Twitter. Although still only 6000 of 50 000 tweets remained for spatial plotting at the end of the analysis, this could still be seen as a relatively 'good' harvest. When the same method is applied to much larger tweet harvests (say, 500 000 or 5 million), the remainnig user locations-tweets could potentially be a very useful dataset. In addition, profile locations are arguably more reliable as an indicator for where the user might *live* than a geotagged tweet. Users can send tweets from any place, making these locations unreliable as places of residences, but their profile location could be a more trustworthy indicator for their place of residence.