# Appendix Code: 1


## Twitter Data Collection
### Using the Search API (RESTful endpoint)

This notebook exemplifies the data collection script we used to scrape tweets. We ran the script three times with different specifications for the language setting (``da`` for Danish, ``de`` for German and ``pl`` for Polish).  

We use the Python packages ``tweepy`` to collect the tweets. 

* _Tweet object:_ The resulting tweets come as tweepy status objects that contain a JSON property with all relevant info about the tweet (JSON = a nested ``dict`` of ``dicts``)

* _Saving tweet objects:_ We'll write the entire JSON property to a csv file because csv is one of the most compact file formats. We do not parse the tweet object but keep it as a whole for now since we do not want to miss out on any information that might be relevant later on.

* _Timeframe:_ We can retrieve tweets which were posted during the last seven days.

* _Query:_ The query can be max. 500 characters long. This is a problem, because we have considerably more keywords than that. Therefore, we'll loop through all the keywords and construct a query for each keyword.

In [None]:
# import packages
import tweepy
from tweepy import TweepError
import csv
from datetime import datetime
from contextlib import suppress

In [None]:
# get credentials from local file "AppCred.py" located in the directory you are working in
from AppCred import CONSUMER_KEY, CONSUMER_SECRET
from AppCred import ACCESS_TOKEN, ACCESS_TOKEN_SECRET

In [None]:
# authentification
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

# create instance of the API class
api = tweepy.API(auth, 
          wait_on_rate_limit=True,
          wait_on_rate_limit_notify=True)

## 1: Retrieving tweets without further query specifications

This will return a mixture of original tweets and retweets. Further down, we run variations of the query to only collect original tweets or to only collect retweets. Moreover, the Search API allows us to specify the ``result_type`` we want to retrieve: We can choose either ``recent``, ``popular`` or ``mixed``. In order to get the most comprehensive sample, we will run the scraper for all three result types.
  
This will most likely results in overlaps between the the different variations of the query. Before analysing the data, we therefore have to remove duplicates (we will do that in another notebook).

In [None]:
# loading the keywords

keywords = []

with open('final_keywords.csv', 'r', encoding='utf') as file:
    reader = csv.reader(file)
    
    for row in reader:
        keywords.append(row[0])

keywords

In [None]:
# DATA COLLECTION

# list to store tweet object in
data = []

# result type: we can choose 'recent', 'popular' or 'mixed' and we want all tweets for all the different types
result_type_list = ['recent', 'popular', 'mixed']

# open the file in 'append' mode to store the tweets
# specify encoding as 'utf8' (because of German letter such as ä, ö, ü)
# specify newline="": if this code is run on a Windows machine and we don't specify this parameter,
# then the csv writer adds an empty line after each entry and we don't want that

with open('tweets2.csv', 'a', encoding='utf8', newline="") as outfile:
    
    # writer instance
    writer = csv.writer(outfile)

    # iterate through the result_type_list:
    for result_type in result_type_list:
        
        # iterate through the keywords
        for word in keywords:

            # suppressing the TweepError (connection error)
            # this allows the script to run without being interrupted by this random error, while also
            # stopping once all tweets for this date are collected
            with suppress(TweepError):

                # we use the tweepy pagination (i.e. we get pages of tweet results back)

                # Cursor specifications:
                # 1- api.search: use the search endpoint
                # 2- q: use the current iteration through the query list as the query term
                # 3- tweet_mode: "extended", i.e. get tweets up to 280 chars
                # 4- count: return 100 tweets per page (100 is the maximum)
                # 5- result_type: get tweets based on the recent/popular/mixed specification
                # 6- include_entities: include entities part of the tweet (just in case it might be relevant)
                # 7- lang: set language to 'de' (= German), 'da' (= Danish) or 'pl' (= Polish)

                for page in tweepy.Cursor(api.search,
                                          q=word,
                                          tweet_mode="extended", 
                                          count=100,
                                          result_type=result_type,
                                          include_entities=True,
                                          lang = 'de').pages(): 

                    # add tweet objects to the 'data' list
                    data.extend(page)

                    # iterate through the objects on the current page
                    for item in page:

                        # write the ._json part of the tweet object (i.e.the part that comes as dict of dicts) to file
                        writer.writerow([str(item._json)])

            # print the number of tweets we've collected so far
            print(f"Tweets in total for {word}: {len(data)}.")

## 2: Retrieving original tweets (no retweets)

This is where we scrape original tweets, i.e. we filter out all retweets. We will save the results in a different file.

In [None]:
# loading the keywords
# here, we filter out retweets

keywords_filter_1 = []

with open('final_keywords.csv', 'r', encoding='utf') as file:
    reader = csv.reader(file)
    
    for row in reader:
        keywords_filter_1.append(row[0] + ' -filter:retweets')

keywords_filter_1

In [None]:
# DATA COLLECTION

# list to store tweet object in
data = []

# result type: we can choose 'recent', 'popular' or 'mixed' and we want all tweets for all the different types
result_type_list = ['recent', 'popular', 'mixed']

# open the file in 'append' mode to store the tweets
# specify encoding as 'utf8' (because of German letter such as ä, ö, ü)
# specify newline="": if this code is run on a Windows machine and we don't specify this parameter
# the csv writer adds an empty line after each entry and we don't want that

with open('tweets_no_rt2.csv', 'a', encoding='utf8', newline="") as outfile:
    
    # writer instance
    writer = csv.writer(outfile)

    # iterate through the result_type_list:
    for result_type in result_type_list:
        
        # iterate through the keywords (incl. filter)
        for word in keywords_filter_1:

            # suppressing the TweepError (connection error)
            # this allows the script to run without being interrupted by this random error, while also
            # stopping once all tweets for this date are collected
            with suppress(TweepError):

                # we use the tweepy pagination (i.e. we get pages of tweet results back)

                # Cursor specifications:
                # 1- api.search: use the search endpoint
                # 2- q: use the current iteration through the query list as the query term
                # 3- tweet_mode: "extended", i.e. get tweets up to 280 chars
                # 4- count: return 100 tweets per page (100 is the maximum)
                # 5- result_type: get tweets based on the recent/popular/mixed specification
                # 6- include_entities: include entities part of the tweet (just in case it might be relevant)
                # 7- lang: set language to 'de' (= German), 'da' (= Danish) or 'pl' (= Polish)

                for page in tweepy.Cursor(api.search,
                                          q=word,
                                          tweet_mode="extended", 
                                          count=100,
                                          result_type=result_type,
                                          include_entities=True,
                                          lang = 'de').pages(): 

                    # add tweet objects to the data list
                    data.extend(page)

                    # iterate through the objects on the current page
                    for item in page:

                        # write the ._json part of the tweet object (i.e.the part that comes as dict of dicts) to file
                        writer.writerow([str(item._json)])

            # print the number of tweets we've collected so far
            print(f"Tweets in total for {word}: {len(data)}.")

## Round 3: Retrieving only retweets
Lastly, we'll scrape _only_ retweets.

In [None]:
# loading the keywords
# here, we filter for just retweets

keywords_filter_2 = []

with open('final_keywords.csv', 'r', encoding='utf') as file:
    reader = csv.reader(file)
    
    for row in reader:
        keywords_filter_2.append(row[0] + ' filter:retweets')

keywords_filter_2

In [None]:
# DATA COLLECTION

# list to store tweet object in
data = []

# result type: we can choose 'recent', 'popular' or 'mixed' and we want all tweets for all the different types
result_type_list = ['recent', 'popular', 'mixed']

# open the file in 'append' mode to store the tweets
# specify encoding as 'utf8' (because of German letter such as ä, ö, ü)
# specify newline="": if this code is run on a Windows machine and we don't specify this parameter
# the csv writer adds an empty line after each entry and we don't want that

with open('tweets_only_rt2.csv', 'a', encoding='utf8', newline="") as outfile:
    
    # writer instance
    writer = csv.writer(outfile)

    # iterate through the result_type_list:
    for result_type in result_type_list:
        
        # iterate through the keywords (this time, filtered to just return retweets)
        for word in keywords_filter_2:

            # suppressing the TweepError (connection error)
            # this allows the script to run without being interrupted by this random error, while also
            # stopping once all tweets for this date are collected
            with suppress(TweepError):

                # we use the tweepy pagination (i.e. we get pages of tweet results back)

                # Cursor specifications:
                # 1- api.search: use the search endpoint
                # 2- q: use the current iteration through the query list as the query term
                # 3- tweet_mode: "extended", i.e. get tweets up to 280 chars
                # 4- count: return 100 tweets per page (100 is the maximum)
                # 5- result_type: get tweets based on the recent/popular/mixed specification
                # 6- include_entities: include entities part of the tweet (just in case it might be relevant)
                # 7- lang: set language to 'de' (= German), 'da' (= Danish) or 'pl' (= Polish)

                for page in tweepy.Cursor(api.search,
                                          q=word,
                                          tweet_mode="extended", 
                                          count=100,
                                          result_type=result_type,
                                          include_entities=True,
                                          lang = 'de').pages(): 

                    # add tweet objects to the data list
                    data.extend(page)

                    # iterate through the objects on the current page
                    for item in page:

                        # write the ._json part of the tweet object (i.e.the part that comes as dict of dicts) to file
                        writer.writerow([str(item._json)])

            # print the number of tweets we've collected so far
            print(f"Tweets in total for {word}: {len(data)}.")