Copyright: Vrije Universiteit Amsterdam, Faculty of Humanities, CLTL

# Lab 1.3: Querying tweets

In this notebook, we are going to query Twitter streams using the library *tweepy*. Take a look at its [documentation](https://github.com/tweepy/tweepy/tree/master/docs)

Tweepy allows you to access Twitter using credentials and returns a so-called Cursor object. From the Cursor object, you can access the twitter data in e.g. JSON format. Documentation on the Twitter data objects can be found [here](https://developer.twitter.com/en/docs)


Make sure you installed the package and obtained the Twitter credentials before your start using the API.

https://www.earthdatascience.org/courses/use-data-open-source-python/intro-to-apis/twitter-data-in-python/

## 1. Setting up your twitter credentials to use the API

First of all you need to have a standard Twitter account. It is easy to create a dummy account and you do not have to use your own name. It is okay, if you do not want to register for a Twitter account. In this case, you cannot test or modify the code. You need to make sure that you still understand how this works.

1. Log in to your twitter account and go to developer.twitter.com
2. Click on “Apply” in the top right and then on “Apply for a developer account”
3. Choose “Academic”, then “Student” and “Get started” and fill in the required fields.
4. Use the following text block for all text fields and mark questions 1 and 3 “yes“ and 2 and 4 "no":
`Text Mining course at the VU university master program of the faculty of humanities. We analyse tweets for extracting data and information and obtaining statistics on language use. Analyses will be described in a password-protected blog. I am a student in this course"`
5. Read the Developer agreement and policy and agree (if you agree). Confirm the email and obtain the credentials.

Set the constants API_KEY and API_Secret to your values:

In [39]:
import tweepy
# The API-Key and the API-secret were displayed to you after you registered
API_KEY = 'dRmenPFIsVZKJf5CmIQNU7llD'
API_SECRET = 'zjQGivfPknrs1lmGAD8ksAiXgKvzTt8ZVip2dTEJZQOYg29iRg'

6. Go to the developer portal, then Project and Apps, and create a Standalone App. Fill in a name for your app, it can be anything, e.g. ‘YOURNAME_Lab1”. Copy the access token and secret and store it in a file.


In [40]:
# The Access token and the Access secret were displayed when you clicked on "generate"
ACCESS_TOKEN = '1413630811656622088-JzdmaxIrXXCnXuUpxdjWOs51L4YFxr'
ACCESS_SECRET = '2YQs4uuYag6hQ1DTaBZyggq2HOC5dx9WYyGmZQR979EfC'

# 2. Querying the Twitter API

We are using Tweepy to crawl tweets, but it is important to know that it has some limitations that affect reproducibility. The Twitter API is not exhaustive, it simply provides a sample and the documentation does not provide much detail on how this sample is determined. https://stackoverflow.com/questions/32445553/tweepy-not-finding-results-that-should-be-there

The Twitter API returns the results as a JSON object. You learned how to use JSON objects in [Chapter 17](https://github.com/cltl/python-for-text-analysis/blob/master/Chapters/Chapter%2017%20-%20%20Data%20formats%20II%20(JSON).ipynb) of the Python course. The tweepy library makes it easier to access these JSON objects. 

The code below is used to set up the connection: 

In [41]:
import tweepy

# Setup the authentication with your Twitter credentials:
auth = tweepy.OAuthHandler(API_KEY, API_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

# Create the api to connect to Twitter using your authentication 
api = tweepy.API(auth, wait_on_rate_limit=True)

We set a few variables to limit our search. Note that we can include hashtags and words in our keywords and combine them using Boolean operators such as OR and AND. Check the [Twitter API documentation](https://developer.twitter.com/en/docs/twitter-api/v1/rules-and-filtering/overview/standard-operators) for more details on how to customize queries. 

**Play around with the parameters and understand how the queries are composed.**

In [42]:
#We define the keywords in our target language
language = "pt"
keywords = "(#trump)"

# Optional: we can define a filter, for example, to ignore retweets
filter = "-filter:retweets"

query = keywords + filter

# # Optional: Limit the number of tweets  
count = 500

# Request the tweets
tweet_iterator = api.search_tweets(q=query,lang=language,count=count)

# We save the tweets as a list, so that we can access them later. 
tweets = list(tweet_iterator)   

for i, tweet in enumerate(tweets): 
    print(i)
    print("User:" + tweet.user.screen_name)
    print("Tweet:" + tweet.text)
    print()

0
User:AlexandreCmps70
Tweet:Marcha das mulheres que não aceitam o resultado das eleições 2022.#bbc #londres #EUA #Putim #Trump #TheTimes… https://t.co/sPef5rcnl5

1
User:AlexandreCmps70
Tweet:Brasileiros nas ruas contra resultado das eleições presidenciais 2022.#bbc #londres #EUA #Putim
#Trump… https://t.co/rn9TdMkMAT

2
User:AlexandreCmps70
Tweet:Criancas nas ruas contra resultado das eleições presidenciais 2022.#bbc #londres #EUA #Putim
#Trump… https://t.co/MlTCDv6yvK

3
User:AlexandreCmps70
Tweet:Brasileiros nas ruas contra resultado das eleições presidenciais 2022.#bbc #londres #EUA #Putim
#Trump… https://t.co/a938y5302o

4
User:AlexandreCmps70
Tweet:Brasileiros nas ruas contra resultado das eleições presidenciais 2022.#bbc #londres #EUA #Putim
#Trump… https://t.co/vpuFRrAYZi

5
User:AlexandreCmps70
Tweet:Brasileiros clamam pelas forças armadas e pedem SOS Forças.#bbc #londres #EUA #Putim
#Trump #TheGlobeandMail… https://t.co/w2JWb6bXUI

6
User:AlexandreCmps70
Tweet:Brasileiros na

# 3. Examining the attributes

In the above code, we only check the username and the text of the tweet. The result that the API returns contains much more information that might be interesting for your analyses. Let's take a look at the attributes of the first tweet in our result list. 

**Discuss which of these properties would be interesting for your analysis.**

In [43]:
# Show all attributes of a tweet that you can access
tweets[0].__dict__

{'_api': <tweepy.api.API at 0x7f13505976a0>,
 '_json': {'created_at': 'Mon Nov 14 08:40:56 +0000 2022',
  'id': 1592075096130199553,
  'id_str': '1592075096130199553',
  'text': 'Marcha das mulheres que não aceitam o resultado das eleições 2022.#bbc #londres #EUA #Putim #Trump #TheTimes… https://t.co/sPef5rcnl5',
  'truncated': True,
  'entities': {'hashtags': [{'text': 'bbc', 'indices': [66, 70]},
    {'text': 'londres', 'indices': [71, 79]},
    {'text': 'EUA', 'indices': [80, 84]},
    {'text': 'Putim', 'indices': [85, 91]},
    {'text': 'Trump', 'indices': [92, 98]},
    {'text': 'TheTimes', 'indices': [99, 108]}],
   'symbols': [],
   'user_mentions': [],
   'urls': [{'url': 'https://t.co/sPef5rcnl5',
     'expanded_url': 'https://twitter.com/i/web/status/1592075096130199553',
     'display_url': 'twitter.com/i/web/status/1…',
     'indices': [110, 133]}]},
  'metadata': {'iso_language_code': 'pt', 'result_type': 'recent'},
  'source': '<a href="http://twitter.com/download/android

In [44]:
# Show all attributes of the user who wrote the tweet
print(tweets[0].user)

User(_api=<tweepy.api.API object at 0x7f13505976a0>, _json={'id': 1397939023545503744, 'id_str': '1397939023545503744', 'name': 'Alexandre Campos', 'screen_name': 'AlexandreCmps70', 'location': 'Duas Barras, Brasil', 'description': 'O desejo cria, o pensamento atraí, a fé realiza!', 'url': None, 'entities': {'description': {'urls': []}}, 'protected': False, 'followers_count': 13, 'friends_count': 128, 'listed_count': 0, 'created_at': 'Thu May 27 15:33:44 +0000 2021', 'favourites_count': 853, 'utc_offset': None, 'time_zone': None, 'geo_enabled': False, 'verified': False, 'statuses_count': 753, 'lang': None, 'contributors_enabled': False, 'is_translator': False, 'is_translation_enabled': False, 'profile_background_color': 'F5F8FA', 'profile_background_image_url': None, 'profile_background_image_url_https': None, 'profile_background_tile': False, 'profile_image_url': 'http://pbs.twimg.com/profile_images/1408010937324482563/6Ez-JJ-k_normal.jpg', 'profile_image_url_https': 'https://pbs.twim

# 3. Saving the results

We have two options for saving the results. 
1. We can select specific attributes and save them as a tsv-file. 
2. If we do not want to decide yet which attributes we need, we can simply dump the whole JSON result to a file and process it later. 

**Make sure that you understand the code below. Open the result files in an editor and compare the differences.** 

In [45]:
import json
# Collect the results
tweets_as_json =[]
tweets_as_text =[]

for tweet in tweets: 
    
    # Option 1: only keep selected attributes
    text = tweet.text.replace("\n", " ")
    keep = str(tweet.created_at) + "\t" + tweet.user.screen_name + "\t" + text
    tweets_as_text.append(keep)  
    
    # Option 2: keep everything and process later
    tweets_as_json.append(tweet._json)
    
# Write them to a file
csv_file = "../results/twitter_search_results/results_veganism.csv"
json_file = "../results/twitter_search_results/results_veganism.json"

with open(csv_file, 'w',encoding="utf-8") as outfile:
    csv_header = "Created at\tUser\tText\n"
    outfile.write(csv_header)
    outfile.write("\n".join(tweets_as_text))

with open(json_file, 'w') as outfile:
    json.dump(tweets_as_json, outfile)