## Twitter Data Collection Documentation
The goal of this notebook is to document the way in which we're currently collecting data from the twitter API. This includes challenges faced in the collection, rationale behind the current method and the overall goals of the project.

### What's the project about?
The aim of the project is to analyse daily public social media discussion in Uganda relating to the COVID-19 pandemic to inform decision making by the National Task Force ([reference](https://docs.google.com/document/d/1a8JI6GS67B6ot4fE1mMbIs0jmnagckw8meoATGM6iNI/edit?usp=sharing)).

We also hope to use the data collected from twitter (along with data from other data sources) for building a language model on use of language in Uganda.

### Collecting data from Twitter
Twitter provides a number of ways of retrieving tweets based on a variety of criteria such as hashtags, keywords, trending topics, location, user handles, dates e.t.c

We've been using the [Standard Search API](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets) which returns a collection of relevant Tweets matching a specified query. The query can contain hashtags, user accounts and keywords. An example of a query string is: `q='from:MinofHealthUG OR to:MinOfHealthUG OR #covid19'`; this means return tweets that are either from the user handle `MinofHealthUG` or replies to the handle `MinofHealthUG` or containing the hashtag `#covid19`.

We've also been leveraging [Python Twitter Tools](https://github.com/sixohsix/twitter) in order to make requests to the API easier.  

Ways of collecting data that we've considered/used are the following:

#### Retrieve tweets based on key user accounts, relevant hashtags. 
An example of this is when we analysed masks tweets following the presidential address (code can be found [here](https://github.com/SunbirdAI/covid19-uganda-twitter-data-analysis/blob/master/MaskRelatedData.py)).

Advantages of this approach:
- We can get very specific tweets about the subject/people in question.

Disadvantages:
- Using a predefined list of user accounts and hashtags limits the tweets retrieved to those mentioning that particular hashtag and tweets from popular accounts.
- As time goes on, a particular topic is discussed less and less and people move on to new things; this means that the new tweets about that topic get fewer and fewer. This limits the number of tweets we can collect if the goal is to collect tweets daily.

A potential mitigation for the limited tweets we get is to get replies to the initial tweets we collect; this was tried during the "masks tweets" analysis but the replies were very few in number.

#### Retrieve tweets based on an initial list of users and then dynamically increase the user-pool based on replies to those users.
This is the method currently being used for the continuous data collection. It involves using an initial list of popular users (based on follower count) and after collecting some tweets from (and replies to) the user, it adds new users that appear in the latest tweets to the users list. For each subsequent request, a random sample of 15 users is chosen from the list and the process repeated.

For details about this, refer to the code, which is as follows:
- [Data collection script](https://github.com/SunbirdAI/covid19-uganda-twitter-data-analysis/blob/master/run_twitter_data_script.py): This script contains the code (with comments for each function) that implements the data collection strategy described above. 
- [Data collection package](https://github.com/SunbirdAI/covid19-uganda-twitter-data-analysis/tree/master/data_collection): Contains the files `db_utils.py` (which is used for interfacing with `mongodb`) and `twitter_api_utils.py` (which is used for interfacing with the Twitter API). All methods in these files have docstrings which explain what each of them does.

Advantages
- Diverse tweets from many different users. This diversity could aid in building the language model mentioned at the start of this document. 
- In addition, when searching the data for specific topics/issues of interest, having very many different tweets may give us an idea of how frequently that particular topic is being discussed along with the context in which it was being discussed (for example by looking at popular hashtags on a particular day).

Disadvantages
- Replies to accounts may not be from Ugandan users, so a significant number of tweets may not be relevant to the Ugandan context on which this project is based. (Issues with location are discussed further later in this document)
- Because the tweets are retrieved randomly, there is no guarantee that topics of interest will appear (frequently enough) in the collected tweets.

#### Retrieve tweets based on location (or trending topics in a location)
This would probably be the best solution to our data collection problems because we could use this to keep track topics trending each day, we would also be sure that the tweets are coming from people in Uganda and the tweets would still be sufficiently diverse as to aid buidling a language model.

The following code illustrates the problems I ran into while pursuing this approach.

In [5]:
# Setup
import twitter
import os
import json
from dotenv import load_dotenv
load_dotenv()

# Authentication
API_KEY = os.getenv("TWITTER_API_KEY")
API_SECRET = os.getenv("TWITTER_API_SECRET")

auth = twitter.oauth.OAuth("", "", API_KEY, API_SECRET)

twitter_api = twitter.Twitter(auth=auth)

- Getting [trends near a location](https://developer.twitter.com/en/docs/trends/trends-for-location/api-reference/get-trends-place) using the location's [WOEID](https://en.wikipedia.org/wiki/WOEID) (currently managed by Yahoo)
- The following WOEIDs were retrieved from this site: https://www.findmecity.com/. 
- I also tried using [this method](https://stackoverflow.com/questions/22927307/how-to-find-woeid-where-on-earth-id-of-a-country) on stackoverflow, but it seems like the `yweather` python package is no longer supported and the Yahoo weather API was deprecated.

In [18]:
WORLD_WOEID = 1
US_WOEID = 23424977
UG_WOEID = 23424974
KLA_WOEID = 1451962
KENYA_WOEID = 23424863

In [20]:
# We can get world trends without a problem
world_trends = twitter_api.trends.place(_id=WORLD_WOEID)
print(json.dumps(world_trends, indent=1))

[
 {
  "trends": [
   {
    "name": "#pinargultekin",
    "url": "http://twitter.com/search?q=%23pinargultekin",
    "promoted_content": null,
    "query": "%23pinargultekin",
    "tweet_volume": 676624
   },
   {
    "name": "#P\u0131nar\u0131nkatiliCHPli",
    "url": "http://twitter.com/search?q=%23P%C4%B1nar%C4%B1nkatiliCHPli",
    "promoted_content": null,
    "query": "%23P%C4%B1nar%C4%B1nkatiliCHPli",
    "tweet_volume": 25560
   },
   {
    "name": "#istanbulsozlesmesiyasatir",
    "url": "http://twitter.com/search?q=%23istanbulsozlesmesiyasatir",
    "promoted_content": null,
    "query": "%23istanbulsozlesmesiyasatir",
    "tweet_volume": 16079
   },
   {
    "name": "#\u79c1\u306e\u5bb6\u653f\u592b\u30ca\u30ae\u30b5\u3055\u3093",
    "url": "http://twitter.com/search?q=%23%E7%A7%81%E3%81%AE%E5%AE%B6%E6%94%BF%E5%A4%AB%E3%83%8A%E3%82%AE%E3%82%B5%E3%81%95%E3%82%93",
    "promoted_content": null,
    "query": "%23%E7%A7%81%E3%81%AE%E5%AE%B6%E6%94%BF%E5%A4%AB%E3%83%8A%E3%82%AE%E3%

In [19]:
# We can also get US trends without a problem
us_trends = twitter_api.trends.place(_id=US_WOEID)
print(json.dumps(us_trends, indent=1))

[
 {
  "trends": [
   {
    "name": "#tuesdayvibes",
    "url": "http://twitter.com/search?q=%23tuesdayvibes",
    "promoted_content": null,
    "query": "%23tuesdayvibes",
    "tweet_volume": 30912
   },
   {
    "name": "#BoycottMLB",
    "url": "http://twitter.com/search?q=%23BoycottMLB",
    "promoted_content": null,
    "query": "%23BoycottMLB",
    "tweet_volume": null
   },
   {
    "name": "#RobinWilliams",
    "url": "http://twitter.com/search?q=%23RobinWilliams",
    "promoted_content": null,
    "query": "%23RobinWilliams",
    "tweet_volume": 12127
   },
   {
    "name": "#IfSchoolsReopenNow",
    "url": "http://twitter.com/search?q=%23IfSchoolsReopenNow",
    "promoted_content": null,
    "query": "%23IfSchoolsReopenNow",
    "tweet_volume": null
   },
   {
    "name": "Larry Householder",
    "url": "http://twitter.com/search?q=%22Larry+Householder%22",
    "promoted_content": null,
    "query": "%22Larry+Householder%22",
    "tweet_volume": null
   },
   {
    "name": "#

In [25]:
# # We get a 404 error when we try to get UG trends
# ug_trends = twitter_api.trends.place(_id=UG_WOEID)
# print(json.dumps(ug_trends, indent=1))

In [27]:
# # We get the same 404 error for KLA
# kla_trends = twitter_api.trends.place(_id=KLA_WOEID)
# print(json.dumps(kla_trends, indent=1))

We get a 404 error when we use the WOEIDs for Kampala and Uganda. I've commented out the code for this because the response from twitter includes the API keys. You can clone this project and run this notebook to see the error.

Here's the response:
```
TwitterHTTPError: Twitter sent status 404 for URL: 1.1/trends/place.json using parameters: (id=1451962)
details: {'errors': [{'code': 34, 'message': 'Sorry, that page does not exist.'}]}
```

In [23]:
# However we can get bKenya trends without a problem
kenya_trends = twitter_api.trends.place(_id=KENYA_WOEID)
print(json.dumps(kenya_trends, indent=1))

[
 {
  "trends": [
   {
    "name": "#njoro",
    "url": "http://twitter.com/search?q=%23njoro",
    "promoted_content": null,
    "query": "%23njoro",
    "tweet_volume": null
   },
   {
    "name": "Churchill",
    "url": "http://twitter.com/search?q=Churchill",
    "promoted_content": null,
    "query": "Churchill",
    "tweet_volume": 13505
   },
   {
    "name": "#testingtuesday",
    "url": "http://twitter.com/search?q=%23testingtuesday",
    "promoted_content": null,
    "query": "%23testingtuesday",
    "tweet_volume": null
   },
   {
    "name": "#KibosLordsOfImpunity",
    "url": "http://twitter.com/search?q=%23KibosLordsOfImpunity",
    "promoted_content": null,
    "query": "%23KibosLordsOfImpunity",
    "tweet_volume": null
   },
   {
    "name": "#SakajaCharged",
    "url": "http://twitter.com/search?q=%23SakajaCharged",
    "promoted_content": null,
    "query": "%23SakajaCharged",
    "tweet_volume": null
   },
   {
    "name": "Watford",
    "url": "http://twitter.com/

We can try to use the `twitter_api.trends.available()` to find the locations which are supported by Twitter.

In [33]:
# Get available locations for trends
available_locations = twitter_api.trends.available()
print("Number of locations available: {}", len(available_locations))
print(json.dumps(available_locations[0], indent=1))

Number of locations available: {} 467
{
 "name": "Worldwide",
 "placeType": {
  "code": 19,
  "name": "Supername"
 },
 "url": "http://where.yahooapis.com/v1/place/1",
 "parentid": 0,
 "country": "",
 "woeid": 1,
 "countryCode": null
}


In [38]:
# Trying to find out if Uganda or Kampala is in this list
for location in available_locations:
    if location['name'] == 'Uganda' or location['name'] == 'Kampala':
        print("Found Uganda or Kampala")
        print(json.dumps(location, indent=1))

# Nothing found :(