[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1uId6X7aAQupJMnEOt1xhQ80PpGnOx_NQ)

# Notebook #1: tweet extraction from the Twitter API

## Description and requirements: 

In this notebook, you will learn how to extract tweets and information about Twitter users from the Twitter API in Python. 

A requirement for this code to work is to have Twitter API keys and access tokens. The steps to request the latter are described in [this tutorial](https://www.slickremix.com/docs/how-to-get-api-keys-and-tokens-for-twitter/). 

Also, on top of the usual Python modules (numpy and pandas), you will need to install the [tweepy](http://docs.tweepy.org/en/latest/index.html) package. 

### Import modules

In [None]:
import os
import sys
import uuid

import pandas as pd
import tweepy

### Process credentials

Before we start, please replace the placeholders below by your API credentials. Make sure to keep them private and remove them before sharing this notebook with third-parties.

In [None]:
api_dict = {"API Key": "Enter your own API Key",
            "API Secret Key": "Enter your own API Secret Key",
            "Bearer token": "Enter your own Bearer token",
            "access_token": "Enter your own access_token",
            "access_token_secret": "Enter your own access_token_secret"}

Now that you have informed your credentials, we will check whether the API recognizes them as valid credentials. The function below will do this and return an error if this is not the case.

In [None]:
def get_auth(api_dict):
    # OAuth process, using the keys and tokens
    auth = tweepy.OAuthHandler(api_dict['API Key'], api_dict['API Secret Key'])
    auth.set_access_token(api_dict['access_token'], api_dict['access_token_secret'])

    # Creation of the actual interface, using authentication
    api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)

    try:
        api.verify_credentials()
    except:
        print(api_dict, ": error during authentication")
        sys.exit('Exit')
    return api

In [None]:
get_auth(api_dict)
print('Credentials Checked!')

Credentials Checked!


### Downloading tweets

The Twitter API allows developers to download several types of information on the social network and its users. In this section, we will focus on three types of Twitter data:
- tweets by hashtag
- tweets by user
- list of users that a given user follows

Before we go on, please note that each developer is limited in the amount of requests she can make to the Twitter API. This is important to take into account if you want to download an important number of tweets. You will find more information on the API rate limit in the [FAQ](https://developer.twitter.com/en/docs/twitter-api/v1/tweets/timelines/faq) of the Twitter Developer documentation. 

#### By hashtag

First, tweets can be selected on the basis of hashtags. This can be relevant if you want to study the importance of specific topics in the Twitter-verse and what users have to say about these topics.

The function below takes as input:
- the `api_dict` dictionary, containing our credentials and defined earlier
- a list of hashtags `tags_list`
- `language` the language of the tweet
- `count_int` the number of tweets to download per hashtag

It then loops over the list of hashtags `tags_list`, downloads `count_int` tweets per hashtag and return tweets and related information in a Pandas dataframe format.

In [None]:
def get_tags(api_dict, tags_list, language, count_int):
    # Create Access For Block of Users
    api = get_auth(api_dict)

    tweets = []
    for tag in tags_list:
        try:
            cursor = tweepy.Cursor(
                api.search,
                q=f'#{tag}',
                language=language).items(count_int)
            for tweet in cursor:
                tweets.append(tweet)
        except tweepy.error.TweepError as e:
            print(e)
            continue
    print(f'Got {len(tweets)} Tweets!!')
    tweets = [tweet._json for tweet in tweets]
    return pd.DataFrame(data=tweets)

Let's look at an example. Say we want to download 10 tweets with the hashtag #COVID-19. We can do this in the following way:

In [None]:
covid19_tweets_df = get_tags(api_dict=api_dict, tags_list=['COVID19'], language='en', count_int=10)

Got 10 Tweets!!


The output dataframe has 10 rows (one tweet per row) and 28 columns. 

In [None]:
covid19_tweets_df.shape

(10, 27)

These 28 columns listed below give a lot of details on the tweets, including the date of creation, the tweet ID and the text.

In [None]:
covid19_tweets_df.columns

Index(['created_at', 'id', 'id_str', 'text', 'truncated', 'entities',
       'metadata', 'source', 'in_reply_to_status_id',
       'in_reply_to_status_id_str', 'in_reply_to_user_id',
       'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo',
       'coordinates', 'place', 'contributors', 'retweeted_status',
       'is_quote_status', 'retweet_count', 'favorite_count', 'favorited',
       'retweeted', 'lang', 'possibly_sensitive', 'extended_entities'],
      dtype='object')

If we focus on the text, here are the first 5 tweets we picked up:

In [None]:
covid19_tweets_df['text'].head()

0    RT @PetroDivisa: Carlos Holmes Trujillo un col...
1    RT @tim_fargo: “When one door of happiness clo...
2    RT @VilledAjaccio: #COVID19\nℹ Vous êtes comme...
3    RT @enricomolinari: How #COVID19 hit Europe #s...
4    RT @FBIDallas: Have you received a call, text,...
Name: text, dtype: object

#### By user

Another option when downloading tweets is to download the tweets from one or several specific users. 

The function below takes as input:
- `screen_name` the screen name of a Twitter user (another name for a Twitter handle)
- `api` the authenticated API credentials

It returns a tuple containing the timeline of that Twitter user in a Pandas dataframe format and, the error message from Tweepy in case there is one. The upper bound of the number of tweets to download is set to 3200 (`count` argument) to avoid API rate limit errors. We also choose the extended `tweet_mode` to avoid the truncating of tweets the Tweepy tool does by default.

In [None]:
def get_timeline(screen_name, api):
    timeline = []
    error = None
    # Collect All Statuses in Timeline
    try:
        cursor = tweepy.Cursor(
            api.user_timeline,
            screen_name=screen_name,
            tweet_mode="extended",
            count=3200,
            include_rts=False).items()

        for status in cursor:
            timeline.append(status._json)
    except tweepy.error.TweepError as e:
        error = str(e)
    return pd.DataFrame(timeline), error

##### Example with Joe Biden's timeline

As an example, let's download Joe Biden's timeline.

In [None]:
joe_biden_timeline_df = get_timeline(screen_name='JoeBiden', api=get_auth(api_dict))[0]
joe_biden_timeline_df.head(n=1)

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,in_reply_to_status_id_str,in_reply_to_user_id,in_reply_to_user_id_str,in_reply_to_screen_name,user,geo,coordinates,place,contributors,is_quote_status,retweet_count,favorite_count,favorited,retweeted,possibly_sensitive,lang,quoted_status_id,quoted_status_id_str,quoted_status_permalink,quoted_status
0,Wed Jan 20 17:55:22 +0000 2021,1351951465674276869,1351951465674276869,"Now the real work begins, folks. Follow along ...",False,"[0, 80]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 1351951461567979525, 'id_str...","<a href=""http://twitter.com/download/iphone"" r...",,,,,,"{'id': 939091, 'id_str': '939091', 'name': 'Jo...",,,,,False,63919,897967,False,False,False,en,,,,


We have successfully downloaded 3003 tweets (one tweet per row) from Joe Biden.

In [None]:
joe_biden_timeline_df.shape

(3003, 30)

The dataframe contains 30 columns with specific information on each tweet.

In [None]:
joe_biden_timeline_df.columns

Index(['created_at', 'id', 'id_str', 'full_text', 'truncated',
       'display_text_range', 'entities', 'extended_entities', 'source',
       'in_reply_to_status_id', 'in_reply_to_status_id_str',
       'in_reply_to_user_id', 'in_reply_to_user_id_str',
       'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place',
       'contributors', 'is_quote_status', 'retweet_count', 'favorite_count',
       'favorited', 'retweeted', 'possibly_sensitive', 'lang',
       'quoted_status_id', 'quoted_status_id_str', 'quoted_status_permalink',
       'quoted_status'],
      dtype='object')

##### Example with Imran Khan's timeline

We will now download tweets from Pakistani Twitter users. To do so, we will first grab a list of 300 followers from Imran Khan, the current Prime Minister of Pakistan. 

In order to achieve this, we will define the following function:

The function below takes as input:
- `user_names` a list of screen names whose user_ids we want
- `amount` the amount of users we wish to obtain

It then returns a list of `followers`, containing `amount` followers for each user in the `user_names` list.

In [None]:
def get_followers(user_names, amount):
    api = get_auth(api_dict)
    followers = []
    for user_name in user_names:
        try:
            for follower in tweepy.Cursor(api.followers, user_name).items(amount):
                followers.extend([follower.screen_name])
        except Exception as e:
            print(e)
            break
    print(f'Got {len(followers)} followers!!')
    return followers

In [None]:
user_names = ['ImranKhanPTI']
followers_screen_names = get_followers(user_names, 100)
# print 10 first users
followers_screen_names[:10]

Got 100 followers!!


['RabbyAh36419387',
 'MHafeezBuzdar3',
 'AmirShe06581205',
 'imrulka86583184',
 'AbubakarRindh',
 'Rijanoor5',
 'Noorfatima9695',
 'Attia97838963',
 'Mohamma85924197',
 'IrumKha08419063']

When wanting to draw policy insights from tweets, one timeline is usually not enough. We will now show how to download and save the timelines of several users. We first define an output path `path_to_timelines` where the timelines will be saved. Please modify it with a local path on your computer.

In [None]:
path_to_timelines = './data/timelines'

Next we would like to save the downloaded timelines. Inorder to do this we will create a function that will receive timelines and will save it to a location.
The function below takes an input:
- `downloaded_screen_name_list` a list of screen names whose timelines were downloaded
- `output_id` an ID to differentiate different outputs
- `user_index` the rank of the user screen name in the list of screen names to download
- `timelines` the tweet data in pandas dataframe format

It then saves the dataframe in the `path_to_timelines` folder in a pickle format. The file name is defined as: `timelines-NB_DOWNLOADED_TIMELINES-OUTPUT_ID.pkl` where `NB_DOWNLOADED_TIMELINES` is the number of downloaded timelines and `OUTPUT_ID` is a randomly generated ID. A `success` text file is also saved and informs on the filename where the timeline of a specific user is saved. 

In [None]:
def save_timelines(downloaded_screen_name_list, output_id, user_index, timelines):
    filename = 'timelines-' + str(len(downloaded_screen_name_list)) + '-' + output_id + '.pkl'

    print('Process', 'processed', str(int(user_index) + 1 ), 'timelines with latest output file:',
          os.path.join(path_to_timelines, filename))
    dir_path = os.path.join(path_to_timelines)
    # Save as list of dict discarding index
    timelines.to_pickle(os.path.join(dir_path, filename))

    # Save User Id and File In Which Its Timeline Was Saved
    with open(os.path.join(path_to_timelines, 'success'), 'a', encoding='utf-8') as file:
        for downloaded_screen_name in downloaded_screen_name_list:
            file.write(f'{downloaded_screen_name}\t{filename}\n')

We define a `cutoff` variable which works in the following manner: when the number of downloaded timelines reaches `cutoff`, these timelines are saved and then deleted from memory. The idea is to avoid losing already downloaded data in case of an error from the Tweepy client. Here, we define `cutoff` as equal to 100.

In [None]:
cutoff = 100

The function below combines the functions `get_timeline` and `save_timelines`. It takes as input:
- `api_dict` the API credentials in a dictionary format
- `screen_name_list` a list of users screen names

It then downloads the timelines of each of the users in `screen_name_list` and saves these timelines in the `path_to_timelines` folder.

In [None]:
def download_timelines(api_dict, screen_name_list):
    # Create Access For Block of Users
    api = get_auth(api_dict)
    # Initialize Output File ID
    output_id = str(uuid.uuid4())
    # Initialize DataFrame
    timelines = pd.DataFrame()
    # Initialize Downloaded User List
    downloaded_screen_name_list = []
    for user_index, screen_name in enumerate(screen_name_list):
        # Try Downloading Timeline
        timeline, error = get_timeline(screen_name, api)
        if error is not None:
            print(screen_name, error)
            continue
        # Append
        timelines = pd.concat([timelines, timeline], sort=False)
        downloaded_screen_name_list.append(screen_name)
        # Save after <cutoff> timelines
        if len(downloaded_screen_name_list) == cutoff:
            save_timelines(downloaded_screen_name_list, output_id, user_index, timelines)
            # Reset Output File ID, Data, and Downloaded Users
            del timelines, downloaded_screen_name_list
            output_id = str(uuid.uuid4())
            timelines = pd.DataFrame()
            downloaded_screen_name_list = []
    # Save the rest of the timelines
    save_timelines(downloaded_screen_name_list, output_id, len(screen_name_list) - 1, timelines)

In [None]:
download_timelines(api_dict, screen_name_list = followers_screen_names)

Noorfatima9695 Twitter error response: status code = 401
Process processed 100 timelines with latest output file: ./data/timelines/timelines-99-8e68f085-9682-4f2d-a7af-1b2c90e3b92f.pkl


#### Download social network

One last Tweepy feature we cover in this tutorial is the possibility to download a list of Twitter accounts that are followed by a specific Twitter account. The function below takes as input:
- `api_dict` the API credentials in a dictionary format
- `screen_name_list` a list of user screen names.

This function will download the list of usernames each user in the `screen_name_list` follows and return the results as a dictionary, with `screen_name` of users given as input as keys and, a list of the users they follow as value.

In [None]:
def get_friends(api_dict, screen_name_list):
    api = get_auth(api_dict)
    friends_dict = dict()
    for screen_name in screen_name_list:
        friends_list = list()
        try:
            for friend_ids in tweepy.Cursor(api.friends_ids, screen_name=screen_name).pages():
                friends_list.extend(friend_ids)
            friends_name_list = [user.screen_name for user in api.lookup_users(user_ids=friends_list)]
            friends_dict[screen_name] = friends_name_list
        except Exception as e:
            print(e)
            continue
    return friends_dict


Let's do an example with Joe Biden:

In [None]:
friends_dict = get_friends(api_dict, ['JoeBiden'])

Below, we can find the list of users Joe Biden follows:

In [None]:
print('**************Twitter users Joe Biden follows:**************')
print(friends_dict['JoeBiden'])

**************Twitter users Joe Biden follows:**************
['POTUS', 'teachcardona', 'AliMayorkas', 'ABlinken', 'JanetYellen', 'neeratanden', 'XavierBecerra', 'mlfudge', 'DenisMcDonough', 'PeteButtigieg', 'DebHaalandNM', 'JenGranholm', 'Michael_S_Regan', 'SecDef', 'Mariska', 'BidenInaugural', 'WhiteHouse', 'BlueAmerica22', 'DouglasEmhoff', 'KamalaHarris', 'JoeForNV', 'JoeForSC', 'JoeForNH', 'JoeForIA', 'TeamJoe', 'De11eDonne', 'ladygaga', 'ItsOnUs', 'DrBiden', 'UDBidenInst', 'BidenCancer', 'PennBiden', 'ObamaFoundation', 'livelihood2017', 'bidenfoundation', 'timkaine', 'HillaryClinton', 'DrBiden44', 'ObamaWhiteHouse', 'WhiteHouse45', 'VP44', 'VP45', 'BeauBidenFdn', 'TheDemocrats', 'MichelleObama', 'BarackObama']
