## Data Gathering - Twitter API

#### Learning Goals
1. Learn how to use Twitter API and tweepy library to gather data from twitter
2. Learn how to work with json data
3. Work on a case study that covers gathering data from Twitter API and webscraping

#### Twitter API and tweepy library


In [1]:
# import the necessary python packages

import json # helps us work with json data
from timeit import default_timer as timer # helps us time our code

import pandas as pd # helps us create dataframes for easy data manipulation
import requests # helps download url contents programmatically
import tweepy # helps work with the twitter API
from bs4 import BeautifulSoup # helps scrape data from websites

In [9]:
from zipfile import ZipFile

with ZipFile("tweet-json.zip", 'r') as file:
    listOfFiles = file.namelist()
    for i in listOfFiles:
        print(i)
        #if i.endswith('.py'):
        file.extractall()

tweet-json copy


In [2]:
# connect to the api and initialize API

# for this next block, insert your own key, token and secrets
consumer_key = ''
consumer_secret = ''
access_token = ''
access_secret = ''

# create an authorization using the consumer key and secret
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
# set access uding your access token and secret
auth.set_access_token(access_token, access_secret)

#call the API
api = tweepy.API(auth, wait_on_rate_limit = True)

##### Understanding the code provided in additional resources of Project 2

In [None]:
# read in archive data provided by udacity
# this is because we need the tweet ids from the original dataset
df_1 = pd.read_csv('twitter-archive-enhanced.csv')
df_1.head()

In [4]:
# this next line extracts the tweet ids from dataframe

# the tweet ids need to be stored in a variable that's iterable
tweet_ids = df_1.tweet_id.values
print(type(tweet_ids))

<class 'numpy.ndarray'>


In [5]:
# you can also read the tweet ids into a list
tweet_ids = df_1.tweet_id.to_list()
print(type(tweet_ids))

# you can also explore the pandas method .iterrows() that's used for iterating over the rows of a dataframe

<class 'list'>


In [6]:
# check the number of tweets we'll be gathering data for
print(len(tweet_ids)) # you can also check the shape of the dataframe to know the number of rows(tweets) in the dataframe

2356


In [8]:
# the next two cells help us understand the nature of data returned by twitter's API

for tweet_id in tweet_ids:
    # the .get_status() method of the API helps get all the information about the tweet specified
    tweet = api.get_status(tweet_id, tweet_mode='extended')
    print(tweet)
    break


Status(_api=<tweepy.api.API object at 0x000001AEF7EC3CC0>, _json={'created_at': 'Tue Aug 01 16:23:56 +0000 2017', 'id': 892420643555336193, 'id_str': '892420643555336193', 'full_text': "This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU", 'truncated': False, 'display_text_range': [0, 85], 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 892420639486877696, 'id_str': '892420639486877696', 'indices': [86, 109], 'media_url': 'http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg', 'url': 'https://t.co/MgUWQ76dJU', 'display_url': 'pic.twitter.com/MgUWQ76dJU', 'expanded_url': 'https://twitter.com/dog_rates/status/892420643555336193/photo/1', 'type': 'photo', 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'medium': {'w': 540, 'h': 528, 'resize': 'fit'}, 'small': {'w': 540, 'h': 528, 'resize': 'fit'}, 'large': {'w': 

When the get_status method is called, it returns a python object. Let's see all the attributes in the python object returned

In [9]:
for tweet_id in tweet_ids:
    tweet = api.get_status(tweet_id, tweet_mode='extended')
    # this next line helps us get the attributes in the python object returned
    print(f'The attributes in this python object are: {dir(tweet)}')
    break

The attributes in this python object are: ['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__weakref__', '_api', '_json', 'author', 'contributors', 'coordinates', 'created_at', 'destroy', 'display_text_range', 'entities', 'extended_entities', 'favorite', 'favorite_count', 'favorited', 'full_text', 'geo', 'id', 'id_str', 'in_reply_to_screen_name', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'is_quote_status', 'lang', 'parse', 'parse_list', 'place', 'possibly_sensitive', 'possibly_sensitive_appealable', 'retweet', 'retweet_count', 'retweeted', 'retweets', 'source', 'source_url', 'truncated', 'user']


A lot of attributes are returned but our focus is on the `._json` attribute since we're instructed in the lesson to write the json data to a text file.

In [10]:
print(type(tweet._json))
print(tweet._json)

<class 'dict'>
{'created_at': 'Tue Aug 01 16:23:56 +0000 2017', 'id': 892420643555336193, 'id_str': '892420643555336193', 'full_text': "This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU", 'truncated': False, 'display_text_range': [0, 85], 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [], 'media': [{'id': 892420639486877696, 'id_str': '892420639486877696', 'indices': [86, 109], 'media_url': 'http://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg', 'media_url_https': 'https://pbs.twimg.com/media/DGKD1-bXoAAIAUK.jpg', 'url': 'https://t.co/MgUWQ76dJU', 'display_url': 'pic.twitter.com/MgUWQ76dJU', 'expanded_url': 'https://twitter.com/dog_rates/status/892420643555336193/photo/1', 'type': 'photo', 'sizes': {'thumb': {'w': 150, 'h': 150, 'resize': 'crop'}, 'medium': {'w': 540, 'h': 528, 'resize': 'fit'}, 'small': {'w': 540, 'h': 528, 'resize': 'fit'}, 'large': {'w': 540, 'h': 528, 'resize': 'fit'}}}]}, 'extended_ent

The ._json attribute returns a python dictionary. To write the json data to a file, we will use the `json.dump()` method which takes a python dictionary, converts it to json contents and writes it to a file directly.

In [11]:
# open the text file in write mode
with open('sample.txt','w') as outfile:
    # use the json.dump() method to write the json content to the file
    json.dump(tweet._json, outfile)

Having understood all of this, let's examine the code provided by Udacity

In [None]:

# Query Twitter's API for JSON data for each tweet ID in the Twitter archive

# initialize a counter to help track progress
count = 0

# create a dictionary to store the tweet ids that cannot be retrieved
fails_dict = {}

# start tracking the time taken to run the next lines of code
start = timer()

# Save each tweet's returned JSON as a new line in a .txt file

# open the text file in write mode
with open('tweet_json.txt', 'w') as outfile:
    # This loop will likely take 20-30 minutes to run because of Twitter's rate limit
    for tweet_id in tweet_ids:
        # increment the counter with each tweet that's being queried
        count += 1
        # print the number and tweet id
        print(str(count) + ": " + str(tweet_id))
        try:
            # get the status of the tweet
            tweet = api.get_status(tweet_id, tweet_mode='extended')
            print("Success")
            # write the json data to the text file opened
            json.dump(tweet._json, outfile)
            # you're required to write each tweet on a new line so go to a new line
            outfile.write('\n')
        except tweepy.TweepyException as e:
            print("Fail")
            # write the tweet id that failed along with the error message to the fails_dict dictionary
            fails_dict[tweet_id] = e
            pass
end = timer()
print(end - start)
print(fails_dict)

One thing to note here is the try and except blocks. This is used for catching errors. You specify what should be done under try block. If there's an error, specify what should be done under the except block

In [13]:
fails_dict

{888202515573088257: tweepy.errors.NotFound('404 Not Found\n144 - No status found with that ID.')}

#### Reading the data from the txt file into a dataframe

The data can be read into a dataframe directly using pd.read_json.

In [15]:
# read the data into a dataframe using pd.read_json
df = pd.read_json('tweet_json.txt', lines = True)
df.head()

Unnamed: 0,created_at,id,id_str,full_text,truncated,display_text_range,entities,extended_entities,source,in_reply_to_status_id,...,place,contributors,is_quote_status,retweet_count,favorite_count,favorited,retweeted,possibly_sensitive,possibly_sensitive_appealable,lang
0,2017-08-01 16:23:56+00:00,892420643555336193,892420643555336192,This is Phineas. He's a mystical boy. Only eve...,False,"[0, 85]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892420639486877696, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,,,False,7009,33806,False,False,False,False,en
1,2017-08-01 00:17:27+00:00,892177421306343426,892177421306343424,This is Tilly. She's just checking pup on you....,False,"[0, 138]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 892177413194625024, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,,,False,5301,29330,False,False,False,False,en
2,2017-07-31 00:18:03+00:00,891815181378084864,891815181378084864,This is Archie. He is a rare Norwegian Pouncin...,False,"[0, 121]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891815175371796480, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,,,False,3481,22051,False,False,False,False,en
3,2017-07-30 15:58:51+00:00,891689557279858688,891689557279858688,This is Darla. She commenced a snooze mid meal...,False,"[0, 79]","{'hashtags': [], 'symbols': [], 'user_mentions...","{'media': [{'id': 891689552724799489, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,,,False,7225,36938,False,False,False,False,en
4,2017-07-29 16:00:24+00:00,891327558926688256,891327558926688256,This is Franklin. He would like you to stop ca...,False,"[0, 138]","{'hashtags': [{'text': 'BarkWeek', 'indices': ...","{'media': [{'id': 891327551943041024, 'id_str'...","<a href=""http://twitter.com/download/iphone"" r...",,...,,,False,7760,35272,False,False,False,False,en


Another way to read the data is to:
1. loop through each line of the file
2. create a dicitionary of the data you need
3. append the dictionary to a list such that you have a list of dictionaries once you've looped through the entire file
4. use the pd.DataFrame constructor to create a dataframe from the list of dictionaries

In [19]:
# open the file in read mode
with open('tweet_json.txt','r') as file:
    # loop through each line in the file
    for line in file:
        # use json.loads() method to convert the json string to a python dictionary
        print(type(json.loads(line)))
        break

<class 'dict'>


In [24]:
with open('tweet_json.txt', 'r') as file:
    for line in file:
        tweet = json.loads(line)
        print(f"The value of the full_text key is: {tweet['full_text']}")
        print(f"The other keys in the dictionary are: {tweet.keys()}")
        break

The value of the full_text key is: This is Phineas. He's a mystical boy. Only ever appears in the hole of a donut. 13/10 https://t.co/MgUWQ76dJU
The other keys in the dictionary are: dict_keys(['created_at', 'id', 'id_str', 'full_text', 'truncated', 'display_text_range', 'entities', 'extended_entities', 'source', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'in_reply_to_screen_name', 'user', 'geo', 'coordinates', 'place', 'contributors', 'is_quote_status', 'retweet_count', 'favorite_count', 'favorited', 'retweeted', 'possibly_sensitive', 'possibly_sensitive_appealable', 'lang'])


With this knowledge, you should be able to pick the keys you want, write them to a dictionary, append them to a list and then create a dataframe.

## Case Study: Punch Newspaper

In this case study, we'll be gathering data for the most recent tweets of a popular Newspaper company in Nigeria. The goal is to analyze the data in order to determine what kind of news their audience engage with the most.
<br>
With the data gotten from twitter alone, it's difficult to determine what kind of news is in the tweet. However, the news tags can be gotten form the news article on their website. For this reason, this case study is divided into two parts:
1. Gathering data from Twitter
    * Things like the retweet count, favorite count and link to the full news article will be gathered and written to a dataframe
2. Scraping data from the news website
    * Using the urls gotten from the first stage, we will download the html content of each website using the requests library. Then we will scrape the tags using beautiful soup library
<br>

While the scope of this case study ends at just these two tasks, students are encouraged to go further by doing the following:
1. Getting more tweets over a longer period (e.g. last one month or last three months)
2. Properly categorizing the tags gotten from the news article as the tags are too specific. You can categorize them manually or apply machine learning methods

In [25]:
# the user_timeline method helps us get the most recent tweets of the user specified
# by default, it returns the 20 most recent tweets
# we can increase the number of tweets returned but this is capped at 200
punch = api.user_timeline(screen_name='MobilePunch', count = 200)
print(punch[0])

Status(_api=<tweepy.api.API object at 0x000001AEF7EC3CC0>, _json={'created_at': 'Mon Jun 27 09:17:14 +0000 2022', 'id': 1541349926877241346, 'id_str': '1541349926877241346', 'text': 'After Awoniyi, Forest eye Aribo https://t.co/AlUArBF9un', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/AlUArBF9un', 'expanded_url': 'https://punchng.com/after-awoniyi-forest-eye-aribo/?utm_term=Autofeed&utm_medium=Social&utm_source=Twitter#Echobox=1656319166', 'display_url': 'punchng.com/after-awoniyi-…', 'indices': [32, 55]}]}, 'source': '<a href="https://www.echobox.com" rel="nofollow">Echobox</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 24291371, 'id_str': '24291371', 'name': 'Punch Newspapers', 'screen_name': 'MobilePunch', 'location': 'Lagos, Nigeria', 'description': 'This is the official Twitter 

Just like the [`get_status()`](https://www.geeksforgeeks.org/python-api-get_status-in-tweepy/) method, the [`user_timeline()`](https://www.geeksforgeeks.org/python-api-user_timeline-in-tweepy/) method returns a list of python objects for each tweet

In [26]:
# get the attributes of the python object returned
print(f'The attributes in this python object are:{dir(punch[0])}')


The attributes in this python object are:['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__slots__', '__str__', '__subclasshook__', '__weakref__', '_api', '_json', 'author', 'contributors', 'coordinates', 'created_at', 'destroy', 'entities', 'favorite', 'favorite_count', 'favorited', 'geo', 'id', 'id_str', 'in_reply_to_screen_name', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'is_quote_status', 'lang', 'parse', 'parse_list', 'place', 'possibly_sensitive', 'retweet', 'retweet_count', 'retweeted', 'retweets', 'source', 'source_url', 'text', 'truncated', 'user']


##### <b> Task1:</b> 
Create a dataframe from this data. The dataframe should have the following columns: `tweet_id`, `favorite_count`, `retweet_count`, `article_link`

Just like the project, you can work with the json data or just call the attributes of the columns you need

Working with the attributes directly...

In [28]:
for tweet in punch:
    print(tweet.entities)
    break

{'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/AlUArBF9un', 'expanded_url': 'https://punchng.com/after-awoniyi-forest-eye-aribo/?utm_term=Autofeed&utm_medium=Social&utm_source=Twitter#Echobox=1656319166', 'display_url': 'punchng.com/after-awoniyi-…', 'indices': [32, 55]}]}


In [29]:
punch_list = []
no_url = []
for tweet in punch:
    try:
        tweet_id = tweet.id
        favorite_count = tweet.favorite_count
        retweet_count = tweet.retweet_count
        link = tweet.entities['urls'][0]['expanded_url']
        punch_list.append({'tweet_id':tweet_id,'favorite_count':favorite_count, 'retweet_count':retweet_count, 'link':link})

    except Exception as e:
        no_url.append({tweet.id:e})
        pass

punch_list[0]

{'tweet_id': 1541349926877241346,
 'favorite_count': 18,
 'retweet_count': 1,
 'link': 'https://punchng.com/after-awoniyi-forest-eye-aribo/?utm_term=Autofeed&utm_medium=Social&utm_source=Twitter#Echobox=1656319166'}

In [38]:
punch_df = pd.DataFrame(punch_list)
punch_df.head()

Unnamed: 0,tweet_id,favorite_count,retweet_count,link
0,1541349926877241346,18,1,https://punchng.com/after-awoniyi-forest-eye-a...
1,1541347086679130112,45,5,https://punchng.com/moses-trains-alone-after-s...
2,1541344281398546432,31,2,https://punchng.com/four-psl-clubs-battle-for-...
3,1541342051287142400,31,2,https://punchng.com/discospayment-to-gencos-dr...
4,1541339143229374464,14,9,https://punchng.com/telecom-firms-operating-co...


In [30]:
no_url

[{1541330746098569217: IndexError('list index out of range')},
 {1541292923458928640: IndexError('list index out of range')},
 {1541284430190452737: IndexError('list index out of range')},
 {1541011103181930496: IndexError('list index out of range')},
 {1541010866941968385: IndexError('list index out of range')},
 {1541010439886209026: IndexError('list index out of range')},
 {1541010072867934209: IndexError('list index out of range')},
 {1540993251548831744: IndexError('list index out of range')},
 {1540993239372668929: IndexError('list index out of range')},
 {1540930651330846721: IndexError('list index out of range')},
 {1540920299889299458: IndexError('list index out of range')},
 {1540754718493769729: IndexError('list index out of range')},
 {1540748601860694017: IndexError('list index out of range')}]

Following the pattern from the project...

In [32]:
# write the json data to a text file
with open('punch_tweets.txt','w') as outfile:
    for tweet in punch:
        json.dump(tweet._json, outfile)
        outfile.write('\n')

In [None]:
# read data from the text file into a dataframe


#### Explaining how the url was gotten from the twitter data returned by the `user_timeline` method

In [33]:
print(punch[0])

Status(_api=<tweepy.api.API object at 0x000001AEF7EC3CC0>, _json={'created_at': 'Mon Jun 27 09:17:14 +0000 2022', 'id': 1541349926877241346, 'id_str': '1541349926877241346', 'text': 'After Awoniyi, Forest eye Aribo https://t.co/AlUArBF9un', 'truncated': False, 'entities': {'hashtags': [], 'symbols': [], 'user_mentions': [], 'urls': [{'url': 'https://t.co/AlUArBF9un', 'expanded_url': 'https://punchng.com/after-awoniyi-forest-eye-aribo/?utm_term=Autofeed&utm_medium=Social&utm_source=Twitter#Echobox=1656319166', 'display_url': 'punchng.com/after-awoniyi-…', 'indices': [32, 55]}]}, 'source': '<a href="https://www.echobox.com" rel="nofollow">Echobox</a>', 'in_reply_to_status_id': None, 'in_reply_to_status_id_str': None, 'in_reply_to_user_id': None, 'in_reply_to_user_id_str': None, 'in_reply_to_screen_name': None, 'user': {'id': 24291371, 'id_str': '24291371', 'name': 'Punch Newspapers', 'screen_name': 'MobilePunch', 'location': 'Lagos, Nigeria', 'description': 'This is the official Twitter 

The news article link (usually starts with '`https://punchng.com/...`') can be found in the expanded url key which is nested in the entities attribute or key

In [35]:
# working with the entities attribute directly
for tweet in punch:
    entities_dict = tweet.entities
    print(f"2: {entities_dict['urls']}")
    print(f"3: {entities_dict['urls'][0]}")
    print(f"4: {entities_dict['urls'][0]['expanded_url']}")
    url = entities_dict['urls'][0]['expanded_url']
    print(url)
    break

2: [{'url': 'https://t.co/AlUArBF9un', 'expanded_url': 'https://punchng.com/after-awoniyi-forest-eye-aribo/?utm_term=Autofeed&utm_medium=Social&utm_source=Twitter#Echobox=1656319166', 'display_url': 'punchng.com/after-awoniyi-…', 'indices': [32, 55]}]
3: {'url': 'https://t.co/AlUArBF9un', 'expanded_url': 'https://punchng.com/after-awoniyi-forest-eye-aribo/?utm_term=Autofeed&utm_medium=Social&utm_source=Twitter#Echobox=1656319166', 'display_url': 'punchng.com/after-awoniyi-…', 'indices': [32, 55]}
4: https://punchng.com/after-awoniyi-forest-eye-aribo/?utm_term=Autofeed&utm_medium=Social&utm_source=Twitter#Echobox=1656319166
https://punchng.com/after-awoniyi-forest-eye-aribo/?utm_term=Autofeed&utm_medium=Social&utm_source=Twitter#Echobox=1656319166


#### Next Steps
1. Use requests to get the html content of each news article
2. Inspect the file to know what html tags contain the news tags
3. Use beautiful soup to scrape the news tag data
4. Write tags to a dataframe