# Web Data Scraping

[Spring 2021 ITSS Mini-Course](https://www.colorado.edu/cartss/programs/interdisciplinary-training-social-sciences-itss/mini-course-web-data-scraping) — ARSC 5040  
[Brian C. Keegan, Ph.D.](http://brianckeegan.com/)  
[Assistant Professor, Department of Information Science](https://www.colorado.edu/cmci/people/information-science/brian-c-keegan)  
University of Colorado Boulder  

Copyright and distributed under an [MIT License](https://opensource.org/licenses/MIT)

## Class outline

* **Week 1**: Introduction to Jupyter, browser console, structured data, ethical considerations
* **Week 2**: Scraping HTML with `requests` and `BeautifulSoup`
* **Week 3**: Scraping web data with Selenium
* **Week 4**: Scraping an API with `requests` and `json`, Wikipedia and Reddit
* **Week 5**: Scraping data from Twitter

## Acknowledgements

Thank you also to Professor [Terra KcKinnish](https://www.colorado.edu/economics/people/faculty/terra-mckinnish) for coordinating the ITSS seminars.

## Class 5 goals

* Sharing accomplishments and challenges with last week's material
* Using the `twitter` wrapper library to handle authentication
* Retrieving and parsing a single tweet
* Rehydrating a list of tweet IDs
* Pulling a user's timeline
* Pulling a user's friend and follower lists
* Using the search endpoint of the API
* Listen to the streaming API
* Detecting bot accounts using IU's Bot-o-Meter

Start with our usual suspect packages.

In [None]:
# Lets Jupyter Notebook display images in-line
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sb

# Import our helper libraries
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
import json
import requests
from bs4 import BeautifulSoup
import time
from urllib.parse import quote, unquote

We're going to use a library called VADER to help with sentiment analysis of tweets. We need to do some setup first! You should only need to do this step once.

In [None]:
import nltk
nltk.download('vader_lexicon')

Now try to import.

In [None]:
# Import the VADER sentiment analyzer
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Instantiate the model
sia = SentimentIntensityAnalyzer()

## Installing a Twitter API wrapper

As was the case with Reddit, we will take advantage of a wrapper library to handle the heavy lifting of authenticating, making specific requests, handling rate-limiting, *etc*. There are no shortage of Python wrappers for the Twitter API, but the most popular are:

* [twitter](https://github.com/python-twitter-tools/twitter)
* [python-twitter](https://python-twitter.readthedocs.io/en/latest/)
* [Tweepy](http://docs.tweepy.org/en/latest/)
* [Twython](https://twython.readthedocs.io/en/latest/)

There are other wrapper libraries linked from the [Twitter developer utilities documentation](https://developer.twitter.com/en/docs/twitter-api/tools-and-libraries).

I'm going to use `twitter` just because it is very lightweight and replicates the official Twitter API's design.

You will need to install this since it does not come with conda by default. At the Terminal:

`pip install twitter`

Once you've installed it, you can import the `twitter` wrapper library.

In [None]:
import twitter

## Authenticating
I don't want to share my Twitter credentials with the world, so I load from from my local machine. If you wanted to do this, it should take this format of:

```
{"consumer_key":"API key",
 "consumer_secret":"API secret key",
 "access_token_key":"Access token",
 "access_token_secret":"Access token secret"
}
```

In [None]:
# Load my key information from disk
with open('twitter_keys.json','r') as f:
    twitter_keys = json.load(f)

# Authenticate with the Twitter API using the twitter_keys dictionary
# The "tweet_mode='extended' allows us to see the full 280 characters in tweets
api = twitter.Twitter(auth=twitter.OAuth(twitter_keys['access_token_key'],
                                         twitter_keys['access_token_secret'],
                                         twitter_keys['consumer_key'],
                                         twitter_keys['consumer_secret']),
                     )

Alternatively, you can just enter your keys directly into the `Api` function.

In [None]:
api = twitter.Api(consumer_key = 'API key',
                  consumer_secret = 'API secret key',
                  access_token_key = 'Access token',
                  access_token_secret = 'Access token secret',
                  tweet_mode='extended')

Test that you can connect to the API. Retrieve *Daily Camera* journalist [@mitchellbyars](https://twitter.com/mitchellbyars)'s account information.

In [None]:
api.users.show(screen_name='mitchellbyars')

We can also retrieve his most recent tweets. Obnoxiously, you also need to add a parameter `tweet_mode='extended'` ([docs](https://developer.twitter.com/en/docs/twitter-api/tweets/timelines/migrate/standard-to-twitter-api-v2)) to get the full 280 characters of text.

In [None]:
api.statuses.user_timeline(screen_name='mitchellbyars',
                           count=5,
                           tweet_mode='extended')

## Getting the payload of a single tweet

Wikipedia helpfully maintains a [List of most-retweeted tweets](https://en.wikipedia.org/wiki/List_of_most-retweeted_tweets). Go to one of the tweets and pull out the ID at the end of the URL.

In [None]:
tweet = api.statuses.show(_id='849813577770778624')

In [None]:
tweet

Access the attributes of this dictionary.

In [None]:
# When was the tweet created
tweet['created_at']

In [None]:
# Number of favorites (at the time of the API call)
tweet['favorite_count']

In [None]:
# Number of retweets (at the time of the API call)
tweet['retweet_count']

In [None]:
# Text of the tweet
tweet['text']

In [None]:
# Location (if it geo-located)
tweet['geo']

In [None]:
# List of hashtags present
tweet['entities']['hashtags']

In [None]:
# Tweet ID
tweet['id']

In [None]:
# A guess at the language of the tweet
tweet['lang']

These next two attributes return `User` and `Media` objects rather than simple strings, ints, *etc*. that have their own attributes and methods.

In [None]:
tweet['user']

We can access attributes of this `User` object.

In [None]:
# Screen name of the user
tweet['user']['screen_name']

In [None]:
# Displayed name of the user
tweet['user']['name']

In [None]:
# User biography
tweet['user']['description']

In [None]:
# Account creation time
tweet['user']['created_at']

In [None]:
# Self-reported location
tweet['user']['location']

In [None]:
# Number of tweets from the user
tweet['user']['statuses_count']

In [None]:
# Number of followers
tweet['user']['followers_count']

In [None]:
# Number of friends (accounts this account follows, followees, etc.)
tweet['user']['friends_count']

Similarly, the `Media` object inside this list contains information about the type and the URLs of the media inside this object. If there were multiple images in this tweet, there would be a `Media` item in the list for each of them.

In [None]:
tweet['entities']['media'][0]['media_url']

## Rehydrating a list of tweets

Twitter's Terms of Service do not allow datasets of statuses to be shared, but researchers are permitted to share the identifiers for tweets in their datasets. Researchers then need to "rehydrate" these statuses by requesting the full payloads from Twitter's API. A list of resources with links to tweet IDs used in research:

* [DocNow's Tweet ID Datasets](https://www.docnow.io/catalog/)
* [FollowTheHashtag's Free Twitter Datasets](http://followthehashtag.com/datasets/)
* [AcademicTorrents](http://academictorrents.com/browse.php?search=twitter)
* [FiveThirtyEight's Russian Troll Tweets](https://github.com/fivethirtyeight/russian-troll-tweets/)
* [Harvard Dataverse](https://dataverse.harvard.edu/dataverse/harvard?q=twitter&types=datasets&sort=score&order=desc&page=1)

This has some privacy benefits: Twitter's [compliance statement](https://developer.twitter.com/en/docs/twitter-api/enterprise/compliance-firehose-api/overview) describes that users should retain the option to delete tweets or their accounts and this rehydration arrangement—theoretically—prevents their tweet content from circulating without their consent. In practice, many of the largest Twitter corpora come from the streaming API (more on that later in this notebook) and Twitter has a "[compliance stream](https://developer.twitter.com/en/docs/tweets/compliance/api-reference/compliance-firehose)" that indicates that a user has deleted a tweet, protected their account, Twitter has suspended an account, Twitter has withheld the status, *etc*. and the tweet should be removed from your streaming dataset as well. The Sunlight Foundation and ProPublica maintain a list of deleted tweets from politicians called [Politiwoops](https://projects.propublica.org/politwoops/).

I am going to use a [list of tweets made by Senators in the 115th Congress](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/UIVHQR) collected by Justin Littman in 2017. Load the text file:

In [None]:
with open('senators-115.txt','r') as f:
    senators_tweet_ids = [tweet_id.strip() for tweet_id in f.readlines()]
    
"There are {0:,} tweets IDs in the file.".format(len(senators_tweet_ids))

Look at the first 10 statuses.

In [None]:
senators_tweet_ids[:10]

Use the `statuses.lookup` API endpoint, which accepts a string containing up to 100 comma-separated tweet IDs. We will use the "`map=True`" parameter to keep track of any tweets that were not returned (which should be `None`s rather than `Status`es).

In [None]:
senators_10_tweets = api.statuses.lookup(_id=','.join(senators_tweet_ids[:50]),
                                         map=True,
                                         tweet_mode='extended'
                                        )

Inspect.

In [None]:
list(senators_10_tweets['id'].values())[0]

Now for a bit of accounting on rate limits. According to the API documentation for [get statuses/lookup](https://developer.twitter.com/en/docs/tweets/post-and-engage/api-reference/get-statuses-lookup), you can ask for up to 100 tweets per request and you can make 900 requests per 15-minutes. This means you can theoretically rehydrate 90,000 tweets per 15 minutes, or 360,000 tweets per hour. So it would take approximately 90 minutes to rehydrate all 500,000 of these senators' tweets.

Access the field with the rate limit status for looking up statuses.

In [None]:
api.application.rate_limit_status()['resources']['statuses']['/statuses/lookup']

Parse out when the the API limit will reset.

In [None]:
# Access the reset field
reset_datetime = api.application.rate_limit_status()['resources']['statuses']['/statuses/lookup']['reset']

# Convert from UNIX time to something interpretable
print(datetime.fromtimestamp(reset_datetime))

There should be 10 `Status` objects returned by the API.

In [None]:
len(senators_10_tweets['id'])

Write a loop to go through the ten tweets and print out information.

In [None]:
for status in senators_10_tweets['id'].values():
    screen_name = status['user']['screen_name']
    created = pd.to_datetime(status['created_at'])
    text = status['full_text']
    retweets = status['retweet_count']
    formatted_str = '{0} on {1} said {2}, which received {3} retweets.\n'
    print(formatted_str.format(screen_name,created,text,retweets))

Alternatively, extract the relevant fields, save these as a list of dictionaries, and convert the list of dictionaries to a DataFrame. Note that when an account retweets another account, a second `Status` object is embedded under the "`.retweeted_status`" attribute that contains the parent tweet's information. In these cases, the "`.created_at`" from the Senator's account is when s/he retweeted the status and the "`.created_at`" for the "`.retweeted_status`" is when the parent tweet was first posted.

In [None]:
list(senators_10_tweets['id'].values())[1]

In [None]:
# Make an empty list to store the data after we process it below
statuses_list = []

# Loop through each of the status objects
for status in senators_10_tweets['id'].values():
    
    # Check to make sure the status is not empty/None 
    if status != None:
        
        # Create an empty dictionary to store relevant fields
        payload = {}
        payload['id'] = status['id_str']
        payload['screen_name'] = status['user']['screen_name']
        payload['created'] = pd.to_datetime(status['created_at'])
        payload['retweets'] = status['retweet_count']
        payload['favorites'] = status['favorite_count']

        # If an account retweets another account, we should store that information
        if 'retweeted_status' in status:
            payload['text'] = status['retweeted_status']['full_text']
            payload['retweeted'] = True
            payload['retweeted_screen_name'] = status['retweeted_status']['user']['screen_name']
            payload['retweeted_created'] = status['retweeted_status']['created_at']
            
        # If there is no retweeted_status then it's not a retweet
        else:
            payload['text'] = status['full_text']
            payload['retweeted'] = False
            payload['retweeted_screen_name'] = False
            payload['retweeted_created'] = False

        # Store the payload dictionary in our list
        statuses_list.append(payload)
        
# Conver to a DataFrame
df = pd.DataFrame(statuses_list)

# Inspect
df.head()

## Pulling a user's timeline

In general, if you want to retrieve the tweets from a user's timeline, we can use Twitter's API to get the 3,200 most recent tweets from the [get statuses/user_timeline](https://developer.twitter.com/en/docs/tweets/timelines/api-reference/get-statuses-user_timeline.html) API endpoint. This will include retweets of other statuses. We can retrieve up to 200 tweets per request and can make 900 requests per 15-minute window, so we can get 18,000 tweets per window or 72,000 tweets per hour. This means we could theoretically get up to 22 users' most recent 3,200 tweets per hour.

Alexandria Ocasio-Cortez has written 6,975 tweets on her personal account, "[AOC](https://twitter.com/aoc)". She also has an official account "[RepAOC](https://twitter.com/repaoc)", but this only has 14 tweets. Let's get her 3,200 most-recent tweets from the API. Disappointingly, `python-twitter` does not handle the "pagination" for us so we can only ask for 200 tweets at a time and have to update when to ask for the next tweets.

In [None]:
aoc_tweets = api.statuses.user_timeline(screen_name='aoc',
                                        count=200,
                                        tweet_mode='extended')

In [None]:
print("{0:,} tweets were returned.".format(len(aoc_tweets)))

The first tweet returned is the most recent tweet.

In [None]:
aoc_tweets[0]['created_at']

The last tweet returned (the 200th).

In [None]:
aoc_tweets[-1]['created_at']

We really care about the final tweet's ID so we can make an API query that asks for the next 200 statuses before the last tweet returned.

In [None]:
aoc_tweets[-1]['id']

Now we make the next query.

In [None]:
aoc_tweets2 = api.statuses.user_timeline(screen_name='aoc',
                                         count=200,
                                         tweet_mode='extended',
                                         include_rts=True,
                                         max_id=aoc_tweets[-1]['id'])

Check the first tweet in here.

In [None]:
aoc_tweets2[0]['id']

Compare to the last tweet in the first set (`aoc_tweets`).

In [None]:
aoc_tweets[-1]['id']

Let's also check in on how much of our API rate limit we've used.

In [None]:
api.application.rate_limit_status()['resources']['statuses']['/statuses/user_timeline']

Now we'll write a loop to get all 3,200 tweets.

In [None]:
# Start with the list of the 200 most-recent tweets
aoc_timeline_tweets = api.statuses.user_timeline(screen_name='aoc',
                                         count=200,
                                         tweet_mode='extended',
                                         include_rts=True)

# Initialize a counter so we don't go overboard with our requests
request_counter = 1

# While our request counter hasn't gone past 16, but try for more :)
while request_counter < 20:
    # Get the most oldest tweet id
    final_status_id = aoc_timeline_tweets[-1]['id']
    
    # Pass this tweet ID into the max_id parameter, minus 1 so we don't duplicate it
    aoc_timeline_tweets += api.statuses.user_timeline(screen_name='aoc',
                                                  count=200,
                                                  tweet_mode='extended',
                                                  include_rts=True,
                                                  max_id=final_status_id-1)
    
    # Increment our request_counter
    request_counter += 1

I just used a chunk of my API rate limit.

In [None]:
api.application.rate_limit_status()['resources']['statuses']['/statuses/user_timeline']

Somehow a few more tweets snuck in there.

In [None]:
len(aoc_timeline_tweets)

We can also abstract this into a function we could use for anyone's timeline.

In [None]:
def get_all_user_timeline(screen_name,count=200,include_rts=True,exclude_replies=False):
    
    # Start with the list of the 200 most-recent tweets
    timeline_tweets = api.statuses.user_timeline(screen_name = screen_name,
                                                 count = count,
                                                 tweet_mode = 'extended',
                                                 include_rts = include_rts,
                                                 exclude_replies = exclude_replies
                                                )

    # Initialize a counter so we don't go overboard with our requests
    request_counter = 1

    # While our request counter hasn't gone past 16, but try for 20
    while request_counter < 20:
        # Get the most oldest tweet id
        final_status_id = timeline_tweets[-1]['id']

        # Pass this tweet ID into the max_id parameter, minus 1 so we don't duplicate it
        timeline_tweets += api.statuses.user_timeline(screen_name = screen_name,
                                                      count = count,
                                                      tweet_mode = 'extended',
                                                      include_rts = include_rts,
                                                      exclude_replies = exclude_replies,
                                                      max_id = final_status_id-1)

        # Increment our request_counter
        request_counter += 1
        
    return timeline_tweets

Test this on [@joebiden](https://twitter.com/joebiden).

In [None]:
biden_tweets = get_all_user_timeline('joebiden')
len(biden_tweets)

I've added a lot more sugar into our loop to grab information about replies, user mentions, and hashtags.

In [None]:
# Make an empty list to store the data after we process it below
statuses_list = []

# Loop through each of the status objects
for status in biden_tweets:
    
    # Check to make sure the status is not empty/None 
    if status != None:
        
        # Create an empty dictionary to store relevant fields
        payload = {}
        payload['id'] = status['id_str']
        payload['screen_name'] = status['user']['screen_name']
        payload['created'] = pd.to_datetime(status['created_at'])
        payload['retweets'] = status['retweet_count']
        payload['favorites'] = status['favorite_count']
        payload['reply_screen_name'] = status['in_reply_to_screen_name']
        payload['reply_id'] = status['in_reply_to_status_id']
        payload['source'] = BeautifulSoup(status['source']).text

        if len(status['entities']['user_mentions']) > 0:
            payload['user_mentions'] = '; '.join([m['screen_name'] for m in status['entities']['user_mentions']])
        else:
            payload['user_mentions'] = None
            
        if len(status['entities']['hashtags']) > 0:
            payload['hashtags'] = '; '.join([h['text'] for h in status['entities']['hashtags']])
        else:
            payload['hashtags'] = None
        
        # If an account retweets another account, we should store that information
        if 'retweeted_status' in status:
            payload['text'] = status['retweeted_status']['full_text']
            payload['retweeted'] = True
            payload['retweeted_screen_name'] = status['retweeted_status']['user']['screen_name']
            payload['retweeted_created'] = status['retweeted_status']['created_at']
            payload['retweeted_source'] = BeautifulSoup(status['retweeted_status']['source']).text
            if len(status['retweeted_status']['entities']['hashtags']) > 0:
                payload['hashtags'] = '; '.join([h['text'] for h in status['retweeted_status']['entities']['hashtags']])
            else:
                payload['hashtags'] = None
        # If there is no retweeted_status then it's not a retweet
        else:
            payload['text'] = status['full_text']
            payload['retweeted'] = False
            payload['retweeted_screen_name'] = False
            payload['retweeted_created'] = False

        # Store the payload dictionary in our list
        statuses_list.append(payload)
        
# Conver to a DataFrame
df = pd.DataFrame(statuses_list)

# Inspect
df.head()

Convert the "created" column into a proper `datetime` object and extract the dates as another column.

In [None]:
df['timestamp'] = pd.to_datetime(df['created'])
df['date'] = df['timestamp'].apply(lambda x:x.date())
df['weekday'] = df['timestamp'].apply(lambda x:x.weekday)
df['hour'] = df['timestamp'].apply(lambda x:x.hour)

Make a plot of the number of tweets by date.

In [None]:
# Group by date and aggregate by number of tweets on that date
_s = df.groupby(pd.Grouper(key='timestamp',freq='1D')).agg({'id':len,'retweeted':'sum','reply_id':lambda x:sum(x.notnull())})

# Reindex the data to be continuous over the range, fill in missing dates as 0s
_s.columns = ['Tweets','Retweets','Replies']
_s_frac = _s[['Retweets','Replies']].div(_s['Tweets'],axis=0).fillna(0)

# Make the plot
f,axs = plt.subplots(2,1,figsize=(8,6),sharex=True)
_s['Tweets'].rolling(3).mean().plot(ax=axs[0])
_s_frac.rolling(3).mean().plot(ax=axs[1],legend=False)

axs[0].legend(loc='center left',bbox_to_anchor=(1,.5))
axs[1].legend(loc='center left',bbox_to_anchor=(1,.5))
axs[0].set_ylabel('Count')
axs[1].set_ylabel('Fraction of tweets')

# Annotate the plot with lines corresponding to major events
for ax in axs:
    ax.axvline(pd.Timestamp('2020-04-08'),lw=3,c='k',alpha=.25) # Biden cinches
    ax.axvline(pd.Timestamp('2020-08-20'),lw=3,c='k',alpha=.25) # DNC speech
    ax.axvline(pd.Timestamp('2020-11-03'),lw=3,c='k',alpha=.25) # Election day
    ax.axvline(pd.Timestamp('2021-01-21'),lw=3,c='k',alpha=.25) # Swearing in
    
axs[0].text(x=pd.Timestamp('2020-04-08')+pd.Timedelta(3,'d'),y=47.5,s='Biden\ncinches',va='center')
axs[0].text(x=pd.Timestamp('2020-08-20')+pd.Timedelta(3,'d'),y=47.5,s='DNC\nspeech',va='center')
axs[0].text(x=pd.Timestamp('2020-11-03')+pd.Timedelta(3,'d'),y=47.5,s='Election',va='center')
axs[0].text(x=pd.Timestamp('2021-01-21')+pd.Timedelta(3,'d'),y=47.5,s='Sworn\nin',va='center')

f.tight_layout()
# f.savefig('aoc_activity.png',dpi=300,bbox_inches='tight')

In [None]:
# Group by the date and aggregate by the sum of retweets and favorites for all tweets on that date
_s = df.groupby(pd.Grouper(key='timestamp',freq='1D')).agg({'retweets':'sum','favorites':'sum'})

# Make the plot
f,ax = plt.subplots(1,1,figsize=(8,4))
ax = _s.rolling(3).mean().plot(legend=False,lw=2,ax=ax)
ax.set_ylim((0,7000000))
ax.legend(loc='center left',bbox_to_anchor=(1,.5))
ax.set_title('Daily engagement with @joebiden tweets')

# Annotate the plot with lines corresponding to major events
ax.axvline(pd.Timestamp('2020-04-08'),lw=3,c='k',alpha=.25) # Biden cinches
ax.axvline(pd.Timestamp('2020-08-20'),lw=3,c='k',alpha=.25) # DNC speech
ax.axvline(pd.Timestamp('2020-11-03'),lw=3,c='k',alpha=.25) # Election day
ax.axvline(pd.Timestamp('2021-01-21'),lw=3,c='k',alpha=.25) # Swearing in
    
_y = 6.5e6
ax.text(x=pd.Timestamp('2020-04-08')+pd.Timedelta(3,'d'),y=_y,s='Biden\ncinches',va='center')
ax.text(x=pd.Timestamp('2020-08-20')+pd.Timedelta(3,'d'),y=_y,s='DNC\nspeech',va='center')
ax.text(x=pd.Timestamp('2020-11-03')+pd.Timedelta(3,'d'),y=_y,s='Election',va='center')
ax.text(x=pd.Timestamp('2021-01-21')+pd.Timedelta(3,'d'),y=_y,s='Sworn\nin',va='center')

f.tight_layout()
# f.savefig('joebiden_engagement.png',dpi=300,bbox_inches='tight')

We can also do a bit of sentiment analysis. You'll likely need to [install the NLTK data](https://www.nltk.org/data.html) for this to work. We are going to use the [VADER sentiment analysis tool](https://github.com/cjhutto/vaderSentiment) that was specifically trained for social media text: [see paper here](http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf).

In [None]:
# Get sentiment scores for each tweet's text
df['sentiment'] = df['text'].apply(lambda x:sia.polarity_scores(x)['compound'])

Plot out the daily sentiment of tweets with major events annotated.

In [None]:
# Group by the date and aggregate by the average sentiment for all tweets on that date
_s = df.groupby(pd.Grouper(key='timestamp',freq='1D')).agg({'sentiment':'mean'})

# Make the plot with a 7-day rolling average
f,ax = plt.subplots(1,1,figsize=(8,4))
ax = _s.rolling(3).mean().fillna(method='ffill').plot(legend=False,ax=ax)
ax.set_ylim((-.3,.8))
ax.axhline(0,ls='--',c='k',lw=1)
ax.set_title('Sentiment of @joebiden tweets')
ax.set_ylabel('Compound VADER score')

# Annotate the plot with lines corresponding to major events
ax.axvline(pd.Timestamp('2020-04-08'),lw=3,c='k',alpha=.25) # Biden cinches
ax.axvline(pd.Timestamp('2020-08-20'),lw=3,c='k',alpha=.25) # DNC speech
ax.axvline(pd.Timestamp('2020-11-03'),lw=3,c='k',alpha=.25) # Election day
ax.axvline(pd.Timestamp('2021-01-21'),lw=3,c='k',alpha=.25) # Swearing in
    
_y = .7
ax.text(x=pd.Timestamp('2020-04-08')+pd.Timedelta(3,'d'),y=_y,s='Biden\ncinches',va='center')
ax.text(x=pd.Timestamp('2020-08-20')+pd.Timedelta(3,'d'),y=_y,s='DNC\nspeech',va='center')
ax.text(x=pd.Timestamp('2020-11-03')+pd.Timedelta(3,'d'),y=_y,s='Election',va='center')
ax.text(x=pd.Timestamp('2021-01-21')+pd.Timedelta(3,'d'),y=_y,s='Sworn\nin',va='center')

f.tight_layout()
# f.savefig('joebiden_sentiment.png',dpi=300,bbox_inches='tight')

Compute the engagement for @joebiden tweets, ignoring retweets and replies, and normalizing for total tweet activity on that day.

In [None]:
c1 = ~df['retweeted']
c2 = df['reply_id'].isnull()

pure_tweets_df = df[c1 & c2]
print("There are {0:,} tweets that are not retweets or replies.".format(len(pure_tweets_df)))

_s = pure_tweets_df.groupby(pd.Grouper(key='timestamp',freq='1D')).agg({'retweets':'sum','favorites':'sum','id':len})
_s = _s[['retweets','favorites']].div(_s['id'],axis=0)
_s.columns = ['Retweets','Favorites']

# Make the plot
f,ax = plt.subplots(1,1,figsize=(8,4))
ax = _s.rolling(3).mean().plot(legend=False,lw=2,ax=ax)
# ax.set_yscale('symlog')
# ax.set_ylim((1e1,1e6))

ax.legend(loc='center left',bbox_to_anchor=(1,.5))
ax.set_title('Daily engagement with @joebiden tweets')
ax.set_ylabel('Engagement per tweet')

# Annotate the plot with lines corresponding to major events
ax.axvline(pd.Timestamp('2020-04-08'),lw=3,c='k',alpha=.25) # Biden cinches
ax.axvline(pd.Timestamp('2020-08-20'),lw=3,c='k',alpha=.25) # DNC speech
ax.axvline(pd.Timestamp('2020-11-03'),lw=3,c='k',alpha=.25) # Election day
ax.axvline(pd.Timestamp('2021-01-21'),lw=3,c='k',alpha=.25) # Swearing in
    
_y = 5.5e5
ax.text(x=pd.Timestamp('2020-04-08')+pd.Timedelta(3,'d'),y=_y,s='Biden\ncinches',va='center')
ax.text(x=pd.Timestamp('2020-08-20')+pd.Timedelta(3,'d'),y=_y,s='DNC\nspeech',va='center')
ax.text(x=pd.Timestamp('2020-11-03')+pd.Timedelta(3,'d'),y=_y,s='Election',va='center')
ax.text(x=pd.Timestamp('2021-01-21')+pd.Timedelta(3,'d'),y=_y,s='Sworn\nin',va='center')

f.tight_layout()
# f.savefig('aoc_engagement_no_rt_reply.png',dpi=300,bbox_inches='tight')

Plot favorites per retweet.

In [None]:
f,ax = plt.subplots(1,1,figsize=(8,4))

(_s['Retweets']/_s['Favorites']).fillna(0).rolling(3).mean().plot(ax=ax)

ax.set_ylim((0,.5))

# Annotate the plot with lines corresponding to major events
ax.axvline(pd.Timestamp('2020-04-08'),lw=3,c='k',alpha=.25) # Biden cinches
ax.axvline(pd.Timestamp('2020-08-20'),lw=3,c='k',alpha=.25) # DNC speech
ax.axvline(pd.Timestamp('2020-11-03'),lw=3,c='k',alpha=.25) # Election day
ax.axvline(pd.Timestamp('2021-01-21'),lw=3,c='k',alpha=.25) # Swearing in
    
_y = .45
ax.text(x=pd.Timestamp('2020-04-08')+pd.Timedelta(3,'d'),y=_y,s='Biden\ncinches',va='center')
ax.text(x=pd.Timestamp('2020-08-20')+pd.Timedelta(3,'d'),y=_y,s='DNC\nspeech',va='center')
ax.text(x=pd.Timestamp('2020-11-03')+pd.Timedelta(3,'d'),y=_y,s='Election',va='center')
ax.text(x=pd.Timestamp('2021-01-21')+pd.Timedelta(3,'d'),y=_y,s='Sworn\nin',va='center')

What are the top tweets by retweets per favorite? The're primarily from before her primary win.

In [None]:
df['rt_fav_ratio'] = (df['retweets']/df['favorites']).replace({np.inf:np.nan})
top_retweets = df['rt_fav_ratio'].dropna().sort_values(ascending=False).head(10)
df.loc[top_retweets.index,['created','text','retweets','favorites']]

Is there an intereseting relationship between seniment and retweet/favorite ratio? We can specify a simple univariate LOESS regression for the relationship between sentiment and the retweet-per-favorite ratio. It appears that extremely negative and positive tweets have higher ratios than neutral tweets.

In [None]:
g = sb.lmplot(x='sentiment',y='rt_fav_ratio',data=df,lowess=True,aspect=2,
              line_kws={'color':'red','linewidth':10,'alpha':.5})
ax = g.axes[0,0]
ax.set_ylim((0,.6))

## Pulling a user's friends

In the parlance of the Twitter API, the people who follow an account are "followers" and the people followed by an account are "friends". There's unfortuantely no timestamp meta-data about when friend and follower relationships were created. The API limits on this are much more stringent than other API calls: only 200 accounts per request and only 15 requests per 15-minute window: basically 200 accounts per minute or 3,000 accounts before you hit the rate limit. AOC has 1,417 friends, so it takes 8 API requests to get them all, leaving me with 7 requests in this 15-minute window.

In [None]:
friends = api.friends.list(screen_name='joebiden',count=200,skip_status=True)
print("There are {0:,} friends.".format(len(friends['users'])))

In [None]:
friends['users'][0]

We can check my API rate limit status too.

In [None]:
api.application.rate_limit_status()['resources']['friends']['/friends/list']

In [None]:
datetime.fromtimestamp(api.application.rate_limit_status()['resources']['friends']['/friends/list']['reset'])

I think "friends" convey much more valuable information about an account than followers, primarily because an account doesn't choose who follows them. However, if you wanted to get the followers of an account, we use the `GetFollowers` method. I'm only going to grab 200 so I don't burn more API calls.

In [None]:
followers = api.followers.list(screen_name='joebiden',skip_status=True,total_count=200)

In [None]:
api.application.rate_limit_status()['resources']['followers']['/followers/list']

We can access these user objects to pull out interesting meta-data.

In [None]:
friends['users'][0]['screen_name']

In [None]:
friends['users'][0]['description']

In [None]:
friends['users'][0]['name']

In [None]:
friends['users'][0]['created_at']

In [None]:
friends['users'][0]['statuses_count']

In [None]:
friends['users'][0]['followers_count']

In [None]:
friends['users'][0]['friends_count']

In [None]:
friends['users'][0]['verified']

In [None]:
friends['users'][0]['id']

Loop through all the friends of @joebiden and turn it into a DataFrame.

In [None]:
friends_payloads = []

for friend in friends['users']:
    p = {}
    p['name'] = friend['name']
    p['description'] = friend['description']
    p['screen_name'] = friend['screen_name']
    p['created_at'] = friend['created_at']
    p['statuses_count'] = friend['statuses_count']
    p['followers_count'] = friend['followers_count']
    p['friends_count'] = friend['friends_count']
    p['verified'] = friend['verified']
    p['id'] = friend['id']
    friends_payloads.append(p)
    
friends_df = pd.DataFrame(friends_payloads)
friends_df['created_at'] = pd.to_datetime(friends_df['created_at'])
friends_df['created_at'] = friends_df['created_at'].dt.tz_convert(None)
friends_df['account_age'] = friends_df['created_at'].apply(lambda x:(datetime.now() - x)/pd.Timedelta(1,'d'))
friends_df.head()

In this sample of Twitter accounts, are there any interesting trends in verified accounts?

In [None]:
f,axs = plt.subplots(1,4,figsize=(16,4),sharey=True)

sb.barplot(x='verified',y='followers_count',data=friends_df,ax=axs[0],estimator=np.mean,errwidth=5)
sb.barplot(x='verified',y='friends_count',data=friends_df,ax=axs[1],estimator=np.mean,errwidth=5)
sb.barplot(x='verified',y='statuses_count',data=friends_df,ax=axs[2],estimator=np.mean,errwidth=5)
sb.barplot(x='verified',y='account_age',data=friends_df,ax=axs[3],estimator=np.mean,errwidth=5)

axs[0].set_title('Followers')
axs[1].set_title('Friends')
axs[2].set_title('Statuses')
axs[3].set_title('Account age (days)')

# As we'll see below, having more than 5,000 friends could complicate our sampling
axs[0].axhline(5000,ls='--',c='k')
axs[1].axhline(5000,ls='--',c='k')
axs[2].axhline(3200,ls='--',c='k')

for ax in axs:
    ax.set_ylim((1e0,1e8))
    ax.set_yscale('symlog')
    ax.set_ylabel(None)

f.tight_layout()

Are these differences statistically-significant? Let's run some [t-tests](https://en.wikipedia.org/wiki/T-test).

In [None]:
from scipy import stats

for var in ['followers_count','friends_count','statuses_count','account_age']:
    vals1 = friends_df.loc[friends_df['verified'] == True,var]
    vals2 = friends_df.loc[friends_df['verified'] == False,var]
    test,pvalue = stats.ttest_ind(vals1,vals2)
    str_fmt = "The differences in {0}: t = {1:.2f} \t p={2:.3f}"
    print(str_fmt.format(var,test,pvalue))

Or use the non-parametric [Mann-Whitney U-test](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test) since our data is so skewed.

In [None]:
for var in ['followers_count','friends_count','statuses_count','account_age']:
    vals1 = friends_df.loc[friends_df['verified'] == True,var]
    vals2 = friends_df.loc[friends_df['verified'] == False,var]
    test,pvalue = stats.mannwhitneyu(vals1,vals2)
    str_fmt = "The differences in {0}: U = {1:,.0f} \t p = {2:,.3f}"
    print(str_fmt.format(var,test,pvalue))

Unsurprisingly, verified accounts have more followers and are older than non-verified accounts. But they also appear to be more active and have more friends.

### Ego-network
We can make a 1.5-step ego-network of the accounts @aoc follows and the accounts each of them follow. Using the `GetFriends` is too "expensive" because it cost us 8 API calls to get a single account's followers since it only returns 200 accounts at a time. Twitter also exposes a [get friends/ids](https://developer.twitter.com/en/docs/accounts-and-users/follow-search-get-users/api-reference/get-friends-ids) end-point that will return up to 5,000 user IDs per request. The number of requests remains 15 requests per 15-minute window, but we can now get the friend networks for 15 accounts per 15 minutes rather than maybe only 1 or 2. The challenge with this is that we will need to "rehydrate" these user IDs at some point.

Here, we'll use the "total_count" parameter to limit it to 5,000 accounts in case one of these accounts follows thousands of accounts. 

Store the data in a dictionary keyed by account name and with the list of user IDs integers as values. Initialize with @aoc.

In [None]:
api.users.lookup(screen_name='joebiden')[0]['id']

In [None]:
friends_d = {'939091':api.friends.ids(user_id='939091',count=5000)}

In [None]:
api.application.rate_limit_status()['resources']['friends']['/friends/ids']

In [None]:
datetime.fromtimestamp(1616698614)

How many accounts have more than 5000 friends? So about 10% of our network will be incomplete if we limit to only a single "page" of 5,000 user IDs per follower.

In [None]:
gt5000_friends_df = friends_df[friends_df['friends_count'] > 5000]
print("There are {0:,} accounts with more than 5,000 friends.".format(len(gt5000_friends_df)))
gt5000_friends_df.head()

Who are some of these high-friend accounts? Even at 5,000 friends per request, it will still cost you 119 API requests (and thus 119 minutes) to get Barack Obama's 593,000 friends.

In [None]:
# Make a list of the high-friend accounts to skip
gt5000_friends_ids = gt5000_friends_df['id'].values.tolist()

This loop will go through the list of `aoc_friends` (a list of `User` objects) and then get the 5,000 friends' user IDs for each of them. With these rate limits of 15 requests per 15 min, it will take 1417 minutes (23.6 hours) to get our sample of data for AOC's 1,417 friends. You can now start to see the appeal of parallelizing requests!

You probably don't want to run this loop.

In [None]:
# Loop through each of @joebiden's friend IDs
for friend_id in friends_d['939091']['ids']:
    
    # Check to make sure the friend ID isn't already in the dictionary and is not a high-friend account
    if friend_id not in friends_d.keys() and friend_id not in gt5000_friends_ids:
        
        # Try to get the account's friends
        try:
            friends_d[friend_id] = api.friends.ids(user_id=friend_id,count=5000)
            
        # If you get a TwitterError, assume its a rate limit problem
        except twitter.TwitterError:
            
            # Get the current rate limit status
            reset_time = api.application.rate_limit_status()['resources']['friends']['/friends/ids']['reset']
            
            # Wait until the API limit refreshed and add a second for good measure
            sleep_time = (datetime.fromtimestamp(reset_time) - datetime.now())/timedelta(seconds=1) + 1
            
            # Print out to make sure
            print("At {0}, sleeping for {1} seconds.".format(datetime.now(),sleep_time))
            
            # Sleep until our API limit refreshes
            time.sleep(sleep_time)
            
            # Try to get the friend ID again
            friends_d[friend_id] = api.friends.ids(user_id=friend_id,count=5000)
            
    # Write the friend IDs out to disk after each friend ID
    with open('joebiden_friends_ids.json','w') as f:
        json.dump(friends_d,f)

Instead, I've done this scraping for you and saved the results in a JSON file.

In [None]:
with open('joebiden_friends_ids.json','r') as f:
    friends_d = json.load(f)

Now we want to make a network of who follows whom.

In [None]:
friends_l = []

# Turn the dictionary into an edgelist
for user_id, friend_ids in friends_d.items():
    for friend_id in friend_ids['ids']:
        friends_l.append((str(user_id),str(friend_id)))
        
# Turn the list of dictionaries into a DataFrame
friends_gdf = pd.DataFrame(friends_l,columns=['user_id','friend_id'])

# Get the unique user_ids for AOC's friends
unique_friend_ids = friends_gdf['user_id'].unique()

# Just keep friends of joebiden in the list
# Throw away friends of friends who aren't direct friends of joebiden
subset_friends_df = friends_gdf[friends_gdf['friend_id'].isin(unique_friend_ids)]

# Print out number of edges remaining
print('Edges before: {:,}'.format(len(friends_gdf)),'\nEdges after: {:,}'.format(len(subset_friends_df)))

# Inspect
subset_friends_df.head()

Map the numeric user_id back to screen_name.

In [None]:
ids_to_screen_name_map = {str(user['id']):user['screen_name'] for user in friends['users']}
ids_to_screen_name_map['939091'] = 'joebiden'

Building on the [shared audience measure](http://faculty.washington.edu/kstarbi/Stewart_Starbird_Drawing_the_Lines_of_Contention-final.pdf) used by Stewart, *et al.* (2017), I computed [Jaccard coefficients](https://en.wikipedia.org/wiki/Jaccard_index) for the friend sets of each account. The intuituion here is that if two accounts are friends with all the same accounts, their score would be 1 while if two accounts had no friends in common, their score would be 0. This has the benefit of giving us a numerical weight to otherwise binary friend relationships: friend relations are "stronger" if they are more strongly embedded in a network with other overlapping friend relations and "weaker" if there is less overlap. This requires pair-wise evaluations of $1420*1419=2,014,980$ combinations, which takes about 20 minutes on my computer.

In [None]:
jaccard_l = []

gt5000_friends_ids

for f1 in unique_friend_ids:
    for f2 in unique_friend_ids:
        if f1 != f2 and int(f1) not in gt5000_friends_ids and int(f2) not in gt5000_friends_ids:
            try:
#                 f1_int = int(f1)
#                 f2_int = int(f2)
                jaccard = len(set(friends_d[f1]['ids']) & set(friends_d[f2]['ids']))/len(set(friends_d[f1]['ids']) | set(friends_d[f2]['ids']))
                jaccard_l.append({'user':f1,'friend':f2,'jaccard':jaccard})
            except:
                print(f1,f2)
                pass
            
friend_jaccard_df = pd.DataFrame(jaccard_l)[['user','friend','jaccard']]

friend_jaccard_df.to_csv('all_friend_jaccard.csv')

friend_jaccard_df.head()

We can combine the `subset_friends_df` with `friend_jaccard_df` using pandas's [`merge`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) command.

In [None]:
# Left-join the subset_friends and friend_jaccard DataFrames
friend_el = pd.merge(subset_friends_df,
                     friend_jaccard_df,
                     left_on=['user_id','friend_id'],
                     right_on=['user','friend'],how='left')

# Keep a few columns
friend_el = friend_el[['user','friend','jaccard']]

# Map the user_ids back to screen_names
friend_el['user'] = friend_el['user'].apply(str).map(ids_to_screen_name_map)
friend_el['friend'] = friend_el['friend'].apply(str).map(ids_to_screen_name_map)

# Save to disk
# friend_el.to_csv('aoc_friends_edgelist.csv')

# Inspect
friend_el.tail()

We are going to use the [`networkx`](https://networkx.github.io/documentation/stable/) (that should come with Anaconda by default) to convert this edgelist into a Graph object.

In [None]:
# Import networkx
import networkx as nx

This raw network is very dense: there are about 100 times more edges than nodes. A general heuristic for graph visualization is you want the number of nodes and edges to be about the same order of magnitude to prevent [overplotting](https://www.displayr.com/what-is-overplotting/).

In [None]:
g_dense = nx.from_pandas_edgelist(friend_el,source='user',target='friend',edge_attr='jaccard',create_using=nx.Graph)
print("There are {0:,} nodes and {1:,} edges.".format(g_dense.number_of_nodes(),g_dense.number_of_edges()))
nx.write_gexf(g_dense,'all_friends.gexf')

Visualize. There are better packages for doing this (like Gephi), but let's do something easy.

In [None]:
f,ax = plt.subplots(1,1,figsize=(12,12))

pos = nx.layout.rescale_layout_dict(nx.layout.spring_layout(g_dense,iterations=100),1.2)

nx.draw_networkx_nodes(g_dense,pos,node_size=[i*1e3 for i in nx.degree_centrality(g_dense).values()])
nx.draw_networkx_edges(g_dense,pos,width=[d['jaccard']*33 for i,j,d in g_dense.edges(data=True)],alpha=.2)
nx.draw_networkx_labels(g_dense,pos,font_size=8);

## Using the streaming API

We can also sit on Twitter's [Streaming API](https://developer.twitter.com/en/docs/tweets/sample-realtime/api-reference/get-statuses-sample) and get a sample of tweets that are produced in real time. The `.GetStreamSample()` method returns a [generator](https://wiki.python.org/moin/Generators), which is an advanced type of object that doesn't store any data *per se* but points to successive locations where you can find data. In this case, the generator points to where we can find the next tweet in the sample. For 10,000 tweets on a stream sampling approximately 1% of live tweets, this may take 2–3 minutes.

In [None]:
# Make the generator
stream = twitter.TwitterStream(auth=twitter.OAuth(twitter_keys['access_token_key'],
                                         twitter_keys['access_token_secret'],
                                         twitter_keys['consumer_key'],
                                         twitter_keys['consumer_secret']))

tweet_stream = stream.statuses.sample()

# Make an empty list to store the tweet statuses
stream_list = []

# Start iterating through the stream
for status in tweet_stream:
    
    # As long as we have fewer than this many tweets
    if len(stream_list) < 1000:
        
        # And if it's not a delete status request
        if 'delete' not in status:
        
            # Add another tweet to our list
            stream_list.append(status)
        
    # Otherwise stop
    else:
        break
        
"There are {0:,} tweets from the stream.".format(len(stream_list))

Look at one of our statuses.

In [None]:
stream_list[0]

We can adapt our previous tweet_cleaner code to turn this JSON data into a DataFrame.

In [None]:
def tweet_cleaner(status):
    payload = {}
    payload['screen_name'] = status['user']['screen_name']
    payload['created'] = pd.to_datetime(status['created_at'])
    payload['retweets'] = status['retweet_count']
    payload['favorites'] = status['favorite_count']
    payload['id'] = status['id']
    payload['reply_screen_name'] = status['in_reply_to_screen_name']
    payload['reply_id'] = status['in_reply_to_status_id']
    payload['source'] = BeautifulSoup(status['source']).text
    payload['lang'] = status['lang']
    
    if status['place']:
        payload['place'] = status['place']['country']
    else:
        payload['place'] = None

    if len(status['entities']['user_mentions']) > 0:
        payload['user_mentions'] = '; '.join([m['screen_name'] for m in status['entities']['user_mentions']])
    else:
        payload['user_mentions'] = None

    if len(status['entities']['hashtags']) > 0:
        payload['hashtags'] = '; '.join([h['text'] for h in status['entities']['hashtags']])
    else:
        payload['hashtags'] = None

    # If an account retweets another account, we should store that information
    if 'retweeted_status' in status:
        rt_status = status['retweeted_status']
        if 'extended_tweet' in rt_status:
            payload['text'] = rt_status['extended_tweet']['full_text']
            if len(rt_status['extended_tweet']['entities']['hashtags']) > 0:
                payload['hashtags'] = '; '.join([h['text'] for h in rt_status['extended_tweet']['entities']['hashtags']])
            else:
                payload['hashtags'] = None
        else:
            try:
                payload['text'] = rt_status['text']
            except:
                payload['text'] = rt_status['full_text']
            if len(rt_status['entities']['hashtags']) > 0:
                payload['hashtags'] = '; '.join([h['text'] for h in rt_status['entities']['hashtags']])
            else:
                payload['hashtags'] = None
        payload['is_retweet'] = True
        payload['retweeted_screen_name'] = rt_status['user']['screen_name']
        payload['retweeted_created'] = rt_status['created_at']
        payload['retweeted_source'] = BeautifulSoup(rt_status['source']).text
        
    else:
        if status['truncated']:
            payload['text'] = status['extended_tweet']['full_text']
        else:
            try:
                payload['text'] = status['text']
            except:
                payload['text'] = status['full_text']
        payload['is_retweet'] = False
        payload['retweeted_screen_name'] = False
        payload['retweeted_created'] = False
        payload['retweeted_source'] = False

    return payload

Loop through our list of dictionaries (including the delete stream objects) and flatten the dictionaries out into something we can read into a DataFrame. Include some exception handling that will keep track of which tweets throw errors and prints out the first 50 of those tweet's index position in the `stream_list` for us to diagnose.

In [None]:
stream_statuses_flat = []
errors = []

for i,status in enumerate(stream_list):
    try:
        payload = tweet_cleaner(status)
        stream_statuses_flat.append(payload)
    except:
        errors.append(str(i))

if len(errors) == 0:
    print("There were no errors!")
else:
    print("There were errors at the following indices:", '; '.join(errors[:50]))

Make our DataFrame, clean up some columns, and make some new ones.

In [None]:
stream_df = pd.DataFrame(stream_statuses_flat)
stream_df['created'] = pd.to_datetime(stream_df['created'])
stream_df['created'] = stream_df['created'].dt.tz_convert(None)
stream_df.tail()

Where are people writing their tweets in this sample?

In [None]:
stream_df['source'].value_counts().head(20)

What languages are these tweets in?

In [None]:
stream_df['lang'].value_counts().head(10)

If a tweet is geolocated, where is it?

In [None]:
stream_df['place'].value_counts()

How many tweets are retweets?

In [None]:
stream_df['is_retweet'].value_counts()

How many tweets are replies?

In [None]:
stream_df['reply_id'].notnull().value_counts()

Which users are getting a lot of retweets right now?

In [None]:
stream_df['retweeted_screen_name'].value_counts().head(20)

### Filtered streams

We can also filter the tweets in the stream. Here we only get tweets mentioning "Biden" and that have been auto-classified as written in English.

In [None]:
# Make the generator
filtered_stream = stream.statuses.filter(track='Biden',languages='en')

# Make an empty list to store the tweet statuses
filtered_stream_list = []

# What time did we start?
start = time.time()

# Start iterating through the stream
for status in filtered_stream:
    
    # As long as we have fewer than this many tweets
    if len(filtered_stream_list) < 1000:
        
        # And if it's not a delete status request
        if 'delete' not in status:
        
            # Add another tweet to our list
            filtered_stream_list.append(status)
        
    # Otherwise stop
    else:
        break

# What time did we stop?
stop = time.time()
elapsed = stop - start

"There are {0:,} tweets from the stream after {1:.0f} seconds.".format(len(filtered_stream_list),elapsed)

Clean this up into a DataFrame.

In [None]:
filtered_stream_statuses_flat = []
filtered_errors = []

for i,status in enumerate(filtered_stream_list):
    try:
        payload = tweet_cleaner(status)
        filtered_stream_statuses_flat.append(payload)
    except:
        filtered_errors.append(str(i))

if len(filtered_errors) == 0:
    print("There were no errors!")
else:
    print("There were errors at the following indices:", '; '.join(filtered_errors[:50]))
    
filtered_stream_df = pd.DataFrame(filtered_stream_statuses_flat)
filtered_stream_df['created'] = pd.to_datetime(filtered_stream_df['created'])
filtered_stream_df['created'] = filtered_stream_df['created'].dt.tz_convert(None)
filtered_stream_df.tail()

Let's measure the sentiment of the tweets in this filtered s ample and plot the distribution of their sentiment values.

In [None]:
# Compute the sentiment scores
filtered_stream_df['sentiment'] = filtered_stream_df['text'].apply(lambda x:sia.polarity_scores(x)['compound'])

# Plot the distribution
filtered_stream_df['sentiment'].plot(kind='hist',bins=20)

How many retweets in this sample?

In [None]:
filtered_stream_df['is_retweet'].value_counts()

Given the higher fraction of retweets, who is being retweeted?

In [None]:
filtered_stream_df['retweeted_screen_name'].value_counts().head(10)

## Search API

Twitter's [search API](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets) provides an endpoint to search for tweets matching a query for terms, accounts, hashtags, language, locations, and date ranges. This API endpoint has a rate limit of 180 requests per 15-minute window with 100 statuses per request: or 18,000 statuses per window or 72,000 statuses per hour.

You can explore some of the search functionality through Twitter's [advanced search interface](https://twitter.com/search-advanced). Note that the [standard search API](https://developer.twitter.com/en/docs/tweets/search/overview/standard) only provides a limited access to sample of tweets in the past 7 days, you'll need to pay more to access [historical APIs](https://developer.twitter.com/en/docs/tutorials/choosing-historical-api.html).

In [None]:
query = api.search.tweets(q='boulder',
                          count=100,
                          lang='en',
                          result_type='recent',
                          tweet_mode='extended')

Loop through these 100 tweets.

In [None]:
search_statuses_flat = []
search_errors = []

for i,status in enumerate(query['statuses']):
    try:
        payload = tweet_cleaner(status)
        search_statuses_flat.append(payload)
    except:
        search_errors.append(str(i))

if len(search_errors) == 0:
    print("There were no errors!")
else:
    print("There were errors at the following indices:", '; '.join(search_errors[:50]))
    
search_df = pd.DataFrame(search_statuses_flat)
search_df['created'] = pd.to_datetime(search_df['created'])
search_df['created'] = search_df['created'].dt.tz_convert(None)
search_df.tail()

Write a loop to try to get more tweets. The `query` dictionary includes a sub-dictionary under the "search_metadata" key that includes information about paginating to find the next set of results.

In [None]:
search_tweets = []

while True:
    # When to stop?
    if len(search_tweets) == 2500:
        break
    
    # Get the first set of tweets
    if len(search_tweets) == 0:
        query = api.search.tweets(q='boulder',
                          count=100,
                          lang='en',
                          result_type='recent',
                          tweet_mode='extended')
        search_tweets += query['statuses']
        
    # Keep getting tweets
    else:
        # Find the last tweet to use as a max_id
        max_id = search_tweets[-1]['id']
        
        # Get the next set of tweets
        query = api.search.tweets(q='boulder',
                                  count=100,
                                  lang='en',
                                  result_type='recent',
                                  tweet_mode='extended',
                                  max_id = max_id - 1)
        
        # Add them to the list of tweets
        search_tweets += query['statuses']
        
print("There are {0:,} tweets in the collection.".format(len(search_tweets)))

In [None]:
search_statuses_flat = []
search_errors = []

for i,status in enumerate(search_tweets):
    try:
        payload = tweet_cleaner(status)
        search_statuses_flat.append(payload)
    except:
        search_errors.append(str(i))

if len(search_errors) == 0:
    print("There were no errors!")
else:
    print("There were errors at the following indices:", '; '.join(search_errors[:50]))
    
search_df = pd.DataFrame(search_statuses_flat)
search_df['created'] = pd.to_datetime(search_df['created'])
search_df['created'] = search_df['created'].dt.tz_convert(None)
search_df.tail()

In [None]:
# Compute the sentiment scores
search_df['sentiment'] = search_df['text'].apply(lambda x:sia.polarity_scores(x)['compound'])

# Plot the distribution
search_df['sentiment'].plot(kind='hist',bins=20)