# Twitter Data Collection & Analysis

In this lesson, we're going to learn how to analyze and explore Twitter data with the Python/command line tool [twarc](https://twarc-project.readthedocs.io/en/latest/). We're specifically going to work with [twarc2](https://twarc-project.readthedocs.io/en/latest/twarc2/), which is designed for version 2 of the Twitter API (released in 2020) and the Academic Research track of the Twitter API (released in 2021), which enables researchers to collect tweets from the entire Twitter archive for free.

Twarc was developed by a project called [Documenting the Now](https://www.docnow.io/). The DocNow team develops tools and ethical frameworks for social media research.

## Dataset

<blockquote class="epigraph" style=" padding: 10px">
    
[David Foster Wallace]...has become lit-bro shorthand...Make a passing reference to the “David Foster Wallace fanboy” and you can assume the reader knows whom you’re talking about.<p class ="attribution">—Molly Fischer,
<a href="https://slate.com/culture/2015/08/men-who-love-david-foster-wallace-what-s-wrong-with-bros-obsessing-over-infinite-jest.html">"David Foster Wallace, Beloved Author of Bros"</a>
    </p>
    
</blockquote>

![](https://static01.nyt.com/images/2014/12/18/arts/18book-sub/BOOK-1418838418938-superJumbo.jpg?quality=90&auto=webp)
*Source: [Giovanni Giovanetti, NYT](https://www.nytimes.com/2014/12/18/books/the-david-foster-wallace-reader-a-compilation.html)*

The Twitter conversation that we're going to explore in this lesson is related to "Wallace bros" — fans of the author David Foster Wallace who are often described as "bros" or, more pointedly, "David Foster Wallace bros."

For example, in *Slate* in 2015, Molly Fischer argued that David Foster Wallace's writing — most famously his novel *Infinite Jest* — tended to attract  [a fan base of chauvinistic and misogynistic young men](https://slate.com/culture/2015/08/men-who-love-david-foster-wallace-what-s-wrong-with-bros-obsessing-over-infinite-jest.html). But other people  have defended Wallace's fans and the author against such charges. What is a "David Foster Wallace bro"? Was DFW himself a "bro"? Who is using this phrase, how often are they using it, and why? We're going to track this phrase and explore the varied viewpoints in this cultural conversation by analyzing tweets that mention "David Foster Wallace bro."

## Search Queries & Privacy Concerns

To collect tweets from the Twitter API, we need to make queries, or requests for specific kinds of tweets — e.g., `twarc2 search *query*`. The simplest kind of query is a keyword search, such as the phrase "David Foster Wallace bro," which should return any tweet that contains all of these words in any order — `twarc2 search "David Foster Wallace bro"`.

There are many other operators that we can add to a query, which would allow us to collect tweets only from specific Twitter users or locations, or to only collect tweets that meet certain conditions, such as containing an image or being authored by a verified Twitter user. Here's an excerpted table of search operators taken from [Twitter's documentation](https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query#list) about how to build a search query. There are many other operators beyond those included in this table, and I recommend reading through [Twitter's entire web page on this subject](https://developer.twitter.com/en/docs/twitter-api/tweets/search/integrate/build-a-query#list).


| Search Operator             | Explanation                                                                                                                                                                                                                                                                                                                                                                                                                                         |
|:--------------------:|:----------------------------------------------------------------------------------------------:|
| keyword              | Matches a keyword within the body of a Tweet. `so sweet and so cold`                                                                                          
| "exact phrase match" | Matches the exact phrase within the body of a Tweet. `"so sweet and so cold" OR "plums in the icebox"`                                                                                              |
| - | Do NOT match a keyword or operator `baldwin -alec`, `walt whitman -bridge`                                                                                              |
| #                    | Matches any Tweet containing a recognized hashtag `#arthistory`        |                                                                             |
| from:, to:                | Matches any Tweet from or to a specific user. `from:KingJames` `to:KingJames`                                                                    |                                                                                                            |
| place:               | Matches Tweets tagged with the specified location or Twitter place ID. `place:"new york city" OR place:seattle`                                                                                            |
| is:reply, is:quote             | Returns only replies or quote tweets. `DFW bro is:reply` `David Foster Wallace bro is:quote`                                                                                                                               |
| is:verified          | Returns only Tweets whose authors are verified by Twitter.`DFW bro is:verified`                                                                                                                                |
| has:media           | Matches Tweets that contain a media object, such as a photo, GIF, or video, as determined by Twitter. `I Think You Should Leave has:media`                                                                                                                                |
| has:images, has:videos           | Matches Tweets that contain a recognized URL to an image. `i'm gonna tell my kinds that this was has:images`                                                                                    |
| has:geo              | Matches Tweets that have Tweet-specific geolocation data provided by the Twitter user.  `pyramids has:geo`              

In this lesson, we will only be collecting tweets that were tweeted by verified users: `"David Foster Wallace bro is:verified"`.

As I discussed in ["Users’ Data: Legal & Ethical Considerations,"](01-User-Ethics-Legal-Concerns) collecting publicly available tweets is legal, but it still raises a lot of privacy concerns and ethical quandaries — particularly when you re-publish user's data, as I am in this lesson. To reduce potential harm to Twitter users when re-publishing or citing tweets, it can be helpful to ask for explicit permission from the authors or to focus on tweets that have already been reasonably exposed to the public (e.g., tweets with many retweets or tweets from verified users), such that re-publishing the content will not unduly increase risk to the user.
               

## Install and Import Libraries

Because twarc relies on Twitter's API, we need to apply for a Twitter developer account and create a Twitter application before we can use it. You can find instructions for the application process in ["Twitter API Set Up."](11-Twitter-API-Setup)

If you haven't done so already, you need to install twarc and configure twarc with your bearer token and/or API keys.

In [1]:
#!pip install twarc
#!twarc2 configure

To make an interactive plot, we're also going to install the package plotly.

In [2]:
# !pip install plotly



Then we're going to import plotly as well as pandas

In [4]:
import plotly.express as px

import pandas as pd
pd.options.display.max_colwidth = 400
pd.options.display.max_columns = 90

## Get Tweet Counts

The first thing we're going to do is retrieve "tweet counts" — that is, retrieve the number of tweets that included the phrase "David Foster Wallace bro" each day in Twitter's history.

The [tweet counts API endpoint](https://twittercommunity.com/t/introducing-new-tweet-counts-endpoints-to-the-twitter-api-v2/155997) is a convenient feature of the v2 API (first introduced in 2021) that allows us to get a sense of how many tweets will be returned for a given query before we actually collect all the tweets that match the query. We won't get the text of the tweets or the users who tweeted the tweets or any other relevant data. We will simply get the number of tweets that match the query. This is helpful because we might be able to see that the search query "Wallace" matches too many tweets, which would encourage us to narrow our search by modifying the query. 

The tweet counts API endpoint is perhaps even more useful for research projects that are primarily interested in tracking the volume of a Twitter conversation over time. In this case, tweet counts enable a researcher to retrieve this information in a way that's faster and easier than retrieving all tweets and relevant metadata.

To get tweet counts from Twitter's entire history with twarc2, we will use [`twarc2 counts`](https://twarc-project.readthedocs.io/en/latest/twarc2/#counts) followed by a search query.

We will also use the flag `--csv` because we want to output the data as a CSV, the flag `--archive` because we're working with the Academic Research track of the Twitter API and want access to the full archive, and the flag `--granularity day` to get tweet counts per day (other options include `hour` and `minute` — you can see more in [twarc's documentation](https://twarc-project.readthedocs.io/en/latest/twarc2/#counts)).  Finally, we write the data to a CSV file.

In [5]:
# !twarc2 counts "David Foster Wallace bro is:verified" --csv --archive --granularity day > twitter-data/tweet-counts.csv

We can read in this CSV file with pandas, parse the date columns, and sort from earliest to latest. The code below is largely [borrowed from Ed Summers](https://github.com/edsu/notebooks/blob/master/Black%20Lives%20Matter%20Counts.ipynb). Thanks, Ed!

<div class="admonition pandasreview" name="html-admonition" style="background: black; color: white; padding: 10px">
<p class="title">Pandas</p>
 Do you need a refresher or introduction to the Python data analysis library Pandas? Be sure to check out <a href="https://melaniewalsh.github.io/Intro-Cultural-Analytics/Data-Analysis/Pandas-Basics-Part1.html"> Pandas Basics (1-3) </a> in this textbook!
    
</div>

In [6]:
# Code borrowed from Ed Summers
# https://github.com/edsu/notebooks/blob/master/Black%20Lives%20Matter%20Counts.ipynb

# Read in CSV as DataFrame
# tweet_counts_df = pd.read_csv('twitter-data/tweet-counts.csv', parse_dates=['start', 'end'])
# Sort values by earliest date
# tweet_counts_df = tweet_counts_df.sort_values('start')
# tweet_counts_df

Then we can make a quick plot of tweets per day with [plotly](https://plotly.com/python/line-charts/)

In [7]:
# Code borrowed from Ed Summers
# https://github.com/edsu/notebooks/blob/master/Black%20Lives%20Matter%20Counts.ipynb
# Make a line plot from the DataFrame and specify x and y axes, axes titles, and plot title
# figure = px.line(tweet_counts_df, x='start', y='day_count',
#     labels={'start': 'Time', 'day_count': 'Tweets per Day'},
#     title= 'DFW Bro Tweets'
# )

# figure.show()

With a plotly line chart, we can hover over points to see more information, and we can use the tool bar in the upper right corner to zoom or pan on different parts of the graph. We can also press the camera button to download an image of the graph at any pan or zoom level.

To return to the original view, double-click on the plot.

## Get Tweets (Standard Track)

To actually collect tweets and their associated metadata, we can use the command `twarc2 search` and insert a query.

Here we're going to search for any tweets that mention the words "David Foster Wallace bro" and were tweeted by verified accounts *in the past week*. By default, `twarc2 search` will use the standard track of the Twitter API, which only collects tweets from the past week.

In [8]:
# !twarc2 search "David Foster Wallace is:verified"

https://github.com/melaniewalsh/Intro-Cultural-AnalyticsThe tweets and tweet metadata above are being printed to the notebook. But we want to save this information to a file so we can work with it.

To output Twitter data to a file, we can also include a filename with the ".jsonl" file extension, which stands for JSON lines, a special kind of JSON file.

In [9]:
# !twarc2 search "David Foster Wallace is:verified" twitter-data/dfw_last_week.jsonl

Theoretically, a tweet with "David," "Foster", and "Wallace" in different places would be matched by the more general search above. If we wanted to match the words "David Foster Wallace" exactly, we would need to put "David Foster Wallace" in quotation marks *and* "escape" those quotation marks, so that `twarc2` will know that our query shouldn't end at the next quotation mark.

In [11]:
# !twarc2 search "\"David Foster Wallace\" is:verified" twitter-data/dfw_exact.jsonl

If you're working on a Mac, you should be able to escape the quotation marks with backslashes `\` before the characters, as shown in the example above. But if you're working on a Windows computer, you may need to use triple quotations instead, for example:

`twarc2 search """ "David Foster Wallace" is:verified""" twitter-data/dfw_exact.jsonl`

## Get Tweets (Academic Track, Full Twitter Archive)


<div class="admonition attention" name="html-admonition" style="background: lightyellow; padding: 10px">
<p class="title">Attention</p>
Remember that this functionality is only available to those who have an [Academic Research account](https://developer.twitter.com/en/products/twitter-api/academic-research).

</div>

To collect tweets from Twitter's entire historical archive, we need to add the `--archive` flag.

In [12]:
# !twarc2 search "David Foster Wallace bro is:verified" --archive twitter-data/dfw_bro.jsonl

## Convert JSONL to CSV

To make our Twitter data easier to work with, we can convert our JSONL file to a CSV file with the [`twarc-csv`](https://pypi.org/project/twarc-csv/) plugin, which needs to be installed separately.

In [15]:
# !pip install twarc-csv

Once installed, we can use the plug-in from twarc2 with the input filename for the JSONL and a desired output filename for the CSV file.

In [16]:
# !twarc2 csv twitter-data/dfw_bro.jsonl twitter-data/dfw_bro.csv

By default, when converting from the JSONL file, `twarc-csv` will only include tweets that were directly returned from the search.

If you want you can also use `--inline-referenced-tweets` option to make "referenced" tweets into their own rows in the CSV file. For example, if a quote tweet matched our query, the tweet being quoted would also be included in the CSV file as its own row, even if it didn't match our query. But as of v0.5.0 of twarc-csv this is no longer the default behavior.

## Read in CSV

Now we're ready to explore the data!

To work with our tweet data, we can read in our CSV file with pandas and again parse the date column.

In [17]:
# tweets_df = pd.read_csv('twitter-data/dfw_bro.csv',
#                         parse_dates = ['created_at'])

If we scroll through this dataset, we can see that there are only 29 tweets that matched our search query, but there is a *lot* of metadata associated with each tweet. Scroll to the right to see all the information. What category surprises you the most? (For me, it's the tweet author's pinned tweet from their own timeline. Your pinned tweet gets attached to everything else you tweet!)

In [18]:
# tweets_df

If we ask for a list of all the columns in the DataFrame, we can see that there are more than 90 columns here!

In [20]:
# tweets_df.columns

As you experiment with the query syntax provided by the Twitter API you should make a habit of scanning through your collected Twitter data to ensure that your API query and subsequent manipulations are returning the data that you expect and want. If you notice tweets you don't expect return to examine your query to see if you can explain why those tweets are turning up. If you can't find an adequate explanation you might want to ask in the [Twitter Community Forum](https://twittercommunity.com).

## Extract Tweet and Media URLs

We can make some Python functions that will create a tweet URL based on each tweet's unique ID as well as extract an image URL if one exists.

In [21]:
# Make Tweet URL
def make_tweet_url(tweets):
    # Get username
    username = tweets[0]
    # Get tweet ID
    tweet_id = tweets[1]
    # Make tweet URL
    tweet_url = f"https://twitter.com/{username}/status/{tweet_id}"
    return tweet_url

# Extract Image URL
from ast import literal_eval
def get_image_url(media):
    # if not NaN or {}
    if type(media) != float and media != '{}':
        # Convert to an actual Python list, not just a string
        media =  literal_eval(media)
        media = media[0]
         # Extract media url if it exists
        if 'url' in media.keys():
            return media['url']
    else:
        return "No Image URL"

Here we apply the above Python functions to the relevant columns to create new columns.

In [22]:
# tweets_df['tweet_url'] = tweets_df[['author.username', 'id']].apply(make_tweet_url, axis='columns')
# tweets_df['media'] = tweets_df['attachments.media'].apply(get_image_url)

## Rename and Select Columns

To make the data more readable, we're going to rename a number of columns.

In [23]:
# tweets_df.rename(columns={'created_at': 'date',
#                           'public_metrics.retweet_count': 'retweets', 
#                           'author.username': 'username', 
#                           'author.name': 'name',
#                           'author.verified': 'verified', 
#                           'public_metrics.like_count': 'likes', 
#                           'public_metrics.quote_count': 'quotes', 
#                           'public_metrics.reply_count': 'replies',
#                            'author.description': 'user_bio'},
#                             inplace=True)

Then we're only going to select the columns that we're interested. Depending on your project and research question, you should change and customize these categories.

In [24]:
# tweets_df = tweets_df[['date', 'username', 'name', 'verified', 'text', 'retweets',
#            'likes', 'replies',  'quotes', 'tweet_url', 'media', 'user_bio']]

Now we can view our more focused DataFrame!

In [25]:
# tweets_df

## Sort By Top Retweets

We can sort by number of retweets to see the most circulated tweets. Let's examine the top 5.

In [26]:
# tweets_df.sort_values(by='retweets', ascending=False)[:5]

Here is the most retweeted tweet in this dataset:

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">david foster wallace bro conversation again? i feel like by this point in the discourse i need footnotes! (just a little &quot;insider&quot; dfw humor for ya)</p>&mdash; David Grossman (@davidgross_man) <a href="https://twitter.com/davidgross_man/status/1297936487909130240?ref_src=twsrc%5Etfw">August 24, 2020</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> 

## Sort By Date

We can sort from the earliest tweets to the latest tweets. Let's examine the earliest 5 tweets.

In [27]:
# tweets_df.sort_values(by='date', ascending=True)[:5]

The earliest tweet in this dataset is from the music label Melodic Records:

<blockquote class="twitter-tweet"><p lang="en" dir="ltr"><a href="https://twitter.com/MelodicRecords?ref_src=twsrc%5Etfw">@MelodicRecords</a> chillax brother, we all gud. Put another blunt on the barbie &#39;n&#39; go wid th&#39; flo bro. Dude man... <a href="http://t.co/q0RUmv12">http://t.co/q0RUmv12</a></p>&mdash; Drowned in Sound ⚓️ (@DrownedinSound) <a href="https://twitter.com/DrownedinSound/status/146330510422048768?ref_src=twsrc%5Etfw">December 12, 2011</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> 

## Plot Tweets Over Time

An easy way to create a plot of tweets over time is to add a column with a 1 for every row, which we can use to count how many tweets were published per day, week, month, or year.

In [28]:
# tweets_df = tweets_df.assign(count=1)

We also need to set the date column to the index so we can do some special date manipulations.

In [29]:
# tweets_df = tweets_df.set_index('date')
# tweets_df

Because our index is a datetime value, we can use the special Pandas method [`.resample()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.resample.html) to group the tweets by month, add them up, and plot them over time.

In [30]:
# tweets_df['count'].resample('M').sum()\
# .plot(title='"David Foster Wallace Bro"\n Tweets from Verified Accounts');

## Display Links and Images in Twitter Data

To display links and images in our DataFrame, we can convert the image URL into an HTML image tag, and we can display our DataFrame as an HTML object with the `HTML` module.

In [31]:
from IPython.core.display import HTML

def get_image_html(link):
    # check to see if the media category has an image URL
    if link != "No Image URL":
        # format the image url as an HTML image
        image_html = f"<a href='{link}'>'<img src='{link}' width='500px'></a>                            "
        return image_html
    else:
        return "No Image URL"
# Apply the above function to the media column
# tweets_df['media']= tweets_df['media'].apply(get_image_html)

View images

In [32]:
# HTML(tweets_df[['media', 'text']].sort_values(by='media').to_html(render_links=True, escape=False))

View tweet links

In [33]:
# HTML(tweets_df[['tweet_url', 'text', 'retweets']].sort_values(by='retweets', ascending=False).to_html(render_links=True, escape=False))

## Top Hashtags

To analyze hashtags in a tweet dataset, we can use the plugin [`twarc2 hashtags`](https://pypi.org/project/twarc-hashtags/), which requires a separate installation.

In [34]:
!pip install twarc-hashtags



Then we can create a CSV digest of the top hashtags from our JSONL data with `twarc2 hashtags`. (To get more hashtag data, we are using a JSONL file that contains a full archive search of "David Foster Wallace" rather than "David Foster Wallace bro").

In [35]:
# !twarc2 hashtags twitter-data/dfw.jsonl twitter-data/dfw_hashtags.csv

In [36]:
# pd.read_csv('twitter-data/dfw_hashtags.csv')

We can also use the flag `--group` to group the hashtags by their frequency per time period and the flag `--limit` to limit the hashtags to only the top *n* number of hashtags per grouping.

In [37]:
# !twarc2 hashtags --group year --limit 10 twitter-data/dfw.jsonl twitter-data/dfw_hashtags_year.csv

In [38]:
# hashtags_df = pd.read_csv('twitter-data/dfw_hashtags_year.csv')
# hashtags_df 

To plot the frequency of hashtags over time, we can set the DataFrame index to the "time" column.

In [39]:
# hashtags_df = hashtags_df.set_index('time')

Then we can filter for a specific hashtag and plot its frequency.

In [40]:
# hashtags_df[hashtags_df['hashtag'] == 'writing'].plot(y='tweets', label='#writing', title='DFW Hashtags');

We can also plot multiple hashtags on the same plot by assigning the first plot to the variable `main_axis` and then directing the next plots to be plotted on the same axis `ax=main_axis`.

In [41]:
# main_axis = hashtags_df[hashtags_df['hashtag'] == 'writing'].plot(y='tweets', label='#writing', title='DFW Hashtags')
# hashtags_df[hashtags_df['hashtag'] == 'dfw'].plot(ax=main_axis, y='tweets', label='#dfw')
# hashtags_df[hashtags_df['hashtag'] == 'infinitejest'].plot(ax=main_axis, y='tweets', label='#infinitejest');

## Sentiment Analysis 

See an example of running the English-language [sentiment analysis tool VADER on Donald Trump's tweets](../05-Text-Analysis/04-Sentiment-Analysis).

## Topic Modeling Tweets

See an example of using [topic modeling on Donald Trump's tweets](../05-Text-Analysis/11-Topic-Modeling-Time-Series).