# Twitter data

## Copyright and Licensing

You are free to use or adapt this notebook for any purpose you'd like. However, please respect the [Simplified BSD License](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/blob/master/LICENSE.txt) that governs its use.

### Please Note:

This notebook likely looks a little different from the video content in the course. This notebook has been modified to be easier to understand as Tweepy is generally an easier package to work with. The old notebooks will still be available in the course downloads page if desired, but they will not be regularly updated.

# Twitter API Access

In order to use it to make requests to Twitter's API, you'll need to go to https://dev.twitter.com/apps and create a sample application.

Choose any name for your application, write a description and use `http://google.com` for the website. Further instructions can be found in week 6 of the course.

Under **Key and Access Tokens**, there are four primary identifiers you'll need to note: 
* consumer key, 
* consumer secret, 
* access token, and 
* access token secret (Click on Create Access Token to create those).

Note that you will need an ordinary Twitter account in order to login, create an app, and get these credentials.

Install the `tweepy` package to interface with the Twitter API

In [None]:
#pip install for the package we will be using
# !pip install tweepy

## Example 1. Authorizing an application to access Twitter account data

In [1]:
import tweepy

#Setting up the keys and tokens
c_k = "JTbow39cknsEnz90n9qtQP6Yd"
c_s = "CBrW95sUS4pChbTwglLKCeUlV9ykmi1LNjLfQtdqfCMbzEM8C4"

a_t = "1960748732-SZU3yFV37gFS6glcKtKeJWpWlRmPi6Lbcgm1d30"
a_s = "Fop8tHsjPa65GJkYtQf0FXRlM4CYiRKkkKgfyOTJ6vxn7"

auth = tweepy.OAuthHandler(c_k, c_s)
auth.set_access_token(a_t, a_s)
api = tweepy.API(auth)

# Nothing to see by displaying twitter_api except that it's now a
# defined variable

print(api)

<tweepy.api.API object at 0x000001F885E83A08>


## Example 2. Retrieving trends

Twitter identifies locations using the Yahoo! Where On Earth ID.

The Yahoo! Where On Earth ID for the entire world is 1.
See https://dev.twitter.com/docs/api/1.1/get/trends/place and
http://developer.yahoo.com/geo/geoplanet/

look at the BOSS placefinder here: https://developer.yahoo.com/boss/placefinder/

To look up an area use:
https://www.findmecity.com/

In [2]:
WORLD_WOE_ID = 1
US_WOE_ID = 23424977

Look up the WOE ID for "San Diego" and you should find the following ID below defined as "LOCAL_WOE_ID".

You can change this if you would like.

In [3]:
LOCAL_WOE_ID=2487889

# Prefix ID with the underscore for query string parameterization.
# Without the underscore, the twitter package appends the ID value
# to the URL itself as a special case keyword argument.

world_trends = api.trends_place(WORLD_WOE_ID)
us_trends = api.trends_place(US_WOE_ID)
local_trends = api.trends_place(LOCAL_WOE_ID)

In [4]:
world_trends[:2]

[{'trends': [{'name': '#dogecoin',
    'url': 'http://twitter.com/search?q=%23dogecoin',
    'promoted_content': None,
    'query': '%23dogecoin',
    'tweet_volume': 689139},
   {'name': 'Durk',
    'url': 'http://twitter.com/search?q=Durk',
    'promoted_content': None,
    'query': 'Durk',
    'tweet_volume': 28101},
   {'name': 'Lucas',
    'url': 'http://twitter.com/search?q=Lucas',
    'promoted_content': None,
    'query': 'Lucas',
    'tweet_volume': 578471},
   {'name': 'Cicely Tyson',
    'url': 'http://twitter.com/search?q=%22Cicely+Tyson%22',
    'promoted_content': None,
    'query': '%22Cicely+Tyson%22',
    'tweet_volume': 559271},
   {'name': '#bbb21',
    'url': 'http://twitter.com/search?q=%23bbb21',
    'promoted_content': None,
    'query': '%23bbb21',
    'tweet_volume': 1033912},
   {'name': 'GitHub',
    'url': 'http://twitter.com/search?q=GitHub',
    'promoted_content': None,
    'query': 'GitHub',
    'tweet_volume': 209609},
   {'name': 'Selena',
    'url': '

In [5]:
trends=local_trends
print(type(trends))
print(list(trends[0].keys()))
print(trends[0]['trends'])

<class 'list'>
['trends', 'as_of', 'created_at', 'locations']
[{'name': 'Kiké', 'url': 'http://twitter.com/search?q=Kik%C3%A9', 'promoted_content': None, 'query': 'Kik%C3%A9', 'tweet_volume': None}, {'name': 'Lakers', 'url': 'http://twitter.com/search?q=Lakers', 'promoted_content': None, 'query': 'Lakers', 'tweet_volume': 64792}, {'name': 'Suns', 'url': 'http://twitter.com/search?q=Suns', 'promoted_content': None, 'query': 'Suns', 'tweet_volume': 16218}, {'name': 'Weezer', 'url': 'http://twitter.com/search?q=Weezer', 'promoted_content': None, 'query': 'Weezer', 'tweet_volume': None}, {'name': 'Blake Griffin', 'url': 'http://twitter.com/search?q=%22Blake+Griffin%22', 'promoted_content': None, 'query': '%22Blake+Griffin%22', 'tweet_volume': None}, {'name': 'Gibson', 'url': 'http://twitter.com/search?q=Gibson', 'promoted_content': None, 'query': 'Gibson', 'tweet_volume': None}, {'name': '#Walker', 'url': 'http://twitter.com/search?q=%23Walker', 'promoted_content': None, 'query': '%23Walke

## Example 3. Displaying API responses as pretty-printed JSON

In [6]:
import json

print((json.dumps(us_trends[:2], indent=1)))

[
 {
  "trends": [
   {
    "name": "#dogecoin",
    "url": "http://twitter.com/search?q=%23dogecoin",
    "promoted_content": null,
    "query": "%23dogecoin",
    "tweet_volume": 689139
   },
   {
    "name": "Durk",
    "url": "http://twitter.com/search?q=Durk",
    "promoted_content": null,
    "query": "Durk",
    "tweet_volume": 28101
   },
   {
    "name": "Cicely Tyson",
    "url": "http://twitter.com/search?q=%22Cicely+Tyson%22",
    "promoted_content": null,
    "query": "%22Cicely+Tyson%22",
    "tweet_volume": 558893
   },
   {
    "name": "Jewish Space Laser",
    "url": "http://twitter.com/search?q=%22Jewish+Space+Laser%22",
    "promoted_content": null,
    "query": "%22Jewish+Space+Laser%22",
    "tweet_volume": 47688
   },
   {
    "name": "#CriticalRoleSpoilers",
    "url": "http://twitter.com/search?q=%23CriticalRoleSpoilers",
    "promoted_content": null,
    "query": "%23CriticalRoleSpoilers",
    "tweet_volume": null
   },
   {
    "name": "Brent",
    "url": "htt

## Example 4. Computing the intersection of two sets of trends

In [7]:
trends_set = {}
trends_set['world'] = set([trend['name'] 
                        for trend in world_trends[0]['trends']])

trends_set['us'] = set([trend['name'] 
                     for trend in us_trends[0]['trends']]) 

trends_set['san diego'] = set([trend['name'] 
                     for trend in local_trends[0]['trends']]) 

In [8]:
for loc in ['world','us','san diego']:
    print(('-'*10,loc))
    print((','.join(trends_set[loc])))

('----------', 'world')
Lumena,見た目のおじさん,TI and Tiny,Durk,新発売のチーク,#bbb21,情報漏洩,GitHub,Whindersson,Cicely Tyson,Kerline,年収300万,Brent,Jewish Space Laser,팬레터 세계관,高校トイレ,Pistons,ウルトラマン,DALE AGACHADITA REMIX,Selena,シカ立てこもり中,セキュリティー,Lucas,メイク悩み,rauw,Oubre,楽天モバイル,下半身露出,Flamengo,コロナ抑止,SMBC,心のおじさん,ソースコード,#MACLOVESLISA,#dogecoin,#FarmerTikaitVsModiDakait,年収1500万,葛葉ダブハン,さんの推定年収,Binance,見た目のJKさ,党本部対象,Gabigol,コフレドール,#BailaConmigo,心のJKさ,juliette,本当の息子,三井住友銀,スパクロ
('----------', 'us')
PARTYNEXTDOOR,rip queen,Durk,Oladipo,Cicely Tyson,Nestor,Brent,#CriticalRoleSpoilers,Vlad,Jewish Space Laser,Coinbase,Pistons,Rockies,Andre 3000,colours,Kadri,Bebo,SUMMER OF SOUL,ONE FOR THE ROAD,Agholor,GARLIC KNOT,Should've Ducked,#SilhoutteChallenge,Wyoming,Robinhood CEO,Matt Gaetz,OK Human,Rockets,Method Man,Sounder,Oubre,#TheVoiceDeluxe,lil kennedy,Eddie Munster,Kenya Barris,#dogecoin,Arenado,Jon Stewart,Butthead,Binance,Jane Pittman,Lucien,Chris Cuomo,Beavis,#BailaConmigo,Belmont,Nader,Madlib,Frank Kaminsky,Wendy Will

In [9]:
print(( '='*10,'intersection of world and us'))
print((trends_set['world'].intersection(trends_set['us'])))

print(('='*10,'intersection of us and san-diego'))
print((trends_set['san diego'].intersection(trends_set['us'])))

{'Jewish Space Laser', '#dogecoin', 'Pistons', 'Binance', 'Durk', '#BailaConmigo', 'Cicely Tyson', 'Oubre', 'Brent'}
{'PARTYNEXTDOOR', 'Jewish Space Laser', 'Coinbase', 'Pistons', 'Arenado', 'Andre 3000', 'Jon Stewart', 'Butthead', 'Durk', 'Binance', 'Jane Pittman', 'Lucien', 'Cicely Tyson', 'Oubre', 'Nader', "Should've Ducked", 'Brent', 'Frank Kaminsky'}


## Example 5. Collecting search results

Set the variable `q` to a trending topic, 
or anything else for that matter. The example query below
was a trending topic when this content was being developed
and is used throughout the remainder of this chapter

In [10]:
# You can change this to whatever hashtag you want, but if the tag isn't
# popular enough you might not get back a lot of results
q = "Keanu"

number = 100

search_results = tweepy.Cursor(api.search, q=q, lang="en").items(number)

#This will give us an Iterator
print(search_results)

# WE will be looking at the tags "retweeted", "retweet count", 
# and the text we found earlier
tweets = []
retweeted = []
retweet_count = []

for tweet in search_results:
    tweets.append(tweet.text)
    retweet_count.append(tweet.retweet_count)
    # This if/else just checks the number of retweets and defines "rewteeted"
    # based on that value
    if tweet.retweet_count > 0:
        retweeted.append(True)
    else:
        retweeted.append(False)


#tweets

<tweepy.cursor.ItemIterator object at 0x000001F885F8BB08>


In [11]:
# Not necessary, but this does make the data look pretty
import pandas as pd

df = pd.DataFrame({'Tweet':tweets, 'Retweeted':retweeted, "Retweet Count":retweet_count})

df

Unnamed: 0,Tweet,Retweeted,Retweet Count
0,RT @News18Tech: #Cyberpunk2077 mod that allowe...,True,1
1,RT @my2k: roflmao elliot page is pointing and ...,True,26
2,#Cyberpunk2077 mod that allowed users to have ...,True,1
3,RT @my2k: roflmao elliot page is pointing and ...,True,26
4,RT @my2k: roflmao elliot page is pointing and ...,True,26
...,...,...,...
95,@STFUandStayAway That's Keanu Reaves.... John ...,False,0
96,RT @partygirlu2: thinking about keanu reeves a...,True,115
97,This kid cudi album driving through the rain&g...,False,0
98,This fight this war in me,False,0


Twitter often returns duplicate results, we can filter them out checking for duplicate texts:

In [12]:
all_text = []
filtered_tweets = []
for t in tweets:
    if not t in all_text:
        filtered_tweets.append(t)
        all_text.append(t)
#filtered_tweets    
filtered_tweets[0]

"RT @News18Tech: #Cyberpunk2077 mod that allowed users to have sex with Keanu Reeves has been taken down by @CDPROJEKTRED. Here's the full s…"

In [13]:
#This gives us the number of all of the unique tweets from our search results
print(len(filtered_tweets))
if len(filtered_tweets) < len(tweets):
    print("There were duplicates in our search results!")

67
There were duplicates in our search results!


## Example 6. Creating a basic frequency distribution from the words in tweets

In [14]:
from collections import Counter

words = []

for t in tweets:
    for word in t.split():
        words.append(word)
        
c = Counter(words)
c.most_common(10)

[('RT', 52),
 ("it's", 48),
 ('a', 43),
 ('and', 39),
 ('Keanu', 38),
 ('keanu', 37),
 ('is', 36),
 ('but', 36),
 ('you', 36),
 ('not', 33)]

## Example 7. Create a prettyprint function to display tuples in a nice tabular format

In [15]:
def prettyprint_counts(label, list_of_tuples):
    print("\n{:^20} | {:^6}".format(label, "Count"))
    print("*"*40)
    for k,v in list_of_tuples:
        print("{:20} | {:>6}".format(k,v))

In [16]:
for label, data in (('Word', words), 
                    ('Retweet_count', retweet_count)):
    
    c = Counter(data)
    prettyprint_counts(label, c.most_common()[:10])


        Word         | Count 
****************************************
RT                   |     52
it's                 |     48
a                    |     43
and                  |     39
Keanu                |     38
keanu                |     37
is                   |     36
but                  |     36
you                  |     36
not                  |     33

   Retweet_count     | Count 
****************************************
                   0 |     45
                  26 |     24
                 115 |      6
                   1 |      5
                   6 |      4
                 176 |      3
                  22 |      3
                  56 |      2
                  12 |      1
                1158 |      1


## Example 8. Finding the most popular retweets

In [17]:
# This sets up a filter for our dataset that only leaves data with Retweeted
# marked as true
filter1 = df['Retweeted'] == True

#This is a built in pandas operation that will filter the data given the filter
rt_df = df.where(filter1)

#Now we will have a new df without any NaN values
rt_df = rt_df.dropna()

#The indices will look odd, but this is because it is keeping the old indices
rt_df.head(10)

Unnamed: 0,Tweet,Retweeted,Retweet Count
0,RT @News18Tech: #Cyberpunk2077 mod that allowe...,1.0,1.0
1,RT @my2k: roflmao elliot page is pointing and ...,1.0,26.0
2,#Cyberpunk2077 mod that allowed users to have ...,1.0,1.0
3,RT @my2k: roflmao elliot page is pointing and ...,1.0,26.0
4,RT @my2k: roflmao elliot page is pointing and ...,1.0,26.0
5,RT @CBR: BOOM! Studios' #BRZRKR 1 Gets Keanu R...,1.0,12.0
8,RT @my2k: roflmao elliot page is pointing and ...,1.0,26.0
10,RT @my2k: roflmao elliot page is pointing and ...,1.0,26.0
11,RT @my2k: roflmao elliot page is pointing and ...,1.0,26.0
12,RT @garrynewman: How can you call it modding i...,1.0,56.0


We can sort this dataframe in descending order of the number of retweets using df.sort_values()

In [18]:
rt_df_sorted = rt_df.sort_values(by="Retweet Count", ascending=0)

rt_df_sorted.head(5)

Unnamed: 0,Tweet,Retweeted,Retweet Count
62,RT @AnetaMolenda: This isn’t an accident. http...,1.0,11494.0
35,RT @mitchellorval: Keanu reeves running off wi...,1.0,1505.0
20,RT @UtadaHikaruVEVO: Me including this isn't r...,1.0,1158.0
43,RT @IGN: CD Projekt Red has removed a Cyberpun...,1.0,176.0
61,RT @IGN: CD Projekt Red has removed a Cyberpun...,1.0,176.0


We can build another `prettyprint` function to print entire tweets with their retweet count.

We also want to split the text of the tweet in up to 3 lines, if needed.

In [19]:
### Remember our pretty_print function from above
### We will modify it slightly
def prettyprint_counts_modified(label, list_of_tuples):
    print("\n{:^20} | {:^6}".format(label, "Count"))
    print("*"*40)
    for k,v in list_of_tuples:
        print("{:20} | {:>6}".format(k,v))

In [20]:
rt_tweets = rt_df_sorted["Tweet"]
rt_re_count = rt_df_sorted["Retweet Count"]

for label, data in (('Tweet', rt_tweets), 
                    ('Retweet_count', rt_re_count)):
    
    c2 = Counter(data)
    prettyprint_counts_modified(label, c2.most_common()[:5])


       Tweet         | Count 
****************************************
RT @my2k: roflmao elliot page is pointing and laughing

suddenly it's a problem when it's keanu your bro idol but not someone who you depic… |     23
RT @partygirlu2: thinking about keanu reeves as john constantine https://t.co/Des27gVaxT |      4
RT @IGN: CD Projekt Red has removed a Cyberpunk 2077 mod that  would  let players swap models and have sex with characters such as Keanu Re… |      3
RT @merrittk: I have decided to become 1991 Keanu Reeves at this time. It may be difficult to accept but this is the right thing for me |      3
RT @HYPEBEAST: The mod has since been taken down. https://t.co/XCtRzMkTyS |      3

   Retweet_count     | Count 
****************************************
                26.0 |     24
               115.0 |      6
                 1.0 |      5
                 6.0 |      4
               176.0 |      3
