# Twitter data

## Copyright and Licensing

You are free to use or adapt this notebook for any purpose you'd like. However, please respect the [Simplified BSD License](https://github.com/ptwobrussell/Mining-the-Social-Web-2nd-Edition/blob/master/LICENSE.txt) that governs its use.

### Please Note:

This notebook likely looks a little different from the video content in the course. This notebook has been modified to be easier to understand as Tweepy is generally an easier package to work with. The old notebooks will still be available in the course downloads page if desired, but they will not be regularly updated.

# Twitter API Access

In order to use it to make requests to Twitter's API, you'll need to go to https://dev.twitter.com/apps and create a sample application.

Choose any name for your application, write a description and use `http://google.com` for the website. Further instructions can be found in week 6 of the course.

Under **Key and Access Tokens**, there are four primary identifiers you'll need to note: 
* consumer key, 
* consumer secret, 
* access token, and 
* access token secret (Click on Create Access Token to create those).

Note that you will need an ordinary Twitter account in order to login, create an app, and get these credentials.

Install the `tweepy` package to interface with the Twitter API

In [None]:
#pip install for the package we will be using
!pip install tweepy

## Example 1. Authorizing an application to access Twitter account data

In [None]:
import tweepy

#Setting up the keys and tokens
c_k = "Consumer_Key"
c_s = "Consumer_Secret"

a_t = "Access_Token"
a_s = "Access_Token_Secret"

auth = tweepy.OAuthHandler(c_k, c_s)
auth.set_access_token(a_t, a_s)
api = tweepy.API(auth)

# Nothing to see by displaying twitter_api except that it's now a
# defined variable

print(api)

## Example 2. Retrieving trends

Twitter identifies locations using the Yahoo! Where On Earth ID.

The Yahoo! Where On Earth ID for the entire world is 1.
See https://dev.twitter.com/docs/api/1.1/get/trends/place and
http://developer.yahoo.com/geo/geoplanet/

look at the BOSS placefinder here: https://developer.yahoo.com/boss/placefinder/

To look up an area use:
https://www.findmecity.com/

In [None]:
WORLD_WOE_ID = 1
US_WOE_ID = 23424977

Look up the WOE ID for "San Diego" and you should find the following ID below defined as "LOCAL_WOE_ID".

You can change this if you would like.

In [None]:
LOCAL_WOE_ID=2487889

# Prefix ID with the underscore for query string parameterization.
# Without the underscore, the twitter package appends the ID value
# to the URL itself as a special case keyword argument.

world_trends = api.trends_place(WORLD_WOE_ID)
us_trends = api.trends_place(US_WOE_ID)
local_trends = api.trends_place(LOCAL_WOE_ID)

In [None]:
world_trends[:2]

In [None]:
trends=local_trends
print(type(trends))
print(list(trends[0].keys()))
print(trends[0]['trends'])

## Example 3. Displaying API responses as pretty-printed JSON

In [None]:
import json

print((json.dumps(us_trends[:2], indent=1)))

## Example 4. Computing the intersection of two sets of trends

In [None]:
trends_set = {}
trends_set['world'] = set([trend['name'] 
                        for trend in world_trends[0]['trends']])

trends_set['us'] = set([trend['name'] 
                     for trend in us_trends[0]['trends']]) 

trends_set['san diego'] = set([trend['name'] 
                     for trend in local_trends[0]['trends']]) 

In [None]:
for loc in ['world','us','san diego']:
    print(('-'*10,loc))
    print((','.join(trends_set[loc])))

In [None]:
print(( '='*10,'intersection of world and us'))
print((trends_set['world'].intersection(trends_set['us'])))

print(('='*10,'intersection of us and san-diego'))
print((trends_set['san diego'].intersection(trends_set['us'])))

## Example 5. Collecting search results

Set the variable `q` to a trending topic, 
or anything else for that matter. The example query below
was a trending topic when this content was being developed
and is used throughout the remainder of this chapter

In [None]:
# You can change this to whatever hashtag you want, but if the tag isn't
# popular enough you might not get back a lot of results
q = "Keanu"

number = 100

search_results = tweepy.Cursor(api.search, q=q, lang="en").items(number)

#This will give us an Iterator
print(search_results)

# WE will be looking at the tags "retweeted", "retweet count", 
# and the text we found earlier
tweets = []
retweeted = []
retweet_count = []

for tweet in search_results:
    tweets.append(tweet.text)
    retweet_count.append(tweet.retweet_count)
    # This if/else just checks the number of retweets and defines "rewteeted"
    # based on that value
    if tweet.retweet_count > 0:
        retweeted.append(True)
    else:
        retweeted.append(False)


#tweets

In [None]:
# Not necessary, but this does make the data look pretty
import pandas as pd

df = pd.DataFrame({'Tweet':tweets, 'Retweeted':retweeted, "Retweet Count":retweet_count})

df

Twitter often returns duplicate results, we can filter them out checking for duplicate texts:

In [None]:
all_text = []
filtered_tweets = []
for t in tweets:
    if not t in all_text:
        filtered_tweets.append(t)
        all_text.append(t)
#filtered_tweets    
filtered_tweets[0]

In [None]:
#This gives us the number of all of the unique tweets from our search results
print(len(filtered_tweets))
if len(filtered_tweets) < len(tweets):
    print("There were duplicates in our search results!")

## Example 6. Creating a basic frequency distribution from the words in tweets

In [None]:
from collections import Counter

words = []

for t in tweets:
    for word in t.split():
        words.append(word)
        
c = Counter(words)
c.most_common(10)

## Example 7. Create a prettyprint function to display tuples in a nice tabular format

In [None]:
def prettyprint_counts(label, list_of_tuples):
    print("\n{:^20} | {:^6}".format(label, "Count"))
    print("*"*40)
    for k,v in list_of_tuples:
        print("{:20} | {:>6}".format(k,v))

In [None]:
for label, data in (('Word', words), 
                    ('Retweet_count', retweet_count)):
    
    c = Counter(data)
    prettyprint_counts(label, c.most_common()[:10])

## Example 8. Finding the most popular retweets

In [None]:
# This sets up a filter for our dataset that only leaves data with Retweeted
# marked as true
filter1 = df['Retweeted'] == True

#This is a built in pandas operation that will filter the data given the filter
rt_df = df.where(filter1)

#Now we will have a new df without any NaN values
rt_df = rt_df.dropna()

#The indices will look odd, but this is because it is keeping the old indices
rt_df.head(10)

We can sort this dataframe in descending order of the number of retweets using df.sort_values()

In [None]:
rt_df_sorted = rt_df.sort_values(by="Retweet Count", ascending=0)

rt_df_sorted.head(5)

We can build another `prettyprint` function to print entire tweets with their retweet count.

We also want to split the text of the tweet in up to 3 lines, if needed.

In [None]:
### Remember our pretty_print function from above
### We will modify it slightly
def prettyprint_counts_modified(label, list_of_tuples):
    print("\n{:^20} | {:^6}".format(label, "Count"))
    print("*"*40)
    for k,v in list_of_tuples:
        print("{:20} | {:>6}".format(k,v))

In [None]:
rt_tweets = rt_df_sorted["Tweet"]
rt_re_count = rt_df_sorted["Retweet Count"]

for label, data in (('Tweet', rt_tweets), 
                    ('Retweet_count', rt_re_count)):
    
    c2 = Counter(data)
    prettyprint_counts_modified(label, c2.most_common()[:5])