# Using APIs to Get Data From the Internet


**API** means Application Programmer Interface

An API is a set of instructions that describe how computers can interact with each other to request and receive information.

Some important questions we will ask that help us discover APIs is below.

|Question | In technical terms |
|:---------|:--------------------|
|Where is my data? | What is the domain? |
|How do I learn what data is available?| Where is the documentation? |
|How do I request specific data?| How do I formulate a URL for a specific purpose? |
|How do I interpret the data?| What is the structure and format of the output?|



**Let's walk through an example in the browser**

PlaceKitten!

In a browser, go to http://www.placekitten.com

|In technical terms | PlaceKitten |
|:---------|:--------------------|
|What is the domain? | http://www.placekitten.com |
|Where is the documentation?| The documentation is on the home page. |
|How do I formulate a URL for a specific purpose? | You put it in the url like http://www.placekitten/width/height |
|What is the structure and format of the output?| It's an image! |

# Accessing placekitten in python

We're going to use a special library called <code>requests</code>

In [None]:
from IPython.display import display, Image  # This line lets you display images. We'll use that in a bit.

# This line lets you use python to download data from the web.
import requests

In [None]:
# Get a 200 by 300 image from placekitten.
r = requests.get('http://www.placekitten.com/200/300')

In [None]:
# Look at the status code
r.status_code

In [None]:
# print the content
r.content

In [None]:
# Use the Image function to display the image
display(Image(r.content))

### Exercise 1

Write a function that takes in the width and height and prints an image

In [73]:
# Use input to request width/height and print an image

from IPython.display import display, Image
import requests
def p_image(width, height):
    URL = "http://www.placekitten.com/" + width + "/" + height
    r = requests.get(URL)
    display(Image(r.content))
    
width = input("Enter width: ")
height = input("Enter height: ")
p_image(width, height)

Enter width: 500
Enter height: 500


<IPython.core.display.Image object>

### Exercise 2

Can you write a loop to show several images?


In [None]:
from IPython.display import display, Image
import requests
import random
images = input("How many images do you want? ")
for image in range(int(images)):
    URL = "http://www.placekitten.com/" + str(random.randrange(100, 600)) + "/" + str(random.randrange(100, 600))
    r = requests.get(URL)
    display(Image(r.content))

# Example 2: Getting World Times

This example introduces a slightly more complicated API. It also introduces **JSON** which is a very common data format.

Our API is at http://worldtimeapi.org/

In [None]:
# Download list of time zones
r = requests.get("http://worldtimeapi.org/api/timezone")
print(r.content)

### Exercise 3

Use the .json() function to get the response converted to a dictionary or list

In [None]:
# Use the .json() function to get the response converted to a dictionary or list
# What did it return?
type(r.json())

### Exercise 4

Get the time for your time zone

In [88]:
# Your code here
r = requests.get("http://worldtimeapi.org/api/timezone/America/Indianapolis")
r.json()["utc_datetime"]

'2021-05-26T18:02:29.816902+00:00'

### Exercise 5

Get the time for your IP address

In [94]:
# Get the time for your IP address
r = requests.get("http://worldtimeapi.org/api/ip")
r.json()["utc_datetime"]

'2021-05-26T18:07:38.592773+00:00'

# Example 3: Getting Wikipedia pages

Wikipedia also has an open API, and I want to use it to show one other tip for using the `requests` library; many APIs will take in a set of parameters, which you can pass as a parameter dictionary.

The documentation for the very extensive API is [here](https://www.mediawiki.org/wiki/API:Main_page). Many of the operations require you to authenticate (which we will cover next), but some things, like getting the content of a page, do not.

For example, the following code gets the recent changes to Wikipedia.

In [None]:
import requests

endpt = 'https://en.wikipedia.org/w/api.php'


def get_last_pages_changed(n):
    params = {'action': 'query',
          'format': 'json',
          'list': 'recentchanges',
          'rcnamespace': '0',
          'rclimit': n}
    r = requests.get(endpt, params)
    #print(r.json()['query']['recentchanges'])
    result = []
    content = r.json()['query']['recentchanges']
    for r in content:
        result.append(r['title'])
    return result

In [None]:
get_last_pages_changed(20)

## Exercise 6

Review the documentation (and Google) to see if you can figure out how to get a list of all of the people who have ever edited the most recently edited Wikipedia page.

In [99]:
import requests

endpt = 'https://en.wikipedia.org/w/api.php'

def get_last_users_changed(n):
    params = {'action': 'query',
          'format': 'json',
          'list': 'recentchanges',
          'rcnamespace': '0',
          'rclimit': n,
          "rcprop": "user"
             }
    r = requests.get(endpt, params)
    result = []
    content = r.json()['query']['recentchanges']
    for r in content:
        result.append(r['user'])
    return result

In [103]:
get_last_users_changed(7)

['EN-Jungwon',
 'Zinnober9',
 'Tkbrett',
 'WikiCleanerBot',
 '2409:4040:E9A:1F31:31EB:960C:62E6:AED',
 '159.118.137.49',
 'Saadrafiq4']

# Example 4: Intro to Twitter API

In order to use the Twitter API, you need to do two things:

1. Install tweepy. This is a python library designed to make it easier to use the API (rather than using `requests` directly. I made [this video](https://www.youtube.com/watch?v=TASX3evcgG4) to walk you through how to install tweepy in Anaconda.

2. To use the Twitter API, you need to be authenticated, and so you need a developer account. [This page](https://wiki.communitydata.science/Intro_to_Programming_and_Data_Science_(Summer_2020)/Twitter_authentication_setup) explains how to get a developer account.

Once you have your keys, you should create a file called `twitter_authentication.py` in the same directory as this file. It should contain the following four lines (replace the fake strings below with the corresponding keys from your twitter account):

```
CONSUMER_KEY = 'zFxMGdKmbo4e72X8Fi2FYr54v'
CONSUMER_SECRET = 'SetuIC9x6zPQXPZrc9cKTph7AMSngUZSf745GXT0QZTrnWeELQ'
ACCESS_TOKEN = '16614440-V09URsqNfP0V0JYZCD65NhpJAcPZ6Wb9A5ar9JrUT'
ACCESS_TOKEN_SECRET = 'oxVSzC1OjXOVVYrBvGyy6XKKe772Jdvvw6Opb3bSLdIb'
```

In general, it is a good practice to keep your keys (which should be secret) separate from your code, which you can share. In this case, we put them in a different file and then import them.

The following code loads the tweepy library and imports these keys from the `twitter_authentication.py` file, and then prepares to "log in" to your account for the Twitter API.

In [1]:
import tweepy

from twitter_authentication import CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_TOKEN_SECRET

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

# We then create an api object, based on the auth object created with your credentials
api = tweepy.API(auth, wait_on_rate_limit=True)

## Rate Limiting

You will quickly learn that the Twitter API is "rate limited". This means that they will only let each account make a certain number of calls to their API in a given time period. The default rate is quite low - many calls only allow 15 calls per 15 minutes.

You may notice above that we had the code:
```
api = tweepy.API(auth, wait_on_rate_limit=True)
```
the `wait_on_rate_limit=True` tells your code to wait for 15 minutes if it gets back a message that you've exceeded a rate limit. This can get annoying when debugging, so be careful with how often you try things - sometimes it makes sense, for example, to try to get a small amount of data that only takes on call and make sure that your code works before trying to get all of the data.

## Timeline

This first example is just to make sure it's working. It should print out the last 100 tweets from your timeline.

In [None]:
# Grab the last 100 tweets
public_tweets = api.home_timeline(count=10)

# And print the text from them
for tweet in public_tweets:
    print(tweet.text)

Each of these `tweet` objects contains lots of additional information. This shows all of the metadata available for the last one we looked at.

In [None]:
tweet._json

You can try to change the `count` argument above, and you'll quickly learn that if you raise it over 200, you will still only get 200 tweets. If you want to print more than 200 tweets, you may need to use a [cursor](http://docs.tweepy.org/en/v3.5.0/cursor_tutorial.html).

This is basically tweepy's clever way of breaking what you want to do into multiple calls to the API.

For example, this call will get 350 tweets. The `count` argument (optional) says how many tweets to get per call, and the argument in `.items()` is how many to get in total.

In [None]:
for tweet in tweepy.Cursor(api.home_timeline, count = 175).items(350):
    print(tweet.text)

## Followers

You can also get information about a user, such as who their followers are.

Here's information about me and some of my followers.

In [None]:
user = api.get_user('JuanPab37747124')

print(user.screen_name + " has " + str(user.followers_count) + " followers.")

print("They include these 100 people:")

for follower in user.followers(count=100):
    print(follower.screen_name)

Here is what that user object looks like for my user

In [None]:
user._json

And here's the user object for one of my followers, which is nearly identical.

In [None]:
follower._json

Note that 200 is the maximum number of followers that you can get at one time. If you want to get information about all of a user's followers, you will need to use a cursor. If you are getting many followers, you will almost certainly hit rate limits.

In [None]:
f = []
for follower in tweepy.Cursor(api.followers, screen_name='JuanPab37747124', count=200).items():
    #print(follower.screen_name)
    f.append(follower.screen_name)

In [None]:
print(f)

## Searching

For most of your research, you may be interested in how people are talking about a given topic. There are two main ways to do this.

The first is the search API ([Official Twitter info on the Search API](https://developer.twitter.com/en/docs/tweets/search/overview)). We only have access to "[Standard Search](https://developer.twitter.com/en/docs/tweets/search/overview/standard)", the most limited of Twitter Search API options, which is limited to the last 7 days.


[This page](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets) is the documentation for Standard Search and has some helpful intel about modifying the parameters.

Below is a simple example that gets the last 20 tweets about data science.

In [None]:
public_tweets = api.search('"data science"', count=20)

for tweet in public_tweets:
    print(tweet.user.screen_name + "\t" + str(tweet.created_at) + "\t" + tweet.text)

Note that many of these results are truncated. If you want the full tweet, you actually have to modify the call a little bit, like so.

In [None]:
public_tweets = api.search('"data science"', count=20, tweet_mode='extended')

for tweet in public_tweets:
    print(tweet.user.screen_name + "\t" + str(tweet.created_at) + "\t" + tweet.full_text)

### Additional Search resources

* [Tweepy extended tweets documentation](http://docs.tweepy.org/en/latest/extended_tweets.html)
* [Twitter documentation for crafting queries](https://developer.twitter.com/en/docs/tweets/search/guides/standard-operators). This includes things like how to search by geography or remove retweets.

## Streaming

The other option is to "stream" tweets. Instead of looking backward, this just keeps you connected to Twitter and whenever new tweets come in, they are sent to your program. You would typicaly just keep the program running and keep writing the data that you want to an external file.

As with the search API, there are some caveats. One is that (I believe) there is no guarantee that this is all of the tweets that match. If you try to filter by very popular terms, then Twitter may give you only a sample of them.

In [None]:
class StreamListener(tweepy.StreamListener):
    def on_status(self, tweet):
        print(tweet.author.screen_name + "\t" + tweet.text)

    def on_error(self, status_code):
        print( 'Error: ' + repr(status_code))
        return False

l = StreamListener()
streamer = tweepy.Stream(auth=auth, listener=l, tweet_mode='extended')

keywords = ['Purdue', '"data science"']
streamer.filter(track = keywords)

# Exercises


7. Use the streaming API to produce a list of 1000 tweets about a topic.
2. From that list of 1000 tweets, eliminate retweets.
4. For each original tweet, create a dictionary with the number of times you see it retweeted in your dataset.
5. Get a list of the URLs in your dataset
3. Now, see if you can figure out how to eliminate retweets in the query instead.
7. Get the last 50 tweets from West Lafayette, using the search API. (Hint - look up the geocode information [here](https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets)).
8. Alter the streaming algorithm to include a "locations" filter to get tweets from New York City. You need to use the order sw_lng, sw_lat, ne_lng, ne_lat for the four coordinates instead of a radius as in the search API.

### BONUS Questions
1. For each of your followers, get *their* followers (investigate time.sleep to throttle your computation)
2. Identify the follower you have that also follows the most of your followers.
3. How many users follow you but none of your followers?

In [23]:
# Exercise 7. Use the streaming API to produce a list of 1000 tweets about a topic.
public_tweets = tweepy.Cursor(api.search, '"climate change"', count = 200, tweet_mode = 'extended').items(1000)

list_climate = list()

for tweet in public_tweets:
    list_climate.append((tweet.user.screen_name, str(tweet.created_at), tweet.full_text))

In [105]:
# Exercise 7. Using streaming
class StreamListener(tweepy.StreamListener):
    counter = 0
    num_tweets = 5
    def on_status(self, tweet):
        print(tweet.author.screen_name + "\t" + tweet.text)
        self.counter += 1
        if self.counter == self.num_tweets:
            return False

        return True

    def on_error(self, status_code):
        print( 'Error: ' + repr(status_code))
        return False

l = StreamListener()
streamer = tweepy.Stream(auth=auth, listener=l, tweet_mode='extended')

keywords = ['climate change']
streamer.filter(track = keywords)

Zidouta	RT @JavierBlas: TURNING POINT: Big Oil has suffered a huge defeat today on its climate change strategy, with Exxon, Chevron and Shell (by f…
femmedomtop	RT @LRoordaLaw: Court notes Shell group has enormous impact on climate change through its emissions, larger than most individual countries.…
StarlettDetrick	RT @JuddLegum: BREAKING

In a major victory for climate change activists, "Engine No. 1 has won at least two board seats at Exxon following…
KeneshaBolton	Experts warn of mental health issues triggered by climate change crisis https://t.co/cnt6xU6Y9c
PaulEaton15	RT @Isabella_Kam: BREAKING: Shell's become the 1st company in history held legally liable for contributing to climate change. A Dutch court…


In [26]:
# Exercise 8: From that list of 1000 tweets, eliminate retweets.
list_climate_nort = list()

for tweet in list_climate:
    if not tweet[2].startswith("RT @"):
        list_climate_nort.append(tweet)

len(list_climate_nort)
#list_climate_nort

259

In [None]:
# Exercise 9: For each original tweet, create a dictionary with
#             the number of times you see it retweeted in your dataset.
# NOT WORKING
list_climate_dict = {}
for tweet in list_climate_nort:
    list_climate_dict[tweet[2]] = 0

for tweet in list_climate:
    if tweet[2].startswith("RT @"):
        #print(tweet[2])
        for tweet_dic in list_climate_dict.keys():
            if tweet_dic in tweet[2]:
                list_climate_dict[tweet_dic] += 1

list_climate_dict

In [49]:
# Exercise 10. Get a list of the URLs in your dataset
URLs = []
for tweet in list_climate:
    no_spaces = tweet[2].split()
    for word in range(len(no_spaces)):
        if no_spaces[word].startswith("http"):
            URLs.append(no_spaces[word])

len(URLs)

320

In [58]:
# Exercise 11. Now, see if you can figure out how to
#              eliminate retweets in the query instead.
public_tweets = tweepy.Cursor(api.search,
                              '"climate change" -filter:retweets',
                              count = 100,
                              tweet_mode = 'extended').items(1000)

list_climate_no_rt = list()

for tweet in public_tweets:
    list_climate_no_rt.append((tweet.user.screen_name, str(tweet.created_at), tweet.full_text))

In [65]:
# Exercise 12. Get the last 50 tweets from West
#              Lafayette, using the search API.
public_tweets = tweepy.Cursor(api.search,
                              geocode="40.452505,-86.914437,4km",
                              count = 200,
                              tweet_mode = 'extended').items(50)

list_climate_wl = list()

for tweet in public_tweets:
    list_climate_wl.append((tweet.user.screen_name, str(tweet.created_at), tweet.full_text))

In [71]:
# Exercise 13: Alter the streaming algorithm to include a
#              "locations" filter to get tweets from New York City.
class StreamListener(tweepy.StreamListener):
    def on_status(self, tweet):
        print(tweet.author.screen_name + "\t" + tweet.text)

    def on_error(self, status_code):
        print( 'Error: ' + repr(status_code))
        return False

l = StreamListener()
streamer = tweepy.Stream(auth=auth, listener=l, tweet_mode='extended')

geobox_ny = [-79.842750, 40.495969, -71.783881, 45.072033]
# Coordinates of NY (state, not city) taken from:
# https://ceb.wikipedia.org/wiki/Plantilya:Geobox_locator_New_York/doc

streamer.filter(locations = geobox_ny)

robotcabeza	My secret wish is to own my own dance company. I was a pro Irish dancer at age 10. Competing in nationals in philly… https://t.co/6wtPaFCZiz
davereachill	Indoor kite flying never took off...
suburbanrealist	@NotebookMUBI @Festival_Cannes gonna take a wild guess that the leading cast is entirely white

give or take a nonBlack POC
DelhiCapitalsOm	@nishantchat You are doing good things consistently!
JimLexa	@CarmenDeFalco @Jurko64 How about @Yankees center fielders? Joe DiMaggio from 1936-51, and Mickey Mantle from 1951-… https://t.co/z41ddSzavK
SadBitch404	@ianthaaaa Girl my lease up end of august im READY
behzadsahifi	@NFTBoT3 https://t.co/ICQojAY0T3
BigJay430	https://t.co/g72wZfISoo
SirWoodsington	🤦🏾‍♂️🤦🏾‍♂️🤦🏾‍♂️
karmascruizin	Just posted a photo @ Karma’s Cruizin’ Cafe https://t.co/YTFqDAnZkQ
GrooveSafe	Extremely disappointed in @GoDaddy our website is down on their end. Today is an important day for us. Lame.
ryanorilio	@smilelearning @JCasaTodd Laurie, I love this!


KeyboardInterrupt: 