# DSCI 511: Data acquisition and pre-processing <br>Chapter 3: Acquiring Data from the Internet

## 3.1 APIs
The entire infrastructure of the internet uses three basic ideas as building blocks: clients, servers, and requests. When you visit a website, your web browser is the _client_. In order to display the web page to you, your browser makes a _request_ to a remote _server_. The server then processes the request and sends back the page, which is essentially an HTML document. Your browser can interpret this document, which contains all kinds of information (including styling and presentation-related information and hyperlinks), and display it on your screen.

### 3.1.1 What is an API?
An _Application Processing Interface_ (API) allows you to make a request to a remote server to obtain data instead of a web page. So instead of an HTML document, usually an API request will return some data in a format like, JSON, CSV, or XML. To review how to load data from these formats review Chapter 1.

### 3.1.2 Accessing APIs
When you access a web page, your browser makes a request to the remote server. The browser uses a URL address to send the request. Similarly, URLs are also used to send requests to APIs. Usually, constructing the URL needed to send the request is an important early step in working with APIs.
#### 3.1.2.1 Making Requests and handling JSON responses
When writing Python code to get data from APIs, we'll use the `requests` module to make these requests. The `requests.get()` method can be supplied with a URL, and returns a "response" object, which has a very convenient `.json()` method to process the response when it is a JSON file and return a dictionary (so we don't necessarily have to use the `json` module to de-serialize a request's text result. For example, we'll use the GitHub API to grab some data about a user.

In [8]:
import requests
import json
# format: "https://api.github.com/user/USERNAME"
response = requests.get("https://api.github.com/users/jakerylandwilliams")

# json.loads(response.text)

#response.json()
response.json()

{'login': 'jakerylandwilliams',
 'id': 4721029,
 'node_id': 'MDQ6VXNlcjQ3MjEwMjk=',
 'avatar_url': 'https://avatars.githubusercontent.com/u/4721029?v=4',
 'gravatar_id': '',
 'url': 'https://api.github.com/users/jakerylandwilliams',
 'html_url': 'https://github.com/jakerylandwilliams',
 'followers_url': 'https://api.github.com/users/jakerylandwilliams/followers',
 'following_url': 'https://api.github.com/users/jakerylandwilliams/following{/other_user}',
 'gists_url': 'https://api.github.com/users/jakerylandwilliams/gists{/gist_id}',
 'starred_url': 'https://api.github.com/users/jakerylandwilliams/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/jakerylandwilliams/subscriptions',
 'organizations_url': 'https://api.github.com/users/jakerylandwilliams/orgs',
 'repos_url': 'https://api.github.com/users/jakerylandwilliams/repos',
 'events_url': 'https://api.github.com/users/jakerylandwilliams/events{/privacy}',
 'received_events_url': 'https://api.github.com/user

GitHub's API doesn't require an "access token" to return this information about a user. There are thousands of public APIs available on the web that can help you get useful data, and many of them can be used as simply as this GitHub example.
#### 3.1.2.2 A more local example of an API
The Southeastern Pennsylvania Transportation Authority (SEPTA) [makes a few APIs available](http://www3.septa.org/hackathon/). Some of these APIs can be used to access realtime data about SEPTA transit (trains, buses, trolleys). For example, we can request data about the next trains to arrive at a given station.

In [9]:
# format: "http://www3.septa.org/hackathon/Arrivals/*STATION_NAME*/*NUMBER_OF_TRAINS*"
arrivals_response = requests.get("http://www3.septa.org/hackathon/Arrivals/30th Street Station/5")

arrivals_dict = arrivals_response.json()
arrivals_dict

{'Gray 30th Street Departures: October 31, 2023, 9:55 am': [{'Northbound': [{'direction': 'N',
     'path': 'R4N',
     'train_id': '420',
     'origin': 'Airport Terminal E-F',
     'destination': 'Warminster',
     'line': 'Warminster',
     'status': 'On Time',
     'service_type': 'LOCAL',
     'next_station': 'Penn Medical Station',
     'sched_time': '2023-10-31 09:59:01.000',
     'depart_time': '2023-10-31 10:00:00.000',
     'track': '5',
     'track_change': None,
     'platform': '',
     'platform_change': None},
    {'direction': 'N',
     'path': 'R5N',
     'train_id': '6508',
     'origin': 'Gray 30th Street',
     'destination': 'Doylestown',
     'line': 'Lansdale/Doylestown',
     'status': 'On Time',
     'service_type': 'LOCAL',
     'next_station': None,
     'sched_time': '2023-10-31 10:11:01.000',
     'depart_time': '2023-10-31 10:12:00.000',
     'track': '1',
     'track_change': None,
     'platform': '',
     'platform_change': None},
    {'direction': 'N',

#### 3.1.2.3 Exercise: processing a JSON response
Make a request to the SEPTA Arrivals API to get data on the next 10 trains to arrive at Suburban Station. Store this JSON-format data into a dictionary. Inspect the dictionary structure. Then, write code to create a list containing 10 dictionaries, one for each train. These new dictionaries should look like this:

In [10]:
from pprint import pprint
# example of train dictionary format
train_dict = {
    'direction': 'S',
     'line': 'Media/Elwyn',
     'sched_time': '2018-08-22 17:31:01.000',
     'status': 'On Time',
     'track': '6'
}

pprint(train_dict)

{'direction': 'S',
 'line': 'Media/Elwyn',
 'sched_time': '2018-08-22 17:31:01.000',
 'status': 'On Time',
 'track': '6'}


__Luke's Notes:__
When iterating through a nested structure, get the top keys using the following code:

In [11]:
# Iterating through a nested structure.
arrivals_response = requests.get("http://www3.septa.org/hackathon/Arrivals/Suburban Station/10")
arrivals_dict = arrivals_response.json()
list(arrivals_dict.keys())
# In this case we only have one key, which is the most recent departure.

# Listing the values
list(arrivals_dict.values())
# We have another nexted structure. Here we have a secondary key called 'Northbound' which indicates the direction of the trains.
# Since we have a series of nests here, we'll convert to a list to get the underlying values for the trains
list(arrivals_dict.values())[0][0]['Northbound']


# From here, just pass a loop and use the keys to define the updated values!

[{'direction': 'N',
  'path': 'R3N',
  'train_id': '9312',
  'origin': 'Wawa',
  'destination': 'Temple U',
  'line': 'West Trenton',
  'status': 'On Time',
  'service_type': 'LOCAL',
  'next_station': 'Suburban Station',
  'sched_time': '2023-10-31 09:56:00.000',
  'depart_time': '2023-10-31 09:57:00.000',
  'track': '2',
  'track_change': None,
  'platform': 'A',
  'platform_change': None},
 {'direction': 'N',
  'path': 'R4N',
  'train_id': '420',
  'origin': 'Airport Terminal E-F',
  'destination': 'Warminster',
  'line': 'Warminster',
  'status': 'On Time',
  'service_type': 'LOCAL',
  'next_station': 'Penn Medical Station',
  'sched_time': '2023-10-31 10:04:00.000',
  'depart_time': '2023-10-31 10:05:00.000',
  'track': '2',
  'track_change': None,
  'platform': 'A',
  'platform_change': None},
 {'direction': 'N',
  'path': 'R5N',
  'train_id': '6508',
  'origin': 'Gray 30th Street',
  'destination': 'Doylestown',
  'line': 'Lansdale/Doylestown',
  'status': 'On Time',
  'service_

In [12]:
# code goes here

arrivals_response = requests.get("http://www3.septa.org/hackathon/Arrivals/Suburban Station/10")

arrivals_dict = arrivals_response.json()
arrivals_dict=list(arrivals_dict.values())[0][0]['Northbound']
trains=[]

# For each item in the arrivals dictionary key list, 
for train in arrivals_dict:
    trains_dict={'direction':train['direction'],
                   'line':train['line'],
                   'sched_time':train['sched_time'],
                   'status':train['status'],
                   'track':train['track']}
    trains.append(trains_dict)
pprint(trains_dict)

{'direction': 'N',
 'line': 'Media/Wawa',
 'sched_time': '2023-10-31 10:56:00.000',
 'status': 'On Time',
 'track': '1'}


#### 3.1.2.4 Open geo-location data
Geolocation data can also be found in JSON format. The [OpenStreetMap (OSM) API](https://wiki.openstreetmap.org/wiki/Main_Page) can be used to request geographic data. Usually, map data is stored in the form of polygons, shapes with vertices consisting of latitude-longitude points. For example, we can obtain the polygon for Philadelphia from OSM like this:

In [13]:
response = requests.get("https://nominatim.openstreetmap.org/search.php?q=Philadelphia+Pennsylvania&polygon_geojson=1&format=json")

pprint(response.json())

[{'addresstype': 'city',
  'boundingbox': ['39.8670050', '40.1379593', '-75.2802977', '-74.9558314'],
  'class': 'boundary',
  'display_name': 'Philadelphia, Pennsylvania, United States',
  'geojson': {'coordinates': [[[-75.2802977, 39.9750019],
                               [-75.2802246, 39.9748885],
                               [-75.280192, 39.974835],
                               [-75.28013, 39.974735],
                               [-75.280085, 39.974624],
                               [-75.280044, 39.97454],
                               [-75.280027, 39.974502],
                               [-75.279955, 39.974424],
                               [-75.279805, 39.97432],
                               [-75.2796, 39.974185],
                               [-75.279465, 39.974122],
                               [-75.279359, 39.974072],
                               [-75.279242, 39.974031],
                               [-75.27917, 39.973986],
                              

### 3.1.3 Working with CSV responses
Sometimes API responses can be in CSV format, too. For example, the schedule data API for Center City regional arrivals by SEPTA returns CSVs. Since a CSV file is really just a text file, we can read the text from the response using a CSV reader. In the output below, showing the schedule data being displayed [here](http://www3.septa.org/ccstations/30th/) in CSV format, notice that the first line and the last two lines are not part of the table, rather messages and timestamps. Further, the CSV does not appear to have been written properly in the usual one-entry-per-line format. Rather, entries for trains on the same line are joined together. Irregularities like this are easy to miss, but can end up breaking your code.

In [14]:
import csv
# format: "http://www3.septa.org/ccstations/STATION/sched_data.csv", acceptable values for STATION are "me", "ss", and "30th"
schedule_response = requests.get("http://www3.septa.org/ccstations/30th/sched_data.csv")
schedule_text = schedule_response.text.strip().split("\n") # removing leading and trailing spaces and splitting lines
schedule_reader = csv.reader(schedule_text)
schedule = list(schedule_reader)
pprint(schedule)

[["EMG=' No Emg Message"],
 ['R4S=09:59',
  'Airport',
  '6',
  'ON TIME',
  'LOCAL                    ',
  '8423  ',
  '<_NEXT_MSG>10:07',
  '30th St Gray',
  '',
  'ON TIME',
  'LOCAL                    ',
  '6477  ',
  '<_NEXT_MSG>10:29',
  'Airport',
  '6',
  'ON TIME',
  'LOCAL                    ',
  '425   ',
  '<_NEXT_MSG>10:59',
  'Airport',
  '6',
  'ON TIME',
  'LOCAL                    ',
  '8427  ',
  ''],
 ['R4N=10:00',
  'Warminster',
  '5',
  'ON TIME',
  'LOCAL                    ',
  '420   ',
  '<_NEXT_MSG>11:00',
  'Warminster',
  '5',
  'ON TIME',
  'LOCAL                    ',
  '424   ',
  '<_NEXT_MSG>12:00',
  'Warminster',
  '5',
  'ON TIME',
  'LOCAL                    ',
  '428   ',
  '<_NEXT_MSG>01:00',
  'Warminster',
  '5',
  'ON TIME',
  'LOCAL                    ',
  '432   ',
  ''],
 ['R2S=10:32',
  'Wilmington',
  '6',
  ' 4 LATE',
  'LOCAL                    ',
  '3209  ',
  '<_NEXT_MSG>12:39',
  'Wilmington',
  '6',
  'ON TIME',
  'LOCAL             

### 3.1.4 Working with XML Responses

The Wikipedia API ([docs](https://en.wikipedia.org/api/rest_v1/)) returns results to various types of calls, and notably articles, in html format. Processing an html reponse into active form is more complex than JSON---the content won't necessarily translate into a native Python data type, much like XML. To work with this, we'll want to use a module called `BeautifulSoup` (install using `pip3 install bs4`), which we'll discuss in detail when we get to web scraping in __Chapter 5__. But explore, we can request the article for Philadelphia by constructing a search query and attempt to process with `xmltodict`, as introduced in __Chapter 1__.

In [15]:
response = requests.get("https://en.wikipedia.org/api/rest_v1/page/html/Philadelphia Pennsylvania")
    
pprint(response.text[:10000])

('<!DOCTYPE html>\n'
 '<html prefix="dc: http://purl.org/dc/terms/ mw: http://mediawiki.org/rdf/" '
 'about="https://en.wikipedia.org/wiki/Special:Redirect/revision/1182157691"><head '
 'prefix="mwr: https://en.wikipedia.org/wiki/Special:Redirect/"><meta '
 'property="mw:TimeUuid" content="2653fa40-75cc-11ee-b712-cdba7e320d1d"/><meta '
 'charset="utf-8"/><meta property="mw:pageId" content="50585"/><meta '
 'property="mw:pageNamespace" content="0"/><link rel="dc:replaces" '
 'resource="mwr:revision/1181424676"/><meta property="mw:revisionSHA1" '
 'content="79df3813a6c6f25b4d30ca9fa3e8134076a190ae"/><meta '
 'property="dc:modified" content="2023-10-27T14:09:43.000Z"/><meta '
 'property="mw:htmlVersion" content="2.8.0"/><meta property="mw:html:version" '
 'content="2.8.0"/><link rel="dc:isVersionOf" '
 'href="//en.wikipedia.org/wiki/Philadelphia"/><base '
 'href="//en.wikipedia.org/wiki/"/><title>Philadelphia</title><meta '
 'property="mw:jsConfigVars" '
 'content=\'{"wgKartographerLiveDa

Since html is essentially a type of XML, we can try and convert the response into a dictionary using `xmltodict`. As you can see, in order to get the desired data out of the html, some studying of its structure is necessary. And since the structure of an html document is so particular, it will turn out to be a bit easier working with the html-specific parser in `BeautifulSoup` (__Chapter 5__). However, as we can see the content is there and navigable!

In [18]:
import xmltodict

parsed = xmltodict.parse(response.text)
pprint(parsed['html']['body']['section'][0])

{'@data-mw-section-id': '0',
 '@id': 'mwAQ',
 'div': [{'#text': 'Largest city in Pennsylvania, United States',
          '@about': '#mwt1',
          '@class': 'shortdescription nomobile noexcerpt noprint searchaux',
          '@data-mw': '{"parts":[{"template":{"target":{"wt":"Short '
                      'description","href":"./Template:Short_description"},"params":{"1":{"wt":"Largest '
                      'city in Pennsylvania, United States"}},"i":0}}]}',
          '@id': 'mwAg',
          '@style': 'display:none',
          '@typeof': 'mw:Transclusion'},
         {'#text': '"Philly" redirects here. For other uses, see  and .',
          '@about': '#mwt2',
          '@class': 'hatnote navigation-not-searchable',
          '@id': 'mwBQ',
          '@role': 'note',
          'a': [{'#text': 'Philly (disambiguation)',
                 '@class': 'mw-disambig',
                 '@href': './Philly_(disambiguation)',
                 '@rel': 'mw:WikiLink',
                 '@title': 'P

## 3.2 API authentication
The GitHub and SEPTA examples we've looked at so far are APIs that don't require any authentication to access. Anyone can send a request to these APIs and receive a response. There are, however, quite a few APIs that require the user to have some authentication. This authentication usually takes the form of an access token that needs to be obtained from the API provider before making requests.

### 3.2.1 Example: Sportradar
As an example, we'll take a look at one of the [Sportradar APIs](https://developer.sportradar.com). Sportradar has APIs for a number of different sports.

In order to use any of their APIs, Sportradar requires you to open a developer account and register an app. Only then you are granted an access token or "API key", which you must plug in to any requests you make.

The steps to obtain an API key from Sportradar are:
1. Sign up as a developer [here](https://developer.sportradar.com/member/register)
2. Sign in to your account and go your [account page](https://developer.sportradar.com/member/my-account)
3. Go to your [applications page](https://developer.sportradar.com/apps/myapps)
4. Register a new application and select the API keys you need

We'll use the Sportradar Soccer API to obtain the match schedule for an English soccer team, Manchester City.

First, we'll construct the request address using the API key.

In [None]:
soccer_key = "xy5kwnrtm4mm6vru9w9rr7r6"

In [None]:
# format: "https://api.sportradar.us/soccer-xt3/eu/en/teams/TEAM_ID/schedule.json?api_key=API_KEY"
address = "https://api.sportradar.com/soccer-advanced-analytics/trial/v1/en/competitors/sr:competitor:44/schedules.json?api_key=" + soccer_key

In [None]:
resp = requests.get(address)

In [None]:
result = resp.json()

In [None]:
pprint(result)

{'generated_at': '2023-10-24T00:24:32+00:00',
 'schedules': [{'sport_event': {'channels': [{'country': 'Italy',
                                              'country_code': 'ITA',
                                              'name': 'Sky Sport Uno HD - Hot '
                                                      'Bird 1/2/3/4/6 (13.0E)',
                                              'url': 'https://guidatv.sky.it/sport-e-calcio?vista=griglia'},
                                             {'country': 'Romania',
                                              'country_code': 'ROU',
                                              'name': 'Digi Sport 1 RO - Thor '
                                                      '2/3 (1.0W)',
                                              'url': 'http://www.digisport.ro/Program/'},
                                             {'country': 'Great Britain',
                                              'country_code': 'GBR',
                                

Inspecting the results, we can see that the schedule can be found under the "schedule" key to the top-level dictionary. It is a list of matches, with each match encoded in a dictionary.

In [None]:
schedule = result["schedules"]
print(type(schedule))
print(len(schedule))

<class 'list'>
59


Each match looks like this:

In [None]:
pprint(schedule[0]['sport_event']['competitors'])

[{'abbreviation': 'LIV',
  'country': 'England',
  'country_code': 'ENG',
  'gender': 'male',
  'id': 'sr:competitor:44',
  'name': 'Liverpool FC',
  'qualifier': 'home'},
 {'abbreviation': 'EVE',
  'country': 'England',
  'country_code': 'ENG',
  'gender': 'male',
  'id': 'sr:competitor:48',
  'name': 'Everton FC',
  'qualifier': 'away'}]


#### 3.2.1.1 Exercise: accessing a soccer schedule

Make a request to the Sportradar Soccer schedule API to obtain the match schedule for Liverpool FC (team_id = sr:competitor:44). Then, from the obtained schedule, make a simple list of fixtures. Your output should be a list with strings as elements. The strings should be of the format "HOME_TEAM vs AWAY_TEAM".

In [None]:
schedule[0]['sport_event']['competitors']

[{'id': 'sr:competitor:44',
  'name': 'Liverpool FC',
  'country': 'England',
  'country_code': 'ENG',
  'abbreviation': 'LIV',
  'qualifier': 'home',
  'gender': 'male'},
 {'id': 'sr:competitor:48',
  'name': 'Everton FC',
  'country': 'England',
  'country_code': 'ENG',
  'abbreviation': 'EVE',
  'qualifier': 'away',
  'gender': 'male'}]

In [None]:

match=schedule[0]['sport_event']['competitors']

match[0]['qualifier']

'home'

In [None]:
#pprint(list(schedule[0]['sport_event']['competitors']['qualifier']))

match=schedule[0]['sport_event']['competitors']
game_dic=[]
home_team=''
away_team=''
for teams in match:

    if teams['qualifier']=='home': 
        home_team=teams['name']
    if teams['qualifier']=='away': 
        away_team=teams['name']
    string_format=home_team+' v.s '+away_team
game_dic.append(string_format)
print(game_dic)

['Liverpool FC v.s Everton FC']


### 3.3 Big Tech APIs

Large tech companies make a variety of APIs available for use. Most of these require authentication and can be used to access some very useful data. 

#### 3.3.1 Facebook

Facebook has an API called the Graph API that allows a developer to access data about posts, comments, users and more. However, one of the major sticking points of working with APIs from companies like Facebook is that these APIs change very frequently, and sometimes the changes can break a developer's code. Facebook changed their policy towards applications some time ago, and as a result, a developer simply working on a research project is not allowed access to any data from the Graph API. In order to gain data access, all developers must register their application and go through a review process with Facebook. Only then is data collection allowed through the API. These barriers make it hard to discuss and work with the Graph API. 

#### 3.3.2 Google

Similarly, Google Maps has a very powerful API that can be used to obtain a large variety of data, however, it is meant to be used as a licensed resource for which developers pay Google a fee. Some of Google's tools may be used for free by obtaining a \$200 API credit, however, this still requires setting up billing. If you are interested in these tools, take a look [here](https://cloud.google.com/maps-platform/pricing/).

An important consideration when working with these APIs is that companies usually enforce a "rate limit". This means a developer is only allowed to make calls under a certain frequency. When you are working with an API from a commercial entity, make sure to check on their rate limit.

#### 3.3.3 Twitter

A big-tech API we can demonstrate and play around with is the Twitter API. While the API can be accessed in a barebones way usings tools such as the `requests` module, we can also use an API client, which is a third-party library that allows us to easily work with an API by automating and simplifying low-level tasks.

For Twitter, we'll use a client called Twython (`pip3 install Twython` to install). First, we'll need to follow these steps to obtain API access and authentication: 

1. Sign up for a Twitter account
2. Sign in to [https://apps.twitter.com]
3. Create an app
4. Go to the API Keys section and click "Generate ACCESS TOKEN".

The resulting keys are:
- "oauth_access_token"
- "oauth_access_token_secret"
- "consumer_key"
- "consumer_secret"

We'll have to save these values in some variables that we'll need:

In [None]:
access_token = ''
access_token_secret = ''
consumer_key = ''
consumer_secret = ''

Now, we'll create a `twitter` object using our consumer key and secret and use this object to download a [list of tweets](https://www.buzzfeed.com/danieldalton/epic-tweet-bro?utm_term=.shgZJEe8V#.lywBJEpjG) using their unique IDs.

In [None]:
from twython import Twython

twitter = Twython(consumer_key, consumer_secret)

IDlist = [
    "1121915133", 
    "64780730286358528", 
    "64877790624886784", 
    "20", 
    "467192528878329856", 
    "474971393852182528",
    "475071400466972672",
    "475121451511844864",
    "440322224407314432",
    "266031293945503744",
    "3109544383",
    "1895942068",
    "839088619",
    "8062317551",
    "232348380431544320",
    "286910551899127808",
    "286948264236945408",
    "27418932143",
    "786571964",
    "467896522714017792",
    "290892494152028160",
    "470571408896962560"
]

for ID in IDlist:
    status = twitter.show_status(id = ID)
    print(status["text"])

http://twitpic.com/135xa - There's a plane in the Hudson. I'm on the ferry going to pick up the people. Crazy.
Helicopter hovering above Abbottabad at 1AM (is a rare event).
So I'm told by a reputable person they have killed Osama Bin Laden. Hot damn.
just setting up my twttr
India has won! भारत की विजय। अच्छे दिन आने वाले हैं।
We can neither confirm nor deny that this is our first tweet.
Thank you for the @Twitter welcome! We look forward to sharing great #unclassified content with you.
@CIA We look forward to sharing great classified info about you http://t.co/QcdVxJfU4X https://t.co/kcEwpcitHo More: https://t.co/PEeUpPAt7F
If only Bradley's arm was longer. Best photo ever. #oscars http://t.co/C9U5NOtGap
Four more years. http://t.co/bAJE6Vom
Facebook turned me down. It was a great opportunity to connect with some fantastic people. Looking forward to life's next adventure.
Got denied by Twitter HQ. That's ok. Would have been a long commute.
Are you ready to celebrate?  Well, get ready

Next, we'll grab a user's timeline (Drexel University in this case) and print out the last 10 tweets by them:

In [None]:
drexel_twitter = twitter.get_user_timeline(screen_name = "drexeluniv")

for tweet in drexel_twitter[:10]:
    print(tweet["text"])
    print()

Congratulations to Johnny Zhu, who won first place in the “@ Play” category with his photo “Standing on the Shoulde… https://t.co/19TpUmJGZy

Congratulations to Michelle Kim, who won first place in the “@ Work” category with her photo “Go Metro to the LA Op… https://t.co/k2r0VZtu2V

Thank you to all who participated in the @Steinbright Co-op Photo Contest and sharing your incredible #Drexel co-op… https://t.co/rAkzaAzr6d

Happy Tuesday! Don't forget to swing by EarthFest today between 11:30 a.m. and 1:30 p.m. on Lancaster Walk! 🌎 https://t.co/6oFjgkqdM7

RT @DrexelNow: It's #EarthDay today. Here's how @DrexelUniv is dedicated to transforming its campus into a sustainability leader: https://t…

It’s World Dragon Week at Drexel! 🌎🐉 Join us in celebrating the cultural diversity represented on our campus with e… https://t.co/ImXKWwcmGG

Happy #EarthDay Dragons! Each year, Drexel celebrates its commitment to environmental sustainability with our annua… https://t.co/zMsRNmzIzc

Don't forget 

#### 3.3.3.1 Exercise: access some accidental haikus from Twitter's REST API
Create your Twitter API keys and download the last 15 tweets by @accidental575 (the hilarious Accidental Haiku Bot).

In [None]:
# code goes here

#### 3.3.3.2 Complex queries with filtering
We can use Twitter's Search API to grab some tweets about a particular topic:

In [None]:
results = twitter.search(q = '"data science"')
for tweet in results["statuses"]:
    print(tweet["text"])
    print()

https://t.co/br4slRoeTj https://t.co/NSZNAhE9zP

RT @DD_Serena_: Visual Data Storytelling with Tableau (Addison-Wesley Data &amp; Analytics Series) https://t.co/dfjkWwVP8k #Tableau

RT @DD_FaFa_: Visual Data Storytelling with Tableau (Addison-Wesley Data &amp; Analytics Series) https://t.co/w6gvd9Zu5f #Tableau

RT @dataquestio: Dive into our new Data Analyst in R #rstats path:

- Interactive coding in your browser
- No videos to sit through
- No pr…

RT @KirkDBorne: New #DataScience 10-page (PDF) Cheat Sheet covers basic concepts in probability, #statistics, statistical learning, #Machin…

RT @DataRubix: Such valid insights by @schmarzo !! The role of critical thinking in #DataScience https://t.co/2YosCuhSZy #DataDriven #DataA…

An interesting list of #DataScience and #ML resources
https://t.co/YMbiC7CUER

RT @schmarzo: 3 #DesignCanvases that link your business model to #DataScience and #ML initiatives and give #DataScientists direct line of s…

Enterprise Products is hiring in #Housto

#### 3.3.3.3 Twitter's streaming APIs
he Streaming API lets us collect a portion of all the tweets currently being posted, in real time. This is more complicated than collecting old tweets, because it means essentially downloading an endless stream of data. Let's say we want to collect 30 tweets and stop. We need to write a class based on `TwythonStreamer` that has this failsafe built-in, while also only collecting English tweets:

In [None]:
from twython import TwythonStreamer

tweets = []

class Streamer(TwythonStreamer):
    
    def on_success(self, data):
        
        if data["lang"] == "en":
            tweets.append(data)
            print("Received tweet #" + str(len(tweets)))
            
        if len(tweets) >= 30:
            self.disconnect()
            
    def on_error(self, status_code, data):
        print(status_code, data)
        self.disconnect()

Next, we need to `Streamer` object and use it to collect tweets. Let's say we want to collect tweets with the keyword "science":

In [None]:
stream = Streamer(consumer_key, consumer_secret, access_token, access_token_secret)

stream.statuses.filter(track = "science")

Received tweet #1
Received tweet #2
Received tweet #3
Received tweet #4
Received tweet #5
Received tweet #6
Received tweet #7
Received tweet #8
Received tweet #9
Received tweet #10
Received tweet #11
Received tweet #12
Received tweet #13
Received tweet #14
Received tweet #15
Received tweet #16
Received tweet #17
Received tweet #18
Received tweet #19
Received tweet #20
Received tweet #21
Received tweet #22
Received tweet #23
Received tweet #24
Received tweet #25
Received tweet #26
Received tweet #27
Received tweet #28
Received tweet #29
Received tweet #30


Let's take a look at these 30 tweets, collected in real time:

In [None]:
for tweet in tweets:
    print(tweet["text"])
    print()

All Science: Kuwaiti TV Promotes Homosexuality "Cure" With Suppository That Kills Semen-Eating Anal Worms -… https://t.co/W11KOP8l88

@BobStein_FT This program isn't just for teachers, it is also military and federal employees as an incentive to sta… https://t.co/YI9h7z49cu

Hello @Space_Station from Morrison High School Science Class 285.2 mi away @NASA_Johnson #issabove https://t.co/WMp8SwaoS4

RT @baaaadscience: 90% of what you consist of is undetectable by science.

I've entered to #win two Beaker Creature Reactor Pods from @LRUK with @anywaytostay #competition #win #comp #toys… https://t.co/ISohhwJGnS

LabAnimal: A Nature Research journal covering in vivo science and technology using...... 
model organisms of human… https://t.co/nSTEoGy9R5

RT @arstechnica: Listen up: We’ve detected our first marsquake https://t.co/2TocxFQYNU by @SciGuySpace

I redid an ENTIRE science project on my own in elementary bc I hated how my partner and I did it. 💀😂

RT @msalli_: literally in year 9, I wo

The Streaming API is a very powerful data source, but it can also generate quite a bit of data in a short period of time. So, whenever you're collecting tweets from the stream, make sure to design the collection in such a way that you don't run out of storage!