# **Chapter 9. Getting Data**

## Reading Files

* The easiest way of handing a CSV file is to use Pandas (not covered in our textbook).
![picture](https://drive.google.com/uc?id=1REB_9aobuG1ZeLtXkT6gFWII2Rz8CWBX)
* In addition to the CSV file, Pandas provides the funcions for loading the input files of various formats (e.g., Excel).

In [None]:
import os
import tarfile
from six.moves import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

fetch_housing_data()

In [None]:
import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing = load_housing_data()

housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


## Scraping the Web

* Another way to get data is by scraping it from web pages.

### HTML Parsing

* To get data out of HTML, we will use the **Beautiful Soup** library, which builds a tree out of the various elements on a web page and provides a simple interface for accessing them.

![picture](https://drive.google.com/uc?id=1hCVNBHa05qbQyT87QlabhtH9let1c9pq)

In [None]:
from bs4 import BeautifulSoup
import requests

# I put the relevant HTML file on GitHub. In order to fit
# the URL in the book I had to split it across two lines.
# Recall that whitespace-separated strings get concatenated.
url = ("https://raw.githubusercontent.com/"
       "joelgrus/data/master/getting-data.html")
html = requests.get(url).text
soup = BeautifulSoup(html, 'html5lib')

* We'll typically work with Tag objects, which correspond to the tags representing the structure of an HTML page.

In [None]:
first_paragraph = soup.find('p')        # or just soup.p

assert str(soup.find('p')) == '<p id="p1">This is the first paragraph.</p>'

first_paragraph_text = soup.p.text
first_paragraph_words = soup.p.text.split()

assert first_paragraph_words == ['This', 'is', 'the', 'first', 'paragraph.']

first_paragraph_id = soup.p['id']       # raises KeyError if no 'id'
first_paragraph_id2 = soup.p.get('id')  # returns None if no 'id'

assert first_paragraph_id == first_paragraph_id2 == 'p1'

all_paragraphs = soup.find_all('p')  # or just soup('p')
paragraphs_with_ids = [p for p in soup('p') if p.get('id')]

assert len(all_paragraphs) == 2
assert len(paragraphs_with_ids) == 1

important_paragraphs = soup('p', {'class' : 'important'})
important_paragraphs2 = soup('p', 'important')
important_paragraphs3 = [p for p in soup('p')
                         if 'important' in p.get('class', [])]

assert important_paragraphs == important_paragraphs2 == important_paragraphs3
assert len(important_paragraphs) == 1                         

* You can combine these methods to implement more elaborate logic.
>* For example, if you want to find every `<span>` element that is contained inside a `<div>` element, you could do this:

In [None]:
# warning, will return the same span multiple times
# if it sits inside multiple divs
# be more clever if that's the case
spans_inside_divs = [span
                     for div in soup('div')     # for each <div> on the page
                     for span in div('span')]   # find each <span> inside it

assert len(spans_inside_divs) == 3

### Example: Keeping Tabs on Congress

* The VP of Policy at DataSciencester is worried about potential regulation of the data science industry and asks you to quantify what Congress is saying on the topic. In particular, he wants you to find all the representatives who have press releases about "data."

In [None]:
from bs4 import BeautifulSoup
import requests
    
url = "https://www.house.gov/representatives"
text = requests.get(url).text
soup = BeautifulSoup(text, "html5lib")
    
all_urls = [a['href']
            for a in soup('a')
            if a.has_attr('href')]
    
print(len(all_urls))  # 965 for me, way too many 

967


* If you look at them, the ones we want start with
either *http://* or *https://*, have some kind of name, and end with either *.house.gov* or *.house.gov/*.

In [None]:
import re
    
# Must start with http:// or https://
# Must end with .house.gov or .house.gov/
regex = r"^https?://.*\.house\.gov/?$"
    
# Let's write some tests!
assert re.match(regex, "http://joel.house.gov")
assert re.match(regex, "https://joel.house.gov")
assert re.match(regex, "http://joel.house.gov/")
assert re.match(regex, "https://joel.house.gov/")
assert not re.match(regex, "joel.house.gov")
assert not re.match(regex, "http://joel.house.com")
assert not re.match(regex, "https://joel.house.gov/biography")

# And now apply
good_urls = [url for url in all_urls if re.match(regex, url)]
print(len(good_urls))  # still 862 for me

870


* If you look at the list, there are a lot of duplicates. Let's use `set` to get rid of them:

In [None]:
num_original_good_urls = len(good_urls)
good_urls = list(set(good_urls))
print(len(good_urls))  # only 431 for me
      
assert len(good_urls) < num_original_good_urls

435


* When we look at the sites, most of them have a link to press releases.

In [None]:
html = requests.get('https://jayapal.house.gov').text
soup = BeautifulSoup(html, 'html5lib')
    
# Use a set because the links might appear multiple times.
links = {a['href'] for a in soup('a') if 'press releases' in a.text.lower()}

print(links) # {'/media/press-releases'}

{'https://jayapal.house.gov/category/press-releases/', 'https://jayapal.house.gov/category/news/'}


* We'll write a slightly more general function that checks whether a page
of press releases mentions any given term.

In [None]:
def paragraph_mentions(text: str, keyword: str) -> bool:
    """
    Returns True if a <p> inside the text mentions {keyword}
    """
    soup = BeautifulSoup(text, 'html5lib')
    paragraphs = [p.get_text() for p in soup('p')]

    return any(keyword.lower() in paragraph.lower()
               for paragraph in paragraphs)

text = """<body><h1>Facebook</h1><p>Twitter</p>"""
assert paragraph_mentions(text, "twitter")       # is inside a <p>
assert not paragraph_mentions(text, "facebook")  # not inside a <p>

In [None]:
# I don't want this file to scrape all 400+ websites every time it runs.
# So I'm going to randomly throw out most of the urls.
# The code in the book doesn't do this.
import random
good_urls = random.sample(good_urls, 5)
print(f"after sampling, left with {good_urls}")

after sampling, left with ['https://upton.house.gov', 'https://baird.house.gov/', 'https://hartzler.house.gov/', 'https://gaetz.house.gov', 'https://lee.house.gov/']


* At last we are re ready to find the relevant congresspeople.

In [None]:
from typing import Dict, Set

press_releases: Dict[str, Set[str]] = {}
    
for house_url in good_urls:
    html = requests.get(house_url).text
    soup = BeautifulSoup(html, 'html5lib')
    pr_links = {a['href'] for a in soup('a') if 'press releases' in a.text.lower()}
    print(f"{house_url}: {pr_links}")
    press_releases[house_url] = pr_links
    
for house_url, pr_links in press_releases.items():
    for pr_link in pr_links:
        url = f"{house_url}/{pr_link}"
        text = requests.get(url).text
   
        if paragraph_mentions(text, 'data'):
            print(f"{house_url}")
            break  # done with this house_url

https://upton.house.gov: {'/News/DocumentQuery.aspx?DocumentTypeID=1828'}
https://baird.house.gov/: {'/news/documentquery.aspx?DocumentTypeID=27'}
https://hartzler.house.gov/: {'/media-center/press-releases'}
https://gaetz.house.gov: {'/media/press-releases'}
https://lee.house.gov/: set()


## Using APIs

* Many websites and web services provide application programming interfaces (APIs), which allow you to explicitly request data in a structured format. This saves you the trouble of having to scrape them!

### JSON and XML

* Because HTTP is a protocol for transferring *text*, the data you request through a web API needs to be **serialized** into a string format. Often this serialization uses **JavaScript Object Notation (JSON)**.
* JavaScript objects look quite similar to Python `dict`, which
makes their string representations easy to interpret:

In [None]:
{ "title" : "Data Science Book",
  "author" : "Joel Grus",
  "publicationYear" : 2019,
  "topics" : [ "data", "science", "data science"] }

{'author': 'Joel Grus',
 'publicationYear': 2019,
 'title': 'Data Science Book',
 'topics': ['data', 'science', 'data science']}

* We can parse JSON using Python's `json` module. In particular, we will use its `loads` function, which deserializes a string representing a JSON object into a Python object:

In [None]:
import json
serialized = """{ "title" : "Data Science Book",
                  "author" : "Joel Grus",
                  "publicationYear" : 2019,
                  "topics" : [ "data", "science", "data science"] }"""

# parse the JSON to create a Python dict
deserialized = json.loads(serialized)
assert deserialized["publicationYear"] == 2019
assert "data science" in deserialized["topics"]

### Using an Unauthenticated API

* Most APIs these days require that you first authenticate yourself before you can use them. Accordingly, we'll start by taking a look at [GitHub's API](https://docs.github.com/en/rest), with which you can do some simple things unauthenticated:

In [None]:
import requests, json
    
github_user = "joelgrus"
endpoint = f"https://api.github.com/users/{github_user}/repos"
    
repos = json.loads(requests.get(endpoint).text)
    
from collections import Counter
from dateutil.parser import parse
    
dates = [parse(repo["created_at"]) for repo in repos]
month_counts = Counter(date.month for date in dates)
weekday_counts = Counter(date.weekday() for date in dates)
    
last_5_repositories = sorted(repos,
                             key=lambda r: r["pushed_at"],
                             reverse=True)[:5]
print(last_5_repositories)
    
last_5_languages = [repo["language"]
                    for repo in last_5_repositories]
print(last_5_languages)

[{'id': 26382146, 'node_id': 'MDEwOlJlcG9zaXRvcnkyNjM4MjE0Ng==', 'name': 'data-science-from-scratch', 'full_name': 'joelgrus/data-science-from-scratch', 'private': False, 'owner': {'login': 'joelgrus', 'id': 1308313, 'node_id': 'MDQ6VXNlcjEzMDgzMTM=', 'avatar_url': 'https://avatars.githubusercontent.com/u/1308313?v=4', 'gravatar_id': '', 'url': 'https://api.github.com/users/joelgrus', 'html_url': 'https://github.com/joelgrus', 'followers_url': 'https://api.github.com/users/joelgrus/followers', 'following_url': 'https://api.github.com/users/joelgrus/following{/other_user}', 'gists_url': 'https://api.github.com/users/joelgrus/gists{/gist_id}', 'starred_url': 'https://api.github.com/users/joelgrus/starred{/owner}{/repo}', 'subscriptions_url': 'https://api.github.com/users/joelgrus/subscriptions', 'organizations_url': 'https://api.github.com/users/joelgrus/orgs', 'repos_url': 'https://api.github.com/users/joelgrus/repos', 'events_url': 'https://api.github.com/users/joelgrus/events{/privacy

### Finding APIs

* There are libraries for the Yelp API, for the Instagram API, for the Spotify API, and so on.
* If you're looking for a list of APIs that have Python wrappers, there's a nice one from [Real Python on GitHub](https://docs.github.com/en/rest).

## Example: Using the Twitter APIs

* **Twitter** is a fantastic source of data to work with.
* You can use it to get real-time news. You can use it to measure reactions to current events. You can use it to find links related to specific topics. You can use it for pretty much anything you can imagine, just as long as you can get access to its data. And you can get access to its data through its APIs.

### Getting Credentials

Here are the steps:
1. Go to https://developer.twitter.com/.
2. If you are not signed in, click "Sign in" and enter your Twitter username and
password.
3. Click `Apply` to apply for a developer account.
4. Request access for your own personal use.
5. Fill out the application. It requires 300 words (really) on why you need access, so to get over the limit you could tell them about this book and how much you're enjoying it.
6. Wait some indefinite amount of time.
7. Once you get approved, go back to developer.twitter.com, find the "Apps" section, and click "Create an app."
8. Fill out all the required fields (again, if you need extra characters for the description, you could talk about this book and how edifying you're finding it).
9. Click `Create`.

![picture](https://drive.google.com/uc?id=1wTREQhuNaSUhZMuk8rs76nVkL0NKl9Yt)

* Now your app should have a "Keys and tokens" tab with a "Consumer API keys" section that lists an "API key" and an "API secret key." Take note of those keys; you'll need them. (Also, keep them secret! They're like passwords.)

![picture](https://drive.google.com/uc?id=1T1v6Y7MBrcf2VFm95-Zg5FrUbq9dftzl)

### Using Twython

* The trickiest part of using the Twitter API is authenticating yourself. API providers want to make sure that you're authorized to access their data and that you don't exceed their usage limits. They also want to know who's accessing their data.
* There is a simple way, OAuth 2, that suffices when you just want to do simple searches. And there is a complex way, OAuth 1, that's required when you want to perform actions (e.g., tweeting) or (in particular for us) connect to the Twitter stream.
* So we're stuck with the more complicated way, which we'll try to automate as much as we can.

* First, you need your API key and API secret key (sometimes known as the consumer key and consumer secret, respectively).

In [None]:
# Feel free to plug your key and secret in directly
credentials = {}  
# Your credentials
credentials['CONSUMER_KEY'] = #
credentials['CONSUMER_SECRET'] = #
# credentials['ACCESS_TOKEN'] = #
# credentials['ACCESS_SECRET'] = #

CONSUMER_KEY = credentials['CONSUMER_KEY']
CONSUMER_SECRET = credentials['CONSUMER_SECRET']

SyntaxError: ignored

* Now we can instantiate the client:

In [None]:
import webbrowser
!pip install twython
# Import the Twython class
from twython import Twython
    
# Get a temporary client to retrieve an authentication url
temp_client = Twython(CONSUMER_KEY, CONSUMER_SECRET)
temp_creds = temp_client.get_authentication_tokens()
url = temp_creds['auth_url']
    
# Now visit that URL to authorize the application and get a PIN
print(f"go visit {url} and get the PIN code and paste it below")
webbrowser.open(url)
PIN_CODE = input("please enter the PIN code: ")
    
# Now we use that PIN_CODE to get the actual tokens
auth_client = Twython(CONSUMER_KEY,
                      CONSUMER_SECRET,
                      temp_creds['oauth_token'],
                      temp_creds['oauth_token_secret'])
final_step = auth_client.get_authorized_tokens(PIN_CODE)
ACCESS_TOKEN = final_step['oauth_token']
ACCESS_TOKEN_SECRET = final_step['oauth_token_secret']
    
# And get a new Twython instance using them.
twitter = Twython(CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

* The [Streaming API](https://developer.twitter.com/en/docs/tutorials/consuming-streaming-data) allows you to connect to (a sample of) the great Twitter firehose. 
* To use it, you'll need to authenticate using your access tokens. 
* In order to access the Streaming API with `Twython`, we need to define a class that inherits from `TwythonStreamer` and that overrides its `on_success` method, and possibly its `on_error method`:

* `MyStreamer` will connect to the Twitter stream and wait for Twitter to feed it data. Each time it receives some data (here, a tweet represented as a Python object), it passes it to the `on_success` method, which appends it to our `tweets` list if its language is English, and then disconnects the streamer after it's collected 1,000 tweets.

In [None]:
from twython import TwythonStreamer
    
# Appending data to a global variable is pretty poor form
# but it makes the example much simpler
tweets = []
    
class MyStreamer(TwythonStreamer):
    def on_success(self, data):
        """
        What do we do when twitter sends us data?
        Here data will be a Python dict representing a tweet
        """
        # We only want to collect English-language tweets
        if data.get('lang') == 'en':
            tweets.append(data)
            print(f"received tweet #{len(tweets)}")

        # Stop when we've collected enough
        if len(tweets) >= 100:
            self.disconnect()
    
    def on_error(self, status_code, data):
        print(status_code, data)
        self.disconnect()

In [None]:
stream = MyStreamer(CONSUMER_KEY, CONSUMER_SECRET,
                    ACCESS_TOKEN, ACCESS_TOKEN_SECRET)
    
# starts consuming public statuses that contain the keyword 'data'
stream.statuses.filter(track='data')
    
# if instead we wanted to start consuming a sample of *all* public statuses
# stream.statuses.sample()

* This will run until it collects 100 tweets (or until it encounters an error) and stop, at which point you can start analyzing those tweets.
* For instance, you could find the most common hashtags with:

In [None]:
top_hashtags = Counter(hashtag['text'].lower()
                       for tweet in tweets
                       for hashtag in tweet["entities"]["hashtags"])

print(top_hashtags.most_common(5))