Logs   
- [2023/03/08]   
  Restart this notebook if you change the scratch library

- [2024/02/16]   
  You do not need to restart if you change the scratch library.  I added
  two lines in the third cell

In [1]:
import re
import csv
import numpy as np
import matplotlib.pyplot as plt
import requests
import json

from collections import Counter
from bs4 import BeautifulSoup
from typing import Dict, Set
from dateutil.parser import parse

In [3]:
%load_ext autoreload
%autoreload 2 


In [4]:
plt.rcParams.update(plt.rcParamsDefault)
plt.rcParams.update({
  'font.size': 16,
  'grid.alpha': 0.25})

## stdin and stdout

All the files are in `ch-09`. There are three files that you have to write
or download
- `egrep.py` (to capture all lines that contains numbers)
- `line_count.py` (to count the number of lines)
- `text_examples.txt` (an example of text that contains two lines with numbers)

This demo simulate that you can write two Python programs to capture
all the lines that contain a specific pattern and then count how many of it.

You can run those two files, wtih the following command (type it in
Command Prompt or PowerShell if you are using Windows)
```bat
type text_examples.txt | python egrep.py "[0-9]" | python line_count.py
```

Sometimes, you get an error. Please make sure that you are in the correct
current directory `ch-09` and load the correct conda environment that have
installed Python

Fortunately, Windows provides us with a lot of tools, so you can have
the same output with the built-in functions in Command Prompt or PowerShell:

- Command Prompt:
  ```bat
  type text_examples.txt | findstr /r "[0-9]" | find /c /v ""
  ```

- PowerShell:  
  ```c#
  get-content text_examples.txt | select-string -pattern "[0-9]" | measure-object -line
  ```

The other data-processing pipeline that we want to learn is a script
that counts the words in its input and writes out the most common ones

Run the following command in PowerShell
```bat
type web.txt | python most_common_words.py 10
```

## Reading Files

### The Basics of Text Files

Old structure to read a file:
```py
# 'r' means read-only, it's assumed if you leave it out
file_for_reading = open("reading_file.txt", 'r')
file_for_reading2 = open("reading_file.txt")   # by default it sets 'r'

# 'w' is write -- will destroy the file if it already exists!
file_for_writing = open("writing_file.txt", 'w')

# 'a' is append -- for adding to the end of the file
file_for_appending = open("appending_file.txt", 'a')

# don't forget to close your files when you're done
file_for_writing.close()
```

New structure to read a file:
```py
with open(filename) as f:
  data = function_that_gets_data_from(f)
```



An example to count a number of lines that is started by `#`

In [5]:
starts_with_hash = 0

with open("./ch-09/input.txt") as f:
  for line in f:                # look at each line in the file
    if re.match("^#", line):    # use a regex to see if it starts with '#'
      starts_with_hash += 1     # if it does, add 1 to the count

assert starts_with_hash == 4

An example script to extract domains given a list of email addresses

In [6]:
def get_domain(email_address: str) -> str:
  """Split on '@' and return the last piece"""
  return email_address.lower().split("@")[-1]

# a couple of test
assert get_domain("joelgrus@gmail.com") == "gmail.com"
assert get_domain("joel@m.datasciencester.com") == "m.datasciencester.com"

In [7]:
with open("./ch-09/email_addresses.txt", 'r') as f:
  domain_counts = Counter(get_domain(line.strip())
                          for line in f
                          if '@' in line)

domain_counts

Counter({'techno.org': 21,
         'globex.co.uk': 21,
         'megacorp.edu': 20,
         'widgetcorp.net': 19,
         'infinite.io': 19,
         'acme.com': 18})

### Delimited Files

[Note: Never parse a comma-separated file yourself. You will screw up the edge cases!]

The following is the data of stock prices where each column is separated by tab character

In [8]:
def process(date, symbol, closing_price):
  """A simple function to process date, symbol, and closing price.
  This is only used to show that opening a file is successful"""
  print(f"{date} {symbol} {closing_price}")

In [9]:
with open("./datasets/tab_delimited_stock_prices.tsv") as f:
  tab_reader = csv.reader(f, delimiter="\t")
  for row in tab_reader:
    date = row[0]
    symbol = row[1]
    closing_price = float(row[2])
    process(date, symbol, closing_price)

3/20/2023 AAPL 157.4
3/20/2023 MSFT 272.23
3/20/2023 FB 196.64
3/17/2023 AAPL 155.0
3/17/2023 MSFT 279.43
3/17/2023 FB 196.64


If your file has headers (and also separated by colon). There is a method
`.DictReader()` in `csv` module that automatically create a `dict` when
your data has a header.

In [10]:
with open("./datasets/colon_delimited_stock_prices.dsv") as f:
  colon_reader = csv.DictReader(f, delimiter=":")
  for dict_row in colon_reader:
    date = dict_row["date"]
    symbol = dict_row["symbol"]
    closing_price = float(dict_row["closing_price"])
    process(date, symbol, closing_price)

3/20/2023 AAPL 157.4
3/20/2023 MSFT 272.23
3/20/2023 FB 196.64
3/17/2023 AAPL 155.0
3/17/2023 MSFT 279.43
3/17/2023 FB 196.64


When you want to write a data that you get from the internet (in the
next section we will do how to get the data), never use manual writing.
Instead of that, use `csv.writer` if you want to write your data into 
comma separated values.

In [11]:
results = [["test1", "success", "Monday"],
           ["test2", "success, kind of", "Tuesday"],
           ["test3", "failure, kind of", "Wednesday"],
           ["test4", "failure, utter", "Thursday"]]

# don't do this
with open("./ch-09/bad_csv.txt", 'w') as f:
  for row in results:
    f.write(",".join(map(str, row)))    # might have too many commas in it!
    f.write("\n")                       # row might have newlines as well!

In [12]:
# -- windows
!type ".\ch-09\bad_csv.txt"    

# -- linux/mac
#!cat "./ch-09/bad_csv.txt"

test1,success,Monday
test2,success, kind of,Tuesday
test3,failure, kind of,Wednesday
test4,failure, utter,Thursday


In [13]:
# the correct way
with open("./ch-09/good_csv.txt", 'w', newline="") as f:
  csv_writer = csv.writer(f, delimiter=',')
  for row in results:
    csv_writer.writerow(row)

In [14]:
!type ".\ch-09\good_csv.txt"

test1,success,Monday
test2,"success, kind of",Tuesday
test3,"failure, kind of",Wednesday
test4,"failure, utter",Thursday


## Scraping the Web

### HTML and the Parsing Thereof

A simple structure of HTML (in general it more than this!)

```html
<html>
  <head>
    <title>A web page</title>
  </head>
  <body>
    <p id="author">Joel Grus</p>
    <p id="subject">Data Science</p>
  </body>
</html>
```

We need two modules:
- `BeautifulSoup` for building a tree out of the various elements on
  the web apge and provides a simple interface for accessing them
- `requests` for making HTTP requests (perform asking a content of webpage
  like your browsers does)



To properly use `BeatifulSoup` you need to install HTML parser (a mechanism to read
HTML file) with the following commmand:
```bash
pip install html5lib
```

After download and install that module, please restart your VSCode.

To use `BeautifulSoup`, we pass a string containing HTML from the results
of `requests.get`

In [15]:
url = ("https://raw.githubusercontent.com/joelgrus/data/master/getting-data.html")
html = requests.get(url).text
soup = BeautifulSoup(html, "html5lib")
soup

<!DOCTYPE html>
<html lang="en-US"><head>
    <title>Getting Data</title>
    <meta charset="utf-8"/>
</head>
<body>
    <h1>Getting Data</h1>
    <div class="explanation">
        This is an explanation.
    </div>
    <div class="comment">
        This is a comment.
    </div>
    <div class="content">
        <p id="p1">This is the first paragraph.</p>
        <p class="important">This is the second paragraph.</p>
    </div>
    <div class="signature">
        <span id="name">Joel</span>
        <span id="twitter">@joelgrus</span>
        <span id="email">joelgrus-at-gmail</span>
    </div>


</body></html>

To find the first `<p>` tag (and its contents)

In [16]:
first_paragraph = soup.find("p")   # or just soup.p
first_paragraph

<p id="p1">This is the first paragraph.</p>

We can get the text contents of a `Tag` using its `text` property

In [17]:
first_paragraph_text = soup.p.text
first_paragraph_words = soup.p.text.split()
print(first_paragraph_text)
print(first_paragraph_words)

This is the first paragraph.
['This', 'is', 'the', 'first', 'paragraph.']


Extract a tag's attribute by treating it like a `dict`

In [18]:
first_paragraph_id = soup.p["id"]         # raise KeyError if no 'id'
first_paragraph_id2 = soup.p.get("id")    # returns None if no 'id'

print(first_paragraph_id)
print(first_paragraph_id2)

p1
p1


Get multiple tags at once

In [19]:
all_paragraphs = soup.find_all('p')     # or just soup('p')
paragraphs_with_ids = [p for p in soup('p') if p.get('id')]

print(all_paragraphs)
print(paragraphs_with_ids)


[<p id="p1">This is the first paragraph.</p>, <p class="important">This is the second paragraph.</p>]
[<p id="p1">This is the first paragraph.</p>]


Find tags with a specific `class`

In [20]:
important_paragraphs = soup('p', {'class': 'important'})
important_paragraphs2 = soup('p', 'important')
important_paragraphs3 = [p for p in soup('p')
                         if 'important' in p.get('class', [])]

print(important_paragraphs)
print(important_paragraphs2)
print(important_paragraphs3)

[<p class="important">This is the second paragraph.</p>]
[<p class="important">This is the second paragraph.</p>]
[<p class="important">This is the second paragraph.</p>]


Find every `<span>` element that is contained inside a `<div>` element

In [21]:
# Warning: will return the same <span> multiple times
# if it sits inside multiple <div>s. (same <span> sits inside multiple <divs>s)
# Be more clever if that's the case. (add more conditions to the attribute of <span>)
span_inside_divs = [span 
                    for div in soup('div')      # for each <div> on the page
                      for span in div('span')]  # for each <span> inside it
span_inside_divs

[<span id="name">Joel</span>,
 <span id="twitter">@joelgrus</span>,
 <span id="email">joelgrus-at-gmail</span>]

### Example: Keeping Tabs on Congress

You are a data science and you are worried that there is a potential regulation   
of the data science industry and you need to quantify what Congress is saying  
on the data science industry.

Using the website of all the representative: `https://www.house.gov/representatives`,   
you need to extract a wesbite for each representative where it looks like:
```html
<td>
  <a href="https://jayapal.house.gov">Jayapal, Pramila</a>
</td>
```

In [22]:
url = "https://www.house.gov/representatives"
text = requests.get(url).text
soup = BeautifulSoup(text, "html5lib")

all_urls = [a['href'] for a in soup('a')
            if a.has_attr("href")]
all_urls

['#main-content',
 '/',
 '/',
 '/representatives',
 '/leadership',
 '/committees',
 '/legislative-activity',
 '/the-house-explained',
 '/visitors',
 '/educators-and-students',
 '/media',
 '/doing-business-with-the-house',
 '/employment',
 '/representatives',
 '/leadership',
 '/committees',
 '/legislative-activity',
 '/the-house-explained',
 '/visitors',
 '/educators-and-students',
 '/media',
 '/doing-business-with-the-house',
 '/employment',
 '/the-house-explained',
 'https://www.aoc.gov/explore-capitol-campus/buildings-grounds/house-office-buildings/cannon',
 'https://www.aoc.gov/explore-capitol-campus/buildings-grounds/house-office-buildings/longworth',
 'https://www.aoc.gov/explore-capitol-campus/buildings-grounds/house-office-buildings/rayburn',
 'https://www.visitthecapitol.gov/visit/maps-and-brochures/us-capitol-map',
 '#room-numbers',
 '#by-state',
 '#by-name',
 '#state-alabama',
 '#state-california',
 '#state-delaware',
 '#state-florida',
 '#state-georgia',
 '#state-hawaii',
 '

There are many URLS you do not need to use. This is an opportunity to use
regular expression pattern to get a specific pattern of URL.

In [23]:
# Must start with http:// or https://
# Must end with .house.gov or .house.gov
regex = r"^https?://.*\.house\.gov/?$"

# Let's write some tests!
assert re.match(regex, "http://joel.house.gov")
assert re.match(regex, "https://joel.house.gov")
assert re.match(regex, "http://joel.house.gov")
assert re.match(regex, "https://joel.house.gov")
assert not re.match(regex, "joel.house.gov")
assert not re.match(regex, "https://joel.house.com")
assert not re.match(regex, "https://joel.house.gov/biography")

# An now apply
good_urls = [url for url in all_urls if re.match(regex, url)]

print(len(good_urls))   # still 862 for me (now in 2024, 872 representatives)

872


In [24]:
good_urls

['https://carl.house.gov',
 'https://barrymoore.house.gov',
 'https://mikerogers.house.gov/',
 'https://aderholt.house.gov/',
 'https://strong.house.gov',
 'https://palmer.house.gov/',
 'https://sewell.house.gov/',
 'https://peltola.house.gov',
 'https://radewagen.house.gov',
 'https://schweikert.house.gov/',
 'https://crane.house.gov',
 'https://rubengallego.house.gov/',
 'https://stanton.house.gov/',
 'https://biggs.house.gov',
 'https://ciscomani.house.gov',
 'https://grijalva.house.gov/',
 'https://lesko.house.gov',
 'https://gosar.house.gov/',
 'https://crawford.house.gov/',
 'https://hill.house.gov/',
 'https://womack.house.gov/',
 'https://westerman.house.gov/',
 'https://lamalfa.house.gov',
 'https://huffman.house.gov',
 'https://kiley.house.gov',
 'https://mikethompson.house.gov/',
 'https://mcclintock.house.gov/',
 'https://bera.house.gov',
 'https://matsui.house.gov',
 'https://garamendi.house.gov/',
 'https://harder.house.gov/',
 'https://desaulnier.house.gov/',
 'https://p

There is a possiblity that a couple of House seats empty, or maybe there is  
representative without a website

In [25]:
html = requests.get("https://jayapal.house.gov").text
soup = BeautifulSoup(html, "html5lib")

# Use a set because the links might appear multiple times
# print("\n".join([a.text.lower() for a in soup('a')]))  # only to check
links = {a['href'] for a in soup('a') 
         if 'press releases' in a.text.lower() and
            'press-releases' in a['href']}   # there are two links
                                             # press-releases and news

print(links)      # {'/media/press-releases'}

{'https://jayapal.house.gov/category/press-releases/'}


Get all press release links for all representatives

In [26]:
# scraping time: (for all webiste) ~60 minutes
# because it is so many requests, we have to use user-agents
num_of_website = 100
press_releases: Dict[str, Set[str]] = {}

user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " \
              + "AppleWebKit/537.36 (KHTML, like Gecko) " \
              + "Chrome/121.0.0.0 Safari/537.36 Edg/121.0.2277.112"
for idx, house_url in enumerate(good_urls[:num_of_website]):
  html = requests.get(house_url, headers={
    "User-Agent": user_agent}).text
  soup = BeautifulSoup(html, "html5lib")
  pr_links = {a["href"] for a in soup('a') if 'press releases' in a.text.lower()}

  print(f"{idx:03d} {house_url}: {pr_links}")
  press_releases[house_url] = pr_links


000 https://carl.house.gov: {'/media/press-releases'}
001 https://barrymoore.house.gov: {'/media/press-releases'}
002 https://mikerogers.house.gov/: {'/news/documentquery.aspx?DocumentTypeID=27'}
003 https://aderholt.house.gov/: {'/media-center/press-releases'}
004 https://strong.house.gov: {'/media/press-releases'}
005 https://palmer.house.gov/: {'/media-center/press-releases'}


006 https://sewell.house.gov/: {'/press-releases'}
007 https://peltola.house.gov: {'/news/documentquery.aspx?DocumentTypeID=27'}
008 https://radewagen.house.gov: {'/media/press-releases'}
009 https://schweikert.house.gov/: set()
010 https://crane.house.gov: {'/media/press-releases'}
011 https://rubengallego.house.gov/: {'https://rubengallego.house.gov/media-center/press-releases'}
012 https://stanton.house.gov/: {'/press-releases'}
013 https://biggs.house.gov: {'/media/press-releases'}
014 https://ciscomani.house.gov: {'/media/press-releases'}
015 https://grijalva.house.gov/: set()
016 https://lesko.house.gov: {'/press-releases'}
017 https://gosar.house.gov/: {'/news/email'}
018 https://crawford.house.gov/: set()
019 https://hill.house.gov/: {'/news/documentquery.aspx?DocumentTypeID=27'}
020 https://womack.house.gov/: {'/News/DocumentQuery.aspx?DocumentTypeID=2067'}
021 https://westerman.house.gov/: {'/media-center/press-releases'}
022 https://lamalfa.house.gov: {'/media-center/press-r

Our goal is to find out which congresspeople have releases mentioning "data".

In [31]:
def paragraph_mentions(text: str, keyword: str) -> bool:
  """Return True if a <p> inside the text mentions {keyword}"""
  soup = BeautifulSoup(text, "html5lib")
  paragraphs = [p.get_text() for p in soup('p')]

  return any(keyword.lower() in paragraph.lower() 
             for paragraph in paragraphs)

Write a quick test 

In [32]:
text = """<body><h1>Facebook</h1><p>Twitter</p>"""
assert paragraph_mentions(text, "twitter")        # is inside a <p>
assert not paragraph_mentions(text, "facebook")   # not inside a <p>

Find the relevant congresspeople and give their names to the VP

In [33]:
# scraping time: 2 minutes 43 secs (100 websites). found 2 websites
# we use User-Agent to avoid RemoteDisconnected
for house_url, pr_links in press_releases.items():
  for pr_link in pr_links:
    url = f"{house_url}/{pr_link}"
    # print(f"{url}: ", end="")
    text = requests.get(url, headers={"User-Agent": user_agent}).text

    if paragraph_mentions(text, 'data'):
      print(f"{house_url}")
      break # done with this house_url

https://duarte.house.gov
https://cammack.house.gov


## Using APIs

API = Application Programming Interfaces.  
There are two popular responses when we use API
- JSON
  ```js
  {
    "title": "Data Science Book",
    "author": "Joel Grus", 
    "publicationYear": 2019,
    "topics": ["data", "science", "data science"]
  }
  ```

- XML
  ```xml
  <Book>
    <Title>Data Scince Book</Title>
    <Author>Joel Grus</Author>
    <PublicationYear>2014</PublicationYear>
    <Topics>
      <Topic>data</Topic>
      <Topic>science</Topic>
      <Topic>data science</Topic>
    </Topics>
  <Book>
  ```

### JSON and XML

JSON = JavaScript Object Notation. A serialization format to transfer data

In [34]:
serialized = """{
  "title": "Data Science Book",
  "author": "Joel Grus", 
  "publicationYear": 2019,
  "topics": ["data", "science", "data science"]}"""

# parse the JSON to create a Python dict
deserialized = json.loads(serialized)

assert deserialized["publicationYear"] == 2019
assert "data science" in deserialized["topics"]

To parse XML, we can use `BeautifulSoup` similar to the previos steps
when we parse HTML

### Using an Unauthenticated API

To know the example response, please visit REST API   
https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28k

In [35]:
github_user = "joelgrus"
endpoint = f"https://api.github.com/users/{github_user}/repos"

# the output is all the repos
repos = json.loads(requests.get(endpoint).text)
repos

[{'id': 112873601,
  'node_id': 'MDEwOlJlcG9zaXRvcnkxMTI4NzM2MDE=',
  'name': 'advent2017',
  'full_name': 'joelgrus/advent2017',
  'private': False,
  'owner': {'login': 'joelgrus',
   'id': 1308313,
   'node_id': 'MDQ6VXNlcjEzMDgzMTM=',
   'avatar_url': 'https://avatars.githubusercontent.com/u/1308313?v=4',
   'gravatar_id': '',
   'url': 'https://api.github.com/users/joelgrus',
   'html_url': 'https://github.com/joelgrus',
   'followers_url': 'https://api.github.com/users/joelgrus/followers',
   'following_url': 'https://api.github.com/users/joelgrus/following{/other_user}',
   'gists_url': 'https://api.github.com/users/joelgrus/gists{/gist_id}',
   'starred_url': 'https://api.github.com/users/joelgrus/starred{/owner}{/repo}',
   'subscriptions_url': 'https://api.github.com/users/joelgrus/subscriptions',
   'organizations_url': 'https://api.github.com/users/joelgrus/orgs',
   'repos_url': 'https://api.github.com/users/joelgrus/repos',
   'events_url': 'https://api.github.com/users/j

How many repos that the account (Grus) has?

In [37]:
len(repos)

30

Find out which months and days of the week Grus (the author of the main reference)
is likely to create a repository

In [36]:
dates = [parse(repo["created_at"]) for repo in repos]
month_counts = Counter(date.month for date in dates)
weekday_counts = Counter(date.weekday() for date in dates)

print(month_counts)
print(weekday_counts)

Counter({11: 8, 12: 5, 9: 5, 7: 3, 2: 2, 5: 2, 8: 2, 1: 1, 3: 1, 6: 1})
Counter({2: 7, 5: 6, 1: 6, 4: 5, 6: 5, 3: 1})


Get the programming languages of the account last five repositories

In [38]:
last_5_repositories = sorted(repos,
                             key=lambda r: r["pushed_at"],
                             reverse=True)[:5]
last_5_languages = [repo["language"] for repo in last_5_repositories]
last_5_languages

['Python', 'Python', 'Svelte', 'Python', 'Python']

### Finding APIs

If you need data from a specific site, look for "developers" or "API" section   
of the site for details, and try searching the web for "python <sitename> api"  
to find a library.

There are libraries for the Instagram API, for the Spotify API, and so on.

If you're looking for a list of APIs that have Python wrappers, there's a nice  
one from [Real Python on GitHub](https://github.com/realpython/list-of-python-api-wrappers)

And if you can't find what you need, there's always scraping, the last regure of the   
data scientists.