In [1]:
import pandas as pd
import numpy as np
import os

import requests
import bs4
import json
from IPython.display import HTML, Image

# Lecture 16 – Parsing, Regular Expressions

## DSC 80, Spring 2022

### Announcements

- Discussion 5 (on web scraping) is **today from 7-8:30PM**, and is due (for extra credit) on **Saturday, May 7th at 11:59PM**.
- Lab 6 is due on **Monday, May 9th at 11:59PM**.
- Project 3 is released, and is due on **Thursday, May 12th at 11:59PM**.
    - See [dsc80.com/project3](https://dsc80.com/project3/) for all the details.
    - **Try and download your data today!** There is no checkpoint, so you will have to hold yourself accountable.
- Midterm Exam grades are released! See [#935](https://campuswire.com/c/G325FA25B/feed/935) for details.
- Later this week, expect to see a "Grade Report" that contains a summary of your scores on all assignments this quarter along with a slip day counter.

### Agenda

- Example: Scraping the HDSI Faculty page.
- Example: Scraping quotes.
- String methods.
- Regular expressions.

## Example: Scraping the HDSI Faculty page

### HDSI Faculty page

Let's try and extract a list of HDSI Faculty from https://datascience.ucsd.edu/about/faculty/faculty/.

A good first step is to use the "inspect element" tool in our web browser.

In [2]:
fac_response = requests.get('https://datascience.ucsd.edu/about/faculty/faculty/')
fac_response

<Response [200]>

In [3]:
soup = bs4.BeautifulSoup(fac_response.text)

It seems like the relevant `<div>`s for faculty are the ones where the `data-entry-type` attribute is equal to `'individual'`. Let's find all of those using `find_all`.

In [4]:
divs = soup.find_all('div', attrs={'data-entry-type': 'individual'})

In [5]:
divs[0]

<div class="cn-list-row cn-list-item vcard individual faculty lecturers" data-entry-id="229" data-entry-slug="rod-albuyeh" data-entry-type="individual" id="rod-albuyeh">
<div class="cn-entry cn-accordion" id="entry-id-2296277514243b34">
<div class="cn-left" style="min-width: 215px;">
<span class="cn-image-style"><span style="display: block; max-width: 100%; width: 215px"><img alt="Photo of Rod Albuyeh" class="cn-image photo" height="215" lazyload="1" loading="lazy" sizes="100vw" srcset="//datascience.ucsd.edu/wp-content/uploads/connections-images/rod-albuyeh/Rod-Albuyeh-Web-07dd8c651b197a11107f1c858ce1e390.jpg 1x" title="Photo of Rod Albuyeh" width="215"/></span></span>
</div> <!-- end cn-left-->
<div class="cn-right">
<h3 style="border-bottom: #182A48 1px solid; color:#182A48;"><a href="https://datascience.ucsd.edu/about/faculty/faculty/name/rod-albuyeh/" title="Rod Albuyeh"><span class="fn n notranslate"><span class="given-name">Rod</span> <span class="family-name">Albuyeh</span></sp

Within here, we need to extract each faculty member's name. It seems like names are stored in the `title` attribute within an `<a>` tag.

In [6]:
divs[0].find('a')

<a href="https://datascience.ucsd.edu/about/faculty/faculty/name/rod-albuyeh/" title="Rod Albuyeh"><span class="fn n notranslate"><span class="given-name">Rod</span> <span class="family-name">Albuyeh</span></span></a>

In [7]:
divs[0].find('a').get('title')

'Rod Albuyeh'

We can also extract job titles:

In [8]:
divs[0].find('h4')

<h4 class="title">Lecturer</h4>

In [9]:
divs[0].find('h4').text

'Lecturer'

And bios:

In [None]:
divs[0].find('div', attrs={'class': 'cn-bio'})

In [None]:
divs[0].find('div', attrs={'class': 'cn-bio'}).text.strip()

Let's create a DataFrame consisting of names and bios for each faculty member.

In [None]:
names = [div.find('a').get('title') for div in divs]
names[:5]

In [None]:
titles = [div.find('h4').text if div.find('h4') else '' for div in divs]

In [None]:
bios = [div.find('div', attrs={'class': 'cn-bio'}).text.strip() for div in divs]

In [None]:
faculty = pd.DataFrame().assign(name=names, title=titles, bio=bios)
faculty.head()

Now we have a DataFrame!

In [None]:
faculty[faculty['title'] == 'Lecturer']

What if we want to get faculty members' pictures? It seems like we should look at the attributes of an `<img>` tag.

In [None]:
divs[0].find('img')

In [None]:
divs[0].find('img').get('srcset')

In [None]:
def show_picture(name):
    idx = names.index(name)
    url = divs[idx].find('img').get('srcset')
    url = 'https://' + url.strip('/').strip(' 1x')
    display(Image(url))

In [None]:
show_picture('Suraj Rampure')

## Example: Scraping quotes

### Example: Scraping quotes

Let's scrape quotes from https://quotes.toscrape.com/.

<center><img src="imgs/quotes2scrape.png" width=60%></center>

Specifically, let's try to make a DataFrame that looks like the one below:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>quote</th>
      <th>author</th>
      <th>author_url</th>
      <th>tags</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</td>
      <td>Albert Einstein</td>
      <td>https://quotes.toscrape.com/author/Albert-Einstein</td>
      <td>change,deep-thoughts,thinking,world</td>
    </tr>
    <tr>
      <th>1</th>
      <td>“It is our choices, Harry, that show what we truly are, far more than our abilities.”</td>
      <td>J.K. Rowling</td>
      <td>https://quotes.toscrape.com/author/J-K-Rowling</td>
      <td>abilities,choices</td>
    </tr>
    <tr>
      <th>2</th>
      <td>“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</td>
      <td>Albert Einstein</td>
      <td>https://quotes.toscrape.com/author/Albert-Einstein</td>
      <td>inspirational,life,live,miracle,miracles</td>
    </tr>
  </tbody>
</table>

### The plan

Eventually, we will create a single function – `quote_df` – which takes in an integer `n` and returns a **DataFrame** with the quotes on the **first `n` pages** of https://quotes.toscrape.com/.

To do this, we will define several helper functions:
- `download_page(i)`, which downloads a **single page** (page `i`) and returns a `BeautifulSoup` object of the response.
- `process_quote(div)`, which takes in a `<div>` tree corresponding to a **single quote** and returns a Series containing all of the relevant information for that quote.
- `process_page(divs)`, which takes in a list of `<div>` trees corresponding to a **single page** and returns a DataFrame containing all of the relevant information for all quotes on that page.

Key principle: some of our helper functions will make **requests**, and others will **parse**, but none will do both! 
- Easier to debug and catch errors.
- Avoids unnecessary requests.

### Aside: f-strings in Python

- f-strings in Python provide a convenient way to format strings.
- To create an f-string, create a string with the character `f` **right before** the opening quote. Then, anything in the subsequent string that is inside `{curly brackets}` will be evaluated. 

In [None]:
f'2 + 3 = {2 + 3}'

In [None]:
def make_greeting(name):
    return f"Hi {name}! 👋 Your name has {len(name)} characters, the first of which is {name[0]}."

In [None]:
make_greeting('Billy')

### Downloading a single page

In [None]:
def download_page(i):
    url = f'https://quotes.toscrape.com/page/{i}'
    request = requests.get(url)
    return bs4.BeautifulSoup(request.text)

In `quote_df`, we will call `download_page` repeatedly – once for `i=1`, once for `i=2`, ..., `i = n`. For now, we will work with just page 5 (chosen arbitrarily).

In [None]:
soup = download_page(5)

### Parsing a single page

Let's look at the page's source code (via "inspect element") to find where the quotes in the page are located.

In [None]:
divs = soup.find_all('div', attrs={'class': 'quote'})

In [None]:
divs[0]

From this `<div>`, we can extract the quote, author name, author's URL, and tags.

In [None]:
divs[0].find('span', attrs={'class': 'text'}).text

In [None]:
divs[0].find('small', attrs={'class': 'author'}).text

In [None]:
divs[0].find('a').get('href')

In [None]:
divs[0].find('meta', attrs={'class': 'keywords'}).get('content')

Let's write an intermediate function, `process_quote`, which takes in a `<div>` corresponding to a single quote and returns a **Series** containing the quote's information.

Note that this approach is different than the approach taken in the HDSI Faculty page example – there, we created each column of our final DataFrame separately, while here we are creating one **row** of our final DataFrame at a time.

In [None]:
def process_quote(div):
    quote = div.find('span', attrs={'class': 'text'}).text
    author = div.find('small', attrs={'class': 'author'}).text
    author_url = 'https://quotes.toscrape.com' + div.find('a').get('href')
    tags = div.find('meta', attrs={'class': 'keywords'}).get('content')
    
    return pd.Series({'quote': quote, 'author': author, 'author_url': author_url, 'tags': tags})

In [None]:
process_quote(divs[3])

Next, we can write a function that takes in a list of `<div>`s, calls the above function on each `<div>` in the list, and returns a **DataFrame**.

In [None]:
def process_page(divs):
    return pd.DataFrame([process_quote(div) for div in divs])

In [None]:
process_page(divs)

### Putting it all together

In [None]:
def quote_df(n):
    '''Returns a DataFrame containing the quotes on the first n pages of https://quotes.toscrape.com/.'''
    dfs = []
    for i in range(1, n + 1):
        # Download page n and create a BeautifulSoup object
        soup = download_page(i)
        
        # Create DataFrame using the information in that page
        divs = soup.find_all('div', attrs={'class': 'quote'})
        df = process_page(divs)
        
        # Append DataFrame to dfs
        dfs.append(df)
        
    # Stitch all DataFrames together
    return pd.concat(dfs).reset_index(drop=True)

In [None]:
first_three_pages = quote_df(3)
first_three_pages.head()

The elements in the `'tags'` column are all strings, but they look like lists. This is not ideal, as we will see shortly.

### An extension

We could:
- Request information about each of the **authors** in the DataFrame.
    - See https://quotes.toscrape.com/author/Albert-Einstein/ for an example.
- Create a DataFrame of author information.
- Merge that DataFrame with `first_three_pages`.

In [None]:
np.unique(first_three_pages['author_url'])

In [None]:
einstein = bs4.BeautifulSoup(requests.get('https://quotes.toscrape.com/author/Albert-Einstein').text)

In [None]:
einstein.find('div', attrs={'class': 'author-description'}).text[:1000]

### Key takeaways

* Make as few requests as possible.
* Create a request and parsing plan **beforehand**.
* Create your output schema **beforehand**.
* Make requests and parse in **separate functions**!
* See Lab 6, Question 2 for a related example.

## Nested vs. flat data formats

### Nested vs. flat data formats

- **Nested** data formats, like HTML, JSON, and XML, allow us to represent hierarchical relationships between variables.

* **Flat** (i.e. tabular) data formats, like CSV, do not.

<center><img src="imgs/hierarchy.png" width=40%></center>

### Example: Scraping quotes, again

- Suppose we obtained the quotes data via an API and saved it to the file `data/quotes2scrape.json`.
- `quotes2scrape.json` is a **JSON records** file; each line is a valid JSON object, **but the entire document is not**.

In [None]:
f = open(os.path.join('data', 'quotes2scrape.json'))

In [None]:
json.loads(f.readline())

Note that for a single quote, we have keys for `'auth_url'`, `'quote_auth'`, `'quote_text'`, `'bio'`, `'dob'`, and `'tags'`.

Since each line is a separate JSON object, let's read in each line one at a time.

In [None]:
L = [json.loads(x) for x in open(os.path.join('data', 'quotes2scrape.json'))]

Let's convert the result to a DataFrame.

In [None]:
df = pd.DataFrame(L)
df.head()

What data type is the `'tags'` column?

In [None]:
df['tags'].iloc[0]

Let's save `df` to a CSV and read it back in.

In [None]:
df.to_csv('out.csv')

In [None]:
df_again = pd.read_csv('out.csv')
df_again.head()

What data type is the `'tags'` column now?

In [None]:
df_again['tags'].iloc[0]

### One-hot encoding

- So that we don't have to deal with lists within Series, we can **flatten** lists of tags so that there is **one column per tag**.
    - For example, consider the tag `'inspirational'`.
    - If a quote has a 1 in the `'inspirational'` column, it **was** tagged `'inspirational'`.
    - If a quote has a 0 in the `'inspirational'` column, it **was not** tagged `'inspirational'`.
- This process – of converting categorical variables into columns of 1s and 0s – is called **one-hot encoding**. We will revisit it in a few weeks.

In [None]:
distinct_tags = np.unique(df['tags'].sum())
distinct_tags

Let's write a function that takes in the list of tags (`taglist`) for a given quote and returns the one-hot-encoded sequence of 1s and 0s for that quote.

In [None]:
def flatten_tags(taglist):
    return pd.Series({k:1 for k in taglist}, dtype=float)

tags = df['tags'].apply(flatten_tags).fillna(0).astype(int)
tags.head()

Let's combine this one-hot-encoded DataFrame with `df`.

In [None]:
df_full = pd.concat([df, tags], axis=1).drop(columns='tags')
df_full.head()

If we want all quotes tagged `'inspiration'`, we can simply query:

In [None]:
df_full[df_full['inspirational'] == 1].head()

Note that this DataFrame representation of the response JSON takes up much more space than the original JSON. Why is that?

## String methods, revisited

### Transitioning

<center><img src="imgs/DSLC.png" width="30%"></center>

- We've spent a lot of time at the "Find and Clean Data" stage.
- Today, we're going to start working with text data.
    - First, it will be in the context of cleaning data.
    - Starting next week, it will be in the context of modeling and prediction.

### Joining on text

Consider the following two DataFrames (see [this presentation](https://docs.google.com/presentation/d/1xQsqa7e3xDZ9nBiekbSBOecwvQm8pSVGa-FBoV6aJ7E/edit#slide=id.g11197671c7e_0_813)) for inspiration).

In [None]:
codes = pd.read_csv(os.path.join('data', 'codes.csv'))
programs = pd.read_csv(os.path.join('data', 'programs.csv'))

display(codes)
display(programs)

What would happen if we try to merge the two DataFrames on `'department'`?

In [None]:
codes.merge(programs, on='department')

### String canonicalization

- One solution is to **canonicalize** both `'department'` columns, so that there is just a single way to format each department's name **in both DataFrames**. 
- We can do this by implementing a `canonicalize_department` function, which takes in a department's name as a string and reformats it.
- `canonicalize_department` should:
    - Fix cases (upper vs. lower).
    - Standardize variants of words – e.g. `'eng.'` vs `'engineering'`.
    - Fix punctuation – e.g. `'&'` vs. `'and'`.

In [None]:
display(codes)
display(programs)

In [None]:
def canonicalize_department(d):
    return (d
           .lower()
           .replace('sci.', 'science')
           .replace('stud.', 'studies')
           .replace('eng.', 'engineering')
           .replace('&', 'and')
           .replace('(', '- ')
           .replace(')', '')
           )

In [None]:
codes['department_clean'] = codes['department'].apply(canonicalize_department)
programs['department_clean'] = programs['department'].apply(canonicalize_department)

display(codes)
display(programs)

Now, we can join `codes` with `programs` on `'department_clean'`.

In [None]:
codes.merge(programs, on='department_clean')

### Reflection

The process of **string canonicalization** is very brittle. 
- `canonicalize_department` was hyper-specific to the four department names we had access to. 
- We don't know if it'll work for other departments.

### The limitations of string methods

How can we extract the date and time from the following **log** string, using just Python string methods?

```
132.249.20.188 - - [05/May/2022:14:26:15 -0800] "GET /my/home/ HTTP/1.1" 200 2585

```



### Parsing log strings

In [None]:
s = '''132.249.20.188 - - [05/May/2022:14:26:15 -0800] "GET /my/home/ HTTP/1.1" 200 2585'''

In [None]:
full_date = s.split('[')[1].split(']')[0]
full_date

In [None]:
day, month, rest = full_date.split('/')
day, month, rest

In [None]:
year, hour, minute, second = rest.split(':')
second = second[:2]
year, hour, minute, second

In [None]:
year, day, month, hour, minute, second

Alternatively:

In [None]:
pd.to_datetime(full_date[:-6], format='%d/%b/%Y:%H:%M:%S')

That was annoying! Let's see if there's a better way to extract the same information.

## Regular expressions

### This works...?

In [None]:
s

In [None]:
import re
re.findall('\[(\d)+\/(\w+)\/(\d+):(\d+):(\d+):(\d+)\s.*\]', s)[0]

<center><h2>🤔🤯</h2></center>

### Regular expressions

- A regular expression, or **regex** for short, is a sequence of characters used to **match patterns in strings**.
    - For example, `[1-9][0-9]{2}-[0-9]{3}-[0-9]{4}` matches US phone numbers of the form `'XXX-XXX-XXXX'`.
- They are very powerful and widely used.
- However, they are quite difficult to read.

### [regex101.com](https://regex101.com)

- Next class, we will learn how to use regular expressions in Python using the `re` package.
- However, when crafting regular expressions, it is helpful to work in an environment that provides syntax highlighting and details.
- **[regex101.com](https://regex101.com) does exactly that – use it!**
    - [This link](https://regex101.com/r/ESor65/1) will bring you to the phone number example.

### Regex building blocks 🧱

The four main building blocks for all regexes are shown below ([table source](https://www.cs.princeton.edu/courses/archive/spring17/cos226/lectures/54RegularExpressions.pdf), [inspiration](https://docs.google.com/presentation/d/1xQsqa7e3xDZ9nBiekbSBOecwvQm8pSVGa-FBoV6aJ7E/edit#slide=id.g11197671c7e_0_919)).

| operation | order of op. | example | matches ✅ | does not match ❌ |
|:--- |:---|:---|:---|:---|
| <span style='color:purple'><b>concatenation</b></span> | 3 | `AABAAB` | AABAAB | every other string |
| <span style='color:purple'><b>or</b></span> | 4 | `AA\|BAAB` | AA, BAAB | every other string |
| <span style='color:purple'><b>closure</b><br>(zero or more)</span> | 2 | `AB*A` | AA, ABBBBBBA | AB, ABABA |
| <span style='color:purple'><b>parentheses</b></span> | 1 | `A(A\|B)AAB` <hr style="height:1px"> `(AB)*A` | AAAAB, ABAAB<hr style="height:1px">A, ABABABABA | every other string<hr style="height:1px">AA, ABBA |

Note that `|`, `(`, `)`, and `*` are **special characters**, not literals. They manipulate the characters around them.

`AB*A` matches strings with an `'A'`, followed by zero or more `'B'`s, and then an `'A'`. 

✅ `'AA'`, `'ABA'`, `'ABBBBBBBBBBBBBBA'`<br>
❌ `'AB'`, `'ABAB'`

`(AB)*A` matches strings with zero or more `'AB'`s, followed by an `'A'`.

✅ `'A'`, `'ABA'`, `'ABABABABA'`<br>
❌ `'AA'`, `'ABBBBBBBA'`, `'ABAB'`

### Example 1

Write a regular expression that matches `'billy'`, `'billlly'`, `'billlllly'`, etc.
- First, think about how to match strings with any even number of `'l'`s, including zero `'l'`s (i.e. `'biy'`).
- Then, think about how to match only strings with a **positive even** number of `'l'`s.

<br><br>

<details>
<summary>
    ✅ Click here to see the answer <b>after</b> you've tried it yourself at <a href='https://regex101.com'>regex101.com</a>.
</summary>
<code>bi(ll)*y</code> will match any even number of <code>'l'</code>s, including 0.
    
To match only a positive even number of <code>'l'</code>s, we'd need to first "fix into place" two <code>'l'</code>s, and then follow that up with zero or more pairs of <code>'l'</code>s. This specifies the regular expression <code>bill(ll)*y</code>.
    </details>

### Example 2

Write a regular expression that matches `'billy'`, `'billlly'`, `'biggy'`, `'biggggy'`, etc.

Specifically, it should match any string with a **positive even** number of `'l'`s in the middle, or a **positive even** number of `'g'`s in the middle.

<br><br>

<details>
<summary>
    ✅ Click here to see the answer <b>after</b> you've tried it yourself at <a href='https://regex101.com'>regex101.com</a>.
</summary>

Possible answers: `bi(ll(ll)*|gg(gg)*)y` or `bill(ll)*y|bigg(gg)*y`.
 
<br>

Note, `bill(ll)*|gg(gg)*y` is <b>not</b> a valid answer! This is because "concatenation" comes before "or" in the order of operations. This regular expression would match strings that match `bill(ll)*`, like `'billll'`, OR strings that match `gg(gg)*y`, like `'ggy'`.

    
</details>

## Summary, next time

### Summary

- When writing scraping code:
     - Use "inspect element" to identify the names of tags and attributes that are relevant to the information you want to extract.
     - Separate your logic for making requests and for parsing.
- Regular expressions allow us to match patterns in strings.
- **Next time:** More regex syntax. Using regex in Python.
    - You **don't** need to memorize syntax, you just need to know what is possible.
    - We will look at some "cheat sheets" next class.
- **For fun (and practice):** Play [Regex Golf](https://alf.nu/RegexGolf?world=regex&level=r00)!