In [6]:
import pandas as pd
import numpy as np
import os
import requests
import json
import bs4

from IPython.display import display, Image

## Parsing HTML using Beautiful Soup

## Example: Scraping quotes

### Example: Scraping quotes

Let's scrape quotes from https://quotes.toscrape.com/.


Specifically, let's try to make a DataFrame that looks like the one below:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>quote</th>
      <th>author</th>
      <th>author_url</th>
      <th>tags</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</td>
      <td>Albert Einstein</td>
      <td>https://quotes.toscrape.com/author/Albert-Einstein</td>
      <td>change,deep-thoughts,thinking,world</td>
    </tr>
    <tr>
      <th>1</th>
      <td>“It is our choices, Harry, that show what we truly are, far more than our abilities.”</td>
      <td>J.K. Rowling</td>
      <td>https://quotes.toscrape.com/author/J-K-Rowling</td>
      <td>abilities,choices</td>
    </tr>
    <tr>
      <th>2</th>
      <td>“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</td>
      <td>Albert Einstein</td>
      <td>https://quotes.toscrape.com/author/Albert-Einstein</td>
      <td>inspirational,life,live,miracle,miracles</td>
    </tr>
  </tbody>
</table>

### The plan

Eventually, we will create a single function – `quote_df` – which takes in an integer `n` and returns a **DataFrame** with the quotes on the **first `n` pages** of https://quotes.toscrape.com/.

To do this, we will define several helper functions:

- `download_page(i)`, which downloads a **single page** (page `i`) and returns a `BeautifulSoup` object of the response.

- `process_quote(div)`, which takes in a `<div>` tree corresponding to a **single quote** and returns a Series containing all of the relevant information for that quote.

- `process_page(divs)`, which takes in a list of `<div>` trees corresponding to a **single page** and returns a DataFrame containing all of the relevant information for all quotes on that page.

Key principle: some of our helper functions will make **requests**, and others will **parse**, but none will do both! 
- Easier to debug and catch errors.
- Avoids unnecessary requests.

### Downloading a single page

In [7]:
def download_page(i):
    url = f'https://quotes.toscrape.com/page/{i}'
    request = requests.get(url)
    return bs4.BeautifulSoup(request.text)

In `quote_df`, we will call `download_page` repeatedly – once for `i=1`, once for `i=2`, ..., `i = n`. For now, we will work with just page 5 (chosen arbitrarily).

In [8]:
soup = download_page(5)

### Parsing a single page

Let's look at the page's source code (via "inspect element") to find where the quotes in the page are located.

In [9]:
divs = soup.find_all('div', attrs={'class': 'quote'})

In [10]:
divs[0]

<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”</span>
<span>by <small class="author" itemprop="author">George R.R. Martin</small>
<a href="/author/George-R-R-Martin">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="read,readers,reading,reading-books" itemprop="keywords"/>
<a class="tag" href="/tag/read/page/1/">read</a>
<a class="tag" href="/tag/readers/page/1/">readers</a>
<a class="tag" href="/tag/reading/page/1/">reading</a>
<a class="tag" href="/tag/reading-books/page/1/">reading-books</a>
</div>
</div>

From this `<div>`, we can extract the quote, author name, author's URL, and tags.

In [11]:
divs[0].find('span', attrs={'class': 'text'}).text

'“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”'

In [12]:
divs[0].find('small', attrs={'class': 'author'}).text

'George R.R. Martin'

In [13]:
divs[0].find('a').get('href')

'/author/George-R-R-Martin'

In [14]:
divs[0].find('meta', attrs={'class': 'keywords'}).get('content')

'read,readers,reading,reading-books'

Let's implement our next function, `process_quote`, which takes in a `<div>` corresponding to a single quote and returns a **Series** containing the quote's information.

Note that this approach is different than the approach taken in the HDSI Faculty page example – there, we created each column of our final DataFrame separately, while here we are creating one **row** of our final DataFrame at a time.

In [15]:
def process_quote(div):
    quote = div.find('span', attrs={'class': 'text'}).text
    author = div.find('small', attrs={'class': 'author'}).text
    author_url = 'https://quotes.toscrape.com' + div.find('a').get('href')
    tags = div.find('meta', attrs={'class': 'keywords'}).get('content')
    
    return pd.Series({'quote': quote, 'author': author, 'author_url': author_url, 'tags': tags})

In [16]:
process_quote(divs[3])

quote         “If you can make a woman laugh, you can make h...
author                                           Marilyn Monroe
author_url    https://quotes.toscrape.com/author/Marilyn-Monroe
tags                                                 girls,love
dtype: object

Our last helper function will take in a **list** of `<div>`s, call `process_quote` on each `<div>` in the list, and return a **DataFrame**.

In [17]:
def process_page(divs):
    return pd.DataFrame([process_quote(div) for div in divs])

In [18]:
process_page(divs)

Unnamed: 0,quote,author,author_url,tags
0,“A reader lives a thousand lives before he die...,George R.R. Martin,https://quotes.toscrape.com/author/George-R-R-...,"read,readers,reading,reading-books"
1,“You can never get a cup of tea large enough o...,C.S. Lewis,https://quotes.toscrape.com/author/C-S-Lewis,"books,inspirational,reading,tea"
2,“You believe lies so you eventually learn to t...,Marilyn Monroe,https://quotes.toscrape.com/author/Marilyn-Monroe,
3,"“If you can make a woman laugh, you can make h...",Marilyn Monroe,https://quotes.toscrape.com/author/Marilyn-Monroe,"girls,love"
4,“Life is like riding a bicycle. To keep your b...,Albert Einstein,https://quotes.toscrape.com/author/Albert-Eins...,"life,simile"
5,“The real lover is the man who can thrill you ...,Marilyn Monroe,https://quotes.toscrape.com/author/Marilyn-Monroe,love
6,"“A wise girl kisses but doesn't love, listens ...",Marilyn Monroe,https://quotes.toscrape.com/author/Marilyn-Monroe,attributed-no-source
7,“Only in the darkness can you see the stars.”,Martin Luther King Jr.,https://quotes.toscrape.com/author/Martin-Luth...,"hope,inspirational"
8,"“It matters not what someone is born, but what...",J.K. Rowling,https://quotes.toscrape.com/author/J-K-Rowling,dumbledore
9,“Love does not begin and end the way we seem t...,James Baldwin,https://quotes.toscrape.com/author/James-Baldwin,love


### Putting it all together

In [19]:
def quote_df(n):
    '''Returns a DataFrame containing the quotes on the first n pages of https://quotes.toscrape.com/.'''
    dfs = []
    for i in range(1, n + 1):
        # Download page n and create a BeautifulSoup object.
        soup = download_page(i)
        
        # Create DataFrame using the information in that page.
        divs = soup.find_all('div', attrs={'class': 'quote'})
        df = process_page(divs)
        
        # Append DataFrame to dfs.
        dfs.append(df)
        
    # Stitch all DataFrames together.
    return pd.concat(dfs).reset_index(drop=True)

In [20]:
first_three_pages = quote_df(3)
first_three_pages.head()

Unnamed: 0,quote,author,author_url,tags
0,“The world as we have created it is a process ...,Albert Einstein,https://quotes.toscrape.com/author/Albert-Eins...,"change,deep-thoughts,thinking,world"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,https://quotes.toscrape.com/author/J-K-Rowling,"abilities,choices"
2,“There are only two ways to live your life. On...,Albert Einstein,https://quotes.toscrape.com/author/Albert-Eins...,"inspirational,life,live,miracle,miracles"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,https://quotes.toscrape.com/author/Jane-Austen,"aliteracy,books,classic,humor"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,https://quotes.toscrape.com/author/Marilyn-Monroe,"be-yourself,inspirational"


The elements in the `'tags'` column are all strings, but they look like lists. This is not ideal, as we will see shortly.

### Key takeaways

* Make as few requests as possible.
* Create a request and parsing plan **beforehand**.
* Create your output schema **beforehand**.
* Make requests and parse in **separate functions**!

## Nested vs. flat data formats

### Nested vs. flat data formats

- **Nested** data formats, like HTML, JSON, and XML, allow us to represent hierarchical relationships between variables.

* **Flat** (i.e. tabular) data formats, like CSV, do not.

<center><img src="imgs/hierarchy.png" width=40%></center>

### Aside: JSON Crack

The site [jsoncrack.com](https://jsoncrack.com/editor) allows you to upload a JSON file and visualizes it. Let's try it with `data/family.json`!

### Example: Scraping quotes, again

- Suppose we obtained the quotes data via an API and saved it to the file `data/quotes2scrape.json`.
- `quotes2scrape.json` is a **JSON records** file; each line is a valid JSON object, **but the entire document is not**.

In [21]:
f = open(os.path.join('data', 'quotes2scrape.json'))

FileNotFoundError: [Errno 2] No such file or directory: 'data\\quotes2scrape.json'

In [None]:
json.loads(f.readline())

Note that for a single quote, we have keys for `'auth_url'`, `'quote_auth'`, `'quote_text'`, `'bio'`, `'dob'`, and `'tags'`.

Since each line is a separate JSON object, let's read in each line one at a time.

In [None]:
L = [json.loads(x) for x in open(os.path.join('data', 'quotes2scrape.json'))]

Let's convert the result to a DataFrame.

In [None]:
df = pd.DataFrame(L)
df.head()

What data type is the `'tags'` column?

In [None]:
df['tags'].iloc[0]

Let's save `df` to a CSV and read it back in.

In [None]:
df.to_csv('out.csv')

In [None]:
df_again = pd.read_csv('out.csv')
df_again.head()

What data type is the `'tags'` column now?

In [None]:
df_again['tags'].iloc[0]

### One-hot encoding

- So that we don't have to deal with lists within Series, we can **flatten** lists of tags so that there is **one column per unique tag**.
    - For example, consider the tag `'inspirational'`.
    - If a quote has a 1 in the `'inspirational'` column, it **was** tagged `'inspirational'`.
    - If a quote has a 0 in the `'inspirational'` column, it **was not** tagged `'inspirational'`.

- This process – of converting categorical variables into columns of 1s and 0s – is called **one-hot encoding**. We will revisit it in a few weeks.

Let's write a function that takes in the list of tags (`taglist`) for a given quote and returns the one-hot-encoded sequence of 1s and 0s for that quote.

In [None]:
def flatten_tags(taglist):
    return pd.Series({k:1 for k in taglist}, dtype=float)

tags = df['tags'].apply(flatten_tags).fillna(0).astype(int)
tags.head()

Let's combine this one-hot-encoded DataFrame with `df`.

In [None]:
df_full = pd.concat([df, tags], axis=1).drop(columns='tags')
df_full.head()

If we want all quotes tagged `'inspiration'`, we can simply query:

In [None]:
df_full[df_full['inspirational'] == 1].head()

Note that this DataFrame representation of the response JSON takes up much more space than the original JSON. Why is that?