In [1]:
import pandas as pd
import numpy as np
import os
import requests
import json
import bs4

from IPython.display import display, Image

In [3]:
exam_string = '''
<html>
<college>Your score is ready!</college>

<sat verbal="ready" math="ready">
    Your percentiles are as follows:
    <scorelist listtype="percentiles">
        <scorerow kind="verbal" subkind="per">
            Verbal: <scorenum>84</scorenum>
        </scorerow>
        <scorerow kind="math" subkind="per">
            Math: <scorenum>99</scorenum>
        </scorerow>
    </scorelist>
    And your actual scores are as follows:
    <scorelist listtype="scores">
        <scorerow kind="verbal">
            Verbal: <scorenum>680</scorenum>
        </scorerow>
        <scorerow kind="math">
            Math: <scorenum>800</scorenum>
        </scorerow>
    </scorelist>
</sat>
</html>
'''.strip()

In [6]:
soup = bs4.BeautifulSoup(exam_string)
soup.find('scorerow',attrs={"kind":"verbal","subkind":"per"})

<scorerow kind="verbal" subkind="per">
            Verbal: <scorenum>84</scorenum>
</scorerow>

# Lecture 10 – More on Parsing HTML



### Agenda

- Parsing HTML using Beautiful Soup.
    - Example: Scraping the Jiaotong Global Classroom website.
    - Example: Scraping quotes.
- Nested vs. flat data structures.

## Parsing HTML using Beautiful Soup

### `BeautifulSoup` objects

- `bs4.BeautifulSoup` takes in a string or file-like object representing HTML (markup) and returns a **parsed** document.
- Remember, HTML documents are represented as **trees**, under the "Document Object Model."

In [2]:
html_string = '''
<html>
    <body>
      <div id="content">
        <h1>Heading here</h1>
        <p>My First paragraph</p>
        <p>My <em>second</em> paragraph</p>
        <hr>
      </div>
      <div id="nav">
        <ul>
          <li>item 1</li>
          <li>item 2</li>
          <li>item 3</li>
        </ul>
      </div>
    </body>
</html>
'''.strip()

In [3]:
soup = bs4.BeautifulSoup(html_string)

In [4]:
type(soup)

bs4.BeautifulSoup

### Finding elements in a tree

The most common methods you'll use to find _tags_ in a `soup` object are:
- `soup.find(tag)`, which finds the **first** instance of a tag (the first one on the page, i.e. the first one that DFS sees).
    - More general: `soup.find(name=None, attrs={}, recursive=True, text=None, **kwargs)`.
- `soup.find_all(tag)`, which finds **all** instances of a tag.


### Using `find_all`

`find_all` returns a list of all matches.

In [5]:
soup.find_all('div')

[<div id="content">
 <h1>Heading here</h1>
 <p>My First paragraph</p>
 <p>My <em>second</em> paragraph</p>
 <hr/>
 </div>,
 <div id="nav">
 <ul>
 <li>item 1</li>
 <li>item 2</li>
 <li>item 3</li>
 </ul>
 </div>]

In [6]:
soup.find_all('li')

[<li>item 1</li>, <li>item 2</li>, <li>item 3</li>]

In [7]:
[x.text for x in soup.find_all('li')]

['item 1', 'item 2', 'item 3']

### Node attributes
* The `text` attribute of a tag element gets the text between the opening and closing tags.
* The `attrs` attribute lists all attributes of a tag.
* The `get(key)` method gets the value of a tag attribute.

In [8]:
soup.find('p')

<p>My First paragraph</p>

In [9]:
soup.find('p').text

'My First paragraph'

In [10]:
soup.find('div')

<div id="content">
<h1>Heading here</h1>
<p>My First paragraph</p>
<p>My <em>second</em> paragraph</p>
<hr/>
</div>

In [11]:
soup.find('div').attrs

{'id': 'content'}

In [12]:
soup.find('div').get('id')

'content'

The `get` method must be called directly on the node that contains the attribute you're looking for.

In [13]:
soup

<html>
<body>
<div id="content">
<h1>Heading here</h1>
<p>My First paragraph</p>
<p>My <em>second</em> paragraph</p>
<hr/>
</div>
<div id="nav">
<ul>
<li>item 1</li>
<li>item 2</li>
<li>item 3</li>
</ul>
</div>
</body>
</html>

In [14]:
# While there are multiple 'id' attributes, none of them are in the <html> tag at the top.
soup.get('id')

In [15]:
soup.find('div').get('id')

'content'

## Example: Scraping the Jiaotong Global Classroom page

### Example

Let's try and extract a list of courses from https://global.sjtu.edu.cn/en/cooperation/globalclass.

A good first step is to use the "inspect element" tool in our web browser.

In [16]:
fac_response = requests.get('https://global.sjtu.edu.cn/en/cooperation/globalclass')
fac_response

<Response [200]>

In [57]:
soup = bs4.BeautifulSoup(fac_response.text)

It seems like the relevant `<div>`s for all courses are the ones where the `class` attribute is equal to `'layui-col-lg3 layui-col-md4 layui-col-sm6 layui-col-xsm6 layui-col-xs12'`. Let's find all of those.

In [66]:
divs = soup.find_all('div', attrs={'class': 'layui-col-lg3 layui-col-md4 layui-col-sm6 layui-col-xsm6 layui-col-xs12'})

It seems like the relevant `<div>`s for course titles are the ones where the `class` attribute is equal to `'large-tile'`. Let's find all of those.

In [83]:
divs[0].find('div',attrs={'class':'large-title'}).text.strip()

'LAW6852 Global Governance, Conflict and China'

Within here, we need to extract the link to each course. It seems like names are stored in the `href` attribute within an `<a>` tag.

In [76]:
divs[0].find('a').get('href')

'/en/page/sub/397'

We can also extract course credit:

In [81]:
divs[0].find_all('p')[1].text

'Credit：1'

We can also extract course department:

In [84]:
divs[0].find_all('p')[0].text

'KoGuan School of Law'

Let's create a DataFrame consisting of credit, department, and link for each course.

In [85]:
names = [div.find('div',attrs={'class':'large-title'}).text.strip() for div in divs]
names[:5]

['LAW6852 Global Governance, Conflict and China',
 'JC8809 Management Practices in Cultural and Creative Industry',
 'PUM6004 Social Science Methodology',
 'HIS8706 Wars and Revolutions in 20th Century China',
 'PJ187 Net Zero-Carbon Fuels']

In [89]:
credits = [div.find_all('p')[1].text for div in divs]
credits[:5]

['Credit：1', 'Credit：1', 'Credits：3', 'Credits：3', 'Credits：2']

In [90]:
departments = [div.find_all('p')[0].text for div in divs]
departments[:5]

['KoGuan School of Law',
 'USC-SJTU Institute of Cultural and Creative Industry',
 'School of International and Public Affairs',
 'School of Humanities',
 'China-UK Low Carbon College']

In [93]:
links = [ 'https://global.sjtu.edu.cn'+div.find('a').get('href') for div in divs]
links[:5]


['https://global.sjtu.edu.cn/en/page/sub/397',
 'https://global.sjtu.edu.cn/en/page/sub/396',
 'https://global.sjtu.edu.cn/en/page/sub/395',
 'https://global.sjtu.edu.cn/en/page/sub/394',
 'https://global.sjtu.edu.cn/en/page/sub/393']

In [94]:
courses = pd.DataFrame().assign(name=names, credit=credits, department=departments ,link=links)
courses.head()

Unnamed: 0,name,credit,department,link
0,"LAW6852 Global Governance, Conflict and China",Credit：1,KoGuan School of Law,https://global.sjtu.edu.cn/en/page/sub/397
1,JC8809 Management Practices in Cultural and Cr...,Credit：1,USC-SJTU Institute of Cultural and Creative In...,https://global.sjtu.edu.cn/en/page/sub/396
2,PUM6004 Social Science Methodology,Credits：3,School of International and Public Affairs,https://global.sjtu.edu.cn/en/page/sub/395
3,HIS8706 Wars and Revolutions in 20th Century C...,Credits：3,School of Humanities,https://global.sjtu.edu.cn/en/page/sub/394
4,PJ187 Net Zero-Carbon Fuels,Credits：2,China-UK Low Carbon College,https://global.sjtu.edu.cn/en/page/sub/393


Now we have a DataFrame!

In [95]:
courses[courses['department']=='UM-SJTU Joint Institute']

Unnamed: 0,name,credit,department,link
10,ECE2810J Data Structures and Algorithms,Credits：4,UM-SJTU Joint Institute,https://global.sjtu.edu.cn/en/page/sub/351
11,ECE4270J VLSI Design I,Credits：4,UM-SJTU Joint Institute,https://global.sjtu.edu.cn/en/page/sub/352
12,VV556/MATH6001J Methods of Applied Mathematics,Credits：3,UM-SJTU Joint Institute,https://global.sjtu.edu.cn/en/page/sub/353
13,ECE4710J Introduction to Data Science,Credits：4,UM-SJTU Joint Institute,https://global.sjtu.edu.cn/en/page/sub/354


What if we want to get faculty members' pictures? It seems like we should look at the attributes of an `<img>` tag.

In [97]:
def show_picture(name):
    idx = names.index(name)
    url = divs[idx].find('img').get('src')
    display(Image(url))

In [98]:
show_picture('ECE2810J Data Structures and Algorithms')

<IPython.core.display.Image object>

## Example: Scraping quotes

### Example: Scraping quotes

Let's scrape quotes from https://quotes.toscrape.com/.

<center><img src="imgs/quotes2scrape.png" width=60%></center>

Specifically, let's try to make a DataFrame that looks like the one below:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>quote</th>
      <th>author</th>
      <th>author_url</th>
      <th>tags</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</td>
      <td>Albert Einstein</td>
      <td>https://quotes.toscrape.com/author/Albert-Einstein</td>
      <td>change,deep-thoughts,thinking,world</td>
    </tr>
    <tr>
      <th>1</th>
      <td>“It is our choices, Harry, that show what we truly are, far more than our abilities.”</td>
      <td>J.K. Rowling</td>
      <td>https://quotes.toscrape.com/author/J-K-Rowling</td>
      <td>abilities,choices</td>
    </tr>
    <tr>
      <th>2</th>
      <td>“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</td>
      <td>Albert Einstein</td>
      <td>https://quotes.toscrape.com/author/Albert-Einstein</td>
      <td>inspirational,life,live,miracle,miracles</td>
    </tr>
  </tbody>
</table>

### The plan

Eventually, we will create a single function – `quote_df` – which takes in an integer `n` and returns a **DataFrame** with the quotes on the **first `n` pages** of https://quotes.toscrape.com/.

To do this, we will define several helper functions:

- `download_page(i)`, which downloads a **single page** (page `i`) and returns a `BeautifulSoup` object of the response.

- `process_quote(div)`, which takes in a `<div>` tree corresponding to a **single quote** and returns a Series containing all of the relevant information for that quote.

- `process_page(divs)`, which takes in a list of `<div>` trees corresponding to a **single page** and returns a DataFrame containing all of the relevant information for all quotes on that page.

Key principle: some of our helper functions will make **requests**, and others will **parse**, but none will do both! 
- Easier to debug and catch errors.
- Avoids unnecessary requests.

### Downloading a single page

In [29]:
def download_page(i):
    url = f'https://quotes.toscrape.com/page/{i}'
    request = requests.get(url)
    return bs4.BeautifulSoup(request.text)

In `quote_df`, we will call `download_page` repeatedly – once for `i=1`, once for `i=2`, ..., `i = n`. For now, we will work with just page 5 (chosen arbitrarily).

In [32]:
soup = download_page(5)

KeyboardInterrupt: 

### Parsing a single page

Let's look at the page's source code (via "inspect element") to find where the quotes in the page are located.

In [33]:
divs = soup.find_all('div', attrs={'class': 'quote'})

In [34]:
divs[0]

<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”</span>
<span>by <small class="author" itemprop="author">George R.R. Martin</small>
<a href="/author/George-R-R-Martin">(about)</a>
</span>
<div class="tags">
            Tags:
            <meta class="keywords" content="read,readers,reading,reading-books" itemprop="keywords"/>
<a class="tag" href="/tag/read/page/1/">read</a>
<a class="tag" href="/tag/readers/page/1/">readers</a>
<a class="tag" href="/tag/reading/page/1/">reading</a>
<a class="tag" href="/tag/reading-books/page/1/">reading-books</a>
</div>
</div>

From this `<div>`, we can extract the quote, author name, author's URL, and tags.

In [35]:
divs[0].find('span', attrs={'class': 'text'}).text

'“A reader lives a thousand lives before he dies, said Jojen. The man who never reads lives only one.”'

In [36]:
divs[0].find('small', attrs={'class': 'author'}).text

'George R.R. Martin'

In [37]:
divs[0].find('a').get('href')

'/author/George-R-R-Martin'

In [38]:
divs[0].find('meta', attrs={'class': 'keywords'}).get('content')

'read,readers,reading,reading-books'

Let's implement our next function, `process_quote`, which takes in a `<div>` corresponding to a single quote and returns a **Series** containing the quote's information.

Note that this approach is different than the approach taken in the HDSI Faculty page example – there, we created each column of our final DataFrame separately, while here we are creating one **row** of our final DataFrame at a time.

In [39]:
def process_quote(div):
    quote = div.find('span', attrs={'class': 'text'}).text
    author = div.find('small', attrs={'class': 'author'}).text
    author_url = 'https://quotes.toscrape.com' + div.find('a').get('href')
    tags = div.find('meta', attrs={'class': 'keywords'}).get('content')
    
    return pd.Series({'quote': quote, 'author': author, 'author_url': author_url, 'tags': tags})

In [40]:
process_quote(divs[3])

quote         “If you can make a woman laugh, you can make h...
author                                           Marilyn Monroe
author_url    https://quotes.toscrape.com/author/Marilyn-Monroe
tags                                                 girls,love
dtype: object

Our last helper function will take in a **list** of `<div>`s, call `process_quote` on each `<div>` in the list, and return a **DataFrame**.

In [41]:
def process_page(divs):
    return pd.DataFrame([process_quote(div) for div in divs])

In [42]:
process_page(divs)

Unnamed: 0,quote,author,author_url,tags
0,“A reader lives a thousand lives before he die...,George R.R. Martin,https://quotes.toscrape.com/author/George-R-R-...,"read,readers,reading,reading-books"
1,“You can never get a cup of tea large enough o...,C.S. Lewis,https://quotes.toscrape.com/author/C-S-Lewis,"books,inspirational,reading,tea"
2,“You believe lies so you eventually learn to t...,Marilyn Monroe,https://quotes.toscrape.com/author/Marilyn-Monroe,
3,"“If you can make a woman laugh, you can make h...",Marilyn Monroe,https://quotes.toscrape.com/author/Marilyn-Monroe,"girls,love"
4,“Life is like riding a bicycle. To keep your b...,Albert Einstein,https://quotes.toscrape.com/author/Albert-Eins...,"life,simile"
5,“The real lover is the man who can thrill you ...,Marilyn Monroe,https://quotes.toscrape.com/author/Marilyn-Monroe,love
6,"“A wise girl kisses but doesn't love, listens ...",Marilyn Monroe,https://quotes.toscrape.com/author/Marilyn-Monroe,attributed-no-source
7,“Only in the darkness can you see the stars.”,Martin Luther King Jr.,https://quotes.toscrape.com/author/Martin-Luth...,"hope,inspirational"
8,"“It matters not what someone is born, but what...",J.K. Rowling,https://quotes.toscrape.com/author/J-K-Rowling,dumbledore
9,“Love does not begin and end the way we seem t...,James Baldwin,https://quotes.toscrape.com/author/James-Baldwin,love


### Putting it all together

In [43]:
def quote_df(n):
    '''Returns a DataFrame containing the quotes on the first n pages of https://quotes.toscrape.com/.'''
    dfs = []
    for i in range(1, n + 1):
        # Download page n and create a BeautifulSoup object.
        soup = download_page(i)
        
        # Create DataFrame using the information in that page.
        divs = soup.find_all('div', attrs={'class': 'quote'})
        df = process_page(divs)
        
        # Append DataFrame to dfs.
        dfs.append(df)
        
    # Stitch all DataFrames together.
    return pd.concat(dfs).reset_index(drop=True)

In [44]:
first_three_pages = quote_df(3)
first_three_pages.head()

Unnamed: 0,quote,author,author_url,tags
0,“The world as we have created it is a process ...,Albert Einstein,https://quotes.toscrape.com/author/Albert-Eins...,"change,deep-thoughts,thinking,world"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,https://quotes.toscrape.com/author/J-K-Rowling,"abilities,choices"
2,“There are only two ways to live your life. On...,Albert Einstein,https://quotes.toscrape.com/author/Albert-Eins...,"inspirational,life,live,miracle,miracles"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,https://quotes.toscrape.com/author/Jane-Austen,"aliteracy,books,classic,humor"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,https://quotes.toscrape.com/author/Marilyn-Monroe,"be-yourself,inspirational"


The elements in the `'tags'` column are all strings, but they look like lists. This is not ideal, as we will see shortly.

### Key takeaways

* Make as few requests as possible.
* Create a request and parsing plan **beforehand**.
* Create your output schema **beforehand**.
* Make requests and parse in **separate functions**!

## Nested vs. flat data formats

### Nested vs. flat data formats

- **Nested** data formats, like HTML, JSON, and XML, allow us to represent hierarchical relationships between variables.

* **Flat** (i.e. tabular) data formats, like CSV, do not.

<center><img src="imgs/hierarchy.png" width=40%></center>

### Aside: JSON Crack

The site [jsoncrack.com](https://jsoncrack.com/editor) allows you to upload a JSON file and visualizes it. Let's try it with `data/family.json`!

### Example: Scraping quotes, again

- Suppose we obtained the quotes data via an API and saved it to the file `data/quotes2scrape.json`.
- `quotes2scrape.json` is a **JSON records** file; each line is a valid JSON object, **but the entire document is not**.

In [45]:
f = open(os.path.join('data', 'quotes2scrape.json'))

In [46]:
json.loads(f.readline())

{'auth_url': 'http://quotes.toscrape.com/author/Albert-Einstein',
 'quote_auth': 'Albert Einstein',
 'quote_text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”',
 'bio': 'In 1879, Albert Einstein was born in Ulm, Germany. He completed his Ph.D. at the University of Zurich by 1909. His 1905 paper explaining the photoelectric effect, the basis of electronics, earned him the Nobel Prize in 1921. His first paper on Special Relativity Theory, also published in 1905, changed the world. After the rise of the Nazi party, Einstein made Princeton his permanent home, becoming a U.S. citizen in 1940. Einstein, a pacifist during World War I, stayed a firm proponent of social justice and responsibility. He chaired the Emergency Committee of Atomic Scientists, which organized to alert the public to the dangers of atomic warfare.At a symposium, he advised: "In their struggle for the ethical good, teachers of religion must have the

Note that for a single quote, we have keys for `'auth_url'`, `'quote_auth'`, `'quote_text'`, `'bio'`, `'dob'`, and `'tags'`.

Since each line is a separate JSON object, let's read in each line one at a time.

In [47]:
L = [json.loads(x) for x in open(os.path.join('data', 'quotes2scrape.json'))]

Let's convert the result to a DataFrame.

In [48]:
df = pd.DataFrame(L)
df.head()

Unnamed: 0,auth_url,quote_auth,quote_text,bio,dob,tags
0,http://quotes.toscrape.com/author/Albert-Einstein,Albert Einstein,“The world as we have created it is a process ...,"In 1879, Albert Einstein was born in Ulm, Germ...","March 14, 1879","[change, deep-thoughts, thinking, world]"
1,http://quotes.toscrape.com/author/J-K-Rowling,J.K. Rowling,"“It is our choices, Harry, that show what we t...",See also: Robert GalbraithAlthough she writes ...,"July 31, 1965","[abilities, choices]"
2,http://quotes.toscrape.com/author/Albert-Einstein,Albert Einstein,“There are only two ways to live your life. On...,"In 1879, Albert Einstein was born in Ulm, Germ...","March 14, 1879","[inspirational, life, live, miracle, miracles]"
3,http://quotes.toscrape.com/author/Jane-Austen,Jane Austen,"“The person, be it gentleman or lady, who has ...",Jane Austen was an English novelist whose work...,"December 16, 1775","[aliteracy, books, classic, humor]"
4,http://quotes.toscrape.com/author/Marilyn-Monroe,Marilyn Monroe,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe (born Norma Jeane Mortenson; Ju...,"June 01, 1926","[be-yourself, inspirational]"


What data type is the `'tags'` column?

In [49]:
df['tags'].iloc[0]

['change', 'deep-thoughts', 'thinking', 'world']

Let's save `df` to a CSV and read it back in.

In [50]:
df.to_csv('out.csv')

In [51]:
df_again = pd.read_csv('out.csv')
df_again.head()

Unnamed: 0.1,Unnamed: 0,auth_url,quote_auth,quote_text,bio,dob,tags
0,0,http://quotes.toscrape.com/author/Albert-Einstein,Albert Einstein,“The world as we have created it is a process ...,"In 1879, Albert Einstein was born in Ulm, Germ...","March 14, 1879","['change', 'deep-thoughts', 'thinking', 'world']"
1,1,http://quotes.toscrape.com/author/J-K-Rowling,J.K. Rowling,"“It is our choices, Harry, that show what we t...",See also: Robert GalbraithAlthough she writes ...,"July 31, 1965","['abilities', 'choices']"
2,2,http://quotes.toscrape.com/author/Albert-Einstein,Albert Einstein,“There are only two ways to live your life. On...,"In 1879, Albert Einstein was born in Ulm, Germ...","March 14, 1879","['inspirational', 'life', 'live', 'miracle', '..."
3,3,http://quotes.toscrape.com/author/Jane-Austen,Jane Austen,"“The person, be it gentleman or lady, who has ...",Jane Austen was an English novelist whose work...,"December 16, 1775","['aliteracy', 'books', 'classic', 'humor']"
4,4,http://quotes.toscrape.com/author/Marilyn-Monroe,Marilyn Monroe,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe (born Norma Jeane Mortenson; Ju...,"June 01, 1926","['be-yourself', 'inspirational']"


What data type is the `'tags'` column now?

In [52]:
df_again['tags'].iloc[0]

"['change', 'deep-thoughts', 'thinking', 'world']"

### One-hot encoding

- So that we don't have to deal with lists within Series, we can **flatten** lists of tags so that there is **one column per unique tag**.
    - For example, consider the tag `'inspirational'`.
    - If a quote has a 1 in the `'inspirational'` column, it **was** tagged `'inspirational'`.
    - If a quote has a 0 in the `'inspirational'` column, it **was not** tagged `'inspirational'`.

- This process – of converting categorical variables into columns of 1s and 0s – is called **one-hot encoding**. We will revisit it in a few weeks.

Let's write a function that takes in the list of tags (`taglist`) for a given quote and returns the one-hot-encoded sequence of 1s and 0s for that quote.

In [53]:
def flatten_tags(taglist):
    return pd.Series({k:1 for k in taglist}, dtype=float)

tags = df['tags'].apply(flatten_tags).fillna(0).astype(int)
tags.head()

Unnamed: 0,change,deep-thoughts,thinking,world,abilities,choices,inspirational,life,live,miracle,...,christianity,faith,sun,adventure,better-life-empathy,difficult,grown-ups,write,writers,mind
0,1,1,1,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,1,1,1,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Let's combine this one-hot-encoded DataFrame with `df`.

In [54]:
df_full = pd.concat([df, tags], axis=1).drop(columns='tags')
df_full.head()

Unnamed: 0,auth_url,quote_auth,quote_text,bio,dob,change,deep-thoughts,thinking,world,abilities,...,christianity,faith,sun,adventure,better-life-empathy,difficult,grown-ups,write,writers,mind
0,http://quotes.toscrape.com/author/Albert-Einstein,Albert Einstein,“The world as we have created it is a process ...,"In 1879, Albert Einstein was born in Ulm, Germ...","March 14, 1879",1,1,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1,http://quotes.toscrape.com/author/J-K-Rowling,J.K. Rowling,"“It is our choices, Harry, that show what we t...",See also: Robert GalbraithAlthough she writes ...,"July 31, 1965",0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,http://quotes.toscrape.com/author/Albert-Einstein,Albert Einstein,“There are only two ways to live your life. On...,"In 1879, Albert Einstein was born in Ulm, Germ...","March 14, 1879",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,http://quotes.toscrape.com/author/Jane-Austen,Jane Austen,"“The person, be it gentleman or lady, who has ...",Jane Austen was an English novelist whose work...,"December 16, 1775",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,http://quotes.toscrape.com/author/Marilyn-Monroe,Marilyn Monroe,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe (born Norma Jeane Mortenson; Ju...,"June 01, 1926",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


If we want all quotes tagged `'inspiration'`, we can simply query:

In [55]:
df_full[df_full['inspirational'] == 1].head()

Unnamed: 0,auth_url,quote_auth,quote_text,bio,dob,change,deep-thoughts,thinking,world,abilities,...,christianity,faith,sun,adventure,better-life-empathy,difficult,grown-ups,write,writers,mind
2,http://quotes.toscrape.com/author/Albert-Einstein,Albert Einstein,“There are only two ways to live your life. On...,"In 1879, Albert Einstein was born in Ulm, Germ...","March 14, 1879",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,http://quotes.toscrape.com/author/Marilyn-Monroe,Marilyn Monroe,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe (born Norma Jeane Mortenson; Ju...,"June 01, 1926",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,http://quotes.toscrape.com/author/Thomas-A-Edison,Thomas A. Edison,"“I have not failed. I've just found 10,000 way...","Thomas Alva Edison was an American inventor, s...","February 11, 1847",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10,http://quotes.toscrape.com/author/Marilyn-Monroe,Marilyn Monroe,“This life is what you make it. No matter what...,Marilyn Monroe (born Norma Jeane Mortenson; Ju...,"June 01, 1926",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
16,http://quotes.toscrape.com/author/Elie-Wiesel,Elie Wiesel,"“The opposite of love is not hate, it's indiff...",Eliezer Wiesel was a Romania-born American nov...,"September 30, 1928",0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Note that this DataFrame representation of the response JSON takes up much more space than the original JSON. Why is that?

## Summary

### Summary

- Beautiful Soup is an HTML parser that allows us to (somewhat) easily extract information from HTML documents.
    - `soup.find` and `soup.find_all` are the functions you will use most often.
- When writing scraping code:
    - Use "inspect element" to identify the names of tags and attributes that are relevant to the information you want to extract.
    - Separate your logic for making requests and for parsing.