## Meeting 1 Introduction to BeautifulSoup  

Welcome to webscraping! We will start by introducting BeautifulSoup, a python package made for making webscraping easy.  


We will be introducting some basic concepts by scraping https://quotes.toscrape.com

Link to BeautifulSoup documentation:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

In [4]:
from bs4 import BeautifulSoup
import requests


def getSoup(url: str) -> BeautifulSoup:
    page = requests.get(url)
    bs = BeautifulSoup(page.content, "html.parser")
    page.close()
    
    return bs

page="https://quotes.toscrape.com"

soup = getSoup(page)
print(soup.prettify())



<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row header-box">
    <div class="col-md-8">
     <h1>
      <a href="/" style="text-decoration: none">
       Quotes to Scrape
      </a>
     </h1>
    </div>
    <div class="col-md-4">
     <p>
      <a href="/login">
       Login
      </a>
     </p>
    </div>
   </div>
   <div class="row">
    <div class="col-md-8">
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Albert Einstein
       </small>
       <a href="/author/Albert

## Navigating to a specific Element
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree

A beautifulSoup object contains the entire HTML tree you see when you inspect element on the browser (and maybe some extra stuff I don't even know about).

#### Exercise 1: Grab the title 
Grab and print the website title "Quotes to Scrape" by navigating the HTML tree.


The answer for this simple exercise should be an ugly, hardcoded list of HTML attributes.

In [5]:
soup.title.text
#soup.some_tags.some_more_tags.text

'Quotes to Scrape'

## Searching the Tree
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree

As you saw, that was pretty ugly. Typing those long statements to get an element won't scale when we deal with bigger, more complicated websites.

#### Exercise 2: Search for the title
Do the same as above but using a filter to grab that specific element (remove the .text from your solution to see what type it is)

In [6]:
soup.title

<title>Quotes to Scrape</title>

## Scraping similar elements

There is only one title on this page, but in practice we usually are targeting data which repeats in some pattern.

#### Exercise 3: Print all the quotes  
Scrape all the quotes on the first page of this website and print them without the quotation character at the beginning/end.

In [7]:
#the following finds the first instance of the text class
soup.find(class_ = "text").text

'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

In [8]:
#the following iterates throguh all text classes with the span tag
#strips it to only the text
for quote in soup.select("span.text"):
    quote = quote.text.strip()
    print(quote)


“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
“Try not to become a man of success. Rather become a man of value.”
“It is better to be hated for what you are than to be loved for what you are not.”
“I have not failed. I've just found 10,000 ways that won't work.”
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
“A day without sunshine is like, you know, night.”


## Python: Creating Better Functions  

We found a pattern! Each quote has a class="text" attribute. But, is scraping the quotes directly using this pattern the best thing? Well, maybe.  

It might not be if we want to scrape other data from each tile on the page.

#### Exercise 4: Write a function to scrape a tile  
Each quote is contained in a tile which holds information about the author and some tags that relate to the quote.  
Create a function 

In [15]:
from bs4 import Tag
def scrape_tile(tile: Tag) -> tuple[str, str, list[str]]:
    # print(tile.prettify())
    
    quote = tile.find('span', class_ = "text").text
    author = tile.find('small', class_ = "author").text
    tags = tile.find_all('a', class_ = "tag")
    tags = [tag.text.strip() for tag in tags]

    return (quote, author, tags)

#all_divs = [soup.select("div.quote")] # find what represents the tiles
all_div = soup.find_all('div', class_='quote')
for div in soup.find_all('div', class_='quote'):
    data = scrape_tile(div)
    print(data)


('“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'Albert Einstein', ['change', 'deep-thoughts', 'thinking', 'world'])
('“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'J.K. Rowling', ['abilities', 'choices'])
('“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'Albert Einstein', ['inspirational', 'life', 'live', 'miracle', 'miracles'])
('“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'Jane Austen', ['aliteracy', 'books', 'classic', 'humor'])
("“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'Marilyn Monroe', ['be-yourself', 'inspirational'])
('“Try not to become a man of success. Rather become a man of value.”', 'Albert Einstein', ['adulthood', 'success', 'value'])
('“I

## You did it.

Now write the code for and combining the data into a pandas dataframe, or JSON object.

After you're done. You've completed a baby-version of what we will be doing in this project.

In [18]:
import pandas as pd
all_tiles = [scrape_tile(div) for div in all_div]
df = pd.DataFrame(all_tiles, columns=['quotes', 'author', 'tags'])