# Intro to Web Scraping

## `Process`
- Use `requests` to download the HTML
- Use `response.text` property on the `response` object to get the text HTML

In [1]:
from requests import get
from bs4 import BeautifulSoup
import os

In [2]:
url = "https://site-to-scrape.glitch.me"

In [3]:
headers = {'User-Agent': 'Codeup Data Science'} # Some websites don't accept the pyhon-requests default user-agent
response = get(url, headers=headers)

In [4]:
response

<Response [200]>

In [5]:
response.text

'<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <title>Site to Scrape!</title>\n    <meta charset="utf-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="viewport" content="width=device-width, initial-scale=1">\n    \n    <!-- import the webpage\'s stylesheet -->\n    <link rel="stylesheet" href="/style.css">\n    \n    <!-- import the webpage\'s javascript file -->\n    <script src="/script.js" defer></script>\n  </head>  \n  <body>\n    <header>\n      <h1>This is the header!</h1>\n      <hr>\n    </header>\n    \n    <main>\n      <div>\n        <h1 class="first">\n        This is the main\n        </h1>\n        <h2>\n          This is an h2 of main\n        </h2>\n        <h3>\n          H3 inside of first div inside of main\n        </h3>\n      </div>\n      <div>\n        <h3 class="first">\n          H3 inside of second div inside of main.\n        </h3>\n        <p>\n          Here\'s some text content for us to scrape! 👽\n        </p>\n        

In [6]:
# Make a soup variable holding the response content
soup = BeautifulSoup(response.content, 'html.parser')

In [7]:
soup.title

<title>Site to Scrape!</title>

In [8]:
soup.h1

<h1>This is the header!</h1>

In [9]:
# returns the first match only with this dot syntax
soup.h2

<h2>
          This is an h2 of main
        </h2>

In [11]:
soup.h2.text

'\n          This is an h2 of main\n        '

In [12]:
soup.h2.text.strip()

'This is an h2 of main'

In [14]:
soup.h2.text.strip()[-5:]

' main'

In [15]:
type(soup.find_all('h3')[0])

bs4.element.Tag

In [16]:
soup.find_all("h3")[0]

<h3>
          H3 inside of first div inside of main
        </h3>

In [17]:
soup.find_all("h3")[0].text

'\n          H3 inside of first div inside of main\n        '

In [18]:
soup.text

"\n\n\nSite to Scrape!\n\n\n\n\n\n\n\n\n\n\nThis is the header!\n\n\n\n\n\n        This is the main\n        \n\n          This is an h2 of main\n        \n\n          H3 inside of first div inside of main\n        \n\n\n\n          H3 inside of second div inside of main.\n        \n\n          Here's some text content for us to scrape! 👽\n        \n\n          Here's another paragraph of content! ☠️\n        \nClick here to visit my homepage\n\n\n\nThis is the footer\n\n\n\n\n"

In [19]:
list(soup.children)

['html',
 '\n',
 <html lang="en">
 <head>
 <title>Site to Scrape!</title>
 <meta charset="utf-8"/>
 <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
 <meta content="width=device-width, initial-scale=1" name="viewport"/>
 <!-- import the webpage's stylesheet -->
 <link href="/style.css" rel="stylesheet"/>
 <!-- import the webpage's javascript file -->
 <script defer="" src="/script.js"></script>
 </head>
 <body>
 <header>
 <h1>This is the header!</h1>
 <hr/>
 </header>
 <main>
 <div>
 <h1 class="first">
         This is the main
         </h1>
 <h2>
           This is an h2 of main
         </h2>
 <h3>
           H3 inside of first div inside of main
         </h3>
 </div>
 <div>
 <h3 class="first">
           H3 inside of second div inside of main.
         </h3>
 <p>
           Here's some text content for us to scrape! 👽
         </p>
 <p>
           Here's another paragraph of content! ☠️
         </p>
 <a href="https://ryanorsinger.com">Click here to visit my homepage</a>
 </

## Beautiful Soup Methods and Properties

- soup.title.string gets the page's title (the same text in the browser tab for a page, this is the `<title>` element.
- `soup.prettify()` is useful to print in case you want to see the HTML
- `soup.find_all("a")` find all the anchor tags, or whatever argument is specified.
- `soup.find("h1")` finds the first matching element
- `soup.get_text()` gets the text from within a matching piece of soup/HTML
- The `soup.select()` method takes in a CSS selector as a string and returns all matching elements. super useful