# BeautifulSoup + Requests | Web Scraping in Python

**url:** https://youtu.be/bargNl2WeN4?si=E0WL5UAiwuSFEKb4 **(the lesson)**

**url:** https://youtu.be/q-kbzWjyPak?si=iyAZvlRDGkIyO9SD **(Important! Inspecting Web Pages with HTML)**

**Web scraping for beginners:**
- importing the packages
- get all the html from the website
- make sure that it's in a usable state

In [3]:
#importing the necessary libraries for web scraping
from bs4 import BeautifulSoup   # For web scraping and analysis of HTML and XML documents
import requests                 # To make HTTP requests to websites

In [4]:
#assign the url we're going to use in a variable
url = 'https://www.scrapethissite.com/pages/forms/'

In [5]:
#sending a GET request to the URL
#get() uses the requests library, then sends a get request to that url, then it's going to return a response object
requests.get(url)

<Response [200]>

In [6]:
#we got a response of 200 which means it's good
#if we got 204, 400, 401 or 404: all of them are potentially bad
#204: means no content in the actual web page
#400: means a bad request, it was invalid and the server couldn't process => you won't get a response
#404: is an error that means the server cannot be found

In [7]:
#assign the response in a variable
page = requests.get(url)

In [8]:
#parse the HTML content of the page using BeautifulSoup  
#the 'html' argument specifies that we are dealing with HTML content 
BeautifulSoup(page.text, 'html')

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Hockey Teams: Forms, Searching and Pagination | Scrape This Site | A public sandbox for learning web scraping</title>
<link href="/static/images/scraper-icon.png" rel="icon" type="image/png"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components." name="description"/>
<link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" integrity="sha256-MfvZlkHCEqatNoGiOXveE8FIwMzZg4W85qfrfIFBfYc= sha512-dTfge/zgoMYpP7QbHy4gWMEGsbsdZeCXz7irItjcC3sPUFtf0kuFbDz/ixG7ArTxmDjLXDmezHubeNikyKGVyQ==" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/css?family=Lato:400,700" rel="stylesheet" type="text/css"/>
<link href="/static/css/styles.css" rel="stylesheet" type="text/css"/>
<meta content="noindex" name="robot

In [9]:
#assign it to a variable
soup = BeautifulSoup(page.text, 'html')

In [10]:
#print the BeautifulSoup object, which contains the parsed HTML 
print(soup)

#notice that there is not hierarchy built in here, compared to when we use prettify() in the next cell

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Hockey Teams: Forms, Searching and Pagination | Scrape This Site | A public sandbox for learning web scraping</title>
<link href="/static/images/scraper-icon.png" rel="icon" type="image/png"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components." name="description"/>
<link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" integrity="sha256-MfvZlkHCEqatNoGiOXveE8FIwMzZg4W85qfrfIFBfYc= sha512-dTfge/zgoMYpP7QbHy4gWMEGsbsdZeCXz7irItjcC3sPUFtf0kuFbDz/ixG7ArTxmDjLXDmezHubeNikyKGVyQ==" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/css?family=Lato:400,700" rel="stylesheet" type="text/css"/>
<link href="/static/css/styles.css" rel="stylesheet" type="text/css"/>
<meta content="noindex" name="robot

In [11]:
#print a prettified (formatted) version of the parsed HTML for better readability  
#prettify(): makes the html easier to visualize and see (it shows that it kind of has a hierarchy built in this html compared to the first code)
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Hockey Teams: Forms, Searching and Pagination | Scrape This Site | A public sandbox for learning web scraping
  </title>
  <link href="/static/images/scraper-icon.png" rel="icon" type="image/png"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components." name="description"/>
  <link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" integrity="sha256-MfvZlkHCEqatNoGiOXveE8FIwMzZg4W85qfrfIFBfYc= sha512-dTfge/zgoMYpP7QbHy4gWMEGsbsdZeCXz7irItjcC3sPUFtf0kuFbDz/ixG7ArTxmDjLXDmezHubeNikyKGVyQ==" rel="stylesheet"/>
  <link href="https://fonts.googleapis.com/css?family=Lato:400,700" rel="stylesheet" type="text/css"/>
  <link href="/static/css/styles.css" rel="stylesheet" type="text/css"/>
  <meta con

**Detailed Explanation of the code:**

**1. Import Statements:**

- **import requests:** This library is used to send HTTP requests. In this case, it makes a GET request to retrieve the content of a webpage.

- **from bs4 import BeautifulSoup:** This imports the BeautifulSoup class from the bs4 module, which is used to parse HTML and XML documents.

**2. GET Request:**

- **page = requests.get(url):** This line sends a GET request to the URL specified in the variable url and stores the server's response in the variable page.

**3. Parsing HTML:**

**soup = BeautifulSoup(page.text, 'html'):** The HTML content of the page (stored in page.text) is parsed into a BeautifulSoup object. This allows to navigate and search through the HTML tree easily.

**4. Printing Responses:**

- **print(page):** This prints the response object, showing status code and other metadata.
- **print(soup):** Prints the parsed HTML but may not be very readable.
- **print(soup.prettify()):** This generates a nicely formatted string of the parsed HTML, making it easier to read.

**In the next lesson:**

We're going to learn how to query this to take specific information out and trying to understand what's going on in this html to make sure we get the information we need.