# Find and Find_All | Web Scraping in Python

**url :** https://youtu.be/xjA1HjvmoMY?si=uDg_g6quTSL1WBYc

**Web scraping for beginners: how to use Find and Find_All:**

We are going to use BeautifulSoup, a Python library for web scraping, to find various HTML elements and extract text from them. The use of .find() and .find_all() methods enables selective retrieval of elements based on their type (like div, p, or th) and attributes (like class names).

In [4]:
#importing the necessary libraries for web scraping
from bs4 import BeautifulSoup   # For web scraping and analysis of HTML and XML documents
import requests                 # To make HTTP requests to websites

In [5]:
#store the url we're going to use in a variable
url = 'https://www.scrapethissite.com/pages/forms/'

In [6]:
#sending a GET request to the URL
#get() uses the requests library, then sends a get request to that url, then it's going to return a response object
requests.get(url)

<Response [200]>

In [7]:
#we got a response of 200 which means it's good
#if we got 204, 400, 401 or 404: all of them are potentially bad
#204: means no content in the actual web page
#400: means a bad request, it was invalid and the server couldn't process => you won't get a response
#404: is an error that means the server cannot be found

In [8]:
#store the response in a variable
page = requests.get(url)

In [9]:
#parse the HTML content of the page using BeautifulSoup  
#the 'html' argument specifies that we are dealing with HTML content 
BeautifulSoup(page.text, 'html')

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Hockey Teams: Forms, Searching and Pagination | Scrape This Site | A public sandbox for learning web scraping</title>
<link href="/static/images/scraper-icon.png" rel="icon" type="image/png"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components." name="description"/>
<link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" integrity="sha256-MfvZlkHCEqatNoGiOXveE8FIwMzZg4W85qfrfIFBfYc= sha512-dTfge/zgoMYpP7QbHy4gWMEGsbsdZeCXz7irItjcC3sPUFtf0kuFbDz/ixG7ArTxmDjLXDmezHubeNikyKGVyQ==" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/css?family=Lato:400,700" rel="stylesheet" type="text/css"/>
<link href="/static/css/styles.css" rel="stylesheet" type="text/css"/>
<meta content="noindex" name="robot

In [10]:
#store it in a variable
soup = BeautifulSoup(page.text, 'html')

In [11]:
#print the BeautifulSoup object, which contains the parsed HTML 
print(soup)

#notice that there is not hierarchy built in here, compared to when we use prettify() in the next cell

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Hockey Teams: Forms, Searching and Pagination | Scrape This Site | A public sandbox for learning web scraping</title>
<link href="/static/images/scraper-icon.png" rel="icon" type="image/png"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components." name="description"/>
<link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" integrity="sha256-MfvZlkHCEqatNoGiOXveE8FIwMzZg4W85qfrfIFBfYc= sha512-dTfge/zgoMYpP7QbHy4gWMEGsbsdZeCXz7irItjcC3sPUFtf0kuFbDz/ixG7ArTxmDjLXDmezHubeNikyKGVyQ==" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/css?family=Lato:400,700" rel="stylesheet" type="text/css"/>
<link href="/static/css/styles.css" rel="stylesheet" type="text/css"/>
<meta content="noindex" name="robot

**Detailed Explanation of the code:**

**1. Import Statements:**

- **import requests:** This library is used to send HTTP requests. In this case, it makes a GET request to retrieve the content of a webpage.

- **from bs4 import BeautifulSoup:** This imports the BeautifulSoup class from the bs4 module, which is used to parse HTML and XML documents.

**2. GET Request:**

- **page = requests.get(url):** This line sends a GET request to the URL specified in the variable url and stores the server's response in the variable page.

**3. Parsing HTML:**

**soup = BeautifulSoup(page.text, 'html'):** The HTML content of the page (stored in page.text) is parsed into a BeautifulSoup object. This allows to navigate and search through the HTML tree easily.

**4. Printing Responses:**

- **print(page):** This prints the response object, showing status code and other metadata.
- **print(soup):** Prints the parsed HTML.

In [13]:
#find the first <div> element in the BeautifulSoup object (the parsed HTML)
soup.find('div')

<div class="container">
<div class="col-md-12">
<ul class="nav nav-tabs">
<li id="nav-homepage">
<a class="nav-link hidden-sm hidden-xs" href="/">
<img id="nav-logo" src="/static/images/scraper-icon.png"/>
                                Scrape This Site
                            </a>
</li>
<li id="nav-sandbox">
<a class="nav-link" href="/pages/">
<i class="glyphicon glyphicon-console hidden-sm hidden-xs"></i>
                                Sandbox
                            </a>
</li>
<li id="nav-lessons">
<a class="nav-link" href="/lessons/">
<i class="glyphicon glyphicon-education hidden-sm hidden-xs"></i>
                                Lessons
                            </a>
</li>
<li id="nav-faq">
<a class="nav-link" href="/faq/">
<i class="glyphicon glyphicon-flag hidden-sm hidden-xs"></i>
                                FAQ
                            </a>
</li>
<li class="pull-right" id="nav-login">
<a class="nav-link" href="/login/">
                                Login

In [14]:
#find all <div> elements in the BeautifulSoup object and return them as a list
soup.find_all('div')

[<div class="container">
 <div class="col-md-12">
 <ul class="nav nav-tabs">
 <li id="nav-homepage">
 <a class="nav-link hidden-sm hidden-xs" href="/">
 <img id="nav-logo" src="/static/images/scraper-icon.png"/>
                                 Scrape This Site
                             </a>
 </li>
 <li id="nav-sandbox">
 <a class="nav-link" href="/pages/">
 <i class="glyphicon glyphicon-console hidden-sm hidden-xs"></i>
                                 Sandbox
                             </a>
 </li>
 <li id="nav-lessons">
 <a class="nav-link" href="/lessons/">
 <i class="glyphicon glyphicon-education hidden-sm hidden-xs"></i>
                                 Lessons
                             </a>
 </li>
 <li id="nav-faq">
 <a class="nav-link" href="/faq/">
 <i class="glyphicon glyphicon-flag hidden-sm hidden-xs"></i>
                                 FAQ
                             </a>
 </li>
 <li class="pull-right" id="nav-login">
 <a class="nav-link" href="/login/">
        

In [15]:
#find all <div> elements in the BeautifulSoup object with the class "col-md-12" and return them as a ResultSet
soup.find_all('div', class_="col-md-12")

[<div class="col-md-12">
 <ul class="nav nav-tabs">
 <li id="nav-homepage">
 <a class="nav-link hidden-sm hidden-xs" href="/">
 <img id="nav-logo" src="/static/images/scraper-icon.png"/>
                                 Scrape This Site
                             </a>
 </li>
 <li id="nav-sandbox">
 <a class="nav-link" href="/pages/">
 <i class="glyphicon glyphicon-console hidden-sm hidden-xs"></i>
                                 Sandbox
                             </a>
 </li>
 <li id="nav-lessons">
 <a class="nav-link" href="/lessons/">
 <i class="glyphicon glyphicon-education hidden-sm hidden-xs"></i>
                                 Lessons
                             </a>
 </li>
 <li id="nav-faq">
 <a class="nav-link" href="/faq/">
 <i class="glyphicon glyphicon-flag hidden-sm hidden-xs"></i>
                                 FAQ
                             </a>
 </li>
 <li class="pull-right" id="nav-login">
 <a class="nav-link" href="/login/">
                                 

In [16]:
#find all <p> elements in the BeautifulSoup object (the parsed HTML)
soup.find_all('p')

[<p class="lead">
                             Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components.
                             Take a look at how pagination and search elements change the URL as your browse. Build a web scraper that can conduct searches and paginate through the results.
                         </p>,
 <p>
 <i class="glyphicon glyphicon-education"></i> There are <a href="/lessons/">8 video lessons</a> that show you how to scrape this page.
                         </p>,
 <p>
                             
                                 Data via
                                 <a class="data-attribution" href="http://www.opensourcesports.com/hockey/" target="_blank">http://www.opensourcesports.com/hockey/</a>
 </p>]

In [17]:
# Find all <p> elements with the class 'lead' in the BeautifulSoup object (the parsed HTML) and return them as a ResultSet  
soup.find_all('p', class_='lead')

[<p class="lead">
                             Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components.
                             Take a look at how pagination and search elements change the URL as your browse. Build a web scraper that can conduct searches and paginate through the results.
                         </p>]

In [18]:
# Get the text content of all <p> elements with class "lead"  
# Attempt to get the .text attribute of all <p> elements found with class 'lead', which will cause an error (ResultSet object has no attribute 'text')
soup.find_all('p', class_='lead').text

AttributeError: ResultSet object has no attribute 'text'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?

In [28]:
'''
The line soup.find_all('p', class_='lead').text will throw an AttributeError because:
find_all() returns a list rather than a single element.
To get text from all "lead" paragraphs, you would need to loop through them or join the text properly.
'''

'\nThe line soup.find_all(\'p\', class_=\'lead\').text will throw an AttributeError because:\nfind_all() returns a list rather than a single element.\nTo get text from all "lead" paragraphs, you would need to loop through them or join the text properly.\n'

In [29]:
# Find the first <p> element with the class 'lead', and retrieve its text content
soup.find('p', class_='lead').text

'\n                            Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components.\n                            Take a look at how pagination and search elements change the URL as your browse. Build a web scraper that can conduct searches and paginate through the results.\n                        '

In [30]:
# Find the first <p> element with the class 'lead', retrieve its text content, and remove leading/trailing whitespace  
soup.find('p', class_='lead').text.strip()

'Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components.\n                            Take a look at how pagination and search elements change the URL as your browse. Build a web scraper that can conduct searches and paginate through the results.'

In [31]:
# Find all <th> (table header) elements in the BeautifulSoup object (the parsed HTML)
soup.find_all('th')

[<th>
                             Team Name
                         </th>,
 <th>
                             Year
                         </th>,
 <th>
                             Wins
                         </th>,
 <th>
                             Losses
                         </th>,
 <th>
                             OT Losses
                         </th>,
 <th>
                             Win %
                         </th>,
 <th>
                             Goals For (GF)
                         </th>,
 <th>
                             Goals Against (GA)
                         </th>,
 <th>
                             + / -
                         </th>]

In [32]:
# Find the first <th> element in the BeautifulSoup object (the parsed HTML)
soup.find('th')

<th>
                            Team Name
                        </th>

In [33]:
# Find the first <th> element in the BeautifulSoup object (the parsed HTML), and retrieve its text content
soup.find('th').text

'\n                            Team Name\n                        '

In [34]:
# Find the first <th> element  in the BeautifulSoup object (the parsed HTML) and retrieve its text content
#and removing leading/trailing whitespace  
soup.find('th').text.strip()

'Team Name'

In [35]:
# Find all <th> (table header) elements in the BeautifulSoup object (the parsed HTML)
soup.find_all('td')

[<td class="name">
                             Boston Bruins
                         </td>,
 <td class="year">
                             1990
                         </td>,
 <td class="wins">
                             44
                         </td>,
 <td class="losses">
                             24
                         </td>,
 <td class="ot-losses">
 </td>,
 <td class="pct text-success">
                             0.55
                         </td>,
 <td class="gf">
                             299
                         </td>,
 <td class="ga">
                             264
                         </td>,
 <td class="diff text-success">
                             35
                         </td>,
 <td class="name">
                             Buffalo Sabres
                         </td>,
 <td class="year">
                             1990
                         </td>,
 <td class="wins">
                             31
                         </td>,
 

**Detailed Explanation of the Code:**

**1. Finding Elements:**

- **soup.find('div'):** This line retrieves the first **div** element found in the parsed HTML. **If there are multiple div elements, only the first one will be returned**. It's useful for when you only need one occurrence.

- **soup.find_all('div'):** This retrieves all div elements and returns them as a list. We use this when we need to work with multiple elements of the same type.

**2. Filtering by Class:**

- **soup.find_all('div', class_="col-md-12"):** Similar to find_all(), but this only looks for div elements that have the specified class. This is often used in web development to get sections of a page formatted a certain way

**3. Working with Paragraphs:**

- **soup.find_all('p'):** Retrieves all p elements, enabling to gather all paragraphs from the HTML.

- **soup.find_all('p', class_='lead'):** This line searches for all p (paragraph) elements that have a class of "lead", which is often used for prominent introductory text. It returns these elements as a result set. This means it will return a list-like object containing all matching paragraphs.

**4. Text Retrieval:**

- **soup.find_all('p', class_='lead').text:** This line attempts to access the .text attribute of the ResultSet returned by find_all. However, this will raise an AttributeError because a ResultSet does not have a .text attribute directly. Instead, you would need to iterate through the ResultSet or access individual elements first.

- **soup.find('p', class_='lead').text:** Retrieves the text from the first paragraph with the "lead" class.

- **soup.find('p', class_='lead').text.strip():** Here, the code finds the first p element with the class 'lead', retrieves its text using the .text attribute, and then applies .strip() to remove any leading or trailing whitespace. This is useful for cleaning up the text content before further processing or displaying it.

**5. Table Data:**

- **soup.find_all('td'):** Retrieves all td elements, which are the data cells in tables. This helps extract all the data content in a structured way.

**6. Table Header Elements:**

- **soup.find_all('th'):** This line retrieves all th (table header) elements from the soup object, similar to how p elements were retrieved. This will return all headers in any tables present in the HTML.

- **soup.find('th').text.strip():** This searches for the first th element in the document, retrieves its text, and then strips any whitespace from it. This is commonly used to access and clean up header information in a table.

**In the next lesson:**

We'll try to pull all the information in the table to a dataframe and then use pandas to search and manipulate that data within that dataframe.