# Web Scraping

![legtsgo](https://media.giphy.com/media/dwmNhd5H7YAz6/giphy.gif)

By the end of this lesson, you will be able to scrape data from a static web page using the **requests** and **Beautiful Soup** libraries, and export that data into a structured text file using the **pandas** library.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Web-Scraping" data-toc-modified-id="Web-Scraping-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Web Scraping</a></span><ul class="toc-item"><li><span><a href="#What-is-Web-Scraping" data-toc-modified-id="What-is-Web-Scraping-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>What is Web Scraping</a></span></li><li><span><a href="#Web-structure" data-toc-modified-id="Web-structure-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Web structure</a></span></li><li><span><a href="#HTML" data-toc-modified-id="HTML-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>HTML</a></span><ul class="toc-item"><li><span><a href="#Exploring-Web-Page-Structures" data-toc-modified-id="Exploring-Web-Page-Structures-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Exploring Web Page Structures</a></span></li><li><span><a href="#Fact-1:-HTML-is-Built-on-Tags" data-toc-modified-id="Fact-1:-HTML-is-Built-on-Tags-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>Fact 1: HTML is Built on Tags</a></span></li><li><span><a href="#Fact-2:-Tags-Can-Have-Attributes" data-toc-modified-id="Fact-2:-Tags-Can-Have-Attributes-1.3.3"><span class="toc-item-num">1.3.3&nbsp;&nbsp;</span>Fact 2: Tags Can Have Attributes</a></span></li><li><span><a href="#Fact-3:-Tags-Can-Be-Nested" data-toc-modified-id="Fact-3:-Tags-Can-Be-Nested-1.3.4"><span class="toc-item-num">1.3.4&nbsp;&nbsp;</span>Fact 3: Tags Can Be Nested</a></span></li><li><span><a href="#Selecting-Specific-Elements-in-Web-Scraping" data-toc-modified-id="Selecting-Specific-Elements-in-Web-Scraping-1.3.5"><span class="toc-item-num">1.3.5&nbsp;&nbsp;</span>Selecting Specific Elements in Web Scraping</a></span></li></ul></li><li><span><a href="#Web-Scraping-with-Python" data-toc-modified-id="Web-Scraping-with-Python-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Web Scraping with Python</a></span><ul class="toc-item"><li><span><a href="#Requests:-Fetching-a-Web-Page" data-toc-modified-id="Requests:-Fetching-a-Web-Page-1.4.1"><span class="toc-item-num">1.4.1&nbsp;&nbsp;</span>Requests: Fetching a Web Page</a></span></li><li><span><a href="#Parsing-HTML-with-Beautiful-Soup" data-toc-modified-id="Parsing-HTML-with-Beautiful-Soup-1.4.2"><span class="toc-item-num">1.4.2&nbsp;&nbsp;</span>Parsing HTML with Beautiful Soup</a></span><ul class="toc-item"><li><span><a href="#Extracting-Data" data-toc-modified-id="Extracting-Data-1.4.2.1"><span class="toc-item-num">1.4.2.1&nbsp;&nbsp;</span>Extracting Data</a></span></li><li><span><a href="#More-filtering-options" data-toc-modified-id="More-filtering-options-1.4.2.2"><span class="toc-item-num">1.4.2.2&nbsp;&nbsp;</span>More filtering options</a></span></li><li><span><a href="#Creating-a-DataFrame-with-the-data" data-toc-modified-id="Creating-a-DataFrame-with-the-data-1.4.2.3"><span class="toc-item-num">1.4.2.3&nbsp;&nbsp;</span>Creating a DataFrame with the data</a></span></li><li><span><a href="#💡-Check-for-understanding" data-toc-modified-id="💡-Check-for-understanding-1.4.2.4"><span class="toc-item-num">1.4.2.4&nbsp;&nbsp;</span>💡 Check for understanding</a></span></li><li><span><a href="#Scraping-many-pages" data-toc-modified-id="Scraping-many-pages-1.4.2.5"><span class="toc-item-num">1.4.2.5&nbsp;&nbsp;</span>Scraping many pages</a></span></li><li><span><a href="#CSS-selectors" data-toc-modified-id="CSS-selectors-1.4.2.6"><span class="toc-item-num">1.4.2.6&nbsp;&nbsp;</span>CSS selectors</a></span></li></ul></li><li><span><a href="#More-examples" data-toc-modified-id="More-examples-1.4.3"><span class="toc-item-num">1.4.3&nbsp;&nbsp;</span>More examples</a></span><ul class="toc-item"><li><span><a href="#BBC" data-toc-modified-id="BBC-1.4.3.1"><span class="toc-item-num">1.4.3.1&nbsp;&nbsp;</span>BBC</a></span></li></ul></li></ul></li><li><span><a href="#Comments" data-toc-modified-id="Comments-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Comments</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Summary</a></span></li><li><span><a href="#Further-materials" data-toc-modified-id="Further-materials-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Further materials</a></span><ul class="toc-item"><li><span><a href="#How-to-Solve-a-403-Error" data-toc-modified-id="How-to-Solve-a-403-Error-1.7.1"><span class="toc-item-num">1.7.1&nbsp;&nbsp;</span>How to Solve a 403 Error</a></span></li></ul></li></ul></li></ul></div>

## What is Web Scraping

Web scraping is a method employed by data analysts and developers to retrieve information from web pages. It involves fetching a web page and then parsing that page to obtain desired information. This technique is especially useful when the desired data isn't available through APIs. The extracted data can then be cleaned, analyzed, or stored in databases for further data analytics tasks. 

## Web structure

The fundamental web technologies that form the structure of the websites we aim to scrape are:

- **HTML**: Standing as the backbone of almost all websites, HTML, the core markup language, is instrumental in creating web pages. It houses all the content available on a webpage.
  
- **CSS**: This stylesheet language works alongside HTML, taking charge of the presentation aspect of the webpages. It controls how HTML elements are displayed, setting the stage for a visually pleasing and organized web interface.

- **JavaScript**: Adding a dynamic touch to the websites, JavaScript comes into play to create interactive and animated content. This programming language has the power to alter webpage content even after it has loaded, bringing a vivid and responsive element to web designs.

In this lesson, we will work with the HTML from the websites.

## HTML

In the realm of web scraping, understanding HTML (Hypertext Markup Language) is crucial.

HTML is the standard markup language used to create web pages. Think of it as the skeleton or blueprint of a website. It structures content on the web, defining elements like paragraphs, headings, links, lists, and images. These elements are represented by "tags", which enclose content to give it meaning and context.

When web scraping, you'll often navigate through this HTML structure to pinpoint and extract the exact data you need. Tools like web browsers' "Inspect" or "View Source" features allow you to see the underlying HTML of a page, which is invaluable when determining how to access specific pieces of content programmatically.

![image.png](attachment:e4229c35-3852-4b9c-95c2-ac4ecbfdd581.png)

### Exploring Web Page Structures

To inspect the underlying HTML of a web page, right-click anywhere on the page. Choose "View Page Source" in browsers like Chrome or Firefox. For Internet Explorer, it's "View Source," and for Safari, select "Show Page Source." (In Safari, if this option isn't visible, navigate to Safari Preferences, click on the Advanced tab, and enable "Show Develop menu in menu bar.")

To embark on your web scraping journey, you just need to grasp **three foundational aspects** of HTML.


### Fact 1: HTML is Built on Tags

At its core, HTML is composed of content enveloped in `<tags>`. It typically houses the textual content we aim to scrape, adorned with these "tags" delineated by angle brackets. These tags provide structure and meaning, guiding the browser on how to display the content. The acronym "HTML" represents Hyper Text Markup Language.

HTML follows a tree-like structure, encompassing parent tags, child tags, and sibling tags:
```
<html>
    <head>
        <title>Page Title</title>
    </head>
    <body>
        <h1>My First Heading</h1>
        <p>My first paragraph.</p>
    </body>
</html>
```

For instance, consider the `<strong>` tag, signaling bold formatting. If "Jan. 21" is encapsulated between an opening `<strong>` tag and its corresponding closing `</strong>` tag, it denotes where the bold styling begins and ends. This pair of tags instructs the browser to render the enclosed text, "Jan. 21", in bold.

Tags come in various types, each suited to encapsulate specific content:
 * **Headings**: `<h1>`, `<h2>`, `<h3>`, `<hgroup>`...
 * **Phrasing**: `<b>`, `<img>`, `<sub>`...
 * **Embedded Content**: `<audio>`, `<img>`, `<video>`...
 * **Tabulated Data**: `<table>`, `<tr>`, `<tbody>`...
 * **Page Sections**: `<header>`, `<section>`, `<article>`...
 * **Metadata and Scripts**: `<meta>`, `<title>`, `<script>`...


### Fact 2: Tags Can Have Attributes

HTML tags can possess "attributes," which are defined within the opening tag itself. 

Examine the following examples:

- `<span class="short-desc">`: Here, the `<span>` tag has a `class` attribute with the value "short-desc".
- `<div> Zapas Marca Joma X54 </div>`: This tag doesn't contain any attributes.
- `<div class="price-item" id="offer"> Zapas Marca Joma X54 </div>`: The `div` tag here has two attributes - `class` with the value "price-item" and `id` with the value "offer".
- `<div class="text-monospace" id="name_132" href="www.example.com"> Page Content </div>`: This `div` tag encompasses the following attributes:
    + **class**: With the value "text-monospace". Remember, the class isn't unique across the page.
    + **id**: With the value "name_132". IDs are meant to be unique identifiers for tags on the page.
    + **href**: With the value "www.example.com". The href commonly represents a link to another section of the page or to an external website.

**Key Notes**:
- The `id` attribute should be unique for a tag; no two tags should share the same `id`.
- The `class` attribute isn't meant to be unique. Instead, it often groups tags exhibiting similar behavior or styles.

For web scraping purposes, **understanding the semantics** behind terms like `<span>`, `class`, or `short-desc` **isn't crucial**. The key takeaway is recognizing that tags can possess attributes and understanding their structural representation. When extracting content, our goal is to pinpoint the right tags within a webpage's HTML.

**Other commonly used attributes in HTML**

Several attributes in HTML provide additional information or modify elements. Some of these frequently used attributes include:

 * **`dir`**: Determines the text direction within an element, allowing for either forward or backward writing.
 * **`lang`**: Designates the language of the content within an element.
 * **`style`**: Applies inline styling to an element (Note: This shouldn't be mixed up with the `<style>` tag).
 * **`title`**: Offers supplementary details about an element, often displayed as a tooltip (Important: This is distinct from the `<title>` tag).

...and many more.




### Fact 3: Tags Can Be Nested

Imagine the following segment of HTML code:

`Hello <strong><em>Ironhack</em> students</strong>`

Here, the phrase **Ironhack students** would be displayed in bold since it resides between the `<strong>` and `</strong>` tags. Additionally, the word ***Ironhack*** would be italicized due to the `<em>` tag, which signifies italic formatting. However, the word "Hello" remains unaffected by any formatting, as it lies outside both the `<strong>` and `<em>` tags. This results in the display:

Hello ***Ironhack* students**

This example illustrates a key principle: **tags influence the text from their opening to their closing points,** even if they are nested within other tags.

### Selecting Specific Elements in Web Scraping

When diving into web scraping, it's essential to target specific elements efficiently. To hone in on the precise content you need, consider filtering tags based on:
 
 * **Tag Name**: The main type of the element (e.g., `<div>`, `<a>`, `<p>`).
 * **Class**: A descriptor that groups multiple elements with similar characteristics.
 * **ID**: A unique identifier assigned to a particular element.
 * **Other Attributes**: Additional properties like `href`, `title`, or `lang` that can further specify the elements of interest.


## Web Scraping with Python

In this lesson, we'll use the `requests` library to fetch web pages and `BeautifulSoup` from the `bs4` package to parse these pages and extract information.

Ensure you've installed the required packages:

In [None]:
#!pip install requests beautifulsoup4

### Requests: Fetching a Web Page


First, we use the `requests` library to fetch the content of a webpage.

In [37]:
import requests

url = "https://www.decathlon.com/collections/mountain-bikes"
response = requests.get(url)
response

<Response [200]>

The provided code retrieves the webpage content from the given URL and saves it in a `response` object. This object possesses either a `text` or `content` attribute, holding the HTML code similar to what we observe when inspecting the source in a web browser.

In [38]:
response.content



In [39]:
response.headers # Response headers (as a python dictionary)

{'Date': 'Wed, 06 Sep 2023 09:34:28 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'X-Sorting-Hat-PodId': '61', 'X-Sorting-Hat-ShopId': '13306287', 'X-Storefront-Renderer-Rendered': '1', 'Set-Cookie': 'secure_customer_sig=; path=/; expires=Fri, 06 Sep 2024 09:34:28 GMT; secure; HttpOnly; SameSite=Lax, localization=US; path=/; expires=Fri, 06 Sep 2024 09:34:28 GMT, cart_currency=USD; path=/; expires=Wed, 20 Sep 2023 09:34:28 GMT, _cmp_a=%7B%22purposes%22%3A%7B%22a%22%3Atrue%2C%22p%22%3Atrue%2C%22m%22%3Atrue%2C%22t%22%3Atrue%7D%2C%22display_banner%22%3Afalse%2C%22merchant_geo%22%3A%22US%22%2C%22sale_of_data_region%22%3Afalse%7D; domain=decathlon.com; path=/; expires=Thu, 07 Sep 2023 09:34:28 GMT; SameSite=Lax, _y=b2f54edd-3dfe-4791-97fc-bc450f8f7b22; Expires=Thu, 05-Sep-24 09:34:28 GMT; Domain=decathlon.com; Path=/; SameSite=Lax, _s=4d125c0f-802f-458b-a2e5-a89ec0559f9e; Expires=Wed, 06-Sep-23 10:04:28 GMT; Domain=decathlon.co

In [40]:
print(response.headers['Content-Type'])

text/html; charset=utf-8


When interacting with APIs, we typically receive data in JSON format. However, web scraping provides us with HTML, which can be challenging to navigate. Fortunately, Beautiful Soup simplifies this process, making our work more manageable!

### Parsing HTML with Beautiful Soup

To parse the HTML, we'll employ the renowned Python library, [Beautiful Soup 4](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). For a deeper understanding of its functionalities, explore the [official documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/).


In [41]:
from bs4 import BeautifulSoup

soup = BeautifulSoup(response.content)

The code above parses the HTML (stored in `response.content`) into a special object called `soup` that the Beautiful Soup library understands. In other words, Beautiful Soup is **reading the HTML and making sense of its structure.**

In [42]:
type(soup)

bs4.BeautifulSoup

In [None]:
print(soup.prettify())  # This formats the HTML in a readable way

With the parsed HTML, we can now extract specific elements.

#### Extracting Data

`find` and `findAll` (or its equivalent `find_all`) are methods used to search the soup tree for tags that match a certain criterion.

1. **find**:
    - Returns only the **first** tag that matches a given set of criteria.
    - Useful when you know there's only one tag of interest or you only want the first occurrence.
    - Example: If you have multiple `<p>` tags on a page and you use `soup.find('p')`, you'll get only the first `<p>` tag.

2. **findAll (or find_all)**:
    - Returns a **list** of tags that match the given criteria.
    - Useful when you want to capture all occurrences of a particular tag or set of tags.
    - Example: Using `soup.find_all('p')` will give you a list containing all `<p>` tags on the page.

Here's a simple illustration:

```html
<html>
    <body>
        <p>First paragraph.</p>
        <p>Second paragraph.</p>
        <div>Some div.</div>
    </body>
</html>
```

Using `find('p')` would return the "First paragraph." while `find_all('p')` would return a list containing both "First paragraph." and "Second paragraph.".


Let's look at different ways of extracting data.

##### **By Tag**

Let's start with a popular tag: `title`.

In [43]:
soup.find("title") # Find the first <title> tag on the page

<title>
    Mountain Bikes | Decathlon
    </title>

In [49]:
soup.findAll("title") # Find the all <title> tags on the page

[<title>
     Mountain Bikes | Decathlon
     </title>,
 <title id="logo-title">Decathlon</title>,
 <title id="logo-title">Decathlon</title>,
 <title id="logo-title">Decathlon</title>,
 <title>Decathlon Btwin</title>,
 <title>Decathlon Btwin</title>,
 <title>Decathlon Rockrider</title>,
 <title>Decathlon Rockrider</title>,
 <title>Decathlon Rockrider</title>,
 <title>Decathlon Rockrider</title>,
 <title>Decathlon Rockrider</title>,
 <title>Decathlon Rockrider</title>,
 <title>Decathlon Rockrider</title>,
 <title>product-research</title>,
 <title>laptop-checkout</title>]

##### **By Class**

To search for HTML elements by class in a webpage using BeautifulSoup, you can also use the `find` and `find_all` methods. 

1. **Using `find` method to get the first matching element**:
   
   ```python
   result = soup.find(class_='your-class-name')
   ```

2. **Using `find_all` method to get a list of all matching elements**:

   ```python
   results = soup.find_all(class_='your-class-name')
   ```
   
Note that we are using `class_` parameter because `class` is a reserved keyword in Python.

Let's dive into our target URL and explore its structure. Our objective is to craft a dataframe populated with bicycle names and their corresponding prices. 

To pinpoint the exact tags housing this information, follow these steps:
1. Navigate to the website in your browser.
2. Locate a bicycle name, right-click on it, and choose 'Inspect'. This action will direct you to the element within the site's HTML. Identify the tags so we can extract our desired data.
3. Repeat the same process for the price.

Note: the bicycle names and prices will change depending on the newest bikes in the shop.

Let's filter all elements which `class` is `de-ProductTile-title`.

In [None]:
soup.findAll(class_='de-ProductTile-title')

In this case, the results of `class` `de-ProductTile-title` are all inside `h4` tags and we actually got the information we wanted. But what if the `class` `de-ProductTile-title` was inside different `tags` and we only want the results of the `tag h4`?

##### **By Tag and Class**

BeautifulSoup allows filtering results using combinations, such as filtering by tag and class. 

```python
tags = soup.find_all(name=tag_name, class_=class_name)
```

We can use a for loop to iterate over the results and do whatever we need to do.

To extract the names from the provided HTML content, you can:

1. Use the `findAll` method to locate the `<h4>` tags with the specific class (`de-ProductTile-title` in this case).
2. Extract the text from the found tag.

In [None]:
# Find all <h4> tags with class "de-ProductTile-title"
bike_names_tags = soup.findAll('h4', class_='de-ProductTile-title')

In [55]:
# Findall returns a list
bike_names_tags[:3]

[<h4 class="de-ProductTile-title de-u-textMedium de-u-textShrink1 de-u-lg-textGrow1 de-u-lineHeight2" data-gtm-actions="click" data-gtm-category="Tiles" data-gtm-delegate="child" data-gtm-event="customer-interaction" data-gtm-id="Product Title | /products/mountain-bike-275-rockrider-st-100-196952-192872" itemprop="name">
           Mountain Bike 27.5" Rockrider ST 100
         </h4>,
 <h4 class="de-ProductTile-title de-u-textMedium de-u-textShrink1 de-u-lg-textGrow1 de-u-lineHeight2" data-gtm-actions="click" data-gtm-category="Tiles" data-gtm-delegate="child" data-gtm-event="customer-interaction" data-gtm-id="Product Title | /products/mountain-bike-rockrider-520" itemprop="name">
           Rockrider ST520 Mountain Bike 27.5"
         </h4>,
 <h4 class="de-ProductTile-title de-u-textMedium de-u-textShrink1 de-u-lg-textGrow1 de-u-lineHeight2" data-gtm-actions="click" data-gtm-category="Tiles" data-gtm-delegate="child" data-gtm-event="customer-interaction" data-gtm-id="Product Title | /p

In [56]:
type(bike_names_tags) # What type is it?

bs4.element.ResultSet

In [57]:
# Lets look at how many elements we retrieved
len(bike_names_tags)

9

In [180]:
# Lets look at the first element
bike_names_tags[0]

<h4 class="de-ProductTile-title de-u-textMedium de-u-textShrink1 de-u-lg-textGrow1 de-u-lineHeight2" data-gtm-actions="click" data-gtm-category="Tiles" data-gtm-delegate="child" data-gtm-event="customer-interaction" data-gtm-id="Product Title | /products/mountain-bike-275-rockrider-st-100-196952-192872" itemprop="name">
          Mountain Bike 27.5" Rockrider ST 100
        </h4>

In [66]:
# We can get the actual text using .text or .getText()
print(bike_names_tags[0].text)
print(bike_names_tags[0].getText())


          Mountain Bike 27.5" Rockrider ST 100
        

          Mountain Bike 27.5" Rockrider ST 100
        


In [63]:
# Lets get rid of all the white spaces
bike_names_tags[0].getText().strip()

'Mountain Bike 27.5" Rockrider ST 100'

In [67]:
# Create a new list only with bicycle names
names = [bike.text.strip() for bike in bike_names_tags]
names

To extract the price from the provided HTML content, you can:

1. Use the `findAll` method to locate the `<span>` tags with the specific class (`js-de-ProductTile-currentPrice` in this case).
2. Extract the text from the found tag.


In [76]:
# Find the <span> tag and get its text
prices = soup.findAll('span', class_='js-de-ProductTile-currentPrice')
prices

[<span class="js-de-ProductTile-currentPrice">$99.00 — $399.00</span>,
 <span class="js-de-ProductTile-currentPrice">$150.00</span>,
 <span class="js-de-ProductTile-currentPrice">$3,999.00</span>,
 <span class="js-de-ProductTile-currentPrice">$2,499.00</span>,
 <span class="js-de-ProductTile-currentPrice">$400.00</span>,
 <span class="js-de-ProductTile-currentPrice">$350.00</span>,
 <span class="js-de-ProductTile-currentPrice">$1,200.00</span>,
 <span class="js-de-ProductTile-currentPrice">$300.00</span>,
 <span class="js-de-ProductTile-currentPrice">$1,500.00</span>]

In [77]:
prices[0].text

'$99.00 — $399.00'

In [78]:
for price_tag in prices:
    print(price_tag.text)

$99.00 — $399.00
$150.00
$3,999.00
$2,499.00
$400.00
$350.00
$1,200.00
$300.00
$1,500.00


##### **Getting other attributes**

To access other attribute values such as hyperlinks (which are usually contained in the `href` attribute of `a` tags), you first locate the element using BeautifulSoup methods such as `find` or `find_all`, and then use the `.get()` method to retrieve the value of the attribute you're interested in. Here is a step-by-step explanation:

1. **Locate the element**: Use `find` or `find_all` to locate the element(s) that contain the attribute you want to access.

    ```python
    link_element = soup.find('a', class_='link-class')
    ```

2. **Access the attribute**: Once you have the element, use the `.get()` method to access the attribute value.

    ```python
    link_url = link_element.get('href')
    ```

In the above snippet:
- We first find the `a` element with the class `'link-class'`.
- We then get the value of the `href` attribute which contains the hyperlink.


When inspecting the website, we saw that the bicycle title was a link. How can we get that link?
Lets inspect the whole element containing the bicycle name instead of just the name. 

We can see that we have:
    
    <a class="de-u-linkClean js-de-ProductTile-link" href="/collections/mountain-bikes/products/mountain-bike-275-rockrider-st-100-196952-192872">
        
Note: this link will change depending on the newest bike in the shop.

In [170]:
# Find the <a> tag and get its href attribute
soup.find('a', class_='de-u-linkClean js-de-ProductTile-link').get('href')

'/collections/mountain-bikes/products/mountain-bike-275-rockrider-st-100-196952-192872'

In [154]:
# Extracting all links within <a> tags and class 'de-u-linkClean js-de-ProductTile-link'
for link in soup.find_all('a', class_='de-u-linkClean js-de-ProductTile-link'):
    print(link.get('href'))

/collections/mountain-bikes/products/mountain-bike-275-rockrider-st-100-196952-192872
/collections/mountain-bikes/products/mountain-bike-rockrider-520
/collections/mountain-bikes/products/rockrider-race-900s-gx-eagle-xc-mountain-bike-350356
/collections/mountain-bikes/products/rockrider-am-fifty_s-12-speed-full-suspension-all-mountain-bike-275-29-331990
/collections/mountain-bikes/products/rockrider-st530s-full-suspension-mountain-bike-27-5-311716
/collections/mountain-bikes/products/rockrider-st-540-mountain-bike-27-5
/collections/mountain-bikes/products/mountain-bike-xc-500-29-rr
/collections/mountain-bikes/products/rr-cn-fr-chrome-st-530-325125
/collections/mountain-bikes/products/xc-mountain-bike-carbon-29-rockrider-900


#### More filtering options

##### Filtering by Multiple Tags

To find elements with multiple possible tags, you can pass a list of tag names to `find_all`.

In [None]:
# Find all <div> and <span> tags
soup.find_all(['div', 'span'])

##### Filtering by Multiple Classes

To find elements with multiple possible classes, you can pass a list of class names.


In [81]:
# Find all elements with class 'js-de-ProductTile-currentPrice' or 'de-ProductTile-title'
soup.find_all(class_=['js-de-ProductTile-currentPrice', 'de-ProductTile-title'])

[<h4 class="de-ProductTile-title de-u-textMedium de-u-textShrink1 de-u-lg-textGrow1 de-u-lineHeight2" data-gtm-actions="click" data-gtm-category="Tiles" data-gtm-delegate="child" data-gtm-event="customer-interaction" data-gtm-id="Product Title | /products/mountain-bike-275-rockrider-st-100-196952-192872" itemprop="name">
           Mountain Bike 27.5" Rockrider ST 100
         </h4>,
 <span class="js-de-ProductTile-currentPrice">$99.00 — $399.00</span>,
 <h4 class="de-ProductTile-title de-u-textMedium de-u-textShrink1 de-u-lg-textGrow1 de-u-lineHeight2" data-gtm-actions="click" data-gtm-category="Tiles" data-gtm-delegate="child" data-gtm-event="customer-interaction" data-gtm-id="Product Title | /products/mountain-bike-rockrider-520" itemprop="name">
           Rockrider ST520 Mountain Bike 27.5"
         </h4>,
 <span class="js-de-ProductTile-currentPrice">$150.00</span>,
 <h4 class="de-ProductTile-title de-u-textMedium de-u-textShrink1 de-u-lg-textGrow1 de-u-lineHeight2" data-gtm-acti

##### Combining Multiple Criteria

You can combine multiple criteria by using the `attrs` argument.

In [None]:
# Find all <div> or <span> tags with class "class1" or "class2"
soup.find_all(['h4', 'span'], attrs={"class": ['js-de-ProductTile-currentPrice', 'de-ProductTile-title']})

##### Limiting the Results

You can limit the number of results returned by `find_all` using the `limit` parameter.

In [None]:
# Only get the first 5 matches
soup.find_all('h4', limit=5)

##### Navigating through the "Tree" of HTML Elements

Beautiful Soup provides a robust set of tools that allow you to traverse and explore the hierarchical structure of an HTML document, often referred to as the "tree". 

To access child elements directly:

In [99]:
soup.div.find_all("h4")

[<h4 class="message-title">
                                     Our Summer Sale is ON!
                                 </h4>,
 <h4 class="message-title">
                                         Journey On - Packs for Mountain and City Adventures
                                     </h4>,
 <h4 class="message-title">
                                         Free Shipping | Click &amp; Collect is back!
                                     </h4>]

The code above will first locate the initial `div` element present in the Beautiful Soup object. Subsequently, it will fetch all `h4` elements contained within that `div`.

But what if you need to retrieve a specific child by its position, say the second child?

In [100]:
soup.div.find_all("h4")[1]

<h4 class="message-title">
                                        Journey On - Packs for Mountain and City Adventures
                                    </h4>

#### Creating a DataFrame with the data

Instead of getting names and prices separately, we can target the whole component, and extract the name and price from each bicycle component in a more structured manner. By targeting this whole component tag, we can ensure that we are extracting information for the same product (i.e., the name and price correspond to the same bicycle).

Here's how we can achieve this:

1. **Targeting the Whole Component**:
   - Instead of targeting individual tags for names and prices, we target the main component that houses both the name and price.
   - By visually inspecting the HTML, we can see that:
       - The information for each bicycle (name, price, etc.) is grouped together under a `<section>` tag. 
       - The `class` attribute of this `<section>` tag is `de-ProductTile-info`. This class seemed specific to the product tile and thus, a good candidate to use for extraction.
   
2. **Iterating through Components**:
   - For each such component, extract the name and the price.
   
3. **Storing Data**:
   - Store the extracted data in lists, which can then be used to create a DataFrame.

In [109]:
import pandas as pd

# Lists to store extracted data
bicycle_names = []
prices = []

# Find all components
components = soup.find_all('section', class_='de-ProductTile-info')

for component in components:
    # Extract bicycle name
    bike_name = component.find('h4', class_='de-ProductTile-title').text.strip()
    bicycle_names.append(bike_name)
    
    # Extract price
    price = component.find('span', class_='js-de-ProductTile-currentPrice').text.strip().replace("$", "")
    prices.append(price)

# Create DataFrame
df = pd.DataFrame({
    'Bicycle_Name': bicycle_names,
    'Price': prices
})

df

Unnamed: 0,Bicycle_Name,Price
0,"Mountain Bike 27.5"" Rockrider ST 100",99.00 — 399.00
1,"Rockrider ST520 Mountain Bike 27.5""",150.00
2,Rockrider RACE 900S GX Eagle XC Mountain Bike,3999.00
3,Rockrider AM Fifty_S 12-Speed Full Suspension ...,2499.00
4,Rockrider ST530S Full Suspension Mountain Bike...,400.00
5,"Rockrider ST540 Mountain Bike 27.5""",350.00
6,Rockrider XC500 Mountain Bike 29'',1200.00
7,"Rockrider ST530 Mountain Bike 27.5""",300.00
8,"Rockrider XC900 Carbon Mountain Bike 29""",1500.00


We could clean even more our dataset so the price can be a float, and we can easily make operations with it. This means, we shouldnt have range of prices.

#### 💡 Check for understanding

You are given a raw HTML content of a product list from an online store. Your task is to extract the following details for each product:

- Bicycle Name
- Bicycle Price
- URL for the product details
- URL for the product image

Write a function extract_bike_info that takes in the HTML content and returns a pandas DataFrame with the above columns.

**Hint:**

In order to get the product image, might be a good idea to use the `article` tag with the class `de-ProductTile` since based on the HTML structure, this `article` tag encapsulates the entire product, including both the image and the product details. This allows us to more easily access all the relevant details for each product without having to jump around different sections.

If we were to only use `soup.find_all('section', class_='de-ProductTile-info')`, we'd be focusing solely on the product details section and would then need a separate approach to extract the image URL. By starting with the `article` tag, we're able to extract all the needed data in a more cohesive and streamlined manner.

**Bonus:** clean the price column so you can make numerical operations.

In [127]:
def extract_bike_info(soup):
    # Extract the main product sections
    product_sections = soup.find_all('article', class_='de-ProductTile')
    
    data = []

    for article in product_sections:
        # Extract the bike name
        name = article.find('h4', class_='de-ProductTile-title').text.strip()
        
        # Extract the price information
        price_text = article.find('span', class_='js-de-ProductTile-currentPrice').text.strip()
        prices = [float(p.replace('$', '').replace(',', '')) for p in price_text.split('—')]
        
        if len(prices) == 1:
            min_price, max_price = prices[0], prices[0]
        else:
            min_price, max_price = prices

        # Extract the product detail URL
        product_detail_url = article.find('a', class_='js-de-ProductTile-link')['href']

        # Extract the bicycle image URL. We'll take the first available image for simplicity.
        img_tag = article.find('img', class_='de-ProductTile-showcaseImage')
        image_url = img_tag['data-src'] if img_tag and 'data-src' in img_tag.attrs else None

        data.append([name, min_price, max_price, product_detail_url, image_url])

    df = pd.DataFrame(data, columns=['Bicycle_Name', 'Minimum_Price', 'Maximum_Price', 'Product_Detail_URL', 'Image_URL'])
    return df


In [128]:
df = extract_bike_info(soup)
df

Unnamed: 0,Bicycle_Name,Minimum_Price,Maximum_Price,Product_Detail_URL,Image_URL
0,"Mountain Bike 27.5"" Rockrider ST 100",99.0,399.0,/collections/mountain-bikes/products/mountain-...,//www.decathlon.com/cdn/shop/files/fa98636a-d1...
1,"Rockrider ST520 Mountain Bike 27.5""",150.0,150.0,/collections/mountain-bikes/products/mountain-...,//www.decathlon.com/cdn/shop/products/ROCKRIDE...
2,Rockrider RACE 900S GX Eagle XC Mountain Bike,3999.0,3999.0,/collections/mountain-bikes/products/rockrider...,//www.decathlon.com/cdn/shop/products/8832747-...
3,Rockrider AM Fifty_S 12-Speed Full Suspension ...,2499.0,2499.0,/collections/mountain-bikes/products/rockrider...,//www.decathlon.com/cdn/shop/products/8641873-...
4,Rockrider ST530S Full Suspension Mountain Bike...,400.0,400.0,/collections/mountain-bikes/products/rockrider...,//www.decathlon.com/cdn/shop/files/8555876-pro...
5,"Rockrider ST540 Mountain Bike 27.5""",350.0,350.0,/collections/mountain-bikes/products/rockrider...,//www.decathlon.com/cdn/shop/files/B_TWIN_20ST...
6,Rockrider XC500 Mountain Bike 29'',1200.0,1200.0,/collections/mountain-bikes/products/mountain-...,//www.decathlon.com/cdn/shop/products/8551327-...
7,"Rockrider ST530 Mountain Bike 27.5""",300.0,300.0,/collections/mountain-bikes/products/rr-cn-fr-...,//www.decathlon.com/cdn/shop/products/8629973-...
8,"Rockrider XC900 Carbon Mountain Bike 29""",1500.0,1500.0,/collections/mountain-bikes/products/xc-mounta...,//www.decathlon.com/cdn/shop/products/ROCKRIDE...


#### Scraping many pages

When dealing with a limited number of bicycles, all products are conveniently displayed on a single page. But what if there were numerous products necessitating pagination across multiple pages?

Consider the 'deals' collection. By navigating to the end of its first page on the website, we can observe pagination links. Transitioning to the second page results in a change in the URL:

From: 
"https://www.decathlon.com/collections/deals"
To: 
"https://www.decathlon.com/collections/deals?page=2"

This pattern in the URL can be leveraged to generate a series of URLs for web scraping.

Please note: Depending on the current offers available at the time of this lesson, pagination might not be present. If that's the case, explore other product categories that have a substantial number of items, resulting in multiple pages.

In [137]:
pages = [f"https://www.decathlon.com/collections/deals?page={pag}" for pag in range(1,5)]

In [138]:
pages

['https://www.decathlon.com/collections/deals?page=1',
 'https://www.decathlon.com/collections/deals?page=2',
 'https://www.decathlon.com/collections/deals?page=3',
 'https://www.decathlon.com/collections/deals?page=4']

Now lets build a df for each URL

In [139]:
def get_df_from_url(url):

    response = requests.get(url)

    soup = BeautifulSoup(response.content)
    df = extract_bike_info(soup)
    return df

In [140]:
dfs = [get_df_from_url(p) for p in pages]

In [142]:
dfs[1]

Unnamed: 0,Bicycle_Name,Minimum_Price,Maximum_Price,Product_Detail_URL,Image_URL
0,Oxelo Quad Artistic Roller Skate 54mm 85A Adult,40.0,40.0,/collections/deals/products/decathlon-quad-art...,//www.decathlon.com/cdn/shop/products/8494830-...
1,Oxelo FIT500 Protective Gear Set w Knee Elbow ...,15.0,15.0,/collections/deals/products/oxelo-fit500-prote...,//www.decathlon.com/cdn/shop/products/8494805-...
2,"Quechua 2"" Pop-up Extra Large Camping Beach Sh...",28.0,28.0,/collections/deals/products/2-seconds-0-xl-fre...,//www.decathlon.com/cdn/shop/products/8581549-...
3,Ski 500 Thermal Underwear Base Layer Top Men's,9.0,9.0,/collections/deals/products/wedze-500-ski-top-...,//www.decathlon.com/cdn/shop/products/8576241-...
4,ADULT SKI GLOVES 100 - LIGHT BLACK,10.0,10.0,/collections/deals/products/ski-100-waterproof...,//www.decathlon.com/cdn/shop/products/8602262-...
5,Oxelo B1 100 Adjustable Scooter 3 Wheel Deck O...,3.0,3.0,/collections/deals/products/decathlon-b1-100-a...,//www.decathlon.com/cdn/shop/products/8338355-...
6,Quechua SH500 X-Warm Waterproof Lace-Up Snow B...,40.0,40.0,/collections/deals/products/sh500-x-warm-water...,//www.decathlon.com/cdn/shop/products/8556332-...
7,Btwin Bowl 500 City Bike Helmet,20.0,34.99,/collections/deals/products/cycling-bowl-helme...,//www.decathlon.com/cdn/shop/products/BTWIN_20...
8,"Rockrider ST520 Mountain Bike 27.5""",150.0,150.0,/collections/deals/products/mountain-bike-rock...,//www.decathlon.com/cdn/shop/products/ROCKRIDE...


In [None]:
# Concatenate all DataFrames in the list
result_df = pd.concat(dfs, ignore_index=True)
result_df

If you look at our results, and compare it with the website, you'll see that its not returning all the products. Each page has more than 9 products, and its only returning 9 on each page.

This could be because the content is dynamic. 

**Dynamic Content**: Many modern websites use JavaScript to load content dynamically. When you make a request using libraries like `requests`, you're only getting the initial HTML content. Any content loaded dynamically via JavaScript after the initial page load won't be captured. In such cases, tools like Selenium are used because they can interact with the JavaScript of the page.


#### CSS selectors

CSS selectors are patterns used to select and manipulate one or more elements in an HTML or XML document. When web scraping with Python, CSS selectors can be used to target specific elements of interest within the page's content. 

The `select` method in BeautifulSoup allows you to pass a CSS selector and returns a list of elements matching that selector.

1. **Tag Selector**: Targets elements by their tag name.
   - `p`: selects all `<p>` elements.
   - `soup.select("p")`

2. **Class Selector**: Targets elements by their class attribute.
   - `.classname`: selects all elements with `class="classname"`.
   - If class name has spaces, they must be changed by `.`
   - `soup.select(".classname")`
   - To combine both, we can have `soup.select("tagname.classname")`

3. **Descendant Selector**: Targets an element that is a descendant of another element.
   - `div p`: selects all `<p>` elements inside a `<div>` element.
   - `.class1 .class2`: selects all elements with class2 that is a descendant of an element with class1.
   
4. **Attribute Selector**: Targets elements based on their attributes and values.
   - `a[href]`: selects all `<a>` elements with an `href` attribute.
   - `a[href="https://www.example.com"]`: selects all `<a>` elements with an `href` value of "https://www.example.com".

And more...


1. **Tag Selector**:
   - **`article`**: This would select all `<article>` elements on the page.
  

In [None]:
soup.select("article")

2. **Class Selector**:
   - **`.de-ProductTile`**: This would select all elements with the class `de-ProductTile`.

In [None]:
soup.select(".de-ProductTile")

To combine both, we can have `soup.select("tagname.classname")`

Without CSS selectors we did:

In [155]:
# Extracting all links within <a> tags and class 'de-u-linkClean js-de-ProductTile-link'
for link in soup.find_all('a', class_='de-u-linkClean js-de-ProductTile-link'):
    print(link.get('href'))

/collections/mountain-bikes/products/mountain-bike-275-rockrider-st-100-196952-192872
/collections/mountain-bikes/products/mountain-bike-rockrider-520
/collections/mountain-bikes/products/rockrider-race-900s-gx-eagle-xc-mountain-bike-350356
/collections/mountain-bikes/products/rockrider-am-fifty_s-12-speed-full-suspension-all-mountain-bike-275-29-331990
/collections/mountain-bikes/products/rockrider-st530s-full-suspension-mountain-bike-27-5-311716
/collections/mountain-bikes/products/rockrider-st-540-mountain-bike-27-5
/collections/mountain-bikes/products/mountain-bike-xc-500-29-rr
/collections/mountain-bikes/products/rr-cn-fr-chrome-st-530-325125
/collections/mountain-bikes/products/xc-mountain-bike-carbon-29-rockrider-900


Equivalently, using CSS selectors, which is a universal syntax, you can try and find `tag_name.class_name`. If class name has spaces, they must be changed by `.`

In [156]:
# Extracting all links within <a> tags and class 'de-u-linkClean js-de-ProductTile-link'
for link in soup.select('a.de-u-linkClean.js-de-ProductTile-link'):
    print(link.get('href'))

/collections/mountain-bikes/products/mountain-bike-275-rockrider-st-100-196952-192872
/collections/mountain-bikes/products/mountain-bike-rockrider-520
/collections/mountain-bikes/products/rockrider-race-900s-gx-eagle-xc-mountain-bike-350356
/collections/mountain-bikes/products/rockrider-am-fifty_s-12-speed-full-suspension-all-mountain-bike-275-29-331990
/collections/mountain-bikes/products/rockrider-st530s-full-suspension-mountain-bike-27-5-311716
/collections/mountain-bikes/products/rockrider-st-540-mountain-bike-27-5
/collections/mountain-bikes/products/mountain-bike-xc-500-29-rr
/collections/mountain-bikes/products/rr-cn-fr-chrome-st-530-325125
/collections/mountain-bikes/products/xc-mountain-bike-carbon-29-rockrider-900


3. **Descendant Selector**:
   - **`.de-ProductTile .de-ProductTile-title`**: This would select all elements with the class `de-ProductTile-title` that are descendants of elements with the class `de-ProductTile`.
   - **`article h4`**: This would select all `<h4>` elements that are descendants of `<article>` elements.

In [None]:
soup.select(".de-ProductTile .de-ProductTile-title")

In [None]:
soup.select("article h4")

In [158]:
# how many spans?
len(soup.select("span"))

461

In [159]:
# how many spans inside spans?
len(soup.select("span span"))

40

In [160]:
# how many spans inside spans inside spans?
len(soup.select("span span span"))

0

In [161]:
# how many span inside div inside div inside div ...
len(soup.select("div div div div div div div div span"))

315

[Beautiful Soup selectors](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

You can also use a combination of `find`, `find_all`, and `select` methods to navigate and locate the elements you're interested in more efficiently. Here's how you can use them together:

1. **Using `find` or `find_all` to Narrow Down the Search Scope**:
   
   Initially, you can use `find` or `find_all` to narrow down your search to a specific section of the HTML document.

   ```python
   section = soup.find('div', class_='product-section')
   ```

2. **Using `select` to Further Locate Elements**:

   After narrowing down the section, you can use the `select` method to locate elements using CSS selectors, which allow for more complex queries. The `select` method can be used on a BeautifulSoup object or on a Tag object (like the one retrieved in step 1).

   ```python
   product_links = section.select('a.product-link')
   ```

In this snippet:
- First, we locate a section of the webpage using `find`.
- Then, within that section, we locate all `a` elements with the class `'product-link'` using `select`.

In [177]:
first_product = soup.find('article', class_='de-ProductTile')
first_product.select('h4.de-ProductTile-title')

[<h4 class="de-ProductTile-title de-u-textMedium de-u-textShrink1 de-u-lg-textGrow1 de-u-lineHeight2" data-gtm-actions="click" data-gtm-category="Tiles" data-gtm-delegate="child" data-gtm-event="customer-interaction" data-gtm-id="Product Title | /products/mountain-bike-275-rockrider-st-100-196952-192872" itemprop="name">
           Mountain Bike 27.5" Rockrider ST 100
         </h4>]

In [178]:
# Since its a list we need to access the element to get the text
first_product.select('h4.de-ProductTile-title')[0].text

'\n          Mountain Bike 27.5" Rockrider ST 100\n        '

### More examples

#### BBC

Lets scrape the BBC to gather some information.

We'll get the hyperlinks to images from the BBC website.

In [181]:
response = requests.get("https://www.bbc.com/")
response

<Response [200]>

In [182]:
soup = BeautifulSoup(response.content)

In [183]:
img_tags = soup.find_all("img") # We get all the image elements

In [184]:
len(img_tags) # Lets see how many we got

65

In [185]:
img_tags[1] # For example, lets look at the second one to see how we can get the actual URL

<img alt="" height="1" src="https://a1.api.bbc.co.uk/hit.xiti?&amp;col=1&amp;from=p&amp;ptag=js&amp;s=598253&amp;p=home.page&amp;x2=[responsive]&amp;x3=[bbc_website]&amp;x4=[en]&amp;x7=[Index-home]&amp;x8=[reverb-3.8.0-nojs]&amp;x11=[HOMEPAGE_GNL]&amp;x12=[GNL_HOMEPAGE]" width="1"/>

In [188]:
# We can use the get method to get the src attribute which contains the URL to the image
img_tags[1].get("src")

'https://ichef.bbc.co.uk/wwhp/144/cpsprodpb/10B8/production/_131008240_mediaitem131007991.jpg'

If we inspect the top menu of the BBC, we see that we have the `nav` element with the class `orbit-header-links international`. To get the names from the menu, we need to locate all the span elements inside this nav element.

In [190]:
# Find the nav element with the specified class
nav_element = soup.find('nav', class_='orbit-header-links international')

# Find all span elements inside the nav element
menu_names = [span.get_text() for span in nav_element.find_all('span')]

# Print the menu names
for name in menu_names:
    print(name)

Home
News
Sport
Reel
Worklife
Travel
Future
Culture
TV
Weather
Sounds


## Comments

It's always recommended to check for the availability of an **API** before resorting to web scraping for the following reasons:
 * It is generally much easier to use
 * APIs are usually well-documented
 * Utilizing APIs is often preferred by server administrators

Refer to the `robots.txt` file on a website (by doing `www.example.com/robots.txt`) to understand the server's guidelines and limitations regarding web scraping.

## Summary

1. **Web Technologies**:
   - **HTML**: This is the markup language that holds the content of the webpage. It is the primary target when we engage in web scraping.
   - **CSS**: Cascading Style Sheets are used to describe the look and formatting of a document written in HTML. 
   - **JavaScript**: This is a scripting language used to create and control dynamic website content.

2. **HTML Structure**:
   - **Hierarchical**: HTML documents are structured hierarchically, meaning elements are nested within other elements, forming a tree-like structure.
   - **Tags**: These are the building blocks of HTML, defining elements that hold different types of content.
   - **Attributes**: HTML tags can have attributes, which define properties of an element and are used to set various characteristics such as class, ID, and style.

3. **Web Scraping Tools**:
   - **Requests**: A Python library that allows you to send HTTP requests to get the HTML content of a webpage.
   - **Beautiful Soup**: A Python library that facilitates the programmatic analysis of HTML, helping in parsing the HTML and navigating the parse tree.
   - **Selenium**: In cases where the webpage content is dynamic, generated using JavaScript, a tool like Selenium becomes necessary. Selenium can interact with JavaScript to load dynamic content, making it accessible for scraping.
   
4. **Finding and Selecting Elements**:
   - **Selection by Tag, Class, and ID**: We can find elements using various attributes such as their tag name, class name, or ID.
   - **CSS Selectors**: These are patterns used to select elements more complexly, leveraging the relationships between different elements to find them in numerous ways.


## Further materials

[Web archive](http://web.archive.org/): find historical webpages state in the past!!

### How to Solve a 403 Error

When you get a `403` status code in response to a web request, it means "Forbidden." The server understands your request, but it refuses to fulfill it. This is often a measure by websites to prevent web scraping or automated access.

Here's why you might get a `403 Forbidden` error:

1. **User-Agent**: Many websites block requests that don't have a standard web browser User-Agent. The default User-Agent of the `requests` library often gets blocked.
2. **Robots.txt**: This is a file websites use to guide web crawlers about which pages or sections of the site shouldn't be processed or scanned. Respect it.
3. **Rate Limiting**: Websites might block you if you make too many requests in a short period.
And more...

To solve it, try the following, starting from the user-agent:

1. **Change the User-Agent**:
   You can mimic a request from a web browser by setting a User-Agent header.
   ```python
   headers = {
       "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
   }
   response = requests.get(url, headers=headers)
   ```

2. **Use a Web Scraper Library**:
   Libraries like Scrapy or Selenium can help bypass restrictions, especially when JavaScript rendering is involved.

3. **Respect `robots.txt`**:
   Always check `https://www.example.com/robots.txt` (replace `example.com` with the website's domain) to see which URLs you're allowed to access.

4. **Rate Limiting**:
   Implement delays in your requests using `time.sleep(seconds)` to avoid hitting rate limits.

5. **Use Proxies or VPN**:
   Rotate IP addresses or use a VPN service if the server has blocked your IP.

6. **Sessions & Cookies**:
   Some websites might require maintaining sessions or handling cookies.
