In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **Web Scraping with BeautifulSoup (Basic to Advanced)**

#### **Course Outline**

##### **1. Introduction to Web Scraping**
   - **What is Web Scraping?**
   - **Why Use Web Scraping?**
   - **Legal Considerations and Best Practices**
   - **Setting Up the Environment**
     - Installing Python
     - Installing BeautifulSoup
     - Installing Required Libraries (Requests, lxml, etc.)

##### **2. Getting Started with BeautifulSoup**
   - **Introduction to BeautifulSoup**
   - **Basic Syntax and Setup**
     - Creating a Soup Object
     - Using HTML/XML Parsers (`html.parser`, `lxml`)
   - **Parsing a Simple HTML Document**
   - **Navigating the Parse Tree**
     - Accessing Tags, Attributes, and Text
     - Using `.tagName`, `.attrs`, `.text`

##### **3. Searching and Navigating with BeautifulSoup**
   - **Finding Elements by Tag Name**
     - `find()`
     - `find_all()`
   - **CSS Selectors**
     - Using `select()`
   - **Using Filters and Regular Expressions**
   - **Searching with Attributes**
     - `class_`, `id`, `href`, etc.
   - **Navigating the Tree**
     - `parent`, `children`, `next_sibling`, `previous_sibling`
   - **Getting All Links and Images from a Page**

##### **4. Extracting Data with BeautifulSoup**
   - **Extracting Text from Tags**
   - **Using `get()` Method for Attributes**
   - **Extracting Tables**
     - Finding and Extracting Table Data
     - Parsing HTML Tables
   - **Extracting Data from Nested Tags**
   - **Handling Missing or Incomplete Data**

##### **5. Advanced HTML Parsing Techniques**
   - **Working with Complex HTML Structures**
   - **Handling Pagination in Web Scraping**
   - **Dealing with JavaScript-Rendered Content**
     - Using `Selenium` for Dynamic Content
     - Combining BeautifulSoup with Selenium
   - **Handling AJAX Requests**
     - Using Network Tab to Identify Requests

##### **6. Data Cleaning and Manipulation**
   - **Cleaning Extracted Data**
   - **Removing HTML Tags and Special Characters**
   - **Using Python's `re` Library for Text Processing**
   - **Handling Encoding Issues**
   - **Using Pandas to Organize Scraped Data**

##### **7. Exporting and Storing Scraped Data**
   - **Saving Data to CSV Files**
   - **Saving Data to Excel Files**
   - **Saving Data to JSON Files**
   - **Storing Data in Databases (SQLite, PostgreSQL)**
   - **Creating a Simple API for Scraped Data with Flask**

##### **8. Handling Errors and Exceptions in Web Scraping**
   - **Common Errors in BeautifulSoup**
   - **Handling HTTP Errors with `requests` Library**
   - **Using `try-except` Blocks for Error Handling**
   - **Setting Up Retry Logic for Failed Requests**

##### **9. Web Scraping Projects**
   - **Project 1: Scraping E-commerce Product Data**
   - **Project 2: Scraping News Headlines and Summaries**
   - **Project 3: Scraping Weather Data from a Weather Website**
   - **Project 4: Scraping Wikipedia Tables and Articles**

##### **10. Advanced BeautifulSoup Techniques**
   - **Using Proxies to Avoid IP Blocking**
   - **Rotating User Agents with `fake_useragent` Library**
   - **Building a Web Scraping Pipeline with Python**
   - **Scraping Data from Websites with Infinite Scroll**
   - **Using `concurrent.futures` for Parallel Scraping**

##### **11. Web Scraping Best Practices and Optimization**
   - **Respecting `robots.txt` and Website Terms of Service**
   - **Using Request Headers to Avoid Blocking**
   - **Optimizing the Scraping Process for Speed**
   - **Data Throttling and Sleep Intervals**
   - **Avoiding Captchas with Anti-Captcha Services**

##### **12. Deploying a Web Scraping Script**
   - **Using Cron Jobs for Scheduling**
   - **Deploying on Cloud Services (AWS Lambda, Heroku)**
   - **Creating a Web Scraping API with FastAPI**
   - **Sending Notifications (Email/Slack) After Scraping**

##### **13. Web Scraping Alternatives**
   - **Comparison with Scrapy and Selenium**
   - **When to Use BeautifulSoup, Scrapy, or Selenium**
   - **Exploring Other Libraries (Puppeteer, Playwright)**

##### **14. Course Project: Building a Complete Web Scraper**
   - **Project Setup and Requirements**
   - **Building the Scraper with BeautifulSoup**
   - **Data Cleaning and Exporting**
   - **Deploying and Automating the Web Scraper**
   - **Project Showcase and Review**

##### **15. Conclusion and Next Steps**
   - **Recap of Key Learnings**
   - **Resources for Further Learning**
   - **Tips for Building Your Own Web Scrapers**
   - **Q&A and Troubleshooting Common Issues**


## **1. Introduction to Web Scraping**

### **1.1 What is Web Scraping?**

**Web Scraping** is the automated process of extracting data from websites. Instead of manually copying and pasting information, web scraping tools and libraries allow you to gather data programmatically. 

- **Example**: If you want to collect product prices from an e-commerce website, you can use web scraping to fetch this data instead of manually looking up each product.

**How Web Scraping Works**:
1. **Send a Request**: The scraper sends a request to the target website to fetch the HTML content.
2. **Parse HTML**: The HTML content is parsed using tools like BeautifulSoup.
3. **Extract Data**: The desired data is extracted from the HTML tags.
4. **Store Data**: The extracted data is then stored in a structured format (e.g., CSV, Excel, JSON, daabase).

#### **1.2 Why Use Web Scraping?**

Web scraping is useful for various purposes, such as:

- **Data Collection**: Gathering large datasets for analysis (e.g., stock prices, news articles, reviews).
- **Price Monitoring**: Tracking product prices for e-commerce websites.
- **Competitor Analysis**: Monitoring competitors' websites to analyze their products and pricing strategies.
- **Content Aggregation**: Collecting data from multiple sources to provide a unified view (e.g., nws aggregators).

#### **1.3 Legal Considerations and Best Practices**

Before starting web scraping, it is essential to understand the legal and ethical considerations:

1. **Respect `robots.txt`**: 
   - Websites often provide a `robots.txt` file that specifies which pages can or cannot be scraped.
   - Always check and respect the rules defined in this file.
   - Example: `https://example.com/robots.txt`

2. **Avoid Overloading the Server**:
   - Sending too many requests in a short time can overload the server, causing it to block your IP.
   - Use time intervals (e.g., `time.sleep()`) between requests to avoid being flagged as a bot.

3. **Do Not Scrape Personal Information**:
   - Avoid scraping sensitive or personal data to comply with privacy regulations like GDPR.

4. **Check the Website's Terms of Service**:
   - Always review a website’s terms of srvice to ensure scraping is allowed.

#### **1.4 Setting Up the Environment**

To get started with web scraping using BeautifulSoup, you need to set up your environment.

**Step 1: Installing Python**
- BeautifulSoup is a Python library, so you need Python installed on your system.
- **Download Python**:
  - Go to [Python's official website](https://www.python.org/downloads/) and download the latest version.
- **Check Python Installation**:
  ```bash
  python --version
  ```

**Step 2: Installing BeautifulSoup**
- BeautifulSoup can be installed using Python's package manager `pip`.
- **Install BeautifulSoup**:
  ```bash
  pip install beautifulsoup4
  ```

**Step 3: Installing Required Libraries**
1. **Requests Library**:
   - Used for sending HTTP requests to fetch web pages.
   ```bash
   pip install requests
   ```
2. **lxml Parser**:
   - An optional parser that can be used with BeautifulSoup for faster parsing.
   ```bash
   pip install lxml
   ```

**Step 4: Setting Up a Basic Project Structure**
- **Create a Project lder**:
  ```bash
  mkdir web_scraping_project
  cd web_scraping_proje


  ## **2. Getting Started with BeautifulSoup**

In this section, we will cover the basics of **BeautifulSoup**, a Python library used for parsing HTML and XML documents. You'll learn how to set it up, create a soup object, use different parsers, and navigate through the parse tree.

### **Introduction to BeautifulSoup**

**BeautifulSoup** is a popular Python library for web scraping, which allows us to extract data from HTML and XML documents. It helps to parse and navigate through the HTML content, making it easier to extract the required information.

**Key Features of BeautifulSoup:**
- Provides tools for parsing HTML/XML documents
- Easy to navigate through the document tree
- Works well with `requests` for making HTTP requests
- Supports multiple parsers (like `html.parser`, `lxml`, etc.)

### **Basic Syntax and Setup**

#### **Installation**
To get started, you need to install the `beautifulsoup4` library and the `lxml` parser. Use the fol
pip install requests
```

#### **Importing Libraries**

```python
from bs4 import BeautifulSoup
import requests
```

### **Creating a Soup Object**

A **Soup Object** is the main object in Beautany specific questions?. Would you like to proceed with **"Getting Started with BeautifulSoup"** or explore a specific topic further?specific data.

Would you like to proceed to the next section, or do you have specific questions about this introduction?

In [None]:
from bs4 import BeautifulSoup
import requests


html_content = """
<!DOCTYPE html>
<html>
<head>
    <title>Test Page</title>
</head>
<body>
    <h1>Welcome to BeautifulSoup Tutorial</h1>
    <p class="description">This is a sample paragraph.</p>
    <a href="https://example.com" id="link1">Example Link</a>
</body>
</html>
"""

# Create a BeautifulSoup object
soup = BeautifulSoup(html_content, "html.parser")

# Output the parsed HTML
print(soup.prettify())


### **Using HTML/XML Parsers**

BeautifulSoup supports multiple parsers:
1. **`html.parser`** - Built-in Python HTML parser.
2. **`lxml`** - Faster, more lenient parser. Requires `lxml` to be installed.
3. **`html5lib`** - Parses HTML similar to web browsers.

#### **Example: Using Different Parsers**








In [None]:
# Using html.parser
soup_html = BeautifulSoup(html_content, "html.parser")

# Using lxml parser
soup_lxml = BeautifulSoup(html_content, "lxml")

print(soup_html.title)
print(soup_lxml.title)


### **Parsing a Simple HTML Document**

Let's parse a basic HTML document to extract elements like the title, header, and paragraph.




In [None]:
# Extract the title of the page
page_title = soup.title.text
print("Page Title:", page_title)

# Extract the first header
header = soup.h1.text
print("Header:", header)

# Extract the paragraph text
paragraph = soup.find('p').text
print("Paragraph:", paragraph)



### **Navigating the Parse Tree**

Navigating the parse tree involves moving through different tags to access specific data. We use the following attributes:

1. **`.tagName`** - Access a specific tag.
2. **`.attrs`** - Access the attributes of a tag as a dictionary.
3. **`.text`** - Extract the text content of a tag.

#### **Example: Accessing Tags, Attributes, and Text**


In [None]:
# Access the <a> tag
link_tag = soup.a
print("Link Tag:", link_tag)

# Accessing attributes of the <a> tag
link_href = link_tag['href']
print("Link Href:", link_href)

# Accessing the text inside the <a> tag
link_text = link_tag.text
print("Link Text:", link_text)

# Access all attributes of the <a> tag
link_attributes = link_tag.attrs
print("Link Attributes:", link_attributes)

## **3. Searching and Navigating with BeautifulSoup**

Now that you’ve learned the basics of creating a BeautifulSoup object and parsing HTML, let’s dive into searching and navigating the document tree. BeautifulSoup provides several methods to search for elements within the HTML and navigate the parse tree efficiently---

#### **3.1. Finding Elements by Tag Name**

The most fundamental way to find elements in BeautifulSoup is by searching for them by their tag name. You can use the following methos:

##### **find()**

The `find()` method is used to search for the first occurrence of a tag in the document. It returns a **single element** (the first match) or `None` if no element is found.

**Syntax:**
```python
soup.find('tag_name')
`('img')
for image in images:
    print(image['src'])
```

**Output:**
```
image1.jpg
image2.jpg
```

---

That concludes **Section 3: Searching and Navigating with BeautifulSoup**. Let me know if you need any more details or would like to continue with the next section!

In [None]:
from bs4 import BeautifulSoup

html_doc = "<html><head><title>Test Page</title></head><body><h1>Welcome to the test page</h1></body></html>"
soup = BeautifulSoup(html_doc, 'html.parser')

# Find the first <h1> tag
h1_tag = soup.find('h1')
print(h1_tag.text)  # Output: Welcome to the test page

#### **find_all()**

The `find_all()` method is used to find all occurrences of a tag. It returns a **list of matching elements** (could be empty if no match is found).

**Syntax:**
```python
soup.find_all('tag_name')
```

**Example:**

In [None]:
html_doc = "<html><body><h1>Title 1</h1><h1>Title 2</h1><h1>Title 3</h1></body></html>"
soup = BeautifulSoup(html_doc, 'html.parser')

# Find all <h1> tags
h1_tags = soup.find_all('h1')
for tag in h1_tags:
    print(tag.text)



### **3.2. CSS Selectors**

BeautifulSoup allows you to select elements using CSS selectors with the `select()` method. This method gives you more flexibility, especially when working with classes, ids, or nested elements.

#### **Using `select()`**

The `select()` method allows you to use CSS selectors to find elements. It returns a **list** of elements matching the selector.

**Syntax:**
```python
soup.select('selector')
```

**Example:**

In [None]:
html_doc = '''
<html>
    <body>
        <div class="content">
            <h1 class="header">Header 1</h1>
            <p class="paragraph">This is the first paragraph.</p>
        </div>
        <div class="content">
            <h1 class="header">Header 2</h1>
            <p class="paragraph">This is the second paragraph.</p>
        </div>
    </body>
</html>
'''

soup = BeautifulSoup(html_doc, 'html.parser')

# Select all elements with the class 'header'
headers = soup.select('.header')
for header in headers:
    print(header.text)



#### **Using Filters and Regular Expressions**

You can use filters or regular expressions with `find_all()` and `select()` to make more specific searches.

**Example using regular expressions:**

In [None]:
import requests

html_doc = "<html><body><h1>Title 1</h1><h2>Title 2</h2><h3>Title 3</h3></body></html>"
soup = BeautifulSoup(html_doc, 'html.parser')

# Find all header tags (h1, h2, h3, etc.)
headers = soup.find_all(re.compile('h[1-3]'))
for header in headers:
    print(header.text)


### **3.3. Searching with Attributes**

You can filter elements based on their attributes like `id`, `class`, `href`, etc. These attributes can be passed directly into `find()` or `find_all()`.

#### **Using `class_` (to avoid conflict with Python's `class` keyword)**
You can search for elements based on their `class` attribute by passing `class_` to the method.

**Example:**

In [None]:
html_doc = "<html><body><div class='content'>Some content</div><div class='footer'>Footer content</div></body></html>"
soup = BeautifulSoup(html_doc, 'html.parser')

# Find the first div with class 'content'
content_div = soup.find('div', class_='content')
print(content_div.text)


#### **Using `id`**
You can search by `id` attribute similarly.

**Example:**

In [None]:
html_doc = "<html><body><div id='main'>Main content</div></body></html>"
soup = BeautifulSoup(html_doc, 'html.parser')

# Find the div with the id 'main'
main_div = soup.find('div', id='main')
print(main_div.text)


#### **Using other attributes**
You can also search by other attributes like `href`, `src`, etc.


**Example:**

In [None]:
html_doc = "<html><body><a href='https://example.com'>Visit Example</a></body></html>"
soup = BeautifulSoup(html_doc, 'html.parser')

# Find the anchor tag with a specific href
link = soup.find('a', href='https://example.com')
print(link.text)


### **3.4. Navigating the Tree**

Once you have found elements, you can navigate the document tree to explore relationships between elements.

#### **Using `parent`**
This allows you to access the parent element of the current tag.

**Example:**

In [None]:
html_doc = "<html><body><h1>Welcome</h1></body></html>"
soup = BeautifulSoup(html_doc, 'html.parser')

# Find the <h1> tag and get its parent
h1_tag = soup.find('h1')
parent = h1_tag.parent
print(parent.name)  


#### **Using `children`**
The `children` attribute allows you to access all child elements of the current tag.

**Example:**


In [None]:
html_doc = "<html><body><h1>Welcome</h1><p>Content</p></body></html>"
soup = BeautifulSoup(html_doc, 'html.parser')

# Get all children of the <body> tag
body_tag = soup.find('body')
for child in body_tag.children:
    print(child)

#### **Using `next_sibling` and `previous_sibling`**
These attributes let you move between siblings of an element.

**Example:**

In [None]:
html_doc = "<html><body><h1>Header 1</h1><h2>Header 2</h2></body></html>"
soup = BeautifulSoup(html_doc, 'html.parser')

# Get the next sibling of <h1>
header1 = soup.find('h1')
next_header = header1.find_next_sibling()
print(next_header.text)  # Output: Header 2


### **3.5. Getting All Links and Images from a Page**

BeautifulSoup makes it easy to get all links (`<a>` tags) and images (`<img>` tags) from a page.

#### **Extracting All Links**
To get all links from a page, you can search for all anchor (`<a>`) tags and extract the `href` attribute.

**Example:**


In [None]:
html_doc = "<html><body><a href='https://example1.com'>Link 1</a><a href='https://example2.com'>Link 2</a></body></html>"
soup = BeautifulSoup(html_doc, 'html.parser')

# Find all links
links = soup.find_all('a')
for link in links:
    print(link['href'])

#### **Extracting All Images**
Similarly, you can extract all images by looking for `<img>` tags and accessing the `src` attribute.

**Example:**


In [None]:
html_doc = "<html><body><img src='image1.jpg' alt='Image 1'><img src='image2.jpg' alt='Image 2'></body></html>"
soup = BeautifulSoup(html_doc, 'html.parser')

# Find all images
images = soup.find_all('img')
for image in images:
    print(image['src'])


## **4. Extracting Data with BeautifulSoup**

In this section, we will cover how to extract data from HTML documents using BeautifulSoup. You'll learn how to extract text from tags, get attributes from tags, and deal with HTML tables. We'll also dive into handling nested tags and missing data.


#### **4.1 Extracting Text from Tags**

The primary purpose of web scraping is often to extract the data from the HTML tags. BeautifulSoup makes it very easy to retrieve the text content from tags.

**Example:**


In [None]:
from bs4 import BeautifulSoup

html_doc = """
<html>
  <head><title>Sample Page</title></head>
  <body>
    <h1>Welcome to Web Scraping</h1>
    <p>This is a simple HTML page for testing.</p>
  </body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# Extracting the title of the page
title = soup.title.string
print("Title:", title)  # Output: Sample Page

# Extracting text from an h1 tag
header = soup.h1.text
print("Header:", header)  # Output: Welcome to Web Scraping


In the above code, we used `.string` to get the contents of the `title` tag and `.text` to get the text inside the `h1` tag.

### **4.2 Using the `get()` Method for Attributes**

Sometimes you need to extract attributes (e.g., `href`, `src`, `alt`) from tags. This can be done using the `.get()` method.

**Exa*

In [None]:
html_doc = """
<html>
  <body>
    <a href="https://www.example.com" target="_blank">Click here</a>
  </body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# Extracting the href attribute from the <a> tag
link = soup.a.get('href')
print("Link:", link)  # Output: https://www.example.com


### **4.3 Extracting Tables**

Websites often present data in HTML tables. Extracting table data with BeautifulSoup involves finding the `table` tag and extracting its content row by row.

##### **Finding and Extracting Table Data**

You can find a table using `soup.find()` or `soup.find_all()`. After finding the table, you can extract rows and columns to get the actual data.

**Example:**



In [None]:
html_doc = """
<html>
  <body>
    <table>
      <tr><th>Name</th><th>Age</th></tr>
      <tr><td>Alice</td><td>24</td></tr>
      <tr><td>Bob</td><td>30</td></tr>
    </table>
  </body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# Extracting the table
table = soup.find('table')

# Extracting rows
rows = table.find_all('tr')

for row in rows:
    columns = row.find_all('td')
    if columns:  # Skip headers
        name = columns[0].text
        age = columns[1].text
        print(f"Name: {name}, Age: {age}")



In this example, we use `find_all('tr')` to get all the rows, and within each row, we extract the `td` elements for data.

##### **Parsing HTML Tables**

In more complex tables, you may have to work with nested tables or mixed tag structures. You can still parse them similarly by navigating through nested tags and applying filters as ne

---

#### **4.4 Extracting Data from Nested Tags**

HTML structures can be complex, with tags nested inside other tags. BeautifulSoup allows you to access nested tags easily.

**Example:**


In [None]:
html_doc = """
<html>
  <body>
    <div class="content">
      <h2>Article Title</h2>
      <p>Some content here...</p>
      <a href="https://example.com">Read more</a>
    </div>
    <div class="content">
      <h2>Another Article</h2>
      <p>Different content here...</p>
      <a href="https://another.com">Read more</a>
    </div>
  </body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# Extracting all divs with the class "content"
contents = soup.find_all('div', class_='content')

for content in contents:
    # Extracting the article title and link
    title = content.h2.text
    link = content.a['href']
    print(f"Title: {title}, Link: {link}")



Here, the `.find_all()` method is used to find all divs with the class `content`. Then, we extract the nested `h2` (title) and `a` (link) tags from each div.

---
#### **4.5 Handling Missing or Incomplete Data**

When scraping data, sometimes you may encounter missing or incomplete values, such as empty cells or broken links. You can handle this by checking if the tag exists before extracting its content.

**Example:**


In [None]:
html_doc = """
<html>
  <body>
    <ul>
      <li>Item 1</li>
      <li></li>  <!-- Empty item -->
      <li>Item 3</li>
    </ul>
  </body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# Extracting all items, skipping empty ones
items = soup.find_all('li')

for item in items:
    text = item.text.strip()
    if text:  # Checking if the item is not empty
        print(f"Item: {text}")


In [None]:
base_url = "https://example.com/products?page="
for page_num in range(1, 6):  # Scraping first 5 pages
    url = base_url + str(page_num)
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    soup.body
    # Extract data here


In this case, we use `.strip()` to remove any leading or trailing spaces and check if the item has any text. If it’s empty, we skip it.

---

### **Summary of Section 4**
- **Extracting Text**: `.text` or `.string` is used to extract the text content from tags.
- **Extracting Attributes**: `.get('attribute')` helps to fetch tag attributes like `href`, `src`, etc.
- **Extracting Tables**: Use `find_all('tr')` and `find_all('td')` to extract table rows and columns.
- **Handling Nested Tags**: You can navigate through deeply nested tags with `find()` and `find_all()`.
- **Handling Missing Data**: Ensure robustness by checking if the tag contains data befor cce



### **5. Advanced HTML Parsing Techques**

#### **5.1 Working with Complex HTML Structures**
   - When scraping complex HTML structures, BeautifulSoup provides a way to navigate nested tags and extract relevant data.
   - Complex HTML often includes nested elements like `<div>`, `<span>`, `<ul>`, `<li>`, etc. It's essential to identify the tags that encapsulate the data you need.
   - Example: Scraping blog post content or product listings with multiple nested levels.
   
   **Steps:**
   - Identify parent elements (e.g., `<div class="post">`) and drill down to child elements.
   - Use `.find()` and `.find_all()` to search for specific tags inside the

 **Tip:** When working with deeply nested structures, consider using CSS selectors or XPath for easier targeting.
se parent elemee:
xample:
ssing it.


In [None]:
import requests
from bs4 import BeautifulSoup

url = 'https://quotes.toscrape.com/'  

response = requests.get(url)
html_content = response.text

soup = BeautifulSoup(html_content, 'html.parser')

post = soup.find('div', class_='quote')  
quote = post.find('span', class_='text').text 

print(f"Quote: {quote}")


### **5.2 Handling Pagination in Web Scraping**
   - Pagination occurs when a website splits content across multiple pages. You need to extract data from all pages to get the full set of information.
   - Example: Scraping product listings from an e-commerce website that has multiple pages.
   
   **Steps:**
   - Identify the pattern in the URL for the pages (e.g., `page=1`, `page=2`).
   - Loop through each page by modifying the URL and scraping th

 **Tip:** Check if the website uses a “next” button, and ensure to handle edge cases (e.g., last page).e
   ta.
   
   **Example of pagination:**


In [None]:
base_url = "https://dummyjson.com/products/"
for page_num in range(1, 6):  # Scraping first 5 pages
    url = base_url + str(page_num)
    print(url)
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    # Extract data here
    print(soup.body)

### **5.3 Dealing with JavaScript-Rendered Content**
   - Many modern websites use JavaScript to load content dynamically after the page is loaded. BeautifulSoup alone won't be able to scrape data from these pages because it only works with static HTML.
   - To scrape JavaScript-generated content, you need to either:
     - Use a tool like **Selenium** or **Playwright** to render JavaScript and extract the final HTML.
     - Look for an API call in the network traffic that provides the data in JSON or XML format

#### **5.4 Using Selenium for Dynamic Content**
   - **Selenium** automates a real browser (e.g., Chrome or Firefox) to render pages, execute JavaScript, and allow you to scrape the content after the page loads fully.
   
   **Steps:**
   1. Install Selenium and a WebDriver (e.g., ChromeDriver for Chrome).
   2. Set up Selenium to navigate to a webpage and extract data.
   
   **Example:**


In [None]:
pip install selenium

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys

# Set up Chrome WebDriver
driver = webdriver.Chrome(executable_path='/path/to/chromedriver')
driver.get("https://example.com")

# Wait for the page to load and scrape the content
content = driver.find_element(By.CLASS_NAME, 'content-class').text
print(content)

driver.quit()


   - **Tip:** Use `WebDriverWait` to wait for elements to load dynamically.

### **5.5 Combining BeautifulSoup with Selenium**
   - Sometimes, you may need the power of both tools: **Selenium** for rendering dynamic pages and **BeautifulSoup** for parsing the HTML after it's fully rendered.
   
   **Steps:**
   - Use Selenium to navigate and render the page.
   - Use BeautifulSoup to parse the rendered HTML.

   **Example:**


In [None]:
from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome(executable_path='/path/to/chromedriver')
driver.get("https://example.com")

# Get page source after rendering JavaScript
page_source = driver.page_source

# Parse with BeautifulSoup
soup = BeautifulSoup(page_source, 'html.parser')
data = soup.find('div', class_='product').text

driver.quit()


   - **Tip:** This method is particularly useful for scraping data from websites that load content dynamically via JavaScript.
#### **5.6 Handling AJAX Requests**
   - Some websites load content using **AJAX** (Asynchronous JavaScript and XML) requests. These requests fetch data from the server without refreshing the page.
   - BeautifulSoup cannot directly handle AJAX requests because it only parses the static HTML, so you need to inspect network traffic and make similar requests to extract the data.
   
   **Steps:**
   1. Open the website in a browser and use the **Network** tab in the developer tools to identify AJAX requests.
   2. Extract the URL of the AJAX request (usually a **JSON** or **XML** response).
   3. Use Python’s `requests` library to make a similar request.
   
   **Example (Extracting JSON via AJAX):**


In [None]:
import requests

url = 'https://example.com/ajax/data'
response = requests.get(url)
data = response.json()  # Assuming the response is JSON
print(data)


   - **Tip:** Make sure to pass any necessary headers (like `User-Agent`) to mimic a browser request.

### **5.7 Using Network Tab to Identify Requests**
   - The **Network Tab** in the browser's developer tools allows you to view network requests made by the webpage, including AJAX calls, image requests, and scripts.
   - You can inspect these requests to find the ones that contain the data you're interested in.
   
   **Steps:**
   1. Open the website in a browser (e.g., Chrome).
   2. Open the **Developer Tools** (F12 or right-click and select "Inspect").
   3. Go to the **Network** tab and reload the page.
   4. Look for XHR (XMLHttpRequest) or Fetch requests that contain the data in JSON or XML format.
   5. Copy the request URL or observe the parameters (e.g., pagination or filters) and replicate the request us

   - **Tip:** You can often find APIs that provide structured data directly, avoiding the need for scraping HTML.ing Python.

   **Example:**


In [None]:
import requests

# Replace with the actual AJAX URL you found
url = 'https://jsonplaceholder.typicode.com/posts'  # Example URL

# Optional: Add headers to mimic a browser request
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}

response = requests.get(url, headers=headers)
data = response.json()  # Assuming the response is JSON
print(data)

## **6. Data Cleaning and Manipulation**

When working with raw data extracted through web scraping, it often contains noise like HTML tags, special characters, extra spaces, etc. We need to clean and process this data to make it suitable for analysis. Let's cover the main techniques for data cleaning and manipulation.
#### **6.1 Cleaning Extracted Data**

After scraping data, it’s crucial to ensure that your dataset is free of HTML tags and any unwanted characters. Here’s an example to demonstrate cleaning scraped data.

**Example Code:**


In [None]:
from bs4 import BeautifulSoup
import requests
import re

# Sample URL
url = 'https://quotes.toscrape.com/'

# Make a request
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')

# Extract all quotes
quotes = soup.find_all('span', class_='text')

# Store quotes in a list
cleaned_quotes = []

for quote in quotes:
    raw_text = quote.get_text()  # Extract text
    cleaned_text = re.sub(r'[^\w\s]', '', raw_text)  # Remove punctuation
    cleaned_quotes.append(cleaned_text)

print(cleaned_quotes)


**Explanation:**
- `re.sub(r'[^\w\s]', '', raw_text)` is used to remove punctuation.
- `.get_text()` extracts the raw text content, stripping out HTML tags.

#### **6.2 Removing HTML Tags and Special Characters**

We can also use BeautifulSoup's built-in methods or regex to clean text further.

**Using BeautifulSoup's Built-in Method:**


In [None]:
for quote in quotes:
    text = quote.text.strip()  # Removes leading/trailing whitespace
    print(text)


**Using Regex to Remove Special Characters:**

In [None]:
clean_text = re.sub(r'\s+', ' ', text)  # Replaces multiple spaces with a single space
print(clean_text)


### **6.3 Using Python's `re` Library for Text Processing**

The `re` (regular expression) library is powerful for pattern matching and text processing.

**Example - Removing Digits:**


In [None]:
text_with_numbers = "Quote123 with 456 some 789 numbers"
cleaned_text = re.sub(r'\d+', '', text_with_numbers)  # Removes digits
print(cleaned_text) 


**Example - Extracting Specific Patterns:**

In [None]:
emails = "Contact us at info@example.com or support@example.org"
found_emails = re.findall(r'\S+@\S+', emails)
print(found_emails) 


### **6.4 Handling Encoding Issues**

When scraping data from various websites, encoding issues may arise. Handling them effectively is crucial for data quality.

**Handling Common Encoding Issues:**


In [None]:
response = requests.get(url)
response.encoding = 'utf-8'  # Ensure correct encoding
soup = BeautifulSoup(response.content, 'html.parser')

# Example of replacing encoding issues like smart quotes
cleaned_text = response.text.replace("\u201c", '"').replace("\u201d", '"')
print(cleaned_text)


#### **6.5 Using Pandas to Organize Scraped Data**

The **Pandas** library can help in organizing and analyzing data effectively.

**Installing Pandas:**
```bash
pip install pandas
```

**Example - Organizing Data in a DataFrame:**


In [None]:
import pandas as pd

# Example scraped data
data = {
    'Quote': cleaned_quotes,
    'Author': [author.get_text() for author in soup.find_all('small', class_='author')]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Display the DataFrame
print(df.head())

# Save to CSV
df.to_csv('quotes.csv', index=False)


## **7. Exporting and Storing Scraped Data**

When scraping data, it is important to store the extracted data in a usable format for further analysis or reporting. Here’s how you can do it:
#### **7.1 Saving Data to CSV Files**

CSV (Comma Separated Values) is a popular format for data storage because it is simple and widely supported.

**Example Code: Saving to CSV**


In [None]:
import requests
from bs4 import BeautifulSoup
import csv

# URL for example scraping
url = "https://quotes.toscrape.com/page/1/"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# Extract quotes and authors
quotes = soup.find_all("span", class_="text")
authors = soup.find_all("small", class_="author")

# Prepare CSV file
with open("quotes.csv", "w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["Quote", "Author"])  # Header
    for quote, author in zip(quotes, authors):
        writer.writerow([quote.text, author.text])

print("Data saved to quotes.csv")


### **7.2 Saving Data to Excel Files**

For saving data to Excel files, we use the **Pandas** library, which provides easy methods to export DataFrames.

**Example Code: Saving to Excel**


In [None]:
import pandas as pd

# Create a DataFrame
data = {
    "Quote": [quote.text for quote in quotes],
    "Author": [author.text for author in authors]
}
df = pd.DataFrame(data)

# Save to Excel
df.to_excel("quotes.xlsx", index=False)
print("Data saved to quotes.xlsx")


### **7.3 Saving Data to JSON Files**

JSON is a preferred format for storing hierarchical or nested data.

**Example Code: Saving to JSON**


In [None]:
import json

# Prepare data
quotes_data = [{"quote": quote.text, "author": author.text} for quote, author in zip(quotes, authors)]

# Save to JSON
with open("quotes.json", "w", encoding="utf-8") as file:
    json.dump(quotes_data, file, ensure_ascii=False, indent=4)

print("Data saved to quotes.json")


### **7.4 Storing Data in Databases (SQLite, PostgreSQL)**

For persistent and large-scale data storage, using a database is a better approach. Here’s an example using SQLite.

**Setting Up SQLite:**

```bash
pip install sqlite3
```

**Example Code: Saving to SQLite Database**


In [None]:
import sqlite3

# Connect to SQLite Database (or create it)
conn = sqlite3.connect("quotes.db")
cursor = conn.cursor()

# Create table
cursor.execute('''
CREATE TABLE IF NOT EXISTS quotes (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    quote TEXT NOT NULL,
    author TEXT NOT NULL
)
''')

# Insert data into table
for quote, author in zip(quotes, authors):
    cursor.execute("INSERT INTO quotes (quote, author) VALUES (?, ?)", (quote.text, author.text))

# Commit changes and close connection
conn.commit()
conn.close()

print("Data saved to SQLite database quotes.db")


**Using PostgreSQL:**

For PostgreSQL, you need to set up the database first and use the `psycopg2` library:

```bash
pip install psycopg2
```

**Example Code: Saving to PostgreSQL Database**

In [None]:
import psycopg2

# Connect to PostgreSQL
conn = psycopg2.connect(
    database="your_db",
    user="your_user",
    password="your_password",
    host="localhost",
    port="5432"
)
cursor = conn.cursor()

# Create table
cursor.execute('''
CREATE TABLE IF NOT EXISTS quotes (
    id SERIAL PRIMARY KEY,
    quote TEXT NOT NULL,
    author TEXT NOT NULL
)
''')

# Insert data
for quote, author in zip(quotes, authors):
    cursor.execute("INSERT INTO quotes (quote, author) VALUES (%s, %s)", (quote.text, author.text))

conn.commit()
conn.close()
print("Data saved to PostgreSQL database")


#### **7.5 Creating a Simple API for Scraped Data with Flask**

Creating an API endpoint can be useful to serve your scraped data.

**Install Flask:**

```bash
pip install flask
```

**Flask API Example:**


In [None]:
from flask import Flask, jsonify
import json

app = Flask(__name__)

# Load scraped data
with open("quotes.json", "r", encoding="utf-8") as file:
    data = json.load(file)

@app.route("/api/quotes", methods=["GET"])
def get_quotes():
    return jsonify(data)

if __name__ == "__main__":
    app.run(debug=True)


**Run the Flask API:**

```bash
python app.py
```

**Access the API:**
Open `http://127.0.0.1:5000/api/quotes` in your browser to see the data in JSON format.

---

### **Additional Practice: Using Other URLs for Data Extraction**

Let's extend our example to handle multiple pages from another site. Here’s an adjusted approach using an e-commerce website example:

**Example Code for Handling Multiple Pages:**


In [None]:
import requests
from bs4 import BeautifulSoup
import csv

# Define base URL and page range
base_url = "https://books.toscrape.com/catalogue/page-{}.html"
page_range = 5  # Number of pages to scrape

# Prepare CSV file
with open("books.csv", "w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["Title", "Price"])

    for page in range(1, page_range + 1):
        url = base_url.format(page)
        response = requests.get(url)
        soup = BeautifulSoup(response.content, "html.parser")
        
        # Scrape data
        books = soup.find_all("article", class_="product_pod")
        for book in books:
            title = book.h3.a["title"]
            price = book.find("p", class_="price_color").text
            writer.writerow([title, price])
        
        print(f"Page {page} scraped successfully!")

print("Data saved to books.csv")


## **8. Handling Errors and Exceptions in Web Scraping**

### **8.1 Common Errors in BeautifulSoup**

While working with BeautifulSoup, you might encounter some common errors such as:

1. **AttributeError**: Occurs when you try to access an attribute or method that doesn't exist on a BeautifulSoup object. For example, trying to call `.text` on a `None` object.


In [None]:
title = soup.find("h1").text  # If `h1` tag is not found, this raises an AttributeError


2. **TypeError**: Happens when the function is used incorrectly, like passing incorrect arguments to methods.

In [None]:
soup.find_all(123)  # This will raise a TypeError since `find_all` expects a string


3. **Parser Errors**: This error can occur if the HTML document is malformed or has invalid tags. Using the `html.parser` or `lxml` parser can help manage these issues better.

### **8.2 Handling HTTP Errors with `requests` Library**

Sometimes, requests to websites may fail due to various reasons, such as:

- **404 Not Found**: The requested resource is not available.
- **500 Internal Server Error**: The server encountered an unexpected condition.
- **403 Forbidden**: Access to the resource is denied.

To handle these errors, we can use the `requests` library's `raise_for_status()` method, which raises an HTTPError for bad responses.


In [None]:
import requests
from bs4 import BeautifulSoup

url = "https://example.com/nonexistent-page"
try:
    response = requests.get(url)
    response.raise_for_status()  # Raises HTTPError for 4xx/5xx errors
    soup = BeautifulSoup(response.text, 'html.parser')
    print(soup.title.text)
except requests.exceptions.HTTPError as e:
    print(f"HTTP Error: {e}")
except requests.exceptions.ConnectionError as e:
    print(f"Connection Error: {e}")
except requests.exceptions.Timeout as e:
    print(f"Timeout Error: {e}")
except Exception as e:
    print(f"An error occurred: {e}")


### **8.3 Using `try-except` Blocks for Error Handling**

Using `try-except` blocks is an effective way to catch and handle exceptions in your scraping script.

**Example: Handling Missing Tags**


In [None]:
html_content = "<html><body><p>Hello World</p></body></html>"
soup = BeautifulSoup(html_content, 'html.parser')

try:
    title = soup.find("h1").text  # This will raise an AttributeError
    print(title)
except AttributeError:
    print("The 'h1' tag is not found in the HTML content.")


### **8.4 Setting Up Retry Logic for Failed Requests**

When making requests, especially to unreliable servers, your scraper might occasionally fail. Setting up a retry mechanism can help handle transient errors.

**Using `time.sleep` and `retry` logic:**


In [None]:
import requests
from bs4 import BeautifulSoup
import time

def fetch_data(url, retries=3):
    headers = {"User-Agent": "Mozilla/5.0"}
    attempt = 0
    while attempt < retries:
        try:
            response = requests.get(url, headers=headers, timeout=10)
            response.raise_for_status()
            return BeautifulSoup(response.text, 'html.parser')
        except requests.exceptions.RequestException as e:
            print(f"Request failed: {e}. Retrying in 5 seconds...")
            attempt += 1
            time.sleep(5)  # Wait before retrying
    print("Failed to fetch data after multiple attempts.")
    return None

url = "https://httpbin.org/status/500"  # Example of an endpoint that may return errors
soup = fetch_data(url)
if soup:
    print(soup.title)


**Key Points:**
- We use the `User-Agent` header to avoid getting blocked.
- `time.sleep(5)` introduces a delay between retries.
- `timeout=10` prevents hanging if the server takes too long to respon


### **8.5 Handling Multipage Data with Error Handling**

If you are scraping multiple pages, error handling becomes more critical. Let's consider an example where we scrape paginated data with proper error handling.



In [None]:
import requests
from bs4 import BeautifulSoup
import time

def scrape_multipage(base_url, num_pages):
    headers = {"User-Agent": "Mozilla/5.0"}
    all_data = []

    for page in range(1, num_pages + 1):
        url = f"{base_url}?page={page}"
        print(f"Scraping {url}")

        try:
            response = requests.get(url, headers=headers, timeout=10)
            response.raise_for_status()
            soup = BeautifulSoup(response.text, 'html.parser')

            items = soup.find_all('h2')  # Example: Extracting all 'h2' tags from each page
            for item in items:
                all_data.append(item.text.strip())

        except requests.exceptions.RequestException as e:
            print(f"Error fetching {url}: {e}")
            continue

        time.sleep(2)  # Polite scraping delay

    return all_data

base_url = "https://quotes.toscrape.com/page"
data = scrape_multipage(base_url, num_pages=5)

print("Scraped Data:", data)


**Explanation:**
- We handle errors on each page using `try-except`, so a failure on one page does not stop the entire script.
- We use `time.sleep(2)` between page requests to avoid being blocked.
- **Retry Logic** can be combined here to enhance robustness furthe

## **9. Web Scraping Projects**

### **9.1 Project 1: Scraping E-commerce Product Data**

**Objective**: Scrape product names, prices, and ratings from an e-commerce website.

**URL**: We will use **"Books to Scrape"**, a test e-commerce website available at [https://books.toscrape.com/](https://books.toscrape.com/).

#### **Steps**:
1. Extract product names, prices, and ratings from the website.
2. Handle pagination to scrape data from multiple pages.
3. Save the data to a CSV file.
r.


In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Base URL of the website
base_url = "https://books.toscrape.com/catalogue/page-{}.html"

# Lists to store the data
book_titles = []
book_prices = []
book_ratings = []

# Loop through multiple pages
for page in range(1, 6):  # Scraping first 5 pages
    print(f"Scraping page {page}...")
    
    # Get the page content
    url = base_url.format(page)
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find all book containers
    books = soup.find_all('article', class_='product_pod')
    
    # Extract data from each book
    for book in books:
        title = book.h3.a['title']
        price = book.find('p', class_='price_color').text
        rating = book.p['class'][1]
        
        # Append to lists
        book_titles.append(title)
        book_prices.append(price)
        book_ratings.append(rating)

# Save the data to a CSV file
data = pd.DataFrame({
    "Title": book_titles,
    "Price": book_prices,
    "Rating": book_ratings
})

data.to_csv("ecommerce_books.csv", index=False)
print("Data saved to ecommerce_books.csv")
data.head()

#### **Key Points**:
- **Pagination**: We used the format `page-{}.html` to scrape multiple pages.
- **Rating Extraction**: The rating is stored as a CSS class (`star-rating X`), so we extract the second class name.
- **Data Storage**: The scraped data is saved in a CSV file.

### **9.2 Project 2: Scraping News Headlines and Summaries**

**Objective**: Scrape news headlines, summaries, and URLs from a news website.

**URL**: We will use **"The Hacker News"**, a popular tech news site, at [https://thehackernews.com/](https://thehackernews.com/).

#### **Steps**:
1. Extract headlines, summaries, and article URLs.
2. Handle pagination for news articles.
3. Save the data to a CSV file.



In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL for scraping
base_url = "https://thehackernews.com/search/label/Cyber%20Attack?max-results=10&start={}"

# Lists to store the data
headlines = []
summaries = []
article_links = []

# Loop through the first 3 pages
for start in range(0, 30, 10):  # 0, 10, 20
    print(f"Scraping page starting at {start}...")
    
    url = base_url.format(start)
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find articles
    articles = soup.find_all('div', class_='body-post')

    for article in articles:
        headline = article.find('h2', class_='home-title').text.strip()
        summary = article.find('div', class_='home-desc').text.strip()
        link = article.find('a')['href']
        
        # Append to lists
        headlines.append(headline)
        summaries.append(summary)
        article_links.append(link)

# Save the data to a CSV file
data = pd.DataFrame({
    "Headline": headlines,
    "Summary": summaries,
    "URL": article_links
})

data.to_csv("news_data.csv", index=False)
print("Data saved to news_data.csv")
data.head()


#### **Key Points**:
- **Pagination Handling**: The URL pattern uses a `start` parameter to handle pagination.
- **Data Collection**: We collect headlines, summaries, and URLs.

### **9.3 Project 3: Scraping Weather Data from a Weather Website**

**Objective**: Extract the current weather information for multiple cities.

**URL**: We will use **"Time and Date Weather"**, available at [https://www.timeanddate.com/weather/](https://www.timeanddate.com/weather/).

#### **Steps**:
1. Extract city names, temperatures, and weather descriptions.
2. Scrape weather data for multiple cities.
3. Save the data to a CSV file.



In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Base URL of the website
cities = ["karachi", "islamabad", "multan", "quetta", "lahore","peshawar","hyderabad","abbottabad","bahawalpur","sialkot","sukkur","murree","gujranwala"]
base_url = "https://www.timeanddate.com/weather/pakistan/"

# Lists to store data
city_names = []
temperatures = []
humidities = []
weatherConditions=[]
pressures = []
dew_points = []

# Scrape data for each city
for city in cities:
    print(f"Scraping weather data for {city}...")
    
    # Constructing the URL correctly
    url = f"{base_url}{city}"  # Ensure proper URL formatting
    response = requests.get(url)

    # Check if the request was successful
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Extract data
        city_name = soup.find('h1').text.split("Weather")[0].strip()
        temperature = soup.find('div', class_='h2').text.strip()
        pressure = soup.find('table',class_="table table--left table--inner-borders-rows").find_all("tr")[4].td.text
        humidity = soup.find('table',class_="table table--left table--inner-borders-rows").find_all("tr")[5].td.text
        dewPoint = soup.find('table',class_="table table--left table--inner-borders-rows").find_all("tr")[6].td.text

        weatherCondition = soup.find('div',class_="bk-focus__qlook").p.text.split(".")[0]
        
        
        # Append to lists
        city_names.append(city)
        temperatures.append(temperature)
        humidities.append(humidity)
        pressures.append(pressure)
        dew_points.append(dewPoint)
        weatherConditions.append(weatherCondition)
    else:
        print(f"Failed to retrieve data for {city}. Status code: {response.status_code}")

# Save the data to a CSV file if there's any data collected
if city_names:
    data = pd.DataFrame({
        "City": city_names,
        "Temperature": temperatures,
        "humidity": humidities,
        "Weather Condition":weatherConditions,
        "Pressue": pressures,
        "Dew Point": dew_points
        
    })

    data.to_csv("weather_data.csv", index=False)
    print("Data saved to weather_data.csv")
    print(data.head())
else:
    print("No data collected.")

#### **Key Points**:
- **Dynamic URL Creation**: Using city names to dynamically build the URL for scraping.
- **Extracting Weather Data**: We use specific classes for temperature and description.

### **9.4 Project 4: Scraping Wikipedia Tables and Articles**

**Objective**: Scrape data from a Wikipedia table.

**URL**: We will use the **"List of Countries by GDP"** page at [https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)](https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)).

#### **Steps**:
1. Extract country names, GDP values, and rankings.
2. Parse the HTML table and extract data.
3. Save the data to a CSV file.



In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# URL of the Wikipedia page
url = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find the table
table = soup.find('table', class_='wikitable')

# Lists to store the data
countries = []
gdps = []
rankings = []

# Extract data from table rows
for row in table.find_all('tr')[1:]:
    cols = row.find_all('td')
    if len(cols) > 1:
        country = cols[1].text.strip()
        gdp = cols[2].text.strip()
        rank = cols[0].text.strip()
        
        countries.append(country)
        gdps.append(gdp)
        rankings.append(rank)

# Save to CSV
data = pd.DataFrame({
    "Rank": rankings,
    "Country": countries,
    "GDP (Nominal)": gdps
})

data.to_csv("gdp_data.csv", index=False)
print("Data saved to gdp_data.csv")
print(data.head())

#### **Key Points**:
- **Table Parsing**: We identify and extract data from specific columns in the table.
- **Data Cleaning**: We strip whitespace and handle text extraction properly.
## **10. Advanced BeautifulSoup Techniques**
#### **10.1 Using Proxies to Avoid IP Blocking**
When web scraping, websites may block your IP address if you send too many requests. Using proxies allows you to mask your real IP and rotate between multiple IPs to avoid detection.

**Steps:**
1. **Choose a Proxy Provider:** Use free proxy lists (e.g., [free-proxy-list.net](https://free-proxy-list.net)) or paid services.
2. **Set Up a Proxy with Requests:**




In [None]:
import requests
from bs4 import BeautifulSoup

proxies = {
    "http": "http://proxy_ip:proxy_port",
    "https": "https://proxy_ip:proxy_port"
}
url = "http://example.com"
response = requests.get(url, proxies=proxies)
soup = BeautifulSoup(response.content, "html.parser")
print(soup.prettify())


3. **Test Proxy Health:** Ensure proxies are functional by testing them with a simple request.


### **10.2 Rotating User Agents with `fake_useragent` Library**
Websites often block automated scripts by detecting default user agents. Rotating user agents makes your scraper appear more human-like.

**Steps:**
1. **Install `fake_useragent`:**
   ```bash
   pip install fake-useragent
   ```
2. **Use Rotating User Agents:**


In [None]:
from fake_useragent import UserAgent
import requests
from bs4 import BeautifulSoup

ua = UserAgent()
headers = {"User-Agent": ua.random}

url = "http://example.com"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, "html.parser")
print(soup.prettify())


3. **Integrate with Multiple Requests:**
   Rotate user agents dynamically for each request in your script.

--

#### **10.3 Building a Web Scraping Pipeline with Python**
A web scraping pipeline ensures data is scraped, processed, and stored efficiently.

**Steps:**
1. **Define the Scraping Workflow:**
   - Fetch the webpage.
   - Parse HTML with BeautifulSoup.
   - Extract and clean data.
   - Save data to a file or database.
2. **Example Pipeline:**


In [None]:
import requests
from bs4 import BeautifulSoup
import csv

def fetch_page(url):
    response = requests.get(url)
    return response.content

def parse_html(html):
    soup = BeautifulSoup(html, "html.parser")
    return soup.find_all("div", class_="example-class")

def save_to_csv(data, filename="output.csv"):
    with open(filename, "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerow(["Name", "Price"])
        writer.writerows(data)

def main():
    url = "http://example.com"
    html = fetch_page(url)
    data = parse_html(html)
    save_to_csv(data)

if __name__ == "__main__":
    main()


### **10.4 Scraping Data from Websites with Infinite Scroll**
Websites with infinite scrolling load data dynamically via AJAX. This requires fetching additional pages programmatically.

**Steps:**
1. **Inspect Network Activity:**
   - Use browser developer tools to find the AJAX URL pattern.
2. **Make Multiple Requests:**


In [None]:
import requests
from bs4 import BeautifulSoup

base_url = "https://jsonplaceholder.typicode.com/users?page="
all_data = []

for page in range(1, 6):  # Adjust page range as needed
    url = f"{base_url}{page}"
    response = requests.get(url)
    
    # Check if the request was successful
    if response.status_code == 200:
        data = response.json()  # Assuming JSON response
        all_data.extend(data)  # Adjust key to your data structure
    else:
        print(f"Failed to retrieve data from {url}")

print(all_data)

3. **Combine AJAX Data with BeautifulSoup:**
   Process fetched data with BeautifulSoup or directly store it.


### **10.5 Using `concurrent.futures` for Parallel Scraping**
Parallel scraping speeds up data collection by making requests concurrently.

**Steps:**
1. **Import Necessary Modules:**
2. **Set Up a Function for Scraping:**
3. **Use a Thread Pool for Parallel Execution:**

In [None]:
# 1. **Import Necessary Modules:**

from concurrent.futures import ThreadPoolExecutor
import requests
from bs4 import BeautifulSoup


In [None]:
# 2. **Set Up a Function for Scraping:**
def scrape_page(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    return soup.title.string

In [None]:
# 3. **Use a Thread Pool for Parallel Execution:**

urls = [
    "http://example.com/page1",
    "http://example.com/page2",
    "http://example.com/page3",
]

with ThreadPoolExecutor(max_workers=5) as executor:
    results = executor.map(scrape_page, urls)

for title in results:
    print(title)


## **11. Web Scraping Best Practices and Optimization**

### **11.1 Respecting `robots.txt` and Website Terms of Service**
- **What is `robots.txt`?**
  - A `robots.txt` file provides guidelines for web crawlers on what parts of a website they can or cannot access.
- **How to Check `robots.txt`?**
  - Example: Access `https://example.com/robots.txt` in your browser.
- **Respect the Guidelines:**
  - Before scraping, ensure you are allowed to scrape the website sections you target.
- **Ethical Considerations:**
  - Always follow the site's terms of service and avoid scraping sensitive data or overloading serers.

#### **11.2 Using Request Headers to Avoid Blocking**
- **Why Use Headers?**
  - Many websites block requests without proper headers, such as a User-Agent.
- **Adding Headers with the `requests` Library:**


In [None]:
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
}

# Replace with a valid URL
response = requests.get('https://jsonplaceholder.typicode.com/posts', headers=headers)

# Check if the request was successful
if response.status_code == 200:
    print(response.json())  # Print the JSON response
else:
    print(f"Failed to retrieve data: {response.status_code}")

- **Rotating User Agents:**
  - Use the `fake_useragent` library to change User-Agent frequently.


In [None]:
!pip install fake_useragent

In [None]:
from fake_useragent import UserAgent

ua = UserAgent()
headers = {'User-Agent': ua.random}


### **11.3 Optimizing the Scraping Process for Speed**
- **Avoiding Unnecessary Requests:**
  - Minimize the number of requests by targeting only essential data.
- **Use Efficient Parsing:**
  - Use lightweight parsers like `lxml` for better performance.


In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'lxml')


- **Concurrent Scraping:**
  - Use `concurrent.futures` for parallel requests.


In [None]:
import concurrent.futures
import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
}

urls = ['https://example1.com', 'https://example2.com']

def fetch_url(url):
    response = requests.get(url, headers=headers)
    return response.content

with concurrent.futures.ThreadPoolExecutor() as executor:
    results = executor.map(fetch_url, urls)

for result in results:
    print(result)  # Print the content of each fetched URL

### **11.4 Data Throttling and Sleep Intervals**
- **Why Throttle Requests?**
  - To avoid overwhelming the server and being flagged as a bot.
- **Using Random Sleep Intervals:**


In [None]:
import time
import random

for url in urls:
    response = requests.get(url, headers=headers)
    time.sleep(random.uniform(1, 3))  # Random delay between 1 and 3 seconds


### **11.5 Avoiding Captchas with Anti-Captcha Services**
- **Detecting Captchas:**
  - Check if the response contains captcha-related content (e.g., `Recaptcha` tags).
- **Using Anti-Captcha Tools:**
  - Tools like **2Captcha**, **Anti-Captcha**, or **DeathByCaptcha** can solve captchas programmatically.


In [None]:
!pip install anticaptchaofficial

In [None]:
from anticaptchaofficial.recaptchav2proxyless import *

solver = recaptchaV2Proxyless()
solver.set_verbose(1)
solver.set_key("YOUR_ANTICAPTCHA_API_KEY")  # Replace with your Anti-Captcha API key
solver.set_website_url("https://valid-website.com")  # Replace with the actual website URL
solver.set_website_key("YOUR_SITE_KEY")  # Replace with the actual site key for reCAPTCHA

token = solver.solve_and_return_solution()
if token != 0:
    print("Captcha Solved: " + token)
else:
    print("Captcha Error: " + solver.error_code)

## **12. Deploying a Web Scraping Script**

In this section, we'll explore how to automate, deploy, and make your web scraping script efficient and accessible.--

#### **12.1 Using Cron Jobs for Scheduling**
Cron jobs are used to schedule scripts to run periodically on UNIX-based systems (Linux, macOS).

1. **Set Up a Cron Job**
   - Open your terminal and edit the crontab:
     ```bash
     crontab -e
     ```
   - Add a line specifying when to run your script. For example, to run it daily at midnight:
     ```bash
     0 0 * * * /usr/bin/python3 /path/to/your_script.py
     ```
2. **Verify Cron Job Execution**
   - Use the `cron.log` file to ensure the script runs correctly.
   - Debug any errors that arise by running the script manually.

3. **Windows Alternative**
   - Use Task Scheduler for periodiccuton on Windows.

---

#### **12.2 Deploying on Cloud Services (AWS Lambda, Heroku)**

1. **Deploying on AWS Lambda**
   - Package your Python script along with dependencies:
     ```bash
     pip install -r requirements.txt -t .
     zip -r script_package.zip .
     ```
   - Upload the `script_package.zip` to AWS Lambda.
   - Set the handler function in AWS Lambda (e.g., `lambda_function.lambda_handler`).
   - Schedule the script using AWS EventBridge.

2. **Deploying on Heroku**
   - Create a `Procfile` for Heroku:
     ```plaintext
     web: python your_script.py
     ```
   - Push your code to a Heroku repository:
     ```bash
     heroku create
     git push heroku main
     ``   Use Heroku Scheduler for periodic tasks.

---

#### **12.3 Creating a Web Scraping API with FastAPI**

FastAPI can be used to serve scraped data via an API:

1. **Install FastAPI and Uvicorn**:
   ```bash
   pip install fastapi uvicorn
   ```

2. **Create a Simple API**:


In [None]:
from fastapi import FastAPI
import requests
from bs4 import BeautifulSoup

app = FastAPI()

@app.get("/scrape")
def scrape_data():
    url = "https://example.com"
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    data = [item.text for item in soup.find_all("h2")]
    return {"scraped_data": data}


3. **Run the API Locally**:
   ```bash
   uvicorn app:app --reload
   ```

4. **Deploy FastAPI on a Server**
   - Use services like AWS, Azure, or Heroku for deployment.


### **12.4 Sending Notifications (Email/Slack) After Scraping**

1. **Sending Email Notifications**
   - Use Python's `smtplib`:


In [None]:
import smtplib
from email.mime.text import MIMEText

def send_email(subject, message, recipient):
    sender = "your_email@example.com"
    password = "your_password"
    msg = MIMEText(message)
    msg["Subject"] = subject
    msg["From"] = sender
    msg["To"] = recipient

    with smtplib.SMTP("smtp.gmail.com", 587) as server:
        server.starttls()
        server.login(sender, password)
        server.sendmail(sender, recipient, msg.as_string())

send_email("Scraping Completed", "Your scraping task is done.", "recipient@example.com")


2. **Sending Slack Notifications**
   - Use Slack Webhooks:


In [None]:
import requests

def send_slack_message(webhook_url, message):
    payload = {"text": message}
    requests.post(webhook_url, json=payload)

webhook_url = "https://hooks.slack.com/services/your/webhook/url"
send_slack_message(webhook_url, "Scraping task completed successfully!")


3. **Automating Notifications**
   - Integrate the email/Slack function in your scraping script to trigger notifications after data is processed.-


#### **12.5 Using Multiple URLs for Consistent Data**
When scraping, if a single URL doesn’t have pagination or becomes unavailable, you can:
   
1. **Use Alternative URLs**:
   - Maintain a list of fallback URLs:


In [None]:
urls = [
    "https://example1.com",
    "https://example2.com",
    "https://example3.com"
]
for url in urls:
    try:
        response = requests.get(url)
        if response.status_code == 200:
            soup = BeautifulSoup(response.content, "html.parser")
            # Process data
            break
    except Exception as e:
        print(f"Error with {url}: {e}")


2. **Aggregate Data from Multiple URLs**:
   - Combine data from several URLs into a single dataset:


In [None]:
combined_data = []
for url in urls:
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
    data = [item.text for item in soup.find_all("div", class_="data-class")]
    combined_data.extend(data)

By combining multiple URLs and a robust fallback mechanism, you ensure your scraper remains functional and avoids errors due to unavailable pages.

## **13. Web Scraping Alternatives**

### **13.1 Comparison with Scrapy and Selenium**
BeautifulSoup is excellent for parsing static HTML content, but there are alternatives that may be more efficient depending on your needs:

1. **Scrapy**  
   - **Advantages**:
     - Built-in support for asynchronous requests, making it faster for large-scale scraping.
     - Framework-like structure with built-in tools for parsing, data storage, and middleware.
     - Automatic handling of crawling through multiple pages.
   - **Disadvantages**:
     - More complex setup than BeautifulSoup.
     - Not ideal for scraping JavaScript-heavy websites.
   
2. **Selenium**  
   - **Advantages**:
     - Can interact with JavaScript-rendered pages and simulate browser behavior.
     - Useful for scraping interactive elements (e.g., dropdown menus, buttons).
   - **Disadvantages**:
     - Slower than BeautifulSoup and Scrapy because it uses a full browser.
     - Requires significant system resources for large-scale scraping.

3. **BeautifulSoup**  
   - **Advantages**:
     - Simple to use and lightweight.
     - Great for small to medium-sized projects with static content.
   - **Disadvantages**:
     - Lacks asynchronous support.
     - Requires additional libraries (e.g., `reqests`) for fetching data.

#### **13.2 When to Use BeautifulSoup, Scrapy, or Selenium**
| **Scenario**                               | **Tool**          |
|--------------------------------------------|-------------------|
| Simple static HTML scraping                | BeautifulSoup     |
| Large-scale scraping with many pages       | Scrapy            |
| JavaScript-heavy or dynamic web content    | Selenium          |
| Need both static and dynamic scaping      | Combine Tools     |

#### **13.3 Exploring Other Libraries**
1. **Puppeteer**  
   - A Node.js library for controlling headless Chrome browsers.
   - Ideal for interacting with dynamic websites.
   - Examples include scraping sites requiring login or handling infinite scrolls.

2. **Playwright**  
   - Similar to Puppeteer but supports multiple browsers (Chromium, Firefox, WebKit).
   - Allows faster and more reliable scraping for complex websites.
   - Offers built-in support

## **14. Course Project: Building a Complete Web Scraper**

### **14.1 Project Setup and Requirements**

For this project, we will build a complete web scraper using **BeautifulSoup**. The goal is to scrape **product data** (e.g., name, price, rating, availability) from a website. If the chosen website doesn’t support pagination, we will supplement it by scraping additional URLs to ensure our scraper handles diverse case---

#### **14.2 Building the Scraper with BeautifulSoup**

1. **Set Up the Environment**  
   - Install required libraries:
     ```bash
     pip install requests beautifulsoup4 pandas
     ```
   - Import libraries in your Python script:
 for handling iframes and multi-tab scenarios.




In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd


2. **Target Website**  
   Use the **Books to Scrape** website, which is public and legal for scraping:  
   **URL**: [http://books.toscrape.com/](http://books.toscrape.com/)  
   This website contains multiple pages with books, making it ideal for demonstrating both single and multipage scraping.

3. **Fetch the HTML Content**  
   Start with fetching HTML for the first page:


In [None]:
BASE_URL = "http://books.toscrape.com/"

response = requests.get(BASE_URL)
if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')
    print("Page fetched successfully!")
else:
    print(f"Failed to fetch page: {response.status_code}")


4. **Parse the Data**
   Extract specific data such as book title, price, and stock availability:


In [None]:
# Function to parse a single page
def parse_page(soup):
    books = []
    for book in soup.find_all('article', class_='product_pod'):
        title = book.h3.a['title']
        price = book.find('p', class_='price_color').text
        availability = book.find('p', class_='instock availability').text.strip()
        books.append({'Title': title, 'Price': price, 'Availability': availability})
    return books


5. **Handle Pagination**
   Create logic to navigate through multiple pages:


In [None]:
def scrape_books(base_url):
    all_books = []
    next_page = "catalogue/page-1.html"
    
    while next_page:
        print(f"Scraping {base_url + next_page}")
        response = requests.get(base_url + next_page)
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Parse the current page
        all_books.extend(parse_page(soup))
        
        # Check for 'next' button
        next_button = soup.find('li', class_='next')
        next_page = next_button.a['href'] if next_button else None
    return all_books

# Scrape all pages
books_data = scrape_books(BASE_URL)
print(f"Scraped {len(books_data)} books.")


### **14.3 Data Cleaning and Exporting**

1. **Cleaning Data**
   Remove unwanted characters and standardize formatting:


In [None]:
for book in books_data:
    book['Price'] = book['Price'].replace('£', '').strip()


2. **Exporting to CSV**
   Save the cleaned data for further analysis:


In [None]:
df = pd.DataFrame(books_data)
df.to_csv('books_data.csv', index=False)
print("Data exported to books_data.csv")


### **14.4 Deploying and Automating the Web Scraper**

1. **Using Cron Jobs (Linux/Mac) or Task Scheduler (Windows)**
   Automate the script to run periodically:
   - For Linux/Mac, schedule the script using a cron job:
     ```bash
     crontab -e
     # Add a line to run the script every day at 9 AM
     0 9 * * * /usr/bin/python3 /path/to/your/script.py
     ```
   - For Windows, use Task Scheduler to set up periodic execution.

2. **Deploy on a Cloud Service**
   Use a cloud platform like AWS Lambda or Heroku for deployment:
   - **Heroku Deployment**:
     - Create a `requirements.txt` file with all dependencies.
     - Use `git` to push the project to Heroku.


### **14.5 Project Showcase and Review**

- **Summary of Features:**
  - Scrapes book data (title, price, availability).
  - Handles multipage scraping dynamically.
  - Cleans and exports data to CSV.
  - Automates scraping for regular updates.

- **Key Challenges and Solutions:**
  - **Challenge**: Missing pages or invalid URLs.  
    **Solution**: Use try-except blocks to handle errors gracefully.
  - **Challenge**: Dynamic content (if present).  
    **Solution**: Integrate Selenium when necessary.

- **Code Overview**:
  The final script handles real-world challenges like pagination, data cleaning, and exporting while maintaining simplicity.


## **15. Conclusion and Next Steps**

### **15.1 Recap of Key Learnings**
In this course, we covered everything you need to know about **BeautifulSoup**, starting from basic concepts to advanced techniques. Here’s a summary of what you’ve learned:
- **Basics of Web Scraping**: Understanding the ethical and legal considerations, setting up your environment, and understanding HTML structure.
- **Navigating HTML with BeautifulSoup**: Using tags, attributes, and advanced CSS selectors to extract data efficiently.
- **Handling Complex Scenarios**: Parsing dynamic content using Selenium and handling AJAX requests for JSON data.
- **Data Cleaning and Exporting**: Cleaning the extracted data and saving it in formats like CSV, JSON, or databases.
- **Error Handling and Optimization**: Managing HTTP errors, retries, and optimizing scrapers to avoid blocking.
- **Real-World Projects**: Implementing scraping pipelines for e-commerce, news, and weather data.
- **Deployment and Automation**: Using cloud services like AWS or scheduling tasks with cron jobs.

### **15.2 Resources for Further Learning**
To continue building your skills, consider exploring these resources:
1. **Books**:
   - *"Automate the Boring Stuff with Python"* by Al Sweigart.
   - *"Web Scraping with Python"* by Ryan Mitchell.
2. **Online Courses**:
   - Python-focused courses on platforms like Coursera, Udemy, and edX.
   - Free tutorials on [Real Python](https://realpython.com/).
3. **Official Documentation**:
   - [BeautifulSoup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

### **15.3 Tips for Building Your Own Web Scrapers**
1. **Start Small**: Test your scraper on a small, static site before scaling to more complex or dynamic sites.
2. **Understand Robots.txt**: Always check the `robots.txt` file of a website to respect its scraping policies.
3. **Use Headers and Proxies**: Include appropriate headers like user-agent and rotate proxies to prevent IP bans.
4. **Handle Pagination Gracefully**: Ensure your scraper can dynamically move through pages, checking for `next` buttons or unique page URLs.
5. **Test Often**: Websites frequently change their HTML structure, so test and update your scraper regularly.


### **15.4 Addressing Pagination Issues with a New URL**

If your original URL doesn’t contain multiple pages, try using a website that supports pagination. Let’s scrape data from a valid multi-page example.

**Example URL**:  
We will use the jobs listing site *Remotive.io*. Here’s the base URL:  
`https://remotive.io/remote-jobs/software-dev`

This site contains job listings across multiple pages.

**Updated Steps for Pagination:**
1. **Identify the Pagination Logic**:
   - Check the URL structure for pagination. On Remotive, the next pages are loaded dynamically, so you will need to handle AJAX calls or simulate clicks using Selenium.

2. **Code Implementation:**

Here’s a snippet for handling pagination dynamically:



In [None]:
import requests
from bs4 import BeautifulSoup
import time

base_url = "https://remotive.io/remote-jobs/software-dev"

def scrape_jobs(base_url, max_pages=5):
    current_page = 1
    all_jobs = []

    while current_page <= max_pages:
        print(f"Scraping page {current_page}...")
        # Simulate changing page number if applicable
        response = requests.get(f"{base_url}?page={current_page}")
        
        if response.status_code != 200:
            print(f"Error fetching page {current_page}: {response.status_code}")
            break

        soup = BeautifulSoup(response.text, "html.parser")
        
        # Extract job details
        jobs = soup.select("div.job-tile-title")
        for job in jobs:
            title = job.text.strip()
            all_jobs.append(title)
        
        # Simulate waiting to avoid detection
        time.sleep(2)
        
        current_page += 1

    return all_jobs

# Fetch and print job titles
job_list = scrape_jobs(base_url)
for idx, job in enumerate(job_list, 1):
    print(f"{idx}. {job}")


**Challenges Resolved**:
- **Dynamic Pagination**: The script now dynamically scrapes up to 5 pages, as specified by `max_pages`.
- **Error Handling**: Errors during HTTP requests are caught gracefully.
- **Flexible Base URL**: You can replace the `base_url` with other paginated websites.

### **15.5 Q&A and Troubleshooting Common Issues**

**Q1. My scraper suddenly stops working. What should I do?**  
- Inspect the website's structure to see if it has changed. Use browser developer tools to verify.
- If the website blocks your IP, try using proxies or VPNs.

**Q2. How can I scrape JavaScript-rendered content?**  
- Use Selenium or Playwright to interact with the DOM rendered by JavaScript.
- Alternatively, inspect the network tab for AJAX requests that return raw data.

**Q3. What to do if the site has anti-scraping mechanisms like CAPTCHAs?**  
- Use services like 2Captcha to bypass CAPTCHAs.
- Minimize frequent requests by adding random delays.

**Q4. How do I export large datasets?**  
- Write data incrementally to CSV or database to avoid memory overflow.

With this comprehensive guide, you now have all the tools and knowledge needed to build robust scrapers for various applications. Best of luck in your web scraping journey!
