# Web Scraping Basics with Python

## HTML Structure Basics
Before scraping data from a webpage, it's important to understand how HTML works, as it is the main language used to structure websites.

### Elements of a Webpage:

HTML consists of various elements (or tags) that define the structure and content of a webpage.
Here's a basic example of HTML:

```html
<html> <!-- This is the root element of the HTML document. All content is enclosed within this tag. -->
  <head> <!-- The <head> section contains metadata about the webpage, such as the title or linked resources. -->
    <title>Web Scraping Example</title> <!-- The <title> element defines the title of the webpage, which appears on the browser tab. -->
  </head>
  <body> <!-- The <body> section contains all the visible content of the webpage. -->
    <h1>Welcome to Web Scraping</h1> <!-- The <h1> tag represents the main heading of the webpage. -->
    <p>This is a paragraph.</p> <!-- The <p> tag defines a paragraph of text. -->
    <div class="content"> <!-- The <div> element is a container used to group content, in this case, it has a class "content" for styling or identification. -->
      <span>Data we want to extract</span> <!-- The <span> tag is an inline element used to define a portion of text. Here, it contains the data we want to scrape. -->
    </div>
  </body>
</html> <!-- The closing tag for the root HTML element, marking the end of the document. -->

In [2]:
from IPython.display import HTML

html_code = """
<html>
  <head>
    <title>Web Scraping Example</title>
  </head>
  <body>
    <h1>Welcome to Web Scraping</h1>
    <p>This is a paragraph.</p>
    <div class="content">
      <span>Data we want to extract. </span>
    </div>
  </body>
</html>
"""

# Render the HTML
HTML(html_code)

### Common HTML Tags

- `<h1>, <h2>, <h3>, <h4>, <h5>, <h6>`:  
  Headings, where `<h1>` is the largest and `<h6>` is the smallest. These tags are used to define headings or titles on a webpage.

- `<p>`:  
  Paragraph. Used to define blocks of text content.

- `<div>`:  
  Division. It groups together sections of content. Commonly used to structure and organize the layout of the webpage.

- `<span>`:  
  Inline container for text. It's used for styling a part of the text or grouping smaller chunks of content.

- `<a>`:  
  Anchor tag or hyperlink. It creates links to other pages or sections within the same page using the `href` attribute.

- `<img>`:  
  Image. Used to display an image on the webpage. It requires the `src` attribute, which defines the path to the image.

- `<ul>`:  
  Unordered list. Creates a list of items with bullet points.

- `<ol>`:  
  Ordered list. Creates a list of items with numbers.

- `<li>`:  
  List item. Used inside `<ul>` or `<ol>` to define each item in the list.

- `<table>`:  
  Table. Used to display data in a tabular format.

- `<tr>`:  
  Table row. Defines a row in a table.

- `<td>`:  
  Table data. Represents a cell within a row in the table.

- `<th>`:  
  Table header. Defines a header cell in a table, often bold and centered.

- `<form>`:  
  Form. Used to collect user input. It often contains elements like text fields, radio buttons, and submit buttons.

- `<input>`:  
  Input field. Used in forms to collect data from users (e.g., text fields, checkboxes, radio buttons).

- `<button>`:  
  Button. Creates a clickable button, often used to submit forms or trigger events.

- `<textarea>`:  
  Multi-line input field. Allows users to enter larger blocks of text.

- `<select>`:  
  Dropdown menu. Allows users to select one option from a list.

- `<option>`:  
  Defines each option inside a `<select>` dropdown menu.

- `<strong>`:  
  Makes text bold.

- `<em>`:  
  Emphasizes text, usually italicized.

- `<br>`:  
  Line break. Moves content to a new line.

- `<hr>`:  
  Horizontal rule. Creates a horizontal line across the webpage, often used to separate sections.

### Attributes

HTML tags can have attributes that provide additional information about the element. These attributes are placed inside the opening tag and follow the pattern: `attribute="value"`. Attributes help define the behavior, appearance, or content of the elements on a webpage.

#### Example:
```html
<div class="content">This is a content section.</div>

#### Common HTML Attributes:

- **`id`**:  
  The `id` attribute is like giving a name tag to an HTML element. This name must be unique on the page, meaning no two elements can have the same `id`. It's particularly useful when you want to apply specific styles with CSS or perform actions on that element using JavaScript.

  - **Usage**:
    ```html
    <div id="header">This is the header.</div>
    ```
    Here, the `id="header"` uniquely identifies the `<div>` tag. This makes it easy for developers to style the header in CSS or manipulate it with JavaScript, ensuring that only this element is affected.

  - **Context**:  
    IDs are commonly used for important sections of a webpage like headers, footers, or navigation menus, where specific and unique behavior or styling is required. The unique nature of `id` makes it an essential tool for targeting single elements.
 

- **`class`**:  
  The `class` attribute is like assigning a label to an HTML element, and you can give the same label to multiple elements on the page. This is especially useful when you want to apply the same styles to a group of elements or perform similar actions on them using JavaScript. 

  - **Usage**:
    ```html
    <div class="content">This is the first content section.</div>
    <div class="content">This is the second content section.</div>
    ```
    In this example, both `<div>` elements share the class `"content"`, which means they can both be styled the same way in CSS or have the same functionality applied using JavaScript.

  - **Context**:  
    Classes are heavily used in web development because they allow multiple elements to share the same styling or behavior. For example, you might use the same class for all buttons or navigation links on your page to give them a uniform look. Unlike `id`, which is unique, a `class` can be shared by many elements, making it a powerful tool for grouping similar content.

- **`href`**:  
  The `href` (Hypertext REFerence) attribute is used in anchor (`<a>`) tags to specify the URL (web address) of the page you want to link to. This attribute turns the text or image inside the anchor tag into a clickable link.

  - **Usage**:
    ```html
    <a href="https://www.example.com">Visit Example</a>
    ```
    In this example, the `href="https://www.example.com"` tells the browser where the link should take the user. When someone clicks on the text "Visit Example," they will be taken to `https://www.example.com`.

  - **Context**:  
    The `href` attribute is crucial for creating hyperlinks, which are the core of how we navigate the web. Without the `href` attribute, a link would just be regular text without any functionality. Links can point to other websites, sections within the same page, or even files for download.

- **`src`**:  
  The `src` (source) attribute is used in tags like `<img>` for images and `<script>` for JavaScript files to tell the browser where the file is located. It's like providing the address or path for the file that the browser needs to display or execute.

  - **Usage**:
    ```html
    <img src="image.jpg" alt="Sample image">
    ```
    In this example, the `src="image.jpg"` tells the browser where to find the image file named "image.jpg". The image will be displayed at this location on the webpage.

  - **Context**:  
    The `src` attribute is critical for embedding visual content like images, videos, or audio files, and also for linking external resources like JavaScript files. It ensures that the correct resource is fetched and displayed or executed by the browser, making it essential for delivering multimedia and interactive content on web pages.

- **`style`**:  
  The `style` attribute allows you to add CSS (Cascading Style Sheets) rules directly to an HTML element. This is called inline styling. While it’s often better to keep your styles in a separate CSS file for better organization, `style` can be useful for quick or one-time styling adjustments.

  - **Usage**:
    ```html
    <p style="color: red;">This text will be red.</p>
    ```
    In this example, the `style="color: red;"` applies a red color to the text within the `<p>` tag. It changes the text color for that specific paragraph.

  - **Context**:  
    The `style` attribute is convenient when you need to apply unique styling to an element without creating an entire CSS rule in a separate stylesheet. However, it’s generally considered better practice to keep styles in a separate CSS file for easier management and cleaner code. Using inline styles too often can make your HTML harder to maintain.

- **`target`**:  
  The `target` attribute is used in links to specify where the linked page should open. For instance, if you use `target="_blank"`, the link will open in a new tab or window, allowing the user to visit the new page without leaving the current one.

  - **Usage**:
    ```html
    <a href="https://www.example.com" target="_blank">Visit Example</a>
    ```
    In this example, the `target="_blank"` ensures that when someone clicks the "Visit Example" link, it will open in a new browser tab instead of replacing the current page.

  - **Context**:  
    The `target` attribute is commonly used when you want to direct users to an external site or a different page without taking them away from the current one. This is particularly useful for external links or opening additional resources while keeping the user’s current browsing session intact.

### Why Attributes Matter:
Attributes are essential for making HTML elements more interactive and meaningful. They provide additional information to browsers and developers about how the elements should behave or look on the webpage. When scraping a webpage, attributes such as id, class, or src help target and identify the data you want to extract.

### Using Browser Developer Tools

When scraping data from a website, it's important to first understand the HTML structure of the page. Modern browsers like Chrome, Firefox, and Edge come with built-in developer tools that allow you to inspect and explore the webpage’s code. This helps you find the specific tags, attributes, and data you need to scrape.

#### Steps to Use Developer Tools:

1. **Open any webpage** (e.g., `https://quotes.toscrape.com`):
   - In this example, let's say we want to scrape quotes and authors from the webpage.

2. **Right-click on the element you want to inspect**:
   - Find a part of the webpage that contains the data you're interested in. For example, right-click on a quote or an author’s name.
   - A context menu will appear. Select **Inspect** (in Chrome) or **Inspect Element** (in Firefox).

3. **Explore the HTML structure**:
   - When you select **Inspect**, a panel will appear, usually at the bottom or side of the screen. This panel shows the HTML code of the webpage.
   - As you hover over different HTML elements in the panel, the corresponding part of the webpage will be highlighted, allowing you to see exactly which HTML tag controls that section.
   
4. **Identify key tags and attributes**:
   - In the developer tools, you can view the HTML tags (like `<div>`, `<span>`, or `<a>`), attributes (such as `class` or `id`), and content inside those tags.
   - For example, if you are trying to scrape a quote from the page, you might notice something like:
     ```html
     <div class="quote">
       <span class="text">"The world is full of magical things patiently waiting for our wits to grow sharper."</span>
       <span> - Author Name</span>
     </div>
     ```
   - In this case, the quote text is inside the `<span>` tag with the class `text`. You can use this information in your scraper to locate and extract the quote.

5. **Copy the exact HTML path**:
   - Right-click on the selected element in the developer tools and choose **Copy > Copy XPath** or **Copy Selector**. This provides you with the precise path or CSS selector that can be used to extract the data programmatically.

6. **Check dynamic content**:
   - Some websites load content dynamically using JavaScript. In the developer tools, you can observe the **Network** tab to see if additional requests are being made by the browser to fetch more data (e.g., when you scroll down a webpage).
   - If you notice network requests fetching data in formats like JSON, it might be more efficient to scrape that data directly from the request rather than from the HTML.

7. **Edit HTML and Test**:
   - You can also use developer tools to **temporarily edit** the HTML or CSS to see how changes affect the webpage. For example, you can modify the style of elements or content without actually changing the website itself. It’s great for testing visual adjustments before implementing them in your code.
   - This is useful when testing how data is structured or confirming which elements to scrape.

#### Why Using Developer Tools is Important:
- **Technical Understanding**: It helps you understand the structure and layout of a webpage, making it easier to locate the exact tags and attributes for scraping.
- **Efficiency**: By using developer tools, you can quickly find the parts of the page you want to scrape, without having to sift through the entire HTML document.
- **Dynamic Content**: It allows you to check if content is dynamically loaded, which can change your scraping strategy.

Using these tools is the first step before building your web scraper, ensuring you know exactly how to target and extract the data you need.

## Python Libraries for Web Scraping

When scraping web content using Python, there are two key libraries that help us do the job:

### 1. **`requests`**
- **What it does**:  
  The `requests` library is like a tool that allows your Python program to talk to websites. When you type a web address into a browser, the browser sends a request to the website’s server, asking for the webpage. The server responds by sending back the HTML code, which the browser then displays for you.  
  Similarly, the `requests` library allows Python to send the same kind of request to a website’s server to fetch the webpage content.
  
- **Simple analogy**:  
  Think of it as sending a letter (the request) to a website, asking for a copy of its page. The website sends back the webpage content as a response, just like receiving a letter in return. Now, instead of a browser displaying that page, you can use Python to do something with the content (like scraping the data).
  
- **Common use in scraping**:  
  In web scraping, `requests` is used to download the HTML of a webpage so that we can process it and extract the information we need.


In [1]:
import requests

# Send a request to a webpage
response = requests.get('https://quotes.toscrape.com/')

# Print the content of the webpage
print(response.text)

<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    

<div class="row">
    <div class="col-md-8">

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="author" itempr

### 2. **`BeautifulSoup`**

- **What it does**:  
  BeautifulSoup is a library that helps you take the messy HTML code from a webpage and break it down into an organized structure. Once you have this structure, you can easily search for and extract specific pieces of data (like text, links, or images).

- **Simple analogy**:  
  Imagine you get a big, jumbled box of puzzle pieces (the raw HTML). BeautifulSoup helps you sort and organize those pieces so you can pick out the ones you need.

- **Common use in scraping**:  
  After `requests` fetches the raw HTML, BeautifulSoup helps you parse it (break it down) and navigate through the HTML tags (like `<div>`, `<p>`, `<a>`) to find specific data you want to scrape.

In [6]:
import requests
from bs4 import BeautifulSoup

# Send a request to the webpage
url = 'https://quotes.toscrape.com/'
response = requests.get(url)  # Fetch the content of the page

# Check if the request was successful (status code 200 indicates success)
if response.status_code == 200:
    print("Successfully fetched the webpage content!")
else:
    print("Failed to retrieve the webpage.")

# The content of the webpage is stored in `response.text`. This is the raw HTML of the webpage.
# Let's print a small portion of it to see what it looks like.

print(response.text[:500])  # Print first 500 characters of the HTML content

Successfully fetched the webpage content!
<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md


In [8]:
# Now that we have the raw HTML, we can use BeautifulSoup to parse it and navigate through the structure to extract data.
# Parse the HTML with BeautifulSoup
soup = BeautifulSoup(response.text, 'html.parser')  # Parse the HTML and create a BeautifulSoup object

# Print the page title
print("Page Title:", soup.title.string)  # Extract and print the title of the webpage

# Extract all quotes on the page
# The quotes are inside <span> tags with the class "text"
quotes = soup.find_all('span', class_='text')
print("\nQuotes on the Page:")
for quote in quotes:
    print(quote.text)  # Extract and print the text inside each quote

# Extract all authors on the page
# The authors' names are inside <small> tags with the class "author"
authors = soup.find_all('small', class_='author')
print("\nAuthors of the Quotes:")
for author in authors:
    print(author.text)  # Extract and print the author's name

# Extract all tags for each quote
# Each quote has associated tags that are inside <a> tags with the class "tag" within a <div> with the class "quote"
quote_divs = soup.find_all('div', class_='quote')  # Find all quote <div> elements
print("\nTags for Each Quote:")
for div in quote_divs:
    tags = div.find_all('a', class_='tag')  # Extract all <a> tags with the class "tag" within each quote
    tag_list = [tag.text for tag in tags]  # Create a list of the tag texts
    print("Tags:", ', '.join(tag_list))  # Print all tags for each quote, separated by commas

Page Title: Quotes to Scrape

Quotes on the Page:
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
“Try not to become a man of success. Rather become a man of value.”
“It is better to be hated for what you are than to be loved for what you are not.”
“I have not failed. I've just found 10,000 ways that won't work.”
“A woman is like a tea bag; you never know how strong it is until it's in hot water.”
“A day without sunshine is like, you know, night.”

Authors of the Quotes:
Albert Einstein
J.K. 

### Why Use These Libraries Together?

- **Requests**:  
  The `requests` library is responsible for sending an HTTP request to the server and downloading the content of the webpage. It retrieves the raw HTML, which contains all the data on the page, but it’s just a jumble of code at this point.

- **BeautifulSoup**:  
  Once you have the raw HTML from `requests`, `BeautifulSoup` steps in to help. It parses (or breaks down) the HTML into a structured format, making it easy to navigate through the tags and extract the specific pieces of information you’re looking for (such as text, links, or images).

Together, these two libraries work hand in hand to make web scraping much easier:
1. **First**, you use `requests` to download the webpage content.
2. **Then**, you use `BeautifulSoup` to pick out the specific pieces of information you’re interested in.

By combining these libraries, you can automate the process of collecting data from websites, whether it's for analysis, creating datasets, or other projects.

## Data Handling and Storage

After extracting the data using BeautifulSoup, it's important to store it in a structured format so it can be easily accessed, analyzed, or shared. Two of the most common formats for storing scraped data are **CSV** and **JSON**:

- **CSV** (Comma-Separated Values):  
  A simple text format where each row represents an entry, and columns are separated by commas. It's widely used because it's easy to import into spreadsheet programs like Excel or Google Sheets, and it’s also efficient for large datasets.

- **JSON** (JavaScript Object Notation):  
  A lightweight format that represents data as key-value pairs. It's ideal for hierarchical or nested data structures, and is often used when exchanging data between web applications and servers.

### Steps for Data Handling and Storage:

First, we'll organize the extracted data (quotes, authors, and tags) into lists or dictionaries and then store it into CSV and JSON formats for future use.

In [9]:
import csv  # Import the csv module to handle writing data to CSV files
import json  # Import the json module to handle writing data to JSON files

# List to store the extracted data in dictionaries
data = []

# Loop through each quote and extract the quote text, author's name, and tags
for div in quote_divs:
    # Find and extract the quote text within <span class="text">
    quote_text = div.find('span', class_='text').text  
    
    # Find and extract the author's name within <small class="author">
    author_name = div.find('small', class_='author').text  
    
    # Find all tags associated with the quote within <a class="tag">, and store them in a list
    tags = [tag.text for tag in div.find_all('a', class_='tag')]  
    
    # Append the extracted data as a dictionary to the list `data`
    data.append({
        'quote': quote_text,  # Add quote text to the dictionary
        'author': author_name,  # Add author's name to the dictionary
        'tags': tags  # Add list of tags to the dictionary
    })

# --- Storing Data in CSV Format ---
# Open a file named 'quotes.csv' in write mode with UTF-8 encoding to handle special characters
with open('quotes.csv', mode='w', newline='', encoding='utf-8') as file:
    # Create a DictWriter object to write dictionaries into the CSV file
    writer = csv.DictWriter(file, fieldnames=['quote', 'author', 'tags'])  # Define column headers
    
    # Write the header row (quote, author, tags) to the CSV file
    writer.writeheader()  
    
    # Loop through each row (dictionary) in the `data` list
    for row in data:
        # Convert the list of tags to a single string where tags are separated by commas
        row['tags'] = ', '.join(row['tags'])  # Convert list of tags into a comma-separated string
        
        # Write each dictionary as a row in the CSV file
        writer.writerow(row)  

print("Data has been stored in quotes.csv")  # Confirmation message for CSV storage

# --- Storing Data in JSON Format ---
# Open a file named 'quotes.json' in write mode with UTF-8 encoding
with open('quotes.json', mode='w', encoding='utf-8') as file:
    # Dump (write) the list of dictionaries `data` to the JSON file
    json.dump(data, file, indent=4)  # `indent=4` makes the JSON file more readable by adding indentation

print("Data has been stored in quotes.json")  # Confirmation message for JSON storage

Data has been stored in quotes.csv
Data has been stored in quotes.json


In [15]:
## Loading and Viewing Data from quotes.csv as a Pandas DataFrame

import pandas as pd  # Import pandas for handling CSV files

# --- Load Data from CSV File into a Pandas DataFrame ---
df_csv = pd.read_csv('quotes.csv')  # Load the CSV file into a DataFrame

In [16]:
# View the first few rows of the DataFrame
print("Data from CSV file (as DataFrame):")
df_csv.head()  # Display the first 5 rows of the DataFrame

Data from CSV file (as DataFrame):


Unnamed: 0,quote,author,tags
0,“The world as we have created it is a process ...,Albert Einstein,"change, deep-thoughts, thinking, world"
1,"“It is our choices, Harry, that show what we t...",J.K. Rowling,"abilities, choices"
2,“There are only two ways to live your life. On...,Albert Einstein,"inspirational, life, live, miracle, miracles"
3,"“The person, be it gentleman or lady, who has ...",Jane Austen,"aliteracy, books, classic, humor"
4,"“Imperfection is beauty, madness is genius and...",Marilyn Monroe,"be-yourself, inspirational"


In [17]:
# Additional view options
print("\nData Overview:")
print(df_csv.info())  # Get a summary of the DataFrame, including data types and non-null counts


Data Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   quote   10 non-null     object
 1   author  10 non-null     object
 2   tags    10 non-null     object
dtypes: object(3)
memory usage: 372.0+ bytes
None


In [23]:
## Loading and Viewing Data from quotes.json as a JSON Object

import json  # Import json module to handle JSON files

# --- Load Data from JSON File ---
# Open the quotes.json file in read mode with UTF-8 encoding
with open('quotes.json', mode='r', encoding='utf-8') as file:
    data_from_json = json.load(file)  # Load the JSON file into a Python list of dictionaries

# --- View the JSON Data ---
# Pretty-print the first 5 entries from the JSON file for better readability
print("\nData from JSON file (first 5 entries):")
print(json.dumps(data_from_json[:5], indent=4, ensure_ascii=False))  # ensure_ascii=False to display non-ASCII characters properly

# --- Additional Data Insights ---
# Print the total number of quotes in the JSON file
print(f"\nTotal number of quotes in the JSON file: {len(data_from_json)}")


Data from JSON file (first 5 entries):
[
    {
        "quote": "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”",
        "author": "Albert Einstein",
        "tags": "change, deep-thoughts, thinking, world"
    },
    {
        "quote": "“It is our choices, Harry, that show what we truly are, far more than our abilities.”",
        "author": "J.K. Rowling",
        "tags": "abilities, choices"
    },
    {
        "quote": "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”",
        "author": "Albert Einstein",
        "tags": "inspirational, life, live, miracle, miracles"
    },
    {
        "quote": "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”",
        "author": "Jane Austen",
        "tags": "aliteracy, books, classic, humor"
    },
    {
        "quote": "“Imperfection 

## Error Handling and Debugging in Web Scraping

### 1. Common Issues in Scraping

#### 1.1 Handling HTTP Errors

When scraping websites, you may encounter various **HTTP status codes** that indicate problems with your request. Understanding these error codes is critical for building robust web scrapers.

#### 403 (Forbidden)

The **403 Forbidden** status code means that the server is denying your access to the requested resource. This often happens when the server detects that you are using a bot or scraper, instead of a normal browser.

**Solution**:  
To avoid the 403 error, you can simulate a browser request by adding HTTP headers, especially the `User-Agent` header, which tells the server what type of browser is making the request. Many websites block requests without a valid `User-Agent`.

#### What is a User-Agent?

The **User-Agent** is a string that your browser or application sends to the server to identify itself. It typically includes information about the browser type, operating system, and device. When a website receives a request, it checks the `User-Agent` to understand which browser and platform are making the request, and may tailor the response or restrict access based on that information.

For example, a typical `User-Agent` for Google Chrome on a Windows machine might look like this:

```plaintext
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3

The `User-Agent` tells the website:

- **Mozilla/5.0**: The application sending the request. Here it shows compatibility with Mozilla browsers.
- **Windows NT 10.0**: The operating system (Windows 10 in this case).
- **AppleWebKit/537.36**: The layout engine used to display the content (Apple WebKit is used by Chrome and Safari).
- **Chrome/58.0.3029.110**: The browser and its version (Chrome version 58).
- **Safari/537.3**: Also shows compatibility with Safari.

#### Why does `User-Agent` Fix the 403 Error?

Many servers block non-browser requests, as these could be from bots or scrapers. When you make a request without a `User-Agent` or with a generic one (e.g., the default `User-Agent` from libraries like `requests`), the server may flag your request as suspicious and deny it (returning a **403 Forbidden** error).

By adding a valid `User-Agent`, you trick the server into thinking the request is coming from a regular browser, which is generally allowed. The server will respond as if it’s interacting with a human using a browser, and it will likely return the requested data.

In [30]:
import requests

url = 'http://example.com'

# Add a User-Agent to simulate a browser request
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}

# Send the request with the User-Agent header
response = requests.get(url, headers=headers)

#### 404 (Not Found)

The **404 Not Found** status code means that the requested URL does not exist. This could happen if the URL is incorrect or the page has been removed.

#### **Solution**:
1. **Double-check the URL**: Make sure the URL you're trying to scrape is correct and valid. This is the first step to avoid the error.
   
2. **Handle this gracefully in code**: 
   - Handling an error gracefully means that instead of letting the program crash or return an uninformative error message, your code should catch the error, log it if necessary, and allow the program to continue running or exit in a controlled way.
   - For example, you might log a helpful message like "Page not found" and skip the invalid URL, rather than having the entire scraping process fail because of one bad URL.

In [33]:
import requests

url = 'https://example.com/nonexistentpage'

try:
    response = requests.get(url)
    
    # Check if the status code is 404
    if response.status_code == 404:
        print(f"Error 404: The page {url} was not found.")
    else:
        # If it's not 404, continue processing the page
        print(f"Page {url} fetched successfully.")
except requests.exceptions.RequestException as e:
    print(f"An error occurred: {e}")

Error 404: The page https://example.com/nonexistentpage was not found.


Graceful handling means we check if the status code is 404, and instead of crashing, we display a useful message and move on. This helps the script handle missing pages or incorrect URLs without interrupting the overall scraping process.

#### 429 (Too Many Requests)

The **429 Too Many Requests** status code means that the server is limiting your requests because you're sending too many requests in a short period. This is known as **rate limiting**.

#### **Solution**:
To avoid the **429 Too Many Requests** error, you can implement a **delay** or **sleep** between requests. By adding a short pause between requests, you allow the server some time to "rest" before sending another request. This way, you don't overwhelm the server with too many requests in a short period, reducing the likelihood of being rate-limited.

#### What are `delay` and `sleep`?

- **Delay**: This simply means pausing your code's execution for a certain period before sending the next request.
- **Sleep**: In Python, you can use the `time.sleep()` function to add a delay in seconds between requests. This function temporarily halts the program's execution for the specified amount of time.

#### How Does It Fix the 429 Error?

By introducing a delay between your requests, you're effectively **throttling** your requests, making them slower and more spaced out. Servers often impose rate limits (e.g., 100 requests per minute) to prevent being overwhelmed by too many requests at once. Adding a delay gives the server time to process each request without hitting its rate limits, preventing the 429 error.

In [34]:
import time
import requests

# Define the URL
url = 'https://example.com/api/resource'

# Loop through multiple requests with a delay in between
for i in range(5):
    response = requests.get(url)
    
    # Check the response status
    if response.status_code == 200:
        print(f"Request {i+1}: Success")
    elif response.status_code == 429:
        print("Error 429: Too many requests. Pausing before retrying...")
    
    # Introduce a delay of 3 seconds between requests
    time.sleep(3)  # Pause for 3 seconds

Using time.sleep() helps you control the rate at which you send requests to a server, avoiding the 429 Too Many Requests error by respecting the server's rate limits. By introducing delays between requests, you ensure that your scraper or bot doesn't overwhelm the server with rapid, consecutive requests.

### 1.2 Dealing with Incomplete or Corrupt Data

Scraping often involves dealing with messy HTML or incomplete data on web pages. It's common to encounter missing elements or poorly structured HTML that can cause issues during scraping.

#### **Incomplete Data**

Incomplete data refers to situations where certain expected elements (like specific tags) are missing from a webpage. For example, some pages might not have the same structure or tags as others, leading to errors in your script when it tries to extract data that doesn’t exist.

- **Solution**: Use `try-except` blocks to handle missing elements. This prevents your script from crashing when it encounters missing data. Additionally, logging errors will help track which pages are causing issues, allowing you to handle them later without stopping the entire scraping process.

#### **Example**: Handling Missing Elements with `try-except`

In [36]:
import requests
from bs4 import BeautifulSoup

url = 'https://example.com/nonexistent-tag'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

try:
    # Try to find an element that might be missing
    title = soup.find('h1').text
    print(f"Page title: {title}")
except AttributeError:
    # Handle the case where the element is not found
    print("Error: Page title is missing!")

Page title: 404 - Not Found


#### **Corrupt Data**

Corrupt data refers to **malformed HTML**, where the tags or structure of the webpage are broken or not well-formed. This can cause problems when scraping because the parser may not be able to interpret the HTML correctly.

Malformed HTML can include:
- Unclosed or mismatched tags.
- Missing or duplicated elements.
- Nested elements incorrectly structured.

Such issues can lead to errors or incomplete data extraction.

#### **Solution**:
To handle corrupt data, you can use libraries or parsers designed to process messy HTML. Two common solutions are:

1. **Using the `lxml` Parser**:
   - The `lxml` library is more powerful and robust for handling malformed or complex HTML. It can process broken structures more efficiently than standard parsers.

2. **Using BeautifulSoup's `"html.parser"`**:
   - BeautifulSoup’s built-in `"html.parser"` is a Python library designed to handle invalid or broken HTML. It automatically fixes some common issues like unclosed tags and improper nesting.

In [40]:
import requests
from bs4 import BeautifulSoup

url = 'https://example.com/malformed-html'

response = requests.get(url)

# Use BeautifulSoup's html.parser to handle messy HTML
soup = BeautifulSoup(response.text, 'html.parser')

# Use the lxml parser, which is robust for handling malformed HTML
# soup = BeautifulSoup(response.text, 'lxml')

# Extract data from malformed HTML
print(soup.prettify())  # Prints a cleaned-up version of the HTML

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
         "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <title>
   404 - Not Found
  </title>
 </head>
 <body>
  <h1>
   404 - Not Found
  </h1>
  <script src="//obj.ac.bcon.ecdns.net/ec_tpm_bcon.js" type="text/javascript">
  </script>
 </body>
</html>



## Advanced Scraping Techniques: 
### Scraping Paginated Content

We will scrape paginated content from a website, meaning we will gather data spread across multiple pages. Many websites display large sets of data across several pages (e.g., product listings or blog posts), and to get all the data, we need to navigate through those pages programmatically.

We will scrape the same website `https://quotes.toscrape.com`. The quotes are spread across multiple pages, and we'll write a script to scrape quotes from all pages.

If you look at the URL of the second page, you'll notice it follows a pattern:

- Page 1 URL: `https://quotes.toscrape.com/page/1/`
- Page 2 URL: `https://quotes.toscrape.com/page/2/`

The page number is included at the end of the URL, which means we can loop through the pages by changing the number programmatically. We'll loop through pages by incrementing the page number in the URL and stopping when no more data is available. 

In [49]:
def scrape_quotes_from_page(page_url):
    """
    Scrapes quotes and authors from a single page of the website.
    
    Args:
        page_url (str): The URL of the page to scrape.
        
    Returns:
        quotes_list (list): A list of dictionaries containing quotes and authors.
    """
    # Send a GET request to the page URL
    response = requests.get(page_url)
    
    # Parse the HTML content of the page
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find all the quote elements on the page
    quotes = soup.find_all('div', class_='quote')
    
    # Extract the text and author for each quote
    quotes_list = []
    for quote in quotes:
        text = quote.find('span', class_='text').get_text()
        author = quote.find('small', class_='author').get_text()
        quotes_list.append({"quote": text, "author": author})
    
    return quotes_list

In [50]:
def scrape_multiple_pages(base_url, total_pages):
    """
    Scrapes quotes from multiple pages of the website.
    
    Args:
        base_url (str): The base URL for the website.
        total_pages (int): The number of pages to scrape.
        
    Returns:
        all_quotes (list): A list of all quotes across multiple pages.
    """
    all_quotes = []
    
    # Loop through each page number
    for page_num in range(1, total_pages + 1):
        # Construct the URL for the current page
        page_url = f"{base_url}/page/{page_num}/"
        print(f"Scraping page {page_num}: {page_url}")
        
        # Scrape the quotes from the current page
        quotes = scrape_quotes_from_page(page_url)
        
        # If no quotes are found, break the loop (this assumes that an empty page means the end of content)
        if not quotes:
            print(f"No quotes found on page {page_num}. Stopping.")
            break
        
        # Add the scraped quotes to the list
        all_quotes.extend(quotes)
        
        # Add a delay to avoid overwhelming the server
        time.sleep(2)
    
    return all_quotes

In [51]:
base_url = "https://quotes.toscrape.com"
total_pages_to_scrape = 10

In [52]:
# Run the scraping process
all_quotes = scrape_multiple_pages(base_url, total_pages_to_scrape)

Scraping page 1: https://quotes.toscrape.com/page/1/
Scraping page 2: https://quotes.toscrape.com/page/2/
Scraping page 3: https://quotes.toscrape.com/page/3/
Scraping page 4: https://quotes.toscrape.com/page/4/
Scraping page 5: https://quotes.toscrape.com/page/5/
Scraping page 6: https://quotes.toscrape.com/page/6/
Scraping page 7: https://quotes.toscrape.com/page/7/
Scraping page 8: https://quotes.toscrape.com/page/8/
Scraping page 9: https://quotes.toscrape.com/page/9/
Scraping page 10: https://quotes.toscrape.com/page/10/


In [53]:
# Print the scraped quotes
for i, quote in enumerate(all_quotes, start=1):
    print(f"{i}. {quote['quote']} - {quote['author']}")

1. “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” - Albert Einstein
2. “It is our choices, Harry, that show what we truly are, far more than our abilities.” - J.K. Rowling
3. “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” - Albert Einstein
4. “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.” - Jane Austen
5. “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.” - Marilyn Monroe
6. “Try not to become a man of success. Rather become a man of value.” - Albert Einstein
7. “It is better to be hated for what you are than to be loved for what you are not.” - André Gide
8. “I have not failed. I've just found 10,000 ways that won't work.” - Thomas A. Edison
9. “A woman is like a tea bag; you never know how strong it is until it's in

### Extracting Data from Dynamic Content (JavaScript-rendered Pages) using Selenium

Many modern websites load their content dynamically using JavaScript. This means that the data might not be available in the raw HTML that you get through a normal HTTP request. To handle these cases, we can use **Selenium**, which simulates how a real user interacts with a browser.

We will:
- Use Selenium to load a JavaScript-rendered page.
- Extract data that only appears after the page is fully loaded by the browser.

#### What is Selenium?
**Selenium** is a browser automation tool. It allows you to open a real browser (like Chrome or Firefox), load a website, and interact with the webpage—just like a human would. Selenium is useful when dealing with websites that load data dynamically via JavaScript, which normal web scraping tools (like `requests` or `BeautifulSoup`) can't easily access.

#### What is a WebDriver?
A **WebDriver** is a browser-specific driver that allows Selenium to control a browser. For example:
- If you use **Chrome**, you’ll need **ChromeDriver**.
- If you use **Firefox**, you’ll need **GeckoDriver**.

The WebDriver makes it possible for Selenium to open a browser, interact with the page, and extract content that may be dynamically rendered by JavaScript.

In [7]:
# Installing Required Libraries and WebDriver
!pip install selenium
!pip install chromedriver-autoinstaller



In [8]:
import sys
# Inserts the path to 'chromedriver' at the beginning of the system path
# This ensures Python can find and execute 'chromedriver' from this location
sys.path.insert(0, '/usr/lib/chromium-browser/chromedriver')

import time  # Import the time module for adding delays
import pandas as pd  # Import pandas for handling tabular data (dataframes)
from bs4 import BeautifulSoup  # Import BeautifulSoup for parsing HTML content
from selenium import webdriver  # Import webdriver from Selenium for automating the browser
import chromedriver_autoinstaller  # Import chromedriver_autoinstaller to automatically install ChromeDriver

# Automatically install the appropriate version of ChromeDriver that matches your installed version of Chrome.
# This removes the need for manually downloading or specifying the path to ChromeDriver.
chromedriver_autoinstaller.install()

'/Users/tanyakhanna/mambaforge/envs/myenv/lib/python3.12/site-packages/chromedriver_autoinstaller/129/chromedriver'

In [9]:
from selenium.webdriver.chrome.options import Options
# Import the Options class from selenium.webdriver.chrome.options
# This class allows you to set various options for ChromeDriver, like running Chrome in headless mode, enabling logs, or setting browser preferences.

# Create a new instance of the Chrome Options class
chrome_options = Options()

# Add the argument to enable verbose logging
# "--log-level=3" minimizes logging to display only fatal errors
# Possible values for log levels are:
# 0 = ALL logs (verbose),
# 1 = DEBUG logs,
# 2 = INFO logs,
# 3 = WARN logs,
# 4 = ERROR logs (default),
# 5 = FATAL logs (only fatal errors).
chrome_options.add_argument("--log-level=3")

# Create a new instance of the Chrome driver with the options
# Pass the chrome_options object to the Chrome WebDriver instance.
# The WebDriver will launch Chrome with the options specified (in this case, minimal logging).
driver = webdriver.Chrome(options=chrome_options)

In [10]:
# Define the URL for SpotCrime's Rutgers University daily archive page
url = "https://spotcrime.com/NJ/Rutgers%20University/daily-archive"

# Instruct the Selenium WebDriver to navigate to the specified URL
# This command will open the page in the Chrome browser that is controlled by Selenium
driver.get(url)

In [11]:
# Pause the execution of the script for 5 seconds
# This is useful to ensure that the page has fully loaded or to wait for elements to be available before interacting with them
# In this case, we are waiting for 5 seconds before proceeding with the next commands
time.sleep(5)

In [12]:
from selenium.webdriver.common.by import By  # Import the By module

# Find all the date elements on the page using an XPATH that targets the anchor tags with class 'crime-records__daily-blotter-option'
date_elements = driver.find_elements(By.XPATH, "//a[@class='crime-records__daily-blotter-option']")

# Loop through each date element that was found
for date_element in date_elements:
    
    # Extract the 'href' attribute of the anchor tag, which contains the link for that specific date's crime data
    date_link = date_element.get_attribute('href')
    
    # The last part of the URL (after the last '/') contains the date in 'YYYY-MM-DD' format
    # Replace the '-' with '/' to convert the date into 'YYYY/MM/DD' format
    date = date_link.split('/')[-1].replace('-', '/')
    
    # Find the 'span' tag within the current anchor tag element, which contains the crime count
    crime_count_element = date_element.find_element(By.TAG_NAME, 'span')
    
    # Extract the text inside the 'span' tag (which looks like 'XX crimes')
    # Split the text using '(' and take the part after '(' to get just the number of crimes, 
    # then remove the 'crimes)' part to isolate the number
    crime_count = crime_count_element.text.split('(')[-1].replace('crimes)', '')
    
    # Print the date and the crime count in the format: 'YYYY/MM/DD: X crimes'
    print(f"{date}: {crime_count} crimes")


2024/08/28: 2  crimes
2024/08/27: 2  crimes
2024/08/25: 3  crimes
2024/08/21: 3  crimes


In [13]:
driver.quit()

### Introduction to APIs as an Alternative to Scraping

#### What is an API?
An API (Application Programming Interface) is a set of rules and protocols that allow one software application to communicate with another. In the context of data retrieval, APIs provide a structured way for developers to access data from a service (like Twitter, Reddit, or OpenWeather) without having to manually scrape web pages. APIs typically return data in structured formats such as JSON or XML.

#### Examples of Services That Provide APIs:
- **Twitter API**: Allows developers to access tweets, trends, user information, and more.
- **OpenWeather API**: Provides weather data like current conditions, forecasts, and historical data.
- **Reddit API**: Enables access to Reddit posts, comments, user profiles, and more.

### Differences Between API and Web Scraping

| Feature                    | API                                                      | Web Scraping                                            |
|----------------------------|----------------------------------------------------------|---------------------------------------------------------|
| **Data Access**             | Direct, structured access to data provided by the service | Extracting data from the visible HTML structure of a page|
| **Data Format**             | Structured (e.g., JSON, XML)                             | Unstructured (HTML), needs parsing                      |
| **Speed**                   | Fast, as it returns only the data needed                 | Slower, as it fetches the entire webpage                |
| **Legal Concerns**          | No legal concerns if usage follows the API’s terms       | Legal issues may arise depending on the website's ToS    |
| **Rate Limiting**           | Often strict rate limits imposed by the service          | Limited by server response times or anti-scraping measures|
| **Handling Dynamic Content**| Returns dynamic data directly                            | Requires handling JavaScript-rendered content            |
| **Complexity**              | Easier and cleaner to implement                          | More complex; requires parsing HTML and handling dynamic content |
| **Cost**                    | Often has usage fees after a certain number of requests  | Usually free unless blocked by the server               |

### Benefits of API

- **Faster Access**: APIs provide direct access to the data you need without the overhead of downloading and parsing entire web pages.
- **Structured Data**: APIs return data in well-structured formats like JSON or XML, making it easier to work with and integrate into applications.
- **No Legal Concerns**: If you follow the terms of service of an API, there are usually no legal concerns about accessing the data. Unlike scraping, which can sometimes violate a site's terms of service, APIs are designed to be used for accessing data.

---

### Why Use APIs Instead of Scraping?

APIs solve several problems that are common in web scraping:
- **Efficiency**: APIs provide structured data without the need to parse HTML, making them much faster and easier to work with.
- **Dynamic Content**: APIs provide access to dynamically generated content directly, whereas scraping requires extra effort to handle JavaScript or AJAX-based content.
- **Rate Limiting and Blocking**: Scrapers can often be blocked by websites or face rate limits. APIs provide clear documentation on limits and usage fees.

---

### When to Choose APIs Over Web Scraping?

| Situation                      | Choose API                                        | Choose Web Scraping                                  |
|---------------------------------|--------------------------------------------------|-----------------------------------------------------|
| **Data is structured and available via an API** | Always choose the API for better speed and structure | Use scraping only if API does not exist or is restricted |
| **You need dynamic data**       | API provides it in real-time                     | Scraping may require handling dynamic content manually |
| **You are working on a large-scale project** | APIs handle large amounts of data with rate limits | Scraping can be blocked or throttled on large-scale use |
| **Cost/Fees**                   | APIs often have cost or rate limits              | Web scraping is usually free, but slower and riskier |
| **Legal Issues**                | APIs are legal if used correctly                 | Web scraping may violate a website's terms of service |

---

Here’s an example of how to fetch weather data using the OpenWeather API. We'll use Python's `requests` library to make an API call and get the current weather in a city.

In [45]:
import requests

# API endpoint for 5-day weather forecast
api_url = "http://api.openweathermap.org/data/2.5/forecast"

# Parameters for the API request
params = {
    "id": 524901,  # City ID for Moscow. You can change this to any other city ID.
    "appid": "49e57985ff9ee8fd46da39ae2e6286e2",  # Your actual API key
    "units": "metric"  # Get temperature in Celsius
}

# Make the API request
response = requests.get(api_url, params=params)

# Check if the request was successful (status code 200 means success)
if response.status_code == 200:
    # Parse the response as JSON
    data = response.json()

    # Loop through each forecast entry
    for forecast in data["list"]:
        forecast_time = forecast["dt_txt"]
        temperature = forecast["main"]["temp"]
        weather_description = forecast["weather"][0]["description"]

        # Print the forecast for each 3-hour interval
        print(f"Forecast for {forecast_time}:")
        print(f"Temperature: {temperature}°C")
        print(f"Weather: {weather_description}")
        print("-" * 30)  # Separator for readability

else:
    # If the request failed, print the status code
    print(f"Failed to retrieve data. Status code: {response.status_code}")

Forecast for 2024-09-19 09:00:00:
Temperature: 21.8°C
Weather: overcast clouds
------------------------------
Forecast for 2024-09-19 12:00:00:
Temperature: 21.82°C
Weather: overcast clouds
------------------------------
Forecast for 2024-09-19 15:00:00:
Temperature: 20.39°C
Weather: overcast clouds
------------------------------
Forecast for 2024-09-19 18:00:00:
Temperature: 16.89°C
Weather: overcast clouds
------------------------------
Forecast for 2024-09-19 21:00:00:
Temperature: 14.89°C
Weather: broken clouds
------------------------------
Forecast for 2024-09-20 00:00:00:
Temperature: 13.85°C
Weather: broken clouds
------------------------------
Forecast for 2024-09-20 03:00:00:
Temperature: 13.17°C
Weather: few clouds
------------------------------
Forecast for 2024-09-20 06:00:00:
Temperature: 15.68°C
Weather: scattered clouds
------------------------------
Forecast for 2024-09-20 09:00:00:
Temperature: 19.77°C
Weather: broken clouds
------------------------------
Forecast for