## <span style="color:#3366ff; font-weight:bold;">The Art of Data Scraping - Jan 2, Thu 2025</span>
### Instructor: <span style="font-weight:normal;">Sudip Parajuli</span>


### Github:- Repo Link:- https://github.com/CS50xNepalOfficial/DataScraping

## **Module 1: Introduction to Web Scraping**


## Introduction to Data Scraping
- 📊 **Data Scraping**: The process of automatically extracting information from websites, enabling users to gather large amounts of data efficiently.

## What is Web Scraping?
- 🕸️ **Web Scraping**: A specific type of data scraping focused on extracting data from web pages, transforming unstructured data into structured formats.

## Why is Web Scraping Valuable?
- 🌟 It provides access to vast amounts of information that can be used for analysis, research, and decision-making.

## Real World Applications of Web Scraping
- 🏛️ **Academic research**: Gathering data for scholarly studies and literature reviews.
- 📈 **Business insights**: Analyzing market trends and consumer behavior.
- 🛒 **Price monitoring**: Tracking product prices across different retailers to find the best deals.
- 🔍 **Competitive analysis**: Monitoring competitors' offerings and strategies.
- 🗞️ **News aggregation**: Collecting and summarizing news articles from various sources.


## **Module 2: HTML Basics and Inspecting Web Pages**



In [None]:
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>HTML Basics and Inspecting Web Pages</title>
</head>
<body>
    <h1>Welcome to HTML Basics!</h1>
    <p>HTML (HyperText Markup Language) is the standard language for creating web pages. It provides the structure for content on the web.</p>
    
    <h2>Basic HTML Structure</h2>
    <p>HTML documents consist of elements. Each element begins with an opening tag and ends with a closing tag.</p>
    <p>&lt;element&gt;Content&lt;/element&gt;</p>
    
    <h2>Inspecting Web Pages</h2>
    <p>Inspecting web pages allows us to view and understand the structure, styles, and content of a webpage using browser developer tools.</p>
    <p>To inspect a web page:</p>
    <ol>
        <li>Right-click on an element.</li>
        <li>Select "Inspect" or "Inspect Element".</li>
        <li>The developer tools panel will open, displaying the HTML and CSS of the element.</li>
    </ol>
    
    <h2>Example Element</h2>
    <p>This is an example paragraph. You can inspect this paragraph to see its HTML structure.</p>
    
    <h2>Example Table</h2>
    <table border="1">
        <caption>Sample Table</caption>
        <tr>
            <th>Header 1</th>
            <th>Header 2</th>
            <th>Header 3</th>
        </tr>
        <tr>
            <td>Row 1, Cell 1</td>
            <td>Row 1, Cell 2</td>
            <td>Row 1, Cell 3</td>
        </tr>
        <tr>
            <td>Row 2, Cell 1</td>
            <td>Row 2, Cell 2</td>
            <td>Row 2, Cell 3</td>
        </tr>
    </table>
    
    <footer>
        <p>You can put your footer contents here!</p>
    </footer>
</body>
</html>


### **<span style="color: lightgray;">Output:- This is how the rendered page should look like</span>**

![Rendered Image](page.png)


### **Intro to HTML Tags and Attributes**

HTML (Hypertext Markup Language) utilizes tags and attributes to structure and define content within a web page. Tags are used to mark the beginning and end of elements, while attributes provide additional information about the elements.

**HTML Tags**

Tags are enclosed in angle brackets <>, and most come in pairs—an opening tag and a closing tag.


**Example:**


Putting Text
```html
<p>This is a paragraph tag. It has an opening <p> and a closing </p> tag.</p>
<h1> to <h6> Tags (Headings)
Defines headings with varying sizes, where <h1> is the largest and <h6> is the smallest.   
```
Anchor Tag
- Creates hyperlinks to other web pages or resources.
```html
<a href="https://example.com">Visit our website</a>
```

Image
- Embeds images into a web page.
```html
<img src="image.jpg" alt="Description of the image">
```

Unordered List Tag
- Creates an unordered (bulleted) list.
For ordered just replace the ul with ol
```html
<ul>
    <li>Item 1</li>
    <li>Item 2</li>
</ul>
```

Table Tag
- Creates a table structure with rows and columns.

```html
<table>
    <tr>
        <th>Header 1</th>
        <th>Header 2</th>
        <th>Header 3</th>
    </tr>
    <tr>
        <td>Row 1, Cell 1</td>
        <td>Row 1, Cell 2</td>
        <td>Row 1, Cell 3</td>
    </tr>
    <tr>
        <td>Row 2, Cell 1</td>
        <td>Row 2, Cell 2</td>
        <td>Row 2, Cell 3</td>
    </tr>
</table>
```
Form Tag
- Defines a form for user input with input fields and a submit button.
```html
<form action="/submit-form" method="post">
    <label for="username">Username:</label>
    <input type="text" id="username" name="username"><br><br>
    <label for="password">Password:</label>
    <input type="password" id="password" name="password"><br><br>
    <input type="submit" value="Submit">
</form>
```

Div Tag
- Creates a division or a container that can be styled using CSS.
```html
<div>
    <!-- Content to be enclosed within the div -->
</div>
```




### **Overview of CSS selectors for targeting elements**

### **Inspecting Web Pages**

Inspecting web pages allows us to view and understand the structure, styles, and content of a webpage using browser developer tools.

**To inspect a web page:**
1. Right-click on an element.
2. Select "Inspect" or "Inspect Element".
3. The developer tools panel will open, displaying the HTML and CSS of the element.


## **Module 3: Extracting Data with CSS Selectors**

### **Overview of CSS selectors for targeting elements**

CSS selectors are vital tools in web scraping, allowing precise targeting and extraction of data from HTML documents. Here's a concise breakdown:

1. **Understanding Selectors**: CSS selectors identify elements based on attributes, structure, and relationships. For instance:
   - `p` targets all paragraphs.
   - `.class` selects elements with a specific class.
   - `#id` targets elements by their unique ID.
   - `a[href="https://example.com"]` selects links with the specified URL.

2. **Types of Selectors**: They include:
   - **Element**: Selects based on tag names.
   - **Class**: Identifies elements by class names.
   - **ID**: Targets elements with unique identifiers.
   - **Attribute**: Selects elements by specific attributes.
   - **Combination**: Uses multiple selectors for precision.

3. **Application in Web Scraping**: CSS selectors aid in pinpointing and extracting desired data efficiently. For example:
   - `div#content` targets a `<div>` element with the ID of 'content'.
   - `.price` selects all elements with the class 'price'.
   - `table td` targets all table cells.
   - `ul > li` selects direct list items within unordered lists.

4. **Challenges**: Complex structures or dynamic elements require adaptable selectors. Examples include:
   - Selecting elements nested within others.
   - Handling dynamically generated content or changing classes.

CSS selectors play a pivotal role in effective data extraction during web scraping, offering accuracy and efficiency.


### **Locating and targeting elements using CSS Selectors**

In web scraping, efficiently locating and targeting specific elements within the HTML structure is crucial for accurate data extraction. CSS Selectors are instrumental in this process, offering various techniques:

#### 1. **Locating Elements**

CSS Selectors allow precise identification of elements based on attributes, structure, and relationships in the HTML document:

- **Tag Names**: Target elements by their tag names like `p`, `div`, `a`.
- **Class and ID**: Use `.class` and `#id` selectors to locate elements by their class and unique ID.
- **Attribute Selectors**: Locate elements based on attributes, e.g., `[data-type="header"]`.

#### 2. **Targeting Elements**

After locating elements, further target specific content or data within them using CSS Selectors:

- **Text Content**: Access and extract text content using selectors like `p`, `h1`.
- **Links**: Target hyperlinks with `a` or specific attributes like `a[href^="https://"]`.
- **Images**: Access images using `img` or selectors based on image attributes.

#### 3. **Specificity and Precision**

CSS Selectors offer varying levels of specificity for precise targeting:

- **Combination Selectors**: Combine classes, tags, and attributes for targeted extraction, e.g., `.class > p[data-type="info"]`.
- **Nested Elements**: Select elements nested within others, accessing specific data layers deep within the structure.

#### 4. **Extensive Examples**

```css
/* Examples showcasing CSS Selectors for data extraction */

/* Selecting specific class */
.info {
    /* Styles or extraction */
}

/* Targeting elements by ID */
#main-content {
    /* Styles or extraction */
}

/* Accessing nested elements */
.container > div:nth-child(2) > ul > li a {
    /* Styles or extraction */
}


### **Extracting data from HTML elements (text, attributes, etc.)**

Web scraping involves precise extraction of data from HTML elements. Here's a detailed breakdown:

#### 1. **Text Content Extraction**

- **Using Element Tags**: Extract text within specific elements using their tags:
  
  ```html
  <p>This is a paragraph text.</p>
  <h1>Heading text</h1>
```
- **Class or ID Targeting**: Access text within specific classes or IDs:

    ```html
    <p class="description">Product description text.</p>
    <div id="main-content">Main content text.</div>

    ```
    
#### 2. **Attributes Extraction**

- **Hyperlinks (href)**: Extract URLs from anchor tags:
    ```html
<a href="https://example.com">Visit our website</a>

    ```
- **Image URLs (src)**: Access image URLs from image tags:
    ```html
<img src="image.jpg" alt="Image description">

    ```
    
#### 3. **Attribute Values**

- **Extracting Specific Attributes**: Retrieve values of specific attributes:
    ```html
<div data-category="tech">Technology category</div>
<img src="logo.png" alt="Company Logo" title="Company XYZ">

    ```

#### 4. **Combining Selectors for Targeted Extraction**

- **Nested Elements**: Combine selectors for extraction within nested structures:
    ```html
<div class="parent">
    <div class="child">Nested content</div>
</div>

    ```
    
#### 5. **Advanced Techniques**

- **Regular Expressions and Conditional Extraction:**
    ```html
<div class="featured">
    <h2>Featured Product</h2>
    <p class="discount">20% off!</p>
</div>

    ```
 

## **Module 4: Handling Dynamic Web Content**

### **Understanding dynamic web content**

Dynamic web content refers to elements on a webpage that change or update without requiring a full page reload. Understanding these dynamics is crucial in web scraping due to the complexity introduced by their ever-changing nature. Here's an in-depth exploration:

#### 1. **Characteristics of Dynamic Web Pages**

- **JavaScript Interactions**: Elements or content updated via JavaScript interactions or events.
- **Asynchronous Loading**: Content loaded independently, often asynchronously without a page refresh.
- **DOM Manipulation**: Alteration of the Document Object Model (DOM) structure based on user actions or server responses.

#### 2. **Behavioral Aspects**

- **Real-time Updates**: Data or content that refreshes automatically or periodically.
- **Interactive Elements**: Features that respond to user inputs or triggers.

#### 3. **Rendering Mechanisms**

- **Client-Side Rendering**: Content rendered on the client-side using JavaScript frameworks like React, Angular, or Vue.js.
- **Server-Side Rendering**: Pages generated dynamically on the server and sent to the client.

#### 4. **Impact on Web Scraping**

- **Complexity in Data Retrieval**: Challenges in accessing dynamically loaded content through traditional scraping methods.
- **Need for Adaptability**: Requires adaptable scraping techniques to capture data rendered after page load.
- **Handling Asynchronous Requests**: Dealing with content loaded via AJAX or other asynchronous mechanisms.

#### 5. **Tools and Strategies for Understanding Dynamics**

- **Developer Tools**: Leveraging browser developer tools to inspect dynamic elements and interactions.
- **Network Analysis**: Understanding network requests to identify dynamically loaded data.
- **Testing Environments**: Using testing frameworks to simulate dynamic content for scraping exploration.

Understanding the behavior and mechanisms of dynamic web content is fundamental in effectively extracting data during web scraping. Mastery of these dynamics empowers scraping strategies to adapt and capture real-time or constantly changing information from modern web applications.

example:- https://today.yougov.com/ratings/entertainment/fame/people/all


### **Dealing with JavaScript-driven Websites**

JavaScript-driven websites present unique challenges in web scraping due to their reliance on client-side scripting for content rendering and interactions. Effectively scraping data from such sites involves understanding and addressing these complexities:

#### 1. **JavaScript Rendered Content**

- **Dynamic Element Creation**: Elements generated dynamically via JavaScript after page load.
- **Delayed Data Loading**: Content loaded asynchronously after initial page load events.

#### 2. **Handling Asynchronous Requests**

- **AJAX Calls**: Data loaded through asynchronous requests, requiring specialized handling.
- **Wait Time Consideration**: Dealing with delays as content loads progressively.

#### 3. **Approaches for Scraping JavaScript-Driven Sites**

- **Headless Browsers**: Leveraging tools like Puppeteer or Selenium to mimic browser behavior and access dynamically rendered content.
- **JavaScript Execution**: Executing JavaScript code within scraping scripts to access dynamically generated elements.
- **API Interactions**: Utilizing exposed APIs or endpoints to retrieve data directly.

#### 4. **Challenges and Solutions**

- **Dynamic Element Identification**: Difficulty in targeting elements loaded dynamically, requiring adaptive selectors or techniques.
- **Handling JavaScript Events**: Capturing data triggered by user interactions or JavaScript events.

#### 5. **Adaptation and Tool Selection**

- **Adapting Scraping Strategies**: Modifying scraping strategies to handle dynamic content updates or DOM manipulations.
- **Tool Selection**: Choosing appropriate scraping libraries or frameworks that support JavaScript rendering and dynamic content extraction.

Dealing with JavaScript-driven websites demands specialized approaches and tools to effectively scrape data. Mastering these techniques enables access to the rich and interactive content offered by modern web applications.


### **Introduction to headless browsers and their role in web scraping**

Headless browsers simulate the behavior of a regular web browser without a graphical user interface (GUI). They play a vital role in web scraping by providing a programmatic way to interact with and extract data from web pages. Here's an overview:

#### 1. **What are Headless Browsers?**

- **GUI-Less Browsing**: Operate as browsers without a visual interface, running in the background.
- **Automated Web Interaction**: Enable automated browsing, navigation, and content retrieval.

#### 2. **Role in Web Scraping**

- **Automated Interaction**: Allows scripting interactions with web pages, enabling data extraction without human intervention.
- **Rendering and JavaScript Support**: Offers rendering capabilities and supports JavaScript execution for accessing dynamically rendered content.

#### 3. **Popular Headless Browsers**

- **Puppeteer**: Developed by Google, provides a high-level API to control headless Chrome or Chromium browsers.
- **Selenium WebDriver**: Supports headless mode across various browsers, allowing automation and scraping tasks.

#### 4. **Advantages in Web Scraping**

- **JavaScript Execution**: Executes JavaScript and renders dynamic content, crucial for scraping modern web applications.
- **Automation Capabilities**: Enables scripted interactions like form submissions, clicking elements, and navigating through pages.
- **Efficiency and Scalability**: Automates scraping tasks, enhancing efficiency and scalability in data extraction processes.

#### 5. **Use Cases in Web Scraping**

- **Data Collection**: Extracting structured data from websites with dynamic content or complex interactions.
- **Testing and Automation**: Automating website testing, user interaction simulations, and regression testing.

#### 6. **Role in Handling Dynamic Content**

- **Addressing JavaScript Rendered Content**: Essential for scraping data from websites relying heavily on client-side scripting.

Headless browsers serve as powerful tools in web scraping, offering the capability to automate interactions, access dynamic content, and retrieve data from modern and dynamic web pages without human intervention.


## **Module 5: Web Scraping Tools and Libraries**

### **Introduction to popular web scraping tools and libraries (e.g: BeautifulSoup, Scrapy, Selenium, MechanicalSoup)**

Web scraping tools and libraries simplify the process of extracting data from websites, offering various functionalities and approaches to facilitate scraping tasks. Here's an overview of some widely-used tools:

#### 1. **Beautiful Soup**

- **Parsing HTML and XML**: Provides a simple API for navigating and searching parsed data from HTML or XML documents.
- **Ease of Use**: Allows quick extraction of data by handling complexities in HTML structure.
- **Integration with Parsing Libraries**: Often used alongside libraries like `requests` for web page retrieval.

#### 2. **Scrapy**

- **Full-Fledged Framework**: Offers a complete framework for scraping, handling requests, and managing spiders.
- **Asynchronous Requests**: Supports concurrent requests, enhancing scraping speed and efficiency.
- **Modularity and Extensibility**: Highly customizable and allows integration with middleware and extensions.

#### 3. **Selenium**

- **Browser Automation**: Provides automated testing and web scraping by simulating browser interactions.
- **Dynamic Content Handling**: Capable of handling JavaScript-rendered content and interactions.
- **Cross-Browser Support**: Works with multiple browsers and supports headless modes for automated scraping.

#### 4. **MechanicalSoup**

- **Simplified Interface**: Wrapper around `requests` and `Beautiful Soup`, simplifying interaction with web pages.
- **Form Submission and Navigation**: Streamlines form submission and navigation tasks during scraping.
- **Ease of Use for Simple Tasks**: Ideal for simpler scraping tasks or when combining `requests` and `Beautiful Soup`.

#### 5. **Use Cases and Selection Considerations**

- **Project Complexity**: Selection based on the complexity of scraping tasks and the structure of target websites.
- **Handling Dynamic Content**: Consideration for tools capable of handling JavaScript-rendered content.
- **Community Support and Documentation**: Importance of active communities and comprehensive documentation for support.

#### 6. **Combining Tools for Enhanced Functionality**

- **Integration Potential**: Utilizing combinations of tools for enhanced scraping capabilities, e.g., using Selenium for dynamic content handling alongside Beautiful Soup for parsing.

These tools and libraries serve as valuable resources for web scraping, each offering distinct features, functionalities, and approaches catering to different scraping requirements and complexities.


### **Pros and Cons of different tools**

When selecting a web scraping tool, it's essential to consider the strengths and limitations each tool offers:

#### **Beautiful Soup**

**Pros:**
- Easy to Learn: Simple API for parsing HTML and XML, suitable for beginners.
- Pythonic Approach: Integrates seamlessly with Python for data extraction tasks.
- Parsing Complexities: Handles intricate HTML structures and navigates through nested elements.

**Cons:**
- Limited Fetching Abilities: Requires additional libraries like `requests` for fetching web pages.
- Lacks Automation: Not suitable for handling complex automation or dynamic content.

#### **Scrapy**

**Pros:**
- Full-Fledged Framework: Offers a complete scraping framework with features for managing spiders, pipelines, and middleware.
- Asynchronous Processing: Supports concurrent requests, enhancing scraping speed.
- Scalability: Suitable for large-scale scraping projects due to its modularity and extensibility.

**Cons:**
- Learning Curve: Steeper learning curve compared to simpler libraries due to its comprehensive nature.
- Complexity Overhead: Might be overkill for smaller or less complex scraping tasks.

#### **Selenium**

**Pros:**
- Browser Automation: Simulates browser interactions, suitable for handling JavaScript-rendered content.
- Dynamic Content Handling: Capable of handling dynamic content and interactions.
- Cross-Browser Compatibility: Works across multiple browsers for testing and scraping.

**Cons:**
- Slower Execution: Running a full browser can be slower compared to parsing HTML directly.
- Setup Overhead: Requires browser drivers and setup configurations for different browsers.

#### **MechanicalSoup**

**Pros:**
- Simplified Interface: Combines `requests` and `Beautiful Soup`, streamlining basic scraping tasks.
- Ease of Use: User-friendly for simpler scraping tasks or when combining `requests` and `Beautiful Soup`.

**Cons:**
- Limited Functionality: Not as feature-rich or capable as more comprehensive frameworks like Scrapy.
- Handling Dynamic Content: Limited capability in handling JavaScript-rendered content or complex interactions.

Selecting the right tool depends on the specific requirements of the scraping task, considering factors like project complexity, handling of dynamic content, scalability, and ease of use.


### **Selecting the right tool for your scraping needs**

Choosing the most suitable web scraping tool requires an understanding of project requirements and the capabilities of available tools. Consider the following factors:

#### 1. **Project Complexity**

- **Simple Tasks**: For basic data extraction from static web pages, lightweight libraries like Beautiful Soup or MechanicalSoup suffice.
- **Complex Tasks**: Projects involving large-scale scraping, handling dynamic content, or complex structures may benefit from comprehensive frameworks like Scrapy or Selenium.

#### 2. **Handling Dynamic Content**

- **JavaScript-Heavy Sites**: Websites with heavy reliance on JavaScript-rendered content necessitate tools capable of handling dynamic elements, like Selenium or headless browsers.

#### 3. **Scalability**

- **Large-Scale Scraping**: Projects requiring scalability, concurrent requests, and modular design favor frameworks like Scrapy due to their scalability and extensibility.

#### 4. **Learning Curve**

- **Ease of Use**: Consider the learning curve associated with each tool. Simple tasks might benefit from straightforward libraries like Beautiful Soup or MechanicalSoup, while complex projects might justify the learning curve of Scrapy or Selenium.

#### 5. **Performance and Speed**

- **Execution Speed**: Evaluate the performance impact of each tool. Headless browsers like Selenium might be slower due to browser simulation, while direct parsers like Beautiful Soup offer faster parsing.

#### 6. **Community Support and Documentation**

- **Active Community**: Tools with an active community and comprehensive documentation provide valuable resources for troubleshooting and support.

#### 7. **Customization and Flexibility**

- **Modularity and Customization**: Frameworks like Scrapy offer high customization and modularity for tailored scraping needs, while simpler libraries might lack this flexibility.

#### 8. **Testing and Experimentation**

- **Trial and Error**: Consider experimenting with different tools to find the best fit for your specific project needs. Conduct small-scale tests to assess compatibility and efficiency.

Selecting the right tool involves aligning project requirements with the strengths and limitations of available scraping tools, aiming for an optimal balance between functionality, ease of use, performance, and scalability.


## **Module 6: Best Practices and Ethical Considerations**

Module 6 focuses on ethical conduct and best practices when engaging in web scraping activities. Understanding and adhering to ethical guidelines are essential in maintaining the integrity of data extraction and respecting the rights and policies of websites.

### **Key Topics Covered:**

#### 1. **Respect for Website Guidelines**

- **Robots.txt Compliance**: Understanding and adhering to the guidelines outlined in the robots.txt file, respecting site-specific rules and directives.
- **Terms of Service and Legal Boundaries**: Awareness of website terms of service and legal boundaries governing data collection activities.

#### 2. **Data Privacy and Security**

- **Sensitive Data Handling**: Exercising caution when scraping and handling personally identifiable or sensitive information.
- **Storage and Security Practices**: Implementing secure storage methods and encryption for scraped data to maintain confidentiality.

#### 3. **Rate Limiting and Politeness**

- **Responsible Scraping Rates**: Implementing rate limiting to prevent overwhelming servers and showing politeness towards server resources.
- **Monitoring and Adjustment**: Strategies for monitoring server responses and adjusting scraping rates accordingly.

#### 4. **Code Quality and Maintenance**

- **Robust and Maintainable Code**: Writing clean, well-documented, and maintainable code for efficient and scalable scraping.
- **Error Handling and Logging**: Implementing robust error handling and logging mechanisms to ensure data integrity and traceability.

### **Importance of Best Practices**

- **Legal Compliance**: Adherence to legal frameworks, copyright laws, and data protection regulations to prevent legal complications.
- **Maintaining Reputability**: Upholding ethical scraping practices fosters trust and credibility within the data science community and prevents backlash from website owners.

### **Ethical Considerations**

- **Transparency**: Being transparent about scraping activities and purposes, avoiding misleading or harmful practices.
- **Respect for Website Resources**: Ensuring scraping activities do not unduly burden website servers or cause disruptions.


## **Module 7: Let's Code**

### Installing Necessary Libraries for this workshop

Note: The exclamation mark (!) preceding a command within a Jupyter Notebook or a similar environment indicates that the command should be executed in the terminal or command prompt rather than within the notebook itself.

In [None]:
!pip install requests
!pip install beautifulsoup4
!pip install selenium

### Starting With Requests

Requests Short Overview
[More in Requests documentation](https://requests.readthedocs.io/en/latest/)

```python
import requests
#url
res = requests.get('https://example.com')
#payload
payload = {'key1': 'value1', 'key2': 'value2'}
res = requests.get('https://example.com/get', params=payload)
#headers
headers = {'User-Agent': 'Mozilla/5.0'}
res = requests.get('https://example.com', headers=headers)
#timeout
res = requests.get('https://example.com', timeout=5)
#auth
res = requests.get('https://example.com', auth=('username', 'password'))
#proxies
proxies = {'http': 'http://proxy.example.com:8080', 'https': 'https://proxy.example.com:8080'}
res = requests.get('https://example.com', proxies=proxies)
```



### Scraping Enroz

In [None]:
import requests
res = requests.get('https://www.enroz.com/product-category/smart-phones?sort=popularity&page=1')
res.text

In [5]:
with open('enroz.html', 'w') as f:
    f.write(res.text)

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(res.text, 'html.parser')
soup.title

In [None]:
h2s= soup.find_all('h2')
h2s

In [None]:
h3s = soup.find_all('h3')
h3s

In [None]:
for h3 in h3s:
    print(h3.text)
with open('enroz.csv', 'w') as f:
    f.write('Product Name, Price\n')
    for h3 in h3s:
        f.write(f'{h3.text}\n')

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(res.text, 'html.parser')

row = soup.find('div', {'class': 'row'})

def parse_product_card(html):
    product_div = soup.find('div', class_='p-2')
    if not product_div:
        print("Product div not found")
        return None
        
    try:
        name_elem = product_div.find('h3')
        name = name_elem.find('a').text if name_elem and name_elem.find('a') else ''
        url = name_elem.find('a')['href'] if name_elem and name_elem.find('a') else ''
        
        img_area = product_div.find('div', class_='img-area')
        image_url = img_area.find('img')['src'] if img_area and img_area.find('img') else ''
        
        price_div = product_div.find('div', class_='product-price')
        price = price_div.text.strip() if price_div else ''
        
        rate_ul = product_div.find('ul', class_='rate')
        rating = len(rate_ul.find_all('i', class_='far fa-star')) if rate_ul else 0
        
    
        
        return {
            'name': name,
            'url': url,
            'image_url': image_url,
            'price': price,
            'rating': rating,
        }
        
    except Exception as e:
        print(f"Error parsing product card: {e}")
        return None


In [None]:
parse_product_card(row)

### Scraping a Website BooksToScrape

In [13]:
from bs4 import BeautifulSoup
import requests


In [None]:
res = requests.get("https://books.toscrape.com/")
res

In [None]:
soup = BeautifulSoup(res.content, "html.parser")
soup

In [None]:
ols = soup.find('ol', class_="row")
ols

In [53]:
anchors = ols.find_all('a')
anchors

[<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>,
 <a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>,
 <a href="catalogue/tipping-the-velvet_999/index.html"><img alt="Tipping the Velvet" class="thumbnail" src="media/cache/26/0c/260c6ae16bce31c8f8c95daddd9f4a1c.jpg"/></a>,
 <a href="catalogue/tipping-the-velvet_999/index.html" title="Tipping the Velvet">Tipping the Velvet</a>,
 <a href="catalogue/soumission_998/index.html"><img alt="Soumission" class="thumbnail" src="media/cache/3e/ef/3eef99c9d9adef34639f510662022830.jpg"/></a>,
 <a href="catalogue/soumission_998/index.html" title="Soumission">Soumission</a>,
 <a href="catalogue/sharp-objects_997/index.html"><img alt="Sharp Objects" class="thumbnail" src="media/cache/32/51/3251cf3a3412f53f339e42cac2134093.jpg"/></a>,
 <a href="catalogue/sharp-objects_997/

In [None]:
for i in anchors:
    print(i['href'])

In [None]:
home_link="https://books.toscrape.com/"
newres = requests.get(home_link+"catalogue/a-light-in-the-attic_1000/index.html")
newres

In [None]:
soup_ind = BeautifulSoup(newres.content, "html.parser")
soup_ind

#### Getting Headers

In [None]:
book_name= soup_ind.title.get_text()
book_name

In [None]:
book_name = book_name.strip()

In [None]:
book_name.split("|")[0].replace("\n", "").strip()

In [None]:
product_main_div = soup_ind.find('div', class_="product_main")
h1 = product_main_div.h1
h1_text = h1.get_text()
h1_text

In [None]:
soup_ind.find('p', class_="price_color").get_text()

In [None]:
price_p = product_main_div.find('p', class_="price_color")
price = price_p.get_text()
price

In [39]:
rating = soup_ind.find('p', class_="star-rating")

In [41]:
rating["class"][1]

'Three'

In [42]:
star_rating = product_main_div.find('p', class_="star-rating")
star_rating_class = star_rating["class"]
rating = star_rating_class[-1]
rating


'Three'

In [52]:
description_ind= soup_ind.find('div', id='product_description').next_sibling.next_sibling
description_ind.get_text()


"It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounde

In [44]:
description = soup_ind.find('div', class_="sub-header").next_sibling.next_sibling.get_text().strip()
description

"It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love th It's hard to imagine a world without A Light in the Attic. This now-classic collection of poetry and drawings from Shel Silverstein celebrates its 20th anniversary with this special edition. Silverstein's humorous and creative verse can amuse the dowdiest of readers. Lemon-faced adults and fidgety kids sit still and read these rhythmic words and laugh and smile and love that Silverstein. Need proof of his genius? RockabyeRockabye baby, in the treetopDon't you know a treetopIs no safe place to rock?And who put you up there,And your cradle, too?Baby, I think someone down here'sGot it in for you. Shel, you never sounde

In [54]:
import csv
home_link = "https://books.toscrape.com/"
data_rows = []

total_books = len(anchors)
books_processed = 0

for anchor in anchors:
    try:
        new_res = requests.get(home_link + anchor["href"])
        soup_ind = BeautifulSoup(new_res.content, "html.parser")
        product_main_div = soup_ind.find('div', class_="product_main")
        h1_text = product_main_div.h1.get_text()
        price = product_main_div.find('p', class_="price_color").get_text()
        star_rating_class = product_main_div.find('p', class_="star-rating")["class"]
        rating = star_rating_class[-1]
        description = soup_ind.find('div', class_="sub-header").find_next_sibling().get_text().strip()
        
        # Append data to data_rows
        data_rows.append([h1_text, price, rating, description])
        
        books_processed += 1
        print(f"Processed {books_processed}/{total_books} books.")
    except Exception as e:
        print("Error:", e)

# Write data to CSV file
with open("books_data.csv", "w", newline="", encoding="utf-8") as csvfile:
    writer = csv.writer(csvfile)
    # Write header
    writer.writerow(["Title", "Price", "Rating", "Description"])
    # Write data rows
    writer.writerows(data_rows)

print("Data saved to books_data.csv")
    

Processed 1/40 books.
Processed 2/40 books.
Processed 3/40 books.
Processed 4/40 books.
Processed 5/40 books.
Processed 6/40 books.
Processed 7/40 books.
Processed 8/40 books.
Processed 9/40 books.
Processed 10/40 books.
Processed 11/40 books.
Processed 12/40 books.
Processed 13/40 books.
Processed 14/40 books.
Processed 15/40 books.
Processed 16/40 books.
Processed 17/40 books.
Processed 18/40 books.
Processed 19/40 books.
Processed 20/40 books.
Processed 21/40 books.
Processed 22/40 books.
Processed 23/40 books.
Processed 24/40 books.
Processed 25/40 books.
Processed 26/40 books.
Processed 27/40 books.
Processed 28/40 books.
Processed 29/40 books.
Processed 30/40 books.
Processed 31/40 books.
Processed 32/40 books.
Processed 33/40 books.
Processed 34/40 books.
Processed 35/40 books.
Processed 36/40 books.
Processed 37/40 books.
Processed 38/40 books.
Processed 39/40 books.
Processed 40/40 books.
Data saved to books_data.csv


### Scraping Celeb Height Data

In [None]:
import requests

res = requests.get("https://www.celebheights.com/s/allA.html")
if res.ok:
    print(res.content)
    

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(res.content, "html.parser")
soup

### Getting all the names of the Celebrity and their heights

Possible soup methods

```python
soup.find_all('a')
soup.find('div', class_='captionBottom')

class_select = soup.select(".class_name")
or
soup.find_all('div', class_='class_name')

id_select = soup.select("#id_name")
# here we are getting the first div of the html content and getting the text within it
soup.find('div').get_text()
# we can use | to separate the two texts within a html element
# <div>
# <p>Hello<span>World</span><p>
# </div>

soup.find('div').get_text("|")

```

In [None]:
#your code here 
names_div = soup.find_all('div', class_="sAZ2")
names_div

In [9]:
names_strings = []
heights_of_celebs = []
for name in names_div:
    name_temp = name.get_text("|")
    splitted_name = name_temp.split("|")
    if len(splitted_name) > 2:
        names_strings.append(splitted_name[1])
        heights_of_celebs.append(splitted_name[2])

    
        
    #your code here, remember to remove the pass
    #hint: you can use the above technique to filter out and get the data


### Exporting Data to CSV format

Two ways of writing a data from an array to a csv file
- 1 

```python
import csv

names_strings = ['Alice', 'Bob', 'Charlie']
heights_of_celebs = [165, 180, 175]

file = open('heights.csv', mode='w', newline='', encoding='utf-8')
writer = csv.writer(file)
writer.writerow(['Name', 'Height'])
for name, height in zip(names_strings, heights_of_celebs):
    writer.writerow([name, height])

file.close()  # Remember to close the file after writing

```

- 2

```python
import csv

names_strings = ['Alice', 'Bob', 'Charlie', 'Dōmo']
heights_of_celebs = [165, 180, 175, 170]

with open('heights.csv', mode='w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Name', 'Height'])
    for name, height in zip(names_strings, heights_of_celebs):
        writer.writerow([name, height])

```

In [11]:
#write the code similar to above to save that to a csv file
#your code here ...
import csv

with open('heightdata.csv', mode = 'w', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(['Name', 'Height'])
    for name, height in zip(names_strings, heights_of_celebs):
        writer.writerow([name, height])

### **Using Selenium**

##### Note:- 
It requires the chrome binary or the chrome installed to your machine inorder for this to work.
Your Task:-
Try Running up the selenium on your machine
https://tecadmin.net/setup-selenium-with-python-on-ubuntu-debian/

In [55]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
import time

def google_search(query):
    try:
        # Setup Chrome options
        chrome_options = webdriver.ChromeOptions()
        chrome_options.add_argument('--headless')  # Run in headless mode
        
        # Initialize the Chrome driver
        driver = webdriver.Chrome(service=Service(chrome_options=chrome_options))
        
        # Navigate to Google
        driver.get("https://www.google.com")
        
        # Wait for search box to be present
        search_box = WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.NAME, "q"))
        )
        
        # Enter search query
        search_box.send_keys(query)
        search_box.send_keys(Keys.RETURN)
        
        # Wait for results to load
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.ID, "search"))
        )
        
        print(f"Search completed for: {query}")
        
    except Exception as e:
        print(f"An error occurred: {str(e)}")
    
    finally:
        time.sleep(5)
        driver.quit()

if __name__ == "__main__":
    google_search("BeautifulSoup in python")

Search completed for: BeautifulSoup in python


In [None]:
import json
import requests
from concurrent.futures import ThreadPoolExecutor
from requests.exceptions import Timeout

# Create an empty list to store the responses
responses = []

def fetch_data(name):
    try:
        api_url = f'https://www.famousbirthdays.com/api/autocomplete?term={name}'
        response = requests.get(api_url, timeout=10)
        
        if response.status_code == requests.codes.ok:
            response_data = response.json()
            if response_data:
                 newDict = {}
                 newDict["data"] = response_data["suggestions"][0]['data']
                 newDict["value"] = (response_data["suggestions"][0]['value'])
                 responses.append(newDict)
                 print(newDict)
        else:
            print("Error:", response.status_code, response.text)
    
    except Timeout:
        print("Connection timeout occurred for", name)

# Create a ThreadPoolExecutor with maximum concurrency of 5
with ThreadPoolExecutor(max_workers=32) as executor:
    # Submit tasks to the executor
    executor.map(fetch_data, "Ram")

# Save the responses to a JSON file
with open('categorical.json', 'w') as file:
    json.dump(responses, file)


In [56]:
import requests
# Add headers and proxies to try to avoid blocking
headers = {
'User-Agent': 'Mozilla/5.0',
'Accept-Language': 'en-US,en;q=0.5'
}
proxies = {
    'http': 'http://8.219.97.248:80',
    'https': 'http://8.219.97.248:80'
}
res = requests.get('https://www.daraz.com.np/catalog/?spm=a2a0e.searchlist.search.d_go.1888513fWE3iuM&q=mobile%20phones', headers=headers, proxies=proxies)

In [2]:
with open('daraz.html', 'w') as file:
    file.write(res.text)

In [57]:
res.headers

{'Date': 'Thu, 02 Jan 2025 09:22:58 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'Vary': 'Accept-Encoding', 'x-server-id': '528d98c43d575f61a16f40479a3b1ce12a8e037a70e5d0a6e3c3c5e54418c6ec9b107f6a3fc947730bf3cb262721f0df', 'set-cookie': 'hng=NP|en-NP|NPR|524; path=/; max-age=2592000; expires=Sat, 01 Feb 2025 09:22:58 GMT; domain=daraz.com.np, hng.sig=CpTPf0oy-Ji7yHXlTqu1ZBr2yy4DEw3ATHLSHmIgtyk; path=/; max-age=2592000; expires=Sat, 01 Feb 2025 09:22:58 GMT; domain=daraz.com.np', 'x-frame-options': 'SAMEORIGIN', 'x-xss-protection': '1; mode=block', 'x-content-type-options': 'nosniff', 'x-download-options': 'noopen', 'strict-transport-security': 'max-age=31536000, max-age=31536000', 'x-readtime': '3', 'Cache-Control': 'max-age=0, s-maxage=120', 'Content-Encoding': 'gzip', 'Server': 'Tengine/Aserver', 'EagleEye-TraceId': '210104e217358097782795054e531d', 'Timing-Allow-Origin': '*'}

### That's all for today hope you guys Enjoyed
#### If you enjoyed and want to learn more about this you can visit my blog on medium:- https://medium.com/@sudipnext