![image.png](../background_photos/)
[’¨’∏÷Ç’Ω’°’∂’Ø’°÷Ä’´ ’∞’≤’∏÷Ç’¥’®](https://unsplash.com/photos/a-large-mountain-with-a-very-tall-cliff-UiP9KfVe3aQ), ’Ä’•’≤’´’∂’°’Ø’ù []()

<a href="ToDo" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> (ToDo)

> Song reference - ToDo

# üìå ’Ü’Ø’°÷Ä’°’£’´÷Ä

[üìö ‘±’¥’¢’∏’≤’ª’°’Ø’°’∂ ’∂’µ’∏÷Ç’©’®]()

#### üì∫ ’è’•’Ω’°’∂’µ’∏÷Ç’©’•÷Ä
#### üè° ’è’∂’°’µ’´’∂

# üìö ’Ü’µ’∏÷Ç’©’®

## üåê HTML Basics - Understanding Web Structure

Before diving into web scraping, it's essential to understand the structure of web pages. HTML (HyperText Markup Language) provides the structure of web pages.

### What is HTML?

HTML uses **tags** to define elements. Tags are enclosed in angle brackets `< >` and usually come in pairs:

```html
<tagname>Content goes here</tagname>
```

### Basic HTML Document Structure:

```html
<!DOCTYPE html>
<html>
<head>
    <title>Page Title</title>
</head>
<body>
    <h1>Main Heading</h1>
    <p>This is a paragraph.</p>
</body>
</html>
```

### Common HTML Tags for Scraping:

#### Document Structure:
- `<html>` - Root element
- `<head>` - Contains metadata
- `<title>` - Page title
- `<body>` - Visible page content

#### Text Content:
- `<h1>`, `<h2>`, `<h3>` - Headers (most important to least)
- `<p>` - Paragraphs
- `<span>` - Inline text container

#### Containers:
- `<div>` - Block-level container (most common)
- `<section>` - Semantic section
- `<article>` - Independent content

#### Lists and Links:
- `<ul>`, `<ol>`, `<li>` - Unordered/ordered lists and list items
- `<a>` - Links
- `<img>` - Images

#### Data Tables:
- `<table>`, `<tr>`, `<td>`, `<th>` - Tables, rows, cells, headers

### HTML Attributes - The Key to Scraping

Attributes provide additional information about elements and are **crucial** for web scraping:

```html
<div id="content" class="main-section">
<a href="https://example.com" target="_blank">Link</a>
<img src="image.jpg" alt="Description">
<div data-price="29.99" data-category="electronics">Product</div>
```

**Most Important Attributes for Scraping:**
- `id` - Unique identifier (use with `#` in CSS selectors)
- `class` - CSS class name(s) (use with `.` in CSS selectors)
- `href` - Link destination
- `src` - Source for images/scripts
- `data-*` - Custom data attributes (very common in modern websites)

**Why Attributes Matter:**
- They help us target specific elements
- They often contain valuable data
- They make our scrapers more precise

## üéØ CSS Selectors - Your Scraping Toolkit

CSS selectors are **THE MOST IMPORTANT** concept in web scraping. They tell your scraper exactly which elements to extract.

### Basic Selectors:

#### 1. Element Selector:
```css
p          /* Selects all <p> elements */
div        /* Selects all <div> elements */
h1         /* Selects all <h1> elements */
```

#### 2. Class Selector (starts with .):
```css
.classname     /* Selects elements with class="classname" */
.post-title    /* Selects elements with class="post-title" */
.btn-primary   /* Selects elements with class="btn-primary" */
```

#### 3. ID Selector (starts with #):
```css
#idname        /* Selects element with id="idname" */
#main-content  /* Selects element with id="main-content" */
#header        /* Selects element with id="header" */
```

#### 4. Attribute Selector:
```css
[href]                    /* Elements with href attribute */
[class="post"]           /* Elements with class="post" */
[data-price="29.99"]     /* Elements with data-price="29.99" */
```

### Advanced CSS Selectors:

#### Combination Selectors:
```css
div p              /* All <p> inside <div> (descendant) */
div > p            /* Direct <p> children of <div> */
h1 + p             /* First <p> immediately after <h1> */
.post .title       /* Elements with class "title" inside elements with class "post" */
```

#### Multiple Classes:
```css
.post.featured     /* Elements with BOTH classes "post" AND "featured" */
.btn.btn-primary   /* Elements with BOTH classes "btn" AND "btn-primary" */
```

#### Pseudo-selectors:
```css
p:first-child      /* First <p> element of its parent */
p:last-child       /* Last <p> element of its parent */
p:nth-child(2)     /* Second <p> element of its parent */
a:contains("Next") /* Links containing text "Next" */
```

#### Complex Examples:
```css
div.post-content p.highlight    /* <p> with class "highlight" inside <div> with class "post-content" */
#main-content .sidebar a[href]  /* Links inside sidebar inside main content */
table tr:nth-child(odd)         /* Odd rows in a table */
```

In [None]:
# CSS Selector Practice with Sample HTML
from bs4 import BeautifulSoup

# Sample HTML for practicing CSS selectors
practice_html = """
<html>
<body>
    <div id="header" class="top-section">
        <h1 class="main-title">Welcome to Our Store</h1>
        <nav class="navigation">
            <a href="/home">Home</a>
            <a href="/products">Products</a>
            <a href="/contact">Contact</a>
        </nav>
    </div>
    
    <div id="main-content">
        <div class="product featured" data-price="199.99">
            <h2 class="product-title">iPhone 15</h2>
            <p class="description">Latest smartphone with amazing features</p>
            <span class="price">$199.99</span>
        </div>
        
        <div class="product" data-price="89.99">
            <h2 class="product-title">Headphones</h2>
            <p class="description">High-quality wireless headphones</p>
            <span class="price">$89.99</span>
        </div>
    </div>
</body>
</html>
"""

soup = BeautifulSoup(practice_html, 'html.parser')

print("üéØ CSS Selector Practice:")
print("=" * 40)

# 1. Basic selectors
print("\n1Ô∏è‚É£ Basic Selectors:")
print(f"All h2 elements: {len(soup.select('h2'))} found")
print(f"Elements with class 'product': {len(soup.select('.product'))} found")
print(f"Element with id 'header': {len(soup.select('#header'))} found")

# 2. Find specific content
print("\n2Ô∏è‚É£ Finding Specific Content:")
main_title = soup.select_one('h1.main-title')
if main_title:
    print(f"Main title: {main_title.text}")

navigation_links = soup.select('nav.navigation a')
print(f"Navigation links: {[link.text for link in navigation_links]}")

# 3. Product information
print("\n3Ô∏è‚É£ Extract Product Information:")
products = soup.select('div.product')
for i, product in enumerate(products, 1):
    title = product.select_one('.product-title').text
    price = product.select_one('.price').text
    is_featured = 'featured' in product.get('class', [])
    
    print(f"Product {i}: {title}")
    print(f"  Price: {price}")
    print(f"  Featured: {'‚úÖ' if is_featured else '‚ùå'}")

# 4. Advanced selectors
print("\n4Ô∏è‚É£ Advanced Selectors:")
featured_product = soup.select_one('.product.featured .product-title')
if featured_product:
    print(f"Featured product: {featured_product.text}")

expensive_products = soup.select('[data-price]')
print(f"Products with price data: {len(expensive_products)}")

first_product = soup.select_one('.product:first-child .product-title')
if first_product:
    print(f"First product: {first_product.text}")

## ü•Ñ Beautiful Soup - Your HTML Parsing Companion

Beautiful Soup is perfect for beginners and handles most scraping tasks effectively. It makes parsing HTML as easy as navigating a family tree!

### Why Beautiful Soup?
- **Easy to learn**: Intuitive syntax
- **Powerful**: Handles messy HTML gracefully
- **Flexible**: Multiple ways to find elements
- **Robust**: Handles encoding issues automatically

### Core Concepts:
1. **Parsing**: Convert HTML text into a navigable object
2. **Searching**: Find specific elements using tags, attributes, or CSS selectors
3. **Extracting**: Get text, attributes, or sub-elements
4. **Navigating**: Move between parent, children, and sibling elements

### Installation and Setup:

Beautiful Soup doesn't work alone - it needs a parser. Here are the most common combinations:

```bash
# Basic installation
pip install beautifulsoup4 requests

# With faster parsers
pip install lxml html5lib
```

**Parser Comparison:**
- `html.parser` - Built-in Python, decent speed, good enough for most tasks
- `lxml` - Very fast, requires C libraries
- `html5lib` - Most accurate, handles broken HTML best, but slower

In [None]:
# Install required packages for web scraping
!pip install beautifulsoup4 requests lxml html5lib pandas

Collecting html5lib
  Using cached html5lib-1.1-py2.py3-none-any.whl.metadata (16 kB)
Collecting webencodings (from html5lib)
  Using cached webencodings-0.5.1-py2.py3-none-any.whl.metadata (2.1 kB)
Using cached html5lib-1.1-py2.py3-none-any.whl (112 kB)
Using cached webencodings-0.5.1-py2.py3-none-any.whl (11 kB)
Installing collected packages: webencodings, html5lib

   ---------------------------------------- 0/2 [webencodings]
   -------------------- ------------------- 1/2 [html5lib]
   -------------------- ------------------- 1/2 [html5lib]
   -------------------- ------------------- 1/2 [html5lib]
   -------------------- ------------------- 1/2 [html5lib]
   -------------------- ------------------- 1/2 [html5lib]
   -------------------- ------------------- 1/2 [html5lib]
   ---------------------------------------- 2/2 [html5lib]

Successfully installed html5lib-1.1 webencodings-0.5.1


In [None]:
# Beautiful Soup Basic Usage - Step by Step
from bs4 import BeautifulSoup

# Step 1: Create a BeautifulSoup object
html_content = """
<html>
<head>
    <title>My First Scraping Example</title>
</head>
<body>
    <h1 id="main-title">Welcome to Web Scraping!</h1>
    <p class="intro">This is a sample paragraph.</p>
    <p class="content">This is another paragraph with <a href="/link">a link</a>.</p>
</body>
</html>
"""

# Parse the HTML
soup = BeautifulSoup(html_content, 'html.parser')

print("‚úÖ BeautifulSoup object created successfully!")
print(f"Type: {type(soup)}")
print(f"Parser used: html.parser")

# Step 2: Basic navigation
print("\nüîç Basic Element Access:")
print(f"Page title: {soup.title.text}")
print(f"First h1: {soup.h1.text}")
print(f"First paragraph: {soup.p.text}")

# Step 3: Pretty print the parsed HTML
print("\nüìã Formatted HTML structure:")
print(soup.prettify())

In [None]:
# Beautiful Soup Finding Methods - The Core of Scraping
from bs4 import BeautifulSoup

# More complex HTML for demonstration
complex_html = """
<div class="container">
    <article class="post" id="post-1" data-category="tech">
        <h2 class="post-title">First Tech Article</h2>
        <p class="post-content">Content about technology...</p>
        <div class="post-meta">
            <span class="author">John Doe</span>
            <span class="date">2025-01-15</span>
            <a href="/tech/article-1" class="read-more">Read More</a>
        </div>
    </article>
    
    <article class="post" id="post-2" data-category="science">
        <h2 class="post-title">Science Discovery</h2>
        <p class="post-content">Amazing scientific breakthrough...</p>
        <div class="post-meta">
            <span class="author">Jane Smith</span>
            <span class="date">2025-01-16</span>
            <a href="/science/article-2" class="read-more">Read More</a>
        </div>
    </article>
</div>
"""

soup = BeautifulSoup(complex_html, 'html.parser')

print("üîç Beautiful Soup Finding Methods:")
print("=" * 45)

# Method 1: find() - Returns first match
print("\n1Ô∏è‚É£ find() - Get the FIRST matching element:")
first_article = soup.find('article')
if first_article:
    title = first_article.find('h2').text
    print(f"   First article title: {title}")

# Method 2: find_all() - Returns all matches
print("\n2Ô∏è‚É£ find_all() - Get ALL matching elements:")
all_articles = soup.find_all('article')
print(f"   Found {len(all_articles)} articles")
for i, article in enumerate(all_articles, 1):
    title = article.find('h2').text
    print(f"   Article {i}: {title}")

# Method 3: Finding by attributes
print("\n3Ô∏è‚É£ Finding by attributes:")
tech_article = soup.find('article', {'data-category': 'tech'})
if tech_article:
    print(f"   Tech article: {tech_article.find('h2').text}")

science_article = soup.find('article', attrs={'data-category': 'science'})
if science_article:
    print(f"   Science article: {science_article.find('h2').text}")

# Method 4: Finding by class (note: class_ because class is a Python keyword)
print("\n4Ô∏è‚É£ Finding by class:")
authors = soup.find_all('span', class_='author')
print(f"   Authors found: {[author.text for author in authors]}")

# Method 5: Finding by id
print("\n5Ô∏è‚É£ Finding by id:")
post1 = soup.find('article', id='post-1')
if post1:
    print(f"   Post 1 title: {post1.find('h2').text}")

print("\nüí° Key Takeaway: find() vs find_all()")
print("   find() ‚Üí Returns first match or None")
print("   find_all() ‚Üí Returns list of all matches (can be empty)")

In [None]:
# Beautiful Soup with CSS Selectors - The Modern Way
# Using the same HTML as previous cell

print("üéØ CSS Selectors with Beautiful Soup:")
print("=" * 42)

# Method 1: select() - CSS selectors (returns list)
print("\n1Ô∏è‚É£ select() - CSS selectors (returns list):")
post_titles = soup.select('.post-title')
print(f"   Found {len(post_titles)} titles using '.post-title'")
for title in post_titles:
    print(f"   ‚Ä¢ {title.text}")

# Method 2: select_one() - First match only
print("\n2Ô∏è‚É£ select_one() - First match only:")
first_author = soup.select_one('.author')
if first_author:
    print(f"   First author: {first_author.text}")

# Method 3: Complex CSS selectors
print("\n3Ô∏è‚É£ Complex CSS selectors:")
tech_title = soup.select_one('article[data-category="tech"] .post-title')
if tech_title:
    print(f"   Tech article title: {tech_title.text}")

read_more_links = soup.select('.post-meta .read-more')
print(f"   Read more links: {[link.get('href') for link in read_more_links]}")

# Method 4: Combining selectors
print("\n4Ô∏è‚É£ Advanced combinations:")
all_meta_spans = soup.select('.post-meta span')
print(f"   Meta spans: {[span.text for span in all_meta_spans]}")

# Method 5: Pseudo-selectors
print("\n5Ô∏è‚É£ Pseudo-selectors:")
first_post = soup.select_one('article:first-child .post-title')
if first_post:
    print(f"   First post: {first_post.text}")

last_post = soup.select_one('article:last-child .post-title')
if last_post:
    print(f"   Last post: {last_post.text}")

print("\nüìù Summary of Selector Methods:")
print("   find/find_all + attributes ‚Üí Traditional Beautiful Soup")
print("   select/select_one + CSS     ‚Üí Modern, CSS-like approach")
print("   üí° Use CSS selectors for complex queries!")

In [None]:
# Beautiful Soup Complete Example - Part 1: Setup and Parsing
from bs4 import BeautifulSoup
import requests

# Sample HTML for demonstration
sample_html = """
<!DOCTYPE html>
<html>
<head>
    <title>Sample Blog</title>
</head>
<body>
    <header id="main-header">
        <h1>My Amazing Blog</h1>
        <nav class="navigation">
            <ul>
                <li><a href="#home">Home</a></li>
                <li><a href="#about">About</a></li>
                <li><a href="#contact">Contact</a></li>
            </ul>
        </nav>
    </header>
    
    <main class="content">
        <article class="post featured" data-id="123">
            <h2 class="post-title">How to Learn Web Scraping</h2>
            <p class="post-content">Web scraping is an essential skill for data scientists...</p>
            <div class="post-meta">
                <span class="author">Alice Johnson</span>
                <span class="date">2025-01-15</span>
                <span class="category">Technology</span>
            </div>
        </article>
        
        <article class="post" data-id="124">
            <h2 class="post-title">Python Data Analysis Tips</h2>
            <p class="post-content">Here are some advanced tips for analyzing data with Python...</p>
            <div class="post-meta">
                <span class="author">Bob Smith</span>
                <span class="date">2025-01-16</span>
                <span class="category">Data Science</span>
            </div>
        </article>
    </main>
    
    <footer>
        <p>&copy; 2025 My Blog. All rights reserved.</p>
    </footer>
</body>
</html>
"""

# Create BeautifulSoup object
soup = BeautifulSoup(sample_html, 'html.parser')

print("ü•Ñ Beautiful Soup Complete Example - Part 1")
print("=" * 50)

print("\n‚úÖ Step 1: Parse HTML into BeautifulSoup object")
print(f"   Object type: {type(soup)}")
print(f"   Parser used: html.parser")

print("\nüìã Step 2: Examine the structure")
print(f"   Page title: {soup.title.text}")
print(f"   Number of articles: {len(soup.find_all('article'))}")
print(f"   Number of links: {len(soup.find_all('a'))}")

print("\n? Step 3: Basic element access")
# Direct access to first element of each type
print(f"   First h1: {soup.h1.text}")
print(f"   First article title: {soup.find('h2').text}")
print(f"   Footer text: {soup.footer.p.text}")

In [None]:
# Beautiful Soup Complete Example - Part 2: Finding Elements
# Using the same soup object from Part 1

print("ü•Ñ Beautiful Soup Complete Example - Part 2")
print("=" * 50)

print("\nüîç Different Ways to Find Elements:")

# Method 1: By tag name
print("\n1Ô∏è‚É£ Find by tag name:")
titles = soup.find_all('h2')
print(f"   Found {len(titles)} h2 elements:")
for i, title in enumerate(titles, 1):
    print(f"     {i}. {title.text}")

# Method 2: By class
print("\n2Ô∏è‚É£ Find by class:")
posts = soup.find_all('article', class_='post')
print(f"   Found {len(posts)} articles with class 'post'")

# Method 3: By ID
print("\n3Ô∏è‚É£ Find by ID:")
header = soup.find('header', id='main-header')
if header:
    print(f"   Header found: {header.h1.text}")

# Method 4: By multiple attributes
print("\n4Ô∏è‚É£ Find by multiple attributes:")
featured_post = soup.find('article', {'class': 'post featured'})
if featured_post:
    print(f"   Featured post: {featured_post.find('h2').text}")

# Method 5: By custom attributes
print("\n5Ô∏è‚É£ Find by custom attributes:")
post_123 = soup.find('article', {'data-id': '123'})
if post_123:
    author = post_123.find('span', class_='author').text
    print(f"   Post 123 author: {author}")

print("\nüí° Key Takeaway:")
print("   find() ‚Üí First match or None")
print("   find_all() ‚Üí List of all matches")

In [None]:
# Beautiful Soup Complete Example - Part 3: Data Extraction & Navigation
# Using the same soup object from previous parts

print("ü•Ñ Beautiful Soup Complete Example - Part 3")
print("=" * 50)

print("\nüìä Extract Structured Data from All Articles:")

articles = soup.find_all('article', class_='post')
blog_data = []

for i, article in enumerate(articles, 1):
    # Extract data from each article
    title = article.find('h2', class_='post-title').text
    content = article.find('p', class_='post-content').text
    author = article.find('span', class_='author').text
    date = article.find('span', class_='date').text
    category = article.find('span', class_='category').text
    data_id = article.get('data-id')  # Get attribute value
    is_featured = 'featured' in article.get('class', [])
    
    # Store in dictionary
    article_data = {
        'id': data_id,
        'title': title,
        'author': author,
        'date': date,
        'category': category,
        'content_preview': content[:50] + "...",
        'is_featured': is_featured
    }
    
    blog_data.append(article_data)
    
    print(f"\nüìÑ Article {i}:")
    print(f"   ID: {article_data['id']}")
    print(f"   Title: {article_data['title']}")
    print(f"   Author: {article_data['author']}")
    print(f"   Category: {article_data['category']}")
    print(f"   Featured: {'‚úÖ' if article_data['is_featured'] else '‚ùå'}")

print("\nüß≠ Navigation Examples:")

# Parent-child navigation
first_article = articles[0]
post_meta = first_article.find('div', class_='post-meta')
print(f"\n   Parent of author span: {post_meta.name}")
print(f"   Children of post-meta: {[child.name for child in post_meta.children if child.name]}")

# Sibling navigation  
author_span = first_article.find('span', class_='author')
next_sibling = author_span.find_next_sibling('span')
if next_sibling:
    print(f"   Next sibling of author: {next_sibling.text} ({next_sibling.get('class')})")

print(f"\nüìà Summary Statistics:")
print(f"   Total articles processed: {len(blog_data)}")
print(f"   Featured articles: {sum(1 for article in blog_data if article['is_featured'])}")
print(f"   Unique authors: {len(set(article['author'] for article in blog_data))}")
print(f"   Categories: {list(set(article['category'] for article in blog_data))}")

### üåê Real Website Scraping Example

Let's scrape some real data from a website. We'll use `httpbin.org` which provides testing endpoints:

In [None]:
# Step 1: Import required libraries for web scraping
import requests
from bs4 import BeautifulSoup
import time
import json

print("‚úÖ Libraries imported successfully!")

In [None]:
# Step 2: Define the scraping function
def scrape_quotes():
    """
    Scrape quotes from quotes.toscrape.com
    Returns a list of dictionaries containing quote data
    """
    url = "http://quotes.toscrape.com/"
    
    try:
        print(f"üåê Sending request to: {url}")
        
        # Send GET request to the website
        response = requests.get(url)
        response.raise_for_status()  # Raise exception for bad status codes
        
        print(f"‚úÖ Request successful! Status code: {response.status_code}")
        
        # Parse HTML content with Beautiful Soup
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Find all quote containers
        quotes = soup.find_all('div', class_='quote')
        print(f"üìä Found {len(quotes)} quotes on the page")
        
        scraped_data = []
        
        # Extract data from each quote
        for i, quote in enumerate(quotes, 1):
            # Extract quote text (remove quotes and whitespace)
            text = quote.find('span', class_='text').text
            
            # Extract author name
            author = quote.find('small', class_='author').text
            
            # Extract tags (multiple tags per quote)
            tags = [tag.text for tag in quote.find_all('a', class_='tag')]
            
            # Store data in dictionary
            quote_data = {
                'text': text,
                'author': author,
                'tags': tags
            }
            
            scraped_data.append(quote_data)
            print(f"  üìù Processed quote {i}: {author}")
        
        return scraped_data
    
    except requests.RequestException as e:
        print(f"‚ùå Error fetching the webpage: {e}")
        return []
    except Exception as e:
        print(f"‚ùå Error processing data: {e}")
        return []

print("‚úÖ Function defined successfully!")

In [None]:
# Step 3: Run the scraper and collect data
print("üï∑Ô∏è Starting the scraping process...")
quotes_data = scrape_quotes()

print(f"\nüìä Scraping completed!")
print(f"Total quotes collected: {len(quotes_data)}")

if quotes_data:
    print("\nüéØ Sample of scraped data:")
    print("=" * 60)
else:
    print("‚ùå No quotes were scraped. Check your internet connection.")

In [None]:
# Step 4: Display the first few quotes to see our results
if quotes_data:
    print("üìù First 3 quotes from our scraping:")
    print("=" * 80)
    
    for i, quote in enumerate(quotes_data[:3], 1):
        print(f"\nüí¨ Quote {i}:")
        print(f"   Text: {quote['text']}")
        print(f"   Author: {quote['author']}")
        print(f"   Tags: {', '.join(quote['tags'])}")
        print("-" * 60)
    
    # Show some statistics
    print(f"\nüìà Quick Statistics:")
    print(f"   Total quotes: {len(quotes_data)}")
    
    # Find unique authors
    authors = set(quote['author'] for quote in quotes_data)
    print(f"   Unique authors: {len(authors)}")
    
    # Find all unique tags
    all_tags = set()
    for quote in quotes_data:
        all_tags.update(quote['tags'])
    print(f"   Unique tags: {len(all_tags)}")
    print(f"   Some tags: {', '.join(list(all_tags)[:5])}...")
else:
    print("‚ùå No data to display")

In [None]:
# Step 5: Save the scraped data to a file
if quotes_data:
    # Save to JSON file with proper formatting
    filename = 'quotes_scraped.json'
    
    try:
        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(quotes_data, f, indent=2, ensure_ascii=False)
        
        print(f"üíæ Data saved successfully to '{filename}'")
        print(f"üìÅ File contains {len(quotes_data)} quotes")
        
        # Show file size
        import os
        file_size = os.path.getsize(filename)
        print(f"üìä File size: {file_size:,} bytes")
        
    except Exception as e:
        print(f"‚ùå Error saving file: {e}")
        
    # Also demonstrate saving specific data
    print(f"\nüîç Example of accessing specific quote data:")
    if len(quotes_data) > 0:
        first_quote = quotes_data[0]
        print(f"   First quote text: {first_quote['text'][:50]}...")
        print(f"   First quote author: {first_quote['author']}")
        print(f"   First quote tags: {first_quote['tags']}")
else:
    print("‚ùå No data to save")

#### üéì What We Just Did - Step by Step:

1. **üì¶ Imported Libraries**: We imported the essential tools:
   - `requests` for making HTTP requests
   - `BeautifulSoup` for parsing HTML
   - `json` for saving data
   - `time` for adding delays (good practice)

2. **üîß Created Function**: We defined `scrape_quotes()` that:
   - Sends a GET request to the website
   - Handles errors gracefully
   - Parses HTML with Beautiful Soup
   - Extracts specific data using CSS selectors

3. **üöÄ Executed Scraper**: We ran the function and collected data

4. **üëÄ Viewed Results**: We displayed the scraped quotes to verify success

5. **üíæ Saved Data**: We saved the results to a JSON file for future use

**Key Learning Points:**
- Always check `response.status_code` to ensure successful requests
- Use `.find()` for single elements and `.find_all()` for multiple elements
- Handle exceptions to make your scraper robust
- Save data in structured formats like JSON or CSV

## üè† Real Website Scraping Example

Let's create a more complex scraper that demonstrates real-world techniques. This example shows how to:
- Handle multiple pages
- Extract structured data
- Process and analyze results
- Implement basic error handling

**Key Learning Points:**
- Always check a website's robots.txt before scraping
- Add appropriate delays between requests
- Handle errors gracefully
- Structure your extracted data properly

In [None]:
# List.am Real Estate Scraper
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import re
from urllib.parse import urljoin, urlparse
import json

def scrape_listam_listings(base_url="https://www.list.am/category/62", max_pages=2, delay=2):
    """
    Scrape real estate listings from list.am
    
    Args:
        base_url (str): Base URL for the category
        max_pages (int): Maximum number of pages to scrape
        delay (int): Delay between requests in seconds
    
    Returns:
        list: List of dictionaries containing listing data
    """
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
    }
    
    all_listings = []
    
    for page in range(1, max_pages + 1):
        try:
            # Construct page URL
            if page == 1:
                page_url = base_url
            else:
                page_url = f"{base_url}/{page}"
            
            print(f"üîç Scraping page {page}: {page_url}")
            
            # Add delay to be respectful
            if page > 1:
                time.sleep(delay)
            
            # Send request
            response = requests.get(page_url, headers=headers, timeout=10)
            response.raise_for_status()
            
            # Parse HTML
            soup = BeautifulSoup(response.content, 'html.parser')
            
            # Find listing containers (adjust selectors based on actual HTML structure)
            listings = soup.find_all('a', href=True)
            
            page_listings = []
            
            for listing in listings:
                href = listing.get('href', '')
                
                # Filter for item links
                if '/item/' in href and href.startswith('/item/'):
                    # Extract item ID
                    item_match = re.search(r'/item/(\d+)', href)
                    if not item_match:
                        continue
                    
                    item_id = item_match.group(1)
                    full_url = urljoin(base_url, href)
                    
                    # Extract text content from the link
                    text_content = listing.get_text(strip=True)
                    
                    # Parse listing information from text
                    listing_data = parse_listing_text(text_content, item_id, full_url)
                    
                    if listing_data:
                        page_listings.append(listing_data)
            
            print(f"   ‚úÖ Found {len(page_listings)} listings on page {page}")
            all_listings.extend(page_listings)
            
            # Check if there's a next page
            next_link = soup.find('a', string='’Ä’°’ª’∏÷Ä’§’® >')
            if not next_link and page == max_pages:
                print("üìÑ Reached last page or max pages limit")
                break
                
        except requests.RequestException as e:
            print(f"‚ùå Error fetching page {page}: {e}")
            break
        except Exception as e:
            print(f"‚ùå Error parsing page {page}: {e}")
            continue
    
    return all_listings

def parse_listing_text(text, item_id, url):
    """
    Parse listing information from text content
    
    Args:
        text (str): Text content of the listing
        item_id (str): Item ID
        url (str): Full URL to the listing
    
    Returns:
        dict: Parsed listing data
    """
    
    if not text or len(text.strip()) < 10:
        return None
    
    # Initialize listing data
    listing = {
        'id': item_id,
        'url': url,
        'raw_text': text.strip(),
        'price': None,
        'price_currency': None,
        'location': None,
        'property_type': None,
        'area_sqm': None,
        'rooms': None,
        'floor': None,
        'description': None
    }
    
    # Extract price (handles both USD and AMD)
    price_usd_match = re.search(r'\$([0-9,]+(?:\.[0-9]+)?)', text)
    price_amd_match = re.search(r'([0-9,]+(?:\.[0-9]+)?)\s*÷è', text)
    
    if price_usd_match:
        listing['price'] = price_usd_match.group(1).replace(',', '')
        listing['price_currency'] = 'USD'
    elif price_amd_match:
        listing['price'] = price_amd_match.group(1).replace(',', '')
        listing['price_currency'] = 'AMD'
    
    # Extract area (square meters)
    area_match = re.search(r'(\d+)\s*÷Ñ’¥', text)
    if area_match:
        listing['area_sqm'] = area_match.group(1)
    
    # Extract number of rooms
    rooms_match = re.search(r'(\d+)\s*’Ω’•’∂', text)
    if rooms_match:
        listing['rooms'] = rooms_match.group(1)
    
    # Extract floor information
    floor_match = re.search(r'(\d+)/(\d+)\s*’∞’°÷Ä’Ø', text)
    if floor_match:
        listing['floor'] = f"{floor_match.group(1)}/{floor_match.group(2)}"
    
    # Extract location (common locations in Yerevan)
    locations = [
        '‘ø’•’∂’ø÷Ä’∏’∂', '‘±÷Ä’°’¢’Ø’´÷Ä', '‘¥’°’æ’©’°’∑’•’∂', '’Ñ’°’¨’°’©’´’°-’ç’•’¢’°’Ω’ø’´’°', 
        '’á’•’∂’£’°’æ’´’©', '’Ü’∏÷Ä ’Ü’∏÷Ä÷Ñ', '‘±’ª’°÷É’∂’µ’°’Ø', '‘±’æ’°’∂', '‘∑÷Ä’•’¢’∏÷Ç’∂’´',
        '‘≥’µ’∏÷Ç’¥÷Ä’´', '’é’°’∂’°’±’∏÷Ä', '‘±’¢’∏’æ’µ’°’∂', '‘±÷Ä’ø’°’∑’°’ø', '‘≥÷á’°÷Ä÷Ñ',
        '‘æ’°’≤’Ø’°’±’∏÷Ä', '‘¥’´’¨’´’ª’°’∂', '‘ª’ª÷á’°’∂', '‘≥’∏÷Ä’´’Ω', '‘ø’°’∫’°’∂'
    ]
    
    for location in locations:
        if location in text:
            listing['location'] = location
            break
    
    # Determine property type based on keywords
    if '’¢’∂’°’Ø’°÷Ä’°’∂' in text:
        listing['property_type'] = 'Apartment'
    elif '’ø’∏÷Ç’∂' in text or '’©’°’∏÷Ç’∂’∞’°’∏÷Ç’¶' in text:
        listing['property_type'] = 'House'
    elif '’∞’∏’≤’°’ø’°÷Ä’°’Æ÷Ñ' in text:
        listing['property_type'] = 'Land'
    elif '’°’æ’ø’∏’ø’∂’°’Ø' in text:
        listing['property_type'] = 'Garage'
    elif '’£÷Ä’°’Ω’•’∂’µ’°’Ø' in text:
        listing['property_type'] = 'Office'
    else:
        listing['property_type'] = 'Other'
    
    # Clean description (remove price and location)
    description = text
    if listing['price'] and listing['price_currency']:
        price_pattern = rf"\${listing['price']}|{listing['price']}\s*÷è"
        description = re.sub(price_pattern, '', description)
    
    if listing['location']:
        description = description.replace(listing['location'], '')
    
    listing['description'] = description.strip()
    
    return listing

def analyze_listings(listings):
    """
    Analyze scraped listings and provide statistics
    
    Args:
        listings (list): List of listing dictionaries
    
    Returns:
        dict: Analysis results
    """
    
    if not listings:
        return {}
    
    df = pd.DataFrame(listings)
    
    # Convert price to numeric for analysis
    df['price_numeric'] = pd.to_numeric(df['price'].str.replace(',', ''), errors='coerce')
    df['area_numeric'] = pd.to_numeric(df['area_sqm'], errors='coerce')
    df['rooms_numeric'] = pd.to_numeric(df['rooms'], errors='coerce')
    
    analysis = {
        'total_listings': len(listings),
        'unique_locations': df['location'].nunique(),
        'property_types': df['property_type'].value_counts().to_dict(),
        'currency_distribution': df['price_currency'].value_counts().to_dict(),
        'price_stats': {},
        'area_stats': {},
        'location_stats': df['location'].value_counts().head(10).to_dict()
    }
    
    # Price statistics (for USD listings)
    usd_prices = df[df['price_currency'] == 'USD']['price_numeric'].dropna()
    if len(usd_prices) > 0:
        analysis['price_stats']['USD'] = {
            'count': len(usd_prices),
            'mean': round(usd_prices.mean(), 2),
            'median': round(usd_prices.median(), 2),
            'min': usd_prices.min(),
            'max': usd_prices.max()
        }
    
    # Area statistics
    areas = df['area_numeric'].dropna()
    if len(areas) > 0:
        analysis['area_stats'] = {
            'count': len(areas),
            'mean': round(areas.mean(), 2),
            'median': round(areas.median(), 2),
            'min': areas.min(),
            'max': areas.max()
        }
    
    return analysis

# Example usage
print("üè† Starting List.am Real Estate Scraper...")
print("‚ö†Ô∏è  Remember: This is for educational purposes only!")
print("üïê Adding delays between requests to be respectful...")

# Scrape listings (limiting to 2 pages for demo)
listings = scrape_listam_listings(max_pages=2, delay=3)

print(f"\nüìä Scraping completed! Total listings found: {len(listings)}")

if listings:
    print("\nüè† Sample listings:")
    print("=" * 80)
    
    for i, listing in enumerate(listings[:5], 1):
        print(f"\n{i}. ID: {listing['id']}")
        print(f"   Type: {listing['property_type']}")
        print(f"   Price: {listing['price']} {listing['price_currency'] or 'N/A'}")
        print(f"   Location: {listing['location'] or 'N/A'}")
        print(f"   Area: {listing['area_sqm']} sqm" if listing['area_sqm'] else "   Area: N/A")
        print(f"   Rooms: {listing['rooms']}" if listing['rooms'] else "   Rooms: N/A")
        print(f"   Description: {listing['description'][:60]}...")
        print(f"   URL: {listing['url']}")
else:
    print("‚ùå No listings found")

In [None]:
# Data Analysis and Visualization
if listings:
    print("\nüìà Analyzing scraped data...")
    
    # Perform analysis
    analysis = analyze_listings(listings)
    
    print(f"\nüìä Analysis Results:")
    print("=" * 60)
    print(f"üìã Total listings: {analysis['total_listings']}")
    print(f"üèôÔ∏è Unique locations: {analysis['unique_locations']}")
    
    print(f"\nüè† Property types:")
    for prop_type, count in analysis['property_types'].items():
        print(f"   {prop_type}: {count}")
    
    print(f"\nüí∞ Currency distribution:")
    for currency, count in analysis['currency_distribution'].items():
        if currency:  # Skip None values
            print(f"   {currency}: {count}")
    
    if 'USD' in analysis['price_stats']:
        usd_stats = analysis['price_stats']['USD']
        print(f"\nüíµ USD Price statistics:")
        print(f"   Count: {usd_stats['count']}")
        print(f"   Average: ${usd_stats['mean']:,.2f}")
        print(f"   Median: ${usd_stats['median']:,.2f}")
        print(f"   Range: ${usd_stats['min']:,.0f} - ${usd_stats['max']:,.0f}")
    
    if analysis['area_stats']:
        area_stats = analysis['area_stats']
        print(f"\nüìê Area statistics (sqm):")
        print(f"   Count: {area_stats['count']}")
        print(f"   Average: {area_stats['mean']:.1f} sqm")
        print(f"   Median: {area_stats['median']:.1f} sqm")
        print(f"   Range: {area_stats['min']} - {area_stats['max']} sqm")
    
    print(f"\nüó∫Ô∏è Top locations:")
    for location, count in list(analysis['location_stats'].items())[:5]:
        if location:  # Skip None values
            print(f"   {location}: {count}")
    
    # Save data to CSV
    df = pd.DataFrame(listings)
    filename = f'listam_listings_{pd.Timestamp.now().strftime("%Y%m%d_%H%M%S")}.csv'
    df.to_csv(filename, index=False, encoding='utf-8')
    print(f"\nüíæ Data saved to: {filename}")
    
    # Save analysis to JSON
    analysis_filename = f'listam_analysis_{pd.Timestamp.now().strftime("%Y%m%d_%H%M%S")}.json'
    with open(analysis_filename, 'w', encoding='utf-8') as f:
        json.dump(analysis, f, ensure_ascii=False, indent=2, default=str)
    print(f"üìä Analysis saved to: {analysis_filename}")
else:
    print("‚ùå No data to analyze")

In [None]:
# Advanced List.am Scraping Techniques

def scrape_detailed_listing(listing_url, headers=None):
    """
    Scrape detailed information from a single listing page
    
    Args:
        listing_url (str): URL of the specific listing
        headers (dict): HTTP headers to use
    
    Returns:
        dict: Detailed listing information
    """
    
    if headers is None:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
    
    try:
        response = requests.get(listing_url, headers=headers, timeout=10)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Extract detailed information (adjust selectors based on actual page structure)
        details = {
            'url': listing_url,
            'title': None,
            'price': None,
            'description': None,
            'contact_info': None,
            'images': [],
            'features': [],
            'posted_date': None
        }
        
        # Extract title
        title_element = soup.find('h1') or soup.find('title')
        if title_element:
            details['title'] = title_element.get_text(strip=True)
        
        # Extract description
        desc_selectors = [
            'div.description', 
            'div.content', 
            '.item-description',
            'p'
        ]
        
        for selector in desc_selectors:
            desc_element = soup.select_one(selector)
            if desc_element and len(desc_element.get_text(strip=True)) > 50:
                details['description'] = desc_element.get_text(strip=True)
                break
        
        # Extract images
        img_elements = soup.find_all('img', src=True)
        for img in img_elements:
            src = img.get('src')
            if src and ('jpg' in src or 'jpeg' in src or 'png' in src):
                full_img_url = urljoin(listing_url, src)
                details['images'].append(full_img_url)
        
        # Extract contact information (phone numbers)
        text_content = soup.get_text()
        phone_patterns = [
            r'\+374\s?\d{2}\s?\d{3}\s?\d{3}',  # Armenian format
            r'0\d{2}\s?\d{3}\s?\d{3}',        # Local format
            r'\d{2}-\d{2}-\d{2}'              # Alternative format
        ]
        
        for pattern in phone_patterns:
            phones = re.findall(pattern, text_content)
            if phones:
                details['contact_info'] = phones[0]
                break
        
        return details
        
    except Exception as e:
        print(f"‚ùå Error scraping detailed listing {listing_url}: {e}")
        return {}

def create_price_monitor(target_criteria, check_interval=3600):
    """
    Create a price monitoring system for specific criteria
    
    Args:
        target_criteria (dict): Criteria to monitor (location, max_price, min_area, etc.)
        check_interval (int): Check interval in seconds
    
    Returns:
        function: Monitoring function
    """
    
    def monitor():
        print(f"üîç Monitoring for: {target_criteria}")
        
        # Get current listings
        current_listings = scrape_listam_listings(max_pages=1, delay=2)
        
        matching_listings = []
        
        for listing in current_listings:
            matches = True
            
            # Check location
            if 'location' in target_criteria:
                if listing['location'] != target_criteria['location']:
                    matches = False
            
            # Check max price
            if 'max_price_usd' in target_criteria and listing['price'] and listing['price_currency'] == 'USD':
                try:
                    price = float(listing['price'].replace(',', ''))
                    if price > target_criteria['max_price_usd']:
                        matches = False
                except:
                    pass
            
            # Check minimum area
            if 'min_area' in target_criteria and listing['area_sqm']:
                try:
                    area = int(listing['area_sqm'])
                    if area < target_criteria['min_area']:
                        matches = False
                except:
                    pass
            
            # Check property type
            if 'property_type' in target_criteria:
                if listing['property_type'] != target_criteria['property_type']:
                    matches = False
            
            if matches:
                matching_listings.append(listing)
        
        if matching_listings:
            print(f"üéØ Found {len(matching_listings)} matching listings:")
            for listing in matching_listings:
                print(f"   - {listing['property_type']} in {listing['location']}: {listing['price']} {listing['price_currency']}")
                print(f"     URL: {listing['url']}")
        else:
            print("‚ùå No matching listings found")
        
        return matching_listings
    
    return monitor

# Example: Monitor for apartments in Kentron under $200,000
print("\nüéØ Setting up price monitoring example...")
monitor_criteria = {
    'location': '‘ø’•’∂’ø÷Ä’∏’∂',
    'max_price_usd': 200000,
    'min_area': 50,
    'property_type': 'Apartment'
}

price_monitor = create_price_monitor(monitor_criteria)

print("\nüí° Price monitor created! You can run price_monitor() to check for matching listings.")
print("üîÑ In a real application, you would schedule this to run periodically.")

# Example of running the monitor once
print("\nüèÉ‚Äç‚ôÇÔ∏è Running price monitor once as example...")
# matching = price_monitor()  # Uncomment to run the monitor

### üéØ Advanced Beautiful Soup Techniques

#### 1. Different Parsing Methods:

In [None]:
# Advanced Beautiful Soup techniques
from bs4 import BeautifulSoup
import re

sample_html = """
<div class="container">
    <div class="product" data-price="29.99" data-category="electronics">
        <h3>Smartphone</h3>
        <p class="description">Latest smartphone with amazing features</p>
        <span class="price">$29.99</span>
        <div class="reviews">
            <span class="rating">4.5</span>
            <span class="review-count">(150 reviews)</span>
        </div>
    </div>
    
    <div class="product" data-price="599.99" data-category="electronics">
        <h3>Laptop</h3>
        <p class="description">High-performance laptop for professionals</p>
        <span class="price">$599.99</span>
        <div class="reviews">
            <span class="rating">4.8</span>
            <span class="review-count">(89 reviews)</span>
        </div>
    </div>
    
    <article class="blog-post">
        <h2>Tech News</h2>
        <p>Latest technology trends and updates...</p>
        <time datetime="2025-01-15">January 15, 2025</time>
    </article>
</div>
"""

soup = BeautifulSoup(sample_html, 'html.parser')

print("üîß Advanced Beautiful Soup Techniques:")
print("=" * 50)

# 1. Find with attributes
print("\n1Ô∏è‚É£ Finding by attributes:")
expensive_products = soup.find_all('div', {'data-price': lambda x: x and float(x) > 100})
for product in expensive_products:
    name = product.h3.text
    price = product.get('data-price')
    print(f"   {name}: ${price}")

# 2. Using regular expressions
print("\n2Ô∏è‚É£ Using regex patterns:")
price_spans = soup.find_all('span', string=re.compile(r'\$\d+\.\d+'))
for span in price_spans:
    print(f"   Found price: {span.text}")

# 3. CSS selectors advanced
print("\n3Ô∏è‚É£ Advanced CSS selectors:")
# Products with rating above 4.5
high_rated = soup.select('div.product:has(.rating)')
for product in high_rated:
    name = product.h3.text
    rating = product.select_one('.rating').text
    if float(rating) > 4.5:
        print(f"   High-rated: {name} ({rating}‚≠ê)")

# 4. Parent and sibling navigation
print("\n4Ô∏è‚É£ Navigation between elements:")
rating_element = soup.find('span', class_='rating')
if rating_element:
    # Get parent
    reviews_div = rating_element.parent
    print(f"   Parent element: {reviews_div.name}")
    
    # Get sibling
    review_count = rating_element.find_next_sibling('span')
    print(f"   Review count: {review_count.text}")

# 5. Extracting numbers from text
print("\n5Ô∏è‚É£ Extracting numbers from text:")
review_texts = soup.find_all('span', class_='review-count')
for review in review_texts:
    # Extract number using regex
    numbers = re.findall(r'\d+', review.text)
    if numbers:
        print(f"   Reviews: {numbers[0]}")

# 6. Custom filters
print("\n6Ô∏è‚É£ Custom filters:")
def has_class_and_data_price(tag):
    return tag.has_attr('class') and tag.has_attr('data-price')

products_with_price = soup.find_all(has_class_and_data_price)
for product in products_with_price:
    print(f"   Product: {product.h3.text}, Price: ${product['data-price']}")

## üöó Selenium - For Dynamic and JavaScript-Heavy Websites

### ü§î When Do You Need Selenium?

**Beautiful Soup + Requests** works great for static HTML, but many modern websites use JavaScript to load content dynamically. This is where Selenium comes in.

#### Signs You Need Selenium:
- Content loads after the page loads (AJAX)
- You need to click buttons or fill forms
- The data you want appears only after user interaction
- The website is a Single Page Application (SPA)
- You see "Loading..." messages or spinners

#### What Selenium Does:
- **Controls a real browser** (Chrome, Firefox, Safari)
- **Executes JavaScript** like a human user
- **Waits for content** to load dynamically
- **Simulates user actions** (clicks, typing, scrolling)

### ‚ö° Selenium vs Beautiful Soup Comparison:

| Feature | Beautiful Soup | Selenium |
|---------|----------------|----------|
| **Speed** | ‚ö° Very fast | üêå Slower (launches browser) |
| **JavaScript** | ‚ùå No support | ‚úÖ Full support |
| **User Interaction** | ‚ùå Cannot click/type | ‚úÖ Can simulate user actions |
| **Memory Usage** | üíö Low | üî¥ High (browser overhead) |
| **Complexity** | üíö Simple | üü° More complex setup |
| **Best For** | Static websites | Dynamic/interactive websites |

### üõ†Ô∏è Selenium Installation & Setup

#### Step 1: Install Selenium
```bash
pip install selenium webdriver-manager
```

#### Step 2: Understanding WebDrivers
Selenium needs a **WebDriver** to control browsers:
- **ChromeDriver** - For Google Chrome
- **GeckoDriver** - For Firefox  
- **EdgeDriver** - For Microsoft Edge

**Good News**: `webdriver-manager` automatically downloads the correct driver!

#### Step 3: Basic Setup Options

```python
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

# Option 1: Visible browser (for development/debugging)
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

# Option 2: Headless browser (for production)
options = webdriver.ChromeOptions()
options.add_argument("--headless")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
```

#### Step 4: Common Chrome Options
```python
options = webdriver.ChromeOptions()
options.add_argument("--headless")          # Run without GUI
options.add_argument("--no-sandbox")        # Required for some environments
options.add_argument("--disable-dev-shm-usage")  # Overcome limited resource problems
options.add_argument("--window-size=1920,1080")  # Set window size
options.add_argument("--user-agent=Custom User Agent")  # Custom user agent
```

In [None]:
# Install Selenium and WebDriver
!pip install selenium webdriver-manager

In [None]:
# Selenium Basic Example - Part 1: Setup and Navigation
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options

def selenium_basic_demo():
    """Demonstrate Selenium basic usage"""
    
    print("üöó Selenium Basic Demo - Part 1: Setup")
    print("=" * 45)
    
    # Setup Chrome options for demo
    options = Options()
    options.add_argument("--headless")  # Run without GUI for demo
    options.add_argument("--no-sandbox")
    options.add_argument("--disable-dev-shm-usage")
    
    try:
        print("\nüìã Step 1: Initialize WebDriver")
        # This automatically downloads ChromeDriver if needed
        service = Service(ChromeDriverManager().install())
        driver = webdriver.Chrome(service=service, options=options)
        print("   ‚úÖ Chrome WebDriver initialized successfully")
        
        print("\nüåê Step 2: Navigate to website")
        url = "https://quotes.toscrape.com/js/"  # JavaScript version
        driver.get(url)
        print(f"   üìç Navigated to: {url}")
        print(f"   üìÑ Page title: {driver.title}")
        
        print("\n‚è≥ Step 3: Wait for content to load")
        # Wait up to 10 seconds for quotes to appear
        wait = WebDriverWait(driver, 10)
        quotes = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, "quote")))
        print(f"   ‚úÖ Found {len(quotes)} quotes after waiting for JavaScript")
        
        print(f"\nüìä Page Information:")
        print(f"   Current URL: {driver.current_url}")
        print(f"   Page source length: {len(driver.page_source):,} characters")
        
        return driver, quotes
        
    except Exception as e:
        print(f"‚ùå Error in Selenium demo: {e}")
        return None, []

# Note: This demo shows setup - actual scraping in next cell
print("? Note: This example shows Selenium setup and navigation.")
print("? For full functionality, Chrome browser and ChromeDriver are required.")
print("? In Colab/Jupyter environments, additional setup might be needed.")

# Uncomment the line below to run the demo (if Chrome is available)
# driver, quotes = selenium_basic_demo()

In [None]:
# Selenium Basic Example - Part 2: Data Extraction and Interaction
# This continues from Part 1

def selenium_scraping_demo():
    """Demonstrate Selenium data extraction and interaction"""
    
    print("üöó Selenium Basic Demo - Part 2: Data Extraction")
    print("=" * 50)
    
    print("\nüìù Common Selenium Element Location Methods:")
    print("   By.CLASS_NAME    ‚Üí find_element(By.CLASS_NAME, 'quote')")
    print("   By.ID            ‚Üí find_element(By.ID, 'main-content')")
    print("   By.TAG_NAME      ‚Üí find_element(By.TAG_NAME, 'h1')")
    print("   By.CSS_SELECTOR  ‚Üí find_element(By.CSS_SELECTOR, '.quote .text')")
    print("   By.XPATH         ‚Üí find_element(By.XPATH, '//div[@class=\"quote\"]')")
    
    # Simulated data extraction (would work with real driver)
    simulated_quotes = [
        {
            'text': '"The world as we have created it is a process of our thinking."',
            'author': 'Albert Einstein',
            'tags': ['change', 'deep-thoughts', 'thinking', 'world']
        },
        {
            'text': '"It is our choices, Harry, that show what we truly are."',
            'author': 'J.K. Rowling',
            'tags': ['abilities', 'choices']
        },
        {
            'text': '"There are only two ways to live your life."',
            'author': 'Albert Einstein',
            'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']
        }
    ]
    
    print(f"\nüîç Extracting Data with Selenium:")
    print("   (Simulated - shows the process)")
    
    for i, quote_data in enumerate(simulated_quotes, 1):
        print(f"\nüí¨ Quote {i}:")
        print(f"   Text: {quote_data['text']}")
        print(f"   Author: {quote_data['author']}")
        print(f"   Tags: {', '.join(quote_data['tags'])}")
    
    print(f"\nüéØ Real Selenium Code Pattern:")
    selenium_code = '''
# Real Selenium extraction code:
quotes = driver.find_elements(By.CLASS_NAME, "quote")

for quote in quotes:
    text = quote.find_element(By.CLASS_NAME, "text").text
    author = quote.find_element(By.CLASS_NAME, "author").text
    tags = [tag.text for tag in quote.find_elements(By.CLASS_NAME, "tag")]
    
    quote_data = {
        'text': text,
        'author': author,
        'tags': tags
    }
'''
    
    print(selenium_code)
    
    print(f"\nüñ±Ô∏è Selenium Interaction Examples:")
    interaction_code = '''
# Click elements
button = driver.find_element(By.ID, "load-more-btn")
button.click()

# Fill forms
search_box = driver.find_element(By.NAME, "search")
search_box.send_keys("python")
search_box.submit()

# Scroll page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Wait for specific conditions
wait = WebDriverWait(driver, 10)
element = wait.until(EC.element_to_be_clickable((By.ID, "submit-btn")))
'''
    
    print(interaction_code)
    
    print(f"\n‚ö†Ô∏è Important Selenium Concepts:")
    print("   üïê Explicit Waits: Wait for specific conditions")
    print("   üïë Implicit Waits: Global wait time for all elements")
    print("   üé≠ Headless Mode: Run without visible browser")
    print("   üîí Always Close: driver.quit() to free resources")

# Run the demo
selenium_scraping_demo()

# üöÄ Parallel Web Scraping & Multiprocessing

When scraping large amounts of data, performance becomes crucial. Python's multiprocessing and libraries like `joblib` allow us to speed up scraping by processing multiple URLs simultaneously.

## üß† Why Use Parallel Processing?

**Sequential Processing:**
- Scrapes one URL at a time
- Total time = (number of URLs) √ó (average time per URL)
- CPU cores remain underutilized

**Parallel Processing:**
- Scrapes multiple URLs simultaneously
- Total time ‚âà (number of URLs) √∑ (number of workers) √ó (average time per URL)
- Better resource utilization

‚ö†Ô∏è **Important**: Always respect websites' rate limits and robots.txt when using parallel processing!

## üîß Basic Multiprocessing Concepts

Before applying multiprocessing to web scraping, let's understand the basics with simple examples.

In [None]:
import time
import multiprocessing as mp
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
import requests
from joblib import Parallel, delayed

# Example 1: CPU-intensive task (Sequential vs Parallel)
def square_number(n):
    """Simulate CPU-intensive work"""
    time.sleep(0.1)  # Simulate computation time
    return n ** 2

def demonstrate_multiprocessing():
    numbers = list(range(1, 21))  # 1 to 20
    
    # Sequential processing
    print("üêå Sequential Processing:")
    start_time = time.time()
    sequential_results = [square_number(n) for n in numbers]
    sequential_time = time.time() - start_time
    print(f"   Time taken: {sequential_time:.2f} seconds")
    print(f"   Results: {sequential_results[:5]}... (showing first 5)")
    
    # Parallel processing with multiprocessing
    print("\n‚ö° Parallel Processing (multiprocessing):")
    start_time = time.time()
    with ProcessPoolExecutor(max_workers=4) as executor:
        parallel_results = list(executor.map(square_number, numbers))
    parallel_time = time.time() - start_time
    print(f"   Time taken: {parallel_time:.2f} seconds")
    print(f"   Results: {parallel_results[:5]}... (showing first 5)")
    print(f"   Speedup: {sequential_time/parallel_time:.2f}x faster")

# Run the demonstration
demonstrate_multiprocessing()

## üì¶ Introduction to Joblib

`joblib` is a powerful library that makes parallel computing easy and efficient. It's particularly great for:
- CPU-bound tasks
- Machine learning workloads
- Data processing pipelines

**Key advantages:**
- Simple API: `Parallel(n_jobs=-1)(delayed(function)(args) for args in data)`
- Automatic memory optimization
- Built-in progress tracking
- Works well with NumPy arrays

In [None]:
# Install joblib if not already installed
# !pip install joblib

from joblib import Parallel, delayed
import numpy as np

def process_data(x):
    """Simulate data processing"""
    time.sleep(0.05)
    return x ** 3 + 2 * x ** 2 + x + 1

def demonstrate_joblib():
    data = list(range(1, 51))  # 1 to 50
    
    print("üîß Joblib Examples:")
    
    # Sequential processing
    print("\nüêå Sequential Processing:")
    start_time = time.time()
    sequential_results = [process_data(x) for x in data]
    sequential_time = time.time() - start_time
    print(f"   Time taken: {sequential_time:.2f} seconds")
    
    # Parallel processing with joblib (all CPU cores)
    print("\n‚ö° Joblib Parallel (all cores):")
    start_time = time.time()
    parallel_results = Parallel(n_jobs=-1)(delayed(process_data)(x) for x in data)
    parallel_time = time.time() - start_time
    print(f"   Time taken: {parallel_time:.2f} seconds")
    print(f"   Speedup: {sequential_time/parallel_time:.2f}x faster")
    
    # Parallel processing with specific number of workers
    print("\n‚ö° Joblib Parallel (4 workers):")
    start_time = time.time()
    parallel_results_4 = Parallel(n_jobs=4)(delayed(process_data)(x) for x in data)
    parallel_time_4 = time.time() - start_time
    print(f"   Time taken: {parallel_time_4:.2f} seconds")
    
    # With verbose progress tracking
    print("\nüìä Joblib with Progress Tracking:")
    start_time = time.time()
    parallel_results_verbose = Parallel(n_jobs=4, verbose=1)(
        delayed(process_data)(x) for x in data
    )
    verbose_time = time.time() - start_time
    print(f"   Time taken: {verbose_time:.2f} seconds")
    
    # Verify results are the same
    print(f"\n‚úÖ Results match: {sequential_results == parallel_results}")

# Run joblib demonstration
demonstrate_joblib()

## üåê Parallel Web Scraping Examples

Now let's apply these concepts to web scraping. We'll compare sequential vs parallel approaches for scraping multiple URLs.

In [None]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time
from joblib import Parallel, delayed

def scrape_single_url(url, timeout=10):
    """Scrape a single URL and extract basic information"""
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }
        
        response = requests.get(url, headers=headers, timeout=timeout)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Extract basic information
        title = soup.find('title')
        title_text = title.get_text(strip=True) if title else "No title"
        
        # Count paragraphs
        paragraphs = soup.find_all('p')
        paragraph_count = len(paragraphs)
        
        # Count links
        links = soup.find_all('a', href=True)
        link_count = len(links)
        
        # Get first paragraph text (if available)
        first_paragraph = ""
        if paragraphs:
            first_paragraph = paragraphs[0].get_text(strip=True)[:200] + "..."
        
        return {
            'url': url,
            'title': title_text[:100],  # Limit title length
            'status': 'success',
            'paragraph_count': paragraph_count,
            'link_count': link_count,
            'first_paragraph': first_paragraph,
            'response_time': response.elapsed.total_seconds()
        }
        
    except requests.exceptions.RequestException as e:
        return {
            'url': url,
            'title': None,
            'status': 'error',
            'error': str(e),
            'paragraph_count': 0,
            'link_count': 0,
            'first_paragraph': '',
            'response_time': None
        }
    except Exception as e:
        return {
            'url': url,
            'title': None,
            'status': 'error',
            'error': f"Parsing error: {str(e)}",
            'paragraph_count': 0,
            'link_count': 0,
            'first_paragraph': '',
            'response_time': None
        }

def scrape_urls_sequential(urls):
    """Scrape URLs one by one (sequential)"""
    print("üêå Sequential scraping...")
    start_time = time.time()
    
    results = []
    for i, url in enumerate(urls, 1):
        print(f"   Scraping {i}/{len(urls)}: {url[:50]}...")
        result = scrape_single_url(url)
        results.append(result)
        time.sleep(1)  # Be respectful - add delay
    
    total_time = time.time() - start_time
    print(f"   Sequential time: {total_time:.2f} seconds")
    return results, total_time

def scrape_urls_parallel_joblib(urls, n_jobs=4):
    """Scrape URLs in parallel using joblib"""
    print(f"‚ö° Parallel scraping with joblib ({n_jobs} workers)...")
    start_time = time.time()
    
    # Add delays in parallel execution too (but spread out)
    def scrape_with_delay(url, delay_factor):
        time.sleep(delay_factor * 0.5)  # Staggered delays
        return scrape_single_url(url)
    
    # Create delay factors for staggered requests
    delay_factors = [i % 4 for i in range(len(urls))]
    
    results = Parallel(n_jobs=n_jobs, verbose=1)(
        delayed(scrape_with_delay)(url, delay) 
        for url, delay in zip(urls, delay_factors)
    )
    
    total_time = time.time() - start_time
    print(f"   Parallel time: {total_time:.2f} seconds")
    return results, total_time

# Test URLs (using public APIs and websites that allow scraping)
test_urls = [
    'https://httpbin.org/html',
    'https://httpbin.org/json',
    'https://jsonplaceholder.typicode.com/posts/1',
    'https://jsonplaceholder.typicode.com/posts/2',
    'https://httpbin.org/xml',
    'https://httpbin.org/robots.txt',
    'https://jsonplaceholder.typicode.com/users/1',
    'https://jsonplaceholder.typicode.com/users/2'
]

print("üåê Web Scraping Performance Comparison")
print("=" * 50)

# Sequential scraping
sequential_results, seq_time = scrape_urls_sequential(test_urls)

print("\n" + "=" * 50)

# Parallel scraping
parallel_results, par_time = scrape_urls_parallel_joblib(test_urls, n_jobs=4)

# Compare results
print(f"\nüìä Performance Summary:")
print(f"   URLs scraped: {len(test_urls)}")
print(f"   Sequential time: {seq_time:.2f} seconds")
print(f"   Parallel time: {par_time:.2f} seconds")
print(f"   Speedup: {seq_time/par_time:.2f}x faster")

# Show success rates
seq_success = sum(1 for r in sequential_results if r['status'] == 'success')
par_success = sum(1 for r in parallel_results if r['status'] == 'success')

print(f"\n‚úÖ Success Rates:")
print(f"   Sequential: {seq_success}/{len(test_urls)} ({seq_success/len(test_urls)*100:.1f}%)")
print(f"   Parallel: {par_success}/{len(test_urls)} ({par_success/len(test_urls)*100:.1f}%)")

# Show sample results
print(f"\nüìÑ Sample Results (first 3):")
for i, result in enumerate(parallel_results[:3]):
    print(f"   {i+1}. {result['url']}")
    print(f"      Title: {result['title']}")
    print(f"      Status: {result['status']}")
    if result['status'] == 'success':
        print(f"      Paragraphs: {result['paragraph_count']}, Links: {result['link_count']}")
    print()

## üõ°Ô∏è Advanced Parallel Scraping with Rate Limiting

When scraping real websites, we need to be more careful about rate limiting, error handling, and respecting server resources.

In [None]:
import random
from threading import Lock
import threading
from datetime import datetime, timedelta

class RateLimitedScraper:
    """A rate-limited web scraper with parallel processing capabilities"""
    
    def __init__(self, requests_per_second=2, max_retries=3):
        self.requests_per_second = requests_per_second
        self.max_retries = max_retries
        self.last_request_time = {}
        self.lock = Lock()
        
    def wait_if_needed(self, domain):
        """Implement rate limiting per domain"""
        with self.lock:
            now = datetime.now()
            if domain in self.last_request_time:
                time_since_last = (now - self.last_request_time[domain]).total_seconds()
                min_interval = 1.0 / self.requests_per_second
                
                if time_since_last < min_interval:
                    sleep_time = min_interval - time_since_last
                    time.sleep(sleep_time)
            
            self.last_request_time[domain] = datetime.now()
    
    def extract_domain(self, url):
        """Extract domain from URL"""
        from urllib.parse import urlparse
        return urlparse(url).netloc
    
    def scrape_with_retries(self, url):
        """Scrape URL with retry logic and rate limiting"""
        domain = self.extract_domain(url)
        
        for attempt in range(self.max_retries):
            try:
                # Implement rate limiting
                self.wait_if_needed(domain)
                
                headers = {
                    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
                }
                
                response = requests.get(url, headers=headers, timeout=15)
                response.raise_for_status()
                
                soup = BeautifulSoup(response.content, 'html.parser')
                
                # Extract comprehensive data
                result = {
                    'url': url,
                    'status': 'success',
                    'attempt': attempt + 1,
                    'timestamp': datetime.now().isoformat(),
                    'response_code': response.status_code,
                    'content_length': len(response.content),
                    'title': '',
                    'meta_description': '',
                    'headings': {},
                    'link_count': 0,
                    'image_count': 0,
                    'form_count': 0,
                    'text_content_length': 0
                }
                
                # Extract title
                title_tag = soup.find('title')
                if title_tag:
                    result['title'] = title_tag.get_text(strip=True)
                
                # Extract meta description
                meta_desc = soup.find('meta', attrs={'name': 'description'})
                if meta_desc:
                    result['meta_description'] = meta_desc.get('content', '')
                
                # Count different elements
                result['link_count'] = len(soup.find_all('a', href=True))
                result['image_count'] = len(soup.find_all('img'))
                result['form_count'] = len(soup.find_all('form'))
                
                # Count headings
                for i in range(1, 7):
                    headings = soup.find_all(f'h{i}')
                    if headings:
                        result['headings'][f'h{i}'] = len(headings)
                
                # Get text content length
                text_content = soup.get_text(strip=True)
                result['text_content_length'] = len(text_content)
                
                return result
                
            except requests.exceptions.RequestException as e:
                if attempt == self.max_retries - 1:  # Last attempt
                    return {
                        'url': url,
                        'status': 'error',
                        'error': str(e),
                        'attempt': attempt + 1,
                        'timestamp': datetime.now().isoformat()
                    }
                else:
                    # Wait before retry (exponential backoff)
                    wait_time = (2 ** attempt) + random.uniform(0, 1)
                    time.sleep(wait_time)
            
            except Exception as e:
                return {
                    'url': url,
                    'status': 'error',
                    'error': f"Unexpected error: {str(e)}",
                    'attempt': attempt + 1,
                    'timestamp': datetime.now().isoformat()
                }

def parallel_scrape_with_rate_limiting(urls, n_jobs=3, requests_per_second=2):
    """Scrape URLs in parallel with rate limiting"""
    scraper = RateLimitedScraper(requests_per_second=requests_per_second)
    
    print(f"üöÄ Advanced Parallel Scraping:")
    print(f"   URLs: {len(urls)}")
    print(f"   Workers: {n_jobs}")
    print(f"   Rate limit: {requests_per_second} requests/second per domain")
    
    start_time = time.time()
    
    results = Parallel(n_jobs=n_jobs, verbose=1)(
        delayed(scraper.scrape_with_retries)(url) for url in urls
    )
    
    total_time = time.time() - start_time
    
    # Analyze results
    successful = [r for r in results if r['status'] == 'success']
    failed = [r for r in results if r['status'] == 'error']
    
    print(f"\nüìä Scraping Summary:")
    print(f"   Total time: {total_time:.2f} seconds")
    print(f"   Average time per URL: {total_time/len(urls):.2f} seconds")
    print(f"   Successful: {len(successful)}/{len(urls)} ({len(successful)/len(urls)*100:.1f}%)")
    print(f"   Failed: {len(failed)}/{len(urls)} ({len(failed)/len(urls)*100:.1f}%)")
    
    if successful:
        avg_content_length = sum(r['content_length'] for r in successful) / len(successful)
        total_links = sum(r['link_count'] for r in successful)
        total_images = sum(r['image_count'] for r in successful)
        
        print(f"\nüìÑ Content Analysis:")
        print(f"   Average content length: {avg_content_length:.0f} bytes")
        print(f"   Total links found: {total_links}")
        print(f"   Total images found: {total_images}")
    
    if failed:
        print(f"\n‚ùå Failed URLs:")
        for fail in failed[:3]:  # Show first 3 failures
            print(f"   {fail['url']}: {fail.get('error', 'Unknown error')}")
    
    return results

# Example with mixed domains (rate limiting will be applied per domain)
mixed_urls = [
    'https://httpbin.org/html',
    'https://httpbin.org/json',
    'https://httpbin.org/xml',
    'https://jsonplaceholder.typicode.com/posts/1',
    'https://jsonplaceholder.typicode.com/posts/2',
    'https://jsonplaceholder.typicode.com/users/1',
    'https://httpbin.org/robots.txt',
    'https://httpbin.org/user-agent',
    'https://jsonplaceholder.typicode.com/comments/1',
    'https://httpbin.org/headers'
]

# Run advanced parallel scraping
results = parallel_scrape_with_rate_limiting(
    mixed_urls, 
    n_jobs=3, 
    requests_per_second=2
)

# Show detailed results for successful scrapes
print(f"\nüìã Detailed Results (first 3 successful):")
successful_results = [r for r in results if r['status'] == 'success']
for i, result in enumerate(successful_results[:3]):
    print(f"\n{i+1}. {result['url']}")
    print(f"   Title: {result['title'][:60]}...")
    print(f"   Response Code: {result['response_code']}")
    print(f"   Content Length: {result['content_length']:,} bytes")
    print(f"   Links: {result['link_count']}, Images: {result['image_count']}")
    if result['headings']:
        print(f"   Headings: {result['headings']}")
    print(f"   Attempt: {result['attempt']}")

## üõí Real-World Example: Parallel E-commerce Data Scraping

Let's create a practical example that simulates scraping product data from multiple pages, using parallel processing to handle large datasets efficiently.

In [None]:
import pandas as pd
import json
from pathlib import Path

class EcommerceScraper:
    """Simulate e-commerce product scraping with parallel processing"""
    
    def __init__(self):
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        })
    
    def simulate_product_page(self, product_id):
        """Simulate scraping a product page"""
        # In real scraping, this would fetch from actual URLs
        # For demo purposes, we'll simulate data
        
        time.sleep(random.uniform(0.5, 2.0))  # Simulate network delay
        
        # Simulate occasional failures
        if random.random() < 0.1:  # 10% failure rate
            raise requests.exceptions.RequestException(f"Failed to load product {product_id}")
        
        # Generate simulated product data
        categories = ['Electronics', 'Clothing', 'Books', 'Home & Garden', 'Sports']
        brands = ['BrandA', 'BrandB', 'BrandC', 'BrandD', 'BrandE']
        
        product = {
            'product_id': product_id,
            'name': f'Product {product_id}',
            'price': round(random.uniform(10, 500), 2),
            'category': random.choice(categories),
            'brand': random.choice(brands),
            'rating': round(random.uniform(1, 5), 1),
            'review_count': random.randint(0, 1000),
            'in_stock': random.choice([True, False]),
            'description_length': random.randint(100, 1000),
            'image_count': random.randint(1, 10),
            'scrape_timestamp': datetime.now().isoformat()
        }
        
        return product
    
    def scrape_product_batch(self, product_ids):
        """Scrape a batch of product IDs"""
        results = []
        batch_start = time.time()
        
        for product_id in product_ids:
            try:
                product = self.simulate_product_page(product_id)
                product['status'] = 'success'
                results.append(product)
            except Exception as e:
                results.append({
                    'product_id': product_id,
                    'status': 'error',
                    'error': str(e),
                    'scrape_timestamp': datetime.now().isoformat()
                })
        
        batch_time = time.time() - batch_start
        return results, batch_time

def parallel_ecommerce_scraping(product_ids, batch_size=50, n_jobs=4):
    """Scrape e-commerce products in parallel batches"""
    
    # Split product IDs into batches
    batches = [product_ids[i:i + batch_size] for i in range(0, len(product_ids), batch_size)]
    
    print(f"üõí E-commerce Parallel Scraping:")
    print(f"   Total products: {len(product_ids)}")
    print(f"   Batch size: {batch_size}")
    print(f"   Number of batches: {len(batches)}")
    print(f"   Parallel workers: {n_jobs}")
    
    scraper = EcommerceScraper()
    
    start_time = time.time()
    
    # Process batches in parallel
    batch_results = Parallel(n_jobs=n_jobs, verbose=1)(
        delayed(scraper.scrape_product_batch)(batch) for batch in batches
    )
    
    total_time = time.time() - start_time
    
    # Flatten results
    all_products = []
    total_batch_time = 0
    
    for results, batch_time in batch_results:
        all_products.extend(results)
        total_batch_time += batch_time
    
    # Analyze results
    successful_products = [p for p in all_products if p['status'] == 'success']
    failed_products = [p for p in all_products if p['status'] == 'error']
    
    print(f"\nüìä Scraping Results:")
    print(f"   Total time: {total_time:.2f} seconds")
    print(f"   Products/second: {len(product_ids)/total_time:.2f}")
    print(f"   Successful: {len(successful_products)}/{len(product_ids)} ({len(successful_products)/len(product_ids)*100:.1f}%)")
    print(f"   Failed: {len(failed_products)}/{len(product_ids)} ({len(failed_products)/len(product_ids)*100:.1f}%)")
    
    return successful_products, failed_products

def analyze_scraped_products(products):
    """Analyze the scraped product data"""
    if not products:
        print("‚ùå No products to analyze")
        return
    
    df = pd.DataFrame(products)
    
    print(f"\nüìà Product Data Analysis:")
    print(f"   Dataset shape: {df.shape}")
    
    # Price analysis
    if 'price' in df.columns:
        print(f"\nüí∞ Price Statistics:")
        print(f"   Average price: ${df['price'].mean():.2f}")
        print(f"   Median price: ${df['price'].median():.2f}")
        print(f"   Price range: ${df['price'].min():.2f} - ${df['price'].max():.2f}")
        
    # Category distribution
    if 'category' in df.columns:
        print(f"\nüìÇ Category Distribution:")
        category_counts = df['category'].value_counts()
        for category, count in category_counts.items():
            print(f"   {category}: {count} products ({count/len(df)*100:.1f}%)")
    
    # Brand analysis
    if 'brand' in df.columns:
        print(f"\nüè∑Ô∏è Top Brands:")
        brand_counts = df['brand'].value_counts().head(5)
        for brand, count in brand_counts.items():
            print(f"   {brand}: {count} products")
    
    # Stock status
    if 'in_stock' in df.columns:
        in_stock_count = df['in_stock'].sum()
        print(f"\nüì¶ Stock Status:")
        print(f"   In stock: {in_stock_count}/{len(df)} ({in_stock_count/len(df)*100:.1f}%)")
        print(f"   Out of stock: {len(df)-in_stock_count}/{len(df)} ({(len(df)-in_stock_count)/len(df)*100:.1f}%)")
    
    # Rating analysis
    if 'rating' in df.columns:
        print(f"\n‚≠ê Rating Statistics:")
        print(f"   Average rating: {df['rating'].mean():.2f}/5.0")
        print(f"   Ratings >= 4.0: {(df['rating'] >= 4.0).sum()}/{len(df)} ({(df['rating'] >= 4.0).sum()/len(df)*100:.1f}%)")
    
    return df

# Generate sample product IDs (simulating large dataset)
product_ids = [f"PROD_{i:06d}" for i in range(1, 501)]  # 500 products

print("üöÄ Starting E-commerce Parallel Scraping Demo...")

# Run parallel scraping
successful_products, failed_products = parallel_ecommerce_scraping(
    product_ids, 
    batch_size=50, 
    n_jobs=4
)

# Analyze the results
df_products = analyze_scraped_products(successful_products)

# Save results
if successful_products:
    # Save to JSON
    output_file = 'scraped_products.json'
    with open(output_file, 'w') as f:
        json.dump(successful_products, f, indent=2)
    
    # Save to CSV
    csv_file = 'scraped_products.csv'
    df_products.to_csv(csv_file, index=False)
    
    print(f"\nüíæ Data saved:")
    print(f"   JSON: {output_file}")
    print(f"   CSV: {csv_file}")

# Show sample products
if successful_products:
    print(f"\nüõçÔ∏è Sample Products:")
    for i, product in enumerate(successful_products[:3]):
        print(f"\n{i+1}. {product['name']} (ID: {product['product_id']})")
        print(f"   Price: ${product['price']}")
        print(f"   Category: {product['category']}")
        print(f"   Brand: {product['brand']}")
        print(f"   Rating: {product['rating']}/5.0 ({product['review_count']} reviews)")
        print(f"   In Stock: {'‚úÖ' if product['in_stock'] else '‚ùå'}")

In [None]:
https://www.ysu.am/robots.txt

## ‚ö° Performance Optimization & Best Practices

### üéØ Choosing the Right Approach

| Method | Best For | Pros | Cons |
|--------|----------|------|------|
| **Sequential** | Small datasets, strict rate limits | Simple, predictable | Slow for large datasets |
| **Threading** | I/O-bound tasks, many small requests | Good for network-bound tasks | GIL limitations in Python |
| **Multiprocessing** | CPU-intensive parsing | True parallelism | Higher memory usage |
| **Joblib** | Balanced approach, data science tasks | Easy to use, optimized | Extra dependency |

### üõ°Ô∏è Rate Limiting Strategies

```python
# 1. Fixed delay between requests
time.sleep(1)

# 2. Random delay (more human-like)
time.sleep(random.uniform(0.5, 2.0))

# 3. Exponential backoff on errors
wait_time = (2 ** attempt) + random.uniform(0, 1)

# 4. Domain-specific rate limiting
# Different limits for different websites
```

### üìä Monitoring & Logging

```python
# Track success rates, response times, errors
# Use logging instead of print for production
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

logger.info(f"Scraped {url} successfully")
logger.error(f"Failed to scrape {url}: {error}")
```

### üîß Performance Tips

1. **Use connection pooling** with `requests.Session()`
2. **Implement caching** to avoid re-scraping
3. **Batch processing** for large datasets
4. **Memory management** - process in chunks
5. **Error handling** - implement retries and fallbacks
6. **Respect robots.txt** and rate limits
7. **Use appropriate timeouts**
8. **Monitor resource usage** (CPU, memory, network)

In [None]:
import concurrent.futures
import threading
from collections import defaultdict

def compare_parallel_approaches(urls, max_workers=4):
    """Compare different parallel processing approaches"""
    
    results = {}
    
    # 1. Sequential baseline
    print("üêå Sequential Processing:")
    start_time = time.time()
    sequential_results = [scrape_single_url(url) for url in urls]
    sequential_time = time.time() - start_time
    results['Sequential'] = {
        'time': sequential_time,
        'results': sequential_results
    }
    print(f"   Time: {sequential_time:.2f}s")
    
    # 2. ThreadPoolExecutor
    print("\nüßµ ThreadPoolExecutor:")
    start_time = time.time()
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        thread_results = list(executor.map(scrape_single_url, urls))
    thread_time = time.time() - start_time
    results['ThreadPool'] = {
        'time': thread_time,
        'results': thread_results
    }
    print(f"   Time: {thread_time:.2f}s")
    print(f"   Speedup: {sequential_time/thread_time:.2f}x")
    
    # 3. ProcessPoolExecutor
    print("\n‚öôÔ∏è ProcessPoolExecutor:")
    start_time = time.time()
    with concurrent.futures.ProcessPoolExecutor(max_workers=max_workers) as executor:
        process_results = list(executor.map(scrape_single_url, urls))
    process_time = time.time() - start_time
    results['ProcessPool'] = {
        'time': process_time,
        'results': process_results
    }
    print(f"   Time: {process_time:.2f}s")
    print(f"   Speedup: {sequential_time/process_time:.2f}x")
    
    # 4. Joblib
    print("\nüì¶ Joblib Parallel:")
    start_time = time.time()
    joblib_results = Parallel(n_jobs=max_workers)(
        delayed(scrape_single_url)(url) for url in urls
    )
    joblib_time = time.time() - start_time
    results['Joblib'] = {
        'time': joblib_time,
        'results': joblib_results
    }
    print(f"   Time: {joblib_time:.2f}s")
    print(f"   Speedup: {sequential_time/joblib_time:.2f}x")
    
    # Summary comparison
    print(f"\nüìä Performance Summary:")
    print(f"{'Method':<15} {'Time (s)':<10} {'Speedup':<10} {'Success Rate'}")
    print("-" * 50)
    
    for method, data in results.items():
        success_count = sum(1 for r in data['results'] if r['status'] == 'success')
        success_rate = success_count / len(urls) * 100
        speedup = sequential_time / data['time'] if data['time'] > 0 else 0
        
        print(f"{method:<15} {data['time']:<10.2f} {speedup:<10.2f} {success_rate:.1f}%")
    
    return results

# Test with a smaller set for comparison
test_urls_small = [
    'https://httpbin.org/delay/1',  # 1 second delay
    'https://httpbin.org/delay/1',
    'https://httpbin.org/delay/1',
    'https://httpbin.org/delay/1',
    'https://httpbin.org/json',
    'https://httpbin.org/html',
    'https://httpbin.org/xml',
    'https://httpbin.org/user-agent'
]

print("üî¨ Comparing Parallel Processing Approaches")
print("=" * 60)

comparison_results = compare_parallel_approaches(test_urls_small, max_workers=4)

# Memory usage comparison (simplified)
print(f"\nüíæ Memory Usage Notes:")
print("   Sequential: Low memory, single process")
print("   ThreadPool: Medium memory, shared memory space")
print("   ProcessPool: High memory, separate processes")
print("   Joblib: Optimized memory usage, especially for NumPy arrays")

print(f"\nüéØ Recommendations:")
print("   ‚Ä¢ Use ThreadPool for I/O-bound web scraping")
print("   ‚Ä¢ Use ProcessPool for CPU-intensive data processing")
print("   ‚Ä¢ Use Joblib for data science and ML workloads")
print("   ‚Ä¢ Always implement rate limiting and error handling")
print("   ‚Ä¢ Monitor resource usage in production")

## üéØ Key Takeaways: Parallel Web Scraping

### ‚úÖ What We Learned

1. **Multiprocessing Basics**: Understanding CPU cores and parallel execution
2. **Joblib Library**: Simple and efficient parallel processing with `Parallel()` and `delayed()`
3. **Rate Limiting**: Implementing respectful scraping with proper delays
4. **Error Handling**: Robust retry mechanisms and failure recovery
5. **Performance Comparison**: Different approaches for different use cases
6. **Real-world Application**: E-commerce data scraping with batch processing

### üöÄ When to Use Parallel Scraping

**‚úÖ Good candidates:**
- Large datasets (100s-1000s of URLs)
- I/O-bound operations (network requests)
- Independent scraping tasks
- Time-sensitive data collection

**‚ùå Avoid when:**
- Small datasets (< 50 URLs)
- Strict rate limits (< 1 req/sec)
- Complex interdependent scraping
- Server explicitly prohibits parallel access

### üìã Production Checklist

- [ ] Implement proper rate limiting
- [ ] Add comprehensive error handling
- [ ] Monitor resource usage (CPU, memory, network)
- [ ] Respect robots.txt and terms of service
- [ ] Implement logging and monitoring
- [ ] Test with small datasets first
- [ ] Use appropriate number of workers
- [ ] Handle failures gracefully

### üîó Next Steps

1. Practice with the provided examples
2. Implement rate limiting in your projects
3. Experiment with different worker counts
4. Monitor performance and optimize
5. Always prioritize ethical scraping practices

Remember: **With great power comes great responsibility!** Use parallel scraping responsibly and always respect website terms of service.

## üï∏Ô∏è Scrapy Framework - Industrial-Strength Web Scraping

Scrapy is not just a library - it's a **complete framework** for building web scrapers. Think of it as the difference between a hammer (Beautiful Soup) and a complete construction toolkit (Scrapy).

### ü§î When to Choose Scrapy vs Beautiful Soup?

| Use Case | Beautiful Soup | Scrapy |
|----------|----------------|---------|
| **Simple, one-time scraping** | ‚úÖ Perfect | ‚ùå Overkill |
| **Large-scale projects** | ‚ùå Limited | ‚úÖ Excellent |
| **Multiple websites** | ‚ùå Manual work | ‚úÖ Built-in support |
| **Following links automatically** | ‚ùå Manual coding | ‚úÖ Built-in |
| **Data export (CSV, JSON)** | ‚ùå Manual coding | ‚úÖ Built-in |
| **Handling cookies/sessions** | ‚ùå Manual coding | ‚úÖ Automatic |
| **Concurrent requests** | ‚ùå Manual threading | ‚úÖ Built-in |
| **Respecting robots.txt** | ‚ùå Manual checking | ‚úÖ Automatic |

### üèóÔ∏è Scrapy Architecture

Scrapy follows a **component-based architecture**:

1. **Engine** - Controls data flow between components
2. **Scheduler** - Manages which URLs to scrape next
3. **Downloader** - Fetches web pages
4. **Spiders** - Your custom logic for extracting data
5. **Item Pipeline** - Processes extracted data
6. **Middlewares** - Hooks for customizing requests/responses

### üöÄ Getting Started with Scrapy

#### Installation:
```bash
pip install scrapy
```

#### Creating a Scrapy Project:
```bash
# Create new project
scrapy startproject myproject

# Project structure created:
myproject/
    scrapy.cfg            # deploy configuration file
    myproject/            # project's Python module
        __init__.py
        items.py          # project items definition file
        middlewares.py    # project middlewares file
        pipelines.py      # project pipelines file
        settings.py       # project settings file
        spiders/          # directory for spiders
            __init__.py
```

#### Key Files Explained:
- **spiders/** - Where you write your scraping logic
- **items.py** - Define what data you want to extract
- **pipelines.py** - Process the extracted data
- **settings.py** - Configure how Scrapy behaves

### üï∑Ô∏è Understanding Scrapy Spiders

A **Spider** is a class that defines how to scrape a website. Every spider must:
1. **Have a unique name**
2. **Define starting URLs**
3. **Implement a parse method**

#### Basic Spider Structure:

```python
import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'quotes'                    # Unique identifier
    allowed_domains = ['quotes.toscrape.com']  # Optional: restrict domains
    start_urls = ['http://quotes.toscrape.com/']  # Starting URLs
    
    def parse(self, response):
        # This method is called for each start_url
        # Extract data and/or follow links
        pass
```

#### The `response` Object:
- `response.css()` - Use CSS selectors
- `response.xpath()` - Use XPath selectors  
- `response.url` - Current URL
- `response.status` - HTTP status code
- `response.follow()` - Follow links

In [None]:
# Complete Scrapy Spider Example (Simulated)
# Note: This is how a Scrapy spider looks - normally it runs in Scrapy framework

class QuotesSpider:
    """
    Example Scrapy Spider for quotes.toscrape.com
    This shows the structure and logic of a real Scrapy spider
    """
    
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']
    
    def parse(self, response):
        """
        Main parsing method - called for each response
        
        Args:
            response: Scrapy response object with methods:
                - response.css('selector') - CSS selectors
                - response.xpath('xpath') - XPath selectors  
                - response.follow(link) - Follow links
        """
        
        # Extract all quotes on the current page
        quotes = response.css('div.quote')
        
        for quote in quotes:
            # Extract individual fields using CSS selectors
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
        
        # Follow the "Next" page link automatically
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            # This tells Scrapy to follow the link and call parse() again
            yield response.follow(next_page, self.parse)

# Let's simulate what Scrapy does behind the scenes
print("üï∑Ô∏è Scrapy Spider Analysis:")
print("=" * 35)

print("\n1Ô∏è‚É£ Spider Attributes:")
spider = QuotesSpider()
print(f"   Name: {spider.name}")
print(f"   Allowed domains: {spider.allowed_domains}")
print(f"   Starting URLs: {spider.start_urls}")

print("\n2Ô∏è‚É£ How Scrapy Works:")
print("   Step 1: Scrapy sends requests to start_urls")
print("   Step 2: Calls parse() method with each response")
print("   Step 3: Spider yields data items and/or new requests")
print("   Step 4: Scrapy schedules new requests and processes items")
print("   Step 5: Repeats until no more requests")

print("\n3Ô∏è‚É£ Key Scrapy Concepts:")
print("   üì• yield items ‚Üí Data extraction")
print("   üì§ yield requests ‚Üí Following links")
print("   üîÑ response.follow() ‚Üí Automatic link following")
print("   üéØ CSS/XPath selectors ‚Üí Element selection")

print("\n4Ô∏è‚É£ Scrapy Selectors:")
print("   .get() ‚Üí Get first match (like select_one)")
print("   .getall() ‚Üí Get all matches (like select)")
print("   ::text ‚Üí Extract text content")
print("   ::attr(name) ‚Üí Extract attribute value")

print("\n5Ô∏è‚É£ Running the Spider:")
print("   Command: scrapy crawl quotes -o quotes.json")
print("   Output: Saves all extracted data to quotes.json")
print("   Automatically: Handles requests, follows links, exports data")

### üì¶ Scrapy Items - Structured Data Definition

**Items** define the structure of data you want to extract. Think of them as data containers with validation.

```python
# items.py
import scrapy

class QuoteItem(scrapy.Item):
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()
    url = scrapy.Field()
    scraped_at = scrapy.Field()
```

**Benefits of Items:**
- **Data validation** - Ensure consistent data structure
- **IDE support** - Better autocomplete and error checking
- **Documentation** - Clear data schema
- **Pipeline compatibility** - Works seamlessly with pipelines

### üîß Scrapy Pipelines - Data Processing

**Pipelines** process items after extraction. Common uses:
- Cleaning and validation
- Duplicate removal
- Database storage
- File export

```python
# pipelines.py
class QuotesPipeline:
    def process_item(self, item, spider):
        # Clean the quote text
        item['text'] = item['text'].strip('"')
        
        # Add timestamp
        from datetime import datetime
        item['scraped_at'] = datetime.now()
        
        return item

class DuplicatesPipeline:
    def __init__(self):
        self.seen_quotes = set()
    
    def process_item(self, item, spider):
        if item['text'] in self.seen_quotes:
            raise DropItem(f"Duplicate quote: {item['text']}")
        
        self.seen_quotes.add(item['text'])
        return item
```

### ‚ö° Scrapy Advanced Features

#### üîß Built-in Settings (settings.py):
```python
# Respect robots.txt
ROBOTSTXT_OBEY = True

# Configure delays between requests
DOWNLOAD_DELAY = 1  # 1 second delay
RANDOMIZE_DOWNLOAD_DELAY = 0.5  # Random delay up to 50%

# Configure concurrent requests
CONCURRENT_REQUESTS = 16
CONCURRENT_REQUESTS_PER_DOMAIN = 8

# User agent
USER_AGENT = 'myproject (+http://www.yourdomain.com)'

# Enable pipelines
ITEM_PIPELINES = {
    'myproject.pipelines.DuplicatesPipeline': 300,
    'myproject.pipelines.QuotesPipeline': 800,
}
```

#### üéØ Scrapy Shell - Interactive Testing:
```bash
# Start interactive shell for testing selectors
scrapy shell "http://quotes.toscrape.com/"

# In shell:
>>> response.css('div.quote').getall()
>>> response.css('span.text::text').get()
>>> view(response)  # Opens in browser
```

#### üöÄ Running Scrapy Spiders:
```bash
# Basic run
scrapy crawl quotes

# Export to different formats
scrapy crawl quotes -o quotes.json
scrapy crawl quotes -o quotes.csv
scrapy crawl quotes -o quotes.xml

# Custom settings
scrapy crawl quotes -s USER_AGENT='Custom Bot'
scrapy crawl quotes -s DOWNLOAD_DELAY=2

# Multiple settings
scrapy crawl quotes -s DOWNLOAD_DELAY=1 -s CONCURRENT_REQUESTS=1
```

#### üîç Monitoring and Debugging:
```bash
# Verbose output
scrapy crawl quotes -L INFO

# Very detailed debugging
scrapy crawl quotes -L DEBUG

# Save log to file
scrapy crawl quotes --logfile=scrapy.log
```

## üìö Resources & Documentation - Organized by Library

### ü•Ñ Beautiful Soup Resources

#### Official Documentation:
- **[Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)** - Complete official documentation
- **[Beautiful Soup Quick Start](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start)** - Getting started guide
- **[CSS Selectors Reference](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors)** - CSS selector syntax

#### Video Tutorials:
- **[Beautiful Soup Tutorial](https://www.youtube.com/watch?v=87Gx3U0BDlo)** - Corey Schafer
- **[Web Scraping with Beautiful Soup](https://www.youtube.com/watch?v=ng2o98k983k)** - Tech With Tim
- **[Beautiful Soup Complete Guide](https://www.youtube.com/watch?v=XVv6mJpFOb0)** - freeCodeCamp

#### Articles & Tutorials:
- **[Real Python - Beautiful Soup Guide](https://realpython.com/beautiful-soup-web-scraper-python/)** - Comprehensive tutorial
- **[GeeksforGeeks - Beautiful Soup](https://www.geeksforgeeks.org/implementing-web-scraping-python-beautiful-soup/)** - Step-by-step examples

### üåê Requests Library Resources

#### Official Documentation:
- **[Requests Documentation](https://requests.readthedocs.io/)** - HTTP library documentation
- **[Requests Quickstart](https://requests.readthedocs.io/en/latest/user/quickstart/)** - Basic usage examples
- **[Advanced Usage](https://requests.readthedocs.io/en/latest/user/advanced/)** - Sessions, cookies, SSL

#### Video Tutorials:
- **[Requests Library Tutorial](https://www.youtube.com/watch?v=tb8gHvYlCFs)** - Corey Schafer
- **[HTTP Requests in Python](https://www.youtube.com/watch?v=9N6a-VLBa2I)** - Programming with Mosh

#### Articles:
- **[Requests vs urllib](https://realpython.com/python-requests/)** - When to use what
- **[Session Objects in Requests](https://requests.readthedocs.io/en/latest/user/advanced/#session-objects)** - Persistent sessions

### üï∏Ô∏è Scrapy Framework Resources

#### Official Documentation:
- **[Scrapy Documentation](https://docs.scrapy.org/)** - Complete framework guide
- **[Scrapy Tutorial](https://docs.scrapy.org/en/latest/intro/tutorial.html)** - Step-by-step tutorial
- **[Scrapy Best Practices](https://docs.scrapy.org/en/latest/topics/practices.html)** - Production tips
- **[Scrapy Shell](https://docs.scrapy.org/en/latest/topics/shell.html)** - Interactive testing

#### Video Tutorials:
- **[Scrapy Framework Tutorial](https://www.youtube.com/watch?v=s4jtkzHhLzY)** - Traversy Media
- **[Complete Scrapy Course](https://www.youtube.com/watch?v=mBoX_JCKZTE)** - Coding Entrepreneurs
- **[Scrapy vs Beautiful Soup](https://www.youtube.com/watch?v=52rxmBEmeKQ)** - When to use what

#### Books & Courses:
- **"Learning Scrapy"** by Dimitris Kouzis-Loukas - Packt
- **"Web Scraping with Python and Scrapy"** - Udemy courses
- **[Scrapy GitHub Examples](https://github.com/scrapy/scrapy/tree/master/docs/topics/examples)** - Official examples

### üöó Selenium Resources

#### Official Documentation:
- **[Selenium Documentation](https://selenium-python.readthedocs.io/)** - Official Python bindings
- **[WebDriver API](https://selenium-python.readthedocs.io/api.html)** - Complete API reference
- **[Waits in Selenium](https://selenium-python.readthedocs.io/waits.html)** - Handling dynamic content
- **[Selenium Grid](https://selenium.dev/documentation/grid/)** - Distributed testing

#### Video Tutorials:
- **[Selenium WebDriver with Python](https://www.youtube.com/watch?v=Xjv1sY630Uc)** - Programming with Mosh
- **[Selenium Complete Course](https://www.youtube.com/watch?v=j7VZsCCnptM)** - Edureka
- **[Selenium with Python Tutorial](https://www.youtube.com/watch?v=zjo9yFHoUl8)** - Telusko

#### Articles & Guides:
- **[Real Python - Selenium Guide](https://realpython.com/modern-web-automation-with-python-and-selenium/)** - Modern web automation
- **[Selenium Best Practices](https://selenium.dev/documentation/test_practices/)** - Official best practices
- **[Handling Dynamic Content](https://selenium-python.readthedocs.io/waits.html)** - WebDriverWait examples

### ‚ö° Parallel Processing & Advanced Topics

#### Joblib Resources:
- **[Joblib Documentation](https://joblib.readthedocs.io/)** - Official documentation
- **[Parallel Computing with Joblib](https://joblib.readthedocs.io/en/latest/parallel.html)** - Parallel processing guide
- **[Joblib vs Multiprocessing](https://stackoverflow.com/questions/20776189/joblib-vs-multiprocessing)** - When to use what

#### Multiprocessing & Threading:
- **[Python Multiprocessing](https://docs.python.org/3/library/multiprocessing.html)** - Official documentation
- **[Concurrent.futures](https://docs.python.org/3/library/concurrent.futures.html)** - High-level interface
- **[Threading vs Multiprocessing](https://realpython.com/python-concurrency/)** - Real Python guide

#### Performance & Optimization:
- **[Web Scraping Performance](https://blog.apify.com/web-scraping-performance-optimization/)** - Apify blog
- **[Async Web Scraping](https://www.scrapehero.com/async-web-scraping-with-aiohttp/)** - ScrapeHero guide

### üõ†Ô∏è General Web Scraping Resources

#### Practice Websites:
- **[Quotes to Scrape](http://quotes.toscrape.com/)** - Perfect for beginners
- **[Books to Scrape](http://books.toscrape.com/)** - E-commerce practice
- **[Scrape This Site](https://scrapethissite.com/)** - Various challenges
- **[HTTP Bin](https://httpbin.org/)** - HTTP testing service

#### Alternative Libraries:
- **[requests-html](https://github.com/psf/requests-html)** - JavaScript support for requests
- **[playwright-python](https://playwright.dev/python/)** - Modern browser automation
- **[httpx](https://www.python-httpx.org/)** - Next-generation HTTP client
- **[pyppeteer](https://github.com/pyppeteer/pyppeteer)** - Puppeteer port for Python

#### Data Processing:
- **[Pandas Documentation](https://pandas.pydata.org/docs/)** - Data manipulation and analysis
- **[NumPy User Guide](https://numpy.org/doc/stable/user/)** - Numerical computing
- **[Matplotlib Tutorials](https://matplotlib.org/stable/tutorials/index.html)** - Data visualization

### üìñ Books & Comprehensive Courses

#### Recommended Books:
- **"Web Scraping with Python"** by Ryan Mitchell - O'Reilly Media (Classic)
- **"Python Web Scraping Cookbook"** by Michael Heydt - Packt
- **"Mastering Python Web Scraping"** - Advanced techniques

#### Online Courses:
- **[freeCodeCamp Web Scraping](https://www.youtube.com/watch?v=XVv6mJpFOb0)** - 3+ hour complete course
- **[Udemy Web Scraping Courses](https://www.udemy.com/topic/web-scraping/)** - Various paid courses
- **[Coursera Web Scraping](https://www.coursera.org/search?query=web%20scraping)** - University courses

### üéØ Learning Path Recommendations

#### Beginner (1-2 weeks):
1. HTML/CSS basics
2. Beautiful Soup fundamentals
3. Simple scraping projects
4. Practice websites

#### Intermediate (2-4 weeks):
5. Selenium for dynamic content
6. Error handling and robustness
7. Data processing with pandas
8. Multiple page scraping

#### Advanced (4+ weeks):
9. Scrapy framework
10. Parallel processing
11. Large-scale projects
12. Production deployment

# üõ†Ô∏è ‘≥’∏÷Ä’Æ’∂’°’Ø’°’∂

Let's put our knowledge into practice with hands-on exercises!

## üéØ Exercise 1: News Headlines Scraper

Create a scraper that extracts news headlines from a news website.

In [None]:
# Exercise 1: News Headlines Scraper
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime

def scrape_news_headlines():
    """
    Scrape news headlines from a sample news site
    Note: In real projects, always check robots.txt and terms of service
    """
    
    # Using BBC RSS feed as an example (more reliable than scraping HTML)
    url = "http://feeds.bbci.co.uk/news/rss.xml"
    
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
        
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        
        # Parse XML (RSS feeds are XML)
        soup = BeautifulSoup(response.content, 'xml')
        
        # Find all items (news articles)
        items = soup.find_all('item')
        
        news_data = []
        
        for item in items[:10]:  # Get first 10 articles
            title = item.find('title')
            link = item.find('link')
            description = item.find('description')
            pub_date = item.find('pubDate')
            
            news_data.append({
                'title': title.text if title else 'N/A',
                'link': link.text if link else 'N/A',
                'description': description.text if description else 'N/A',
                'published': pub_date.text if pub_date else 'N/A',
                'scraped_at': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
            })
        
        return news_data
    
    except Exception as e:
        print(f"Error scraping news: {e}")
        return []

# Run the scraper
print("üì∞ Scraping BBC News Headlines...")
news_headlines = scrape_news_headlines()

if news_headlines:
    print(f"‚úÖ Successfully scraped {len(news_headlines)} headlines")
    
    # Display first 3 headlines
    for i, article in enumerate(news_headlines[:3], 1):
        print(f"\n{i}. {article['title']}")
        print(f"   Published: {article['published']}")
        print(f"   Description: {article['description'][:100]}...")
    
    # Save to CSV
    df = pd.DataFrame(news_headlines)
    df.to_csv('news_headlines.csv', index=False, encoding='utf-8')
    print(f"\nüíæ Data saved to 'news_headlines.csv'")
    
    # Show basic statistics
    print(f"\nüìä Statistics:")
    print(f"   Total articles: {len(news_headlines)}")
    print(f"   Average title length: {df['title'].str.len().mean():.1f} characters")
else:
    print("‚ùå No headlines found")

## üéØ Exercise 2: Table Data Scraper

Extract tabular data from websites and convert it to pandas DataFrame.

In [None]:
# Exercise 2: Table Data Scraper
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Create sample HTML table for demonstration
sample_table_html = """
<html>
<body>
    <h2>Cryptocurrency Prices</h2>
    <table id="crypto-table" class="data-table">
        <thead>
            <tr>
                <th>Rank</th>
                <th>Name</th>
                <th>Symbol</th>
                <th>Price (USD)</th>
                <th>24h Change</th>
                <th>Market Cap</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>1</td>
                <td>Bitcoin</td>
                <td>BTC</td>
                <td>$43,250.00</td>
                <td class="positive">+2.34%</td>
                <td>$847.5B</td>
            </tr>
            <tr>
                <td>2</td>
                <td>Ethereum</td>
                <td>ETH</td>
                <td>$2,580.50</td>
                <td class="negative">-1.25%</td>
                <td>$310.2B</td>
            </tr>
            <tr>
                <td>3</td>
                <td>Cardano</td>
                <td>ADA</td>
                <td>$0.45</td>
                <td class="positive">+5.67%</td>
                <td>$15.2B</td>
            </tr>
            <tr>
                <td>4</td>
                <td>Solana</td>
                <td>SOL</td>
                <td>$98.75</td>
                <td class="positive">+3.21%</td>
                <td>$42.8B</td>
            </tr>
        </tbody>
    </table>
</body>
</html>
"""

def scrape_table_data(html_content):
    """Extract table data and convert to pandas DataFrame"""
    
    soup = BeautifulSoup(html_content, 'html.parser')
    
    # Find the table
    table = soup.find('table', {'id': 'crypto-table'})
    
    if not table:
        print("‚ùå Table not found")
        return None
    
    # Extract headers
    headers = []
    header_row = table.find('thead').find('tr')
    for th in header_row.find_all('th'):
        headers.append(th.text.strip())
    
    print(f"üìã Table headers: {headers}")
    
    # Extract data rows
    data_rows = []
    tbody = table.find('tbody')
    
    for row in tbody.find_all('tr'):
        row_data = []
        for td in row.find_all('td'):
            # Clean the text (remove extra whitespace, currency symbols, etc.)
            cell_text = td.text.strip()
            row_data.append(cell_text)
        data_rows.append(row_data)
    
    # Create DataFrame
    df = pd.DataFrame(data_rows, columns=headers)
    
    return df

def clean_financial_data(df):
    """Clean and process financial data"""
    
    df_clean = df.copy()
    
    # Clean price column (remove $ and convert to float)
    if 'Price (USD)' in df_clean.columns:
        df_clean['Price_Numeric'] = df_clean['Price (USD)'].str.replace('$', '').str.replace(',', '').astype(float)
    
    # Clean percentage change (remove % and convert to float)
    if '24h Change' in df_clean.columns:
        df_clean['Change_Numeric'] = df_clean['24h Change'].str.replace('%', '').str.replace('+', '').astype(float)
    
    # Clean market cap (convert to billions)
    if 'Market Cap' in df_clean.columns:
        def parse_market_cap(cap_str):
            cap_str = cap_str.replace('$', '').replace(',', '')
            if 'B' in cap_str:
                return float(cap_str.replace('B', '')) * 1e9
            elif 'M' in cap_str:
                return float(cap_str.replace('M', '')) * 1e6
            return float(cap_str)
        
        df_clean['Market_Cap_Numeric'] = df_clean['Market Cap'].apply(parse_market_cap)
    
    return df_clean

# Scrape the table
print("üìä Scraping table data...")
crypto_df = scrape_table_data(sample_table_html)

if crypto_df is not None:
    print("\n‚úÖ Raw table data:")
    print(crypto_df.to_string(index=False))
    
    # Clean the data
    crypto_clean = clean_financial_data(crypto_df)
    
    print("\nüßπ Cleaned data with numeric columns:")
    print(crypto_clean[['Name', 'Symbol', 'Price_Numeric', 'Change_Numeric']].to_string(index=False))
    
    # Basic analysis
    print("\nüìà Quick Analysis:")
    print(f"   Average price: ${crypto_clean['Price_Numeric'].mean():,.2f}")
    print(f"   Highest price: {crypto_clean.loc[crypto_clean['Price_Numeric'].idxmax(), 'Name']} (${crypto_clean['Price_Numeric'].max():,.2f})")
    print(f"   Best performer: {crypto_clean.loc[crypto_clean['Change_Numeric'].idxmax(), 'Name']} ({crypto_clean['Change_Numeric'].max()}%)")
    print(f"   Worst performer: {crypto_clean.loc[crypto_clean['Change_Numeric'].idxmin(), 'Name']} ({crypto_clean['Change_Numeric'].min()}%)")
    
    # Save to CSV
    crypto_clean.to_csv('crypto_data.csv', index=False)
    print("\nüíæ Data saved to 'crypto_data.csv'")
else:
    print("‚ùå Failed to scrape table data")

# ? Summary & Next Steps

## üìö What We've Learned

### üåê HTML & CSS Fundamentals
- **HTML structure**: Tags, attributes, and document hierarchy
- **CSS selectors**: The foundation of all web scraping
- **Key concepts**: Classes, IDs, and element relationships

### ü•Ñ Beautiful Soup Mastery
- **Parsing HTML**: Converting text to navigable objects
- **Finding elements**: Multiple methods for element selection
- **Data extraction**: Getting text, attributes, and structured data
- **Navigation**: Moving between parents, children, and siblings

### üï∏Ô∏è Scrapy Framework
- **When to use**: Large-scale, production scraping projects
- **Architecture**: Spiders, Items, Pipelines, and Settings
- **Advanced features**: Automatic link following, data export, rate limiting
- **Commands**: Creating projects and running spiders

### üöó Selenium for Dynamic Content
- **JavaScript handling**: Scraping interactive websites
- **WebDriver control**: Automating browser actions
- **Waiting strategies**: Handling dynamic content loading
- **User simulation**: Clicks, form filling, and scrolling

### ‚ö° Parallel Processing
- **Performance optimization**: Multiprocessing and joblib
- **Rate limiting**: Respectful scraping practices
- **Error handling**: Robust scraping systems
- **Real-world applications**: E-commerce and large datasets

## üöÄ Your Learning Path Forward

### üìñ Practice Projects (Beginner)
1. **News Headlines Scraper**: Start with RSS feeds or simple news sites
2. **Product Price Monitor**: Track prices on e-commerce sites
3. **Weather Data Collector**: Scrape weather information
4. **Social Media Posts**: Extract public posts (check terms of service!)

### üîß Intermediate Projects
5. **Multi-page Scraper**: Follow pagination automatically
6. **Data Analysis Pipeline**: Scrape ‚Üí Clean ‚Üí Analyze ‚Üí Visualize
7. **API vs Scraping**: Compare when to use APIs vs scraping
8. **Error Recovery System**: Build robust scrapers with retry logic

### üèÜ Advanced Projects
9. **Distributed Scraping**: Use Scrapy with multiple machines
10. **Anti-bot Evasion**: Handle CAPTCHAs and bot detection
11. **Real-time Monitoring**: Build alerting systems
12. **Machine Learning Integration**: Use scraped data for ML projects

## üí° Best Practices to Remember

### ‚úÖ Do:
- Start with simple static websites
- Always check robots.txt
- Add delays between requests
- Handle errors gracefully
- Save data in structured formats
- Test selectors thoroughly
- Use version control for your scrapers

### ‚ùå Don't:
- Scrape faster than necessary
- Ignore HTTP status codes
- Scrape without permission for commercial use
- Store sensitive data insecurely
- Forget to close browser instances (Selenium)
- Hardcode values that might change

## üîó Keep Learning

- **Join communities**: Reddit r/webscraping, Stack Overflow
- **Read documentation**: Stay updated with library changes
- **Build projects**: Apply what you've learned
- **Share knowledge**: Help others learn web scraping
- **Stay ethical**: Always respect website terms and rate limits

### üéì Recommended Next Topics
- **Data Analysis**: pandas, matplotlib, seaborn
- **Databases**: SQLite, PostgreSQL, MongoDB
- **APIs**: requests, authentication, REST APIs
- **Deployment**: Docker, cloud services, scheduling
- **Monitoring**: Logging, error tracking, performance metrics

Remember: **The best way to learn web scraping is by doing!** Start with simple projects and gradually increase complexity as you gain confidence.

# üé≤ 00
- ‚ñ∂Ô∏è[Video]()
- üîó[Random link]()
- üá¶üá≤üé∂[]()
- üåêüé∂[]()
- ü§å[‘ø’°÷Ä’£’´’∂]()


<a href="http://s01.flagcounter.com/more/1oO"><img src="https://s01.flagcounter.com/count2/1oO/bg_FFFFFF/txt_000000/border_CCCCCC/columns_2/maxflags_10/viewers_0/labels_0/pageviews_1/flags_0/percent_0/" alt="Flag Counter"></a>
