![image.png](../background_photos/)
[լուսանկարի հղումը](https://unsplash.com/photos/a-large-mountain-with-a-very-tall-cliff-UiP9KfVe3aQ), Հեղինակ՝ []()

<a href="ToDo" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> (ToDo)

> Song reference - ToDo

# 📌 Նկարագիր

[📚 Ամբողջական նյութը]()

#### 📺 Տեսանյութեր
#### 🏡 Տնային

# 📚 Նյութը

## 🌐 HTML & CSS Basics

Before diving into web scraping, it's essential to understand the structure of web pages. HTML (HyperText Markup Language) provides the structure, while CSS (Cascading Style Sheets) handles the styling.

### HTML Structure

HTML uses **tags** to define elements. Tags are enclosed in angle brackets `< >` and usually come in pairs:

```html
<tagname>Content goes here</tagname>
```

#### Common HTML Tags:

- `<html>` - Root element
- `<head>` - Contains metadata
- `<title>` - Page title
- `<body>` - Visible page content
- `<h1>`, `<h2>`, `<h3>` - Headers
- `<p>` - Paragraphs
- `<div>` - Generic container
- `<span>` - Inline container
- `<a>` - Links
- `<img>` - Images
- `<ul>`, `<ol>`, `<li>` - Lists
- `<table>`, `<tr>`, `<td>` - Tables

#### HTML Attributes:

Attributes provide additional information about elements:

```html
<div id="content" class="main-section">
<a href="https://example.com" target="_blank">Link</a>
<img src="image.jpg" alt="Description">
```

**Important attributes for scraping:**
- `id` - Unique identifier
- `class` - CSS class name(s)
- `href` - Link destination
- `src` - Source for images/scripts

### CSS Selectors

CSS selectors are crucial for web scraping as they help us target specific elements:

#### Basic Selectors:
- **Element**: `p` (selects all `<p>` elements)
- **Class**: `.classname` (selects elements with `class="classname"`)
- **ID**: `#idname` (selects element with `id="idname"`)
- **Attribute**: `[attribute="value"]`

#### Combination Selectors:
- **Descendant**: `div p` (all `<p>` inside `<div>`)
- **Child**: `div > p` (direct `<p>` children of `<div>`)
- **Adjacent**: `h1 + p` (first `<p>` after `<h1>`)

### Sample HTML Document:

```html
<!DOCTYPE html>
<html>
<head>
    <title>Sample Page</title>
</head>
<body>
    <header id="main-header">
        <h1>Welcome to My Site</h1>
        <nav class="navigation">
            <ul>
                <li><a href="#home">Home</a></li>
                <li><a href="#about">About</a></li>
                <li><a href="#contact">Contact</a></li>
            </ul>
        </nav>
    </header>
    
    <main class="content">
        <article class="post" data-id="123">
            <h2 class="post-title">Article Title</h2>
            <p class="post-content">This is the article content...</p>
            <div class="post-meta">
                <span class="author">John Doe</span>
                <span class="date">2025-01-15</span>
            </div>
        </article>
        
        <article class="post" data-id="124">
            <h2 class="post-title">Another Article</h2>
            <p class="post-content">More content here...</p>
            <div class="post-meta">
                <span class="author">Jane Smith</span>
                <span class="date">2025-01-16</span>
            </div>
        </article>
    </main>
    
    <footer>
        <p>&copy; 2025 My Website</p>
    </footer>
</body>
</html>
```

## 🕷️ Web Scraping Fundamentals

Web scraping is the process of extracting data from websites programmatically. Python offers several powerful libraries for this purpose:

1. **Beautiful Soup** - For parsing HTML/XML
2. **Requests** - For making HTTP requests
3. **Scrapy** - Full-featured scraping framework
4. **Selenium** - For JavaScript-heavy sites

### 🥄 Beautiful Soup

Beautiful Soup is perfect for beginners and handles most scraping tasks effectively.

#### Installation:

In [None]:
# Install required packages
!pip install beautifulsoup4 requests lxml html5lib

In [None]:
# Basic Beautiful Soup example
from bs4 import BeautifulSoup
import requests

# Sample HTML for demonstration
html_content = """
<!DOCTYPE html>
<html>
<head>
    <title>Sample Page</title>
</head>
<body>
    <header id="main-header">
        <h1>Welcome to My Site</h1>
        <nav class="navigation">
            <ul>
                <li><a href="#home">Home</a></li>
                <li><a href="#about">About</a></li>
                <li><a href="#contact">Contact</a></li>
            </ul>
        </nav>
    </header>
    
    <main class="content">
        <article class="post" data-id="123">
            <h2 class="post-title">Article Title</h2>
            <p class="post-content">This is the article content...</p>
            <div class="post-meta">
                <span class="author">John Doe</span>
                <span class="date">2025-01-15</span>
            </div>
        </article>
        
        <article class="post" data-id="124">
            <h2 class="post-title">Another Article</h2>
            <p class="post-content">More content here...</p>
            <div class="post-meta">
                <span class="author">Jane Smith</span>
                <span class="date">2025-01-16</span>
            </div>
        </article>
    </main>
    
    <footer>
        <p>&copy; 2025 My Website</p>
    </footer>
</body>
</html>
"""

# Create BeautifulSoup object
soup = BeautifulSoup(html_content, 'html.parser')

print("🔍 Basic Beautiful Soup Operations:")
print("=" * 50)

# 1. Find by tag
title = soup.find('title')
print(f"Page title: {title.text}")

# 2. Find by class
articles = soup.find_all('article', class_='post')
print(f"Number of articles: {len(articles)}")

# 3. Find by ID
header = soup.find('header', id='main-header')
print(f"Header text: {header.h1.text}")

# 4. CSS selectors
nav_links = soup.select('nav.navigation a')
print(f"Navigation links: {[link.text for link in nav_links]}")

# 5. Extract data from each article
print("\n📰 Article Information:")
for i, article in enumerate(articles, 1):
    title = article.find('h2', class_='post-title').text
    content = article.find('p', class_='post-content').text
    author = article.find('span', class_='author').text
    date = article.find('span', class_='date').text
    data_id = article.get('data-id')
    
    print(f"\nArticle {i}:")
    print(f"  ID: {data_id}")
    print(f"  Title: {title}")
    print(f"  Author: {author}")
    print(f"  Date: {date}")
    print(f"  Content: {content[:50]}...")

### 🌐 Real Website Scraping Example

Let's scrape some real data from a website. We'll use `httpbin.org` which provides testing endpoints:

In [None]:
# Step 1: Import required libraries for web scraping
import requests
from bs4 import BeautifulSoup
import time
import json

print("✅ Libraries imported successfully!")

In [None]:
# Step 2: Define the scraping function
def scrape_quotes():
    """
    Scrape quotes from quotes.toscrape.com
    Returns a list of dictionaries containing quote data
    """
    url = "http://quotes.toscrape.com/"
    
    try:
        print(f"🌐 Sending request to: {url}")
        
        # Send GET request to the website
        response = requests.get(url)
        response.raise_for_status()  # Raise exception for bad status codes
        
        print(f"✅ Request successful! Status code: {response.status_code}")
        
        # Parse HTML content with Beautiful Soup
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Find all quote containers
        quotes = soup.find_all('div', class_='quote')
        print(f"📊 Found {len(quotes)} quotes on the page")
        
        scraped_data = []
        
        # Extract data from each quote
        for i, quote in enumerate(quotes, 1):
            # Extract quote text (remove quotes and whitespace)
            text = quote.find('span', class_='text').text
            
            # Extract author name
            author = quote.find('small', class_='author').text
            
            # Extract tags (multiple tags per quote)
            tags = [tag.text for tag in quote.find_all('a', class_='tag')]
            
            # Store data in dictionary
            quote_data = {
                'text': text,
                'author': author,
                'tags': tags
            }
            
            scraped_data.append(quote_data)
            print(f"  📝 Processed quote {i}: {author}")
        
        return scraped_data
    
    except requests.RequestException as e:
        print(f"❌ Error fetching the webpage: {e}")
        return []
    except Exception as e:
        print(f"❌ Error processing data: {e}")
        return []

print("✅ Function defined successfully!")

In [None]:
# Step 3: Run the scraper and collect data
print("🕷️ Starting the scraping process...")
quotes_data = scrape_quotes()

print(f"\n📊 Scraping completed!")
print(f"Total quotes collected: {len(quotes_data)}")

if quotes_data:
    print("\n🎯 Sample of scraped data:")
    print("=" * 60)
else:
    print("❌ No quotes were scraped. Check your internet connection.")

In [None]:
# Step 4: Display the first few quotes to see our results
if quotes_data:
    print("📝 First 3 quotes from our scraping:")
    print("=" * 80)
    
    for i, quote in enumerate(quotes_data[:3], 1):
        print(f"\n💬 Quote {i}:")
        print(f"   Text: {quote['text']}")
        print(f"   Author: {quote['author']}")
        print(f"   Tags: {', '.join(quote['tags'])}")
        print("-" * 60)
    
    # Show some statistics
    print(f"\n📈 Quick Statistics:")
    print(f"   Total quotes: {len(quotes_data)}")
    
    # Find unique authors
    authors = set(quote['author'] for quote in quotes_data)
    print(f"   Unique authors: {len(authors)}")
    
    # Find all unique tags
    all_tags = set()
    for quote in quotes_data:
        all_tags.update(quote['tags'])
    print(f"   Unique tags: {len(all_tags)}")
    print(f"   Some tags: {', '.join(list(all_tags)[:5])}...")
else:
    print("❌ No data to display")

In [None]:
# Step 5: Save the scraped data to a file
if quotes_data:
    # Save to JSON file with proper formatting
    filename = 'quotes_scraped.json'
    
    try:
        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(quotes_data, f, indent=2, ensure_ascii=False)
        
        print(f"💾 Data saved successfully to '{filename}'")
        print(f"📁 File contains {len(quotes_data)} quotes")
        
        # Show file size
        import os
        file_size = os.path.getsize(filename)
        print(f"📊 File size: {file_size:,} bytes")
        
    except Exception as e:
        print(f"❌ Error saving file: {e}")
        
    # Also demonstrate saving specific data
    print(f"\n🔍 Example of accessing specific quote data:")
    if len(quotes_data) > 0:
        first_quote = quotes_data[0]
        print(f"   First quote text: {first_quote['text'][:50]}...")
        print(f"   First quote author: {first_quote['author']}")
        print(f"   First quote tags: {first_quote['tags']}")
else:
    print("❌ No data to save")

#### 🎓 What We Just Did - Step by Step:

1. **📦 Imported Libraries**: We imported the essential tools:
   - `requests` for making HTTP requests
   - `BeautifulSoup` for parsing HTML
   - `json` for saving data
   - `time` for adding delays (good practice)

2. **🔧 Created Function**: We defined `scrape_quotes()` that:
   - Sends a GET request to the website
   - Handles errors gracefully
   - Parses HTML with Beautiful Soup
   - Extracts specific data using CSS selectors

3. **🚀 Executed Scraper**: We ran the function and collected data

4. **👀 Viewed Results**: We displayed the scraped quotes to verify success

5. **💾 Saved Data**: We saved the results to a JSON file for future use

**Key Learning Points:**
- Always check `response.status_code` to ensure successful requests
- Use `.find()` for single elements and `.find_all()` for multiple elements
- Handle exceptions to make your scraper robust
- Save data in structured formats like JSON or CSV

## 🏠 Real Estate Scraping: List.am Example

List.am is Armenia's popular classifieds website. Let's create a comprehensive scraper for real estate listings. This example demonstrates scraping a real Armenian website with proper error handling and data processing.

**⚠️ Important Notes:**
- Always check robots.txt: https://www.list.am/robots.txt
- Respect the website's terms of service
- Add appropriate delays between requests
- This is for educational purposes only

In [None]:
# List.am Real Estate Scraper
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import re
from urllib.parse import urljoin, urlparse
import json

def scrape_listam_listings(base_url="https://www.list.am/category/62", max_pages=2, delay=2):
    """
    Scrape real estate listings from list.am
    
    Args:
        base_url (str): Base URL for the category
        max_pages (int): Maximum number of pages to scrape
        delay (int): Delay between requests in seconds
    
    Returns:
        list: List of dictionaries containing listing data
    """
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1',
    }
    
    all_listings = []
    
    for page in range(1, max_pages + 1):
        try:
            # Construct page URL
            if page == 1:
                page_url = base_url
            else:
                page_url = f"{base_url}/{page}"
            
            print(f"🔍 Scraping page {page}: {page_url}")
            
            # Add delay to be respectful
            if page > 1:
                time.sleep(delay)
            
            # Send request
            response = requests.get(page_url, headers=headers, timeout=10)
            response.raise_for_status()
            
            # Parse HTML
            soup = BeautifulSoup(response.content, 'html.parser')
            
            # Find listing containers (adjust selectors based on actual HTML structure)
            listings = soup.find_all('a', href=True)
            
            page_listings = []
            
            for listing in listings:
                href = listing.get('href', '')
                
                # Filter for item links
                if '/item/' in href and href.startswith('/item/'):
                    # Extract item ID
                    item_match = re.search(r'/item/(\d+)', href)
                    if not item_match:
                        continue
                    
                    item_id = item_match.group(1)
                    full_url = urljoin(base_url, href)
                    
                    # Extract text content from the link
                    text_content = listing.get_text(strip=True)
                    
                    # Parse listing information from text
                    listing_data = parse_listing_text(text_content, item_id, full_url)
                    
                    if listing_data:
                        page_listings.append(listing_data)
            
            print(f"   ✅ Found {len(page_listings)} listings on page {page}")
            all_listings.extend(page_listings)
            
            # Check if there's a next page
            next_link = soup.find('a', string='Հաջորդը >')
            if not next_link and page == max_pages:
                print("📄 Reached last page or max pages limit")
                break
                
        except requests.RequestException as e:
            print(f"❌ Error fetching page {page}: {e}")
            break
        except Exception as e:
            print(f"❌ Error parsing page {page}: {e}")
            continue
    
    return all_listings

def parse_listing_text(text, item_id, url):
    """
    Parse listing information from text content
    
    Args:
        text (str): Text content of the listing
        item_id (str): Item ID
        url (str): Full URL to the listing
    
    Returns:
        dict: Parsed listing data
    """
    
    if not text or len(text.strip()) < 10:
        return None
    
    # Initialize listing data
    listing = {
        'id': item_id,
        'url': url,
        'raw_text': text.strip(),
        'price': None,
        'price_currency': None,
        'location': None,
        'property_type': None,
        'area_sqm': None,
        'rooms': None,
        'floor': None,
        'description': None
    }
    
    # Extract price (handles both USD and AMD)
    price_usd_match = re.search(r'\$([0-9,]+(?:\.[0-9]+)?)', text)
    price_amd_match = re.search(r'([0-9,]+(?:\.[0-9]+)?)\s*֏', text)
    
    if price_usd_match:
        listing['price'] = price_usd_match.group(1).replace(',', '')
        listing['price_currency'] = 'USD'
    elif price_amd_match:
        listing['price'] = price_amd_match.group(1).replace(',', '')
        listing['price_currency'] = 'AMD'
    
    # Extract area (square meters)
    area_match = re.search(r'(\d+)\s*քմ', text)
    if area_match:
        listing['area_sqm'] = area_match.group(1)
    
    # Extract number of rooms
    rooms_match = re.search(r'(\d+)\s*սեն', text)
    if rooms_match:
        listing['rooms'] = rooms_match.group(1)
    
    # Extract floor information
    floor_match = re.search(r'(\d+)/(\d+)\s*հարկ', text)
    if floor_match:
        listing['floor'] = f"{floor_match.group(1)}/{floor_match.group(2)}"
    
    # Extract location (common locations in Yerevan)
    locations = [
        'Կենտրոն', 'Արաբկիր', 'Դավթաշեն', 'Մալաթիա-Սեբաստիա', 
        'Շենգավիթ', 'Նոր Նորք', 'Աջափնյակ', 'Ավան', 'Էրեբունի',
        'Գյումրի', 'Վանաձոր', 'Աբովյան', 'Արտաշատ', 'Գևարք',
        'Ծաղկաձոր', 'Դիլիջան', 'Իջևան', 'Գորիս', 'Կապան'
    ]
    
    for location in locations:
        if location in text:
            listing['location'] = location
            break
    
    # Determine property type based on keywords
    if 'բնակարան' in text:
        listing['property_type'] = 'Apartment'
    elif 'տուն' in text or 'թաունհաուզ' in text:
        listing['property_type'] = 'House'
    elif 'հողատարածք' in text:
        listing['property_type'] = 'Land'
    elif 'ավտոտնակ' in text:
        listing['property_type'] = 'Garage'
    elif 'գրասենյակ' in text:
        listing['property_type'] = 'Office'
    else:
        listing['property_type'] = 'Other'
    
    # Clean description (remove price and location)
    description = text
    if listing['price'] and listing['price_currency']:
        price_pattern = rf"\${listing['price']}|{listing['price']}\s*֏"
        description = re.sub(price_pattern, '', description)
    
    if listing['location']:
        description = description.replace(listing['location'], '')
    
    listing['description'] = description.strip()
    
    return listing

def analyze_listings(listings):
    """
    Analyze scraped listings and provide statistics
    
    Args:
        listings (list): List of listing dictionaries
    
    Returns:
        dict: Analysis results
    """
    
    if not listings:
        return {}
    
    df = pd.DataFrame(listings)
    
    # Convert price to numeric for analysis
    df['price_numeric'] = pd.to_numeric(df['price'].str.replace(',', ''), errors='coerce')
    df['area_numeric'] = pd.to_numeric(df['area_sqm'], errors='coerce')
    df['rooms_numeric'] = pd.to_numeric(df['rooms'], errors='coerce')
    
    analysis = {
        'total_listings': len(listings),
        'unique_locations': df['location'].nunique(),
        'property_types': df['property_type'].value_counts().to_dict(),
        'currency_distribution': df['price_currency'].value_counts().to_dict(),
        'price_stats': {},
        'area_stats': {},
        'location_stats': df['location'].value_counts().head(10).to_dict()
    }
    
    # Price statistics (for USD listings)
    usd_prices = df[df['price_currency'] == 'USD']['price_numeric'].dropna()
    if len(usd_prices) > 0:
        analysis['price_stats']['USD'] = {
            'count': len(usd_prices),
            'mean': round(usd_prices.mean(), 2),
            'median': round(usd_prices.median(), 2),
            'min': usd_prices.min(),
            'max': usd_prices.max()
        }
    
    # Area statistics
    areas = df['area_numeric'].dropna()
    if len(areas) > 0:
        analysis['area_stats'] = {
            'count': len(areas),
            'mean': round(areas.mean(), 2),
            'median': round(areas.median(), 2),
            'min': areas.min(),
            'max': areas.max()
        }
    
    return analysis

# Example usage
print("🏠 Starting List.am Real Estate Scraper...")
print("⚠️  Remember: This is for educational purposes only!")
print("🕐 Adding delays between requests to be respectful...")

# Scrape listings (limiting to 2 pages for demo)
listings = scrape_listam_listings(max_pages=2, delay=3)

print(f"\n📊 Scraping completed! Total listings found: {len(listings)}")

if listings:
    print("\n🏠 Sample listings:")
    print("=" * 80)
    
    for i, listing in enumerate(listings[:5], 1):
        print(f"\n{i}. ID: {listing['id']}")
        print(f"   Type: {listing['property_type']}")
        print(f"   Price: {listing['price']} {listing['price_currency'] or 'N/A'}")
        print(f"   Location: {listing['location'] or 'N/A'}")
        print(f"   Area: {listing['area_sqm']} sqm" if listing['area_sqm'] else "   Area: N/A")
        print(f"   Rooms: {listing['rooms']}" if listing['rooms'] else "   Rooms: N/A")
        print(f"   Description: {listing['description'][:60]}...")
        print(f"   URL: {listing['url']}")
else:
    print("❌ No listings found")

In [None]:
# Data Analysis and Visualization
if listings:
    print("\n📈 Analyzing scraped data...")
    
    # Perform analysis
    analysis = analyze_listings(listings)
    
    print(f"\n📊 Analysis Results:")
    print("=" * 60)
    print(f"📋 Total listings: {analysis['total_listings']}")
    print(f"🏙️ Unique locations: {analysis['unique_locations']}")
    
    print(f"\n🏠 Property types:")
    for prop_type, count in analysis['property_types'].items():
        print(f"   {prop_type}: {count}")
    
    print(f"\n💰 Currency distribution:")
    for currency, count in analysis['currency_distribution'].items():
        if currency:  # Skip None values
            print(f"   {currency}: {count}")
    
    if 'USD' in analysis['price_stats']:
        usd_stats = analysis['price_stats']['USD']
        print(f"\n💵 USD Price statistics:")
        print(f"   Count: {usd_stats['count']}")
        print(f"   Average: ${usd_stats['mean']:,.2f}")
        print(f"   Median: ${usd_stats['median']:,.2f}")
        print(f"   Range: ${usd_stats['min']:,.0f} - ${usd_stats['max']:,.0f}")
    
    if analysis['area_stats']:
        area_stats = analysis['area_stats']
        print(f"\n📐 Area statistics (sqm):")
        print(f"   Count: {area_stats['count']}")
        print(f"   Average: {area_stats['mean']:.1f} sqm")
        print(f"   Median: {area_stats['median']:.1f} sqm")
        print(f"   Range: {area_stats['min']} - {area_stats['max']} sqm")
    
    print(f"\n🗺️ Top locations:")
    for location, count in list(analysis['location_stats'].items())[:5]:
        if location:  # Skip None values
            print(f"   {location}: {count}")
    
    # Save data to CSV
    df = pd.DataFrame(listings)
    filename = f'listam_listings_{pd.Timestamp.now().strftime("%Y%m%d_%H%M%S")}.csv'
    df.to_csv(filename, index=False, encoding='utf-8')
    print(f"\n💾 Data saved to: {filename}")
    
    # Save analysis to JSON
    analysis_filename = f'listam_analysis_{pd.Timestamp.now().strftime("%Y%m%d_%H%M%S")}.json'
    with open(analysis_filename, 'w', encoding='utf-8') as f:
        json.dump(analysis, f, ensure_ascii=False, indent=2, default=str)
    print(f"📊 Analysis saved to: {analysis_filename}")
else:
    print("❌ No data to analyze")

In [None]:
# Advanced List.am Scraping Techniques

def scrape_detailed_listing(listing_url, headers=None):
    """
    Scrape detailed information from a single listing page
    
    Args:
        listing_url (str): URL of the specific listing
        headers (dict): HTTP headers to use
    
    Returns:
        dict: Detailed listing information
    """
    
    if headers is None:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
    
    try:
        response = requests.get(listing_url, headers=headers, timeout=10)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Extract detailed information (adjust selectors based on actual page structure)
        details = {
            'url': listing_url,
            'title': None,
            'price': None,
            'description': None,
            'contact_info': None,
            'images': [],
            'features': [],
            'posted_date': None
        }
        
        # Extract title
        title_element = soup.find('h1') or soup.find('title')
        if title_element:
            details['title'] = title_element.get_text(strip=True)
        
        # Extract description
        desc_selectors = [
            'div.description', 
            'div.content', 
            '.item-description',
            'p'
        ]
        
        for selector in desc_selectors:
            desc_element = soup.select_one(selector)
            if desc_element and len(desc_element.get_text(strip=True)) > 50:
                details['description'] = desc_element.get_text(strip=True)
                break
        
        # Extract images
        img_elements = soup.find_all('img', src=True)
        for img in img_elements:
            src = img.get('src')
            if src and ('jpg' in src or 'jpeg' in src or 'png' in src):
                full_img_url = urljoin(listing_url, src)
                details['images'].append(full_img_url)
        
        # Extract contact information (phone numbers)
        text_content = soup.get_text()
        phone_patterns = [
            r'\+374\s?\d{2}\s?\d{3}\s?\d{3}',  # Armenian format
            r'0\d{2}\s?\d{3}\s?\d{3}',        # Local format
            r'\d{2}-\d{2}-\d{2}'              # Alternative format
        ]
        
        for pattern in phone_patterns:
            phones = re.findall(pattern, text_content)
            if phones:
                details['contact_info'] = phones[0]
                break
        
        return details
        
    except Exception as e:
        print(f"❌ Error scraping detailed listing {listing_url}: {e}")
        return {}

def create_price_monitor(target_criteria, check_interval=3600):
    """
    Create a price monitoring system for specific criteria
    
    Args:
        target_criteria (dict): Criteria to monitor (location, max_price, min_area, etc.)
        check_interval (int): Check interval in seconds
    
    Returns:
        function: Monitoring function
    """
    
    def monitor():
        print(f"🔍 Monitoring for: {target_criteria}")
        
        # Get current listings
        current_listings = scrape_listam_listings(max_pages=1, delay=2)
        
        matching_listings = []
        
        for listing in current_listings:
            matches = True
            
            # Check location
            if 'location' in target_criteria:
                if listing['location'] != target_criteria['location']:
                    matches = False
            
            # Check max price
            if 'max_price_usd' in target_criteria and listing['price'] and listing['price_currency'] == 'USD':
                try:
                    price = float(listing['price'].replace(',', ''))
                    if price > target_criteria['max_price_usd']:
                        matches = False
                except:
                    pass
            
            # Check minimum area
            if 'min_area' in target_criteria and listing['area_sqm']:
                try:
                    area = int(listing['area_sqm'])
                    if area < target_criteria['min_area']:
                        matches = False
                except:
                    pass
            
            # Check property type
            if 'property_type' in target_criteria:
                if listing['property_type'] != target_criteria['property_type']:
                    matches = False
            
            if matches:
                matching_listings.append(listing)
        
        if matching_listings:
            print(f"🎯 Found {len(matching_listings)} matching listings:")
            for listing in matching_listings:
                print(f"   - {listing['property_type']} in {listing['location']}: {listing['price']} {listing['price_currency']}")
                print(f"     URL: {listing['url']}")
        else:
            print("❌ No matching listings found")
        
        return matching_listings
    
    return monitor

# Example: Monitor for apartments in Kentron under $200,000
print("\n🎯 Setting up price monitoring example...")
monitor_criteria = {
    'location': 'Կենտրոն',
    'max_price_usd': 200000,
    'min_area': 50,
    'property_type': 'Apartment'
}

price_monitor = create_price_monitor(monitor_criteria)

print("\n💡 Price monitor created! You can run price_monitor() to check for matching listings.")
print("🔄 In a real application, you would schedule this to run periodically.")

# Example of running the monitor once
print("\n🏃‍♂️ Running price monitor once as example...")
# matching = price_monitor()  # Uncomment to run the monitor

### 🔒 List.am Scraping: Ethical Guidelines & Best Practices

#### ⚖️ Legal and Ethical Considerations for List.am:

1. **robots.txt Compliance**: 
   - Check https://www.list.am/robots.txt before scraping
   - Respect crawl delays and disallowed paths

2. **Rate Limiting**: 
   - Use delays between requests (minimum 2-3 seconds)
   - Don't overwhelm the server with concurrent requests
   - Consider scraping during off-peak hours

3. **Data Usage**:
   - Personal data (phone numbers, emails) should be handled carefully
   - Don't republish copyrighted content without permission
   - Use data for analysis, not commercial republication

4. **Respectful Scraping**:
   - Monitor your requests and stop if blocked
   - Use appropriate User-Agent headers
   - Don't scrape more data than you need

#### 🛠️ Technical Best Practices:

```python
# Good practices for List.am scraping:

# 1. Session management
session = requests.Session()
session.headers.update({
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
})

# 2. Error handling with retries
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

retry_strategy = Retry(
    total=3,
    backoff_factor=2,
    status_forcelist=[429, 500, 502, 503, 504],
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)

# 3. Respect for robots.txt
import urllib.robotparser

def can_fetch(url):
    rp = urllib.robotparser.RobotFileParser()
    rp.set_url("https://www.list.am/robots.txt")
    rp.read()
    return rp.can_fetch("*", url)

# 4. Caching to reduce requests
import pickle
import os
from datetime import datetime, timedelta

def cache_get(url, cache_hours=24):
    cache_file = f"cache_{hash(url)}.pkl"
    
    if os.path.exists(cache_file):
        with open(cache_file, 'rb') as f:
            cached_data, timestamp = pickle.load(f)
        
        if datetime.now() - timestamp < timedelta(hours=cache_hours):
            return cached_data
    
    return None

def cache_set(url, data):
    cache_file = f"cache_{hash(url)}.pkl"
    with open(cache_file, 'wb') as f:
        pickle.dump((data, datetime.now()), f)
```

#### 🚨 Red Flags to Avoid:

- **Don't** scrape faster than 1 request per second
- **Don't** ignore HTTP status codes (429, 503, etc.)
- **Don't** scrape personal contact information for spam
- **Don't** bypass anti-bot measures aggressively
- **Don't** scrape without checking terms of service

#### 💡 Alternative Approaches:

1. **Official APIs**: Check if List.am offers an API
2. **RSS Feeds**: Look for RSS/XML feeds for listings
3. **Partnership**: Contact List.am for data partnership
4. **Manual Collection**: For small datasets, manual collection might be appropriate

### 🎯 Advanced Beautiful Soup Techniques

#### 1. Different Parsing Methods:

In [None]:
# Advanced Beautiful Soup techniques
from bs4 import BeautifulSoup
import re

sample_html = """
<div class="container">
    <div class="product" data-price="29.99" data-category="electronics">
        <h3>Smartphone</h3>
        <p class="description">Latest smartphone with amazing features</p>
        <span class="price">$29.99</span>
        <div class="reviews">
            <span class="rating">4.5</span>
            <span class="review-count">(150 reviews)</span>
        </div>
    </div>
    
    <div class="product" data-price="599.99" data-category="electronics">
        <h3>Laptop</h3>
        <p class="description">High-performance laptop for professionals</p>
        <span class="price">$599.99</span>
        <div class="reviews">
            <span class="rating">4.8</span>
            <span class="review-count">(89 reviews)</span>
        </div>
    </div>
    
    <article class="blog-post">
        <h2>Tech News</h2>
        <p>Latest technology trends and updates...</p>
        <time datetime="2025-01-15">January 15, 2025</time>
    </article>
</div>
"""

soup = BeautifulSoup(sample_html, 'html.parser')

print("🔧 Advanced Beautiful Soup Techniques:")
print("=" * 50)

# 1. Find with attributes
print("\n1️⃣ Finding by attributes:")
expensive_products = soup.find_all('div', {'data-price': lambda x: x and float(x) > 100})
for product in expensive_products:
    name = product.h3.text
    price = product.get('data-price')
    print(f"   {name}: ${price}")

# 2. Using regular expressions
print("\n2️⃣ Using regex patterns:")
price_spans = soup.find_all('span', string=re.compile(r'\$\d+\.\d+'))
for span in price_spans:
    print(f"   Found price: {span.text}")

# 3. CSS selectors advanced
print("\n3️⃣ Advanced CSS selectors:")
# Products with rating above 4.5
high_rated = soup.select('div.product:has(.rating)')
for product in high_rated:
    name = product.h3.text
    rating = product.select_one('.rating').text
    if float(rating) > 4.5:
        print(f"   High-rated: {name} ({rating}⭐)")

# 4. Parent and sibling navigation
print("\n4️⃣ Navigation between elements:")
rating_element = soup.find('span', class_='rating')
if rating_element:
    # Get parent
    reviews_div = rating_element.parent
    print(f"   Parent element: {reviews_div.name}")
    
    # Get sibling
    review_count = rating_element.find_next_sibling('span')
    print(f"   Review count: {review_count.text}")

# 5. Extracting numbers from text
print("\n5️⃣ Extracting numbers from text:")
review_texts = soup.find_all('span', class_='review-count')
for review in review_texts:
    # Extract number using regex
    numbers = re.findall(r'\d+', review.text)
    if numbers:
        print(f"   Reviews: {numbers[0]}")

# 6. Custom filters
print("\n6️⃣ Custom filters:")
def has_class_and_data_price(tag):
    return tag.has_attr('class') and tag.has_attr('data-price')

products_with_price = soup.find_all(has_class_and_data_price)
for product in products_with_price:
    print(f"   Product: {product.h3.text}, Price: ${product['data-price']}")

## 🚗 Selenium - For Dynamic Content

Selenium is essential when dealing with JavaScript-heavy websites that load content dynamically.

### When to use Selenium:
- Websites with JavaScript-rendered content
- Sites requiring interaction (clicking, scrolling, forms)
- Single Page Applications (SPAs)
- Content loaded via AJAX

### Installation:

In [None]:
# Install Selenium and WebDriver
!pip install selenium webdriver-manager

In [None]:
# Basic Selenium example
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
import time

def selenium_scraping_example():
    """Example of scraping with Selenium"""
    
    # Setup Chrome options
    chrome_options = Options()
    chrome_options.add_argument("--headless")  # Run in background
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")
    
    try:
        # Setup WebDriver
        service = Service(ChromeDriverManager().install())
        driver = webdriver.Chrome(service=service, options=chrome_options)
        
        print("🚗 Starting Selenium WebDriver...")
        
        # Navigate to website
        url = "https://quotes.toscrape.com/js/"  # JavaScript version
        driver.get(url)
        
        # Wait for content to load
        wait = WebDriverWait(driver, 10)
        quotes = wait.until(EC.presence_of_all_elements_located((By.CLASS_NAME, "quote")))
        
        print(f"📄 Found {len(quotes)} quotes on the page")
        
        scraped_quotes = []
        
        # Extract data
        for i, quote in enumerate(quotes[:3], 1):  # First 3 quotes
            text_element = quote.find_element(By.CLASS_NAME, "text")
            author_element = quote.find_element(By.CLASS_NAME, "author")
            tags_elements = quote.find_elements(By.CLASS_NAME, "tag")
            
            quote_data = {
                'text': text_element.text,
                'author': author_element.text,
                'tags': [tag.text for tag in tags_elements]
            }
            
            scraped_quotes.append(quote_data)
            
            print(f"\nQuote {i}:")
            print(f"  Text: {quote_data['text']}")
            print(f"  Author: {quote_data['author']}")
            print(f"  Tags: {', '.join(quote_data['tags'])}")
        
        # Try clicking "Next" button if exists
        try:
            next_button = driver.find_element(By.PARTIAL_LINK_TEXT, "Next")
            if next_button:
                print("\n🔄 'Next' button found (not clicking in this example)")
        except:
            print("\n❌ No 'Next' button found")
        
        return scraped_quotes
        
    except Exception as e:
        print(f"❌ Error: {e}")
        return []
    
    finally:
        # Always close the driver
        if 'driver' in locals():
            driver.quit()
            print("\n🔒 WebDriver closed")

# Note: This example might not work in all environments due to browser setup
# In Colab/Jupyter, you might need additional setup for Chrome/ChromeDriver
print("🚨 Note: This Selenium example requires proper Chrome/ChromeDriver setup")
print("💡 In production environments, make sure you have the necessary dependencies installed")

# Uncomment the line below to run the example (if Chrome is available)
# selenium_results = selenium_scraping_example()

# 🚀 Parallel Web Scraping & Multiprocessing

When scraping large amounts of data, performance becomes crucial. Python's multiprocessing and libraries like `joblib` allow us to speed up scraping by processing multiple URLs simultaneously.

## 🧠 Why Use Parallel Processing?

**Sequential Processing:**
- Scrapes one URL at a time
- Total time = (number of URLs) × (average time per URL)
- CPU cores remain underutilized

**Parallel Processing:**
- Scrapes multiple URLs simultaneously
- Total time ≈ (number of URLs) ÷ (number of workers) × (average time per URL)
- Better resource utilization

⚠️ **Important**: Always respect websites' rate limits and robots.txt when using parallel processing!

## 🔧 Basic Multiprocessing Concepts

Before applying multiprocessing to web scraping, let's understand the basics with simple examples.

In [None]:
import time
import multiprocessing as mp
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
import requests
from joblib import Parallel, delayed

# Example 1: CPU-intensive task (Sequential vs Parallel)
def square_number(n):
    """Simulate CPU-intensive work"""
    time.sleep(0.1)  # Simulate computation time
    return n ** 2

def demonstrate_multiprocessing():
    numbers = list(range(1, 21))  # 1 to 20
    
    # Sequential processing
    print("🐌 Sequential Processing:")
    start_time = time.time()
    sequential_results = [square_number(n) for n in numbers]
    sequential_time = time.time() - start_time
    print(f"   Time taken: {sequential_time:.2f} seconds")
    print(f"   Results: {sequential_results[:5]}... (showing first 5)")
    
    # Parallel processing with multiprocessing
    print("\n⚡ Parallel Processing (multiprocessing):")
    start_time = time.time()
    with ProcessPoolExecutor(max_workers=4) as executor:
        parallel_results = list(executor.map(square_number, numbers))
    parallel_time = time.time() - start_time
    print(f"   Time taken: {parallel_time:.2f} seconds")
    print(f"   Results: {parallel_results[:5]}... (showing first 5)")
    print(f"   Speedup: {sequential_time/parallel_time:.2f}x faster")

# Run the demonstration
demonstrate_multiprocessing()

## 📦 Introduction to Joblib

`joblib` is a powerful library that makes parallel computing easy and efficient. It's particularly great for:
- CPU-bound tasks
- Machine learning workloads
- Data processing pipelines

**Key advantages:**
- Simple API: `Parallel(n_jobs=-1)(delayed(function)(args) for args in data)`
- Automatic memory optimization
- Built-in progress tracking
- Works well with NumPy arrays

In [None]:
# Install joblib if not already installed
# !pip install joblib

from joblib import Parallel, delayed
import numpy as np

def process_data(x):
    """Simulate data processing"""
    time.sleep(0.05)
    return x ** 3 + 2 * x ** 2 + x + 1

def demonstrate_joblib():
    data = list(range(1, 51))  # 1 to 50
    
    print("🔧 Joblib Examples:")
    
    # Sequential processing
    print("\n🐌 Sequential Processing:")
    start_time = time.time()
    sequential_results = [process_data(x) for x in data]
    sequential_time = time.time() - start_time
    print(f"   Time taken: {sequential_time:.2f} seconds")
    
    # Parallel processing with joblib (all CPU cores)
    print("\n⚡ Joblib Parallel (all cores):")
    start_time = time.time()
    parallel_results = Parallel(n_jobs=-1)(delayed(process_data)(x) for x in data)
    parallel_time = time.time() - start_time
    print(f"   Time taken: {parallel_time:.2f} seconds")
    print(f"   Speedup: {sequential_time/parallel_time:.2f}x faster")
    
    # Parallel processing with specific number of workers
    print("\n⚡ Joblib Parallel (4 workers):")
    start_time = time.time()
    parallel_results_4 = Parallel(n_jobs=4)(delayed(process_data)(x) for x in data)
    parallel_time_4 = time.time() - start_time
    print(f"   Time taken: {parallel_time_4:.2f} seconds")
    
    # With verbose progress tracking
    print("\n📊 Joblib with Progress Tracking:")
    start_time = time.time()
    parallel_results_verbose = Parallel(n_jobs=4, verbose=1)(
        delayed(process_data)(x) for x in data
    )
    verbose_time = time.time() - start_time
    print(f"   Time taken: {verbose_time:.2f} seconds")
    
    # Verify results are the same
    print(f"\n✅ Results match: {sequential_results == parallel_results}")

# Run joblib demonstration
demonstrate_joblib()

## 🌐 Parallel Web Scraping Examples

Now let's apply these concepts to web scraping. We'll compare sequential vs parallel approaches for scraping multiple URLs.

In [None]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import time
from joblib import Parallel, delayed

def scrape_single_url(url, timeout=10):
    """Scrape a single URL and extract basic information"""
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }
        
        response = requests.get(url, headers=headers, timeout=timeout)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.content, 'html.parser')
        
        # Extract basic information
        title = soup.find('title')
        title_text = title.get_text(strip=True) if title else "No title"
        
        # Count paragraphs
        paragraphs = soup.find_all('p')
        paragraph_count = len(paragraphs)
        
        # Count links
        links = soup.find_all('a', href=True)
        link_count = len(links)
        
        # Get first paragraph text (if available)
        first_paragraph = ""
        if paragraphs:
            first_paragraph = paragraphs[0].get_text(strip=True)[:200] + "..."
        
        return {
            'url': url,
            'title': title_text[:100],  # Limit title length
            'status': 'success',
            'paragraph_count': paragraph_count,
            'link_count': link_count,
            'first_paragraph': first_paragraph,
            'response_time': response.elapsed.total_seconds()
        }
        
    except requests.exceptions.RequestException as e:
        return {
            'url': url,
            'title': None,
            'status': 'error',
            'error': str(e),
            'paragraph_count': 0,
            'link_count': 0,
            'first_paragraph': '',
            'response_time': None
        }
    except Exception as e:
        return {
            'url': url,
            'title': None,
            'status': 'error',
            'error': f"Parsing error: {str(e)}",
            'paragraph_count': 0,
            'link_count': 0,
            'first_paragraph': '',
            'response_time': None
        }

def scrape_urls_sequential(urls):
    """Scrape URLs one by one (sequential)"""
    print("🐌 Sequential scraping...")
    start_time = time.time()
    
    results = []
    for i, url in enumerate(urls, 1):
        print(f"   Scraping {i}/{len(urls)}: {url[:50]}...")
        result = scrape_single_url(url)
        results.append(result)
        time.sleep(1)  # Be respectful - add delay
    
    total_time = time.time() - start_time
    print(f"   Sequential time: {total_time:.2f} seconds")
    return results, total_time

def scrape_urls_parallel_joblib(urls, n_jobs=4):
    """Scrape URLs in parallel using joblib"""
    print(f"⚡ Parallel scraping with joblib ({n_jobs} workers)...")
    start_time = time.time()
    
    # Add delays in parallel execution too (but spread out)
    def scrape_with_delay(url, delay_factor):
        time.sleep(delay_factor * 0.5)  # Staggered delays
        return scrape_single_url(url)
    
    # Create delay factors for staggered requests
    delay_factors = [i % 4 for i in range(len(urls))]
    
    results = Parallel(n_jobs=n_jobs, verbose=1)(
        delayed(scrape_with_delay)(url, delay) 
        for url, delay in zip(urls, delay_factors)
    )
    
    total_time = time.time() - start_time
    print(f"   Parallel time: {total_time:.2f} seconds")
    return results, total_time

# Test URLs (using public APIs and websites that allow scraping)
test_urls = [
    'https://httpbin.org/html',
    'https://httpbin.org/json',
    'https://jsonplaceholder.typicode.com/posts/1',
    'https://jsonplaceholder.typicode.com/posts/2',
    'https://httpbin.org/xml',
    'https://httpbin.org/robots.txt',
    'https://jsonplaceholder.typicode.com/users/1',
    'https://jsonplaceholder.typicode.com/users/2'
]

print("🌐 Web Scraping Performance Comparison")
print("=" * 50)

# Sequential scraping
sequential_results, seq_time = scrape_urls_sequential(test_urls)

print("\n" + "=" * 50)

# Parallel scraping
parallel_results, par_time = scrape_urls_parallel_joblib(test_urls, n_jobs=4)

# Compare results
print(f"\n📊 Performance Summary:")
print(f"   URLs scraped: {len(test_urls)}")
print(f"   Sequential time: {seq_time:.2f} seconds")
print(f"   Parallel time: {par_time:.2f} seconds")
print(f"   Speedup: {seq_time/par_time:.2f}x faster")

# Show success rates
seq_success = sum(1 for r in sequential_results if r['status'] == 'success')
par_success = sum(1 for r in parallel_results if r['status'] == 'success')

print(f"\n✅ Success Rates:")
print(f"   Sequential: {seq_success}/{len(test_urls)} ({seq_success/len(test_urls)*100:.1f}%)")
print(f"   Parallel: {par_success}/{len(test_urls)} ({par_success/len(test_urls)*100:.1f}%)")

# Show sample results
print(f"\n📄 Sample Results (first 3):")
for i, result in enumerate(parallel_results[:3]):
    print(f"   {i+1}. {result['url']}")
    print(f"      Title: {result['title']}")
    print(f"      Status: {result['status']}")
    if result['status'] == 'success':
        print(f"      Paragraphs: {result['paragraph_count']}, Links: {result['link_count']}")
    print()

## 🛡️ Advanced Parallel Scraping with Rate Limiting

When scraping real websites, we need to be more careful about rate limiting, error handling, and respecting server resources.

In [None]:
import random
from threading import Lock
import threading
from datetime import datetime, timedelta

class RateLimitedScraper:
    """A rate-limited web scraper with parallel processing capabilities"""
    
    def __init__(self, requests_per_second=2, max_retries=3):
        self.requests_per_second = requests_per_second
        self.max_retries = max_retries
        self.last_request_time = {}
        self.lock = Lock()
        
    def wait_if_needed(self, domain):
        """Implement rate limiting per domain"""
        with self.lock:
            now = datetime.now()
            if domain in self.last_request_time:
                time_since_last = (now - self.last_request_time[domain]).total_seconds()
                min_interval = 1.0 / self.requests_per_second
                
                if time_since_last < min_interval:
                    sleep_time = min_interval - time_since_last
                    time.sleep(sleep_time)
            
            self.last_request_time[domain] = datetime.now()
    
    def extract_domain(self, url):
        """Extract domain from URL"""
        from urllib.parse import urlparse
        return urlparse(url).netloc
    
    def scrape_with_retries(self, url):
        """Scrape URL with retry logic and rate limiting"""
        domain = self.extract_domain(url)
        
        for attempt in range(self.max_retries):
            try:
                # Implement rate limiting
                self.wait_if_needed(domain)
                
                headers = {
                    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
                }
                
                response = requests.get(url, headers=headers, timeout=15)
                response.raise_for_status()
                
                soup = BeautifulSoup(response.content, 'html.parser')
                
                # Extract comprehensive data
                result = {
                    'url': url,
                    'status': 'success',
                    'attempt': attempt + 1,
                    'timestamp': datetime.now().isoformat(),
                    'response_code': response.status_code,
                    'content_length': len(response.content),
                    'title': '',
                    'meta_description': '',
                    'headings': {},
                    'link_count': 0,
                    'image_count': 0,
                    'form_count': 0,
                    'text_content_length': 0
                }
                
                # Extract title
                title_tag = soup.find('title')
                if title_tag:
                    result['title'] = title_tag.get_text(strip=True)
                
                # Extract meta description
                meta_desc = soup.find('meta', attrs={'name': 'description'})
                if meta_desc:
                    result['meta_description'] = meta_desc.get('content', '')
                
                # Count different elements
                result['link_count'] = len(soup.find_all('a', href=True))
                result['image_count'] = len(soup.find_all('img'))
                result['form_count'] = len(soup.find_all('form'))
                
                # Count headings
                for i in range(1, 7):
                    headings = soup.find_all(f'h{i}')
                    if headings:
                        result['headings'][f'h{i}'] = len(headings)
                
                # Get text content length
                text_content = soup.get_text(strip=True)
                result['text_content_length'] = len(text_content)
                
                return result
                
            except requests.exceptions.RequestException as e:
                if attempt == self.max_retries - 1:  # Last attempt
                    return {
                        'url': url,
                        'status': 'error',
                        'error': str(e),
                        'attempt': attempt + 1,
                        'timestamp': datetime.now().isoformat()
                    }
                else:
                    # Wait before retry (exponential backoff)
                    wait_time = (2 ** attempt) + random.uniform(0, 1)
                    time.sleep(wait_time)
            
            except Exception as e:
                return {
                    'url': url,
                    'status': 'error',
                    'error': f"Unexpected error: {str(e)}",
                    'attempt': attempt + 1,
                    'timestamp': datetime.now().isoformat()
                }

def parallel_scrape_with_rate_limiting(urls, n_jobs=3, requests_per_second=2):
    """Scrape URLs in parallel with rate limiting"""
    scraper = RateLimitedScraper(requests_per_second=requests_per_second)
    
    print(f"🚀 Advanced Parallel Scraping:")
    print(f"   URLs: {len(urls)}")
    print(f"   Workers: {n_jobs}")
    print(f"   Rate limit: {requests_per_second} requests/second per domain")
    
    start_time = time.time()
    
    results = Parallel(n_jobs=n_jobs, verbose=1)(
        delayed(scraper.scrape_with_retries)(url) for url in urls
    )
    
    total_time = time.time() - start_time
    
    # Analyze results
    successful = [r for r in results if r['status'] == 'success']
    failed = [r for r in results if r['status'] == 'error']
    
    print(f"\n📊 Scraping Summary:")
    print(f"   Total time: {total_time:.2f} seconds")
    print(f"   Average time per URL: {total_time/len(urls):.2f} seconds")
    print(f"   Successful: {len(successful)}/{len(urls)} ({len(successful)/len(urls)*100:.1f}%)")
    print(f"   Failed: {len(failed)}/{len(urls)} ({len(failed)/len(urls)*100:.1f}%)")
    
    if successful:
        avg_content_length = sum(r['content_length'] for r in successful) / len(successful)
        total_links = sum(r['link_count'] for r in successful)
        total_images = sum(r['image_count'] for r in successful)
        
        print(f"\n📄 Content Analysis:")
        print(f"   Average content length: {avg_content_length:.0f} bytes")
        print(f"   Total links found: {total_links}")
        print(f"   Total images found: {total_images}")
    
    if failed:
        print(f"\n❌ Failed URLs:")
        for fail in failed[:3]:  # Show first 3 failures
            print(f"   {fail['url']}: {fail.get('error', 'Unknown error')}")
    
    return results

# Example with mixed domains (rate limiting will be applied per domain)
mixed_urls = [
    'https://httpbin.org/html',
    'https://httpbin.org/json',
    'https://httpbin.org/xml',
    'https://jsonplaceholder.typicode.com/posts/1',
    'https://jsonplaceholder.typicode.com/posts/2',
    'https://jsonplaceholder.typicode.com/users/1',
    'https://httpbin.org/robots.txt',
    'https://httpbin.org/user-agent',
    'https://jsonplaceholder.typicode.com/comments/1',
    'https://httpbin.org/headers'
]

# Run advanced parallel scraping
results = parallel_scrape_with_rate_limiting(
    mixed_urls, 
    n_jobs=3, 
    requests_per_second=2
)

# Show detailed results for successful scrapes
print(f"\n📋 Detailed Results (first 3 successful):")
successful_results = [r for r in results if r['status'] == 'success']
for i, result in enumerate(successful_results[:3]):
    print(f"\n{i+1}. {result['url']}")
    print(f"   Title: {result['title'][:60]}...")
    print(f"   Response Code: {result['response_code']}")
    print(f"   Content Length: {result['content_length']:,} bytes")
    print(f"   Links: {result['link_count']}, Images: {result['image_count']}")
    if result['headings']:
        print(f"   Headings: {result['headings']}")
    print(f"   Attempt: {result['attempt']}")

## 🛒 Real-World Example: Parallel E-commerce Data Scraping

Let's create a practical example that simulates scraping product data from multiple pages, using parallel processing to handle large datasets efficiently.

In [None]:
import pandas as pd
import json
from pathlib import Path

class EcommerceScraper:
    """Simulate e-commerce product scraping with parallel processing"""
    
    def __init__(self):
        self.session = requests.Session()
        self.session.headers.update({
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        })
    
    def simulate_product_page(self, product_id):
        """Simulate scraping a product page"""
        # In real scraping, this would fetch from actual URLs
        # For demo purposes, we'll simulate data
        
        time.sleep(random.uniform(0.5, 2.0))  # Simulate network delay
        
        # Simulate occasional failures
        if random.random() < 0.1:  # 10% failure rate
            raise requests.exceptions.RequestException(f"Failed to load product {product_id}")
        
        # Generate simulated product data
        categories = ['Electronics', 'Clothing', 'Books', 'Home & Garden', 'Sports']
        brands = ['BrandA', 'BrandB', 'BrandC', 'BrandD', 'BrandE']
        
        product = {
            'product_id': product_id,
            'name': f'Product {product_id}',
            'price': round(random.uniform(10, 500), 2),
            'category': random.choice(categories),
            'brand': random.choice(brands),
            'rating': round(random.uniform(1, 5), 1),
            'review_count': random.randint(0, 1000),
            'in_stock': random.choice([True, False]),
            'description_length': random.randint(100, 1000),
            'image_count': random.randint(1, 10),
            'scrape_timestamp': datetime.now().isoformat()
        }
        
        return product
    
    def scrape_product_batch(self, product_ids):
        """Scrape a batch of product IDs"""
        results = []
        batch_start = time.time()
        
        for product_id in product_ids:
            try:
                product = self.simulate_product_page(product_id)
                product['status'] = 'success'
                results.append(product)
            except Exception as e:
                results.append({
                    'product_id': product_id,
                    'status': 'error',
                    'error': str(e),
                    'scrape_timestamp': datetime.now().isoformat()
                })
        
        batch_time = time.time() - batch_start
        return results, batch_time

def parallel_ecommerce_scraping(product_ids, batch_size=50, n_jobs=4):
    """Scrape e-commerce products in parallel batches"""
    
    # Split product IDs into batches
    batches = [product_ids[i:i + batch_size] for i in range(0, len(product_ids), batch_size)]
    
    print(f"🛒 E-commerce Parallel Scraping:")
    print(f"   Total products: {len(product_ids)}")
    print(f"   Batch size: {batch_size}")
    print(f"   Number of batches: {len(batches)}")
    print(f"   Parallel workers: {n_jobs}")
    
    scraper = EcommerceScraper()
    
    start_time = time.time()
    
    # Process batches in parallel
    batch_results = Parallel(n_jobs=n_jobs, verbose=1)(
        delayed(scraper.scrape_product_batch)(batch) for batch in batches
    )
    
    total_time = time.time() - start_time
    
    # Flatten results
    all_products = []
    total_batch_time = 0
    
    for results, batch_time in batch_results:
        all_products.extend(results)
        total_batch_time += batch_time
    
    # Analyze results
    successful_products = [p for p in all_products if p['status'] == 'success']
    failed_products = [p for p in all_products if p['status'] == 'error']
    
    print(f"\n📊 Scraping Results:")
    print(f"   Total time: {total_time:.2f} seconds")
    print(f"   Products/second: {len(product_ids)/total_time:.2f}")
    print(f"   Successful: {len(successful_products)}/{len(product_ids)} ({len(successful_products)/len(product_ids)*100:.1f}%)")
    print(f"   Failed: {len(failed_products)}/{len(product_ids)} ({len(failed_products)/len(product_ids)*100:.1f}%)")
    
    return successful_products, failed_products

def analyze_scraped_products(products):
    """Analyze the scraped product data"""
    if not products:
        print("❌ No products to analyze")
        return
    
    df = pd.DataFrame(products)
    
    print(f"\n📈 Product Data Analysis:")
    print(f"   Dataset shape: {df.shape}")
    
    # Price analysis
    if 'price' in df.columns:
        print(f"\n💰 Price Statistics:")
        print(f"   Average price: ${df['price'].mean():.2f}")
        print(f"   Median price: ${df['price'].median():.2f}")
        print(f"   Price range: ${df['price'].min():.2f} - ${df['price'].max():.2f}")
        
    # Category distribution
    if 'category' in df.columns:
        print(f"\n📂 Category Distribution:")
        category_counts = df['category'].value_counts()
        for category, count in category_counts.items():
            print(f"   {category}: {count} products ({count/len(df)*100:.1f}%)")
    
    # Brand analysis
    if 'brand' in df.columns:
        print(f"\n🏷️ Top Brands:")
        brand_counts = df['brand'].value_counts().head(5)
        for brand, count in brand_counts.items():
            print(f"   {brand}: {count} products")
    
    # Stock status
    if 'in_stock' in df.columns:
        in_stock_count = df['in_stock'].sum()
        print(f"\n📦 Stock Status:")
        print(f"   In stock: {in_stock_count}/{len(df)} ({in_stock_count/len(df)*100:.1f}%)")
        print(f"   Out of stock: {len(df)-in_stock_count}/{len(df)} ({(len(df)-in_stock_count)/len(df)*100:.1f}%)")
    
    # Rating analysis
    if 'rating' in df.columns:
        print(f"\n⭐ Rating Statistics:")
        print(f"   Average rating: {df['rating'].mean():.2f}/5.0")
        print(f"   Ratings >= 4.0: {(df['rating'] >= 4.0).sum()}/{len(df)} ({(df['rating'] >= 4.0).sum()/len(df)*100:.1f}%)")
    
    return df

# Generate sample product IDs (simulating large dataset)
product_ids = [f"PROD_{i:06d}" for i in range(1, 501)]  # 500 products

print("🚀 Starting E-commerce Parallel Scraping Demo...")

# Run parallel scraping
successful_products, failed_products = parallel_ecommerce_scraping(
    product_ids, 
    batch_size=50, 
    n_jobs=4
)

# Analyze the results
df_products = analyze_scraped_products(successful_products)

# Save results
if successful_products:
    # Save to JSON
    output_file = 'scraped_products.json'
    with open(output_file, 'w') as f:
        json.dump(successful_products, f, indent=2)
    
    # Save to CSV
    csv_file = 'scraped_products.csv'
    df_products.to_csv(csv_file, index=False)
    
    print(f"\n💾 Data saved:")
    print(f"   JSON: {output_file}")
    print(f"   CSV: {csv_file}")

# Show sample products
if successful_products:
    print(f"\n🛍️ Sample Products:")
    for i, product in enumerate(successful_products[:3]):
        print(f"\n{i+1}. {product['name']} (ID: {product['product_id']})")
        print(f"   Price: ${product['price']}")
        print(f"   Category: {product['category']}")
        print(f"   Brand: {product['brand']}")
        print(f"   Rating: {product['rating']}/5.0 ({product['review_count']} reviews)")
        print(f"   In Stock: {'✅' if product['in_stock'] else '❌'}")

In [None]:
https://www.ysu.am/robots.txt

## ⚡ Performance Optimization & Best Practices

### 🎯 Choosing the Right Approach

| Method | Best For | Pros | Cons |
|--------|----------|------|------|
| **Sequential** | Small datasets, strict rate limits | Simple, predictable | Slow for large datasets |
| **Threading** | I/O-bound tasks, many small requests | Good for network-bound tasks | GIL limitations in Python |
| **Multiprocessing** | CPU-intensive parsing | True parallelism | Higher memory usage |
| **Joblib** | Balanced approach, data science tasks | Easy to use, optimized | Extra dependency |

### 🛡️ Rate Limiting Strategies

```python
# 1. Fixed delay between requests
time.sleep(1)

# 2. Random delay (more human-like)
time.sleep(random.uniform(0.5, 2.0))

# 3. Exponential backoff on errors
wait_time = (2 ** attempt) + random.uniform(0, 1)

# 4. Domain-specific rate limiting
# Different limits for different websites
```

### 📊 Monitoring & Logging

```python
# Track success rates, response times, errors
# Use logging instead of print for production
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

logger.info(f"Scraped {url} successfully")
logger.error(f"Failed to scrape {url}: {error}")
```

### 🔧 Performance Tips

1. **Use connection pooling** with `requests.Session()`
2. **Implement caching** to avoid re-scraping
3. **Batch processing** for large datasets
4. **Memory management** - process in chunks
5. **Error handling** - implement retries and fallbacks
6. **Respect robots.txt** and rate limits
7. **Use appropriate timeouts**
8. **Monitor resource usage** (CPU, memory, network)

In [None]:
import concurrent.futures
import threading
from collections import defaultdict

def compare_parallel_approaches(urls, max_workers=4):
    """Compare different parallel processing approaches"""
    
    results = {}
    
    # 1. Sequential baseline
    print("🐌 Sequential Processing:")
    start_time = time.time()
    sequential_results = [scrape_single_url(url) for url in urls]
    sequential_time = time.time() - start_time
    results['Sequential'] = {
        'time': sequential_time,
        'results': sequential_results
    }
    print(f"   Time: {sequential_time:.2f}s")
    
    # 2. ThreadPoolExecutor
    print("\n🧵 ThreadPoolExecutor:")
    start_time = time.time()
    with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
        thread_results = list(executor.map(scrape_single_url, urls))
    thread_time = time.time() - start_time
    results['ThreadPool'] = {
        'time': thread_time,
        'results': thread_results
    }
    print(f"   Time: {thread_time:.2f}s")
    print(f"   Speedup: {sequential_time/thread_time:.2f}x")
    
    # 3. ProcessPoolExecutor
    print("\n⚙️ ProcessPoolExecutor:")
    start_time = time.time()
    with concurrent.futures.ProcessPoolExecutor(max_workers=max_workers) as executor:
        process_results = list(executor.map(scrape_single_url, urls))
    process_time = time.time() - start_time
    results['ProcessPool'] = {
        'time': process_time,
        'results': process_results
    }
    print(f"   Time: {process_time:.2f}s")
    print(f"   Speedup: {sequential_time/process_time:.2f}x")
    
    # 4. Joblib
    print("\n📦 Joblib Parallel:")
    start_time = time.time()
    joblib_results = Parallel(n_jobs=max_workers)(
        delayed(scrape_single_url)(url) for url in urls
    )
    joblib_time = time.time() - start_time
    results['Joblib'] = {
        'time': joblib_time,
        'results': joblib_results
    }
    print(f"   Time: {joblib_time:.2f}s")
    print(f"   Speedup: {sequential_time/joblib_time:.2f}x")
    
    # Summary comparison
    print(f"\n📊 Performance Summary:")
    print(f"{'Method':<15} {'Time (s)':<10} {'Speedup':<10} {'Success Rate'}")
    print("-" * 50)
    
    for method, data in results.items():
        success_count = sum(1 for r in data['results'] if r['status'] == 'success')
        success_rate = success_count / len(urls) * 100
        speedup = sequential_time / data['time'] if data['time'] > 0 else 0
        
        print(f"{method:<15} {data['time']:<10.2f} {speedup:<10.2f} {success_rate:.1f}%")
    
    return results

# Test with a smaller set for comparison
test_urls_small = [
    'https://httpbin.org/delay/1',  # 1 second delay
    'https://httpbin.org/delay/1',
    'https://httpbin.org/delay/1',
    'https://httpbin.org/delay/1',
    'https://httpbin.org/json',
    'https://httpbin.org/html',
    'https://httpbin.org/xml',
    'https://httpbin.org/user-agent'
]

print("🔬 Comparing Parallel Processing Approaches")
print("=" * 60)

comparison_results = compare_parallel_approaches(test_urls_small, max_workers=4)

# Memory usage comparison (simplified)
print(f"\n💾 Memory Usage Notes:")
print("   Sequential: Low memory, single process")
print("   ThreadPool: Medium memory, shared memory space")
print("   ProcessPool: High memory, separate processes")
print("   Joblib: Optimized memory usage, especially for NumPy arrays")

print(f"\n🎯 Recommendations:")
print("   • Use ThreadPool for I/O-bound web scraping")
print("   • Use ProcessPool for CPU-intensive data processing")
print("   • Use Joblib for data science and ML workloads")
print("   • Always implement rate limiting and error handling")
print("   • Monitor resource usage in production")

## 🎯 Key Takeaways: Parallel Web Scraping

### ✅ What We Learned

1. **Multiprocessing Basics**: Understanding CPU cores and parallel execution
2. **Joblib Library**: Simple and efficient parallel processing with `Parallel()` and `delayed()`
3. **Rate Limiting**: Implementing respectful scraping with proper delays
4. **Error Handling**: Robust retry mechanisms and failure recovery
5. **Performance Comparison**: Different approaches for different use cases
6. **Real-world Application**: E-commerce data scraping with batch processing

### 🚀 When to Use Parallel Scraping

**✅ Good candidates:**
- Large datasets (100s-1000s of URLs)
- I/O-bound operations (network requests)
- Independent scraping tasks
- Time-sensitive data collection

**❌ Avoid when:**
- Small datasets (< 50 URLs)
- Strict rate limits (< 1 req/sec)
- Complex interdependent scraping
- Server explicitly prohibits parallel access

### 📋 Production Checklist

- [ ] Implement proper rate limiting
- [ ] Add comprehensive error handling
- [ ] Monitor resource usage (CPU, memory, network)
- [ ] Respect robots.txt and terms of service
- [ ] Implement logging and monitoring
- [ ] Test with small datasets first
- [ ] Use appropriate number of workers
- [ ] Handle failures gracefully

### 🔗 Next Steps

1. Practice with the provided examples
2. Implement rate limiting in your projects
3. Experiment with different worker counts
4. Monitor performance and optimize
5. Always prioritize ethical scraping practices

Remember: **With great power comes great responsibility!** Use parallel scraping responsibly and always respect website terms of service.

## 🕸️ Scrapy Framework

Scrapy is a powerful, production-ready framework for large-scale scraping projects.

### Scrapy Features:
- Built-in support for handling requests, following links, exporting data
- Automatic throttling and concurrent requests
- Built-in support for handling cookies, sessions, HTTP authentication
- Robust handling of common scraping challenges

### Basic Scrapy Spider Example:

```python
import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = ['http://quotes.toscrape.com/']
    
    def parse(self, response):
        # Extract quotes
        quotes = response.css('div.quote')
        
        for quote in quotes:
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }
        
        # Follow next page
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)
```

### Running Scrapy:
```bash
# Create new Scrapy project
scrapy startproject myproject

# Run spider
scrapy crawl quotes -o quotes.json
```

## 🛡️ Web Scraping Best Practices & Ethics

### 📋 Technical Best Practices:

1. **Respect robots.txt**
   ```python
   # Check robots.txt before scraping
   # Example: https://example.com/robots.txt
   ```

2. **Add delays between requests**
   ```python
   import time
   time.sleep(1)  # Wait 1 second between requests
   ```

3. **Use proper headers**
   ```python
   headers = {
       'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
   }
   response = requests.get(url, headers=headers)
   ```

4. **Handle errors gracefully**
   ```python
   try:
       response = requests.get(url, timeout=10)
       response.raise_for_status()
   except requests.RequestException as e:
       print(f"Error: {e}")
   ```

5. **Implement retry logic**
   ```python
   from requests.adapters import HTTPAdapter
   from urllib3.util.retry import Retry
   
   session = requests.Session()
   retry_strategy = Retry(total=3, backoff_factor=1)
   adapter = HTTPAdapter(max_retries=retry_strategy)
   session.mount("http://", adapter)
   session.mount("https://", adapter)
   ```

### ⚖️ Legal and Ethical Considerations:

1. **Check Terms of Service** - Always read the website's ToS
2. **Respect Rate Limits** - Don't overwhelm servers
3. **Data Privacy** - Be mindful of personal data
4. **Copyright** - Respect intellectual property rights
5. **Commercial Use** - Understand licensing for commercial projects

### 🚫 Common Challenges & Solutions:

| Challenge | Solution |
|-----------|----------|
| **JavaScript content** | Use Selenium or requests-html |
| **CAPTCHAs** | Slow down requests, use proxy rotation |
| **IP blocking** | Use proxy servers, VPNs |
| **Dynamic content** | Wait for elements, use WebDriverWait |
| **Large datasets** | Implement pagination, use Scrapy |

## 📚 Additional Resources & Documentation

### 📖 Official Documentation

#### Beautiful Soup
- **[Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)** - Complete official documentation
- **[Beautiful Soup Quick Start](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start)** - Getting started guide
- **[CSS Selectors Reference](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors)** - CSS selector syntax

#### Requests Library
- **[Requests Documentation](https://requests.readthedocs.io/)** - HTTP library documentation
- **[Requests Quickstart](https://requests.readthedocs.io/en/latest/user/quickstart/)** - Basic usage examples
- **[Advanced Usage](https://requests.readthedocs.io/en/latest/user/advanced/)** - Sessions, cookies, SSL

#### Selenium
- **[Selenium Documentation](https://selenium-python.readthedocs.io/)** - Official Python bindings
- **[WebDriver API](https://selenium-python.readthedocs.io/api.html)** - Complete API reference
- **[Selenium Grid](https://selenium.dev/documentation/grid/)** - Distributed testing

#### Scrapy Framework
- **[Scrapy Documentation](https://docs.scrapy.org/)** - Complete framework guide
- **[Scrapy Tutorial](https://docs.scrapy.org/en/latest/intro/tutorial.html)** - Step-by-step tutorial
- **[Scrapy Best Practices](https://docs.scrapy.org/en/latest/topics/practices.html)** - Production tips

### 🎥 YouTube Videos & Tutorials

#### Beginner Friendly
- **[Web Scraping with Python - Complete Course](https://www.youtube.com/watch?v=XVv6mJpFOb0)** - freeCodeCamp (3+ hours)
- **[Beautiful Soup Tutorial](https://www.youtube.com/watch?v=87Gx3U0BDlo)** - Corey Schafer
- **[Python Web Scraping Basics](https://www.youtube.com/watch?v=ng2o98k983k)** - Tech With Tim
- **[Requests Library Tutorial](https://www.youtube.com/watch?v=tb8gHvYlCFs)** - Corey Schafer

#### Advanced Topics
- **[Selenium WebDriver with Python](https://www.youtube.com/watch?v=Xjv1sY630Uc)** - Programming with Mosh
- **[Scrapy Framework Tutorial](https://www.youtube.com/watch?v=s4jtkzHhLzY)** - Traversy Media
- **[Web Scraping JavaScript Sites](https://www.youtube.com/watch?v=MeBU-4Xs2RU)** - John Watson Rooney
- **[Handling CAPTCHAs and Bot Detection](https://www.youtube.com/watch?v=HeYvNR1r6Js)** - Kalle Hallden

#### Real-World Projects
- **[Building a Price Monitor](https://www.youtube.com/watch?v=Bg9r_yLk7VY)** - Tech With Tim
- **[News Aggregator Project](https://www.youtube.com/watch?v=R9Dc6cCLPCc)** - Sentdex
- **[Social Media Scraper](https://www.youtube.com/watch?v=HiOtQMcI5wg)** - John Watson Rooney

### 📝 Excellent Articles & Blogs

#### Getting Started
- **[Real Python - Web Scraping Guide](https://realpython.com/python-web-scraping-practical-introduction/)** - Comprehensive guide
- **[GeeksforGeeks - Web Scraping](https://www.geeksforgeeks.org/python-web-scraping-tutorial/)** - Step-by-step tutorial
- **[Towards Data Science - Web Scraping 101](https://towardsdatascience.com/web-scraping-101-with-python-e40e233b7f5b)** - Medium article

#### Advanced Techniques
- **[Scraping JavaScript Heavy Sites](https://blog.apify.com/web-scraping-javascript-heavy-sites/)** - Apify Blog
- **[Avoiding Bot Detection](https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/)** - ScrapeHero
- **[Proxy Rotation Strategies](https://oxylabs.io/blog/rotating-proxies-python)** - Oxylabs Blog
- **[Handling Dynamic Content](https://www.zenrows.com/blog/selenium-wait-for-page-to-load)** - ZenRows

#### Legal & Ethical Aspects
- **[Web Scraping Laws](https://blog.apify.com/is-web-scraping-legal/)** - Legal considerations
- **[robots.txt Guide](https://developers.google.com/search/docs/crawling-indexing/robots/intro)** - Google Developers
- **[Ethical Scraping Practices](https://www.scrapehero.com/web-scraping-ethics/)** - ScrapeHero

### 🛠️ Tools & Browser Extensions

#### Browser Developer Tools
- **[Chrome DevTools](https://developer.chrome.com/docs/devtools/)** - Inspect elements, network requests
- **[Firefox Developer Tools](https://firefox-source-docs.mozilla.org/devtools-user/)** - Alternative dev tools
- **[Selector Gadget](https://selectorgadget.com/)** - Chrome extension for CSS selectors

#### Testing & Debugging
- **[Postman](https://www.postman.com/)** - API testing tool
- **[httpbin.org](http://httpbin.org/)** - HTTP testing service
- **[quotes.toscrape.com](http://quotes.toscrape.com/)** - Practice scraping site
- **[scrape.world](https://scrape.world/)** - More practice sites

#### Proxy & VPN Services
- **[ProxyMesh](https://proxymesh.com/)** - Rotating proxy service
- **[Bright Data](https://brightdata.com/)** - Professional proxy network
- **[Oxylabs](https://oxylabs.io/)** - Residential proxies

### 📊 Data Processing & Storage

#### Data Analysis
- **[Pandas Documentation](https://pandas.pydata.org/docs/)** - Data manipulation
- **[NumPy User Guide](https://numpy.org/doc/stable/user/)** - Numerical computing
- **[Matplotlib Tutorials](https://matplotlib.org/stable/tutorials/index.html)** - Data visualization

#### Database Storage
- **[SQLite3 Tutorial](https://docs.python.org/3/library/sqlite3.html)** - Lightweight database
- **[MongoDB with Python](https://pymongo.readthedocs.io/)** - NoSQL database
- **[PostgreSQL with Python](https://www.psycopg.org/docs/)** - Relational database

### 🔧 Advanced Libraries & Frameworks

#### Alternative Scraping Libraries
- **[requests-html](https://github.com/psf/requests-html)** - JavaScript support for requests
- **[pyppeteer](https://github.com/pyppeteer/pyppeteer)** - Puppeteer port for Python
- **[playwright-python](https://playwright.dev/python/)** - Modern browser automation
- **[httpx](https://www.python-httpx.org/)** - Next-generation HTTP client

#### Specialized Tools
- **[Splash](https://splash.readthedocs.io/)** - JavaScript rendering service
- **[AutoScraper](https://github.com/alirezamika/autoscraper)** - Intelligent scraping
- **[newspaper3k](https://github.com/codelucas/newspaper)** - News article extraction
- **[trafilatura](https://github.com/adbar/trafilatura)** - Text extraction from web pages

### 🎓 Online Courses & Learning Platforms

#### Free Courses
- **[freeCodeCamp](https://www.freecodecamp.org/)** - Web scraping with Python
- **[Coursera - Web Scraping](https://www.coursera.org/search?query=web%20scraping)** - University courses
- **[edX - Data Science](https://www.edx.org/search?q=web%20scraping)** - MIT and Harvard courses

#### Paid Courses
- **[Udemy - Web Scraping Courses](https://www.udemy.com/topic/web-scraping/)** - Various instructors
- **[Pluralsight](https://www.pluralsight.com/search?q=web%20scraping)** - Professional development
- **[DataCamp](https://www.datacamp.com/search?q=web%20scraping)** - Interactive learning

### 📚 Books & E-books

#### Beginner Books
- **"Web Scraping with Python"** by Ryan Mitchell - O'Reilly Media
- **"Python Web Scraping Cookbook"** by Michael Heydt - Packt
- **"Learning Scrapy"** by Dimitris Kouzis-Loukas - Packt

#### Advanced Books
- **"Web Scraping with Python: Data Extraction"** by Ryan Mitchell
- **"Mastering Python Web Scraping"** by Various Authors
- **"Python for Data Analysis"** by Wes McKinney - Pandas creator

### 🏆 Practice Websites & Challenges

#### Beginner Practice
- **[Quotes to Scrape](http://quotes.toscrape.com/)** - Basic scraping practice
- **[Books to Scrape](http://books.toscrape.com/)** - E-commerce scraping
- **[Scrape This Site](https://scrapethissite.com/)** - Various challenges

#### Advanced Practice
- **[HackerRank](https://www.hackerrank.com/)** - Coding challenges
- **[Kaggle](https://www.kaggle.com/)** - Data science competitions
- **[GitHub Scraping Projects](https://github.com/topics/web-scraping)** - Open source projects

### 💡 Pro Tips for Learning

1. **Start Small**: Begin with static HTML sites before tackling JavaScript
2. **Practice Regularly**: Scrape different types of websites
3. **Read robots.txt**: Always check before scraping
4. **Join Communities**: Reddit r/webscraping, Stack Overflow
5. **Stay Updated**: Web technologies change frequently
6. **Build Projects**: Create real-world applications
7. **Learn HTML/CSS**: Understanding structure helps with scraping
8. **Respect Websites**: Follow ethical practices

# 🛠️ Գործնական

Let's put our knowledge into practice with hands-on exercises!

## 🎯 Exercise 1: News Headlines Scraper

Create a scraper that extracts news headlines from a news website.

In [None]:
# Exercise 1: News Headlines Scraper
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime

def scrape_news_headlines():
    """
    Scrape news headlines from a sample news site
    Note: In real projects, always check robots.txt and terms of service
    """
    
    # Using BBC RSS feed as an example (more reliable than scraping HTML)
    url = "http://feeds.bbci.co.uk/news/rss.xml"
    
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
        }
        
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        
        # Parse XML (RSS feeds are XML)
        soup = BeautifulSoup(response.content, 'xml')
        
        # Find all items (news articles)
        items = soup.find_all('item')
        
        news_data = []
        
        for item in items[:10]:  # Get first 10 articles
            title = item.find('title')
            link = item.find('link')
            description = item.find('description')
            pub_date = item.find('pubDate')
            
            news_data.append({
                'title': title.text if title else 'N/A',
                'link': link.text if link else 'N/A',
                'description': description.text if description else 'N/A',
                'published': pub_date.text if pub_date else 'N/A',
                'scraped_at': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
            })
        
        return news_data
    
    except Exception as e:
        print(f"Error scraping news: {e}")
        return []

# Run the scraper
print("📰 Scraping BBC News Headlines...")
news_headlines = scrape_news_headlines()

if news_headlines:
    print(f"✅ Successfully scraped {len(news_headlines)} headlines")
    
    # Display first 3 headlines
    for i, article in enumerate(news_headlines[:3], 1):
        print(f"\n{i}. {article['title']}")
        print(f"   Published: {article['published']}")
        print(f"   Description: {article['description'][:100]}...")
    
    # Save to CSV
    df = pd.DataFrame(news_headlines)
    df.to_csv('news_headlines.csv', index=False, encoding='utf-8')
    print(f"\n💾 Data saved to 'news_headlines.csv'")
    
    # Show basic statistics
    print(f"\n📊 Statistics:")
    print(f"   Total articles: {len(news_headlines)}")
    print(f"   Average title length: {df['title'].str.len().mean():.1f} characters")
else:
    print("❌ No headlines found")

## 🎯 Exercise 2: Table Data Scraper

Extract tabular data from websites and convert it to pandas DataFrame.

In [None]:
# Exercise 2: Table Data Scraper
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Create sample HTML table for demonstration
sample_table_html = """
<html>
<body>
    <h2>Cryptocurrency Prices</h2>
    <table id="crypto-table" class="data-table">
        <thead>
            <tr>
                <th>Rank</th>
                <th>Name</th>
                <th>Symbol</th>
                <th>Price (USD)</th>
                <th>24h Change</th>
                <th>Market Cap</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td>1</td>
                <td>Bitcoin</td>
                <td>BTC</td>
                <td>$43,250.00</td>
                <td class="positive">+2.34%</td>
                <td>$847.5B</td>
            </tr>
            <tr>
                <td>2</td>
                <td>Ethereum</td>
                <td>ETH</td>
                <td>$2,580.50</td>
                <td class="negative">-1.25%</td>
                <td>$310.2B</td>
            </tr>
            <tr>
                <td>3</td>
                <td>Cardano</td>
                <td>ADA</td>
                <td>$0.45</td>
                <td class="positive">+5.67%</td>
                <td>$15.2B</td>
            </tr>
            <tr>
                <td>4</td>
                <td>Solana</td>
                <td>SOL</td>
                <td>$98.75</td>
                <td class="positive">+3.21%</td>
                <td>$42.8B</td>
            </tr>
        </tbody>
    </table>
</body>
</html>
"""

def scrape_table_data(html_content):
    """Extract table data and convert to pandas DataFrame"""
    
    soup = BeautifulSoup(html_content, 'html.parser')
    
    # Find the table
    table = soup.find('table', {'id': 'crypto-table'})
    
    if not table:
        print("❌ Table not found")
        return None
    
    # Extract headers
    headers = []
    header_row = table.find('thead').find('tr')
    for th in header_row.find_all('th'):
        headers.append(th.text.strip())
    
    print(f"📋 Table headers: {headers}")
    
    # Extract data rows
    data_rows = []
    tbody = table.find('tbody')
    
    for row in tbody.find_all('tr'):
        row_data = []
        for td in row.find_all('td'):
            # Clean the text (remove extra whitespace, currency symbols, etc.)
            cell_text = td.text.strip()
            row_data.append(cell_text)
        data_rows.append(row_data)
    
    # Create DataFrame
    df = pd.DataFrame(data_rows, columns=headers)
    
    return df

def clean_financial_data(df):
    """Clean and process financial data"""
    
    df_clean = df.copy()
    
    # Clean price column (remove $ and convert to float)
    if 'Price (USD)' in df_clean.columns:
        df_clean['Price_Numeric'] = df_clean['Price (USD)'].str.replace('$', '').str.replace(',', '').astype(float)
    
    # Clean percentage change (remove % and convert to float)
    if '24h Change' in df_clean.columns:
        df_clean['Change_Numeric'] = df_clean['24h Change'].str.replace('%', '').str.replace('+', '').astype(float)
    
    # Clean market cap (convert to billions)
    if 'Market Cap' in df_clean.columns:
        def parse_market_cap(cap_str):
            cap_str = cap_str.replace('$', '').replace(',', '')
            if 'B' in cap_str:
                return float(cap_str.replace('B', '')) * 1e9
            elif 'M' in cap_str:
                return float(cap_str.replace('M', '')) * 1e6
            return float(cap_str)
        
        df_clean['Market_Cap_Numeric'] = df_clean['Market Cap'].apply(parse_market_cap)
    
    return df_clean

# Scrape the table
print("📊 Scraping table data...")
crypto_df = scrape_table_data(sample_table_html)

if crypto_df is not None:
    print("\n✅ Raw table data:")
    print(crypto_df.to_string(index=False))
    
    # Clean the data
    crypto_clean = clean_financial_data(crypto_df)
    
    print("\n🧹 Cleaned data with numeric columns:")
    print(crypto_clean[['Name', 'Symbol', 'Price_Numeric', 'Change_Numeric']].to_string(index=False))
    
    # Basic analysis
    print("\n📈 Quick Analysis:")
    print(f"   Average price: ${crypto_clean['Price_Numeric'].mean():,.2f}")
    print(f"   Highest price: {crypto_clean.loc[crypto_clean['Price_Numeric'].idxmax(), 'Name']} (${crypto_clean['Price_Numeric'].max():,.2f})")
    print(f"   Best performer: {crypto_clean.loc[crypto_clean['Change_Numeric'].idxmax(), 'Name']} ({crypto_clean['Change_Numeric'].max()}%)")
    print(f"   Worst performer: {crypto_clean.loc[crypto_clean['Change_Numeric'].idxmin(), 'Name']} ({crypto_clean['Change_Numeric'].min()}%)")
    
    # Save to CSV
    crypto_clean.to_csv('crypto_data.csv', index=False)
    print("\n💾 Data saved to 'crypto_data.csv'")
else:
    print("❌ Failed to scrape table data")

# 🏡Տնային

Practice your web scraping skills with the examples provided in the tutorial!

# 🎲 00
- ▶️[Video]()
- 🔗[Random link]()
- 🇦🇲🎶[]()
- 🌐🎶[]()
- 🤌[Կարգին]()


<a href="http://s01.flagcounter.com/more/1oO"><img src="https://s01.flagcounter.com/count2/1oO/bg_FFFFFF/txt_000000/border_CCCCCC/columns_2/maxflags_10/viewers_0/labels_0/pageviews_1/flags_0/percent_0/" alt="Flag Counter"></a>
