## Documentation Scraping Agent

### Objective
This component extracts technical documentation content from MoEngage's developer portal while bypassing potential anti-scraping measures. The extracted content is converted to clean Markdown format for downstream analysis.

### Technical Approach
#### Core Technologies
- **Bright Data Residential Proxies**: To bypass IP restrictions and avoid blocking
- **Requests**: For HTTP communication with retry mechanism
- **BeautifulSoup**: HTML parsing and content extraction
- **html2text**: Conversion of HTML to Markdown format
- **Regex**: Content cleanup and filename sanitization

#### Key Features
1. **Proxy Integration**:
   - Uses Bright Data residential proxies to mimic real user traffic
   - Configurable via environment variables for security
   - Automatic retry mechanism (3 attempts) for network resilience

2. **Content Extraction**:
   - Targets specific CSS classes (`article__body markdown`) for main content
   - Multiple fallback strategies if primary content container not found
   - Cleans unnecessary elements (scripts, styles, footers, etc.)

3. **Markdown Conversion**:
   - Preserves links and basic formatting
   - Skips images to reduce token usage in later stages
   - Aggressive cleanup of whitespace and line breaks

4. **File Management**:
   - Automatic directory creation (`scraped_docs`)
   - Filename generation from article titles
   - Special character sanitization for compatibility

### Why This Approach?
- **Anti-Scraping Mitigation**: Residential proxies help avoid IP bans that are common with technical documentation portals
- **Content Focus**: Targets only relevant technical content, ignoring navigation and site chrome
- **Format Preservation**: Maintains code blocks and technical formatting crucial for SDK documentation
- **Error Resilience**: Comprehensive exception handling for network and parsing issues


In [7]:
import requests
from bs4 import BeautifulSoup
import html2text
import re
import os
import urllib3
from urllib.parse import urlparse
import warnings


warnings.filterwarnings('ignore', category=urllib3.exceptions.InsecureRequestWarning)

# Configure your Bright Data residential proxy here
PROXY_HOST = os.getenv('PROXY_HOST', 'brd.superproxy.io')
PROXY_PORT = os.getenv('PROXY_PORT', '33335')
PROXY_USERNAME = os.getenv('PROXY_USERNAME', 'Secret...') # Get your own Username
PROXY_PASSWORD = os.getenv('PROXY_PASSWORD', 'Secret...') # Get your Own Password

def fetch_html(url: str) -> str:
    """
    Fetches HTML content using Bright Data residential proxy
    
    Args:
        url (str): Documentation URL to fetch
        
    Returns:
        str: Raw HTML content
    """
    try:
        # Construct proxy URL in format given below:
        proxy_url = f"http://{PROXY_USERNAME}:{PROXY_PASSWORD}@{PROXY_HOST}:{PROXY_PORT}"
        proxies = {
            "http": proxy_url,
            "https": proxy_url
        }
        
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
            'Accept-Language': 'en-US,en;q=0.9',
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'
        }
        
        # Add 30-second timeout and retry mechanism
        for attempt in range(3):
            try:
                # Temporarily disable SSL verification for this request
                response = requests.get(
                    url,
                    headers=headers,
                    proxies=proxies,
                    timeout=30,
                    verify=False  # Disable SSL verification for problematic domains
                )
                response.raise_for_status()
                
                # Validate content type
                if 'text/html' not in response.headers.get('Content-Type', ''):
                    raise ValueError("URL does not return HTML content")
                    
                return response.text
                
            except (requests.exceptions.ConnectionError, requests.exceptions.Timeout) as e:
                if attempt < 2:  # Retry twice
                    print(f"Retrying ({attempt+1}/3) due to error: {str(e)}")
                    continue
                raise
                
    except requests.exceptions.RequestException as e:
        raise ConnectionError(f"Proxy fetch failed: {str(e)}") from e

def extract_main_content(html: str) -> tuple:
    """Extracts main content from MoEngage documentation HTML"""
    try:
        soup = BeautifulSoup(html, 'html.parser')
        
        # Extract title from the page
        title_tag = soup.find('h6', class_='article-title')
        title = title_tag.get_text().strip() if title_tag else "Untitled"
        
        # Find the main content container
        article_body = soup.find('div', class_='article__body markdown')
        if not article_body:
            # Fallback to other content containers
            article_body = soup.find('div', class_='article-body') or \
                          soup.find('article') or \
                          soup.find('main') or \
                          soup.body
            
        if not article_body:
            raise ValueError("Could not identify main content area")
        
        # Remove unnecessary elements
        for element in article_body.find_all(['script', 'style', 'footer', 'nav', 'aside', 'form', 'dialog', 'div', 'ol', 'header']):
            element.decompose()
        
        # Convert to Markdown
        converter = html2text.HTML2Text()
        converter.ignore_links = False
        converter.ignore_images = True  # Skip images to save tokens since im using the free tier
        converter.ignore_emphasis = False
        converter.body_width = 0
        
        markdown = converter.handle(str(article_body))
        
        # Clean up Markdown
        markdown = re.sub(r'\n{3,}', '\n\n', markdown)
        markdown = re.sub(r'(\s{2,})', ' ', markdown)
        return title, markdown.strip()
    
    except Exception as e:
        raise RuntimeError(f"Content extraction failed: {str(e)}") from e

def process_documentation_url(url: str) -> tuple:
    """Processes a documentation URL and returns title/content"""
    html = fetch_html(url)
    return extract_main_content(html)

def save_to_file(title: str, content: str, output_dir: str = "scraped_docs"):
    """Saves content to a properly named markdown file"""
    os.makedirs(output_dir, exist_ok=True)
    
    # Create filename from title
    filename = re.sub(r'[^\w\s-]', '', title).strip().lower()
    filename = re.sub(r'[-\s]+', '_', filename)[:50] + '.md'  # Limit filename length
    filepath = os.path.join(output_dir, filename)
    
    with open(filepath, 'w', encoding='utf-8') as f:
        f.write(f"# {title}\n\n")
        f.write(content)
    
    return filepath

# Example usage
if __name__ == "__main__":
    # REPLACE THESE WITH YOUR ACTUAL BRIGHT DATA CREDENTIALS
    os.environ['PROXY_USERNAME'] = 'Secret...' # Get your Own Username
    os.environ['PROXY_PASSWORD'] = 'Secret...'  # Get your Own Password
    
    test_urls = [
        "https://developers.moengage.com/hc/en-us/articles/360061108111-Web-SDK-Overview#h_01H9G1YMFWVN61PKBDN0MGAJWG",
        "https://developers.moengage.com/hc/en-us/articles/22105190881044-Getting-Started-with-React-Native-SDK#h_01HEJAHP5W49AASNSHF5614AHP"
    ]
    
    for url in test_urls:
        print(f"\n{'='*50}\nProcessing: {url}\n{'='*50}")
        try:
            title, content = process_documentation_url(url)
            filepath = save_to_file(title, content)
            print(f"Extracted content saved to: {filepath}")
            print(f"Title: {title}")
            print(f"Content length: {len(content)} characters")
            print(f"Content preview:\n{content[:300]}...\n")
        except Exception as e:
            print(f"Error processing {url}: {str(e)}")


Processing: https://developers.moengage.com/hc/en-us/articles/360061108111-Web-SDK-Overview#h_01H9G1YMFWVN61PKBDN0MGAJWG
Extracted content saved to: scraped_docs\web_sdk_overview.md
Title: Web SDK Overview
Content length: 1679 characters
Content preview:
# Introduction to Web Modules The following modules are supported for Web SDK Integration: ## Web Push Permission based messages are sent regardless of whether the user is on your website or not. ## Analytics Any data collection on your users for your brand use. Typical use cases are to personalize ...


Processing: https://developers.moengage.com/hc/en-us/articles/22105190881044-Getting-Started-with-React-Native-SDK#h_01HEJAHP5W49AASNSHF5614AHP
Extracted content saved to: scraped_docs\getting_started_with_react_native_sdk.md
Title: Getting Started with React Native SDK
Content length: 16102 characters
Content preview:
# Overview MoEngage’s React Native SDK helps you integrate MoEngage into iOS and Android applications built with Reac