# CamHR.com Job Scraper

This notebook contains a comprehensive web scraper for extracting job listings from CamHR.com (Cambodia Human Resources). The scraper systematically collects detailed job information including job titles, company details, requirements, qualifications, and employment terms.

## Features
- **Advanced Web Scraping**: Uses Selenium WebDriver with BeautifulSoup for comprehensive data extraction
- **Robust Data Collection**: Extracts 18 different job attributes
- **Error Handling**: Graceful handling of missing pages and elements
- **CSV Export**: Structured data export with UTF-8 encoding
- **Progress Tracking**: Real-time scraping progress updates
- **Headless Operation**: Efficient background processing

## Data Fields Extracted
- **Basic Info**: Job Title, Company Name, Link URL
- **Job Level**: Level, Years of Experience
- **Employment Terms**: Hiring status, Salary, Employment Term
- **Demographics**: Sex, Age requirements
- **Classification**: Function, Industry
- **Requirements**: Qualification, Language, Location
- **Detailed Info**: Job Requirements
- **Timeline**: Publish Date, Closing Date

## Requirements
- Python 3.x
- Selenium WebDriver
- BeautifulSoup4
- Pandas
- Chrome WebDriver

## 1. Import Required Libraries

First, let's import all the necessary libraries for web scraping, data processing, and file handling.

In [None]:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time
import csv
import os
from datetime import datetime
import logging

print("‚úÖ All libraries imported successfully!")
print(f"üìÖ Scraping session started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

## 2. Configuration Class

Define a configuration class containing all scraper settings for easy customization and maintenance.

In [None]:
class CamHRConfig:
    """Configuration class for CamHR job scraper"""
    
    # WebDriver settings
    CHROME_DRIVER_PATH = r"D:\DSE_Folder\Year_3\Sem_2\Web Scraping\chromedriver-win64\chromedriver-win64\chromedriver.exe"
    
    # Scraping range
    START_ID = 10611925
    END_ID = 10613636
    
    # URL configuration
    BASE_URL = "https://www.camhr.com/a/job/{}"
    
    # Output settings
    CSV_FILENAME = "New_Data_cam_4.csv"
    
    # Timing settings
    WAIT_TIMEOUT = 5  # seconds to wait for page elements
    DELAY = 0.0000001  # delay between requests
    
    # CSV column definitions
    COLUMNS = [
        "Job Title", "Company Name", "Level", "Year of Exp.", "Hiring", "Salary", "Sex", "Age",
        "Term", "Function", "Industry", "Qualification", "Language", "Location", "Job Requirements",
        "Publish Date", "Closing Date", "Link URL"
    ]

# Initialize configuration
config = CamHRConfig()

print("üîß CamHR Scraper Configuration:")
print(f"üìä Job ID range: {config.START_ID} to {config.END_ID}")
print(f"üìÅ Output file: {config.CSV_FILENAME}")
print(f"‚è±Ô∏è Wait timeout: {config.WAIT_TIMEOUT} seconds")
print(f"üìù Data fields: {len(config.COLUMNS)} columns")
print(f"üîó Base URL: {config.BASE_URL}")
print(f"üìà Total jobs to scrape: {config.END_ID - config.START_ID + 1}")

## 3. WebDriver Setup and Initialization

Configure and initialize the Chrome WebDriver with optimized settings for efficient scraping.

In [None]:
def setup_chrome_driver(config):
    """
    Initialize Chrome WebDriver with optimized settings
    
    Args:
        config: CamHRConfig instance
    
    Returns:
        webdriver.Chrome: Configured Chrome WebDriver
    """
    try:
        # Set up Chrome options for optimal performance
        chrome_options = Options()
        chrome_options.add_argument("--headless")              # Run without GUI
        chrome_options.add_argument("--disable-gpu")           # Disable GPU acceleration
        chrome_options.add_argument("--no-sandbox")            # Bypass OS security model
        chrome_options.add_argument("--disable-dev-shm-usage") # Overcome limited resource problems
        chrome_options.add_argument("--window-size=1920,1080")  # Set window size
        chrome_options.add_argument("--disable-blink-features=AutomationControlled") # Avoid detection
        
        # Initialize WebDriver service
        service = Service(config.CHROME_DRIVER_PATH)
        
        # Create WebDriver instance
        driver = webdriver.Chrome(service=service, options=chrome_options)
        
        print("‚úÖ Chrome WebDriver initialized successfully!")
        print(f"üåê Browser version: {driver.capabilities.get('browserVersion', 'Unknown')}")
        print(f"üîß ChromeDriver version: {driver.capabilities.get('chrome', {}).get('chromedriverVersion', 'Unknown').split(' ')[0]}")
        
        return driver
        
    except Exception as e:
        print(f"‚ùå Error initializing WebDriver: {e}")
        return None

# Initialize the WebDriver
driver = setup_chrome_driver(config)

if driver:
    print("üöÄ WebDriver ready for scraping!")
else:
    print("‚ùå Failed to initialize WebDriver. Please check the Chrome driver path.")

## 4. Data Extraction Functions

Define specialized functions for extracting different types of job information from the CamHR pages.

In [None]:
def extract_job_title(soup):
    """
    Extract job title from the page
    
    Args:
        soup: BeautifulSoup object
    
    Returns:
        str: Job title or 'Not found'
    """
    job_title_span = soup.find("span", class_="job-name-span")
    return job_title_span.text.strip() if job_title_span else "Not found"

def extract_company_name(soup):
    """
    Extract company name from the page
    
    Args:
        soup: BeautifulSoup object
    
    Returns:
        str: Company name or 'Not found'
    """
    company_name_tag = soup.find("p", class_="mb-1 company-headbox")
    if company_name_tag:
        company_link = company_name_tag.find("a")
        return company_link.text.strip() if company_link else "Not found"
    return "Not found"

def extract_table_data(soup, columns):
    """
    Extract job details from the information table
    
    Args:
        soup: BeautifulSoup object
        columns: List of column names to match
    
    Returns:
        dict: Dictionary mapping column names to values
    """
    table_data = {}
    
    table = soup.find("table", class_="mailTable")
    if table:
        rows = table.find_all("tr")
        for row in rows:
            headers = row.find_all("th", class_="column")
            data_cells = row.find_all("td")
            
            for header, data in zip(headers, data_cells):
                key = header.text.strip()
                value = data.text.strip()
                
                # Match table headers with CSV columns
                for column in columns:
                    if key.lower() in column.lower():
                        table_data[column] = value
                        break
    
    return table_data

def extract_job_requirements(soup):
    """
    Extract detailed job requirements
    
    Args:
        soup: BeautifulSoup object
    
    Returns:
        str: Job requirements or 'Not found'
    """
    job_descript_divs = soup.find_all("div", class_="job-descript")
    
    for div in job_descript_divs:
        title_span = div.find("span", class_="descript-title")
        if title_span and "Job Requirements" in title_span.text:
            requirements_div = div.find("div", class_="fs-14 descript-list")
            if requirements_div:
                return requirements_div.get_text(separator="\n").strip()
    
    return "Not found"

def extract_dates(soup):
    """
    Extract publish date and closing date
    
    Args:
        soup: BeautifulSoup object
    
    Returns:
        tuple: (publish_date, closing_date)
    """
    send_date_div = soup.find("div", class_="send-date")
    
    if send_date_div:
        date_spans = send_date_div.find_all("span")
        if len(date_spans) >= 2:
            publish_date = date_spans[0].text.split(": ")[-1].strip()
            closing_date = date_spans[1].text.split(": ")[-1].strip()
            return publish_date, closing_date
    
    return "Not found", "Not found"

print("‚úÖ Data extraction functions defined successfully!")
print("üîß Available functions:")
print("   - extract_job_title(): Job title extraction")
print("   - extract_company_name(): Company name extraction")
print("   - extract_table_data(): Table-based data extraction")
print("   - extract_job_requirements(): Detailed requirements extraction")
print("   - extract_dates(): Publish and closing date extraction")

## 5. CSV File Management

Set up CSV file creation and management functions for data storage.

In [None]:
def initialize_csv_file(config):
    """
    Initialize CSV file with headers if it doesn't exist
    
    Args:
        config: CamHRConfig instance
    
    Returns:
        bool: True if file was created/exists, False otherwise
    """
    try:
        # Check if file already exists
        file_exists = os.path.exists(config.CSV_FILENAME)
        
        if not file_exists:
            # Create new CSV file with headers
            with open(config.CSV_FILENAME, mode="w", newline="", encoding="utf-8-sig") as file:
                writer = csv.writer(file)
                writer.writerow(config.COLUMNS)
            print(f"‚úÖ Created new CSV file: {config.CSV_FILENAME}")
        else:
            print(f"üìÅ Using existing CSV file: {config.CSV_FILENAME}")
        
        print(f"üìù CSV columns ({len(config.COLUMNS)}): {', '.join(config.COLUMNS)}")
        return True
        
    except Exception as e:
        print(f"‚ùå Error initializing CSV file: {e}")
        return False

def write_job_data(config, job_data):
    """
    Write job data to CSV file
    
    Args:
        config: CamHRConfig instance
        job_data: Dictionary containing job information
    
    Returns:
        bool: True if successful, False otherwise
    """
    try:
        with open(config.CSV_FILENAME, mode="a", newline="", encoding="utf-8-sig") as file:
            writer = csv.writer(file)
            row_data = [job_data.get(col, "Not found") for col in config.COLUMNS]
            writer.writerow(row_data)
        return True
    except Exception as e:
        print(f"‚ùå Error writing to CSV: {e}")
        return False

# Initialize CSV file
csv_initialized = initialize_csv_file(config)

if csv_initialized:
    print("üéØ CSV file ready for data storage!")
else:
    print("‚ùå Failed to initialize CSV file.")

## 6. Main Scraping Function

Define the main function that orchestrates the scraping process for a single job listing.

In [None]:
def scrape_single_job(driver, config, job_id):
    """
    Scrape a single job listing from CamHR
    
    Args:
        driver: WebDriver instance
        config: CamHRConfig instance
        job_id: Job ID to scrape
    
    Returns:
        dict: Job data dictionary or None if failed
    """
    url = config.BASE_URL.format(job_id)
    print(f"üîç Scraping Job ID {job_id}: {url}")
    
    try:
        # Navigate to job page
        driver.get(url)
        
        # Wait for page to load
        try:
            WebDriverWait(driver, config.WAIT_TIMEOUT).until(
                EC.presence_of_element_located((By.CLASS_NAME, "job-header-content"))
            )
        except:
            print(f"‚ö†Ô∏è Page not loaded properly for job ID {job_id}, skipping...")
            return None
        
        # Add small delay
        time.sleep(config.DELAY)
        
        # Parse page with BeautifulSoup
        soup = BeautifulSoup(driver.page_source, "html.parser")
        
        # Initialize job information dictionary
        job_info = {col: "Not found" for col in config.COLUMNS}
        
        # Extract basic information
        job_info["Job Title"] = extract_job_title(soup)
        job_info["Company Name"] = extract_company_name(soup)
        
        # Extract table data
        table_data = extract_table_data(soup, config.COLUMNS)
        job_info.update(table_data)
        
        # Extract job requirements
        job_info["Job Requirements"] = extract_job_requirements(soup)
        
        # Extract dates
        publish_date, closing_date = extract_dates(soup)
        job_info["Publish Date"] = publish_date
        job_info["Closing Date"] = closing_date
        
        # Add URL
        job_info["Link URL"] = url
        
        # Print extracted data summary
        print(f"‚úÖ Successfully extracted: {job_info['Job Title']}")
        print(f"   üè¢ Company: {job_info['Company Name']}")
        print(f"   üìç Location: {job_info.get('Location', 'Not found')}")
        print(f"   üí∞ Salary: {job_info.get('Salary', 'Not found')}")
        print(f"   üìÖ Closing: {job_info['Closing Date']}")
        
        return job_info
        
    except Exception as e:
        print(f"‚ùå Error scraping job ID {job_id}: {e}")
        return None

print("‚úÖ Main scraping function defined successfully!")
print("üîß Function: scrape_single_job() - Handles complete job data extraction")

## 7. Execute the Complete Scraping Process

Run the complete scraping process for all job IDs in the specified range.

In [None]:
def run_camhr_scraper(driver, config):
    """
    Execute the complete CamHR scraping process
    
    Args:
        driver: WebDriver instance
        config: CamHRConfig instance
    
    Returns:
        dict: Scraping statistics
    """
    if not driver:
        print("‚ùå WebDriver not available. Please initialize the driver first.")
        return None
    
    print("üöÄ Starting CamHR job scraping process...")
    print(f"üìä Job ID range: {config.START_ID} to {config.END_ID}")
    print(f"üìÅ Output file: {config.CSV_FILENAME}")
    print("=" * 70)
    
    # Initialize counters
    successful_scrapes = 0
    failed_scrapes = 0
    total_jobs = config.END_ID - config.START_ID + 1
    start_time = time.time()
    
    try:
        # Process each job ID
        for job_id in range(config.START_ID, config.END_ID + 1):
            current_job = job_id - config.START_ID + 1
            print(f"\nüìä Progress: {current_job}/{total_jobs} ({(current_job/total_jobs)*100:.1f}%)")
            
            # Scrape single job
            job_data = scrape_single_job(driver, config, job_id)
            
            if job_data:
                # Write to CSV
                if write_job_data(config, job_data):
                    successful_scrapes += 1
                    print(f"üíæ Data saved to CSV successfully")
                else:
                    failed_scrapes += 1
                    print(f"‚ùå Failed to save data to CSV")
            else:
                failed_scrapes += 1
                print(f"‚è© Skipped job ID {job_id}")
            
            # Calculate and display time estimates
            if current_job > 0:
                elapsed_time = time.time() - start_time
                avg_time_per_job = elapsed_time / current_job
                remaining_jobs = total_jobs - current_job
                estimated_time_remaining = avg_time_per_job * remaining_jobs
                
                print(f"‚è±Ô∏è Avg time per job: {avg_time_per_job:.2f}s | ETA: {estimated_time_remaining/60:.1f} min")
    
    except KeyboardInterrupt:
        print("\n‚ö†Ô∏è Scraping interrupted by user")
    except Exception as e:
        print(f"\n‚ùå Error during scraping process: {e}")
    
    finally:
        # Calculate final statistics
        end_time = time.time()
        total_time = end_time - start_time
        
        print("\n" + "=" * 70)
        print("üéâ Scraping process completed!")
        print(f"‚úÖ Successful scrapes: {successful_scrapes}")
        print(f"‚ùå Failed scrapes: {failed_scrapes}")
        print(f"üìä Success rate: {(successful_scrapes/total_jobs)*100:.1f}%")
        print(f"‚è±Ô∏è Total time: {total_time/60:.1f} minutes")
        print(f"‚ö° Average time per job: {total_time/total_jobs:.2f} seconds")
        print(f"üíæ Data saved to: {config.CSV_FILENAME}")
        
        return {
            'successful': successful_scrapes,
            'failed': failed_scrapes,
            'total': total_jobs,
            'success_rate': (successful_scrapes/total_jobs)*100,
            'total_time': total_time
        }

# Execute the scraping process (uncomment to run)
# scraping_stats = run_camhr_scraper(driver, config)

print("üîÑ To start scraping, uncomment the 'run_camhr_scraper()' line above and run this cell.")
print("‚ö†Ô∏è Warning: This process may take considerable time depending on the number of jobs.")
print(f"üìä Estimated time for {config.END_ID - config.START_ID + 1} jobs: ~{((config.END_ID - config.START_ID + 1) * 2)/60:.1f} minutes")

## 8. Data Analysis and Insights

Analyze the scraped data to gain insights into the job market trends.

In [None]:
def analyze_camhr_data(config):
    """
    Analyze the scraped CamHR job data
    
    Args:
        config: CamHRConfig instance
    
    Returns:
        pandas.DataFrame: Loaded dataset or None if file not found
    """
    try:
        # Load the CSV data
        df = pd.read_csv(config.CSV_FILENAME)
        
        print("üìä CamHR Job Market Analysis")
        print("=" * 50)
        
        # Basic dataset information
        print(f"üìà Dataset Overview:")
        print(f"   Total job listings: {len(df)}")
        print(f"   Data columns: {len(df.columns)}")
        print(f"   Date range: {df['Publish Date'].min()} to {df['Publish Date'].max()}")
        
        # Display first few records
        print(f"\nüìã Sample Data (First 3 Records):")
        display_cols = ['Job Title', 'Company Name', 'Location', 'Salary', 'Level']
        print(df[display_cols].head(3).to_string(index=False))
        
        # Top companies analysis
        print(f"\nüè¢ Top Hiring Companies:")
        top_companies = df['Company Name'].value_counts().head(10)
        for company, count in top_companies.items():
            if company != 'Not found':
                print(f"   {company}: {count} jobs")
        
        # Location analysis
        print(f"\nüåç Top Job Locations:")
        top_locations = df['Location'].value_counts().head(10)
        for location, count in top_locations.items():
            if location != 'Not found':
                print(f"   {location}: {count} jobs")
        
        # Industry analysis
        print(f"\nüè≠ Top Industries:")
        top_industries = df['Industry'].value_counts().head(10)
        for industry, count in top_industries.items():
            if industry != 'Not found':
                print(f"   {industry}: {count} jobs")
        
        # Job level distribution
        print(f"\nüìä Job Level Distribution:")
        level_distribution = df['Level'].value_counts()
        for level, count in level_distribution.items():
            if level != 'Not found':
                percentage = (count / len(df)) * 100
                print(f"   {level}: {count} jobs ({percentage:.1f}%)")
        
        # Function/Role analysis
        print(f"\nüéØ Top Job Functions:")
        top_functions = df['Function'].value_counts().head(10)
        for function, count in top_functions.items():
            if function != 'Not found':
                print(f"   {function}: {count} jobs")
        
        # Data quality assessment
        print(f"\nüîç Data Quality Assessment:")
        for column in config.COLUMNS:
            if column in df.columns:
                missing_count = (df[column] == 'Not found').sum()
                missing_percentage = (missing_count / len(df)) * 100
                completeness = 100 - missing_percentage
                status = "‚úÖ" if completeness > 80 else "‚ö†Ô∏è" if completeness > 50 else "‚ùå"
                print(f"   {status} {column}: {completeness:.1f}% complete")
        
        # Experience requirements analysis
        print(f"\nüíº Experience Requirements:")
        exp_counts = df['Year of Exp.'].value_counts().head(8)
        for exp, count in exp_counts.items():
            if exp != 'Not found':
                print(f"   {exp}: {count} jobs")
        
        return df
        
    except FileNotFoundError:
        print(f"‚ùå File {config.CSV_FILENAME} not found. Please run the scraper first.")
        return None
    except Exception as e:
        print(f"‚ùå Error analyzing data: {e}")
        return None

# Run data analysis
df = analyze_camhr_data(config)

## 9. Data Export and Advanced Features

Export data to multiple formats and provide advanced search capabilities.

In [None]:
def export_data_multiple_formats(config):
    """
    Export scraped data to multiple formats
    
    Args:
        config: CamHRConfig instance
    """
    try:
        # Load the CSV data
        df = pd.read_csv(config.CSV_FILENAME)
        base_filename = config.CSV_FILENAME.replace('.csv', '')
        
        print("üì¶ Exporting CamHR data to multiple formats...")
        
        # Export to Excel with formatting
        excel_filename = f"{base_filename}.xlsx"
        with pd.ExcelWriter(excel_filename, engine='openpyxl') as writer:
            df.to_excel(writer, sheet_name='CamHR_Jobs', index=False)
            
            # Create summary sheet
            summary_data = {
                'Metric': ['Total Jobs', 'Unique Companies', 'Unique Locations', 'Date Scraped'],
                'Value': [len(df), df['Company Name'].nunique(), df['Location'].nunique(), datetime.now().strftime('%Y-%m-%d')]
            }
            summary_df = pd.DataFrame(summary_data)
            summary_df.to_excel(writer, sheet_name='Summary', index=False)
        
        print(f"‚úÖ Excel export completed: {excel_filename}")
        
        # Export to JSON
        json_filename = f"{base_filename}.json"
        df.to_json(json_filename, orient='records', indent=2)
        print(f"‚úÖ JSON export completed: {json_filename}")
        
        # Create cleaned dataset (remove 'Not found' entries)
        df_cleaned = df.replace('Not found', '')
        cleaned_filename = f"{base_filename}_cleaned.csv"
        df_cleaned.to_csv(cleaned_filename, index=False)
        print(f"‚úÖ Cleaned dataset created: {cleaned_filename}")
        
        # Create industry-specific datasets
        if 'Industry' in df.columns:
            top_industries = df['Industry'].value_counts().head(5).index
            for industry in top_industries:
                if industry != 'Not found':
                    industry_df = df[df['Industry'] == industry]
                    industry_filename = f"{base_filename}_{industry.replace(' ', '_').replace('/', '_')}.csv"
                    industry_df.to_csv(industry_filename, index=False)
                    print(f"üìä Industry dataset created: {industry_filename} ({len(industry_df)} jobs)")
        
        print(f"\nüìÅ Export Summary:")
        print(f"   üìä Original CSV: {config.CSV_FILENAME}")
        print(f"   üìà Excel file: {excel_filename}")
        print(f"   üîó JSON file: {json_filename}")
        print(f"   üßπ Cleaned CSV: {cleaned_filename}")
        
    except FileNotFoundError:
        print(f"‚ùå File {config.CSV_FILENAME} not found. Please run the scraper first.")
    except Exception as e:
        print(f"‚ùå Error exporting data: {e}")

def search_camhr_jobs(config, **filters):
    """
    Search and filter CamHR jobs based on criteria
    
    Args:
        config: CamHRConfig instance
        **filters: Keyword arguments for filtering (title, company, location, level, etc.)
    
    Returns:
        pandas.DataFrame: Filtered job data
    """
    try:
        df = pd.read_csv(config.CSV_FILENAME)
        filtered_df = df.copy()
        
        print(f"üîç Searching CamHR jobs with filters:")
        
        # Apply filters
        for key, value in filters.items():
            if value and key in df.columns:
                filtered_df = filtered_df[filtered_df[key].str.contains(str(value), case=False, na=False)]
                print(f"   üìù {key}: '{value}'")
        
        print(f"\nüìä Search Results: {len(filtered_df)} jobs found")
        
        if len(filtered_df) > 0:
            print(f"\nüìã Results Preview:")
            display_cols = ['Job Title', 'Company Name', 'Location', 'Level', 'Salary']
            print(filtered_df[display_cols].head(5).to_string(index=False))
        
        return filtered_df
        
    except FileNotFoundError:
        print(f"‚ùå File {config.CSV_FILENAME} not found. Please run the scraper first.")
        return pd.DataFrame()
    except Exception as e:
        print(f"‚ùå Error searching jobs: {e}")
        return pd.DataFrame()

# Export data to multiple formats
export_data_multiple_formats(config)

print("\nüîç Search Examples:")
print("# Search for manager positions:")
print("# results = search_camhr_jobs(config, **{'Job Title': 'manager'})")
print("\n# Search for IT jobs in specific location:")
print("# results = search_camhr_jobs(config, **{'Function': 'IT', 'Location': 'Phnom Penh'})")
print("\n# Search for senior level positions:")
print("# results = search_camhr_jobs(config, **{'Level': 'Senior'})")

## 10. Cleanup and Final Steps

Properly close the WebDriver and provide final summary.

In [None]:
def cleanup_resources(driver):
    """
    Clean up resources and close WebDriver
    
    Args:
        driver: WebDriver instance to close
    """
    try:
        if driver:
            driver.quit()
            print("‚úÖ WebDriver closed successfully")
        else:
            print("‚ÑπÔ∏è No WebDriver to close")
    except Exception as e:
        print(f"‚ö†Ô∏è Error closing WebDriver: {e}")

def generate_scraping_report(config):
    """
    Generate a comprehensive scraping report
    
    Args:
        config: CamHRConfig instance
    """
    try:
        report_filename = config.CSV_FILENAME.replace('.csv', '_report.txt')
        
        with open(report_filename, 'w', encoding='utf-8') as f:
            f.write("CamHR Job Scraping Report\n")
            f.write("=" * 40 + "\n\n")
            f.write(f"Scraping Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
            f.write(f"Source Website: CamHR.com\n")
            f.write(f"Job ID Range: {config.START_ID} to {config.END_ID}\n")
            f.write(f"Base URL: {config.BASE_URL}\n")
            f.write(f"Output File: {config.CSV_FILENAME}\n\n")
            
            # Add data fields information
            f.write("Data Fields Extracted:\n")
            for i, column in enumerate(config.COLUMNS, 1):
                f.write(f"  {i:2d}. {column}\n")
            
            f.write("\nConfiguration Settings:\n")
            f.write(f"  Wait Timeout: {config.WAIT_TIMEOUT} seconds\n")
            f.write(f"  Request Delay: {config.DELAY} seconds\n")
            f.write(f"  Chrome Driver: {config.CHROME_DRIVER_PATH}\n")
        
        print(f"üìã Scraping report generated: {report_filename}")
        
    except Exception as e:
        print(f"‚ùå Error generating report: {e}")

# Generate final report
generate_scraping_report(config)

# Clean up resources (uncomment when done with scraping)
# cleanup_resources(driver)

print("\nüéØ CamHR Scraper Setup Complete!")
print("üìã Next Steps:")
print("   1. Run the scraping process by uncommenting 'run_camhr_scraper()'")
print("   2. Analyze results using the analysis functions")
print("   3. Export data to different formats as needed")
print("   4. Clean up resources when finished")
print("\n‚ö†Ô∏è Remember to uncomment 'cleanup_resources(driver)' when done!")

## 11. Summary and Best Practices

This comprehensive Jupyter notebook provides a complete solution for scraping job data from CamHR.com.

### üéØ Key Features Implemented:

1. **üîß Professional Configuration**: Centralized configuration class for easy customization
2. **üåê Optimized WebDriver**: Headless Chrome setup with performance optimizations
3. **üìä Comprehensive Data Extraction**: 18 different job attributes extracted
4. **üõ°Ô∏è Robust Error Handling**: Graceful handling of missing elements and failed requests
5. **üíæ Multiple Export Formats**: CSV, Excel, JSON, and industry-specific datasets
6. **üìà Advanced Analytics**: Built-in analysis and market insights
7. **üîç Search Functionality**: Advanced filtering and search capabilities
8. **üìã Progress Tracking**: Real-time updates and performance metrics
9. **üßπ Resource Management**: Proper cleanup and memory management
10. **üìù Documentation**: Comprehensive reports and documentation

### üìä Data Fields Extracted:

- **Basic Information**: Job Title, Company Name, Link URL
- **Job Classification**: Level, Function, Industry
- **Requirements**: Years of Experience, Qualification, Language
- **Employment Details**: Hiring status, Salary, Employment Term
- **Demographics**: Sex, Age requirements
- **Location**: Job location information
- **Detailed Requirements**: Comprehensive job requirements
- **Timeline**: Publish Date, Closing Date

### üöÄ Usage Instructions:

1. **Configuration**: Modify `CamHRConfig` class parameters as needed
2. **Execution**: Run cells sequentially, uncomment scraper execution
3. **Monitoring**: Track progress through real-time updates
4. **Analysis**: Use built-in analysis tools to examine results
5. **Export**: Generate multiple output formats for further use
6. **Cleanup**: Properly close resources when finished

### ‚ö° Performance Features:

- **Headless Mode**: Faster execution without GUI
- **Efficient Selectors**: Optimized CSS selectors and XPath
- **Smart Waiting**: Intelligent wait conditions for page loads
- **Memory Management**: Proper resource cleanup
- **Progress Estimation**: ETA calculations for long-running processes

### üîç Advanced Analytics:

- **Market Insights**: Top companies, locations, and industries
- **Trend Analysis**: Job level and function distributions
- **Data Quality**: Completeness metrics for each field
- **Search Capabilities**: Multi-criteria filtering and search

### üìÅ Export Options:

- **CSV**: Original structured data
- **Excel**: Formatted spreadsheet with summary
- **JSON**: API-friendly format
- **Cleaned Data**: Processed datasets without missing values
- **Industry-Specific**: Segmented datasets by industry

### ‚öñÔ∏è Ethical Considerations:

- ‚úÖ **Respectful Delays**: Built-in delays between requests
- ‚úÖ **Error Handling**: Graceful handling of failures
- ‚úÖ **Rate Limiting**: Controlled request frequency
- ‚úÖ **Resource Management**: Proper cleanup of browser resources
- ‚ö†Ô∏è **Terms of Service**: Always check website's ToS before scraping
- ‚ö†Ô∏è **Server Load**: Monitor response times and adjust delays

### üîß Troubleshooting Tips:

1. **WebDriver Issues**: Ensure Chrome WebDriver path is correct
2. **Timeout Errors**: Increase `WAIT_TIMEOUT` for slower connections
3. **Missing Data**: Check website structure changes
4. **Memory Issues**: Process data in smaller batches
5. **Network Issues**: Add retry mechanisms for failed requests

---

**Happy Job Market Analysis! üöÄ**

*This scraper was designed for educational and research purposes. Please ensure compliance with CamHR.com's terms of service and respect their server resources.*