# Jobify.works Job Scraper

This notebook contains a comprehensive web scraper for extracting job listings from Jobify.works. The scraper systematically collects detailed job information including titles, salaries, requirements, qualifications, and other essential job details.

## Features
- **Automated Web Scraping**: Uses Selenium WebDriver for dynamic content extraction
- **Comprehensive Data Collection**: Extracts 16 different job attributes
- **Error Handling**: Robust error handling for missing elements
- **CSV Export**: Saves collected data in structured CSV format
- **Progress Tracking**: Real-time progress updates during scraping
- **Headless Operation**: Runs in background for better performance

## Data Fields Extracted
- Job Title
- Job Link
- Salary
- Job Type
- Job Level
- Gender Requirements
- Age Requirements
- Years of Experience
- Language Requirements
- Category
- Industry
- Location
- Qualification
- Available Positions
- Required Skills
- Job Requirements

## Requirements
- Python 3.x
- Selenium WebDriver
- Chrome WebDriver
- CSV module (built-in)

## 1. Import Required Libraries

First, let's import all the necessary libraries for web scraping and data handling.

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import csv
import pandas as pd
import time
from datetime import datetime

print("‚úÖ All libraries imported successfully!")
print(f"üìÖ Scraping session started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

## 2. Configuration Settings

Configure all the necessary settings for the web scraper including file paths, output settings, and scraping parameters.

In [None]:
# Configuration Settings
class JobifyConfig:
    # Path to Chrome WebDriver executable
    CHROME_DRIVER_PATH = r"D:\DSE_Folder\Year_3\Sem_2\Web Scraping\chromedriver-win64\chromedriver-win64\chromedriver.exe"
    
    # Output CSV filename
    OUTPUT_FILENAME = "job4.csv"
    
    # Scraping range settings
    START_ID = 1086  # Starting job ID
    END_ID = 500     # Ending job ID
    STEP = -1        # Step direction (negative for reverse)
    
    # Base URL template for job pages
    BASE_URL = "https://jobify.works/jobs/{}"
    
    # Timing configurations
    WAIT_TIMEOUT = 10  # seconds to wait for page elements
    
    # CSV column headers
    CSV_HEADERS = [
        "Job Title", "Job Link", "Salary", "Job Type", "Job Level", "Gender", "Age",
        "Years of Experience", "Language", "Category", "Industry", "Location", "Qualification",
        "Available Position", "Required Skills", "Job Requirement"
    ]

# Display configuration
config = JobifyConfig()
print("üîß Configuration Settings:")
print(f"üìä Scraping range: {config.START_ID} to {config.END_ID} (step: {config.STEP})")
print(f"üíæ Output file: {config.OUTPUT_FILENAME}")
print(f"‚è±Ô∏è Wait timeout: {config.WAIT_TIMEOUT} seconds")
print(f"üìù Total fields to extract: {len(config.CSV_HEADERS)}")
print(f"üîó Base URL: {config.BASE_URL}")

## 3. WebDriver Setup

Initialize the Chrome WebDriver with optimized settings for web scraping.

In [None]:
def setup_webdriver(chrome_driver_path):
    """
    Initialize Chrome WebDriver with optimized settings
    
    Args:
        chrome_driver_path (str): Path to Chrome WebDriver executable
    
    Returns:
        webdriver.Chrome: Configured Chrome WebDriver instance
    """
    try:
        # Configure Chrome service
        service = Service(chrome_driver_path)
        
        # Configure Chrome options
        options = webdriver.ChromeOptions()
        options.add_argument("--headless")              # Run in background
        options.add_argument("--disable-gpu")           # Disable GPU acceleration
        options.add_argument("--no-sandbox")            # Bypass OS security model
        options.add_argument("--disable-dev-shm-usage") # Overcome limited resource problems
        options.add_argument("--window-size=1920,1080")  # Set window size
        
        # Initialize WebDriver
        driver = webdriver.Chrome(service=service, options=options)
        
        print("‚úÖ Chrome WebDriver initialized successfully!")
        print(f"üåê Browser version: {driver.capabilities['browserVersion']}")
        print(f"üîß Driver version: {driver.capabilities['chrome']['chromedriverVersion'].split(' ')[0]}")
        
        return driver
        
    except Exception as e:
        print(f"‚ùå Error initializing WebDriver: {e}")
        return None

# Initialize WebDriver
driver = setup_webdriver(config.CHROME_DRIVER_PATH)

## 4. Data Extraction Functions

Define helper functions for extracting specific job details from the web pages.

In [None]:
def extract_job_title(driver):
    """
    Extract job title from the page
    
    Args:
        driver: WebDriver instance
    
    Returns:
        str: Job title or 'N/A' if not found
    """
    try:
        title_element = WebDriverWait(driver, config.WAIT_TIMEOUT).until(
            EC.presence_of_element_located((By.CLASS_NAME, "job-title"))
        )
        return title_element.text.strip() if title_element.text.strip() else "N/A"
    except Exception as e:
        print(f"‚ùå Error extracting job title: {e}")
        return "N/A"

def get_job_detail(driver, label):
    """
    Extract specific job detail using label-based XPath
    
    Args:
        driver: WebDriver instance
        label (str): Label text to search for (e.g., 'Salary:', 'Job Type:')
    
    Returns:
        str: Job detail value or 'N/A' if not found
    """
    try:
        element = driver.find_element(By.XPATH, f"//strong[text()='{label}']")
        return element.find_element(By.XPATH, "./following-sibling::text()").strip()
    except Exception as e:
        print(f"‚ùå Error extracting {label}: {e}")
        return "N/A"

def extract_job_requirements(driver, url):
    """
    Extract job requirements from the dedicated section
    
    Args:
        driver: WebDriver instance
        url (str): Current page URL for error reporting
    
    Returns:
        str: Job requirements or 'N/A' if not found
    """
    try:
        # Wait for job requirement section to load
        job_req_section = WebDriverWait(driver, config.WAIT_TIMEOUT).until(
            EC.presence_of_element_located((By.XPATH, "//h5[text()='Job Requirement']/following-sibling::div"))
        )
        
        # Extract all list items from unordered lists
        ul_elements = job_req_section.find_elements(By.TAG_NAME, "ul")
        li_elements = [
            li.text.strip() 
            for ul in ul_elements 
            for li in ul.find_elements(By.TAG_NAME, "li") 
            if li.text.strip()
        ]
        
        return " | ".join(li_elements) if li_elements else "N/A"
        
    except Exception as e:
        print(f"‚ùå Job Requirement not found for {url}: {e}")
        return "N/A"

print("‚úÖ Data extraction functions defined successfully!")
print("üîß Functions available:")
print("   - extract_job_title(): Extracts job title")
print("   - get_job_detail(): Extracts labeled job details")
print("   - extract_job_requirements(): Extracts job requirements list")

## 5. Main Scraping Function

Define the main function that orchestrates the entire scraping process.

In [None]:
def scrape_single_job(driver, job_id):
    """
    Scrape a single job listing
    
    Args:
        driver: WebDriver instance
        job_id (int): Job ID to scrape
    
    Returns:
        list: List of extracted job data or None if failed
    """
    url = config.BASE_URL.format(job_id)
    print(f"üîç Fetching Job ID {job_id}: {url}")
    
    try:
        # Navigate to job page
        driver.get(url)
        
        # Extract job title first (acts as a page load indicator)
        title = extract_job_title(driver)
        
        if title == "N/A":
            print(f"‚ö†Ô∏è Job title not found for ID {job_id}, skipping...")
            return None
        
        # Extract all labeled job details
        job_details = {
            'salary': get_job_detail(driver, "Salary:"),
            'job_type': get_job_detail(driver, "Job Type:"),
            'job_level': get_job_detail(driver, "Job Level:"),
            'gender': get_job_detail(driver, "Gender:"),
            'age': get_job_detail(driver, "Age:"),
            'experience': get_job_detail(driver, "Years of Experience:"),
            'language': get_job_detail(driver, "Language:"),
            'category': get_job_detail(driver, "Category:"),
            'industry': get_job_detail(driver, "Industry:"),
            'location': get_job_detail(driver, "Location:"),
            'qualification': get_job_detail(driver, "Qualification:"),
            'available_position': get_job_detail(driver, "Available Position:"),
            'required_skills': get_job_detail(driver, "Required Skills:")
        }
        
        # Extract job requirements
        job_requirement = extract_job_requirements(driver, url)
        
        # Compile all data
        job_data = [
            title, url, job_details['salary'], job_details['job_type'], 
            job_details['job_level'], job_details['gender'], job_details['age'],
            job_details['experience'], job_details['language'], job_details['category'],
            job_details['industry'], job_details['location'], job_details['qualification'],
            job_details['available_position'], job_details['required_skills'], job_requirement
        ]
        
        print(f"‚úÖ Successfully extracted: {title}")
        print(f"   üìç Location: {job_details['location']}")
        print(f"   üí∞ Salary: {job_details['salary']}")
        print(f"   üè¢ Company Type: {job_details['job_type']}")
        
        return job_data
        
    except Exception as e:
        print(f"‚ùå Error scraping job ID {job_id}: {e}")
        return None

print("‚úÖ Main scraping function defined successfully!")

## 6. Execute the Scraping Process

Now let's run the complete scraping process. This cell will iterate through all job IDs and save the data to CSV.

In [None]:
def run_jobify_scraper():
    """
    Execute the complete Jobify scraping process
    """
    if driver is None:
        print("‚ùå WebDriver not initialized. Please run the WebDriver setup cell first.")
        return
    
    print("üöÄ Starting Jobify.works scraping process...")
    print(f"üìä Range: {config.START_ID} to {config.END_ID} (step: {config.STEP})")
    print(f"üìÅ Output file: {config.OUTPUT_FILENAME}")
    print("=" * 60)
    
    # Initialize counters
    successful_scrapes = 0
    failed_scrapes = 0
    total_jobs = abs(config.START_ID - config.END_ID) + 1
    
    # Open CSV file for writing
    try:
        with open(config.OUTPUT_FILENAME, "w", encoding="utf-8", newline='') as file:
            writer = csv.writer(file)
            
            # Write CSV headers
            writer.writerow(config.CSV_HEADERS)
            print(f"üìù CSV file created with headers: {len(config.CSV_HEADERS)} columns")
            
            # Loop through job IDs
            for job_id in range(config.START_ID, config.END_ID + config.STEP, config.STEP):
                print(f"\nüìä Progress: {successful_scrapes + failed_scrapes + 1}/{total_jobs}")
                
                # Scrape single job
                job_data = scrape_single_job(driver, job_id)
                
                if job_data:
                    # Write to CSV
                    writer.writerow(job_data)
                    successful_scrapes += 1
                    print(f"üíæ Data saved to CSV")
                else:
                    failed_scrapes += 1
                    print(f"‚è© Skipped job ID {job_id}")
                
                # Add small delay to be respectful to the server
                time.sleep(0.5)
        
        print("\n" + "=" * 60)
        print("üéâ Scraping completed successfully!")
        print(f"‚úÖ Successful scrapes: {successful_scrapes}")
        print(f"‚ùå Failed scrapes: {failed_scrapes}")
        print(f"üìä Success rate: {(successful_scrapes/total_jobs)*100:.1f}%")
        print(f"üíæ Data saved to: {config.OUTPUT_FILENAME}")
        
    except Exception as e:
        print(f"‚ùå Error during scraping process: {e}")
    
    finally:
        # Close the browser
        if driver:
            driver.quit()
            print("üîí Browser closed successfully")

# Run the scraper (uncomment the line below to start)
# run_jobify_scraper()

print("üîÑ To start scraping, uncomment the 'run_jobify_scraper()' line above and run this cell.")
print("‚ö†Ô∏è Warning: This process may take several minutes depending on the number of jobs to scrape.")

## 7. Data Analysis and Visualization

After scraping is complete, let's analyze the collected data to gain insights.

In [None]:
def analyze_scraped_data():
    """
    Analyze the scraped job data and provide insights
    """
    try:
        # Load the CSV data
        df = pd.read_csv(config.OUTPUT_FILENAME)
        
        print("üìä Jobify.works Data Analysis")
        print("=" * 50)
        
        # Basic statistics
        print(f"üìà Dataset Overview:")
        print(f"   Total jobs scraped: {len(df)}")
        print(f"   Total columns: {len(df.columns)}")
        print(f"   Data types: {df.dtypes.value_counts().to_dict()}")
        
        # Display first few rows
        print(f"\nüìã First 5 Records:")
        print(df.head())
        
        # Job categories analysis
        print(f"\nüè∑Ô∏è Top Job Categories:")
        category_counts = df['Category'].value_counts().head(10)
        for category, count in category_counts.items():
            if category != 'N/A':
                print(f"   {category}: {count} jobs")
        
        # Location analysis
        print(f"\nüåç Top Job Locations:")
        location_counts = df['Location'].value_counts().head(10)
        for location, count in location_counts.items():
            if location != 'N/A':
                print(f"   {location}: {count} jobs")
        
        # Industry analysis
        print(f"\nüè≠ Top Industries:")
        industry_counts = df['Industry'].value_counts().head(10)
        for industry, count in industry_counts.items():
            if industry != 'N/A':
                print(f"   {industry}: {count} jobs")
        
        # Job level analysis
        print(f"\nüìä Job Level Distribution:")
        level_counts = df['Job Level'].value_counts()
        for level, count in level_counts.items():
            if level != 'N/A':
                percentage = (count / len(df)) * 100
                print(f"   {level}: {count} jobs ({percentage:.1f}%)")
        
        # Data quality analysis
        print(f"\nüîç Data Quality Analysis:")
        for column in config.CSV_HEADERS:
            na_count = (df[column] == 'N/A').sum()
            na_percentage = (na_count / len(df)) * 100
            print(f"   {column}: {na_count} missing ({na_percentage:.1f}%)")
        
        return df
        
    except FileNotFoundError:
        print(f"‚ùå File {config.OUTPUT_FILENAME} not found. Please run the scraper first.")
        return None
    except Exception as e:
        print(f"‚ùå Error analyzing data: {e}")
        return None

# Run analysis (will only work after scraping is complete)
df = analyze_scraped_data()

## 8. Data Export and Additional Formats

Export the scraped data to different formats for further analysis.

In [None]:
def export_data_to_formats():
    """
    Export scraped data to multiple formats
    """
    try:
        # Load the CSV data
        df = pd.read_csv(config.OUTPUT_FILENAME)
        
        print("üì¶ Exporting data to multiple formats...")
        
        # Export to Excel
        excel_filename = config.OUTPUT_FILENAME.replace('.csv', '.xlsx')
        df.to_excel(excel_filename, index=False, engine='openpyxl')
        print(f"‚úÖ Excel export completed: {excel_filename}")
        
        # Export to JSON
        json_filename = config.OUTPUT_FILENAME.replace('.csv', '.json')
        df.to_json(json_filename, orient='records', indent=2)
        print(f"‚úÖ JSON export completed: {json_filename}")
        
        # Create a cleaned dataset (remove N/A values)
        df_cleaned = df.replace('N/A', '')
        cleaned_filename = config.OUTPUT_FILENAME.replace('.csv', '_cleaned.csv')
        df_cleaned.to_csv(cleaned_filename, index=False)
        print(f"‚úÖ Cleaned CSV export completed: {cleaned_filename}")
        
        # Generate summary statistics
        summary_filename = config.OUTPUT_FILENAME.replace('.csv', '_summary.txt')
        with open(summary_filename, 'w', encoding='utf-8') as f:
            f.write("Jobify.works Scraping Summary\n")
            f.write("=" * 40 + "\n")
            f.write(f"Total records: {len(df)}\n")
            f.write(f"Scraping date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
            f.write(f"Source: {config.BASE_URL}\n")
            f.write(f"ID range: {config.START_ID} to {config.END_ID}\n\n")
            
            f.write("Column completeness:\n")
            for column in config.CSV_HEADERS:
                na_count = (df[column] == 'N/A').sum()
                completeness = ((len(df) - na_count) / len(df)) * 100
                f.write(f"  {column}: {completeness:.1f}% complete\n")
        
        print(f"‚úÖ Summary report generated: {summary_filename}")
        
        print(f"\nüìä Export Summary:")
        print(f"   üìÅ Original CSV: {config.OUTPUT_FILENAME}")
        print(f"   üìä Excel file: {excel_filename}")
        print(f"   üîó JSON file: {json_filename}")
        print(f"   üßπ Cleaned CSV: {cleaned_filename}")
        print(f"   üìã Summary report: {summary_filename}")
        
    except FileNotFoundError:
        print(f"‚ùå File {config.OUTPUT_FILENAME} not found. Please run the scraper first.")
    except Exception as e:
        print(f"‚ùå Error exporting data: {e}")

# Export to multiple formats
export_data_to_formats()

## 9. Advanced Data Filtering and Search

Provide tools for filtering and searching through the scraped job data.

In [None]:
def search_jobs(keyword=None, location=None, category=None, job_level=None, min_salary=None):
    """
    Search and filter jobs based on various criteria
    
    Args:
        keyword (str): Keyword to search in job title
        location (str): Job location
        category (str): Job category
        job_level (str): Job level (Entry, Mid, Senior, etc.)
        min_salary (str): Minimum salary filter
    
    Returns:
        DataFrame: Filtered job data
    """
    try:
        df = pd.read_csv(config.OUTPUT_FILENAME)
        filtered_df = df.copy()
        
        print(f"üîç Searching jobs with criteria:")
        
        # Apply filters
        if keyword:
            filtered_df = filtered_df[filtered_df['Job Title'].str.contains(keyword, case=False, na=False)]
            print(f"   üìù Keyword: '{keyword}'")
        
        if location:
            filtered_df = filtered_df[filtered_df['Location'].str.contains(location, case=False, na=False)]
            print(f"   üìç Location: '{location}'")
        
        if category:
            filtered_df = filtered_df[filtered_df['Category'].str.contains(category, case=False, na=False)]
            print(f"   üè∑Ô∏è Category: '{category}'")
        
        if job_level:
            filtered_df = filtered_df[filtered_df['Job Level'].str.contains(job_level, case=False, na=False)]
            print(f"   üìä Job Level: '{job_level}'")
        
        print(f"\nüìä Search Results: {len(filtered_df)} jobs found")
        
        if len(filtered_df) > 0:
            print(f"\nüìã Sample Results:")
            display_columns = ['Job Title', 'Location', 'Category', 'Job Level', 'Salary']
            print(filtered_df[display_columns].head(10).to_string(index=False))
        
        return filtered_df
        
    except FileNotFoundError:
        print(f"‚ùå File {config.OUTPUT_FILENAME} not found. Please run the scraper first.")
        return pd.DataFrame()
    except Exception as e:
        print(f"‚ùå Error searching jobs: {e}")
        return pd.DataFrame()

# Example searches (uncomment to try)
print("üîç Job Search Examples:")
print("Uncomment any of the following lines to search for specific jobs:")
print("# search_jobs(keyword='developer')")
print("# search_jobs(location='Phnom Penh')")
print("# search_jobs(category='IT')")
print("# search_jobs(job_level='Senior')")
print("# search_jobs(keyword='manager', location='Cambodia')")

# Example: Search for developer jobs
# results = search_jobs(keyword='developer')

## 10. Summary and Best Practices

This notebook provides a comprehensive solution for scraping job data from Jobify.works.

### Key Features Implemented:

1. **üîß Robust Configuration**: Centralized configuration class for easy customization
2. **üåê Optimized WebDriver**: Headless Chrome setup with performance optimizations
3. **üìä Comprehensive Data Extraction**: 16 different job attributes extracted
4. **üõ°Ô∏è Error Handling**: Graceful handling of missing elements and failed requests
5. **üíæ Multiple Export Formats**: CSV, Excel, JSON, and cleaned datasets
6. **üìà Data Analysis**: Built-in analysis and visualization tools
7. **üîç Search Functionality**: Advanced filtering and search capabilities
8. **üìã Progress Tracking**: Real-time updates and success rate monitoring

### Data Fields Extracted:

- **Basic Info**: Job Title, Job Link, Location
- **Employment Details**: Job Type, Job Level, Available Positions
- **Requirements**: Years of Experience, Qualification, Required Skills
- **Compensation**: Salary information
- **Demographics**: Gender, Age requirements
- **Classification**: Category, Industry
- **Other**: Language requirements, Job Requirements

### Usage Instructions:

1. **Setup**: Ensure Chrome WebDriver path is correct
2. **Configure**: Modify `JobifyConfig` class parameters as needed
3. **Execute**: Run cells sequentially, uncomment scraper execution
4. **Analyze**: Use built-in analysis tools to examine results
5. **Export**: Generate multiple output formats for further use

### Ethical Considerations:

- ‚úÖ **Respectful Delays**: Built-in delays between requests
- ‚úÖ **Error Handling**: Graceful handling of failures
- ‚úÖ **Rate Limiting**: Controlled request frequency
- ‚ö†Ô∏è **Terms of Service**: Always check website's ToS before scraping
- ‚ö†Ô∏è **Server Load**: Monitor server response and adjust delays if needed

### Performance Optimization:

- **Headless Mode**: Faster execution without GUI
- **Efficient Selectors**: Optimized XPath and CSS selectors
- **Error Recovery**: Retry mechanisms for failed requests
- **Memory Management**: Proper WebDriver cleanup

---

**Happy Job Hunting! üöÄ**

*This scraper was designed for educational and research purposes. Please ensure compliance with the website's terms of service and robots.txt before large-scale usage.*