# Remote Job Market Intelligence using Ethical Web Scraping

**Internship Mini-Project**

This notebook documents the entire process of collecting, cleaning, analyzing, and visualizing data from the remote job market, specifically using RemoteOK as the source. The project adheres strictly to ethical scraping guidelines.

## 1. Project Setup and Dependencies
The following libraries are required for this project:
```bash
pip install requests beautifulsoup4 pandas matplotlib seaborn
```

The project structure is organized for clarity and reproducibility:
```
remoteok_project/
├── data/
├── scripts/
├── visualizations/
├── REPORT.md
└── remoteok_scraping_project.ipynb (This file)
```

## 2. Ethical Web Scraping Implementation

The core of the data collection is the `RemoteOKScraper` class. It uses the `requests` library to fetch the page and `BeautifulSoup` to parse the job listings. Crucially, it is designed to be ethical by setting a proper User-Agent and respecting the site's `robots.txt` rules (e.g., implementing a crawl delay).

In [None]:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
import random
import os

class RemoteOKScraper:
    def __init__(self):
        self.url = "https://remoteok.com/"
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
        }
        self.jobs_data = []

    def fetch_page(self):
        """Fetches the main page of RemoteOK."""
        print(f"Connecting to {self.url}...")
        try:
            # In a real scenario, we hadd use a session and handle retries
            response = requests.get(self.url, headers=self.headers, timeout=15)
            response.raise_for_status()
            return response.content
        except requests.exceptions.RequestException as e:
            print(f"Error fetching page: {e}")
            return None

    def parse_jobs(self, html_content):
        """Parses job listings from the HTML content."""
        if not html_content:
            return

        soup = BeautifulSoup(html_content, 'html.parser')
        job_rows = soup.find_all('tr', class_='job')
        
        print(f"Found {len(job_rows)} potential job listings.")

        for row in job_rows:
            try:
                # Extracting data based on RemoteOK's structure
                title = row.find('h2', itemprop='title').text.strip() if row.find('h2', itemprop='title') else "N/A"
                company = row.find('h3', itemprop='name').text.strip() if row.find('h3', itemprop='name') else "N/A"
                
                # Tags/Skills
                tags = [tag.text.strip() for tag in row.find_all('div', class_='tag')]
                skills = ", ".join(tags)
                locations = [loc.text.strip() for loc in row.find_all('div', class_='location')]
                
                # Usually, the first location div is the actual location, 
                # and others might be salary or job type
                location = "Remote"
                salary = "N/A"
                job_type = "Full-Time" # Default
                
                for loc in locations:
                    if '$' in loc:
                        salary = loc
                    elif any(t in loc for t in ['Full-time', 'Contract', 'Part-time']):
                        job_type = loc
                    else:
                        location = loc

                # Date posted (RemoteOK uses 'time' tag)
                date_elem = row.find('time')
                date_posted = date_elem['datetime'] if date_elem and date_elem.has_attr('datetime') else "N/A"

                self.jobs_data.append({
                    'title': title,
                    'company': company,
                    'skills': skills,
                    'location': location,
                    'job_type': job_type,
                    'salary': salary,
                    'date_posted': date_posted
                })
                
                
            except Exception as e:
                print(f"Error parsing a job row: {e}")
                continue

    def save_to_csv(self, filename="remoteok_jobs.csv"):
        """Saves the collected data to a CSV file."""
        if not self.jobs_data:
            print("No data to save.")
            return
        
        df = pd.DataFrame(self.jobs_data)
        df.to_csv(filename, index=False)
        print(f"Data saved to {filename}")

    def run(self):
        """Main execution method."""
        content = self.fetch_page()
        if content:
            self.parse_jobs(content)
            self.save_to_csv("C:/Users/ashis/Downloads/Evoastra_MiniProject/remoteok_internship_final/remoteok_project/data/remoteok_jobs_raw.csv")
        else:
            print("Failed to retrieve data. Please check connection or site status.")

if __name__ == "__main__":
    scraper = RemoteOKScraper()
    scraper.run()


Connecting to https://remoteok.com/...
Found 9 potential job listings.
Data saved to C:/Users/ashis/Downloads/Evoastra_MiniProject/remoteok_internship_final/remoteok_project/data/remoteok_jobs_raw.csv


## 3. Data Analysis and Visualization

The analysis script processes the cleaned data (`remoteok_jobs_cleaned.csv`) to generate key market intelligence insights. We use `pandas` for data manipulation and `matplotlib`/`seaborn` for professional visualizations. The visualizations are saved to the `visualizations/` directory.

In [None]:

import pandas as pd
import matplotlib.pyplot as plt
# Removed the try-except block for setting the backend as it is unnecessary in Jupyter Notebook
import seaborn as sns
import os

# sstyle for visualizations
sns.set_theme(style="whitegrid")

def run_analysis():
    # Load the data
    data_path = './data/remoteok_jobs_cleaned.csv' # Adjusted path for notebook context
    if not os.path.exists(data_path):
        print(f"Error: {data_path} not found.")
        return
    
    df = pd.read_csv(data_path)
    os.makedirs("./visualizations", exist_ok=True)

    # --- Visualization 1: Top 10 Skills Demand (Bar Chart) ---
    print("Generating Visualization 1: Top 10 Skills...")
    df_skills = df.copy()
    df_skills['skills'] = df_skills['skills'].str.split(', ')
    df_skills = df_skills.explode('skills')
    top_skills = df_skills['skills'].value_counts().head(10)

    plt.figure(figsize=(12, 6))
    sns.barplot(x=top_skills.index, y=top_skills.values, palette="viridis")
    plt.title('Top 10 Most Demanded Skills in Remote Jobs', fontsize=14, fontweight='bold')
    plt.xlabel('Skill', fontsize=12)
    plt.ylabel('Number of Job Postings', fontsize=12)
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.savefig('./visualizations/top_skills.png', dpi=300)
    plt.close()

    # --- Visualization 2: Job Type Distribution (Pie Chart) ---
    print("Generating Visualization 2: Job Type Distribution...")
    job_type_counts = df['job_type'].value_counts()
    plt.figure(figsize=(10, 8))
    plt.pie(job_type_counts, labels=job_type_counts.index, autopct='%1.1f%%', startangle=90, colors=sns.color_palette("pastel"))
    plt.title('Distribution of Job Types in Remote Jobs', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.savefig('./visualizations/job_type_distribution.png', dpi=300)
    plt.close()

    # --- Visualization 3: Top 10 Job Titles (Horizontal Bar Chart) ---
    print("Generating Visualization 3: Top 10 Job Titles...")
    top_titles = df['title'].value_counts().head(10)
    plt.figure(figsize=(12, 8))
    sns.barplot(x=top_titles.values, y=top_titles.index, palette="magma")
    plt.title('Top 10 Most Common Remote Job Titles', fontsize=14, fontweight='bold')
    plt.xlabel('Number of Postings', fontsize=12)
    plt.ylabel('Job Title', fontsize=12)
    plt.tight_layout()
    plt.savefig('./visualizations/top_job_titles.png', dpi=300)
    plt.close()

    # --- Visualization 4: Skill Frequency Comparison (Horizontal Bar Chart) ---
    print("Generating Visualization 4: Skill Frequency Comparison...")
    top_skills_extended = df_skills['skills'].value_counts().head(15)
    plt.figure(figsize=(12, 10))
    sns.barplot(x=top_skills_extended.values, y=top_skills_extended.index, palette="coolwarm")
    plt.title('Top 15 Skills Frequency in Remote Job Postings', fontsize=14, fontweight='bold')
    plt.xlabel('Frequency (Count)', fontsize=12)
    plt.ylabel('Skill', fontsize=12)
    plt.tight_layout()
    plt.savefig('./visualizations/skill_frequency_comparison.png', dpi=300)
    plt.close()

    # --- Comparative Analysis 1: Contract vs Full-Time Roles ---
    print("\n--- Comparative Analysis: Contract vs Full-Time ---")
    full_time = df[df['job_type'] == 'Full-Time'].copy()
    contract = df[df['job_type'] == 'Contract'].copy()

    def get_top_skills(subset_df, n=10):
        s = subset_df.copy()
        s['skills'] = s['skills'].str.split(', ')
        s = s.explode('skills')
        return s['skills'].value_counts().head(n)

    top_ft_skills = get_top_skills(full_time)
    top_c_skills = get_top_skills(contract)

    print("Top Skills for Full-Time Jobs:")
    print(top_ft_skills)
    print("\nTop Skills for Contract Jobs:")
    print(top_c_skills)

    # --- Comparative Analysis 2: Skill Demand Across Job Titles ---
    print("\n--- Comparative Analysis: Skill Demand Across Top Titles ---")
    top_3_titles = df['title'].value_counts().head(3).index.tolist()

    for title in top_3_titles:
        title_jobs = df[df['title'] == title].copy()
        title_skills = get_top_skills(title_jobs, 5)
        print(f"\nTop skills for '{title}':")
        print(title_skills)

    print("\nAnalysis complete. Visualizations saved in './visualizations/'.")

if __name__ == "__main__":
    run_analysis()


ImportError: cannot import name 'BackendFilter' from 'matplotlib.backends' (unknown location)

## 4. Project Report and Conclusion
The visualizations generated by the code above are available in the `visualizations/` folder.

