<a href="https://colab.research.google.com/github/AbhinavKumar0000/Data_collection_pipeline/blob/main/Scraper_script.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Job Posting Data Collection from Remotive API

This notebook collects job posting data from the Remotive REST API, processes it, and saves it to a CSV file. The script uses the `requests` library to fetch data, `BeautifulSoup` to clean HTML descriptions, and `pandas` to structure the data into a DataFrame.

## Prerequisites
- Install required libraries: `requests`, `pandas`, `beautifulsoup4`
- Ensure internet connectivity for API access

## Steps
1. Import necessary libraries
2. Define a function to collect job data from the Remotive API
3. Parse HTML job descriptions to extract plain text
4. Save the processed data to a CSV file
5. Display the first few rows and DataFrame information



In [9]:
#Importing necessary Libraries
import requests
import pandas as pd
from bs4 import BeautifulSoup

### Ingests job posting data from the Remotive REST API

In [10]:
def collect_jobs_from_remotive_api():

    #API endpoint and parameters
    api_url = "https://remotive.com/api/remote-jobs?category=software-development&limit=50"

    print(f"Requesting data from API endpoint: {api_url}")
    job_list = []

    try:
        response = requests.get(api_url)
        response.raise_for_status()
        data = response.json()

        if 'jobs' not in data or not data['jobs']:
            print("The API response did not contain a 'jobs' array or it was empty")
            return pd.DataFrame()

        print(f"Successfully fetched {len(data['jobs'])} records from the API")

        #HTML is parsed to extract plain text
        for job in data['jobs']:
            description_html = job.get('description', '')
            soup = BeautifulSoup(description_html, 'html.parser')
            clean_description = soup.get_text(separator=' ', strip=True)

            #Maping the API response fields to our target schema
            job_data = {
                'job_title': job.get('title'),
                'company_name': job.get('company_name'),
                'location': job.get('candidate_required_location'),
                'job_description': clean_description,
                'url': job.get('url')
            }
            job_list.append(job_data)

    except requests.exceptions.RequestException as e:
        print(f"An error occurred during the API request: {e}")
        return pd.DataFrame()

    return pd.DataFrame(job_list)

###Executing the Data Collection Script

In [11]:
raw_df = collect_jobs_from_remotive_api()

if not raw_df.empty:
    raw_df.to_csv('raw_data.csv', index=False)
    print("Data collection successful. Raw dataset saved to raw_data.csv")
else:
    print("Data collection failed or returned no data")

Requesting data from API endpoint: https://remotive.com/api/remote-jobs?category=software-development&limit=50
Successfully fetched 50 records from the API
Data collection successful. Raw dataset saved to raw_data.csv


In [12]:
if not raw_df.empty:
    print("Raw Data:")
    display(raw_df.head())

Raw Data:


Unnamed: 0,job_title,company_name,location,job_description,url
0,QA Test Engineer,PrimeWorks,USA,You will be working for a US-based technology ...,https://remotive.com/remote-jobs/qa/qa-test-en...
1,"Senior Product Manager, Platform",ServiceUp,USA,About the Role: ServiceUp is seeking a Senior ...,https://remotive.com/remote-jobs/product/senio...
2,Business Development Associate,Forbes Advisor,USA,"Company Description At Forbes Advisor, our mis...",https://remotive.com/remote-jobs/sales-busines...
3,Graphic Designer,DMS International,USA,"Data Management Services, Inc. (dba: DMS Inter...",https://remotive.com/remote-jobs/design/graphi...
4,Client Success Manager,"Momentive Software, Inc.",USA,This description is a summary of our understan...,https://remotive.com/remote-jobs/customer-supp...


In [13]:
raw_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   job_title        50 non-null     object
 1   company_name     50 non-null     object
 2   location         50 non-null     object
 3   job_description  50 non-null     object
 4   url              50 non-null     object
dtypes: object(5)
memory usage: 2.1+ KB
