Code for scrape data(job posting) from hh.ru

There was lack of data on the chosen profession, so I decided to look at several different fields and compare them.

I chose these: data scientist, frontend developer, data analyst, ML engineer, java developer.

In [4]:
import requests
import json
import time
from tqdm import tqdm

def fetch_vacancies(query, area=113, pages=100, per_page=20):
    base_url = "https://api.hh.ru/vacancies"
    headers = {"User-Agent": "JobScraper/1.0"}
    all_vacancies = []

    for page in tqdm(range(pages), desc=f"scrape data: {query}"):
        params = {
            "text": query,
            "area": area,
            "page": page,
            "per_page": per_page
        }
        response = requests.get(base_url, headers=headers, params=params)
        if response.status_code != 200:
            print(f"error {page}: {response.status_code}")
            break
        data = response.json()
        vacancies = data.get("items", [])
        if not vacancies:
            break
        all_vacancies.extend(vacancies)
        time.sleep(0.1)

    return all_vacancies

def fetch_vacancy_details(vacancy_ids):
    detailed_vacancies = []
    for vid in tqdm(vacancy_ids, desc="get details"):
        url = f"https://api.hh.ru/vacancies/{vid}"
        response = requests.get(url)
        if response.status_code == 200:
            detailed_vacancies.append(response.json())
        time.sleep(0.05)
    return detailed_vacancies

SEARCH_QUERIES = [
    "data scientist",
    "frontend разработчик",
    "data analyst",
    "ML engineer",
    "java разработчик"
]

def main():
    categorized_vacancies = {}

    for query in SEARCH_QUERIES:
        print(f"\nscrape by query: {query}")
        vacancies = fetch_vacancies(query=query, area=113, pages=30, per_page=20)
        vacancy_ids = [v["id"] for v in vacancies]

        details = fetch_vacancy_details(vacancy_ids)
        categorized_vacancies[query] = details

    with open("categorized_vacancies.json", "w", encoding="utf-8") as f:
        json.dump(categorized_vacancies, f, ensure_ascii=False, indent=4)

    total = sum(len(v) for v in categorized_vacancies.values())
    print(f"\nVacancies collected: {total} (divided into categories)")

if __name__ == "__main__":
    main()



scrape by query: data scientist


scrape data: data scientist:  53%|█████▎    | 16/30 [00:08<00:07,  1.87it/s]
get details: 100%|██████████| 302/302 [01:09<00:00,  4.36it/s]



scrape by query: frontend разработчик


scrape data: frontend разработчик: 100%|██████████| 30/30 [00:15<00:00,  1.95it/s]
get details: 100%|██████████| 600/600 [02:29<00:00,  4.02it/s]



scrape by query: data analyst


scrape data: data analyst: 100%|██████████| 30/30 [00:14<00:00,  2.12it/s]
get details: 100%|██████████| 600/600 [02:27<00:00,  4.07it/s]



scrape by query: ML engineer


scrape data: ML engineer: 100%|██████████| 30/30 [00:10<00:00,  2.90it/s]
get details: 100%|██████████| 589/589 [02:44<00:00,  3.59it/s]



scrape by query: java разработчик


scrape data: java разработчик: 100%|██████████| 30/30 [00:10<00:00,  2.83it/s]
get details: 100%|██████████| 600/600 [02:23<00:00,  4.19it/s]



Vacancies collected: 1062 (divided into categories)


Clean data from branded_description, it is not necessary for my visualization

In [2]:
import json

with open('categorized_vacancies.json', 'r', encoding='utf-8') as file:
    data = json.load(file)

for profession in data.values():
    for vacancy in profession:
        if "branded_description" in vacancy:
            del vacancy["branded_description"]

with open('clean_categorized_vacancies.json', 'w', encoding='utf-8') as file:
    json.dump(data, file, ensure_ascii=False, indent=2)