## First Round Сountry Сheck

used "google_domain": "google.com"

[] - iOS, [] - Android
-----------------------
- [-],[-] Austria
- [-],[-] Belgium
- [-],[+] Croatia
- [+],[-] Czechia (Czech Republic)
- [+],[+] Denmark
- [+],[+] Estonia
- [+],[-] Finland
- [-],[-] France
- [-],[-] Germany
- [-],[+] Greece
- [+],[-] Hungary
- [-],[-] Italy
- [+],[-] Latvia
- [-],[+] Lithuania
- [-],[-] Luxembourg
- [+],[-] Malta
- [+],[+] Netherlands
- [-],[+] Poland
- [+],[-] Portugal
- [-],[-] Slovakia
- [+],[+] Slovenia
- [-],[+] Spain
- [-],[-] Sweden
- [-],[+] Bulgaria (from March 2024)
- [+],[-] Romania (from March 2024)

Non-EU Member States:
---
- [-],[-] Iceland
- [+],[-] Liechtenstein
- [-],[-] Norway
- [-],[+] Switzerland

Outside Schengen but with special status:
---
- [+],[+] Ireland 
- [+],[+] United Kingdom
- [-],[-] Cyprus 

North American countries:
---
- [+],[+] Canada
- [+],[+] United States
- [-],[+] Mexico

In [1]:
from serpapi import GoogleSearch
from dotenv import load_dotenv
import pandas as pd
import time
import json
import os
import re

load_dotenv()
api_key = os.getenv("API_KEY")

In [2]:
path = r"../data/csv/location_domain_table.csv"
df = pd.read_csv(path).copy()

df = df.drop(columns=["gl(not needed)"])
df.head()

Unnamed: 0,location,google_domain,Region,EU member,Schengen Agreement
0,Austria,google.at,Europe,True,True
1,Belgium,google.be,Europe,True,True
2,Bulgaria,google.bg,Europe,True,True
3,Canada,google.ca,Northern America,False,False
4,Croatia,google.hr,Europe,True,True


In [None]:
search = GoogleSearch({
    "engine": "google_jobs",
    "q": "Android developer", 
    "location": "Mexico",
    "api_key": api_key
  })
result = search.get_dict()

file_path = r"./Mexico1.json"  

# Saving JSON to a file
with open(file_path, "w", encoding="utf-8") as file:
    json.dump(result, file, ensure_ascii=False, indent=4)

Version 1:
search = GoogleSearch({
    "engine": "google_jobs",
    "q": "Job title", 
    "location": "Country",
    "api_key": api_key
  })

## New version

### Параметры функции

- **`quarry`** (*str*):  
  Запрос вакансии (например, `"Android developer"`).

- **`location`** (*str*, default=`"all"`):  
  - `"all"`: Собирать данные для всех стран из DataFrame.  
  - Название страны (например, `"Austria"`), чтобы собирать данные только для одной страны.

- **`domain`** (*str*, default=`"default"`):  
  - `"default"`: Использовать `google.com`.  
  - `"local"`: Использовать локальный домен из DataFrame (например, `google.at` для Австрии).

- **`number_of_queries`** (*int* или *str*, default=`1`):  
  - `"all"`: Собирать все данные до тех пор, пока есть результаты.  
  - Число (например, `2`), чтобы ограничить количество запросов на страну.

- **`api_key`** (*str*):  
  Ваш API-ключ SerpApi.

- **`data_frame`** (*pandas.DataFrame*):  
  DataFrame с данными о странах. Обязательные столбцы:  
  - `location`: Название страны (например, "Austria").  
  - `google_domain`: Локальный домен Google для этой страны (например, "google.at").

- **`save_path`** (*str*, default=`"."`):  
  Путь для сохранения JSON-файлов.

- **`number_of_errors`** (*int*, default=`2`):  
  Максимальное количество пустых запросов подряд, после которого сбор данных для страны прекращается.

- **`report`** (*bool*, default=`True`):  
  - `True`: Выводит отчёт о количестве запросов и ошибок.  
  - `False`: Не выводит отчёт.


In [None]:
def collect_jobs_data(
    quarry,
    location="all",
    domain="default",
    number_of_queries=1,
    api_key="",
    data_frame=None,
    save_path=".",
    number_of_errors=2,
    report=True,
):
    """
    Collect job-listing data from Google Jobs via SerpApi and save each raw
    JSON response to disk.

    Parameters
    ----------
    quarry : str
        Search string for the vacancy (e.g., "Android developer").
    location : str, default "all"
        - "all": scrape every country found in `data_frame`.
        - A specific country name (e.g., "Austria"): scrape only that country.
    domain : str, default "default"
        - "default": always use `google.com`.
        - "local"  : use the country-specific Google domain contained in
          `data_frame["google_domain"]` (e.g., "google.at" for Austria).
    number_of_queries : int | str, default 1
        - "all": keep paginating until no more results are returned.
        - int  : maximum number of pages to fetch per country.
    api_key : str
        Your SerpApi API key.
    data_frame : pandas.DataFrame
        Must include at least two columns:
            * 'location'       – country name (e.g., "Austria")
            * 'google_domain'  – local Google domain (e.g., "google.at")
    save_path : str, default "."
        Directory in which JSON files are written.
    number_of_errors : int, default 2
        Stop requesting a country after this many consecutive empty pages.
    report : bool, default True
        If True, print a per-country summary at the end.

    Returns
    -------
    list[dict]
        One dictionary per country, e.g.,
        {"country": "Austria", "queries": 4, "errors": 1}.

    How it works
    ------------
    For each target country the function repeatedly calls the
    `google_jobs` SerpApi engine, paginating with `next_page_token`
    until one of three conditions is met:

    1. `number_of_queries` pages have been fetched.
    2. `number_of_errors` empty pages have occurred in a row.
    3. SerpApi no longer returns a `next_page_token`.

    Each response is saved as
        `<save_path>/<country>_<page_index>.json`
    so that raw data can be inspected or re-processed later. A simple
    safeguard pauses execution for one hour every 1 000 requests to stay
    within API limits.

    The function logs queries and errors in memory; if `report=True`, it
    prints that log in the familiar “--- Report ---” format before
    returning it.
    """
    # Argument validation
    if data_frame is None:
        raise ValueError("data_frame must be provided.")

    # Prepare list of countries
    if location == "all":
        countries = data_frame["location"].tolist()
    else:
        countries = [location]

    # Logging container
    report_data = []

    for country in countries:
        error_count = 0
        query_count = 0
        next_page_token = None  # For pagination

        # Determine domain
        google_domain = (
            "google.com"
            if domain == "default"
            else data_frame.loc[data_frame["location"] == country, "google_domain"].values[0]
        )

        while error_count < number_of_errors:
            if number_of_queries != "all" and query_count >= number_of_queries:
                break

            # Build search parameters
            search_params = {
                "q": quarry,
                "engine": "google_jobs",
                "location": country,
                "google_domain": google_domain,
                "api_key": api_key,
            }
            if next_page_token:
                search_params["next_page_token"] = next_page_token

            search = GoogleSearch(search_params)
            result = search.get_dict()

            # Save data
            file_name = f"{country}_{query_count}.json"
            file_path = os.path.join(save_path, file_name)
            with open(file_path, "w", encoding="utf-8") as file:
                json.dump(result, file, ensure_ascii=False, indent=4)

            # Check result
            if "jobs_results" not in result or not result["jobs_results"]:
                error_count += 1
            else:
                error_count = 0  # Reset on successful request

            query_count += 1

            # Get token for next page
            next_page_token = result.get("serpapi_pagination", {}).get("next_page_token")
            if not next_page_token:  # If no token, stop
                break

            # Rate-limit safeguard
            if query_count % 1000 == 0:
                print("Reached request limit. Pausing for 1 hour...")
                time.sleep(3600)

        report_data.append(
            {"country": country, "queries": query_count, "errors": error_count}
        )

    # Generate report
    if report:
        print("\n--- Report ---")
        for entry in report_data:
            print(f"{entry['country']}: {entry['queries']} queries, {entry['errors']} errors")

    return report_data


In [None]:
collect_jobs_data(
    quarry="Android developer",       # Job search query
    location="all",                   # Collect data for all countries
    domain="local",                   # Use local domains
    number_of_queries="all",          # Maximum 2 queries per country
    api_key=api_key,                  # Specify your actual API key
    data_frame=df,                    # DataFrame containing country data
    save_path=r"../data/jobs_data/data/local_domain/Android",  # Folder to save JSON files
    number_of_errors=2,               # Stop after 2 empty results
    report=True                       # Display a report
)



--- Report ---
Austria: 4 queries, 1 errors
Belgium: 3 queries, 0 errors
Bulgaria: 3 queries, 0 errors
Canada: 13 queries, 0 errors
Croatia: 2 queries, 0 errors
Cyprus: 3 queries, 0 errors
Czechia: 3 queries, 0 errors
Denmark: 2 queries, 0 errors
Estonia: 2 queries, 0 errors
Finland: 3 queries, 0 errors
France: 6 queries, 0 errors
Germany: 9 queries, 1 errors
Greece: 3 queries, 0 errors
Hungary: 3 queries, 0 errors
Iceland: 1 queries, 0 errors
Ireland: 4 queries, 0 errors
Italy: 6 queries, 0 errors
Latvia: 1 queries, 0 errors
Liechtenstein: 1 queries, 0 errors
Lithuania: 3 queries, 0 errors
Luxembourg: 1 queries, 0 errors
Malta: 1 queries, 0 errors
Mexico: 9 queries, 0 errors
Netherlands: 6 queries, 0 errors
Norway: 2 queries, 0 errors
Poland: 6 queries, 0 errors
Portugal: 5 queries, 0 errors
Romania: 5 queries, 0 errors
Slovakia: 2 queries, 0 errors
Slovenia: 1 queries, 0 errors
Spain: 7 queries, 1 errors
Sweden: 4 queries, 0 errors
Switzerland: 3 queries, 0 errors
United Kingdom: 17

[{'country': 'Austria', 'queries': 4, 'errors': 1},
 {'country': 'Belgium', 'queries': 3, 'errors': 0},
 {'country': 'Bulgaria', 'queries': 3, 'errors': 0},
 {'country': 'Canada', 'queries': 13, 'errors': 0},
 {'country': 'Croatia', 'queries': 2, 'errors': 0},
 {'country': 'Cyprus', 'queries': 3, 'errors': 0},
 {'country': 'Czechia', 'queries': 3, 'errors': 0},
 {'country': 'Denmark', 'queries': 2, 'errors': 0},
 {'country': 'Estonia', 'queries': 2, 'errors': 0},
 {'country': 'Finland', 'queries': 3, 'errors': 0},
 {'country': 'France', 'queries': 6, 'errors': 0},
 {'country': 'Germany', 'queries': 9, 'errors': 1},
 {'country': 'Greece', 'queries': 3, 'errors': 0},
 {'country': 'Hungary', 'queries': 3, 'errors': 0},
 {'country': 'Iceland', 'queries': 1, 'errors': 0},
 {'country': 'Ireland', 'queries': 4, 'errors': 0},
 {'country': 'Italy', 'queries': 6, 'errors': 0},
 {'country': 'Latvia', 'queries': 1, 'errors': 0},
 {'country': 'Liechtenstein', 'queries': 1, 'errors': 0},
 {'country'

## Make markdown file with the results

In [None]:
# Path to the main folder
base_path = r"../data/jobs_data/test_data/local_domain"
android_path = os.path.join(base_path, "Android")
ios_path = os.path.join(base_path, "iOS")
markdown_path = os.path.join(base_path, "results.md")

# Check if the file is "successful" or "empty"
def is_valid_file(file_path):
    try:
        with open(file_path, "r", encoding="utf-8") as file:
            data = json.load(file)
            if data.get("search_information", {}).get("jobs_results_state") == "Fully empty" and \
               data.get("error") == "Google hasn't returned any results for this query.":
                return False
            return True
    except Exception as e:
        print(f"Error processing {file_path}: {e}")
        return False

# Scan folders and collect data
results = {}

for folder, platform in [(ios_path, "iOS"), (android_path, "Android")]:
    for file_name in os.listdir(folder):
        if file_name.endswith(".json"):
            file_path = os.path.join(folder, file_name)
            country = file_name.replace(".json", "").strip()
            is_valid = is_valid_file(file_path)
            
            if country not in results:
                results[country] = {"iOS": "-", "Android": "-"}
            
            results[country][platform] = "+" if is_valid else "-"

# Create Markdown file
with open(markdown_path, "w", encoding="utf-8") as md_file:
    md_file.write("[] - iOS, [] - Android\n")
    md_file.write("-----------------------\n")
    for country, platforms in sorted(results.items()):
        ios_status = platforms["iOS"]
        android_status = platforms["Android"]
        md_file.write(f"- [{ios_status}],[{android_status}] {country}\n")

print(f"Markdown file created: {markdown_path}")

Используются оба домена, чтобы максимизировать шанс сбора данных по стране. И увеличить шанс сбора уникальных вакансий доступных только на определенном домене.

Если пройтись по всем странам и собрать по одной вакансии для "iOS" и "Android" разработчиков, были получены следующие резульаты:

`.com`:
- Все страны (100%) вернули данные как по iOS, так и по Android.

`Локальный домен`:<br>
Большинство стран также показали успешные результаты для iOS и Android вакансий.<br>
Исключения:
- Исландия: отсутствуют вакансии для iOS.
- Румыния: отсутствуют вакансии для Android.


## Data Collection

loggs:

"Android developer" / ".com":

--- Report ---
Austria: 4 queries, 0 errors
Belgium: 3 queries, 0 errors
Bulgaria: 3 queries, 0 errors
Canada: 13 queries, 0 errors
Croatia: 2 queries, 0 errors
Cyprus: 3 queries, 0 errors
Czechia: 3 queries, 0 errors
Denmark: 2 queries, 0 errors
Estonia: 2 queries, 0 errors
Finland: 3 queries, 0 errors
France: 6 queries, 0 errors
Germany: 9 queries, 0 errors
Greece: 3 queries, 0 errors
Hungary: 3 queries, 0 errors
Iceland: 1 queries, 0 errors
Ireland: 2 queries, 1 errors
Italy: 6 queries, 0 errors
Latvia: 1 queries, 0 errors
Liechtenstein: 1 queries, 0 errors
Lithuania: 3 queries, 0 errors
Luxembourg: 1 queries, 0 errors
Malta: 1 queries, 0 errors
Mexico: 9 queries, 0 errors
Netherlands: 6 queries, 0 errors
Norway: 2 queries, 0 errors
Poland: 6 queries, 1 errors
Portugal: 5 queries, 0 errors
Romania: 5 queries, 0 errors
Slovakia: 2 queries, 0 errors
Slovenia: 1 queries, 0 errors
Spain: 7 queries, 0 errors
Sweden: 4 queries, 0 errors
Switzerland: 3 queries, 0 errors
United Kingdom: 17 queries, 0 errors
United States: 18 queries, 0 errors


"iOS developer" / ".com":

--- Report ---
Austria: 6 queries, 0 errors
Belgium: 3 queries, 0 errors
Bulgaria: 3 queries, 0 errors
Canada: 11 queries, 0 errors
Croatia: 3 queries, 0 errors
Cyprus: 3 queries, 0 errors
Czechia: 6 queries, 0 errors
Denmark: 2 queries, 0 errors
Estonia: 3 queries, 0 errors
Finland: 2 queries, 0 errors
France: 4 queries, 0 errors
Germany: 9 queries, 0 errors
Greece: 3 queries, 0 errors
Hungary: 4 queries, 0 errors
Iceland: 1 queries, 0 errors
Ireland: 3 queries, 0 errors
Italy: 7 queries, 0 errors
Latvia: 1 queries, 0 errors
Liechtenstein: 1 queries, 0 errors
Lithuania: 3 queries, 0 errors
Luxembourg: 2 queries, 0 errors
Malta: 1 queries, 0 errors
Mexico: 9 queries, 0 errors
Netherlands: 5 queries, 0 errors
Norway: 2 queries, 0 errors
Poland: 5 queries, 0 errors
Portugal: 8 queries, 0 errors
Romania: 6 queries, 0 errors
Slovakia: 2 queries, 0 errors
Slovenia: 1 queries, 0 errors
Spain: 6 queries, 0 errors
Sweden: 5 queries, 0 errors
Switzerland: 5 queries, 0 errors
United Kingdom: 17 queries, 0 errors
United States: 19 queries, 0 errors



"Android developer" / "local":

--- Report ---
Austria: 4 queries, 1 errors
Belgium: 3 queries, 0 errors
Bulgaria: 3 queries, 0 errors
Canada: 13 queries, 0 errors
Croatia: 2 queries, 0 errors
Cyprus: 3 queries, 0 errors
Czechia: 3 queries, 0 errors
Denmark: 2 queries, 0 errors
Estonia: 2 queries, 0 errors
Finland: 3 queries, 0 errors
France: 6 queries, 0 errors
Germany: 9 queries, 1 errors
Greece: 3 queries, 0 errors
Hungary: 3 queries, 0 errors
Iceland: 1 queries, 0 errors
Ireland: 4 queries, 0 errors
Italy: 6 queries, 0 errors
Latvia: 1 queries, 0 errors
Liechtenstein: 1 queries, 0 errors
Lithuania: 3 queries, 0 errors
Luxembourg: 1 queries, 0 errors
Malta: 1 queries, 0 errors
Mexico: 9 queries, 0 errors
Netherlands: 6 queries, 0 errors
Norway: 2 queries, 0 errors
Poland: 6 queries, 0 errors
Portugal: 5 queries, 0 errors
Romania: 5 queries, 0 errors
Slovakia: 2 queries, 0 errors
Slovenia: 1 queries, 0 errors
Spain: 7 queries, 1 errors
Sweden: 4 queries, 0 errors
Switzerland: 3 queries, 0 errors
United Kingdom: 17 queries, 0 errors
United States: 18 queries, 0 errors



"iOS developer" / "local":

--- Report ---
Austria: 6 queries, 0 errors
Belgium: 3 queries, 0 errors
Bulgaria: 3 queries, 0 errors
Canada: 11 queries, 0 errors
Croatia: 3 queries, 0 errors
Cyprus: 3 queries, 0 errors
Czechia: 6 queries, 0 errors
Denmark: 2 queries, 0 errors
Estonia: 3 queries, 0 errors
Finland: 2 queries, 0 errors
France: 4 queries, 0 errors
Germany: 9 queries, 0 errors
Greece: 3 queries, 0 errors
Hungary: 4 queries, 0 errors
Iceland: 1 queries, 0 errors
Ireland: 4 queries, 0 errors
Italy: 7 queries, 0 errors
Latvia: 1 queries, 0 errors
Liechtenstein: 1 queries, 0 errors
Lithuania: 3 queries, 0 errors
Luxembourg: 2 queries, 0 errors
Malta: 1 queries, 0 errors
Mexico: 9 queries, 0 errors
Netherlands: 5 queries, 0 errors
Norway: 2 queries, 0 errors
Poland: 5 queries, 0 errors
Portugal: 8 queries, 0 errors
Romania: 6 queries, 0 errors
Slovakia: 2 queries, 0 errors
Slovenia: 1 queries, 0 errors
Spain: 6 queries, 0 errors
Sweden: 5 queries, 0 errors
Switzerland: 5 queries, 0 errors
United Kingdom: 17 queries, 0 errors
United States: 19 queries, 0 errors

In [None]:
def clean_text(text):
    """Clean text from unusual line separators."""
    if not isinstance(text, str):
        return text
    return re.sub(r"[\u2028\u2029]", " ", text)  # Remove Line Separator and Paragraph Separator


def process_json_to_csv(data_dirs, region_df, output_file=r"../data/csv/jobs_data.csv"):
    final_data = []
    report = {
        "total_files": 0,
        "processed_files": 0,
        "empty_files": 0,
        "corrupted_files": 0
    }

    for data_dir in data_dirs:
        for root, _, files in os.walk(data_dir):
            for file in files:
                if file.endswith(".json"):
                    report["total_files"] += 1
                    file_path = os.path.join(root, file)

                    try:
                        with open(file_path, "r", encoding="utf-8") as f:
                            data = json.load(f)

                        if not data.get("jobs_results"):
                            report["empty_files"] += 1
                            continue

                        for job in data["jobs_results"]:
                            row = {
                                "Location": clean_text(data.get("search_parameters", {}).get("location_used")),
                                "Region": region_df.loc[region_df["location"] == data.get("search_parameters", {}).get("location_used"), "Region"].values[0],
                                "EU Member": region_df.loc[region_df["location"] == data.get("search_parameters", {}).get("location_used"), "EU member"].values[0],
                                "Schengen Agreement": region_df.loc[region_df["location"] == data.get("search_parameters", {}).get("location_used"), "Schengen Agreement"].values[0],
                                "Google Domain Type": "local" if "local_domain" in file_path else "default",
                                "Google Domain Used": clean_text(data.get("search_parameters", {}).get("google_domain")),
                                "Job Title": clean_text(job.get("title")),
                                "Company Name": clean_text(job.get("company_name")),
                                "Job Location": clean_text(job.get("location")),
                                "Apply Options": clean_text(", ".join([opt.get("title", "") for opt in job.get("apply_options", [])])),
                                "Job Description": clean_text(job.get("description")),
                                "Work from home": clean_text(job.get("detected_extensions", {}).get("work_from_home")),
                                "Salary": clean_text(job.get("detected_extensions", {}).get("salary")),
                                "Schedule type": clean_text(job.get("detected_extensions", {}).get("schedule_type")),
                                "Qualifications": clean_text(job.get("detected_extensions", {}).get("qualifications")),
                                "Job ID": clean_text(job.get("job_id")),
                                "Search Date": clean_text(data.get("search_metadata", {}).get("created_at")),
                                "Search Query": clean_text(data.get("search_parameters", {}).get("q"))
                            }
                            final_data.append(row)

                        report["processed_files"] += 1

                    except (json.JSONDecodeError, KeyError, IndexError) as e:
                        print(f"Error while processing file {file_path}: {e}")
                        report["corrupted_files"] += 1
                        continue

    # Create DataFrame and save to CSV
    df = pd.DataFrame(final_data)
    df.to_csv(output_file, index=False, encoding="utf-8")

    # Processing summary
    report_text = (
        f"Processing completed:\n"
        f"- Total files: {report['total_files']}\n"
        f"- Successfully processed: {report['processed_files']}\n"
        f"- Empty files: {report['empty_files']}\n"
        f"- Corrupted files: {report['corrupted_files']}\n"
    )
    print(report_text)
    return report


In [None]:
# Paths to the data
data_dirs = [
    r"../data/jobs_data/data/dotcom_domain/Android",
    r"../data/jobs_data/data/dotcom_domain/iOS",
    r"../data/jobs_data/data/local_domain/Android",
    r"../data/jobs_data/data/local_domain/iOS"
]

# Run the function
process_json_to_csv(data_dirs, df)


Обработка завершена:
- Всего файлов: 665
- Успешно обработано: 660
- Пустых файлов: 5
- Повреждённых файлов: 0

