There are legitimate jobs in the **Aegean Fake Job Postings Prediction** dataset. The goal is to extract real job posts (human-written) for NLP analysis and overall comparison with the human-written fake job posts and LLM-refined fake job posts, in order to detect the extent of similarities and differences.

## Disclaimer

We're not 100% sure that the job posts in this dataset were human written. It was collected between 2012 - 2014 but does not state this fact. We'll consider it as human written for research purposes.

In [11]:
import pandas as pd

data = pd.read_csv("../1_datasets/aegean_raw_data/all_job_postings.csv")

print("Shape of the data", data.shape)

# extracting all real posts
real_jobs = data[data["fraudulent"] == 0].copy()

# checking the number of real posts
print(f"Extracted{data.shape[0]}real jobs.")

# checking the percentage of real jobs and normalize
# the frequency to get proportions instead of absolute values
print(
    f"Percentage of real jobs:\
    {data['fraudulent'].value_counts(normalize=True)[0] * 100:.2f}%"
)

Shape of the data (17880, 18)
Extracted17880real jobs.
Percentage of real jobs:    95.16%


In [7]:
# check for null values
print("Missing values in the dataset:")
print(real_jobs.isnull().sum())

Missing values in the dataset:
job_id                     0
title                      0
location                 327
department             11022
salary_range           14369
company_profile         2721
description                0
requirements            2542
benefits                6848
telecommuting              0
has_company_logo           0
has_questions              0
employment_type         3230
required_experience     6615
required_education      7654
industry                4628
function                6118
fraudulent                 0
dtype: int64


In [8]:
# excluding columns that are not necessary for NLP analysis
dropped_columns = [
    "has_company_logo",
    "employment_type",
    "fraudulent",
    "telecommuting",
    "has_questions",
    "required_education",
]

real_jobs.drop(columns=dropped_columns, inplace=True, errors="ignore")

print(f"Shape of the dataset after dropping columns:{real_jobs.shape}")

Shape of the dataset after dropping columns:(17014, 12)


In [9]:
# dropping rows that have NaN values
real_jobs.dropna(
    subset=[
        "benefits",
        "location",
        "department",
        "requirements",
        "company_profile",
        "salary_range",
        "required_experience",
        "company_profile",
        "industry",
        "function",
    ],
    inplace=True,
)

# data shape after cleaning NaN values
print(
    f"Shape of the data after excluding rows with NaN values:\
    {real_jobs.shape}"
)

Shape of the data after excluding rows with NaN values:    (755, 12)


In [10]:
# saving the file
file_path = "../1_datasets/cleaned_real_jobs/cleaned_real_jobs.csv"

real_jobs.to_csv(file_path, index=False)