There are legitimate jobs in the **Aegean Fake Job Postings Prediction** dataset. The goal is to extract real job posts (human-written) for NLP analysis and overall comparison with the human-written fake job posts and LLM-refined fake job posts, in order to detect the extent of similarities and differences.

## Disclaimer

We're not 100% sure that the job posts in this dataset were human written. It was collected between 2012 - 2014 but does not state this fact. We'll consider it as human written for research purposes.

In [32]:
import pandas as pd

# Display settings for better viewing
pd.set_option("display.max_columns", None)
pd.set_option("display.width", 1000)
pd.set_option("display.max_colwidth", 200)

data = pd.read_csv("../1_datasets/aegean_raw_data/all_job_postings.csv")

print("Shape of the data", data.shape, "\n")

print("First 5 rows:", data.head())

Shape of the data (17880, 18) 

First 5 rows:    job_id                                      title            location department salary_range                                                                                                                                                                                          company_profile                                                                                                                                                                                              description                                                                                                                                                                                             requirements                                                                                                                                                                                                 benefits  telecommuting  has_company_logo  has_questions emplo

In [33]:
# excluding columns that are not necessary for NLP analysis
dropped_columns = [
    "job_id",
    "telecommuting",
    "has_company_logo",
    "has_questions",
    "employment_type",
    "required_experience",
    "required_education",
]

data.drop(columns=dropped_columns, inplace=True, errors="ignore")

print(f"Shape of the dataset after dropping columns:{data.shape}")

Shape of the dataset after dropping columns:(17880, 11)


In [36]:
# extracting all real and fake jobs
real_jobs = data[data["fraudulent"] == 0].copy()
fake_jobs = data[data["fraudulent"] == 1].copy()

# checking the number of real posts
print(f"Extracted {real_jobs.shape[0]} real jobs.")

# checking the number of fake posts
print(f"Extracted {fake_jobs.shape[0]} fake jobs.\n")

# checking the percentage of real and fake jobs and normalize
# the frequency to get proportions instead of absolute values
print(
    f"Percentage of real jobs:\
    {data['fraudulent'].value_counts(normalize=True)[0] * 100:.2f}%"
)

print(
    f"Percentage of fake jobs:\
    {data['fraudulent'].value_counts(normalize=True)[1] * 100:.2f}%"
)

Extracted 17014 real jobs.
Extracted 866 fake jobs.

Percentage of real jobs:    95.16%
Percentage of fake jobs:    4.84%


In [38]:
# check for null values
print("Missing values in the dataset of real jobs:")
print(real_jobs.isnull().sum())

print("----")

print("Missing values in the dataset of fake jobs:")
print(fake_jobs.isnull().sum())

Missing values in the dataset of real jobs:
title                  0
location             327
department         11022
salary_range       14369
company_profile     2721
description            0
requirements        2542
benefits            6848
industry            4628
function            6118
fraudulent             0
dtype: int64
----
Missing values in the dataset of fake jobs:
title                0
location            19
department         531
salary_range       643
company_profile    587
description          1
requirements       154
benefits           364
industry           275
function           337
fraudulent           0
dtype: int64


In [39]:
# dropping rows that have NaN values
real_jobs.dropna(inplace=True)
fake_jobs.dropna(inplace=True)

# data shape after cleaning NaN values
print(
    f"Shape of the real jobs dataset after excluding rows with NaN values:\
    {real_jobs.shape}"
)

print(
    f"Shape of the fake jobs dataset after excluding rows with NaN values:\
    {fake_jobs.shape}"
)

Shape of the real jobs dataset after excluding rows with NaN values:    (808, 11)
Shape of the fake jobs dataset after excluding rows with NaN values:    (74, 11)


In [40]:
# saving the file
real_jobs_file_path = "../1_datasets/cleaned_data/real_jobs.csv"
fake_jobs_file_path = "../1_datasets/cleaned_data/fake_jobs.csv"

real_jobs.to_csv(real_jobs_file_path, index=False)
fake_jobs.to_csv(fake_jobs_file_path, index=False)