There are legitimate jobs in the **Aegean Fake Job Postings Prediction** dataset. The goal is to extract 30 randomly for NLP analysis and overall comparison with the refined (AI-Generated) fake posts, in order to detect the extent of similarities between human written real posts and AI-Generated fake posts, which mainly aim to simulate real ones in the first place. 

## Disclaimer

We're not 100% sure that the job posts in this dataset were human written. It was collected between 2012 - 2014 but does not state this fact. We'll consider it as human written for research purposes.

In [None]:
import pandas as pd

data = pd.read_csv("../1_datasets/raw_fake_jobs/fake_job_postings.csv")

print("Shape of the data", data.shape)

# extracting all real posts
real_jobs = data[data["fraudulent"] == 0].copy()

# checking the number of real posts
print(f"Extracted{data.shape[0]}real jobs.")

# checking the percentage of real jobs and normalize
# the frequency to get proportions instead of absolute values
print(
    f"Percentage of real jobs:\
    {data['fraudulent'].value_counts(normalize=True)[0] * 100:.2f}%"
)

In [None]:
# check for null values
print("Missing values in the dataset:")
print(real_jobs.isnull().sum())

In [None]:
# excluding columns that are not necessary for NLP analysis
dropped_columns = [
    "job_id",
    "title",
    "location",
    "department",
    "has_company_logo",
    "industry",
    "employment_type",
    "fraudulent",
    "telecommuting",
    "has_questions",
    "required_experience",
    "required_education",
    "function",
]

real_jobs.drop(columns=dropped_columns, inplace=True, errors="ignore")

print(f"Shape of the dataset after dropping columns:{real_jobs.shape}")

In [None]:
# dropping rows that have NaN values
real_jobs.dropna(
    subset=[
        "benefits",
        "requirements",
        "company_profile",
        "salary_range",
    ],
    inplace=True,
)

# data shape after cleaning NaN values
print(
    f"Shape of the data after excluding rows with NaN values:\
    {real_jobs.shape}"
)

In [42]:
# randomly select 30 jobs
real_jobs.sample(n=30, random_state=42)

# saving the file
file_path = "../1_datasets/cleaned_real_jobs/aegean_cleaned_raw_jobs.csv"

real_jobs.to_csv(file_path, index=False)