**Fake Jobs Extraction**

There are 866 fake jobs in our raw fake_jobs dataset.
Our goal here is to inspect dataset, drop features that are not needed,
and randomly extract 30 fake jobs for our research purpose.

In [1]:
import pandas as pd

# import random  # For random sampling

# Display settings for better viewing
pd.set_option("display.max_columns", None)
pd.set_option("display.width", 1000)
pd.set_option("display.max_colwidth", 200)

In [2]:
# load the dataset
print("Loading large job dataset...")
try:
    large_df = pd.read_csv("../1_datasets/aegean_raw_data/all_job_postings.csv")  # noqa: E501
    print("Dataset loaded successfully.")
    print(f"Initial shape: {large_df.shape}")
    print("\nFirst 5 rows:")
    print(large_df.head())
    print("\nColumn names and data types:")
    print(large_df.info())
except FileNotFoundError:
    print(
        "Error: 'large_job_dataset.csv' not found. Please ensure the\
            file is in the correct directory."
    )

Loading large job dataset...
Dataset loaded successfully.
Initial shape: (17880, 18)

First 5 rows:
   job_id                                      title            location department salary_range                                                                                                                                                                                          company_profile                                                                                                                                                                                              description                                                                                                                                                                                             requirements                                                                                                                                                                                                 benefits

In [3]:
# Inspect the 'fraudulent' column distribution
print("\nDistribution of 'fraudulent' column:")
print(large_df["fraudulent"].value_counts())
print(
    f"Percentage of fake jobs:\
        {large_df['fraudulent'].value_counts(normalize=True)[1] * 100:.2f}%"
)


Distribution of 'fraudulent' column:
fraudulent
0    17014
1      866
Name: count, dtype: int64
Percentage of fake jobs:        4.84%


In [4]:
# Extract the fake jobs
fake_jobs_df = large_df[
    large_df["fraudulent"] == 1
].copy()  # .copy() to avoid SettingWithCopyWarning
print(f"\nExtracted {fake_jobs_df.shape[0]} fake job postings.")


Extracted 866 fake job postings.


In [6]:
# Check for missing values in the fake jobs DataFrame
print("\nMissing values in fake jobs DataFrame:")
print(fake_jobs_df.isnull().sum())


Missing values in fake jobs DataFrame:
job_id                   0
title                    0
location                19
department             531
salary_range           643
company_profile        587
description              1
requirements           154
benefits               364
telecommuting            0
has_company_logo         0
has_questions            0
employment_type        241
required_experience    435
required_education     451
industry               275
function               337
fraudulent               0
dtype: int64


Question for the team, what features should our job postings have?
should we go with job title, company name, job description and
salary range? answering this question will help us know the features
to drop from this dataset (question answered by team)

***Note for the team*** (7/15/2025)

After our meeting yesterday, we agreed to drop rows and columns where there are missing values in the fake jobs. Looking at the missing values above, I am having a rethink. For example, imagine we drop all the rows where company profile is missing, that is a whooping loss of 587 rows out of 866. We are not also sure that these rows correspond with the 643 where salary range is missing. we will end up loosing all our dataset.

Best approach is to start by removing columns we dont need. In other words, keeping only columns that we really need to answer our research question (Understanding fake job dynamics in the era of AI).

So here are the columns I want to retain and why:

- job id
- job title 
- location (often a big factor in the application process. people choose suitable locations)
- benefits (This has less missing values than salary range, and often also details what the company is offering in terms of compensation. Arguably makes up for the salary range)
- description (often the first place applicants look at to understand the requirements of the job. It also ususally mention the company profile or who they are. Arguably makes up for company profile)
- fraudulent (marker indicating that the jobs are fraudulent jobs)


In [5]:
# drop columns that are not needed for our analysis
columns_to_drop = [
    "department",
    "telecommuting",
    "has_company_logo",
    "has_questions",
    "required_education",
    "employment_type",
    "function",
    "industry",
    "required_experience",
    "salary_range",
    "company_profile",
    "requirements",
]
fake_jobs_df.drop(columns=columns_to_drop, inplace=True, errors="ignore")

# print the shape of the DataFrame after dropping columns
print(
    f"\nShape of fake jobs DataFrame after dropping unnecessary columns:\
        {fake_jobs_df.shape}"
)

# show the first 5 rows of the DataFrame after dropping columns
print(
    "\nFirst 5 rows of the fake jobs DataFrame after dropping\
        unnecessary columns:"
)
print(fake_jobs_df.head())


Shape of fake jobs DataFrame after dropping unnecessary columns:        (866, 6)

First 5 rows of the fake jobs DataFrame after dropping        unnecessary columns:
     job_id                             title                            location                                                                                                                                                                                              description                                                                                                                                                                                                 benefits  fraudulent
98       99                   IC&E Technician                   US, , Stocton, CA  IC&amp;E Technician | Bakersfield, CA Mt. PosoPrincipal Duties and Responsibilities: Calibrates, tests, maintains, troubleshoots, and installs all power plant instrumentation, control systems and ...  BENEFITSWhat is offered:Competitive compensation packa

In [None]:
# Check for missing values in the updated fake jobs
# DataFrame after dropping columns
print("\nMissing values in current fake jobs DataFrame:")
print(fake_jobs_df.isnull().sum())


Missing values in current fake jobs DataFrame:
job_id           0
title            0
location        19
description      1
benefits       364
fraudulent       0
dtype: int64


In [None]:
# remove rows with missing values in the 'benefits'
fake_jobs_df.dropna(subset=["benefits"], inplace=True)

# print the shape of the DataFrame after dropping rows with
# missing values in 'benefits'
print(f"\nShape of current fake jobs DataFrame': {fake_jobs_df.shape}")  # noqa: E501

# current missing values in the DataFrame
print("\nCurrent missing values in fake jobs DataFrame:")
print(fake_jobs_df.isnull().sum())


Shape of current fake jobs DataFrame': (502, 6)

Current missing values in fake jobs DataFrame:
job_id         0
title          0
location       2
description    0
benefits       0
fraudulent     0
dtype: int64


In [None]:
# remoove rows with missing values in the 'location' column
fake_jobs_df.dropna(subset=["location"], inplace=True)

# print the shape of the DataFrame after dropping rows with
# missing values in 'location'
print(f"\nShape of current fake jobs DataFrame': {fake_jobs_df.shape}")  # noqa: E501

# current missing values in the DataFrame
print("\nCurrent missing values in fake jobs DataFrame:")
print(fake_jobs_df.isnull().sum())


Shape of current fake jobs DataFrame': (500, 6)

Current missing values in fake jobs DataFrame:
job_id         0
title          0
location       0
description    0
benefits       0
fraudulent     0
dtype: int64


In [None]:
# set the output file path
output_file_path = "../1_datasets/cleaned_aegean_fakejobs/aegean_500_fakejobs.csv"  # noqa: E501

In [12]:
# save the selected fake jobs to a CSV file
fake_jobs_df.to_csv(output_file_path, index=False)
print(f"\nCleaned fake jobs saved to {output_file_path}.")


Cleaned fake jobs saved to ../1_datasets/cleaned_aegean_fakejobs/aegean_500_fakejobs.csv.
