**Fake Jobs Refinement with LLM**

This script calls on a large language model (Gemini API) to transform the scam job descriptions in our dataset, to simulate ***"Fake Job postings written by AI"*** in alignment with our research objectives.

In [1]:
# import needed libraries
import pandas as pd

In [None]:
# load the extracted fake jobs dataset
extracted_fake_jobs_df = pd.read_csv('../1_datasets/processed_fake_jobs/originally_selected_30_fake_jobs.csv')

# display the first few rows of the dataset
print("Extracted Fake Jobs Dataset (first 5 rows):")
print(extracted_fake_jobs_df.head())

Extracted Fake Jobs Dataset (first 5 rows):
   job_id                         title              location  salary_range  \
0    5452      Adminstrative/Data Entry      US, VA, Marshall           NaN   
1   17688                    Data Entry      US, IL, ATKINSON           NaN   
2   17603             Network Marketing              US, NH,   7200-1380000   
3    6696  Cruise Staff Wanted *URGENT*  US, PA, philadelphia           NaN   
4     998           EXECUTIVE SOUS CHEF                MY, ,    55000-65000   

                                     company_profile  \
0                                                NaN   
1                                                NaN   
2                                                NaN   
3                                                NaN   
4  Le Meridien is situated in the heart of kuala ...   

                                         description  \
0  Arise Virtual Solutions is a business process ...   
1  We are seeking extremely moti

In [None]:
# Setup for LLM Refinement

# 2. Import LLM client and configure API key
import google.generativeai as genai
import os
from dotenv import load_dotenv


# Load environment variables from .env file
load_dotenv()
api_key = os.getenv("API_KEY")

if not api_key:
    raise ValueError("No API key found. Please set your API key in the .env file.")

# Configure the generative AI API
genai.configure(api_key=api_key)
model = genai.GenerativeModel('gemini-2.5-flash')


In [None]:
# Define the LLM Refinement Prompt

# This prompt needs careful engineering!
# It's a balance: we make it sound legitimate, but not *too* perfect.
# we might need several iterations of prompt engineering.

LLM_REFINEMENT_PROMPT = """
You are an expert HR professional and a master wordsmith. Your task is to rewrite a given job description
to make it sound highly professional, appealing, and legitimate, while subtly incorporating characteristics
that might be common in sophisticated, but still fraudulent, job postings.

Focus on:
- Improving grammar and vocabulary.
- Making vague tasks sound more professional (e.g., "data entry" -> "information management").
- Removing obvious scam red flags (e.g., "send money," "no experience needed - huge pay").
- Adding appealing but potentially exaggerated benefits or responsibilities.
- Making the application process sound normal.
- Retain the core 'job' type (e.g., if it was 'data entry', keep it as a data-related role).

DO NOT:
- Make it sound *too* perfect if your goal is to make it a *subtle* scam.
- Introduce explicit scam language.
- Mention anything about being a scam or fraudulent.

Here is the job description to refine:
---
{job_description_text}
---
Please provide only the refined job description text, nothing else.
"""

# Loop through the selected fake jobs and refine them

# Make a copy to work on, preserving the original extracted_fake_jobs_df state if needed
df_to_refine = extracted_fake_jobs_df.copy() # This is the DataFrame from the previous code block

# Add a column to store the original description before LLM refinement, if you want to compare
df_to_refine['original_description_before_llm'] = df_to_refine['description']

refined_descriptions = []
print("Starting LLM refinement process...")
for index, row in df_to_refine.iterrows():
    original_description = row['description']

    # Construct the full prompt for the current job description
    current_prompt = LLM_REFINEMENT_PROMPT.format(job_description_text=original_description)

    try:
        # Call the LLM API
        response = model.generate_content(current_prompt)
        refined_text = response.text

        refined_descriptions.append(refined_text)
        print(f"Refined Job ID: {row['job_id']}")

    except Exception as e:
        print(f"Error refining Job ID {row['job_id']}: {e}")
        refined_descriptions.append(original_description) # Append original if error

# Update the 'description' column with the refined text
df_to_refine['description'] = refined_descriptions

print("\nLLM refinement complete.")
print("\nFirst 3 refined job descriptions (showing original vs refined):")
print(df_to_refine[['original_description_before_llm', 'description']].head(3))


Starting LLM refinement process...
Refined Job ID: 5452
Refined Job ID: 17688
Refined Job ID: 17603
Refined Job ID: 6696
Refined Job ID: 998
Refined Job ID: 6576
Refined Job ID: 17561
Refined Job ID: 6575
Refined Job ID: 11711
Refined Job ID: 17593
Refined Job ID: 8483
Refined Job ID: 17747
Refined Job ID: 17599
Refined Job ID: 10954
Refined Job ID: 17650
Refined Job ID: 17713
Refined Job ID: 2267
Refined Job ID: 17818
Refined Job ID: 17684
Refined Job ID: 3180
Refined Job ID: 17632
Refined Job ID: 17627
Refined Job ID: 11768
Refined Job ID: 17549
Refined Job ID: 2922
Refined Job ID: 10408
Refined Job ID: 4267
Refined Job ID: 8691
Refined Job ID: 5870
Refined Job ID: 7655

LLM refinement complete.

First 3 refined job descriptions (showing original vs refined):
                     original_description_before_llm  \
0  Arise Virtual Solutions is a business process ...   
1  We are seeking extremely motivated and experie...   
2  Are you looking to make anywhere from 600-115,...   

   

In [7]:
# display the first few rows of the refined DataFrame
print("\nRefined Fake Jobs Dataset (first 5 rows):")
print(df_to_refine.head())



Refined Fake Jobs Dataset (first 5 rows):
   job_id                         title              location  salary_range  \
0    5452      Adminstrative/Data Entry      US, VA, Marshall           NaN   
1   17688                    Data Entry      US, IL, ATKINSON           NaN   
2   17603             Network Marketing              US, NH,   7200-1380000   
3    6696  Cruise Staff Wanted *URGENT*  US, PA, philadelphia           NaN   
4     998           EXECUTIVE SOUS CHEF                MY, ,    55000-65000   

                                     company_profile  \
0                                                NaN   
1                                                NaN   
2                                                NaN   
3                                                NaN   
4  Le Meridien is situated in the heart of kuala ...   

                                         description  \
0  Arise Virtual Solutions stands as a pioneering...   
1  **Operations Support Specialis

In [None]:
# display only the original and refined descriptions
print(df_to_refine[['original_description_before_llm', 'description']].head(3))    



                     original_description_before_llm  \
0  Arise Virtual Solutions is a business process ...   
1  We are seeking extremely motivated and experie...   
2  Are you looking to make anywhere from 600-115,...   

                                         description  
0  Arise Virtual Solutions stands as a pioneering...  
1  **Operations Support Specialist**\n\nWe are ac...  
2  Are you seeking a distinctive professional tra...  


In [None]:
# append the original vs refined descriptions to a json file for later comparison
df_to_refine[['original_description_before_llm', 'description']].to_json('original_vs_refined_fakeJobs_descriptions.json', orient='records', lines=True)


In [10]:
# current shape of the DataFrame
print("\nCurrent shape of the refined DataFrame:")
print(df_to_refine.shape)


Current shape of the refined DataFrame:
(30, 11)


In [11]:
# Save the LLM-refined fake jobs
output_file_path = '../1_datasets/processed_fake_jobs/llm_refined_30_fake_job_postings.csv'
df_to_refine.to_csv(output_file_path, index=False)
print(f"\nLLM-refined fake jobs saved to '{output_file_path}'")


LLM-refined fake jobs saved to '../1_datasets/processed_fake_jobs/llm_refined_30_fake_job_postings.csv'
