# Detecting Fake Job Postings with Machine Learning

#### Project Storyline: Fighting Fraud in the Job Market

In today’s digital age, millions of online job platforms have become targets for scammers posting fake jobs to steal personal data or charge fake fees. These fraudulent listings erode trust and waste job seekers' time. To help job boards maintain safety and efficiency. These fraudulent jobs often:

- Collect personal information under false pretenses
- Charge fake application or training fees
- Waste job seekers’ time and damage their trust

To protect users and improve trust in online job boards, we aim to build a machine learning model that flags suspicious job postings.

##### Stakeholder
The key stakeholder is online job platforms (e.g., LinkedIn, Indeed). They aim to protect job seekers from scams, maintain trust, and reduce manual moderation by using automated fake job detection.

##### Objective

Build a classification model that can predict whether a job listing is **fraudulent (1)** or **legitimate (0)** based on its content.

##### Business Questions
1. Can we predict fake job postings based on content like description, location, and telecommuting?

2. Which features are most predictive of fraudulent listings?

3. How well can we catch fake listings without misclassifying too many real ones?

4. How can this model be used to assist or automate moderation workflows?

##### Why Machine Learning?

Rule-based systems can’t keep up with evolving scams. With machine learning, we can:

- Learn from patterns across thousands of job postings
- Automatically detect suspicious listings
- Support human moderators with predictions

##### Dataset

We’ll use the “Fake Job Postings” dataset from [Kaggle](https://www.kaggle.com/shivamb/real-or-fake-fake-jobposting-prediction). The dataset includes real and fake job posts, with features such as:

- Job title, company, location
- Description, requirements, benefits
- Telecommuting, company logo, employment type

Our target variable is:
- `fraudulent`: 0 = Real job, 1 = Fake job

---

### **Let’s begin by importing the dataset and exploring the data.**


In [27]:
import pandas as pd

df = pd.read_csv("../data/fake_job_postings.csv")
df.head()


Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0


In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   job_id               17880 non-null  int64 
 1   title                17880 non-null  object
 2   location             17534 non-null  object
 3   department           6333 non-null   object
 4   salary_range         2868 non-null   object
 5   company_profile      14572 non-null  object
 6   description          17879 non-null  object
 7   requirements         15185 non-null  object
 8   benefits             10670 non-null  object
 9   telecommuting        17880 non-null  int64 
 10  has_company_logo     17880 non-null  int64 
 11  has_questions        17880 non-null  int64 
 12  employment_type      14409 non-null  object
 13  required_experience  10830 non-null  object
 14  required_education   9775 non-null   object
 15  industry             12977 non-null  object
 16  func

In [29]:
print("Dataset shape (rows, columns):", df.shape)

Dataset shape (rows, columns): (17880, 18)


In [30]:
# Check missing values count per column
print("\nMissing values per column:")
print(df.isnull().sum())


Missing values per column:
job_id                     0
title                      0
location                 346
department             11547
salary_range           15012
company_profile         3308
description                1
requirements            2695
benefits                7210
telecommuting              0
has_company_logo           0
has_questions              0
employment_type         3471
required_experience     7050
required_education      8105
industry                4903
function                6455
fraudulent                 0
dtype: int64


##### Handling Missing Values

The dataset has several columns with missing values. Before building our model, it's important to decide how to handle these gaps. 

- Columns with too many missing values might be dropped or carefully imputed.
- Important features with few missing values can be filled with sensible defaults or placeholders.
- The target variable `fraudulent` has no missing data, so we don't need to worry about it.
---
##### **Let’s explore and handle missing values step-by-step.**


In [31]:
# fill missing text columns with "uknown"
text_columns = ['company_profile', 'benefits', 'requirements', 'description']
for columns in text_columns:
    df[columns] = df[columns].fillna('unknown')


In [32]:
# filling in missing categorical columns with mode
categorical_columns = ['department', 'employment_type', 'required_experience', 'required_education', 'industry', 'function']
for columns in categorical_columns:
    mode = df[columns].mode()[0]
    df[columns] = df[columns].fillna(mode)

In [33]:
# Filling in missing in 'location' with "Unknown"
df['location'] = df['location'].fillna("unknown").str.strip()
df.loc[df['location'] == '', 'location'] = "unknown"

# Filling in missing in 'salary_range' with "Not Disclosed"
df['salary_range'] = df['salary_range'].fillna("Not Disclosed").str.strip()

# (Optional) Check some cleaned columns
df[['location', 'salary_range']].head(10)


Unnamed: 0,location,salary_range
0,"US, NY, New York",Not Disclosed
1,"NZ, , Auckland",Not Disclosed
2,"US, IA, Wever",Not Disclosed
3,"US, DC, Washington",Not Disclosed
4,"US, FL, Fort Worth",Not Disclosed
5,"US, MD,",Not Disclosed
6,"DE, BE, Berlin",20000-28000
7,"US, CA, San Francisco",Not Disclosed
8,"US, FL, Pensacola",Not Disclosed
9,"US, AZ, Phoenix",Not Disclosed


In [34]:
df.isnull().sum()

job_id                 0
title                  0
location               0
department             0
salary_range           0
company_profile        0
description            0
requirements           0
benefits               0
telecommuting          0
has_company_logo       0
has_questions          0
employment_type        0
required_experience    0
required_education     0
industry               0
function               0
fraudulent             0
dtype: int64

---

After carefully inspecting the dataset, I performed data cleaning to prepare it for analysis and modeling:

- **Filled missing text fields** such as `company_profile`, `benefits`, `requirements`, and `description` with `"Unknown"` to avoid empty text entries.
- **Imputed missing categorical columns** like `department`, `employment_type`, `required_experience`, `required_education`, `industry`, and `function` using the most frequent value (mode) in each column.
- **Cleaned `location` column** by filling missing values with `"Unknown"` to handle incomplete geographic data.
- **Standardized `salary_range`** by filling missing or undisclosed salaries with `"Not Disclosed"`, which means the employer chose **not to publish or reveal** the salary for the position to keep data consistent.
- After cleaning, the dataset no longer contains missing values, making it ready for further analysis.

This cleaned dataset ensures better quality inputs for building our machine learning model to detect fake job postings.


In [35]:
# Saving the cleaned dataset to a new CSV file
df.to_csv("data/cleaned_fake_job_postings.csv", index=False)
print("Cleaned dataset saved successfully.")

Cleaned dataset saved successfully.
