PURPOSE OF THIS NOTEBOOK:  
This notebook focuses on preparing a clean, consistent and model ready data set by handling missing values, noisy text fields and irrelevant features while preserving information critical for fraud detection.

In [2]:
import pandas as pd
import numpy as np

df=pd.read_csv("../data/raw/fake_job_postings.csv")
df.shape

(17880, 18)

In [3]:
# Calculate the percentage of missing values for each column
missing_percent = df.isnull().mean().sort_values(ascending=False) * 100
missing_percent

salary_range           83.959732
department             64.580537
required_education     45.329978
benefits               40.335570
required_experience    39.429530
function               36.101790
industry               27.421700
employment_type        19.412752
company_profile        18.501119
requirements           15.078300
location                1.935123
description             0.005593
title                   0.000000
job_id                  0.000000
telecommuting           0.000000
has_questions           0.000000
has_company_logo        0.000000
fraudulent              0.000000
dtype: float64

In [4]:
# Often in fraud detection, inconsistencies or missing data can be indicators of fraudulent activity rather than noise

text_columns = [ # Usually it is noticed that fraud jobs have thin or missing text
    "title", #often use vague titles and overuse buzzwords
    "company_profile", #fake job posts often have incomplete or generic company profiles
    "description",
    "requirements",
    "benefits", #fake job posts often list no benefits or exaggerate them
]

binary_columns = [
    "telecommuting",
    "has_company_logo",
    "has_questions",
]
# these are the columns where missing values are signals not noise
categorical_columns = [
    "location", #usually vague or unrealistic in fraudlent jobs. 
    "department", 
    "salary_range",
    "employment_type",
    "required_experience",
    "required_education",
    "industry",
    "function",
]

columns_to_drop = [
    "job_id" #unique identifier, not useful for prediction
]


In [5]:
# To hangle missing string values, we replace them with an empty string
df[text_columns] = df[text_columns].fillna("")

# For binary columns, we replace missing values with 0, assuming that missing indicates absence
df[binary_columns] = df[binary_columns].fillna(0)

# For categorical columns, we replace missing values with 'Unknown'
df[categorical_columns] = df[categorical_columns].fillna("Unknown")

df.drop(columns=["job_id"], inplace=True)

- For text features, we use empty string when absence itself is information like title, profile, etc. Missing values represent absence of linguistic content which in itself can signal fraudlent.  
- On the other hand, for categorical features, missing values indicate an unspecified state and hence we use "Unknown" to preserve interpretability.  

***How does this mathematically matter?***  
Empty string maps to a zero vector whose distributional correlation with fraud can make it highly informative.  
On the other hand, unknown is a token in its own right and competes with others, owned by both real and fraudlent jobs.  
Text featues being absent represents lack of explanation and is abnormal, often indicative of low effort which correlates strongly with fraud.  
Categorical fields being missing is _often_ legitimate even for real jobs.

In [None]:
def clean_text(text):
    text=text.lower()
    text=text.strip()
    return text

for col in text_columns:
    df[col] = df[col].apply(clean_text)

df["combined_text"] = (df["title"] + " " + df["company_profile"] + " " + df["description"] + " " + df["requirements"] + " " + df["benefits"])

In [8]:
#To check if there are any remaining missing values
df.isnull().sum()

title                  0
location               0
department             0
salary_range           0
company_profile        0
description            0
requirements           0
benefits               0
telecommuting          0
has_company_logo       0
has_questions          0
employment_type        0
required_experience    0
required_education     0
industry               0
function               0
fraudulent             0
combined_text          0
dtype: int64

In [None]:
df["fraudulent"].value_counts(normalize=True)*100
#To find the percentage of fraudulent vs non-fraudulent job posts in the data set.

fraudulent
0    95.1566
1     4.8434
Name: proportion, dtype: float64

In [10]:
df.to_csv("../data/processed/cleaned_job_postings.csv", index=False)
# We put index = False to avoid adding an extra index column in the output csv file.