<a href="https://colab.research.google.com/github/N00B-MA5TER/ML-Projects/blob/main/Fake_Job_Posting_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing the dataset from Kaggle

In [1]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("shivamb/real-or-fake-fake-jobposting-prediction")

print("Path to dataset files:", path)

Using Colab cache for faster access to the 'real-or-fake-fake-jobposting-prediction' dataset.
Path to dataset files: /kaggle/input/real-or-fake-fake-jobposting-prediction


In [2]:
import pandas as pd
df = pd.read_csv(path + "/fake_job_postings.csv")
df.head()

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0


**Data Evaluation and Preprocessing**

In [3]:
df.shape

(17880, 18)

Checking for and Handling Missing Values

In [5]:
missing_values = df.isnull().sum()
print("Missing values per column:")
print(missing_values)

Missing values per column:
job_id                     0
title                      0
location                 346
department             11547
salary_range           15012
company_profile         3308
description                1
requirements            2696
benefits                7212
telecommuting              0
has_company_logo           0
has_questions              0
employment_type         3471
required_experience     7050
required_education      8105
industry                4903
function                6455
fraudulent                 0
dtype: int64


Dropping columns with a high number of missing values and imputing the remaining text columns with an empty string.



In [6]:
# Drop columns with a high percentage of missing values
df = df.drop(columns=['salary_range', 'department'])

# Impute missing values in text columns with an empty string
text_columns = ['location', 'company_profile', 'description', 'requirements', 'benefits', 'employment_type', 'required_experience', 'required_education', 'industry', 'function']
for col in text_columns:
    if col in df.columns:
        df[col] = df[col].fillna('')

# Verify that missing values have been handled
print("\nMissing values after handling:")
print(df.isnull().sum())


Missing values after handling:
job_id                 0
title                  0
location               0
company_profile        0
description            0
requirements           0
benefits               0
telecommuting          0
has_company_logo       0
has_questions          0
employment_type        0
required_experience    0
required_education     0
industry               0
function               0
fraudulent             0
dtype: int64


Using SMOTE to address data imbalance

In [4]:
df.value_counts("fraudulent")

Unnamed: 0_level_0,count
fraudulent,Unnamed: 1_level_1
0,17014
1,866


In [15]:
from imblearn.over_sampling import SMOTE

X = df.drop('fraudulent', axis=1)
y = df['fraudulent']

X_numeric = X.select_dtypes(include=['number'])

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_numeric, y)

print("Class distribution before SMOTE:")
print(y.value_counts())
print("\nClass distribution after SMOTE:")
print(y_resampled.value_counts())

Class distribution before SMOTE:
fraudulent
0    17014
1      866
Name: count, dtype: int64

Class distribution after SMOTE:
fraudulent
0    17014
1    17014
Name: count, dtype: int64


Dropping the job_id column

In [9]:
X = X.drop(columns=['job_id'])
X_resampled = X_resampled.drop(columns=['job_id'])

print("Shape of X after dropping 'job_id':", X.shape)
print("Shape of X_resampled after dropping 'job_id':", X_resampled.shape)

Shape of X after dropping 'job_id': (17880, 14)
Shape of X_resampled after dropping 'job_id': (34028, 3)


Tokenize text data using TfidfVectorizer

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

string_cols = df.select_dtypes(include='object').columns
X_text = df[string_cols].fillna('')

tfidf_vectorizer = TfidfVectorizer(max_features=5000, min_df=2, max_df=0.85)
X_text_tfidf = tfidf_vectorizer.fit_transform(X_text.agg(' '.join, axis=1))

X_text_df = pd.DataFrame(X_text_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

X_combined = pd.concat([X_resampled.reset_index(drop=True), X_text_df], axis=1)

print("Shape of combined feature DataFrame:", X_combined.shape)
display(X_combined.head())

Shape of combined feature DataFrame: (34028, 5003)


Unnamed: 0,telecommuting,has_company_logo,has_questions,00,000,0in,0pt,10,100,1000,...,που,σε,στην,στο,τα,την,της,τις,το,του
0,0,1,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0,1,0,0.0,0.083517,0.0,0.0,0.0,0.033344,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0,1,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0,1,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0,1,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


**Splitting the data into training data and test data**

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_combined, y_resampled, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (27222, 5003)
Shape of X_test: (6806, 5003)
Shape of y_train: (27222,)
Shape of y_test: (6806,)


**Training Classification Models**

In [13]:
from sklearn.impute import SimpleImputer
import numpy as np

imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

X_train = imputer.fit_transform(X_train)
X_test = imputer.transform(X_test)

log_reg = LogisticRegression(random_state=42)
rf_clf = RandomForestClassifier(random_state=42)
gb_clf = GradientBoostingClassifier(random_state=42)

log_reg.fit(X_train, y_train)
rf_clf.fit(X_train, y_train)
gb_clf.fit(X_train, y_train)

print("Logistic Regression model trained.")
print("Random Forest Classifier model trained.")
print("Gradient Boosting Classifier model trained.")

Logistic Regression model trained.
Random Forest Classifier model trained.
Gradient Boosting Classifier model trained.


**Model Evaluation**

In [14]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

models = {
    "Logistic Regression": log_reg,
    "Random Forest": rf_clf,
    "Gradient Boosting": gb_clf
}

for model_name, model in models.items():
    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

    print(f"--- {model_name} ---")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1-score: {f1:.4f}")
    print("-" * (len(model_name) + 6))

--- Logistic Regression ---
Accuracy: 0.8444
Precision: 0.8726
Recall: 0.8073
F1-score: 0.8387
-------------------------
--- Random Forest ---
Accuracy: 0.9899
Precision: 0.9997
Recall: 0.9801
F1-score: 0.9898
-------------------
--- Gradient Boosting ---
Accuracy: 0.9882
Precision: 0.9985
Recall: 0.9780
F1-score: 0.9881
-----------------------


**Identifying features or words in JDs that are fraudulent in nature**

In [16]:
# Get feature names
feature_names = X_combined.columns

# Get coefficients for Logistic Regression
log_reg_coef = log_reg.coef_[0]

# Get feature importances for Random Forest
rf_importances = rf_clf.feature_importances_

# Get feature importances for Gradient Boosting
gb_importances = gb_clf.feature_importances_

# Create DataFrames for each model
log_reg_importance_df = pd.DataFrame({'feature': feature_names, 'coefficient': log_reg_coef})
rf_importance_df = pd.DataFrame({'feature': feature_names, 'importance': rf_importances})
gb_importance_df = pd.DataFrame({'feature': feature_names, 'importance': gb_importances})

# Sort by absolute coefficient for Logistic Regression and importance for others
log_reg_importance_df['abs_coefficient'] = abs(log_reg_importance_df['coefficient'])
log_reg_importance_df = log_reg_importance_df.sort_values(by='abs_coefficient', ascending=False).drop(columns='abs_coefficient')

rf_importance_df = rf_importance_df.sort_values(by='importance', ascending=False)
gb_importance_df = gb_importance_df.sort_values(by='importance', ascending=False)

# Display the top features for each model
print("Top features for Logistic Regression:")
display(log_reg_importance_df.head(10))

print("\nTop features for Random Forest:")
display(rf_importance_df.head(10))

print("\nTop features for Gradient Boosting:")
display(gb_importance_df.head(10))

Top features for Logistic Regression:


Unnamed: 0,feature,coefficient
1,has_company_logo,-3.317194
3048,oil,3.288333
1603,entry,3.124298
4647,typing,2.771407
344,apex,-2.650627
2295,industry,2.649482
182,administrative,2.521865
4102,signing,2.471899
2394,internet,2.424788
254,aker,2.404425



Top features for Random Forest:


Unnamed: 0,feature,importance
2189,hse,0.036487
3860,rho,0.036055
832,church,0.027078
3630,ramberg,0.026961
4360,summaview,0.026864
3712,refined,0.01853
1613,epsilon,0.018062
1790,fbn,0.018049
3067,ookla,0.018048
2751,mateo,0.018044



Top features for Gradient Boosting:


Unnamed: 0,feature,importance
254,aker,0.678797
3776,represented,0.101782
856,clerk,0.087015
1460,earn,0.033504
1550,encouraged,0.024143
1,has_company_logo,0.008036
3048,oil,0.006408
182,administrative,0.0051
1761,facilitating,0.004514
1603,entry,0.004126


In [17]:
# Analyze top features from Logistic Regression
print("Analysis of top Logistic Regression features:")
print(log_reg_importance_df.head(10))
print("-" * 30)

# Analyze top features from Random Forest
print("\nAnalysis of top Random Forest features:")
print(rf_importance_df.head(10))
print("-" * 30)

# Analyze top features from Gradient Boosting
print("\nAnalysis of top Gradient Boosting features:")
print(gb_importance_df.head(10))
print("-" * 30)

# Identify common features among the top features of the models (considering the top N features)
# Let's consider the top 20 features for each model to find commonalities
top_n = 20
log_reg_top_features = set(log_reg_importance_df.head(top_n)['feature'])
rf_top_features = set(rf_importance_df.head(top_n)['feature'])
gb_top_features = set(gb_importance_df.head(top_n)['feature'])

common_top_features = log_reg_top_features.intersection(rf_top_features, gb_top_features)

print("\nCommon top features across all three models (top 20):")
print(common_top_features)

# Further analysis of the identified features based on their potential meaning
# Based on the printed top features, manually identify potentially suspicious terms
print("\nPotential meaning/context of some top features:")
print("- 'has_company_logo' (Logistic Regression): Negative coefficient suggests absence of company logo is associated with fraudulent postings.")
print("- 'oil' (Logistic Regression): Positive coefficient suggests the word 'oil' is associated with fraudulent postings. This could be related to specific scams.")
print("- 'aker', 'represented', 'clerk', 'earn', 'encouraged' (Gradient Boosting): These terms might appear in descriptions trying to sound legitimate or promising high earnings/easy work.")
print("- 'hse', 'rho', 'church', 'ramberg', 'summaview' (Random Forest): These might be specific terms used in niche scams or jargon that appears in fake postings.")
print("- 'entry' (Logistic Regression): Positive coefficient could indicate 'entry-level' jobs that are often targeted by fraudsters.")
print("- 'typing' (Logistic Regression): Positive coefficient might relate to data entry or similar scam types.")
print("- 'apex' (Logistic Regression): Negative coefficient could be a legitimate software/term not used in fake postings.")



Analysis of top Logistic Regression features:
               feature  coefficient
1     has_company_logo    -3.317194
3048               oil     3.288333
1603             entry     3.124298
4647            typing     2.771407
344               apex    -2.650627
2295          industry     2.649482
182     administrative     2.521865
4102           signing     2.471899
2394          internet     2.424788
254               aker     2.404425
------------------------------

Analysis of top Random Forest features:
        feature  importance
2189        hse    0.036487
3860        rho    0.036055
832      church    0.027078
3630    ramberg    0.026961
4360  summaview    0.026864
3712    refined    0.018530
1613    epsilon    0.018062
1790        fbn    0.018049
3067      ookla    0.018048
2751      mateo    0.018044
------------------------------

Analysis of top Gradient Boosting features:
               feature  importance
254               aker    0.678797
3776       represented    0.1017