- Overview
- Project Structure and Components
- Data
- Methodology
- Results
- How to Run
- Conclusion & Future Work
This project aims to develop a robust machine learning model for detecting fraudulent job postings. In today's digital age, online job platforms are a primary target for scammers. These fraudulent postings can mislead job seekers, waste their time, and potentially expose them to financial risks.
Leveraging a dataset of job advertisements, this solution employs natural language processing (NLP) techniques combined with traditional machine learning algorithms to identify suspicious patterns. The core of the model utilizes TF-IDF for text feature extraction and a Linear Support Vector Classifier (LinearSVC) for classification. The project demonstrates a complete workflow from data understanding and preprocessing to model training, evaluation, and persistence for future use.
The project is structured into several Python classes within the classes.ipynb
Jupyter Notebook, each handling a specific part of the data science pipeline:
-
Data
Class:- Purpose: Handles the initial loading of the
jobs.csv
dataset and provides basic data exploration methods (e.g.,head()
,shape()
,info()
,describe()
). - Key Functionality: Provides fundamental insights into the dataset's structure, dimensions, data types, and summary statistics.
- Purpose: Handles the initial loading of the
-
DataPreprocessing
Class:- Purpose: Focuses on cleaning and preparing the raw data for model training.
- Key Functionality:
- Identifies and handles missing values (e.g., dropping rows with missing
description
, filling others with placeholders like 'Not Provided' or statistical measures like median/mode). - Performs feature engineering by creating new informative features from existing text fields (e.g., text lengths, count of specific characters like '$' or '!', presence of common scam keywords).
- Saves the cleaned and engineered dataset to a new CSV file (
phase1cleaned.csv
).
- Identifies and handles missing values (e.g., dropping rows with missing
-
Graph
,UnivariateAnalysis
,BivariateAnalysis
Classes:- Purpose: Dedicated to exploratory data analysis (EDA) through various visualizations.
Graph
: A base class providing individual plotting methods for different aspects of the data.UnivariateAnalysis
: Inherits fromGraph
and aggregates methods for plotting distributions of single variables (e.g.,fraudulent
job distribution,employment_type
distribution,required_education
levels).BivariateAnalysis
: Inherits fromGraph
and provides methods for visualizing relationships between two variables (e.g., average salary vs. location, employment type vs. fraudulent status, fraudulent jobs by industry).
-
Model
Class:- Purpose: Encapsulates the core machine learning pipeline, including feature encoding, data splitting, model training, and evaluation.
- Key Functionality:
- Feature Encoding: Combines relevant text fields into a single 'text' column and applies TF-IDF (Term Frequency-Inverse Document Frequency) vectorization to convert text into numerical features. Integrates other numerical features like
telecommuting
,has_company_logo
, andhas_questions
. - Data Splitting: Divides the processed data into training and testing sets (80/20 split) using stratified sampling to ensure the target variable's class distribution is maintained.
- Model Training: Trains a
LinearSVC
(Linear Support Vector Classifier) fromsklearn.svm
, known for its effectiveness in text classification. - Model Evaluation: Assesses the model's performance using standard metrics: confusion matrix, classification report (precision, recall, F1-score), and accuracy.
- Model Persistence: Saves the trained
LinearSVC
model and theTfidfVectorizer
object using Python'spickle
module for later use.
- Feature Encoding: Combines relevant text fields into a single 'text' column and applies TF-IDF (Term Frequency-Inverse Document Frequency) vectorization to convert text into numerical features. Integrates other numerical features like
-
PickleModel
Class:- Purpose: Provides a convenient way to load the previously saved machine learning model and vectorizer to make predictions on new, unseen data.
- Key Functionality: Loads the
svm_fraud_model.pkl
file and offers apredict()
method that prepares new data (combining text, vectorizing, stacking with other features) and generates predictions.
The project utilizes a dataset of job postings, assumed to be named jobs.csv
. This dataset contains various features describing job advertisements, including:
- Textual Features:
title
,description
,requirements
,benefits
,company_profile
. - Categorical Features:
employment_type
,required_experience
,required_education
,industry
,function
,location
,department
. - Numerical/Binary Features:
salary_range
(converted tosalary_avg
),telecommuting
,has_company_logo
,has_questions
. - Target Variable:
fraudulent
(a binary variable indicating whether a job posting is legitimate0
or fraudulent1
).
The initial step involves loading the jobs.csv
dataset into a Pandas DataFrame. The Data
class is used to perform preliminary checks, such as:
- Viewing the first few rows (
head()
) to understand the data format. - Checking the dimensions (
shape
) to know the number of rows and columns. - Getting a concise summary of the DataFrame (
info()
) to inspect data types and non-null counts. - Generating descriptive statistics (
describe()
,describe_all()
) for numerical and categorical columns.
The DataPreprocessing
class implements crucial steps to prepare the data:
- Handling Missing Values:
- Rows with missing
description
are dropped, as this is a vital text field for fraud detection. - Other missing text fields (
company_profile
,requirements
,benefits
) are filled with 'Not Provided'. salary_range
is processed to extractsalary_avg
(by taking the average of the range if available, otherwise filled with the median salary). The originalsalary_range
column is then dropped.- Other categorical columns (
location
,department
,employment_type
,required_experience
,required_education
,industry
,function
) are filled with their respective modes.
- Rows with missing
- Feature Extraction:
- New numerical features are engineered from text columns:
desc_length
,req_length
,benefits_length
,title_length
,profile_length
: The character lengths of the respective text fields.desc_dollar_count
,desc_exclaim_count
: The number of dollar signs and exclamation marks in the job description, often indicators of suspicious language.has_scam_words
: A binary indicator (0 or 1) if the description contains common scam-related phrases (e.g., 'money', 'investment', 'fast cash', 'work from home', 'no experience', 'quick earn').
- New numerical features are engineered from text columns:
The Graph
, UnivariateAnalysis
, and BivariateAnalysis
classes are used to visualize the data and uncover patterns:
- Univariate Analysis:
- Fraudulent Job Distribution: Shows a significant class imbalance, with a much larger number of legitimate jobs compared to fraudulent ones. This is a common challenge in fraud detection.
- Employment Type Distribution: Visualizes the most common employment types.
- Required Experience/Education: Displays the distribution of required experience levels and education backgrounds.
- Telecommuting/Company Logo/Questions: Shows the counts of remote jobs, jobs with company logos, and jobs requiring screening questions. (e.g., fraudulent jobs often lack company logos or are remote).
- Top Industries/Job Functions/Locations: Highlights the most frequent industries, job functions, and geographical locations.
- Bivariate Analysis:
- Average Salary by Location: Illustrates how average salary varies across top locations.
- Employment Type vs. Fraudulent: Breaks down fraudulent vs. non-fraudulent counts by employment type, revealing if certain types are more prone to fraud.
- Fraud by Industry: Pinpoints which industries have the highest counts of fraudulent job postings.
The Model
class orchestrates the machine learning pipeline:
- Combined Text Feature: A new 'text' column is created by concatenating
title
,description
,requirements
,benefits
, andcompany_profile
. This comprehensive text field is central to the NLP approach. - TF-IDF Vectorization:
TfidfVectorizer
is applied to the combined 'text' column. This technique transforms text into a matrix of numerical TF-IDF features, representing the importance of words in the document relative to the corpus.max_features
is set to 50,000 to limit the vocabulary size. - Feature Stacking: The TF-IDF features are horizontally stacked (
hstack
) with other numerical/binary features (telecommuting
,has_company_logo
,has_questions
) to create the final feature matrixX
. - Data Splitting: The dataset is split into training (80%) and testing (20%) sets using
train_test_split
. Crucially,stratify=self.y
is used to ensure that the proportion of fraudulent jobs is maintained in both training and testing sets, addressing the class imbalance.random_state=42
ensures reproducibility. - Model Training: A
LinearSVC
(Linear Support Vector Classifier) is trained on theX_train
andy_train
data.LinearSVC
is suitable for large datasets and linear classification tasks, making it a good choice for TF-IDF features.max_iter
is increased to 10,000 to ensure convergence.
After training, the model's performance is evaluated on the unseen X_test
data:
- Confusion Matrix: Provides a breakdown of correct and incorrect classifications (True Positives, True Negatives, False Positives, False Negatives).
- Classification Report: Offers detailed metrics for each class:
- Precision: The proportion of positive identifications that were actually correct.
- Recall (Sensitivity): The proportion of actual positives that were correctly identified.
- F1-score: The harmonic mean of precision and recall, providing a balanced measure.
- Accuracy Score: The overall proportion of correctly classified instances.
Model Performance Summary:
Based on the provided output, the LinearSVC
model demonstrates strong performance in detecting fraudulent job postings:
precision recall f1-score support
0 0.99 1.00 0.99 3403
1 0.99 0.78 0.87 173
accuracy 0.99 3576
macro avg 0.99 0.89 0.93 3576
weighted avg 0.99 0.99 0.99 3576
Accuracy: 0.9890939597315436
- Overall Accuracy: Approximately 98.9%. This high accuracy indicates that the model correctly classifies a large majority of job postings.
- Fraudulent Class (Class 1) Performance:
- Precision (0.99): When the model predicts a job is fraudulent, it is correct 99% of the time. This is excellent for minimizing false alarms.
- Recall (0.78): The model identifies 78% of all actual fraudulent job postings. While not perfect, this is a respectable recall, indicating it catches a significant portion of fraud.
- F1-score (0.87): A strong F1-score for the minority class suggests a good balance between precision and recall in identifying fraud.
These results indicate that the model is highly effective in differentiating between legitimate and fraudulent job postings, making it a valuable tool for enhancing job platform security.
To execute this project and train the fraud detection model:
-
Prerequisites: Ensure you have Python (version 3.x recommended) installed, along with the following libraries:
pandas
numpy
matplotlib
scikit-learn
scipy
copy
(built-in)pickle
(built-in)
You can install them via pip:
pip install pandas numpy matplotlib scikit-learn scipy
-
Dataset: Make sure the
jobs.csv
file is located in the same directory as theclasses.ipynb
notebook. -
Execute the Notebook:
- Open the
classes.ipynb
file using Jupyter Notebook or JupyterLab. - Run all cells sequentially. The script will perform data loading, preprocessing, feature engineering, text vectorization, data splitting, model training, and evaluation.
- Upon successful execution, a trained model and TF-IDF vectorizer will be saved as
svm_fraud_model.pkl
in your project directory.
- Open the
-
Making New Predictions:
- To use the trained model for new predictions, load the
PickleModel
class and use itspredict()
method. Ensure your new data DataFrame has the required columns (title
,description
,requirements
,benefits
,company_profile
,telecommuting
,has_company_logo
,has_questions
).
- To use the trained model for new predictions, load the
This project successfully developed a robust fraud detection model capable of classifying job postings with high accuracy. The combination of domain-specific feature engineering, TF-IDF for text representation, and a powerful LinearSVC classifier proved effective in handling the complexity and imbalanced nature of the dataset. The model's high precision for the fraudulent class ensures that legitimate job postings are rarely misflagged, which is crucial for user experience on job platforms.