Skip to content

A comprehensive machine learning project for detecting fraudulent job postings, leveraging NLP (TF-IDF) and a LinearSVC classifier to enhance online job platform security.

Notifications You must be signed in to change notification settings

HamzaaAkmal/Fraud-Job-Detection-Model

Repository files navigation


Fraud Detection Model

Table of Contents

  1. Overview
  2. Project Structure and Components
  3. Data
  4. Methodology
  5. Results
  6. How to Run
  7. Conclusion & Future Work

Overview

This project aims to develop a robust machine learning model for detecting fraudulent job postings. In today's digital age, online job platforms are a primary target for scammers. These fraudulent postings can mislead job seekers, waste their time, and potentially expose them to financial risks.

Leveraging a dataset of job advertisements, this solution employs natural language processing (NLP) techniques combined with traditional machine learning algorithms to identify suspicious patterns. The core of the model utilizes TF-IDF for text feature extraction and a Linear Support Vector Classifier (LinearSVC) for classification. The project demonstrates a complete workflow from data understanding and preprocessing to model training, evaluation, and persistence for future use.

Project Structure and Components

The project is structured into several Python classes within the classes.ipynb Jupyter Notebook, each handling a specific part of the data science pipeline:

  • Data Class:

    • Purpose: Handles the initial loading of the jobs.csv dataset and provides basic data exploration methods (e.g., head(), shape(), info(), describe()).
    • Key Functionality: Provides fundamental insights into the dataset's structure, dimensions, data types, and summary statistics.
  • DataPreprocessing Class:

    • Purpose: Focuses on cleaning and preparing the raw data for model training.
    • Key Functionality:
      • Identifies and handles missing values (e.g., dropping rows with missing description, filling others with placeholders like 'Not Provided' or statistical measures like median/mode).
      • Performs feature engineering by creating new informative features from existing text fields (e.g., text lengths, count of specific characters like '$' or '!', presence of common scam keywords).
      • Saves the cleaned and engineered dataset to a new CSV file (phase1cleaned.csv).
  • Graph, UnivariateAnalysis, BivariateAnalysis Classes:

    • Purpose: Dedicated to exploratory data analysis (EDA) through various visualizations.
    • Graph: A base class providing individual plotting methods for different aspects of the data.
    • UnivariateAnalysis: Inherits from Graph and aggregates methods for plotting distributions of single variables (e.g., fraudulent job distribution, employment_type distribution, required_education levels).
    • BivariateAnalysis: Inherits from Graph and provides methods for visualizing relationships between two variables (e.g., average salary vs. location, employment type vs. fraudulent status, fraudulent jobs by industry).
  • Model Class:

    • Purpose: Encapsulates the core machine learning pipeline, including feature encoding, data splitting, model training, and evaluation.
    • Key Functionality:
      • Feature Encoding: Combines relevant text fields into a single 'text' column and applies TF-IDF (Term Frequency-Inverse Document Frequency) vectorization to convert text into numerical features. Integrates other numerical features like telecommuting, has_company_logo, and has_questions.
      • Data Splitting: Divides the processed data into training and testing sets (80/20 split) using stratified sampling to ensure the target variable's class distribution is maintained.
      • Model Training: Trains a LinearSVC (Linear Support Vector Classifier) from sklearn.svm, known for its effectiveness in text classification.
      • Model Evaluation: Assesses the model's performance using standard metrics: confusion matrix, classification report (precision, recall, F1-score), and accuracy.
      • Model Persistence: Saves the trained LinearSVC model and the TfidfVectorizer object using Python's pickle module for later use.
  • PickleModel Class:

    • Purpose: Provides a convenient way to load the previously saved machine learning model and vectorizer to make predictions on new, unseen data.
    • Key Functionality: Loads the svm_fraud_model.pkl file and offers a predict() method that prepares new data (combining text, vectorizing, stacking with other features) and generates predictions.

Data

The project utilizes a dataset of job postings, assumed to be named jobs.csv. This dataset contains various features describing job advertisements, including:

  • Textual Features: title, description, requirements, benefits, company_profile.
  • Categorical Features: employment_type, required_experience, required_education, industry, function, location, department.
  • Numerical/Binary Features: salary_range (converted to salary_avg), telecommuting, has_company_logo, has_questions.
  • Target Variable: fraudulent (a binary variable indicating whether a job posting is legitimate 0 or fraudulent 1).

Methodology

Data Understanding & Loading

The initial step involves loading the jobs.csv dataset into a Pandas DataFrame. The Data class is used to perform preliminary checks, such as:

  • Viewing the first few rows (head()) to understand the data format.
  • Checking the dimensions (shape) to know the number of rows and columns.
  • Getting a concise summary of the DataFrame (info()) to inspect data types and non-null counts.
  • Generating descriptive statistics (describe(), describe_all()) for numerical and categorical columns.

Data Preprocessing & Feature Engineering

The DataPreprocessing class implements crucial steps to prepare the data:

  • Handling Missing Values:
    • Rows with missing description are dropped, as this is a vital text field for fraud detection.
    • Other missing text fields (company_profile, requirements, benefits) are filled with 'Not Provided'.
    • salary_range is processed to extract salary_avg (by taking the average of the range if available, otherwise filled with the median salary). The original salary_range column is then dropped.
    • Other categorical columns (location, department, employment_type, required_experience, required_education, industry, function) are filled with their respective modes.
  • Feature Extraction:
    • New numerical features are engineered from text columns:
      • desc_length, req_length, benefits_length, title_length, profile_length: The character lengths of the respective text fields.
      • desc_dollar_count, desc_exclaim_count: The number of dollar signs and exclamation marks in the job description, often indicators of suspicious language.
      • has_scam_words: A binary indicator (0 or 1) if the description contains common scam-related phrases (e.g., 'money', 'investment', 'fast cash', 'work from home', 'no experience', 'quick earn').

Exploratory Data Analysis (EDA)

The Graph, UnivariateAnalysis, and BivariateAnalysis classes are used to visualize the data and uncover patterns:

  • Univariate Analysis:
    • Fraudulent Job Distribution: Shows a significant class imbalance, with a much larger number of legitimate jobs compared to fraudulent ones. This is a common challenge in fraud detection.
    • Employment Type Distribution: Visualizes the most common employment types.
    • Required Experience/Education: Displays the distribution of required experience levels and education backgrounds.
    • Telecommuting/Company Logo/Questions: Shows the counts of remote jobs, jobs with company logos, and jobs requiring screening questions. (e.g., fraudulent jobs often lack company logos or are remote).
    • Top Industries/Job Functions/Locations: Highlights the most frequent industries, job functions, and geographical locations.
  • Bivariate Analysis:
    • Average Salary by Location: Illustrates how average salary varies across top locations.
    • Employment Type vs. Fraudulent: Breaks down fraudulent vs. non-fraudulent counts by employment type, revealing if certain types are more prone to fraud.
    • Fraud by Industry: Pinpoints which industries have the highest counts of fraudulent job postings.

Model Development

The Model class orchestrates the machine learning pipeline:

  • Combined Text Feature: A new 'text' column is created by concatenating title, description, requirements, benefits, and company_profile. This comprehensive text field is central to the NLP approach.
  • TF-IDF Vectorization: TfidfVectorizer is applied to the combined 'text' column. This technique transforms text into a matrix of numerical TF-IDF features, representing the importance of words in the document relative to the corpus. max_features is set to 50,000 to limit the vocabulary size.
  • Feature Stacking: The TF-IDF features are horizontally stacked (hstack) with other numerical/binary features (telecommuting, has_company_logo, has_questions) to create the final feature matrix X.
  • Data Splitting: The dataset is split into training (80%) and testing (20%) sets using train_test_split. Crucially, stratify=self.y is used to ensure that the proportion of fraudulent jobs is maintained in both training and testing sets, addressing the class imbalance. random_state=42 ensures reproducibility.
  • Model Training: A LinearSVC (Linear Support Vector Classifier) is trained on the X_train and y_train data. LinearSVC is suitable for large datasets and linear classification tasks, making it a good choice for TF-IDF features. max_iter is increased to 10,000 to ensure convergence.

Model Evaluation

After training, the model's performance is evaluated on the unseen X_test data:

  • Confusion Matrix: Provides a breakdown of correct and incorrect classifications (True Positives, True Negatives, False Positives, False Negatives).
  • Classification Report: Offers detailed metrics for each class:
    • Precision: The proportion of positive identifications that were actually correct.
    • Recall (Sensitivity): The proportion of actual positives that were correctly identified.
    • F1-score: The harmonic mean of precision and recall, providing a balanced measure.
  • Accuracy Score: The overall proportion of correctly classified instances.

Results

Model Performance Summary:

Based on the provided output, the LinearSVC model demonstrates strong performance in detecting fraudulent job postings:

              precision    recall  f1-score   support

           0       0.99      1.00      0.99      3403
           1       0.99      0.78      0.87       173

    accuracy                           0.99      3576
   macro avg       0.99      0.89      0.93      3576
weighted avg       0.99      0.99      0.99      3576

Accuracy: 0.9890939597315436
  • Overall Accuracy: Approximately 98.9%. This high accuracy indicates that the model correctly classifies a large majority of job postings.
  • Fraudulent Class (Class 1) Performance:
    • Precision (0.99): When the model predicts a job is fraudulent, it is correct 99% of the time. This is excellent for minimizing false alarms.
    • Recall (0.78): The model identifies 78% of all actual fraudulent job postings. While not perfect, this is a respectable recall, indicating it catches a significant portion of fraud.
    • F1-score (0.87): A strong F1-score for the minority class suggests a good balance between precision and recall in identifying fraud.

These results indicate that the model is highly effective in differentiating between legitimate and fraudulent job postings, making it a valuable tool for enhancing job platform security.

How to Run

To execute this project and train the fraud detection model:

  1. Prerequisites: Ensure you have Python (version 3.x recommended) installed, along with the following libraries:

    • pandas
    • numpy
    • matplotlib
    • scikit-learn
    • scipy
    • copy (built-in)
    • pickle (built-in)

    You can install them via pip:

    pip install pandas numpy matplotlib scikit-learn scipy
  2. Dataset: Make sure the jobs.csv file is located in the same directory as the classes.ipynb notebook.

  3. Execute the Notebook:

    • Open the classes.ipynb file using Jupyter Notebook or JupyterLab.
    • Run all cells sequentially. The script will perform data loading, preprocessing, feature engineering, text vectorization, data splitting, model training, and evaluation.
    • Upon successful execution, a trained model and TF-IDF vectorizer will be saved as svm_fraud_model.pkl in your project directory.
  4. Making New Predictions:

    • To use the trained model for new predictions, load the PickleModel class and use its predict() method. Ensure your new data DataFrame has the required columns (title, description, requirements, benefits, company_profile, telecommuting, has_company_logo, has_questions).

Conclusion & Future Work

This project successfully developed a robust fraud detection model capable of classifying job postings with high accuracy. The combination of domain-specific feature engineering, TF-IDF for text representation, and a powerful LinearSVC classifier proved effective in handling the complexity and imbalanced nature of the dataset. The model's high precision for the fraudulent class ensures that legitimate job postings are rarely misflagged, which is crucial for user experience on job platforms.


About

A comprehensive machine learning project for detecting fraudulent job postings, leveraging NLP (TF-IDF) and a LinearSVC classifier to enhance online job platform security.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published