# Detailed Explanation of Model Training Using Logistic Regression

### Introduction

### This document provides a step-by-step explanation of the Logistic Regression Model Training process for detecting fraudulent job postings. The process includes data preprocessing, feature engineering, handling missing values, class imbalance management, model training, evaluation, and model saving.



## Step 1: Importing Required Libraries

### To start, we import the necessary libraries used for data manipulation, preprocessing, machine learning, and saving the model:

In [19]:
# Import necessary libraries
import pandas as pd
import numpy as np
import scipy.sparse as sp
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, f1_score
from imblearn.over_sampling import SMOTE
import joblib  # For saving the model

## Step 2: Loading the Dataset

### The dataset, which contains job posting data, is loaded into a pandas DataFrame:

In [4]:

# Load the dataset
file_path = "fake_job_postings.csv"  # Ensure your file path is correct
df = pd.read_csv(file_path)

## Step 3: Dropping Irrelevant Columns

### Some columns, such as job_id, may not contribute to the prediction, so we remove them:

In [5]:
# Drop irrelevant columns
df = df.drop(columns=['job_id'], errors='ignore')

## Step 4: Handling Categorical Variables

### Categorical columns need to be converted into numerical format. Label Encoding is used:

In [6]:
# Handle categorical features using Label Encoding
categorical_cols = ['employment_type', 'required_experience', 'required_education', 'industry', 'function']
label_encoders = {}  # Store encoders for future use

for col in categorical_cols:
    df[col] = df[col].fillna('Unknown')  # Fill missing values
    le = LabelEncoder()
    df[col] = df[col].astype(str)  # Ensure all data is string
    le.fit(list(df[col].unique()) + ["Unknown"])  # Add "Unknown" explicitly
    df[col] = le.transform(df[col])
    label_encoders[col] = le  # Save the encoder for later use

## Step 5: Handling Text Features Using TF-IDF

### Since textual data cannot be fed directly into the model, we convert it into a numerical representation using TF-IDF (Term Frequency - Inverse Document Frequency):

In [7]:
# Handle text-based features using TF-IDF vectorization
text_features = ['title', 'company_profile', 'description', 'requirements', 'benefits']
df[text_features] = df[text_features].fillna('')  # Fill missing text with empty strings

tfidf = TfidfVectorizer(max_features=5000)  # Convert text into TF-IDF features
text_vectors = tfidf.fit_transform(df[text_features].apply(lambda x: ' '.join(x), axis=1))

## Step 6: Combining Text and Numerical Features

### We stack TF-IDF vectors (textual data) with numerical features to form the final dataset:

In [8]:
# Use numerical columns as they are
numerical_cols = ['telecommuting', 'has_company_logo', 'has_questions'] + categorical_cols
numerical_data = df[numerical_cols]

In [9]:
# Convert numerical data to sparse format
numerical_sparse = csr_matrix(numerical_data.values)

In [10]:
# Ensure the dimensions match
assert text_vectors.shape[0] == numerical_sparse.shape[0], "Mismatch in number of rows between text and numerical data."

In [11]:
# Stack text and numerical features
X = sp.hstack([text_vectors, numerical_sparse])
y = df['fraudulent']

## Step 7: Handling Class Imbalance Using SMOTE

### Since fraudulent job postings are less common than legitimate ones, we use SMOTE (Synthetic Minority Over-sampling Technique) to create a balanced dataset:

In [12]:
# Handle class imbalance using SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

## Step 8: Splitting Data into Training and Testing Sets

### We divide the dataset into 80% training data and 20% testing data:

In [13]:
# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42, stratify=y_resampled)

## Step 9: Training the Logistic Regression Model

### We initialize and train a Logistic Regression classifier:

In [14]:
# Train Logistic Regression model
model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)

## Step 10: Model Evaluation

### Once trained, we evaluate the model's performance using accuracy score and classification report:

In [15]:
# Make predictions
y_pred = model.predict(X_test)

In [16]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)

In [17]:
# Print results
print(f"Accuracy: {accuracy:.4f}")
print("Classification Report:\n", report)

Accuracy: 0.9857
Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.98      0.99      3403
           1       0.98      1.00      0.99      3403

    accuracy                           0.99      6806
   macro avg       0.99      0.99      0.99      6806
weighted avg       0.99      0.99      0.99      6806



## Step 11: Saving the Model and Preprocessing Objects

### To use the trained model in a Streamlit UI, we save it along with the TF-IDF vectorizer and Label Encoders:

In [22]:
# ---- EXPORTING THE MODEL ----

# Correct way to store LabelEncoders with their class labels
final_label_encoders = {}

for col, encoder in label_encoders.items():
    final_label_encoders[col] = {
        "encoder": encoder,   # Store actual LabelEncoder object
        "classes": list(encoder.classes_)  # Store class labels as list
    }
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, pos_label=1)
print(f"Model Accuracy: {accuracy * 100:.2f}%")
print(f"Model f1 of fraudulent: {f1 * 100:.2f}%") 

# Save the accuracy to a .pkl file
joblib.dump(accuracy, "pickles/logistic_accuracy.pkl")
joblib.dump(f1, "pickles/logistic_f1_score.pkl")
joblib.dump(model, "pickles/logistic_model.pkl")  # Save trained model
joblib.dump(tfidf, "pickles/logistic_vectorizer.pkl")  # Save TF-IDF vectorizer
#joblib.dump(final_label_encoders, "laxmi_label_encoders.pkl")  # Save encoders with class labels (UNUSED NOW)

print("Model and vectorizers saved successfully!")


Model Accuracy: 98.57%
Model f1 of fraudulent: 98.59%
Model and vectorizers saved successfully!
