# Model-3 - BERT based Modelling

**BERT (Bidirectional Encoder Representations from Transformers)**
 is a deep learning model developed by Google for natural language processing tasks. BERT is designed to understand the context of words in search queries. It does this by processing words in both directions (left-to-right and right-to-left) using a transformer-based architecture. This bidirectional approach enables BERT to capture the meaning of words based on their context within a sentence, making it highly effective for tasks like question answering, sentiment analysis, and language translation. BERT-based models have significantly improved the performance of various NLP applications due to their ability to understand context more accurately than previous models.


## Data Preprocessing
Data preprocessing transforms raw data into a clean and usable format by handling missing values, outliers, and ensuring consistent data scales through normalization or standardization. It also includes feature extraction and selection to enhance dataset quality. This step is essential for efficient and accurate data analysis or machine learning model performance.

In [78]:
# importing all the necessary libraries
import pandas as pd
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from scipy.sparse import hstack
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler, OneHotEncoder


In [21]:
# Reading and displaying the first 10 rows of the CSV file
df=pd.read_csv('fake_job_postings.csv')
df.head(10)

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0
5,6,Accounting Clerk,"US, MD,",,,,Job OverviewApex is an environmental consultin...,,,0,0,0,,,,,,0
6,7,Head of Content (m/f),"DE, BE, Berlin",ANDROIDPIT,20000-28000,"Founded in 2009, the Fonpit AG rose with its i...",Your Responsibilities: Manage the English-spea...,Your Know-How: ...,Your Benefits: Being part of a fast-growing co...,0,1,1,Full-time,Mid-Senior level,Master's Degree,Online Media,Management,0
7,8,Lead Guest Service Specialist,"US, CA, San Francisco",,,Airenvy’s mission is to provide lucrative yet ...,Who is Airenvy?Hey there! We are seasoned entr...,"Experience with CRM software, live chat, and p...",Competitive Pay. You'll be able to eat steak e...,0,1,1,,,,,,0
8,9,HP BSM SME,"US, FL, Pensacola",,,Solutions3 is a woman-owned small business who...,Implementation/Configuration/Testing/Training ...,MUST BE A US CITIZEN.An active TS/SCI clearanc...,,0,1,1,Full-time,Associate,,Information Technology and Services,,0
9,10,Customer Service Associate - Part Time,"US, AZ, Phoenix",,,"Novitex Enterprise Solutions, formerly Pitney ...",The Customer Service Associate will be based i...,Minimum Requirements:Minimum of 6 months custo...,,0,1,0,Part-time,Entry level,High School or equivalent,Financial Services,Customer Service,0


In [23]:
# Fill missing values in text columns with empty strings
text_columns = ['description', 'requirements', 'company_profile', 'title', 'location', 'department', 'salary_range', 'employment_type', 'required_experience', 'required_education', 'industry', 'function']
df[text_columns] = df[text_columns].fillna('')

In [25]:
# Combine text columns into a single column for vectorization
df['combined_text'] = df[text_columns].apply(lambda x: ' '.join(x), axis=1)

In [27]:
# Simplified text preprocessing
df['combined_text'] = df['combined_text'].str.lower()
df['combined_text'] = df['combined_text'].apply(lambda x: re.sub(r'http\S+|www\S+|https\S+', '', x))
df['combined_text'] = df['combined_text'].str.replace('\n', ' ')
df['combined_text'] = df['combined_text'].apply(lambda x: re.sub(r'\w*\d\w*', '', x))
df['combined_text'] = df['combined_text'].str.strip()
df['combined_text'] = df['combined_text'].apply(lambda x: re.sub(' +', ' ', x))
df['combined_text'] = df['combined_text'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', x))

In [28]:
# Tokenize the text data
tokenizer = Tokenizer(num_words=5000)  # Limit to the top 5000 words
tokenizer.fit_on_texts(df['combined_text'])
X_text = tokenizer.texts_to_sequences(df['combined_text'])

# Pad sequences to the same length
max_length = 100
X_text_padded = pad_sequences(X_text, maxlen=max_length, padding='post')

In [29]:
# List of non-text columns to be used as features
non_text_columns = ['telecommuting', 'has_company_logo', 'has_questions', 'location', 'department', 'salary_range', 'employment_type', 'required_experience', 'required_education', 'industry', 'function']

In [30]:
# Fill missing values in non-text columns with a placeholder
df[non_text_columns] = df[non_text_columns].fillna('missing')

In [31]:
# OneHotEncode the categorical non-text columns
categorical_columns = ['location', 'department', 'salary_range', 'employment_type', 'required_experience', 'required_education', 'industry', 'function']
encoder = OneHotEncoder()
X_non_text_encoded = encoder.fit_transform(df[categorical_columns])

In [32]:
# Scale non-categorical non-text columns
scaler = StandardScaler()
X_non_text_scaled = scaler.fit_transform(df[['telecommuting', 'has_company_logo', 'has_questions']])

In [33]:
# Combine non-categorical and categorical features
from scipy.sparse import hstack
X_non_text = hstack([X_non_text_encoded, X_non_text_scaled]).toarray()

In [34]:
# Combine the text and non-text features
X_combined = np.hstack((X_text_padded, X_non_text))

## Model Building

Building a model with BERT involves several key steps. First, text data is tokenized using the BERT tokenizer. Then, a pre-trained BERT model, such as `bert-base-uncased`, is selected. To tailor the model for a specific task, a task-specific layer is added on top. The model is fine-tuned on the dataset, typically using a smaller learning rate and fewer epochs compared to training from scratch. During training, appropriate loss functions and optimizers are used. The model’s performance is evaluated using metrics like accuracy or F1 score. Hugging Face's Transformers library makes this process efficient by providing tools for implementation and fine-tuning.

In [36]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments

In [37]:
# Split the data into training, testing, and validation sets
X_train, X_temp, y_train, y_temp = train_test_split(
    X_combined, df['fraudulent'], test_size=0.3, random_state=42
)
X_test, X_val, y_test, y_val = train_test_split(
    X_temp, y_temp, test_size=0.33, random_state=42
)  # 0.33 * 0.3 = 0.10

## Approch-
The overall approach for preparing text data for BERT-based modeling can be broken down into several steps:

**Load the Pre-trained BERT Tokenizer**:
Use a pre-trained BERT tokenizer to tokenize and preprocess the text data. The tokenizer converts text into tokens that the BERT model can understand.

**Tokenization Function**:
Define a function to tokenize the text data. This function extracts the text data from the input, tokenizes it, and ensures that all sequences are of uniform length by padding and truncating them as necessary.

**Apply the Tokenization**:
Use the tokenization function to process the training, validation, and test datasets. This converts the raw text data into tokenized format, ready for input into the BERT model.

# Benefits of this Approach

**Efficiency**:
Using a pre-trained tokenizer and model like BERT significantly reduces the computational resources and time required compared to training a model from scratch.

**Consistency**:
Padding and truncation ensure that all input sequences are of uniform length, which is crucial for efficient batch processing.

**Compatibility**:
Converting the tokenized data to PyTorch tensors ensures compatibility with PyTorch-based BERT models, facilitating seamless integration into the training and evaluation pipeline.

In [38]:
# Load pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the input data
def tokenize_data(texts):
    # Assuming the first column of your arrays contains the text data
    return tokenizer([str(row[0]) for row in texts], padding=True, truncation=True, return_tensors="pt")

train_encodings = tokenize_data(X_train.tolist())
val_encodings = tokenize_data(X_val.tolist())
test_encodings = tokenize_data(X_test.tolist())

In [39]:
# Convert the labels to tensors with Float type
train_labels = torch.tensor(y_train.values, dtype=torch.float32)
val_labels = torch.tensor(y_val.values, dtype=torch.float32)
test_labels = torch.tensor(y_test.values, dtype=torch.float32)

In [40]:
# Define a Dataset class
class FraudDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item["labels"] = self.labels[idx]
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = FraudDataset(train_encodings, train_labels)
val_dataset = FraudDataset(val_encodings, val_labels)
test_dataset = FraudDataset(test_encodings, test_labels)

In [41]:
# Load pre-trained BERT model with regression head
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=1)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [42]:
# Define training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy="epoch",
)



### Benefits of Using Trainer
Simplifies the Training Process: The Trainer class abstracts many of the complex details involved in training a model, making it easier to implement and manage.
Integrated Evaluation: By providing both training and evaluation datasets, the Trainer can automatically evaluate the model's performance during training.
Configuration Management: The training_args parameter allows for comprehensive configuration of the training process, including advanced features like gradient accumulation and mixed precision training.
Built-in Features: The Trainer includes features like logging, saving checkpoints, and early stopping, which are essential for robust model training.

In [43]:
# Define Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

# Train the model
trainer.train()

Epoch,Training Loss,Validation Loss
1,0.0245,0.055883
2,0.0011,0.056545
3,0.0243,0.053403


TrainOutput(global_step=4695, training_loss=0.047395093910023174, metrics={'train_runtime': 8764.1948, 'train_samples_per_second': 4.284, 'train_steps_per_second': 0.536, 'total_flos': 135067258654776.0, 'train_loss': 0.047395093910023174, 'epoch': 3.0})

## Model Evaluation
Model evaluation involves assessing a trained model's performance using various metrics to determine its effectiveness and generalization ability. Common metrics for classification tasks include accuracy, precision, recall, F1 score, and AUC-ROC, while regression tasks often use mean squared error (MSE), mean absolute error (MAE), and R-squared. The evaluation process typically includes splitting the data into training and testing sets, training the model on the training set, and evaluating it on the test set to ensure the model performs well on unseen data. Cross-validation can also be used for more robust evaluation.

In [65]:
# Evaluate the model
eval_result = trainer.evaluate(eval_dataset=test_dataset)
print(f"Test Loss (MSE): {eval_result['eval_loss']}")

Test Loss (MSE): 0.04579808935523033


In [66]:
# Make predictions on the test set
predictions = trainer.predict(test_dataset)
y_pred = predictions.predictions.flatten()

In [67]:
from sklearn.metrics import mean_absolute_error
# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print(f'Mean Absolute Error (MAE): {mae}')

Mean Absolute Error (MAE): 0.07158705714875933


## Conclusion
The model exhibits a Mean Squared Error (MSE) of 0.0456 and a Mean Absolute Error (MAE) of 0.0805 on the test data, indicating that it performs well with a relatively low level of prediction error. These metrics suggest that the model is effective at predicting values close to the actual targets, demonstrating good accuracy for the regression task.

# Output Analysis

### BERT-Based Model
- **Test Loss (MSE)**: 0.0456
- **Mean Absolute Error (MAE)**: 0.0805

### Comparison:

3. **BERT-Based Model**:
   - **Plus**: Comparable performance to LSTM with slightly higher MSE and MAE.
   - **minus**: Similar to LSTM, BERT models are computationally expensive and complex.
   - **suitable scenario**: Text data or situations where contextual understanding is important.

### Recommendation:

If achieving high accuracy across all classes is paramount, particularly with a noticeable class imbalance, **Logistic Regression** remains a strong contender, despite its tendency to underperform on minority classes.

In scenarios involving sequential or time-series data, where capturing intricate dependencies is crucial, opting for an **LSTM Model** would be advantageous.

When dealing with textual data or situations requiring nuanced understanding of context, leveraging a **BERT-Based Model** proves highly effective.

When analyzing metrics, the Logistic Regression model consistently demonstrates superior performance across accuracy and weighted averages. However, depending on the complexity and specificity of the task at hand, either the LSTM or BERT model might offer more suitable solutions.
