In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##Data Preprocessing
Data preprocessing transforms raw data into a clean and usable format by handling missing values, outliers, and ensuring consistent data scales through normalization or standardization. It also includes feature extraction and selection to enhance dataset quality. This step is essential for efficient and accurate data analysis or machine learning model performance.

In [None]:
# importing all the necessary libraries
import pandas as pd
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from scipy.sparse import hstack
import numpy as np

In [None]:
# Reading and displaying the first 10 rows of the CSV file
df=pd.read_csv('/content/drive/MyDrive/fake_job/fake_job_postings.csv')
df.head(10)

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0
5,6,Accounting Clerk,"US, MD,",,,,Job OverviewApex is an environmental consultin...,,,0,0,0,,,,,,0
6,7,Head of Content (m/f),"DE, BE, Berlin",ANDROIDPIT,20000-28000,"Founded in 2009, the Fonpit AG rose with its i...",Your Responsibilities: Manage the English-spea...,Your Know-How: ...,Your Benefits: Being part of a fast-growing co...,0,1,1,Full-time,Mid-Senior level,Master's Degree,Online Media,Management,0
7,8,Lead Guest Service Specialist,"US, CA, San Francisco",,,Airenvy’s mission is to provide lucrative yet ...,Who is Airenvy?Hey there! We are seasoned entr...,"Experience with CRM software, live chat, and p...",Competitive Pay. You'll be able to eat steak e...,0,1,1,,,,,,0
8,9,HP BSM SME,"US, FL, Pensacola",,,Solutions3 is a woman-owned small business who...,Implementation/Configuration/Testing/Training ...,MUST BE A US CITIZEN.An active TS/SCI clearanc...,,0,1,1,Full-time,Associate,,Information Technology and Services,,0
9,10,Customer Service Associate - Part Time,"US, AZ, Phoenix",,,"Novitex Enterprise Solutions, formerly Pitney ...",The Customer Service Associate will be based i...,Minimum Requirements:Minimum of 6 months custo...,,0,1,0,Part-time,Entry level,High School or equivalent,Financial Services,Customer Service,0


In [None]:
# Fill missing values in text columns with empty strings
text_columns = ['description', 'requirements', 'company_profile', 'title', 'location', 'department', 'salary_range', 'employment_type', 'required_experience', 'required_education', 'industry', 'function']
df[text_columns] = df[text_columns].fillna('')

In [None]:
# Combine text columns into a single column for vectorization
df['combined_text'] = df[text_columns].apply(lambda x: ' '.join(x), axis=1)

In [None]:
# Simplified text preprocessing
df['combined_text'] = df['combined_text'].str.lower()
df['combined_text'] = df['combined_text'].apply(lambda x: re.sub(r'http\S+|www\S+|https\S+', '', x))
df['combined_text'] = df['combined_text'].str.replace('\n', ' ')
df['combined_text'] = df['combined_text'].apply(lambda x: re.sub(r'\w*\d\w*', '', x))
df['combined_text'] = df['combined_text'].str.strip()
df['combined_text'] = df['combined_text'].apply(lambda x: re.sub(' +', ' ', x))
df['combined_text'] = df['combined_text'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', x))

In [None]:
# Tokenize the text data
tokenizer = Tokenizer(num_words=5000)  # Limit to the top 5000 words
tokenizer.fit_on_texts(df['combined_text'])
X_text = tokenizer.texts_to_sequences(df['combined_text'])

# Pad sequences to the same length
max_length = 100
X_text_padded = pad_sequences(X_text, maxlen=max_length, padding='post')

In [None]:
# List of non-text columns to be used as features
non_text_columns = ['telecommuting', 'has_company_logo', 'has_questions', 'location', 'department', 'salary_range', 'employment_type', 'required_experience', 'required_education', 'industry', 'function']

In [None]:
# Fill missing values in non-text columns with a placeholder
df[non_text_columns] = df[non_text_columns].fillna('missing')

In [None]:
# OneHotEncode the categorical non-text columns
categorical_columns = ['location', 'department', 'salary_range', 'employment_type', 'required_experience', 'required_education', 'industry', 'function']
encoder = OneHotEncoder()
X_non_text_encoded = encoder.fit_transform(df[categorical_columns])

In [None]:
# Scale non-categorical non-text columns
scaler = StandardScaler()
X_non_text_scaled = scaler.fit_transform(df[['telecommuting', 'has_company_logo', 'has_questions']])

In [None]:
# Combine non-categorical and categorical features
from scipy.sparse import hstack
X_non_text = hstack([X_non_text_encoded, X_non_text_scaled]).toarray()

In [None]:
# Combine the text and non-text features
X_combined = np.hstack((X_text_padded, X_non_text))

##Model Building
Using LSTM involves preprocessing sequential data, stacking LSTM layers to capture temporal patterns, and training with backpropagation. Evaluation metrics like accuracy measure performance, with hyperparameter tuning optimizing model effectiveness for tasks like time series or NLP.

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense

# Use a smaller subset of the data for quick testing
X_subset, _, y_subset, _ = train_test_split(
    X_combined, df['fraudulent'], test_size=0.8, random_state=42
)

# Split the data into training, testing, and validation sets
X_train, X_temp, y_train, y_temp = train_test_split(
    X_subset, y_subset, test_size=0.3, random_state=42
)
X_test, X_val, y_test, y_val = train_test_split(
    X_temp, y_temp, test_size=0.33, random_state=42
)  # 0.33 * 0.3 = 0.10

# Define the model
model = Sequential()
model.add(LSTM(10, return_sequences=True, input_shape=(X_train.shape[1], 1)))
model.add(LSTM(10))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')

# Reshape the input data to 3D
X_train = np.reshape(X_train, (X_train.shape[0], X_train.shape[1], 1))
X_test = np.reshape(X_test, (X_test.shape[0], X_test.shape[1], 1))
X_val = np.reshape(X_val, (X_val.shape[0], X_val.shape[1], 1))

# Train the model
model.fit(
    X_train,
    y_train,
    validation_data=(X_val, y_val),
    epochs=10,
    batch_size=16,
    verbose=1,
)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7a4feb20a3e0>

##Model Evaluation
Model evaluation involves assessing a trained model's performance using various metrics to determine its effectiveness and generalization ability. Common metrics for classification tasks include accuracy, precision, recall, F1 score, and AUC-ROC, while regression tasks often use mean squared error (MSE), mean absolute error (MAE), and R-squared. The evaluation process typically includes splitting the data into training and testing sets, training the model on the training set, and evaluating it on the test set to ensure the model performs well on unseen data. Cross-validation can also be used for more robust evaluation.

In [None]:
from sklearn.metrics import mean_absolute_error

# Evaluate the model on the test set
test_loss = model.evaluate(X_test, y_test)
print(f'Test Loss (MSE): {test_loss}')

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred)
print(f'Mean Absolute Error (MAE): {mae}')

Test Loss (MSE): 0.04109606891870499
Mean Absolute Error (MAE): 0.07985357789084066


##Conclusion
The model shows a low Mean Squared Error (MSE) of 0.0411 and a Mean Absolute Error (MAE) of 0.0799 on the test data, indicating good overall performance with minimal error in predictions. These results suggest that the model has a strong ability to predict values close to the actual targets, making it effective for the given regression task.

#Comparison
Let's compare the three models based on their performance metrics:

### Logistic Regression
- **Accuracy**: 0.9757
- **Precision (Class 0)**: 0.98
- **Recall (Class 0)**: 1.00
- **F1-Score (Class 0)**: 0.99
- **Precision (Class 1)**: 0.94
- **Recall (Class 1)**: 0.61
- **F1-Score (Class 1)**: 0.74
- **Macro Avg Precision**: 0.96
- **Macro Avg Recall**: 0.80
- **Macro Avg F1-Score**: 0.86
- **Weighted Avg Precision**: 0.97
- **Weighted Avg Recall**: 0.98
- **Weighted Avg F1-Score**: 0.97

### LSTM Model
- **Test Loss (MSE)**: 0.0411
- **Mean Absolute Error (MAE)**: 0.0799

### BERT-Based Model
- **Test Loss (MSE)**: 0.0456
- **Mean Absolute Error (MAE)**: 0.0805

### Comparison and Recommendation:

1. **Logistic Regression**:
   - **Pros**:
     - Very high overall accuracy (97.57%).
     - Excellent performance on the majority class (Class 0).
   - **Cons**:
     - Lower recall for the minority class (Class 1), which may indicate that it misses a significant portion of the positive cases.
   - **Best for**: Situations where the overall accuracy is crucial, but the model might not be as effective in identifying the minority class.

2. **LSTM Model**:
   - **Pros**:
     - Low Mean Squared Error (MSE) and Mean Absolute Error (MAE), indicating good predictive performance.
   - **Cons**:
     - LSTM models are typically more complex and resource-intensive to train and deploy.
   - **Best for**: Time series data or sequential data where capturing the temporal dependencies is crucial.

3. **BERT-Based Model**:
   - **Pros**:
     - Comparable performance to LSTM with slightly higher MSE and MAE.
   - **Cons**:
     - Similar to LSTM, BERT models are computationally expensive and complex.
   - **Best for**: Text data or situations where contextual understanding is important.

### Recommendation:

- If we prioritize overall accuracy and have a significant class imbalance, **Logistic Regression** might be the best choice despite its lower performance on the minority class.
- If the data involves sequences or time-series data, and we need a more nuanced model to capture these dependencies, go for the **LSTM Model**.
- For text data or when we need to leverage contextual information, the **BERT-Based Model** is suitable.

Considering the metrics, the **Logistic Regression** model shows the best overall performance in terms of accuracy and weighted averages. However, for more complex or specific tasks, either the LSTM or BERT model could be more appropriate.
