In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##1. Data Preprocessing
Data preprocessing transforms raw data into a clean and usable format by handling missing values, outliers, and ensuring consistent data scales through normalization or standardization. It also includes feature extraction and selection to enhance dataset quality. This step is essential for efficient and accurate data analysis or machine learning model performance.

In [None]:
# importing all the necessary libraries
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder

In [None]:
# Reading and displaying the first 10 rows of the CSV file
df=pd.read_csv('/content/drive/MyDrive/fake_job/fake_job_postings.csv')
df.head(10)

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0
5,6,Accounting Clerk,"US, MD,",,,,Job OverviewApex is an environmental consultin...,,,0,0,0,,,,,,0
6,7,Head of Content (m/f),"DE, BE, Berlin",ANDROIDPIT,20000-28000,"Founded in 2009, the Fonpit AG rose with its i...",Your Responsibilities: Manage the English-spea...,Your Know-How: ...,Your Benefits: Being part of a fast-growing co...,0,1,1,Full-time,Mid-Senior level,Master's Degree,Online Media,Management,0
7,8,Lead Guest Service Specialist,"US, CA, San Francisco",,,Airenvy’s mission is to provide lucrative yet ...,Who is Airenvy?Hey there! We are seasoned entr...,"Experience with CRM software, live chat, and p...",Competitive Pay. You'll be able to eat steak e...,0,1,1,,,,,,0
8,9,HP BSM SME,"US, FL, Pensacola",,,Solutions3 is a woman-owned small business who...,Implementation/Configuration/Testing/Training ...,MUST BE A US CITIZEN.An active TS/SCI clearanc...,,0,1,1,Full-time,Associate,,Information Technology and Services,,0
9,10,Customer Service Associate - Part Time,"US, AZ, Phoenix",,,"Novitex Enterprise Solutions, formerly Pitney ...",The Customer Service Associate will be based i...,Minimum Requirements:Minimum of 6 months custo...,,0,1,0,Part-time,Entry level,High School or equivalent,Financial Services,Customer Service,0


In [None]:
# To check if for null values in the dataset
df.isnull().sum()

job_id                     0
title                      0
location                 346
department             11547
salary_range           15012
company_profile         3308
description                1
requirements            2696
benefits                7212
telecommuting              0
has_company_logo           0
has_questions              0
employment_type         3471
required_experience     7050
required_education      8105
industry                4903
function                6455
fraudulent                 0
dtype: int64

In [None]:
# Fill missing values in text columns with empty strings
text_columns = ['description', 'requirements', 'company_profile', 'title', 'location', 'department', 'salary_range', 'employment_type', 'required_experience', 'required_education', 'industry', 'function']
df[text_columns] = df[text_columns].fillna('')

In [None]:
# Combine text columns into a single column
df['combined_text'] = df[text_columns].apply(lambda x: ' '.join(x), axis=1)

In [None]:
df['combined_text'].head()

0    Food52, a fast-growing, James Beard Award-winn...
1    Organised - Focused - Vibrant - Awesome!Do you...
2    Our client, located in Houston, is actively se...
3    THE COMPANY: ESRI – Environmental Systems Rese...
4    JOB TITLE: Itemization Review ManagerLOCATION:...
Name: combined_text, dtype: object

In [None]:
# Convert to lower case
df['combined_text'] = df['combined_text'].str.lower()

In [None]:
# Remove links
df['combined_text'] = df['combined_text'].apply(lambda x: re.sub(r'http\S+|www\S+|https\S+', '', x, flags=re.MULTILINE))

In [None]:
# Remove next lines (\n)
df['combined_text'] = df['combined_text'].str.replace('\n', ' ')

In [None]:
# Remove words containing numbers
df['combined_text'] = df['combined_text'].apply(lambda x: re.sub(r'\w*\d\w*', '', x))

In [None]:
# Remove extra spaces
df['combined_text'] = df['combined_text'].str.strip()
df['combined_text'] = df['combined_text'].apply(lambda x: re.sub(' +', ' ', x))

In [None]:
# Remove special characters
df['combined_text'] = df['combined_text'].apply(lambda x: re.sub(r'[^a-zA-Z\s]', '', x))


In [None]:
!pip install nltk
import nltk
nltk.download('stopwords')

# Removal of stop words
from nltk.corpus import stopwords
stop = stopwords.words('english')
df['combined_text'] = df['combined_text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
# Stemming
stemmer = PorterStemmer()
df['combined_text'] = df['combined_text'].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))

In [None]:
# Lemmatization
nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
df['combined_text'] = df['combined_text'].apply(lambda x: ' '.join([lemmatizer.lemmatize(word) for word in x.split()]))

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


##2.Feature Engineering
Feature engineering involves creating or modifying features to improve machine learning models. This includes selecting relevant features, transforming data, and creating new features based on insights. Techniques include encoding categorical variables, scaling numerical features, and creating interaction terms. Effective feature engineering enhances model accuracy and robustness by capturing underlying data patterns.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

In [None]:
# Vectorize the text data with limited features
vectorizer = TfidfVectorizer(max_features=1000)
X_text = vectorizer.fit_transform(df['combined_text'])

In [None]:
# List of non-text columns to be used as features
non_text_columns = ['telecommuting', 'has_company_logo', 'has_questions', 'location', 'department', 'salary_range', 'employment_type', 'required_experience', 'required_education', 'industry', 'function']

# Fill missing values in non-text columns with a placeholder
df[non_text_columns] = df[non_text_columns].fillna('missing')

In [None]:
# OneHotEncode the categorical non-text columns
categorical_columns = ['location', 'department', 'salary_range', 'employment_type', 'required_experience', 'required_education', 'industry', 'function']
encoder = OneHotEncoder()
X_non_text_encoded = encoder.fit_transform(df[categorical_columns])

In [None]:
# Scale non-categorical non-text columns
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_non_text_scaled = scaler.fit_transform(df[['telecommuting', 'has_company_logo', 'has_questions']])

In [None]:
# Import hstack from scipy.sparse
from scipy.sparse import hstack

# Combine non-categorical and categorical features
X_non_text = hstack([X_non_text_encoded, X_non_text_scaled])

In [None]:
# Combine the text and non-text features
X = hstack([X_text, X_non_text])

In [None]:
# Target variable
y = df['fraudulent']

In [None]:
# Split the data into training, testing, and validation sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.33, random_state=42) # 0.33 * 0.3 = 0.10

##3. Model Building
Model building with logistic regression for binary classification involves preparing the data by handling missing values, encoding categorical variables, and scaling features. The data is then split into training and testing sets. A logistic regression model is trained on the training set, and its performance is evaluated on the test set using metrics like accuracy, precision, recall. Hyperparameter tuning can be performed to optimize the model. This process helps create a robust predictive model that can effectively classify binary outcomes.

In [None]:
# Import the necessary class
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)s
y_pred_lr = lr.predict(X_val)

print("Logistic Regression Accuracy:", accuracy_score(y_val, y_pred_lr))
print(classification_report(y_val, y_pred_lr))

Logistic Regression Accuracy: 0.9757199322416714
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1672
           1       0.94      0.61      0.74        99

    accuracy                           0.98      1771
   macro avg       0.96      0.80      0.86      1771
weighted avg       0.97      0.98      0.97      1771



## 4. conclusion
The logistic regression model achieves high overall accuracy (0.976), performing excellently on the majority class (class 0) with precision and recall of 0.98 and 1.00, respectively. However, it struggles with the minority class (class 1), reflected by a lower recall (0.61) and F1-score (0.74), indicating it misses many true positives. This imbalance in performance suggests that while the model is reliable for predicting the majority class, it is less effective for the minority class. Improving the model’s performance on the minority class could involve techniques like resampling, adjusting class weights, or employing more complex models and ensemble methods.

#Comparison
Let's compare the three models based on their performance metrics:

### Logistic Regression
- **Accuracy**: 0.9757
- **Precision (Class 0)**: 0.98
- **Recall (Class 0)**: 1.00
- **F1-Score (Class 0)**: 0.99
- **Precision (Class 1)**: 0.94
- **Recall (Class 1)**: 0.61
- **F1-Score (Class 1)**: 0.74
- **Macro Avg Precision**: 0.96
- **Macro Avg Recall**: 0.80
- **Macro Avg F1-Score**: 0.86
- **Weighted Avg Precision**: 0.97
- **Weighted Avg Recall**: 0.98
- **Weighted Avg F1-Score**: 0.97

### LSTM Model
- **Test Loss (MSE)**: 0.0411
- **Mean Absolute Error (MAE)**: 0.0799

### BERT-Based Model
- **Test Loss (MSE)**: 0.0456
- **Mean Absolute Error (MAE)**: 0.0805

### Comparison and Recommendation:

1. **Logistic Regression**:
   - **Pros**:
     - Very high overall accuracy (97.57%).
     - Excellent performance on the majority class (Class 0).
   - **Cons**:
     - Lower recall for the minority class (Class 1), which may indicate that it misses a significant portion of the positive cases.
   - **Best for**: Situations where the overall accuracy is crucial, but the model might not be as effective in identifying the minority class.

2. **LSTM Model**:
   - **Pros**:
     - Low Mean Squared Error (MSE) and Mean Absolute Error (MAE), indicating good predictive performance.
   - **Cons**:
     - LSTM models are typically more complex and resource-intensive to train and deploy.
   - **Best for**: Time series data or sequential data where capturing the temporal dependencies is crucial.

3. **BERT-Based Model**:
   - **Pros**:
     - Comparable performance to LSTM with slightly higher MSE and MAE.
   - **Cons**:
     - Similar to LSTM, BERT models are computationally expensive and complex.
   - **Best for**: Text data or situations where contextual understanding is important.

### Recommendation:

- If we prioritize overall accuracy and have a significant class imbalance, **Logistic Regression** might be the best choice despite its lower performance on the minority class.
- If the data involves sequences or time-series data, and we need a more nuanced model to capture these dependencies, go for the **LSTM Model**.
- For text data or when we need to leverage contextual information, the **BERT-Based Model** is suitable.

Considering the metrics, the **Logistic Regression** model shows the best overall performance in terms of accuracy and weighted averages. However, for more complex or specific tasks, either the LSTM or BERT model could be more appropriate.
