**job_id**: Unique Job ID  
**title**: The title of the job ad entry  
**location**: Geographical location of the job ad  
**department**: Corporate department (e.g. sales)  
**salary_range**: Indicative salary range (e.g. \\$50,000-\\$60,000)  
**company_profile**: A brief company description  
**description**: The details description of the job ad  
**requirements**: Enlisted requirements for the job opening  
**benefits**: Enlisted offered benefits by the employer  
**telecommuting**: True for telecommuting positions  
**hascompanylogo**: True if company logo is present  
**has_questions**: True if screening questions are present  
**employment_type**: Full-type, Part-time, Contract, etc  
**required_experience**: Executive, Entry level, Intern, etc  
**required_education**: Doctorate, Master’s Degree, Bachelor, etc  
**industry**: Automotive, IT, Health care, Real estate, etc  
**function**: Consulting, Engineering, Research, Sales etc  
**fraudulent**: target - Classification attribute  

In [288]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [289]:
# Loading data into dataframe
job_data = pd.read_csv('dataset/fake_job_postings.csv')

In [290]:
# First 5 rows
job_data.head()

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0


In [291]:
# Number of rows and columns
job_data.shape

(17880, 18)

In [292]:
job_data.describe()

Unnamed: 0,job_id,telecommuting,has_company_logo,has_questions,fraudulent
count,17880.0,17880.0,17880.0,17880.0,17880.0
mean,8940.5,0.042897,0.795302,0.491723,0.048434
std,5161.655742,0.202631,0.403492,0.499945,0.214688
min,1.0,0.0,0.0,0.0,0.0
25%,4470.75,0.0,1.0,0.0,0.0
50%,8940.5,0.0,1.0,0.0,0.0
75%,13410.25,0.0,1.0,1.0,0.0
max,17880.0,1.0,1.0,1.0,1.0


In [293]:
# Checking type of data
job_data.dtypes

job_id                  int64
title                  object
location               object
department             object
salary_range           object
company_profile        object
description            object
requirements           object
benefits               object
telecommuting           int64
has_company_logo        int64
has_questions           int64
employment_type        object
required_experience    object
required_education     object
industry               object
function               object
fraudulent              int64
dtype: object

In [294]:
# Checking missing values
job_data.isna().sum()

job_id                     0
title                      0
location                 346
department             11547
salary_range           15012
company_profile         3308
description                1
requirements            2695
benefits                7210
telecommuting              0
has_company_logo           0
has_questions              0
employment_type         3471
required_experience     7050
required_education      8105
industry                4903
function                6455
fraudulent                 0
dtype: int64

**For simplicity we are going to consider only the columns that seem more relevant and those will be:
title, location, company_profile, description, requirements, benefits, has_questions, employment_type, required_experience and requiered_education**

In [295]:
job_data_filtered_columns = job_data.drop(columns=['job_id', 'department', 'salary_range', 'telecommuting', 'has_company_logo', 'industry', 'function'], axis=1)

In [296]:
job_data_filtered_columns.head()

Unnamed: 0,title,location,company_profile,description,requirements,benefits,has_questions,employment_type,required_experience,required_education,fraudulent
0,Marketing Intern,"US, NY, New York","We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,Other,Internship,,0
1,Customer Service - Cloud Video Production,"NZ, , Auckland","90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,Full-time,Not Applicable,,0
2,Commissioning Machinery Assistant (CMA),"US, IA, Wever",Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,,,,0
3,Account Executive - Washington DC,"US, DC, Washington",Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,Full-time,Mid-Senior level,Bachelor's Degree,0
4,Bill Review Manager,"US, FL, Fort Worth",SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,1,Full-time,Mid-Senior level,Bachelor's Degree,0


**The NaN will be replaced with empty strings**

In [297]:
clean_job_data = job_data_filtered_columns.where((pd.notnull(job_data)), '')

In [298]:
clean_job_data.isna().sum()

title                  0
location               0
company_profile        0
description            0
requirements           0
benefits               0
has_questions          0
employment_type        0
required_experience    0
required_education     0
fraudulent             0
dtype: int64

In [299]:
clean_job_data.head()

Unnamed: 0,title,location,company_profile,description,requirements,benefits,has_questions,employment_type,required_experience,required_education,fraudulent
0,Marketing Intern,"US, NY, New York","We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,Other,Internship,,0
1,Customer Service - Cloud Video Production,"NZ, , Auckland","90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,Full-time,Not Applicable,,0
2,Commissioning Machinery Assistant (CMA),"US, IA, Wever",Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,,,,0
3,Account Executive - Washington DC,"US, DC, Washington",Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,Full-time,Mid-Senior level,Bachelor's Degree,0
4,Bill Review Manager,"US, FL, Fort Worth",SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,1,Full-time,Mid-Senior level,Bachelor's Degree,0


**For location, only the country will be considered**

In [300]:
def get_country(location):
    return location.split(',')[0]

In [301]:
clean_job_data['location'] = clean_job_data.location.apply(get_country)

In [302]:
clean_job_data.head()

Unnamed: 0,title,location,company_profile,description,requirements,benefits,has_questions,employment_type,required_experience,required_education,fraudulent
0,Marketing Intern,US,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,Other,Internship,,0
1,Customer Service - Cloud Video Production,NZ,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,Full-time,Not Applicable,,0
2,Commissioning Machinery Assistant (CMA),US,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,,,,0
3,Account Executive - Washington DC,US,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,Full-time,Mid-Senior level,Bachelor's Degree,0
4,Bill Review Manager,US,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,1,Full-time,Mid-Senior level,Bachelor's Degree,0


In [303]:
clean_job_data.dtypes

title                  object
location               object
company_profile        object
description            object
requirements           object
benefits               object
has_questions           int64
employment_type        object
required_experience    object
required_education     object
fraudulent              int64
dtype: object

In [304]:
# has_questions will be converted to string
clean_job_data['has_questions'] = clean_job_data['has_questions'].astype(str)

In [305]:
clean_job_data.dtypes

title                  object
location               object
company_profile        object
description            object
requirements           object
benefits               object
has_questions          object
employment_type        object
required_experience    object
required_education     object
fraudulent              int64
dtype: object

**A column with the text of all the features will be created to be used as X**

In [306]:
clean_job_data['text'] = clean_job_data['title'] + ' - ' + clean_job_data['location'] + ' - ' + clean_job_data['company_profile'] + ' - ' + clean_job_data['description'] + ' - ' + clean_job_data['requirements'] +  ' - ' + clean_job_data['benefits'] + ' - ' + clean_job_data['has_questions'] + ' - ' + clean_job_data['employment_type'] + ' - ' + clean_job_data['required_experience'] + ' - ' + clean_job_data['required_education']

In [307]:
clean_job_data.head()

Unnamed: 0,title,location,company_profile,description,requirements,benefits,has_questions,employment_type,required_experience,required_education,fraudulent,text
0,Marketing Intern,US,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,Other,Internship,,0,"Marketing Intern - US - We're Food52, and we'v..."
1,Customer Service - Cloud Video Production,NZ,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,Full-time,Not Applicable,,0,Customer Service - Cloud Video Production - NZ...
2,Commissioning Machinery Assistant (CMA),US,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,,,,0,Commissioning Machinery Assistant (CMA) - US -...
3,Account Executive - Washington DC,US,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,Full-time,Mid-Senior level,Bachelor's Degree,0,Account Executive - Washington DC - US - Our p...
4,Bill Review Manager,US,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,1,Full-time,Mid-Senior level,Bachelor's Degree,0,Bill Review Manager - US - SpotSource Solution...


In [308]:
# Checking value on Fraudulent column
clean_job_data['fraudulent'].value_counts()

0    17014
1      866
Name: fraudulent, dtype: int64

In [309]:
# Separating features and label
X = clean_job_data['text']
Y = clean_job_data['fraudulent']

In [310]:
print(X)

0        Marketing Intern - US - We're Food52, and we'v...
1        Customer Service - Cloud Video Production - NZ...
2        Commissioning Machinery Assistant (CMA) - US -...
3        Account Executive - Washington DC - US - Our p...
4        Bill Review Manager - US - SpotSource Solution...
                               ...                        
17875    Account Director - Distribution  - CA - Vend i...
17876    Payroll Accountant - US - WebLinc is the e-com...
17877    Project Cost Control Staff Engineer - Cost Con...
17878    Graphic Designer - NG -  - Nemsia Studios is l...
17879    Web Application Developers - NZ - Vend is look...
Name: text, Length: 17880, dtype: object


In [311]:
print(Y)

0        0
1        0
2        0
3        0
4        0
        ..
17875    0
17876    0
17877    0
17878    0
17879    0
Name: fraudulent, Length: 17880, dtype: int64


## Splitting into test and train data

In [312]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2, stratify=Y)

In [313]:
print(X.shape, X_train.shape, X_test.shape)

(17880,) (14304,) (3576,)


In [314]:
# Checking porcentages of false offers on train and test and original set. The stratify worked well.
print(Y.value_counts()[1] / Y.shape[0] * 100, Y_train.value_counts()[1] / Y_train.shape[0] * 100,  Y_test.value_counts()[1] / Y_test.shape[0] * 100)

4.8434004474272925 4.844798657718121 4.837807606263982


## Feature Extraction

In [315]:
# Transform the text data to feature vectors that can be used as input to the logistic regression model
feature_extraction = TfidfVectorizer(min_df=1, stop_words='english', lowercase=True)

In [316]:
X_train_features = feature_extraction.fit_transform(X_train)

In [317]:
X_test_features = feature_extraction.transform(X_test)

In [318]:
print(X_train_features)

  (0, 8234)	0.014734757409852492
  (0, 7219)	0.01924431474006488
  (0, 61014)	0.022542582944754877
  (0, 7157)	0.02827633913423761
  (0, 54324)	0.01893656277664071
  (0, 1210)	0.032522105634496326
  (0, 71099)	0.028062161608947343
  (0, 51625)	0.017956469674083493
  (0, 68875)	0.035179037500392626
  (0, 2504)	0.051820605078989605
  (0, 73807)	0.0334801357210345
  (0, 31479)	0.021197803710340038
  (0, 85952)	0.020743958199954574
  (0, 59260)	0.04170679019013523
  (0, 26016)	0.018159703315203855
  (0, 54125)	0.022398486441750135
  (0, 57473)	0.018920308388946566
  (0, 12934)	0.02869244794773342
  (0, 12730)	0.06010212843900789
  (0, 65509)	0.06117662343572307
  (0, 43578)	0.027643128310565937
  (0, 2299)	0.04366479782323314
  (0, 64310)	0.040714060589611634
  (0, 43608)	0.04312137312773858
  (0, 12928)	0.044152438575253285
  :	:
  (14303, 26004)	0.046729657080594844
  (14303, 47005)	0.02314745501558671
  (14303, 50197)	0.01912787590284715
  (14303, 51710)	0.01690150686537308
  (14303, 86

## Training Logistic Regression Model

In [319]:
model = LogisticRegression()

In [320]:
# Training the logistic regression model with training data (features_data)
model.fit(X_train_features, Y_train)

## Evaluating the model

In [321]:
# Prediction on training data
prediction_on_training_data = model.predict(X_train_features)
accuracy_on_training_data = accuracy_score(Y_train, prediction_on_training_data)

In [322]:
print(f"Accuracy on training data is {accuracy_on_training_data}")

Accuracy on training data is 0.975601230425056


In [323]:
# Prediction on test data
prediction_on_test_data = model.predict(X_test_features)
accuracy_on_test_data = accuracy_score(Y_test, prediction_on_test_data)

In [324]:
print(f"Accuracy on test data is {accuracy_on_test_data}")

Accuracy on test data is 0.9717561521252797


**Accuracies very similar, which means no overfitting**

## Predictive System

In [358]:
# Test for 0
input_offer = job_data.iloc[0, :]
# Test for 1
#input_offer = job_data.iloc[98, :]
# Dropping columns (use labels in panda Series)
input_offer_filtered = input_offer.drop(labels=['job_id', 'department', 'salary_range', 'telecommuting', 'has_company_logo', 'industry', 'function'])
# Removing NaN
input_offer_filtered = input_offer_filtered.where((pd.notnull(input_offer_filtered)), '')
# Getting the country
input_offer_filtered['location'] = get_country(input_offer_filtered['location'])
# has_questions to string
input_offer_filtered['has_questions'] = str(input_offer_filtered['has_questions'])
# Creating text column
input_offer_filtered['text'] = input_offer_filtered['title'] + ' - ' + input_offer_filtered['location'] + ' - ' + input_offer_filtered['company_profile'] + ' - ' + input_offer_filtered['description'] + ' - ' + input_offer_filtered['requirements'] +  ' - ' + input_offer_filtered['benefits'] + ' - ' + input_offer_filtered['has_questions'] + ' - ' + input_offer_filtered['employment_type'] + ' - ' + input_offer_filtered['required_experience'] + ' - ' + input_offer_filtered['required_education']


In [359]:
# Prediction features
X_prediction_features = feature_extraction.transform([input_offer_filtered['text']])

In [360]:
prediction = model.predict(X_prediction_features)

In [361]:
if prediction[0] == 0:
    print("True job offer")
else:
    print("Fake job offer")

True job offer
