# Project Part 3

[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/Denfire2/cs39aa_project/blob/main/project-part-3.ipynb)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/Denfire2/cs39aa_project/blob/main/project-part-3.ipynb)

Project Part 3: A Deep Learning Model
In this third and final part of the project you will train a deep learning model on your dataset. Note that the best way to do this will likely be to fine-tune an existing deep learning model such as GPT-2, BERT, etc. This is the same as what you will do in Assign 5, except that rather than using the Airline Tweet dataset you will be using your own dataset. Note that it is also possible to train a deep learning model from scratch with either PyTorch or TensorFlow/Keras, but that in the real world it will be more likely that you will want to leverage the cutting edge performance of a pre-trained deep learning model such as those available through huggingface. 

As with Parts 1 and 2, this should be done in a Jupyter notebook and you should add the notebook to the repository where Assign 1 and 2 are. When you are done, you will then get the URL of your project_part3.ipynb notebook in your GitHub repository, and submit that URL here in Canvas.

In [23]:
# Import statements
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk 
from collections import Counter
from keras.preprocessing.text import Tokenizer
from sklearn.feature_extraction.text import  TfidfVectorizer
from sklearn.model_selection import train_test_split 
from sklearn import metrics
from sklearn.metrics import accuracy_score
from tensorflow import keras
from math import exp

Loading the dataset and making it compatible for tokenizing

In [24]:
df = pd.read_csv('../input/real-or-fake-jobs/fake_job_postings.csv', usecols=['title','description','fraudulent'])
df.dropna(inplace=True)
df


Unnamed: 0,title,description,fraudulent
0,Marketing Intern,"Food52, a fast-growing, James Beard Award-winn...",0
1,Customer Service - Cloud Video Production,Organised - Focused - Vibrant - Awesome!Do you...,0
2,Commissioning Machinery Assistant (CMA),"Our client, located in Houston, is actively se...",0
3,Account Executive - Washington DC,THE COMPANY: ESRI – Environmental Systems Rese...,0
4,Bill Review Manager,JOB TITLE: Itemization Review ManagerLOCATION:...,0
...,...,...,...
17875,Account Director - Distribution,Just in case this is the first time you’ve vis...,0
17876,Payroll Accountant,The Payroll Accountant will focus primarily on...,0
17877,Project Cost Control Staff Engineer - Cost Con...,Experienced Project Cost Control Staff Enginee...,0
17878,Graphic Designer,Nemsia Studios is looking for an experienced v...,0


In [25]:
df.drop(df[df['description'].map(len) < 9].index, inplace=True) # delete rows of less than 10 characters in text data (as it says nothing)
df.shape

(17871, 3)

Setup and data split for LSTM

In [26]:
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(df.description) # Text to token conversion 

num_tokens = [len(tokens) for tokens in df['description']]
num_tokens = np.array(num_tokens)

max_tokens = np.mean(num_tokens) + 2 * np.std(num_tokens)
max_tokens = int(max_tokens)
max_tokens
from keras.preprocessing.sequence import pad_sequences

X = tokenizer.texts_to_sequences(df['description'])
Y = df['fraudulent'] 
X_pad = pad_sequences(X, maxlen=max_tokens)

print(X_pad.shape, Y.shape)

x_train, x_test, y_train, y_test = train_test_split(X_pad, Y, test_size = 0.25, random_state = 42)

print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

(17871, 3007) (17871,)
(13403, 3007) (13403,)
(4468, 3007) (4468,)


Actual LSTM model tuning. This was annoying as the lower the batch size the larger the steps. I dont have the strongest computer so fiting the model took awhile (2+hours per run) before I finally managed to find a balance between accuracy and being able to see the changes quickly. Epochs were tested at 50 and 20 before finaly settling on 5 as both initial tests did not trigger the early stopping. batch size was also an annoyance that went into the long fitting times: size was tested at 42, 64, and 128 before I chose 256. This was to speed up run time (still slow af)

In [32]:
model_ker = keras.Sequential()
model_ker.add(keras.layers.Embedding(20000, 100, input_length=max_tokens))
model_ker.add(keras.layers.LSTM(100, dropout=0.5, recurrent_dropout=0.5))
model_ker.add(keras.layers.Dense(1, activation='sigmoid'))

model_ker.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

epochs = 5
batch_size = 256

callback = (keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, min_delta=0.001))

history = model_ker.fit(x_train, y_train, epochs=epochs, batch_size=batch_size,validation_split=0.1 ,callbacks=[callback])
accr = model_ker.evaluate(x_test, y_test)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


The actual Predictions of real vs fake jobs condenced into a readable number

In [33]:
pred_lstm = model_ker.predict(x_test,verbose=1,use_multiprocessing=True)
pred = pred_lstm > 1.0e-1
pred_lstm_rounded = pred.astype(int)
print('LSTM Model test dataset accuracy: {0:0.4f}'.format(metrics.accuracy_score(y_test, pred_lstm_rounded)))

LSTM Model test dataset accuracy: 0.9573


Final analysis: LSTM had the about the same accuracy as the baseline model used in part 2. and while you might be able to get a higher accuracy, the time it takes to run through will not be viable in the long run/ business sense. 