# SDS Challenge #3 - Job Postings

## Problem Statement

Welcome Data Scientist to the 3rd SDS Club Monthly Challenge! In this month's challenge you are helping your friend search for a job. Your friend has found thousands of job ads online and is trying to pick some to apply to. Your friends has heard that there are a lot of fraudulent job ads that are actually scams. Your mission is to help your friend by predicting whether a job is fraudulent based on the data provided.

## Evaluation

\begin{equation*}
accuracy = \frac{TP + TN}{TP + TN + FP + FN}
\end{equation*}
<br>

## Understanding the Dataset

Each column in the dataset is labeled and explained in more detail below.

**title** - title of the job in ad <br>
**location** - location of job ad <br>
**department** - corporate department <br>
**salary_range** - salary range of job <br>
**company_profile** - description of company <br>
**description** - description of position <br>
**requirements** - description of job requirements <br>
**benefits** - benefits offered by the employer <br>
**telecommuting** - if telecommuting position <br>
**has_company_logo** - if the company's logo is present in the ad <br>
**has_questions** - if interview questions are present in ad <br>
**employment_type** - type of employment (full-time, part-time, contract, etc.) <br>
**required_experience** - required experience for job (master's degree, bachelor, doctorate, etc.) <br>
**industry** - industry of company (Construction, Health Care, IT, etc.) <br>
**function** - function of company within industry (consulting, sales, research, etc.) <br>
**fraudulent** - whether job is fraudulent or not <br>

## Dataset Files

**public_jobs.csv** - Dataset to train and analyze <br>
**pred_jobs.csv** - Dataset to predict whether or not a job posting is fraudulent

## Submission

All submissions should be sent through email to challenges@superdatascience.com. When submitting, the file should contain predictions made on the pred_jobs.csv file, and it should have the following format:

In [1]:
0
1
0
0
1
0

0

## Acknowledgements

The data was collected and published by The University of the Aegean, Laboratory of Information & Communication Systems Security.

## Importing the libraries

In [2]:
import numpy as np
import pandas as pd

## Importing the dataset

In [3]:
dataset = pd.read_csv('public_jobs.csv')
dataset.drop(['title', 'location', 'department', 'salary_range','company_profile', 'requirements', 'employment_type', 'required_experience',"benefits", "industry", "required_education", 'function' ],
  axis='columns', inplace=True)

dataset_test = pd.read_csv('pred_jobs.csv')
dataset_test.drop(['title', 'location', 'department', 'salary_range','company_profile', 'requirements', 'employment_type', 'required_experience',"benefits", "industry", "required_education", 'function' ],
  axis='columns', inplace=True)

dataset

Unnamed: 0,description,telecommuting,has_company_logo,has_questions,fraudulent
0,"Centra Windows an established, employee-owned ...",0,1,1,0
1,"Londoners, New Yorkers, Parisians, and Berline...",0,1,1,0
2,POS-X is a rapidly growing point-of-sale hardw...,0,1,1,0
3,Data Center Migration Application Lead / Archi...,0,0,0,1
4,"Hayes Corp is looking for a patient, meticulou...",0,1,0,0
...,...,...,...,...,...
14299,"As an Outside Sales Representative, you must h...",0,1,0,0
14300,We are a well-funded financial technology star...,0,1,0,0
14301,Come be one of the charter members of our sale...,0,1,1,0
14302,We’re looking for an enthusiastic and experien...,0,1,0,0


In [4]:
dataset_test

Unnamed: 0,description,telecommuting,has_company_logo,has_questions
0,Papa John’s is one of the world’s biggest and ...,0,1,1
1,J-Curve Technologies is currently in search of...,0,1,1
2,Job Title: Technical Solution ConsultantDepart...,0,0,0
3,"AVAILABLE POSITIONS:Catering staffs, managers,...",0,0,0
4,Full time Sales Manager/Customer Service for a...,0,1,0
...,...,...,...,...
3571,Bowery is an easier way to set up your develop...,0,1,1
3572,"Play with kids, get paid for it :-)Love travel...",0,1,0
3573,Want to build a career in IT? Free training in...,0,0,0
3574,We seek a full time User Experience Designer a...,0,1,1


## Cleaning the texts

In [5]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
removes = ['against','no', 'nor', 'not','don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't",'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
all_stopwords = stopwords.words('english')
new_stop_words = list(set(all_stopwords).difference(removes))
for i in range(0,len(dataset)):
  review = re.sub('[^a-zA-z]', ' ', str(dataset['description'][i])) #we replace every thing that is not a letter by a space
  review = review.lower()
  review = review.split()
  ps = PorterStemmer()
  review = [ps.stem(word) for word in review if not word in set(new_stop_words)]
  review = ' '.join(review)
  corpus.append(review)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [6]:
print(corpus)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [7]:
len(dataset_test)

3576

In [8]:
test_corpus = []
for i in range(0,len(dataset_test)):
  review = re.sub('[^a-zA-z]', ' ', str(dataset_test['description'][i])) #we replace every thing that is not a letter by a space
  review = review.lower()
  review = review.split()
  ps = PorterStemmer()
  review = [ps.stem(word) for word in review if not word in set(new_stop_words)]
  review = ' '.join(review)
  test_corpus.append(review)

## Creating the Bag of Words model

In [9]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 40000) # we got 40000 here from the len of X before this run but the previous one
X = cv.fit_transform(corpus).toarray() # here we transforming the corpus list to a 2d array that has only the words that appear most commonly using the CountVetorizer object cv
X_test = cv.transform(test_corpus).toarray()
y = dataset.iloc[:, -1].values

In [10]:
len(X_test[0])

40000

In [11]:
len(X[0])

40000

In [12]:
print(X_test)
X_test.shape

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


(3576, 40000)

In [13]:
print(X)
X.shape

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


(14304, 40000)

In [14]:
X = np.concatenate((X, dataset.iloc[:, 1:-1].values), 1)
X_test = np.concatenate((X_test, dataset_test.iloc[:, 1:]), 1)

In [15]:
print(X)

[[0 0 0 ... 0 1 1]
 [0 0 0 ... 0 1 1]
 [0 0 0 ... 0 1 1]
 ...
 [0 0 0 ... 0 1 1]
 [0 0 0 ... 0 1 0]
 [0 0 0 ... 0 1 1]]


In [16]:
X.shape

(14304, 40003)

In [17]:
print(X_test)

[[0 0 0 ... 0 1 1]
 [0 0 0 ... 0 1 1]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 1 1]
 [0 0 0 ... 0 1 0]]


In [18]:
X_test.shape

(3576, 40003)

## Training CatBoost on the Training set

## Training the Random Forest Classification model on the Training set

In [19]:
from sklearn.ensemble import RandomForestClassifier

classifier = RandomForestClassifier(n_estimators = 10, criterion = "entropy", max_depth = 5, random_state = 0)
classifier.fit(X, y)

RandomForestClassifier(criterion='entropy', max_depth=5, n_estimators=10,
                       random_state=0)

## Computing the accuracy with k-Fold Cross Validation

In [20]:
from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator = classifier, X = X, y = y, cv = 10)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard deviation: {:.2f} %".format(accuracies.std()*100))

Accuracy: 95.10 %
Standard deviation: 0.11 %


## Predicting values

In [21]:
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
y_pred

array([0, 0, 0, ..., 0, 0, 0])

## Building the csv file

In [22]:
df = pd.DataFrame(y_pred, columns=["fraudulent"])

df_total = pd.concat([pd.read_csv("pred_jobs.csv"), df], axis=1)

print(df_total)

                                             title  ... fraudulent
0     Part-time Pizza Delivery Drivers - Wallasey   ...          0
1            Director of Contact Center Operations  ...          0
2                    Technical Solution Consultant  ...          0
3         Vacancies At The Cafe Royal Hotel London  ...          0
4        Car Dealer Sales Manager/Customer Service  ...          0
...                                            ...  ...        ...
3571          Community Manager & Customer Support  ...          0
3572            Graduates: English Teacher Abroad   ...          0
3573                Associate Business Development  ...          0
3574           UX Designer / Information Architect  ...          0
3575                     Back Office PHP Developer  ...          0

[3576 rows x 17 columns]


In [24]:
df_total.to_csv("pred_jobs.csv", index=None)