# 2. Classifier: Recruiter Or Not?

After having run 01_recruiter_collect_mails run this to classify the recruiter mails with the pre-trained classifier.
<div class="alert alert-danger"><b>!IMPORTANT!</b> Sklearn version '0.21.3' must be installed, either install it here or in your environment</div>
<div class="alert alert-danger"><b>!IMPORTANT!</b> Make sure to install ipysheet correctly before running</div>

## 2.1 Load dependencies and dataset

 For the spreadsheet magic we will need...

In [None]:
!pip install ipysheet

If you role with jupyter lab like I do you need to: 

In [None]:
!jupyter labextension install @jupyter-widgets/jupyterlab-manager
!jupyter labextension install ipysheet

<div class="alert alert-info"><b>for JupyterLab users</b>: horizontal scroll of ipysheets/ipwidgets doesn't work correctly in the standard view
(displaying anything wider than the output area will result in truncated output that isn't horizontally scrollable),
I recommend that you use "Create New View for Output" when these widgets are needed</div>

In [None]:
import pandas as pd
import numpy as np
import pickle
import json
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import svm
import ipysheet

In [None]:
# Load your set of emails from 01_recruiter_collect_emails.ipynb here with the correct file name
test_set = json.load(open('files/hide/scraped_mails.json'))
test_set_df = pd.DataFrame(test_set)

To run the dummy dataset instead uncomment the following two lines:

In [None]:
#test_set_df = pd.read_csv('files/dummy_data/job_email_examples.csv', usecols = ['name', 'email', 'subject', 'domain', 'firstname', 'lastname', 'language', 'date', 'message_cleaned'])
#test_set_df.rename(columns={'message_cleaned': 'message'}, inplace=True)

In [None]:
# Let's get that date data into a nice column
# Skip for job_email_examples as test_set
test_set_df['date'] = test_set_df['date'].apply(lambda x: ','.join(map(str, x[0:3])))
test_set_df['date'] = pd.to_datetime(test_set_df['date'], format='%Y,%m,%d')

In [None]:
# Drop duplicate emails by subject, this works well for sparse datasets or where there isn't such exact title and location overlap.. 
# If there is such an overlap, it is best to hold off and just drop them once we have a structured dataset in 02.1
test_set_df.drop_duplicates(subset='subject', keep='first', inplace=True)

In [None]:
# Optional: Are there any dates missing you think?
test_set_df['date'].unique()

## 2.2 Classify the data

Ohhhh, machine learning!

In [None]:
vectorizer = TfidfVectorizer(vocabulary=pickle.load(open("files/vectorizer.pickle", "rb")))
X = vectorizer.fit_transform(test_set_df.subject)

In [None]:
model = pickle.load(open("files/SVM_recruiter_model.pickle", 'rb'))

In [None]:
def predict_labels(clf, features):
    return(clf.predict(features))

In [None]:
y_pred = predict_labels(model, X)
test_set_df['prediction'] = y_pred

## 2.3 Filter results and ground truth

We are not interested in job-based emails from platforms but they will be predicted as recruiter mails so we need to filter them.
We will want to also ground truth the results

Filter

Feel free to add a domain if this does not cover all of the false positives in your dataset.

In [None]:
test_set_df['class'] = np.nan
is_predicted = (test_set_df['prediction'] == 1)
not_predicted = (test_set_df['prediction'] == 0)
is_empty = (test_set_df['class'].isna())
is_commercial = (test_set_df['domain'].isin(['linkedin.com', 'glassdoor.com', 'medium.com', 'quora.com', 'datacamp.com']))

test_set_df.loc[is_empty & is_commercial, 'class'] = 0
test_set_df.loc[is_predicted & is_empty & ~is_commercial, 'class'] = 1
test_set_df.loc[not_predicted & is_empty, 'class'] = 0

Review the recruiter mails

Click on change class for the ones that should be non-recruiter

<div class="alert alert-danger"><b>!IMPORTANT!</b>If the mail is a reply or otherwise isn't an initial offer/ad for a job, it shouldn't be in the recruiter_df!</div

In [None]:
recruiter_df = test_set_df[test_set_df['class'] == 1].copy()
recruiter_df.loc[:, 'name'] = recruiter_df.loc[:, 'name'].str[:15]
recruiter_df.loc[:,'subject'] = recruiter_df.loc[:,'subject'].str[:100]
recruiter_df = recruiter_df.assign(change_class=None)
recruiter_df['change_class'] = recruiter_df['change_class'].astype(bool)
recruiter_df.drop(['email', 'message', 'language', 'date', 'firstname', 'lastname', 'domain', 'prediction'], axis=1, inplace=True)
recruiter_sheet = ipysheet.from_dataframe(recruiter_df)
recruiter_sheet.layout.height = '600px'
recruiter_sheet


Turn the sheet back into a df

If a row is marked in the sheet with a checkbox, change the class

In [None]:
recruiter_df = ipysheet.to_dataframe(recruiter_sheet)
recruiter_df.loc[recruiter_df['change_class'] == True, 'class'] = 0
# NOTE: ipysheet messes up the index turning it into strings
recruiter_df.index = pd.to_numeric(recruiter_df.index)
recruiter_incorrect = (recruiter_df['change_class'].sum()/len(recruiter_df))
print("The percentage of false positive classifications is", "{0:.0%}".format(recruiter_incorrect))

<div class="alert alert-danger"><b>!IMPORTANT!</b> For job_email_examples as test_set, all of the mails are recruiters. Uncomment the codeblock below, run and go directly to 2.4</div>

In [None]:
#ground_truth_recruiter_df = test_set_df
#ground_truth_recruiter_df['class'] = ground_truth_recruiter_df['prediction']

Review the non-recruiter mails

Click on change class for the ones that should be recruiter

<div class="alert alert-danger"><b>!IMPORTANT!</b>If the mail is a reply or otherwise isn't an initial offer/ad for a job, it shouldn't be in the recruiter_df!</div>

In [None]:
non_recruiter_df = test_set_df[test_set_df['class'] == 0].copy()
non_recruiter_df.loc[:, 'name'] = non_recruiter_df.loc[:, 'name'].str[:15]
non_recruiter_df.loc[:,'subject'] = non_recruiter_df.loc[:,'subject'].str[:100]
non_recruiter_df = non_recruiter_df.assign(change_class=None)
non_recruiter_df['change_class'] = non_recruiter_df['change_class'].astype(bool)
non_recruiter_df.drop(['email', 'message', 'language', 'date', 'firstname', 'lastname', 'domain', 'prediction'], axis=1, inplace=True)
non_recruiter_sheet = ipysheet.from_dataframe(non_recruiter_df)
non_recruiter_sheet.layout.height = '600px'
non_recruiter_sheet

Turn the sheet back into a df

In [None]:
non_recruiter_df = ipysheet.to_dataframe(non_recruiter_sheet)
non_recruiter_df.loc[non_recruiter_df['change_class'] == True, 'class'] = 1
# NOTE: ipysheet messes up the index turning it into strings
non_recruiter_df.index = pd.to_numeric(non_recruiter_df.index)
non_recruiter_incorrect = (non_recruiter_df['change_class'].sum()/len(non_recruiter_df))
print("The percentage of false negative classifications is", "{0:.0%}".format(non_recruiter_incorrect))

In [None]:
changes_df = pd.concat([recruiter_df, non_recruiter_df])
changes_df.drop(['name', 'subject', 'change_class'], axis=1, inplace=True)

In [None]:
test_set_df.drop(['class'], axis=1, inplace=True)
ground_truth_df = pd.concat([test_set_df, changes_df], axis=1)

In [None]:
ground_truth_recruiter_df = ground_truth_df[ground_truth_df['class'] == 1].copy()

## 2.4 Export data

Optional: Have one last look to make sure its good

In [None]:
ground_truth_recruiter_df

In [None]:
ground_truth_recruiter_df.to_csv(r'files/ground_truth_recruiter_df.csv', index=False)

### On to <a href="./03_recruiter_NER.ipynb">03_recruiter_NER…</a>