# 3. Tagging And Sorting

### After running the classifier in 02_recruiter_classifier, run this!
### Tags job titles and classifies them (management, tech, web), job titles can have multiple classes
### Tags location using SpaCy
### Tags job requirements and classifies them
### Tags meta data about the jobs
### Filters job ads to interesting and uninteresting

## 3.1 Load dependencies and mail dataset

In case you need to install the model for spacy

using English here, you might want to change it for other langauges

In [None]:
!pip install spacy en_core_web_sm

In [None]:
import re
import pickle
import pandas as pd
import spacy
import numpy as np
from spacy import displacy
import json
import en_core_web_sm
nlp = en_core_web_sm.load()
from bs4 import BeautifulSoup
from IPython.core.display import display, HTML
import unicodedata
from html.parser import HTMLParser
import ipysheet

Load the ground truth recruiter data

In [None]:
recruiter_df = pd.read_csv('files/hide/ground_truth_recruiter_df.csv')
# For dummy dataset job_email_examples as test_set:
#trecruiter_df= pd.read_csv('files/dummy_data/job_email_examples.csv')

## 3.2 Job type labeling on email subjects

We will want to find the job titles in the subject and classify them (this could also be done in the message itself).
Therefore, we will search through the subjects for the keywords in 'job-types.json'.
Then label the 'type' they represent (i.e. 'head of data science' is a combination of management and AI).

In [None]:
jobTypes = json.load(open('files/job_titles/job-types.json', 'r'))

Load the function for job title parsing (NER)

In [None]:
def checkJobType(subject, jobTypes):
    foundJobTypes = []
    foundJobs = []
    subject = subject.lower()
    for name, categories in jobTypes.items():
          for category in categories:
            if category in subject:
                foundJobTypes.append(name)
                foundJobs.append(category)
    return list(set(foundJobTypes)), foundJobs

Optional data check: print out the subjects, tags

In [None]:
recruiter_df['subject']
for subject in recruiter_df['subject']:
    output = checkJobType(subject, jobTypes)
    print(subject, '\\t->', checkJobType(subject, jobTypes))

Codeblock to rip out job types and append them to new columns in the df

In [None]:
catTags = []
tags = []
jobsPerEntry = []
categories = jobTypes.keys()
for subject in recruiter_df['subject']:
    currentJobTypes, jobs = checkJobType(subject, jobTypes)
    tags.append(';'.join(currentJobTypes))
    jobsPerEntry.append(';'.join(jobs))
    
recruiter_df['jobTypes'] = tags
recruiter_df['jobTags'] = jobsPerEntry

## 3.3 Clean the subjects and extract location labels from subjects

To extract locations and other processing we will have to clean up the subjects a bit. Then to extract the locations from the subject, we will use SpaCy (the locations could also be pulled from the messages).

Note: For SpaCy it seems their CNN model works best when you keep in captial letters and commas

In [None]:
def cleanText(subject):
    subject = re.sub(r'^(re|fwd):\s*', '', subject, flags=re.I) 
    subject = re.sub(r'[^a-zA-Z0-9,]+', " ", subject)
    subject = re.sub(r'\s*,\s*', ', ', subject)
    return subject.strip()

In [None]:
recruiter_df['subject_cleaned'] = recruiter_df['subject'].apply(cleanText)

This takes the NER from spaCy, applies it to our df on the subject labeling locations

In [None]:
results = []
for subject in recruiter_df['subject_cleaned']:
    output = nlp(subject)
    all_locations = []
    for ent in output.ents:
        if ent.label_ == 'GPE' or ent.label_ == 'LOC':
            all_locations.append(ent.text)
    results.append([subject, ';'.join(all_locations)])

Optional: to check your results
SpaCy NER doesn't work perfectly
I would recommend training a new model from your data if you have enough or applying NER to the message itself

In [None]:
results

In [None]:
results_df = pd.DataFrame(results, columns=['subject_cleaned', 'location'], index=recruiter_df.index)
results_df['location'].replace('', np.nan, inplace=True)
results_df.drop(columns='subject_cleaned', inplace=True)
recruiter_df = pd.concat([recruiter_df, results_df], axis=1)

Optional: How's this shaping up?

In [None]:
recruiter_df.drop(['date','language','message','name', 'email', 'subject', 'domain', 'prediction', 'class', 'firstname', 'lastname'], axis=1).style.set_properties(subset=['subject_cleaned'], **{'width': '300px'})

In [None]:
recruiter_df.drop(['subject_cleaned'], axis=1, inplace=True)

## Optional: Extract and clean requirements from message bullet points

Bullet points are a low hanging fruit for finding requirements. Run this if you want to make your own dictionary file to match custom requirements in emails instead of using the default one (requirement-types.json).

<div class="alert alert-danger"><b>!IMPORTANT!</b> If you are running the dummy dataset, skip this</div>

In [None]:
def select_bullet(item):
    lists = BeautifulSoup(item, "lxml").select('ul')
    if lists:
        return lists[0].get_text()
    return None

In [None]:
recruiter_df['message_bullets'] = recruiter_df['message'].apply(select_bullet)
recruiter_df['message_bullets'] = recruiter_df['message_bullets'].str.replace('\n', ' ')

In [None]:
count_total = len(recruiter_df)
count_na = len(recruiter_df['message_bullets']) - recruiter_df['message_bullets'].count()
count_bullet_char = len(recruiter_df[recruiter_df['message'].str.contains('•')])
print ("From a total of", count_total, "entries, there are", count_na, "without any HTML bullet points", "and", count_bullet_char, "with symbol bullet points")

Ideas for keywords and phrases we will need for later.
Take a look at the results to to get a feel for important requirements

In [None]:
recruiter_df.dropna().drop(['subject', 'location','language', 'message','name', 'email', 'domain', 'prediction', 'class', 'firstname', 'lastname', 'date'], axis=1).style.set_properties(**{'width': '90%'})

We need to get rough word counts of requirements

So we will clean the message_bullets

Then tokenize and use stop words

Finally we will count the remaining keywords

Keep in mind, stop words will sometimes change words we don't want to like turn kubernetes into kupernete. Also this will only count single words, not phrases.

In [None]:
recruiter_df['message_bullets_cleaned'] = recruiter_df['message_bullets'].dropna().apply(cleanText).str.replace(',', '')
recruiter_df['message_bullets_cleaned'] = recruiter_df['message_bullets_cleaned'].dropna().str.lower().apply(lambda text: " ".join(token.lemma_ for token in nlp(text) if not token.is_stop))
requirement_ideas = recruiter_df['message_bullets_cleaned'].str.split(' ', expand=True).stack().value_counts()
requirement_ideas_df = pd.DataFrame(requirement_ideas, columns=['count'])
pd.set_option('display.max_rows', None)
requirement_ideas_df

From our dataframe with some ideas of requirement keywords, as well as our results from the 'message_bullets', we can turn that into a dictionary file (requirement-types.json)

In [None]:
requirement_ideas_df.to_csv(r'files/hide/requirement_ideas_df.csv', index=False)
recruiter_df.drop(['message_bullets', 'message_bullets_cleaned'], axis=1, inplace=True)
pd.set_option('display.max_rows', 10)

## 3.4 Pull requirements from whole message

We will use either the default dictionary (requirement-types.json) or that you made based on the requirements from 4.0 to extract the requirements line by line from the whole message, not just the bullet points

<div class="alert alert-danger"><b>!IMPORTANT!</b> If you are running the dummy dataset skip from the next two steps, the messages have already be cleaned.</div>

This function will clean up the email messages with HTML

In [None]:
def cleanMessages(html):
    soup = BeautifulSoup(html, "html.parser")
    for script in soup(["script", "style"]):
        script.extract()
    text = soup.get_text()
    text = unicodedata.normalize("NFKD", text)
    text = text.replace('\xa0', '\n')
    text = text.replace('\x92', '')
    text = text.replace('\x92s', '')
    text = text.replace('\x96', '')
    text = text.replace('\u200b', '')
    text = re.sub(r'(--- mail_boundary ---)\s*(.+)', '', text) 
    lines = (line.strip() for line in text.splitlines())
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return text

We are going to overwrite message_cleaned with a line split version to work with

In [None]:
recruiter_df['message_cleaned'] = recruiter_df['message'].dropna().apply(lambda x: cleanMessages(x))

In [None]:
requirementTypes = json.load(open('files/requirement-types.json', 'r'))

Function for requirement parsing

In [None]:
def checkRequirementType(message, requirementTypes):
    foundRequirementTypes = []
    foundRequirementKeywords = []
    foundRequirements = []
    message = message.lower()
    lines = message.split('\n')
    for name, categories in requirementTypes.items():
          for category in categories:
            for line in lines:
                if category in line:
                    foundRequirementTypes.append(name)
                    foundRequirementKeywords.append(category)
                    foundRequirements.append(line)
    return list(set(foundRequirementTypes)), list(set(foundRequirementKeywords)), list(set(foundRequirements))

Optional data check: print out the found requirement types, keywords, and line it was found in

In [None]:
recruiter_df['message_cleaned']
for message in recruiter_df['message_cleaned']:
    output = checkRequirementType(message, requirementTypes)
    print(output)

Optional Example

This function changes requirement lines where a line lists devlop/build + ml algorithms

Develop/Build is not the same as having a theoretical understanding of ml algorithms

You can add your own easily

In [None]:
def clean_requirements(requirement):
    if re.search('(build|develop|design)([^ ]* ){1,5}algorithm|algorithm design', requirement, re.I):
        requirement = re.sub('algorithm', 'research', requirement)
    return requirement

Code block to rip out requirement types, the keywords found, and the lines they were found on.

Then appends them to new columns in the df.

In [None]:
tags = []
keywordsPerEntry = []
requirementsPerEntry = []
categories = requirementTypes.keys()
for message in recruiter_df['message_cleaned']:
    message = clean_requirements(message)
    currentRequirementTypes, keywords,requirements = checkRequirementType(message, requirementTypes)
    tags.append(';'.join(currentRequirementTypes))
    keywordsPerEntry.append(';'.join(keywords))
    requirementsPerEntry.append(';'.join(requirements))
    
recruiter_df['requirementTypes'] = tags
recruiter_df['requirementKeywords'] = keywordsPerEntry
recruiter_df['requirements'] = requirementsPerEntry

Optional to check results

In [None]:
recruiter_df.drop(['location', 'jobTypes', 'jobTags','date', 'message','message_cleaned','name', 'email', 'subject', 'domain', 'prediction', 'class', 'firstname', 'lastname', 'language'], axis=1).style.set_properties(**{'width': '1%'})

## 3.5 Extract meta job data (salary, title, etc.) from messages

By looking at the messages themselves, it can be seen that one or more of the following patterns often occurs:

role: (role)
salary: (salary)
title: (title)

Because of this we can extract this job meta data for further analysis

We are going to overwrite message_cleaned with a line split version to work with

In [None]:
recruiter_df['message_cleaned'] = recruiter_df['message_cleaned'].str.split('\n')

Bring in the tags (notice, I have also done this whole notebook for German too)

In [None]:
role_re = re.compile(r'(title|titel|roll?e|position)\s*:\s*(.+)', re.I)
location_re = re.compile(r'(Location|(?:stand)?ort):\s*(.+)', re.I)
duration_re = re.compile(r'(duration|dauer):\s*(.+)', re.I)
salary_re = re.compile(r'(gehalt|salary)::(?!\\r)\s*(.+)*', re.I)
education_re = re.compile(r'(education)::(?!\\r)\s*(.+)', re.I)

Function for extracting the meta job data

In [None]:
def extractMetaData(clean_message):
    extractRegEx = {'role':role_re, 'location':location_re, 'duration':duration_re, 'salary':salary_re, 'education':education_re}
    metaData = {}
    results = []
    greedy = False
    for line in clean_message:
        for label, regEx in extractRegEx.items():
            extracted = regEx.findall(line)
            if extracted:
                result = extracted[0][1].strip()
                metaData[label] = result
                if not result:
                    greedy = True
                elif greedy:
                    result = line
                    metaData[label] += result
    return metaData

In [None]:
results = []

for clean_message in recruiter_df['message_cleaned']:
    output = extractMetaData(clean_message) 
    results.append(output)
    
metaData_df = pd.DataFrame.from_dict(results)
metaData_df['location'] = metaData_df['location'].str.replace('/', ';')

In [None]:
metaData_df

Experience is done slightly differently..

In [None]:
results = []
experience_re = re.compile(r'(?:at least |minimum )?([^\s]+ ?y(?:ea)rs?[\w ]+?experience(?:(?: in| with)(?:.+)?)?)', re.I)
for clean_message in recruiter_df['message_cleaned']:
    result = []
    for line in clean_message:
        output = experience_re.findall(line)
        if output:
            result.extend(output)
    if result:
        results.append(';'.join(set(result)))
    else:
        results.append(np.nan)
results
experience_df = pd.DataFrame(zip((results)), columns=['experience'])

In [None]:
metaData_df = pd.concat([metaData_df, experience_df], axis=1)

Optional: review the complete job meta data

In [None]:
metaData_df.style.set_properties(**{'width': '200px'})

Let's fill in our NaNs with the non-NaN locations ripped from the subject by SpaCy

In [None]:
metaData_df['location'] = metaData_df['location'].fillna(recruiter_df['location'].dropna())

In [None]:
recruiter_df.drop(['location'], axis=1, inplace=True)
recruiter_df = pd.concat([recruiter_df, metaData_df], axis=1)

## 3.6 Response sorting

Job offers analysis

Here's an example of some of the insights we can gleem from our data very easily

I am personally interested in ai jobs, especially that contain jobTypes 'ai' and 'management'

In [None]:
ai_recruiter_df = recruiter_df[recruiter_df['jobTypes'].str.contains('ai')].copy()
ai_job_count = len(ai_recruiter_df)
# !WATCH OUT! if the requirementKeywords are empty, they will be seen as duplicates
ai_job_dup_requirements_count = len(ai_recruiter_df[ai_recruiter_df.duplicated(subset='requirements', keep='first')])
ai_job_dup_requirementKeywords_count = len(ai_recruiter_df[ai_recruiter_df.duplicated(subset='requirementKeywords', keep='first')])

print("There are", ai_job_count, "AI jobs", "with", ai_job_dup_requirements_count, "probable duplicates and", ai_job_dup_requirementKeywords_count, "possible duplicates")

Optional: Look at Probable duplicates (duplicates based on requirements)

In [None]:
ai_recruiter_df[ai_recruiter_df.duplicated(subset='requirements', keep=False)].drop(['email', 'firstname', 'lastname','message', 'message_cleaned', 'class', 'prediction', 'language', 'jobTypes', 'requirementTypes'], axis=1).sort_values('requirementKeywords').style.set_properties(**{'width': '200px'})

Optional: it seems like some duplicates might have nan values in some columns so let's fill those missing values and drop the duplicates

In [None]:
ai_recuiter_df = ai_recruiter_df.groupby('requirements').apply(lambda x: x.ffill().bfill()).drop_duplicates(subset='requirements')

These are possible duplicates based on requirement keywords

Having no or few requirement keywords might wrongly identify duplicates

In [None]:
ai_recruiter_df[ai_recruiter_df.duplicated(subset='requirementKeywords', keep=False)].drop(['email', 'firstname', 'lastname','message', 'message_cleaned', 'class', 'prediction', 'language', 'jobTypes', 'requirementTypes'], axis=1).sort_values('requirementKeywords').style.set_properties(**{'width': '200px'})

Optional: I personally like to filter the above probable requirement keywords duplicates with location

In [None]:
ai_recruiter_df = ai_recruiter_df.dropna(subset=['location']).drop_duplicates(subset=['requirementKeywords', 'location'])

In the AI jobs how many combine other job types?

In [None]:
ai_recruiter_df['jobTypes'].str.split(';', expand=True).stack().value_counts()

In [None]:
ai_management_recruiter_df = ai_recruiter_df[ai_recruiter_df['jobTypes'].str.contains('management')].copy()
ai_management_job_count = len(ai_management_recruiter_df)
print("There are", ai_management_job_count,  "jobs that also combine keywords with management")

Quick look at the AI-management keywords and their counts

In [None]:
ai_management_recruiter_df['jobTags'].str.split(';', expand=True).stack().value_counts()

This is a bug work around for ipysheet

If there is only one row in a df converted to sheet, when trying to convert back into df it throws an exception

`Exception: Data must be 1-dimensional`

Note: This will add a row of NaN until removed after converting back into df

In [None]:
if ai_management_job_count == 1:
    ai_management_recruiter_df.loc['temp'] = None

Now let's actually review the results and put a check mark in the roles we want to apply to

This is the part were JupyterLab users should "Create New View for Output"

In [None]:
ai_management_recruiter_df = ai_management_recruiter_df.assign(reply=None)
ai_management_recruiter_df['reply'] = ai_management_recruiter_df['reply'].astype(bool)
ai_management_recruiter_df.drop(['jobTypes', 'subject', 'message_cleaned', 'language', 'message','name', 'email', 'domain', 'prediction', 'class', 'firstname', 'lastname', 'date', 'requirements', 'requirementTypes'], axis=1, inplace=True, errors='ignore')
ai_management_recruiter_sheet = ipysheet.from_dataframe(ai_management_recruiter_df)
ai_management_recruiter_sheet.layout.overflow_y = 'scroll'
ai_management_recruiter_sheet.layout.overflow_x = 'scroll'
ai_management_recruiter_sheet

Turn the sheet back into a df

In [None]:
ai_management_recruiter_df = ipysheet.to_dataframe(ai_management_recruiter_sheet)
ai_management_recruiter_df.drop(index = 'temp', inplace=True, errors='ignore')
# NOTE: ipysheet messes up the index turning it into strings
ai_management_recruiter_df.index = pd.to_numeric(ai_management_recruiter_df.index)
ai_management_recruiter_df['reply'] = ai_management_recruiter_df['reply'].astype(str)
ai_management_recruiter_df['reply'].replace({'True':'interested', 'False':'uninterested'}, inplace=True)
ai_management_recruiter_df.drop(['jobTags', 'location', 'requirementKeywords', 'experience', 'duration', 'role'], inplace=True, axis=1)

In [None]:
ai_management_recruiter_df

Bring the role_interesting information back into the main recruiter df

In [None]:
processed_recruiter_df = pd.concat([recruiter_df, ai_management_recruiter_df], axis=1)
processed_recruiter_df = processed_recruiter_df.fillna('uninterested')
processed_recruiter_df.drop(['prediction', 'class', 'jobTags', 'jobTypes', 'requirementTypes', 'location', 'message_cleaned', 'requirementKeywords', 'requirements', 'experience', 'message', 'role', 'duration'], inplace=True, axis=1)

## 3.7 Export data

Export the data to send responses back to recruiters 

In [None]:
processed_recruiter_mails = processed_recruiter_df.to_json(orient='records')

Optional: have a look to make sure it checks out

In [None]:
processed_recruiter_mails

In [None]:
with open('processed_recruiter_mails.json', 'w') as outfile:
    json.dump('files/hide/processed_recruiter_mails', outfile)

Optional: Save your recruiter_df as a csv

In [None]:
recruiter_df.to_csv(r'files/hide/recruiter_df.csv', index=False)

### Onward to the final step, <a href="./04_recruiter_send_emails.ipynb">04_recruiter_send_emails…</a>