## Job Search and Ranking function based on Semantics and Word Embeddings

This notebook presents a mockup of a search and ranking function based on semantics of 20000 job descriptions (dataset extracted from Monster.com jobs).
This methodology using word embeddings captures the context and the semantics of the analysed text, compared to a classic search function based on words counts per documents (Vector Space Model and Term Frequency-Inverse Document Frequency).

A word embedding Work2Vec model is build from these descriptions to capture the semantics and the context.
This model is then enriched with a generic Word2Vec model based on a Google News corpus, the job descriptions being not sufficient to build a full language model.

The resulted ranking of a search is based on the cosine similarity between the query and the different job descriptions scored with the word embedding model (300 dim vector).

The TSNE dimension reduction method allows to visualise the job descritions in a 3D space. 

Possible improvement: TF-IDF weighting for job descriptions scoring

### Dependencies
- Numpy
- Gensim for text processing and word2vec model
- nltk
- tensorflow for T-SNE visualisation


In [1]:
# libraries

import pandas as pd
import re
import codecs
import multiprocessing
import gensim
from gensim.models import Word2Vec
from gensim.parsing.preprocessing import preprocess_string, remove_stopwords, stem_text
from gensim.sklearn_api import TfIdfTransformer

import nltk
import numpy as np

import os
import tensorflow as tf
from tensorflow.contrib.tensorboard.plugins import projector

### Dataset and Cleaning

In [2]:
# Import the job dataset
jobs = pd.read_csv('../../Data/Jobs/monster_com-job_sample.csv')
jobs.head()

Unnamed: 0,country,country_code,date_added,has_expired,job_board,job_description,job_title,job_type,location,organization,page_url,salary,sector,uniq_id
0,United States of America,US,,No,jobs.monster.com,TeamSoft is seeing an IT Support Specialist to...,IT Support Technician Job in Madison,Full Time Employee,"Madison, WI 53702",,http://jobview.monster.com/it-support-technici...,,IT/Software Development,11d599f229a80023d2f40e7c52cd941e
1,United States of America,US,,No,jobs.monster.com,The Wisconsin State Journal is seeking a flexi...,Business Reporter/Editor Job in Madison,Full Time,"Madison, WI 53708",Printing and Publishing,http://jobview.monster.com/business-reporter-e...,,,e4cbb126dabf22159aff90223243ff2a
2,United States of America,US,,No,jobs.monster.com,Report this job About the Job DePuy Synthes Co...,Johnson & Johnson Family of Companies Job Appl...,"Full Time, Employee",DePuy Synthes Companies is a member of Johnson...,Personal and Household Services,http://jobview.monster.com/senior-training-lea...,,,839106b353877fa3d896ffb9c1fe01c0
3,United States of America,US,,No,jobs.monster.com,Why Join Altec? If you’re considering a career...,Engineer - Quality Job in Dixon,Full Time,"Dixon, CA",Altec Industries,http://jobview.monster.com/engineer-quality-jo...,,Experienced (Non-Manager),58435fcab804439efdcaa7ecca0fd783
4,United States of America,US,,No,jobs.monster.com,Position ID# 76162 # Positions 1 State CT C...,Shift Supervisor - Part-Time Job in Camphill,Full Time Employee,"Camphill, PA",Retail,http://jobview.monster.com/shift-supervisor-pa...,,Project/Program Management,64d0272dc8496abfd9523a8df63c184c


In [3]:
# quick clean up of job titles and sectors   
job_titles = jobs["job_title"]

job_titles = job_titles.str.lower()
job_titles = job_titles.str.split("job in", n = 1, expand = True)[0] 
job_titles2 = job_titles.str.split("job application for", n = 1, expand = True)[1] \
                        .str.split('|', n = 1, expand = True)[0] \
                        .str.split('-', n = 1, expand = True)[0]
job_titles = job_titles2.combine_first(job_titles)

jobs["job_title"] = job_titles
jobs['job_title'] = jobs['job_title'].fillna('no title')
jobs['sector'] = jobs['sector'].fillna('no sector')

jobs.job_description.to_csv('job_descriptions.txt', header=False, index=False, sep=' ')

### Job Descriptions Word2Vec Model

Word2Vec model with 300 dimensions and 10 words window

In [4]:
# Word2Vec process and train functions

def preprocess_text(text):
    text = re.sub('[^a-zA-Zа-яА-Я1-9]+', ' ', text)
    text = re.sub(' +', ' ', text)
    return text.strip()

def prepare_for_w2v(filename_from, filename_to, lang):
    raw_text = codecs.open(filename_from, "r", encoding='utf-8').read()
    with open(filename_to, 'w', encoding='utf-8') as f:
        for sentence in nltk.sent_tokenize(raw_text, lang):
            print(preprocess_text(sentence.lower()), file=f)
            
def train_word2vec(filename):
    data = gensim.models.word2vec.LineSentence(filename)
    return Word2Vec(data, size=300, window=10, min_count=1, workers=multiprocessing.cpu_count())

In [5]:
prepare_for_w2v('./job_descriptions.txt', 'job_descriptions_prep.txt', 'english')

In [6]:
model_jobdesc = train_word2vec('job_descriptions_prep.txt')

In [7]:
model_jobdesc.save('job_desc_model.model')

The word embeddings are extracted to enrich the Google News model

In [8]:
words_jd = []
embeddings_jd = []
for word in list(model_jobdesc.wv.vocab):
    embeddings_jd.append(model_jobdesc[word])
    words_jd.append(word)

  after removing the cwd from sys.path.


In [9]:
len(words_jd)

99372

99372 words in the job descriptions corpus

### Google News Word2Vec model enrichement

Job descriptions word embedding are added to the Google Model

We have now 3071336 words in the model

In [10]:
model_google = gensim.models.KeyedVectors.load_word2vec_format('../../Data/GoogleNews-vectors-negative300.bin', binary=True)

In [11]:
model_goojd = model_google
model_goojd.add(words_jd, embeddings_jd)

In [12]:
len(model_goojd.vocab)

3071336

### Job Description scoring

The 22000 job descriptions are scored against the model, taking the mean of the scores of each word.

A possible improvement is possible here, by weighting each word by the TF-IDF score.

In [13]:
job_desc =[]
#filters = [lambda x: x.lower(), strip_punctuation, stem_text, remove_stopwords]
for i, jd in enumerate(jobs.job_description):
    jdd = ' '
    for sentence in nltk.sent_tokenize(jd, 'english'):
        jdd = jdd + ' ' + preprocess_text(sentence.lower())
        #jdd = jdd + ' ' + ' '.join(preprocess_string((sentence), filters))
        
    job_desc.append(jdd)

In [14]:
# split jos description into words
for i ,_ in enumerate(job_desc):
    job_desc[i] = job_desc[i].split()

In [15]:
# score job descriptions
job_scored = []
for i ,_ in enumerate(job_desc):
    job_scored.append(model_goojd[job_desc[i]].mean(axis=0))

### TSNE visualisation

The t-distributed stochastic neighbor embedding visualisation with Tensorboard allows to see the 22000 jobs represented by a 300 dim vectors in 3D.

Shell command: tensorboard --logdir=.\project-tensorboard\log_desc

In [16]:
# Metadata
jobs[['job_title','sector']].to_csv('./project-tensorboard/log_desc/job_desc_meta.tsv', header=True, index=False, sep='\t')

In [17]:
## Get working directory
PATH = os.getcwd()

## Path to save the embedding and checkpoints generated
LOG_DIR = PATH + './project-tensorboard/log_desc/'

metadata = os.path.join(LOG_DIR, 'job_desc_meta.tsv')

## TensorFlow Variable from data
tf_data = tf.Variable(np.asarray(job_scored))

In [18]:
## Running TensorFlow Session
with tf.Session() as sess:
    saver = tf.train.Saver([tf_data])
    sess.run(tf_data.initializer)
    saver.save(sess, os.path.join(LOG_DIR, 'tf_data.ckpt'))
    config = projector.ProjectorConfig()
    
# One can add multiple embeddings.
    embedding = config.embeddings.add()
    embedding.tensor_name = tf_data.name
    # Link this tensor to its metadata(Labels) file
    embedding.metadata_path = metadata
    # Saves a config file that TensorBoard will read during startup.
    projector.visualize_embeddings(tf.summary.FileWriter(LOG_DIR), config)

![SegmentLocal](tensorboard.gif "segment")

## Search function

A free text query is scored with our model to retrive the 10 closest job descriptions in terms of semantics, and ranked by the cosine similarity.

Here each text is scored averaging the words scores, next improvement would be to weight these scores with the TF-IDF.


We can first try to score a single word to retrieve closest words and synomyms, for example 'scientist':

In [19]:
model_goojd.most_similar("scientist", topn=10)

[('researcher', 0.7906599640846252),
 ('physicist', 0.7408765554428101),
 ('biologist', 0.7058728933334351),
 ('geneticist', 0.6985131502151489),
 ('microbiologist', 0.6837249398231506),
 ('biochemist', 0.679307222366333),
 ('professor', 0.6541188955307007),
 ('molecular_biologist', 0.6405497193336487),
 ('ecologist', 0.6376317739486694),
 ('geoscientist', 0.6362979412078857)]

Or the full query compared against job descriptions:

In [20]:
# Search function
query = "software developer with .net skills"
query = query.lower().split()

In [21]:
similarity_to_query = []
for i,jd in enumerate(job_desc):
    similarity_to_query.append(model_goojd.n_similarity(query, jd))

In [22]:
results = pd.DataFrame({'Job_title': jobs['job_title'], 'Job_Description' : jobs['job_description'],  'Similarity': similarity_to_query})
results.sort_values(['Similarity'], ascending=False).head(10)

Unnamed: 0,Job_title,Job_Description,Similarity
16729,it professionals,IT Professionals (Multiple Positions) SAP Cons...,0.607872
4841,software automation engineer – python,"Software Automation Engineer – Python, Ruby, J...",0.583989
18080,.net developer,"Net Developer TAMKO Building Products, Inc. Ir...",0.574179
21450,junior java / jee developer,Job Title: Junior Java / J2EE Developer Locati...,0.569483
1153,jee developer,Our client is currently seeking a J2EE Develop...,0.549309
177,technical training developer,Technical Training Developer 6 months or longe...,0.54923
17104,software engineer,"Software Engineer Sri Anjaneya Tech Dallas, TX...",0.543767
3056,juniper conrail test engineer,"Dear Associate,We do have an urgent requiremen...",0.541393
6014,software developer iii,Description Sr .Net Developer 8+ years experie...,0.536462
21797,qa engineers,CCC Information Services Inc. seeks QA Enginee...,0.536206
