<a href="https://github.com/AsmaBALAMANE/job-profile-matcher">github repository </a>

#IT job-candidate matching
> *Data-driven solution for job offers and IT professionals automatic matching*


---


<h1>Project scope</h1>

Most job-seekers are using Internet platforms to search for new opportunities and advance their careers. During the early stages of recruitment (pre-interview), reviewing many candidate qualifications, and comparing the required skills may slow the process. It is far from easy to find candidates and match them with the best possible jobs. It takes hours of search to get good matches.


The automatic matching system we offer simulates the human resources process to match candidates and job offers.  Usually, when searching for suitable candidates and matching candidates to job offers, human recruiters use their understanding of the meaning of words. Then, depending on their semantic interpretation, they find the perfect match.  
<h1>System design</h1> 

The system takes as input a job offer ‘j’ and a candidate's profile ‘p’. It gives as output a matching rate ‘r’ (0 < r < 1 /  with 0 for no matching and 1 for perfectly matched).


A Job offer is divided into multiple sections, the meaningful information could be extracted from the job title, description, and required skills sections;  the same applies to the candidate's profile:  professional title, the candidate experiences, and skills, respectively. 
<h1>System process</h1> 

At the first stage, each pair of input data i.e. ( job title, professional candidate title), ( job description, candidate experiences), and (job required skills, candidate skills) is processed by the corresponding model.


Models return as output word embeddings. Then, the SIF technique is applied to generate sentence embedding vectors. After that, Vectors Cosine similarity is calculated to define the matching rate for each input pair.  These partial matching rates are weighted by factors (see default values in  Table3 ). The final matching rate 'r' of the job offer and the candidate profile is the weighted average of the partial matching rates  ( m1, m2, and m3).

![](https://drive.google.com/uc?export=view&id=1LJ-oelzZ39N4jv35R4ELju8hX9g5FyYO)

---



Requirements and packages Instalation 

In [None]:
import pandas as pd  
import numpy as np
from time import time  
import logging  
import multiprocessing
import collections
from collections import Counter
import itertools
import gensim
from gensim.models import Word2Vec, KeyedVectors   
# For preprocessing
import re
from gensim.models.phrases import Phrases, Phraser 
import spacy  
import string
from gensim.parsing.preprocessing import remove_stopwords
# For word frequency 
from collections import defaultdict
from sklearn.metrics.pairwise import cosine_similarity  
# for models frozing
import pickle
# Setting up the loggings to monitor gensim
from gensim.test.utils import datapath
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)

In [None]:
!python -m spacy download en_core_web_sm

# Model creation and data processing Functions


<h1>Data Processing</h1>

Data needs to be cleaned in order to perform further tasks of NLP. This preprocessing phase includes:

1.   removing incomplete data rows 
2.   using regular expressions to remove non-ASCII values, special characters
3.   Tokenization i.e. segmentation of the input text into tokens (words)  which can be used for further processing
4.   removing stop words
5.   named entity recognition (NER) to extract entities from text streams
6.   Lemmatization i.e. reducing word variations to simpler forms ( the root word is called lemma)
7.   detecting common phrases (bigrams) to extract sequences of tokens (phrases) that have a strong independent meaning of the words when treated separately ex.“Big Data

![](https://drive.google.com/uc?export=view&id=1CDdOKglbUBWjoD3lFKEXXFxqKjbHMnOe)








In [None]:
def cleaning(doc):
    # doc is a spacy Doc object
    # Lemmatization and  stopwords removing (using the SpaCy stopWords list)
    text = [token.lemma_ for token in doc if not token.is_stop]
    txt = ' '.join(text)
    #remove stopword using gensim stopword list
    txt = remove_stopwords(txt)
    #correct some indetected words by the NER model 
    txt= txt.replace("c #", "c#").replace("phantom js", "phantomjs")
    return txt
def data_processing(column):
  # delete null rows
  column = column.dropna().reset_index(drop=True)
  # Loading the en_core_web_sm SpaCy model 
  nlp = spacy.load('en_core_web_sm') 
   # keeping only words 
  brief_cleaning = (re.sub("[^A-Za-z#++']+", ' ', str(row)).lower() for row in column)
  t = time()
  txt = [cleaning(doc) for doc in nlp.pipe(brief_cleaning, batch_size=5000, n_threads=-1)]
  print('Cleaning time: {} mins'.format(round((time() - t) / 60, 2)))
  df_clean = pd.DataFrame({'cleaned': txt})
  df_clean = df_clean.dropna().drop_duplicates()
  sent = [row.split() for row in df_clean['cleaned']]
  # bigram creation
  phrases = Phrases(sent, min_count=30, progress_per=10000)
  bigram = Phraser(phrases)
  sentences = bigram[sent]
  return sentences


<h1>Model building and training</h1>

Features Extraction


In order to represent sentences as embedding vectors, the system uses Smooth Inverse Frequency (SIF) technique, It takes a weighted average of word embeddings. Every word embedding is weighted by a / (a + p(w)) , where a is a parameter that is typically set to 0.001, and p(w)is the estimated frequency of the word in a corpus.

In [None]:
# Features Extraction 
def map_word_frequency(document):
    return Counter(itertools.chain(*document))
    
def get_sif_feature_vectors(sentence1, sentence2, word_emb_model):
    sentence1 = [token for token in sentence1 if token in word_emb_model.wv.vocab]
    sentence2 = [token for token in sentence2 if token in word_emb_model.wv.vocab]
    word_counts = map_word_frequency((sentence1 + sentence2))
    embedding_size = 300 # size of vectore in word embeddings
    a = 0.001
    sentence_set=[]
    for sentence in [sentence1, sentence2]:
        vs = np.zeros(embedding_size)
        sentence_length = len(sentence)
        for word in sentence:
            a_value = a / (a + word_counts[word]) # smooth inverse frequency, SIF
            vs = np.add(vs, np.multiply(a_value, word_emb_model.wv[word])) # vs += sif * word_vector
        vs = np.divide(vs, sentence_length) # weighted average
        sentence_set.append(vs)
    return sentence_set


In [None]:
#Input preparation 
def input_preparation(sentence):
    nlp = spacy.load("en_core_web_sm")
    doc1=nlp(sentence)
    txt= [token.lower() for token in cleaning(doc1).split()]
    txt = list(set(txt)) 
    r = re.compile("[A-Za-z#++']+")
    txt = list(filter(r.match, txt))
    phrases = Phrases(txt, min_count=10, progress_per=10000)
    bigram = Phraser(phrases)
    sentences = bigram[txt]
    return sentences

Word2Vec training

Word2vec uses a single hidden layer, fully connected neural network. The neurons in the hidden layer are all linear neurons. The input layer is set to have as many neurons as there are words in the vocabulary for training (V). The hidden layer size is set to the dimensionality (N) of the resulting word vectors. The size of the output layer is the same as the input layer.
The input to hidden layer connections can be represented by a matrix (Win) of size (VxN) with each row representing a vocabulary word. In the same way, the connections from the hidden layer to the output layer can be described by a matrix (Wout)of size (NxV). In this case, each column of (Wout) matrix represents a word from the given vocabulary

![](https://drive.google.com/uc?export=view&id=1VKPYt2tl2or2lJBsZ4s1Vkng3CS-VBAE)



In [None]:
#min_count=5 or 20 for description , window=2,size=300,sample=6e-5, alpha=0.03, min_alpha=0.0007, negative=20,workers=cores-1
def word2vec_creation(data,min_count,window,size,sample,alpha,min_alpha,negative):
  # Count the number of cores in a computer
  cores = multiprocessing.cpu_count() 
  w2v_model = Word2Vec(min_count=min_count,
                     window=window,
                     size=size,
                     sample=sample, 
                     alpha=alpha, 
                     min_alpha=min_alpha, 
                     negative=negative,
                     workers=cores-1)
  t = time()
  #build the word2Vec vocabulary
  w2v_model.build_vocab(data, progress_per=10000)
  print('Time to build vocab: {} mins'.format(round((time() - t) / 60, 2)))
  return w2v_model
  
def word2vec_training(data,w2v_model):
  t = time()
  w2v_model.train(data, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)
  print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))
  return w2v_model


Docs Similarity

In [None]:
# calculate similarity 
def get_cosine_similarity(feature_vec_1, feature_vec_2):    
    return cosine_similarity(feature_vec_1.reshape(1, -1), feature_vec_2.reshape(1, -1))[0][0]

In [None]:
def job_profile_matching(skills_rate, title_rate, description_rate, job, profile,skills_model, titles_model, desciptions_model):
  
  #Skills similarity 
    job_skills= input_preparation(job[0])
    profile_skills= input_preparation(profile[0])
    result_skills= get_sif_feature_vectors(job_skills,profile_skills,skills_model)
    #if any word from job input or profile input could be detected in the vocabulary
    if(result_skills==0):
       matching_skills=0
    else:
       matching_skills= get_cosine_similarity(result_skills[0],result_skills[1])  
    print('matching_skills:', matching_skills)
  #Title similarity  
    job_title= input_preparation(job[1])
    profile_title= input_preparation(profile[1])
    result_title= get_sif_feature_vectors(job_title,profile_title,titles_model)
    #if any word from job input or profile input could be detected in the vocabulary
    if(result_title==0):
       matching_title=0
    else:
       matching_title= get_cosine_similarity(result_title[0],result_title[1])
    print('matching_title:',matching_title)
  #Description similarity
    job_description= input_preparation(job[2])
    profile_description= input_preparation(profile[2])
    result_description= get_sif_feature_vectors(job_description,profile_description,desciptions_model)
     #if any word from job input or profile input could be detected in the vocabulary
    if(result_description==0):
       matching_description=0
    else:
      matching_description= get_cosine_similarity(result_description[0],result_description[1]) 
    print('matching_description:', matching_description)
    matchings=[]
    matchings.append(matching_title)
    matchings.append(matching_description)
    matchings.append(matching_skills)
    final_matching= matching_skills * skills_rate + matching_title * title_rate + matching_description * description_rate
    matchings.append(final_matching)
    return (matchings)

# Model building and publishing


Data loading and model training

In [None]:
# Read the data set from data.world
df = pd.read_csv('https://query.data.world/s/mmxnlcbuvbcx5gdrvgllt52zn5wu75')
df.head()

In [None]:
# Extract the concerned columns
data_skills = df['skills'].apply(lambda x: str(x).lower().replace('see below','').replace('(see job description)','').replace('see job description','').replace('full time','').replace('part time',''))
data_jobtitles=df['jobtitle']

#Description processing
size=len(df['jobdescription'].index) // 7
data_description=df['jobdescription'].iloc[0:size]
#job_description = df[['jobtitle', 'jobdescription']].apply(lambda x: ' '.join(x), axis = 1) 

In [None]:
# Skills Model
#data processing
skills = data_processing(data_skills)
#models vocabs creation
model_skills = word2vec_creation(skills,min_count=20, window=2,size=300,sample=6e-5, alpha=0.03, min_alpha=0.0007, negative=20)
#models training
w2v_skills= word2vec_training(skills,model_skills)

# Job Titles Model
jobtitles = data_processing(data_jobtitles)
model_jobtitles = word2vec_creation(jobtitles,min_count=5, window=2,size=300,sample=6e-5, alpha=0.03, min_alpha=0.0007, negative=20)
w2v_jobtitles= word2vec_training(jobtitles,model_jobtitles)

In [None]:
# Description Model
descriptions = data_processing(data_description)
model_description=word2vec_creation(descriptions,min_count=20, window=2,size=300,sample=6e-5, alpha=0.03, min_alpha=0.0007, negative=20)
w2v_description= word2vec_training(descriptions,model_description)

In [None]:
# Export models format .p using pickl
pickl = {'model': w2v_skills}
pickle.dump( pickl, open( 'w2v_skills' + ".p", "wb" ) )

pickl = {'model': w2v_jobtitles}
pickle.dump( pickl, open( 'w2v_jobtitles' + ".p", "wb" ) )

pickl = {'model': w2v_description}
pickle.dump( pickl, open( 'w2v_description' + ".p", "wb" ) )

In [None]:
 # Load the models files 
 file_name = "w2v_skills.p"
 with open(file_name, 'rb') as pickled:
  data = pickle.load(pickled)
  model_skills = data['model']

file_name = "w2v_description.p"
 with open(file_name, 'rb') as pickled:
  data = pickle.load(pickled)
  model_description = data['model']

file_name = "w2v_jobtitles.p"
 with open(file_name, 'rb') as pickled:
  data = pickle.load(pickled)
  model_jobtitles = data['model']  
  

#Tests
 

In [None]:
# Test models format .p
# skills, jobTitle, description
job=[' C++, Development, Programming, Python, Shell Script, Software','Business Intelligence Analyst','Junior to Mid-Level Business Intelligence AnalystLocation: onsite Marlborough, MAUS Citizens and Green Card Holders and those authorized to work in the US are encouraged to apply. We are unable to sponsor H1B candidates at this time.Description:We are looking for an energetic go getter who is looking to climb the career ladder!Would you like to be working with one of the worlds largest travel groups, with more than 16,000 staff worldwide. They are the world leader in the enterprise travel management space with an active global network spanning over 90 countries worldwide.  Offering a unique and welcoming culture, this rapidly growing organization works with cutting edge technology, offers a very stable, career growth-oriented environment and has excellent benefits including discounts on accommodationsWe seek a motivated person to work on the Global Technology team to enhance our technical product platform. This position requires technical ingenuity, the ability to work with people and win the hearts of our great clients. The ideal candidate will have experience working with the Microsoft BI platform and knows how to extract data from complex databases using SQL.The position will require the candidate to liaise with our global teams to manage and assist with projects through to completion. An applicant should be able to demonstrate the ability to initiate process improvements with automation to create efficiencies.  If this job is for you and you do not mind getting your hands dirty, we’ll help you grow your development skills with 2 industry leading BI platforms the Microsoft Business Intelligence Stack and the GoodData reporting suite.Responsibilities:Must have experience with: SQL Server solid understanding of relational database. Able to write queries with joins (basic to mid-level)Solid understanding of TSQL syntax, stored procedures and user defined functionsStrong problem solving and troubleshooting skillsSoft skills: Smart and quick learner.Respond to internal support queries, such as data and report validation questionsAdhoc analysis of travel data for internal review & external clientsDocumenting processes and proceduresAssist with support of our Web Development team Position Requirements:Hands on experience extracting data with SQL using complex joinsExperience using a business intelligence and analytic platformMust have experience working with end users and/ or technical product ownersMust be able to facilitate meetings and keep participants engagedHighly logical with proven analytical and problem-solving abilitiesExperience of gathering user requirements and translating these into codeAbility to execute tasks in a high-pressure environmentStrong interpersonal, written and oral communication skillsExperience of working both independently and in a team environmentStrong technical knowledge of the Microsoft Office SuiteExperience with SQL Server Reporting Services is a plusWorking knowledge of Sharepoint is a plus']
# skills, jobTitle, experiences
profile=['data science, python, tensorflow','Business  Analyst','Junior to Mid-Level Business Intelligence AnalystLocation: onsite Marlborough, MAUS Citizens and Green Card Holders and those authorized to work in the US are encouraged to apply. We are unable to sponsor H1B candidates at this time.Description:We are looking for an energetic go getter who is looking to climb the career ladder!Would you like to be working with one of the worlds largest travel groups, with more than 16,000 staff worldwide. They are the world leader in the enterprise travel management space with an active global network spanning over 90 countries worldwide.  Offering a unique and welcoming culture, this rapidly growing organization works with cutting edge technology, offers a very stable, career growth-oriented environment and has excellent benefits including discounts on accommodationsWe seek a motivated person to work on the Global Technology team to enhance our technical product platform. This position requires technical ingenuity, the ability to work with people and win the hearts of our great clients. The ideal candidate will have experience working with the Microsoft BI platform and knows how to extract data from complex databases using SQL.The position will require the candidate to liaise with our global teams to manage and assist with projects through to completion. An applicant should be able to demonstrate the ability to initiate process improvements with automation to create efficiencies.  If this job is for you and you do not mind getting your hands dirty, we’ll help you grow your development skills with 2 industry leading BI platforms the Microsoft Business Intelligence Stack and the GoodData reporting suite.Responsibilities:Must have experience with: SQL Server solid understanding of relational database. Able to write queries with joins (basic to mid-level)Solid understanding of TSQL syntax, stored procedures and user defined functionsStrong problem solving and troubleshooting skillsSoft skills: Smart and quick learner.Respond to internal support queries, such as data and report validation questionsAdhoc analysis of travel data for internal review & external clientsDocumenting processes and proceduresAssist with support of our Web Development team Position Requirements:Hands on experience extracting data with SQL using complex joinsExperience using a business intelligence and analytic platformMust have experience working with end users and/ or technical product ownersMust be able to facilitate meetings and keep participants engagedHighly logical with proven analytical and problem-solving abilitiesExperience of gathering user requirements and translating these into codeAbility to execute tasks in a high-pressure environmentStrong interpersonal, written and oral communication skillsExperience of working both independently and in a team environmentStrong technical knowledge of the Microsoft Office SuiteExperience with SQL Server Reporting Services is a plusWorking knowledge of Sharepoint is a plus']
matching= job_profile_matching(0.4, 0.5, 0.1, job, profile,model_skills, model_jobtitles, model_description)
matching