Summary: Performed research on various NLP models that we can use in our current models. 

[**Confluence**](https://confluence.dhigroupinc.com/pages/editpage.action?pageId=121563351)

Author: Sfurti Srivastava


In [85]:
# ! pip install "tensorflow>=2.0.0"
# ! pip install --upgrade tensorflow-hub
#!pip install sent2vec
#!pip install -U sentence-transformers

Downloading Glove file

In [24]:

import os
import pandas as pd
import numpy as np
from scipy import spatial
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from IPython.display import HTML
from dsmatch import local_bucket
from dsmatch.analytics.modelevaluation import labeled_xtab, aggregate_stats_from_xtab, print_aggregate_stats

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# !wget http://nlp.stanford.edu/data/glove.42B.300d.zip
# !apt install unzip
# !unzip "glove.42B.300d.zip"

Getting dice test data

In [4]:
#sfurtianalytics-py-candidatematchdataconsolidated_annotations
path = '/home/ec2-user/SageMaker/sfurti/analytics-py-candidatematch/data/consolidated_annotations/bg_test_data.csv'

print('path =',path)
test_data=pd.read_csv(path)
test_data["jobSkills"]=test_data["jobSkills"].str.lower()
test_data["profileSkills"]=test_data["profileSkills"].str.lower()

test_data.head(2)

path = /home/ec2-user/SageMaker/sfurti/analytics-py-candidatematch/data/consolidated_annotations/bg_test_data.csv


Unnamed: 0.1,Unnamed: 0,Application ID,Job ID,Applicant Profile ID,jobTitle,jobSkills,jobDescription,profileCurrentTitle,profileDesiredTitle,profilePreviousTitle,profileSkills,user_score
0,348,243418029,0488010ad26992bdfb587b8834e3ace2,c0781818d9489e0e246ec824cd90a3d4,Senior SQL Server Database Administrator( Need...,"sql server , performance tuning, dba, ssis, ss...",<p>&nbsp;</p><p><strong>Job Title: Senior SQL ...,Full-stack .NET Developer,Full-stack .NET Developer,"Full-stack .NET Developer,Sr.Net Developer,Sr ...","vb . net,microsoft . net,c # . net,asp . net,a...",2.0
1,483,223328274,00305b6e944dc77465e73dba5df5e184,ec3fb8df40b19e582431f4aeca96cdd3,Full Stack Developer,"agile, consulting, database, developer, develo...","<br><span style=""color:#222222""> <span style=""...",Software developer,Software developer,"Altonica. C++ programmer,Software Engineer,Dir...","ajax,asp,photoshop,angularjs,animation,subvers...",4.0


In [5]:
print(test_data.shape)
test_data=test_data.dropna(subset=["jobSkills"])
print(test_data.shape)
try:
    test_data["token_jobSkills"]=test_data["jobSkills"].apply(word_tokenize)
except TypeError as e:
    pass
    


(123, 12)
(120, 12)


Lets check how many words vector of dice test_data is not present in the glove model?


In [6]:
import csv
def glove2dict(glove_filename):
    with open(glove_filename, encoding='utf-8') as f:
        reader = csv.reader(f, delimiter=' ',quoting=csv.QUOTE_NONE)
        embed = {line[0]: np.array(list(map(float, line[1:])))
                for line in reader}
    return embed
glove_path = "glove.42B.300d.txt"
pre_glove = glove2dict(glove_path)
oov=[]
for brown_nonstop in test_data["token_jobSkills"]:
    k = [token for token in brown_nonstop if token not in pre_glove.keys()]
    if k!=[]:
        #print (k)
        oov.append(k)

oov=list(map(lambda x: str(x[0]), oov))
print(oov)

['5+', 'cert/s', 'cocoapods', 'cocoapods', 'agile-waterfall', 'winforms/reporting', '5+', '5+', 'browserstack/sauce', '5+', 'philly/nj', 'asp.net,5+', 'ca/rally', '.net', '.net', 'directory/ad', 'springboot', 'html/html5', 'master/business', 'documentation/', 'pan-os', 'c/c++', '.net', 'cert/s', '.net', '.net', '.net', 'cocoapods', '.net', '5+', 'agile/devops', '.net', '.net', '.net', 'microservices', 'react.js', '.net', 'microservices', '.net', '.net', '.net', 'html/html5', 'module,3-5', '2016/', 'owb/odi']


Word found in out of vocabulary looks like most of them can be used to get the vectors.A little preprocessing is required to fix those words and get the vectors.

In [7]:
embeddings_dict = {}
with open("glove.42B.300d.txt", 'r', encoding="utf-8") as f:
    for line in f:
        values = line.split()
        word = values[0]
        vector = np.asarray(values[1:], "float32")
        embeddings_dict[word] = vector
print(f"length of the model:{len(embeddings_dict)} words")

length of the model:1917494 words


Here we can explore what words surround around a perticular word. We are taking first 5 words.

In [38]:
def find_closest_embeddings(embedding):
    res=sorted(embeddings_dict.keys(), key=lambda word: spatial.distance.euclidean(embeddings_dict[word], embedding))
    res=res[0:5]
    res=[res]
    return res
A=find_closest_embeddings(embeddings_dict["vb.net"])
B=find_closest_embeddings(embeddings_dict["dotnet"])
display(HTML("First 5 closest_embeddings"))
display(HTML(f"<u>{A[0][0]}</u>"))
print(A[0][1:])
display(HTML(f"<u>{B[0][0]}</u>"))
print(B[0][1:])


['vb6', 'vb', 'asp.net', 'vbscript']


['vb.net', 'vb', 'systemidletimer', 'static-detail']


`Toy Problem` to get the similarity scores between 2 sets of skills, lets say col_1 and col_2 are the lists of skills.

In [39]:
df_ = pd.DataFrame(columns=["col_1","col_2"])
df_["col_1"]=A
df_["col_2"]=B
df_

Unnamed: 0,col_1,col_2
0,"[vb.net, vb6, vb, asp.net, vbscript]","[dotnet, vb.net, vb, systemidletimer, static-d..."


In [13]:
df_exp=df_.explode('col_1')
df_exp.reset_index(inplace=True)
df_exp=df_exp.drop(columns="index")
df_exp

Unnamed: 0,col_1,col_2
0,vb.net,"[dotnet, vb.net, vb, systemidletimer, static-d..."
1,vb6,"[dotnet, vb.net, vb, systemidletimer, static-d..."
2,vb,"[dotnet, vb.net, vb, systemidletimer, static-d..."
3,asp.net,"[dotnet, vb.net, vb, systemidletimer, static-d..."
4,vbscript,"[dotnet, vb.net, vb, systemidletimer, static-d..."


Getting its Cosine similarity

In [14]:
 


try:
    k=[]
    for label,row in df_exp.iterrows():
        

        z=[]
        x = row[0]
        for y in row[1]:
            #print(y)
#             if y!=row[0] :
                #print("----------------------")
                #print(f'x : {x} and y : {y}')
            v1= embeddings_dict[x]
            v2=embeddings_dict[y]
            X=np.array([v1,v2])
            val=(1-(spatial.distance.pdist(X,'cosine')[0]))
            z.append(val)
            
                
        k.append(z)
    score=pd.Series(k)
    df_exp["cosine_score"]=pd.DataFrame(score)
    minimum=[]
    maximum=[]
    mean=[]
    for c in df_exp["cosine_score"]:
        minimum.append(min(c))
        maximum.append(max(c))
        mean.append(sum(c) / len(c))
    df_exp["min"]=pd.DataFrame(minimum)
    df_exp["max"]=pd.DataFrame(maximum)
    df_exp["mean"]=pd.DataFrame(mean)

               
except TypeError:
    pass

df_exp           
            

       

       

Unnamed: 0,col_1,col_2,cosine_score,min,max,mean
0,vb.net,"[dotnet, vb.net, vb, systemidletimer, static-d...","[0.6601479389185724, 1.0, 0.719454824288189, 0...",0.168903,1.0,0.591865
1,vb6,"[dotnet, vb.net, vb, systemidletimer, static-d...","[0.4898636955393918, 0.7656895397964274, 0.651...",0.126589,0.76569,0.475317
2,vb,"[dotnet, vb.net, vb, systemidletimer, static-d...","[0.5703884329575297, 0.719454824288189, 1.0, 0...",0.061652,1.0,0.552051
3,asp.net,"[dotnet, vb.net, vb, systemidletimer, static-d...","[0.5637284830193972, 0.748919528091196, 0.5872...",0.147627,0.74892,0.496888
4,vbscript,"[dotnet, vb.net, vb, systemidletimer, static-d...","[0.42715952646160105, 0.6877714786083069, 0.62...",0.132094,0.687771,0.463385


In [58]:
v1= embeddings_dict["c++"]
v2=embeddings_dict["c/c"]
X=np.array([v1,v2])
val=(1-(spatial.distance.pdist(X,'cosine')[0]))
val

0.6886931397646269

**BERT**

In [47]:
from sent2vec.vectorizer import Vectorizer
from sentence_transformers import SentenceTransformer
from transformers import AutoTokenizer, AutoModel
from scipy import spatial
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/bert-base-nli-mean-tokens")
model = AutoModel.from_pretrained("sentence-transformers/bert-base-nli-mean-tokens")


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=461.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=438006864.0, style=ProgressStyle(descri…




In [48]:
path = '/home/ec2-user/SageMaker/sfurti/analytics-py-candidatematch/data/consolidated_annotations/bg_test_data.csv'

test_data=pd.read_csv(path)
test_data["jobSkills"]=test_data["jobSkills"].str.lower()
test_data["profileSkills"]=test_data["profileSkills"].str.lower()


test_data=test_data.dropna(subset=["jobSkills","profileSkills"])


In [49]:
for x in test_data["jobSkills"]:
    l=len(x)
    if l >= 1700:
        print(l)

Exploring BERT model, here the maximum characters it is taking is less 1727, which is constarin because profile skills length is more. To check the model I am splitting the characters after 1700.But we can get the vector of any thing. 

**Toy problem**

In [55]:

sen=["c++","dotnet"]
vectorizer = Vectorizer()
vectorizer.bert(sen)
vectors_bert = vectorizer.vectors
cosine_bert = (1-spatial.distance.cosine(vectors_bert[0], vectors_bert[1]))
cosine_bert

0.9183284044265747

In [59]:

r=[]
for k in test_data["profileSkills"]:
    k=k[0:1726]
    r.append(k)
test_data["Length_ps"]=pd.DataFrame(r) 
len(test_data["Length_ps"][0])



1726

In [60]:
k=test_data[0:20]
p=[]
for idx, row in k.iterrows():
    a=row["jobSkills"]
    b=row["Length_ps"]
    sentences = [a,b] 
    vectorizer = Vectorizer()
    vectorizer.bert(sentences)
    vectors_bert = vectorizer.vectors
    cosine_bert = (1-spatial.distance.cosine(vectors_bert[0], vectors_bert[1]))
    #print(cosine_bert)
    p.append(cosine_bert)
    
k["cosine_bert"]=pd.DataFrame(p) 

    

    


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Checking the cosine similarirty by sending 1726 characters of profile skills and job skills 

**if score is < 20 its a 1**

**if score is between  20-40 its a 2**

**if score is between  40-60 its a 3**

**if score is between  60-80 its a 4**

**if score is between  80-100 its a 5**

In [73]:
e=[]
for x in k["cosine_bert"]:
    if x<.20:
        d=1
    elif x >.20 and x< .40:
        d=2
    elif x >.40 and x< .60:
        d=3
    elif x >.60 and x< .80:
        d=4
    else :
        d=5
    e.append(d)

k["BERT_score"]=pd.DataFrame(e) 
k[["user_score","cosine_bert","BERT_score"]][0:5]
    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Unnamed: 0,user_score,cosine_bert,BERT_score
0,2.0,0.535574,3.0
1,4.0,0.608466,4.0
2,4.0,0.620449,4.0
3,4.0,0.488343,3.0
5,3.0,0.610064,4.0


In [77]:

df_xtab = labeled_xtab(k, pred_col='BERT_score', labeled_col='user_score')
d_stats = aggregate_stats_from_xtab(df_xtab)
print_aggregate_stats(d_stats)
display(HTML(df_xtab.to_html()))

Total number of records: 18
Total exact matches: 1
Percent exact: 5.6%
Percent one-half 1 off: 36.1%
Percent Gaussian rolloff: 45.6%


user_score,1.0,2.0,3.0,4.0
BERT_score,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3.0,1,4,1,2
4.0,1,0,4,3
5.0,0,2,0,0


In [413]:
k[["user_score","cosine_bert"]][0:5]

Unnamed: 0,user_score,cosine_bert
0,2.0,0.535574
1,4.0,0.608466
2,4.0,0.620449
3,4.0,0.488343
5,3.0,0.610064


In [78]:
k.columns

Index(['Unnamed: 0', 'Application ID', 'Job ID', 'Applicant Profile ID',
       'jobTitle', 'jobSkills', 'jobDescription', 'profileCurrentTitle',
       'profileDesiredTitle', 'profilePreviousTitle', 'profileSkills',
       'user_score', 'Length_ps', 'cosine_bert', 'BERT_score'],
      dtype='object')

let check on job Titles 

In [79]:
k=test_data[0:20]
p=[]
for idx, row in k.iterrows():
    a=row["jobTitle"]
    b=row["profileCurrentTitle"]
    sentences = [a,b] 
    vectorizer = Vectorizer()
    vectorizer.bert(sentences)
    vectors_bert = vectorizer.vectors
    cosine_bert = (1-spatial.distance.cosine(vectors_bert[0], vectors_bert[1]))
    #print(cosine_bert)
    p.append(cosine_bert)
    
k["cosine_bert"]=pd.DataFrame(p) 


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [80]:
k[["jobTitle","profileCurrentTitle","cosine_bert"]][0:5]

Unnamed: 0,jobTitle,profileCurrentTitle,cosine_bert
0,Senior SQL Server Database Administrator( Need...,Full-stack .NET Developer,0.846741
1,Full Stack Developer,Software developer,0.957192
2,Scrum Master,"Business Development Manager,Freelancing finan...",0.826734
3,Sr. Oracle DBA,Senior Software Engineer,0.833804
5,Business Analyst / Project Manager,Sr. Business System Analyst,0.820757


In [None]:
e=[]
for x in k["cosine_bert"]:
    if x<.20:
        d=1
    elif x >.20 and x< .40:
        d=2
    elif x >.40 and x< .60:
        d=3
    elif x >.60 and x< .80:
        d=4
    else :
        d=5
    e.append(d)

k["BERT_score"]=pd.DataFrame(e) 
k[["user_score","jobTitle","profileCurrentTitle","cosine_bert","BERT_score"]][0:5]

In [84]:
k[["user_score","jobTitle","profileCurrentTitle","cosine_bert","BERT_score"]][0:10]

Unnamed: 0,user_score,jobTitle,profileCurrentTitle,cosine_bert,BERT_score
0,2.0,Senior SQL Server Database Administrator( Need...,Full-stack .NET Developer,0.846741,5.0
1,4.0,Full Stack Developer,Software developer,0.957192,5.0
2,4.0,Scrum Master,"Business Development Manager,Freelancing finan...",0.826734,5.0
3,4.0,Sr. Oracle DBA,Senior Software Engineer,0.833804,5.0
5,3.0,Business Analyst / Project Manager,Sr. Business System Analyst,0.820757,5.0
6,3.0,Information Security Assurance Analyst,"IT GOVERNANCE, RISK, and COMPLIANCE AUDITS",0.960751,5.0
7,3.0,Solutions Engineer,"Pro, Software Engineer",0.804113,5.0
8,2.0,Active Directory Engineer,"JWICS(SCI) Technician,JWICS(SCI) Technician,JW...",0.926829,5.0
9,2.0,Developer,Sr. SOA Developer,0.805089,5.0
10,1.0,Developer,Sr. API/IIB Developer,0.96752,5.0


In the Job Titles Bert has scored 5 to all the Job titles. Even if we ignore the user score is the overall score for skills and Job Titles, still it is not making a 5. 

**Recommendation:** Recommending to use the GloVe pre-trained model, we can build a model on the top of the Glove model for classification. It has its inbuilt corpus and we do not require to build our own.  For the missing vocabulary, we can add it as well. It can be used for skill to skill mapping. It is faster than other new NLP models to train, scalable to huge corpora, and easier to implement. Not to use BERT at this time because our document length is bigger than its limit, takes a much longer time to train the model, and the process is slower. 

