## Exploring using pre-trained sentence embeddings to identify lines in resumes that match best lines in job descriptions

a quick look

### IMPORTS

In [99]:
%matplotlib inline

from random import randint
import numpy as np
import torch
import shutil
import string
import nltk.data
import matplotlib

matplotlib.rcParams['figure.figsize'] = (20.0, 10.0)

In [103]:
import urllib

In [332]:
import nltk
#nltk.download('punkt')

##### Referred to this tutorial initially

Working from https://github.com/facebookresearch/InferSent/blob/master/demo.ipynb & https://www.kaggle.com/jacksoncrow/infersent-demo?select=glove.840B.300d.txt

### Getting inferSent pkl file

In [3]:
%%time

# TODO: add encoder to dataset as well
# If this cell freezes, probably you haven't enabled Internet access for the notebook
! mkdir encoder
! curl -Lo encoder/infersent1.pkl https://dl.fbaipublicfiles.com/infersent/infersent1.pkl

mkdir: encoder: File exists
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  146M  100  146M    0     0  12.1M      0  0:00:12  0:00:12 --:--:-- 11.8M
CPU times: user 206 ms, sys: 80.2 ms, total: 286 ms
Wall time: 12.3 s


### Getting glove models

Glove website = https://nlp.stanford.edu/projects/glove/

In [101]:
#### https://nlp.stanford.edu/projects/glove/
url_to_glove_840B = "http://nlp.stanford.edu/data/glove.840B.300d.zip"

In [102]:
url_to_glove_6B = "http://nlp.stanford.edu/data/glove.6B.zip"

In [104]:
from urllib.request import urlopen
from tempfile import NamedTemporaryFile
from shutil import unpack_archive



In [105]:
##### DOWNLOAD
# with urlopen(url_to_glove_6B) as zipresp, NamedTemporaryFile() as tfile:
#     tfile.write(zipresp.read())
#     tfile.seek(0)
#     unpack_archive(tfile.name, 'glove.840B.300d.txt', format = 'zip')

In [106]:
ls

README.md            extract_features.py  playground_1.ipynb
[34m__pycache__[m[m/         [34mglove.6B[m[m/            [34mtokenizers[m[m/
[34mencoder[m[m/             models.py


### model parameters

In [327]:
model_version = 1
MODEL_PATH = "encoder/infersent%s.pkl" % model_version
W2V_PATH = 'glove.840B.300d.txt'
#W2V_PATH = "glove.6B/glove.6B.300d.txt"
VOCAB_SIZE = 1e5  # Load embeddings of VOCAB_SIZE most frequent words
USE_CUDA = False  # Keep it on CPU if False, otherwise will put it on GPU

### loading model with pytorch

In [328]:
from models import InferSent
params_model = {'bsize': 64, 'word_emb_dim': 300, 'enc_lstm_dim': 2048,
                'pool_type': 'max', 'dpout_model': 0.0, 'version': model_version}
model = InferSent(params_model)
model.load_state_dict(torch.load(MODEL_PATH))


<All keys matched successfully>

In [329]:
%%time
model = model.cuda() if USE_CUDA else model

model.set_w2v_path(W2V_PATH)

model.build_vocab_k_words(K=VOCAB_SIZE)

Vocab size : 100000.0
CPU times: user 8.3 s, sys: 1.41 s, total: 9.71 s
Wall time: 10 s


### Loading tokenizer to help prep data

In [333]:
tokenizer = nltk.data.load('./tokenizers/punkt/english.pickle')

### Exploration

In [330]:
sentences = ['Everyone really likes the newest benefits',
 'The Government Executive articles housed on the website are not able to be searched .',
 'I like him for the most part , but would still enjoy seeing someone beat him .',
 'My favorite restaurants are always at least a hundred miles away from my house .',
 'What a day !',
 'What color is it ?',
 'I know exactly .']

You probably want to move the punk files so next line runs!

In [334]:
ls

README.md
[34m__pycache__[m[m/
[34mencoder[m[m/
extract_features.py
[34mglove.6B[m[m/
glove.840B.300d.txt
glove.840B.300d.zip
models.py
playground_seeIfPreTrainedModelFindsBestLinesInResumeForLineInJobDescription.ipynb
[34mtokenizers[m[m/


#### fuction that uses tokenizer to help prep text data


In [335]:
def format_text(text):
    global tokenizer
    padded_text = text.translate(str.maketrans({key: " {0} ".format(key) for key in string.punctuation}))
    return tokenizer.tokenize(padded_text)

In [336]:
text = 'Everyone really likes the newest benefits. The Government Executive articles housed on the website are not able to be searched.'\
'I like him for the most part, but would still enjoy seeing someone beat him. My favorite restaurants are always at least a hundred '\
'miles away from my house. What a day! What color is it? I know exactly.'

sentences = format_text(text)
sentences

['Everyone really likes the newest benefits .',
 'The Government Executive articles housed on the website are not able to be searched .',
 'I like him for the most part ,  but would still enjoy seeing someone beat him .',
 'My favorite restaurants are always at least a hundred miles away from my house .',
 'What a day !',
 'What color is it ?',
 'I know exactly .']

In [337]:
embeddings = model.encode(sentences, bsize=128, tokenize=False, verbose=True)
print('nb sentences encoded : {0}'.format(len(embeddings)))

Nb words kept : 81/81 (100.0%)
Speed : 26.0 sentences/s (cpu mode, bsize=128)
nb sentences encoded : 7


In [338]:
embeddings

array([[ 0.10451623,  0.09512678,  0.        , ...,  0.05427713,
         0.        ,  0.03061952],
       [ 0.07801579,  0.11585563,  0.08730361, ...,  0.00039722,
         0.        ,  0.        ],
       [ 0.06004622,  0.08212356,  0.03758185, ..., -0.02280346,
        -0.03814262, -0.0099379 ],
       ...,
       [ 0.00834052,  0.0226815 ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.0709541 ,  0.13905519,  0.02034344, ...,  0.        ,
         0.        ,  0.01513716],
       [ 0.03649271,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.00840744]], dtype=float32)

In [339]:
len(embeddings[0])

4096

In [340]:
np.linalg.norm(model.encode(['the cat eats.'], bsize=128, tokenize=False, verbose=True))


Nb words kept : 4/5 (80.0%)
Speed : 33.2 sentences/s (cpu mode, bsize=128)


3.017585

### Determining cosine similarity of embedding vectors

In [341]:
def cosine(u, v):
    return np.dot(u, v) / (np.linalg.norm(u) * np.linalg.norm(v))



In [342]:
cosine(model.encode(['the cat eats.'], bsize=128, tokenize=False, verbose=True)[0], model.encode(['the cat drinks.'], bsize=128, tokenize=False, verbose=True)[0])

Nb words kept : 4/5 (80.0%)
Speed : 40.5 sentences/s (cpu mode, bsize=128)
Nb words kept : 4/5 (80.0%)
Speed : 48.7 sentences/s (cpu mode, bsize=128)


1.0

In [343]:
cosine(model.encode(['the cat eats.'])[0], model.encode(['the cat drinks.'])[0])


0.8165239

In [344]:
cosine(model.encode(['the cat eats.'])[0], model.encode(['the cat drinks.'])[0])


0.8165239

In [345]:
sentence_one_A = "AS HEAD OF ANALYTICS YOU WILL Lead the end-to-end structuration of Quidnet as a lean digital company."
senence_one_B = "Design the company Big-Data architecture from on-site sensors to cloud processes to laptop solutions, turning data into information."

In [346]:
cosine(model.encode([sentence_one_A])[0], model.encode([senence_one_B])[0])


0.69741255

In [347]:
def compare_two_sentences(sentenceOne,sentenceTwo):
    #### Takes in two sentences as strings and compares their sentence level embedding cosine similarity and 
    #### returns scores as float between 0 and 1
    return cosine(model.encode([sentenceOne])[0], model.encode([sentenceTwo])[0])


In [348]:
test_1 = 'Thirteen years of experience as a data scientist ,  geoscientist ,  and software engineer .'
test_2 = "Research Data Scientist opportunity available to work with JPL ' s Machine Learning  &  Instrument Autonomy Group ."

In [349]:
compare_two_sentences(test_1,test_2)

0.69722444

## Now let's work with the Job Description & Resume Text Data

In [350]:
job_description_test = "Research Data Scientist opportunity available to work with JPL's Machine Learning & Instrument Autonomy Group. \
The team employs rigorous, explainable Machine Learning (ML) methods to support scientific data investigation on the ground,\
inform spacecraft operations teams in a time-critical setting.\
infuse new capabilities into JPL’s space missions, and extend the reach of scientists beyond Earth by creating autonomous “smart” instruments that can recognize and prioritize discoveries. \
We eschew “black box” ML, and often our trained models are created and examined explicitly to shed light on complex data and increase human insight rather than to automate a process. We collaborate with scientists, spacecraft operators, and engineers to identify novel ways that ML and data-driven science can help achieve their goals. \
Finally, we infuse ML solutions throughout JPL and NASA to bring about a richer, deeper, and more complete understanding of our universe from the Earth to the stars. We measure the success of our applied ML systems in terms of human hours saved, additional discoveries made, or new questions that may be addressed. \
We also frequently act as ambassadors to science and engineering areas not yet familiar with data-driven approaches by teaching, explaining, and offering assistance to those ready to receive it in the complex and often risk/change-averse arena of space exploration., \
Senior Research Data Scientist Responsibilities Include Conduct R&D of ML and related technologies for application to science and engineering challenges. \
Develop and lead algorithm and data processing methods for airborne and satellite remote sensing platforms, planetary surface missions, and novel instrument designs. \
Collaborate with domain experts from a variety of natural science and space-related engineering fields. \
Publicize practical and theoretical work through peer-reviewed conference, workshop, and journal papers. \
Facilitate the infusion of high quality, data-driven technologies throughout JPL and NASA. Provide strategic guidance on all aspects of data science at the group, section, division, and institutional level. Drive the infusion of data-driven products into JPL missions, components, instruments, and processes. Obtain competitive grant proposal funding, form multidisciplinary teams, and serve as Principle Investigator. \
The Machine Learning & Insturment Autonomy Group is part of the Instrument Software and Science Data Systems Section which provides expertise across the domains of instrument operations and science data systems with a team of multidisciplinary engineers and technologists. Our engineering teams build and operate high performance data processing, management and analysis systems capable of handling petabyte scale datasets in support of science discovery, research, operations and applications.\
The Section supports JPL and NASA missions, as well as other science-based projects. Our research and technology development teams create new onboard and ground based technologies for data processing, analysis, modeling, reasoning, visualization, management, access and analytics that are infused into our data systems. \
Typically requires degree in Computer Science, Electrical Engineering, or related technical discipline with a minimum of 9 years of experience, or a Master's with 7+ years experience or a Ph.D with 5+ years experience.\
Sustained record of high-quality ML and space data-oriented papers in respected, peer reviewed publications.\
Demonstrated Principle Investigator leadership and independent, novel research capabilities.\
Demonstrated success competitive proposal calls in space-related areas,\
Active reviewer for major data science / remote sensing journals and conferences.\
Broad knowledge of traditional and deep ML for time series, image, and spectral analysis.\
Demonstrated examples of software that are written to be maintainable, reusable, and follows best practices.\
Excellence in extracting meaningful science-relevant conclusions, identifying driving features, and performing relevant tradeoff analyses that inform data exploration, build intuition, and address specific mission needs.\
Deep expertise in and dedication to rigorous validation and explainability for all analyses.\
Strong written and oral communication skills including data visualization, presentation, and explanation of results.\
Team-building mentality, can-do attitude, and relentless commitment to project success.\
Significant programming experience with Python including NumPy / scikit-learn with high productivity.\
Experience with software engineering best practices including unit and regression testing.\
Optimizing and delivering algorithms to onboard, large-scale computing, or real-time critical systems. Working knowledge of Earth Science / Atmospheric Science / Planetary Science / Physics / Astronomy / Geology.\
"

In [351]:
job_description_test_formatted = format_text(job_description_test)
len(job_description_test_formatted)

35

In [352]:
job_description_test_formatted

["Research Data Scientist opportunity available to work with JPL ' s Machine Learning  &  Instrument Autonomy Group .",
 'The team employs rigorous ,  explainable Machine Learning  ( ML )  methods to support scientific data investigation on the ground , inform spacecraft operations teams in a time - critical setting .',
 'infuse new capabilities into JPL’s space missions ,  and extend the reach of scientists beyond Earth by creating autonomous “smart” instruments that can recognize and prioritize discoveries .',
 'We eschew “black box” ML ,  and often our trained models are created and examined explicitly to shed light on complex data and increase human insight rather than to automate a process .',
 'We collaborate with scientists ,  spacecraft operators ,  and engineers to identify novel ways that ML and data - driven science can help achieve their goals .',
 'Finally ,  we infuse ML solutions throughout JPL and NASA to bring about a richer ,  deeper ,  and more complete understanding

In [353]:
resume_text_test = "Thirteen years of experience as a data scientist, geoscientist, and software engineer.\
Has successfully delivered projects in the machine learning, data visualization, and software engineering spaces. \
Extensive record of generating new ideas, formulating strategy, and collaborating across organizational boundaries to deliver on objectives.\
Computer Language, Database, Web-development & Machine-Learning Skills: Language Databases Python, JavaScript, R, Java, Bash, PostgreSQL, Neo4J, & AWS cloud system admin. ML - Scikit-learn, TensorFlow, Keras, Weka, PyTorch, Natural language processing and speech recognition using a variety of python libraries, including CMUSphinx, DeepSpeech, NLTK, Gensim, & spaCy \
Web – Flaskpy, Nodejs, JQueryjs, Angularjs, & Reactjs, HTML, CSS, PHP, & WordPress \
Data visualization & GIS d3js, threejs, Tableau, AR/VR in Unity & JavaScript, & ESRI desktop certification\
Half my time I am a data scientist supporting a NASA data Analytics lab building machine-learning and data\
visualization prototypes for partners across NASA.\
The other half I support the NASA Open-Innovation program, responsible for operations, new features, and compliance for \
I am technical manager for one software engineer, one junior data scientist, and one to two interns.\
Presented on leveraging machine-learning to create new metadata from existing metadata at AGU 2019. This natural language processing work is also described as example for other federal agencies on strategydatagov. \
Rapid response dashboard of NASA’s COVID19 risk exposure & telework capability for senior executives.\
Delivered speech-to-text projects with focus on reusable code & enabling effective vendor/model evaluations.\
Collaborated with intern on NLP model to disambiguate authors with same name across thousands of documents.\
Mentored intern to develop natural language processing model that can be given an acronym and surrounding context words and predict correct expansion of the acronym where multiple possible definitions exist.\
Co-administrator of NASA’s internal & public GitHub instances. Webpage admin. AWS cloud system admin. \
Push adoption of new technologies in machine-learning and data visualization through consulting  proof-of-concept projects as a member of the NASA OCIO Technology & Data Div Data Analytics lab.\
Created interactive visualization to show how NASA strategic goals and objectives map to spending.\
Visualized aggregate device specific patterns in network traffic from an Internet of Things WIFI network\
Worked in multidisciplinary teams to analyze data, make predictions, communicate results, and enable decisions.\
Taught geologists, engineers, and other disciplines geology for 4 years in classroom and field setting.\
Predicted fluid connectivity in gas field development leveraging multiple linear regression and neural networks\
Predicted well-log lithology using machine-learning methods, XGBoost.\
Applied geostatistics in 3D model simulations of fluid flow & resource size used in billion-dollar investment decision.\
Recent Talks, Open-Source Software, and Community Leadership Roles Outside Job Duties \
Co-lead of Houston Data Visualization Meetup.\
Developed Wellioviz, open-source JavaScript library for visualization of well logs, supported by a startup.\
Created Predictatops, open-source Python project for applying machine-learning to chronostratigraphic well log surface correlation, and presented it at the 2019 American Assoc. of Petroleum Geologists Annual Meeting.\
Rice Data Sci Conf Talk - “Practical Considerations for Data Science Consulting in a Large Organization“\
"

In [354]:
resume_text_test_formatted = format_text(resume_text_test)
len(resume_text_test_formatted)

26

In [355]:
resume_text_test_formatted

['Thirteen years of experience as a data scientist ,  geoscientist ,  and software engineer .',
 'Has successfully delivered projects in the machine learning ,  data visualization ,  and software engineering spaces .',
 'Extensive record of generating new ideas ,  formulating strategy ,  and collaborating across organizational boundaries to deliver on objectives .',
 'Computer Language ,  Database ,  Web - development  &  Machine - Learning Skills :  Language Databases Python ,  JavaScript ,  R ,  Java ,  Bash ,  PostgreSQL ,  Neo4J ,   &  AWS cloud system admin .',
 'ML  -  Scikit - learn ,  TensorFlow ,  Keras ,  Weka ,  PyTorch ,  Natural language processing and speech recognition using a variety of python libraries ,  including CMUSphinx ,  DeepSpeech ,  NLTK ,  Gensim ,   &  spaCy Web – Flaskpy ,  Nodejs ,  JQueryjs ,  Angularjs ,   &  Reactjs ,  HTML ,  CSS ,  PHP ,   &  WordPress Data visualization  &  GIS d3js ,  threejs ,  Tableau ,  AR / VR in Unity  &  JavaScript ,   &  ESRI

#### text is prepped

In [356]:
template = {"job_name":"",
  "resume_name":"",
  "details":[{
      "job_line_number":0,
      "job_text":"",
      "scores":[{
          "resume_line_number":0,
          "resume_text":"",
          "line_similarity":0
          }    
      ],
      "top_similar_line":{0:""}
   }
  ]
  }

### Now we'll create a little scorecard function to keep track of how each line is similar

In [357]:
template

{'job_name': '',
 'resume_name': '',
 'details': [{'job_line_number': 0,
   'job_text': '',
   'scores': [{'resume_line_number': 0,
     'resume_text': '',
     'line_similarity': 0}],
   'top_similar_line': {0: ''}}]}

In [382]:
def creates_score_card(job_name,job_description_prepped,resume_name,resume_info_prepped,template_scorecard):
    #### takes in two data structures of each document. Both are array of strings
    #### approximately something like
    ### [{"id_number":0,"string":"Here is where the actual string goes in the job description":["resume_string":"Project description","resume_string_id":"String1","project_number":4,"cosine_sim":0.34]}]
    template = {}
    template['job_name']=job_name
    template['resume_name']=resume_name
    template["details"]=[]
    
    
    resume_line_number = -1
    job_description_line_number = -1
    
    for j_line in job_description_prepped:
        template_obj = {'job_line_number': 0,
                   'job_text': '',
                   'scores': [
                       {'resume_line_number': 0,
                        'resume_text': '',
                        'line_similarity': 0}
                             ],
                    'top_similar_line': {0: ''}
               }
    
        job_description_line_number += 1
        resume_line_number = 0
        template_obj['job_text'] = j_line
        template_obj['job_line_number'] = job_description_line_number
        template_obj['scores'] = []
        top_scoring_resume_line_for_this_job_line = 0
        top_scoring_resume_line_score_for_this_job_line = 0
        #print("did the line number",job_description_line_number," which is ",j_line)
        for r_line in resume_info_prepped:
            resume_line_number += 1
            scores_obj =  {'resume_line_number': 0,'resume_text': '','line_similarity': 0}
            scores_obj['resume_text'] = r_line
            scores_obj['resume_line_number'] = resume_line_number
            #print("got to this line in resume",resume_line_number)
            sim = compare_two_sentences(j_line,r_line)
            if sim > top_scoring_resume_line_score_for_this_job_line:
                top_scoring_resume_line_score_for_this_job_line = sim
                top_scoring_resume_line_for_this_job_line = resume_line_number
                
            scores_obj['line_similarity'] = sim
            #print("sim =",sim)
            template_obj['scores'].append(scores_obj)
        template_obj['top_similar_line'] = top_scoring_resume_line_for_this_job_line
        template["details"].append(template_obj)
    return template
        
            
#     for j_line in job_description_prepped:
        
    return template

In [383]:
scorecard_results = creates_score_card("job_name",job_description_test_formatted,"resume_name",resume_text_test_formatted,template)

In [379]:
#scorecard_results

In [361]:
scorecard_results.keys()

dict_keys(['job_name', 'resume_name', 'details'])

In [378]:
# scorecard_results['details'][1]

In [375]:
# scorecard_results['details'][2]

In [376]:
len(scorecard_results['details'])

35

In [377]:
# scorecard_results['details'][14]

In [366]:
scorecard_results['details'][10].keys()

dict_keys(['job_line_number', 'job_text', 'scores', 'top_similar_line'])

In [367]:
scorecard_results['details'][11]['job_text']

'Publicize practical and theoretical work through peer - reviewed conference ,  workshop ,  and journal papers .'

In [368]:
scorecard_results['details'][11]['top_similar_line']

5

In [369]:
scorecard_results['details'][11]['scores'][16]

{'resume_line_number': 17,
 'resume_text': 'Created interactive visualization to show how NASA strategic goals and objectives map to spending .',
 'line_similarity': 0.7092615}

In [370]:
def get_most_similar_lines_printed(scorecard_results,line):
    print("job description text:")
    print(scorecard_results['details'][line]['job_text'])
    top_similar_line = scorecard_results['details'][line]['top_similar_line']
    print("")
    print("most similar meaning in resume lines:")
    print(scorecard_results['details'][line]['scores'][top_similar_line]['resume_text'])
    print("")
    print("similarity:",scorecard_results['details'][line]['scores'][top_similar_line]['line_similarity'])
    print("")
    print("")

In [371]:
get_most_similar_lines_printed(scorecard_results,29)

job description text:
Strong written and oral communication skills including data visualization ,  presentation ,  and explanation of results .

most similar meaning in resume lines:
Extensive record of generating new ideas ,  formulating strategy ,  and collaborating across organizational boundaries to deliver on objectives .

similarity: 0.78247577




In [372]:
for n in range(0,34):
    get_most_similar_lines_printed(scorecard_results,n)

job description text:
Research Data Scientist opportunity available to work with JPL ' s Machine Learning  &  Instrument Autonomy Group .

most similar meaning in resume lines:
Created interactive visualization to show how NASA strategic goals and objectives map to spending .

similarity: 0.7195837


job description text:
The team employs rigorous ,  explainable Machine Learning  ( ML )  methods to support scientific data investigation on the ground , inform spacecraft operations teams in a time - critical setting .

most similar meaning in resume lines:
The other half I support the NASA Open - Innovation program ,  responsible for operations ,  new features ,  and compliance for I am technical manager for one software engineer ,  one junior data scientist ,  and one to two interns .

similarity: 0.8462393


job description text:
infuse new capabilities into JPL’s space missions ,  and extend the reach of scientists beyond Earth by creating autonomous “smart” instruments that can recog

### Results are....... not great

Some of the poorer performing lines are due to just not having a great match in the limited set of lines I supplied for the resume side. If I was doing this for real, I'd probably want to feed it every resume I've every written and then some. 

Other poorer selections had, in my mind at least, better choice available suggesting the limits of this quick of an approach.