# WMD And Cosine Similarity Applied To CVs and Jobs.

Basado en: Ver ejemplo 4 de "1-WMD_And_Cosine_Similarity_General".

### About
Aplicaremos las técnicas 'WMD' y 'Cosine Similarity' para medir distancias y obtener similitudes entre textos. 

Para esto realizamos 3 comparaciones:
* 1-Entre el contenido de los CVs y las descripciones de los puestos.
* 2-Entre el contenido de los CVs y las keywords de puestos obtenidas mediante Keyword extraction.
* 3-Entre el contenido de los CVs y keywords de puestos puestas "a mano" /personalizadas.

Como datasets utilizaremos:
* 6 CVs en pdf. Ubicación PDFs: "Archivos 2-Especificos/Ejemplo1_Dataset_CVs_And_Job_Desc/EN/PDFs_CVs". 
* 3 descripciones de trabajo de Data Scientist extraidas previamente de la página Indeed. Ubicación CSV: "Archivos 2-Especificos/Ejemplo1_Dataset_CVs_And_Job_Desc/EN/Job_Descr/Job_Positions_Scrapped/Data_Scientist_indeed.csv".

### Índice:


* 1.1-Importando librerias necesarias.


* 1.2-Explicaciones Cosine Similarity y WDM.


* 1.3-Armado y lectura de los 2 datasets.
    * 1.3.1-Dataset puestos.
        * 1.3.1.1-Agregando columna 'keywords' a nuestro DF obteniendolas mediante keyword extraction.
    * 1.3.2-Dataset CVs.
    

* 1.4- Realizando Comparaciones.
     * 1.4.1- Usamos el Word2vec descargado.
     * 1.4.2- Funciones necesarias para Cosine Sim y WDM
     * 1.4.3- Comparaciones.
         * 1.4.3.1-Entre el contenido de los CVs y las descripciones de los puestos.
         * 1.4.3.2-Entre el contenido de los CVs y las keywords de puestos obtenidas mediante Keyword extraction.
         * 1.4.3.3-Entre el contenido de los CVs y keywords de puestos puestas "a mano" /personalizadas.
         
         
* 1.5- Utilizando glove en lugar de word2vec para aplicar WDM.


* 1.6- Utilizando spacy en lugar de word2vec o glove para aplicar WDM.

## 1.1-Importando librerias necesarias.

In [1]:
from gensim.models import KeyedVectors
import matplotlib.pyplot as  plt
from collections import Counter
import pandas as pd
import numpy as np
import random
import string
import math
import re
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/fedricio/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 1.2-Explicaciones Cosine Similarity y WDM.

#### 1. Cosine Similarity
Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.

![Alt Text](https://i.imgur.com/HqKjGoQ.jpg)


#### 2. Word Mover's Distance
Word Mover's Distance (WMD) uses the word embeddings of the words in two texts to measure the minimum distance that the words in one text need to travel in semantic space to reach the words in the other text.

The WMD is measured by measuring the minimum euclidean distance between each word in the two documents in word2vec space. if the distance is small then words in the two documents are close to each other.

So, If I have the same two sentences:
- sentence 1: "Obama speaks to the media in Illinois"
- sentence 2: "The president greets the press in Chicago"

After removing stopwords, The word mover distance is small as mentioned in the figure.


![Alt Text](https://imgur.com/L1QNfPK.jpg)

## 1.3-Armado y lectura de los 2 datasets.

* 1-Dataset puestos.
* 2-Dataset CVs.

### 1.3.1-Dataset puestos.

Los armo en base al CSV "Data_Scientist_indeed.csv", dejando solo las columnas job_description y job_title. Y agregamos al columna keywords. 

In [2]:
df_Jobs = pd.read_csv("Archivos 2-Especificos/Ejemplo1_Dataset_CVs_And_Job_Desc/EN/Job_Descr/Job_Positions_Scrapped/Data_Scientist_indeed.csv")
df_Jobs = df_Jobs.drop(['company', 'location'], axis=1) #Eliminamos columnas innecesarias.
df_Jobs = df_Jobs.rename(columns={'job_title': 'Job_Title', 'job_description':'Job_Description'})
df_Jobs = df_Jobs.iloc[0:3] #Solo nos quedamos con las 1ras 3 filas.

df_Jobs

Unnamed: 0,Job_Title,Job_Description
0,Data Scientist I,"Master or PhD in Computer Science, Machine Lea..."
1,Data Scientist,"Full-timeVancouver, BC\n\nJUNE 9, 2020\nJob ID..."
2,Data Scientist,Minimum Qualifications\nMaster’s degree or abo...


###  1.3.1.1-Agregando columna 'keywords' a nuestro DF obteniendolas mediante keyword extraction.

Para esto ver previamente el ejemplo general (ejemplo número 5) del Notebook '1-WMD_And_Cosine_Similarity_General'.

#### Resumen: 
Document -> Remove stop words -> Find Term Frequency (TF) -> Find Inverse Document Frequency (IDF) -> Find TF*IDF -> Get top N Keywords

Definimos las funciones principales a utilizar:

In [3]:
from nltk import tokenize
from operator import itemgetter
import math

def get_top_n(dict_elem, n):
    result = dict(sorted(dict_elem.items(), key = itemgetter(1), reverse = True)[:n]) 
    return result
    
def check_sent(word, sentences): 
    final = [all([w in x for w in word]) for x in sentences] 
    sent_len = [sentences[i] for i in range(0, len(final)) if final[i]]
    return int(len(sent_len))
   
def get_keywords(doc,n):
    from nltk import tokenize
    from operator import itemgetter
    import math

    import nltk
    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize 
    stop_words = set(stopwords.words('english'))

    total_words = doc.split()
    total_word_length = len(total_words)
    # print(total_word_length)
    
    total_sentences = tokenize.sent_tokenize(doc)
    total_sent_len = len(total_sentences)
    # print(total_sent_len)
    
    #Calculo de TF:
    tf_score = {}
    for each_word in total_words:
        each_word = each_word.replace('.','')
        if each_word not in stop_words:
            if each_word in tf_score:
                tf_score[each_word] += 1
            else:
                tf_score[each_word] = 1

    # Dividing by total_word_length for each dictionary element
    tf_score.update((x, y/int(total_word_length)) for x, y in tf_score.items())
    #print(tf_score)
    
    #Calculo de IDF:
    idf_score = {}
    for each_word in total_words:
        each_word = each_word.replace('.','')
        if each_word not in stop_words:
            if each_word in idf_score:
                idf_score[each_word] = check_sent(each_word, total_sentences)
            else:
                idf_score[each_word] = 1

    # Performing a log and divide
    idf_score.update((x, math.log(int(total_sent_len)/y)) for x, y in idf_score.items())
    #print(idf_score)

    #Calculo de TF * IDF:
    tf_idf_score = {key: tf_score[key] * idf_score.get(key, 0) for key in tf_score.keys()}
    # print(tf_idf_score)

    #Obtenemos las 5 palabras top:
    n_palabras_top = get_top_n(tf_idf_score, n)
    # print(n_palabras_top)
    
    #Obtenemos la lista de keywords:
    keyword_list = list(n_palabras_top.keys())   #n_palabras_top es un dic.
    
    return keyword_list

In [4]:
#Ejemplo para comprobar el correcto funcionamiento de las funciones anteriores:
doc = 'I am a graduate. I want to learn Python. I like learning Python. Python is easy. Python is interesting. Learning increases thinking. Everyone should invest time in learning'
keyword_list = get_keywords(doc,5)  #Le pasamos el número de keywords que queremos y el documento/string.
#dic.keys()
keyword_list

['I', 'learning', 'Python', 'graduate', 'want']

In [5]:
#Ahora aplicamos la funcion get_keywords(n,doc) a nuestro DF para generar la columna 'keywords':
df_Jobs

Unnamed: 0,Job_Title,Job_Description
0,Data Scientist I,"Master or PhD in Computer Science, Machine Lea..."
1,Data Scientist,"Full-timeVancouver, BC\n\nJUNE 9, 2020\nJob ID..."
2,Data Scientist,Minimum Qualifications\nMaster’s degree or abo...


In [6]:
df_Jobs['Keywords'] = df_Jobs['Job_Description'].apply(get_keywords, args=[10])
df_Jobs

Unnamed: 0,Job_Title,Job_Description,Keywords
0,Data Scientist I,"Master or PhD in Computer Science, Machine Lea...","[Amazon, #0000, NLU, experience, Stock, Machin..."
1,Data Scientist,"Full-timeVancouver, BC\n\nJUNE 9, 2020\nJob ID...","[You, work, We’re, Job, 3022, &, Excellent, 20..."
2,Data Scientist,Minimum Qualifications\nMaster’s degree or abo...,"[Experience, Qualifications, Understanding, pr..."


In [7]:
#Transformamos la lista de strings de la columna Keywords en strings separados por ', ':
df_Jobs['Keywords'] = [', '.join(map(str, l)) for l in df_Jobs['Keywords']]

#Minusculas:
df_Jobs['Keywords'] = df_Jobs['Keywords'].apply(lambda x: x.lower() if isinstance(x,str) else x) #Ponemos todas en minusculas (importante para hacer las comparaciones futuras).
df_Jobs['Job_Description'] = df_Jobs['Job_Description'].apply(lambda x: x.lower() if isinstance(x,str) else x) #Tambien en minuscula el Job_Description.

df_Jobs

Unnamed: 0,Job_Title,Job_Description,Keywords
0,Data Scientist I,"master or phd in computer science, machine lea...","amazon, #0000, nlu, experience, stock, machine..."
1,Data Scientist,"full-timevancouver, bc\n\njune 9, 2020\njob id...","you, work, we’re, job, 3022, &, excellent, 202..."
2,Data Scientist,minimum qualifications\nmaster’s degree or abo...,"experience, qualifications, understanding, pro..."


### 1.3.2-Dataset CVs.

Los armo extayendo todo el texto de los CVs en inglés ubicados en "Archivos 2-Especificos/Ejemplo1_Dataset_CVs_And_Job_Desc/EN/PDFs_CVs". 

In [8]:
#Leemos los CVs almacenados en nuestra carpeta y los extraemos uno por uno mediante la libreria PyPDF; devolviendonos
#una secuencia de strings.
import PyPDF2
import os
import collections
from os import listdir
from os.path import isfile, join

pathCVs='Archivos 2-Especificos/Ejemplo1_Dataset_CVs_And_Job_Desc/EN/PDFs_CVs' #Ruta Relativa, ubicacion de la carpeta.  #Antes, ruta absoluta: r'C:\Users\calon\Desktop\Notebooks\Resume-Scoring-using-NLP-master\Resume-Scoring-using-NLP-master\Resumes'
onlyfiles = [os.path.join(pathCVs, f) for f in os.listdir(pathCVs) if os.path.isfile(os.path.join(pathCVs, f))]

print("Cantidad de CVs extraidos:", len(onlyfiles))

Cantidad de CVs extraidos: 6


In [9]:
#Funcion para extraer las palabras del CV:
import collections

def pdfextract(file):
    page_content = ""
    pdf_file = open(file, 'rb')
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    number_of_pages = read_pdf.getNumPages()
    c = collections.Counter(range(number_of_pages))
    for i in c:
        page = read_pdf.getPage(i)
        page_content += page.extractText()
    return (page_content.encode('utf-8'))

In [10]:
def extract_text(file):
    text = pdfextract(file).decode('utf-8')
    text = text.replace("\n", "")
    text = text.lower()
    return text

In [11]:
#Obtenemos todas las palabras del CV sin preprocesar ni nada:
df_Candidates=pd.DataFrame(columns = ['Candidate_Name','Content_CV'])
i=0
while i < len(onlyfiles):
    file=onlyfiles[i]
    base = os.path.basename(file)  #Test_Phoebe Buffay.pdf
    filename = os.path.splitext(base)[0]  #Test_Phoebe Buffay
    dat=extract_text(file)
    data = [{'Candidate_Name':filename, 'Content_CV':dat}]
    df_Candidates=df_Candidates.append(data, ignore_index=True)
    i+=1

df_Candidates

Unnamed: 0,Candidate_Name,Content_CV
0,Test_Chandler,chandler bing chandler.bing@vit.ac.in 85111926...
1,Test_MeghnaLohani,meghna lohani campus address meghna.lohan...
2,Test_AmanSharma,aman sharma campus address ...
3,Federico_Calonge_HCM,...
4,Test_VAISHALI BIJOY,vaishali bijoy vaishali.bijoy2016@vitstudent.a...
5,Test_Phoebe Buffay,phoebe buffay phobe.buffaycat@vit.ac.in 85111...


In [12]:
first_content_cv = df_Candidates.iloc[0]['Content_CV']
first_content_cv

'chandler bing chandler.bing@vit.ac.in 8511192673 education  l date of birth 12 may 1998  languages known hindi, english, spanish  coursework machine learning data visualization operating systems database management system discrete maths   skills c++        java python        arduino android       html php        javascript mysql        r studio keras        nltk parser        sql   experience  android club coordinator  front end developer in technovit website team  key entertainer in friends   projects good for females an android application for women and health safety, emergency mode using android modules, safest route using image processing, disease prediction using svm and linear regression.  face detection and recognition using openmp opencv, openmp, cuda  smart farming arduino coding, sensors , thingspeak  nlp resume parsing made resume parser based on nltk, spacy and other nlp tools    co-curricular activities reading novels, painting     '

## 1.4- Realizando Comparaciones.

### 1.4.1- Usamos el Word2vec descargado.

In [13]:
#Link del cual descargamos el archivo: https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
EMBEDDING_FILE = '/home/fedricio/Desktop/Embeddings_Utilizados/Word2vec/GoogleNews-vectors-negative300.bin.gz'
word2vec = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, binary=True)

###  1.4.2- Funciones necesarias para Cosine Sim y WDM

In [14]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('stopwords')
#Esto de arriba se descarga LOCALMENTE en home/user/nltk_data.

[nltk_data] Downloading package punkt to /home/fedricio/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /home/fedricio/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [15]:
def tokenize(text):
    #Convierte el texto recibido como entrada en una lista de tokens (palabras).
    #Dentro pasamos todo a minusculas, se eliminan signos de puntuación, espacios en blanco y se eliminan las stop words.
        #input  --> text: el documento como un STRING. 
        #output --> texts_tokens: cada palabra del texto en una LISTA de string. 
    texts_tokens=[]
    for line in text:
        tokens=word_tokenize(line)
        tok=[w.lower() for w in tokens]
        table=str.maketrans('','',string.punctuation)
        strpp=[w.translate(table) for w in tok]
        words=[word for word in strpp if word.isalpha()]
        stop_words=set(stopwords.words('english'))
        words=[w for w in words if not w in stop_words]
        texts_tokens.append(words)
    return texts_tokens

def text_to_vector(text):
    #Convierte el texto del documento en una "term matrix" donde todas las palabras estan listadas junto a 
    #la frecuencia en que aparecen en el texto.
        #input  --> text: el documento como una LISTA de string. 
        #output --> Term matrix: una MATRIZ con cada palabra del documento junto a su frecuencia. 
        
    texts_tokens = tokenize(text)
    #print(texts_tokens)
    return Counter(texts_tokens[0]) #[0] porque es una lista dentro de otra lista.

def get_cosine(doc1, doc2):
    #Get the cosine similarity between two documents.
    #Depends on the angle between two non zero vectors which are constructed by each word frequency in the two documents.
        #input  --> doc1: the first document as STRING.
                    #doc2: the second document as STRING.
        #output --> cosine similarity score

    vec1 = text_to_vector([doc1])
    vec2 = text_to_vector([doc2])
    intersection = set(vec1.keys()) & set(vec2.keys())
    numerator = sum([vec1[x] * vec2[x] for x in intersection])

    sum1 = sum([vec1[x] ** 2 for x in list(vec1.keys())])
    sum2 = sum([vec2[x] ** 2 for x in list(vec2.keys())])
    denominator = math.sqrt(sum1) * math.sqrt(sum2)

    if not denominator:
        return 0.0
    else:
        return float(numerator) / denominator
        
def wordMdistance(doc1,doc2):
   #Return the word mover distance between two documents 
     #input  --> doc1: the first document as list of words
                 #doc2: the second document as list of words
     #output --> Word Mover's Distance score
    sum_dist = 0
    i = 0
    for word in sent1:
        mindist = 1000.0
        for word2 in sent2:
            try:
                j = np.copy(word2vec.get_vector(word))
                t = np.copy(word2vec.get_vector(word2))
                dista = np.sqrt(sum((j-t)**2))
                if(dista < mindist):
                    mindist = dista
            except:
                continue
        sum_dist+=mindist
        i+=1
    return sum_dist/i

def WMD(doc1,doc2):
      #Preprocess the document first and remove english stopwords then call the function that calculates the word mover distance
      #input --> doc1: the first document as STRING.
                 #doc2: the second document as STRING.
      #output --> Word Mover's Distance score
    #print(doc1)
    first_doc = tokenize([doc1])[0]     #[0] porque es una lista dentro de otra lista.
    #print(first_doc)
    second_doc = tokenize([doc2])[0]    #[0] porque es una lista dentro de otra lista.
    return (word2vec.wmdistance(second_doc, first_doc))

#### Test para comprobar el funcionamiento de la función text_to_vector

In [16]:
#Tomamos como ejemplo la descripción del primer puesto del DF "df_DS_Jobs":
text_description_1st_job = df_Jobs.iloc[0]['Job_Description']
text_description_1st_job

"master or phd in computer science, machine learning, statistics or a related quantitative field.\n2+ years of hands-on experience in applied machine learning, and predictive modeling and analysis.\nalgorithm and model development experience for large-scale applications\nexperience using python, or other programming or scripting language, as well as with r, matlab.\nsolid understanding of foundational statistics concepts, nlu and ml algorithms: linear/logistic regression, random forest, boosting, gbm, nns, etc.\nlanguage required for job: english\n#0000\n\nterms of employment: full time, permanent\n\njob location: 510 west georgia street, vancouver, v6b 0m3\n\nas a machine learning scientist, you will work with software developers and other teams to design and implement nlu models for how customers use and interact with smart devices in their homes. you will help lay the foundation to move from directed device interactions to learned behaviors that enable alexa to proactively take acti

In [17]:
#Usamos la funcion text_to_vector() para text description:
vector_test_description = text_to_vector([text_description_1st_job])  #Recordar mandar lo de adentro como una LISTA.
vector_test_description

Counter({'master': 1,
         'phd': 1,
         'computer': 1,
         'science': 1,
         'machine': 3,
         'learning': 3,
         'statistics': 2,
         'related': 1,
         'quantitative': 1,
         'field': 1,
         'years': 2,
         'handson': 1,
         'experience': 4,
         'applied': 1,
         'predictive': 1,
         'modeling': 1,
         'analysis': 1,
         'algorithm': 1,
         'model': 1,
         'development': 1,
         'largescale': 1,
         'applications': 1,
         'using': 1,
         'python': 1,
         'programming': 1,
         'scripting': 1,
         'language': 2,
         'well': 1,
         'r': 1,
         'matlab': 1,
         'solid': 1,
         'understanding': 1,
         'foundational': 1,
         'concepts': 1,
         'nlu': 2,
         'ml': 1,
         'algorithms': 1,
         'linearlogistic': 1,
         'regression': 1,
         'random': 1,
         'forest': 1,
         'boosting': 1,
      

###  1.4.3- Comparaciones.

Las comparaciones a realizar son:

    1-Entre el contenido de los CVs y las descripciones de los puestos.
    2-Entre el contenido de los CVs y las keywords de puestos obtenidas mediante Keyword extraction.
    3-Entre el contenido de los CVs y keywords de puestos puestas "a mano" /personalizadas.

#### 1.4.3.1-Entre el contenido de los CVs y las descripciones de los puestos.

In [18]:
df_Jobs

Unnamed: 0,Job_Title,Job_Description,Keywords
0,Data Scientist I,"master or phd in computer science, machine lea...","amazon, #0000, nlu, experience, stock, machine..."
1,Data Scientist,"full-timevancouver, bc\n\njune 9, 2020\njob id...","you, work, we’re, job, 3022, &, excellent, 202..."
2,Data Scientist,minimum qualifications\nmaster’s degree or abo...,"experience, qualifications, understanding, pro..."


In [19]:
df_Candidates

Unnamed: 0,Candidate_Name,Content_CV
0,Test_Chandler,chandler bing chandler.bing@vit.ac.in 85111926...
1,Test_MeghnaLohani,meghna lohani campus address meghna.lohan...
2,Test_AmanSharma,aman sharma campus address ...
3,Federico_Calonge_HCM,...
4,Test_VAISHALI BIJOY,vaishali bijoy vaishali.bijoy2016@vitstudent.a...
5,Test_Phoebe Buffay,phoebe buffay phobe.buffaycat@vit.ac.in 85111...


In [20]:
#Creamos un nuevo DF concatenando los 2 dataframe de manera que se comparen todos los CVs con todos los Jobs:
#Nos quedarán 3 x 6 = 18 filas.
df_Jobs_and_Candidates = pd.merge(df_Candidates.assign(A=1), df_Jobs.assign(A=1), on='A').drop('A', 1)
df_Jobs_and_Candidates

Unnamed: 0,Candidate_Name,Content_CV,Job_Title,Job_Description,Keywords
0,Test_Chandler,chandler bing chandler.bing@vit.ac.in 85111926...,Data Scientist I,"master or phd in computer science, machine lea...","amazon, #0000, nlu, experience, stock, machine..."
1,Test_Chandler,chandler bing chandler.bing@vit.ac.in 85111926...,Data Scientist,"full-timevancouver, bc\n\njune 9, 2020\njob id...","you, work, we’re, job, 3022, &, excellent, 202..."
2,Test_Chandler,chandler bing chandler.bing@vit.ac.in 85111926...,Data Scientist,minimum qualifications\nmaster’s degree or abo...,"experience, qualifications, understanding, pro..."
3,Test_MeghnaLohani,meghna lohani campus address meghna.lohan...,Data Scientist I,"master or phd in computer science, machine lea...","amazon, #0000, nlu, experience, stock, machine..."
4,Test_MeghnaLohani,meghna lohani campus address meghna.lohan...,Data Scientist,"full-timevancouver, bc\n\njune 9, 2020\njob id...","you, work, we’re, job, 3022, &, excellent, 202..."
5,Test_MeghnaLohani,meghna lohani campus address meghna.lohan...,Data Scientist,minimum qualifications\nmaster’s degree or abo...,"experience, qualifications, understanding, pro..."
6,Test_AmanSharma,aman sharma campus address ...,Data Scientist I,"master or phd in computer science, machine lea...","amazon, #0000, nlu, experience, stock, machine..."
7,Test_AmanSharma,aman sharma campus address ...,Data Scientist,"full-timevancouver, bc\n\njune 9, 2020\njob id...","you, work, we’re, job, 3022, &, excellent, 202..."
8,Test_AmanSharma,aman sharma campus address ...,Data Scientist,minimum qualifications\nmaster’s degree or abo...,"experience, qualifications, understanding, pro..."
9,Federico_Calonge_HCM,...,Data Scientist I,"master or phd in computer science, machine lea...","amazon, #0000, nlu, experience, stock, machine..."


In [21]:
#Como prueba obtenemos el WMD y Cosine Sim comparando para la 1ra fila (entre las columnas Content_CV y Job_Description)
first_job_descr = df_Jobs_and_Candidates.loc[0,'Job_Description']
first_content_cv = df_Jobs_and_Candidates.loc[0,'Content_CV']

cosine_result = round(get_cosine(first_job_descr,first_content_cv),3)
wmd_result = round(WMD(first_job_descr,first_content_cv),3)
print(cosine_result)
print(wmd_result)

0.116
1.068


In [22]:
#Aplicamos las funciones get_cosine y WMD para TODO el DF... entre el contenido del CV y la descripcion del 
#puesto; y el resultado lo guardamos en  las columnas 'Cosine_Job_Desc' y 'WMD_Job_Desc'.
df_Jobs_and_Candidates['Cosine_Job_Desc'] = df_Jobs_and_Candidates.apply(lambda row: round(get_cosine(row['Content_CV'],row['Job_Description']),3), axis=1)
df_Jobs_and_Candidates['WMD_Job_Desc'] = df_Jobs_and_Candidates.apply(lambda row: round(WMD(row['Content_CV'],row['Job_Description']),3), axis=1)

df_Jobs_and_Candidates

Unnamed: 0,Candidate_Name,Content_CV,Job_Title,Job_Description,Keywords,Cosine_Job_Desc,WMD_Job_Desc
0,Test_Chandler,chandler bing chandler.bing@vit.ac.in 85111926...,Data Scientist I,"master or phd in computer science, machine lea...","amazon, #0000, nlu, experience, stock, machine...",0.116,1.068
1,Test_Chandler,chandler bing chandler.bing@vit.ac.in 85111926...,Data Scientist,"full-timevancouver, bc\n\njune 9, 2020\njob id...","you, work, we’re, job, 3022, &, excellent, 202...",0.119,1.096
2,Test_Chandler,chandler bing chandler.bing@vit.ac.in 85111926...,Data Scientist,minimum qualifications\nmaster’s degree or abo...,"experience, qualifications, understanding, pro...",0.115,1.075
3,Test_MeghnaLohani,meghna lohani campus address meghna.lohan...,Data Scientist I,"master or phd in computer science, machine lea...","amazon, #0000, nlu, experience, stock, machine...",0.162,0.992
4,Test_MeghnaLohani,meghna lohani campus address meghna.lohan...,Data Scientist,"full-timevancouver, bc\n\njune 9, 2020\njob id...","you, work, we’re, job, 3022, &, excellent, 202...",0.17,1.055
5,Test_MeghnaLohani,meghna lohani campus address meghna.lohan...,Data Scientist,minimum qualifications\nmaster’s degree or abo...,"experience, qualifications, understanding, pro...",0.189,1.037
6,Test_AmanSharma,aman sharma campus address ...,Data Scientist I,"master or phd in computer science, machine lea...","amazon, #0000, nlu, experience, stock, machine...",0.097,1.069
7,Test_AmanSharma,aman sharma campus address ...,Data Scientist,"full-timevancouver, bc\n\njune 9, 2020\njob id...","you, work, we’re, job, 3022, &, excellent, 202...",0.153,1.086
8,Test_AmanSharma,aman sharma campus address ...,Data Scientist,minimum qualifications\nmaster’s degree or abo...,"experience, qualifications, understanding, pro...",0.206,1.032
9,Federico_Calonge_HCM,...,Data Scientist I,"master or phd in computer science, machine lea...","amazon, #0000, nlu, experience, stock, machine...",0.089,1.073


#### 1.4.3.2-Entre el contenido de los CVs y las keywords de puestos obtenidas mediante Keyword extraction.

In [23]:
#APlicamos las funciones get_cosine y WMD entre el contenido del CV y las keywords que obtuvimos previamente 
#del puesto y el resultado lo guardamos en  las columnas 'Cosine_Job_Keywords' y 'WMD_Job_Keywords'.
df_Jobs_and_Candidates['Cosine_Job_Keywords'] = df_Jobs_and_Candidates.apply(lambda row: round(get_cosine(row['Content_CV'],row['Keywords']),3), axis=1)
df_Jobs_and_Candidates['WMD_Job_Keywords'] = df_Jobs_and_Candidates.apply(lambda row: round(WMD(row['Content_CV'],row['Keywords']),3), axis=1)

df_Jobs_and_Candidates

Unnamed: 0,Candidate_Name,Content_CV,Job_Title,Job_Description,Keywords,Cosine_Job_Desc,WMD_Job_Desc,Cosine_Job_Keywords,WMD_Job_Keywords
0,Test_Chandler,chandler bing chandler.bing@vit.ac.in 85111926...,Data Scientist I,"master or phd in computer science, machine lea...","amazon, #0000, nlu, experience, stock, machine...",0.116,1.068,0.055,1.211
1,Test_Chandler,chandler bing chandler.bing@vit.ac.in 85111926...,Data Scientist,"full-timevancouver, bc\n\njune 9, 2020\njob id...","you, work, we’re, job, 3022, &, excellent, 202...",0.119,1.096,0.0,1.269
2,Test_Chandler,chandler bing chandler.bing@vit.ac.in 85111926...,Data Scientist,minimum qualifications\nmaster’s degree or abo...,"experience, qualifications, understanding, pro...",0.115,1.075,0.072,1.236
3,Test_MeghnaLohani,meghna lohani campus address meghna.lohan...,Data Scientist I,"master or phd in computer science, machine lea...","amazon, #0000, nlu, experience, stock, machine...",0.162,0.992,0.133,1.186
4,Test_MeghnaLohani,meghna lohani campus address meghna.lohan...,Data Scientist,"full-timevancouver, bc\n\njune 9, 2020\njob id...","you, work, we’re, job, 3022, &, excellent, 202...",0.17,1.055,0.0,1.271
5,Test_MeghnaLohani,meghna lohani campus address meghna.lohan...,Data Scientist,minimum qualifications\nmaster’s degree or abo...,"experience, qualifications, understanding, pro...",0.189,1.037,0.101,1.216
6,Test_AmanSharma,aman sharma campus address ...,Data Scientist I,"master or phd in computer science, machine lea...","amazon, #0000, nlu, experience, stock, machine...",0.097,1.069,0.089,1.185
7,Test_AmanSharma,aman sharma campus address ...,Data Scientist,"full-timevancouver, bc\n\njune 9, 2020\njob id...","you, work, we’re, job, 3022, &, excellent, 202...",0.153,1.086,0.0,1.276
8,Test_AmanSharma,aman sharma campus address ...,Data Scientist,minimum qualifications\nmaster’s degree or abo...,"experience, qualifications, understanding, pro...",0.206,1.032,0.115,1.211
9,Federico_Calonge_HCM,...,Data Scientist I,"master or phd in computer science, machine lea...","amazon, #0000, nlu, experience, stock, machine...",0.089,1.073,0.102,1.207


#### 1.4.3.3-Entre el contenido de los CVs y keywords de puestos puestas "a mano" / personalizadas.

In [24]:
Content_CV_Phoebe_Buffay = df_Jobs_and_Candidates.loc[17,'Content_CV']
print(Content_CV_Phoebe_Buffay)

phoebe buffay  phobe.buffaycat@vit.ac.in 8511192673 education montessori school,97% l date of birth 12 may 1973  languages known english, spanish,french  coursework machine learning data visualization operating systems database management system discrete maths   skills c++        java c        ruby on rails python        arduino android       html php        javascript mysql        r studio keras        nltk parser        sql cuda        openmp opencv       cnn   experience  android club coordinator  front end developer in technovit website team  key entertainer in friends   projects good for females an android application for women and health safety, emergency mode using android modules, safest route using image processing, disease prediction using svm and linear regression.  face detection and recognition using openmp opencv, openmp, cuda  smart farming arduino coding, sensors , thingspeak  nlp resume parsing made resume parser based on nltk, spacy and other nlp tools  breast cancer 

In [26]:
#Agregamos 2 filas más al DF con las keywords personalizadas (las segundas keywords no tendrán nada que ver
#al contenido del CV). Y luego aplicamos sim cosine y WDM para estos casos y vemos los resultados.

df_Jobs_and_Candidates.loc[18] = ['Test',Content_CV_Phoebe_Buffay,'','','data science, machine learning, pandas, python, sql','','','',''] 
#Es lo mismo poner las Keywords con o sin comas (da igual el resultado, ya que el procesamiento se encarga de esto). 
df_Jobs_and_Candidates.loc[19] = ['Test',Content_CV_Phoebe_Buffay,'','','football, tennis, spanish','','','',''] 

df_Jobs_and_Candidates.tail()

Unnamed: 0,Candidate_Name,Content_CV,Job_Title,Job_Description,Keywords,Cosine_Job_Desc,WMD_Job_Desc,Cosine_Job_Keywords,WMD_Job_Keywords
15,Test_Phoebe Buffay,phoebe buffay phobe.buffaycat@vit.ac.in 85111...,Data Scientist I,"master or phd in computer science, machine lea...","amazon, #0000, nlu, experience, stock, machine...",0.116,1.073,0.048,1.219
16,Test_Phoebe Buffay,phoebe buffay phobe.buffaycat@vit.ac.in 85111...,Data Scientist,"full-timevancouver, bc\n\njune 9, 2020\njob id...","you, work, we’re, job, 3022, &, excellent, 202...",0.107,1.102,0.0,1.274
17,Test_Phoebe Buffay,phoebe buffay phobe.buffaycat@vit.ac.in 85111...,Data Scientist,minimum qualifications\nmaster’s degree or abo...,"experience, qualifications, understanding, pro...",0.102,1.086,0.062,1.245
18,Test,phoebe buffay phobe.buffaycat@vit.ac.in 85111...,,,"data science, machine learning, pandas, python...",,,,
19,Test,phoebe buffay phobe.buffaycat@vit.ac.in 85111...,,,"football, tennis, spanish",,,,


In [27]:
#TEST: Usamos la funcion text_to_vector() para las keywords:
vector_test_description = text_to_vector(df_Jobs_and_Candidates.iloc[18]['Keywords'])
vector_test_description

#Ahora si aplicamos sim cosine y WDM para estos casos y vemos los resultados. 
df_Jobs_and_Candidates.loc[18,'Cosine_Job_Keywords'] = round(get_cosine(df_Jobs_and_Candidates.loc[18,'Content_CV'],df_Jobs_and_Candidates.loc[18,'Keywords']),3)
df_Jobs_and_Candidates.loc[18,'WMD_Job_Keywords'] = round(WMD(df_Jobs_and_Candidates.loc[18,'Content_CV'],df_Jobs_and_Candidates.loc[18,'Keywords']),3)
df_Jobs_and_Candidates.loc[19,'Cosine_Job_Keywords'] = round(get_cosine(df_Jobs_and_Candidates.loc[19,'Content_CV'],df_Jobs_and_Candidates.loc[19,'Keywords']),3)
df_Jobs_and_Candidates.loc[19,'WMD_Job_Keywords'] = round(WMD(df_Jobs_and_Candidates.loc[19,'Content_CV'],df_Jobs_and_Candidates.loc[19,'Keywords']),3)

df_Jobs_and_Candidates.tail()

Unnamed: 0,Candidate_Name,Content_CV,Job_Title,Job_Description,Keywords,Cosine_Job_Desc,WMD_Job_Desc,Cosine_Job_Keywords,WMD_Job_Keywords
15,Test_Phoebe Buffay,phoebe buffay phobe.buffaycat@vit.ac.in 85111...,Data Scientist I,"master or phd in computer science, machine lea...","amazon, #0000, nlu, experience, stock, machine...",0.116,1.073,0.048,1.219
16,Test_Phoebe Buffay,phoebe buffay phobe.buffaycat@vit.ac.in 85111...,Data Scientist,"full-timevancouver, bc\n\njune 9, 2020\njob id...","you, work, we’re, job, 3022, &, excellent, 202...",0.107,1.102,0.0,1.274
17,Test_Phoebe Buffay,phoebe buffay phobe.buffaycat@vit.ac.in 85111...,Data Scientist,minimum qualifications\nmaster’s degree or abo...,"experience, qualifications, understanding, pro...",0.102,1.086,0.062,1.245
18,Test,phoebe buffay phobe.buffaycat@vit.ac.in 85111...,,,"data science, machine learning, pandas, python...",,,0.162,1.168
19,Test,phoebe buffay phobe.buffaycat@vit.ac.in 85111...,,,"football, tennis, spanish",,,0.041,1.298


## 1.5- Utilizando glove en lugar de word2vec para aplicar WDM.

In [28]:
# Load Word Embedding Model
import gensim
from gensim.models.keyedvectors import KeyedVectors

print('gensim version: %s' % gensim.__version__)
#glove_model = gensim.models.KeyedVectors.load_word2vec_format('../model/text/stanford/glove/glove.6B.50d.vec')
glove_model = KeyedVectors.load_word2vec_format('/home/fedricio/Desktop/Embeddings_Utilizados/Glove/glove.6B.50d.txt', binary=False, no_header=True)

gensim version: 4.0.0


In [53]:
CV_text = df_Jobs_and_Candidates.loc[19,'Content_CV']
CV_text_2 = df_Jobs_and_Candidates.loc[9,'Content_CV']
CV_text_3 = df_Jobs_and_Candidates.loc[12,'Content_CV']
Job_Keyw_text = "data science machine learning pandas python sql"
Job_Keyw_text_2 = "football, tennis, spanish"
texts_to_compare = [Job_Keyw_text,CV_text,CV_text_2,CV_text_3,Job_Keyw_text_2]
texts_tokens = tokenize(texts_to_compare)   #Función definida en las secciones anteriores. 
print(texts_tokens)

[['data', 'science', 'machine', 'learning', 'pandas', 'python', 'sql'], ['phoebe', 'buffay', 'phobebuffaycat', 'vitacin', 'education', 'montessori', 'l', 'date', 'birth', 'may', 'languages', 'known', 'english', 'spanish', 'french', 'coursework', 'machine', 'learning', 'data', 'visualization', 'operating', 'systems', 'database', 'management', 'system', 'discrete', 'maths', 'skills', 'c', 'java', 'c', 'ruby', 'rails', 'python', 'arduino', 'android', 'html', 'php', 'javascript', 'mysql', 'r', 'studio', 'keras', 'nltk', 'parser', 'sql', 'cuda', 'openmp', 'opencv', 'cnn', 'experience', 'android', 'club', 'coordinator', 'front', 'end', 'developer', 'technovit', 'website', 'team', 'key', 'entertainer', 'friends', 'projects', 'good', 'females', 'android', 'application', 'women', 'health', 'safety', 'emergency', 'mode', 'using', 'android', 'modules', 'safest', 'route', 'using', 'image', 'processing', 'disease', 'prediction', 'using', 'svm', 'linear', 'regression', 'face', 'detection', 'recognit

In [30]:
texts_tokens[1]   #CV_text

['phoebe',
 'buffay',
 'phobebuffaycat',
 'vitacin',
 'education',
 'montessori',
 'l',
 'date',
 'birth',
 'may',
 'languages',
 'known',
 'english',
 'spanish',
 'french',
 'coursework',
 'machine',
 'learning',
 'data',
 'visualization',
 'operating',
 'systems',
 'database',
 'management',
 'system',
 'discrete',
 'maths',
 'skills',
 'c',
 'java',
 'c',
 'ruby',
 'rails',
 'python',
 'arduino',
 'android',
 'html',
 'php',
 'javascript',
 'mysql',
 'r',
 'studio',
 'keras',
 'nltk',
 'parser',
 'sql',
 'cuda',
 'openmp',
 'opencv',
 'cnn',
 'experience',
 'android',
 'club',
 'coordinator',
 'front',
 'end',
 'developer',
 'technovit',
 'website',
 'team',
 'key',
 'entertainer',
 'friends',
 'projects',
 'good',
 'females',
 'android',
 'application',
 'women',
 'health',
 'safety',
 'emergency',
 'mode',
 'using',
 'android',
 'modules',
 'safest',
 'route',
 'using',
 'image',
 'processing',
 'disease',
 'prediction',
 'using',
 'svm',
 'linear',
 'regression',
 'face',
 'detec

In [57]:
#Imprimimos las distancias entre Job_Keyw_text y los 3 CVs:

tokens_job_keyw = texts_tokens[0]
tokens_CV_1 = texts_tokens[1]
tokens_CV_2 = texts_tokens[2]
tokens_CV_3 = texts_tokens[3]
tokens_job_keyw_2 = texts_tokens[4]

distance_1 = glove_model.wmdistance(tokens_job_keyw, tokens_CV_1)
print('Distance with CV_1= %.4f' % distance_1)

distance_2 = glove_model.wmdistance(tokens_job_keyw, tokens_CV_2)
print('Distance with CV_2= %.4f' % distance_2)

distance_3 = glove_model.wmdistance(tokens_job_keyw, tokens_CV_3)
print('Distance with CV_3= %.4f' % distance_3)

#Esta última comparación no debe dar buen score, ya que las keywords no tienen nada que ver con el CV:
distance_4 = glove_model.wmdistance(tokens_job_keyw_2, tokens_CV_1)
print('Distance keywords_2 with CV_1= %.4f' % distance_4)

Distance with CV_1= 0.9605
Distance with CV_2= 0.9719
Distance with CV_3= 0.9377
Distance keywords_2 with CV_1= 1.2101


## 1.6- Utilizando spacy en lugar de word2vec o glove para aplicar WDM.

In [46]:
#Ejemplo de la doc oficial modificado: doc oficial: https://spacy.io/universe/project/wmd-relax
#Para esto descargue spacy: >conda install -c conda-forge spacy
#Y el modelo: python -m spacy download en_core_web_lg
import spacy
import wmd

nlp = spacy.load('en_core_web_lg', create_pipeline=wmd.WMD.create_spacy_pipeline)

In [50]:
#Ejemplo:
doc1 = nlp("Politician speaks to the media in Illinois.")
doc2 = nlp("The president greets the press in Chicago.")
doc3 = nlp("The president greets the press in Chicago.")
print(doc1.similarity(doc2))
print(doc2.similarity(doc3))

0.8705018149808114
1.0


In [47]:
#Nuestro caso:

#Preprocesamos el doc_1 a comparar (el doc_2 no hace falta porque son keywords):
doc_1 = df_Jobs_and_Candidates.loc[19,'Content_CV']
doc_1_tokenize = tokenize([doc_1])              #Usando la funcion definida en las secciones previas.
new_text_doc_1 = ' '.join(doc_1_tokenize[0])    #Pasamos la lista a string.
new_text_doc_1                                  #Nuevo texto sin espacios, stop words y demas (ver todo lo que hace 'tokenize')

doc1 = nlp(new_text_doc_1)
doc2 = nlp('data science machine learning pandas python sql')
print(doc1.similarity(doc2))

0.8392574509904325


In [58]:
#Hacemos lo mismo que arriba solo que ahora nuestras keywords serán otras que no tendrán que ver con el CV:

doc_1 = df_Jobs_and_Candidates.loc[19,'Content_CV']
doc_1_tokenize = tokenize([doc_1])              
new_text_doc_1 = ' '.join(doc_1_tokenize[0])    
new_text_doc_1                                 

doc1 = nlp(new_text_doc_1)
doc2 = nlp('football, tennis, spanish')  #CAMBIAMOS ESTO.
print(doc1.similarity(doc2))

0.45593429733006446
