### About
Aplicaremos las técnicas de 'WMD' y 'Cosine Similarity' para medir distancias y obtener similitudes entre textos.  
Previamente a aplicar Cosine Similarity usaremos TF-IDF para...  **COMPLETAR**  
Y, previamente a aplicar WMD usaremos Word Embedding para... **COMPLETAR**

Las comparaciones que realizaremos serán entre el contenido de los CVs de los Candidatos y las cualificaciones requeridas para los mismos, encontradas dentro de las descripciones de los Puestos.

Como datasets utilizaremos:  
* **(COMPLETAR CON EL NÚMERO FINAL)** CVs en pdf. Ubicación PDFs: "Archivos 2-Especificos/Ejemplo1_Dataset_CVs_And_Job_Desc/EN/PDFs_CVs". 
* **(COMPLETAR CON EL NÚMERO FINAL)** descripciones de trabajo, las cuales corresponden a:

Ubicación CVs: **COMPLETAR**. hoola  
Ubicación Descripciones Puestos: **COMPLETAR**.

## Índice:

* 1-Armado de Datasets y Limpieza de datos.
    * 1.1-Impotando librerías necesarias.
    * 1.2-Funciones necesarias para la Limpieza de datos. (*)
    * 1.3-Armado y Limpieza de Dataset Puestos.
    * 1.4-Armado y Limpieza de Dataset CVs.
    
    
* 2-Realizando Comparaciones y obteniendo Similitudes.
     * 2.1- TF-IDF & Cosine Similarity.
     * 2.2- Word Embedding (Word2vec) & WMD.  
     
     
* 3-Exportamos el CSV con los resultados para utilizarlo en el Notebook "2-KMEANS_&_KNN".  
  
(*) El procedimiento para la Limpieza de los CVs y los Puestos será el siguiente:
1. Convertimos todo a minúscula.
2. Eliminamos datos no relevantes para nuestros análisis (mails y páginas web).
3. Eliminamos signos de puntuación y caracterés especiales (incluyendo números).
4. Eliminamos stop words.
5. Eliminamos common words no relevantes para nuestros análisis.
6. Aplicamos Lematización y Tokenización.

## 1-Armado de Datasets y Limpieza de datos.

### 1.1-Importando librerias necesarias.

In [1]:
from gensim.models import KeyedVectors
import matplotlib.pyplot as  plt
from collections import Counter
import pandas as pd
import numpy as np
import random
import string
import math
import re
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/fedricio/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### 1.2-Funciones necesarias para la Limpieza de datos.

In [2]:
import regex as re                        #Usado en la función remove_punctuation_and_special_characters.
from nltk.stem import WordNetLemmatizer   #Usado para lematización.
nltk.download('wordnet')
from nltk.tokenize import word_tokenize   #Usado para tokenizar.

#Definimos nuestras stop words desde nltk:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
            
def lower_text(DF,clean_column):
    #Pasamos todo a minúscula.
    DF[clean_column] = DF[clean_column].apply(lambda x: x.lower() if isinstance(x,str) else x)

def delete_emails_and_web_pages(DF,clean_column):
    #Macheamos páginas con www, http o https:
    DF[clean_column] = DF[clean_column].apply(lambda x: re.sub('(www|http:|https:)+[^\s]+[\w]',' ', x) if isinstance(x,str) else x) 
    #Macheamos e-mails:
    DF[clean_column] = DF[clean_column].apply(lambda x: re.sub('\S*@\S*\s?',' ', x) if isinstance(x,str) else x)
        #Explicación macheo:
            #\S* : secuencia de caracteres que NO son espacios. 
            #@ : el @
            #\S* : otra secuencia de caracteres que NO son espacios.
            #\s? : Y eventualmente un espacio, si acá hay uno. El '?' es necesario para
            #machear una dirección de correo al final de la linea. Entonces, si acá es espacio, siempre macheará.
            #Ver demo: https://regex101.com/r/J0ohIf/1

def delete_common_words(DF,clean_column):
    #Creamos previamente un txt donde colocamos palabras comunes que NO son necesarias 
    #para nuestro análisis y macheo (meses y sus abreviaciones, categorías de un cv -education, work experience, etc.)
    #y eliminamos estas palabras de nuestra columna.

    with open('common_words.txt') as f2:
        content = f2.read()
        
    tokens_common_words = word_tokenize(content)  #tokenizo las common words
    DF[clean_column] = DF[clean_column].apply(lambda x: ' '.join([item for item in x.split() if item not in tokens_common_words]))

def delete_candidate_name(DF,clean_column):
    with open('candidate_names.txt') as f2:
        content = f2.read()
        
    tokens_common_words = word_tokenize(content)
    DF[clean_column] = DF[clean_column].apply(lambda x: ' '.join([item for item in x.split() if item not in tokens_common_words]))

def remove_stop_words(DF,clean_column):
    DF[clean_column] = DF[clean_column].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

def remove_punctuation_and_special_characters(DF,clean_column):
    DF[clean_column] = DF[clean_column].apply(lambda x: re.sub('[\W+]|(\d+)',' ', x) if isinstance(x,str) else x)
    #Macheamos cualquier non-word [\W+]  ; y  los números (\d+)
    
def tokenize_and_lemmatization(text_column):
    #Tokenizar es el proceso de parsear Strings de texto en diferentes secciones ("tokens"). 
    #Para esto usamos la función "word_tokenize". 
    #Diferencia lematizar vs derivar (stemming): https://qastack.mx/programming/1787110/what-is-the-difference-between-lemmatization-vs-stemming
    tokens = word_tokenize(text_column) #Tokenizamos.
    wordnet_lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [wordnet_lemmatizer.lemmatize(tok).lower() for tok in tokens] # Stem words.
    return list(lemmatized_tokens)

def cleaning_DF(DF,column_to_clean,flag_candidate): #Si es un candidato (flag_candidate=True) entonces eliminamos todas las apariciones del Nombre en el CV.
    clean_column='clean_'+column_to_clean
    DF[clean_column]=DF[column_to_clean]            #Copiamos el contenido de 'column_to_clean' en 'clean_column' para utilizarla en las funciones posteriores.
    lower_text(DF,clean_column)
    delete_emails_and_web_pages(DF,clean_column)
    remove_punctuation_and_special_characters(DF,clean_column)
    remove_stop_words(DF,clean_column)
    delete_common_words(DF,clean_column)
    if(flag_candidate==True):                       #Si es un candidato...
        delete_candidate_name(DF,clean_column)
    
def tokenize_and_lemmatize(DF,column_to_clean):
    clean_column='clean_'+column_to_clean
    tokens_column='tokens_'+column_to_clean
    DF[tokens_column] = DF[clean_column].apply(tokenize_and_lemmatization)  #tokenize_and_lemmatization devuelve tokens.
    DF[clean_column] = DF[tokens_column].apply(' '.join)                    #destokenizamos para obtener nuevamente nuestro texto (luego de aplicar lematización) y colocarlo en DF[clean_column].

[nltk_data] Downloading package wordnet to /home/fedricio/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### 1.3-Armado y Limpieza de Dataset Puestos.

## 1-Jobs_CSV

In [200]:
df_Jobs = pd.read_csv("../Datasets_CVs_And_Job_Descriptions/EN/Job_Descr/1-Jobs_CSV/Positions_Qualif.csv")
df_Jobs = df_Jobs.rename(columns={'job_title': 'Job_Title', 'job_description':'Job_Description'})
df_Jobs = df_Jobs.sort_values('Job_Title', ascending=True)
df_Jobs = df_Jobs.reset_index(drop=True)
df_Jobs

Unnamed: 0,Job_Title,Job_Description
0,Data Scientist,"Master’s degree or above in a STEM field, incl..."
1,Data Scientist 2,"\nReporting to the Director, Data & Analytics,..."
2,HCM Consultant,\nThe Oracle Cloud HCM Absence Consultant will...
3,HCM Consultant 2,4+ years of experience in PeopleSoft or Oracle...
4,Machine Learning Engineer,Leveraging the latest machine and deep learnin...
5,Machine Learning Engineer 2,Collaborate with a multidisciplinary team to g...
6,Security Specialist,Work in a fast-paced environment that combine ...
7,Security Specialist 2,\n Handling incoming requests for assistanc...
8,Web Developer Full Stack,\n\n Graduate Degree in Information Technol...
9,Web Developer Full Stack 2,\n· Enter existing website codebases and exten...


## 2-csv_dice_com-job_us_sample

In [206]:
path_jobs_2='../Datasets_CVs_And_Job_Descriptions/EN/Job_Descr/'
df_Jobs_2 = pd.read_csv(path_jobs_2+'/2-csv_dice_com-job_us_sample.csv', usecols= ['jobdescription','jobtitle'])
df_Jobs_2 = df_Jobs_2.rename(columns={'jobtitle': 'Job_Title', 'jobdescription':'Job_Description'})
df_Jobs_2 = df_Jobs_2.reset_index(drop=True)
df_Jobs_2

Unnamed: 0,Job_Description,Job_Title
0,Looking for Selenium engineers...must have sol...,AUTOMATION TEST ENGINEER
1,The University of Chicago has a rapidly growin...,Information Security Engineer
2,"GalaxE.SolutionsEvery day, our solutions affec...",Business Solutions Architect
3,Java DeveloperFull-time/direct-hireBolingbrook...,"Java Developer (mid level)- FT- GREAT culture,..."
4,Midtown based high tech firm has an immediate ...,DevOps Engineer
...,...,...
21995,Company Description We are searching for a ta...,Web Designer
21996,CONTACT - priya@omegasolutioninc.com / 408-45...,Senior Front End Web Developer - Full Time at ...
21997,Do you take pride in your work knowing that th...,QA Analyst
21998,Company Description What We Can Offer YouAs th...,Tech Lead-Full Stack


In [207]:
#Eliminamos los duplicados.
df_Jobs_2 = df_Jobs_2.drop_duplicates(subset=None, keep='first', inplace=False)
df_Jobs_2 = df_Jobs_2.reset_index(drop=True)
df_Jobs_2

Unnamed: 0,Job_Description,Job_Title
0,Looking for Selenium engineers...must have sol...,AUTOMATION TEST ENGINEER
1,The University of Chicago has a rapidly growin...,Information Security Engineer
2,"GalaxE.SolutionsEvery day, our solutions affec...",Business Solutions Architect
3,Java DeveloperFull-time/direct-hireBolingbrook...,"Java Developer (mid level)- FT- GREAT culture,..."
4,Midtown based high tech firm has an immediate ...,DevOps Engineer
...,...,...
20578,Company Description We are searching for a ta...,Web Designer
20579,CONTACT - priya@omegasolutioninc.com / 408-45...,Senior Front End Web Developer - Full Time at ...
20580,Do you take pride in your work knowing that th...,QA Analyst
20581,Company Description What We Can Offer YouAs th...,Tech Lead-Full Stack


## Unimos los datasets:

In [208]:
df_Jobs = df_Jobs.append(df_Jobs_2, ignore_index=True)
df_Jobs

Unnamed: 0,Job_Title,Job_Description
0,Data Scientist,"Master’s degree or above in a STEM field, incl..."
1,Data Scientist 2,"\nReporting to the Director, Data & Analytics,..."
2,HCM Consultant,\nThe Oracle Cloud HCM Absence Consultant will...
3,HCM Consultant 2,4+ years of experience in PeopleSoft or Oracle...
4,Machine Learning Engineer,Leveraging the latest machine and deep learnin...
...,...,...
20588,Web Designer,Company Description We are searching for a ta...
20589,Senior Front End Web Developer - Full Time at ...,CONTACT - priya@omegasolutioninc.com / 408-45...
20590,QA Analyst,Do you take pride in your work knowing that th...
20591,Tech Lead-Full Stack,Company Description What We Can Offer YouAs th...


## 3-csv_monster_com-job_sample

#### Limpieza de datos del DF de Puestos.

In [209]:
cleaning_DF(df_Jobs,'Job_Description',False)
tokenize_and_lemmatize(df_Jobs,'Job_Description')

In [211]:
#Tomamos el job_description de la posición 5 como ejemplo (Machine Learning Engineer 2'):

print(df_Jobs['Job_Description'].iloc[5])  
#non-expert  debería mantenerse.
#deberia juntar el Computer Science
print("#######################################################")
print(df_Jobs['clean_Job_Description'].iloc[5]) 
print("#######################################################")
print(df_Jobs['tokens_Job_Description'].iloc[5]) 
print("#######################################################")

Do you take pride in your work knowing that thousands of lives are positively impacted? Are you interested in being part of an award-winning Health Plan? Do you thrive in a mission-driven and fast-growing environment where the frontier of healthcare is constantly being pushed? Then, let's talk! We're looking for smart thinkers who enjoy challenges and want to help us shape the future of healthcare.The San Francisco Health Plan is an award winning managed care health plan designed by and for the people of San Francisco. Our mission is to provide affordable, high quality health care to San Francisco residents and support our safety net of clinics and providers.We are ambitious in our pursuits, passionate about our mission, and creative in our execution. We thrive on our culture of serving with respect, striving to excel, and teamwork.What makes SFHP different? Our people do. Every day, we are doing things to transform our city and the lives of over 130,000 people in the city! There hasn'

In [212]:
df_Jobs

Unnamed: 0,Job_Title,Job_Description,clean_Job_Description,tokens_Job_Description
0,Data Scientist,"Master’s degree or above in a STEM field, incl...",master degree stem field including limited com...,"[master, degree, stem, field, including, limit..."
1,Data Scientist 2,"\nReporting to the Director, Data & Analytics,...",reporting director data analytics senior data ...,"[reporting, director, data, analytics, senior,..."
2,HCM Consultant,\nThe Oracle Cloud HCM Absence Consultant will...,oracle cloud hcm absence consultant responsibl...,"[oracle, cloud, hcm, absence, consultant, resp..."
3,HCM Consultant 2,4+ years of experience in PeopleSoft or Oracle...,peoplesoft oracle eb implementation support hc...,"[peoplesoft, oracle, eb, implementation, suppo..."
4,Machine Learning Engineer,Leveraging the latest machine and deep learnin...,leveraging latest machine deep learning techni...,"[leveraging, latest, machine, deep, learning, ..."
...,...,...,...,...
20588,Web Designer,Company Description We are searching for a ta...,company description searching talented creativ...,"[company, description, searching, talented, cr..."
20589,Senior Front End Web Developer - Full Time at ...,CONTACT - priya@omegasolutioninc.com / 408-45...,location san francisco caterm full time perman...,"[location, san, francisco, caterm, full, time,..."
20590,QA Analyst,Do you take pride in your work knowing that th...,take pride knowing thousand life positively im...,"[take, pride, knowing, thousand, life, positiv..."
20591,Tech Lead-Full Stack,Company Description What We Can Offer YouAs th...,company description offer youas world leading ...,"[company, description, offer, youas, world, le..."


### Obtenemos los bi-grams y los guardamos en la columna 'tokens_Job_Description'

In [213]:
lista_vocab_jobs = df_Jobs['tokens_Job_Description'].tolist()
lista_vocab_jobs #lista de lista de tokens.
lista_vocab_jobs[9]  #la ultima lista es la 9

['enter',
 'existing',
 'codebases',
 'extend',
 'functionality',
 'scoped',
 'feature',
 'maintaining',
 'efficient',
 'quality',
 'driven',
 'timeline',
 'implement',
 'context',
 'switching',
 'technology',
 'platform',
 'server',
 'stack',
 'assist',
 'site',
 'maintenance',
 'development',
 'quickly',
 'learn',
 'understand',
 'implement',
 'new',
 'old',
 'technology',
 'best',
 'leverage',
 'particular',
 'project',
 'effectively',
 'deploy',
 'code',
 'production',
 'environment',
 'ensure',
 'high',
 'quality',
 'service',
 'communicate',
 'technical',
 'issue',
 'risk',
 'clearly',
 'ensure',
 'necessary',
 'action',
 'taken',
 'timely',
 'manner',
 'communicating',
 'technical',
 'accumum',
 'professionally',
 'clearly',
 'non',
 'technical',
 'team',
 'member',
 'client',
 'excellent',
 'english',
 'communication',
 'proficient',
 'php',
 'familiar',
 'basic',
 'computer',
 'science',
 'concept',
 'proficient',
 'frontend',
 'technology',
 'html',
 'javascript',
 'cs',
 'we

In [214]:
len(lista_vocab_jobs)

20593

In [215]:
import gensim
from gensim.models.phrases import Phraser, Phrases
import string
import re

#Lo que hacemos acá es unir palabras bigramas como: machine_learning, big_Data deep_learning (Que antes estaban separadas 
#pero que enrealidad van juntas; asi, objeto Phraser detecta esto y te lo devuelve como 1 sola palabra).
#Los bigramas son construidos usando phrases (frases).
    
#Los bigrams los creamos asi:
# Creamos frases relevantes desde nuestra lista de oraciones: 
phrases_jobs = Phrases(lista_vocab_jobs)
# Usamos el objeto Phraser ahora para transformar las oraciones:
bigram_jobs = Phraser(phrases_jobs)
# Aplicamos el Phraser para transformar nuestras oraciones a algo más simple (una lista):
all_sentences_jobs = list(bigram_jobs[lista_vocab_jobs])

#Imprimimos todo nuestro vocabulario:
all_sentences_jobs  #Lista de listas con nuestro vocabulario actualizado con bigramas.
all_sentences_jobs[4] #la ultima
#for t in all_sentences:
    #lista_nueva.append(t)

['leveraging_latest',
 'machine',
 'deep_learning',
 'technique',
 'challenge',
 'current',
 'practice',
 'across',
 'business',
 'unit',
 'enhancing',
 'internal',
 'collaboration',
 'business',
 'unit',
 'alignment',
 'people',
 'process',
 'technology',
 'business',
 'goal',
 'proposing',
 'best_practice',
 'optimize',
 'business',
 'unit',
 'cloud',
 'presence',
 'performance',
 'security',
 'cost',
 'develop',
 'new',
 'spec',
 'documentation',
 'partake',
 'development',
 'technical',
 'procedure',
 'user',
 'support',
 'guide',
 'carry',
 'debugging_troubleshooting',
 'modification',
 'unit_testing',
 'custom',
 'solution',
 'built',
 'organization',
 'platform',
 'managing',
 'code',
 'artifact',
 'including',
 'model',
 'algorithm',
 'template',
 'tool',
 'policy',
 'guideline',
 'measuring',
 'performance',
 'comparing',
 'result',
 'similar',
 'effort',
 'across',
 'organization',
 'industry',
 'working',
 'different',
 'business',
 'organization',
 'across',
 'company',
 'u

In [216]:
len(all_sentences_jobs)

20593

In [217]:
#Actualizamos columna 'tokens_Job_Description' de nuestro DF con los bigramas:
df_Jobs['tokens_Job_Description'] = all_sentences_jobs

In [218]:
df_Jobs

Unnamed: 0,Job_Title,Job_Description,clean_Job_Description,tokens_Job_Description
0,Data Scientist,"Master’s degree or above in a STEM field, incl...",master degree stem field including limited com...,"[master_degree, stem, field, including_limited..."
1,Data Scientist 2,"\nReporting to the Director, Data & Analytics,...",reporting director data analytics senior data ...,"[reporting, director, data, analytics, senior,..."
2,HCM Consultant,\nThe Oracle Cloud HCM Absence Consultant will...,oracle cloud hcm absence consultant responsibl...,"[oracle, cloud, hcm, absence, consultant, resp..."
3,HCM Consultant 2,4+ years of experience in PeopleSoft or Oracle...,peoplesoft oracle eb implementation support hc...,"[peoplesoft, oracle_eb, implementation, suppor..."
4,Machine Learning Engineer,Leveraging the latest machine and deep learnin...,leveraging latest machine deep learning techni...,"[leveraging_latest, machine, deep_learning, te..."
...,...,...,...,...
20588,Web Designer,Company Description We are searching for a ta...,company description searching talented creativ...,"[company, description, searching, talented, cr..."
20589,Senior Front End Web Developer - Full Time at ...,CONTACT - priya@omegasolutioninc.com / 408-45...,location san francisco caterm full time perman...,"[location_san, francisco, caterm, full_time, p..."
20590,QA Analyst,Do you take pride in your work knowing that th...,take pride knowing thousand life positively im...,"[take_pride, knowing, thousand, life, positive..."
20591,Tech Lead-Full Stack,Company Description What We Can Offer YouAs th...,company description offer youas world leading ...,"[company, description, offer, youas, world_lea..."


### 1.4-Armado y Limpieza de Dataset CVs.

## 1-CVs_PDF_1

#### PDF a texto usando pdfplumber.

In [174]:
import pdfplumber
import os
import collections
from os import listdir
from os.path import isfile, join

#Leemos los CVs almacenados en nuestra carpeta y los extraemos uno por uno convirtiendolos a texto
#mediante la libreria pdfplumber:
pathCVs='../Datasets_CVs_And_Job_Descriptions/EN/CVs/1-CVs_PDF_1'
onlyfiles = [os.path.join(pathCVs, f) for f in os.listdir(pathCVs) if os.path.isfile(os.path.join(pathCVs, f))]
print("Cantidad de CVs extraidos:", len(onlyfiles))

#Funcion para extraer las palabras del CV:

def pdfextract(PDF_file):
    single_page_text = ""
    all_text = ""
    pdf = pdfplumber.open(PDF_file)
    for pdf_page in pdf.pages:
        single_page_text = pdf_page.extract_text()
        all_text = all_text + '\n' + single_page_text
    pdf.close()
    #return(all_text)        
    return(all_text.encode('utf-8'))

def extract_text(file):
    text = pdfextract(file).decode('utf-8')
    #text = pdfextract(file)
    return text

Cantidad de CVs extraidos: 10


In [175]:
#Obtenemos todas las palabras del CV SIN preprocesamiento:
df_Candidates=pd.DataFrame(columns = ['Candidate_Name','Content_CV'])
i=0
while i < len(onlyfiles):
    file=onlyfiles[i]
    base = os.path.basename(file)         #Data_Scientist_Karla_Lewis.pdf
    filename = os.path.splitext(base)[0]  #Data_Scientist_Karla_Lewis
    dat=extract_text(file)
    data = [{'Candidate_Name':filename, 'Content_CV':dat}]
    df_Candidates=df_Candidates.append(data, ignore_index=True)
    i+=1

df_Candidates = df_Candidates.sort_values('Candidate_Name', ascending=True)
df_Candidates = df_Candidates.reset_index(drop=True)
df_Candidates

Unnamed: 0,Candidate_Name,Content_CV
0,DataScientist_Karla_Lewis,\nKARLA LEWIS\nData Scientist\nCONTACT WORK EX...
1,DataScientist_Rahul_Malik,\nRAHUL MALIK\nNLP Data Scientist\nCONTACT WOR...
2,HCM_Federico_Calonge,\n ...
3,HCM_Robert_Smith,\nSap Hcm Consultant Phone: (123) 456 78 99\nE...
4,MLEngineer_Bradly_Johnston,"\nBradly Johnston\n435 Kasey Vista, Detroit\n+..."
5,MLEngineer_Jonathon_Price,"\nJonathon Price \n4587 Terry Groves, Boston\n..."
6,SecuritySpecialist_Ahmed Wayne,"\nAhmed Wayne\nAddress: Abu Dhabi, UAE\nNation..."
7,SecuritySpecialist_Denis Banik,\nDenis Banik\nEmail address: hello@kickresume...
8,WebDev_Alec_Dionisio,"\nChestertown, MD 4107083942\nhi@alecdionis.io..."
9,WebDev_Karen_Higgins,\nKaren Higgins \n We b Developer \n \n...


## 2-CVs_PDF_2

In [136]:
#Leemos los CVs almacenados en nuestra carpeta y los extraemos uno por uno convirtiendolos a texto
#mediante la libreria pdfplumber:
pathCVs_2='../Datasets_CVs_And_Job_Descriptions/EN/CVs/2-CVs_PDF_2'
onlyfiles = [os.path.join(pathCVs_2, f) for f in os.listdir(pathCVs_2) if os.path.isfile(os.path.join(pathCVs_2, f))]
print("Cantidad de CVs extraidos:", len(onlyfiles))

Cantidad de CVs extraidos: 228


In [138]:
#Obtenemos todas las palabras del CV SIN preprocesamiento:
df_Candidates_2=pd.DataFrame(columns = ['Candidate_Name','Content_CV'])
i=0
while i < len(onlyfiles):
    file=onlyfiles[i]
    base = os.path.basename(file)
    filename = os.path.splitext(base)[0]
    dat=extract_text(file)
    data = [{'Candidate_Name':filename, 'Content_CV':dat}]
    df_Candidates_2=df_Candidates_2.append(data, ignore_index=True)
    i+=1

df_Candidates_2 = df_Candidates_2.sort_values('Candidate_Name', ascending=True)
df_Candidates_2 = df_Candidates_2.reset_index(drop=True)
df_Candidates_2

Unnamed: 0,Candidate_Name,Content_CV
0,Abiral_Pandey_Fullstack_Java,\nName: Abiral Pandey \nEmail: abiral.pandey88...
1,Achyuth Resume_8,\nAchyuth \n540-999-8048 \nachyuth.java88@gmai...
2,Adelina_Erimia_PMP1,"\nAdelina Erimia, PMP, Six Sigma Green Belt, S..."
3,Adhi Gopalam - SM,\n \n \nAdhi Gopalam \nadhigopalam@gmail.com \...
4,AjayKumar,\nAjay Kumar (CSM) Email/Skype: ...
...,...,...
223,sanjay kumar,\nSanjay \nEmail: sanjay.j0828@gmail.com \nCon...
224,srinivas b,\n \nSrinivas \nSrinivasjava04@gmail.com \n315...
225,vema reddy,\n ...
226,venu b,\nVENU \nvenu6773@gmail.com \n(414) 436-567...


## 3-CVs_CSV_1

In [142]:
pathCV_CSV='../Datasets_CVs_And_Job_Descriptions/EN/CVs/3-CVs_CSV_1'
df_csvs_2 = pd.read_csv(pathCV_CSV+'/Resume_DataSet.csv', usecols= ['Resume_str','Category'])
df_csvs_2

Unnamed: 0,Resume_str,Category
0,HR ADMINISTRATOR/MARKETING ASSOCIATE\...,HR
1,"HR SPECIALIST, US HR OPERATIONS ...",HR
2,HR DIRECTOR Summary Over 2...,HR
3,HR SPECIALIST Summary Dedica...,HR
4,HR MANAGER Skill Highlights ...,HR
...,...,...
2479,RANK: SGT/E-5 NON- COMMISSIONED OFFIC...,AVIATION
2480,"GOVERNMENT RELATIONS, COMMUNICATIONS ...",AVIATION
2481,GEEK SQUAD AGENT Professional...,AVIATION
2482,PROGRAM DIRECTOR / OFFICE MANAGER ...,AVIATION


In [143]:
#Vemos los diferentes tipos de categorias que tiene para extraer solo las del rubro de IT:
for val in df_csvs_2['Category'].unique():
    print(val)

HR
DESIGNER
INFORMATION-TECHNOLOGY
TEACHER
ADVOCATE
BUSINESS-DEVELOPMENT
HEALTHCARE
FITNESS
AGRICULTURE
BPO
SALES
CONSULTANT
DIGITAL-MEDIA
AUTOMOBILE
CHEF
FINANCE
APPAREL
ENGINEERING
ACCOUNTANT
CONSTRUCTION
PUBLIC-RELATIONS
BANKING
ARTS
AVIATION


In [144]:
#Armamos 2 listas para incluir en nuestro DF final, en la primera vamos a incluir a todos. 
#Y en la 2da únicamente filtraremos a los consultores que son del rubro de it.

IT_Jobs_list = ['INFORMATION-TECHNOLOGY','BUSINESS-DEVELOPMENT']
Consultant = ['CONSULTANT']

filtered_df_csvs_2 = df_csvs_2[df_csvs_2['Category'].isin(IT_Jobs_list)]
filtered_df_csvs_2 = filtered_df_csvs_2.reset_index(drop=True)
filtered_df_csvs_2

#filtered_df_csvs_2['Resume_str'].iloc[90]   #Si queremos ver a un Resume_str de uno en particular.

Unnamed: 0,Resume_str,Category
0,INFORMATION TECHNOLOGY Summar...,INFORMATION-TECHNOLOGY
1,INFORMATION TECHNOLOGY SPECIALIST\tGS...,INFORMATION-TECHNOLOGY
2,INFORMATION TECHNOLOGY SUPERVISOR ...,INFORMATION-TECHNOLOGY
3,INFORMATION TECHNOLOGY INSTRUCTOR ...,INFORMATION-TECHNOLOGY
4,INFORMATION TECHNOLOGY MANAGER/ANALYS...,INFORMATION-TECHNOLOGY
...,...,...
235,BUSINESS DEVELOPMENT INTERN Sum...,BUSINESS-DEVELOPMENT
236,"DIRECTOR, BUSINESS DEVELOPMENT ...",BUSINESS-DEVELOPMENT
237,BUSINESS DEVELOPMENT MANAGER Su...,BUSINESS-DEVELOPMENT
238,BUSINESS DEVELOPMENT REPRESENTATIVE ...,BUSINESS-DEVELOPMENT


In [145]:
df_consultants = df_csvs_2[df_csvs_2['Category'].isin(Consultant)]
df_consultants = df_consultants.reset_index(drop=True)
df_consultants

Unnamed: 0,Resume_str,Category
0,CONSULTANT Summary Human R...,CONSULTANT
1,IT CONSULTANT Summary ...,CONSULTANT
2,CONSULTANT Professional Sum...,CONSULTANT
3,CONSULTANT Professional Summary...,CONSULTANT
4,CONSULTANT Summary \nPC T...,CONSULTANT
...,...,...
110,BUSINESS CONSULTANT Professiona...,CONSULTANT
111,CONSULTANT ACCOUNT Summary T...,CONSULTANT
112,PRINCIPAL CONSULTANT Summary ...,CONSULTANT
113,LEASING CONSULTANT Summary T...,CONSULTANT


In [146]:
#Filtro los Consultores que tengan "it" en su descripción (antes lo pasé a minúscula para que no haya problema
#de macheo). De esta manera filtraremos los Consultores que no esten relacionados al rubro de it. 
df_consultants_it = df_consultants[df_consultants['Resume_str'].str.lower().str.contains(fr'\bit\b', regex=True)]
df_consultants_it = df_consultants_it.reset_index(drop=True)
df_consultants_it.shape  #Vemos que dejó solo 50 filas.

(50, 2)

In [147]:
#Por último acoplamos "df_consultants_it" a "filtered_df_csvs_2":
filtered_df_csvs_2 = filtered_df_csvs_2.append(df_consultants_it, ignore_index=True)
filtered_df_csvs_2
#df_consultants_it = df_consultants_it.reset_index(drop=True)

Unnamed: 0,Resume_str,Category
0,INFORMATION TECHNOLOGY Summar...,INFORMATION-TECHNOLOGY
1,INFORMATION TECHNOLOGY SPECIALIST\tGS...,INFORMATION-TECHNOLOGY
2,INFORMATION TECHNOLOGY SUPERVISOR ...,INFORMATION-TECHNOLOGY
3,INFORMATION TECHNOLOGY INSTRUCTOR ...,INFORMATION-TECHNOLOGY
4,INFORMATION TECHNOLOGY MANAGER/ANALYS...,INFORMATION-TECHNOLOGY
...,...,...
285,CONSULTANT TO OWNER Educati...,CONSULTANT
286,Pavithra Shetty Summary ...,CONSULTANT
287,BUSINESS CONSULTANT Professiona...,CONSULTANT
288,PRINCIPAL CONSULTANT Summary ...,CONSULTANT


#### Eliminamos duplicados (observamos que no los hay igualmente):

In [148]:
non_duplicate_csvs_2 = filtered_df_csvs_2.drop_duplicates(subset=None, keep='first', inplace=False)

In [149]:
non_duplicate_csvs_2

Unnamed: 0,Resume_str,Category
0,INFORMATION TECHNOLOGY Summar...,INFORMATION-TECHNOLOGY
1,INFORMATION TECHNOLOGY SPECIALIST\tGS...,INFORMATION-TECHNOLOGY
2,INFORMATION TECHNOLOGY SUPERVISOR ...,INFORMATION-TECHNOLOGY
3,INFORMATION TECHNOLOGY INSTRUCTOR ...,INFORMATION-TECHNOLOGY
4,INFORMATION TECHNOLOGY MANAGER/ANALYS...,INFORMATION-TECHNOLOGY
...,...,...
285,CONSULTANT TO OWNER Educati...,CONSULTANT
286,Pavithra Shetty Summary ...,CONSULTANT
287,BUSINESS CONSULTANT Professiona...,CONSULTANT
288,PRINCIPAL CONSULTANT Summary ...,CONSULTANT


## 4-CVs_CSV_2

In [150]:
pathCV_CSV='../Datasets_CVs_And_Job_Descriptions/EN/CVs/4-CVs_CSV_2'
df_csvs = pd.read_csv(pathCV_CSV+'/UpdatedResumeDataSet.csv', usecols= ['Category','Resume'])
df_csvs

Unnamed: 0,Category,Resume
0,Data Science,Skills * Programming Languages: Python (pandas...
1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...
2,Data Science,"Areas of Interest Deep Learning, Control Syste..."
3,Data Science,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...
4,Data Science,"Education Details \r\n MCA YMCAUST, Faridab..."
...,...,...
957,Testing,Computer Skills: â¢ Proficient in MS office (...
958,Testing,â Willingness to accept the challenges. â ...
959,Testing,"PERSONAL SKILLS â¢ Quick learner, â¢ Eagerne..."
960,Testing,COMPUTER SKILLS & SOFTWARE KNOWLEDGE MS-Power ...


In [151]:
#Vemos los diferentes tipos de categorias que tiene para extraer solo las del rubro de IT:
for val in df_csvs['Category'].unique():
    print(val)

Data Science
HR
Advocate
Arts
Web Designing
Mechanical Engineer
Sales
Health and fitness
Civil Engineer
Java Developer
Business Analyst
SAP Developer
Automation Testing
Electrical Engineering
Operations Manager
Python Developer
DevOps Engineer
Network Security Engineer
PMO
Database
Hadoop
ETL Developer
DotNet Developer
Blockchain
Testing


In [152]:
IT_Jobs_list = ['Data Science','Web Designing','Java Developer', 'Business Analyst', 'SAP Developer', 'Automation Testing', 'Python Developer', 'DevOps Engineer', 'Network Security Engineer', 'PMO', 'Database', 'Hadoop', 'ETL Developer', 'DotNet Developer', 'Blockchain Testing']

filtered_df_csvs = df_csvs[df_csvs['Category'].isin(IT_Jobs_list)]
filtered_df_csvs = filtered_df_csvs.reset_index(drop=True)
filtered_df_csvs

Unnamed: 0,Category,Resume
0,Data Science,Skills * Programming Languages: Python (pandas...
1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...
2,Data Science,"Areas of Interest Deep Learning, Control Syste..."
3,Data Science,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...
4,Data Science,"Education Details \r\n MCA YMCAUST, Faridab..."
...,...,...
543,DotNet Developer,"Technical Skills â¢ Languages: C#, ASP .NET M..."
544,DotNet Developer,Education Details \r\nJanuary 2014 Education ...
545,DotNet Developer,"Technologies ASP.NET, MVC 3.0/4.0/5.0, Unit Te..."
546,DotNet Developer,"Technical Skills CATEGORY SKILLS Language C, C..."


#### Eliminamos duplicados, observamos que solo nos quedan 97 filas.

In [154]:
non_duplicate_csvs = filtered_df_csvs.drop_duplicates(subset=None, keep='first', inplace=False)
non_duplicate_csvs = non_duplicate_csvs.reset_index(drop=True)

In [155]:
non_duplicate_csvs

Unnamed: 0,Category,Resume
0,Data Science,Skills * Programming Languages: Python (pandas...
1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...
2,Data Science,"Areas of Interest Deep Learning, Control Syste..."
3,Data Science,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...
4,Data Science,"Education Details \r\n MCA YMCAUST, Faridab..."
...,...,...
92,DotNet Developer,"Technical Skills â¢ Languages: C#, ASP .NET M..."
93,DotNet Developer,Education Details \r\nJanuary 2014 Education ...
94,DotNet Developer,"Technologies ASP.NET, MVC 3.0/4.0/5.0, Unit Te..."
95,DotNet Developer,"Technical Skills CATEGORY SKILLS Language C, C..."


# Unimos los 4 datasets (2 pdfs y 2 csvs) obteniendo nuestro 'df_candidatos_final'

In [184]:
#Renonmbramos las columnas de los DFs de csvs ('non_duplicate_csvs_2' y 'non_duplicate_csvs') 
#para que queden iguales a los DFs de pdfs (df_Candidates, df_Candidates_2):
non_duplicate_csvs.rename(columns={'Category': 'Candidate_Name', 'Resume': 'Content_CV'}, inplace=True)
non_duplicate_csvs_2.rename(columns={'Category': 'Candidate_Name', 'Resume_str': 'Content_CV'}, inplace=True)

#Ahora unimos non_duplicate_csvs y non_duplicate_csvs_2 (los CSVs):
df_csvs = non_duplicate_csvs.append(non_duplicate_csvs_2, ignore_index=True)

#Luego modificamos la columna 'Category' de los DFs de csvs para que tenga el índice + Category:
df_csvs['Candidate_Name'] = df_csvs.index.astype(str)  + "-" + df_csvs['Candidate_Name'].astype(str)   

#Unimos df_Candidates y df_Candidates_2 (los PDFs):
df_pdfs = df_Candidates.append(df_Candidates_2, ignore_index=True)

#Y ahora por último unimos los DFs 'non_duplicate_csvs_2' y 'non_duplicate_csvs' a 'df_Candidates_2' y 'df_Candidates':
df_candidatos_final = df_pdfs.append(df_csvs, ignore_index=True)

In [185]:
df_candidatos_final

Unnamed: 0,Candidate_Name,Content_CV
0,DataScientist_Karla_Lewis,\nKARLA LEWIS\nData Scientist\nCONTACT WORK EX...
1,DataScientist_Rahul_Malik,\nRAHUL MALIK\nNLP Data Scientist\nCONTACT WOR...
2,HCM_Federico_Calonge,\n ...
3,HCM_Robert_Smith,\nSap Hcm Consultant Phone: (123) 456 78 99\nE...
4,MLEngineer_Bradly_Johnston,"\nBradly Johnston\n435 Kasey Vista, Detroit\n+..."
...,...,...
620,382-CONSULTANT,CONSULTANT TO OWNER Educati...
621,383-CONSULTANT,Pavithra Shetty Summary ...
622,384-CONSULTANT,BUSINESS CONSULTANT Professiona...
623,385-CONSULTANT,PRINCIPAL CONSULTANT Summary ...


#### Tenemos un total de 625 filas/CVs.

In [190]:
df_candidatos_final['Content_CV'].iloc[624]

"         MARKETING CONSULTANT           Summary     Value Creator, Marketing Executive: \xa0Versatile strategic leader with over 15 years in corporate marketing, business development, and account management for Fortune 1000, niche, and start-up companies. Success in healthcare, technology, automotive, retail, and consumer-packaged goods. Proven ability to grow revenues and brand loyalty in B2C and B2B markets with innovative campaigns and targeted marketing programs.\xa0 Contributor to team leader with experience over million-dollar budgets. Entrepreneur mindset creative and analytical skills for measurable impact. BBA in Marketing and MBA in Management.      Skills          Strategic Planning  Forecasting, Budgets, & P & L  Brand & Product Management  Channel Strategies  Lead Generation  Account Management  Complex Selling  Software & Technology  Manufacturing   Sourcing  Sales Enablement      Business Competitive Analysis  Market Research  New Product Development  Packaging  Creativ

## Exportamos el DF como CSV

In [258]:
#Exportamos el csv:
df_candidatos_final.to_csv('DF_Exportado_625_CVs.csv',index = False)

In [3]:
#Si lo queremos importar:
df_candidatos_f = pd.read_csv('DF_Exportado_625_CVs.csv')
df_candidatos_f.shape

(625, 2)

#### Limpieza de datos del DF de Candidatos.

In [4]:
cleaning_DF(df_candidatos_f,'Content_CV',True)
tokenize_and_lemmatize(df_candidatos_f,'Content_CV')

In [5]:
df_candidatos_f

Unnamed: 0,Candidate_Name,Content_CV,clean_Content_CV,tokens_Content_CV
0,DataScientist_Karla_Lewis,\nKARLA LEWIS\nData Scientist\nCONTACT WORK EX...,data scientist brooklyn ny data scientist grub...,"[data, scientist, brooklyn, ny, data, scientis..."
1,DataScientist_Rahul_Malik,\nRAHUL MALIK\nNLP Data Scientist\nCONTACT WOR...,nlp data scientist brooklyn ny nlp data scient...,"[nlp, data, scientist, brooklyn, ny, nlp, data..."
2,HCM_Federico_Calonge,\n ...,hcm technical consultant working oracle tool p...,"[hcm, technical, consultant, working, oracle, ..."
3,HCM_Robert_Smith,\nSap Hcm Consultant Phone: (123) 456 78 99\nE...,sap hcm consultant com qwikresume marshville r...,"[sap, hcm, consultant, com, qwikresume, marshv..."
4,MLEngineer_Bradly_Johnston,"\nBradly Johnston\n435 Kasey Vista, Detroit\n+...",kasey vista detroit senior software engineer m...,"[kasey, vista, detroit, senior, software, engi..."
...,...,...,...,...
620,382-CONSULTANT,CONSULTANT TO OWNER Educati...,consultant owner florida international univers...,"[consultant, owner, florida, international, un..."
621,383-CONSULTANT,Pavithra Shetty Summary ...,pavithra shetty customer oriented principal co...,"[pavithra, shetty, customer, oriented, princip..."
622,384-CONSULTANT,BUSINESS CONSULTANT Professiona...,business consultant business sale operation po...,"[business, consultant, business, sale, operati..."
623,385-CONSULTANT,PRINCIPAL CONSULTANT Summary ...,principal consultant supply chain logistics ma...,"[principal, consultant, supply, chain, logisti..."


In [6]:
#Vemos los tokens generados para el candidato HCM_Federico_Calonge:
df_candidatos_f.iloc[2]['tokens_Content_CV']

['hcm',
 'technical',
 'consultant',
 'working',
 'oracle',
 'tool',
 'participated',
 'erp',
 'cloud',
 'project',
 'performing',
 'reporting',
 'working',
 'module',
 'ap',
 'ar',
 'gl',
 'participating',
 'hcm',
 'cloud',
 'project',
 'performing',
 'reporting',
 'extraction',
 'integration',
 'working',
 'module',
 'core',
 'recruitment',
 'member',
 'artificial',
 'intelligence',
 'ai',
 'committee',
 'oracle',
 'strong',
 'interest',
 'data',
 'science',
 'machine',
 'learning',
 'last',
 'computer',
 'engineering',
 'degree',
 'studying',
 'th',
 'english',
 'personal',
 'academic',
 'programming',
 'project',
 'oracle',
 'hcm',
 'cloud',
 'core',
 'recruitment',
 'oracle',
 'erp',
 'cloud',
 'account',
 'receivable',
 'ar',
 'account',
 'payable',
 'ap',
 'general',
 'ledger',
 'gl',
 'main',
 'reporting',
 'sql',
 'bi',
 'publisher',
 'intermediate',
 'advanced',
 'point',
 'technical',
 'documentation',
 'extraction',
 'api',
 'rest',
 'hcm',
 'extract',
 'web',
 'service',
 

### Obtenemos los bi-grams y los guardamos en la columna 'tokens_Content_CV'

In [8]:
lista_vocab = df_candidatos_f['tokens_Content_CV'].tolist()
#lista_vocab #lista de lista de tokens.
lista_vocab[624]   #la última es 624.

['marketing',
 'consultant',
 'value',
 'creator',
 'marketing',
 'executive',
 'versatile',
 'strategic',
 'leader',
 'corporate',
 'marketing',
 'business',
 'development',
 'account',
 'management',
 'fortune',
 'niche',
 'start',
 'company',
 'success',
 'healthcare',
 'technology',
 'automotive',
 'retail',
 'consumer',
 'packaged',
 'good',
 'proven',
 'grow',
 'revenue',
 'brand',
 'loyalty',
 'b',
 'c',
 'b',
 'b',
 'market',
 'innovative',
 'campaign',
 'targeted',
 'marketing',
 'program',
 'contributor',
 'team',
 'leader',
 'million',
 'dollar',
 'budget',
 'entrepreneur',
 'mindset',
 'creative',
 'analytical',
 'measurable',
 'impact',
 'bba',
 'marketing',
 'mba',
 'management',
 'strategic',
 'planning',
 'forecasting',
 'budget',
 'p',
 'l',
 'brand',
 'product',
 'management',
 'channel',
 'strategy',
 'lead',
 'generation',
 'account',
 'management',
 'complex',
 'selling',
 'software',
 'technology',
 'manufacturing',
 'sourcing',
 'sale',
 'enablement',
 'business'

In [9]:
len(lista_vocab)

625

In [11]:
import gensim
from gensim.models.phrases import Phraser, Phrases
import string
import re

#Lo que hacemos acá es unir palabras bigramas como: machine_learning, big_Data deep_learning (Que antes estaban separadas 
#pero que enrealidad van juntas; asi, objeto Phraser detecta esto y te lo devuelve como 1 sola palabra).
#Los bigramas son construidos usando phrases (frases).
    
#Los bigrams los creamos asi:
# Creamos frases relevantes desde nuestra lista de oraciones: 
phrases = Phrases(lista_vocab)
# Usamos el objeto Phraser ahora para transformar las oraciones:
bigram = Phraser(phrases)
# Aplicamos el Phraser para transformar nuestras oraciones a algo más simple (una lista):
all_sentences = list(bigram[lista_vocab])

#Imprimimos todo nuestro vocabulario:
all_sentences  #Lista de listas con nuestro vocabulario actualizado con bigramas. Son 625 listas en total.
all_sentences[624]   #la última es 624.
#for t in all_sentences:
    #lista_nueva.append(t)

['marketing',
 'consultant',
 'value',
 'creator',
 'marketing',
 'executive',
 'versatile',
 'strategic',
 'leader',
 'corporate',
 'marketing',
 'business',
 'development',
 'account',
 'management',
 'fortune',
 'niche',
 'start',
 'company',
 'success',
 'healthcare',
 'technology',
 'automotive',
 'retail',
 'consumer',
 'packaged',
 'good',
 'proven',
 'grow',
 'revenue',
 'brand',
 'loyalty',
 'b_c',
 'b_b',
 'market',
 'innovative',
 'campaign',
 'targeted',
 'marketing',
 'program',
 'contributor',
 'team',
 'leader',
 'million_dollar',
 'budget',
 'entrepreneur',
 'mindset',
 'creative',
 'analytical',
 'measurable',
 'impact',
 'bba',
 'marketing',
 'mba',
 'management',
 'strategic_planning',
 'forecasting',
 'budget',
 'p_l',
 'brand',
 'product',
 'management',
 'channel',
 'strategy',
 'lead_generation',
 'account',
 'management',
 'complex',
 'selling',
 'software',
 'technology',
 'manufacturing',
 'sourcing',
 'sale',
 'enablement',
 'business',
 'competitive_analysis

In [12]:
len(all_sentences)

625

In [13]:
#Actualizamos columna 'tokens_Content_CV' de nuestro DF con los bigramas:
df_candidatos_f['tokens_Content_CV'] = all_sentences

In [14]:
df_candidatos_f

Unnamed: 0,Candidate_Name,Content_CV,clean_Content_CV,tokens_Content_CV
0,DataScientist_Karla_Lewis,\nKARLA LEWIS\nData Scientist\nCONTACT WORK EX...,data scientist brooklyn ny data scientist grub...,"[data_scientist, brooklyn, ny, data_scientist,..."
1,DataScientist_Rahul_Malik,\nRAHUL MALIK\nNLP Data Scientist\nCONTACT WOR...,nlp data scientist brooklyn ny nlp data scient...,"[nlp, data_scientist, brooklyn, ny, nlp, data_..."
2,HCM_Federico_Calonge,\n ...,hcm technical consultant working oracle tool p...,"[hcm, technical, consultant, working, oracle, ..."
3,HCM_Robert_Smith,\nSap Hcm Consultant Phone: (123) 456 78 99\nE...,sap hcm consultant com qwikresume marshville r...,"[sap_hcm, consultant, com, qwikresume, marshvi..."
4,MLEngineer_Bradly_Johnston,"\nBradly Johnston\n435 Kasey Vista, Detroit\n+...",kasey vista detroit senior software engineer m...,"[kasey, vista, detroit, senior, software_engin..."
...,...,...,...,...
620,382-CONSULTANT,CONSULTANT TO OWNER Educati...,consultant owner florida international univers...,"[consultant, owner, florida, international, un..."
621,383-CONSULTANT,Pavithra Shetty Summary ...,pavithra shetty customer oriented principal co...,"[pavithra, shetty, customer, oriented, princip..."
622,384-CONSULTANT,BUSINESS CONSULTANT Professiona...,business consultant business sale operation po...,"[business, consultant, business, sale, operati..."
623,385-CONSULTANT,PRINCIPAL CONSULTANT Summary ...,principal consultant supply chain logistics ma...,"[principal_consultant, supply_chain, logistics..."


In [251]:
#Eliminamos al candidato '316-BUSINESS-DEVELOPMENT' ya que su Content_CV está vacio. 
df_candidatos_f =  df_candidatos_f[(df_candidatos_f['Candidate_Name']!='316-BUSINESS-DEVELOPMENT')]
df_candidatos_f

Unnamed: 0,Candidate_Name,Content_CV,clean_Content_CV,tokens_Content_CV
0,DataScientist_Karla_Lewis,\nKARLA LEWIS\nData Scientist\nCONTACT WORK EX...,data scientist brooklyn ny data scientist grub...,"[data_scientist, brooklyn, ny, data_scientist,..."
1,DataScientist_Rahul_Malik,\nRAHUL MALIK\nNLP Data Scientist\nCONTACT WOR...,nlp data scientist brooklyn ny nlp data scient...,"[nlp, data_scientist, brooklyn, ny, nlp, data_..."
2,HCM_Federico_Calonge,\n ...,hcm technical consultant working oracle tool p...,"[hcm, technical, consultant, working, oracle, ..."
3,HCM_Robert_Smith,\nSap Hcm Consultant Phone: (123) 456 78 99\nE...,sap hcm consultant com qwikresume marshville r...,"[sap_hcm, consultant, com, qwikresume, marshvi..."
4,MLEngineer_Bradly_Johnston,"\nBradly Johnston\n435 Kasey Vista, Detroit\n+...",kasey vista detroit senior software engineer m...,"[kasey, vista, detroit, senior, software_engin..."
...,...,...,...,...
620,382-CONSULTANT,CONSULTANT TO OWNER Educati...,consultant owner florida international univers...,"[consultant, owner, florida, international, un..."
621,383-CONSULTANT,Pavithra Shetty Summary ...,pavithra shetty customer oriented principal co...,"[pavithra, shetty, customer, oriented, princip..."
622,384-CONSULTANT,BUSINESS CONSULTANT Professiona...,business consultant business sale operation po...,"[business, consultant, business, sale, operati..."
623,385-CONSULTANT,PRINCIPAL CONSULTANT Summary ...,principal consultant supply chain logistics ma...,"[principal_consultant, supply_chain, logistics..."


###  2-Realizando Comparaciones y obteniendo Similitudes.

###### Creamos un nuevo DF concatenando los 2 dataframe de manera que se comparen todos los CVs con todos los Jobs:

In [32]:
#Nos quedarán 10 x 10 = 100 filas.
df_Jobs_and_Candidates = pd.merge(df_candidatos_f.assign(A=1), df_Jobs.assign(A=1), on='A').drop('A', 1)
df_Jobs_and_Candidates

Unnamed: 0,Candidate_Name,Content_CV,clean_Content_CV,tokens_Content_CV,Job_Title,Job_Description,clean_Job_Description,tokens_Job_Description
0,DataScientist_Karla_Lewis,\nKARLA LEWIS\nData Scientist\nCONTACT WORK EX...,data scientist brooklyn ny data scientist grub...,"[data_scientist, brooklyn, ny, data_scientist,...",Data Scientist,"Master’s degree or above in a STEM field, incl...",master degree stem field including limited com...,"[master, degree, stem, field, including, limit..."
1,DataScientist_Karla_Lewis,\nKARLA LEWIS\nData Scientist\nCONTACT WORK EX...,data scientist brooklyn ny data scientist grub...,"[data_scientist, brooklyn, ny, data_scientist,...",Data Scientist 2,"\nReporting to the Director, Data & Analytics,...",reporting director data analytics senior data ...,"[reporting, director, data, analytics, senior,..."
2,DataScientist_Karla_Lewis,\nKARLA LEWIS\nData Scientist\nCONTACT WORK EX...,data scientist brooklyn ny data scientist grub...,"[data_scientist, brooklyn, ny, data_scientist,...",HCM Consultant,\nThe Oracle Cloud HCM Absence Consultant will...,oracle cloud hcm absence consultant responsibl...,"[oracle, cloud, hcm, absence, consultant, resp..."
3,DataScientist_Karla_Lewis,\nKARLA LEWIS\nData Scientist\nCONTACT WORK EX...,data scientist brooklyn ny data scientist grub...,"[data_scientist, brooklyn, ny, data_scientist,...",HCM Consultant 2,4+ years of experience in PeopleSoft or Oracle...,peoplesoft oracle eb implementation support hc...,"[peoplesoft, oracle, eb, implementation, suppo..."
4,DataScientist_Karla_Lewis,\nKARLA LEWIS\nData Scientist\nCONTACT WORK EX...,data scientist brooklyn ny data scientist grub...,"[data_scientist, brooklyn, ny, data_scientist,...",Machine Learning Engineer,Leveraging the latest machine and deep learnin...,leveraging latest machine deep learning techni...,"[leveraging, latest, machine, deep_learning, t..."
...,...,...,...,...,...,...,...,...
6245,386-CONSULTANT,MARKETING CONSULTANT Summar...,marketing consultant value creator marketing e...,"[marketing, consultant, value, creator, market...",Machine Learning Engineer 2,Collaborate with a multidisciplinary team to g...,collaborate multidisciplinary team gain insigh...,"[collaborate, multidisciplinary, team, gain, i..."
6246,386-CONSULTANT,MARKETING CONSULTANT Summar...,marketing consultant value creator marketing e...,"[marketing, consultant, value, creator, market...",Security Specialist,Work in a fast-paced environment that combine ...,fast paced environment combine technical secur...,"[fast, paced, environment, combine, technical,..."
6247,386-CONSULTANT,MARKETING CONSULTANT Summar...,marketing consultant value creator marketing e...,"[marketing, consultant, value, creator, market...",Security Specialist 2,\n Handling incoming requests for assistanc...,handling incoming request assistance business ...,"[handling, incoming, request, assistance, busi..."
6248,386-CONSULTANT,MARKETING CONSULTANT Summar...,marketing consultant value creator marketing e...,"[marketing, consultant, value, creator, market...",Web Developer Full Stack,\n\n Graduate Degree in Information Technol...,graduate degree information technology similar...,"[graduate, degree, information, technology, si..."


In [36]:
#Exportamos el csv:
df_Jobs_and_Candidates.to_csv('DF_Exportado_6250_rows.csv',index = False)

In [37]:
#Si lo queremos importar:
df_Jobs_and_Candidates = pd.read_csv('DF_Exportado_6250_rows.csv')
df_Jobs_and_Candidates.shape

(6250, 8)

In [46]:
array_unicode_CV = df_Jobs_and_Candidates['clean_Content_CV'].astype('U').values
array_unicode_Jobs= df_Jobs_and_Candidates['clean_Job_Description'].astype('U').values

### 2.1- TF-IDF & Cosine Similarity.

In [54]:
#https://www.py4u.net/discuss/188191
    
#Calculate cosine similarity using tf-idf vectors, between search queries and matched documents.
#Acá mi search query es la columna clean_Content_CV y el matched documents son los clean_Job_Description.
#Cuando hacemos un transform estamos pasandole la query para que se fije los TF-IDF que tiene ya entrenados
#del modelo (cuando hicimos fit ya lo entrenamos... y lo hicimos con los cvs y los job descriptions). 

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import paired_cosine_distances as pcd

# Initialize an instance of tf-idf Vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Generate the tf-idf vectors for the corpus
#tfidf_vectorizer.fit(df_Jobs_and_Candidates['clean_Content_CV'] + " " + df_Jobs_and_Candidates['clean_Job_Description'])
tfidf_vectorizer.fit(array_unicode_CV+array_unicode_Jobs)

A = tfidf_vectorizer.transform(array_unicode_CV)
B = tfidf_vectorizer.transform(array_unicode_Jobs)

# compute and the cosine similarity:
cosine = 1 - pcd(A, B)

df_Jobs_and_Candidates['tfidf_cosine'] = cosine

df_Jobs_and_Candidates

Unnamed: 0,Candidate_Name,Content_CV,clean_Content_CV,tokens_Content_CV,Job_Title,Job_Description,clean_Job_Description,tokens_Job_Description,tfidf_cosine
0,DataScientist_Karla_Lewis,\nKARLA LEWIS\nData Scientist\nCONTACT WORK EX...,data scientist brooklyn ny data scientist grub...,"['data_scientist', 'brooklyn', 'ny', 'data_sci...",Data Scientist,"Master’s degree or above in a STEM field, incl...",master degree stem field including limited com...,"['master', 'degree', 'stem', 'field', 'includi...",0.060513
1,DataScientist_Karla_Lewis,\nKARLA LEWIS\nData Scientist\nCONTACT WORK EX...,data scientist brooklyn ny data scientist grub...,"['data_scientist', 'brooklyn', 'ny', 'data_sci...",Data Scientist 2,"\nReporting to the Director, Data & Analytics,...",reporting director data analytics senior data ...,"['reporting', 'director', 'data', 'analytics',...",0.115418
2,DataScientist_Karla_Lewis,\nKARLA LEWIS\nData Scientist\nCONTACT WORK EX...,data scientist brooklyn ny data scientist grub...,"['data_scientist', 'brooklyn', 'ny', 'data_sci...",HCM Consultant,\nThe Oracle Cloud HCM Absence Consultant will...,oracle cloud hcm absence consultant responsibl...,"['oracle', 'cloud', 'hcm', 'absence', 'consult...",0.021795
3,DataScientist_Karla_Lewis,\nKARLA LEWIS\nData Scientist\nCONTACT WORK EX...,data scientist brooklyn ny data scientist grub...,"['data_scientist', 'brooklyn', 'ny', 'data_sci...",HCM Consultant 2,4+ years of experience in PeopleSoft or Oracle...,peoplesoft oracle eb implementation support hc...,"['peoplesoft', 'oracle', 'eb', 'implementation...",0.019523
4,DataScientist_Karla_Lewis,\nKARLA LEWIS\nData Scientist\nCONTACT WORK EX...,data scientist brooklyn ny data scientist grub...,"['data_scientist', 'brooklyn', 'ny', 'data_sci...",Machine Learning Engineer,Leveraging the latest machine and deep learnin...,leveraging latest machine deep learning techni...,"['leveraging', 'latest', 'machine', 'deep_lear...",0.051246
...,...,...,...,...,...,...,...,...,...
6245,386-CONSULTANT,MARKETING CONSULTANT Summar...,marketing consultant value creator marketing e...,"['marketing', 'consultant', 'value', 'creator'...",Machine Learning Engineer 2,Collaborate with a multidisciplinary team to g...,collaborate multidisciplinary team gain insigh...,"['collaborate', 'multidisciplinary', 'team', '...",0.041610
6246,386-CONSULTANT,MARKETING CONSULTANT Summar...,marketing consultant value creator marketing e...,"['marketing', 'consultant', 'value', 'creator'...",Security Specialist,Work in a fast-paced environment that combine ...,fast paced environment combine technical secur...,"['fast', 'paced', 'environment', 'combine', 't...",0.071639
6247,386-CONSULTANT,MARKETING CONSULTANT Summar...,marketing consultant value creator marketing e...,"['marketing', 'consultant', 'value', 'creator'...",Security Specialist 2,\n Handling incoming requests for assistanc...,handling incoming request assistance business ...,"['handling', 'incoming', 'request', 'assistanc...",0.041386
6248,386-CONSULTANT,MARKETING CONSULTANT Summar...,marketing consultant value creator marketing e...,"['marketing', 'consultant', 'value', 'creator'...",Web Developer Full Stack,\n\n Graduate Degree in Information Technol...,graduate degree information technology similar...,"['graduate', 'degree', 'information', 'technol...",0.029146


### 2.2- Word Embedding (Word2vec) & WMD.

#### Usamos un Word2vec entrenado con nuestro vocabulario.


#### Unimos las columnas 'tokens_Job_Description' de 'df_Jobs' y 'tokens_Content_CV' de 'df_candidatos_f' generando una lista de listas, la cual será nuestro vocabulario que introduciremos a Word EMbedding como entrenamiento:

In [219]:
list_vocab_Jobs = df_Jobs['tokens_Job_Description']
#list_vocab_Jobs.tolist()

In [220]:
list_vocab_Jobs

0        [master_degree, stem, field, including_limited...
1        [reporting, director, data, analytics, senior,...
2        [oracle, cloud, hcm, absence, consultant, resp...
3        [peoplesoft, oracle_eb, implementation, suppor...
4        [leveraging_latest, machine, deep_learning, te...
                               ...                        
20588    [company, description, searching, talented, cr...
20589    [location_san, francisco, caterm, full_time, p...
20590    [take_pride, knowing, thousand, life, positive...
20591    [company, description, offer, youas, world_lea...
20592    [c_c, programmingdevelopment, win, programming...
Name: tokens_Job_Description, Length: 20593, dtype: object

In [221]:
list_vocab_Candid = df_candidatos_f['tokens_Content_CV']
#list_vocab_Candid.tolist()

In [222]:
list_vocab_Candid

0      [data_scientist, brooklyn, ny, data_scientist,...
1      [nlp, data_scientist, brooklyn, ny, nlp, data_...
2      [hcm, technical, consultant, working, oracle, ...
3      [sap_hcm, consultant, com, qwikresume, marshvi...
4      [kasey, vista, detroit, senior, software_engin...
                             ...                        
620    [consultant, owner, florida, international, un...
621    [pavithra, shetty, customer, oriented, princip...
622    [business, consultant, business, sale, operati...
623    [principal_consultant, supply_chain, logistics...
624    [marketing, consultant, value, creator, market...
Name: tokens_Content_CV, Length: 625, dtype: object

In [223]:
all_vocab = list_vocab_Candid.append(list_vocab_Jobs) #Unimos las listas.

In [224]:
all_vocab

0        [data_scientist, brooklyn, ny, data_scientist,...
1        [nlp, data_scientist, brooklyn, ny, nlp, data_...
2        [hcm, technical, consultant, working, oracle, ...
3        [sap_hcm, consultant, com, qwikresume, marshvi...
4        [kasey, vista, detroit, senior, software_engin...
                               ...                        
20588    [company, description, searching, talented, cr...
20589    [location_san, francisco, caterm, full_time, p...
20590    [take_pride, knowing, thousand, life, positive...
20591    [company, description, offer, youas, world_lea...
20592    [c_c, programmingdevelopment, win, programming...
Length: 21218, dtype: object

In [225]:
lista_all_vocab = all_vocab.tolist()
lista_all_vocab[0]

['data_scientist',
 'brooklyn',
 'ny',
 'data_scientist',
 'grubhub',
 'current',
 'new_york',
 'ny',
 'implemented',
 'various',
 'time_series',
 'forecasting',
 'technique',
 'predict',
 'surge',
 'customer',
 'order',
 'lower',
 'average',
 'customer',
 'wait',
 'time',
 'minute',
 'github',
 'deployed',
 'recommendation',
 'engine',
 'production',
 'conditionally',
 'recommend',
 'menu',
 'item',
 'based',
 'past',
 'order',
 'history',
 'increase',
 'average',
 'order',
 'size',
 'designed',
 'model',
 'portland',
 'pilot',
 'program',
 'increase',
 'incentive',
 'driver',
 'peak',
 'ordering',
 'hour',
 'resulting_increase',
 'driver',
 'b',
 'availability',
 'peak',
 'ordering',
 'time',
 'statistic',
 'rutgers',
 'university',
 'data_scientist',
 'new',
 'brunswick_nj',
 'adobe',
 'new_york',
 'ny',
 'worked',
 'product',
 'marketing',
 'team',
 'identify',
 'customer',
 'interaction',
 'free',
 'trial',
 'maximize',
 'likelihood',
 'conversion',
 'resulting',
 'conversion_rate

In [226]:
len(lista_all_vocab)

21218

#### Modelo word2vec:

In [227]:
from gensim.models import Word2Vec

#La librería Gensim nos provee una simple API al algoritmo de Google 'word2vec' el cual es usado para crear word embeddings.
#Como vimos previamente, word2vec son modelos de redes neuronales superficiales de dos capas que están entrenadas para
#reconstruir contextos lingüísticos de palabras. 

#Podemos traer word embeddings pre-entrenados con noticias de Google por ejemplo, donde se entrenarán con 3 millones de palabras.
#Pero, nosotros lo que hacemos es utilizar nuestro vocabulario para entrenar y crear nuestros propios embeddings.  

#La entrada de la red neuronal es una palabra representada como un vector “one-hot”, es decir, un vector con tantas posiciones 
#como tamaño tenga el vocabulario. 
#Por ejemplo, si queremos representar la palabra “sol” de un vocabulario de 4 palabras (el sol está cayendo), usaremos un vector
#de de esta dimensión (5) con cero en todas las posiciones menos la correspondiente a la palabra “sol” que tendrá 
#un uno --> [0,1,0,0]
#Y la  salida de la red neuronal será otro vector “one-hot” llamado “Word embedding” de las mismas 4 posiciones que representará
#las probabilidades de cada una de las palabras sean vecinas de la palabra representada en la entrada. [0.12, 1, 0.4, 0.1]
#--> Es 1 en la 2da posición ya que es la misma posición de la palabra. 

#Word embedding es una de las representaciones de un documento de vocabulario más popular. Es capaz de caputar los
#contextos de una palabra en el documento, similitud semantica y sintáctica, relaciones con otras palabras, etc.

#Mediante la siguiente línea de código entrenamos al modelo Word2Vec mediante nuestras 1275 palabras contenidas en “all_Sentences”:
model=gensim.models.Word2Vec(sentences=lista_all_vocab,min_count=2,workers=4,window=4,sg=0) #lo entrenamos con las all_sentences anterior.
#Ver como dividir en train y test. En nuestro caso igualmente no es necesario ya que no vamos a predecir palabras en base a contectos, 
#simplemente vamos a obtener las palabras más similares... entonces usamos TODAS las sentencias como entrenamiento. 
#model = Word2Vec(sentences=lista_all_vocab, size=200, window=4, min_count=1, workers=4)
#Parametros funcion Word2Vec:
#min_count : Ignore all the words where frequency of each word is less than min_count, default value is 5. As we wanted to add all words in corpus, so value we provided is 1.
#size: la longitud del vector denso para representar cada palabra. #VER.
#window: es el tamaño del "filtro" que se usa para analizar el contexto entre una palbras y las otras.  #CREO, VER. 
#workers: las iteraciones de entrenamiento.  #ESO ENTENDI, VER. 
#sg: es si utiliza el algoritmo de entrenamiento CBOW o skip gram. Por defecto estáen 0 (CBOW).

model.save("model")
wrds=list(model.wv.vectors)   #El modelo nos da un vocabulario con palabras con las cuales podemos consultar. 
print(len(wrds))

#ME devuelve 15069 palabras de vocabulario.

60855


In [181]:
#Cargamos el modelo:

#from gensim import models
#w2vec_model = models.KeyedVectors.load_word2vec_format('model', binary=True)
#w2vec_model

#from gensim.models import Word2Vec
#model = Word2Vec.load_word2vec_format('model', binary=True)

#word2vec = KeyedVectors.load_word2vec_format('final.model', binary=False)
#from gensim import models
#w2vec_model = models.KeyedVectors.load_word2vec_format('model', binary=True)
#w2vec_model = KeyedVectors.load('model')

In [228]:
model.wv.get_vector('machine_learning')

array([-9.35329318e-01,  3.30134690e-01, -1.52708888e+00, -9.32685360e-02,
       -2.32141781e+00,  1.13510227e+00,  1.92877603e+00, -1.08878505e+00,
       -3.70160900e-02, -1.94043148e+00,  6.59955204e-01, -1.36610961e+00,
       -8.03965092e-01, -1.07488823e+00,  2.45476127e+00, -2.40702701e+00,
        1.28087670e-01,  3.92293036e-01, -8.57815683e-01, -1.33548349e-01,
        7.01223239e-02, -8.53365541e-01,  2.36701274e+00,  9.49352503e-01,
       -7.31995851e-02,  1.82397389e+00, -5.87599277e-01, -7.96092868e-01,
        5.39072193e-02,  6.38300657e-01,  6.50366664e-01,  2.47248158e-01,
        8.03261697e-02, -2.32437849e+00, -1.85387123e+00,  1.27883121e-01,
       -2.56175542e+00,  3.37904096e-02,  7.70332158e-01, -2.53822785e-02,
       -8.98168743e-01,  1.01843905e+00,  2.21773219e+00,  3.25790495e-01,
        5.62601626e-01, -1.13821220e+00, -1.69144189e+00,  1.89986244e-01,
        1.64483571e+00,  8.70197773e-01,  2.62504839e-03,  1.89018309e-01,
        7.53653884e-01, -

In [237]:
model.wv.wmdistance('deep_learning', 'machine_learning')

0.3036559729920758

In [234]:
#Gracias a nuestros words embeddings obtenidos por el entrenamiento en el modelo word2vec,
#ahora podemos realizar funciones como 'most_similar'....testeamos nuestro modelo: 
ModeloPrueba=model.wv.most_similar("machine_learning")     

#La función most_similar busca palabras que esten semánticamente
#cercas (que sean parecidas) a una palabra dada (En este caso 'machine_learning').

In [235]:
print(ModeloPrueba)

[('deep_learning', 0.8448491096496582), ('artificial_intelligence', 0.8281981348991394), ('nlp', 0.8264880776405334), ('natural_language', 0.7937481999397278), ('predictive_analytics', 0.7921558618545532), ('algorithm', 0.7918980717658997), ('predictive', 0.7864595055580139), ('predictive_modeling', 0.7853081226348877), ('newsql', 0.7812016010284424), ('data_mining', 0.7807395458221436)]


In [238]:
#Función WMD que usaremos:
def WMD(tokens_CV_Candidate,tokens_Job_Desc):
    return (model.wv.wmdistance(tokens_CV_Candidate, tokens_Job_Desc))  #'wmdistance' return the word mover distance between two documents. 

In [239]:
#Como prueba obtenemos el WMD comparando para la 1ra fila (entre las columnas tokens_Content_CV y tokens_Job_Description)
first_job_descr = df_Jobs_and_Candidates.loc[0,'tokens_Job_Description']
first_content_cv = df_Jobs_and_Candidates.loc[0,'tokens_Content_CV']

wmd_result = round(WMD(first_job_descr,first_content_cv),3)
print(wmd_result)
 
similarity_wdm = round((1/(1+wmd_result)),3)  
print(similarity_wdm)

0.113
0.898


#### Usamos el Word2vec descargado.

In [84]:
#Link del cual descargamos el archivo: https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
EMBEDDING_FILE = '/home/fedricio/Desktop/Embeddings_Utilizados/Word2vec/GoogleNews-vectors-negative300.bin.gz'
word2vec = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, binary=True)
#Vemos si algunos bigramas estan: #word2vec["python"]

In [85]:
#Función WMD que usaremos:
def WMD(tokens_CV_Candidate,tokens_Job_Desc):
    return (word2vec.wmdistance(tokens_CV_Candidate, tokens_Job_Desc))  #'wmdistance' return the word mover distance between two documents. 

In [193]:
#Como prueba obtenemos el WMD comparando para la 1ra fila (entre las columnas tokens_Content_CV y tokens_Job_Description)
first_job_descr = df_Jobs_and_Candidates.loc[0,'tokens_Job_Description']
first_content_cv = df_Jobs_and_Candidates.loc[0,'tokens_Content_CV']

wmd_result = round(WMD(first_job_descr,first_content_cv),3)
print(wmd_result)
 
similarity_wdm = round((1/(1+wmd_result)),3)  
print(similarity_wdm)

0.052
0.951


In [240]:
#Aplicamos WMD para TODO el DF... entre el contenido del CV y la descripcion del puesto; 
#y el resultado lo guardamos en las columna 'WMD_Job_Desc'.

#df_Jobs_and_Candidates['WMD_Job_Desc'] = df_Jobs_and_Candidates.apply(lambda row: round(WMD(row['tokens_Content_CV'],row['tokens_Job_Description']),3), axis=1)
#APlicando lo de  --> similarity = 1 / (1 + distance):
#(#similarity = 1 / (1 + distance) https://groups.google.com/g/gensim/c/-pRZnsOEaPQ)
df_Jobs_and_Candidates['WMD_Job_Desc'] = df_Jobs_and_Candidates.apply(lambda row: round((1/(1+(WMD(row['tokens_Content_CV'],row['tokens_Job_Description'])))),3), axis=1)

In [241]:
df_Jobs_and_Candidates

Unnamed: 0,Candidate_Name,Content_CV,clean_Content_CV,tokens_Content_CV,Job_Title,Job_Description,clean_Job_Description,tokens_Job_Description,tfidf_cosine,WMD_Job_Desc
0,DataScientist_Karla_Lewis,\nKARLA LEWIS\nData Scientist\nCONTACT WORK EX...,data scientist brooklyn ny data scientist grub...,"['data_scientist', 'brooklyn', 'ny', 'data_sci...",Data Scientist,"Master’s degree or above in a STEM field, incl...",master degree stem field including limited com...,"['master', 'degree', 'stem', 'field', 'includi...",0.060513,0.898
1,DataScientist_Karla_Lewis,\nKARLA LEWIS\nData Scientist\nCONTACT WORK EX...,data scientist brooklyn ny data scientist grub...,"['data_scientist', 'brooklyn', 'ny', 'data_sci...",Data Scientist 2,"\nReporting to the Director, Data & Analytics,...",reporting director data analytics senior data ...,"['reporting', 'director', 'data', 'analytics',...",0.115418,0.903
2,DataScientist_Karla_Lewis,\nKARLA LEWIS\nData Scientist\nCONTACT WORK EX...,data scientist brooklyn ny data scientist grub...,"['data_scientist', 'brooklyn', 'ny', 'data_sci...",HCM Consultant,\nThe Oracle Cloud HCM Absence Consultant will...,oracle cloud hcm absence consultant responsibl...,"['oracle', 'cloud', 'hcm', 'absence', 'consult...",0.021795,0.884
3,DataScientist_Karla_Lewis,\nKARLA LEWIS\nData Scientist\nCONTACT WORK EX...,data scientist brooklyn ny data scientist grub...,"['data_scientist', 'brooklyn', 'ny', 'data_sci...",HCM Consultant 2,4+ years of experience in PeopleSoft or Oracle...,peoplesoft oracle eb implementation support hc...,"['peoplesoft', 'oracle', 'eb', 'implementation...",0.019523,0.905
4,DataScientist_Karla_Lewis,\nKARLA LEWIS\nData Scientist\nCONTACT WORK EX...,data scientist brooklyn ny data scientist grub...,"['data_scientist', 'brooklyn', 'ny', 'data_sci...",Machine Learning Engineer,Leveraging the latest machine and deep learnin...,leveraging latest machine deep learning techni...,"['leveraging', 'latest', 'machine', 'deep_lear...",0.051246,0.896
...,...,...,...,...,...,...,...,...,...,...
6245,386-CONSULTANT,MARKETING CONSULTANT Summar...,marketing consultant value creator marketing e...,"['marketing', 'consultant', 'value', 'creator'...",Machine Learning Engineer 2,Collaborate with a multidisciplinary team to g...,collaborate multidisciplinary team gain insigh...,"['collaborate', 'multidisciplinary', 'team', '...",0.041610,0.927
6246,386-CONSULTANT,MARKETING CONSULTANT Summar...,marketing consultant value creator marketing e...,"['marketing', 'consultant', 'value', 'creator'...",Security Specialist,Work in a fast-paced environment that combine ...,fast paced environment combine technical secur...,"['fast', 'paced', 'environment', 'combine', 't...",0.071639,0.937
6247,386-CONSULTANT,MARKETING CONSULTANT Summar...,marketing consultant value creator marketing e...,"['marketing', 'consultant', 'value', 'creator'...",Security Specialist 2,\n Handling incoming requests for assistanc...,handling incoming request assistance business ...,"['handling', 'incoming', 'request', 'assistanc...",0.041386,0.910
6248,386-CONSULTANT,MARKETING CONSULTANT Summar...,marketing consultant value creator marketing e...,"['marketing', 'consultant', 'value', 'creator'...",Web Developer Full Stack,\n\n Graduate Degree in Information Technol...,graduate degree information technology similar...,"['graduate', 'degree', 'information', 'technol...",0.029146,0.910


### Vemos como ejemplo a Federico Calonge:

In [242]:
is_Federico_Calonge =  df_Jobs_and_Candidates['Candidate_Name']=='HCM_Federico_Calonge'
new_DF = df_Jobs_and_Candidates[is_Federico_Calonge]
print(new_DF.shape)
new_DF

(10, 10)


Unnamed: 0,Candidate_Name,Content_CV,clean_Content_CV,tokens_Content_CV,Job_Title,Job_Description,clean_Job_Description,tokens_Job_Description,tfidf_cosine,WMD_Job_Desc
20,HCM_Federico_Calonge,\n ...,hcm technical consultant working oracle tool p...,"['hcm', 'technical', 'consultant', 'working', ...",Data Scientist,"Master’s degree or above in a STEM field, incl...",master degree stem field including limited com...,"['master', 'degree', 'stem', 'field', 'includi...",0.069849,0.904
21,HCM_Federico_Calonge,\n ...,hcm technical consultant working oracle tool p...,"['hcm', 'technical', 'consultant', 'working', ...",Data Scientist 2,"\nReporting to the Director, Data & Analytics,...",reporting director data analytics senior data ...,"['reporting', 'director', 'data', 'analytics',...",0.083858,0.906
22,HCM_Federico_Calonge,\n ...,hcm technical consultant working oracle tool p...,"['hcm', 'technical', 'consultant', 'working', ...",HCM Consultant,\nThe Oracle Cloud HCM Absence Consultant will...,oracle cloud hcm absence consultant responsibl...,"['oracle', 'cloud', 'hcm', 'absence', 'consult...",0.314486,0.901
23,HCM_Federico_Calonge,\n ...,hcm technical consultant working oracle tool p...,"['hcm', 'technical', 'consultant', 'working', ...",HCM Consultant 2,4+ years of experience in PeopleSoft or Oracle...,peoplesoft oracle eb implementation support hc...,"['peoplesoft', 'oracle', 'eb', 'implementation...",0.367655,0.932
24,HCM_Federico_Calonge,\n ...,hcm technical consultant working oracle tool p...,"['hcm', 'technical', 'consultant', 'working', ...",Machine Learning Engineer,Leveraging the latest machine and deep learnin...,leveraging latest machine deep learning techni...,"['leveraging', 'latest', 'machine', 'deep_lear...",0.085925,0.904
25,HCM_Federico_Calonge,\n ...,hcm technical consultant working oracle tool p...,"['hcm', 'technical', 'consultant', 'working', ...",Machine Learning Engineer 2,Collaborate with a multidisciplinary team to g...,collaborate multidisciplinary team gain insigh...,"['collaborate', 'multidisciplinary', 'team', '...",0.061773,0.93
26,HCM_Federico_Calonge,\n ...,hcm technical consultant working oracle tool p...,"['hcm', 'technical', 'consultant', 'working', ...",Security Specialist,Work in a fast-paced environment that combine ...,fast paced environment combine technical secur...,"['fast', 'paced', 'environment', 'combine', 't...",0.040321,0.899
27,HCM_Federico_Calonge,\n ...,hcm technical consultant working oracle tool p...,"['hcm', 'technical', 'consultant', 'working', ...",Security Specialist 2,\n Handling incoming requests for assistanc...,handling incoming request assistance business ...,"['handling', 'incoming', 'request', 'assistanc...",0.032068,0.891
28,HCM_Federico_Calonge,\n ...,hcm technical consultant working oracle tool p...,"['hcm', 'technical', 'consultant', 'working', ...",Web Developer Full Stack,\n\n Graduate Degree in Information Technol...,graduate degree information technology similar...,"['graduate', 'degree', 'information', 'technol...",0.021652,0.881
29,HCM_Federico_Calonge,\n ...,hcm technical consultant working oracle tool p...,"['hcm', 'technical', 'consultant', 'working', ...",Web Developer Full Stack 2,\n· Enter existing website codebases and exten...,enter existing codebases extend functionality ...,"['enter', 'existing', 'codebases', 'extend', '...",0.056635,0.92


Observamos altos valores de similitud para los puestos 'HCM Consultant' y 'HCM Consultant 2' (algo esperado).

In [245]:
is_Bradly_Johnston =  df_Jobs_and_Candidates['Candidate_Name']=='MLEngineer_Bradly_Johnston'
new_DF = df_Jobs_and_Candidates[is_Bradly_Johnston]
print(new_DF.shape)
new_DF

(10, 10)


Unnamed: 0,Candidate_Name,Content_CV,clean_Content_CV,tokens_Content_CV,Job_Title,Job_Description,clean_Job_Description,tokens_Job_Description,tfidf_cosine,WMD_Job_Desc
40,MLEngineer_Bradly_Johnston,"\nBradly Johnston\n435 Kasey Vista, Detroit\n+...",kasey vista detroit senior software engineer m...,"['kasey', 'vista', 'detroit', 'senior', 'softw...",Data Scientist,"Master’s degree or above in a STEM field, incl...",master degree stem field including limited com...,"['master', 'degree', 'stem', 'field', 'includi...",0.196025,0.924
41,MLEngineer_Bradly_Johnston,"\nBradly Johnston\n435 Kasey Vista, Detroit\n+...",kasey vista detroit senior software engineer m...,"['kasey', 'vista', 'detroit', 'senior', 'softw...",Data Scientist 2,"\nReporting to the Director, Data & Analytics,...",reporting director data analytics senior data ...,"['reporting', 'director', 'data', 'analytics',...",0.346808,0.91
42,MLEngineer_Bradly_Johnston,"\nBradly Johnston\n435 Kasey Vista, Detroit\n+...",kasey vista detroit senior software engineer m...,"['kasey', 'vista', 'detroit', 'senior', 'softw...",HCM Consultant,\nThe Oracle Cloud HCM Absence Consultant will...,oracle cloud hcm absence consultant responsibl...,"['oracle', 'cloud', 'hcm', 'absence', 'consult...",0.05927,0.898
43,MLEngineer_Bradly_Johnston,"\nBradly Johnston\n435 Kasey Vista, Detroit\n+...",kasey vista detroit senior software engineer m...,"['kasey', 'vista', 'detroit', 'senior', 'softw...",HCM Consultant 2,4+ years of experience in PeopleSoft or Oracle...,peoplesoft oracle eb implementation support hc...,"['peoplesoft', 'oracle', 'eb', 'implementation...",0.058595,0.923
44,MLEngineer_Bradly_Johnston,"\nBradly Johnston\n435 Kasey Vista, Detroit\n+...",kasey vista detroit senior software engineer m...,"['kasey', 'vista', 'detroit', 'senior', 'softw...",Machine Learning Engineer,Leveraging the latest machine and deep learnin...,leveraging latest machine deep learning techni...,"['leveraging', 'latest', 'machine', 'deep_lear...",0.286638,0.924
45,MLEngineer_Bradly_Johnston,"\nBradly Johnston\n435 Kasey Vista, Detroit\n+...",kasey vista detroit senior software engineer m...,"['kasey', 'vista', 'detroit', 'senior', 'softw...",Machine Learning Engineer 2,Collaborate with a multidisciplinary team to g...,collaborate multidisciplinary team gain insigh...,"['collaborate', 'multidisciplinary', 'team', '...",0.333604,0.941
46,MLEngineer_Bradly_Johnston,"\nBradly Johnston\n435 Kasey Vista, Detroit\n+...",kasey vista detroit senior software engineer m...,"['kasey', 'vista', 'detroit', 'senior', 'softw...",Security Specialist,Work in a fast-paced environment that combine ...,fast paced environment combine technical secur...,"['fast', 'paced', 'environment', 'combine', 't...",0.064902,0.918
47,MLEngineer_Bradly_Johnston,"\nBradly Johnston\n435 Kasey Vista, Detroit\n+...",kasey vista detroit senior software engineer m...,"['kasey', 'vista', 'detroit', 'senior', 'softw...",Security Specialist 2,\n Handling incoming requests for assistanc...,handling incoming request assistance business ...,"['handling', 'incoming', 'request', 'assistanc...",0.044034,0.898
48,MLEngineer_Bradly_Johnston,"\nBradly Johnston\n435 Kasey Vista, Detroit\n+...",kasey vista detroit senior software engineer m...,"['kasey', 'vista', 'detroit', 'senior', 'softw...",Web Developer Full Stack,\n\n Graduate Degree in Information Technol...,graduate degree information technology similar...,"['graduate', 'degree', 'information', 'technol...",0.048537,0.905
49,MLEngineer_Bradly_Johnston,"\nBradly Johnston\n435 Kasey Vista, Detroit\n+...",kasey vista detroit senior software engineer m...,"['kasey', 'vista', 'detroit', 'senior', 'softw...",Web Developer Full Stack 2,\n· Enter existing website codebases and exten...,enter existing codebases extend functionality ...,"['enter', 'existing', 'codebases', 'extend', '...",0.066777,0.927


### 3-Exportamos el CSV con los resultados para utilizarlo en el Notebook "2-KMEANS_&_KNN". 

In [246]:
#Exportamos el csv:
df_Jobs_and_Candidates.to_csv('DF_Exportado_Jobs_And_Candidates_own_model.csv',index = False)
#Si lo queremos importar:
DF_J_and_C = pd.read_csv('DF_Exportado_Jobs_And_Candidates_own_model.csv')
DF_J_and_C.tail()

Unnamed: 0,Candidate_Name,Content_CV,clean_Content_CV,tokens_Content_CV,Job_Title,Job_Description,clean_Job_Description,tokens_Job_Description,tfidf_cosine,WMD_Job_Desc
6245,386-CONSULTANT,MARKETING CONSULTANT Summar...,marketing consultant value creator marketing e...,"['marketing', 'consultant', 'value', 'creator'...",Machine Learning Engineer 2,Collaborate with a multidisciplinary team to g...,collaborate multidisciplinary team gain insigh...,"['collaborate', 'multidisciplinary', 'team', '...",0.04161,0.927
6246,386-CONSULTANT,MARKETING CONSULTANT Summar...,marketing consultant value creator marketing e...,"['marketing', 'consultant', 'value', 'creator'...",Security Specialist,Work in a fast-paced environment that combine ...,fast paced environment combine technical secur...,"['fast', 'paced', 'environment', 'combine', 't...",0.071639,0.937
6247,386-CONSULTANT,MARKETING CONSULTANT Summar...,marketing consultant value creator marketing e...,"['marketing', 'consultant', 'value', 'creator'...",Security Specialist 2,\n Handling incoming requests for assistanc...,handling incoming request assistance business ...,"['handling', 'incoming', 'request', 'assistanc...",0.041386,0.91
6248,386-CONSULTANT,MARKETING CONSULTANT Summar...,marketing consultant value creator marketing e...,"['marketing', 'consultant', 'value', 'creator'...",Web Developer Full Stack,\n\n Graduate Degree in Information Technol...,graduate degree information technology similar...,"['graduate', 'degree', 'information', 'technol...",0.029146,0.91
6249,386-CONSULTANT,MARKETING CONSULTANT Summar...,marketing consultant value creator marketing e...,"['marketing', 'consultant', 'value', 'creator'...",Web Developer Full Stack 2,\n· Enter existing website codebases and exten...,enter existing codebases extend functionality ...,"['enter', 'existing', 'codebases', 'extend', '...",0.038411,0.936
