### Jupyter Notebook "1-Preprocessing_&_Data_Cleaning":
* 1-Armado de Dataframes y Limpieza de datos.
    * 1.1-Importando librerías necesarias.
    * 1.2-Funciones necesarias para la Limpieza de datos. ( * )
    * 1.3-Armado Dataframe Puestos en base a Datasets. ( ** )  
        *1.3.1- (Archivo CSV) '1-10_examples_job_Desc.csv'    
        *1.3.2- (Archivo CSV) '2-22000_examples_dice_com-job_us.csv'  
    * 1.4-Limpieza Dataframe Puestos. 
    * 1.5-Armado Dataframe CVs en base a Datasets.  ( ** )   
        *1.5.1- (Carpeta con archivos en PDF) '1-10_examples_CVs_PDF'  
        *1.5.2- (Carpeta con archivos en PDF) '2-228_examples_CVs_PDF'  
        *1.5.3- (Archivo CSV) '3-2484_examples_CVs.csv'  
        *1.5.4- (Archivo CSV) '4-962_examples_CVs.csv'  
    * 1.6-Limpieza Dataframe CVs.  
    
* 2-Export de DFs para usarlo en el siguiente Jupyter Notebook.

( * ) **El procedimiento para la Limpieza de los CVs y los Puestos será el siguiente:**

1. Eliminación de filas repetidas.
2. Limpieza inicial:
    * 2.1. Codificamos nuestros textos a utf-8.
    * 2.2. Convertimos todo a minúscula.
    * 2.3. Eliminamos datos no relevantes para nuestros análisis (mails, páginas web y common words).
    * 2.4. Eliminamos signos de puntuación y caracteres especiales (incluyendo números).
    * 2.5. Eliminamos stop words.
3. Aplicamos Tokenización.
4. Aplicamos Lematización.
5. Obtenemos y usamos bi-gramas.

( ** ) **Fuentes de los datasets:** 

* 1.3.1- (Archivo CSV) '1-10_examples_job_Desc.csv': Recolección propia del sitio Indeed (https://www.indeed.com/q-USA-jobs.html) para puestos de trabajo de IT.  
* 1.3.2- (Archivo CSV) '2-22000_examples_dice_com-job_us.csv': CSV obtenido del sitio Kaggle (https://www.kaggle.com/PromptCloudHQ/us-technology-jobs-on-dicecom). El CSV cuenta con descripciones de puestos obtenidos del sitio web de USA de postulación de trabajos del rubro de IT '**Dice.com**'.
* 1.4.1- (Carpeta con archivos en PDF) '1-10_examples_CVs_PDF': Recolección propia de distintos sitios web con ejemplos de CVs de Candidatos para distintos Puestos.    
* 1.4.2- (Carpeta con archivos en PDF) '2-228_examples_CVs_PDF': Documentos .docx convertidos a .pdf obtenidos del sitio Kaggle (https://www.kaggle.com/palaksood97/resume-dataset). Estos pdfs son Candidatos de la India con experiencia en el rubro de IT.  
* 1.4.3- (Archivo CSV) '3-2484_examples_CVs.csv': CSV obtenido del sitio Kagle (https://www.kaggle.com/snehaanbhawal/resume-dataset). Este CSV cuenta con CVs obtenidos del sitio web de postulación de trabajos '**livecareer.com**'.
* 1.4.4- (Archivo CSV) '4-962_examples_CVs.csv': CSV obtenido del sitio Kaggle (https://www.kaggle.com/gauravduttakiit/resume-dataset). Este CSV cuenta con CVs repartidos en distintas categorías de IT.

**Cantidades finales de los datasets luego de aplicar preprocesamiento y limpieza de datos:**

| Dataset                              | Cantidad Inicial | Cantidad Final | Cantidad Total Final Puestos | Cantidad Total Final CVs |
|--------------------------------------|------------------|----------------|------------------------|--------------------|
| 1-10_examples_job_Desc.csv           | 10               | 10             | -                      | -                  |
| 2-22000_examples_dice_com-job_us.csv | 22000            | 20583          | -                      | -                  |
| -                                    | -                | -              | 20593                  | -                  |
| 1-10_examples_CVs_PDF                | 10               | 10             | -                      | -                  |
| 2-228_examples_CVs_PDF               | 228              | 228            | -                      | -                  |
| 3-2484_examples_CVs.csv              | 2484             | 289            | -                      | -                  |
| 4-962_examples_CVs.csv               | 962              | 97             | -                      | -                  |
| -                                    | -                | -              | -                      | 624                |

## 1-Armado de Dataframes y Limpieza de datos.

### 1.1-Importando librerias necesarias.

In [1]:
from gensim.models import KeyedVectors
import matplotlib.pyplot as  plt
from collections import Counter
import pandas as pd
import numpy as np
import random
import string
import math

#Limpieza de datos:
import regex as re                        #Usado en la función remove_punctuation_and_special_characters.
import nltk
from nltk.stem import WordNetLemmatizer   #Usado para lematización.
nltk.download("stopwords")
nltk.download('wordnet')
from nltk.tokenize import word_tokenize   #Usado para tokenizar.
from nltk.corpus import stopwords

#PDF a text mediante pdfplumber:
import pdfplumber
import os
import collections
from os import listdir
from os.path import isfile, join

#Obtención de bi-grams:
import gensim
from gensim.models.phrases import Phraser, Phrases
import string

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/fedricio/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /home/fedricio/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### 1.2-Funciones necesarias para la Limpieza de datos.

In [2]:
#Ver imports: 
    #re                 -->   Usado en la función remove_punctuation_and_special_characters.
    #WordNetLemmatizer  -->   Usado para lematización.
    #word_tokenize      -->   Usado para tokenizar. 
    #stopwords          -->   Usado para definir nuestras stop words desde nltk.
    
stop_words = stopwords.words('english')
            
def lower_text(DF,clean_column):
    #Pasamos todo a minúscula.
    DF[clean_column] = DF[clean_column].apply(lambda x: x.lower() if isinstance(x,str) else x)

def delete_emails_and_web_pages(DF,clean_column):
    #Macheamos páginas con www, http o https:
    DF[clean_column] = DF[clean_column].apply(lambda x: re.sub('(www|http:|https:)+[^\s]+[\w]',' ', x) if isinstance(x,str) else x) 
    #Macheamos e-mails:
    DF[clean_column] = DF[clean_column].apply(lambda x: re.sub('\S*@\S*\s?',' ', x) if isinstance(x,str) else x)
        #Explicación macheo:
            #\S* : secuencia de caracteres que NO son espacios. 
            #@ : el @
            #\S* : otra secuencia de caracteres que NO son espacios.
            #\s? : Y eventualmente un espacio, si acá hay uno. El '?' es necesario para
            #machear una dirección de correo al final de la linea. Entonces, si acá es espacio, siempre macheará.
            #Ver demo: https://regex101.com/r/J0ohIf/1

def delete_common_words(DF,clean_column):
    #Creamos previamente un txt donde colocamos palabras comunes que NO son necesarias 
    #para nuestro análisis y macheo (meses y sus abreviaciones, categorías de un cv -education, work experience, etc.)
    #y eliminamos estas palabras de nuestra columna.

    with open('common_words.txt') as f2:
        content = f2.read()
        
    tokens_common_words = word_tokenize(content)  #tokenizo las common words
    DF[clean_column] = DF[clean_column].apply(lambda x: ' '.join([item for item in x.split() if item not in tokens_common_words]))

def delete_candidate_name(DF,clean_column):
    with open('candidate_names.txt') as f2:
        content = f2.read()
        
    tokens_common_words = word_tokenize(content)
    DF[clean_column] = DF[clean_column].apply(lambda x: ' '.join([item for item in x.split() if item not in tokens_common_words]))

def remove_stop_words(DF,clean_column):
    DF[clean_column] = DF[clean_column].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

def remove_punctuation_and_special_characters(DF,clean_column):
    DF[clean_column] = DF[clean_column].apply(lambda x: re.sub('[\W+]|(\d+)',' ', x) if isinstance(x,str) else x)
    #Macheamos cualquier non-word [\W+]  ; y  los números (\d+)
    
def tokenize_and_lemmatization(text_column):
    #Tokenizar es el proceso de parsear Strings de texto en diferentes secciones ("tokens"). 
    #Para esto usamos la función "word_tokenize". 
    #Diferencia lematizar vs derivar (stemming): https://qastack.mx/programming/1787110/what-is-the-difference-between-lemmatization-vs-stemming
    tokens = word_tokenize(text_column) #Tokenizamos.
    wordnet_lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [wordnet_lemmatizer.lemmatize(tok).lower() for tok in tokens] # Stem words.
    return list(lemmatized_tokens)

def cleaning_DF(DF,column_to_clean,flag_candidate): #Si es un candidato (flag_candidate=True) entonces eliminamos todas las apariciones del Nombre en el CV.
    clean_column='clean_'+column_to_clean
    DF[clean_column]=DF[column_to_clean]            #Copiamos el contenido de 'column_to_clean' en 'clean_column' para utilizarla en las funciones posteriores.
    lower_text(DF,clean_column)
    delete_emails_and_web_pages(DF,clean_column)
    remove_punctuation_and_special_characters(DF,clean_column)
    remove_stop_words(DF,clean_column)
    delete_common_words(DF,clean_column)
    if(flag_candidate==True):                       #Si es un candidato...
        delete_candidate_name(DF,clean_column)
    
def tokenize_and_lemmatize(DF,column_to_clean):
    clean_column='clean_'+column_to_clean
    tokens_column='tokens_'+column_to_clean
    DF[tokens_column] = DF[clean_column].apply(tokenize_and_lemmatization)  #tokenize_and_lemmatization devuelve tokens.
    DF[clean_column] = DF[tokens_column].apply(' '.join)                    #destokenizamos para obtener nuevamente nuestro texto (luego de aplicar lematización) y colocarlo en DF[clean_column].

### 1.3-Armado Dataframe Puestos en base a Datasets.

### 1.3.1- (Archivo CSV) '1-10_examples_job_Desc.csv'

In [3]:
df_Jobs = pd.read_csv("../Datasets_CVs_And_Job_Descriptions/EN/Job_Descr/1-10_examples_job_Desc.csv")
df_Jobs = df_Jobs.rename(columns={'job_title': 'Job_Title', 'job_description':'Job_Description'})
df_Jobs = df_Jobs.sort_values('Job_Title', ascending=True)
df_Jobs = df_Jobs.reset_index(drop=True)
df_Jobs

Unnamed: 0,Job_Title,Job_Description
0,Data Scientist,"Master’s degree or above in a STEM field, incl..."
1,Data Scientist 2,"\nReporting to the Director, Data & Analytics,..."
2,HCM Consultant,\nThe Oracle Cloud HCM Absence Consultant will...
3,HCM Consultant 2,4+ years of experience in PeopleSoft or Oracle...
4,Machine Learning Engineer,Leveraging the latest machine and deep learnin...
5,Machine Learning Engineer 2,Collaborate with a multidisciplinary team to g...
6,Security Specialist,Work in a fast-paced environment that combine ...
7,Security Specialist 2,\n Handling incoming requests for assistanc...
8,Web Developer Full Stack,\n\n Graduate Degree in Information Technol...
9,Web Developer Full Stack 2,\n· Enter existing website codebases and exten...


### 1.3.2- (Archivo CSV) '2-22000_examples_dice_com-job_us.csv' 

Fuente: https://www.kaggle.com/palaksood97/resume-dataset

In [6]:
path_jobs_2='../Datasets_CVs_And_Job_Descriptions/EN/Job_Descr/'
df_Jobs_2 = pd.read_csv(path_jobs_2+'/2-22000_examples_dice_com-job_us.csv', usecols= ['jobdescription','jobtitle'])
df_Jobs_2 = df_Jobs_2.rename(columns={'jobtitle': 'Job_Title', 'jobdescription':'Job_Description'})
df_Jobs_2 = df_Jobs_2.reset_index(drop=True)
df_Jobs_2

Unnamed: 0,Job_Description,Job_Title
0,Looking for Selenium engineers...must have sol...,AUTOMATION TEST ENGINEER
1,The University of Chicago has a rapidly growin...,Information Security Engineer
2,"GalaxE.SolutionsEvery day, our solutions affec...",Business Solutions Architect
3,Java DeveloperFull-time/direct-hireBolingbrook...,"Java Developer (mid level)- FT- GREAT culture,..."
4,Midtown based high tech firm has an immediate ...,DevOps Engineer
...,...,...
21995,Company Description We are searching for a ta...,Web Designer
21996,CONTACT - priya@omegasolutioninc.com / 408-45...,Senior Front End Web Developer - Full Time at ...
21997,Do you take pride in your work knowing that th...,QA Analyst
21998,Company Description What We Can Offer YouAs th...,Tech Lead-Full Stack


#### Eliminamos los duplicados.

In [7]:
df_Jobs_2 = df_Jobs_2.drop_duplicates(subset=None, keep='first', inplace=False)
df_Jobs_2 = df_Jobs_2.reset_index(drop=True)
df_Jobs_2

Unnamed: 0,Job_Description,Job_Title
0,Looking for Selenium engineers...must have sol...,AUTOMATION TEST ENGINEER
1,The University of Chicago has a rapidly growin...,Information Security Engineer
2,"GalaxE.SolutionsEvery day, our solutions affec...",Business Solutions Architect
3,Java DeveloperFull-time/direct-hireBolingbrook...,"Java Developer (mid level)- FT- GREAT culture,..."
4,Midtown based high tech firm has an immediate ...,DevOps Engineer
...,...,...
20578,Company Description We are searching for a ta...,Web Designer
20579,CONTACT - priya@omegasolutioninc.com / 408-45...,Senior Front End Web Developer - Full Time at ...
20580,Do you take pride in your work knowing that th...,QA Analyst
20581,Company Description What We Can Offer YouAs th...,Tech Lead-Full Stack


### Ahora unimos los 2 dataframes obteniendo nuestro 'df_Jobs'

In [8]:
df_Jobs = df_Jobs.append(df_Jobs_2, ignore_index=True)
df_Jobs

  df_Jobs = df_Jobs.append(df_Jobs_2, ignore_index=True)


Unnamed: 0,Job_Title,Job_Description
0,Data Scientist,"Master’s degree or above in a STEM field, incl..."
1,Data Scientist 2,"\nReporting to the Director, Data & Analytics,..."
2,HCM Consultant,\nThe Oracle Cloud HCM Absence Consultant will...
3,HCM Consultant 2,4+ years of experience in PeopleSoft or Oracle...
4,Machine Learning Engineer,Leveraging the latest machine and deep learnin...
...,...,...
20588,Web Designer,Company Description We are searching for a ta...
20589,Senior Front End Web Developer - Full Time at ...,CONTACT - priya@omegasolutioninc.com / 408-45...
20590,QA Analyst,Do you take pride in your work knowing that th...
20591,Tech Lead-Full Stack,Company Description What We Can Offer YouAs th...


### 1.4-Limpieza Dataframe Puestos.

In [7]:
cleaning_DF(df_Jobs,'Job_Description',False)
tokenize_and_lemmatize(df_Jobs,'Job_Description')

In [9]:
#Tomamos el job_description de la posición 5 como ejemplo (Machine Learning Engineer 2'):

print(df_Jobs['Job_Description'].iloc[5])  
#non-expert  debería mantenerse.
#deberia juntar el Computer Science
print("#######################################################")
print(df_Jobs['clean_Job_Description'].iloc[5]) 
print("#######################################################")
print(df_Jobs['tokens_Job_Description'].iloc[5]) 
print("#######################################################")

Collaborate with a multidisciplinary team to gain insight into complex biochemical systems.
Design computational models to study various interactions such as interactions between genomes, proteins, and binding sites
Extract various features from the computational models and communicate the results back to the team.
Predict the behaviour of new protein structures on certain binding sites using the computational models.
Work with the team of software engineers to embed your models into production.
Perform other related duties in keeping with the purpose and accountabilities of the job.

M.Sc. or Ph.D. in Engineering/Computer Science, or an equivalent combination of experience and knowledge.
2+ years' experience applying machine learning and deep learning concepts to real-world problems.
Solid programming skills with a focus on writing clean/maintainable code, with 2+ years of experience in Python (preferred), Java, or C++ programming.
Good knowledge of machine learning libraries (Tensorf

KeyError: 'clean_Job_Description'

#### Borramos la columna 'Job_Description' que no la necesitamos más.

In [9]:
df_Jobs.drop('Job_Description', axis=1, inplace=True)

### Obtenemos los bi-grams y los guardamos en la columna 'tokens_Job_Description'

In [10]:
corpus_token_jobs = df_Jobs['tokens_Job_Description'].tolist()
corpus_token_jobs #lista de lista de tokens.
corpus_token_jobs[5]

['collaborate',
 'multidisciplinary',
 'team',
 'gain',
 'insight',
 'complex',
 'biochemical',
 'system',
 'design',
 'computational',
 'model',
 'study',
 'various',
 'interaction',
 'interaction',
 'genome',
 'protein',
 'binding',
 'site',
 'extract',
 'various',
 'feature',
 'computational',
 'model',
 'communicate',
 'result',
 'back',
 'team',
 'predict',
 'behaviour',
 'new',
 'protein',
 'structure',
 'certain',
 'binding',
 'site',
 'using',
 'computational',
 'model',
 'team',
 'software',
 'engineer',
 'embed',
 'model',
 'production',
 'perform',
 'related',
 'duty',
 'keeping',
 'purpose',
 'accountability',
 'sc',
 'ph',
 'engineering',
 'computer',
 'science',
 'equivalent',
 'combination',
 'knowledge',
 'applying',
 'machine',
 'learning',
 'deep',
 'learning',
 'concept',
 'real',
 'world',
 'problem',
 'solid',
 'programming',
 'focus',
 'writing',
 'clean',
 'maintainable',
 'code',
 'python',
 'preferred',
 'java',
 'c',
 'programming',
 'good',
 'knowledge',
 'ma

In [11]:
len(corpus_token_jobs)

20593

In [12]:
#Ver imports:
    #gensim.
    #Phraser y Phrases.
    #string y re.

#Lo que hacemos acá es unir palabras bigramas como: machine_learning, big_Data deep_learning (Que antes estaban separadas 
#pero que enrealidad van juntas; asi, objeto Phraser detecta esto y te lo devuelve como 1 sola palabra).
#Los bigramas son construidos usando phrases (frases).
    
#Los bigrams los creamos asi:
# Creamos frases relevantes desde nuestra lista de oraciones: 
phrases_jobs = Phrases(corpus_token_jobs)
# Usamos el objeto Phraser ahora para transformar las oraciones:
bigram_jobs = Phraser(phrases_jobs)
# Aplicamos el Phraser para transformar nuestras oraciones a algo más simple (una lista):
all_sentences_jobs = list(bigram_jobs[corpus_token_jobs])

#Imprimimos todo nuestro corpus tokenizado:
all_sentences_jobs  #Lista de listas con nuestro corpus tokenizado actualizado con bigramas. Son 20593 listas en total.
all_sentences_jobs[5]

['collaborate',
 'multidisciplinary_team',
 'gain_insight',
 'complex',
 'biochemical',
 'system',
 'design',
 'computational',
 'model',
 'study',
 'various',
 'interaction',
 'interaction',
 'genome',
 'protein',
 'binding',
 'site',
 'extract',
 'various',
 'feature',
 'computational',
 'model',
 'communicate',
 'result',
 'back',
 'team',
 'predict',
 'behaviour',
 'new',
 'protein',
 'structure',
 'certain',
 'binding',
 'site',
 'using',
 'computational',
 'model',
 'team',
 'software',
 'engineer',
 'embed',
 'model',
 'production',
 'perform',
 'related',
 'duty',
 'keeping',
 'purpose',
 'accountability',
 'sc',
 'ph',
 'engineering',
 'computer_science',
 'equivalent_combination',
 'knowledge',
 'applying',
 'machine_learning',
 'deep_learning',
 'concept',
 'real_world',
 'problem',
 'solid',
 'programming',
 'focus',
 'writing_clean',
 'maintainable_code',
 'python',
 'preferred',
 'java',
 'c',
 'programming',
 'good',
 'knowledge',
 'machine_learning',
 'library',
 'tenso

In [13]:
len(all_sentences_jobs)

20593

In [14]:
#Actualizamos columna 'tokens_Job_Description' de nuestro DF con los bigramas:
df_Jobs['tokens_Job_Description'] = all_sentences_jobs

In [15]:
#Convertimos la lista de nuestros bigramas (columna 'tokens_Job_Description') a string y lo colocamos en la columna 
#'clean_Job_Description':
df_Jobs['clean_Job_Description'] = df_Jobs['tokens_Job_Description'].apply(lambda x: ' '.join(map(str, x)))

In [16]:
df_Jobs['clean_Job_Description'].iloc[0]

'master_degree stem field including_limited computer_science statistic_applied mathematics operation research engineering economics social science physic chemistry providing advanced analytics within business setting data science implementation within business setting working raw missing data understanding programming fundamental understanding statistic excellent_communication preferred data_visualization tool time series working knowledge relational_database standard sql_query method proficiency_least one general programming_language python java c_c working big_data within hadoop environment working r_sa statistical package advanced statistical econometric data_mining tool method linear model linear_regression generalized linear_regression logistic_regression nonlinear modeling_technique knowledge advanced nonlinear technique including smoothing ensemble_method leverage knowledge programming mathematics computer_science transform_way company business translate business opportunity dat

In [17]:
df_Jobs

Unnamed: 0,Job_Title,clean_Job_Description,tokens_Job_Description
0,Data Scientist,master_degree stem field including_limited com...,"[master_degree, stem, field, including_limited..."
1,Data Scientist 2,reporting director data analytics senior data_...,"[reporting, director, data, analytics, senior,..."
2,HCM Consultant,oracle cloud hcm absence consultant responsibl...,"[oracle, cloud, hcm, absence, consultant, resp..."
3,HCM Consultant 2,peoplesoft oracle_eb implementation support hc...,"[peoplesoft, oracle_eb, implementation, suppor..."
4,Machine Learning Engineer,leveraging_latest machine deep_learning techni...,"[leveraging_latest, machine, deep_learning, te..."
...,...,...,...
20588,Web Designer,company description searching talented creativ...,"[company, description, searching, talented, cr..."
20589,Senior Front End Web Developer - Full Time at ...,location_san francisco caterm full_time perman...,"[location_san, francisco, caterm, full_time, p..."
20590,QA Analyst,take_pride knowing thousand life positively im...,"[take_pride, knowing, thousand, life, positive..."
20591,Tech Lead-Full Stack,company description offer youas world_leading ...,"[company, description, offer, youas, world_lea..."


### 1.5-Armado Dataframe CVs en base a Datasets. 

### 1.5.1- (Carpeta con archivos en PDF) '1-10_examples_CVs_PDF'

#### PDF a texto usando pdfplumber.

In [5]:
#Ver imports:
    #pdfplumber.
    #os, listdir, isfile, join.
    #collections.

#Leemos los CVs almacenados en nuestra carpeta y los extraemos uno por uno convirtiendolos a texto
#mediante la libreria pdfplumber:
pathCVs='../Datasets_CVs_And_Job_Descriptions/EN/CVs/1-10_examples_CVs_PDF'
onlyfiles = [os.path.join(pathCVs, f) for f in os.listdir(pathCVs) if os.path.isfile(os.path.join(pathCVs, f))]
print("Cantidad de CVs extraidos:", len(onlyfiles))

#Funcion para extraer las palabras del CV:

def pdfextract(PDF_file):
    single_page_text = ""
    all_text = ""
    pdf = pdfplumber.open(PDF_file)
    for pdf_page in pdf.pages:
        single_page_text = pdf_page.extract_text()
        all_text = all_text + '\n' + single_page_text
    pdf.close()
    #return(all_text)        
    return(all_text.encode('utf-8'))

def extract_text(file):
    text = pdfextract(file).decode('utf-8')
    #text = pdfextract(file)
    return text

Cantidad de CVs extraidos: 10


In [19]:
#Obtenemos todas las palabras del CV SIN preprocesamiento:
df_Candidates=pd.DataFrame(columns = ['Candidate_Name','Content_CV'])
i=0
while i < len(onlyfiles):
    file=onlyfiles[i]
    base = os.path.basename(file)         #Data_Scientist_Karla_Lewis.pdf
    filename = os.path.splitext(base)[0]  #Data_Scientist_Karla_Lewis
    dat=extract_text(file)
    data = [{'Candidate_Name':filename, 'Content_CV':dat}]
    df_Candidates=df_Candidates.append(data, ignore_index=True)
    i+=1

df_Candidates = df_Candidates.sort_values('Candidate_Name', ascending=True)
df_Candidates = df_Candidates.reset_index(drop=True)
df_Candidates

Unnamed: 0,Candidate_Name,Content_CV
0,DataScientist_Karla_Lewis,\nKARLA LEWIS\nData Scientist\nCONTACT WORK EX...
1,DataScientist_Rahul_Malik,\nRAHUL MALIK\nNLP Data Scientist\nCONTACT WOR...
2,HCM_Federico_Calonge,\n ...
3,HCM_Robert_Smith,\nSap Hcm Consultant Phone: (123) 456 78 99\nE...
4,MLEngineer_Bradly_Johnston,"\nBradly Johnston\n435 Kasey Vista, Detroit\n+..."
5,MLEngineer_Jonathon_Price,"\nJonathon Price \n4587 Terry Groves, Boston\n..."
6,SecuritySpecialist_Ahmed Wayne,"\nAhmed Wayne\nAddress: Abu Dhabi, UAE\nNation..."
7,SecuritySpecialist_Denis Banik,\nDenis Banik\nEmail address: hello@kickresume...
8,WebDev_Alec_Dionisio,"\nChestertown, MD 4107083942\nhi@alecdionis.io..."
9,WebDev_Karen_Higgins,\nKaren Higgins \n We b Developer \n \n...


### 1.5.2- (Carpeta con archivos en PDF) '2-228_examples_CVs_PDF'

In [20]:
#Leemos los CVs almacenados en nuestra carpeta y los extraemos uno por uno convirtiendolos a texto
#mediante la libreria pdfplumber:
pathCVs_2='../Datasets_CVs_And_Job_Descriptions/EN/CVs/2-228_examples_CVs_PDF'
onlyfiles = [os.path.join(pathCVs_2, f) for f in os.listdir(pathCVs_2) if os.path.isfile(os.path.join(pathCVs_2, f))]
print("Cantidad de CVs extraidos:", len(onlyfiles))

Cantidad de CVs extraidos: 228


In [21]:
#Obtenemos todas las palabras del CV SIN preprocesamiento:
df_Candidates_2=pd.DataFrame(columns = ['Candidate_Name','Content_CV'])
i=0
while i < len(onlyfiles):
    file=onlyfiles[i]
    base = os.path.basename(file)
    filename = os.path.splitext(base)[0]
    dat=extract_text(file)
    data = [{'Candidate_Name':filename, 'Content_CV':dat}]
    df_Candidates_2=df_Candidates_2.append(data, ignore_index=True)
    i+=1

df_Candidates_2 = df_Candidates_2.sort_values('Candidate_Name', ascending=True)
df_Candidates_2 = df_Candidates_2.reset_index(drop=True)
df_Candidates_2

Unnamed: 0,Candidate_Name,Content_CV
0,Abiral_Pandey_Fullstack_Java,\nName: Abiral Pandey \nEmail: abiral.pandey88...
1,Achyuth Resume_8,\nAchyuth \n540-999-8048 \nachyuth.java88@gmai...
2,Adelina_Erimia_PMP1,"\nAdelina Erimia, PMP, Six Sigma Green Belt, S..."
3,Adhi Gopalam - SM,\n \n \nAdhi Gopalam \nadhigopalam@gmail.com \...
4,AjayKumar,\nAjay Kumar (CSM) Email/Skype: ...
...,...,...
223,sanjay kumar,\nSanjay \nEmail: sanjay.j0828@gmail.com \nCon...
224,srinivas b,\n \nSrinivas \nSrinivasjava04@gmail.com \n315...
225,vema reddy,\n ...
226,venu b,\nVENU \nvenu6773@gmail.com \n(414) 436-567...


#### Eliminamos los duplicados (vemos que no hay ninguno igualmente):

In [22]:
df_Candidates_2 = df_Candidates_2.drop_duplicates(subset=None, keep='first', inplace=False)
df_Candidates_2

Unnamed: 0,Candidate_Name,Content_CV
0,Abiral_Pandey_Fullstack_Java,\nName: Abiral Pandey \nEmail: abiral.pandey88...
1,Achyuth Resume_8,\nAchyuth \n540-999-8048 \nachyuth.java88@gmai...
2,Adelina_Erimia_PMP1,"\nAdelina Erimia, PMP, Six Sigma Green Belt, S..."
3,Adhi Gopalam - SM,\n \n \nAdhi Gopalam \nadhigopalam@gmail.com \...
4,AjayKumar,\nAjay Kumar (CSM) Email/Skype: ...
...,...,...
223,sanjay kumar,\nSanjay \nEmail: sanjay.j0828@gmail.com \nCon...
224,srinivas b,\n \nSrinivas \nSrinivasjava04@gmail.com \n315...
225,vema reddy,\n ...
226,venu b,\nVENU \nvenu6773@gmail.com \n(414) 436-567...


### 1.5.3- (Archivo CSV) '3-2484_examples_CVs.csv'

In [23]:
pathCV_CSV='../Datasets_CVs_And_Job_Descriptions/EN/CVs'
df_Candidates_3 = pd.read_csv(pathCV_CSV+'/3-2484_examples_CVs.csv', usecols= ['Resume_str','Category'])
df_Candidates_3

Unnamed: 0,Resume_str,Category
0,HR ADMINISTRATOR/MARKETING ASSOCIATE\...,HR
1,"HR SPECIALIST, US HR OPERATIONS ...",HR
2,HR DIRECTOR Summary Over 2...,HR
3,HR SPECIALIST Summary Dedica...,HR
4,HR MANAGER Skill Highlights ...,HR
...,...,...
2479,RANK: SGT/E-5 NON- COMMISSIONED OFFIC...,AVIATION
2480,"GOVERNMENT RELATIONS, COMMUNICATIONS ...",AVIATION
2481,GEEK SQUAD AGENT Professional...,AVIATION
2482,PROGRAM DIRECTOR / OFFICE MANAGER ...,AVIATION


In [24]:
#Vemos los diferentes tipos de categorias que tiene para extraer solo las del rubro de IT:
for val in df_Candidates_3['Category'].unique():
    print(val)

HR
DESIGNER
INFORMATION-TECHNOLOGY
TEACHER
ADVOCATE
BUSINESS-DEVELOPMENT
HEALTHCARE
FITNESS
AGRICULTURE
BPO
SALES
CONSULTANT
DIGITAL-MEDIA
AUTOMOBILE
CHEF
FINANCE
APPAREL
ENGINEERING
ACCOUNTANT
CONSTRUCTION
PUBLIC-RELATIONS
BANKING
ARTS
AVIATION


#### Armamos 2 Dataframes en base a 'df_Candidates_3'; y al final los unimos:  
* El 1er dataframe ('df_it_and_busdev') incluirá todos los CVs de las categorías 'INFORMATION-TECHNOLOGY' o 'BUSINESS-DEVELOPMENT').   
* El 2do dataframe ('df_consultants') incluirá a los CVs de la categoría 'CONSULTANT' siempre y cuando pertenezcan al rubro de IT (para esto nos fijamos que en el contenido del CV tenga la palabra "it".

In [25]:
#Listas que contienen las categorias requeridas:
IT_Jobs_list = ['INFORMATION-TECHNOLOGY','BUSINESS-DEVELOPMENT']
Consultant = ['CONSULTANT']

df_it_and_busdev = df_Candidates_3[df_Candidates_3['Category'].isin(IT_Jobs_list)]
df_it_and_busdev = df_it_and_busdev.reset_index(drop=True)
df_it_and_busdev

#df_it_and_busdev['Resume_str'].iloc[90]   #Si queremos ver a un Resume_str de uno en particular.

Unnamed: 0,Resume_str,Category
0,INFORMATION TECHNOLOGY Summar...,INFORMATION-TECHNOLOGY
1,INFORMATION TECHNOLOGY SPECIALIST\tGS...,INFORMATION-TECHNOLOGY
2,INFORMATION TECHNOLOGY SUPERVISOR ...,INFORMATION-TECHNOLOGY
3,INFORMATION TECHNOLOGY INSTRUCTOR ...,INFORMATION-TECHNOLOGY
4,INFORMATION TECHNOLOGY MANAGER/ANALYS...,INFORMATION-TECHNOLOGY
...,...,...
235,BUSINESS DEVELOPMENT INTERN Sum...,BUSINESS-DEVELOPMENT
236,"DIRECTOR, BUSINESS DEVELOPMENT ...",BUSINESS-DEVELOPMENT
237,BUSINESS DEVELOPMENT MANAGER Su...,BUSINESS-DEVELOPMENT
238,BUSINESS DEVELOPMENT REPRESENTATIVE ...,BUSINESS-DEVELOPMENT


In [26]:
df_consultants = df_Candidates_3[df_Candidates_3['Category'].isin(Consultant)]
df_consultants = df_consultants.reset_index(drop=True)
df_consultants

Unnamed: 0,Resume_str,Category
0,CONSULTANT Summary Human R...,CONSULTANT
1,IT CONSULTANT Summary ...,CONSULTANT
2,CONSULTANT Professional Sum...,CONSULTANT
3,CONSULTANT Professional Summary...,CONSULTANT
4,CONSULTANT Summary \nPC T...,CONSULTANT
...,...,...
110,BUSINESS CONSULTANT Professiona...,CONSULTANT
111,CONSULTANT ACCOUNT Summary T...,CONSULTANT
112,PRINCIPAL CONSULTANT Summary ...,CONSULTANT
113,LEASING CONSULTANT Summary T...,CONSULTANT


In [27]:
#Filtro los Consultores que tengan "it" en su descripción (antes lo pasé a minúscula para que no haya problema
#de macheo). De esta manera filtraremos los Consultores que no esten relacionados al rubro de it. 
df_consultants_it = df_consultants[df_consultants['Resume_str'].str.lower().str.contains(fr'\bit\b', regex=True)]
df_consultants_it = df_consultants_it.reset_index(drop=True)
df_consultants_it.shape  #Vemos que dejó solo 50 filas.

(50, 2)

In [28]:
#Por último acoplamos "df_consultants_it" a "df_it_and_busdev" y a este union
#la llamamos como era originalmente, "df_Candidates_3" :
df_Candidates_3 = df_it_and_busdev.append(df_consultants_it, ignore_index=True)
df_Candidates_3
#df_consultants_it = df_consultants_it.reset_index(drop=True)

Unnamed: 0,Resume_str,Category
0,INFORMATION TECHNOLOGY Summar...,INFORMATION-TECHNOLOGY
1,INFORMATION TECHNOLOGY SPECIALIST\tGS...,INFORMATION-TECHNOLOGY
2,INFORMATION TECHNOLOGY SUPERVISOR ...,INFORMATION-TECHNOLOGY
3,INFORMATION TECHNOLOGY INSTRUCTOR ...,INFORMATION-TECHNOLOGY
4,INFORMATION TECHNOLOGY MANAGER/ANALYS...,INFORMATION-TECHNOLOGY
...,...,...
285,CONSULTANT TO OWNER Educati...,CONSULTANT
286,Pavithra Shetty Summary ...,CONSULTANT
287,BUSINESS CONSULTANT Professiona...,CONSULTANT
288,PRINCIPAL CONSULTANT Summary ...,CONSULTANT


#### Eliminamos duplicados (no los hay igualmente):

In [29]:
df_Candidates_3 = df_Candidates_3.drop_duplicates(subset=None, keep='first', inplace=False)

#### Eliminamos al candidato con índice 219 (de la categoría BUSINESS-DEVELOPMENT) ya que su Resume_str está vacio. 

In [30]:
df_Candidates_3.drop([219], inplace = True)

In [31]:
df_Candidates_3

Unnamed: 0,Resume_str,Category
0,INFORMATION TECHNOLOGY Summar...,INFORMATION-TECHNOLOGY
1,INFORMATION TECHNOLOGY SPECIALIST\tGS...,INFORMATION-TECHNOLOGY
2,INFORMATION TECHNOLOGY SUPERVISOR ...,INFORMATION-TECHNOLOGY
3,INFORMATION TECHNOLOGY INSTRUCTOR ...,INFORMATION-TECHNOLOGY
4,INFORMATION TECHNOLOGY MANAGER/ANALYS...,INFORMATION-TECHNOLOGY
...,...,...
285,CONSULTANT TO OWNER Educati...,CONSULTANT
286,Pavithra Shetty Summary ...,CONSULTANT
287,BUSINESS CONSULTANT Professiona...,CONSULTANT
288,PRINCIPAL CONSULTANT Summary ...,CONSULTANT


### 1.5.4- (Archivo CSV) '4-962_examples_CVs.csv'

In [32]:
pathCV_CSV='../Datasets_CVs_And_Job_Descriptions/EN/CVs'
df_Candidates_4 = pd.read_csv(pathCV_CSV+'/4-962_examples_CVs.csv', usecols= ['Category','Resume'])
df_Candidates_4

Unnamed: 0,Category,Resume
0,Data Science,Skills * Programming Languages: Python (pandas...
1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...
2,Data Science,"Areas of Interest Deep Learning, Control Syste..."
3,Data Science,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...
4,Data Science,"Education Details \r\n MCA YMCAUST, Faridab..."
...,...,...
957,Testing,Computer Skills: â¢ Proficient in MS office (...
958,Testing,â Willingness to accept the challenges. â ...
959,Testing,"PERSONAL SKILLS â¢ Quick learner, â¢ Eagerne..."
960,Testing,COMPUTER SKILLS & SOFTWARE KNOWLEDGE MS-Power ...


In [33]:
#Vemos los diferentes tipos de categorias que tiene para extraer solo las del rubro de IT:
for val in df_Candidates_4['Category'].unique():
    print(val)

Data Science
HR
Advocate
Arts
Web Designing
Mechanical Engineer
Sales
Health and fitness
Civil Engineer
Java Developer
Business Analyst
SAP Developer
Automation Testing
Electrical Engineering
Operations Manager
Python Developer
DevOps Engineer
Network Security Engineer
PMO
Database
Hadoop
ETL Developer
DotNet Developer
Blockchain
Testing


In [34]:
IT_Jobs_list = ['Data Science','Web Designing','Java Developer', 'Business Analyst', 'SAP Developer', 'Automation Testing', 'Python Developer', 'DevOps Engineer', 'Network Security Engineer', 'PMO', 'Database', 'Hadoop', 'ETL Developer', 'DotNet Developer', 'Blockchain Testing']

filtered_df_csvs = df_Candidates_4[df_Candidates_4['Category'].isin(IT_Jobs_list)]
filtered_df_csvs = filtered_df_csvs.reset_index(drop=True)
filtered_df_csvs

Unnamed: 0,Category,Resume
0,Data Science,Skills * Programming Languages: Python (pandas...
1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...
2,Data Science,"Areas of Interest Deep Learning, Control Syste..."
3,Data Science,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...
4,Data Science,"Education Details \r\n MCA YMCAUST, Faridab..."
...,...,...
543,DotNet Developer,"Technical Skills â¢ Languages: C#, ASP .NET M..."
544,DotNet Developer,Education Details \r\nJanuary 2014 Education ...
545,DotNet Developer,"Technologies ASP.NET, MVC 3.0/4.0/5.0, Unit Te..."
546,DotNet Developer,"Technical Skills CATEGORY SKILLS Language C, C..."


#### Eliminamos duplicados, observamos que solo nos quedan 97 filas.

In [35]:
non_duplicate_csvs = filtered_df_csvs.drop_duplicates(subset=None, keep='first', inplace=False)
non_duplicate_csvs = non_duplicate_csvs.reset_index(drop=True)

In [36]:
df_Candidates_4 = non_duplicate_csvs
df_Candidates_4

Unnamed: 0,Category,Resume
0,Data Science,Skills * Programming Languages: Python (pandas...
1,Data Science,Education Details \r\nMay 2013 to May 2017 B.E...
2,Data Science,"Areas of Interest Deep Learning, Control Syste..."
3,Data Science,Skills â¢ R â¢ Python â¢ SAP HANA â¢ Table...
4,Data Science,"Education Details \r\n MCA YMCAUST, Faridab..."
...,...,...
92,DotNet Developer,"Technical Skills â¢ Languages: C#, ASP .NET M..."
93,DotNet Developer,Education Details \r\nJanuary 2014 Education ...
94,DotNet Developer,"Technologies ASP.NET, MVC 3.0/4.0/5.0, Unit Te..."
95,DotNet Developer,"Technical Skills CATEGORY SKILLS Language C, C..."


### Ahora unimos los 4 dataframes obteniendo nuestro 'df_candidatos_final'.

In [37]:
#Renonmbramos las columnas de los DFs de csvs ('df_Candidates_3' y 'df_Candidates_4') 
#para que queden iguales a los DFs de pdfs (df_Candidates, df_Candidates_2):
df_Candidates_4.rename(columns={'Category': 'Candidate_Name', 'Resume': 'Content_CV'}, inplace=True)
df_Candidates_3.rename(columns={'Category': 'Candidate_Name', 'Resume_str': 'Content_CV'}, inplace=True)

#Ahora unimos df_Candidates_3 y df_Candidates_4 (los CSVs):
df_csvs = df_Candidates_3.append(df_Candidates_4, ignore_index=True)

#Luego modificamos la columna 'Category' de los DFs de csvs para que tenga el índice + Category:
df_csvs['Candidate_Name'] = df_csvs.index.astype(str)  + "-" + df_csvs['Candidate_Name'].astype(str)   

#Unimos df_Candidates y df_Candidates_2 (los PDFs):
df_pdfs = df_Candidates.append(df_Candidates_2, ignore_index=True)

#Y ahora por último unimos los DFs 'df_pdfs' y 'df_csvs' (De esta manera unimos los 2 csvs y los 2 pdfs):
df_candidatos_final = df_pdfs.append(df_csvs, ignore_index=True)

In [38]:
df_candidatos_final

Unnamed: 0,Candidate_Name,Content_CV
0,DataScientist_Karla_Lewis,\nKARLA LEWIS\nData Scientist\nCONTACT WORK EX...
1,DataScientist_Rahul_Malik,\nRAHUL MALIK\nNLP Data Scientist\nCONTACT WOR...
2,HCM_Federico_Calonge,\n ...
3,HCM_Robert_Smith,\nSap Hcm Consultant Phone: (123) 456 78 99\nE...
4,MLEngineer_Bradly_Johnston,"\nBradly Johnston\n435 Kasey Vista, Detroit\n+..."
...,...,...
619,381-DotNet Developer,"Technical Skills â¢ Languages: C#, ASP .NET M..."
620,382-DotNet Developer,Education Details \r\nJanuary 2014 Education ...
621,383-DotNet Developer,"Technologies ASP.NET, MVC 3.0/4.0/5.0, Unit Te..."
622,384-DotNet Developer,"Technical Skills CATEGORY SKILLS Language C, C..."


#### Tenemos un total de 625 filas/CVs.

In [39]:
df_candidatos_final['Content_CV'].iloc[300]

"         INFORMATION TECHNOLOGY SPECIALIST           Professional Profile    To continue work in the Information Technology field while developing my skills in Information Systems and Networking.          Experience      Information Technology Specialist    April 2015   to   Current     Company Name          Set up and maintained the network infrastructure both wired and wireless configuration.  Setup and maintained all user's computers including hardware and software.  Set up and assisted users with their e-mail accounts.  I maintained security on our networks in which only company users could access the network.  Setup and configured users android phones so they could access the company's resources.  I maintained security on all companies' machines.          Computer Technical Specialist    September 2007   to   January 2014     Company Name   －   City        Set up and maintain all software on Faculty and Staff computers in a Windows and McIntosh environment.  Troubleshoot all soft

### 1.6-Limpieza Dataframe CVs.

In [40]:
cleaning_DF(df_candidatos_final,'Content_CV',True)
tokenize_and_lemmatize(df_candidatos_final,'Content_CV')

In [41]:
df_candidatos_final.head(10)

Unnamed: 0,Candidate_Name,Content_CV,clean_Content_CV,tokens_Content_CV
0,DataScientist_Karla_Lewis,\nKARLA LEWIS\nData Scientist\nCONTACT WORK EX...,data scientist brooklyn ny data scientist grub...,"[data, scientist, brooklyn, ny, data, scientis..."
1,DataScientist_Rahul_Malik,\nRAHUL MALIK\nNLP Data Scientist\nCONTACT WOR...,nlp data scientist brooklyn ny nlp data scient...,"[nlp, data, scientist, brooklyn, ny, nlp, data..."
2,HCM_Federico_Calonge,\n ...,hcm technical consultant working oracle tool p...,"[hcm, technical, consultant, working, oracle, ..."
3,HCM_Robert_Smith,\nSap Hcm Consultant Phone: (123) 456 78 99\nE...,sap hcm consultant com qwikresume marshville r...,"[sap, hcm, consultant, com, qwikresume, marshv..."
4,MLEngineer_Bradly_Johnston,"\nBradly Johnston\n435 Kasey Vista, Detroit\n+...",kasey vista detroit senior software engineer m...,"[kasey, vista, detroit, senior, software, engi..."
5,MLEngineer_Jonathon_Price,"\nJonathon Price \n4587 Terry Groves, Boston\n...",terry grove boston los angeles ca principal ma...,"[terry, grove, boston, los, angeles, ca, princ..."
6,SecuritySpecialist_Ahmed Wayne,"\nAhmed Wayne\nAddress: Abu Dhabi, UAE\nNation...",abu dhabi uae egyptian dynamic sr infrastructu...,"[abu, dhabi, uae, egyptian, dynamic, sr, infra..."
7,SecuritySpecialist_Denis Banik,\nDenis Banik\nEmail address: hello@kickresume...,detail oriented result driven security analyst...,"[detail, oriented, result, driven, security, a..."
8,WebDev_Alec_Dionisio,"\nChestertown, MD 4107083942\nhi@alecdionis.io...",chestertown md alecdionis io web design develo...,"[chestertown, md, alecdionis, io, web, design,..."
9,WebDev_Karen_Higgins,\nKaren Higgins \n We b Developer \n \n...,developer area personal ambitious problem solv...,"[developer, area, personal, ambitious, problem..."


In [42]:
df_candidatos_final.shape

(624, 4)

In [43]:
#Vemos los tokens generados para el candidato HCM_Federico_Calonge:
df_candidatos_final.iloc[2]['tokens_Content_CV']

['hcm',
 'technical',
 'consultant',
 'working',
 'oracle',
 'tool',
 'participated',
 'erp',
 'cloud',
 'project',
 'performing',
 'reporting',
 'working',
 'module',
 'ap',
 'ar',
 'gl',
 'participating',
 'hcm',
 'cloud',
 'project',
 'performing',
 'reporting',
 'extraction',
 'integration',
 'working',
 'module',
 'core',
 'recruitment',
 'member',
 'artificial',
 'intelligence',
 'ai',
 'committee',
 'oracle',
 'strong',
 'data',
 'science',
 'machine',
 'learning',
 'last',
 'computer',
 'engineering',
 'degree',
 'studying',
 'th',
 'english',
 'personal',
 'academic',
 'programming',
 'project',
 'oracle',
 'hcm',
 'cloud',
 'core',
 'recruitment',
 'oracle',
 'erp',
 'cloud',
 'account',
 'receivable',
 'ar',
 'account',
 'payable',
 'ap',
 'general',
 'ledger',
 'gl',
 'main',
 'reporting',
 'sql',
 'bi',
 'publisher',
 'intermediate',
 'advanced',
 'point',
 'technical',
 'documentation',
 'extraction',
 'api',
 'rest',
 'hcm',
 'extract',
 'web',
 'service',
 'soap',
 'bas

### Obtenemos los bi-grams y los guardamos en la columna 'tokens_Content_CV'

In [44]:
corpus_token_cvs = df_candidatos_final['tokens_Content_CV'].tolist()
#corpus_token_cvs     #lista de lista de tokens.
corpus_token_cvs[2]   #la última es 624.

['hcm',
 'technical',
 'consultant',
 'working',
 'oracle',
 'tool',
 'participated',
 'erp',
 'cloud',
 'project',
 'performing',
 'reporting',
 'working',
 'module',
 'ap',
 'ar',
 'gl',
 'participating',
 'hcm',
 'cloud',
 'project',
 'performing',
 'reporting',
 'extraction',
 'integration',
 'working',
 'module',
 'core',
 'recruitment',
 'member',
 'artificial',
 'intelligence',
 'ai',
 'committee',
 'oracle',
 'strong',
 'data',
 'science',
 'machine',
 'learning',
 'last',
 'computer',
 'engineering',
 'degree',
 'studying',
 'th',
 'english',
 'personal',
 'academic',
 'programming',
 'project',
 'oracle',
 'hcm',
 'cloud',
 'core',
 'recruitment',
 'oracle',
 'erp',
 'cloud',
 'account',
 'receivable',
 'ar',
 'account',
 'payable',
 'ap',
 'general',
 'ledger',
 'gl',
 'main',
 'reporting',
 'sql',
 'bi',
 'publisher',
 'intermediate',
 'advanced',
 'point',
 'technical',
 'documentation',
 'extraction',
 'api',
 'rest',
 'hcm',
 'extract',
 'web',
 'service',
 'soap',
 'bas

In [45]:
len(corpus_token_cvs)

624

In [47]:
#Ver imports:
    #gensim.
    #Phraser y Phrases.
    #string y re.

#Lo que hacemos acá es unir palabras bigramas como: machine_learning, big_Data deep_learning (Que antes estaban separadas 
#pero que enrealidad van juntas; asi, objeto Phraser detecta esto y te lo devuelve como 1 sola palabra).
#Los bigramas son construidos usando phrases (frases).
    
#Los bigrams los creamos asi:
# Creamos frases relevantes desde nuestra lista de oraciones: 
phrases = Phrases(corpus_token_cvs)
# Usamos el objeto Phraser ahora para transformar las oraciones:
bigram = Phraser(phrases)
# Aplicamos el Phraser para transformar nuestras oraciones a algo más simple (una lista):
all_sentences = list(bigram[corpus_token_cvs])

#Imprimimos todo nuestro corpus tokenizado:
all_sentences  #Lista de listas con nuestro corpus tokenizado actualizado con bigramas. Son 624 listas en total.
all_sentences[2]   #la última es 624.

['hcm',
 'technical',
 'consultant',
 'working',
 'oracle',
 'tool',
 'participated',
 'erp',
 'cloud',
 'project',
 'performing',
 'reporting',
 'working',
 'module',
 'ap_ar',
 'gl',
 'participating',
 'hcm',
 'cloud',
 'project',
 'performing',
 'reporting',
 'extraction',
 'integration',
 'working',
 'module',
 'core',
 'recruitment',
 'member',
 'artificial',
 'intelligence',
 'ai',
 'committee',
 'oracle',
 'strong',
 'data',
 'science',
 'machine_learning',
 'last',
 'computer_engineering',
 'degree',
 'studying',
 'th',
 'english',
 'personal',
 'academic',
 'programming',
 'project',
 'oracle',
 'hcm',
 'cloud',
 'core',
 'recruitment',
 'oracle',
 'erp',
 'cloud',
 'account_receivable',
 'ar',
 'account_payable',
 'ap',
 'general_ledger',
 'gl',
 'main',
 'reporting',
 'sql',
 'bi_publisher',
 'intermediate',
 'advanced',
 'point',
 'technical',
 'documentation',
 'extraction',
 'api',
 'rest',
 'hcm',
 'extract',
 'web_service',
 'soap',
 'basic',
 'intermediate',
 'integrat

In [48]:
len(all_sentences)

624

In [49]:
#Actualizamos columna 'tokens_Content_CV' de nuestro DF con los bigramas:
df_candidatos_final['tokens_Content_CV'] = all_sentences

In [50]:
#Convertimos la lista de nuestros bigramas (columna 'tokens_Content_CV') a string y lo colocamos en la columna 
#'clean_Content_CV':
df_candidatos_final['clean_Content_CV'] = df_candidatos_final['tokens_Content_CV'].apply(lambda x: ' '.join(map(str, x)))

In [51]:
df_candidatos_final.head(10)

Unnamed: 0,Candidate_Name,Content_CV,clean_Content_CV,tokens_Content_CV
0,DataScientist_Karla_Lewis,\nKARLA LEWIS\nData Scientist\nCONTACT WORK EX...,data_scientist brooklyn ny data_scientist grub...,"[data_scientist, brooklyn, ny, data_scientist,..."
1,DataScientist_Rahul_Malik,\nRAHUL MALIK\nNLP Data Scientist\nCONTACT WOR...,nlp data_scientist brooklyn ny nlp data_scient...,"[nlp, data_scientist, brooklyn, ny, nlp, data_..."
2,HCM_Federico_Calonge,\n ...,hcm technical consultant working oracle tool p...,"[hcm, technical, consultant, working, oracle, ..."
3,HCM_Robert_Smith,\nSap Hcm Consultant Phone: (123) 456 78 99\nE...,sap_hcm consultant com qwikresume marshville r...,"[sap_hcm, consultant, com, qwikresume, marshvi..."
4,MLEngineer_Bradly_Johnston,"\nBradly Johnston\n435 Kasey Vista, Detroit\n+...",kasey vista detroit senior software_engineer m...,"[kasey, vista, detroit, senior, software_engin..."
5,MLEngineer_Jonathon_Price,"\nJonathon Price \n4587 Terry Groves, Boston\n...",terry grove boston los_angeles ca principal ma...,"[terry, grove, boston, los_angeles, ca, princi..."
6,SecuritySpecialist_Ahmed Wayne,"\nAhmed Wayne\nAddress: Abu Dhabi, UAE\nNation...",abu_dhabi uae egyptian dynamic sr infrastructu...,"[abu_dhabi, uae, egyptian, dynamic, sr, infras..."
7,SecuritySpecialist_Denis Banik,\nDenis Banik\nEmail address: hello@kickresume...,detail_oriented result_driven security analyst...,"[detail_oriented, result_driven, security, ana..."
8,WebDev_Alec_Dionisio,"\nChestertown, MD 4107083942\nhi@alecdionis.io...",chestertown md alecdionis io web design develo...,"[chestertown, md, alecdionis, io, web, design,..."
9,WebDev_Karen_Higgins,\nKaren Higgins \n We b Developer \n \n...,developer area personal ambitious problem_solv...,"[developer, area, personal, ambitious, problem..."


#### Borramos la columna 'Content_CV' que no la necesitamos más.

In [52]:
df_candidatos_final.drop('Content_CV', axis=1, inplace=True)

In [53]:
df_candidatos_final

Unnamed: 0,Candidate_Name,clean_Content_CV,tokens_Content_CV
0,DataScientist_Karla_Lewis,data_scientist brooklyn ny data_scientist grub...,"[data_scientist, brooklyn, ny, data_scientist,..."
1,DataScientist_Rahul_Malik,nlp data_scientist brooklyn ny nlp data_scient...,"[nlp, data_scientist, brooklyn, ny, nlp, data_..."
2,HCM_Federico_Calonge,hcm technical consultant working oracle tool p...,"[hcm, technical, consultant, working, oracle, ..."
3,HCM_Robert_Smith,sap_hcm consultant com qwikresume marshville r...,"[sap_hcm, consultant, com, qwikresume, marshvi..."
4,MLEngineer_Bradly_Johnston,kasey vista detroit senior software_engineer m...,"[kasey, vista, detroit, senior, software_engin..."
...,...,...,...
619,381-DotNet Developer,technical c_asp net mvc html_cs javascript ang...,"[technical, c_asp, net, mvc, html_cs, javascri..."
620,382-DotNet Developer,detail detail pune_maharashtra university pune...,"[detail, detail, pune_maharashtra, university,..."
621,383-DotNet Developer,technology asp_net mvc unit_testing entity fra...,"[technology, asp_net, mvc, unit_testing, entit..."
622,384-DotNet Developer,technical category language c_c oop dot_net te...,"[technical, category, language, c_c, oop, dot_..."


### 2-Export de DFs para usarlo en el siguiente Jupyter Notebook.

In [54]:
df_candidatos_final.to_pickle('DF_624_CVs')

In [55]:
df_Jobs.to_pickle('DF_20593_Job_Desc')