# Playground

A notebook for experimenting and playing with the data before putting everything together.

This notebook transcribes the successive steps and evolutions of this project.

1) [Imports](#Imports)
2) [Collecting Data](#Collecting-data)
3) [Extracting Data](#Data-extraction-and-generation)
4) [BOW](#Bag-Of-Words)
    - [From Scratch](#From-scratch)
    - [With SKLearn](#Using-Scikit-Learn)
    - [With Stemming](#Now-with-stemming)
5) [Tf-Idf](#TF-IDF)
    - [Basic Use](#Basic-use)
    - [With Stemming](#Adding-stemming)
    - [Some Improvements ?](#Some-improvements-?)
6) [Word2Vec](#Word2Vec)
    - [Basic Use](#Basic)
    - [With Stemming](#With-Stemming)
    - [With Tf-Idf](#Adding-Tf-Idf)
7) [Spacy](#Spacy-and-pre-trained-models)
8) [PDF Extraction](#Extracting-skills-from-a-resume)
    - [Best Matches Per Sentence](#Extract-best-skill-per-sentence)
    - [Best Matches Overall](#Extract-best-skills-overall)

<br>

## Imports

In [1]:
import json
from re import split, sub
from pathlib import Path
from sys import maxunicode
from tqdm import tqdm
from itertools import chain

from unicodedata import category
from unidecode import unidecode

import pandas as pd
import numpy as np

from requests import get
from bs4 import BeautifulSoup

from nltk.stem.snowball import FrenchStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

from gensim.models import Word2Vec
import spacy

import fitz



<br>

## Collecting data

The list of all job titles and their corresponding data sheets are downloaded from [here](https://www.pole-emploi.fr/files/live/sites/PE/files/ROME_ArboPrincipale.xlsx).

In [2]:
job_data = pd.read_excel(
    "https://www.pole-emploi.fr/files/live/sites/PE/files/ROME_ArboPrincipale.xlsx",
    sheet_name="Arbo Principale 16-12-2019",
    header=0,
    names=["famille", "domaine", "fiche", "titre", "code_ogr"]
)
job_data.head(7)

Unnamed: 0,famille,domaine,fiche,titre,code_ogr
0,A,,,"Agriculture et Pêche, Espaces naturels et Espa...",
1,A,11.0,,Engins agricoles et forestiers,
2,A,11.0,1.0,Conduite d'engins agricoles et forestiers,
3,A,11.0,1.0,Chauffeur / Chauffeuse de machines agricoles,11987.0
4,A,11.0,1.0,Conducteur / Conductrice d'abatteuses,12862.0
5,A,11.0,1.0,Conducteur / Conductrice d'automoteur de récolte,38874.0
6,A,11.0,1.0,Conducteur / Conductrice d'engins d'exploitati...,13254.0


The letter in the first column (from A -> N) represents the job family.\
For instance : *A -> Agriculture et Pêche, Espaces naturels et Espaces verts, Soins aux animaux*

This letter and two digits number in the second column identify the professional field.\
Here : *A11 -> Engins agricoles et forestiers*

The sequence built from the first three columns refers to a specific job title and is called the ROME code.\
On the third row : *A1101 -> Conduite d'engins agricoles et forestiers*

The other rows prefixed with the same code simply list other job titles that fit the same job description.

<br>

In [3]:
rome = (job_data
 .query("code_ogr == ' '")
 .assign(code_rome=lambda d: (d.famille + d.domaine + d.fiche).str.strip())
 .drop(columns=["famille", "domaine", "fiche", "code_ogr"])
 .set_index("code_rome")
)
rome.head()

Unnamed: 0_level_0,titre
code_rome,Unnamed: 1_level_1
A,"Agriculture et Pêche, Espaces naturels et Espa..."
A11,Engins agricoles et forestiers
A1101,Conduite d'engins agricoles et forestiers
A12,Espaces naturels et espaces verts
A1201,Bûcheronnage et élagage


In [4]:
# Save the ROME codes and their designation in a JSON file for easier use later
rome.titre.to_json("data/fiches_rome.json", orient="index", force_ascii=False)

In [5]:
with open("data/fiches_rome.json") as f:
    rome_codes = json.load(f)

In [6]:
# How many complete ROME codes have we got ? (with 5 characters)
len(list(filter(lambda txt: len(txt) == 5, rome_codes.keys())))

532

<br>

For each job listed with a ROME code, we will fetch useful information from the website [pôle-emploi.fr](https://www.pole-emploi.fr/)
such as job aliases, job roles as well as expertises and skills required.

We can use the page associated with the first ROME code [here](https://candidat.pole-emploi.fr/marche-du-travail/fichemetierrome?codeRome=A1101) for calibration and for automation of the process, and then apply the same steps to others. 

In [7]:
base_url = "https://candidat.pole-emploi.fr/marche-du-travail/fichemetierrome"
payload = {"codeRome": "A1101"}

# Query the URl and raise error if the page couldn't be fetched
response = get(base_url, params=payload)
response.raise_for_status()

# Get html and parse it
soup = BeautifulSoup(response.content, "html.parser")

In [8]:
# Find the different titles for the job and the roles associated
titles, roles = soup.select("#js-tabs-unit1-body > .bd", limit=2)
titles = [tag.text for tag in titles.find_all("li")]
roles = [tag.text for tag in roles.find_all("li")]

In [9]:
# On the second tab, split for base aptitudes and extra aptitudes
base, extra = soup.select("#js-tabs-unit2-body > .bd")

In [10]:
# Then we separate skills from expertises using the columns
base_expertises = [li.text for li in base.select("tr > td:nth-child(odd) li")]
base_skills = [li.text for li in base.select("tr > td:nth-child(even) li")]
extra_expertises = [li.text for li in extra.select("tr > td:nth-child(odd) li")]
extra_skills = [li.text for li in extra.select("tr > td:nth-child(even) li")]

In [11]:
job_sheet = {
    "appelations": titles,
    "description": roles,
    "expertises": base_expertises,
    "competences": base_skills,
    "expertises_extra": extra_expertises,
    "competences_extra": extra_skills
}

In [12]:
job_sheet

{'appelations': ['Chauffeur / Chauffeuse de machines agricoles',
  "Conducteur / Conductrice d'abatteuses",
  "Conducteur / Conductrice d'automoteur de récolte",
  "Conducteur / Conductrice d'engins d'exploitation agricole",
  "Conducteur / Conductrice d'engins d'exploitation forestière",
  "Conducteur / Conductrice d'engins de débardage",
  "Conducteur / Conductrice d'engins forestiers",
  "Conducteur / Conductrice d'épareuse",
  'Conducteur / Conductrice de machines à vendanger',
  'Conducteur / Conductrice de matériels de semis',
  'Conducteur / Conductrice de pulvérisateur',
  "Conducteur / Conductrice de tête d'abattage",
  'Conducteur / Conductrice de tracteur',
  'Conducteur / Conductrice de tracteur enjambeur',
  'Conducteur / Conductrice de tracto-benne',
  'Débardeur / Débardeuse',
  'Débardeur forestier / Débardeuse forestière',
  "Opérateur / Opératrice d'abatteuse",
  "Opérateur / Opératrice d'épandage",
  "Pilote de machines d'abattage",
  'Tractoriste agricole',
  'Tract

<br>

It works, let's automate !

In [13]:
for rome_code in tqdm(filter(lambda txt: len(txt) == 5, rome_codes.keys())):
    # Query the URl and raise error if the page couldn't be fetched
    response = get(base_url, params={"codeRome": rome_code})
    response.raise_for_status()
    
    # Get HTML and parse it
    soup = BeautifulSoup(response.content, "html.parser")
    
    # Find the different titles for the job and the roles associated from the first tab
    titles, roles = soup.select("#js-tabs-unit1-body > .bd", limit=2)
    titles = [tag.text for tag in titles.find_all("li")]
    roles = [tag.text for tag in roles.find_all("li")]
    
    # On the second tab, split for base aptitudes and extra aptitudes
    base, extra = soup.select("#js-tabs-unit2-body > .bd")
    # Then we separate skills from knowledges using the columns
    base_expertises = [li.text for li in base.select("tr > td:nth-child(odd) li")]
    base_skills = [li.text for li in base.select("tr > td:nth-child(even) li")]
    extra_expertises = [li.text for li in extra.select("tr > td:nth-child(odd) li")]
    extra_skills = [li.text for li in extra.select("tr > td:nth-child(even) li")]
    
    # Eventually save the job description as JSON in the appropriate sub-directory
    subdir = Path(f"./data/{rome_code[0]}")
    if not subdir.is_dir() :
        subdir.mkdir()
    with subdir.joinpath(f"{rome_code}.json").open("w") as f:
        json.dump(
            {
                "appelations": titles,
                "description": roles,
                "expertises": base_expertises,
                "competences": base_skills,
                "expertises_extra": extra_expertises,
                "competences_extra": extra_skills
            },
            f
        )

532it [02:54,  3.05it/s]


<br>

## Data extraction and generation

In [14]:
class Corpus:
    
    __punctuation_table = {char: " " for char in range(maxunicode) if category(chr(char)).startswith("P")}
    __stopwords = set(Path("stopwords-fr.txt").read_text().split("\n"))
    __stemmer = FrenchStemmer(ignore_stopwords=True)
    
    def __init__(self, sentences: list, remove_accents: bool = True, stemming: bool = False):
        self.sentences = list(set(sentences))
        self.corpus = list()
        self.vocab = set()
        self.remove_accents = remove_accents
        self.stemming = stemming
        
    def __iter__(self):
        return (s for s in self.corpus)
    
    @property
    def tokenized_documents(self):
        return [s.split() for s in self.corpus]
        
    def clean_sentence(self, sentence: str):
        sentence = sentence.lower().translate(self.__punctuation_table)
        words = [word for word in sentence.split() if word not in self.__stopwords]
        if self.stemming:
            words = [self.__stemmer.stem(word) for word in words]
        if self.remove_accents:
            words = [unidecode(word) for word in words]
        return " ".join(words)
        
    def build_corpus(self):
        for i, sentence in enumerate(self.sentences):
            clean_sentence = self.clean_sentence(sentence)
            self.corpus.append(clean_sentence)
            self.vocab.update(clean_sentence.split())

In [15]:
path = Path('./data/')
fiches = []
for file in path.glob("?/?????.json"):
    with file.open() as j:
        fiches.append(json.load(j))

In [16]:
# extract skills
skills = list(chain(*[fiche.get("competences", []) for fiche in fiches], *[fiche.get("competences_extra", []) for fiche in fiches]))

In [17]:
# without stemming
corpus = Corpus(skills)
corpus.build_corpus()

In [18]:
# with stemming
corpus_stem = Corpus(skills, stemming=True)
corpus_stem.build_corpus()

<br>

## Bag Of Words 

<br>

### From scratch

We can extract a BOW matrix from our corpus object and use it to compute similarities, because why not.

In [19]:
# create empty array
bow = np.empty((len(corpus.corpus), len(corpus.vocab)))

# populate it by counting known words for each document of the corpus
for i, doc in enumerate(corpus):
    words = doc.split()
    bow[i] = [words.count(v) for v in sorted(corpus.vocab)]

In [20]:
# word frequencies
word_counts = sorted(list(zip(sorted(corpus.vocab), np.asarray(bow.sum(axis=0)).flatten())), key=lambda e: e[1], reverse=True)
word_counts[:10]

[('techniques', 657.0),
 ('utilisation', 251.0),
 ('caracteristiques', 207.0),
 ('procedures', 141.0),
 ('pratique', 116.0),
 ('produits', 103.0),
 ('01', 80.0),
 ('gestion', 76.0),
 ('caces', 71.0),
 ('logiciel', 71.0)]

In [21]:
# using a new sentence
skill = "analyser des comptes clients"

# cleaning it to match corpus vacobulary
clean_skill = corpus.clean_sentence(skill)

# vectorizing it
vec = np.array([clean_skill.split().count(v) for v in sorted(corpus.vocab)])

# computing cosine distance
cosines = (bow @ vec.T) / np.linalg.norm(bow, axis=1) / np.linalg.norm(vec)

In [22]:
# print 10 most relevant skills from the corpus
print(*list(sorted(zip(corpus_stem.sentences, cosines), key=lambda s: s[1], reverse=True))[:10], sep="\n")

('Gestion des comptes clients', 0.816496580927726)
('Révision légale des comptes', 0.408248290463863)
('Logiciel de gestion clients', 0.408248290463863)
('Gestion commerciale, relation clients', 0.35355339059327373)
('Fabrication de fromage', 0.0)
('Finance', 0.0)
('Langue étrangère - Japonais', 0.0)
('Utilisation de pistolet hydraulique', 0.0)
("Règles d'affranchissement du courrier", 0.0)
('Logiciel Cinéma 4D', 0.0)


<br>

### Using Scikit-Learn

In [23]:
cvec = CountVectorizer(lowercase=False, vocabulary=corpus.vocab)
bow = cvec.fit_transform(corpus)

In [24]:
word_counts = sorted(list(zip(cvec.vocabulary_, np.asarray(bow.sum(axis=0)).flatten())), key=lambda e: e[1], reverse=True)
word_counts[:10]

[('techniques', 657),
 ('utilisation', 251),
 ('caracteristiques', 207),
 ('procedures', 141),
 ('pratique', 116),
 ('produits', 103),
 ('01', 80),
 ('gestion', 76),
 ('caces', 71),
 ('logiciel', 71)]

In [25]:
# using cosine similarity with sklearn 
skill = "tailler et débiter des arbres"
vec = cvec.transform([corpus.clean_sentence(skill)])
cosines = cosine_similarity(bow, vec).flatten()

In [26]:
# print 10 most relevant skills from the corpus
print(*list(sorted(zip(corpus.sentences, cosines), key=lambda s: s[1], reverse=True))[:10], sep="\n")

('Techniques de soins aux arbres ou ceps', 0.5)
('Fabrication de fromage', 0.0)
('Finance', 0.0)
('Langue étrangère - Japonais', 0.0)
('Utilisation de pistolet hydraulique', 0.0)
("Règles d'affranchissement du courrier", 0.0)
('Logiciel Cinéma 4D', 0.0)
('Mécanique des fluides', 0.0)
('Procédures de contrôle des matériels et équipements', 0.0)
('Caractéristiques des Modulateurs Démodulateurs -MODEM-', 0.0)


<br>

### Now with stemming

In [27]:
cvec = CountVectorizer(lowercase=False, vocabulary=corpus_stem.vocab)
bow = cvec.fit_transform(corpus_stem)

In [28]:
# using cosine similarity with sklearn 
skill = "analyser des comptes clients"
vec = cvec.transform([corpus_stem.clean_sentence(skill)])
cosines = cosine_similarity(bow, vec).flatten()

In [29]:
# print 10 most relevant skills from the corpus
print(*list(sorted(zip(corpus_stem.sentences, cosines), key=lambda s: s[1], reverse=True))[:10], sep="\n")

('Gestion des comptes clients', 0.6666666666666669)
('Typologie du client', 0.408248290463863)
('Analyse de la performance', 0.408248290463863)
('Analyse transactionnelle', 0.408248290463863)
('Comptabilité client', 0.408248290463863)
('Analyse spatiale', 0.408248290463863)
('Analyse des coûts', 0.408248290463863)
('Analyse médicale', 0.408248290463863)
("Analyse d'incidents", 0.408248290463863)
('Analyse financière', 0.408248290463863)


<br>

##  TF-IDF

Some words seem to be frequently used in skill descriptions and to convey less meaning than others.

Using Tf-Idf may result in better matching of skills by avoiding giving importance to meaningless words
such as _"réaliser"_, _"identifier"_, _"définir"_, etc.

### Basic use

In [30]:
tfidf = TfidfVectorizer(lowercase=False, vocabulary=corpus.vocab)
tf_mat = tfidf.fit_transform(corpus)

In [31]:
# using cosine similarity with sklearn 
skill = "analyser des indicateurs financiers"
vec = tfidf.transform([corpus.clean_sentence(skill)])
cosines = cosine_similarity(tf_mat, vec).flatten()

In [32]:
# print 10 most relevant skills from the corpus
print(*list(sorted(zip(corpus_stem.sentences, cosines), key=lambda s: s[1], reverse=True))[:10], sep="\n")

("Analyse d'indicateurs financiers", 0.878542070237204)
('Logiciels financiers', 0.5417875419289122)
('Caractéristiques des produits financiers', 0.5219383668598823)
('Indicateurs statistiques', 0.47898929924549)
('Analyse des risques financiers', 0.4594240849173993)
('Ratios financiers', 0.43587466497709054)
('Calculs financiers', 0.43587466497709054)
('Indicateurs de suivi de production', 0.4296576662039001)
('Indicateurs de couverture de risques', 0.4266050340826602)
('Réglementation des marchés financiers', 0.42142974089207774)


<br>

### Adding stemming

Let's try again with stemming !

In [33]:
tfidf = TfidfVectorizer(lowercase=False, vocabulary=corpus_stem.vocab)
tf_mat = tfidf.fit_transform(corpus_stem)

In [34]:
# same process
skill = "administrer des bases de données relationnelles"
vec = tfidf.transform([corpus_stem.clean_sentence(skill)])
cosines = cosine_similarity(tf_mat, vec).flatten()

In [35]:
# and the top 10 results
print(*list(sorted(zip(corpus_stem.sentences, cosines), key=lambda s: s[1], reverse=True))[:10], sep="\n")

('Gestion de bases de données', 0.5766968689723593)
('Logiciels de gestion de base de données', 0.5304828844131672)
('Marketing relationnel', 0.4622313004659922)
('Système de Gestion de Bases de Données (SGBD)', 0.42104752176086624)
('Gestion administrative', 0.3910602315946875)
('Données de contrôle', 0.3814266823299183)
('Droit administratif', 0.37572217209252623)
('Administration centrale', 0.31730902176957465)
('Procédures et circuits administratifs', 0.30906172994012265)
('Gestion administrative du personnel', 0.2910663892201751)


<br>

### Some improvements ?

To go further, we can use tf-idf on letter-grams instead of entire words, 
it may perform better on resumes with spelling mistakes.

In [36]:
# set analyzer to 'char_wb' to specify the use of characters instead of words
# and to create only n-grams for chars between the same word boundaries ('wb')
# i.e chars belonging to the same word
tfidf = TfidfVectorizer(lowercase=False, analyzer="char_wb", ngram_range=(5, 7))
tf_mat = tfidf.fit_transform(corpus_stem)

In [37]:
# same process
skill = "créer des bases de données relationnelles"
vec = tfidf.transform([corpus_stem.clean_sentence(skill)])
cosines = cosine_similarity(tf_mat, vec).flatten()

In [38]:
# and the top 10 results
print(*sorted(zip(corpus.sentences, cosines), key=lambda s: s[1], reverse=True)[:10], sep="\n")

('Marketing relationnel', 0.7241707133205934)
('Progiciels de gestion de la relation client (CRM - Customer Relationship Management)', 0.3348109810201438)
('Veille informationnelle', 0.3068995110913636)
('Exploration fonctionnelle', 0.20860583904152563)
('Pathologies fonctionnelles', 0.19712503021845362)
('Normes rédactionnelles', 0.19564125213863048)
('Techniques de montage traditionnel', 0.18674828945372202)
("Méthodes d'analyse (systémique, fonctionnelle, de risques, ...)", 0.1790635818929684)
('Rééducation nutritionnelle', 0.17435635522005638)
('Méthodes de contrôle dans le domaine fonctionnel', 0.17351227588858983)


<br>

_This did not go well !_

Because we are using n-grams to compare and identify words, partial matches can overcome full matches.

In the last example, _'relationnelles'_ got stemmed in _'relationnel'_ which got partial
matches against _'informationnelles'_ and _'fonctionnelles'_ and ultimately lead to these results.

<br>

## Word2Vec

Now, let's use Gensim to create and train a Word2Vec model on our library of skills.

Then, we will use the Word Mover's Distance, which is useful to compare sentences with 
a Word2Vec model, to compute the closest skill in the library.

<br>

### Basic

In [39]:
model = Word2Vec(corpus.tokenized_documents, workers=10)

In [40]:
# compute some distances
skill_1 = corpus.clean_sentence("Faire un bilan comptable").split()
skill_2 = corpus.clean_sentence("Façonner une pâte en matériaux de synthèse").split()
skill_3 = corpus.clean_sentence("Peinture sur glace et supports exotiques").split()

print(f"Distance between skill_1 and skill_2: {model.wv.wmdistance(skill_1, skill_2):.4f}")
print(f"Distance between skill_2 and skill_3: {model.wv.wmdistance(skill_2, skill_3):.4f}")
print(f"Distance between skill_3 and skill_1: {model.wv.wmdistance(skill_3, skill_1):.4f}")

Distance between skill_1 and skill_2: 1.3841
Distance between skill_2 and skill_3: 1.4250
Distance between skill_3 and skill_1: 1.4885


In [41]:
distances = np.array([model.wv.wmdistance(skill_1, skill) for skill in corpus.tokenized_documents])

In [42]:
# and the top 10 results
# we are using distances here so no need to reverse the argsort
print(*list(sorted(zip(corpus.sentences, distances), key=lambda s: s[1]))[:10], sep="\n")

('Gestion comptable', 0.6537737846374512)
('Audit comptable', 0.665519118309021)
('Écriture comptable', 0.6878059506416321)
("Règlementation des professionnels de l'expertise comptable", 0.8841224256981766)
('Analyse comptable et financière', 0.8908044811754542)
('Gestion comptable et financière des collectivités locales', 0.9298514443611623)
('Préparations culinaires de base', 1.0910216569900513)
('Plats cuisinés à base de poisson', 1.0910216569900513)
('Produits détaxés', 1.1095918416976929)
('Produits animaliers', 1.1095918416976929)


<br>

### With Stemming

In [43]:
model = Word2Vec(corpus_stem.tokenized_documents, workers=10)

In [44]:
# compute some distances
skill_1 = corpus_stem.clean_sentence("Faire un bilan comptable").split()
skill_2 = corpus_stem.clean_sentence("Façonner une pâte en matériaux de synthèse").split()
skill_3 = corpus_stem.clean_sentence("Peinture sur glace et supports exotiques").split()

print(f"Distance between skill_1 and skill_2: {model.wv.wmdistance(skill_1, skill_2):.4f}")
print(f"Distance between skill_2 and skill_3: {model.wv.wmdistance(skill_2, skill_3):.4f}")
print(f"Distance between skill_3 and skill_1: {model.wv.wmdistance(skill_3, skill_1):.4f}")

Distance between skill_1 and skill_2: 1.1868
Distance between skill_2 and skill_3: 1.3110
Distance between skill_3 and skill_1: 1.3595


In [45]:
# compute distances for skill_3
distances = np.array([model.wv.wmdistance(skill_3, skill) for skill in corpus_stem.tokenized_documents])

In [46]:
# and the top 10 results
# we are using distances here so no need to reverse the argsort
print(*sorted(zip(corpus_stem.sentences, distances), key=lambda s: s[1])[:10], sep="\n")

('Caractéristiques des peintures', 0.6085596770170928)
('Techniques de pulvérisation de peinture électrostatique', 0.6142423673368693)
('Techniques de peinture à la taloche', 0.6142423673368693)
('Techniques de peinture', 0.6142423673368693)
("Techniques de peinture à l'essuyé", 0.6142423673368693)
('Techniques de peinture à la brosse', 0.6142423673368693)
('Peinture sur verre', 0.6278082578882576)
('Supports de manutention', 0.6551567396755815)
('Types de support de câbles', 0.6778903522794844)
('Types de supports audio', 0.6778903522794844)


<br>

### Adding Tf-Idf

Let's go a step further.

Word2Vec is a model used for finding embeddings for words according to their context.
Here, we will be dealing with sentences or paragraphs ...

How should we define the embedding of a sentence according to Word2Vec ?

There are different methods to get the sentence vectors :
- Doc2Vec : we can train your dataset using Doc2Vec and then use the sentence vectors.
- Average of Word2Vec vectors : we can average all the word vectors in a sentence to represent the sentence.
- Average of Word2Vec vectors with TF-IDF : take the average of word vectors weighted with their TF-IDF scores.

We will try the latter.

In [47]:
# a classic Tf-Idf training and transforming
tfidf = TfidfVectorizer(lowercase=False, vocabulary=corpus_stem.vocab)
tf_mat = tfidf.fit_transform(corpus_stem)

In [48]:
# we use min_count=1 to keep every word from the vocabulary
model = Word2Vec(
    corpus_stem.tokenized_documents,
    min_count=1,
    vector_size=50,
    epochs=100,
    workers=10,
)

In [49]:
# the tf_idf matrix is of shape nb_sentences * len_vocab
tf_mat.shape

(3968, 3259)

In [50]:
# the matrix of vectors of the word2vec model is of shape len_vocab * vector_size
# however, the vocabulary is not ordered as it is in tf-idf
model.wv.vectors.shape

(3259, 50)

In [51]:
# for every word in ordered vocab (tfidf.get_feature_names), we stack its vector
vecs = np.array([model.wv[word] for word in tfidf.get_feature_names()])

In [52]:
# now the only operation left to do is multiply matrixes and divide by the sum 
# of each line of the tf-idf mattrix to obtain the vector embeddings of our 
# sentences (of shape n_sentences * vector_size)
embeddings = tf_mat @ vecs / tf_mat.sum(axis=1).reshape(-1, 1)
embeddings.shape

(3968, 50)

In [53]:
skill = "Organiser des évènements officiels et professionnels"

# we need to cast the csr matrix to a coo format to use efficiently columns and data
idfs = tfidf.transform([corpus_stem.clean_sentence(skill)]).tocoo()
vec = np.sum([idf * vecs[index] for index, idf in zip(idfs.col, idfs.data)], axis=0) / idfs.getnnz()

In [54]:
# now we compute the cosine similarities with every sentences by using the embedding matrix
cosines = cosine_similarity(embeddings, vec.reshape(1, -1)).flatten()

In [55]:
# and the top 10 results
print(*sorted(zip(corpus_stem.sentences, cosines), key=lambda s: s[1], reverse=True)[:10], sep="\n")

("Organisation d'évènements d'entreprises", 0.9299365646376524)
('Organisation du marché des transports (tarifs, tendances, ...)', 0.9267802473526286)
('Organisation du système sanitaire et social', 0.9194771603562804)
('Organisation de la chaîne de transport national et international', 0.9163438947544709)
('Organisation du système scolaire', 0.9081117240020357)
("Organisation d'évènements culturels", 0.9064348731974154)
("Organisation d'une baie de brassage", 0.9059132872695261)
("Organisation d'évènements", 0.9052304716102239)
('Organisation de soirées', 0.9047534173617487)
('Sociologie des organisations', 0.9037087604696366)


<br>

## Spacy and pre-trained models

Finally, we will test some models from Spacy, because we can and it is easier.

If everything goes well, this could be better than everything we tried so far.

In [56]:
# to download the pre trained model for french documents, uncomment the following line
# !spacy download fr_core_news_md

In [57]:
nlp = spacy.load("fr_core_news_md")

In [58]:
skill = nlp("Administrer des bases de données")

In [59]:
# build the array of similarities
similarity = np.array([skill.similarity(nlp(doc)) for doc in corpus.sentences])

  similarity = np.array([skill.similarity(nlp(doc)) for doc in corpus.sentences])


In [60]:
# see the top 10 skills
print(*sorted(zip(corpus.sentences, similarity), key=lambda s: s[1], reverse=True)[:10], sep="\n")

('Règles de traitement des opérations bancaires', 0.9264853175083053)
('Méthodes de valorisation des stocks', 0.9218128374642741)
('Fonctionnement des matériels de reprographie', 0.9176832957531333)
('Gestion de bases de données', 0.9171529300387925)
('Caractéristiques des dossiers de maintenance', 0.9144608210200109)
('Méthodes de contrôle de structure des matériaux', 0.9140826994088761)
('Procédures de maintenance des équipements de décontamination', 0.9123576073628312)
('Fonctionnement des machines de développement automatisé', 0.910774109408425)
('Méthodes de Gestion des Moyens de Production', 0.9093153749549872)
('Outils de planification des ressources humaines', 0.9086192583382122)


<br>

## Extracting skills from a resume

In [61]:
# using pyMuPDF with fitz to read PDFs
doc = fitz.open("CV_test/CV_1.pdf")

In [62]:
# we extract each block from the PDF individually
# we split sentences and store them in the following list
sentences = []
for block in doc.loadPage(0).get_text("blocks"):
    block_text = sub(r"<[^>]*>|\n", " ", block[4])
    for st in split(r"[.!?-]\s", block_text):
        if cst := st.strip():
            sentences.append(cst)    

In [63]:
# we create a Word2Vec model with the Tf-Idf subtlety as explained above
tfidf = TfidfVectorizer(lowercase=False, vocabulary=corpus_stem.vocab)
model = Word2Vec(
    corpus_stem.tokenized_documents,
    min_count=1,
    vector_size=50,
    epochs=100,
    workers=10,
)

tf_mat = tfidf.fit_transform(corpus_stem)
vecs = np.array([model.wv[word] for word in tfidf.get_feature_names()])
embeddings = tf_mat @ vecs / tf_mat.sum(axis=1).reshape(-1, 1)

<br>

### Extract best skill per sentence

In [64]:
def compare_sentence(sentence: str) -> tuple:
    """Compare a sentence with the corpus and returns the best match"""
    
    # idf scores for the input sentence
    # we use COO format to access efficiently indexes and data
    tfidfs = tfidf.transform([corpus_stem.clean_sentence(sentence)]).tocoo()
    
    # if no word in the sentence matches the known vocabulary, don't bother
    if tfidfs.getnnz() == 0:
        return ("", 0)
    
    # we compute the average of the words embeddings from the input sentence
    # weighted with their tf-idf scores
    vec = np.sum([tfidf * vecs[index] for index, tfidf in zip(tfidfs.col, tfidfs.data)], axis=0) / tfidfs.sum()
    
    # then we compute similarities with the corpus' embeddings
    cosines = cosine_similarity(embeddings, vec.reshape(1, -1)).flatten()
    
    # and return the best match
    return max(zip(corpus_stem.sentences, cosines), key=lambda s: s[1])

In [65]:
compare_sentence('Gestionnaire de réseaux sociaux')

('Marketing des réseaux sociaux', 0.9832577476846847)

In [66]:
# we compare every sentence extracted from the resume with the skills corpus
# and extract the best matching skill
resume_skills = [compare_sentence(st) for st in sentences]

In [67]:
sorted(resume_skills, key=lambda s: s[1], reverse=True)[:10]

[('Marketing des réseaux sociaux', 0.999999999999999),
 ('Insights marketing', 0.9902524637145232),
 ("Équipements d'imprimerie", 0.9890709084053457),
 ('Marketing des réseaux sociaux', 0.9832577476846847),
 ('Marketing des réseaux sociaux', 0.9832577476846847),
 ('Communication interpersonnelle', 0.9789137887522529),
 ('Marketing des réseaux sociaux', 0.9757376356113969),
 ("Gouvernance d'entreprise", 0.9640695292837177),
 ('Actualité quotidienne / informations', 0.9614404753259805),
 ('Hospitalisation à domicile', 0.9612038974944008)]

In [68]:
# unique skills
sorted(dict(sorted(resume_skills, key=lambda s: s[1])).items(), key=lambda s: s[1], reverse=True)[:10]

[('Marketing des réseaux sociaux', 0.999999999999999),
 ('Insights marketing', 0.9902524637145232),
 ("Équipements d'imprimerie", 0.9890709084053457),
 ('Communication interpersonnelle', 0.9789137887522529),
 ("Gouvernance d'entreprise", 0.9640695292837177),
 ('Actualité quotidienne / informations', 0.9614404753259805),
 ('Hospitalisation à domicile', 0.9612038974944008),
 ('Émissions avec débats', 0.9578283264070274),
 ('Communication digitale', 0.9563800261157076),
 ('Ingénierie de la formation', 0.9557226875014917)]

<br>

Not bad but we need a way to combine and take into account skills that were matched more than once.

<br>

### Extract best skills overall

Instead of extracting the best match for each sentence in the resume,
we could average the distances of the sentences to each skill in the corpus
and then extract the top 10.

In [69]:
resume_distances = np.zeros((embeddings.shape[0], 1))
sentence_count = 0

for sentence in sentences:
    # idf scores for the input sentence (as previously)
    # we use COO format to access efficiently indexes and data
    tfidfs = tfidf.transform([corpus_stem.clean_sentence(sentence)]).tocoo()
    
    # this time, if no word in the sentence matches the known vocabulary, just skip
    if tfidfs.getnnz() == 0:
        continue
        
    # compute the weighted average with tf-idf scores
    vec = np.sum([tfidf * vecs[index] for index, tfidf in zip(tfidfs.col, tfidfs.data)], axis=0) / tfidfs.sum()
    
    # compute similarities with the corpus
    cosines = cosine_similarity(embeddings, vec.reshape(1, -1))
    
    # and this time, we add the similarities to the resume_distances array
    resume_distances += cosines
    # and add one to the sentence count
    sentence_count += 1
    
    
# eventually, we divide resume_distances by the number of sentences in the resume
resume_distances /= (sentence_count or 1)

In [70]:
sorted(zip(corpus_stem.sentences, resume_distances[:, 0]), key=lambda s: s[1], reverse=True)[:10]

[('Localisation de panne', 0.7597689023855665),
 ('Diagnostic capillaire', 0.7526106529920145),
 ('Aide au maintien à domicile', 0.7403831347612659),
 ('UX/UI Design', 0.7377252575912373),
 ('Microsoft Office PowerPoint (diaporama)', 0.7363804169962034),
 ('Hospitalisation à domicile', 0.7263621168744556),
 ('Reporting anglo-saxon', 0.7187427003846412),
 ('Nature morte', 0.71769451397857),
 ('Responsabilité Sociétale des Entreprises (RSE)', 0.7173649919877738),
 ('Généalogie', 0.7149018007722692)]

<br>

Obviously, this didn't work.

The reason is that some weak matches occurring frequently can overcome strong matches
occurring infrequently.

Moreover, if the resume presents two orthogonal skills or set of skills (algebraically speaking),
they can counterinteract when averaging.  
Detecting a skill with 100% confidence in a sentence
will result in a 0% confidence for the orthogonal skills and inversely, thus averaging to 50%
over the whole resume ...)

A more accurate method would be to build a competitive method, as in sporting events where 
teams gather points in each category according to their ranking and the overall winner is the 
team with the most points.

In [71]:
def extract_best_matches(sentence: str, n_best: int) -> tuple:
    """Compare a sentence with the corpus and returns the n best matches"""
    
    # idf scores for the input sentence
    # we use COO format to access efficiently indexes and data
    tfidfs = tfidf.transform([corpus_stem.clean_sentence(sentence)]).tocoo()
    
    # if no word in the sentence matches the known vocabulary, don't bother
    if tfidfs.getnnz() == 0:
        return None
    
    # we compute the average of the words embeddings from the input sentence
    # weighted with their tf-idf scores
    vec = np.sum([tfidf * vecs[index] for index, tfidf in zip(tfidfs.col, tfidfs.data)], axis=0) / tfidfs.sum()
    
    # then we compute similarities with the corpus' embeddings
    cosines = cosine_similarity(embeddings, vec.reshape(1, -1)).flatten()
    
    # and return the best match
    return sorted(zip(corpus_stem.sentences, cosines), key=lambda s: s[1])[:-n_best - 1:-1]

In [72]:
# a dictionary of skills encountered and the points associated
skills_rankings = {}
n_best = 3

for sentence in sentences:
    
    # get best matches
    top = extract_best_matches(sentence, n_best)
    
    # if the sentence could not be matched, continue
    if top is None:
        continue
    
    # we allow points logarithmically from 1 to 2**(n_best - 1)
    # these points are weighted with the cosine associated
    # and then added to the current points of the skill in the ranking list
    for skill, points in zip(top, np.logspace(0, n_best-1, num=n_best, base=2)[::-1]):
        skills_rankings[skill[0]] = skills_rankings.get(skill[0], 0) + points * skill[1]
        
        
# eventually, we sort the skills and normalize each scores with the maximum score
max_score = max(skills_rankings.values())
rankings = sorted(skills_rankings.items(), key=lambda s: s[1], reverse=True)[:10]
rankings = list(map(lambda s: (s[0], s[1] / max_score), rankings))

In [73]:
# see results
rankings

[('Marketing des réseaux sociaux', 1.0),
 ('Émissions avec débats', 0.48592938838949495),
 ('Ingénierie de la formation', 0.4848611470383812),
 ('Diagnostic capillaire', 0.4556705090121022),
 ('Insights marketing', 0.25118946724462754),
 ("Équipements d'imprimerie", 0.25088975150595716),
 ('Communication interpersonnelle', 0.2483132757405448),
 ('Réseaux Digital Subscriber Line (DSL)', 0.24454895108055213),
 ("Gouvernance d'entreprise", 0.24454785049377933),
 ('Gestuelle de communication avec le personnel de piste',
  0.24406472333275903)]

<br>

To be continued ...