## Natural Language Processing : Information Retrieval Challenge

Dans le cadre de ce projet, nous avons entrepris le développement d'un système de recherche d'informations original sur le
corpus médical NFCorpus. Composé d'extraits d'articles médicaux provenant de PubMed, ce corpus complexe nécessite une
approche sophistiquée pour surpasser les performances du modèle BM25, qui reste l'un des meilleurs selon l'article BeiR.


Notre objectif est d'améliorer ces résultats de référence en exploitant diverses techniques de prétraitement, en manipulant le
vocabulaire des documents, et en intégrant éventuellement des modèles Word2Vec pré-entraînés ou nouvellement créés. Nous
évaluons nos modèles en utilisant la métrique ndcg@5, qui évalue les cinq meilleures réponses retournées par le modèle.

### Importations des librairies principales

In [1]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import numpy as np
from collections import defaultdict
from nltk.stem import WordNetLemmatizer
import re
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import ndcg_score
from rank_bm25 import BM25Okapi
nltk.download('all')
from collections import defaultdict, Counter
from nltk.util import ngrams
import random
from nltk import ne_chunk, pos_tag
import math
from gensim.models import Word2Vec
from sklearn.metrics.pairwise import cosine_similarity
from sentence_transformers import SentenceTransformer
from sklearn.metrics import ndcg_score
from gensim.models import FastText
from sklearn.model_selection import ParameterGrid
import torch
import transformers

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     C:\Users\loick\AppData\Roaming\nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     C:\Users\loick\AppData\Roaming\nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     C:\Users\loick\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     C:\Users\loick\AppData\Roaming\nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     C:\Users\loick\AppData\Roaming\nltk_data...
[

Le code de base comporte trois fonctions principales :


loadNFCorpus charge les données du corpus NFCorpus à partir de fichiers spécifiques. Elle charge les documents
médicaux et les stocke dans un dictionnaire “dicDoc”, charge les requêtes médicales et les stocke dans un dictionnaire
“dicReq” et enfin charge les relations entre les requêtes et les documents et leur scores de pertinence associés puis les
stocke dans le dictionnaire “dicReqDoc”.


text2TokenList effectue quelques étapes de pré-traitement : elle télécharge la liste des mots vides, tokenize les phrases,
filtre les mots vides et les mots de moins de 3 lettres et renvoie le résultat.


run_bm25_only lance le modèle BM25 sur le corpus pré-traité et renvoie la moyenne du NDGC (Normalized Discounted
Cumulative Gain) calculée.

### Chargement de NFCorpus

In [2]:
def loadNFCorpus(): #load the NFCorpus
    
	dir = "../data/raw/"
	filename = dir +"dev.docs"

	dicDoc={}
	with open(filename, 'r', encoding='utf-8') as file:
		lines = file.readlines()
	for line in lines:
		tabLine = line.split('\t')
		#print(tabLine)
		key = tabLine[0]
		value = tabLine[1]
		#print(value)
		dicDoc[key] = value
	filename = dir + "dev.all.queries"
	dicReq={}
	with open(filename, 'r', encoding='utf-8') as file:
		lines = file.readlines()
	for line in lines:
		tabLine = line.split('\t')
		key = tabLine[0]
		value = tabLine[1]
		dicReq[key] = value
	filename = dir + "dev.2-1-0.qrel"
	dicReqDoc=defaultdict(dict)
	with open(filename, 'r', encoding='utf-8') as file:
		lines = file.readlines()
	for line in lines:
		tabLine = line.strip().split('\t')
		req = tabLine[0]
		doc = tabLine[2]
		score = int(tabLine[3])
		dicReqDoc[req][doc]=score

	return dicDoc, dicReq, dicReqDoc

In [3]:
dicDoc, dicReq, dicReqDoc = loadNFCorpus()

print("Contenu de dicDoc (5 premiers éléments) :") #affichage des 5 premiers éléments de dicDoc, chaque élément contient l'ID du document et son contenu.
for i, (key, value) in enumerate(dicDoc.items()):
    if i >= 5:
        break
    print(f"Document ID: {key}")
    print(f"Contenu: {value[:]}...")
    print()

Contenu de dicDoc (5 premiers éléments) :
Document ID: MED-118
Contenu: alkylphenols human milk relations dietary habits central taiwan pubmed ncbi abstract aims study determine concentrations num nonylphenol np num octylphenol op num human milk samples examine related factors including mothers demographics dietary habits women consumed median amount cooking oil significantly higher op concentrations num ng/g consumed num ng/g num op concentration significantly consumption cooking oil beta num num fish oil capsules beta num num adjustment age body mass index bmi np concentration significantly consumption fish oil capsules beta num num processed fish products beta num num food pattern cooking oil processed meat products factor analysis strongly op concentration human milk num determinations aid suggesting foods consumption nursing mothers order protect infants np/op exposure num elsevier rights reserved 
...

Document ID: MED-329
Contenu: phosphate vascular toxin pubmed ncbi abstract el

## BM25 basique

In [4]:
def text2TokenList(text): #fonction qui prend en entrée un texte et renvoie une liste de mots sans les stopwords
	stopword = set(stopwords.words('english'))
	#print("LEN DE STOPWORD=",len(stopword))
	word_tokens = word_tokenize(text.lower())
	word_tokens_without_stops = [word for word in word_tokens if word not in stopword and len(word)>2]
	return word_tokens_without_stops

def run_bm25_only(startDoc,endDoc):

	dicDoc, dicReq, dicReqDoc = loadNFCorpus()

	docsToKeep=[]
	reqsToKeep=[]
	dicReqDocToKeep=defaultdict(dict)

	ndcgTop=5

	i=startDoc
	for reqId in dicReqDoc:
		if i > (endDoc - startDoc) :  #nbDocsToKeep:
			break
		for docId in dicReqDoc[reqId]:
			dicReqDocToKeep[reqId][docId] = dicReqDoc[reqId][docId]
			docsToKeep.append(docId)
			i = i + 1
		reqsToKeep.append(reqId)
	docsToKeep = list(set(docsToKeep))

	#"""
	allVocab ={}
	for k in docsToKeep:
		docTokenList = text2TokenList(dicDoc[k])
		#print(docTokenList)
		for word in docTokenList:
			if word not in allVocab:
				allVocab[word] = word
	allVocabListDoc = list(allVocab)
	#print("doc vocab=",len(allVocabListDoc))
	allVocab ={}
	for k in reqsToKeep:
		docTokenList = text2TokenList(dicReq[k])
		#print(docTokenList)
		for word in docTokenList:
			if word not in allVocab:
				allVocab[word] = word
	allVocabListReq = list(allVocab)

	corpusDocTokenList = []
	corpusReqTokenList = {}
	corpusDocName=[]
	corpusDicoDocName={}
	i = 0
	for k in docsToKeep:
		docTokenList = text2TokenList(dicDoc[k])
		corpusDocTokenList.append(docTokenList)
		corpusDocName.append(k)
		corpusDicoDocName[k] = i
		i = i + 1

	#print("reqs...")
	corpusReqName=[]
	corpusDicoReqName={}
	i = 0
	for k in reqsToKeep:
		reqTokenList = text2TokenList(dicReq[k])
		corpusReqTokenList[k] = reqTokenList
		corpusReqName.append(k)
		corpusDicoReqName[k] = i
		i = i + 1

	#print("bm25 doc indexing...")
	bm25 = BM25Okapi(corpusDocTokenList)

	ndcgCumul=0
	corpusReqVec={}
	ndcgBM25Cumul=0
	nbReq=0

	for req in corpusReqTokenList:
		j=0
		reqTokenList = corpusReqTokenList[req]
		doc_scores = bm25.get_scores(reqTokenList)
		trueDocs = np.zeros(len(corpusDocTokenList))

		for docId in corpusDicoDocName:
			if req in dicReqDocToKeep:
				if docId in dicReqDocToKeep[req]:
					#get position docId
					posDocId = corpusDicoDocName[docId]
					trueDocs[posDocId] = dicReqDocToKeep[req][docId]
					#print("TOKEEP=",docId)
					#print(trueDocs)
		ndcgBM25Cumul = ndcgBM25Cumul + ndcg_score([trueDocs], [doc_scores],k=ndcgTop)
		nbReq = nbReq + 1
	ndcgBM25Cumul = ndcgBM25Cumul / nbReq
	return ndcgBM25Cumul
first_doc_key = next(iter(dicDoc))  #Let's take a look at the first document of our corpus and the output of our pre-processing applicated on this document
first_doc_value = dicDoc[first_doc_key]

print(f"Clé du premier document : {first_doc_key}")
print(f"Contenu du premier document : {first_doc_value}")
print(text2TokenList(first_doc_value))

nb_docs = 150
print(f"Résultat en prenant seulement {nb_docs} documents en compte : {run_bm25_only(0, nb_docs)}")

Clé du premier document : MED-118
Contenu du premier document : alkylphenols human milk relations dietary habits central taiwan pubmed ncbi abstract aims study determine concentrations num nonylphenol np num octylphenol op num human milk samples examine related factors including mothers demographics dietary habits women consumed median amount cooking oil significantly higher op concentrations num ng/g consumed num ng/g num op concentration significantly consumption cooking oil beta num num fish oil capsules beta num num adjustment age body mass index bmi np concentration significantly consumption fish oil capsules beta num num processed fish products beta num num food pattern cooking oil processed meat products factor analysis strongly op concentration human milk num determinations aid suggesting foods consumption nursing mothers order protect infants np/op exposure num elsevier rights reserved 

['alkylphenols', 'human', 'milk', 'relations', 'dietary', 'habits', 'central', 'taiwan', '

In [6]:
nb_docs = len(dicDoc)
print(f"Résultat en prenant seulement {nb_docs} documents en compte : {run_bm25_only(0, nb_docs)}")

Résultat en prenant seulement 3193 documents en compte : 0.46335714418774215


On constate qu'en utilisant tout le corpus au lieu des 150 premiers documents, le score ndcg est bien plus bas. Cependant cela ne veut pas dire qu le modèle est moins performant. On a ici un score plus réaliste des performances du systeme avec un corpus plus large. + de documents = plus de bruits et plus de difficulté à placer les documents pertinents dans les 10 premières positions. Nous allons continuer d'utiliser bm25 avec 150 documents pour une execution du code plus rapide mais évaluer sur l'ensemble du corpus serait plus exact.

### Pre-Processing alternatifs

La fonction qui se charge du pre-processing dans le code de base est text2TokenList. Apportons-y quelques changements.

In [5]:
def text2TokenList2(text, use_bigrams=True, use_trigrams=True):
    stopword = set(stopwords.words('english')) #load the list of english stopwords
    text = re.sub(r'[^a-zA-Z]', ' ', text) # delete characters that are not letters
    text = text.lower() #convert words into lowercase
    word_tokens = word_tokenize(text) # tokenize our sentences into tokens (words)
    word_tokens_filtered = [word for word in word_tokens if word not in stopword and len(word) > 2] #filter stopwords and short words
    lemmatizer = WordNetLemmatizer() #lemmatize our tokens
    word_tokens_lemmatized = [lemmatizer.lemmatize(token) for token in word_tokens_filtered]
    if use_bigrams: #add bigrams
        bigrams = list(nltk.bigrams(word_tokens_filtered))
        word_tokens_lemmatized.extend([' '.join(bigram) for bigram in bigrams])
    if use_trigrams: #add trigrams
        trigrams = list(nltk.trigrams(word_tokens_filtered))
        word_tokens_lemmatized.extend([' '.join(trigram) for trigram in trigrams])
    return word_tokens_lemmatized

def run_bm25_only2(startDoc,endDoc):

	dicDoc, dicReq, dicReqDoc = loadNFCorpus()

	docsToKeep=[]
	reqsToKeep=[]
	dicReqDocToKeep=defaultdict(dict)

	ndcgTop=5

	i=startDoc
	for reqId in dicReqDoc:
		if i > (endDoc - startDoc) :  #nbDocsToKeep:
			break
		for docId in dicReqDoc[reqId]:
			dicReqDocToKeep[reqId][docId] = dicReqDoc[reqId][docId]
			docsToKeep.append(docId)
			i = i + 1
		reqsToKeep.append(reqId)
	docsToKeep = list(set(docsToKeep))

	#"""
	allVocab ={}
	for k in docsToKeep:
		docTokenList = text2TokenList2(dicDoc[k])
		#print(docTokenList)
		for word in docTokenList:
			if word not in allVocab:
				allVocab[word] = word
	allVocabListDoc = list(allVocab)
	#print("doc vocab=",len(allVocabListDoc))
	allVocab ={}
	for k in reqsToKeep:
		docTokenList = text2TokenList2(dicReq[k])
		#print(docTokenList)
		for word in docTokenList:
			if word not in allVocab:
				allVocab[word] = word
	allVocabListReq = list(allVocab)

	corpusDocTokenList = []
	corpusReqTokenList = {}
	corpusDocName=[]
	corpusDicoDocName={}
	i = 0
	for k in docsToKeep:
		docTokenList = text2TokenList2(dicDoc[k])
		corpusDocTokenList.append(docTokenList)
		corpusDocName.append(k)
		corpusDicoDocName[k] = i
		i = i + 1

	#print("reqs...")
	corpusReqName=[]
	corpusDicoReqName={}
	i = 0
	for k in reqsToKeep:
		reqTokenList = text2TokenList2(dicReq[k])
		corpusReqTokenList[k] = reqTokenList
		corpusReqName.append(k)
		corpusDicoReqName[k] = i
		i = i + 1

	#print("bm25 doc indexing...")
	bm25 = BM25Okapi(corpusDocTokenList)

	ndcgCumul=0
	corpusReqVec={}
	ndcgBM25Cumul=0
	nbReq=0

	for req in corpusReqTokenList:
		j=0
		reqTokenList = corpusReqTokenList[req]
		doc_scores = bm25.get_scores(reqTokenList)
		trueDocs = np.zeros(len(corpusDocTokenList))

		for docId in corpusDicoDocName:
			if req in dicReqDocToKeep:
				if docId in dicReqDocToKeep[req]:
					#get position docId
					posDocId = corpusDicoDocName[docId]
					trueDocs[posDocId] = dicReqDocToKeep[req][docId]
					#print("TOKEEP=",docId)
					#print(trueDocs)
		ndcgBM25Cumul = ndcgBM25Cumul + ndcg_score([trueDocs], [doc_scores],k=ndcgTop)
		nbReq = nbReq + 1
	ndcgBM25Cumul = ndcgBM25Cumul / nbReq
	return ndcgBM25Cumul

first_doc_key = next(iter(dicDoc))  #Let's take a look at the first document of our corpus and the output of our pre-processing applicated on this document
first_doc_value = dicDoc[first_doc_key]

print(f"Clé du premier document : {first_doc_key}")
print(f"Contenu du premier document : {first_doc_value}")
print(text2TokenList2(first_doc_value, use_bigrams=True, use_trigrams=True))


nb_docs = 150
print(f"Résultat en prenant seulement {nb_docs} documents en compte : {run_bm25_only2(0, nb_docs)}")



Clé du premier document : MED-118
Contenu du premier document : alkylphenols human milk relations dietary habits central taiwan pubmed ncbi abstract aims study determine concentrations num nonylphenol np num octylphenol op num human milk samples examine related factors including mothers demographics dietary habits women consumed median amount cooking oil significantly higher op concentrations num ng/g consumed num ng/g num op concentration significantly consumption cooking oil beta num num fish oil capsules beta num num adjustment age body mass index bmi np concentration significantly consumption fish oil capsules beta num num processed fish products beta num num food pattern cooking oil processed meat products factor analysis strongly op concentration human milk num determinations aid suggesting foods consumption nursing mothers order protect infants np/op exposure num elsevier rights reserved 

['alkylphenols', 'human', 'milk', 'relation', 'dietary', 'habit', 'central', 'taiwan', 'pu

ici, nous avons amélioré la fonction de tokenisation en intégrant la lemmatisation, la suppression de caractères non alphabétiques, et l'ajout d'options pour inclure des bigrammes et trigrammes dans le vocabulaire. Ces ajustements permettent de mieux capturer le sens et les relations entre les mots, enrichissant ainsi la représentation des documents et des requêtes pour optimiser les performances de BM25.

On constate que de  nombreux mots parasites sont tokenisés comme "num" ou "op", on va donc les filtrer en choississant de ne conservere que les mots de 4 lettres et +

In [6]:
def text2TokenList3(text, use_bigrams=True, use_trigrams=True):
    stopword = set(stopwords.words('english')) #load the list of english stopwords
    text = re.sub(r'[^a-zA-Z]', ' ', text) # delete characters that are not letters
    text = text.lower() #convert words into lowercase
    word_tokens = word_tokenize(text) # tokenize our sentences into tokens (words)
    word_tokens_filtered = [word for word in word_tokens if word not in stopword and len(word) > 3] #filter stopwords and words of 3 letters or less
    lemmatizer = WordNetLemmatizer() #lemmatize our tokens
    word_tokens_lemmatized = [lemmatizer.lemmatize(token) for token in word_tokens_filtered]
    if use_bigrams: #add bigrams
        bigrams = list(nltk.bigrams(word_tokens_filtered))
        word_tokens_lemmatized.extend(['_'.join(bigram) for bigram in bigrams])
    if use_trigrams: #add trigrams
        trigrams = list(nltk.trigrams(word_tokens_filtered))
        word_tokens_lemmatized.extend(['_'.join(trigram) for trigram in trigrams])
    return word_tokens_lemmatized

def run_bm25_only3(startDoc,endDoc):

	dicDoc, dicReq, dicReqDoc = loadNFCorpus()

	docsToKeep=[]
	reqsToKeep=[]
	dicReqDocToKeep=defaultdict(dict)

	ndcgTop=5

	i=startDoc
	for reqId in dicReqDoc:
		if i > (endDoc - startDoc) :  #nbDocsToKeep:
			break
		for docId in dicReqDoc[reqId]:
			dicReqDocToKeep[reqId][docId] = dicReqDoc[reqId][docId]
			docsToKeep.append(docId)
			i = i + 1
		reqsToKeep.append(reqId)
	docsToKeep = list(set(docsToKeep))

	#"""
	allVocab ={}
	for k in docsToKeep:
		docTokenList = text2TokenList3(dicDoc[k])
		#print(docTokenList)
		for word in docTokenList:
			if word not in allVocab:
				allVocab[word] = word
	allVocabListDoc = list(allVocab)
	#print("doc vocab=",len(allVocabListDoc))
	allVocab ={}
	for k in reqsToKeep:
		docTokenList = text2TokenList3(dicReq[k])
		#print(docTokenList)
		for word in docTokenList:
			if word not in allVocab:
				allVocab[word] = word
	allVocabListReq = list(allVocab)

	corpusDocTokenList = []
	corpusReqTokenList = {}
	corpusDocName=[]
	corpusDicoDocName={}
	i = 0
	for k in docsToKeep:
		docTokenList = text2TokenList3(dicDoc[k])
		corpusDocTokenList.append(docTokenList)
		corpusDocName.append(k)
		corpusDicoDocName[k] = i
		i = i + 1

	#print("reqs...")
	corpusReqName=[]
	corpusDicoReqName={}
	i = 0
	for k in reqsToKeep:
		reqTokenList = text2TokenList3(dicReq[k])
		corpusReqTokenList[k] = reqTokenList
		corpusReqName.append(k)
		corpusDicoReqName[k] = i
		i = i + 1

	#print("bm25 doc indexing...")
	bm25 = BM25Okapi(corpusDocTokenList)

	ndcgCumul=0
	corpusReqVec={}
	ndcgBM25Cumul=0
	nbReq=0

	for req in corpusReqTokenList:
		j=0
		reqTokenList = corpusReqTokenList[req]
		doc_scores = bm25.get_scores(reqTokenList)
		trueDocs = np.zeros(len(corpusDocTokenList))

		for docId in corpusDicoDocName:
			if req in dicReqDocToKeep:
				if docId in dicReqDocToKeep[req]:
					#get position docId
					posDocId = corpusDicoDocName[docId]
					trueDocs[posDocId] = dicReqDocToKeep[req][docId]
					#print("TOKEEP=",docId)
					#print(trueDocs)
		ndcgBM25Cumul = ndcgBM25Cumul + ndcg_score([trueDocs], [doc_scores],k=ndcgTop)
		nbReq = nbReq + 1
	ndcgBM25Cumul = ndcgBM25Cumul / nbReq
	return ndcgBM25Cumul

first_doc_key = next(iter(dicDoc))  #Let's take a look at the first document of our corpus and the output of our pre-processing applicated on this document
first_doc_value = dicDoc[first_doc_key]

print(f"Clé du premier document : {first_doc_key}")
print(f"Contenu du premier document : {first_doc_value}")
print(text2TokenList3(first_doc_value, use_bigrams=True, use_trigrams=True))


nb_docs = 150
print(f"Résultat en prenant seulement {nb_docs} documents en compte : {run_bm25_only3(0, nb_docs)}")

Clé du premier document : MED-118
Contenu du premier document : alkylphenols human milk relations dietary habits central taiwan pubmed ncbi abstract aims study determine concentrations num nonylphenol np num octylphenol op num human milk samples examine related factors including mothers demographics dietary habits women consumed median amount cooking oil significantly higher op concentrations num ng/g consumed num ng/g num op concentration significantly consumption cooking oil beta num num fish oil capsules beta num num adjustment age body mass index bmi np concentration significantly consumption fish oil capsules beta num num processed fish products beta num num food pattern cooking oil processed meat products factor analysis strongly op concentration human milk num determinations aid suggesting foods consumption nursing mothers order protect infants np/op exposure num elsevier rights reserved 

['alkylphenols', 'human', 'milk', 'relation', 'dietary', 'habit', 'central', 'taiwan', 'pu

gridsearch for bm25 hyperparameters

In [12]:
def run_bm25_with_params(startDoc, endDoc, k1, b):
    dicDoc, dicReq, dicReqDoc = loadNFCorpus()

    docsToKeep = []
    reqsToKeep = []
    dicReqDocToKeep = defaultdict(dict)

    ndcgTop = 5

    i = startDoc
    for reqId in dicReqDoc:
        if i > (endDoc - startDoc):
            break
        for docId in dicReqDoc[reqId]:
            dicReqDocToKeep[reqId][docId] = dicReqDoc[reqId][docId]
            docsToKeep.append(docId)
            i += 1
        reqsToKeep.append(reqId)
    docsToKeep = list(set(docsToKeep))

    corpusDocTokenList = []
    corpusReqTokenList = {}
    corpusDocName = []
    corpusDicoDocName = {}

    for i, k in enumerate(docsToKeep):
        docTokenList = text2TokenList3(dicDoc[k])
        corpusDocTokenList.append(docTokenList)
        corpusDocName.append(k)
        corpusDicoDocName[k] = i

    for k in reqsToKeep:
        reqTokenList = text2TokenList3(dicReq[k])
        corpusReqTokenList[k] = reqTokenList

    bm25 = BM25Okapi(corpusDocTokenList, k1=k1, b=b)

    ndcgBM25Cumul = 0
    nbReq = 0

    for req in corpusReqTokenList:
        reqTokenList = corpusReqTokenList[req]
        doc_scores = bm25.get_scores(reqTokenList)
        trueDocs = np.zeros(len(corpusDocTokenList))

        if req in dicReqDocToKeep:
            for docId, relevance in dicReqDocToKeep[req].items():
                if docId in corpusDicoDocName:
                    posDocId = corpusDicoDocName[docId]
                    trueDocs[posDocId] = relevance

        ndcgBM25Cumul += ndcg_score([trueDocs], [doc_scores], k=ndcgTop)
        nbReq += 1

    ndcgBM25Cumul /= nbReq
    return ndcgBM25Cumul

# Définir les valeurs à tester pour k1 et b
k1_values = [0.5, 1.0, 1.2, 1.5, 2.0]
b_values = [0.5, 0.6, 0.75, 0.8, 0.9]

best_score = 0
best_params = None

nb_docs = 150  # Nombre de documents à considérer

for k1 in k1_values:
    for b in b_values:
        score = run_bm25_with_params(0, nb_docs, k1, b)
        if score > best_score:
            best_score = score
            best_params = (k1, b)

print(f"Meilleurs paramètres : k1={best_params[0]}, b={best_params[1]}")

# Exécuter avec les meilleurs paramètres
final_score = run_bm25_with_params(0, nb_docs, best_params[0], best_params[1])
print(f"Score final avec les meilleurs paramètres : {final_score}")

Meilleurs paramètres : k1=1.5, b=0.75
Score final avec les meilleurs paramètres : 0.868165217456096


Le Ndcg maximum que nous arrivons à obtenir avec modification de la méthode de pre-precessing et optimisation des hyperparametres de bm25 est de 0.87.

### Creation du modèle word2vec

Explorons la piste de l'utilisation d'un modèle word2vec. Word2vec est une méthode permettant de transformer des tokens en vecteurs afin d’améliorer la capacité des modèles NLP à
comprendre et à manipuler le langage naturel de manière plus précise, tout en conservant les relations sémantiques entre lesmots.

In [13]:
tokenized_corpus = [text2TokenList2(doc, use_bigrams=True, use_trigrams=True) for doc in list(dicDoc.values())]

model = Word2Vec(tokenized_corpus, vector_size=400, window=8, min_count=3, workers=6, sg=0, epochs=100) #create our word2vec model
bm25 = BM25Okapi(tokenized_corpus) #apply BM25 to our tokenized corpus

def run_w2v(startDoc, endDoc, model):   
    dicDoc, dicReq, dicReqDoc = loadNFCorpus()

    docsToKeep = []
    reqsToKeep = []
    dicReqDocToKeep = defaultdict(dict)

    ndcgTop = 5

    i = startDoc
    for reqId in dicReqDoc:
        if i > (endDoc - startDoc):
            break
        for docId in dicReqDoc[reqId]:
            dicReqDocToKeep[reqId][docId] = dicReqDoc[reqId][docId]
            docsToKeep.append(docId)
            i += 1
        reqsToKeep.append(reqId)
    docsToKeep = list(set(docsToKeep))

    corpusDocTokenList = []
    corpusReqTokenList = {}
    corpusDocName = []
    corpusDicoDocName = {}

    for i, k in enumerate(docsToKeep):
        docTokenList = text2TokenList3(dicDoc[k])
        corpusDocTokenList.append(docTokenList)
        corpusDocName.append(k)
        corpusDicoDocName[k] = i
        
    for k in reqsToKeep:
        reqTokenList = text2TokenList3(dicReq[k])
        corpusReqTokenList[k] = reqTokenList

    ndcgCumul = 0
    nbReq = 0
    
    for req in corpusReqTokenList:
        reqTokenList = corpusReqTokenList[req]
        
        # Calcul du vecteur moyen de la requête
        req_vector = np.mean([model.wv[word] for word in reqTokenList if word in model.wv], axis=0)
        
        # Calcul des scores de similarité pour chaque document
        w2v_scores = []
        for doc in corpusDocTokenList:
            doc_vector = np.mean([model.wv[word] for word in doc if word in model.wv], axis=0)
            similarity = np.dot(req_vector, doc_vector) / (np.linalg.norm(req_vector) * np.linalg.norm(doc_vector))
            w2v_scores.append(similarity)
        
        w2v_scores = np.array(w2v_scores)
        
        trueDocs = np.zeros(len(corpusDocTokenList))

        if req in dicReqDocToKeep:
            for docId, relevance in dicReqDocToKeep[req].items():
                if docId in corpusDicoDocName:
                    posDocId = corpusDicoDocName[docId]
                    trueDocs[posDocId] = relevance

        # Tri des documents par score décroissant
        sorted_indices = np.argsort(w2v_scores)[::-1]
        sorted_scores = w2v_scores[sorted_indices]
        sorted_true_docs = trueDocs[sorted_indices]

        # Calcul du score NDCG
        ndcgCumul += ndcg_score([sorted_true_docs], [sorted_scores], k=ndcgTop)
        nbReq += 1
        
    ndcgAverage = ndcgCumul / nbReq
    return ndcgAverage

# Exécuter la fonction
result = run_w2v(0, 150, model)
print(f"Score NDCG@5 moyen : {result}")

Score NDCG@5 moyen : 0.4792125082013293


Améliorons les performances de notre modèle en utilisant une grid search pour les hyperparametres et plusieurs pooling methods.

In [14]:
def load_and_preprocess_data(startDoc, endDoc):
    dicDoc, dicReq, dicReqDoc = loadNFCorpus()

    docsToKeep = []
    reqsToKeep = []
    dicReqDocToKeep = defaultdict(dict)

    i = startDoc
    for reqId in dicReqDoc:
        if i > (endDoc - startDoc):
            break
        for docId in dicReqDoc[reqId]:
            dicReqDocToKeep[reqId][docId] = dicReqDoc[reqId][docId]
            docsToKeep.append(docId)
            i += 1
        reqsToKeep.append(reqId)
    docsToKeep = list(set(docsToKeep))

    corpusDocTokenList = []
    corpusReqTokenList = {}
    corpusDocName = []
    corpusDicoDocName = {}

    for i, k in enumerate(docsToKeep):
        docTokenList = text2TokenList3(dicDoc[k])
        corpusDocTokenList.append(docTokenList)
        corpusDocName.append(k)
        corpusDicoDocName[k] = i
        
    for k in reqsToKeep:
        reqTokenList = text2TokenList3(dicReq[k])
        corpusReqTokenList[k] = reqTokenList

    return corpusDocTokenList, corpusReqTokenList, dicReqDocToKeep, corpusDicoDocName

def advanced_pooling(vectors, method='concat'):
    if method == 'concat':
        return np.concatenate([np.mean(vectors, axis=0), np.max(vectors, axis=0)])
    elif method == 'weighted':
        weights = np.array([1 / (i + 1) for i in range(len(vectors))])
        return np.average(vectors, axis=0, weights=weights)
    else:
        return np.mean(vectors, axis=0)

def run_w2v(model, corpusDocTokenList, corpusReqTokenList, dicReqDocToKeep, corpusDicoDocName, pooling_method):
    ndcgTop = 5
    ndcgCumul = 0
    nbReq = 0
    
    for req in corpusReqTokenList:
        reqTokenList = corpusReqTokenList[req]
        
        req_vectors = [model.wv[word] for word in reqTokenList if word in model.wv]
        if not req_vectors:
            continue
        req_vector = advanced_pooling(req_vectors, method=pooling_method)
        
        w2v_scores = []
        for doc in corpusDocTokenList:
            doc_vectors = [model.wv[word] for word in doc if word in model.wv]
            if not doc_vectors:
                w2v_scores.append(0)
                continue
            doc_vector = advanced_pooling(doc_vectors, method=pooling_method)
            similarity = np.dot(req_vector, doc_vector) / (np.linalg.norm(req_vector) * np.linalg.norm(doc_vector))
            w2v_scores.append(similarity)
        
        w2v_scores = np.array(w2v_scores)
        
        trueDocs = np.zeros(len(corpusDocTokenList))

        if req in dicReqDocToKeep:
            for docId, relevance in dicReqDocToKeep[req].items():
                if docId in corpusDicoDocName:
                    posDocId = corpusDicoDocName[docId]
                    trueDocs[posDocId] = relevance

        sorted_indices = np.argsort(w2v_scores)[::-1]
        sorted_scores = w2v_scores[sorted_indices]
        sorted_true_docs = trueDocs[sorted_indices]

        ndcgCumul += ndcg_score([sorted_true_docs], [sorted_scores], k=ndcgTop)
        nbReq += 1
        
    ndcgAverage = ndcgCumul / nbReq
    return ndcgAverage

def grid_search_w2v(corpusDocTokenList, corpusReqTokenList, dicReqDocToKeep, corpusDicoDocName):
    param_grid = {
        'vector_size': [200, 300, 400],
        'window': [5, 8, 10],
        'min_count': [2, 3, 5],
        'epochs': [50, 100, 150],
        'sg': [0, 1],  # 0 for CBOW, 1 for Skip-gram
        'pooling_method': ['mean', 'concat', 'weighted']
    }

    best_score = 0
    best_params = None

    for params in ParameterGrid(param_grid):
        print(f"Testing parameters: {params}")
        model = Word2Vec(corpusDocTokenList, 
                         vector_size=params['vector_size'],
                         window=params['window'],
                         min_count=params['min_count'],
                         workers=6,
                         sg=params['sg'],
                         epochs=params['epochs'])
        
        score = run_w2v(model, corpusDocTokenList, corpusReqTokenList, dicReqDocToKeep, corpusDicoDocName, params['pooling_method'])
        
        print(f"NDCG@5 score: {score}")
        
        if score > best_score:
            best_score = score
            best_params = params

    return best_score, best_params

corpusDocTokenList, corpusReqTokenList, dicReqDocToKeep, corpusDicoDocName = load_and_preprocess_data(0, 150)

best_score, best_params = grid_search_w2v(corpusDocTokenList, corpusReqTokenList, dicReqDocToKeep, corpusDicoDocName)

print(f"Meilleur score NDCG@5 : {best_score}")
print(f"Meilleurs paramètres : {best_params}")

final_model = Word2Vec(corpusDocTokenList, 
                       vector_size=best_params['vector_size'],
                       window=best_params['window'],
                       min_count=best_params['min_count'],
                       sg=best_params['sg'],
                       epochs=best_params['epochs'])

# Exécuter le modèle final
final_score = run_w2v(final_model, corpusDocTokenList, corpusReqTokenList, dicReqDocToKeep, corpusDicoDocName, best_params['pooling_method'])
print(f"Score NDCG@5 final : {final_score}")

Testing parameters: {'epochs': 50, 'min_count': 2, 'pooling_method': 'mean', 'sg': 0, 'vector_size': 200, 'window': 5}
NDCG@5 score: 0.5681226486459744
Testing parameters: {'epochs': 50, 'min_count': 2, 'pooling_method': 'mean', 'sg': 0, 'vector_size': 200, 'window': 8}
NDCG@5 score: 0.5504401110490339
Testing parameters: {'epochs': 50, 'min_count': 2, 'pooling_method': 'mean', 'sg': 0, 'vector_size': 200, 'window': 10}
NDCG@5 score: 0.5533790801482966
Testing parameters: {'epochs': 50, 'min_count': 2, 'pooling_method': 'mean', 'sg': 0, 'vector_size': 300, 'window': 5}
NDCG@5 score: 0.6005571642579128
Testing parameters: {'epochs': 50, 'min_count': 2, 'pooling_method': 'mean', 'sg': 0, 'vector_size': 300, 'window': 8}
NDCG@5 score: 0.6124880887427078
Testing parameters: {'epochs': 50, 'min_count': 2, 'pooling_method': 'mean', 'sg': 0, 'vector_size': 300, 'window': 10}
NDCG@5 score: 0.5709311769478864
Testing parameters: {'epochs': 50, 'min_count': 2, 'pooling_method': 'mean', 'sg': 0, 

Combinons w2v et bm25

In [29]:
def run_combined_model(corpusDocTokenList, corpusReqTokenList, dicReqDocToKeep, corpusDicoDocName):
    vector_size = 300
    window = 8
    min_count = 2
    epochs = 100
    sg = 0
    pooling_method = 'weighted'

    k1 = 1.5
    b = 0.75

    w2v_model = Word2Vec(corpusDocTokenList, 
                         vector_size=vector_size,
                         window=window,
                         min_count=min_count,
                         sg=sg,
                         epochs=epochs)
    
    bm25 = BM25Okapi(corpusDocTokenList, k1=k1, b=b)

    ndcgTop = 5
    ndcgCumul = 0
    nbReq = 0

    for req in corpusReqTokenList:
        reqTokenList = corpusReqTokenList[req]
        
        bm25_scores = np.array(bm25.get_scores(reqTokenList))
        
        req_vectors = [w2v_model.wv[word] for word in reqTokenList if word in w2v_model.wv]
        if not req_vectors:
            continue
        req_vector = advanced_pooling(req_vectors, method=pooling_method)
        
        w2v_scores = []
        for doc in corpusDocTokenList:
            doc_vectors = [w2v_model.wv[word] for word in doc if word in w2v_model.wv]
            if not doc_vectors:
                w2v_scores.append(0)
                continue
            doc_vector = advanced_pooling(doc_vectors, method=pooling_method)
            similarity = np.dot(req_vector, doc_vector) / (np.linalg.norm(req_vector) * np.linalg.norm(doc_vector))
            w2v_scores.append(similarity)
        
        w2v_scores = np.array(w2v_scores)
        
        bm25_scores = (bm25_scores - np.min(bm25_scores)) / (np.max(bm25_scores) - np.min(bm25_scores))
        w2v_scores = (w2v_scores - np.min(w2v_scores)) / (np.max(w2v_scores) - np.min(w2v_scores))
        
        combined_scores = 0.7 * bm25_scores + 0.3 * w2v_scores
        
        trueDocs = np.zeros(len(corpusDocTokenList))

        if req in dicReqDocToKeep:
            for docId, relevance in dicReqDocToKeep[req].items():
                if docId in corpusDicoDocName:
                    posDocId = corpusDicoDocName[docId]
                    trueDocs[posDocId] = relevance

        sorted_indices = np.argsort(combined_scores)[::-1]
        sorted_scores = combined_scores[sorted_indices]
        sorted_true_docs = trueDocs[sorted_indices]

        ndcgCumul += ndcg_score([sorted_true_docs], [sorted_scores], k=ndcgTop)
        nbReq += 1
        
    ndcgAverage = ndcgCumul / nbReq
    return ndcgAverage

corpusDocTokenList, corpusReqTokenList, dicReqDocToKeep, corpusDicoDocName = load_and_preprocess_data(0, 150)

final_score = run_combined_model(corpusDocTokenList, corpusReqTokenList, dicReqDocToKeep, corpusDicoDocName)
print(f"Score NDCG@5 final (modèle combiné) : {final_score}")

Score NDCG@5 final (modèle combiné) : 0.8694526177945482


Utilisons FastText à la place de Word2Vec. FastText est une extension de Word2Vec qui offre une meilleure gestion des mots rares ou inconnus et des performances améliorées sur certaines tâches, en particulier pour les langues morphologiquement riches. Word2vec convertis les mots en vecteurs, FastText convertis des n-grams de caractères en vecteurs.

In [15]:
def load_and_preprocess_data(startDoc, endDoc):
    dicDoc, dicReq, dicReqDoc = loadNFCorpus()

    docsToKeep = []
    reqsToKeep = []
    dicReqDocToKeep = defaultdict(dict)

    i = startDoc
    for reqId in dicReqDoc:
        if i > (endDoc - startDoc):
            break
        for docId in dicReqDoc[reqId]:
            dicReqDocToKeep[reqId][docId] = dicReqDoc[reqId][docId]
            docsToKeep.append(docId)
            i += 1
        reqsToKeep.append(reqId)
    docsToKeep = list(set(docsToKeep))

    corpusDocTokenList = []
    corpusReqTokenList = {}
    corpusDocName = []
    corpusDicoDocName = {}

    for i, k in enumerate(docsToKeep):
        docTokenList = text2TokenList3(dicDoc[k])
        corpusDocTokenList.append(docTokenList)
        corpusDocName.append(k)
        corpusDicoDocName[k] = i
        
    for k in reqsToKeep:
        reqTokenList = text2TokenList3(dicReq[k])
        corpusReqTokenList[k] = reqTokenList

    return corpusDocTokenList, corpusReqTokenList, dicReqDocToKeep, corpusDicoDocName

def advanced_pooling(vectors, method='concat'):
    if method == 'concat':
        return np.concatenate([np.mean(vectors, axis=0), np.max(vectors, axis=0)])
    elif method == 'weighted':
        weights = np.array([1 / (i + 1) for i in range(len(vectors))])
        return np.average(vectors, axis=0, weights=weights)
    else:
        return np.mean(vectors, axis=0)

def run_fasttext(model, corpusDocTokenList, corpusReqTokenList, dicReqDocToKeep, corpusDicoDocName, pooling_method):
    ndcgTop = 5
    ndcgCumul = 0
    nbReq = 0
    
    for req in corpusReqTokenList:
        reqTokenList = corpusReqTokenList[req]
        
        req_vectors = [model.wv[word] for word in reqTokenList if word in model.wv]
        if not req_vectors:
            continue
        req_vector = advanced_pooling(req_vectors, method=pooling_method)
        
        fasttext_scores = []
        for doc in corpusDocTokenList:
            doc_vectors = [model.wv[word] for word in doc if word in model.wv]
            if not doc_vectors:
                fasttext_scores.append(0)
                continue
            doc_vector = advanced_pooling(doc_vectors, method=pooling_method)
            similarity = np.dot(req_vector, doc_vector) / (np.linalg.norm(req_vector) * np.linalg.norm(doc_vector))
            fasttext_scores.append(similarity)
        
        fasttext_scores = np.array(fasttext_scores)
        
        trueDocs = np.zeros(len(corpusDocTokenList))

        if req in dicReqDocToKeep:
            for docId, relevance in dicReqDocToKeep[req].items():
                if docId in corpusDicoDocName:
                    posDocId = corpusDicoDocName[docId]
                    trueDocs[posDocId] = relevance

        sorted_indices = np.argsort(fasttext_scores)[::-1]
        sorted_scores = fasttext_scores[sorted_indices]
        sorted_true_docs = trueDocs[sorted_indices]

        ndcgCumul += ndcg_score([sorted_true_docs], [sorted_scores], k=ndcgTop)
        nbReq += 1
        
    ndcgAverage = ndcgCumul / nbReq
    return ndcgAverage

def grid_search_fasttext(corpusDocTokenList, corpusReqTokenList, dicReqDocToKeep, corpusDicoDocName):
    param_grid = {
        'vector_size': [200, 300, 400],
        'window': [5, 8, 10],
        'min_count': [2, 3, 5],
        'epochs': [50, 100, 150],
        'pooling_method': ['mean', 'concat', 'weighted']
    }

    best_score = 0
    best_params = None

    for params in ParameterGrid(param_grid):
        print(f"Testing parameters: {params}")
        model = FastText(corpusDocTokenList, 
                         vector_size=params['vector_size'],
                         window=params['window'],
                         min_count=params['min_count'],
                         workers=6,
                         sg=1,
                         epochs=params['epochs'])
        
        score = run_fasttext(model, corpusDocTokenList, corpusReqTokenList, dicReqDocToKeep, corpusDicoDocName, params['pooling_method'])
        
        print(f"NDCG@5 score: {score}")
        
        if score > best_score:
            best_score = score
            best_params = params

    return best_score, best_params

corpusDocTokenList, corpusReqTokenList, dicReqDocToKeep, corpusDicoDocName = load_and_preprocess_data(0, 150) #load and preprocess our data

best_score, best_params = grid_search_fasttext(corpusDocTokenList, corpusReqTokenList, dicReqDocToKeep, corpusDicoDocName) # Exécuter la recherche par grille

print(f"Meilleur score NDCG@5 : {best_score}")
print(f"Meilleurs paramètres : {best_params}")

final_model = FastText(corpusDocTokenList, 
                       vector_size=best_params['vector_size'],
                       window=best_params['window'],
                       min_count=best_params['min_count'],
                       workers=6,
                       sg=1,
                       epochs=best_params['epochs']) # Créer le modèle final avec les meilleurs paramètres

final_score = run_fasttext(final_model, corpusDocTokenList, corpusReqTokenList, dicReqDocToKeep, corpusDicoDocName, best_params['pooling_method']) 
print(f"Score NDCG@5 final : {final_score}")

Testing parameters: {'epochs': 50, 'min_count': 2, 'pooling_method': 'mean', 'vector_size': 200, 'window': 5}
NDCG@5 score: 0.44567349969975456
Testing parameters: {'epochs': 50, 'min_count': 2, 'pooling_method': 'mean', 'vector_size': 200, 'window': 8}
NDCG@5 score: 0.44567349969975456
Testing parameters: {'epochs': 50, 'min_count': 2, 'pooling_method': 'mean', 'vector_size': 200, 'window': 10}
NDCG@5 score: 0.42812140290016465
Testing parameters: {'epochs': 50, 'min_count': 2, 'pooling_method': 'mean', 'vector_size': 300, 'window': 5}
NDCG@5 score: 0.4716678906699139
Testing parameters: {'epochs': 50, 'min_count': 2, 'pooling_method': 'mean', 'vector_size': 300, 'window': 8}
NDCG@5 score: 0.4486124687990172
Testing parameters: {'epochs': 50, 'min_count': 2, 'pooling_method': 'mean', 'vector_size': 300, 'window': 10}
NDCG@5 score: 0.42812140290016465
Testing parameters: {'epochs': 50, 'min_count': 2, 'pooling_method': 'mean', 'vector_size': 400, 'window': 5}
NDCG@5 score: 0.4478327819

Combinaison des méthodes fasttext et du bm25

In [16]:
def load_and_preprocess_data(startDoc, endDoc):
    dicDoc, dicReq, dicReqDoc = loadNFCorpus()

    docsToKeep = []
    reqsToKeep = []
    dicReqDocToKeep = defaultdict(dict)

    i = startDoc
    for reqId in dicReqDoc:
        if i > (endDoc - startDoc):
            break
        for docId in dicReqDoc[reqId]:
            dicReqDocToKeep[reqId][docId] = dicReqDoc[reqId][docId]
            docsToKeep.append(docId)
            i += 1
        reqsToKeep.append(reqId)
    docsToKeep = list(set(docsToKeep))

    corpusDocTokenList = []
    corpusReqTokenList = {}
    corpusDocName = []
    corpusDicoDocName = {}

    for i, k in enumerate(docsToKeep):
        docTokenList = text2TokenList3(dicDoc[k])
        corpusDocTokenList.append(docTokenList)
        corpusDocName.append(k)
        corpusDicoDocName[k] = i
        
    for k in reqsToKeep:
        reqTokenList = text2TokenList3(dicReq[k])
        corpusReqTokenList[k] = reqTokenList

    return corpusDocTokenList, corpusReqTokenList, dicReqDocToKeep, corpusDicoDocName

def advanced_pooling(vectors, method='concat'):
    if method == 'concat':
        return np.concatenate([np.mean(vectors, axis=0), np.max(vectors, axis=0)])
    elif method == 'weighted':
        weights = np.array([1 / (i + 1) for i in range(len(vectors))])
        return np.average(vectors, axis=0, weights=weights)
    else:
        return np.mean(vectors, axis=0)

def run_combined_model(corpusDocTokenList, corpusReqTokenList, dicReqDocToKeep, corpusDicoDocName):
    
    k1 = 1.5
    b = 0.75
    
    vector_size = 200
    window = 8
    min_count = 2
    epochs = 100
    pooling_method = 'weighted'


    fasttext_model = FastText(corpusDocTokenList, 
                              vector_size=vector_size,
                              window=window,
                              min_count=min_count,
                              workers=6,
                              sg=1,
                              epochs=epochs)

    bm25 = BM25Okapi(corpusDocTokenList, k1=k1, b=b)

    ndcgTop = 5
    ndcgCumul = 0
    nbReq = 0

    for req in corpusReqTokenList:
        reqTokenList = corpusReqTokenList[req]
        
        bm25_scores = np.array(bm25.get_scores(reqTokenList))
        
        req_vectors = [fasttext_model.wv[word] for word in reqTokenList if word in fasttext_model.wv]
        if not req_vectors:
            continue
        req_vector = advanced_pooling(req_vectors, method=pooling_method)
        
        fasttext_scores = []
        for doc in corpusDocTokenList:
            doc_vectors = [fasttext_model.wv[word] for word in doc if word in fasttext_model.wv]
            if not doc_vectors:
                fasttext_scores.append(0)
                continue
            doc_vector = advanced_pooling(doc_vectors, method=pooling_method)
            similarity = np.dot(req_vector, doc_vector) / (np.linalg.norm(req_vector) * np.linalg.norm(doc_vector))
            fasttext_scores.append(similarity)
        
        fasttext_scores = np.array(fasttext_scores)
        
        bm25_scores = (bm25_scores - np.min(bm25_scores)) / (np.max(bm25_scores) - np.min(bm25_scores))
        fasttext_scores = (fasttext_scores - np.min(fasttext_scores)) / (np.max(fasttext_scores) - np.min(fasttext_scores))
        
        combined_scores = 0.7 * bm25_scores + 0.3 * fasttext_scores
        
        trueDocs = np.zeros(len(corpusDocTokenList))

        if req in dicReqDocToKeep:
            for docId, relevance in dicReqDocToKeep[req].items():
                if docId in corpusDicoDocName:
                    posDocId = corpusDicoDocName[docId]
                    trueDocs[posDocId] = relevance

        sorted_indices = np.argsort(combined_scores)[::-1]
        sorted_scores = combined_scores[sorted_indices]
        sorted_true_docs = trueDocs[sorted_indices]

        ndcgCumul += ndcg_score([sorted_true_docs], [sorted_scores], k=ndcgTop)
        nbReq += 1
        
    ndcgAverage = ndcgCumul / nbReq
    return ndcgAverage

corpusDocTokenList, corpusReqTokenList, dicReqDocToKeep, corpusDicoDocName = load_and_preprocess_data(0, 150)

final_score = run_combined_model(corpusDocTokenList, corpusReqTokenList, dicReqDocToKeep, corpusDicoDocName)
print(f"Score NDCG@5 final (modèle combiné) : {final_score}")

Score NDCG@5 final (modèle combiné) : 0.8499066737257577


Les embeddings sont des représentations vectorielles de données textuelles qui capturent le sens et les relations sémantiques entre les mots ou les phrases. Dans le contexte de la recherche d'information, nous utilisons les embeddings pour transformer les documents et les requêtes en vecteurs numériques, ce qui permet de calculer la similarité entre eux de manière plus efficace et sémantiquement pertinente. Cette approche améliore la qualité des résultats de recherche en permettant de trouver des documents pertinents même lorsqu'ils ne contiennent pas exactement les mêmes mots que la requête.

In [10]:
model = SentenceTransformer('thenlper/gte-base')


def get_doc_vector(model, doc):

    doc_text = ' '.join(doc)
    return model.encode(doc_text)

dicDoc, dicReq, dicReqDoc = loadNFCorpus()
            
preprocessed_docs = [text2TokenList3(doc) for doc in dicDoc.values()]
preprocessed_queries = [text2TokenList3(query) for query in dicReq.values()]

doc_vectors = [get_doc_vector(model, doc) for doc in preprocessed_docs]

def search(query, doc_vectors, dicDoc):
    query_vector = get_doc_vector(model, text2TokenList3(query))
    similarities = cosine_similarity([query_vector], doc_vectors)[0]
    
    sorted_indexes = np.argsort(similarities)[::-1]
    
    results = []
    doc_ids = list(dicDoc.keys())
    for idx in sorted_indexes:
        doc_id = doc_ids[idx]
        results.append((doc_id, dicDoc[doc_id], similarities[idx]))
    
    return results

def evaluate_search(dicReq, dicReqDoc, doc_vectors, dicDoc, top_k=5):
    ndcg_scores = []
    for query_id, query in dicReq.items():
        results = search(query, doc_vectors, dicDoc)
        
        dcg = 0
        idcg = 0
        relevant_docs = sorted([(doc_id, score) for doc_id, score in dicReqDoc[query_id].items()], key=lambda x: x[1], reverse=True)
        
        for i, (doc_id, _, similarity) in enumerate(results[:top_k]):
            if doc_id in dicReqDoc[query_id]:
                dcg += (2 ** dicReqDoc[query_id][doc_id] - 1) / np.log2(i + 2)
        
        for i, (doc_id, relevance) in enumerate(relevant_docs[:top_k]):
            idcg += (2 ** relevance - 1) / np.log2(i + 2)
        
        if idcg > 0:
            ndcg_scores.append(dcg / idcg)
    
    return np.mean(ndcg_scores)

ndcg_score = evaluate_search(dicReq, dicReqDoc, doc_vectors, dicDoc)
print(f"NDCG@5 moyen : {ndcg_score:.4f}")

NDCG@5 moyen : 0.4489


La meilleure méthode pour ce système d'information reste donc la combinaison du bm25 avec pré-processing alternatif et d'un modèle word2vec.