# Recuperación Ranqueada y Vectorización de Documentos (RRDV)

Este notebook implementa un sistema de recuperación ranqueada mediante la vectorización de documentos utilizando la técnica **TF-IDF** y la similitud del coseno. Se hacen los siguientes pasos:
1. Creación de una representación vectorial ponderada **TF-IDF** para cada documento.
2. Cálculo de la **similitud del coseno** entre documentos y consultas.
3. Recuperación de documentos ordenados por el puntaje de similitud del coseno.
4. Evaluación de los resultados utilizando métricas como **Precisión** (P@M), **Recall** (R@M) y **NDCG** (NDCG@M).


## 1. Preparación del Entorno

Se importan las bibliotecas necesarias para la creación del índice invertido, el procesamiento de las consultas y el cálculo de las métricas de evaluación.
## Importaciones y instalaciones
En este paso tambien hacemos uso de algunas celdas del anterior notebook que usaremos en este proceso

In [54]:
!pip install KafNafParserPy



In [55]:
import os
import sys
import numpy as np
import pandas as pd
import nltk
import math
import json
from google.colab import files

In [56]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [57]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [58]:
os.chdir('/content/drive/MyDrive/ColabNotebooks/NLP-main/HW01/')
sys.path.append('algorithms')

In [59]:
from binary_search.inverted_index import InvertedIndex

## 2. Creación del Índice Invertido

Se utiliza el índice invertido generado previamente para construir una representación vectorial de los documentos.
Este índice es la base para calcular las frecuencias y los valores de **TF-IDF**.


In [60]:
inverted_index = InvertedIndex()

In [61]:
index = inverted_index.inverted_index_complete_pipeline()
index

2024-09-04 21:06:32,069 - InvertedIndex - INFO - Starting complete inverted index pipeline
2024-09-04 21:06:32,069 - InvertedIndex - INFO - Starting complete inverted index pipeline
INFO:InvertedIndex:Starting complete inverted index pipeline
2024-09-04 21:06:32,078 - InvertedIndex - INFO - Step 1: Processing texts
2024-09-04 21:06:32,078 - InvertedIndex - INFO - Step 1: Processing texts
INFO:InvertedIndex:Step 1: Processing texts
2024-09-04 21:06:32,084 - TextProcessor - INFO - Starting text processing pipeline
2024-09-04 21:06:32,084 - TextProcessor - INFO - Starting text processing pipeline
2024-09-04 21:06:32,084 - TextProcessor - INFO - Starting text processing pipeline
INFO:TextProcessor:Starting text processing pipeline
2024-09-04 21:06:32,092 - TextProcessor - INFO - Reading files from directory: data/docs-raw-texts/
2024-09-04 21:06:32,092 - TextProcessor - INFO - Reading files from directory: data/docs-raw-texts/
2024-09-04 21:06:32,092 - TextProcessor - INFO - Reading files 

{'001': ['218'],
 '01': ['305'],
 '0314': ['098'],
 '044': ['099'],
 '0511': ['125'],
 '06': ['167'],
 '085': ['118'],
 '10': ['012',
  '023',
  '037',
  '038',
  '043',
  '048',
  '052',
  '080',
  '084',
  '091',
  '098',
  '109',
  '112',
  '118',
  '128',
  '130',
  '139',
  '170',
  '199',
  '200',
  '208',
  '209',
  '214',
  '231',
  '234',
  '239',
  '240',
  '242',
  '245',
  '269',
  '293',
  '302',
  '307',
  '309',
  '310',
  '313',
  '324'],
 '100': ['005',
  '070',
  '080',
  '085',
  '090',
  '150',
  '159',
  '167',
  '175',
  '176',
  '177',
  '179',
  '180',
  '194',
  '199',
  '223',
  '230',
  '244',
  '257',
  '268',
  '288'],
 '1000': ['021',
  '023',
  '024',
  '141',
  '145',
  '157',
  '169',
  '180',
  '184',
  '253',
  '269',
  '282'],
 '10000': ['040', '116', '150'],
 '100000': ['115', '116', '204', '240', '258'],
 '1000volt': ['076'],
 '100yen': ['204'],
 '101': ['090', '255'],
 '102': ['145'],
 '1024': ['303'],
 '103': ['098'],
 '1030': ['199'],
 '104': ['

In [62]:
from algorithms.binary_search.query_processor import QueryProcessor
from algorithms.binary_search.text_processor import TextProcessor

texts = TextProcessor()
queries = QueryProcessor()

text_df = texts.process_texts()
queries_df = queries.process_queries()


2024-09-04 21:07:40,266 - TextProcessor - INFO - Starting text processing pipeline
2024-09-04 21:07:40,266 - TextProcessor - INFO - Starting text processing pipeline
2024-09-04 21:07:40,266 - TextProcessor - INFO - Starting text processing pipeline
2024-09-04 21:07:40,266 - TextProcessor - INFO - Starting text processing pipeline
INFO:TextProcessor:Starting text processing pipeline
2024-09-04 21:07:40,276 - TextProcessor - INFO - Reading files from directory: data/docs-raw-texts/
2024-09-04 21:07:40,276 - TextProcessor - INFO - Reading files from directory: data/docs-raw-texts/
2024-09-04 21:07:40,276 - TextProcessor - INFO - Reading files from directory: data/docs-raw-texts/
2024-09-04 21:07:40,276 - TextProcessor - INFO - Reading files from directory: data/docs-raw-texts/
INFO:TextProcessor:Reading files from directory: data/docs-raw-texts/
2024-09-04 21:07:41,491 - TextProcessor - INFO - Finished reading files. Time taken: 1.21 seconds
2024-09-04 21:07:41,491 - TextProcessor - INFO 

In [63]:
file_path = '/content/drive/MyDrive/ColabNotebooks/NLP-main/HW01/inverted_index.json'

with open(file_path, 'r') as f:
    data = json.load(f)

## 3. Vectorización de Documentos usando TF-IDF

Se crea una función que, a partir del índice invertido, calcula la representación TF-IDF de cada documento. El valor de TF-IDF se calcula de la siguiente manera:
- **TF (Term Frequency)**: Frecuencia de aparición de un término en el documento.
- **IDF (Inverse Document Frequency)**: Penalización basada en el número de documentos en los que aparece un término.

La fórmula es la siguiente:

$
\text{TF-IDF}(t, d) = \text{TF}(t, d) \times \log\left(\frac{N}{\text{DF}(t)}\right)
$

Donde:
- $ t $ es el término.
- $ d $ es el documento.
- $ N $ es el número total de documentos.
- $ \text{DF}(t) $ es el número de documentos que contienen el término $ t $ .


### 3.1 Eficiencia de la Representación TF-IDF

La representación TF-IDF genera un vector para cada documento con tantos elementos como palabras haya en el índice invertido. Sin embargo, este enfoque no es muy eficiente porque la mayoría de los vectores resultantes contienen muchos ceros (vectores dispersos), debido a que los términos no aparecen en todos los documentos.

Una mejor alernativa es usar un diccionario para almacenar solo los términos relevantes de cada documento o consulta, en lugar de un vector lleno de ceros. De esta manera, solo se añaden los términos que están presentes en el documento, lo que reduce el espacio de almacenamiento y mejora la eficiencia


In [64]:
def tfIdf(invertedIndex, N, doc):
    vectTfIdf =[0]*len(invertedIndex)
    for word in doc:
        if word in invertedIndex:
            tf = np.log10(1 + doc.count(word))
            idf = np.log10(N/len(invertedIndex[word]))
            tfIdf = tf*idf
            vectTfIdf[list(invertedIndex.keys()).index(word)] = tfIdf
    return vectTfIdf

In [65]:
tfIdf(data, text_df.shape[0], ['001', '115', '001'])

[1.2022634940680008,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0.758543810040283,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,

### 3.2 Representación TF-IDF sin Zeros

En la siguiente celda se muestra  una implementación de TF-IDF utilizando un diccionario en lugar de un vector lleno de ceros. Asi se almacena únicamente los términos relevantes, lo que reduce el espacio necesario y aumenta la eficiencia


In [66]:
def tfIdfSinZeros(invertedIndex, N, doc):
    vectTfIdf = {}
    for word in doc:
        if word in invertedIndex:
            tf = np.log10(1 + doc.count(word))
            idf = np.log10(N / len(invertedIndex[word]))
            tfIdf = tf * idf
            vectTfIdf[word] = tfIdf
    return vectTfIdf

In [67]:
tfIdfSinZeros(data, text_df.shape[0], ['war', 'moon', 'suggest'])

{'war': 0.1731961996742312,
 'moon': 0.3276603823219233,
 'suggest': 0.25800382717892806}

## 4. Cálculo de la Similitud del Coseno

La **similitud del coseno** entre dos vectores $ A $ y $ B $ se define como:

$
\cos(\theta) = \frac{A \cdot B}{||A|| \cdot ||B||}
$

Se implementa una función para calcular la similitud del coseno entre los vectores TF-IDF de los documentos y las consultas.


In [68]:
def cosineSimilarity (tfIdf1, tfIdf2):
    producto_punto = np.dot(tfIdf1, tfIdf2)
    normas = (np.linalg.norm(tfIdf1) * np.linalg.norm(tfIdf2))
    return producto_punto / normas

In [69]:
tfIdf1 = tfIdf(data, text_df.shape[0], ['moon', 'war'])
tfIdf2 = tfIdf(data, text_df.shape[0], ['war', 'minist'])

cosineSimilarity(tfIdf1, tfIdf2)

0.1935294655121384

## 5. Recuperación de Documentos Basados en Similitud del Coseno

Para cada consulta, se calculan las similitudes del coseno entre la consulta y cada documento del corpus. Los documentos se ordenan de mayor a menor similitud y se almacenan aquellos cuya similitud sea mayor que 0.


In [70]:
corpus = text_df['text_list'].values

In [71]:
from IPython.display import FileLink

In [72]:
def cosineSimilarityDocQuery(inverseIndex, corpus):
    N = text_df.shape[0]
    queriesProcessed = queries_df['query_list'].values
    results_file = open("RRDV-consultas_results.txt", "w")
    tfIdfDoc = {}
    i=1
    for doc in corpus:
        tfIdfDoc[i] = tfIdf(inverseIndex, N, doc)
        i+=1
    all_similarities = {}
    index_query = 1
    index_doc = 1
    for query in queriesProcessed:
        tfIdfQuery = tfIdf(inverseIndex, N, query)
        results_line = f"q{index_query:02} "
        similarities = []
        index_doc = 1
        for doc in corpus:
            similarity = cosineSimilarity(tfIdfDoc[index_doc], tfIdfQuery)
            if similarity != 0:
                similarities.append([index_doc,similarity])
            index_doc += 1

        similarities.sort(key=lambda x: x[1], reverse=True)
        all_similarities[index_query] = similarities
        index_query += 1
        results_line += ", ".join([f"d{doc}:{sim:.4f}" for doc, sim in similarities])
        results_file.write(results_line + "\n")
    results_file.close()
    #display(FileLink(results_file))


    return all_similarities

In [73]:
similarities = cosineSimilarityDocQuery(data, corpus)

In [74]:
print(similarities)

{1: [[42, 0.18687040526703813], [315, 0.049406724731041504], [178, 0.04620249390547406], [252, 0.04146947948961399], [71, 0.03492377568050464], [271, 0.03389594735051589], [70, 0.029634548584189922], [8, 0.026486367098009656], [12, 0.025116458580400083], [35, 0.024857148764448087], [239, 0.024640569971924074], [175, 0.023361745683237116], [156, 0.023206842991617082], [45, 0.022980984461714867], [231, 0.02292908961170689], [128, 0.021703327534822164], [303, 0.02125067958085792], [247, 0.020775755792396138], [286, 0.020571821787847316], [59, 0.020523276479545327], [185, 0.018765342578171924], [32, 0.018740562355552826], [104, 0.018412148037735766], [207, 0.0180584940662059], [100, 0.017650556568659964], [186, 0.017225790174987113], [243, 0.016593109032120287], [299, 0.015228792979081817], [95, 0.015061134815166239], [36, 0.014705127528649171], [267, 0.014066843467061715], [177, 0.013991477445736053], [162, 0.012534703946179827], [328, 0.011882157411052283], [34, 0.011663122769904477], [3

In [75]:
FileLink(r'RRDV-consultas_results.txt')

 ## 6. Generación del Archivo de Resultados y Generación del Archivo de Resultados



Finalmente, se calculan las métricas P@M, R@M, y NDCG@M para evaluar la efectividad de la recuperación.

Empezamos por exportar el archivo de juicios de relevancia, y guardamos la informacion en un diccionario El primer diccionario contiene una relevancia por niveles (cada documento con su nivel de relevancia), mientras que el segundo diccionario contiene una relevancia binaria (solo guardamos los documentos relevantes)

In [76]:
queriesRelevance = {}
i=1
with open("data/relevance-judgments/relevance-judgments.tsv", "r") as file:
    for line in file:
        parts = line.strip().split("\t")
        #key = parts[0]
        key = f"q{i}"
        values = parts[1].split(",")
        escala = {}
        for value in values:
          s = value.strip().split(":")
          escala[int(s[0].replace('d',''))] = int(s[1])
        queriesRelevance[key] = escala
        i+=1

queriesRelevanceBi = {}
for query in queriesRelevance :
        queriesRelevanceBi[query]=[]
        for doc in queriesRelevance[query]:
            docNb = doc
            queriesRelevanceBi[query].append(docNb)

In [77]:
print(queriesRelevance)

{'q1': {186: 4, 254: 5, 16: 5}, 'q2': {136: 2, 139: 2, 143: 4, 283: 4, 228: 4, 164: 4, 318: 2, 291: 4, 293: 4, 147: 2, 149: 2}, 'q3': {152: 3, 291: 4, 283: 4, 147: 3, 318: 2, 105: 2}, 'q4': {275: 3, 10: 3, 286: 2, 19: 2, 49: 2, 330: 2, 270: 3}, 'q5': {69: 2, 233: 3, 257: 2, 297: 3, 26: 4, 329: 5}, 'q6': {4: 3, 77: 3, 266: 2, 179: 3}, 'q7': {205: 2, 5: 4, 110: 4, 108: 3, 117: 3, 81: 2, 292: 2, 251: 5, 28: 3, 271: 3, 121: 2, 180: 2}, 'q8': {205: 3, 199: 5, 198: 3, 223: 2, 217: 2, 177: 2}, 'q9': {68: 2, 100: 2, 65: 3, 76: 3, 231: 4, 199: 4, 52: 2, 215: 2}, 'q10': {239: 4, 277: 4, 258: 3, 250: 4}, 'q11': {239: 2, 277: 2, 258: 2, 49: 4, 56: 4}, 'q12': {2: 2, 5: 3, 142: 2, 314: 3, 280: 3, 130: 3, 41: 3, 117: 2, 81: 4, 93: 3, 91: 4, 180: 3}, 'q13': {229: 2, 132: 3}, 'q14': {280: 2, 271: 4, 121: 4, 91: 2}, 'q15': {207: 2, 201: 3, 192: 4, 194: 3, 222: 2, 216: 2, 210: 3}, 'q16': {77: 2, 179: 5}, 'q17': {277: 2, 11: 3, 132: 2, 258: 2, 49: 2, 250: 2, 331: 2}, 'q18': {205: 2, 202: 2, 276: 4, 194: 2

In [78]:
print(queriesRelevanceBi)

{'q1': [186, 254, 16], 'q2': [136, 139, 143, 283, 228, 164, 318, 291, 293, 147, 149], 'q3': [152, 291, 283, 147, 318, 105], 'q4': [275, 10, 286, 19, 49, 330, 270], 'q5': [69, 233, 257, 297, 26, 329], 'q6': [4, 77, 266, 179], 'q7': [205, 5, 110, 108, 117, 81, 292, 251, 28, 271, 121, 180], 'q8': [205, 199, 198, 223, 217, 177], 'q9': [68, 100, 65, 76, 231, 199, 52, 215], 'q10': [239, 277, 258, 250], 'q11': [239, 277, 258, 49, 56], 'q12': [2, 5, 142, 314, 280, 130, 41, 117, 81, 93, 91, 180], 'q13': [229, 132], 'q14': [280, 271, 121, 91], 'q15': [207, 201, 192, 194, 222, 216, 210], 'q16': [77, 179], 'q17': [277, 11, 132, 258, 49, 250, 331], 'q18': [205, 202, 276, 194, 216, 219, 215, 211], 'q19': [98, 129, 196, 221, 60], 'q20': [167, 166, 20, 23], 'q21': [152], 'q22': [103, 143, 107, 51, 17, 54, 293, 158], 'q23': [136, 316, 94], 'q24': [1, 37, 130, 314, 46, 133, 113, 294, 261, 93, 62, 120], 'q25': [139, 67, 25, 31, 90], 'q26': [248], 'q27': [277, 167, 257, 20, 23, 321, 247, 265, 150, 328], '

Creamos una funcion que calcula la similitud entre las consultas y los documentos, pero esta vez sin borrar los documentos que no se parecen nada a las consultas (los documentos que tienen una similitud nula con las consultas).

Esto nos va a permitir de crear una diccionario que nos permite saber los documentos recuperados en orden.

In [79]:
def cosineSimilarityDocQueryWithZeros(inverseIndex, corpus):
    N = text_df.shape[0]
    queriesProcessed = queries_df['query_list'].values
    results_file = open("RRDV-consultas_results_2.txt", "w")
    tfIdfDoc = {}
    i=1
    for doc in corpus:
        tfIdfDoc[i] = tfIdf(inverseIndex, N, doc)
        i+=1
    all_similarities = {}
    index_query = 1
    index_doc = 1
    for query in queriesProcessed:
        tfIdfQuery = tfIdf(inverseIndex, N, query)
        results_line = f"q{index_query:02} "
        similarities = []
        index_doc = 1
        for doc in corpus:
            similarity = cosineSimilarity(tfIdfDoc[index_doc], tfIdfQuery)
            similarities.append([index_doc,similarity])
            index_doc += 1

        similarities.sort(key=lambda x: x[1], reverse=True)
        all_similarities[index_query] = similarities
        index_query += 1
        results_line += ", ".join([f"d{doc}:{sim:.4f}" for doc, sim in similarities])
        results_file.write(results_line + "\n")
    results_file.close()

    return all_similarities

In [80]:
fullSimilarities = cosineSimilarityDocQueryWithZeros(data, corpus)

In [81]:
retrieved_queries = {}
for query in fullSimilarities:
    retrieved_queries['q'+str(query)]=[]
    for doc in fullSimilarities[query]:
        retrieved_queries['q'+str(query)].append(int(doc[0]))

In [82]:
print(retrieved_queries)

{'q1': [42, 315, 178, 252, 71, 271, 70, 8, 12, 35, 239, 175, 156, 45, 231, 128, 303, 247, 286, 59, 185, 32, 104, 207, 100, 186, 243, 299, 95, 36, 267, 177, 162, 328, 34, 306, 118, 67, 72, 283, 98, 189, 1, 318, 258, 157, 324, 161, 136, 43, 122, 183, 169, 257, 96, 261, 244, 305, 168, 209, 273, 82, 18, 309, 2, 3, 4, 5, 6, 7, 9, 10, 11, 13, 14, 15, 16, 17, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 33, 37, 38, 39, 40, 41, 44, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 60, 61, 62, 63, 64, 65, 66, 68, 69, 73, 74, 75, 76, 77, 78, 79, 80, 81, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 97, 99, 101, 102, 103, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 119, 120, 121, 123, 124, 125, 126, 127, 129, 130, 131, 132, 133, 134, 135, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 158, 159, 160, 163, 164, 165, 166, 167, 170, 171, 172, 173, 174, 176, 179, 180, 181, 182, 184, 187, 188, 190, 191, 192, 193, 194, 195, 

Las funciones que permiten calcular P@M, R@M y NDCG@M de cada consulta y MAP basandose en los diccionarios creados anteriormente.

Las funciones calculan con la logica que hemos visto en las clases.

Tambien otras funciones (DCG@M, p@k y average_precision) que se necesitan para el calculo de las funciones que se piden.

In [83]:
def precision_at_m(query):
  m = len(queriesRelevanceBi[query])
  relevant_retrieved = [doc for doc in retrieved_queries[query][:m] if doc in queriesRelevanceBi[query]]
  return len(relevant_retrieved)/m

In [84]:
def recall_at_m(query):
    m = len(queriesRelevanceBi[query])
    relevant_retrieved = [doc for doc in retrieved_queries[query][:m] if doc in queriesRelevanceBi[query]]
    return len(relevant_retrieved) / len(queriesRelevanceBi[query])

In [85]:
print(precision_at_m("q11"))
print(recall_at_m("q11"))

0.2
0.2


In [86]:
def dcg_at_m(query):
    m = len(queriesRelevanceBi[query])
    dcg = 0.0
    for i in range(m):
        rel_score = queriesRelevance[query].get(retrieved_queries[query][i], 0)
        dcg += rel_score / np.log2(max(i+1, 2))
    return dcg

In [87]:
def ndcg_at_m(query):
    m = len(queriesRelevanceBi[query])
    dcg = dcg_at_m(query)
    ideal_order = sorted(queriesRelevance[query].values(), reverse = True)
    idcg = 0.0
    for i in range(m):
        rel_score = ideal_order[i]
        idcg += rel_score / np.log2(max(i+1, 2))
    if idcg == 0:
      return 0
    return dcg/idcg

In [88]:
print(dcg_at_m("q11"))
print(ndcg_at_m("q11"))

4.0
0.3596083375790938


In [89]:
def precision_at_k(query, k):
  relevant_retrieved = [doc for doc in retrieved_queries[query][:k] if doc in queriesRelevanceBi[query]]
  return len(relevant_retrieved)/k

In [90]:
def average_precision(query):
    m = len(queriesRelevanceBi[query])
    nb_relevant_found = 0
    sum_precisions = 0
    k=0
    while m!=nb_relevant_found:
        if retrieved_queries[query][k] in queriesRelevanceBi[query]:
            nb_relevant_found+=1
            sum_precisions += precision_at_k(query, k+1)
        k+=1

    if m==0:
        return 0
    return (sum_precisions/m)

In [91]:
average_precision("q11")

0.21313256639832798

Creamos un df con las metricas de cada consulta.

In [96]:
df = pd.DataFrame(columns=['Query', 'Precision', 'Recall', 'NDCG'])

rows = []

for i in range(1, queries_df.shape[0]):
    query = "q" + str(i)
    row = {
        'Query': query,
        'Precision': precision_at_m(query),
        'Recall': recall_at_m(query),
        'NDCG': ndcg_at_m(query)
    }
    rows.append(row)

df = pd.concat([df, pd.DataFrame(rows)], ignore_index=True)

  df = pd.concat([df, pd.DataFrame(rows)], ignore_index=True)


In [98]:
df

Unnamed: 0,Query,Precision,Recall,NDCG
0,q1,0.0,0.0,0.0
1,q2,0.181818,0.181818,0.08439
2,q3,0.0,0.0,0.0
3,q4,0.0,0.0,0.0
4,q5,0.0,0.0,0.0
5,q6,0.0,0.0,0.0
6,q7,0.0,0.0,0.0
7,q8,0.0,0.0,0.0
8,q9,0.0,0.0,0.0
9,q10,0.0,0.0,0.0


In [99]:
def MAP ():
    map_value = 0
    for i in range (1, queries_df.shape[0]):
        map_value+=average_precision("q"+str(i))
    return (map_value/queries_df.shape[0])

In [100]:
MAP()

0.03267551549007947