# Evaluación del texto
Estrategia usada: 
```
for pregunta in ground truth data: 
    Se ejecuta la pregunta mediante una técnica(similitud, vectorial, etc)
    se verifica si el documento correcto(id) está en los resultados
```
- **Métricas**
Se emplean 2 métricas de evaluación
    - Hit Rate
    - Mean Reciprocal Rank 

- **Documentos**
    - Se usa: `documentos_idhash.json` y  `ground-truth-data.csv`
    - Resultado: 

## Carga de documentos

### Documentos con id

In [8]:
import json

with open('documentos_idhash.json', 'rt') as file:
    documentos = json.load(file)

In [2]:
documentos[15]

{'text': 'No, late submissions are not allowed. But if the form is still not closed and it’s after the due date, you can still submit the homework. confirm your submission by the date-timestamp on the Course page.y\nOlder news:[source1] [source2]',
 'section': 'General course-related questions',
 'question': 'Homework - Are late submissions of homework allowed?',
 'course': 'data-engineering-zoomcamp',
 'id': 'be5bfee4'}

### Carga de Ground Truth Data

In [9]:
import pandas as pd

df= pd.read_csv('ground-truth-data.csv')
df.head()

Unnamed: 0,question,course,document
0,When does the course begin?,data-engineering-zoomcamp,c02e79ef
1,How can I get the course schedule?,data-engineering-zoomcamp,c02e79ef
2,What is the link for course registration?,data-engineering-zoomcamp,c02e79ef
3,How can I receive course announcements?,data-engineering-zoomcamp,c02e79ef
4,Where do I join the Slack channel?,data-engineering-zoomcamp,c02e79ef


Pero conviene usarlo como diccionario para poder recorrer cada elemento por question, course o el id del documento

In [10]:
ground_truthdata= df.to_dict('records')
ground_truthdata[0:3]

[{'question': 'When does the course begin?',
  'course': 'data-engineering-zoomcamp',
  'document': 'c02e79ef'},
 {'question': 'How can I get the course schedule?',
  'course': 'data-engineering-zoomcamp',
  'document': 'c02e79ef'},
 {'question': 'What is the link for course registration?',
  'course': 'data-engineering-zoomcamp',
  'document': 'c02e79ef'}]

## Búsqueda por similitud
Generamos los resultados por búsqueda de similitud usando Elastic Search y Minsearch.
Se usarán las querys de `ground-truth-data` 

###  Elastic Search
Recordar pasos:

1. Importación de librería y Creación de cliente
2. Indexación con elastic (configuración, creación e indexación de los documentos)
3. Función de búsqueda Semántica usando elastic search. Devuelve los documentos con mayor coincidencia.

**Nota**: recuerda correr elastic search usando docker <p>
docker run -it \
    --rm \
    --name elasticsearch \
    -m 4GB \
    -p 9200:9200 \
    -p 9300:9300 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    docker.elastic.co/elasticsearch/elasticsearch:8.4.3

*Más detalles*: [módulo 1](https://github.com/Halsey26/llm-zoomcamp/blob/main/01_Mod_IntroRag/rag_intro.ipynb) 

In [5]:
import elasticsearch
from elasticsearch import Elasticsearch
# cliente
es_cliente= Elasticsearch('http://127.0.0.1:9200/')
es_cliente.info()

ObjectApiResponse({'name': '0a7d49c11c01', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'Lh37s7P3TomzJ-REo3jkSA', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

In [6]:
# configuracion 
index_config= {
    "settings":{
        "number_of_shards": 1,
        "number_of_replicas": 0
    }, 
    "mappings": {
        "properties":{
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"},
            "id": {"type": "keyword"},
        }
    }
}

# creamos el index
nombre_index= "course-questions"
es_cliente.indices.create (index = nombre_index, body= index_config)


  es_cliente.indices.create (index = nombre_index, body= index_config)


ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [11]:
from tqdm.auto import tqdm

for doc in tqdm(documentos):
    es_cliente.index(index= nombre_index, document=doc)

  2%|▏         | 16/948 [00:00<00:06, 154.81it/s]

100%|██████████| 948/948 [00:04<00:00, 218.76it/s]


In [16]:
# funcion busqueda
def search_elastic(query, course):
    search_query= {
        "size": 5,
        
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": course
                    }
                }
            }
        }
    }

    # busqueda
    respuesta_search= es_cliente.search(index=nombre_index, body= search_query)
    
    resultados_search =[]

    for resultado in respuesta_search['hits']['hits']:
        respuesta = resultado['_source']
        resultados_search.append(respuesta)

    return resultados_search

In [23]:
query= 'I just discovered the course. Can I still join?'
course = 'data-engineering-zoomcamp'

resultados_elastic= search_elastic(query, course)
resultados_elastic

  respuesta_search= es_cliente.search(index=nombre_index, body= search_query)


[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp',
  'id': '7842b56a'},
 {'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (installed with Anaconda)\nTerraform\nGit\nLook over the prerequisites and syllabus to see if you are comfortable with these subjects.',
  'section': 'General course-related questions',
  'question': 'Course - What can I do before the course starts?',
  'course': 'data-engineering-zoomcamp',
  'id': '63394d91'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it fin

###  Minsearch
Recordar pasos:
1. Importamos libreria
2. indexamos los documentos
3. Creamos la función búsqueda, usando el peso de los campos como config


In [20]:
documentos[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp',
 'id': 'c02e79ef'}

In [21]:
import minsearch

#indexacion
index= minsearch.Index(
    text_fields= ['text', 'section', 'question'], 
    keyword_fields=['course', 'id']
)

index.fit(documentos)

<minsearch.minsearch.Index at 0x72d0dc18a960>

In [None]:
# función búsqueda
def search_minsearch(query, course):
    resultados= index.search(
        query= query, 
        boost_dict= {"question": 3, "text": 0.2, "section": 0.5} , 
        filter_dict= {"course": course} ,
        num_results= 5
    )

    return resultados

query= 'I just discovered the course. Can I still join?'
course = 'data-engineering-zoomcamp'

resultados_minsearch= search_minsearch(query, course)
resultados_minsearch

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp',
  'id': '7842b56a'},
 {'text': 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Cloud SDK\nPython 3 (installed with Anaconda)\nTerraform\nGit\nLook over the prerequisites and syllabus to see if you are comfortable with these subjects.',
  'section': 'General course-related questions',
  'question': 'Course - What can I do before the course starts?',
  'course': 'data-engineering-zoomcamp',
  'id': '63394d91'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it fin

## Comparación Evaluacion Elastic Search - Minsearch

### Obtención de los documentos relevantes

Resultados: documentos relevantes usando elasticsearch

In [50]:
prueba_groundt= ground_truthdata[:5]
relevantes= []

for doc_groundt in tqdm(prueba_groundt):
    id_groundt= doc_groundt['document']
    resultados_search= search_elastic(doc_groundt['question'], doc_groundt['course'] )
    
    relevante_registo = []

    for result in resultados_search:
        relevante_registo.append(id_groundt == result['id']  )
    # lo mismo: relevantes_registro = [result['id] == id_groundt for result in resultados_search]
    
    relevantes.append(relevante_registo)

  respuesta_search= es_cliente.search(index=nombre_index, body= search_query)
100%|██████████| 5/5 [00:00<00:00, 65.73it/s]


In [51]:
# documentos relevantes o no para cada query (las 5 primeras)
relevantes

[[True, False, False, False, False],
 [False, False, False, False, False],
 [False, False, False, False, False],
 [False, False, False, False, False],
 [False, False, False, False, False]]

Podemos modularizar, y lo efectuamos para ambas funciones de búsquedas

In [53]:
def obtencion_relevantes(ground_truth,funcion_busqueda ): 
    relevantes= []

    for doc_groundt in tqdm(ground_truth):
        id_groundt= doc_groundt['document']
        resultados_search= funcion_busqueda(doc_groundt['question'], doc_groundt['course'] )
        
        relevante_registo = []

        for result in resultados_search:
            relevante_registo.append(id_groundt == result['id']  )
        # lo mismo: relevantes_registro = [result['id] == id_groundt for result in resultados_search]
        
        relevantes.append(relevante_registo)

    return relevantes

Obtenemos los resultados de cada query

In [54]:
relevantes_elastic= obtencion_relevantes(ground_truthdata, search_elastic)


  respuesta_search= es_cliente.search(index=nombre_index, body= search_query)
100%|██████████| 4627/4627 [00:57<00:00, 80.29it/s]


In [55]:
relevantes_minsearch = obtencion_relevantes(ground_truthdata, search_minsearch)

100%|██████████| 4627/4627 [00:17<00:00, 260.95it/s]


In [62]:
relevantes_elastic[:3]

[[True, False, False, False, False],
 [False, False, False, False, False],
 [False, False, False, False, False]]

In [63]:
relevantes_minsearch[:3]

[[True, False, False, False, False],
 [False, False, False, False, False],
 [False, False, False, False, False]]

### Métricas de evaluación

In [64]:
# realizamos un ejemplo
relevantes

[[True, False, False, False, False],
 [False, False, False, False, False],
 [False, False, False, False, False],
 [False, False, False, False, False],
 [False, False, False, False, False]]

In [66]:
for i in range(len(relevantes)):
    print(i)

0
1
2
3
4
