# Evaluación del texto
Estrategia usada: 
```
for pregunta in ground truth data: 
    Se ejecuta la pregunta mediante una técnica(similitud, vectorial, etc)
    se verifica si el documento correcto(id) está en los resultados
```
- **Métricas**
Se emplean 2 métricas de evaluación
    - Hit Rate
    - Mean Reciprocal Rank 

- **Documentos**
    - Se usa: `documentos_idhash.json` y  `ground-truth-data.csv`
    - Resultado: 

## Carga de documentos

### Documentos con id

In [1]:
import json

with open('documentos_idhash.json', 'rt') as file:
    documentos = json.load(file)

In [2]:
documentos[15]

{'text': 'No, late submissions are not allowed. But if the form is still not closed and it’s after the due date, you can still submit the homework. confirm your submission by the date-timestamp on the Course page.y\nOlder news:[source1] [source2]',
 'section': 'General course-related questions',
 'question': 'Homework - Are late submissions of homework allowed?',
 'course': 'data-engineering-zoomcamp',
 'id': 'be5bfee4'}

### Carga de Ground Truth Data

In [3]:
import pandas as pd

df= pd.read_csv('ground-truth-data.csv')
df.head()

Unnamed: 0,question,course,document
0,When does the course begin?,data-engineering-zoomcamp,c02e79ef
1,How can I get the course schedule?,data-engineering-zoomcamp,c02e79ef
2,What is the link for course registration?,data-engineering-zoomcamp,c02e79ef
3,How can I receive course announcements?,data-engineering-zoomcamp,c02e79ef
4,Where do I join the Slack channel?,data-engineering-zoomcamp,c02e79ef


Pero conviene usarlo como diccionario para poder recorrer cada elemento por question, course o el id del documento

In [None]:
ground_truthdata= df.to_dict('records')
ground_truthdata[0:3]

[{'question': 'When does the course begin?',
  'course': 'data-engineering-zoomcamp',
  'document': 'c02e79ef'},
 {'question': 'How can I get the course schedule?',
  'course': 'data-engineering-zoomcamp',
  'document': 'c02e79ef'},
 {'question': 'What is the link for course registration?',
  'course': 'data-engineering-zoomcamp',
  'document': 'c02e79ef'}]

## Búsqueda por similitud
Generamos los resultados por búsqueda de similitud usando Elastic Search y Minsearch.
Se usarán las querys de `ground-truth-data` 

###  Elastic Search
Recordar pasos:

1. Importación de librería y Creación de cliente
2. Indexación con elastic (configuración, creación e indexación de los documentos)
3. Función de búsqueda Semántica usando elastic search. Devuelve los documentos con mayor coincidencia.
4.  

**Nota**: recuerda correr elastic search usando docker <p>
docker run -it \
    --rm \
    --name elasticsearch \
    -m 4GB \
    -p 9200:9200 \
    -p 9300:9300 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    docker.elastic.co/elasticsearch/elasticsearch:8.4.3

- Más detalles: [módulo 1](https://github.com/Halsey26/llm-zoomcamp/blob/main/01_Mod_IntroRag/rag_intro.ipynb) 

In [17]:
import elasticsearch
from elasticsearch import Elasticsearch
# cliente
es_cliente= Elasticsearch('http://127.0.0.1:9200/')
es_cliente.info()

ObjectApiResponse({'name': '25a54b12f136', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'jJHZMmwnTg6JCSb6Y35YYw', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

In [18]:
# configuracion 
index_config= {
    "settings":{
        "number_of_shards": 1,
        "number_of_replicas": 0
    }, 
    "mappings": {
        "properties":{
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"},
            "id": {"type": "keyword"},
        }
    }
}

# creamos el index
nombre_index= "course-questions"
es_cliente.indices.create (index = nombre_index, body= index_config)


  es_cliente.indices.create (index = nombre_index, body= index_config)


ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [23]:
from tqdm.auto import tqdm

for doc in tqdm(documentos):
    es_cliente.index(index= nombre_index, document=doc)

  1%|          | 6/948 [00:00<00:16, 58.56it/s]

100%|██████████| 948/948 [00:04<00:00, 208.68it/s]


In [36]:
# funcion busqueda

query= 'if the course already started, can i still join it?'
course = 'data-engineering-zoomcamp'

search_query= {
    "size": 5,
    
    "query": {
        "bool": {
            "must": {
                "multi_match": {
                    "query": query,
                    "fields": ["question^3", "text", "section"],
                    "type": "best_fields"
                }
            },
            "filter": {
                "term": {
                    "course": course
                }
            }
        }
    }
}

# buqueda
respuesta_search= es_cliente.search(index=nombre_index, body= search_query)
respuesta_search

  respuesta_search= es_cliente.search(index=nombre_index, body= search_query)


ObjectApiResponse({'took': 20, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 407, 'relation': 'eq'}, 'max_score': 66.088936, 'hits': [{'_index': 'course-questions', '_id': 'wEX3Q5gBgJml6WbXHxQN', '_score': 66.088936, '_source': {'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.", 'section': 'General course-related questions', 'question': 'Course - Can I still join the course after the start date?', 'course': 'data-engineering-zoomcamp', 'id': '7842b56a'}}, {'_index': 'course-questions', '_id': 'xUX3Q5gBgJml6WbXHxSA', '_score': 46.730072, '_source': {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue prepa

In [35]:
respuesta_search['hits']['hits'][0]['_source']['text']

"Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute."

In [43]:
resultados_search =[]

for resultado in respuesta_search['hits']['hits']:
    respuesta = resultado['_source']['text']
    resultados_search.append(respuesta)

resultados_search

["Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
 'Yes, the slack channel remains open and you can ask questions there. But always sDocker containers exit code w search the channel first and second, check the FAQ (this document), most likely all your questions are already answered here.\nYou can also tag the bot @ZoomcampQABot to help you conduct the search, but don’t rely on its answers 100%, it is pretty good though.',
 'You can start by installing and setting up all the dependencies and requirements:\nGoogle cloud account\nGoogle Clou

In [7]:
# es_cliente.info()
import elasticsearch

elasticsearch.__version__

(8, 4, 3)

###  Minsearch

## Comparación Evaluacion Elastic Search - Minsearch