### Q2. Читання документів
Тепер ми можемо читати документи. Створіть власний блок поглинання коду


Давайте почитаємо документи. Використаємо той самий код, який ми використовували для розбору FAQ: [parse-faq-llm.ipynb](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/cohorts/2024/05-orchestration/parse-faq-llm.ipynb)


Використовуйте наступний ідентифікатор документа: ```1qZjwHkvP0lXHiE4zdbWyUXSVfmVGzougDD6N37bat3E```.


Це ідентифікатор документа [LLM FAQ версія 1](https://docs.google.com/document/d/1qZjwHkvP0lXHiE4zdbWyUXSVfmVGzougDD6N37bat3E/edit)


Скопіюйте код до редактора. Скільки документів FAQ ми опрацювали?


* 1
* 2
* 3
* 4

In [1]:
import io

import requests
import docx
import hashlib

from datetime import datetime
from elasticsearch import Elasticsearch
from tqdm.auto import tqdm

In [2]:
def clean_line(line):
    line = line.strip()
    line = line.strip('\uFEFF')
    return line

def read_faq(file_id):
    url = f'https://docs.google.com/document/d/{file_id}/export?format=docx'
    
    response = requests.get(url)
    response.raise_for_status()
    
    with io.BytesIO(response.content) as f_in:
        doc = docx.Document(f_in)

    questions = []

    question_heading_style = 'heading 2'
    section_heading_style = 'heading 1'
    
    heading_id = ''
    section_title = ''
    question_title = ''
    answer_text_so_far = ''
     
    for p in doc.paragraphs:
        style = p.style.name.lower()
        p_text = clean_line(p.text)
    
        if len(p_text) == 0:
            continue
    
        if style == section_heading_style:
            section_title = p_text
            continue
    
        if style == question_heading_style:
            answer_text_so_far = answer_text_so_far.strip()
            if answer_text_so_far != '' and section_title != '' and question_title != '':
                questions.append({
                    'text': answer_text_so_far,
                    'section': section_title,
                    'question': question_title,
                })
                answer_text_so_far = ''
    
            question_title = p_text
            continue
        
        answer_text_so_far += '\n' + p_text
    
    answer_text_so_far = answer_text_so_far.strip()
    if answer_text_so_far != '' and section_title != '' and question_title != '':
        questions.append({
            'text': answer_text_so_far,
            'section': section_title,
            'question': question_title,
        })

    return questions

In [3]:
faq_documents = {
    # LLM Version 1
    'llm-zoomcamp': '1qZjwHkvP0lXHiE4zdbWyUXSVfmVGzougDD6N37bat3E',
    # LLM Version 2
    # 'llm-zoomcamp': '1m2KexowAXTmexfC5rVTCSnaShvdUQ8Ag2IEiwBDHxN0',
}

In [4]:
documents = []

for course, file_id in faq_documents.items():
    print(course)
    course_documents = read_faq(file_id)
    documents.append({'course': course, 'documents': course_documents})

llm-zoomcamp


In [5]:
len(documents)

1

### Q3. Chunking
We don't really need to do any chuncking because our documents already have well-specified boundaries. So we just need to return the documents without any changes.

So let's go to the transformation part and add a custom code chunking block:
```python
documents = []

for doc in data['documents']:
    doc['course'] = data['course']
    # previously we used just "id" for document ID
    doc['document_id'] = generate_document_id(doc)
    documents.append(doc)

print(len(documents))

return documents
```
Where data is the input parameter to the transformer.

And the generate_document_id is defined in the same way as in module 4:
```python
import hashlib

def generate_document_id(doc):
    combined = f"{doc['course']}-{doc['question']}-{doc['text'][:10]}"
    hash_object = hashlib.md5(combined.encode())
    hash_hex = hash_object.hexdigest()
    document_id = hash_hex[:8]
    return document_id
```
Note: if instead of a single dictionary you get a list, add a for loop:
```
for course_dict in data:
    ...
```
You can check the type of data with this code:

```print(type(data))```

How many documents (chunks) do we have in the output?

* 66
* 76
* 86
* 96

In [6]:
type(documents[0])

dict

In [7]:
data = documents[0]

In [8]:
def generate_document_id(doc):
    combined = f"{doc['course']}-{doc['question']}-{doc['text'][:10]}"
    hash_object = hashlib.md5(combined.encode())
    hash_hex = hash_object.hexdigest()
    document_id = hash_hex[:8]
    return document_id

In [9]:
def chunking():
    documents = []
    
    for doc in data['documents']:
        doc['course'] = data['course']
        # previously we used just "id" for document ID
        doc['document_id'] = generate_document_id(doc)
        documents.append(doc)
    
    print(len(documents))
    
    return documents

In [10]:
documents = chunking()

86


### Tokenization and embeddings
We don't need any tokenization, so we skip it.

Because currently it's required in mage, we can create a dummy code block:

Create a custom code block
Don't change it
Because we will use text search, we also don't need embeddings, so skip it too.

If you want to use sentence transformers - the ones from module 3 - you don't need tokenization, but need embeddings (you don't need it for this homework)

### Q4. Export
Now we're ready to index the data with elasticsearch. For that, we use the Export part of the pipeline

* Go to the Export part
* Select vector databases -> Elasticsearch
* Open the code for editing
Because we won't use vector search, but usual text search, we will need to adjust the code.

First, let's change the line where we read the index name:

```index_name = kwargs.get('index_name', 'documents')```

To ```index_name_prefix``` - we will parametrize it with the day and time we run the pipeline
```python
from datetime import datetime

index_name_prefix = kwargs.get('index_name', 'documents')
current_time = datetime.now().strftime("%Y%m%d_%M%S")
index_name = f"{index_name_prefix}_{current_time}"
print("index name:", index_name)
```
We will need to save the name in a global variable, so it can be accessible in other code blocks
```python
from mage_ai.data_preparation.variable_manager import set_global_variable

set_global_variable('YOUR_PIPELINE_NAME', 'index_name', index_name)
```
Where your pipeline name is the name of the pipeline, e.g. transcendent_nexus (replace the space with underscore _)

Replace index settings with the settings we used previously:
```python
index_settings = {
    "settings": {
        "number_of_shards": number_of_shards,
        "number_of_replicas": number_of_replicas
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"},
            "document_id": {"type": "keyword"}
        }
    }
}
```
Remove the embeddings line:
```python
if isinstance(document[vector_column_name], np.ndarray):
    document[vector_column_name] = document[vector_column_name].tolist()
```
At the end (outside of the indexing for loop), print the last document:
```python
print(document)
```
Now execute the block.

What's the last document id?

Also note the index name.

In [11]:
es_client = Elasticsearch('http://localhost:9200') 

index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"},
            "id": {"type": "keyword"},
        }
    }
}

index_name_prefix = "documents"
current_time = datetime.now().strftime("%Y%m%d_%M%S")
index_name = f"{index_name_prefix}_{current_time}"

es_client.indices.delete(index=index_name, ignore_unavailable=True)
es_client.indices.create(index=index_name, body=index_settings)

  es_client.indices.create(index=index_name, body=index_settings)


ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'documents_20240820_4946'})

In [12]:
for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

  0%|          | 0/86 [00:00<?, ?it/s]

In [13]:
doc

{'text': 'Answer',
 'section': 'Workshops: X',
 'question': 'Question',
 'course': 'llm-zoomcamp',
 'document_id': 'd8c4c7bb'}

### Q5. Testing the retrieval
Now let's test the retrieval. Use mage or jupyter notebook to test it.

Let's use the following query: "When is the next cohort?"

What's the ID of the top matching result?

In [14]:
def elastic_search(query, course='llm-zoomcamp'):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                # "filter": {
                #     "term": {
                #         "course": course
                #     }
                # }
            }
        }
    }

    response = es_client.search(index='documents_20240820_2525', body=search_query)
    
    result_docs = []
    
    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])
    
    return result_docs

In [15]:
results = elastic_search(
    query="When is the next cohort?"
)
results

  response = es_client.search(index='documents_20240820_2525', body=search_query)


[{'text': 'Summer 2025 (via Alexey).',
  'section': 'General course-related questions',
  'question': 'When will the course be offered next?',
  'course': 'llm-zoomcamp',
  'document_id': 'bf024675'},
 {'text': 'Cosine similarity is a measure used to calculate the similarity between two non-zero vectors, often used in text analysis to determine how similar two documents are based on their content. This metric computes the cosine of the angle between two vectors, which are typically word counts or TF-IDF values of the documents. The cosine similarity value ranges from -1 to 1, where 1 indicates that the vectors are identical, 0 indicates that the vectors are orthogonal (no similarity), and -1 represents completely opposite vectors.',
  'section': 'Module 3: X',
  'question': 'What is the cosine similarity?',
  'course': 'llm-zoomcamp',
  'document_id': 'ee355823'},
 {'text': 'The error indicates that you have not changed all instances of “employee_handbook” to “homework” in your pipelin

### Q6. Reindexing
Наш документ FAQ змінюється: кожного дня учасники курсу додають нові записи або покращують вже існуючі.

Уявіть, що минув певний час і документ змінився. Для цього у нас є ще одна версія документа FAQ: версія 2.

The ID of this document is ```1T3MdwUvqCL3jrh3d3VCXQ8xE0UqRzI3bfgpfBq3ZWG0.```

Повторно виконаємо весь пайплайн з оновленими даними.

Для того ж самого запиту "Коли наступна когорта?". Який ідентифікатор найкращого результату?