## Retrieval

Mengambil data yang sesuai dengan pertanyaan mengguanakan sistem search engine dalam hal ini menggunakan elastic search


In [None]:
#download the docs:
!wget https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json

In [None]:
!head documents.json

In [None]:
import json
with open('./documents.json','rt') as f_in:
    document_file = json.load(f_in)

Menggabungkan nama kursus dan isinya yg terdiri pertanyaan jawaban dan section di dalam satu dictionanry


In [None]:
documents = []
for course in document_file:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)

In [None]:
documents[0]

In [8]:
from elasticsearch import Elasticsearch

es = Elasticsearch("http://localhost:9200")
es.info()

ObjectApiResponse({'name': 'e4c9697379e4', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'CzLUDNkMQ3-b-L8NjdyUZw', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

In [None]:
!curl http://localhost:9200

### index all the documents:


In [None]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} 
        }
    }
}

index_name = "course-questions"
response = es.indices.create(index=index_name, body=index_settings)

response

In [None]:
from tqdm.auto import tqdm

for doc in tqdm(documents):
    es.index(index=index_name, document=doc)

### Setelah kita index, kita bisa Retrieving the docs


In [5]:
user_question = "How do I join the course after it has started?"
search_query = {
    "size": 5,
    "query": {
        "bool": {
            "must": {
                "multi_match": {
                    "query": user_question,
                    "fields": ["question^3", "text", "section"],
                    "type": "best_fields"
                }
            },
            "filter": {
                "term": {
                    "course": "data-engineering-zoomcamp"
                }
            }
        }
    }
}

This query:

- Retrieves top 5 matching documents.
- Searches in the "question", "text", "section" fields, prioritizing "question" using `multi_match` query with type `best_fields` (see [here](https://github.com/DataTalksClub/llm-zoomcamp/blob/main/01-intro/elastic-search.md) for more information)
- Matches user query "How do I join the course after it has started?".
- Shows results only for the "data-engineering-zoomcamp" course.


In [None]:
response = es.search(index=index_name, body=search_query)
response['hits']['hits'][0]

In [None]:
for hit in response['hits']['hits']:
    doc = hit['_source']
    print(f"Section: {doc['section'].upper()}")
    print(f"Question: {doc['question']}")
    print(f"Answer: {doc['text']}\n")
    

### Kita buat function biar lebih rapi


In [6]:
def retrieve_information(question, index_name='course-questions', max_results=5):
    es = Elasticsearch("http://localhost:9200")
    search_query = {
    "size": 5,
    "query": {
        "bool": {
            "must": {
                "multi_match": {
                    "query": question,
                    "fields": ["question^3", "text", "section"],
                    "type": "best_fields"
                }
            },
            "filter": {
                "term": {
                    "course": "data-engineering-zoomcamp"
                }
            }
        }
    }
}
    response = es.search(index=index_name, body=search_query)
    documents = [hit['_source'] for hit in response['hits']['hits']]
    return documents

In [9]:
user_question = "How do I join the course after it has started?"

response = retrieve_information(user_question)

In [10]:
response[0]

{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
 'section': 'General course-related questions',
 'question': 'Course - Can I still join the course after the start date?',
 'course': 'data-engineering-zoomcamp'}

In [11]:
for doc in response:
    print(f"Section: {doc['section']}")
    print(f"Question: {doc['question']}")
    print(f"Answer: {doc['text'][:60]}...\n")

Section: General course-related questions
Question: Course - Can I still join the course after the start date?
Answer: Yes, even if you don't register, you're still eligible to su...

Section: General course-related questions
Question: Course - Can I follow the course after it finishes?
Answer: Yes, we will keep all the materials after the course finishe...

Section: General course-related questions
Question: Course - What can I do before the course starts?
Answer: You can start by installing and setting up all the dependenc...

Section: General course-related questions
Question: How do I use Git / GitHub for this course?
Answer: After you create a GitHub account, you should clone the cour...

Section: Workshop 1 - dlthub
Question: How do I install the necessary dependencies to run the code?
Answer: Answer: To run the provided code, ensure that the 'dlt[duckd...



# Generation - Answering questions


In [26]:
from openai import OpenAI
from dotenv import load_dotenv
import os
load_dotenv()

client = OpenAI(base_url="http://localhost:11434/",
                api_key = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpZCI6ImRiNTJjNGZkLTQ4MmYtNDE0Ni05MDZhLTBkN2UxN2NkOWQ0YyJ9.wWFjSdsKgaTX2Yk3H9Nsvnu4ZmAhA4-6Si8yJxT_qWs")

response = client.chat.completions.create(
    model = "llama3",
    messages = [{"role": "user",  "content": "The course already started. Can I still join?"}]
)
response.choices[0].message.content

APIConnectionError: Connection error.