# Introduction to LLMs and RAG 

**LLMs** = Large Language Models

**RAG** = Retrieval Augmented Generation

Interacting with private data using RAG and LLMs involves retrieving (using search) relevant set of information that best matches a specific query and further using LLMs to output the most appropriate response. 

In [1]:
!wget https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py

--2024-07-22 14:38:56--  https://raw.githubusercontent.com/alexeygrigorev/minsearch/main/minsearch.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3832 (3.7K) [text/plain]
Saving to: ‘minsearch.py’


2024-07-22 14:38:56 (2.61 MB/s) - ‘minsearch.py’ saved [3832/3832]



In [2]:
import minsearch

In [3]:
import json

In [4]:
import requests

docs_url = 'https://github.com/alexeygrigorev/llm-rag-workshop/raw/main/notebooks/documents.json'
docs_response = requests.get(docs_url)
documents_raw = docs_response.json()
# documents_raw
documents = []

for course in documents_raw:
    course_name = course['course']

    for doc in course['documents']:
        doc['course'] = course_name
        documents.append(doc)


In [5]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [6]:
index = minsearch.Index(
    text_fields=["question", "text", "section"],
    keyword_fields=["course"]
)

In [7]:
q = 'the course has already started, can I still enroll?'

In [8]:
index.fit(documents)

<minsearch.Index at 0x7fb5bad46020>

In [9]:
def search(query):
    boost = {'question': 3.0, 'section': 0.5}

    search_results = index.search(
        query=query,
        filter_dict={'course': 'data-engineering-zoomcamp'},
        boost_dict=boost,
        num_results=5
    )

    return search_results

In [80]:
results = search(q)
results

[{'text': "Yes, even if you don't register, you're still eligible to submit the homeworks.\nBe aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.",
  'section': 'General course-related questions',
  'question': 'Course - Can I still join the course after the start date?',
  'course': 'data-engineering-zoomcamp'},
 {'text': 'Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.\nYou can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.',
  'section': 'General course-related questions',
  'question': 'Course - Can I follow the course after it finishes?',
  'course': 'data-engineering-zoomcamp'},
 {'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 202

In [None]:
!pip install groq

In [10]:
!export GROQ_API_KEY="ztpgsk_IfXWv9Cmcf4WGdyb3FYeJ9754frfcMs50"


In [58]:
def build_prompt(query):
    prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT:
{context}
""".strip()

    context = ""
    search_results = search(query)
    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"

    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

In [18]:
context_template = """
Section: {section}
Question: {question}
Answer: {text}
""".strip()

context_docs = search(q)

context_result = ""

for doc in context_docs:
    doc_str = context_template.format(**doc)
    context_result += ("\n\n" + doc_str)

context = context_result.strip()
print(context)

Section: General course-related questions
Question: Course - Can I still join the course after the start date?
Answer: Yes, even if you don't register, you're still eligible to submit the homeworks.
Be aware, however, that there will be deadlines for turning in the final projects. So don't leave everything for the last minute.

Section: General course-related questions
Question: Course - Can I follow the course after it finishes?
Answer: Yes, we will keep all the materials after the course finishes, so you can follow the course at your own pace after it finishes.
You can also continue looking at the homeworks and continue preparing for the next cohort. I guess you can also start working on your final capstone project.

Section: General course-related questions
Question: Course - When will the course start?
Answer: The purpose of this document is to capture frequently asked technical questions
The exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start wit

In [14]:
# def llm2(prompt):
#     response = client.chat.completions.create(
#         model='gpt-4o',
#         messages=[{"role": "user", "content": prompt}]
#     )

#     return response.choices[0].message.content

In [28]:
# import libraries
from groq import Groq
#from dotenv import load_dotenv
import os

# load environment variables
# load_dotenv()

# create client calling Groq class
client = Groq(api_key="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")


In [59]:
prompt = build_prompt(query)
response = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content":prompt,
            }
        ],
        model="llama3-8b-8192"
    )

    # print the response
print(response.choices[0].message.content)

Based on the provided FAQs from Module 6: Streaming with Kafka, the question is how to run Kafka. 

Since the question is related to running a producer/consumer/kstreams, it seems to be referring to running Kafka in the terminal. 

From the context, we can see that to run a Java Kafka producer in the terminal, we would run:

java -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java

If you are looking to run a Python Kafka code, make sure you have the necessary packages installed and create a virtual environment. Then, you can run your Python code by activating the virtual environment and using the following command: 

python your_python_file.py


In [35]:
def llm(prompt):
    response = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content":prompt,
            }
        ],
        model="llama3-8b-8192",
    )

    # print the response
    print(response.choices[0].message.content)

In [69]:
query = 'how do I run kafka?'

def rag(query):
    search_results = search(query)
    prompt = build_prompt(query)
    answer = llm(prompt)
    return answer

In [70]:
rag(query)

Based on the context, I'll answer your question:

How do I run Kafka?

From the context, it seems that Kafka can be run in different ways depending on whether you're using Java or Python.

For Java, you can run a Kafka producer, consumer, or kstreams by running a Java program. For example, to run a Java Kafka producer, you would run:

```
java -cp build/libs/<jar_name>-1.0-SNAPSHOT.jar:out src/main/java/org/example/JsonProducer.java
```

in the project directory.

For Python, you need to use a virtual environment. To create a virtual environment and install the necessary packages, you would run:

```
python -m venv env
source env/bin/activate
pip install -r ../requirements.txt
```

Then, to run a Python Kafka program, you would activate the virtual environment and run your Python script.


In [38]:
rag('When will the course start?')

According to the CONTEXT, the course will start on the 15th of January 2024 at 17h00 with the first "Office Hours" live session.


## Implementing Elasticsearch

In [42]:
from elasticsearch import Elasticsearch

In [43]:
es_client = Elasticsearch('http://localhost:9200')

In [45]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"}
        }
    }
}

index_name = "course-questions"

es_client.indices.create(index=index_name, body=index_settings)

ObjectApiResponse({'acknowledged': True, 'shards_acknowledged': True, 'index': 'course-questions'})

In [46]:
documents[0]

{'text': "The purpose of this document is to capture frequently asked technical questions\nThe exact day and hour of the course will be 15th Jan 2024 at 17h00. The course will start with the first  “Office Hours'' live.1\nSubscribe to course public Google Calendar (it works from Desktop only).\nRegister before the course starts using this link.\nJoin the course Telegram channel with announcements.\nDon’t forget to register in DataTalks.Club's Slack and join the channel.",
 'section': 'General course-related questions',
 'question': 'Course - When will the course start?',
 'course': 'data-engineering-zoomcamp'}

In [47]:
from tqdm.auto import tqdm

In [48]:
for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

  0%|          | 0/948 [00:00<?, ?it/s]

In [93]:
query = 'I just disovered the course. Can I still join it?'

In [66]:
def elastic_search(query):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "data-engineering-zoomcamp"
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)

    result_docs = []

    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])

    return result_docs

In [71]:
def rag(query):
    search_results = elastic_search(query)
    prompt = build_prompt(query)
    answer = llm(prompt)
    return answer

In [75]:
rag(user_question)

Based on the provided CONTEXT, I can answer your question:

How do I join the course after it has started?

According to the FAQ, even if you don't register, you're still eligible to submit the homeworks. However, be aware that there will be deadlines for turning in the final projects. So, don't leave everything for the last minute.


In [73]:

es_client.info()

ObjectApiResponse({'name': '6cf67c209e00', 'cluster_name': 'docker-cluster', 'cluster_uuid': 'PoA3K2AMT2KfyWZ8HvLwCQ', 'version': {'number': '8.4.3', 'build_flavor': 'default', 'build_type': 'docker', 'build_hash': '42f05b9372a9a4a470db3b52817899b99a76ee73', 'build_date': '2022-10-04T07:17:24.662462378Z', 'build_snapshot': False, 'lucene_version': '9.3.0', 'minimum_wire_compatibility_version': '7.17.0', 'minimum_index_compatibility_version': '7.0.0'}, 'tagline': 'You Know, for Search'})

In [74]:
user_question = "How do I join the course after it has started?"

## Putting everything together

In [None]:
# import libraries
from groq import Groq
import os

client = Groq(api_key="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")

In [77]:
def minsearch(query):
    boost = {'question': 3.0, 'section': 0.5}

    search_results = index.search(
        query=query,
        filter_dict={'course': 'data-engineering-zoomcamp'},
        boost_dict=boost,
        num_results=5
    )

    return search_results

In [78]:
def elastic_search(query):
    search_query = {
        "size": 5,
        "query": {
            "bool": {
                "must": {
                    "multi_match": {
                        "query": query,
                        "fields": ["question^3", "text", "section"],
                        "type": "best_fields"
                    }
                },
                "filter": {
                    "term": {
                        "course": "data-engineering-zoomcamp"
                    }
                }
            }
        }
    }

    response = es_client.search(index=index_name, body=search_query)

    result_docs = []

    for hit in response['hits']['hits']:
        result_docs.append(hit['_source'])

    return result_docs

In [79]:
def build_prompt(query,search_results):
    prompt_template = """
You're a course teaching assistant. Answer the QUESTION based on the CONTEXT from the FAQ database.
Use only the facts from the CONTEXT when answering the QUESTION.

QUESTION: {question}

CONTEXT:
{context}
""".strip()

    context = ""
    #search_results = search(query)
    for doc in search_results:
        context = context + f"section: {doc['section']}\nquestion: {doc['question']}\nanswer: {doc['text']}\n\n"

    prompt = prompt_template.format(question=query, context=context).strip()
    return prompt

In [80]:
def llm(prompt):
    response = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content":prompt,
            }
        ],
        model="llama3-8b-8192",
    )

    # print the response
    print(response.choices[0].message.content)

In [96]:
query = 'how do I run kafka?'

def rag(query):
    search_results = elastic_search(query)
    prompt = build_prompt(query, search_results)
    response = llm(prompt)
    return response


In [97]:
rag('When will the course start?')

Based on the provided context, the course will start on the 15th of January 2024 at 17h00.


In [95]:
!df -h

Filesystem                                Size  Used Avail Use% Mounted on
none                                      3.9G  4.0K  3.9G   1% /mnt/wsl
drivers                                   458G  332G  126G  73% /usr/lib/wsl/drivers
none                                      3.9G     0  3.9G   0% /usr/lib/modules
none                                      3.9G     0  3.9G   0% /usr/lib/modules/5.15.153.1-microsoft-standard-WSL2
/dev/sdc                                 1007G   36G  920G   4% /
none                                      3.9G  112K  3.9G   1% /mnt/wslg
none                                      3.9G     0  3.9G   0% /usr/lib/wsl/lib
rootfs                                    3.9G  2.1M  3.8G   1% /init
none                                      3.9G  876K  3.9G   1% /run
none                                      3.9G     0  3.9G   0% /run/lock
none                                      3.9G  4.0K  3.9G   1% /run/shm
tmpfs                                     4.0M     0  4.0M   0%