# Building Solutions with LLMs and RAG: Introduction

In this workshop we will use notebooks and Python scripts to interactively learn about Large Language Models and RAG.

Large Language Models (LLMs) are a type of machine learning models designed to understand and generate human language. They are trained on massive datasets of text to predict and generate language based on given prompts, learning patterns, structures, and relationships in text to produce human-like responses. They can be used to generate text, answer questions, and more.

Retrieval-Augmented Generation (RAG) combines language generation with real-time data retrieval, allowing models to access external sources or databases to provide more accurate, contextually relevant answers.

RAG combines two main components:
- Retrieval: This component searches and retrieves relevant information from external databases or documents.
- Generation: This component uses the retrieved information to generate more accurate and contextually relevant responses.



## Getting started with Jupyter notebook
First of all, let's make sure you understand the Jupyter notebook interface.
In Jupter you can have cells of either text or code.
You can type any python code in a cell and press shift + enter to run it.

Interact with the cell below and run it multiple times to see the results.

In [1]:
a = 42 
a = a + 2
print(a)

44


You can also import libraries and use them in the cells:

In [2]:

from datetime import datetime
datetime.now().strftime('%Y-%m-%d %H:%M:%S')

'2024-11-20 17:34:06'

## MistralAI

In this workshop, we will be using MistralAI's LLMs, which are similar in concept to OpenAI's ChatGPT and Anthropic's Claude.

To start working with Mistral, you first need to install the library. We did that for you already.

```bash
pip install mistralai
```


The second  step is to get a Mistral api key. You can find some APIs keys we prepared for this workshop in this [sheet](https://docs.google.com/spreadsheets/d/1ZwTpkG6OOuVrOx8nzPmgai_7Hwpo8Kun7yZrOmg_5K4/edit?gid=0#gid=0). Get the key (please write your name next to it in the sheet such that people know it is taken) and write it to the .env file using the command

```
echo 'MISTRAL_API_KEY="your_api_key_here"' >> .env
```

You can run LLMs on your local machine or use the cloud version.
For the sake of this workshop we will use the cloud version as we dont need to download big models.

Let's run the code below to import Mistral and initialize the Mistral client: <br> _(Note: If you run into a ModuleNotFoundError when trying to run the code below, run the command pip install -r requirements.txt in your terminal and try again after it finishes)_

In [15]:
import os
from mistralai import Mistral

# Retrieve the Mistral API key from the environment variables
mistral_api_key = os.getenv('MISTRAL_API_KEY')

# Initialize the Mistral client with the API key
mistral_client = Mistral(api_key=mistral_api_key)

# The model below is the specific model we want to use
model_name = "mistral-small-latest"

The code below defines a function `call_mistral_model` that sends a message to a Mistral model and returns the model's response text.

In [16]:
def call_mistral_model(message):
    response = mistral_client.chat.complete(
        model = model_name,
        messages = [
            {
                "role": "user",
                "content": message,
            }
            ]
        )
    # Extract only the text from the response
    response_text = response.choices[0].message.content
    return response_text

In [17]:
# Let's test it out
response = call_mistral_model("hello! What is your name?")

# Print the response from the Mistral model
print(response)

Hello! You can call me Assistant. How can I help you today?


## Understanding Rate Limiting

When using APIs, you might encounter rate limiting. Rate limiting is a mechanism implemented by APIs to control the number of requests a user can make in a given time frame. This is done to prevent abuse and ensure fair use of the API resources. 

However, rate limiting can be annoying because it can interrupt your workflow and force you to wait before making more requests.

To make things easier, we've implemented a LargeLanguageModel class (see usage example below) that automatically adds sleep intervals between requests to avoid exceeding the rate limit. We will use the LargeLanguageModel class moving forward to make calls to Mistral AI. This class has the same logic as the examples we showed above but includes a mechanism to counter the rate limiting issue. 

You don't need to understand all the code in the class, but feel free to have a look in chat_solution.llm in case you're curious.

In [18]:
from chat_solution.llm import LargeLanguageModel

# Initialize an instance of the LargeLanguageModel class
llm = LargeLanguageModel()

# Make a call to Mistral using the LargeLanguageModel class
# This class includes logic to counteract rate limiting by adding appropriate sleep intervals between requests
response = llm.call("hello! What is your name?")

# Print the response from the Mistral model
print(response)

Hello! You can call me Assistant. How can I help you today?


## Interacting with the LLM
Now that we have seen how to make a basic call to the Mistral model using the `LargeLanguageModel` class, let's try some more prompts to see how the model responds to different types of queries.

### Exercise 1: Exploring and Modifying Prompts
Below are some example use cases of how to use an LLM such as Mistral. Play around with the prompts and see the results. Modify the prompts to see how the model's responses change. This will help you understand how to craft effective prompts and get the desired output from the model.

Try to:
- Ask different types of questions
- Change the text for summarization or extraction (see examples 2 and 3 below)
- Alter the style of the response

#### Example 1: Asking for Information

In [19]:
# Example: Asking for information
prompt = "Can you tell me about coding school 42Berlin?"
response = llm.call(prompt)
print(response)

42Berlin is a tuition-free coding school that follows the innovative and intensive educational model pioneered by 42 Network, which was founded in France by Xavier Niel, Nicolas Sadirac, and Florian Bucher in 2013. The 42 Network now includes several campuses around the world, including 42Berlin, which opened in 2017.

Here are some key features of 42Berlin:

### No Tuition Fees
42Berlin does not charge tuition fees. Instead, it relies on sponsorships from tech companies and other organizations to fund its operations. This makes it an accessible option for students who might not be able to afford traditional educational programs.

### Peer-to-Peer Learning
The school emphasizes peer-to-peer learning. Students are encouraged to help each other and work collaboratively on projects. This approach aims to foster a sense of community and teamwork, which are essential skills in the tech industry.

### Project-Based Curriculum
The curriculum is heavily project-based. Students learn by doing, 

#### Example 2: Summarizing a Given Text

In [20]:
# Change the text into something else to see the results
text_to_summarize = (
    """
    42Berlin is a non-profit coding school offering software engineering education completely tuition free. 
    By making tech education more accessible and inclusive, they empower the next generation of coders.
    Founded in 2021 and based in central Neukölln, we train our students up to the equivalent of Master’s level 
    and implement peer-learning methodologies that give autonomy to each student.
    """

)
prompt = f"Summarize the following text in one brief sentence: {text_to_summarize}"
response = llm.call(prompt)
print(response)

"42Berlin is a tuition-free coding school empowering the next generation of coders through accessible and inclusive education, offering Master's-level training in central Neukölln."


#### Example 3: Extracting Information from a Given Text

In [21]:
# Change the text into something else to see the results
text_to_extract_from = (
    """
    42Berlin is a non-profit coding school offering software engineering education completely tuition free. 
    By making tech education more accessible and inclusive, they empower the next generation of coders.
    Founded in 2021 and based in central Neukölln, we train our students up to the equivalent of Master’s level 
    and implement peer-learning methodologies that give autonomy to each student.
    """

)
prompt = f"Extract the year 42Berlin was founded from the following text: {text_to_extract_from}"
response = llm.call(prompt)
print(response)

The year 42Berlin was founded is 2021.


## Hallucination
LLMs sometimes generate responses that are plausible-sounding but factually incorrect or nonsensical. This phenomenon is known as "hallucination". 

Hallucination can occur because the model generates text based on patterns in the training data rather than actual knowledge or retrieval of relevant information.

### Exercise 2: Demonstrating Hallucination

In this exercise, we will ask the model a question that it might to hallucinate an answer for, showing the limitations of relying solely on language generation without retrieval.

Try running the command below a few times in a row and see how the response by the LLM changes.



In [23]:
# Ask a question likely to cause hallucination
prompt = "What was the name of the workshop launched by the MLOps Community Berlin in collaboration with Girls in Tech?"
response = llm.call(prompt)
print("Response likely to hallucinate:\n")
print(response)

Response likely to hallucinate:

The workshop launched by the MLOps Community Berlin in collaboration with Girls in Tech was named "MLOps for Women." This initiative aimed to empower women in the field of machine learning operations (MLOps) by providing educational resources, networking opportunities, and hands-on training.


Now, let's move on to the next section on Retrieval-Augmented Generation.

# Retrieval-Augmented Generation (RAG)

Large language models (LLMs) can sometimes hallucinate, presenting false information due to outdated training data. Retrieval-Augmented Generation (RAG) allows us to incorporate external information to mitigate these challenges. 

In RAG, a retrieval component searches and retrieves relevant information from a knowledge base or external documents, and a generation component uses this information to generate responses.
This approach allows the model to access up-to-date information and provide more detailed and accurate answers.

### Exercise 4: Simple RAG

In this exercise, we will demonstrate a simple example of how to use Retrieval-Augmented Generation. We will use a predefined set of documents, retrieve relevant information based on a query, and then generate a response using the retrieved information.

Run the code below.

In [25]:
def create_rag_prompt(message: str, context: str):
    """
    Message is the question that the user is asking.
    Context is the information that we want to use to answer the question.
    """
    return f"""Answer the question only using the provided content.

        Context: {context}

        User Question: {message}

        Be helpful and friendly. If the information cannot be found respond with "I don't know"
        """  

By running the code in the cell below, you can compare how our LLM responses differ by the information that you provided.

In [29]:
# The workshop the MLOps Community hosted together with Girls in Tech Germany was called "AI Launchpad: Building Your First Ml Pipeline" or simply "Building Your First ML Pipeline"
# We copy paste the info from our Eventbrite event page from the previous workshop and use this as context for the model to retrieve the right info from
context = """
AI Launchpad - Building Your First ML Pipeline: 
On Wednesday, June 5th, 2024, the MLOps Community Berlin in collaboration with Girls in Tech Germany hosted an interactive workshop for beginners who want to kick start their career in AI/ML. 
The workshop starts at 18.00h at 42Berlin. 

🔍 Why Attend?

Gain hands-on experience building your first ML pipeline in an agile way
Apply the fundamentals of statistical modeling and basic Python
Opportunities to improve your portfolio 
Connect with ML professionals at different levels of seniority


✨ The Agenda: 

6:00 pm - Arrive & Pizza 
6:30 pm -  Introduction MLOps and GiT
6:45 pm - Workshop Introduction
7:30 pm - Break
7:45 pm - Workshop
9:45 pm - Networking


🎉 Highlights:

Food and drinks provided
Engaging discussions and networking opportunities
Bring your laptop and get ready to learn!


💼 Who Should Attend?

Individuals starting their career in Machine Learning or Artificial Intelligence
Those looking to transition into the field of AI/ML
Anyone interested in contributing to and learning from the ML community
Don't miss out on this chance to gain practical AI/ML skills while expanding your professional network! 
"""

message = "What was the name of the workshop launched by the MLOps Community Berlin in collaboration with Girls in Tech?"


generic_response = llm.call(message)
print(f"GENERIC RESPONSE:\n {generic_response}")

rag_prompt = create_rag_prompt(message=message, context=context)
rag_response = llm.call(rag_prompt)

print("-" * 30)
print(f"RAG RESPONSE:\n {rag_response}")

GENERIC RESPONSE:
 The workshop launched by the MLOps Community Berlin in collaboration with Girls in Tech was named "MLOps for Beginners". This event aimed to introduce participants to the principles of MLOps (Machine Learning Operations) and provide practical insights into the field.
Error happended while calling the model: API error occurred: Status 429
{"message":"Requests rate limit exceeded"}
Rate limit error: API error occurred: Status 429
{"message":"Requests rate limit exceeded"}
Waiting 2 seconds before retrying
------------------------------
RAG RESPONSE:
 The name of the workshop launched by the MLOps Community Berlin in collaboration with Girls in Tech is "Building Your First ML Pipeline".


In [30]:
# Let's try the same thing with Berlin weather data!
context = """
The weather in Berlin  December of 2027 will be around 13 degrees Celsius.
Specific dates:
- 10th of December: 10 degrees Celsius
- 15th of December: 15 degrees Celsius
- 20th of December: 7 degrees Celsius
"""

message = "What will be the weather in Berlin on the 10th of December of 2027?"


generic_response = llm.call(message)
print(f"GENERIC RESPONSE:\n {generic_response}")

rag_prompt = create_rag_prompt(message=message, context=context)
rag_response = llm.call(rag_prompt)

print("-" * 30)
print(f"RAG RESPONSE:\n {rag_response}")

GENERIC RESPONSE:
 I cannot accurately predict the weather for a specific date in the future, as weather forecasts are typically made a few days to a week in advance and can change significantly over time. For the most accurate and up-to-date information, I recommend checking a reliable weather forecast service closer to the date you're interested in.
------------------------------
RAG RESPONSE:
 On the 10th of December 2027, the weather in Berlin will be around 10 degrees Celsius.


Now try it yourself! Can you find some content on the internet (think, for example, news articles or very specific, locally relevant information that the LLM normally would not have access to). 

Play around with it and let the creative juices flow. Can you discover some more use cases for which you can use RAG can help make our LLM smarter?

In [None]:

context = """""" # Add your context here

message = "" # Add your message here


generic_response = llm.call(message)
print(f"GENERIC RESPONSE:\n {generic_response}")

rag_prompt = create_rag_prompt(message=message, context=context)
rag_response = llm.call(rag_prompt)

print("-" * 30)
print(f"RAG RESPONSE:\n {rag_response}")

GENERIC RESPONSE:
 Question: What is the difference between a human brain and a computer brain?

Computer brains, also known as artificial neural networks (ANNs), are designed to mimic the structure and function of the human brain to some extent. However, there are several key differences between the two:

1. **Hardware and Structure**:
   - **Human Brain**: The brain is a physical organ consisting of neurons, synapses, and various types of cells. It has a complex, hierarchical structure with different regions specialized for different functions.
   - **Computer Brain**: Computer brains are digital systems composed of processors, memory, and software algorithms. They are typically more modular and less organically complex than biological brains.

2. **Learning and Adaptation**:
   - **Human Brain**: Learning in the human brain involves complex biological processes, including synaptic plasticity and neurogenesis. It can adapt to new information and experiences continuously.
   - **Compu

# That's it! 

RAGs enrich the prompt with additional information about the topic to generate responses. The external information can come from various sources, such as PDFs, Google search results, social media posts, and more. With that, we’ve built a simple Q&A RAG.

In the next notebook, we will scale it up to include even more context as well as embeddings to improve the performance of the RAG system further. Go to the notebook "2-prepare-embedding-data.ipynb" to dive into the world of embeddings and vector databases!