# **All You Need to Know to Build Your First LLM App**
[A Step-by-Step Tutorial to Document Loaders, Embeddings, Vector Stores and Prompt Templates](https://towardsdatascience.com/all-you-need-to-know-to-build-your-first-llm-app-eb982c78ffac)

## **Fine-Tuning vs. Context Injection**
In general, we have two fundamentally different approaches to enable large language models to answer questions that the LLM cannot know: **Model fine-tuning** and **context injection**

### **Fine-Tuning**
Fine-tuning refers to training an existing language model with additional data to optimise it for a specific task.

Fine-tuning of **PLLMs (Pre-trained Language Models)** is a way to adjust the model for a specific task, but it *doesn’t really allow you to inject your own domain knowledge into the model*. This is because the model has already been trained on a massive amount of general language data, and your specific domain data is usually not enough to override what the model has already learned.

So, when you fine-tune the model, it might occasionally provide correct answers, but it will often fail because it heavily relies on the information it learned during pre-training, which might not be accurate or relevant to your specific task. In other words, **fine-tuning helps the model adapt to `HOW` it communicates, but not necessarily `WHAT` it communicates. (Porsche AG, 2023)**

### **Context Injection/In-context learning**
When using context injection, we are **not modifying the LLM**, we focus on **the prompt itself** and inject relevant context into the prompt.

So we need to think about how to provide the prompt with the right information. In the figure below, you can see schematically how the whole thing works. We need a process that is able to identify the most relevant data. To do this, we need to enable our computer to compare text snippets with each other.

![context_injection_process](resources/context_injection_process.png)

 
This can be done with `embeddings`. With embeddings, we translate **text into vectors**, allowing us to represent text in a multidimensional embedding space. Points that are closer to each other in space are often used in the same context. To prevent this similarity search from taking forever, we store our vectors in a vector database and index them.

## **1. Load documents**
For our simple example, we use data that was probably not included in the training data of GPT3.5. I use the Wikipedia article about GPT4 because I assume that GPT3.5 has limited knowledge about GPT4.

For this minimal example, I’m not using any of the LangChain loaders, I’m just scraping the text directly from Wikipedia using BeautifulSoup.

In [1]:
import requests
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/GPT-4"
response = requests.get(url)

soup = BeautifulSoup(response.content, 'html.parser')

# find all the text on the page
text = soup.get_text()

# find the content div
content_div = soup.find('div', {'class': 'mw-parser-output'})

# remove unwanted elements from div
unwanted_tags = ['sup', 'span', 'table', 'ul', 'ol']
for tag in unwanted_tags:
    for match in content_div.findAll(tag):
        match.extract()

print(content_div.get_text())


2023 text-generating language model
"ChatGPT-4" redirects here. For other uses, see GPT.




Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model created by OpenAI, and the fourth in its numbered "GPT-n" series of GPT foundation models. It was released on March 14, 2023, and has been made publicly available in a limited form via the chatbot product ChatGPT Plus (a premium version of ChatGPT), and with access to the GPT-4 based version of OpenAI's API being provided via a waitlist. As a transformer based model, GPT-4 was pretrained to predict the next token (using both public data and "data licensed from third-party providers"), and was then fine-tuned with reinforcement learning from human and AI feedback for human alignment and policy compliance.
Observers reported the GPT-4 based version of ChatGPT to be an improvement on the previous (GPT-3.5 based) ChatGPT, with the caveat that GPT-4 retains some of the same problems. Unlike the predecessors, GPT-4 can 

## **2. Split our document into text fragments**
Next, we must divide the text into smaller sections called `text chunks`. Each text chunk represents a data point in the embedding space, allowing the computer to determine the similarity between these chunks.

The following text snippet is utilizing the text splitter module from langchain. In this particular case, we specify a *chunk size of 100 and a chunk overlap of 20*. It’s common to use larger text chunks, but you can experiment a bit to find the optimal size for your use case.

 You just need to remember that every LLM has a token limit (4000 tokes for GPT 3.5). Since we are inserting the text blocks into our prompt, we need to make sure that the entire prompt is no larger than 4000 tokens.

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter


article_text = content_div.get_text()


text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, just to show.
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
)


texts = text_splitter.create_documents([article_text])
print(texts[0])
print(texts[1])
print(texts[2])
print(texts[3])


page_content='2023 text-generating language model\n"ChatGPT-4" redirects here. For other uses, see GPT.' metadata={}
page_content='Generative Pre-trained Transformer 4 (GPT-4) is a multimodal large language model created by' metadata={}
page_content='model created by OpenAI, and the fourth in its numbered "GPT-n" series of GPT foundation models. It' metadata={}
page_content='models. It was released on March 14, 2023, and has been made publicly available in a limited form' metadata={}


## **3. From Text Chunks to Embeddings**
> To convert text into embeddings, there are several ways, e.g. `Word2Vec`, `GloVe`, `fastText` or `ELMo`.

To capture similarities between words in **embeddings**, `Word2Vec` uses a simple neural network. We train this model with large amounts of text data and want to create a model that is able to assign a point in the n-dimensional embedding space to each word and thus describe its meaning in the form of a vector.
For the training, we assign a neuron in the input layer to each unique word in our data set. In the image below, you can see a simple example.
 
In this case, the hidden layer contains only two neurons. Two, because we want to map the words in a two dimensional embedding space. (The existing models are in reality much larger and thus represent the words in higher dimensional spaces —**OpenAI’s Ada Embedding Model** for example, is using *1536 dimensions*) After the training process the individual weights describe the position in the embedding space.

![embeddings_process](resources/embeddings_process.png)

In [11]:
import openai
from dotenv import load_dotenv
import os

# load the OPENAI_API_KEY
load_dotenv()
openai.api_key = os.environ.get('OPENAI_API_KEY')

# we can see an example
print(texts[0])

embedding = openai.Embedding.create(
    input=texts[0].page_content, model="text-embedding-ada-002"
)["data"][0]["embedding"]


len(embedding)


page_content='2023 text-generating language model\n"ChatGPT-4" redirects here. For other uses, see GPT.' metadata={}


1536

In [10]:
print(embedding)


[-0.034656159579753876, -0.0013680505799129605, 0.007769383955746889, -0.0005045054713264108, 0.024579504504799843, 0.010641701519489288, -0.016049999743700027, 0.012027409859001637, -0.02938239648938179, -0.01808147504925728, 0.02597866766154766, 0.030297232791781425, -0.039795391261577606, -0.002233277540653944, 0.0007891305722296238, 0.018727242946624756, 0.01755679026246071, -0.01188614871352911, 0.009222359396517277, -0.007870284840464592, 0.007776110433042049, 0.011475817300379276, -0.005384754855185747, -0.023153437301516533, 0.014381768181920052, 0.0021979620214551687, 0.022655658423900604, -0.033902764320373535, -0.002018021885305643, -0.00891965627670288, -0.0015757386572659016, -0.0098075857385993, -0.017516428604722023, -0.004654903430491686, -0.011596898548305035, -0.012935519218444824, 0.0038409680128097534, -0.02069145068526268, 0.0095183365046978, 0.001955799525603652, 0.014691198244690895, 0.021700460463762283, 0.024754401296377182, -0.009269447065889835, -0.0056168274

When we represent the text chunks and the user’s question as vectors, we gain the ability to explore various mathematical possibilities.

In order to determine the similarity between two data points, we need to calculate their proximity in the multidimensional space, which is achieved using distance metrics. There are several methods available to compute the distance between points.
A commonly used distance metric is `cosine similarity`. So let’s try to calculate the cosine similarity between our question and the text chunks:

In [41]:
import numpy as np
from numpy.linalg import norm
import pandas as pd
import time


text_chunks = [text.page_content for text in texts]

df = pd.DataFrame({'text_chunks': text_chunks})

# we get only the 50th first because the limitation of OPENAI_API Free trial users
df = df.head(40)

####################################################################
# get embeddings from text-embedding-ada model
####################################################################
embeddings = []
for i, text in enumerate(df.text_chunks):
    text = text.replace("\n", " ")
    emb = openai.Embedding.create(
        input=[text],
        model='text-embedding-ada-002')['data'][0]['embedding']

    embeddings.append(emb)
    if i % 3 == 2:
        time.sleep(61)


In [46]:

df['ada_embedding'] = embeddings

####################################################################
# calculate the cosine similarity
####################################################################
users_question = "when gpt-4 became available?"

question_embedding = openai.Embedding.create(
    input=[users_question],
    model='text-embedding-ada-002')['data'][0]['embedding']

# create a list to store the calculated cosine similarity
cos_sim = []

for index, row in df.iterrows():
    A = row.ada_embedding
    B = question_embedding

    # calculate the cosine similarity
    cosine = np.dot(A, B)/(norm(A)*norm(B))

    cos_sim.append(cosine)

df["cos_sim"] = cos_sim
df.sort_values(by=["cos_sim"], ascending=False).head(10)


Unnamed: 0,text_chunks,ada_embedding,cos_sim
11,"previous (GPT-3.5 based) ChatGPT, with the cav...","[0.0010135102784261107, -0.0003050075902137905...",0.843552
10,Observers reported the GPT-4 based version of ...,"[-0.0067684403620660305, 0.008493323810398579,...",0.842315
5,"ChatGPT), and with access to the GPT-4 based v...","[-0.015336214564740658, -0.03029106743633747, ...",0.834064
20,"as GPT-2, that could perform various tasks wit...","[-0.012413136661052704, 0.004828100558370352, ...",0.82515
0,"2023 text-generating language model\n""ChatGPT-...","[-0.029398135840892792, -0.006758533883839846,...",0.823632
21,"improved into GPT-3.5, which was used to creat...","[-0.021026605740189552, -0.002719306154176593,...",0.817297
14,"Further information: GPT-3 § Background, and G...","[0.0011423703981563449, -0.0064566610381007195...",0.812135
25,"much more nuanced instructions than GPT-3.5."" ...","[-0.02909061498939991, 0.0035216996911913157, ...",0.808014
12,"the same problems. Unlike the predecessors, GP...","[0.001035009860061109, -0.018538333475589752, ...",0.807083
18,"The next year, they introduced GPT-2, a larger...","[-0.012202690355479717, -0.027422528713941574,...",0.79749


## **4. Define the model you want to use**
The next step is to determine which LLM we would like to use and choosing the number of text chunks we want to provide to our LLM in order to answer the question.

If we use GPT to answer short questions similar to how we would use Google, the costs remain relatively low. However, if we use GPT to answer questions that require providing **extensive context**, such as personal data, the query can quickly accumulate thousands of tokens. That increases the cost significantly. But don’t worry, you can set a cost limit.

In [47]:
from langchain.llms import OpenAI

# by default to using “text-davinci-003”.
llm = OpenAI(temperature=0.7)
llm('when gpt-4 became available?')


'\n\nOpenAI released GPT-4 on July 6, 2020.'

## **5. Define our Prompt Template**
Now we have the text snippets that contain the information we are looking for, we need to build a prompt. Within the prompt we also specify the desired mode for the model to answer questions. When we define the mode we are specifying the desired behavior style in which we want the LLM to generate answers.
 
For our example, we want to implement a solution that extracts data from Wikipedia and interacts with the user like a chatbot. We want it to answer questions like a motivated, helpful help desk expert.
To guide the LLM in the right direction, I am adding the following instruction to the prompt:

***You are a chatbot that loves to help people! Answer the following question using only the context provided. If you’re unsure and the answer isn’t explicitly in the context, say “Sorry, I don’t know how to help you.***

By doing this, I set a limitation that only allows GPT to utilize the information stored in our database. This restriction enables us to provide the sources our chatbot relied upon to generate the response, which is crucial for traceability and establishing trust. Additionally, it helps us address the issue of generating unreliable information and allows us to provide answers that can be utilized in a corporate setting for decision-making purposes.

As the context, I am simply using the top 50 text chunks with the highest similarity to the question.

In [55]:
from langchain import PromptTemplate

# define the LLM you want to use
llm = OpenAI(temperature=1, model_name='gpt-3.5-turbo')

# define the context for the prompt by joining the most relevant text chunks
context = ""

for index, row in df[0:10].iterrows():
    context = context + " " + row.text_chunks

# define the prompt template
template = """
You are a chat bot who loves to help people! Given the following context sections, answer the
question using only the given context. If you are unsure and the answer is not
explicitly writting in the documentation, say "Sorry, I don't know how to help with that."

Context sections:
{context}

Question:
{users_question}

Answer:
"""

prompt = PromptTemplate(template=template, input_variables=["context", "users_question"])

# fill the prompt template
prompt_text = prompt.format(context=context, users_question=users_question)
llm(prompt_text)


'GPT-4 became available on March 14, 2023.'

It appears to be functioning to some extent. However, our objective now is to transform this slow process into a robust and efficient one. To achieve this, we introduce an indexing step where **we store our embeddings and indexes in a vector store**. This will enhance the overall performance and decrease the response time.

## **6. Creating a vector store (vector database)**
A `vector store` is a type of data store that is optimized for storing and retrieving large quantities of data that can be represented as vectors. These types of databases allow for efficient querying and retrieval of subsets of the data based on various criteria, such as similarity measures or other mathematical operations.


In order to efficiently search our embeddings, we need to index them. `Indexing` is the second important component of a vector database. The index provides a way to map queries to the most relevant documents or items in the vector store without having to compute similarities between every query and every document.

In recent years, a number of vector stores have been released. Especially in the field of LLMs, the attention around vector stores has exploded:

![Vector Store in the past years](resources/VectorStore%20in%20the%20past%20years.png)

Similar to what we did in the previous sections, we are again calculating the embeddings and storing them in a vector store. To do this, we are using suitable modules from `LangChain` and `chroma` as a vector store

In [1]:
######################################################################
# 1 Collect data that we want to use to answer the users’ questions:
######################################################################


######################################################################
# 2. Load the data and define how you want to split into text chunks
######################################################################

from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader, PyMuPDFLoader
import chromadb

# load the document
loader1 = PyMuPDFLoader("resources/cv_david_smith.pdf")
loader2 = PyMuPDFLoader("resources/cv_Jo Brown.pdf")


# merge each page in each document and to put in a list
# documents = ['', '']
# for page in loader1.load():
#     documents[0] = documents[0] + page.page_content

# for page in loader1.load():
#     documents[1] = documents[1] + page.page_content


# define the text splitter(opcional)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
    # length_function=len,
)
texts = text_splitter.split_documents(loader1.load()+loader2.load())


texts[0]
# db = Chroma.from_documents(
#     texts, embeddings, persist_directory='resources\\db\\',
#     client_settings=PersistentClient(path='resources\\db\\').get_settings())


Document(page_content='Personal\nAddress\n71 Cherry Court, Cox Row \nSouthampton SO53 5PD\nPhone number\n0100 234 5000\nEmail\nexample@cvmaker.uk\nSkills\nMicrosoft Word\n \n \n \n \nCRM software\n \n \n \n \nMicrosoft Excel\n \n \n \n \nSelf control\n \n \n \n \nPatience\n \n \n \n \nEffective listening\n \n \n \n \nClear communication\n \n \n \n \nAdaptability\n \n \n \n \nInterests\nElectronics and computers\nDAVID SMITH\nKeen customer service representative with over 10 years of experience in the short-term', metadata={'source': 'resources/cv_david_smith.pdf', 'file_path': 'resources/cv_david_smith.pdf', 'page': 0, 'total_pages': 2, 'format': 'PDF 1.4', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': 'wkhtmltopdf 0.12.4', 'producer': 'Qt 4.8.7', 'creationDate': "D:20210527130839+02'00'", 'modDate': '', 'trapped': ''})

In [2]:
from langchain.embeddings import HuggingFaceEmbeddings,OpenAIEmbeddings
from langchain.vectorstores import Chroma,FAISS
from chromadb import PersistentClient

# define the embeddings model
embedding = HuggingFaceEmbeddings()

# embedding = OpenAIEmbeddings(model='text-embedding-ada-002')
## al usar el embeddings de OpenAI, para crear el vectorstore se obtiene el siguiente error:
# ValueError: not enough values to unpack (expected 2, got 1)

db = Chroma.from_documents(
            texts, embedding, persist_directory='resources\\db\\',
            client_settings=PersistentClient(path='resources\\db\\').get_settings())
# equivalent
# db = VectorstoreIndexCreator(embedding=embedding).from_documents(texts)


In [7]:
######################################################################
# 4. Calculate the embeddings for the user’s question, find similar
# text chunks in our vector store and use them to build our prompt
######################################################################

from langchain.llms import OpenAI
from langchain import PromptTemplate

# users_question = "Which of the available candidates is best for project manager? "
users_question = "Give me the names of the candidates you have "
users_question = "Summarize what is the experience of Jo Brown "

# use our vector store to find similar text chunks
results = db.similarity_search(
    query=users_question,
    n_results=5
)

# define the prompt template
template = """
You are a chat bot who loves to help people! Given the following context sections, answer the
question using the given context. If you are unsure and the answer is not
explicitly writting in the documentation, say "Sorry, I don't know how to help with that."

Context sections:
{context}

Question:
{users_question}

Answer:
"""

prompt = PromptTemplate(template=template, input_variables=["context", "users_question"])

# fill the prompt template
prompt_text = prompt.format(context = results, users_question = users_question)

# ask the defined LLM
llm = OpenAI(temperature=0.7)

llm(prompt_text)

'Jo Brown has experience in patient assessment, patient care, infection control, catheterisation, drawing blood/collecting samples, developing patient care plans, patient nutrition, patient and family education, basic life support, cardiopulmonary resuscitation, medical software, care, compassion, courage, communication, and commitment.'

### Another way to find similar text chunks in our vector store and use them to get the answer

In [6]:
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

users_question = "Summarize what is the experience of Jo Brown "

retriever = db.as_retriever(search_type="similarity", search_kwargs={"k": 5})
# create a chain to answer questions
qa = RetrievalQA.from_chain_type(
    llm=OpenAI(), chain_type='stuff', retriever=retriever, return_source_documents=True)

result = qa({"query": users_question})
# print(result)


In [8]:
print(result.get('result'))

 Jo Brown has experience in patient assessment, patient care, infection control, catheterisation, drawing blood/collecting samples, developing patient care plans, patient nutrition, patient and family education, basic life support, cardiopulmonary resuscitation, medical software, care, compassion, courage, and communication. They also have interests in volunteering and road running.


In [9]:
qa({"query":  "Give me the names of the candidates you have "})

{'query': 'Give me the names of the candidates you have ',
 'result': " I don't know, I don't have any candidates to provide names for.",
 'source_documents': [Document(page_content='DAVID SMITH\nKeen customer service representative with over 10 years of experience in the short-term\ninsurance industry servicing both private and business clients. I am a highly skilled, effective\nlistener and clear communicator focused on defusing conflicts and resolving client queries as a\nmatter of urgency. Outstanding organisational skills allows quality service delivery, and I\nmaintain the highest level of integrity to ensure the confidence and security of both client and', metadata={'author': '', 'creationDate': "D:20210527130839+02'00'", 'creator': 'wkhtmltopdf 0.12.4', 'file_path': 'resources/cv_david_smith.pdf', 'format': 'PDF 1.4', 'keywords': '', 'modDate': '', 'page': 0, 'producer': 'Qt 4.8.7', 'source': 'resources/cv_david_smith.pdf', 'subject': '', 'title': '', 'total_pages': 2, 'trapped

# Error when use **OpenAIEmbeddings**

In [15]:
from langchain.embeddings.openai import OpenAIEmbeddings

embedd = OpenAIEmbeddings(model='text-embedding-ada-002')

embedd.embed_documents([texts[0]])


ValueError: not enough values to unpack (expected 2, got 1)