1. Load the pdf
2. Chunk that pdf (split that into pieces)
3. Embed each piece
4. Create the vector database, index
5. Query (retrieving from that vector database using a llama2 model)

In [3]:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import os

folder_path = "./pdfs/instruction-tune-llama2-extended-guide.pdf"

In [6]:
pdf_docs = PyPDFLoader(folder_path).load_and_split()
pdf_docs[0]

Document(page_content="07/04/2024, 12:41 Extended Guide: Instruction-tune Llama 2\nhttps://www.philschmid.de/instruction-tune-llama-2 1/17philschmidBlogNewsletterTagsProjectsAbout MeContact\nExtended Guide: Instruction-tune Llama\n2\n#GENERATIVEAI#HUGGINGFACE#LLM#LLAMA\nJuly 26, 2023\n13 min read\nView CodeThis blog post is an extended guide on instruction-tuning Llama 2 from\nMeta AI. The idea of the blog post is to focus on creating the\ninstruction dataset, which we can then use to fine-tune the base model\nof Llama 2 to follow our instructions.\nThe goal is to create a model which can create instructions based on\ninput. The idea behind this is that this can then be used for others to\ncreate instruction data from inputs. That's especially helpful if you\nwant to personalize models for, e.g., tweeting, email writing, etc,\nwhich means that you would be able to generate an instruction dataset\nfrom your emails to then train a model to mimic your email writing.\nOkay, so can we get s

In [5]:
len(pdf_docs)

17

Embedding and Vector Database

In [7]:
from langchain.embeddings import OllamaEmbeddings

In [8]:
embedding = OllamaEmbeddings(model="llama2")

In [9]:
from langchain.vectorstores import Chroma

In [10]:
vector_db = Chroma.from_documents(pdf_docs, embedding=embedding, persist_directory=".")

In [11]:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOllama

In [12]:
llm = ChatOllama(model="llama2", chat_format=True)

In [13]:
llm.invoke("Hi")

AIMessage(content="Hello! It's nice to meet you. Is there something I can help you with or would you like to chat?", response_metadata={'model': 'llama2', 'created_at': '2024-04-09T18:52:59.617943Z', 'message': {'role': 'assistant', 'content': ''}, 'done': True, 'total_duration': 2315489250, 'load_duration': 2863625, 'prompt_eval_count': 20, 'prompt_eval_duration': 1633189000, 'eval_count': 26, 'eval_duration': 676573000}, id='run-532abd7a-4afb-4966-9ee6-30f340c092f1-0')

In [14]:
qa = RetrievalQA.from_chain_type(llm=llm, retriever=vector_db.as_retriever(), return_source_documents=True)

In [15]:
output = qa.invoke("What is an instruction in the context of LLMs? Cite a passage from the uploaded document related to instructions.", return_sources=True)

In [16]:
output

{'query': 'What is an instruction in the context of LLMs? Cite a passage from the uploaded document related to instructions.',
 'result': 'According to the provided document, an instruction in the context of LLMs refers to a set of parameters or configuration options that are used to train and fine-tune a language model (LLM). The document provides examples of instructions for various tasks, including creating an instruction dataset and preparing a model for kbit training.\n\nOne passage from the document that relates to instructions is:\n\n"The SFTTrainer supports a native integration with peft, which makes it super easy to efficiently instruction tune LLMs."\n\nThis passage highlights the ability of the SFTTrainer to integrate with peft (Packing and Efficient Training) to perform efficient instruction tuning of LLMs. The term "instruction" in this context refers to the configuration options or parameters that are used to train and fine-tune the LLM using the SFTTrainer.',
 'source_do

Now, let's do a simple semi-manual benchmark for the quality of this rag setup.

In [55]:
import time
import numpy as np

queries = [
"What is the definition of an instruction for LLM models?",
"How can instructions guide and constrain the output of LLM models?",
"What are some examples of instructions for different capabilities of LLMs, such as Brainstorming,",
"What is the goal of fine-tuning a model to generate instructions based on input?",
"How can synthetic instruction datasets be created for personalizing LLMs and agents?",
"How to define a use case and create a prompt template for instructions?",
"What are the steps to create an instruction dataset, including defining the input and output formats?",
"What research suggests about creating high-quality instruction datasets?",
"What are some methods to create an instruction dataset, such as using existing datasets or synthetically generating new ones?",
"How to use existing LLMs to create synthetic instruction datasets?"
]

outputs = []
latencies = []
for query in queries:
    start = time.time()
    output = qa.invoke(query)
    outputs.append(output)
    end = time.time()
    latencies.append(end - start)

mean_latency = np.mean(latencies)
print(f"Mean latency in seconds: {mean_latency}")

Mean latency in seconds: 11.501888925378973


In [58]:
i=0

In [69]:
print(outputs[i]["query"])
print(outputs[i]["result"])
print(outputs[i]["source_documents"])
i+=1

How to use existing LLMs to create synthetic instruction datasets?
 To create synthetic instruction datasets using existing LLMs, you can follow these general steps:

1. Define the use case and create a prompt template for instructions.
2. Create an instruction dataset by providing inputs and corresponding desired outputs in the form of instructions. For instance, you can use the email request example provided in the guide as a template and modify it to generate different types of instructions such as writing a letter of recommendation or composing a business proposal.
3. Instantiate the LLM and fine-tune it using the instruction dataset, creating an adapter for the model with the help of libraries like Triggers (trl) or the SFTTrainer.
4. Test the model's performance on generating relevant instructions based on new inputs.
5. Merge the adapter weights into the base model and save the merged model for further use.
6. Optionally, push the merged model to a shared hub for public access.
