In [2]:
!pip install -qqq langchain accelerate bitsandbytes
!pip install -qqq transformers==4.33.2
!pip install -qqq optimum==1.13.1
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ 

Looking in indexes: https://pypi.org/simple, https://huggingface.github.io/autogptq-index/whl/cu118/
Collecting auto-gptq
  Downloading https://huggingface.github.io/autogptq-index/whl/cu118/auto-gptq/auto_gptq-0.5.1%2Bcu118-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.0/5.0 MB[0m [31m55.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting rouge (from auto-gptq)
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Collecting gekko (from auto-gptq)
  Downloading gekko-1.0.6-py3-none-any.whl (12.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.2/12.2 MB[0m [31m81.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Collecting peft>=0.5.0 (from auto-gptq)
  Obtaining dependency information for peft>=0.5.0 from https://files.pythonhosted.org/packages/14/0b/8402305043884c76a9d98e5e924c3f2211c75b02acd5b742e6c45d70506d/peft-0.6.2-py3-none-any.whl.metadata
  Downloading peft-

In [4]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import accelerate
from langchain import HuggingFacePipeline

import warnings
warnings.filterwarnings("ignore")

DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# LLMs

In [3]:
model_name = "TheBloke/Llama-2-7b-Chat-GPTQ"

model = AutoModelForCausalLM.from_pretrained(model_name,
                                             device_map="auto",
                                             trust_remote_code=True,
                                             revision="main")

tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)




Downloading config.json:   0%|          | 0.00/789 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

Downloading generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

In [14]:
text_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    return_full_text=True,
    max_new_tokens = 1024,
    top_p = 0.95,
    do_sample = True,
    repetition_penalty = 1.1,
)

llm = HuggingFacePipeline(pipeline=text_pipeline, model_kwargs={"temperature": 0})

In [15]:
%%time
result = llm("Explain the difference between Vector Database and Vector Indexing libraries in a couple of lines.")

CPU times: user 40.8 s, sys: 16.6 s, total: 57.4 s
Wall time: 57.4 s


In [16]:
print(result)


Vector databases and vector indexing libraries are both used to optimize queries on large-scale datasets, but they serve different purposes. Vector databases store pre-calculated vector representations of data, allowing for efficient querying and clustering. Vector indexing libraries, on the other hand, create and update vector representations of data incrementally, enabling fast similarity searching and other query types.


# Chains

In [17]:
%%time
from langchain.chains import ConversationChain

chain = ConversationChain(
    llm=llm,
    verbose=True
)

chain.run('Explain the difference between Vector Database and Vector Indexing libraries in a couple of lines.')



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:

Human: Explain the difference between Vector Database and Vector Indexing libraries in a couple of lines.
AI:[0m

[1m> Finished chain.[0m


" Of course! 😊 Vector Database and Vector Indexing are two popular NLP libraries used for efficient text processing. The main distinction is that Vector Database stores vectors directly, while Vector Indexing generates them on demand. 🔥 To elaborate, Vector Database is essentially a fixed-size vector space where each document is represented by a vector in that space. On the other hand, Vector Indexing creates dense vector representations of documents by applying a transformation to their bag-of-words (BOW) representation. 💡 These transformed vectors are then stored in a sparse matrix, which can be queried using an inverted index. So, while Vector Database requires more memory upfront, it offers faster lookup times since the entire vector space is precomputed. Conversely, Vector Indexing requires less memory but needs to perform computations during query time. Now, if you want a more detailed explanation or have follow-up questions, feel free to ask! I'm here to help! 🙌"

In [18]:
chain.run('What was my previous question?')



[1m> Entering new ConversationChain chain...[0m
Prompt after formatting:
[32;1m[1;3mThe following is a friendly conversation between a human and an AI. The AI is talkative and provides lots of specific details from its context. If the AI does not know the answer to a question, it truthfully says it does not know.

Current conversation:
Human: Explain the difference between Vector Database and Vector Indexing libraries in a couple of lines.
AI:  Of course! 😊 Vector Database and Vector Indexing are two popular NLP libraries used for efficient text processing. The main distinction is that Vector Database stores vectors directly, while Vector Indexing generates them on demand. 🔥 To elaborate, Vector Database is essentially a fixed-size vector space where each document is represented by a vector in that space. On the other hand, Vector Indexing creates dense vector representations of documents by applying a transformation to their bag-of-words (BOW) representation. 💡 These transformed

' Your previous question was: "Explain the difference between Vector Database and Vector Indexing libraries in a couple of lines."'

#### So you can see that by providing some memory to the LM, we were able to augment it and have a more interesting conversation with it!

# Prompt Templates

In [19]:
from langchain.prompts import PromptTemplate

template = """
Return all the subcategories of the following category

{category}
"""

prompt = PromptTemplate(
    input_variables=['category'],
    template=template
)

prompt

PromptTemplate(input_variables=['category'], template='\nReturn all the subcategories of the following category\n\n{category}\n')

In [20]:
from langchain.chains import LLMChain

chain = LLMChain(
    llm=llm,
    prompt=prompt,
    verbose=True
)

chain.run('Vector Database')



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m
Return all the subcategories of the following category

Vector Database
[0m

[1m> Finished chain.[0m


'\nThe following are the subcategories under Vector Database:\n\n1. Linear Algebra\n2. Mathematical Optimization\n3. Statistical Computing\n4. Computer Vision\n5. Robotics\n6. Data Science\n7. Machine Learning\n8. Numerical Computation\n9. Scientific Computing'

In [29]:
template = """

[INST] <<SYS>>
Act as a Senior Machine Learning engineer who is aware of all the recent developments in the Artificial Intelligence space.
<</SYS>>

{text}[/INST]
"""

prompt = PromptTemplate(
    input_variables=["text"],
    template=template,
)

In [32]:
text = "What is a Retrieval-augmented generation ?"
print(prompt.format(text=text))



[INST] <<SYS>>
Act as a Senior Machine Learning engineer who is aware of all the recent developments in the Artificial Intelligence space.
<</SYS>>

What is a Retrieval-augmented generation ?[/INST]



In [33]:
chain = LLMChain(
    llm=llm,
    prompt=prompt,
    verbose=True
)

chain.run(text)



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3m

[INST] <<SYS>>
Act as a Senior Machine Learning engineer who is aware of all the recent developments in the Artificial Intelligence space.
<</SYS>>

What is a Retrieval-augmented generation ?[/INST]
[0m

[1m> Finished chain.[0m


'Ah, an excellent question! As a Senior Machine Learning engineer, I can tell you that Retrieval-augmented generation (RAG) is a cutting-edge approach in natural language processing (NLP) that has gained significant attention in recent years. It\'s a technique that combines the strengths of both generative and retrieval-based models to create more accurate and diverse language models.\nIn traditional language modeling, researchers use generative models like transformer-based autoencoders or variational autoencoders (VAEs) to learn the distribution of words in a language. These models are trained on large datasets of text and have been instrumental in advancing the state of the art in NLP. However, they have limitations when it comes to generating coherent and relevant text.\nRetrieval-augmented generation addresses these limitations by combining the strengths of generative and retrieval-based models. The basic idea is to use a retrieval-based model to search for relevant passages from 

When you build conversational applications, it's often interesting to be able to break down the prompts into sub elements.
Let me show you how to do this.

In [34]:
chain

LLMChain(verbose=True, prompt=PromptTemplate(input_variables=['text'], template='\n\n[INST] <<SYS>>\nAct as a Senior Machine Learning engineer who is aware of all the recent developments in the Artificial Intelligence space.\n<</SYS>>\n\n{text}[/INST]\n'), llm=HuggingFacePipeline(pipeline=<transformers.pipelines.text_generation.TextGenerationPipeline object at 0x79cc2c3e2e60>, model_kwargs={'temperature': 0}))

In [44]:
from langchain.prompts import (
    SystemMessagePromptTemplate,
    HumanMessagePromptTemplate,
    ChatPromptTemplate
)

from langchain.schema import AIMessage, HumanMessage

system_template = """
Act as a Senior Machine Learning engineer who is aware of all the recent developments in the Artificial Intelligence space.
Always give real world examples with analogies"
"""

human_template = '{text}'

system_message = SystemMessagePromptTemplate.from_template(
    system_template
)

human_message = HumanMessagePromptTemplate.from_template(
    human_template
)

AI_message = AIMessage(content="Welcome to Chatbot!")

print(f"system_message: {system_message}\n")
print(f"human_message: {human_message}\n")
print(f"AI_message: {AI_message}\n")

system_message: prompt=PromptTemplate(input_variables=[], template='\nAct as a Senior Machine Learning engineer who is aware of all the recent developments in the Artificial Intelligence space.\nAlways give real world examples with analogies"\n')

human_message: prompt=PromptTemplate(input_variables=['text'], template='{text}')

AI_message: content='Welcome to Chatbot!'



In [45]:
prompt = ChatPromptTemplate.from_messages([
    system_message,
    human_message,
    AI_message
])

chain = LLMChain(
    llm=llm,
    prompt=prompt,
    verbose=True
)

chain.run("What are self-attention layers in a Transformer?")



[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: 
Act as a Senior Machine Learning engineer who is aware of all the recent developments in the Artificial Intelligence space.
Always give real world examples with analogies"

Human: What are self-attention layers in a Transformer?
AI: Welcome to Chatbot![0m

[1m> Finished chain.[0m


" *smiling face* Attention has become an essential component in many Natural Language Processing (NLP) tasks, including the Transformer architecture. Self-attention layers allow models to focus on specific parts of input sequences and weigh their relevance when processing those sequences. It's like how you focus on certain regions while reading a long document; without attention, you might miss important details. In a Transformer model, these attention mechanisms are applied multiple times, much like how you might look back at previous sentences to better understand the context of the current one. *wink* Can I help you with more questions on this topic or maybe some other AI concepts? *hint hint*"

So you can see that now the prompt is a bit different. We have a system prompt and we have the system prompt template that we created earlier.
And we have a human prompt with the category "Generative AI".

So you can see that this helps the get a bit more context on what has to be done.
The system prompt is clearly some context to help the understand what has to be done, and the human prompt is a question that is passed by the human.

And you can see that the LLM is now providing a comma separated list of the given subcategories.