## Installation

In [1]:
! pip install haystack-ai "transformers>=4.43.1" sentence-transformers accelerate bitsandbytes

Collecting haystack-ai
  Downloading haystack_ai-2.5.0-py3-none-any.whl.metadata (13 kB)
Collecting sentence-transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Collecting bitsandbytes
  Downloading bitsandbytes-0.43.3-py3-none-manylinux_2_24_x86_64.whl.metadata (3.5 kB)
Collecting haystack-experimental (from haystack-ai)
  Downloading haystack_experimental-0.1.1-py3-none-any.whl.metadata (6.9 kB)
Collecting lazy-imports (from haystack-ai)
  Downloading lazy_imports-0.3.1-py3-none-any.whl.metadata (10 kB)
Collecting openai>=1.1.0 (from haystack-ai)
  Downloading openai-1.44.0-py3-none-any.whl.metadata (22 kB)
Collecting posthog (from haystack-ai)
  Downloading posthog-3.6.3-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting httpx<1,>=0.23.0 (from openai>=1.1.0->haystack-ai)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting jiter<1,>=0.4.0 (from openai>=1.1.0->haystack-ai)
  Downloading jiter-0.5.0-cp310-cp310-manylinux_2_17_x86

## Authorization

- you need an Hugging Face account
- you need to accept Meta conditions here: https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct and wait for the authorization

In [2]:
import getpass, os


os.environ["HF_API_TOKEN"] = getpass.getpass("Your Hugging Face token")
#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

Your Hugging Face token··········


## RAG with Llama-3.1-8B-Instruct (about the Oscars) 🏆🎬

In [4]:
from IPython.display import Image
from pprint import pprint
import rich
import random

### Load data from Wikipedia

In [5]:
# ! pip install wikipedia
# import wikipedia
# from haystack.dataclasses import Document

# title = "96th_Academy_Awards"
# page = wikipedia.page(title=title, auto_suggest=False)
# raw_docs = [Document(content=page.content, meta={"title": page.title, "url":page.url})]

# print(raw_docs)

### Load data from TXT


In [7]:
from haystack.dataclasses import Document

def load_data_from_txt(file_path):
  """Loads data from a txt file and returns a list of Documents."""
  with open(file_path, 'r') as f:
    content = f.read()
  raw_docs = [Document(content=content, meta={"source": file_path})]
  return raw_docs

# Replace 'your_file.txt' with the actual path to your txt file
file_path = 'propertyDetailsForEmbedding.txt'
raw_docs = load_data_from_txt(file_path)


print(raw_docs)

[Document(id=dcc2db916091c52a9f434705731b8fd6fa9a976ea33e5c73d68b8e44ce9047b8, content: 'This property falls under the category of Houses for sale and is titled "Reserved • Deleted • 
Well-...', meta: {'source': 'propertyDetailsForEmbedding.txt'})]


### Indexing Pipeline

In [8]:
from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack import Document
from haystack.components.embedders import SentenceTransformersTextEmbedder, SentenceTransformersDocumentEmbedder
from haystack.components.converters import TextFileToDocument
from haystack.components.writers import DocumentWriter
from haystack.components.preprocessors import DocumentSplitter
from haystack.utils import ComponentDevice

In [9]:
document_store = InMemoryDocumentStore()

In [10]:
indexing_pipeline = Pipeline()
indexing_pipeline.add_component("splitter", DocumentSplitter(split_by="word", split_length=200))

indexing_pipeline.add_component(
    "embedder",
    SentenceTransformersDocumentEmbedder(
        model="Snowflake/snowflake-arctic-embed-l",  # good embedding model: https://huggingface.co/Snowflake/snowflake-arctic-embed-l
        device=ComponentDevice.from_str("cuda:0"),    # load the model on GPU
    ))
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))

# connect the components
indexing_pipeline.connect("splitter", "embedder")
indexing_pipeline.connect("embedder", "writer")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7e56c9484ac0>
🚅 Components
  - splitter: DocumentSplitter
  - embedder: SentenceTransformersDocumentEmbedder
  - writer: DocumentWriter
🛤️ Connections
  - splitter.documents -> embedder.documents (List[Document])
  - embedder.documents -> writer.documents (List[Document])

In [11]:
indexing_pipeline.run({"splitter":{"documents":raw_docs}})



modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/84.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/107 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Batches:   0%|          | 0/11 [00:00<?, ?it/s]

{'writer': {'documents_written': 324}}

### RAG Pipeline

In [12]:
from haystack.components.builders import PromptBuilder

prompt_template = """
<|begin_of_text|><|start_header_id|>user<|end_header_id|>


Using the information contained in the context, give a comprehensive answer to the question.
If the answer cannot be deduced from the context, do not give an answer.

Context:
  {% for doc in documents %}
  {{ doc.content }} URL:{{ doc.meta['url'] }}
  {% endfor %};
  Question: {{query}}<|eot_id|>

<|start_header_id|>assistant<|end_header_id|>


"""
prompt_builder = PromptBuilder(template=prompt_template)

Here, we use the [`HuggingFaceLocalGenerator`](https://docs.haystack.deepset.ai/docs/huggingfacelocalgenerator), loading the model in Colab with 4-bit quantization.

In [13]:
import torch
from haystack.components.generators import HuggingFaceLocalGenerator

generator = HuggingFaceLocalGenerator(
    # model="meta-llama/Meta-Llama-3.1-8B-Instruct",
    # model="microsoft/Phi-3.5-mini-instruct",
    model="akjindal53244/Llama-3.1-Storm-8B",
    # model="mattshumer/Reflection-Llama-3.1-70B",
    device=ComponentDevice.from_str("cuda:0"),
    huggingface_pipeline_kwargs={"device_map":"auto",
                                  "model_kwargs":{"load_in_4bit":True,
                                                  "bnb_4bit_use_double_quant":True,
                                                  "bnb_4bit_quant_type":"nf4",
                                                  "bnb_4bit_compute_dtype":torch.bfloat16}},
    generation_kwargs={"max_new_tokens": 500})

generator.warm_up()

config.json:   0%|          | 0.00/910 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


model.safetensors.index.json:   0%|          | 0.00/22.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.19G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/185 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/51.0k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

In [14]:
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever

query_pipeline = Pipeline()

query_pipeline.add_component(
    "text_embedder",
    SentenceTransformersTextEmbedder(
        model="Snowflake/snowflake-arctic-embed-l",  # good embedding model: https://huggingface.co/Snowflake/snowflake-arctic-embed-l
        device=ComponentDevice.from_str("cuda:0"),  # load the model on GPU
        prefix="Represent this sentence for searching relevant passages: ",  # as explained in the model card (https://huggingface.co/Snowflake/snowflake-arctic-embed-l#using-huggingface-transformers), queries should be prefixed
    ))
query_pipeline.add_component("retriever", InMemoryEmbeddingRetriever(document_store=document_store, top_k=5))
query_pipeline.add_component("prompt_builder", PromptBuilder(template=prompt_template))
query_pipeline.add_component("generator", generator)

# connect the components
query_pipeline.connect("text_embedder.embedding", "retriever.query_embedding")
query_pipeline.connect("retriever.documents", "prompt_builder.documents")
query_pipeline.connect("prompt_builder", "generator")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7e56b3cf39a0>
🚅 Components
  - text_embedder: SentenceTransformersTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: PromptBuilder
  - generator: HuggingFaceLocalGenerator
🛤️ Connections
  - text_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - prompt_builder.prompt -> generator.prompt (str)

### Let's ask some questions!

In [15]:
def get_generative_answer(query):

  results = query_pipeline.run({
      "text_embedder": {"text": query},
      "prompt_builder": {"query": query}
    }
  )

  answer = results["generator"]["replies"][0]
  rich.print(answer)

In [16]:
get_generative_answer("What is the most expencive house in berlin?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [17]:
get_generative_answer("What is the house is heated with?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

---
This is a simple demo.
We can improve the RAG Pipeline in several ways, including better preprocessing the input.

To use Llama 3 models in Haystack, you also have **other options**:
- [LlamaCppGenerator](https://docs.haystack.deepset.ai/docs/llamacppgenerator) and [OllamaGenerator](https://docs.haystack.deepset.ai/docs/ollamagenerator): using the GGUF quantized format, these solutions are ideal to run LLMs on standard machines (even without GPUs).
- [HuggingFaceAPIGenerator](https://docs.haystack.deepset.ai/docs/huggingfaceapigenerator), which allows you to query a local TGI container or a (paid) HF Inference Endpoint. TGI is a toolkit for efficiently deploying and serving LLMs in production.
- [vLLM via OpenAIGenerator](https://haystack.deepset.ai/integrations/vllm): high-throughput and memory-efficient inference and serving engine for LLMs.

