## Building LLM-powered pipelines perform indexing Haystack

In [the previous notebook](./extraction-and-processing-pipelines.ipynb), we learned how to initialize components that convert files of different formats (PDF, Word, HTML, etc.) into a format that can be cleaned by Haystack components. 

In this notebook, we will integrate components to convert the text into vectors using embedding model provider integrations through Haystack. 

In [4]:
from haystack import Document, Pipeline
from haystack.utils import Secret
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters import MarkdownToDocument
from haystack.components.embedders import OpenAIDocumentEmbedder
from haystack.components.preprocessors import DocumentCleaner
from haystack.components.preprocessors import DocumentSplitter
from haystack.components.writers import DocumentWriter
from pathlib import Path
from haystack.document_stores.types import DuplicatePolicy


### Working with embedding models from OpenAI

In this section, we will build an indexing pipeline that uses an embedding model from OpenAI to convert the text into vectors. We will transform Markdown files into vectors using the embedding model.

In [2]:
from dotenv import load_dotenv
import os

load_dotenv("./../../.env")

open_ai_key = os.getenv("OPENAI_API_KEY")


In [5]:
document_store = InMemoryDocumentStore(embedding_similarity_function="cosine")

# Initialize components
markdown_converter = MarkdownToDocument(table_to_single_line=True)
document_cleaner = DocumentCleaner(
                    remove_empty_lines=True,
                    remove_extra_whitespaces=True,
                    remove_repeated_substrings=False
                )
document_splitter = DocumentSplitter(split_by="word", split_length=5)
document_writer = DocumentWriter(document_store=document_store,
                                 policy = DuplicatePolicy.OVERWRITE)
embedding = OpenAIDocumentEmbedder(model="text-embedding-ada-002", 
                                    batch_size=24,
                                    )

# Initialize pipeline
indexing_pipeline = Pipeline()

# Add components
indexing_pipeline.add_component("converter", markdown_converter)
indexing_pipeline.add_component("cleaner", document_cleaner)
indexing_pipeline.add_component("splitter", document_splitter)
indexing_pipeline.add_component("embedder", embedding)
indexing_pipeline.add_component("writer", document_writer)

# Connect components to one another
indexing_pipeline.connect("converter", "cleaner")
indexing_pipeline.connect("cleaner.documents", "splitter.documents")
indexing_pipeline.connect("splitter.documents", "embedder.documents")
indexing_pipeline.connect("embedder.documents", "writer.documents")

# Execute pipeline
file_names = [str(f) for f in Path("./markdown_pages").rglob("*.md")]
indexing_pipeline.run({"converter": {"sources": file_names}})

Converting markdown files to Documents: 100%|██████████| 3/3 [00:00<00:00, 15.19it/s]
Calculating embeddings: 100%|██████████| 4/4 [00:02<00:00,  1.56it/s]


{'embedder': {'meta': {'model': 'text-embedding-ada-002',
   'usage': {'prompt_tokens': 825, 'total_tokens': 825}}},
 'writer': {'documents_written': 93}}

In [None]:
indexing_pipeline.draw("./images/indexing_pipeline.png")

In [8]:
document_store.filter_documents()[10]

Document(id=ec917c19b99ca38d0bef64aa62d389315af136b55470a40a88d6d9812589201d, content: 'templates. Handlebars is the default.extextension ', meta: {'file_path': 'markdown_pages/page3.md', 'source_id': '09730ab93795029b54abdb66b59d722d53971f4de54e66eebf0e0b1385a439ea'}, embedding: vector of size 1536)

Accessing the embedding values


In [9]:
document_store.filter_documents()[10].embedding[0:10]

[-0.01753813587129116,
 0.015475629828870296,
 -0.010899869725108147,
 -0.007409999147057533,
 0.010271556675434113,
 0.023493453860282898,
 -0.005897266790270805,
 -0.002530326833948493,
 0.012142837978899479,
 -0.014847316779196262]