In [2]:
%%capture
!pip install llama-index
!pip install llama-index-llms-groq
!pip install llama-index-embeddings-huggingface
!pip install llama-parse

In [3]:
import os

os.environ["GROQ_API_KEY"] = "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"

In [4]:
from llama_index.llms.groq import Groq

llm = Groq(model="llama3-8b-8192")
llm_70b = Groq(model="llama3-70b-8192")

# LlamaIndex Bottoms-Up Development - Documents and Nodes
In order to answer questions about the LlamaIndex docs, we first need to load them!

A majority of our documentation is in markdown format. For the sake of scope, we will ONLY worry about markdown files for now.

When parsing these files, there are a few things we might want to keep track of

- Current header (and header hierarchy!)
- Code blocks
- Text
- Source file names

While LlamaIndex does have a built-in markdown loader, we can write our own to fit our requirements exactly! Loaders are not magic -- they just read files and create documents. So building our own is easy!

We have provided an implementation of a custom markdown loaded in the source code. Let's test it out to see how it works!

In [5]:
import os
import sys
import llama_index.
sys.path.append(os.path.join(os.getcwd(), '..'))

In [6]:
# Step 1: Clone the repository into the Colab environment
!git clone https://github.com/BoxOfCereal/llama_docs_bot_groq.git /content/llama_docs_bot_groq

# Step 2: Modify the `load_markdown_docs` function to point to the cloned repository directory
def load_markdown_docs(filepath):
    """Load markdown docs from a directory, excluding all other file types."""
    loader = SimpleDirectoryReader(
        input_dir=filepath,
        exclude=["*.rst", "*.ipynb", "*.py", "*.bat", "*.txt", "*.png", "*.jpg", "*.jpeg", "*.csv", "*.html", "*.js", "*.css", "*.pdf", "*.json"],
        file_extractor={".md": MarkdownDocsReader()},
        recursive=True
    )
    return loader.load_data()

# Step 3: Load documents from specific folders within the cloned repository
# Adjust file paths to point to the cloned repository's structure in Colab
getting_started_docs = load_markdown_docs("/content/llama_docs_bot_groq/docs/getting_started")
community_docs = load_markdown_docs("/content/llama_docs_bot_groq/docs/community")
data_docs = load_markdown_docs("/content/llama_docs_bot_groq/docs/core_modules/data_modules")
agent_docs = load_markdown_docs("/content/llama_docs_bot_groq/docs/core_modules/agent_modules")
model_docs = load_markdown_docs("/content/llama_docs_bot_groq/docs/core_modules/model_modules")
query_docs = load_markdown_docs("/content/llama_docs_bot_groq/docs/core_modules/query_modules")
supporting_docs = load_markdown_docs("/content/llama_docs_bot_groq/docs/core_modules/supporting_modules")
tutorials_docs = load_markdown_docs("/content/llama_docs_bot_groq/docs/end_to_end_tutorials")
contributing_docs = load_markdown_docs("/content/llama_docs_bot_groq/docs/development")


fatal: destination path '/content/llama_docs_bot_groq' already exists and is not an empty directory.


NameError: name 'SimpleDirectoryReader' is not defined

In [None]:
# Make our printing look nice
from llama_index.schema import MetadataMode

In [None]:
print(agent_docs[5].get_content(metadata_mode=MetadataMode.ALL))

In [None]:
print(agent_docs[0].metadata)

Looks not bad! We can see that we have metadata, as well as nicely formatted content.

But, we can improve the formatting even further! We can provide better templating, so that the LLM and embedding models can get a better idea of what they are reading.

In [None]:
text_template = "Content Metadata:\n{metadata_str}\n\nContent:\n{content}"

metadata_template = "{key}: {value},"
metadata_seperator= " "

for doc in agent_docs:
    doc.text_template = text_template
    doc.metadata_template = metadata_template
    doc.metadata_seperator = metadata_seperator

In [None]:
print(agent_docs[0].get_content(metadata_mode=MetadataMode.ALL))

### Advanced Customization
Going even further with metadata, we can also customize which metadata fields will be seen by both the embedding model and LLM.

In [None]:
# Hide the File Name from the LLM
agent_docs[0].excluded_llm_metadata_keys = ["File Name"]
print(agent_docs[0].get_content(metadata_mode=MetadataMode.LLM))

In [None]:
# Hide the File Name from the embedding model
agent_docs[0].excluded_embed_metadata_keys = ["File Name"]
print(agent_docs[0].get_content(metadata_mode=MetadataMode.EMBED))

# Conclusion
In this notebook, we covered how to use a custom data loader, as well as how to customize the text representations of your data when including metadata for both LLMs and embedding models.