<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/CC_BY.png"><br />

Created by [Nathan Kelber](http://nkelber.com) under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/)<br />
For questions/comments/improvements, email nathan.kelber@ithaka.org.<br />
___

# Language Models 3: 🤗 Hugging Face with RAG

**Description:** 


**Use Case:** For Learners (Detailed explanation, not ideal for researchers)

**Difficulty:** Intermediate

**Completion Time:** 75 minutes

**Knowledge Required:** 
* Python Basics
* Pandas Basics

**Knowledge Recommended:** 
* Python Intermediate
* Pandas Intermediate

**Data Format:** None

**Libraries Used:** 
* [🤗 Transformers](https://huggingface.co/docs/transformers/index)- provides APIs and tools to easily download and train pretrained models
* [Pytorch](https://pytorch.org/)- a popular machine learning framework
* [Llama_index](https://docs.llamaindex.ai/en/stable/)- helps index our documents

**Research Pipeline:** None
___

# Installations

In [None]:
# Install transformers and llama-index libraries
!pip install transformers
!pip install llama-index
!pip install llama-index-embeddings-huggingface

# Import Libraries

In [None]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.postprocessor import SimilarityPostprocessor
from transformers import pipeline
from huggingface_hub import login
from huggingface_hub import InferenceClient
import urllib.request
from pathlib import Path

In [None]:
# We can grab a particular dataset builder
# without downloading the whole dataset
# That allows us to preview its description and features first
# Dataset https://huggingface.co/datasets/wikitext
ds_builder = load_dataset_builder("wikitext", 'wikitext-103-raw-v1')

In [None]:
# Use .info.description to retrieve the description
ds_builder.info.description

In [None]:
# Use .info.features to retrieve the features
ds_builder.info.features

# Download Documents



In [None]:
dir_path = Path.cwd() / "documents"
dir_path.mkdir(exist_ok=True)

files ={
    "jupyter-ai-documentation.txt" : 'https://jupyter-ai.readthedocs.io/en/latest/_sources/users/index.md.txt',
    "llama-3.1b-405.txt" : 'https://raw.githubusercontent.com/meta-llama/llama-models/main/models/llama3_1/MODEL_CARD.md',
    "mistral-large-instruct-2407.txt" : 'https://huggingface.co/mistralai/Mistral-Large-Instruct-2407/resolve/main/README.md'
}
    
for file_name, url in files.items():
    urllib.request.urlretrieve(url, f'./documents/{file_name}')

# Simple Directory Reader

The simple directory reader will gather up all the files in a directory and turn them into a list of document objects. It can parse many kinds of files including pdfs, text files, markdown files, etc. It will intelligently select the right reader for the right file, and it will process them differently. For example, a text file is treated as a single document whereas a markdown file is broken down by headings.

In [None]:
# Collect documents into a list
docs = SimpleDirectoryReader("documents").load_data()

In [None]:
print(len(docs))

# Embedding Settings

In [None]:
# Set a Hugging Face embedding model
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

Settings.llm = None
Settings.chunk_size = 256
Settings.chunk_overlap = 25

In [None]:
# Create a vector database from doc
index = VectorStoreIndex.from_documents(docs)

# Search function


In [None]:
# Documents to retrieve
top_k = 3

# Retriever configuration
retriever = VectorIndexRetriever(
    index = index,
    similarity_top_k=top_k
)

In [None]:
# Query Engine
query_engine = RetrieverQueryEngine(
    retriever=retriever,
    node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.5)],
)

# Retrieval


In [None]:
# Query
query = 'What providers does Jupyter AI support?'
response = query_engine.query(query)

In [None]:
# Create a context string from response
context = "Context:\n"
for i in range(top_k):
    context = context + response.source_nodes[i].text + "\n\n"

print(context)

In [None]:
ragless_prompt = f"""
[INST] ResearchBuddy, a virtual consultant for research tasks communicates in clear, accessible language helping answer technical questions on documentation.

Please respond to the following comment.
{query}

[/INST]
"""

In [None]:
# Create a RAG prompt with the context
ragful_prompt = ragless_prompt + context

### Pass the prompt to the LLM

In [None]:
# Log in using an access token
login()

In [None]:
# Choose the model
client = InferenceClient("meta-llama/Meta-Llama-3.1-8B-Instruct")

In [None]:
# Ask the model without context

for message in client.chat_completion(
	messages=[{"role": "user", "content": ragless_prompt}],
	max_tokens=500,
	stream=True,
):
    print(message.choices[0].delta.content, end="")

In [None]:
# Ask the model with RAG context

for message in client.chat_completion(
	messages=[{"role": "user", "content": ragful_prompt}],
	max_tokens=500,
	stream=True,
):
    print(message.choices[0].delta.content, end="")