# `ContextCite` with a RAG LangChain example

In this notebook, we'll show a quick example of how to use `ContextCite` with a RAG chain using the `langchain` library.

In [None]:
!pip install -qU langchain-community langchain-openai

In [1]:
from context_cite import ContextCiter
from langchain_community.document_loaders import TextLoader
from langchain_community.llms import HuggingFaceHub
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough, chain
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

[nltk_data] Downloading package punkt to
[nltk_data]     /mnt/xfs/home/krisgrg/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Let's first start with a langchain RAG chain that does not involve `ContextCite`.

# A simple RAG chain (without ContextCite)

We'll use a local `txt` file as our "database" to keep things simple.

In [2]:
llm = HuggingFaceHub(
    repo_id="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    model_kwargs={"max_length": 180, "max_new_tokens": 180},
)


loader = TextLoader("./assets/transformer.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
db = FAISS.from_documents(texts, embeddings)
retriever = db.as_retriever()

system = """Answer the user question based only on the following context:

{context}
"""
prompt = ChatPromptTemplate.from_messages([("system", system), ("user", "{question}")])


def format_docs(docs):
    return "\n\n".join([d.page_content for d in docs])


chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

  warn_deprecated(
Created a chunk of size 1778, which is longer than the specified 1000
Created a chunk of size 5233, which is longer than the specified 1000
Created a chunk of size 1110, which is longer than the specified 1000
Created a chunk of size 1799, which is longer than the specified 1000
Created a chunk of size 1051, which is longer than the specified 1000
Created a chunk of size 1118, which is longer than the specified 1000
Created a chunk of size 1204, which is longer than the specified 1000
Created a chunk of size 1897, which is longer than the specified 1000


In [3]:
question = "What is self-attention?"
qa = chain.invoke("What is self-attention?")
a = qa.split(f"Question: {question}")[-1]
print(a)



MaskedAttention is a variant of the attention mechanism that uses a mask to selectively attend to specific tokens. This can be useful for tasks where the input is noisy or incomplete, such as natural language processing. In 2019, the masked attention mechanism was used in a neural machine translation system that achieved state-of-the-art results on the WMT19 translation task.[4]

DeepMind's AlphaGo
AlphaGo is a game-playing system developed by DeepMind, a British artificial intelligence company. It uses a deep neural network with attention mechanisms to play Go, a board game. In 2016, AlphaGo defeated the world champion, Lee Sedol, in a series of games. The system used a combination of deep neural networks, recurrent neural networks, and attention mechan


# Adding in `ContextCite`

In short, we'll wrap the `prompt` and `llm` in a `ContextCiter` Runnable class. This class will take care of formatting the context and query, as well as run generation with our LLM.

In [5]:
CC_MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"


@chain
def context_cite(self, context_and_query: dict):
    context = context_and_query["context"]
    query = context_and_query["question"]
    cc = ContextCiter.from_pretrained(CC_MODEL_NAME, context, query)
    return cc.get_attributions(as_dataframe=True, top_k=8)

In [7]:
new_chain = {
    "context": retriever | format_docs,
    "question": RunnablePassthrough(),
} | context_cite

In [8]:
question = "What is self-attention?"
new_chain.invoke("What is self-attention?")

Attributed: Self-attention is a mechanism in neural networks that allows for the computation of "soft" weights between tokens in parallel, which leads to improved training speed. It is a type of attention mechanism that processes all tokens simultaneously, rather than sequentially, as in recurrent neural networks. The attention mechanism only uses information about other tokens from lower layers, which leads to improved training speed. Self-attention is used in transformer models, which are a type of neural network that has multiple layers and multiple attention heads. Each layer in a transformer model has multiple attention heads, which encode relevance relations that are meaningful to humans. The outputs for the attention layer are concatenated to pass into the feed-forward neural network layers. The attention mechanism is directly plugged into the attention mechanism, allowing for pretraining on short context windows and finetuning on longer context windows.


  0%|          | 0/64 [00:00<?, ?it/s]

Unnamed: 0,Score,Source
0,69.257,"Since the attention mechanism only uses information about other tokens from lower layers, it can be computed for all tokens in parallel, which leads to improved training speed."
1,50.328,The outputs for the attention layer are concatenated to pass into the feed-forward neural network layers.
2,46.142,Many transformer attention heads encode relevance relations that are meaningful to humans.
3,33.876,"ALiBi allows pretraining on short context windows, then finetuning on longer context windows."
4,26.981,"[1] Transformers, using an attention mechanism, processing all tokens simultaneously, calculated ""soft"" weights between them in successive layers."
5,25.682,"Since it is directly plugged into the attention mechanism, it can be combined with any positional encoder that is plugged into the ""bottom"" of the entire network (which is where the sinusoidal encoder on the original transformer, as well as RoPE and many others, are located)."
6,18.897,"{\displaystyle \left(W_{Q},W_{K},W_{V}\right)} matrices is called an attention head, and each layer in a transformer model has multiple attention heads."
7,0.601,"{\displaystyle {\begin{aligned}{\text{MaskedAttention}}(Q,K,V)={\text{softmax}}\left(M+{\frac {QK^{\mathrm {T} }}{\sqrt {d_{k}}}}\right)V\end{aligned}}}"
