# Retrieval Augmented Generation: The one about metadata
In this notebook, we'll learn how to do Retrieval Augmented Generation with user provided metadata.

In [3]:
!pip install langchain jsonlines qdrant-client gradio


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## AI articles 🤖📰🧠✨
I've collected a bunch of AI articles from the BBC website into a file. Let's take a look at them.

In [280]:
import jsonlines
with jsonlines.open("documents.json", "r") as documents:
    print(next(iter(documents)))  

{'url': 'https://www.bbc.com/news/entertainment-arts-67134595', 'title': 'John Grisham: Threat from AI cannot be truly appreciated - BBC News', 'time': '2023-10-17T14:00:32.000Z', 'tags': ['Artificial intelligence', 'Books', 'Tom Cruise'], 'authors': ['Emma Saunders'], 'body': ['Bestselling thriller writer John Grisham says the "threat" to his profession from AI cannot be "truly appreciated... explained or predicted".', 'He is among a group of writers who have accused OpenAI of unlawfully training its artificial-intelligence-based chatbot ChatGPT on their work.', 'Jonathan Franzen, Jodi Picoult and George RR Martin are among those joining the recent group legal action.', 'Grisham told BBC One\'s Breakfast programme: "It\'s my turn to file suit."', 'He said: "For 30 years, I\'ve been sued by everyone else - for slander, defamation, copyright, whatever - so it\'s my turn."', 'OpenAI said last month it respected the rights of authors, "they should benefit from AI technology" and the compa

## Storing embeddings in Qdrant 💾🧭🤖
Now, we're going to create LangChain documents based on the chunks of text in each article. We'll then store the chunks and their embeddings in Qdrant.

In [281]:
from langchain_core.documents import Document

In [282]:
langchain_documents = []
with jsonlines.open("documents.json", "r") as documents:
    for doc in documents:
        for line in doc['body']:
            langchain_documents.append(
                Document(
                    page_content = line,
                    metadata = {
                    "url": doc['url'],
                    "tags": doc['tags'],
                    "time": doc['time'],
                    "title": doc['title'],
                    "authors": doc["authors"]
                    }
                )
            )
len(langchain_documents), langchain_documents[:1]


[1m([0m
    [1;36m1225[0m,
    [1m[[0m
        [1;35mDocument[0m[1m([0m
            [33mpage_content[0m=[32m'Bestselling thriller writer John Grisham says the "threat" to his profession from AI cannot be "truly appreciated... explained or predicted".'[0m,
            [33mmetadata[0m=[1m{[0m
                [32m'url'[0m: [32m'https://www.bbc.com/news/entertainment-arts-67134595'[0m,
                [32m'tags'[0m: [1m[[0m[32m'Artificial intelligence'[0m, [32m'Books'[0m, [32m'Tom Cruise'[0m[1m][0m,
                [32m'time'[0m: [32m'2023-10-17T14:00:32.000Z'[0m,
                [32m'title'[0m: [32m'John Grisham: Threat from AI cannot be truly appreciated - BBC News'[0m,
                [32m'authors'[0m: [1m[[0m[32m'Emma Saunders'[0m[1m][0m
            [1m}[0m
        [1m)[0m
    [1m][0m
[1m)[0m

In [283]:
from langchain.vectorstores import Qdrant
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings

In [284]:
embeddings = FastEmbedEmbeddings(model_name="BAAI/bge-base-en-v1.5")

In [285]:
store = Qdrant.from_documents(
    langchain_documents,
    embeddings,
    path="/tmp/ai_qdrant",
    collection_name="AI-Embeddings",
)  

## Querying the vector store 🔍🗄️👨‍💻
Now we're going to query those documents for various search terms.

In [286]:
store.similarity_search_with_score(
    query="AI and authors",
    k=1
)  


[1m[[0m
    [1m([0m
        [1;35mDocument[0m[1m([0m
            [33mpage_content[0m=[32m'But to pull off this feat AI is trained on huge amounts of copyrighted material. Many authors, actors, artists and musicians argue that AI should not be trained on their works without permission and compensation.'[0m,
            [33mmetadata[0m=[1m{[0m
                [32m'url'[0m: [32m'https://www.bbc.com/news/technology-66661815'[0m,
                [32m'tags'[0m: [1m[[0m[32m'Europe'[0m, [32m'Artificial intelligence'[0m[1m][0m,
                [32m'time'[0m: [32m'2023-08-31T00:26:50.000Z'[0m,
                [32m'title'[0m: [32m'Pass AI law soon or risk falling behind, MPs warn - BBC News'[0m,
                [32m'authors'[0m: [1m[[0m[32m'Chris Vallance'[0m[1m][0m
            [1m}[0m
        [1m)[0m,
        [1;36m0.7217354717599976[0m
    [1m)[0m
[1m][0m

In [287]:
store.similarity_search_with_score(
    query="AI and authors",
    filter={"authors": "Emma Saunders"},
    k=1
)  


[1m[[0m
    [1m([0m
        [1;35mDocument[0m[1m([0m
            [33mpage_content[0m=[32m'OpenAI said last month it respected the rights of authors, "they should benefit from AI technology" and the company was "optimistic we will continue to find mutually beneficial ways to work together".'[0m,
            [33mmetadata[0m=[1m{[0m
                [32m'url'[0m: [32m'https://www.bbc.com/news/entertainment-arts-67134595'[0m,
                [32m'tags'[0m: [1m[[0m[32m'Artificial intelligence'[0m, [32m'Books'[0m, [32m'Tom Cruise'[0m[1m][0m,
                [32m'time'[0m: [32m'2023-10-17T14:00:32.000Z'[0m,
                [32m'title'[0m: [32m'John Grisham: Threat from AI cannot be truly appreciated - BBC News'[0m,
                [32m'authors'[0m: [1m[[0m[32m'Emma Saunders'[0m[1m][0m
            [1m}[0m
        [1m)[0m,
        [1;36m0.6064101699793749[0m
    [1m)[0m
[1m][0m

In [288]:
from qdrant_client.http import models

In [289]:
store.similarity_search_with_score(
    query="AI and authors",
    filter=models.Filter(
        must = [
        models.FieldCondition(key="metadata.authors", match=models.MatchValue(value="Emma Saunders"))
        ]
    ),
    k=1
)  


[1m[[0m
    [1m([0m
        [1;35mDocument[0m[1m([0m
            [33mpage_content[0m=[32m'OpenAI said last month it respected the rights of authors, "they should benefit from AI technology" and the company was "optimistic we will continue to find mutually beneficial ways to work together".'[0m,
            [33mmetadata[0m=[1m{[0m
                [32m'url'[0m: [32m'https://www.bbc.com/news/entertainment-arts-67134595'[0m,
                [32m'tags'[0m: [1m[[0m[32m'Artificial intelligence'[0m, [32m'Books'[0m, [32m'Tom Cruise'[0m[1m][0m,
                [32m'time'[0m: [32m'2023-10-17T14:00:32.000Z'[0m,
                [32m'title'[0m: [32m'John Grisham: Threat from AI cannot be truly appreciated - BBC News'[0m,
                [32m'authors'[0m: [1m[[0m[32m'Emma Saunders'[0m[1m][0m
            [1m}[0m
        [1m)[0m,
        [1;36m0.6064101699793749[0m
    [1m)[0m
[1m][0m

## Querying an LLM with and without metadata 🤔🤖🔍❓
Now let's have see how we can have an LLM answer questions with and without metadata.

In [290]:
from langchain_community.llms import Ollama
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

llm = Ollama(
    model="mixtral",
    callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
)  

In [291]:
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
from langchain.prompts import PromptTemplate

In [292]:
def retrieval_chain_with_filter(llm, filter={}):
    template = """You are a bot that answers user questions using only the context provided.
    If you don't know the answer, simply state that you don't know.
    {context}
    Question: {input}"""

    prompt = PromptTemplate(template=template, input_variables=["context", "input"])
    retriever = store.as_retriever(search_kwargs={'filter': filter})
    llm_with_prompt = create_stuff_documents_chain(llm, prompt)
    return create_retrieval_chain(retriever, llm_with_prompt)  

In [293]:
result = retrieval_chain_with_filter(llm).invoke({
    "input": "What was said about AI safety and copyright?"
})  

 The text you provided discusses two main topics: concerns about AI safety and the use of copyrighted materials to train AI systems.

On the topic of AI safety, it is mentioned that there are worries about malicious AI being deployed by "bad actors" with the intention to cause harm, or when AI might make decisions that could inadvertently lead to harm. The Canadian government has also signed statements expressing caution regarding future risks of AI.

Regarding copyright, the text highlights the concern that AI systems like ChatGPT are trained on large amounts of copyrighted material without permission from authors, artists, musicians, and other creators. This use of copyrighted works could lead to debates about ownership, compensation, and ethical usage of such materials in AI training datasets.

In summary, it has been pointed out that there are concerns regarding AI safety and the unauthorized use of copyrighted materials for AI training purposes.

In [294]:
result['context']


[1m[[0m
    [1;35mDocument[0m[1m([0m
        [33mpage_content[0m=[32m'But to pull off this feat AI is trained on huge amounts of copyrighted material. Many authors, actors, artists and musicians argue that AI should not be trained on their works without permission and compensation.'[0m,
        [33mmetadata[0m=[1m{[0m
            [32m'url'[0m: [32m'https://www.bbc.com/news/technology-66661815'[0m,
            [32m'tags'[0m: [1m[[0m[32m'Europe'[0m, [32m'Artificial intelligence'[0m[1m][0m,
            [32m'time'[0m: [32m'2023-08-31T00:26:50.000Z'[0m,
            [32m'title'[0m: [32m'Pass AI law soon or risk falling behind, MPs warn - BBC News'[0m,
            [32m'authors'[0m: [1m[[0m[32m'Chris Vallance'[0m[1m][0m
        [1m}[0m
    [1m)[0m,
    [1;35mDocument[0m[1m([0m
        [33mpage_content[0m=[32m"We[0m[32m also know AI is already causing disruption to jobs. A friend of mine worked in a small marketing company. There were fi

In [295]:
filter = models.Filter(
    must = [
        models.FieldCondition(key="metadata.authors", match=models.MatchValue(value="Emma Saunders"))
    ]
)
result = retrieval_chain_with_filter(llm, filter).invoke({
    "input": "What was said about AI safety and copyright?"
})  

 OpenAI has stated that they respect the rights of authors and believe they should benefit from AI technology. They have also expressed optimism about finding mutually beneficial ways to work together. However, a group of writers, including bestselling thriller writer John Grisham, have accused OpenAI of unlawfully training its artificial-intelligence-based chatbot ChatGPT on their work without permission or compensation. Grisham has also expressed concerns about the potential threat of AI to his profession, stating that it cannot be truly appreciated, explained, or predicted.

In [296]:
result['context']


[1m[[0m
    [1;35mDocument[0m[1m([0m
        [33mpage_content[0m=[32m'OpenAI said last month it respected the rights of authors, "they should benefit from AI technology" and the company was "optimistic we will continue to find mutually beneficial ways to work together".'[0m,
        [33mmetadata[0m=[1m{[0m
            [32m'url'[0m: [32m'https://www.bbc.com/news/entertainment-arts-67134595'[0m,
            [32m'tags'[0m: [1m[[0m[32m'Artificial intelligence'[0m, [32m'Books'[0m, [32m'Tom Cruise'[0m[1m][0m,
            [32m'time'[0m: [32m'2023-10-17T14:00:32.000Z'[0m,
            [32m'title'[0m: [32m'John Grisham: Threat from AI cannot be truly appreciated - BBC News'[0m,
            [32m'authors'[0m: [1m[[0m[32m'Emma Saunders'[0m[1m][0m
        [1m}[0m
    [1m)[0m,
    [1;35mDocument[0m[1m([0m
        [33mpage_content[0m=[32m'He is among a group of writers who have accused OpenAI of unlawfully training its artificial-intelligence-b

## Gradio 📚📊📱🔧
Finally, let's build a UI using Gradio so we can see how the metadata might be provided by the end user.

In [297]:
import gradio as gr

def respond_to_input(text, tag):
    filter = models.Filter(
        must = [
        models.FieldCondition(key="metadata.tags", match=models.MatchValue(value=tag))
        ]
    )
    result = retrieval_chain_with_filter(llm, filter).invoke({"input": text})
    return result['answer']

tags = set()
for doc in langchain_documents:
    for tag in doc.metadata['tags']:
        tags.add(tag)

interface = gr.Interface(fn=respond_to_input, inputs=["text", gr.Dropdown(choices=tags)], outputs="text")
interface.launch()

Running on local URL:  http://127.0.0.1:7888

To create a public link, set `share=True` in `launch()`.


[2;3m<class 'gradio.utils.TupleNoPrint'>.__repr__ returned empty string[0m

 Rishi Sunak, the UK's Chancellor of the Exchequer, has expressed optimism about the potential benefits of artificial intelligence (AI) for the economy and workforce, while acknowledging concerns about its impact on jobs. Meanwhile, Elon Musk, CEO of SpaceX and Tesla, has been a vocal advocate for the development of AI but has also expressed concern about its potential risks and dangers.


Sunak, on the other hand, has emphasized the potential for AI to boost productivity and create new opportunities for workers, while acknowledging the need for education reforms to help people acquire the skills needed to thrive in an increasingly automated world. He has also announced measures to support the development of AI and other emerging technologies in the UK, including funding for research and innovation, and efforts to attract talent and investment from around the world.

Overall, while both Sunak and Musk recognize the potential benefits and risks of AI, they have different perspectives on