# **RAG ChatBot**

### Components:

### Indexing
1. **Load:** First we need to load our data.
2. **Split:** Text splitters break large Documents into smaller chunks. This is useful both for indexing data and for passing it in to a model, since large chunks are harder to search over and won't fit in a model's finite context window.
3. **Store:** We need somewhere to store and index our splits, so that they can later be searched over. This is often done using a VectorStore and Embeddings model.

### Retrieval and generation
4. **Retrieve:** Given a user input, relevant splits are retrieved from storage using a Retriever.
5. **Generate:** A ChatModel / LLM produces an answer using a prompt that includes the question and the retrieved data

## **Step 1: Environment Setup and Initialization**

In [2]:
import os
import re
from getpass import getpass
import time
from tqdm import tqdm
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup

import nltk
from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from pinecone import Pinecone, ServerlessSpec
from langchain_pinecone import PineconeVectorStore

from langchain import hub
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.schema import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

In [2]:
# OpenAI API key
os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

# Pinecone API key
os.environ["PINECONE_API_KEY"] = getpass("Enter your Pinecone API key: ")

## **Step 2: Data Preparation**

In [3]:
# Loads the data
df = pd.read_csv('../DataFiles/web_scraped_data.csv')
df.head()

Unnamed: 0,page_content,metadata
0,"<!DOCTYPE html><html lang=""en""><head><title>On...",{'source': 'https://www.visitlisboa.com/en/p/o...
1,"<!DOCTYPE html><html lang=""en""><head><title>Wh...",{'source': 'https://www.visitlisboa.com/en/p/w...
2,"<!DOCTYPE html><html lang=""en""><head><title>Ge...",{'source': 'https://www.visitlisboa.com/en/p/g...
3,"<!DOCTYPE html><html lang=""en""><head><title>My...",{'source': 'https://www.visitlisboa.com/en/p/m...
4,"<!DOCTYPE html><html lang=""en""><head><title>Jo...",{'source': 'https://www.visitlisboa.com/en/p/m...


In [4]:
num_rows = df.shape[0]
print(num_rows)

227


In [5]:
sum(not row.startswith('<!DOCTYPE html>') for row in df['page_content'])

8

In [6]:
# Filters rows with valid HTML content
df = df[df['page_content'].str.startswith('<!DOCTYPE html>')]

# Additional data cleaning
df.dropna(subset=['page_content'], inplace=True)
df.reset_index(drop=True, inplace=True)

In [7]:
num_rows = df.shape[0]
print(num_rows)

219


### **Step 2.1: Parsing the HTML and Source Content**

In [8]:
# Extracts metadata information
df['source'] = df['metadata'].apply(lambda x: eval(x)['source'])

In [9]:
# Function to parse HTML content based on source
def parse_html_content(row):
    html_content = row['page_content']
    source = row['source']
    
    soup = BeautifulSoup(html_content, 'html.parser')
    
    # Extracts the title
    title = soup.title.string if soup.title else ''
    
    # Removes unwanted substrings from the title
    unwanted_titles = ["| Visit Lisboa", "| CP - Comboios de Portugal"]
    for unwanted in unwanted_titles:
        title = title.replace(unwanted, '').strip()
    
    # Removes unwanted elements such as footer, headers, etc.
    for element in soup(['footer', 'script', 'style', 'header', 'nav', 'aside']):
        element.extract()
    
    # Determines the main content based on the source URL
    main_content = ""
    main_content_tag = None
    
    if "visitlisboa.com" in source:
        main_content_class = "mt-3"
    elif "carris" in source:
        main_content_class = "col-lg-9 col-12 pl-lg-0"
    elif "metrolisboa" in source:
        main_content_class = "et-l et-l--post"
    elif "cp.pt" in source:
        # Using the selector we can search desired cp.pt class in the html
        main_content_tag = soup.select_one("body > div.wrapper > div.wrapper-content > div:nth-child(2) > div > div > div.row > div")
    else:
        main_content_class = None
    
    # Extracts the main content using the specified class
    if main_content_tag:
        main_content = main_content_tag.get_text(separator=" ")
    elif main_content_class:
        main_content_tag = soup.find(class_=main_content_class)
        if main_content_tag:
            main_content = main_content_tag.get_text(separator=" ")
    
    # Pattern to match unwanted sentences
    pattern = re.compile(r'This content is hosted by [\w\s,]+ functional cookies[^.]*\. Cookie Settings')
    
    # Removes the unwanted sentences using regular expressions
    main_content = re.sub(pattern, '', main_content)
    
    # Cleans up the text
    main_content = re.sub(r'\s+', ' ', main_content).strip()
    
    return title, main_content

# Applies parsing function to each row in the dataframe
df[['title', 'text']] = df.apply(parse_html_content, axis=1, result_type='expand')

In [10]:
df.head()

Unnamed: 0,page_content,metadata,source,title,text
0,"<!DOCTYPE html><html lang=""en""><head><title>On...",{'source': 'https://www.visitlisboa.com/en/p/o...,https://www.visitlisboa.com/en/p/only-in-lisbon,Only in Lisbon,The Monastery that turned into Parliament Did ...
1,"<!DOCTYPE html><html lang=""en""><head><title>Wh...",{'source': 'https://www.visitlisboa.com/en/p/w...,https://www.visitlisboa.com/en/p/why-lisbon,Why Lisbon?,Belém Heritage and history with a special twis...
2,"<!DOCTYPE html><html lang=""en""><head><title>Ge...",{'source': 'https://www.visitlisboa.com/en/p/g...,https://www.visitlisboa.com/en/p/get-to-lisbon,Get to Lisbon,There are many ways to get to Lisbon and all o...
3,"<!DOCTYPE html><html lang=""en""><head><title>My...",{'source': 'https://www.visitlisboa.com/en/p/m...,https://www.visitlisboa.com/en/p/my-city,My City,Alfacinhas (Lisbon natives) know there’s no pl...
4,"<!DOCTYPE html><html lang=""en""><head><title>Jo...",{'source': 'https://www.visitlisboa.com/en/p/m...,https://www.visitlisboa.com/en/p/my-city/joana...,Joana Amendoeira,"I am from Santarém, but I fell in love with Li..."


In [11]:
# Counts the number of characters in the text
df['text'].apply(len)

0      9079
1      6416
2      1603
3       623
4      1968
       ... 
214    1294
215     924
216     700
217    3278
218    8364
Name: text, Length: 219, dtype: int64

In [12]:
print(df.loc[0, 'text'])

The Monastery that turned into Parliament Did you know that the building that is today known as the Parliament of Portugal used to be a monastery? Only in Lisbon Only in Lisbon Only in Lisbon Curiosities Jau street, a tribute to the enslaved people of Lisbon There is a curious street in Alcântara about a strange man: Jau. An old artists' villa hidden in Principe Real hills Vila Martel hides behind a closed door at the number 55 in Rua das Taipas, on the slope of Glória. Rua do Arsenal, home of the cod shops, witness to history The trade of Lisbon's favourite fish for Christmas dinner has been more intense in the past, but it still survives in the street that witnessed the most important events in Lisbon's recent history, from the earthquake to 25th April. A barge and two crows: Saint Vincent, the ancient patron saint of Lisbon January 22nd marks another anniversary of the death of Saint Vincent, the saint that was once the patron saint of Lisbon and the Kingdom of Portugal, until he wa

In [13]:
print(df.loc[210, 'source'])
print(df.loc[210, 'text'])

https://www.visitlisboa.com/en/events/festival-big-bang-23
Festival Big Bang'23 [ October ] Music and Adventure Festival for Kids, is surely an intriguing musical voyage of discovery for all involved. Description Map BIG BANG MUSIC AND ADVENTURE FESTIVAL FOR KIDS Big Bang is an international project that began as a business with six partners from five different countries. It is through this project that CCB / Fábrica das Artes has opened a venue where Portuguese artists can create new artistic approaches to music for kids and see their work recognised both in Portugal and throughout Europe. Details www.ccb.pt + info 2023 Centro Cultural de Belém Praça do Império 1449-003 Lisboa From 20 Oct, 2023 to 21 Oct, 2023


### **Step 2.2: Data Preprocessing**

In [14]:
# Lets create a copy so we can proceed safely to the next steps
nltk_df = df.copy()

In [None]:
# nltk.download('punkt')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('wordnet')
# nltk.download('stopwords')

In [15]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

# Helper function to map NLTK's part-of-speech tags to the format expected by the lemmatizer
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

def preprocess_text(text):
    # Removes non-alphanumeric characters
    text = re.sub(r'\W', ' ', text)
    # Tokenizes the text
    tokens = word_tokenize(text)
    # POS tagging
    pos_tags = nltk.pos_tag(tokens)
    
    # Lemmatization
    lemmatized_tokens = [
        lemmatizer.lemmatize(token, get_wordnet_pos(tag) or wordnet.NOUN)
        for token, tag in pos_tags if token not in stop_words
    ]
    return ' '.join(lemmatized_tokens)

nltk_df['preprocessed_text'] = nltk_df['text'].apply(preprocess_text)

In [16]:
nltk_df.drop(['page_content', 'metadata', 'text'], axis=1, inplace=True)
nltk_df.head()

Unnamed: 0,source,title,preprocessed_text
0,https://www.visitlisboa.com/en/p/only-in-lisbon,Only in Lisbon,The Monastery turn Parliament Did know buildin...
1,https://www.visitlisboa.com/en/p/why-lisbon,Why Lisbon?,Belém Heritage history special twist Lisbon 28...
2,https://www.visitlisboa.com/en/p/get-to-lisbon,Get to Lisbon,There many way get Lisbon easy use With airpor...
3,https://www.visitlisboa.com/en/p/my-city,My City,Alfacinhas Lisbon native know place like home ...
4,https://www.visitlisboa.com/en/p/my-city/joana...,Joana Amendoeira,I Santarém I fell love Lisbon listen fado song...


In [17]:
# Counts the number of characters in the text
nltk_df['preprocessed_text'].apply(len)

0      6393
1      5052
2      1028
3       479
4      1291
       ... 
214     903
215     593
216     441
217    2298
218    5589
Name: preprocessed_text, Length: 219, dtype: int64

## **Step 3: Vector Database Setup**

### **Step 3.1: Initializing Pinecone and Embedding Model**

In [18]:
# Initializes Pinecone with the API key (Pinecone is a vector database used for similarity search)
pc = Pinecone(
    api_key=os.getenv("PINECONE_API_KEY")
)


# We need to specify the index name and check if it exists
index_name = 'lisbon-tourism'
if index_name not in pc.list_indexes().names():
    
    # If it doesn't then we create a new index with the specified dimension and metric
    pc.create_index(
        name=index_name, 
        dimension=1536,  # The dimension is adjusted to match OpenAI's embedding size
        metric='cosine',
        spec=ServerlessSpec(
            cloud='aws',
            region='us-east-1'
        )
    )

    # wait for index to be initialized
    while not pc.describe_index('lisbon-tourism').status['ready']:
        time.sleep(1)

# Connects to the created/existing index
index = pc.Index('lisbon-tourism')
time.sleep(1)

In [None]:
#In case we wish to delete the vectordb in Pinecone
#pc.delete_index(index_name)

**Euclidean Distance** - used when dealing with embeddings where the magnitude (length) of vectors can provide meaningful information, such as geographical data or when the scale of the data matters.<br/>
**Cosine Similarity** - used in text analysis and other cases where the direction of the vectors is more important than their magnitude, such as comparing document similarity.<br/>
**Manhattan Distance** - Useful for grid-like data.<br/>
**Hamming Distance** - Useful for categorical data.<br/>
**Jaccard Distance** - Useful for binary or categorical data.

In [19]:
index.describe_index_stats() # lets view the index stats

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 441}},
 'total_vector_count': 441}

Our index is now ready but it's empty. It is a vector index, so it needs vectors. <br/>
To create these vector embeddings we will use OpenAI's text-embedding-ada-002 model that we can access it via LangChain

In [20]:
embedding_model = OpenAIEmbeddings(model="text-embedding-ada-002",
                                   api_key=os.environ["OPENAI_API_KEY"])

### **Step 3.2: Chunking the Data**
Since some of the texts are quite large (6000+ characters) they will exceed the input size limit of many models like OpenAI's text-embedding-ada-002 (which typically handles up to 8192 tokens, approximately 4000-5000 words). So we will apply chunking.

In [21]:
def chunk_text(text, chunk_size=1000, chunk_overlap=200):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size, 
        chunk_overlap=chunk_overlap, 
        add_start_index=True
    )
    document = Document(page_content=text)
    chunks = text_splitter.split_documents([document])
    return [chunk.page_content for chunk in chunks]

### **Step 3.3: Uploading Embeddings to Pinecone**

In [22]:
# Function to upload embeddings to Pinecone with chunking and batching
def upload_embeddings_to_pinecone_with_chunking(index, nltk_df, chunk_size=1000, chunk_overlap=200, batch_size=10):
    all_chunks = []
    all_metadata = []

    for idx, row in nltk_df.iterrows():
        # Chunks the preprocessed text
        chunks = chunk_text(row['preprocessed_text'], chunk_size, chunk_overlap)

        for i, chunk in enumerate(chunks):
            all_chunks.append(chunk)
            all_metadata.append({
                "id": f"{idx}_{i}",
                "source": row['source'], 
                "title": row['title'],
                "chunk": i,
                "text": chunk
            })

            if len(all_chunks) >= batch_size:
                # Generates embeddings for the batch
                embeddings = embedding_model.embed_documents(all_chunks)

                # Upserts the batch of embeddings and metadata
                to_upsert = [(md['id'], emb, md) for emb, md in zip(embeddings, all_metadata)]
                index.upsert(to_upsert)

                # Clears the lists
                all_chunks = []
                all_metadata = []

    # Upserts any remaining chunks
    if all_chunks:
        embeddings = embedding_model.embed_documents(all_chunks)
        to_upsert = [(md['id'], emb, md) for emb, md in zip(embeddings, all_metadata)]
        index.upsert(to_upsert)

In [23]:
# Uploads embeddings for all documents in the dataframe with chunking and progress bar
batch_size = 100
with tqdm(total=len(nltk_df)) as pbar:
    upload_embeddings_to_pinecone_with_chunking(index, nltk_df, batch_size=batch_size)
    pbar.update(len(nltk_df))

100%|██████████| 219/219 [00:13<00:00, 15.65it/s]


In [24]:
index.describe_index_stats() # lets view the index stats

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 441}},
 'total_vector_count': 441}

In [30]:
# Initializing the VectorStore
vectorstore = PineconeVectorStore(index=index, embedding=embedding_model)

## **Step 4: RAG Model Implementation (Retrieval and generation)**

We've built a fully-fledged knowledge base. Now it's time to connect that knowledge base to our chatbot. To do that we'll be using LangChain for the higher-level abstractions <br/>
To use LangChain here we need to load the LangChain abstraction for a vector index, called a vectorstore. We pass in our vector index to initialize the object.

### **Retrieve**

We want to create a application that takes a user question, searches for documents relevant to that question, passes the retrieved documents and initial question to a model, and returns an answer.

In [31]:
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

retrieved_docs = retriever.invoke("Do you know about the Festival Big Bang23?")

len(retrieved_docs)

6

In [32]:
print(retrieved_docs[0].page_content)

Festival Big Bang 23 October Music Adventure Festival Kids surely intriguing musical voyage discovery involve Description Map BIG BANG MUSIC AND ADVENTURE FESTIVAL FOR KIDS Big Bang international project begin business six partner five different country It project CCB Fábrica da Artes open venue Portuguese artist create new artistic approach music kid see work recognise Portugal throughout Europe Details www ccb pt info 2023 Centro Cultural de Belém Praça Império 1449 003 Lisboa From 20 Oct 2023 21 Oct 2023


### **Generate**

Let’s put it all together into a chain that takes a question, retrieves relevant documents, constructs a prompt, passes that to a model, and parses the output.

In [33]:
# Initializes OpenAI model (ensure the API key is set in the environment)
openai_model = ChatOpenAI(api_key=os.getenv("OPENAI_API_KEY"),
                          model="gpt-4o",
                          temperature=0,
                          max_tokens=None)

In [37]:
prompt = hub.pull("rlm/rag-prompt")

example_messages = prompt.invoke(
    {"context": "filler context", "question": "filler question"}
).to_messages()

print(example_messages[0].content)

You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: filler question 
Context: filler context 
Answer:


In [38]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | openai_model
    | StrOutputParser()
)

for chunk in rag_chain.stream("Do you know about the Festival Big Bang23?"):
    print(chunk, end="", flush=True)

Yes, the Festival Big Bang 23 is a music and adventure festival for kids. It is an international project involving six partners from five different countries and will take place at the Centro Cultural de Belém in Lisbon from October 20 to October 21, 2023. More details can be found at www.ccb.pt.

### Built-in chains

In [40]:
system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)


question_answer_chain = create_stuff_documents_chain(openai_model, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

In [41]:
response = rag_chain.invoke({"input": "Do you know about the Festival Big Bang23?"})
print(response["answer"])

Yes, the Festival Big Bang 23 is a music and adventure festival for kids. It is an international project that began with six partners from five different countries and is hosted by CCB Fábrica da Artes in Lisbon. The festival will take place from October 20 to October 21, 2023, at the Centro Cultural de Belém.


In [42]:
# Returning the Sources
for document in response["context"]:
    print(document)
    print()

page_content='Festival Big Bang 23 October Music Adventure Festival Kids surely intriguing musical voyage discovery involve Description Map BIG BANG MUSIC AND ADVENTURE FESTIVAL FOR KIDS Big Bang international project begin business six partner five different country It project CCB Fábrica da Artes open venue Portuguese artist create new artistic approach music kid see work recognise Portugal throughout Europe Details www ccb pt info 2023 Centro Cultural de Belém Praça Império 1449 003 Lisboa From 20 Oct 2023 21 Oct 2023' metadata={'chunk': 0.0, 'id': '210_0', 'source': 'https://www.visitlisboa.com/en/events/festival-big-bang-23', 'title': "Festival Big Bang'23"}

page_content='Out Fest 23 OUT FEST Festival Internacional de Música Exploratória Barreiro Barreiro International Festival Exploratory Music On south shore Tejo Barreiro prepare host OUT FEST consider one important experimental music event country Description On south shore Tejo Barreiro prepare host OUT FEST consider one impo

We can also customize the prompt

In [43]:
template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
Always say "thanks for asking!" at the end of the answer.

{context}

Question: {question}

Helpful Answer:"""
custom_rag_prompt = PromptTemplate.from_template(template)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | custom_rag_prompt
    | openai_model
    | StrOutputParser()
)

In [44]:
rag_chain.invoke("What is the French Film Festival 23?")

'The French Film Festival 23 in Lisboa is a celebration of French cinema, featuring a variety of films including premieres of newly produced works from France. Highlights include the opening film "Jeanne du Barry," sessions dedicated to gastronomy and fashion, and opportunities to meet filmmakers. The festival runs from October 5th to 15th at venues like Cinema São Jorge and Cinemateca Portuguesa. Thanks for asking!'