<a href="https://colab.research.google.com/github/YoAkeHotaru/Erdos-Deep-Learning-2024-RAG-Project/blob/main/Pipeline_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

About this notebook
---
- Loading the data.
  - Data Cleaning, Fixes the abbreviated input text by replacing abbreviations, removing punctuations and hashtag terms.
  - Processing the data, adding elements, combining etc.
  - Deciding chunks etc.

- Embeddings.
  - Create a list of possible embeddings that works with this pipeline.
  - Put them into the vector database.
  - Choosing a vector database. -> FAISS
- Query. -> This part will also be included in RAGAS for generating more query to evalute the score.
  - Embedding a query.
- Retriveal.
  - Retriving texts from vector database.
    
- LLM choice and summarization.
  - ChatGPT3.5-turbo
  - Generate a summary of the retrieved texts.
- Embedding of the answer into the Vector Database.


### Measurements
---
- RAGAS





In [None]:
# Necessary packages to load
#Langchain for creating a framework
!pip install pyarrow
!pip install langchain
#jq for reading json file. NOTE: it maybe change for the .parquet
!pip install --upgrade --quiet faiss faiss-cpu langchain-community langchain-openai tiktoken
!pip install faiss-gpu
!pip install jq
!pip install langchain-chroma
# Vector Databases
!pip install lancedb chromadb
!pip install langchain-openai
!pip install --upgrade --quiet sentence_transformers
!pip install ragas
!pip install --upgrade --quiet  cohere
# OR  (depending on Python version)
!pip install --upgrade langchain_cohere

Collecting langchain
  Downloading langchain-0.1.17-py3-none-any.whl (867 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m867.6/867.6 kB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.5-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langchain-community<0.1,>=0.0.36 (from langchain)
  Downloading langchain_community-0.0.36-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m79.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-core<0.2.0,>=0.1.48 (from langchain)
  Downloading langchain_core-0.1.50-py3-none-any.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.8/302.8 kB[0m [31m40.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-text-splitters<0.1,>=0.0.1 (from langchain)
  Down

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Preprocessing the Data
---
Fixes the abbreviated input text by replacing abbreviations, removing punctuations and hashtag terms.


In [None]:
dataset_path = '/content/drive/MyDrive/Erdos2024/AwareProject/dataset'
reddit_path = dataset_path +'/reddit.json'

In [None]:
import json
from pathlib import Path
from pprint import pprint

data = json.loads(Path(reddit_path).read_text())

In [None]:
!pip install cleantext

Collecting cleantext
  Downloading cleantext-1.1.4-py3-none-any.whl (4.9 kB)
Installing collected packages: cleantext
Successfully installed cleantext-1.1.4


In [None]:
import os
import re
import json
from cleantext import clean #Function to remove emojis


abb_path = '/content/drive/MyDrive/Erdos2024/AwareProject/abbreviation_mappings.json'


with open(abb_path, "r") as json_file:
    abbreviation_mappings = json.load(json_file)

def fix(text):
    """
    Fixes the abbreviated input text by replacing abbreviations, removing punctuations and hashtag terms.

    Parameters:
    text (str): The input abbreviated text to be fixed.

    Returns:
    str: The fixed text.
    """
    # Remove punctuations, emojis and hashtags
    text_punc = re.sub(r' *[<3#][^ ]* *', '', text)
    text_punc = re.sub(r'[^a-zA-Z0-9\']+', ' ', text_punc).strip()


    # Split text into words
    words = text_punc.split()

    # Convert words to lowercase
    normalized_words = [word.lower() for word in words]

    # Replace abbreviations with their full forms
    words_fixed = [abbreviation_mappings.get(word, word) for word in normalized_words]

    text_fixed = ' '.join(words_fixed)


    # Return processes string
    return text_fixed

# Function to remove emojis and correct abbreviations
def clean_text(text):
  text = clean(text)
  text = fix(text)
  return text

In [None]:
# for i in range(len(data)):
#   try:
#     data[i]['reddit_text'] = clean_text(data[i]['reddit_text'])
#   except:
#     print(f"Does'nt work for {i}")

### Saving the preprocess data

In [None]:
# import os
# preprocessed_path = os.path.join(dataset_path,'preprocessedreddit.json')

# with open(preprocessed_path, "w") as json_file:
#   json.dump(data,json_file)

# Loading Data.

In [None]:
import os
dataset_path = '/content/drive/MyDrive/Erdos2024/AwareProject/dataset'
reddit_path = os.path.join(dataset_path,'preprocessedreddit.json')

In [None]:
# Function thaking related metadata of posts.

def metadata_func(record: dict, metadata: dict ) -> dict:

  metadata["aware_post_type"] = record.get("aware_post_type")
  metadata["reddit_author"] = record.get("reddit_author")
  metadata["reddit_id"] = record.get("reddit_id")
  # metadata["reddit_link_id"] = record.get("reddit_link_id")

  # metadata["reddit_parent_id"] = record.get("reddit_parent_id")
  if  metadata["aware_post_type"] != 'submission':
    metadata["reddit_submission"] = record.get("reddit_submission")

  # metadata["reddit_submission"] = record.get("reddit_submission")
  metadata["reddit_subreddit"] = record.get("reddit_subreddit")
  # metadata["reddit_title"] = record.get("reddit_title")

  # metadata["reddit_url"] = record.get("reddit_url")


  return metadata


In [None]:
# Everything is a list, you need to start with [] for the jq schema
from langchain_community.document_loaders import JSONLoader
loader = JSONLoader(
    file_path= reddit_path,
    jq_schema='.[]',
    content_key = 'reddit_text',
    metadata_func=metadata_func,
    )

data_lang = loader.load()

### Chunk size and overlap

In [None]:
# Chunks
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=200, add_start_index=True
)
splits = text_splitter.split_documents(data_lang)

# Embeddings into VDB
---

## Embedding Models
---
Here, using the HuggingfaceAPI for the models will be demonstrated.

In [None]:
from sentence_transformers import SentenceTransformer

model_names = ["all-MiniLM-L6-v2", "nq-distilbert-base-v1", "thenlper/gte-large"]

# model_name = model_names[2]
# embedding = SentenceTransformer(model_name)

## Vector DataBase
- FAISS

### FAISS

In [None]:
!pip install faiss-gpu

Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m20.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2


In [None]:
from langchain.vectorstores import FAISS
from langchain_community.embeddings.sentence_transformer import (
    SentenceTransformerEmbeddings,
)
# embeddings = SentenceTransformerEmbeddings(model_name=model_name)




In [None]:
# db = FAISS.from_documents(splits, embeddings)
# print(db.index.ntotal)

404919


In [None]:

# query = 'How to make Chai Latte???'
# docs = db.similarity_search(query, k=30)

# # print results
# print(docs[1].page_content)

1 steam 2 milk 2 pump chai 2 for short tall grande venti fill cup halfway with hot water 4 fill remaining half with steamed milk https sbuxdates com is a great site for checking recipes


In [None]:
# db = FAISS.from_documents(splits, embeddings)
# print(db.index.ntotal)
# query = 'How to make Chai Latte???'
# docs = db.similarity_search(query, k=30)

# # print results
# print(docs[1].page_content)

Chai tea latte with oat milk and cinnamon steamed in.


## FAISS

## Adding Metadata to database.
---
It is also possible add metadata into the db which we can reach it afterwards. It adds the metadata as a hashed variable.

In [None]:
import hashlib
import json
from langchain_core.documents import Document

def stable_hash(doc: Document) -> str:
    """
    Stable hash document based on its metadata.
    """
    return hashlib.sha1(json.dumps(doc.metadata, sort_keys=True).encode()).hexdigest()

# this function comes from ChatGPT3.5
def ensure_unique_ids(ids):
    unique_ids = []
    seen = set()
    for id in ids:
        if id not in seen:
            unique_ids.append(id)
            seen.add(id)
        else:
            # Append a suffix to make it unique
            suffix = 1
            new_id = f"{id}_{suffix}"
            while new_id in seen:
                suffix += 1
                new_id = f"{id}_{suffix}"
            unique_ids.append(new_id)
            seen.add(new_id)
    return unique_ids



split_ids = list(map(stable_hash, splits))
unique_split_ids = ensure_unique_ids(split_ids)
db.add_documents(splits, ids=unique_split_ids)


['9fc3ebdec37a8d1a81d7d110b1996072bb4887ae',
 '7cb4ded394921b1486aeeeb3db7b9c6bc9feebfb',
 'ed948b9b2453f9a30e05876248b5099a2ec6446e',
 '9aa8d3381e6ffd8ef9e52fe12fff301f227ab2ea',
 '2f86dc0aab133d18c26160dcd8327a0f53f2145f',
 'e839e01d57096eba78faf7c3a72dd3d2116ef38f',
 'f48ad64ba94cdee254dd34a21f90239feddb5044',
 'f84cd48a39b27bb14dd6204a0c5415f539e474c1',
 '9740efe74de01c017b9f364b3cbb6c9abdfec0e1',
 'da4f20ae138dd06401c2b80bc65c58d644c87fc2',
 '35886e746d64d5a2e27cddc6af0256a74bb00bce',
 '398c879bd964404a87f60c11a1adc38effe9de96',
 '8b86bd736902d7dbe67577d1d32ac94826ba4dff',
 'c473f07832fe3fbc4d12487d2177023eea35e2a9',
 '4369b48ba0a6f3754f6fca25dabe86847b65d823',
 '6475003ad374e6486ec69e9e737ec31fcacf0de7',
 'd3c6cc06084190751975d2842271ec642386569c',
 '0f5f2795f92db20c2c22189adb0a4e3962c478f8',
 '36ce7007847dcd1cec99c5abac727821f9ab5e1c',
 'be4aa65a3b9f5c24f8f5671cf1d816a6352d97fa',
 'fb1d587a3cb8392d3de3fee225ec854b77a75cf7',
 '7bdb37a6956d443094623eed95e02b7d18d7956f',
 'b872cb75

## Saving the Embeddings

In [None]:
db.save_local(folder_path='/content/drive/MyDrive/Erdos2024/AwareProject/dataset_new', index_name='preemb3')

## Loading the Embeddings

In [None]:
!pip install faiss-gpu

Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m20.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2


In [None]:
db1 = FAISS.load_local('/content/drive/MyDrive/Erdos2024/AwareProject/dataset_new',index_name='Emb1',allow_dangerous_deserialization=True, embeddings=SentenceTransformerEmbeddings(model_name=model_names[0]))
predb1 = FAISS.load_local('/content/drive/MyDrive/Erdos2024/AwareProject/dataset_new',index_name='preemb1',allow_dangerous_deserialization=True, embeddings=SentenceTransformerEmbeddings(model_name=model_names[0]))

In [None]:
db2 = FAISS.load_local('/content/drive/MyDrive/Erdos2024/AwareProject/dataset_new',index_name='Emb2',allow_dangerous_deserialization=True, embeddings=SentenceTransformerEmbeddings(model_name=model_names[1]))
predb2 = FAISS.load_local('/content/drive/MyDrive/Erdos2024/AwareProject/dataset_new',index_name='preemb1',allow_dangerous_deserialization=True, embeddings=SentenceTransformerEmbeddings(model_name=model_names[1]))

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/540 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/554 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
db3 = FAISS.load_local('/content/drive/MyDrive/Erdos2024/AwareProject/dataset_new',index_name='Emb3',allow_dangerous_deserialization=True, embeddings=SentenceTransformerEmbeddings(model_name=model_names[2]))
predb3 = FAISS.load_local('/content/drive/MyDrive/Erdos2024/AwareProject/dataset_new',index_name='preemb3',allow_dangerous_deserialization=True, embeddings=SentenceTransformerEmbeddings(model_name=model_names[2]))

modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/67.9k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

In [None]:
data_bases = [db1, db2, db3]
data_basespre = [predb1, predb2, predb3]

## LLMs and Answer Generation
---
- Choice of LLMs will be here,
  - ChatGpt3.5
  - Llama-2
  - Mistral
  - ...

### With OpenAI API
- ChatGPT3.5-Turbo

In [None]:
# Import Colab Secrets userdata module
from google.colab import userdata

# Set OpenAI API key
import os
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')

# Set other API keys similarly
os.environ["HF_TOKEN"] = userdata.get('HF_TOKEN')



import os

os.environ["COHERE_API_KEY"] = userdata.get('COHERE_API_KEY')

In [None]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0.0)
# retriever = new_db.as_retriever(search_kwargs={"k": 30})
retriever_base = db1.as_retriever(search_kwargs={"k": 30})



### Reranking with Cohere


In [None]:
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings import CohereEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter

from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
from langchain_community.llms import Cohere


compressor = CohereRerank(model="rerank-english-v3.0", top_n = 10)
# retriever_reranked = ContextualCompressionRetriever(
#     base_compressor=compressor, base_retriever=retriever_base
# )

In [None]:
def retrievers(db):
  retriever_base = db.as_retriever(search_kwargs={"k": 30})
  retriever_reranked = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=retriever_base)
  return [retriever_base, retriever_reranked]


In [None]:
from langchain_core.prompts import ChatPromptTemplate

template = """
You are an assistant for question-answering tasks.
Given the following extracted parts of a long document and a question, create a final answer with references ("SOURCES").
If you don't know the answer, just say that you don't know. Don't try to make up an answer.
ALWAYS return a "SOURCES" part in your answer.

QUESTION: {question}
=========
{source_documents}
=========
FINAL ANSWER: """
prompt = ChatPromptTemplate.from_template(template)

In [None]:
from typing import List

from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser


def format_docs(docs1) -> str:
    return "\n\n".join(
        f"Content: {doc.page_content}\n Source: {doc.metadata['reddit_subreddit']}" for doc in docs1
    )


rag_chain_from_docs = (
    RunnablePassthrough.assign(
        source_documents=(lambda x: format_docs(x["source_documents"]))
    )
    | prompt
    | llm
    | StrOutputParser()
)



def rag_chains(retriever_base, retriever_reranked):

  rag_chain_base = RunnableParallel(
      {
          "source_documents": retriever_base,
          "question": RunnablePassthrough(),
      }
  ).assign(answer=rag_chain_from_docs)


  rag_chain_rerank = RunnableParallel(
      {
          "source_documents": retriever_reranked,
          "question": RunnablePassthrough(),
      }
  ).assign(answer=rag_chain_from_docs)
  return rag_chain_base, rag_chain_rerank

## Question and Answer examples

In [None]:
question = "How to make Chai Latte??"
response = rag_chain_base.invoke(question)
answer = response["answer"]
answer

"To make a hot Chai Latte, you typically steep chai tea bags in water, then add milk, sweetener (such as vanilla syrup), and optional spices like cinnamon. It's important to follow the correct recipe to ensure the best flavor. If you prefer a different variation, you can customize your order at Starbucks by specifying your preferences to the barista. \n\nSOURCES: starbucks, starbucksbaristas"

In [None]:
question = "How to make Chai Latte??"
response = rag_chain_rerank.invoke(question)
answer = response["answer"]
answer

'To make a Chai Latte, you can start by brewing chai tea with oat milk and cinnamon steamed in. For a hot version, you can add 2-3 pumps of vanilla and no water. For an iced version, you can try adding soy milk and 3 espresso shots. If you are a newbie and want to learn how to make a hot Chai latte, you can refer to Starbucks for guidance.\n\nSOURCES: starbucks, starbucksbaristas'

In [None]:
# Gtelarge
question = "How to make Chai Latte??"
response = rag_chain.invoke(question)
answer = response["answer"]
answer

'To make a Chai Latte, you can follow these steps:\n1. Steam 2% milk\n2. Pump chai: 2/3/4/5 for short/tall/grande/venti\n3. Fill cup halfway with hot water\n4. Fill remaining half with steamed milk\n\nFor a homemade version, you can try:\n1. Boil water in a pot\n2. Add spices and let simmer on medium-high heat for 5-15 minutes\n3. Add tea and let sit on low heat for 2-5 minutes\n4. Add milk, honey, warm back up\n5. Strain and enjoy!\n\nFor a unique twist, you can try adding apple to your Chai Latte or order a Brown Sugar Oatmilk Shaken Espresso and add chai to it. You can also customize your Chai Latte with different flavors like vanilla, brown sugar syrup, or cinnamon.\n\nSources: starbucks, starbucksbaristas, sbuxdates.com, Chai Box Instagram.'

In [None]:
# nq-distilbert-base-v1
# question = "How to make Chai Latte??"
# response = rag_chain.invoke(question)
# answer = response["answer"]
# answer

'To make a Chai Latte, you can follow these steps:\n1) Boil water in a pot and add spices, let simmer for 5-15 minutes.\n2) Add tea and let it sit on low heat for 2-5 minutes.\n3) Add milk, honey, warm it back up, strain, and enjoy!\nAlternatively, you can use a Chai concentrate like the one sold by Chai Box on Instagram. For a hot Chai Latte, steam 2% milk, pump chai syrup according to size, fill half the cup with hot water, and the remaining half with steamed milk. You can also try adding brown sugar and oat milk for an iced Chai Latte. If you prefer a spicier flavor, consider adding more chai concentrate or cinnamon. Adding apple juice or making a Lavender Oatmilk Chai are also delicious options. For more recipes and ideas, you can visit sbuxdates.com.\n\nSOURCES: starbucks, starbucksbaristas, sbuxdates.com'

In [None]:
# EMBD1
# question = "How to make Chai Latte??"
# response = rag_chain.invoke(question)
# answer = response["answer"]
# answer

"To make a hot Chai Latte, you typically steep chai tea bags in water, then add milk, sweetener (such as vanilla syrup), and optional spices like cinnamon. The drink can be customized with different types of milk and additional flavorings. It's important to note that the preparation may vary depending on the location and personal preferences. \n\nSOURCES: starbucks, starbucksbaristas"

In [None]:
# GteLarge
question2 = "where is disneyland and how can I go there???"
response2 = rag_chain.invoke(question2)
answer2 = response2["answer"]
answer2

'Disneyland is located in Anaheim, California. To get there, you can fly into nearby airports such as John Wayne Airport (SNA) or Los Angeles International Airport (LAX) and then take a shuttle, taxi, or rental car to the park. Additionally, there are public transportation options available. It is recommended to check the official Disneyland website for the most up-to-date information on transportation options.\n\nSOURCES: Disneyland official website'

In [None]:
question2 = "where is disneyland and how can I go there???"
response2 = rag_chain_base.invoke(question2)
answer2 = response2["answer"]
answer2

'Disneyland is located in Anaheim, California. To get there, you can fly into nearby airports such as John Wayne Airport (SNA) or Los Angeles International Airport (LAX) and then take a shuttle, taxi, or rental car to the park. Additionally, there are public transportation options available. \n\nSOURCES: Disneyland'

In [None]:
question2 = "where is disneyland and how can I go there???"
response2 = rag_chain_rerank.invoke(question2)
answer2 = response2["answer"]
answer2

'Disneyland is located in California. To get there, you can fly into nearby airports such as Los Angeles International Airport (LAX) or John Wayne Airport (SNA) and then take a shuttle, taxi, or rental car to the Disneyland Resort. Additionally, there are public transportation options available. For specific directions and transportation options, it is recommended to visit the official Disneyland website or contact their guest services for more information.\n\nSOURCES: Disneyland Website'

In [None]:
# nq-distilbert-base-v1
question2 = "where is disneyland and how can I go there???"
response2 = rag_chain.invoke(question2)
answer2 = response2["answer"]
answer2

"Disneyland is located in Anaheim, California. To get there, you can park at Mickey/Minnie/Pixar Pals, walk to Disneyland hotel, then either walk through Downtown Disney or take the monorail (if it's running) to reach the park. Additionally, you can use the Disneyland app for information and updates on the park. (SOURCES: Disneyland)"

In [None]:
response2.keys()

dict_keys(['source_documents', 'question', 'answer'])

In [None]:
# question2 = "where is disneyland and how can I go there???"
# response2 = rag_chain.invoke(question2)
# answer2 = response2["answer"]
# answer2

"Disneyland is located in Anaheim, California. To get there, you can park at Mickey/Minnie/Pixar Pals, walk to Disneyland hotel, then either walk through Downtown Disney or take the monorail (if it's running). You can also use the Disneyland app for information and updates on the park. (SOURCES: Disneyland)"

In [None]:
# GteLarge
question3 = "What is the cheapest product in walmart, and/or what is the best one???"
response3 = rag_chain.invoke(question3)
answer3 = response3["answer"]
answer3

"I don't have enough information to determine the cheapest product in Walmart or the best one. \n\nSOURCES: Walmart, Target, Disneyland, Lowes, Fedexers, Amazon, DisneyWorld"

In [None]:
# nq-distilbert-base-v1
# question3 = "What is the cheapest product in walmart, and/or what is the best one???"
# response3 = rag_chain.invoke(question3)
# answer3 = response3["answer"]
# answer3

"I don't know the answer to the question about the cheapest product in Walmart or the best one. \n\nSOURCES: Bestbuy, UPSers, DisneyWorld, CVS, Target"

In [None]:
# question3 = "What is the cheapest product in walmart, and/or what is the best one???"
# response3 = rag_chain.invoke(question3)
# answer3 = response3["answer"]
# answer3

"I don't have enough information to determine the cheapest product in Walmart or the best one based on the provided content. \n\nSOURCES: N/A"

In [None]:
# Gte Large
question4 = "What is in walmart and do they have starbucks inside???"
response4 = rag_chain.invoke(question4)
answer4 = response4["answer"]
answer4

"Walmart does not have Starbucks inside. However, there are some Walmarts that have Starbucks locations. It is more common to find Starbucks inside Target and Giant stores. Additionally, a new grocery store with a Starbucks recently opened near a customer's house. It is important to note that Walmart did not buy Starbucks, as there was a false rumor about it. \n\nSOURCES: starbucks, CVS, walmart, Target, starbucksbaristas"

In [None]:
# nq-distilbert-base-v1
# question4 = "What is in walmart and do they have starbucks inside???"
# response4 = rag_chain.invoke(question4)
# answer4 = response4["answer"]
# answer4

'Walmart does not have Starbucks inside. However, other stores like Target, CVS, and Giant do have Starbucks locations within them. Starbucks does not carry ice cream, so they cannot make affogatos. \n\nSOURCES: starbucks, Disneyland, Target, CVS, starbucksbaristas'

In [None]:
# question4 = "What is in walmart and do they have starbucks inside???"
# response4 = rag_chain.invoke(question4)
# answer4 = response4["answer"]
# answer4

'Walmart does not have Starbucks inside. However, other stores like Target and Giant do have Starbucks inside. It seems that there may have been confusion regarding the presence of Starbucks in Walmart. (SOURCES: starbucks, Target, Giant)'

In [None]:
question5 = "What is Turkey ???"
docs = db.similarity_search(question5, k=30)
for i in range(20):
  print(docs[i])

page_content='turkish? lol' metadata={'source': '/content/drive/MyDrive/Erdos2024/AwareProject/dataset/reddit.json', 'seq_num': 142547, 'aware_post_type': 'comment', 'reddit_author': 'a32m50', 'reddit_id': 'hahi2ov', 'reddit_submission': 'pcakhq', 'reddit_subreddit': 'TalesFromYourBank', 'start_index': 0}
page_content='turkish? lol' metadata={'source': '/content/drive/MyDrive/Erdos2024/AwareProject/dataset/reddit.json', 'seq_num': 142547, 'aware_post_type': 'comment', 'reddit_author': 'a32m50', 'reddit_id': 'hahi2ov', 'reddit_submission': 'pcakhq', 'reddit_subreddit': 'TalesFromYourBank', 'start_index': 0}
page_content='Dang, and I thought 1 turkey was scary 😳' metadata={'source': '/content/drive/MyDrive/Erdos2024/AwareProject/dataset/reddit.json', 'seq_num': 94951, 'aware_post_type': 'comment', 'reddit_author': 'keeganvzw', 'reddit_id': 'j93byem', 'reddit_submission': '115dfvc', 'reddit_subreddit': 'Fedexers', 'start_index': 0}
page_content='Dang, and I thought 1 turkey was scary 😳' m

In [None]:
# Gte Large
question5 = "What is Turkey ???"
response5 = rag_chain.invoke(question5)
answer5 = response5["answer"]
answer5

'Turkey is a type of bird commonly associated with Thanksgiving meals. It is often consumed as a main dish, such as roasted or fried turkey. Additionally, turkey legs are a popular food item at amusement parks like Disneyland and DisneyWorld. Some people find turkeys to be intimidating or annoying, while others enjoy eating them. Overall, turkeys play a significant role in American culture and cuisine.\n\nSOURCES: TalesFromYourBank, Fedexers, DisneyWorld, Disneyland, walmart'

In [None]:
# nq-distilbert-base-v1
# question5 = "What is Turkey ???"
# response5 = rag_chain.invoke(question5)
# answer5 = response5["answer"]
# answer5

"I don't know the answer to the question as the provided content does not contain information about Turkey. \n\nSOURCES: N/A"

In [None]:
# question5 = "What is Turkey ???"
# response5 = rag_chain.invoke(question5)
# answer5 = response5["answer"]
# answer5

'Turkey is a type of meat commonly associated with popular theme parks like DisneyWorld and Disneyland, where turkey legs are a popular snack item. Additionally, there are references to cooking turkey and humorous comments about turkeys in various contexts. Overall, Turkey is a type of meat that is consumed by many people. \n\nSOURCES: DisneyWorld, Disneyland, Walmart, TalesFromYourBank, Fedexers'

In [None]:
# GteLarge
question6 = "What is Walmart???"
response6 = rag_chain.invoke(question6)
answer6 = response6["answer"]
answer6

'Walmart is a multinational retail corporation known for its wide range of products and services. It is a popular destination for shopping and is known for its affordable prices. \n\nSOURCES: walmart'

In [None]:
# nq-distilbert-base-v1
# question6 = "What is Walmart???"
# response6 = rag_chain.invoke(question6)
# answer6 = response6["answer"]
# answer6

'Walmart is a retail corporation known for its chain of hypermarkets, discount department stores, and grocery stores. It is a major player in the retail industry, offering a wide range of products at competitive prices. \n\nSOURCES: walmart'

In [None]:
# question6 = "What is Walmart???"
# response6 = rag_chain.invoke(question6)
# answer6 = response6["answer"]
# answer6

'Walmart is a multinational retail corporation known for its wide range of products and low prices. It is a popular destination for shopping for various items such as groceries, clothing, electronics, and more. \n\nSOURCES: \n- Walmart corporate website\n- Wikipedia page on Walmart'

## Manual test set

In [None]:
def ret2rag(db):
  retdb, retrerankdb = retrievers(db)
  rag_chain_base, rag_chain_rerank = rag_chains(retdb,retrerankdb)
  return rag_chain_base, rag_chain_rerank

In [None]:
data_bases = [db1, db2, db3]
data_basespre = [predb1, predb2, predb3]



rag_chain_basedb1, rag_chain_rerankdb1 = ret2rag(db1)

rag_chain_basedb2, rag_chain_rerankdb2 = ret2rag(db2)
rag_chain_basedb3, rag_chain_rerankdb3 = ret2rag(db3)

rag_chain_basepredb1, rag_chain_rerankpredb1 = ret2rag(predb1)
rag_chain_basepredb2, rag_chain_rerankpredb2 = ret2rag(predb2)
rag_chain_basepredb3, rag_chain_rerankpredb3 = ret2rag(predb3)


base_rag = [rag_chain_basedb1,rag_chain_basedb2,rag_chain_basedb3]

rerank_rag = [rag_chain_rerankdb1,rag_chain_rerankdb2,rag_chain_rerankdb3]

basepre_rag = [rag_chain_basepredb1,rag_chain_basepredb2,rag_chain_basepredb3]

rerankpre_rag = [rag_chain_rerankpredb1,rag_chain_rerankpredb2,rag_chain_rerankpredb3]


In [None]:
from datasets import Dataset


In [None]:
# 'What is walmart?',
      # 'How to make Chai Latte?',
questions = [

      "Do employees have opinions of their mangers at Tjmaxx? How do employees get along with them?",
      "Can I wear earbuds at work in Walmart?",
      "What to do if you injured yourself during your delivery job?",
      "Do you get paid time off at Walmart?",
      "How to handle inappropriate customer instructions at Starbucks?",

  ]
      # 'Heat the milk and steep the tea. Let the milk warm up in a small saucepan over medium heat. Once it simmers, turn it off and add in the tea bags for 4 to 5 minutes. Then remove the tea bags. Add the spices. Turn the heat back to medium heat, and add the cinnamon, ginger, cloves, cardamom, vanilla, and maple syrup. Whisk it all together until its perfectly hot. Top the mugs off! You can drink the chai latte as is, or top it off with extra froth and a dash of cinnamon.',

  #       'Walmart Inc. is an American multinational retail corporation that operates a chain of hypermarkets, discount department stores, and grocery stores in the United States, headquartered in Bentonville, Arkansas.',


ground_truth = [
      "Employees do have opinions of their managers. For example, some managers attitudes toward their employees change from day to day. It is suggested that they can be condescending and biased. On the other hand, many employees really like their managers. Usually such managers displays fairness and patience during conversations. Overall, if you do not like your manager, be civil with them and remember that they are also human and can be stressed and have emotions.",
      "There is no consensus on whether this is prohibited or not. It depends on the departments one work at. It is suggested that it is ok to wear a single earbud so long as they can be aware of the surrounding environment.",
      "You should go to the UPS doctor to talk about your injury, and they will check on you and decide what to do next. However, it is suggested that sometimes it is more useful to get an injury proof from your own doctor. Some people suggested that they go to their lawyer straightaway to handle the issue. The employees seem to not have much expectation on UPS handling injury issues.",
      "There is no policy stating that employees don't get paid time off. Although some manager grant paid time off (pro) or protected paid time off (ppt) , it is likely that most manager will not grant paid time off due to 'no coverage' issue.",
      "Customers can have various instructions on how their drink should be like. It is important that as a barista, you should also follow the rule of making drinks. Usually a small additional instructions can be fulfilled, but definitely appeal to your manager if you are unsure about how to handle the instructions."



  ]

In [None]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)

def answer_context(rag_chain, retriever):
  answers  = []
  contexts = []

  # traversing each question and passing into the chain to get answer from the system
  for question in questions:
      answers.append(rag_chain.invoke(question)['answer'])
      contexts.append([docs.page_content for docs in retriever.get_relevant_documents(question)])

  # Preparing the dataset
  data = {
      "question": questions,
      "answer": answers,
      "contexts": contexts,
      "ground_truth": ground_truth
  }

  dataset = Dataset.from_dict(data)


  result = evaluate(
      dataset=dataset,
      metrics=[
          context_precision,
          context_recall,
          faithfulness,
          answer_relevancy,
      ],
  )

  df = result.to_pandas()

  return df


In [None]:
df1 =  answer_context(rag_chain_basedb1, retrievers(db1)[0])
# df2 =  answer_context(rag_chain_basedb2, retrievers(db2)[0])
df3 =  answer_context(rag_chain_basedb3, retrievers(db3)[0])

df1re = answer_context(rag_chain_rerankdb1, retrievers(db1)[1])
# df2re = answer_context(rag_chain_rerankdb2, retrievers(db2)[1])
df2re = answer_context(rag_chain_rerankdb3, retrievers(db3)[1])

In [None]:
df1pre =  answer_context(rag_chain_basepredb1, retrievers(predb1)[0])
# df2pre =  answer_context(rag_chain_basepredb2, retrievers(predb2)[0])
# df3pre =  answer_context(rag_chain_basepredb3, retrievers(predb3)[0])

df1repre = answer_context(rag_chain_rerankpredb1, retrievers(predb1)[1])
# df2repre = answer_context(rag_chain_rerankdb2, retrievers(predb1)[1])
df2repre = answer_context(rag_chain_rerankdb3, retrievers(predb1)[1])

Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]

In [None]:
df3pre

Unnamed: 0,question,answer,contexts,ground_truth,context_precision,context_recall,faithfulness,answer_relevancy
0,Do employees have opinions of their mangers at...,Employees at Tjmaxx have varying opinions abou...,[i take it you work there besides pay what els...,Employees do have opinions of their managers. ...,0.987875,0.333333,1.0,0.952166
1,Can I wear earbuds at work in Walmart?,"Yes, you can wear earbuds at work in Walmart a...",[i do not work at walmart anymore but i pretty...,There is no consensus on whether this is prohi...,0.785415,0.0,1.0,0.982825
2,What to do if you injured yourself during your...,If you injured yourself during your delivery j...,[follow the methods that means just refuse to ...,You should go to the UPS doctor to talk about ...,0.246532,0.75,1.0,0.993618
3,Do you get paid time off at Walmart?,"Yes, Walmart does not provide paid time off. E...",[no walmart is not doing away with unpaid time...,There is no policy stating that employees don'...,0.683724,0.0,0.714286,0.967193
4,How to handle inappropriate customer instructi...,To handle inappropriate customer instructions ...,[definitely need to get that attitude under co...,Customers can have various instructions on how...,0.361706,0.0,1.0,0.979235


In [None]:
df3

Unnamed: 0,question,answer,contexts,ground_truth,context_precision,context_recall,faithfulness,answer_relevancy
0,Do employees have opinions of their mangers at...,Employees at TjMaxx have varying opinions abou...,"[I take it you work there. Besides pay, what e...",Employees do have opinions of their managers. ...,0.964294,1.0,1.0,0.844519
1,Can I wear earbuds at work in Walmart?,"Yes, you can wear earbuds at work in Walmart. ...",[I don't work at Walmart anymore but I pretty ...,There is no consensus on whether this is prohi...,0.791839,0.0,0.8,0.894335
2,What to do if you injured yourself during your...,If you injured yourself during your delivery j...,[Follow the methods. That means just refuse to...,You should go to the UPS doctor to talk about ...,0.658498,0.75,1.0,0.993618
3,Do you get paid time off at Walmart?,"Yes, Walmart does offer time off options, incl...",[This is about the most ludicrous thing I've e...,There is no policy stating that employees don'...,0.657662,0.0,0.833333,0.950244
4,How to handle inappropriate customer instructi...,To handle inappropriate customer instructions ...,[Definitely need to get that attitude under co...,Customers can have various instructions on how...,0.318609,1.0,1.0,0.981694


In [None]:
df3repre = df2repre
df3repre

Unnamed: 0,question,answer,contexts,ground_truth,context_precision,context_recall,faithfulness,answer_relevancy
0,Do employees have opinions of their mangers at...,Employees at TjMaxx have varying opinions of t...,[that is not how tjmaxx operates the employee ...,Employees do have opinions of their managers. ...,0.880258,0.166667,1.0,0.94913
1,Can I wear earbuds at work in Walmart?,"Yes, Walmart's dress code allows for one earbu...",[i do not work at walmart anymore but i pretty...,There is no consensus on whether this is prohi...,0.642758,1.0,0.0,0.975375
2,What to do if you injured yourself during your...,If you injured yourself during your delivery j...,[have you spoken to anyone at sedgwick or hr t...,You should go to the UPS doctor to talk about ...,0.196429,0.25,1.0,0.989903
3,Do you get paid time off at Walmart?,"No, Walmart does not provide paid time off. Ma...",[no walmart is not doing away with unpaid time...,There is no policy stating that employees don'...,0.766667,0.0,0.75,0.967193
4,How to handle inappropriate customer instructi...,To handle inappropriate customer instructions ...,[100 this i am a nice friendly barista and lov...,Customers can have various instructions on how...,1.0,0.0,1.0,0.983967


In [None]:
df1

Unnamed: 0,question,answer,contexts,ground_truth,context_precision,context_recall,faithfulness,answer_relevancy
0,Do employees have opinions of their mangers at...,Employees at TjMaxx have varying opinions of t...,[You should see how they treat their employees...,Employees do have opinions of their managers. ...,0.913221,0.166667,1.0,0.958887
1,Can I wear earbuds at work in Walmart?,The policy on wearing earbuds at work in Walma...,[I don't work at Walmart anymore but I pretty ...,There is no consensus on whether this is prohi...,0.813295,0.125,1.0,0.928918
2,What to do if you injured yourself during your...,"If you are injured during your delivery job, i...",[Get the workers comp. Accidents happen. It wo...,You should go to the UPS doctor to talk about ...,0.489316,0.75,1.0,0.986655
3,Do you get paid time off at Walmart?,"Yes, Walmart does offer paid time off, but man...",[This is about the most ludicrous thing I've e...,There is no policy stating that employees don'...,0.701894,0.0,0.857143,0.968731
4,How to handle inappropriate customer instructi...,Handling inappropriate customer instructions a...,[This is why customers at Starbucks have no pr...,Customers can have various instructions on how...,0.373227,1.0,1.0,0.982555


In [None]:
df1repre

Unnamed: 0,question,answer,contexts,ground_truth,context_precision,context_recall,faithfulness,answer_relevancy
0,Do employees have opinions of their mangers at...,Employees at Tjmaxx have varying opinions of t...,[that is not how tjmaxx operates the employee ...,Employees do have opinions of their managers. ...,0.880258,0.166667,1.0,0.95828
1,Can I wear earbuds at work in Walmart?,"Based on the information provided, it seems th...",[i do not work at walmart anymore but i pretty...,There is no consensus on whether this is prohi...,0.642758,1.0,1.0,0.0
2,What to do if you injured yourself during your...,"If you are injured during your delivery job, i...",[have you spoken to anyone at sedgwick or hr t...,You should go to the UPS doctor to talk about ...,0.196429,0.25,1.0,0.986676
3,Do you get paid time off at Walmart?,"Yes, Walmart does not do away with unpaid time...",[no walmart is not doing away with unpaid time...,There is no policy stating that employees don'...,0.777778,0.0,0.4,0.944941
4,How to handle inappropriate customer instructi...,To handle inappropriate customer instructions ...,[100 this i am a nice friendly barista and lov...,Customers can have various instructions on how...,0.5,0.0,1.0,0.97832


In [None]:
df1.to_csv(f'{dataset_path}/df1.csv',index=False)
df3.to_csv(f'{dataset_path}/df3.csv',index=False)
df1re.to_csv(f'{dataset_path}/df1re.csv',index=False)
df2re.to_csv(f'{dataset_path}/df3re.csv',index=False)

df3pre.to_csv(f'{dataset_path}/df3pre.csv',index=False)

In [None]:
df1 =  answer_context(rag_chain_basedb1, retrievers(db1)[0])
# df2 =  answer_context(rag_chain_basedb2, retrievers(db2)[0])
df3 =  answer_context(rag_chain_basedb3, retrievers(db3)[0])

df1re = answer_context(rag_chain_rerankdb1, retrievers(db1)[1])
# df2re = answer_context(rag_chain_rerankdb2, retrievers(db2)[1])
df2re = answer_context(rag_chain_rerankdb3, retrievers(db3)[1])

df1pre =  answer_context(rag_chain_basepredb1, retrievers(predb1)[0])
# df2pre =  answer_context(rag_chain_basepredb2, retrievers(predb2)[0])
# df3pre =  answer_context(rag_chain_basepredb3, retrievers(predb3)[0])

df1repre = answer_context(rag_chain_rerankpredb1, retrievers(predb1)[1])
# df2repre = answer_context(rag_chain_rerankdb2, retrievers(predb1)[1])
df2repre = answer_context(rag_chain_rerankdb3, retrievers(predb1)[1])



Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]

Exception in thread Thread-24:
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.10/dist-packages/ragas/executor.py", line 96, in run
    results = self.loop.run_until_complete(self._aresults())
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.10/dist-packages/ragas/executor.py", line 84, in _aresults
    raise e
  File "/usr/local/lib/python3.10/dist-packages/ragas/executor.py", line 79, in _aresults
    r = await future
  File "/usr/lib/python3.10/asyncio/tasks.py", line 571, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/usr/local/lib/python3.10/dist-packages/ragas/executor.py", line 38, in sema_coro
    return await coro
  File "/usr/local/lib/python3.10/dist-packages/ragas/executor.py", line 112, in wrapped_callable_async
    return counter, await callable(

ExceptionInRunner: The runner thread which was running the jobs raised an exeception. Read the traceback above to debug it. You can also pass `raise_exceptions=False` incase you want to show only a warning message instead.

In [None]:
retrievers(predb2)[0]

VectorStoreRetriever(tags=['FAISS', 'HuggingFaceEmbeddings'], vectorstore=<langchain_community.vectorstores.faiss.FAISS object at 0x7ec036544e80>, search_kwargs={'k': 30})

In [None]:
df3pre =  answer_context(rag_chain_basepredb3, retrievers(predb3)[0])

Evaluating:   0%|          | 0/20 [00:00<?, ?it/s]

In [None]:
# df1 = answer_context(rag_chain1, retriever1)
# df2 = answer_context(rag_chain2, retriever2)
# df3 = answer_context(rag_chain3, retriever3)

# dfs = [df1, df2, df3]

Evaluating:   0%|          | 0/8 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/8 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/8 [00:00<?, ?it/s]

In [None]:
df_base = answer_context(rag_chain_base, retriever_base)
df_rerank = answer_context(rag_chain_rerank, retriever_reranked)

  warn_deprecated(


Evaluating:   0%|          | 0/8 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/8 [00:00<?, ?it/s]

In [None]:
dataset = Dataset.from_dict(data)

In [None]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
)




result = evaluate(
    dataset=dataset,
    metrics=[
        context_precision,
        context_recall,
        faithfulness,
        answer_relevancy,
    ],
)

df = result.to_pandas()

Evaluating:   0%|          | 0/8 [00:00<?, ?it/s]

In [None]:
dfs = [df_base, df_rerank]

In [None]:
df_base

Unnamed: 0,question,answer,contexts,ground_truth,context_precision,context_recall,faithfulness,answer_relevancy
0,How to make Chai Latte?,"To make a hot Chai latte, you can steep three ...",[Newbie here. How is a hot Chai latte made? Th...,Heat the milk and steep the tea. Let the milk ...,0.782999,0.222222,1.0,0.960563
1,What is walmart?,Walmart is a multinational retail corporation ...,"[What does this have to do with walmart?, What...",Walmart Inc. is an American multinational reta...,0.231795,0.066667,1.0,0.876947


In [None]:
df_rerank

NameError: name 'df_rerank' is not defined