# Enhancing Data Retreival Accuracy

### Including Traditional NLP tasks in RAG Applications

## This contains comparision on Retrieval accuracy with usual approach of embeddings against additional NLP processing

#### I added helper codes to read data and to create embeddings and storing it to vector db in python scripts and imported same here

In [1]:
from helper.docs_helper import load_data, chunk_data
from handle_vectors import create_embeddings,store_vectors_unaltered,create_vectors

In [2]:
#read data from given path
pdf_contents = load_data(path="data/Mahabharata.pdf")

In [3]:
#create chunks for data read from pdf
chunks = chunk_data(pdf_contents)
#create embeddings and structure for vectore storage
vectors = create_vectors(text_chunks=chunks)

In [4]:
print(len(vectors))
vectors[0]["metadata"]

63


{'text': 'Mahabharata\nManuscript illustration of the Battle of\nKurukshetra\nInformation\nReligion Hinduism\nAuthor Vyasa\nLanguageSanskrit\nPeriod Principally compiled in 3rd century\nBCE–4th century CE\nChapters 18 Parvas\nVerses 200,000\nFull text\nMahabharata at Sanskrit Wikisource\n Mahabharata at English Wikisource\nKrishna and Arjuna at Kurukshetra,\n18th–19th-century painting\nModern depiction of Vyasa narrating\nthe Mahābhārata to Ganesha at the\nMurudeshwara temple, Karnataka.\nMahabharata\nThe Mahābhārata (/məˌhɑːˈbɑː rətə, ˌmɑːhə-/ m ə -HAH-BAR- ə -t ə , MAH-\nh ə -;[1][2][3][4] Sanskrit: म ह ा भ ा र त म ् , IAST: Mahābhāratam, pronounced\n[m ɐɦ a ːˈ b ʱ a ː r ɐt̪ɐ m]) is one of the two major Smriti texts and Sanskrit epics of\nancient India revered in Hinduism, the other being the Rāmāya ṇ a.[5] It narratesthe events and aftermath of the Kurukshetra War, a war of succession between\ntwo groups of princely cousins, the Kauravas and the Pā ṇḍ avas.'}

In [5]:
#to store above created vectors to vetor db(PineCone)
index = store_vectors_unaltered(vectors)

##### on executing above cell, vectors will be stored in PineCone vector db, under Index "mahabharata" and name space "unaltered"

#### To check on retrival score from ususal approach of embedding input and retrieving from knowledge base

In [8]:
query = "Who and all seeked advice from Krishna in this"
query_embedding = create_embeddings(query)
results = index.query(
    namespace="unaltered",
    vector=query_embedding,
    top_k=3,
    include_values=False,
    include_metadata=True
)

##### Below are results from 1st approach of embedding contents asis and comparing against user query embeddings

In [9]:
print(results["matches"][0]["score"])
print(results["matches"][1]["score"])
print(results["matches"][2]["score"])

0.836120367
0.824133754
0.817277193


##### Below are results from 2nd approach of embedding contets asis and then comaring against processed user query(NLP - NLTK proccessed) emebddings

In [10]:
from helper.docs_helper import process_input
query = "Who and all seeked advice from Krishna in this"
query = process_input(query)
query_embedding = create_embeddings(query)
results = index.query(
    namespace="unaltered",
    vector=query_embedding,
    top_k=3,
    include_values=False,
    include_metadata=True
)
print(results["matches"][0]["score"])
print(results["matches"][1]["score"])
print(results["matches"][2]["score"])

0.836811602
0.825062931
0.815428615


## Below additional imports are to handle NLTK cleaning steps of actual contents and to vectorize the same

In [11]:
from handle_vectors import store_vectors_processed, create_vectors_processed
chunks = chunk_data(pdf_contents)
processed_chunks = [process_input(chunk) for chunk in chunks]
vectors = create_vectors_processed(original_chunks=chunks,processed_chunks=processed_chunks)
vectors[0]["metadata"]

{'text': 'Mahabharata\nManuscript illustration of the Battle of\nKurukshetra\nInformation\nReligion Hinduism\nAuthor Vyasa\nLanguageSanskrit\nPeriod Principally compiled in 3rd century\nBCE–4th century CE\nChapters 18 Parvas\nVerses 200,000\nFull text\nMahabharata at Sanskrit Wikisource\n Mahabharata at English Wikisource\nKrishna and Arjuna at Kurukshetra,\n18th–19th-century painting\nModern depiction of Vyasa narrating\nthe Mahābhārata to Ganesha at the\nMurudeshwara temple, Karnataka.\nMahabharata\nThe Mahābhārata (/məˌhɑːˈbɑː rətə, ˌmɑːhə-/ m ə -HAH-BAR- ə -t ə , MAH-\nh ə -;[1][2][3][4] Sanskrit: म ह ा भ ा र त म ् , IAST: Mahābhāratam, pronounced\n[m ɐɦ a ːˈ b ʱ a ː r ɐt̪ɐ m]) is one of the two major Smriti texts and Sanskrit epics of\nancient India revered in Hinduism, the other being the Rāmāya ṇ a.[5] It narratesthe events and aftermath of the Kurukshetra War, a war of succession between\ntwo groups of princely cousins, the Kauravas and the Pā ṇḍ avas.'}

In [12]:
index = store_vectors_processed(vectors)

#### on executing above cell, vectors will be stored in PineCone vector db, under same Index "mahabharata" but diffent name space "processed"

In [13]:
from helper.docs_helper import process_input
query = "Who and all seeked advice from Krishna in this"
# query = process_input(query)
query_embedding = create_embeddings(query)
results = index.query(
    namespace="processed",
    vector=query_embedding,
    top_k=3,
    include_values=False,
    include_metadata=True
)

##### Below are results from 3rd approach of embedding NLTK proccessed contets and then comparing against user query emebddings(as-is).

In [14]:
print(results["matches"][0]["score"])
print(results["matches"][1]["score"])
print(results["matches"][2]["score"])

0.839606
0.825872838
0.817132533


In [15]:
from helper.docs_helper import process_input
query = "Who and all seeked advice from Krishna in this"
query = process_input(query)
query_embedding = create_embeddings(query)
results = index.query(
    namespace="processed",
    vector=query_embedding,
    top_k=3,
    include_values=False,
    include_metadata=True
)

##### Below are results from 4th approach of embedding NLTK proccessed contents and then comparing against NLTK processed user query emebddings.

In [16]:
print(results["matches"][0]["score"])
print(results["matches"][1]["score"])
print(results["matches"][2]["score"])

0.850452602
0.835941494
0.824624062


In [17]:
results["matches"][0]


{'id': '35',
 'metadata': {'text': "king; he seeks Krishna's advice. Krishna advises him, "
                      'and after due preparation and the elimination of some '
                      'opposition,\n'
                      'Yudhishthira carries out the rājasūya yagna ceremony; '
                      'he is thus recognized as pre-eminent among kings.\n'
                      'The Pandavas have a new palace built for them, by Maya '
                      'the Danava.[64] They invite their Kaurava cousins to '
                      'Indraprastha.Duryodhana walks round the palace, and '
                      'mistakes a glossy floor for water, and will not step '
                      'in. After being told of his error, he\n'
                      'then sees a pond and assumes it is not water and falls '
                      'in. Bhima, Arjuna, the twins and the servants laugh at '
                      'him.[65] In popularadaptations, this insult is wrongly '
                   

#### with above data, metadata text is still maintained without change, so earlier cleaning steps will not affect in further steps

In [19]:
""

''