# Visualize  your RAG Data - EDA for Retrieval-Augmented Generation
## How to use UMAP dimensionality reduction for Embeddings to show  Questions, Answers and their relationships to source documents with OpenAI, Langchain and ChromaDB
This notebook is part of an [article at ITNEXT.](https://itnext.io/visualize-your-rag-data-eda-for-retrieval-augmented-generation-0701ee98768f)

### Get ready

In [1]:
!pip install langchain chromadb renumics-spotlight sentence-transformers einops



### Prepare documents

In [2]:
from langchain.vectorstores.chroma import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
import pandas as pd
import numpy as np

In [3]:
# create embeddings model
model_kwargs = {'trust_remote_code': True}
embedding = HuggingFaceEmbeddings(model_name='nomic-ai/nomic-embed-text-v1.5', model_kwargs=model_kwargs)

modules.json:   0%|          | 0.00/255 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/140 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/69.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/120 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/2.06k [00:00<?, ?B/s]

configuration_hf_nomic_bert.py:   0%|          | 0.00/1.96k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/nomic-ai/nomic-bert-2048:
- configuration_hf_nomic_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_hf_nomic_bert.py:   0%|          | 0.00/52.8k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/nomic-ai/nomic-bert-2048:
- modeling_hf_nomic_bert.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


model.safetensors:   0%|          | 0.00/547M [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

In [4]:
# Retrieve vector store
vectordb = Chroma(persist_directory='db_v3', embedding_function=embedding)

### Ask a Question

In [5]:
# Taken from app
sample_questions = [
    'How to enroll?',
    'How do I apply for a car sticker?',
    'Who to contact about student organizations?',
    'What is the difference between BS CS and BS IT?',
    'What is the difference between overload, tutorial, and override?',
    'What are the fee discounts available and how can I apply for them?',
    'Please discuss the core values of USC.'
]
question = sample_questions[0]

In [6]:
retriever = vectordb.as_retriever(search_kwargs={"k": 5})
docs = retriever.get_relevant_documents(question)
sources = [doc.metadata.get("source") for doc in docs]
sources

  warn_deprecated(


['A18-Enrollment Steps for Continuing Students.txt',
 'A15-Post-Admission Enrollment Steps.txt',
 'A16-Enrollment Mechanics.txt',
 'A24-Simultaneous Enrollment.txt',
 'A28-Additional Enrollment Steps (Shiftee).txt']

### Visualize

In [33]:
# extract embeddings for the documents from the vector store and store them in a dataframe
response = vectordb.get(include=["metadatas", "documents", "embeddings"])
df = pd.DataFrame(
    {
        "id": response["ids"],
        "source": [metadata.get("source") for metadata in response["metadatas"]],
        "document": response["documents"],
        "embedding": response["embeddings"],
    }
)

In [34]:
df = df.sort_values(by=['source'])
df

Unnamed: 0,id,source,document,embedding
95,ae0dbca3-cc02-44e9-b21d-bb4110f0ffb8,A00-Abbreviations.txt,USC stands for University of San Carlos.\nSVD ...,"[0.46299073100090027, 0.36808809638023376, -2...."
119,da84031c-f3f3-4e2c-aca9-e9f3068247c7,A01-Description.txt,TOPIC 1: Description\n\nUniversity of San Carl...,"[-0.01337280310690403, 1.720857858657837, -3.1..."
90,a68a39f3-0e0a-44e4-97eb-c2da93483cc8,"A02-Catholic Identity, Vision Mission and Core...","TOPIC 2: Catholic Identity, Vision, Mission, a...","[0.48381316661834717, 0.8756269812583923, -3.3..."
15,18f81ef3-6772-44a3-a32c-7d6874f0f137,A03-The University Seal.txt,Topic 3: The University Seal\n\nThe University...,"[-0.26976484060287476, 1.9007909297943115, -3...."
37,452a0946-0f7a-4cdc-9dae-973aa0c3adec,A04-History.txt,Topic 4: History\n\nThe University of San Carl...,"[0.26369062066078186, 1.508581280708313, -3.29..."
...,...,...,...,...
107,c2089b3d-e87b-47b0-b7b1-4095ae6bc15b,B38-BS Pharmacy.txt,Bachelor of Science in Pharmacy / BS Pharmacy ...,"[-0.570658802986145, 0.36090323328971863, -2.2..."
32,40a5a9f2-383b-4473-9bad-15cd2544b6d5,B39-BS Psychology.txt,Bachelor of Science in Psychology / BS Psychol...,"[-0.12269929051399231, 0.17832259833812714, -2..."
52,65758b26-9d29-4924-8ddc-55ad9e86bb97,B40-BS Tourism Management.txt,Bachelor of Science in Tourism Management / BS...,"[-1.1161071062088013, 1.0692917108535767, -3.6..."
103,bc2b6782-1a0e-4c6e-b3e5-c6b5c624ae0e,B41-B Secondary Education Major in Science.txt,Bachelor of Secondary Education Major in Scien...,"[-0.15809260308742523, 0.5261877179145813, -2...."


In [35]:
# add the question with their embeddings to the dataframe
count = 0
question_rows = []
for question in sample_questions:
  count += 1
  question_rows.append(
      {
          "id": f"question_{count}",
          "question": question,
          "embedding": embedding.embed_query(question),
      }
  )
question_rows = pd.DataFrame(question_rows)
df = pd.concat([question_rows, df])
df.head(20)

Unnamed: 0,id,question,embedding,source,document
0,question_1,How to enroll?,"[-1.4416790008544922, 1.325736403465271, -3.31...",,
1,question_2,How do I apply for a car sticker?,"[-0.09891726821660995, -0.4405957758426666, -3...",,
2,question_3,Who to contact about student organizations?,"[-1.1395211219787598, -0.04957957565784454, -3...",,
3,question_4,What is the difference between BS CS and BS IT?,"[1.1334507465362549, 1.0081663131713867, -2.41...",,
4,question_5,"What is the difference between overload, tutor...","[1.0645222663879395, 0.7283675670623779, -3.81...",,
5,question_6,What are the fee discounts available and how c...,"[-0.5865501761436462, 0.9015997648239136, -3.5...",,
6,question_7,Please discuss the core values of USC.,"[0.12190892547369003, 0.8585222363471985, -3.4...",,
95,ae0dbca3-cc02-44e9-b21d-bb4110f0ffb8,,"[0.46299073100090027, 0.36808809638023376, -2....",A00-Abbreviations.txt,USC stands for University of San Carlos.\nSVD ...
119,da84031c-f3f3-4e2c-aca9-e9f3068247c7,,"[-0.01337280310690403, 1.720857858657837, -3.1...",A01-Description.txt,TOPIC 1: Description\n\nUniversity of San Carl...
90,a68a39f3-0e0a-44e4-97eb-c2da93483cc8,,"[0.48381316661834717, 0.8756269812583923, -3.3...","A02-Catholic Identity, Vision Mission and Core...","TOPIC 2: Catholic Identity, Vision, Mission, a..."


In [38]:
# make 2 copies of df: df_euclid, df_cosine
# note: df_euclid will use euclidean distance, df_cosine will use cosine similarity
df_euclid = df.copy(deep=True)
df_cosine = df.copy(deep=True)

In [39]:
# calculate the distance between the question and the document embeddings
# Euclid
count = 0
for question in sample_questions:
  count += 1
  question_embedding = embedding.embed_query(question)
  df_euclid[f"dist_{count}"] = df.apply(
      # Euclidean distance - norm(A - B)
      lambda row: np.linalg.norm(np.array(row["embedding"]) - question_embedding),
      axis=1,
  )
  df_cosine[f"dist_{count}"] = df.apply(
      # Cosine similarity - dot(A,B)/(norm(A, axis=1)*norm(B))
      lambda row: np.dot(np.array(row["embedding"]),question_embedding)/(np.linalg.norm(np.array(row["embedding"]))*np.linalg.norm(question_embedding)),
      axis=1,
  )

In [40]:
df_euclid.head(20)

Unnamed: 0,id,question,embedding,source,document,dist_1,dist_2,dist_3,dist_4,dist_5,dist_6,dist_7
0,question_1,How to enroll?,"[-1.4416790008544922, 1.325736403465271, -3.31...",,,0.0,21.620063,22.668039,25.112623,23.288184,22.087395,25.292707
1,question_2,How do I apply for a car sticker?,"[-0.09891726821660995, -0.4405957758426666, -3...",,,21.620063,0.0,23.503184,25.036329,23.969033,19.371689,24.768926
2,question_3,Who to contact about student organizations?,"[-1.1395211219787598, -0.04957957565784454, -3...",,,22.668039,23.503184,0.0,24.595079,24.486594,23.543206,23.159292
3,question_4,What is the difference between BS CS and BS IT?,"[1.1334507465362549, 1.0081663131713867, -2.41...",,,25.112623,25.036329,24.595079,0.0,21.257554,24.765791,24.368647
4,question_5,"What is the difference between overload, tutor...","[1.0645222663879395, 0.7283675670623779, -3.81...",,,23.288184,23.969033,24.486594,21.257554,0.0,22.468376,24.194169
5,question_6,What are the fee discounts available and how c...,"[-0.5865501761436462, 0.9015997648239136, -3.5...",,,22.087395,19.371689,23.543206,24.765791,22.468376,0.0,24.240844
6,question_7,Please discuss the core values of USC.,"[0.12190892547369003, 0.8585222363471985, -3.4...",,,25.292707,24.768926,23.159292,24.368647,24.194169,24.240844,0.0
95,ae0dbca3-cc02-44e9-b21d-bb4110f0ffb8,,"[0.46299073100090027, 0.36808809638023376, -2....",A00-Abbreviations.txt,USC stands for University of San Carlos.\nSVD ...,21.597172,20.620203,18.99645,21.098682,21.428712,21.040901,17.542745
119,da84031c-f3f3-4e2c-aca9-e9f3068247c7,,"[-0.01337280310690403, 1.720857858657837, -3.1...",A01-Description.txt,TOPIC 1: Description\n\nUniversity of San Carl...,22.745493,23.098682,20.450947,23.338648,22.449175,21.99972,18.396513
90,a68a39f3-0e0a-44e4-97eb-c2da93483cc8,,"[0.48381316661834717, 0.8756269812583923, -3.3...","A02-Catholic Identity, Vision Mission and Core...","TOPIC 2: Catholic Identity, Vision, Mission, a...",23.475701,22.903508,21.058618,23.098824,22.249181,22.311607,15.826622


In [41]:
df_cosine.head(20)

Unnamed: 0,id,question,embedding,source,document,dist_1,dist_2,dist_3,dist_4,dist_5,dist_6,dist_7
0,question_1,How to enroll?,"[-1.4416790008544922, 1.325736403465271, -3.31...",,,1.0,0.543929,0.498955,0.392178,0.458715,0.506885,0.363029
1,question_2,How do I apply for a car sticker?,"[-0.09891726821660995, -0.4405957758426666, -3...",,,0.543929,1.0,0.430437,0.362216,0.392217,0.597239,0.35323
2,question_3,Who to contact about student organizations?,"[-1.1395211219787598, -0.04957957565784454, -3...",,,0.498955,0.430437,1.0,0.385106,0.366335,0.405335,0.435179
3,question_4,What is the difference between BS CS and BS IT?,"[1.1334507465362549, 1.0081663131713867, -2.41...",,,0.392178,0.362216,0.385106,1.0,0.52917,0.35109,0.383146
4,question_5,"What is the difference between overload, tutor...","[1.0645222663879395, 0.7283675670623779, -3.81...",,,0.458715,0.392217,0.366335,0.52917,1.0,0.4433,0.36692
5,question_6,What are the fee discounts available and how c...,"[-0.5865501761436462, 0.9015997648239136, -3.5...",,,0.506885,0.597239,0.405335,0.35109,0.4433,1.0,0.354405
6,question_7,Please discuss the core values of USC.,"[0.12190892547369003, 0.8585222363471985, -3.4...",,,0.363029,0.35323,0.435179,0.383146,0.36692,0.354405,1.0
95,ae0dbca3-cc02-44e9-b21d-bb4110f0ffb8,,"[0.46299073100090027, 0.36808809638023376, -2....",A00-Abbreviations.txt,USC stands for University of San Carlos.\nSVD ...,0.457341,0.464826,0.551686,0.45026,0.397318,0.406457,0.607627
119,da84031c-f3f3-4e2c-aca9-e9f3068247c7,,"[-0.01337280310690403, 1.720857858657837, -3.1...",A01-Description.txt,TOPIC 1: Description\n\nUniversity of San Carl...,0.459561,0.405676,0.535396,0.402993,0.422652,0.435849,0.614445
90,a68a39f3-0e0a-44e4-97eb-c2da93483cc8,,"[0.48381316661834717, 0.8756269812583923, -3.3...","A02-Catholic Identity, Vision Mission and Core...","TOPIC 2: Catholic Identity, Vision, Mission, a...",0.413277,0.403995,0.49762,0.403847,0.421039,0.407163,0.710005


## Store result

In [42]:
df_euclid.to_csv('embeddings_and_distances(euclid).csv')

In [43]:
df_cosine.to_csv('embeddings_and_distances(cosine).csv')