### RAG(Retrieval Augmented Generation) practice 
- Retrieval -Find relevant information given a query
- Augmentation -Take the relevant information from retrieval and augment our input(prompt) to an LLM with that relevant information.
- Generation - Take the first two steps and pass them to an LLM for generative outputs.

### Why RAG?
- The main goal of RAG is to improve the generation outputs of LLMs. 
- Most of what I practiced were based on fine-tuning. It can provide up-to-date information to the model with less efforts.
### Steps
- Import PDF
- Process text for embedding
- Embed text chunks with embedding model.
- Save embeddings to file for later
### Practiced with 
- Daniel Bourke- Local Retrieval Generation(RAG) from Scratch(step by step tutorial)
- https://www.youtube.com/watch?v=qN_2fnOPY-M&t=513s
### The project is about
- Using RAG to augment pdf document information to a local LLM model.
- Getting a LLM model that is knowledgable on up-to-date Virginia Tech ECE Graduate students policy.

### Import PDF Document

In [1]:
import os 
import requests

# Get PDF docu path
pdf_path = "ECE Graduate Policy Manual AY2023-2024.pdf"

# download 
if not os.path.exists(pdf_path):
    print(f"[INFO] File doesn't exist, downloading...")

    # Enter the URL of the PDF 
    # Did not put URL of VT ECE graduate policy link on purpose. Let's edit this if I need to actually download a pdf file from a site later on.
    url = "example_url" 

    filename = pdf_path 

    response = requests.get(url)

    if response.status_code == 200: 
        with open(filename,"wb") as file:
            file.write(response.content)
        print(f"[INFO] The file has been downloaded and saved as {filename}")
    else:
        print(f"[INFO] failed to download the file. status code: {response.status_code}")
else:
    print(f"File {pdf_path} exists.")


File ECE Graduate Policy Manual AY2023-2024.pdf exists.


In [2]:
# Use PyMuPDF to open a pdf instead of pypdf
import fitz 
from tqdm.auto import tqdm

def text_formatter(text: str) -> str:
    """Performs minor formatting on text."""
    cleaned_text = text.replace("\n", " ").strip()

    # potentially more text formatting can be put in here. 
    return cleaned_text

def open_and_read_pdf(pdf_path: str) -> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text=text)
        # page_number - 4 was done to distinguish Table of Contents from main contents
        pages_and_texts.append({"page_number": page_number-4,
                                "page_char_count":len(text),
                                "page_word_count":len(text.split(" ")),
                                "page_sentence_count_raw":len(text.split(". ")),
                                "page_token_count":len(text)/4, # 1 token = ~4 characters
                                "text":text})
    return pages_and_texts
pages_and_texts = open_and_read_pdf(pdf_path = pdf_path)
pages_and_texts[:15]
        

0it [00:00, ?it/s]

[{'page_number': -4,
  'page_char_count': 67,
  'page_word_count': 11,
  'page_sentence_count_raw': 1,
  'page_token_count': 16.75,
  'text': 'ECE Graduate Student Policy Manual  For the 2023-2024 Academic Year'},
 {'page_number': -3,
  'page_char_count': 5340,
  'page_word_count': 389,
  'page_sentence_count_raw': 50,
  'page_token_count': 1335.0,
  'text': '1  Table of Contents  1  General Information ............................................................................................................................. 5  1.1  ECE Graduate Advising Offices ..................................................................................................... 5  1.1.1  Graduate Academic Advisors ................................................................................................ 5  1.1.2  Graduate Program Director.................................................................................................... 6  1.1.3  Assistant Graduate Program Directors ..........

In [3]:
import random

random.sample(pages_and_texts, k=1)

[{'page_number': 38,
  'page_char_count': 2901,
  'page_word_count': 531,
  'page_sentence_count_raw': 43,
  'page_token_count': 725.25,
  'text': "42 foreign language requirement for Ph.D. students. Additional requirements for coursework listed  on the Ph.D. plan of study include the following.  •  Ph.D. course work consists of 27 credit hours: a minimum of 24 course credit hours at the  5000-level or above.  •  2 credit hours of Seminar, ECE 5944 (These courses can be transferred from a Virginia  Tech M.S. degree in Computer Engineering or Electrical Engineering).  •  1 credit hour of Graduate Student Success in Multicultural Environments, ENGE  5304. (This course can be transferred from a Virginia Tech M.S. degree in Computer  Engineering or Electrical Engineering).  •  A maximum of 3 credit hours of 4000-level courses may be listed on the Ph.D. Plan of  Study. 3000-level or lower courses are not permitted.  •  The Virginia Tech Residency Requirement requires course work at the Ph.D

In [4]:
import pandas as pd
df = pd.DataFrame(pages_and_texts)
df.head()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-4,67,11,1,16.75,ECE Graduate Student Policy Manual For the 20...
1,-3,5340,389,50,1335.0,1 Table of Contents 1 General Information ....
2,-2,6006,467,60,1501.5,2 2.9.10 Exchange (Non-Degree) Status .........
3,-1,5352,389,63,1338.0,3 4.9 Coursework Justification ................
4,0,664,47,6,166.0,4 7 Appendix ..................................


In [5]:
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,54.0,54.0,54.0,54.0,54.0
mean,22.5,2761.83,445.83,25.11,690.46
std,15.73,1095.41,151.94,15.45,273.85
min,-4.0,67.0,11.0,1.0,16.75
25%,9.25,2412.25,425.0,13.25,603.06
50%,22.5,2822.5,478.0,24.0,705.62
75%,35.75,3230.0,531.75,34.5,807.5
max,49.0,6006.0,638.0,63.0,1501.5


### Further text processing(splitting pages into sentences)


In [6]:
from spacy.lang.en import English

nlp = English()
nlp.add_pipe('sentencizer')

doc= nlp("This is a sentence. This is another sentence. I like penguins")
assert len(list(doc.sents)) == 3

# print split sentences

list(doc.sents)

[This is a sentence., This is another sentence., I like penguins]

In [7]:
for item in tqdm(pages_and_texts):
    item["sentences"] = list(nlp(item["text"]).sents)

    # All sentences are strings(the default type is a spaCy data type) 
    item["sentences"] = [str(sentence) for sentence in item["sentences"]]
    #count the sentences
    item["page_sentence_count_spacy"] = len(item["sentences"])

  0%|          | 0/54 [00:00<?, ?it/s]

In [8]:
random.sample(pages_and_texts, k=1)

[{'page_number': 19,
  'page_char_count': 3053,
  'page_word_count': 507,
  'page_sentence_count_raw': 25,
  'page_token_count': 763.25,
  'text': "23 Students may enroll in ECE 5904, Project and Report; ECE 5994, Thesis; or ECE 7994,  Dissertation, to fulfill the requirement the full-time enrollment requirement. Students must enroll  in Project and Report/Thesis/Dissertation by the CRN number assigned to their interim faculty  advisor or faculty advisor. Students are required to inform the faculty if enrolled in one of their  CRN numbers to determine the requirements and faculty expectations to earn a passing (EQ)  grade. These credit hours are not to be viewed as filler but rather a commitment to a certain  research load as mutually agreed upon by the student and faculty advisor.    Once an applicant registers, the application materials become part of the student's educational  record. ECE faculty and staff are allowed to access graduate student's educational files for  consideration

In [9]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,54.0,54.0,54.0,54.0,54.0,54.0
mean,22.5,2761.83,445.83,25.11,690.46,17.0
std,15.73,1095.41,151.94,15.45,273.85,9.52
min,-4.0,67.0,11.0,1.0,16.75,1.0
25%,9.25,2412.25,425.0,13.25,603.06,11.25
50%,22.5,2822.5,478.0,24.0,705.62,17.5
75%,35.75,3230.0,531.75,34.5,807.5,23.0
max,49.0,6006.0,638.0,63.0,1501.5,40.0


### chunking sentences 
text splitting(chunking)
- This is done to fit into embedding model context window(model I'll use has 384 tokens as a limit)

In [10]:
# could use langchain instead, but stick to python for now. 
num_sentence_chunk_size=10

def split_list(input_list: list[str], 
               slice_size: int=num_sentence_chunk_size) -> list[list[str]]:
    return [input_list[i:i+slice_size] for i in range(0, len(input_list), slice_size)]

test_list = list(range(25))
split_list(test_list)

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
 [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
 [20, 21, 22, 23, 24]]

In [11]:
# loop through pages and texts and split sentences into chunks 
for item in tqdm(pages_and_texts): 
    item["sentence_chunks"] = split_list(input_list=item["sentences"],
                                         slice_size = num_sentence_chunk_size)
    item["num_chunks"] = len(item["sentence_chunks"])

  0%|          | 0/54 [00:00<?, ?it/s]

In [12]:
random.sample(pages_and_texts, k=1)

[{'page_number': 14,
  'page_char_count': 3438,
  'page_word_count': 612,
  'page_sentence_count_raw': 28,
  'page_token_count': 859.5,
  'text': '18 Philosophy (PhD) in either the Electrical Engineering (EE) or Computer Engineering (CPE)  programs. The advantage of the program is that it allows enrolled students to double-count up  to 12 course credit hours (with the restrictions listed below) toward both degrees.  Virginia Tech undergraduate students who have a minimum GPA of 3.3 or better on all  undergraduate work, may apply for admission to the Accelerated ECE UGG Degree program. A  student may enter the program within the 12-month time period prior to the expected completion  date of their BS degree. To receive graduate credit, acceptance into the Accelerated UGG  Degree program is required prior to the semester in which enrolled in the courses selected for  double-counting.  Once accepted to the program, and during the last two semesters of their  undergraduate program, students

In [13]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,54.0,54.0,54.0,54.0,54.0,54.0,54.0
mean,22.5,2761.83,445.83,25.11,690.46,17.0,2.24
std,15.73,1095.41,151.94,15.45,273.85,9.52,0.91
min,-4.0,67.0,11.0,1.0,16.75,1.0,1.0
25%,9.25,2412.25,425.0,13.25,603.06,11.25,2.0
50%,22.5,2822.5,478.0,24.0,705.62,17.5,2.0
75%,35.75,3230.0,531.75,34.5,807.5,23.0,3.0
max,49.0,6006.0,638.0,63.0,1501.5,40.0,4.0


### splitting each chunk into its own item
- Embed each chunk of sentences into its own numerical representation.(granularity)

In [14]:
import re

# split each chunk into its own item
pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item["sentence_chunks"]:
        chunk_dict = {}
        chunk_dict["page_number"] = item["page_number"]

        # join the sentences together into a paragraph like structure
        joined_sentence_chunk = "".join(sentence_chunk).replace("  ", " ").strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'.\1', joined_sentence_chunk)

        chunk_dict["sentence_chunk"] = joined_sentence_chunk

        chunk_dict["chunk_char_count"] = len(joined_sentence_chunk)
        chunk_dict["chunk_word_count"] = len([word for word in joined_sentence_chunk.split(" ")])
        chunk_dict["chunk_token_count"] = len(joined_sentence_chunk) / 4 # 1 token ~ 4 chars

        pages_and_chunks.append(chunk_dict)
len(pages_and_chunks)
        


  0%|          | 0/54 [00:00<?, ?it/s]

121

In [15]:
random.sample(pages_and_chunks, k=1)

[{'page_number': -2,
  'sentence_chunk': '2 2.9.10 Exchange (Non-Degree) Status ........................................................................................... 19 2.9.11 Readmission ........................................................................................................................ 20 3 New Student Information .................................................................................................................. 21 3.1 ECE Advisement Orientation Session ......................................................................................... 21 3.2 Payroll Forms ............................................................................................................................... 21 3.3 The Interim Faculty Advisor ......................................................................................................... 21 3.3.1 Interim Faculty Advisor for a M.Eng.Student ..................................................................

In [16]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)
df.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count
0,-4,ECE Graduate Student Policy Manual For the 202...,66,10,16.5
1,-3,1 Table of Contents 1 General Information .......,5256,305,1314.0
2,-2,2 2.9.10 Exchange (Non-Degree) Status ...........,5906,367,1476.5
3,-1,3 4.9 Coursework Justification ..................,5271,308,1317.75
4,0,4 7 Appendix ....................................,654,37,163.5


In [17]:
min_token_length = 30
for row in df[df["chunk_token_count"] <= min_token_length].sample(3).iterrows():
    print(f'chunk token count: {row[1]["chunk_token_count"]} | Text: {row[1]["sentence_chunk"]}')
    

chunk token count: 0.75 | Text: The
chunk token count: 16.5 | Text: ECE Graduate Student Policy Manual For the 2023-2024 Academic Year
chunk token count: 2.75 | Text: There is no


In [18]:
# filter out dataframe for rows under 30 tokens 
pages_and_chunks_over_min_token_len = df[df["chunk_token_count"] > min_token_length].to_dict(orient="records")
pages_and_chunks_over_min_token_len[5:10]

[{'page_number': 1,
  'sentence_chunk': 'The advisors may be contacted by email to vt.ece.gradadm@vt.edu (for all admissions questions) and to eceadvising@vt.edu (for all graduate advising questions). Mailing Address for Admissions Correspondence: ECE Graduate Admissions 1185 Perry Street  453 Whittemore (0111) Virginia Tech Blacksburg, VA 24061-0111',
  'chunk_char_count': 311,
  'chunk_word_count': 41,
  'chunk_token_count': 77.75},
 {'page_number': 2,
  'sentence_chunk': '6 1.1.2 Graduate Program Director The ECE Graduate Program Director is appointed by the ECE Department Head.The duties of the Director include the following:  • Chair the ECE Graduate Committee • Manage ECE graduate admissions • Ensure that the ECE Graduate Handbook is available on the ECE web page and is kept current • Make final decisions regarding exceptions to ECE graduate policies • Help to resolve conflict or outstanding issues that may arise as part of ECE graduate students’ education at Virginia Tech • Ensu

### embedding text chunks
- text -> number .. learned representations
- Embedding model: all-mpnet-base-v2
- Transformer: Sentence-Transformer

In [19]:
from sentence_transformers import SentenceTransformer
#choose other model if needed
embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                     device="cpu")

sentences = ["One winner, 42 losers. I eat losers for breakfast.",
             "I am Moana of Motunui. You will board my boat, sail across the sea, and restore the heart of Te Fiti.",
             "Your mind is like this water, my friend. When it is agitated, it becomes difficult to see. But if you allow it to settle, the answer becomes clear"]

embeddings = embedding_model.encode(sentences)
embeddings_dict = dict(zip(sentences,embeddings))

for sentence, embedding in embeddings_dict.items():
    print(f'Sentence: {sentence}')
    print(f"Embedding: {embedding}")
    print("")

Sentence: One winner, 42 losers. I eat losers for breakfast.
Embedding: [ 1.15046920e-02  1.13322273e-01  1.85373034e-02  6.38279617e-02
 -3.11114974e-02 -2.39148107e-03 -7.74082169e-02  7.86641333e-03
  8.84707365e-03  1.39631983e-02  5.92519678e-02 -3.69009115e-02
 -1.52639113e-02 -1.45488102e-02  4.17278074e-02 -2.85042431e-02
 -3.15775648e-02 -2.98556685e-02 -4.84288717e-03 -1.44464076e-02
 -7.40296021e-02  1.76092349e-02  2.75924206e-02  2.90649780e-03
 -1.54584108e-04  4.33346368e-02  3.12447194e-02  2.67929360e-02
 -1.64484996e-02 -7.76191652e-02  1.54406307e-02  1.30614564e-02
 -3.93566415e-02 -7.51571031e-03  1.92927700e-06 -3.57409604e-02
 -1.43471798e-02  3.73234488e-02 -3.78098786e-02  1.95336640e-02
 -5.33196069e-02  4.20520715e-02 -2.45027579e-02  1.37061253e-02
 -2.96841860e-02  1.34968217e-02 -2.76567489e-02  4.77359258e-02
  1.62776429e-02  1.54969830e-03  1.26459068e-02 -3.21757048e-02
  1.46062532e-02 -2.52563637e-02 -1.62457358e-02 -1.10004451e-02
  3.45287174e-02 -

In [20]:
embeddings[0].shape

(768,)

In [21]:
embedding = embedding_model.encode("John Donne was the best poet in history")
embedding

array([ 7.97574501e-03,  8.51745158e-02,  1.43370498e-02,  3.23541239e-02,
       -1.08743710e-02,  1.46376691e-03,  1.62877105e-02, -6.10514032e-03,
       -6.42639920e-02, -1.35684870e-02, -8.98757204e-03,  2.22515035e-02,
       -1.72912106e-02, -4.42087613e-02, -3.35173234e-02, -5.40120713e-02,
       -1.87907871e-02,  4.45181876e-02,  2.61868145e-02,  2.73707900e-02,
       -5.16821817e-02,  1.58530343e-02,  4.20359056e-03, -5.53753600e-02,
        5.07049598e-02, -3.91208343e-02,  4.91885049e-03,  3.11098937e-02,
       -1.92843955e-02,  1.80281103e-02,  8.38572159e-03, -8.21125414e-03,
       -2.30197906e-02,  4.06032503e-02,  1.02432637e-06,  1.20371551e-04,
        1.74724602e-03,  1.42615419e-02,  3.86207327e-02, -7.17447419e-03,
       -5.99864051e-02,  4.32519875e-02, -1.40723661e-02,  1.33939376e-02,
        3.64025757e-02, -3.28767933e-02,  4.18468518e-03, -4.46832310e-05,
        9.67290811e-03,  5.35940751e-02,  1.56235099e-02, -3.67998928e-02,
        5.29414080e-02, -

In [22]:
%%time

embedding_model.to("cpu")

#embed each chunk one by one 
for item in tqdm(pages_and_chunks_over_min_token_len):
    item["embedding"] = embedding_model.encode(item["sentence_chunk"])

  0%|          | 0/118 [00:00<?, ?it/s]

CPU times: total: 2min 9s
Wall time: 16.7 s


In [23]:
import torch
torch.cuda.is_available()

True

In [24]:
%%time

embedding_model.to('cuda')
for item in tqdm(pages_and_chunks_over_min_token_len):
    item["embedding"] = embedding_model.encode(item["sentence_chunk"])


  0%|          | 0/118 [00:00<?, ?it/s]

CPU times: total: 2.19 s
Wall time: 2.13 s


In [26]:
%%time

text_chunks = [item["sentence_chunk"] for item in pages_and_chunks_over_min_token_len]
text_chunks[100]

CPU times: total: 0 ns
Wall time: 0 ns


'If the Preliminary Exam is failed a second time, the student will be dismissed from the Ph.D. program at Virginia Tech.  The Preliminary Exam must be scheduled through the Virginia Tech Graduate School at least two (2) weeks in advance via ESS.To pass the examination at most one negative vote may be recorded by the Advisory Committee.Only two opportunities to take the examination are permitted.See the Virginia Tech Graduate Catalog for additional details.'

In [27]:
len(text_chunks)

118

In [28]:
%%time

# embed all texts in batches , significant improvement over single ones.
text_chunk_embeddings = embedding_model.encode(text_chunks,
                                               batch_size =32,
                                               convert_to_tensor=True)
text_chunk_embeddings

CPU times: total: 1.59 s
Wall time: 1.34 s


tensor([[ 0.0311, -0.0545, -0.0227,  ..., -0.0468,  0.0078, -0.0749],
        [-0.0019, -0.0627,  0.0015,  ..., -0.0075, -0.0325, -0.0448],
        [-0.0178, -0.0756, -0.0206,  ..., -0.0164, -0.0394, -0.0507],
        ...,
        [ 0.0006, -0.0394,  0.0027,  ..., -0.0429,  0.0150, -0.0261],
        [ 0.0646,  0.0059, -0.0208,  ..., -0.0443,  0.0422, -0.0860],
        [-0.0044, -0.0095, -0.0424,  ..., -0.0411,  0.0141,  0.0452]],
       device='cuda:0')

### Save embeddings to file

In [29]:
#save embeddings to file 
text_chunks_and_embeddings_df = pd.DataFrame(pages_and_chunks_over_min_token_len)
embeddings_df_save_path = "text_chunks_and_embeddings_df.csv"
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False)

In [33]:
text_chunks_and_embedding_df_load = pd.read_csv(embeddings_df_save_path)
text_chunks_and_embedding_df_load.head()

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-3,1 Table of Contents 1 General Information .......,5256,305,1314.0,[ 3.10723260e-02 -5.44653423e-02 -2.26697065e-...
1,-2,2 2.9.10 Exchange (Non-Degree) Status ...........,5906,367,1476.5,[-1.92042568e-03 -6.27196282e-02 1.53697829e-...
2,-1,3 4.9 Coursework Justification ..................,5271,308,1317.75,[-1.77967027e-02 -7.55857304e-02 -2.06447337e-...
3,0,4 7 Appendix ....................................,654,37,163.5,[ 1.84279792e-02 -3.08060311e-02 -2.64991056e-...
4,1,5 1 General Information The Bradley Departme...,2008,293,502.0,[ 1.27470512e-02 -5.17672189e-02 -3.01634315e-...


## Rag- Search and Answer
- Retrieve relevant passages based on a query -> Use the passages to augment an input to an LLM -> generate output based on those relevant passages
- 

### similarity search

embeddings can be used for almost any type of data .. images,sound,text
comparing embeddings are known as similarity search, vector search, semantic search. 



In [47]:
import random
import torch
import numpy as np
import pandas as pd
device = 'cuda' if torch.cuda.is_available() else"cpu"

# import texts and embedding df
text_chunks_and_embedding_df = pd.read_csv("text_chunks_and_embeddings_df.csv")

# convert embedding column back to np.array(it got converted to string when it saved to CSV)
text_chunks_and_embedding_df["embedding"] = text_chunks_and_embedding_df["embedding"].apply(lambda x: np.fromstring(x.strip("[]"), sep=" "))

#convert embeddings into a torch.tensor

embeddings= torch.tensor(np.stack(text_chunks_and_embedding_df["embedding"].tolist(),axis = 0))
# convert texts and embedding df to list of dicts
pages_and_chunks = text_chunks_and_embedding_df.to_dict(orient="records")

text_chunks_and_embedding_df


Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-3,1 Table of Contents 1 General Information .......,5256,305,1314.00,"[0.031072326, -0.0544653423, -0.0226697065, 0...."
1,-2,2 2.9.10 Exchange (Non-Degree) Status ...........,5906,367,1476.50,"[-0.00192042568, -0.0627196282, 0.00153697829,..."
2,-1,3 4.9 Coursework Justification ..................,5271,308,1317.75,"[-0.0177967027, -0.0755857304, -0.0206447337, ..."
3,0,4 7 Appendix ....................................,654,37,163.50,"[0.0184279792, -0.0308060311, -0.0264991056, 0..."
4,1,5 1 General Information The Bradley Departme...,2008,293,502.00,"[0.0127470512, -0.0517672189, -0.0301634315, -..."
...,...,...,...,...,...,...
113,46,The winner is also usually nominated for the U...,1292,199,323.00,"[0.00505551975, -0.00881108548, 0.0178461, 0.0..."
114,46,6.3 Health Insurance Premium Compensation Virg...,492,66,123.00,"[-0.00438998686, 0.0239519812, -0.0100923814, ..."
115,47,"51 Student Services Building Blacksburg, VA 2...",867,116,216.75,"[0.000632528041, -0.0393929072, 0.00265786168,..."
116,48,52 7 Appendix 7.1 ECE Administrative and Grad...,1663,191,415.75,"[0.0646070018, 0.00591729302, -0.0207962431, 0..."


In [48]:
embeddings.shape

torch.Size([118, 768])

In [43]:
text_chunks_and_embedding_df["embedding"]

0      [0.031072326, -0.0544653423, -0.0226697065, 0....
1      [-0.00192042568, -0.0627196282, 0.00153697829,...
2      [-0.0177967027, -0.0755857304, -0.0206447337, ...
3      [0.0184279792, -0.0308060311, -0.0264991056, 0...
4      [0.0127470512, -0.0517672189, -0.0301634315, -...
                             ...                        
113    [0.00505551975, -0.00881108548, 0.0178461, 0.0...
114    [-0.00438998686, 0.0239519812, -0.0100923814, ...
115    [0.000632528041, -0.0393929072, 0.00265786168,...
116    [0.0646070018, 0.00591729302, -0.0207962431, 0...
117    [-0.00438736659, -0.00953250285, -0.0423646644...
Name: embedding, Length: 118, dtype: object

In [45]:
embeddings

array([[ 0.03107233, -0.05446534, -0.02266971, ..., -0.04675224,
         0.00777942, -0.07486996],
       [-0.00192043, -0.06271963,  0.00153698, ..., -0.00747272,
        -0.0325443 , -0.04475317],
       [-0.0177967 , -0.07558573, -0.02064473, ..., -0.01643178,
        -0.03937143, -0.05073398],
       ...,
       [ 0.00063253, -0.03939291,  0.00265786, ..., -0.04287352,
         0.01497129, -0.02610558],
       [ 0.064607  ,  0.00591729, -0.02079624, ..., -0.04434307,
         0.04218665, -0.08595697],
       [-0.00438737, -0.0095325 , -0.04236466, ..., -0.04112922,
         0.01405487,  0.04522928]], shape=(118, 768))

In [50]:
# create model
from sentence_transformers import util, SentenceTransformer

embedding_model = SentenceTransformer(model_name_or_path="all-mpnet-base-v2",
                                      device=device)

Embedding model ready. 
Creating a small semantic search pipeline.
