### Overview:

- RAG is designed to give relavent answers based on query about space
- Used cosine similarity to measure distance between query and content embeddings

In [3]:
from os import putenv
putenv("HSA_OVERRIDE_GFX_VERSION", "11.0.0") # The line must be defined before importing torch.

import torch # OK. The HIP Runtime of PyTorch can recognize your ISA.
import torch.nn as nn

#### Import PDF Document:

In [4]:
pdf_path = 'Introduction_to_Astronomy.pdf'

In [5]:
import os
from tqdm.auto import tqdm

def text_formatter(text:str) -> str:
    ''' Performs basic text cleaning'''

    cleaned_text = text.replace('\n', ' ').strip()
    return cleaned_text

In [6]:
import fitz

def open_and_read_pdf(pdf_path:str)-> list[dict]:
    doc = fitz.open(pdf_path)
    pages_and_texts = []
    for page_number, page in tqdm(enumerate(doc)):
        text = page.get_text()
        text = text_formatter(text)
        pages_and_texts.append({'page_number': page_number-25,
                                'page_char_count': len(text), #it includes the spaces, special characters and punctuation count
                                'page_word_count': len(text.split(' ')), #it includes the word count 
                                'page_sentence_count_raw': len(text.split('. ')),
                                'page_token_count': len(text)/4, #in general 4 characters make a single token,
                                'text': text})
    return pages_and_texts
                                

In [7]:
pages_and_texts = open_and_read_pdf(pdf_path)

0it [00:00, ?it/s]

In [8]:
import random 

random.sample(pages_and_texts,k=3)

[{'page_number': 151,
  'page_char_count': 2715,
  'page_word_count': 453,
  'page_sentence_count_raw': 18,
  'page_token_count': 678.75,
  'text': '[ch. VI, 112] AN INTRODUCTION TO ASTRONOMY 152 the national observatory. For example, in the United States, the chief source of time for railroads and commercial purposes is the Naval Ob- servatory, at Georgetown Heights, Washington, D.C. There are three high-grade clocks keeping standard time at this observatory. Their errors are found from observations of the stars; and after applying cor- rections for these errors, the mean of the three clocks is taken as giving the true standard time for the successive 24 hours. At 5 minutes before noon, Eastern Time, the Western Union Telegraph Company and the Postal Telegraph Company suspend their ordinary business and throw their lines into electrical connection with the standard clock at the Naval Observatory. The connection is arranged so that the sounding key makes a stroke every second during th

#### Data Analysis:

In [9]:
import pandas as pd

df = pd.DataFrame(pages_and_texts)
df

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,text
0,-25,554,88,2,138.50,Project Gutenberg’s An Introduction to Astrono...
1,-24,660,101,7,165.00,"Produced by Brenda Lewis, Andrew D. Hwang, Bup..."
2,-23,28,4,1,7.00,AN INTRODUCTION TO ASTRONOMY
3,-22,186,32,3,46.50,THE MACMILLAN COMPANY NEW YORK · BOSTON · CHIC...
4,-21,59,9,3,14.75,"Fig. 1. — The Lick Observatory, Mount Hamilton..."
...,...,...,...,...,...,...
513,488,2331,389,16,582.75,PROJECT GUTENBERG LICENSE D form. Any alternat...
514,489,2906,475,18,726.50,PROJECT GUTENBERG LICENSE E effort to identify...
515,490,2647,401,19,661.75,PROJECT GUTENBERG LICENSE F 1.F.6. INDEMNITY -...
516,491,2160,331,19,540.00,PROJECT GUTENBERG LICENSE G For additional con...


In [10]:
df.describe()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count
count,518.0,518.0,518.0,518.0,518.0
mean,233.5,2092.694981,376.710425,30.204633,523.173745
std,149.677988,747.552305,136.722428,67.050752,186.888076
min,-25.0,28.0,4.0,1.0,7.0
25%,104.25,1769.5,314.25,16.0,442.375
50%,233.5,2333.0,428.5,20.0,583.25
75%,362.75,2704.0,472.0,23.0,676.0
max,492.0,2924.0,754.0,581.0,731.0


- There are average of 30 sentences per page, and the average word count is 377


#### Splitting paragraph into sentences:

In [11]:
from spacy.lang.en import English

nlp = English()
nlp.add_pipe('sentencizer') #add a sentencizer pipeline
#spacy libray works better for splitting sentences, rather than splitting using .split(' ')

<spacy.pipeline.sentencizer.Sentencizer at 0x76a4505cf600>

In [12]:
for item in pages_and_texts:
    item['sentences'] = list(nlp(item['text']).sents)

    #making sure all the sentences are in string format
    item['sentences'] = [str(sentence) for sentence in item['sentences']]
    item['page_sentence_count_spacy'] = len(item['sentences'])

In [13]:
random.sample(pages_and_texts, k =4)

[{'page_number': 152,
  'page_char_count': 2671,
  'page_word_count': 487,
  'page_sentence_count_raw': 23,
  'page_token_count': 667.75,
  'text': '[ch. VI, 115] TIME 153 observational work is done at night. The hours of the astronomical day are numbered up to 24, just as in the case of sidereal time. 113. Place of Change of Date.—If one should start at any point on the earth and go entirely around it westward, the number of times the sun would cross his meridian would be one less than it would have been if he had stayed at home. Since it would be very inconvenient for him to use fractional dates, he would count his day from midnight to midnight, whatever his longitude, and correct the increasing diﬀerence from the time of his starting point by arbitrarily changing his date one day forward at some point in his journey. That is, he would omit one date and day of the week from his reckoning. On the other hand, if he were going around the earth eastward, he would give two days the same d

In [14]:
df = pd.DataFrame(pages_and_texts)
df.describe()

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy
count,518.0,518.0,518.0,518.0,518.0,518.0
mean,233.5,2092.694981,376.710425,30.204633,523.173745,19.388031
std,149.677988,747.552305,136.722428,67.050752,186.888076,9.364013
min,-25.0,28.0,4.0,1.0,7.0,1.0
25%,104.25,1769.5,314.25,16.0,442.375,16.0
50%,233.5,2333.0,428.5,20.0,583.25,20.0
75%,362.75,2704.0,472.0,23.0,676.0,22.0
max,492.0,2924.0,754.0,581.0,731.0,75.0


#### Chunking our sentences together:

- chunking helps to provide specific information, within the acceptable count of input tokens to the LLM

In [15]:
def split_list(input_list: list, slice_size: int = 10) -> list[list[str]]:
    return [input_list[i:i+slice_size] for i in range(0, len(input_list), slice_size)]
    

In [16]:
for item in tqdm(pages_and_texts):
    item['sentence_chunks'] = split_list(item['sentences'])
    item['num_chunks'] = len(item['sentence_chunks'])
    

  0%|          | 0/518 [00:00<?, ?it/s]

In [17]:
random.sample(pages_and_texts,k=2)

[{'page_number': 285,
  'page_char_count': 2672,
  'page_word_count': 460,
  'page_sentence_count_raw': 21,
  'page_token_count': 668.0,
  'text': '[ch. X, 208] AN INTRODUCTION TO ASTRONOMY 286 the earth are at present exceedingly slight, and it is very probable that their inﬂuences upon the rotations of the other members of the system are also inappreciable. A retardation in the translatory motion of a body causes its orbit to decrease in size. Hence, so far as the meteors aﬀect the planets in this way, they cause them continually to approach the sun. Another eﬀect of meteors upon the members of the solar system is to increase their masses by the accretion of matter which may have come originally from far beyond the orbit of Neptune. As the masses of the sun and planets increase, their mutual attractions increase and the orbits of the planets become smaller. Looking backward in time, we are struck by the possibility that the accretion of meteoric matter may have been more rapid in for

In [18]:
df = pd.DataFrame(pages_and_texts)
df.describe().round(2)

Unnamed: 0,page_number,page_char_count,page_word_count,page_sentence_count_raw,page_token_count,page_sentence_count_spacy,num_chunks
count,518.0,518.0,518.0,518.0,518.0,518.0,518.0
mean,233.5,2092.69,376.71,30.2,523.17,19.39,2.41
std,149.68,747.55,136.72,67.05,186.89,9.36,0.97
min,-25.0,28.0,4.0,1.0,7.0,1.0,1.0
25%,104.25,1769.5,314.25,16.0,442.38,16.0,2.0
50%,233.5,2333.0,428.5,20.0,583.25,20.0,2.0
75%,362.75,2704.0,472.0,23.0,676.0,22.0,3.0
max,492.0,2924.0,754.0,581.0,731.0,75.0,8.0


#### Splitting each chunk into its own item:

In [19]:
import re

pages_and_chunks = []
for item in tqdm(pages_and_texts):
    for sentence_chunk in item['sentence_chunks']:
        chunk_dict = {}
        chunk_dict['page_number'] = item['page_number']
        joined_sentence_chunk = ''.join(sentence_chunk).replace('  ',' ').strip()
        joined_sentence_chunk = re.sub(r'\.([A-Z])', r'. \1', joined_sentence_chunk)
        chunk_dict['sentence_chunk'] = joined_sentence_chunk
        #stats
        chunk_dict['chunk_char_count'] = len(joined_sentence_chunk)
        chunk_dict['chunk_word_count'] = len([word for word in joined_sentence_chunk.split(' ')])
        chunk_dict['chunk_token_count'] = len(joined_sentence_chunk)/4

        pages_and_chunks.append(chunk_dict)

len(pages_and_chunks)

  0%|          | 0/518 [00:00<?, ?it/s]

1248

In [20]:
random.sample(pages_and_chunks, k = 1)

[{'page_number': 264,
  'sentence_chunk': '[ch. X, 196] COMETS AND METEORS 265 are two types of these groups, and they are known as comet families. Families of the ﬁrst type are made up of comets which pursue nearly identical paths. The most celebrated family of this type is composed of the great comets of 1668, 1843, 1880, and 1882. A much smaller one seen in 1887 probably should be added to this list. Their orbits were not only nearly identical, but the comets themselves were very similar in every respect. They came to the sun from the direction of Sirius—that is, from the direction away from which the sun is moving with respect to the stars—and escaped the notice of observers in the northern hemisphere until they were near perihelion. They passed half way around the sun in a few hours at a distance of less than 200, 000 miles from its surface, moving at the enormous velocity of more than 350 miles per second. Their tails extended out in dazzling splendor 100, 000, 000 miles from the

In [21]:
df = pd.DataFrame(pages_and_chunks)
df.describe().round(2)

Unnamed: 0,page_number,chunk_char_count,chunk_word_count,chunk_token_count
count,1248.0,1248.0,1248.0,1248.0
mean,217.72,866.48,154.82,216.62
std,148.95,467.26,79.62,116.82
min,-25.0,3.0,1.0,0.75
25%,87.0,454.75,93.0,113.69
50%,213.0,915.0,161.0,228.75
75%,348.0,1218.25,216.0,304.56
max,492.0,2248.0,390.0,562.0


In [22]:
min_token_length = 30
for row in df[df['chunk_token_count']<min_token_length].sample(5).iterrows():
    print(f'chunk: {row[1]["chunk_token_count"]} | text: {row[1]["sentence_chunk"]}')

chunk: 7.0 | text: AN INTRODUCTION TO ASTRONOMY
chunk: 1.75 | text: Another
chunk: 1.0 | text: 181.
chunk: 2.0 | text: 376, 378
chunk: 5.5 | text: However, the reduction


In [23]:
pages_and_chunks_over_min_token_len = df[df['chunk_token_count'] > min_token_length].to_dict(orient='records')
len(pages_and_chunks_over_min_token_len)

1192

#### Embedding our text chunks:

In [1]:
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer('all-mpnet-base-v2')


  from tqdm.autonotebook import tqdm, trange


In [2]:
embedding_model.to('cuda')

SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

In [26]:
embedding = embedding_model.encode('My main aim of my life is to master the mindfulness')
embedding.shape

(768,)

In [27]:
for item in tqdm(pages_and_chunks_over_min_token_len):
    item['embedding'] = embedding_model.encode(item['sentence_chunk'],
                                              batch_size = 32,
                                              convert_to_tensor = True)

  0%|          | 0/1192 [00:00<?, ?it/s]

#### Save embeddings to file:

In [28]:
pages_and_chunks_over_min_token_len[419]

{'page_number': 139,
 'sentence_chunk': '[ch. V, 101] AN INTRODUCTION TO ASTRONOMY 140 moving in nearly the opposite direction. The history of Sirius during the last two centuries is very interesting, and furnishes a good illustration of the value of the deductive method in making discoveries. First, Halley found, in 1718, that Sirius has a mo- tion with respect to ﬁxed reference points and lines; then, a little more than a century later, Bessel found that this motion is slightly variable. He inferred from this, on the basis of the laws of motion, that Sirius and an unseen companion were traveling around their common cen- ter of gravity which was moving with uniform speed in a straight line. This companion actually was discovered by Alvan G. Clark, in 1862, while adjusting the 18-inch telescope now of the Dearborn Observa- tory, at Evanston, Ill. The distance of the two stars from each other is 1, 800, 000, 000 miles, and they complete a revolution in 48.8 years. The combined mass of t

In [29]:
#save embeddings to file
text_chunks_and_embeddings_df = pd.DataFrame(pages_and_chunks_over_min_token_len)
embeddings_df_save_path = 'text_chunks_and_embeddings_df.csv'
text_chunks_and_embeddings_df.to_csv(embeddings_df_save_path, index=False,  escapechar='\\')

In [3]:
import pandas as pd
embeddings_df_save_path = 'text_chunks_and_embeddings_df.csv'
#Import saved file and view
text_chunks_and_embedding_df_load = pd.read_csv(embeddings_df_save_path)
text_chunks_and_embedding_df_load

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-25,Project Gutenberg’s An Introduction to Astrono...,554,88,138.50,"tensor([ 6.1519e-02, -3.7411e-02, 1.8281e-02,..."
1,-24,"Produced by Brenda Lewis, Andrew D. Hwang, Bup...",659,100,164.75,"tensor([ 8.5111e-03, 1.9476e-02, 6.7720e-03,..."
2,-22,THE MACMILLAN COMPANY NEW YORK · BOSTON · CHIC...,185,31,46.25,"tensor([ 3.3732e-02, -1.8354e-02, -1.9186e-02,..."
3,-20,AN INTRODUCTION TO ASTRONOMY BY FOREST RAY MOU...,251,39,62.75,"tensor([ 3.9364e-02, -5.1087e-02, 9.7266e-03,..."
4,-19,"Copyright, 1906 and 1916, By THE MACMILLAN COM...",330,52,82.50,"tensor([-5.2572e-02, -3.6525e-02, 1.5003e-04,..."
...,...,...,...,...,...,...
1187,490,PROJECT GUTENBERG LICENSE F 1. F.6. INDEMNITY ...,1977,292,494.25,"tensor([ 7.7008e-03, 8.3735e-02, 1.2195e-02,..."
1188,490,The Foundation’s EIN or federal tax identifica...,670,110,167.50,"tensor([ 1.3882e-02, 1.4830e-01, 7.6977e-03,..."
1189,491,PROJECT GUTENBERG LICENSE G For additional con...,1686,253,421.50,"tensor([ 1.0598e-02, 1.2781e-01, 2.0647e-02,..."
1190,491,Donations are accepted in a number of other wa...,474,79,118.50,"tensor([ 1.8604e-02, 7.0090e-02, -9.8827e-03,..."


- If we have over 100k embeddings, we need to use vector database, it uses Approximate Nearest Neighbor technique to find the nearest neighbor embeddings

#### RAG - search and answer:
- We want to retrieve relavent passages based on the query and use those passages to augment an input to an LLM so it can generate output

In [4]:
#semantic search
import random
text_chunks_and_embedding_df_load = pd.read_csv(embeddings_df_save_path)
text_chunks_and_embedding_df_load

Unnamed: 0,page_number,sentence_chunk,chunk_char_count,chunk_word_count,chunk_token_count,embedding
0,-25,Project Gutenberg’s An Introduction to Astrono...,554,88,138.50,"tensor([ 6.1519e-02, -3.7411e-02, 1.8281e-02,..."
1,-24,"Produced by Brenda Lewis, Andrew D. Hwang, Bup...",659,100,164.75,"tensor([ 8.5111e-03, 1.9476e-02, 6.7720e-03,..."
2,-22,THE MACMILLAN COMPANY NEW YORK · BOSTON · CHIC...,185,31,46.25,"tensor([ 3.3732e-02, -1.8354e-02, -1.9186e-02,..."
3,-20,AN INTRODUCTION TO ASTRONOMY BY FOREST RAY MOU...,251,39,62.75,"tensor([ 3.9364e-02, -5.1087e-02, 9.7266e-03,..."
4,-19,"Copyright, 1906 and 1916, By THE MACMILLAN COM...",330,52,82.50,"tensor([-5.2572e-02, -3.6525e-02, 1.5003e-04,..."
...,...,...,...,...,...,...
1187,490,PROJECT GUTENBERG LICENSE F 1. F.6. INDEMNITY ...,1977,292,494.25,"tensor([ 7.7008e-03, 8.3735e-02, 1.2195e-02,..."
1188,490,The Foundation’s EIN or federal tax identifica...,670,110,167.50,"tensor([ 1.3882e-02, 1.4830e-01, 7.6977e-03,..."
1189,491,PROJECT GUTENBERG LICENSE G For additional con...,1686,253,421.50,"tensor([ 1.0598e-02, 1.2781e-01, 2.0647e-02,..."
1190,491,Donations are accepted in a number of other wa...,474,79,118.50,"tensor([ 1.8604e-02, 7.0090e-02, -9.8827e-03,..."


In [5]:

def normalize(embedding):
    norm = np.linalg.norm(embedding)
    return embedding / norm if norm > 0 else embedding


def parse_and_normalize_embedding(embedding_str):
    #converting the tensor of type string to numpy array, faiss needs embeddings to be in numpy array
    cleaned_str = embedding_str.replace('tensor(', '').replace(', device=\'cuda:0\')', '').replace('\n', '')
    embedding = np.array(eval(cleaned_str), dtype=np.float32)
    return normalize(embedding)


In [6]:
import pandas as pd
from datasets import Dataset, DatasetDict
import numpy as np
import faiss

hf_dataset = Dataset.from_pandas(text_chunks_and_embedding_df_load)
hf_dataset = hf_dataset.map(lambda x: {'embedding': parse_and_normalize_embedding(x['embedding'])})


Map:   0%|          | 0/1192 [00:00<?, ? examples/s]

In [7]:
first_embedding = np.array(hf_dataset[0]['embedding'])

index = faiss.IndexFlatIP(first_embedding.shape[0]) #creates index to search based on the inner product between vectors

hf_dataset.add_faiss_index(column='embedding', custom_index=index) #By default it performs L2

  0%|          | 0/2 [00:00<?, ?it/s]

Dataset({
    features: ['page_number', 'sentence_chunk', 'chunk_char_count', 'chunk_word_count', 'chunk_token_count', 'embedding'],
    num_rows: 1192
})

In [8]:
query = 'what is temperature of earth?'
query_embedding = embedding_model.encode(query, convert_to_tensor=True) 
query_embedding = query_embedding.cpu().numpy()  # Convert to NumPy array
query_embedding = normalize(query_embedding) 

In [9]:
scores, neighbors = hf_dataset.get_nearest_examples('embedding', query_embedding, k=25)

In [30]:
neighbors.keys()

dict_keys(['page_number', 'sentence_chunk', 'chunk_char_count', 'chunk_word_count', 'chunk_token_count', 'embedding'])

In [10]:
for i in range(len(scores)):
    print(f"Neighbor {i+1}:")
    print(f"Score: {scores[i]}")
    print(f"Text Chunk: {neighbors['sentence_chunk'][i]}")
    print(f"Page Number: {neighbors['page_number'][i]}")
    print("-----------")

Neighbor 1:
Score: 0.5766785144805908
Text Chunk: [ch. IX, 172] AN INTRODUCTION TO ASTRONOMY 234 the Fahrenheit scale the mean annual surface temperature of the whole earth is about 60◦, or 28◦above freezing. The absolute zero on the Fahrenheit scale is 491◦below freezing. Therefore, the mean tempera- ture of the earth on the Fahrenheit scale counted from the absolute zero is about 491◦+ 28◦= 519◦. Let x represent the absolute temperature of Mars; then, under the assumption that its surface is like that of the earth, the proportion becomes x : 519 = 4 √ 0.43 : 4 √ 1, from which it is found that x = 420◦. That is, under these hypotheses, the mean surface temperature of Mars comes out 491◦−420◦= 71◦ below freezing, or 71◦−32◦= 39◦below zero Fahrenheit. The results just obtained can lay no claim to any particular degree of accuracy because of the uncertain hypotheses on which they rest. But it does not seem that the hypothesis that the surfaces of Mars and the earth radiate similarly can 

- **Note**: In the case of getting relavent documents, we need cosine similarity, which does not take magnitude into account, it considers direction , so we normalize the two vectors and performed the dot product

#### Functionizing our semantic search pipeline:

In [13]:
def print_top_results_and_scores(query,
                                 hf_dataset,
                                 n_resources_to_return=5):
    # Step 1: Create the query embedding
    query_embedding = embedding_model.encode(query, convert_to_tensor=True)
    query_embedding = query_embedding.cpu().numpy()
    query_embedding = normalize(query_embedding)

    # Step 2: Perform FAISS search, returning scores and neighbors
    scores, neighbors = hf_dataset.get_nearest_examples('embedding', query_embedding, k=n_resources_to_return)

    # Step 3: Print top results, including scores, neighbors, and their corresponding indices
    for i in range(len(scores)):
        print(f"Neighbor {i+1}:")
        print(f"Score: {scores[i]}")
        print(f"Text Chunk: {neighbors['sentence_chunk'][i]}")
        print(f"Page Number: {neighbors['page_number'][i]}")
        print("-----------")

    # Return scores, neighbors, and indices
    return scores, neighbors

# Example usage
query = 'How far is the moon?'
scores, neighbors = print_top_results_and_scores(query, hf_dataset, n_resources_to_return=5)


Neighbor 1:
Score: 0.6406916975975037
Text Chunk: [ch. VII, 124] AN INTRODUCTION TO ASTRONOMY 164 Fig.72. —Measuring the distance to the moon.included angle are known, and the distance EM can be computed. In general, the relations and observations will not be so simple as those assumed here, but in no case are serious mathematical or observational diﬃculties encountered. It is to be noted that the result obtained is not guesswork, but that it is based on measurements, and that it is in reality given by measurements in the same sense that a distance on the surface of the earth may be obtained by measurement. The percentage of error in the determination of the moon’s distance is actually much less than that in most of the ordinary distances on the surface of the earth. The mean distance from the center of the earth to the center of the moon has been found to be 238, 862 miles, and the circumference of its orbit is therefore 1, 500, 818 miles. On dividing the circumference by the moon’s s

#### Getting LLM:

In [16]:
# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Hugging Face token if the repo is gated
token = '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'  # Replace with your token

# Load the tokenizer and model with token authentication
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it", use_auth_token=token)
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2-2b-it",
    use_auth_token=token,  # Pass the token to ensure access to the gated model
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Generate text
input_text = "how much does sun weight."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids, max_new_tokens=50)

# Decode and print the output
print(tokenizer.decode(outputs[0]))




tokenizer_config.json:   0%|          | 0.00/47.0k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/838 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/24.2k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.99G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/241M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/187 [00:00<?, ?B/s]

<bos>how much does sun weight.

The Sun is a star, and stars don't have weight in the traditional sense. 

Here's why:

* **Mass vs. Weight:**  Weight is the force of gravity acting on an object's mass.  


In [22]:
input_text = 'what micronutrients are required for body'

dialogue_template = [ { 'role':'user', 'content':input_text}]
prompt = tokenizer.apply_chat_template(conversation = dialogue_template, 
                                       tokenize = False,
                                       add_generation_prompt = True)
print(prompt)

<bos><start_of_turn>user
what micronutrients are required for body<end_of_turn>
<start_of_turn>model



- apply_chat_template is required for the model to respond in a conversational use

In [23]:
inputs = tokenizer(prompt, return_tensors = 'pt').to('cuda')

In [24]:
outputs = model.generate(**inputs, max_new_tokens = 256)
outputs

tensor([[     2,      2,    106,   1645,    108,   5049,  92800, 184592,    708,
           3690,    604,   2971,    107,    108,    106,   2516,    108,   4858,
         235303, 235256,    476,  25497,    576,    573,   8727,  92800, 184592,
            861,   2971,   4026, 235269,   3731,    675,   1024,  16065,    578,
           8269, 235292,    109,    688,  34212,  89092,    688,    109, 235287,
           5231,  62651,    586,  66058,    108,    141, 235287,   5231,  11071,
          66058,  23852, 235269,  24091,   1411, 235269,   3027,   5115, 235269,
          31152, 235269,   5239,   2962,    108,    141, 235287,   5231,  17803,
          66058,    139,  28266,  25741, 235269,  54134, 235269,  65757, 235269,
          63602, 235269,  19967,    108, 235287,   5231,  62651,    599,  25280,
          66058,    108,    141, 235287,   5231, 235305, 235274,    591,  18227,
          20724,   1245,    688,  10367,   4584, 235269,  25606,   1411,    108,
            141, 235287,   5

In [25]:
outputs_decoded = tokenizer.decode(outputs[0])
print(outputs_decoded) 

<bos><bos><start_of_turn>user
what micronutrients are required for body<end_of_turn>
<start_of_turn>model
Here's a breakdown of the essential micronutrients your body needs, along with their roles and sources:

**Vitamins**

* **Vitamin A:**
    * **Role:** Vision, immune function, cell growth, reproduction, skin health
    * **Sources:**  Sweet potatoes, carrots, spinach, kale, liver
* **Vitamin B Complex:**
    * **B1 (Thiamine):** Energy production, nerve function
    * **B2 (Riboflavin):** Energy production, cell growth
    * **B3 (Niacin):** Energy production, DNA repair
    * **B5 (Pantothenic Acid):** Energy production, hormone production
    * **B6 (Pyridoxine):** Brain function, red blood cell production
    * **B7 (Biotin):** Hair, skin, and nail health, metabolism
    * **B9 (Folate):** Cell division, DNA synthesis, red blood cell production
    * **B12 (Cobalamin):** Nerve function, red blood cell production
    * **Sources:**  Whole grains, legumes, leafy green vegetables,

##### The terms Here's a breakdown of the essential is because, we are running the model using the chat_template

#### Augmenting our prompt with context items:

In [None]:
for i in range(len(scores)):
    print(f"Neighbor {i+1}:")
    print(f"Score: {scores[i]}")
    print(f"Text Chunk: {neighbors['sentence_chunk'][i]}")
    print(f"Page Number: {neighbors['page_number'][i]}")
    print("-----------")

In [109]:
def get_top_results_and_scores(query,
                                 hf_dataset,
                                 n_resources_to_return=10):
    # Step 1: Create the query embedding
    query_embedding = embedding_model.encode(query, convert_to_tensor=True)
    query_embedding = query_embedding.cpu().numpy()
    query_embedding = normalize(query_embedding)

    # Step 2: Perform FAISS search, returning scores and neighbors
    scores, neighbors = hf_dataset.get_nearest_examples('embedding', query_embedding, k=n_resources_to_return)


    # Return scores, neighbors, and indices
    return scores, neighbors

# Example usage
query = 'How far is the moon?'
scores, neighbors = get_top_results_and_scores(query, hf_dataset, n_resources_to_return=5)


In [103]:
def prompt_formatter(query:str, context_items:dict)->str:

    context = "- "+"\n- ".join([sentence_chunk for sentence_chunk in context_items['sentence_chunk']])
    base_prompt = '''Based on the following context items, please answer the query.
Give yourself room to think by extracting relevant passages from the context before answering the query.
Don't return the thinking, only return the answer.
Make sure your answers are as explanatory as possible.
Use the following examples as reference for the ideal answer style.
\nExample 1:
Query: What are the fat-soluble vitamins?
Answer: The fat-soluble vitamins include Vitamin A, Vitamin D, Vitamin E, and Vitamin K. These vitamins are absorbed along with fats in the diet and can be stored in the body's fatty tissue and liver for later use. Vitamin A is important for vision, immune function, and skin health. Vitamin D plays a critical role in calcium absorption and bone health. Vitamin E acts as an antioxidant, protecting cells from damage. Vitamin K is essential for blood clotting and bone metabolism.
\nExample 2:
Query: What are the causes of type 2 diabetes?
Answer: Type 2 diabetes is often associated with overnutrition, particularly the overconsumption of calories leading to obesity. Factors include a diet high in refined sugars and saturated fats, which can lead to insulin resistance, a condition where the body's cells do not respond effectively to insulin. Over time, the pancreas cannot produce enough insulin to manage blood sugar levels, resulting in type 2 diabetes. Additionally, excessive caloric intake without sufficient physical activity exacerbates the risk by promoting weight gain and fat accumulation, particularly around the abdomen, further contributing to insulin resistance.
\nExample 3:
Query: What is the importance of hydration for physical performance?
Answer: Hydration is crucial for physical performance because water plays key roles in maintaining blood volume, regulating body temperature, and ensuring the transport of nutrients and oxygen to cells. Adequate hydration is essential for optimal muscle function, endurance, and recovery. Dehydration can lead to decreased performance, fatigue, and increased risk of heat-related illnesses, such as heat stroke. Drinking sufficient water before, during, and after exercise helps ensure peak physical performance and recovery.
\nNow use the following context items to answer the user query:
{context}
\nRelevant passages: <extract relevant passages from the context here>
User query: {query}
Answer:"""
    '''
    base_prompt = base_prompt.format(context=context,query=query)

    dialogue_template = [{'role':'user','content':base_prompt}]
    prompt = tokenizer.apply_chat_template(conversation = dialogue_template,
                                           tokenize=False,
                                           add_generation_prompt = True)
    return prompt

query = 'What is sun made up of'
scores, neighbors = get_top_results_and_scores(query, hf_dataset, n_resources_to_return=5)
prompt = prompt_formatter(query, neighbors)

In [104]:
prompt

'<bos><start_of_turn>user\nBased on the following context items, please answer the query.\nGive yourself room to think by extracting relevant passages from the context before answering the query.\nDon\'t return the thinking, only return the answer.\nMake sure your answers are as explanatory as possible.\nUse the following examples as reference for the ideal answer style.\n\nExample 1:\nQuery: What are the fat-soluble vitamins?\nAnswer: The fat-soluble vitamins include Vitamin A, Vitamin D, Vitamin E, and Vitamin K. These vitamins are absorbed along with fats in the diet and can be stored in the body\'s fatty tissue and liver for later use. Vitamin A is important for vision, immune function, and skin health. Vitamin D plays a critical role in calcium absorption and bone health. Vitamin E acts as an antioxidant, protecting cells from damage. Vitamin K is essential for blood clotting and bone metabolism.\n\nExample 2:\nQuery: What are the causes of type 2 diabetes?\nAnswer: Type 2 diabete

In [105]:
input_ids = tokenizer(prompt, return_tensors = 'pt').to('cuda')
outputs = model.generate(**input_ids,
                         temperature = 0.7, #the lower the temperature the more deterministic the text,
                         do_sample = True,
                         max_new_tokens = 400)

output_text = tokenizer.decode(outputs[0])

In [106]:
print(output_text.replace(prompt,' '))

<bos> The sun is composed of a complex layered structure. The outermost layer, the photosphere, is the visible surface of the sun. It appears sharply defined and is responsible for its light emission. The photosphere is a turbulent layer and it is likely broken in outline due to the violent vertical motions within the sun. Above the photosphere lies the reversing layer, a 500-1000 mile thick layer of gas in which many terrestrial substances like calcium and iron exist in a vaporous state. The chromosphere, a layer of gas 5,000-10,000 miles deep, can be seen during a total solar eclipse, appearing as a brilliant red fringe with leaping flames on its outer surface. 


The sun also contains several other constituents that have been identified through studies of its spectrum. These include elements like calcium, iron, and carbon, as well as many more. The presence of these elements is inferred from the absorption lines observed in the sun's spectrum. 
<end_of_turn>


#### Functionize our LLM answering feature:

In [123]:
def ask(query:str,
        temperature: float = 0.7,
        max_new_tokens:int = 256,
        format_answer_text: bool=True,
        return_answer_only:bool =True):

    #RETRIEVAL
    #Get the scores and passages of top realted results
    scores, neighbors = get_top_results_and_scores(query, hf_dataset, n_resources_to_return=10)

    #AUGMENTATION
    #Create the prompt and format it with the context items
    prompt = prompt_formatter(query, neighbors)


    #GENERATION
    #Tokenize the prompt
    input_ids = tokenizer(prompt, return_tensors = 'pt').to('cuda')
    outputs = model.generate(**input_ids,
                         temperature = 0.7, #the lower the temperature the more deterministic the text,
                         do_sample = True,
                         max_new_tokens = 400)

    output_text = tokenizer.decode(outputs[0])
    output_text = output_text.replace(prompt,' ')

    return output_text

In [124]:
ask(query= 'what is sun madeup of?')

'<bos> The sun is composed of many elements, with the photosphere (the visible surface) containing a majority of the elements.  The reversing layer is a sheet of gas with many terrestrial elements like calcium and iron in a vaporous state. The chromosphere, located above the reversing layer, is a layer of gas that gets its scarlet color from the incandescent hydrogen and calcium it contains.  Heavier elements like iron and calcium are found in the reversing layer, while lighter elements like hydrogen and helium are found in the chromosphere. \n<end_of_turn>'