## RAG Concepts Version 3

Improvements : 

1. Understanding Embedding models + Using Huggingface embedding models instead of ollama embeddings 
2. Using JsonOutputParser()
3. More ways for creating better prompt
4. Creation of Json Files semantically from pdf provided. 
5. Using sentence transformers lib

In [17]:
from langchain.chat_models import init_chat_model
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import (JsonOutputParser , StrOutputParser , PydanticOutputParser)
from langchain_core.pydantic_v1 import BaseModel, Field, validator
from langchain_core.runnables import RunnablePassthrough
from langchain_ollama import OllamaLLM

In [2]:
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_experimental.text_splitter import SemanticChunker
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from sentence_transformers import SentenceTransformer
import json 

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2" )
filepath = 'Datafiles/Publication606.pdf'
threshold_type='gradient'
save_as_name = 'BoseEins'

def create_tfiles(embeddings , filepath , threshold_type , save_as_name):
    loader = PyMuPDFLoader(filepath)
    docs = loader.load()
    text_splitter = SemanticChunker(embeddings=embeddings , breakpoint_threshold_type=threshold_type)
    documents=text_splitter.create_documents([d.page_content for d in docs])
    conxt = ""
    jdata = {}
    for page in range(len(documents)):
        conxt += documents[page].page_content 
        jdata[page+1] = str(documents[page].page_content) 
        with open(save_as_name+'.json', 'w') as f:
             json.dump(jdata, f)
    print("The "+" "+save_as_name+".json file has been created.")
    return jdata

In [3]:
jdat=create_tfiles(embeddings , filepath , threshold_type , save_as_name)

The  BoseEins.json file has been created.


In [22]:
#### Ref prompt template : https://www.youtube.com/watch?v=hztWQcoUbt0&list=LL&index=3


label_template  = """  
                    You are in AI assistant tasked with generating question-answer pairs based on the document.
                    The question should be something the user might naturally ask  when seeking information contained in the document.

                    Given:{chunk}

                    Instructions:
                    1. Analyse the key topics , facts and concepts in the give document , choose one to focus on.
                    2. Generate 2 similar questions that the users might ask to find the information in this document.
                    3. Use natural language and occasionally include typos to mimic real user behaviour in the question.
                    4. Ensure the question is semantically related to the document content WITHOUT directly copying phrases.
                    5. Make sure that all of the questions are similar to each other. I.E. All asking about a similar topic/requesting the same question.

                    Output Format:
                    Return a JSON object with the following structure:
                    '''json 
                    {{
                     "question_1" : "Generated question text" ,
                     "question_2" : "Generated question text" ,
                     ...
                    }} ''' 
                    
                    Be creative , think like a curious user , and generate your 2 similar questions that would naturally lead to the given document in a semantic search.
                    Ensure your response is a valid JSON object containing only the questions.
                    """

label_prompt = ChatPromptTemplate.from_template(label_template)
model = OllamaLLM(model='qwen2.5:1.5b' , temperature=0)
label_chain = label_prompt | model | JsonOutputParser()                

In [23]:
label_chain.invoke({"chunk" : jdat[7]})

{'question_1': 'What are the critical temperatures observed in the strongest and weakest traps?',
 'question_2': "How does the condensate's length affect its propagation speed, and what was the initial condition for this experiment?"}

In [24]:
label_chain.invoke({"chunk" : jdat[6]})

{'question_1': 'How was the imaging of Bose-Einstein condensates achieved?',
 'question_2': 'What techniques were used to generate localized density perturbations in a Bose-Einstein condensate?'}

In [29]:
datajson = {}
for i in range(len(jdat)):
    datajson["Chunk "+ str(i+1)] = label_chain.invoke({"chunk" : jdat[i+1]})

print(datajson)

{'Chunk 1': {'question_1': 'What are some key findings regarding sound propagation in Bose-Einstein condensates?', 'question_2': 'How does the concept of Bose-Einstein condensation affect our understanding of sound waves and their behavior within this state?'}, 'Chunk 2': {'question_1': 'What are the names of the individuals mentioned in this document?', 'question_2': 'Who are R. Andrews, D. M. Kurn, H.-J. Miesner, and D. S. Durfee?'}, 'Chunk 3': {'question_1': 'How was sound propagation observed in a magnetically trapped dilute Bose-Einstein condensate, and what method was used to determine the speed of sound?', 'question_2': 'What is the microscopic picture developed for quantum liquids based on elementary excitations and quantum hydrodynamics, and how does it apply to Bose-Einstein condensed gases?'}, 'Chunk 4': {'question_1': 'How do the frequencies of the lowest collective excitations depend on c0yd, and what does this imply for previous experiments involving these excitations?', 

In [30]:
datajson

{'Chunk 1': {'question_1': 'What are some key findings regarding sound propagation in Bose-Einstein condensates?',
  'question_2': 'How does the concept of Bose-Einstein condensation affect our understanding of sound waves and their behavior within this state?'},
 'Chunk 2': {'question_1': 'What are the names of the individuals mentioned in this document?',
  'question_2': 'Who are R. Andrews, D. M. Kurn, H.-J. Miesner, and D. S. Durfee?'},
 'Chunk 3': {'question_1': 'How was sound propagation observed in a magnetically trapped dilute Bose-Einstein condensate, and what method was used to determine the speed of sound?',
  'question_2': 'What is the microscopic picture developed for quantum liquids based on elementary excitations and quantum hydrodynamics, and how does it apply to Bose-Einstein condensed gases?'},
 'Chunk 4': {'question_1': 'How do the frequencies of the lowest collective excitations depend on c0yd, and what does this imply for previous experiments involving these excita

In [32]:
with open('QSetData.json' , 'w') as f:
     json.dump(datajson , f , indent=4)