## Dataset preparation

In [1]:
from datasets import load_dataset

# Step 1: Load the SQuAD dataset
dataset = load_dataset("squad")

# Step 2: Extract unique contexts from the dataset
data = [item["context"] for item in dataset["train"]]
texts = list(set(data))

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
texts[0]

'In Prussia, some officials considered a war against France both inevitable and necessary to arouse German nationalism in those states that would allow the unification of a great German empire. This aim was epitomized by Prussian Chancellor Otto von Bismarck\'s later statement: "I did not doubt that a Franco-German war must take place before the construction of a United Germany could be realised." Bismarck also knew that France should be the aggressor in the conflict to bring the southern German states to side with Prussia, hence giving Germans numerical superiority. Many Germans also viewed the French as the traditional destabilizer of Europe, and sought to weaken France to prevent further breaches of the peace.'

## Embed dataset

In [3]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

def batch_iterate(lst, batch_size):
    for i in range(0, len(lst), batch_size):
        yield lst[i : i + batch_size]

class EmbedData:
    def __init__(self, 
                 embed_model_name="nomic-ai/nomic-embed-text-v1.5",
                 batch_size=32):
        
        self.embed_model_name = embed_model_name
        self.embed_model = self._load_embed_model()
        self.batch_size = batch_size
        self.embeddings = []
        
    def _load_embed_model(self):
        embed_model = HuggingFaceEmbedding(model_name=self.embed_model_name,
                                           trust_remote_code=True,
                                           cache_folder='./hf_cache')
        return embed_model
    
    def generate_embedding(self, context):
        return self.embed_model.get_text_embedding_batch(context)
    
    def embed(self, contexts):
        self.contexts = contexts
        
        for batch_context in tqdm(batch_iterate(contexts, self.batch_size),
                                  total=len(contexts)//self.batch_size,
                                  desc="Embedding data in batches"):
                                  
            batch_embeddings = self.generate_embedding(batch_context)
            
            self.embeddings.extend(batch_embeddings)

In [4]:
from tqdm.auto import tqdm

batch_size = 32

embeddata = EmbedData(batch_size=batch_size)

<All keys matched successfully>


In [5]:
embeddata.embed(texts[:100]) # Embed the first two contexts

Embedding data in batches: 4it [00:37,  9.43s/it]                       


In [6]:
# import pickle

# with open("embeddata_full.pickle", "rb") as h:
#     embeddata = pickle.load(h)

In [7]:
embeddata.contexts[0]

'In Prussia, some officials considered a war against France both inevitable and necessary to arouse German nationalism in those states that would allow the unification of a great German empire. This aim was epitomized by Prussian Chancellor Otto von Bismarck\'s later statement: "I did not doubt that a Franco-German war must take place before the construction of a United Germany could be realised." Bismarck also knew that France should be the aggressor in the conflict to bring the southern German states to side with Prussia, hence giving Germans numerical superiority. Many Germans also viewed the French as the traditional destabilizer of Europe, and sought to weaken France to prevent further breaches of the peace.'

## Vector database

In [10]:
from qdrant_client import QdrantClient, models

class QdrantVDB:

    def __init__(self, collection_name, vector_dim=768, batch_size=512):
        self.collection_name = collection_name
        self.batch_size = batch_size
        self.vector_dim = vector_dim
    
    def define_client(self):
        self.client = QdrantClient(url="http://localhost:6333",
                                   prefer_grpc=True)
    
    def create_collection(self):
        
        if not self.client.collection_exists(collection_name=self.collection_name):

            self.client.create_collection(collection_name=self.collection_name,
                                          
                                          vectors_config=models.VectorParams(
                                                              size=self.vector_dim,
                                                              distance=models.Distance.DOT,
                                                              on_disk=True),
                                          
                                          optimizers_config=models.OptimizersConfigDiff(
                                                                            default_segment_number=5,
                                                                            indexing_threshold=0)
                                         )
    
    def ingest_data(self, embeddata):
    
        for batch_context, batch_embeddings in tqdm(zip(batch_iterate(embeddata.contexts, self.batch_size), 
                                                        batch_iterate(embeddata.embeddings, self.batch_size)), 
                                                    total=len(embeddata.contexts)//self.batch_size, 
                                                    desc="Ingesting in batches"):
        
            self.client.upload_collection(collection_name=self.collection_name,
                                        vectors=batch_embeddings,
                                        payload=[{"context": context} for context in batch_context])

        self.client.update_collection(collection_name=self.collection_name,
                                    optimizer_config=models.OptimizersConfigDiff(indexing_threshold=20000)
                                    )

In [11]:
database = QdrantVDB("squad_collection")
database.define_client()
database.create_collection()
database.ingest_data(embeddata)

Ingesting in batches: 1it [00:00, 11.82it/s]


## Retrievar class

In [12]:
import time

class Retriever:

    def __init__(self, vector_db, embeddata):
        
        self.vector_db = vector_db
        self.embeddata = embeddata
    
    def search(self, query):
        query_embedding = self.embeddata.embed_model.get_query_embedding(query)
            
        # Start the timer
        start_time = time.time()
        
        result = self.vector_db.client.search(
            collection_name=self.vector_db.collection_name,
            
            query_vector=query_embedding,
            
            search_params=models.SearchParams(
                quantization=models.QuantizationSearchParams(
                    ignore=True,
                    rescore=True,
                    oversampling=2.0,
                )
            ),
            
            timeout=1000,
        )
        
        # End the timer
        end_time = time.time()
        elapsed_time = end_time - start_time

        print(f"Execution time for the search: {elapsed_time:.4f} seconds")

        return result

In [13]:
Retriever(database, embeddata).search("Sample query")[0]

Execution time for the search: 0.0224 seconds


ScoredPoint(id='07659848-6f30-477b-b828-3bef46f203dd', version=3, score=0.4932173490524292, payload={'context': 'Cladistics is a technique borrowed from biology, where it was originally named phylogenetic systematics by Willi Hennig. In biology, the technique is used to determine the evolutionary relationships between different species. In its application in textual criticism, the text of a number of different manuscripts is entered into a computer, which records all the differences between them. The manuscripts are then grouped according to their shared characteristics. The difference between cladistics and more traditional forms of statistical analysis is that, rather than simply arranging the manuscripts into rough groupings according to their overall similarity, cladistics assumes that they are part of a branching family tree and uses that assumption to derive relationships between them. This makes it more like an automated approach to stemmatics. However, where there is a differen

## RAG class

In [21]:
from llama_index.llms.ollama import Ollama

class RAG:

    def __init__(self,
                 retriever,
                 llm_name="llama3.2:1b"):
        
        self.llm_name = llm_name
        self.llm = self._setup_llm()
        self.retriever = retriever
        self.qa_prompt_tmpl_str = """Context information is below.
                                     ---------------------
                                     {context}
                                     ---------------------
                                     
                                     Given the context information above I want you
                                     to think step by step to answer the query in a
                                     crisp manner, incase case you don't know the
                                     answer say 'I don't know!'
                                     
                                     ---------------------
                                     Query: {query}
                                     ---------------------
                                     Answer: """
    
    def _setup_llm(self):
        return Ollama(model=self.llm_name)
    
    def generate_context(self, query):
    
        result = self.retriever.search(query)
        context = [dict(data) for data in result]
        combined_prompt = []

        for entry in context:
            context = entry["payload"]["context"]

            combined_prompt.append(context)

        return "\n\n---\n\n".join(combined_prompt)
    
    def query(self, query):
        context = self.generate_context(query=query)
        print("context is extracted successfully !!!")
        prompt = self.qa_prompt_tmpl_str.format(context=context,
                                                query=query)
        response = self.llm.complete(prompt)
        
        return dict(response)['text']

In [20]:
Retriever(database, embeddata).search("Sample query")[0]

Execution time for the search: 0.0043 seconds


ScoredPoint(id='07659848-6f30-477b-b828-3bef46f203dd', version=3, score=0.4932173490524292, payload={'context': 'Cladistics is a technique borrowed from biology, where it was originally named phylogenetic systematics by Willi Hennig. In biology, the technique is used to determine the evolutionary relationships between different species. In its application in textual criticism, the text of a number of different manuscripts is entered into a computer, which records all the differences between them. The manuscripts are then grouped according to their shared characteristics. The difference between cladistics and more traditional forms of statistical analysis is that, rather than simply arranging the manuscripts into rough groupings according to their overall similarity, cladistics assumes that they are part of a branching family tree and uses that assumption to derive relationships between them. This makes it more like an automated approach to stemmatics. However, where there is a differen

## Lets use it

In [22]:
retriever = Retriever(database, embeddata)

rag = RAG(retriever)

In [25]:
embeddata.contexts[16]

'In Latin, papyri from Herculaneum dating before 79 AD (when it was destroyed) have been found that have been written in old Roman cursive, where the early forms of minuscule letters "d", "h" and "r", for example, can already be recognised. According to papyrologist Knut Kleve, "The theory, then, that the lower-case letters have been developed from the fifth century uncials and the ninth century Carolingian minuscules seems to be wrong." Both majuscule and minuscule letters existed, but the difference between the two variants was initially stylistic rather than orthographic and the writing system was still basically unicameral: a given handwritten document could use either one style or the other but these were not mixed. European languages, except for Ancient Greek and Latin, did not make the case distinction before about 1300.[citation needed]'

In [38]:
query = """What is meant by Politecnico"""
query = """what is the difference between majuscule and minuscule """

answer = rag.query(query)

Execution time for the search: 0.0100 seconds


## display the output

In [39]:
from IPython.display import Markdown, display

display(Markdown(str(answer)))

Based on the provided context information, there doesn't seem to be a clear distinction or explanation for the difference between majuscule (uppercase) and minuscule (lowercase) letters in Latin writing systems. 

It appears that both majuscule and minuscule letters existed and were used, but with initial stylistic differences rather than being strictly orthographic or having any significant functional differences.

In [36]:
embeddata.contexts[16]

'In Latin, papyri from Herculaneum dating before 79 AD (when it was destroyed) have been found that have been written in old Roman cursive, where the early forms of minuscule letters "d", "h" and "r", for example, can already be recognised. According to papyrologist Knut Kleve, "The theory, then, that the lower-case letters have been developed from the fifth century uncials and the ninth century Carolingian minuscules seems to be wrong." Both majuscule and minuscule letters existed, but the difference between the two variants was initially stylistic rather than orthographic and the writing system was still basically unicameral: a given handwritten document could use either one style or the other but these were not mixed. European languages, except for Ancient Greek and Latin, did not make the case distinction before about 1300.[citation needed]'

## Binary Quantization

In [40]:
from qdrant_client import models
from qdrant_client import QdrantClient

class QdrantVDB_BQ:

    def __init__(self, collection_name, vector_dim=768, batch_size=512):
        self.collection_name = collection_name
        self.batch_size = batch_size
        self.vector_dim = vector_dim
    
    def define_client(self):
        self.client = QdrantClient(url="http://localhost:6333",
                                   prefer_grpc=True)
        
    def create_collection(self):
        
        if not self.client.collection_exists(collection_name=self.collection_name):

            self.client.create_collection(collection_name=self.collection_name,
                                          
                                          vectors_config=models.VectorParams(
                                                              size=self.vector_dim,
                                                              distance=models.Distance.DOT,
                                                              on_disk=True),
                                          
                                          optimizers_config=models.OptimizersConfigDiff(
                                                                            default_segment_number=5,
                                                                            indexing_threshold=0),
                                          
                                          quantization_config=models.BinaryQuantization(
                                                        binary=models.BinaryQuantizationConfig(always_ram=True)),
                                         )

In [41]:
import time

class Retriever:

    def __init__(self, vector_db, embeddata):
        
        self.vector_db = vector_db
        self.embeddata = embeddata
    
    def search(self, query):
        query_embedding = self.embeddata.embed_model.get_query_embedding(query)
            
        # Start the timer
        start_time = time.time()
        
        result = self.vector_db.client.search(
            collection_name=self.vector_db.collection_name,
            
            query_vector=query_embedding,
            
            search_params=models.SearchParams(
                quantization=models.QuantizationSearchParams(
                    ignore=False,
                    rescore=True,
                    oversampling=2.0,
                )
            ),
            
            timeout=1000,
        )
        
        # End the timer
        end_time = time.time()
        elapsed_time = end_time - start_time

        print(f"Execution time for the search: {elapsed_time:.4f} seconds")

        return result

In [42]:
database = QdrantVDB("squad_collection_qb")
database.define_client()
database.create_collection()
database.ingest_data(embeddata)

Ingesting in batches: 1it [00:00,  6.54it/s]


In [43]:
retriever = Retriever(database, embeddata)

rag = RAG(retriever)

In [46]:
query = """what is the difference between majuscule and minuscule"""

answer = rag.query(query)

Execution time for the search: 0.0077 seconds


In [47]:
from IPython.display import Markdown, display

display(Markdown(str(answer)))

To answer your question, there are differences between majuscule (uppercase) and minuscule (lowercase) letters in the Latin alphabet.

Majuscule refers to the uppercase form of a letter, which is used for writing large or formal documents, titles, and headings. Examples of majuscule letters include A, B, C, etc.

Minuscule, on the other hand, refers to the lowercase form of a letter, which is used for everyday writing, such as in newspapers, books, and informal correspondence. Examples of minuscule letters include a, b, c, etc.

In addition, the difference between majuscule and minuscule is also reflected in their visual appearance. Majuscule letters are typically larger and more distinct than minuscule letters.

It's worth noting that the term "majuscule" was not commonly used until the 16th century, when the English language began to adopt a standardized alphabet based on the Latin alphabet. Prior to this time, the terms "capital" and "minor" were often used to refer to uppercase and lowercase letters, respectively.

In summary, the main differences between majuscule and minuscule are:

* The use of uppercase or lowercase for writing purposes
* The visual appearance of the letters (majuscules are larger and more distinct)
* The terminology used to describe these forms (majuscule refers to uppercase, while minuscule refers to lowercase)