**Assignment 2**

Use the semantic chunking code in this notebook to chunk the documents for the naive rag we developed and analyze the performance for various queries

Check with the following embedding models for semantic chunking and analyze the performance


1.   BAAI/bge-small-en-v1.5
2.   all-MiniLM-L6-v2
3.   sentence-transformers/all-MiniLM-L12-v2






In [8]:
# First, install required packages
!pip install sentence-transformers numpy scikit-learn

import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import re
from typing import List, Tuple




In [2]:
class SentenceTransformerChunker:
    def __init__(self, model_name='all-MiniLM-L6-v2', similarity_threshold=0.7):
        """
        Initialize the chunker with a SentenceTransformer model

        Args:
            model_name: Name of the sentence transformer model
            similarity_threshold: Threshold for semantic similarity
        """
        self.model = SentenceTransformer(model_name)
        self.similarity_threshold = similarity_threshold

    def split_into_sentences(self, text: str) -> List[str]:
        """Split text into sentences"""
        # Simple sentence splitting (you can use more sophisticated methods)
        sentences = re.split(r'[.!?]+', text)
        sentences = [s.strip() for s in sentences if s.strip()]
        return sentences

    def semantic_chunking(self, text: str, max_chunk_size: int = 5) -> List[str]:
        """
        Create chunks based on semantic similarity

        Args:
            text: Input text to chunk
            max_chunk_size: Maximum number of sentences per chunk

        Returns:
            List of text chunks
        """
        sentences = self.split_into_sentences(text)
        if not sentences:
            return []

        # Generate embeddings for all sentences
        embeddings = self.model.encode(sentences)

        chunks = []
        current_chunk = [sentences[0]]
        current_embedding = embeddings[0:1]

        for i in range(1, len(sentences)):
            # Calculate similarity with current chunk
            chunk_mean_embedding = np.mean(current_embedding, axis=0).reshape(1, -1)
            similarity = cosine_similarity(
                chunk_mean_embedding,
                embeddings[i:i+1]
            )[0][0]

            # If similar enough and chunk not too large, add to current chunk
            if (similarity >= self.similarity_threshold and
                len(current_chunk) < max_chunk_size):
                current_chunk.append(sentences[i])
                current_embedding = np.vstack([current_embedding, embeddings[i:i+1]])
            else:
                # Start new chunk
                chunks.append(' '.join(current_chunk))
                current_chunk = [sentences[i]]
                current_embedding = embeddings[i:i+1]

        # Add the last chunk
        if current_chunk:
            chunks.append(' '.join(current_chunk))

        return chunks


In [3]:
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [6]:
def main():
    # Sample text
    pdf_path = "/content/VITEEE-2024-information-brochure.pdf"

    sample_text = ""
    try:
        import PyPDF2
        with open(pdf_path, 'rb') as file:
            reader = PyPDF2.PdfReader(file)
            for page_num in range(len(reader.pages)):
                page = reader.pages[page_num]
                sample_text += page.extract_text()
    except FileNotFoundError:
        print(f"Error: The file {pdf_path} was not found.")
        return
    except Exception as e:
        print(f"An error occurred while reading the PDF: {e}")
        return

    # Initialize chunker
    chunker = SentenceTransformerChunker(similarity_threshold=0.6)

    # Perform semantic chunking
    chunks = chunker.semantic_chunking(sample_text, max_chunk_size=4)

    print("=== Semantic Chunks ===")
    for i, chunk in enumerate(chunks, 1):
        print(f"Chunk {i}:")
        print(chunk)
        print("-" * 50)


if __name__ == "__main__":
    main()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

=== Semantic Chunks ===
Chunk 1:
/vituniversity /vellore_vit www vit
--------------------------------------------------
Chunk 2:
ac
--------------------------------------------------
Chunk 3:
in /vellore-institute-of-technology /VIT_univ
20242024
VIT ENGINEERING ENTRANCE
EXAMINATION
VIT ENGINEERING ENTRANCE
EXAMINATION
For Admission to B
--------------------------------------------------
Chunk 4:
T ech
--------------------------------------------------
Chunk 5:
Programmes of
VIT - Vellore | VIT - Chennai | VIT - AP | VIT - Bhopal
For Admission to B
--------------------------------------------------
Chunk 6:
T ech
--------------------------------------------------
Chunk 7:
Programmes of
VIT - Vellore | VIT - Chennai | VIT - AP | VIT - BhopalciogNg caHT jUk
VELLORE INSTITUTE OF TECHNOLOGYVITVIT
Vellore Institute of Technology
(Deemed to be University under section 3 of UGC Act, 1956)R
VITEEE
ProspectusVIT - VelloreVIT - Chennai
VIT - APVIT - Bhopal
1
-------------------------------------

In [None]:
# prompt: An error occurred while reading the PDF: [Errno 22] Invalid argument

# Mount Google Drive to access files
from google.colab import drive
drive.mount('/content/drive')

# Make sure the PDF path points to the correct location in your Google Drive
# For example, if VITEEE-2024-information-brochure.pdf is in the root of My Drive:
# pdf_path = "/content/drive/My Drive/VITEEE-2024-information-brochure.pdf"
# If it's in a subfolder, adjust the path accordingly
# e.g., pdf_path = "/content/drive/My Drive/MyFolder/VITEEE-2024-information-brochure.pdf"

# Replace with the actual path to your PDF file in Google Drive
pdf_path = "/content/drive/My Drive/VITEEE-2024-information-brochure.pdf" # @param {type:"string"}