In [9]:
from transformers import AutoModel, AutoTokenizer
import torch
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = 300,
        chunk_overlap = 30,
        length_function = len,
        is_separator_regex = False
)

with open("../content.txt", "r") as file:
    content = file.read()

chunks = text_splitter.split_text(content)
chunks

['Retrieval-Augmented Generation (RAG) is a groundbreaking technique in natural language processing that combines the strengths of retrieval-based and generative models to create a system capable of producing highly accurate and contextually relevant text. By leveraging both retrieval and generation,',
 'retrieval and generation, RAG models address many of the limitations of traditional models, offering a more robust and flexible approach to various tasks. The process begins with the retrieval of relevant documents or passages from a large corpus. The retrieval component typically uses dense',
 'typically uses dense embeddings, which are vector representations learned to capture the semantic meaning of the text. These embeddings allow the model to measure the similarity between the input query and potential documents, even when they do not share exact keywords. Dense retrieval models,',
 'Dense retrieval models, often based on transformer architectures like BERT, excel at finding conte

In [5]:
len(chunks)

48

In [6]:
def calculate_embeddings(text, model, tokeniser):
    inputs = tokeniser(text, return_tensors = "pt", truncation = True, padding = True)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.last_hidden_state.mean(dim = 1).squeeze().numpy()

In [7]:
def calculate_similarity(embedding1, embedding2):
    similarity = cosine_similarity([embedding1], [embedding2])
    return similarity[0][0]

In [11]:
def test_with_model(chunks, user_query, top_k, model_name = "bert-base-uncased"):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)

    user_query_embedding = calculate_embeddings(user_query, model, tokenizer)
    embeddings = [calculate_embeddings(chunk, model, tokenizer) for chunk in chunks]
    count = [i for i in range(len(chunks))]
    scores = {i: calculate_similarity(user_query_embedding, embedding) for embedding, i in zip(embeddings, count)}
    sorted_chunks = sorted(scores.items(), key = lambda item: item[1])[:top_k]
    
    print("Matching chunks:\n")
    for chunk in sorted_chunks:
        print(chunks[chunk[0]], "\n")

In [12]:
test_with_model(chunks, user_query="What is rag?", top_k=3)

Matching chunks:

In summary, Retrieval-Augmented Generation (RAG) represents a powerful and versatile approach to natural language processing. By combining the strengths of retrieval-based and generative models, RAG models can provide accurate, contextually relevant, and coherent responses to a wide range of 

Retrieval-Augmented Generation (RAG) is a groundbreaking technique in natural language processing that combines the strengths of retrieval-based and generative models to create a system capable of producing highly accurate and contextually relevant text. By leveraging both retrieval and generation, 

RAG models have shown significant promise in a variety of applications. In knowledge-intensive tasks, RAG models can provide accurate and contextually relevant answers to complex questions by retrieving and synthesizing information from multiple sources. In summarization tasks, RAG models can 

