# Research Paper Based Semantic System
by Tejas, Pranit and Tushar

### Objective

Develop a reliable generative search system currently we are using a research paper to build a prototype version of this project scale.

### Design

The solution consists of a three-layered pipeline for research paper-based semantic search. The Embedding Layer processes PDFs by extracting, cleaning, and chunking text to preserve context before generating vector embeddings using SentenceTransformers from HuggingFace. These embeddings are stored in ChromaDB for retrieval. Choosing the right chunking strategy—fixed-length, semantic, or sliding window—is crucial for maintaining context. The Semantic Search Layer converts user queries into embeddings and performs vector similarity search in ChromaDB to retrieve the top results. A caching mechanism (e.g., Redis) enhances efficiency, while a re-ranking model (cross-encoder) improves result accuracy by refining search relevance.

The Generation Layer uses Gemini 2.0 Flash Lite to construct responses based on retrieved snippets. The prompt is carefully designed to ensure structured, concise answers with citations, proper formatting, and handling of out-of-scope queries. To improve response quality, few-shot examples can be incorporated. This structured approach ensures that the system delivers accurate, well-formatted, and research-backed responses, making it a powerful tool for analyzing complex research documents efficiently. 🚀


In [1]:
# Installing Pre-requisites

import sys

# Install packages if they are not already installed
def install_packages():
    packages = ["pdfplumber", "tiktoken", "openai", "chromadb", "sentence-transformers", "google-generativeai"]

    for package in packages:
        try:
            __import__(package)
        except ImportError:
            print(f"Installing {package}...")
            %pip install --quiet --upgrade {package}

# Run the installation function
install_packages()

# Verify installations
print("✅ All packages are installed and ready to use.")

Installing sentence-transformers...
Note: you may need to restart the kernel to use updated packages.
Installing google-generativeai...



[notice] A new release of pip is available: 23.2.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.
✅ All packages are installed and ready to use.



[notice] A new release of pip is available: 23.2.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## 1.Importing Important Library

In [3]:
# Import libraries that are required throughout the project
import pandas as pd
from operator import itemgetter
from pathlib import Path
import json
import tiktoken
import openai
import chromadb
import pdfplumber
from typing import List, Tuple, Any
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


## 2.Read, Process, and Chunk the PDF Files


In [4]:
# Currently the path is hardcoded
#pdf_path = "Principal-Sample-Life-Insurance-Policy.pdf"
pdf_path = "EJ1172284.pdf"

In [5]:
# Creating functions that would extract data from the pdf based on the structure


def check_bboxes(word: dict, table_bbox: tuple) -> bool:
    """
    Checks whether a word is inside a table bounding box.

    Parameters:
    - word (dict): A dictionary containing word coordinates with keys 'x0', 'top', 'x1', and 'bottom'.
    - table_bbox (tuple): A tuple representing the bounding box of a table (x0, y0, x1, y1).

    Returns:
    - bool: True if the word is inside the table bounding box, otherwise False.
    """
    try:
        word_bbox = (word['x0'], word['top'], word['x1'], word['bottom'])
        return (
            word_bbox[0] > table_bbox[0] and
            word_bbox[1] > table_bbox[1] and
            word_bbox[2] < table_bbox[2] and
            word_bbox[3] < table_bbox[3]
        )
    except KeyError as e:
        print(f"Missing key in word dictionary: {e}")
        return False

def extract_text_from_pdf(pdf_path: str) -> List[Tuple[str, str]]:
    """
    Extracts text and tables from a PDF while maintaining their chronological order.

    Parameters:
    - pdf_path (str): Path to the PDF file.

    Returns:
    - List[Tuple[str, str]]: A list of tuples containing page numbers and extracted text.
    """
    full_text = []

    try:
        with pdfplumber.open(pdf_path) as pdf:
            for p, page in enumerate(pdf.pages):
                page_no = f"Page {p + 1}"
                text = page.extract_text()

                # Extract tables and their bounding boxes
                tables = page.find_tables()
                table_bboxes = [table.bbox for table in tables]
                tables_data = [{'table': table.extract(), 'top': table.bbox[1]} for table in tables]

                # Extract non-table words
                words = page.extract_words()
                non_table_words = [word for word in words if not any(check_bboxes(word, bbox) for bbox in table_bboxes)]

                lines = []

                # Cluster words and tables based on their vertical positions
                for cluster in pdfplumber.utils.cluster_objects(non_table_words + tables_data, itemgetter('top'), tolerance=5):
                    if 'text' in cluster[0]:  # Regular text
                        try:
                            lines.append(' '.join(word['text'] for word in cluster))
                        except KeyError:
                            pass
                    elif 'table' in cluster[0]:  # Tables
                        lines.append(json.dumps(cluster[0]['table']))

                # Store page number and extracted content
                full_text.append((page_no, " ".join(lines)))

    except Exception as e:
        print(f"Error processing PDF: {e}")

    return full_text

*Now that we have defined the function for extracting the text and tables from a PDF, let's iterate and call this function for all the PDFs in our drive and store them in a list.*

In [6]:
# Call the function to extract the text from the PDF
extracted_text = extract_text_from_pdf(pdf_path)
# Convert the extracted list to a PDF, and add a column to store document names
extracted_text_df = pd.DataFrame(extracted_text, columns=['Page No.', 'Page_Text'])
print("All PDFs have been processed.")

All PDFs have been processed.


In [7]:
extracted_text_df.head()

Unnamed: 0,Page No.,Page_Text
0,Page 1,"The EUROCALL Review, Volume 25, No. 2, Septemb..."
1,Page 2,"The EUROCALL Review, Volume 25, No. 2, Septemb..."
2,Page 3,"The EUROCALL Review, Volume 25, No. 2, Septemb..."
3,Page 4,"The EUROCALL Review, Volume 25, No. 2, Septemb..."
4,Page 5,"The EUROCALL Review, Volume 25, No. 2, Septemb..."


In [8]:
extracted_text_df.Page_Text[7]

'The EUROCALL Review, Volume 25, No. 2, September 2017 [["referred their students to electronic or online resources; however, they did not ask"], ["students to use them in classes, they did not recommend any mobile apps or design"], ["language tasks which required using such devices in order to solve them."], ["4.5. Language practiced"], ["When asked to indicate the most frequently practiced language skills and subsystems by"], ["means of mobile devices, all the interviewees indicated the target language vocabulary."], ["In addition to this, some referred to pronunciation and only a few students mentioned"], ["grammar and practicing reading, listening and speaking skills. As far as practicing"], ["English vocabulary is concerned, the subjects chose to practice it through their"], ["smartphones and/or tables because they regarded this language subsystem as the most"], ["important to learn, they praised their MobDs for providing them with quick and easy"], ["access to needed words and se

In [9]:
extracted_text_df['Text_Length'] = extracted_text_df['Page_Text'].apply(lambda x: len(x.split(' ')))

In [10]:
extracted_text_df['Text_Length']

0     520
1     703
2     705
3     662
4     544
5     492
6     734
7     766
8     797
9     508
10    390
Name: Text_Length, dtype: int64

In [11]:
extracted_text_df = extracted_text_df.loc[extracted_text_df['Text_Length'] >= 10]
extracted_text_df.head()

Unnamed: 0,Page No.,Page_Text,Text_Length
0,Page 1,"The EUROCALL Review, Volume 25, No. 2, Septemb...",520
1,Page 2,"The EUROCALL Review, Volume 25, No. 2, Septemb...",703
2,Page 3,"The EUROCALL Review, Volume 25, No. 2, Septemb...",705
3,Page 4,"The EUROCALL Review, Volume 25, No. 2, Septemb...",662
4,Page 5,"The EUROCALL Review, Volume 25, No. 2, Septemb...",544


###### Store the metadata for each page in a separate column



In [12]:
extracted_text_df['Metadata'] = extracted_text_df.apply(lambda x: {'Page_No.': x['Page No.']}, axis=1)

#### Chunking
overlap chunking with chunk_size 300 and overlap 100 words


In [13]:
def chunk_text(text, chunk_size=300, overlap_size=100):
    """
    Splits the text into chunks with overlap.

    Parameters:
    - text: The text to be split.
    - chunk_size: The size of each chunk (number of words).
    - overlap_size: The number of words that should overlap between consecutive chunks.

    Returns:
    - chunks: A list of text chunks.
    """
    words = text.split()
    chunks = []

    for i in range(0, len(words), chunk_size - overlap_size):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)

    return chunks


In [14]:
# Assuming extracted_text_df is your DataFrame containing the extracted text
chunk_size = 300  # Number of words in each chunk
overlap_size = 100  # Number of overlapping words between chunks

# Create a new column in the DataFrame to store the chunks
extracted_text_df['Chunks'] = extracted_text_df['Page_Text'].apply(lambda x: chunk_text(x, chunk_size, overlap_size))

# Flatten the DataFrame to have one row per chunk
chunked_df = extracted_text_df.explode('Chunks').reset_index(drop=True)

# Add an identifier to each chunk to keep track of the page and chunk number
chunked_df['Chunk_ID'] = chunked_df.index + 1


In [15]:
chunked_df.head()


Unnamed: 0,Page No.,Page_Text,Text_Length,Metadata,Chunks,Chunk_ID
0,Page 1,"The EUROCALL Review, Volume 25, No. 2, Septemb...",520,{'Page_No.': 'Page 1'},"The EUROCALL Review, Volume 25, No. 2, Septemb...",1
1,Page 1,"The EUROCALL Review, Volume 25, No. 2, Septemb...",520,{'Page_No.': 'Page 1'},"of""], [""interest among researchers in recent y...",2
2,Page 1,"The EUROCALL Review, Volume 25, No. 2, Septemb...",520,{'Page_No.': 'Page 1'},prepare them for effective usage of such devic...,3
3,Page 2,"The EUROCALL Review, Volume 25, No. 2, Septemb...",703,{'Page_No.': 'Page 2'},"The EUROCALL Review, Volume 25, No. 2, Septemb...",4
4,Page 2,"The EUROCALL Review, Volume 25, No. 2, Septemb...",703,{'Page_No.': 'Page 2'},"different forms according to the""], [""person, ...",5


Updating metadata with including chunk id

In [16]:
chunked_df.head()

Unnamed: 0,Page No.,Page_Text,Text_Length,Metadata,Chunks,Chunk_ID
0,Page 1,"The EUROCALL Review, Volume 25, No. 2, Septemb...",520,{'Page_No.': 'Page 1'},"The EUROCALL Review, Volume 25, No. 2, Septemb...",1
1,Page 1,"The EUROCALL Review, Volume 25, No. 2, Septemb...",520,{'Page_No.': 'Page 1'},"of""], [""interest among researchers in recent y...",2
2,Page 1,"The EUROCALL Review, Volume 25, No. 2, Septemb...",520,{'Page_No.': 'Page 1'},prepare them for effective usage of such devic...,3
3,Page 2,"The EUROCALL Review, Volume 25, No. 2, Septemb...",703,{'Page_No.': 'Page 2'},"The EUROCALL Review, Volume 25, No. 2, Septemb...",4
4,Page 2,"The EUROCALL Review, Volume 25, No. 2, Septemb...",703,{'Page_No.': 'Page 2'},"different forms according to the""], [""person, ...",5


## 3. <font color = purple> Generate and Store Embeddings using OpenAI and ChromaDB

In this section, we will embed the pages in the dataframe through OpenAI's `text-embedding-ada-002` model, and store them in a ChromaDB collection.

In [19]:
# Import the OpenAI Embedding Function into chroma

from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
from chromadb.utils.embedding_functions import EmbeddingFunction

In [20]:
# Define the path where chroma collections will be stored

chroma_data_path = r'ChromaDB_Database'

In [21]:
import chromadb

In [22]:
# Call PersistentClient()

client = chromadb.PersistentClient(path=chroma_data_path)

In [None]:
# Set up the embedding function using the embedding model

model = SentenceTransformer("all-MiniLM-L6-v2")
#Create a custom embedding function class

# Create a custom embedding function class
class SentenceTransformerEmbeddingFunction(EmbeddingFunction):
    def __call__(self, texts):
        return model.encode(texts).tolist()

# Initialize the embedding function
embedding_function = SentenceTransformerEmbeddingFunction()

In [24]:
# Initialise a collection in chroma and pass the embedding_function to it so that it used OpenAI embeddings to embed the documents

# Create or get the collection, passing the custom embedding function
research_collection = client.get_or_create_collection(
    name="RAG_on_Research",
    embedding_function=embedding_function
)

In [25]:
# Convert the page text and metadata from your dataframe to lists to be able to pass it to chroma

documents_list = chunked_df["Chunks"].tolist()
metadata_list = chunked_df['Metadata'].tolist()

In [26]:
# Add the documents and metadata to the collection alongwith generic integer IDs. You can also feed the metadata information as IDs by combining the policy name and page no.

research_collection.add(
    documents= documents_list,
    ids = [str(i) for i in range(0, len(documents_list))],
    metadatas = metadata_list
)

In [27]:
# Let's take a look at the first few entries in the collection

research_collection.get(
    ids = ['0','1','2'],
    include = ['embeddings', 'documents', 'metadatas']
)

{'ids': ['0', '1', '2'],
 'embeddings': array([[-0.05353186,  0.05806484,  0.06055533, ...,  0.14469525,
         -0.02763622,  0.06009349],
        [-0.04756785, -0.02034263,  0.00397582, ...,  0.07732493,
          0.01114378,  0.01447436],
        [-0.00796748,  0.01207486,  0.08549718, ...,  0.13953213,
          0.01091361,  0.0653479 ]], shape=(3, 384)),
 'documents': ['The EUROCALL Review, Volume 25, No. 2, September 2017 [["Research paper"], [""], ["A look at advanced learners\\u2019 use of mobile devices for"], ["English language study: Insights from interview data"], ["Mariusz Kruk"], ["University of Zielona Gora, Poland"], ["______________________________________________________________"], ["mkruk @ uz.zgora.pl"], [""], ["Abstract"], ["The paper discusses the results of a study which explored advanced learners of English"], ["engagement with their mobile devices to develop learning experiences that meet their"], ["needs and goals as foreign language learners. The data were c

In [28]:
cache_collection = client.get_or_create_collection(name='Research_Cache', embedding_function=embedding_function)

In [29]:
cache_collection.peek()

{'ids': [],
 'embeddings': array([], dtype=float64),
 'documents': [],
 'uris': None,
 'data': None,
 'metadatas': [],
 'included': [<IncludeEnum.embeddings: 'embeddings'>,
  <IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

## 4. <font color = purple> Semantic Search with Cache

In this section, we will perform a semantic search of a query in the collections embeddings to get several top semantically similar results.

In [30]:
query_1="What is covered in the Abstract section of the document?"

In [31]:
# Searh the Cache collection first
# Query the collection against the user query and return the top 20 results
cache_results= cache_collection.query(
         query_texts=query_1,
          n_results=1
     )


In [32]:
print(cache_results)

{'ids': [[]], 'embeddings': None, 'documents': [[]], 'uris': None, 'data': None, 'metadatas': [[]], 'distances': [[]], 'included': [<IncludeEnum.distances: 'distances'>, <IncludeEnum.documents: 'documents'>, <IncludeEnum.metadatas: 'metadatas'>]}


In [33]:
results = research_collection.query(
query_texts=query_1,
n_results=10
)
# results.items()

In [34]:
# Implementing Cache in Semantic Search

# Set a threshold for cache search
threshold = 0.2

ids_1 = []
documents_1 = []
distances_1 = []
metadatas_1 = []
results_df_1 = pd.DataFrame()


# If the distance is greater than the threshold, then return the results from the main collection.

if cache_results['distances'][0] == [] or cache_results['distances'][0][0] > threshold:
      # Query the collection against the user query and return the top 10 results
      results = research_collection.query(
      query_texts=query_1,
      n_results=10
      )

      # Store the query in cache_collection as document w.r.t to ChromaDB so that it can be embedded and searched against later
      # Store retrieved text, ids, distances and metadatas in cache_collection as metadatas, so that they can be fetched easily if a query indeed matches to a query in cache
      Keys = []
      Values = []

      for key, val in results.items():
        if val is None:
          continue
        for i in range(9):
          Keys.append(str(key)+str(i))
          Values.append(str(val[0][i]))


      cache_collection.add(
          documents= [query_1],
          ids = [query_1],  # Or if you want to assign integers as IDs 0,1,2,.., then you can use "len(cache_results['documents'])" as will return the no. of queries currently in the cache and assign the next digit to the new query."
          metadatas = dict(zip(Keys, Values))
      )

      print("Not found in cache. Found in main collection.")

      result_dict = {'Metadatas_1': results['metadatas'][0], 'Documents_1': results['documents'][0], 'Distances_1': results['distances'][0], "IDs":results["ids"][0]}
      results_df_1 = pd.DataFrame.from_dict(result_dict)
      results_df_1


# If the distance is, however, less than the threshold, you can return the results from cache

elif cache_results['distances'][0][0] <= threshold:
      cache_result_dict = cache_results['metadatas'][0][0]

      # Loop through each inner list and then through the dictionary
      for key, value in cache_result_dict.items():
          if 'ids' in key:
              ids.append(value)
          elif 'documents' in key:
              documents.append(value)
          elif 'distances' in key:
              distances.append(value)
          elif 'metadatas' in key:
              metadatas.append(value)

      print("Found in cache!")

      # Create a DataFrame
      results_df_1 = pd.DataFrame({
        'IDs_1': ids,
        'Documents_1': documents,
        'Distances_1': distances,
        'Metadatas_1': metadatas
      })

Not found in cache. Found in main collection.


In [35]:
results_df_1.head()

Unnamed: 0,Metadatas_1,Documents_1,Distances_1,IDs
0,{'Page_No.': 'Page 4'},"on the problems raised""], [""during it (D\u00f6...",1.420501,13
1,{'Page_No.': 'Page 2'},"different forms according to the""], [""person, ...",1.442721,4
2,{'Page_No.': 'Page 5'},"The EUROCALL Review, Volume 25, No. 2, Septemb...",1.447016,15
3,{'Page_No.': 'Page 2'},"The EUROCALL Review, Volume 25, No. 2, Septemb...",1.578391,3
4,{'Page_No.': 'Page 3'},"built""], [""upon mobile devices whereas some ot...",1.590897,9


#### for Query 2

In [36]:
query_2="What does Literature Review section of the document contain?"

In [37]:
# Searh the Cache collection first
# Query the collection against the user query and return the top 20 results
cache_results= cache_collection.query(
         query_texts=query_2,
          n_results=1
     )


In [38]:
cache_results

{'ids': [['What is covered in the Abstract section of the document?']],
 'embeddings': None,
 'documents': [['What is covered in the Abstract section of the document?']],
 'uris': None,
 'data': None,
 'metadatas': [[{'distances0': '1.420501251740232',
    'distances1': '1.442720875095425',
    'distances2': '1.4470158893905203',
    'distances3': '1.578391199737988',
    'distances4': '1.5908967517347046',
    'distances5': '1.5936611833578234',
    'distances6': '1.6019941410934753',
    'distances7': '1.605907653032187',
    'distances8': '1.6197981160681336',
    'documents0': 'on the problems raised"], ["during it (D\\u00f6rnyei, 2007). As D\\u00f6rnyei (2007) explains, in this type of the interview \\u201cthe"], ["interviewer provides guidelines and direction (hence the \\u2018-structured\\u2019 part in the name),"], ["but is also keen to follow up interesting developments and to let the interviewee"], ["elaborate on certain issues (hence the \\u2018semi-\\u2019 part)\\u201d (p. 

In [39]:
results = research_collection.query(
query_texts=query_2,
n_results=10
)
# results.items()

In [40]:
# Implementing Cache in Semantic Search

# Set a threshold for cache search
threshold = 0.2

ids_2 = []
documents_2 = []
distances_2 = []
metadatas_2 = []
results_df_2 = pd.DataFrame()


# If the distance is greater than the threshold, then return the results from the main collection.

if cache_results['distances'][0] == [] or cache_results['distances'][0][0] > threshold:
      # Query the collection against the user query and return the top 10 results
      results = research_collection.query(
      query_texts=query_2,
      n_results=10
      )

      # Store the query in cache_collection as document w.r.t to ChromaDB so that it can be embedded and searched against later
      # Store retrieved text, ids, distances and metadatas in cache_collection as metadatas, so that they can be fetched easily if a query indeed matches to a query in cache
      Keys = []
      Values = []

      for key, val in results.items():
        if val is None:
          continue
        for i in range(9):
          Keys.append(str(key)+str(i))
          Values.append(str(val[0][i]))


      cache_collection.add(
          documents= [query_2],
          ids = [query_2],  # Or if you want to assign integers as IDs 0,1,2,.., then you can use "len(cache_results['documents'])" as will return the no. of queries currently in the cache and assign the next digit to the new query."
          metadatas = dict(zip(Keys, Values))
      )

      print("Not found in cache. Found in main collection.")

      result_dict_2 = {'Metadatas_2': results['metadatas'][0], 'Documents_2': results['documents'][0], 'Distances_2': results['distances'][0], "IDs":results["ids"][0]}
      results_df_2 = pd.DataFrame.from_dict(result_dict_2)
      results_df_1


# If the distance is, however, less than the threshold, you can return the results from cache

elif cache_results['distances'][0][0] <= threshold:
      cache_result_dict = cache_results['metadatas'][0][0]

      # Loop through each inner list and then through the dictionary
      for key, value in cache_result_dict.items():
          if 'ids' in key:
              ids.append(value)
          elif 'documents' in key:
              documents.append(value)
          elif 'distances' in key:
              distances.append(value)
          elif 'metadatas' in key:
              metadatas.append(value)

      print("Found in cache!")

      # Create a DataFrame
      results_df_2 = pd.DataFrame({
        'IDs_1': ids,
        'Documents_2': documents,
        'Distances_2': distances,
        'Metadatas_2': metadatas
      })

Not found in cache. Found in main collection.


In [41]:
results_df_2.head()

Unnamed: 0,Metadatas_2,Documents_2,Distances_2,IDs
0,{'Page_No.': 'Page 5'},"The EUROCALL Review, Volume 25, No. 2, Septemb...",1.4332,15
1,{'Page_No.': 'Page 5'},"compared, modified and either merged or""], [""a...",1.4774,16
2,{'Page_No.': 'Page 6'},"The EUROCALL Review, Volume 25, No. 2, Septemb...",1.495414,18
3,{'Page_No.': 'Page 4'},"on the problems raised""], [""during it (D\u00f6...",1.514388,13
4,{'Page_No.': 'Page 6'},"[""Illustrative examples of such opinions are p...",1.525234,19


###### for query 3

In [42]:
query_3="What are the methods covered in the method section of the document?"

In [43]:
# Searh the Cache collection first
# Query the collection against the user query and return the top 20 results
cache_results= cache_collection.query(
         query_texts=query_3,
          n_results=1
     )

In [44]:
cache_results

{'ids': [['What is covered in the Abstract section of the document?']],
 'embeddings': None,
 'documents': [['What is covered in the Abstract section of the document?']],
 'uris': None,
 'data': None,
 'metadatas': [[{'distances0': '1.420501251740232',
    'distances1': '1.442720875095425',
    'distances2': '1.4470158893905203',
    'distances3': '1.578391199737988',
    'distances4': '1.5908967517347046',
    'distances5': '1.5936611833578234',
    'distances6': '1.6019941410934753',
    'distances7': '1.605907653032187',
    'distances8': '1.6197981160681336',
    'documents0': 'on the problems raised"], ["during it (D\\u00f6rnyei, 2007). As D\\u00f6rnyei (2007) explains, in this type of the interview \\u201cthe"], ["interviewer provides guidelines and direction (hence the \\u2018-structured\\u2019 part in the name),"], ["but is also keen to follow up interesting developments and to let the interviewee"], ["elaborate on certain issues (hence the \\u2018semi-\\u2019 part)\\u201d (p. 

In [45]:
# Implementing Cache in Semantic Search

# Set a threshold for cache search
threshold = 0.2

ids_3 = []
documents_3 = []
distances_3 = []
metadatas_3 = []
results_df_3 = pd.DataFrame()


# If the distance is greater than the threshold, then return the results from the main collection.

if cache_results['distances'][0] == [] or cache_results['distances'][0][0] > threshold:
      # Query the collection against the user query and return the top 10 results
      results = research_collection.query(
      query_texts=query_3,
      n_results=10
      )

      # Store the query in cache_collection as document w.r.t to ChromaDB so that it can be embedded and searched against later
      # Store retrieved text, ids, distances and metadatas in cache_collection as metadatas, so that they can be fetched easily if a query indeed matches to a query in cache
      Keys = []
      Values = []

      for key, val in results.items():
        if val is None:
          continue
        for i in range(9):
          Keys.append(str(key)+str(i))
          Values.append(str(val[0][i]))


      cache_collection.add(
          documents= [query_3],
          ids = [query_3],  # Or if you want to assign integers as IDs 0,1,2,.., then you can use "len(cache_results['documents'])" as will return the no. of queries currently in the cache and assign the next digit to the new query."
          metadatas = dict(zip(Keys, Values))
      )

      print("Not found in cache. Found in main collection.")

      result_dict_3 = {'Metadatas_3': results['metadatas'][0], 'Documents_3': results['documents'][0], 'Distances_3': results['distances'][0], "IDs":results["ids"][0]}
      results_df_3 = pd.DataFrame.from_dict(result_dict_3)
      results_df_3


# If the distance is, however, less than the threshold, you can return the results from cache

elif cache_results['distances'][0][0] <= threshold:
      cache_result_dict = cache_results['metadatas'][0][0]

      # Loop through each inner list and then through the dictionary
      for key, value in cache_result_dict.items():
          if 'ids' in key:
              ids.append(value)
          elif 'documents' in key:
              documents.append(value)
          elif 'distances' in key:
              distances.append(value)
          elif 'metadatas' in key:
              metadatas.append(value)

      print("Found in cache!")

      # Create a DataFrame
      results_df_3 = pd.DataFrame({
        'IDs_1': ids,
        'Documents_3': documents,
        'Distances_3': distances,
        'Metadatas_3': metadatas
      })

Not found in cache. Found in main collection.


In [46]:
results_df_3.head()

Unnamed: 0,Metadatas_3,Documents_3,Distances_3,IDs
0,{'Page_No.': 'Page 5'},"The EUROCALL Review, Volume 25, No. 2, Septemb...",1.46251,15
1,{'Page_No.': 'Page 6'},"a variety of mobile apps, such as Google""], [""...",1.474083,20
2,{'Page_No.': 'Page 4'},"on the problems raised""], [""during it (D\u00f6...",1.474174,13
3,{'Page_No.': 'Page 2'},"different forms according to the""], [""person, ...",1.48093,4
4,{'Page_No.': 'Page 5'},"compared, modified and either merged or""], [""a...",1.49422,16


## 5. <font color = Purple> Re-Ranking with a Cross Encoder


In [47]:
# Import the CrossEncoder library from sentence_transformers

from sentence_transformers import CrossEncoder, util

In [48]:
# Initialise the cross encoder model

cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


##### For query 1

In [49]:
# Input (query, response) pairs for each of the top  responses received from the semantic search to the cross encoder
# Generate the cross_encoder scores for these pairs

cross_inputs_1 = [[query_1, response] for response in results_df_1['Documents_1']]
cross_rerank_scores_1 = cross_encoder.predict(cross_inputs_1)

In [50]:
cross_rerank_scores_1

array([-10.551626,  -9.403101,  -9.809732,  -9.541539, -10.775994,
        -9.310375, -10.547911, -10.770069, -10.321409, -10.342968],
      dtype=float32)

In [51]:
# Store the rerank_scores in results_df

results_df_1['Reranked_scores'] = cross_rerank_scores_1

In [52]:
results_df_1.head()

Unnamed: 0,Metadatas_1,Documents_1,Distances_1,IDs,Reranked_scores
0,{'Page_No.': 'Page 4'},"on the problems raised""], [""during it (D\u00f6...",1.420501,13,-10.551626
1,{'Page_No.': 'Page 2'},"different forms according to the""], [""person, ...",1.442721,4,-9.403101
2,{'Page_No.': 'Page 5'},"The EUROCALL Review, Volume 25, No. 2, Septemb...",1.447016,15,-9.809732
3,{'Page_No.': 'Page 2'},"The EUROCALL Review, Volume 25, No. 2, Septemb...",1.578391,3,-9.541539
4,{'Page_No.': 'Page 3'},"built""], [""upon mobile devices whereas some ot...",1.590897,9,-10.775994


In [53]:
import pandas as pd
# change the display properties of pandas to max
pd.set_option('display.max_colwidth', 800)
pd.set_option('display.max_columns', 800)
pd.set_option('display.max_rows', 5000)

In [54]:
# Return the top 3 results from semantic search

top_3_semantic_1 = results_df_1.sort_values(by='Distances_1')
top_3_semantic_1[:3]

Unnamed: 0,Metadatas_1,Documents_1,Distances_1,IDs,Reranked_scores
0,{'Page_No.': 'Page 4'},"on the problems raised""], [""during it (D\u00f6rnyei, 2007). As D\u00f6rnyei (2007) explains, in this type of the interview \u201cthe""], [""interviewer provides guidelines and direction (hence the \u2018-structured\u2019 part in the name),""], [""but is also keen to follow up interesting developments and to let the interviewee""], [""elaborate on certain issues (hence the \u2018semi-\u2019 part)\u201d (p. 136).""], [""During the interview, the present researcher attempted to encourage the subjects to""], [""describe their learning experiences concerning the use of mobile devices for English""], [""study. This was a form of introspection where the students were prompted to examine""], [""their behaviors and provide a first person narrative of such experiences. All the study""], [""participants were inf...",1.420501,13,-10.551626
1,{'Page_No.': 'Page 2'},"different forms according to the""], [""person, the setting, and multiple contextual and micro-contextual factors\u201d and it is \u201ca""], [""multi-faceted concept that consists of several layers\u201d (Reinders, 2011, p. 48) whose""], [""roots are based in political, societal and educational developments. In addition to this,""], [""work on autonomy emphasizes social dimensions of learner autonomy in view of the""], [""fact that \u201cautonomous learners always do things for themselves, but they may or may""], [""not do things on their own\u201d (Little, 2009, p. 223) and that by means of social""], [""interactions language learners \u201cdevelop a capacity to analyze, reflect upon and""], [""synthesize information to create new perspectives\u201d (Lee, 2011, p. 88). It should also be""], [""noted t...",1.442721,4,-9.403101
2,{'Page_No.': 'Page 5'},"The EUROCALL Review, Volume 25, No. 2, September 2017 [[""\uf0b7 Do you organize regular formal or informal mobile English language learning""], [""sessions?""], [""\uf0b7 What do you learn most frequently by means of your mobile device(s)? Why""], [""this?""], [""\uf0b7 Do you feel that thanks to the use of your mobile device(s) you devote more""], [""time for learning the English language?""], [""\uf0b7 As far as learning English through your mobile device(s) is concerned, do you""], [""consider yourself as an experienced user of such device(s)?""]] consider yourself as an experienced user of such device(s)? [[""The gathered data were subjected to qualitative and quantitative analysis. The analysis""], [""started with partial transcription of the important parts of the data (D\u00f6rnyei, 2007) on a""],...",1.447016,15,-9.809732


In [55]:
# Return the top 3 results after reranking

top_3_rerank_1 = results_df_1.sort_values(by='Reranked_scores', ascending=False)
top_3_rerank_1[:3]

Unnamed: 0,Metadatas_1,Documents_1,Distances_1,IDs,Reranked_scores
5,{'Page_No.': 'Page 3'},"The EUROCALL Review, Volume 25, No. 2, September 2017 [[""2.2. Autonomy and new technologies""], [""As stated in the previous section, the concept of autonomy has been one of the most""], [""researched areas in the field of second/foreign language learning and teaching over the""], [""last few decades. It should be noted, however, that the field of learner autonomy""], [""started to be influenced by technology in the mid-1990s as a result of the growing""], [""influence of the internet on almost every sphere of our life (including second/foreign""], [""language education) and the opportunities for online collaboration and communication""], [""(Reinders & White, 2016). As stated by Benson and Chik (2010), the latest generations""], [""of new technologies, particularly those encompassing the internet, us...",1.593661,7,-9.310375
1,{'Page_No.': 'Page 2'},"different forms according to the""], [""person, the setting, and multiple contextual and micro-contextual factors\u201d and it is \u201ca""], [""multi-faceted concept that consists of several layers\u201d (Reinders, 2011, p. 48) whose""], [""roots are based in political, societal and educational developments. In addition to this,""], [""work on autonomy emphasizes social dimensions of learner autonomy in view of the""], [""fact that \u201cautonomous learners always do things for themselves, but they may or may""], [""not do things on their own\u201d (Little, 2009, p. 223) and that by means of social""], [""interactions language learners \u201cdevelop a capacity to analyze, reflect upon and""], [""synthesize information to create new perspectives\u201d (Lee, 2011, p. 88). It should also be""], [""noted t...",1.442721,4,-9.403101
3,{'Page_No.': 'Page 2'},"The EUROCALL Review, Volume 25, No. 2, September 2017 [[""namely a research question, description of participants, data collection tools and""], [""analysis. This is followed by the presentation of the results of the study. The article""], [""closes with discussion and conclusions.""], [""2. Literature review""], [""2.1. Autonomy in foreign/second language learning""], [""The concept of autonomy in second/foreign language learning and teaching has been""], [""the focus of attention for many researchers and practitioners for more than three""], [""decades. According to Benson (2001), the notion of autonomy was introduced and""], [""popularized in 1981 by Henri Holec in his seminal report for the Council of Europe""], [""entitled Autonomy in Foreign Language Learning in which the researcher defined""], [""au...",1.578391,3,-9.541539


In [56]:
top_3_RAG_1 = top_3_rerank_1[["Documents_1", "Metadatas_1"]][:3]

In [57]:
query_1

'What is covered in the Abstract section of the document?'

In [58]:
top_3_RAG_1

Unnamed: 0,Documents_1,Metadatas_1
5,"The EUROCALL Review, Volume 25, No. 2, September 2017 [[""2.2. Autonomy and new technologies""], [""As stated in the previous section, the concept of autonomy has been one of the most""], [""researched areas in the field of second/foreign language learning and teaching over the""], [""last few decades. It should be noted, however, that the field of learner autonomy""], [""started to be influenced by technology in the mid-1990s as a result of the growing""], [""influence of the internet on almost every sphere of our life (including second/foreign""], [""language education) and the opportunities for online collaboration and communication""], [""(Reinders & White, 2016). As stated by Benson and Chik (2010), the latest generations""], [""of new technologies, particularly those encompassing the internet, us...",{'Page_No.': 'Page 3'}
1,"different forms according to the""], [""person, the setting, and multiple contextual and micro-contextual factors\u201d and it is \u201ca""], [""multi-faceted concept that consists of several layers\u201d (Reinders, 2011, p. 48) whose""], [""roots are based in political, societal and educational developments. In addition to this,""], [""work on autonomy emphasizes social dimensions of learner autonomy in view of the""], [""fact that \u201cautonomous learners always do things for themselves, but they may or may""], [""not do things on their own\u201d (Little, 2009, p. 223) and that by means of social""], [""interactions language learners \u201cdevelop a capacity to analyze, reflect upon and""], [""synthesize information to create new perspectives\u201d (Lee, 2011, p. 88). It should also be""], [""noted t...",{'Page_No.': 'Page 2'}
3,"The EUROCALL Review, Volume 25, No. 2, September 2017 [[""namely a research question, description of participants, data collection tools and""], [""analysis. This is followed by the presentation of the results of the study. The article""], [""closes with discussion and conclusions.""], [""2. Literature review""], [""2.1. Autonomy in foreign/second language learning""], [""The concept of autonomy in second/foreign language learning and teaching has been""], [""the focus of attention for many researchers and practitioners for more than three""], [""decades. According to Benson (2001), the notion of autonomy was introduced and""], [""popularized in 1981 by Henri Holec in his seminal report for the Council of Europe""], [""entitled Autonomy in Foreign Language Learning in which the researcher defined""], [""au...",{'Page_No.': 'Page 2'}


### For the second Query

In [59]:
# Input (query, response) pairs for each of the top  responses received from the semantic search to the cross encoder
# Generate the cross_encoder scores for these pairs

cross_inputs_2 = [[query_2, response] for response in results_df_2['Documents_2']]
cross_rerank_scores_2 = cross_encoder.predict(cross_inputs_2)

In [60]:
cross_rerank_scores_2

array([ -7.062504 , -10.025257 ,  -9.446735 , -10.247514 ,  -9.3070545,
       -10.936045 , -11.400219 ,  -9.389368 ,  -7.9865794, -11.275461 ],
      dtype=float32)

In [61]:
# Store the rerank_scores in results_df

results_df_2['Reranked_scores'] = cross_rerank_scores_2

In [62]:
# Return the top 3 results from semantic search

top_3_semantic_2 = results_df_2.sort_values(by='Distances_2')
top_3_semantic_2[:3]

Unnamed: 0,Metadatas_2,Documents_2,Distances_2,IDs,Reranked_scores
0,{'Page_No.': 'Page 5'},"The EUROCALL Review, Volume 25, No. 2, September 2017 [[""\uf0b7 Do you organize regular formal or informal mobile English language learning""], [""sessions?""], [""\uf0b7 What do you learn most frequently by means of your mobile device(s)? Why""], [""this?""], [""\uf0b7 Do you feel that thanks to the use of your mobile device(s) you devote more""], [""time for learning the English language?""], [""\uf0b7 As far as learning English through your mobile device(s) is concerned, do you""], [""consider yourself as an experienced user of such device(s)?""]] consider yourself as an experienced user of such device(s)? [[""The gathered data were subjected to qualitative and quantitative analysis. The analysis""], [""started with partial transcription of the important parts of the data (D\u00f6rnyei, 2007) on a""],...",1.4332,15,-7.062504
1,{'Page_No.': 'Page 5'},"compared, modified and either merged or""], [""abandoned. It should also be noted that the obtained data were analyzed quantitatively.""], [""This type of analysis involved counting the number of the interviewees\u2019 responses and""], [""calculating percentages.""], [""4. Findings""], [""A thorough analysis of the data yielded the following thematic categories: usage of""], [""mobile devices, reasons for using mobile devices, resources and tools, mobile""], [""encounters, language practiced and study performance.""], [""4.1. Usage of mobile devices""], [""Table 1 shows the study participants\u2019 mobile devices (MobDs) usage descriptions. The""], [""table demonstrates that smartphones were the most often used mobile devices by the""], [""students. In addition, the numerical information in the table indic...",1.4774,16,-10.025257
2,{'Page_No.': 'Page 6'},"The EUROCALL Review, Volume 25, No. 2, September 2017 [["""", ""S8"", ""female"", ""smartphone"", ""2 years"", ""fairly experienced""], ["""", ""S9"", ""male"", ""smartphone"", ""4 years"", ""not very experienced""], [""3rd year"", ""S10"", ""female"", ""smartphone"", ""5 years"", ""fairly experienced""], [null, ""S11"", ""female"", ""tablet and cell phone"", ""2 years"", ""fairly experienced""], [null, ""S12"", ""female"", ""smartphone"", ""2 years"", ""not very experienced""], [""B.A."", ""S13"", ""female"", ""smartphone"", ""3 years"", ""not very experienced""], [null, ""S14"", ""male"", ""smartphone and tablet"", ""3 years"", ""experienced""], [null, ""S15"", ""female"", ""smartphone and tablet"", ""5 years"", ""fairly experienced""], ["""", ""S16"", ""female"", ""smartphone"", ""3 years"", ""not very experienced""], [null, ""S17"", ""female"", ""smartphone and tablet"", ""6 years"", ""fa...",1.495414,18,-9.446735


In [63]:
# Return the top 3 results after reranking

top_3_rerank_2 = results_df_2.sort_values(by='Reranked_scores', ascending=False)
top_3_rerank_2[:3]

Unnamed: 0,Metadatas_2,Documents_2,Distances_2,IDs,Reranked_scores
0,{'Page_No.': 'Page 5'},"The EUROCALL Review, Volume 25, No. 2, September 2017 [[""\uf0b7 Do you organize regular formal or informal mobile English language learning""], [""sessions?""], [""\uf0b7 What do you learn most frequently by means of your mobile device(s)? Why""], [""this?""], [""\uf0b7 Do you feel that thanks to the use of your mobile device(s) you devote more""], [""time for learning the English language?""], [""\uf0b7 As far as learning English through your mobile device(s) is concerned, do you""], [""consider yourself as an experienced user of such device(s)?""]] consider yourself as an experienced user of such device(s)? [[""The gathered data were subjected to qualitative and quantitative analysis. The analysis""], [""started with partial transcription of the important parts of the data (D\u00f6rnyei, 2007) on a""],...",1.4332,15,-7.062504
8,{'Page_No.': 'Page 10'},"The EUROCALL Review, Volume 25, No. 2, September 2017 [[""data collection instrument, namely the semi-structured interview which was conducted""], [""only once. Perhaps a different set of questions, their wording or a series of such""], [""interviews carried out over a particular period of time (say one academic year) may""], [""have yielded more detailed and insightful results. Despite these limitations, this study""], [""provided some insights into why and how advanced English language learners engage""], [""with their mobile devices to develop learning experiences. It should be stressed,""], [""however, that teacher involvement in creating conditions conducive to the use of mobile""], [""devices for language study may result in greater learner engagement with mobile""], [""technology (i.e. mobile de...",1.556413,33,-7.986579
4,{'Page_No.': 'Page 6'},"[""Illustrative examples of such opinions are provided below (3):""]] Illustrative examples of such opinions are provided below (3): [[""S10: It\u2019s very comfortable. I can reach for my dictionary any time I want and I""], [""don\u2019t have to carry thick books (...) The main aspect is convenience.""], [""S5: It\u2019s because I can find needed information ... it\u2019s convenient because I""], [""always carry my smartphone and I have access to the internet all the time (...)""], [""At home I also use my smartphone and I don\u2019t mind it has a small screen.""], [""S14: My tablet lets me organize things and keep my documents in one place.""], [""This is because studying English means having countless study materials (...) I""], [""can store them there (...) this also gives me easier access to them...",1.525234,19,-9.307055


In [64]:
top_3_RAG_2 = top_3_rerank_2[["Documents_2", "Metadatas_2"]][:3]

In [65]:
query_2

'What does Literature Review section of the document contain?'

In [66]:
top_3_RAG_2

Unnamed: 0,Documents_2,Metadatas_2
0,"The EUROCALL Review, Volume 25, No. 2, September 2017 [[""\uf0b7 Do you organize regular formal or informal mobile English language learning""], [""sessions?""], [""\uf0b7 What do you learn most frequently by means of your mobile device(s)? Why""], [""this?""], [""\uf0b7 Do you feel that thanks to the use of your mobile device(s) you devote more""], [""time for learning the English language?""], [""\uf0b7 As far as learning English through your mobile device(s) is concerned, do you""], [""consider yourself as an experienced user of such device(s)?""]] consider yourself as an experienced user of such device(s)? [[""The gathered data were subjected to qualitative and quantitative analysis. The analysis""], [""started with partial transcription of the important parts of the data (D\u00f6rnyei, 2007) on a""],...",{'Page_No.': 'Page 5'}
8,"The EUROCALL Review, Volume 25, No. 2, September 2017 [[""data collection instrument, namely the semi-structured interview which was conducted""], [""only once. Perhaps a different set of questions, their wording or a series of such""], [""interviews carried out over a particular period of time (say one academic year) may""], [""have yielded more detailed and insightful results. Despite these limitations, this study""], [""provided some insights into why and how advanced English language learners engage""], [""with their mobile devices to develop learning experiences. It should be stressed,""], [""however, that teacher involvement in creating conditions conducive to the use of mobile""], [""devices for language study may result in greater learner engagement with mobile""], [""technology (i.e. mobile de...",{'Page_No.': 'Page 10'}
4,"[""Illustrative examples of such opinions are provided below (3):""]] Illustrative examples of such opinions are provided below (3): [[""S10: It\u2019s very comfortable. I can reach for my dictionary any time I want and I""], [""don\u2019t have to carry thick books (...) The main aspect is convenience.""], [""S5: It\u2019s because I can find needed information ... it\u2019s convenient because I""], [""always carry my smartphone and I have access to the internet all the time (...)""], [""At home I also use my smartphone and I don\u2019t mind it has a small screen.""], [""S14: My tablet lets me organize things and keep my documents in one place.""], [""This is because studying English means having countless study materials (...) I""], [""can store them there (...) this also gives me easier access to them...",{'Page_No.': 'Page 6'}


##### Query 3

In [67]:
# Input (query, response) pairs for each of the top responses received from the semantic search to the cross encoder
# Generate the cross_encoder scores for these pairs

cross_inputs_3 = [[query_3, response] for response in results_df_3['Documents_3']]
cross_rerank_scores_3 = cross_encoder.predict(cross_inputs_3)

In [68]:
cross_rerank_scores_3

array([-10.051212, -11.329065, -10.870022, -10.207796, -10.456713,
       -10.905957, -11.150194, -11.128535, -11.413888, -11.122671],
      dtype=float32)

In [69]:
# Store the rerank_scores in results_df

results_df_3['Reranked_scores'] = cross_rerank_scores_3

In [70]:
# Return the top 3 results from semantic search

top_3_semantic_3 = results_df_3.sort_values(by='Distances_3')
top_3_semantic_3[:3]

Unnamed: 0,Metadatas_3,Documents_3,Distances_3,IDs,Reranked_scores
0,{'Page_No.': 'Page 5'},"The EUROCALL Review, Volume 25, No. 2, September 2017 [[""\uf0b7 Do you organize regular formal or informal mobile English language learning""], [""sessions?""], [""\uf0b7 What do you learn most frequently by means of your mobile device(s)? Why""], [""this?""], [""\uf0b7 Do you feel that thanks to the use of your mobile device(s) you devote more""], [""time for learning the English language?""], [""\uf0b7 As far as learning English through your mobile device(s) is concerned, do you""], [""consider yourself as an experienced user of such device(s)?""]] consider yourself as an experienced user of such device(s)? [[""The gathered data were subjected to qualitative and quantitative analysis. The analysis""], [""started with partial transcription of the important parts of the data (D\u00f6rnyei, 2007) on a""],...",1.46251,15,-10.051212
1,{'Page_No.': 'Page 6'},"a variety of mobile apps, such as Google""], [""Translate, Duolingo and Fiszkoteka. The students usually accessed these tools in order""], [""to check, revise and learn the target language vocabulary. Two students also reported""], [""using Voscreen and WhatsApp, i.e. mobile apps for watching video and communicating""], [""with people, respectively. It should also be noted that the interviewees pointed out""], [""various online resources they used with the purpose of practicing reading and listening""], [""skills (e.g. TED, online newspapers, YouTube), vocabulary (e.g. 6 Minute""]] skills (e.g. TED, online newspapers, YouTube), vocabulary (e.g. 6 Minute 23",1.474083,20,-11.329065
2,{'Page_No.': 'Page 4'},"on the problems raised""], [""during it (D\u00f6rnyei, 2007). As D\u00f6rnyei (2007) explains, in this type of the interview \u201cthe""], [""interviewer provides guidelines and direction (hence the \u2018-structured\u2019 part in the name),""], [""but is also keen to follow up interesting developments and to let the interviewee""], [""elaborate on certain issues (hence the \u2018semi-\u2019 part)\u201d (p. 136).""], [""During the interview, the present researcher attempted to encourage the subjects to""], [""describe their learning experiences concerning the use of mobile devices for English""], [""study. This was a form of introspection where the students were prompted to examine""], [""their behaviors and provide a first person narrative of such experiences. All the study""], [""participants were inf...",1.474174,13,-10.870022


In [71]:
# Return the top 3 results after reranking

top_3_rerank_3 = results_df_3.sort_values(by='Reranked_scores', ascending=False)
top_3_rerank_3[:3]

Unnamed: 0,Metadatas_3,Documents_3,Distances_3,IDs,Reranked_scores
0,{'Page_No.': 'Page 5'},"The EUROCALL Review, Volume 25, No. 2, September 2017 [[""\uf0b7 Do you organize regular formal or informal mobile English language learning""], [""sessions?""], [""\uf0b7 What do you learn most frequently by means of your mobile device(s)? Why""], [""this?""], [""\uf0b7 Do you feel that thanks to the use of your mobile device(s) you devote more""], [""time for learning the English language?""], [""\uf0b7 As far as learning English through your mobile device(s) is concerned, do you""], [""consider yourself as an experienced user of such device(s)?""]] consider yourself as an experienced user of such device(s)? [[""The gathered data were subjected to qualitative and quantitative analysis. The analysis""], [""started with partial transcription of the important parts of the data (D\u00f6rnyei, 2007) on a""],...",1.46251,15,-10.051212
3,{'Page_No.': 'Page 2'},"different forms according to the""], [""person, the setting, and multiple contextual and micro-contextual factors\u201d and it is \u201ca""], [""multi-faceted concept that consists of several layers\u201d (Reinders, 2011, p. 48) whose""], [""roots are based in political, societal and educational developments. In addition to this,""], [""work on autonomy emphasizes social dimensions of learner autonomy in view of the""], [""fact that \u201cautonomous learners always do things for themselves, but they may or may""], [""not do things on their own\u201d (Little, 2009, p. 223) and that by means of social""], [""interactions language learners \u201cdevelop a capacity to analyze, reflect upon and""], [""synthesize information to create new perspectives\u201d (Lee, 2011, p. 88). It should also be""], [""noted t...",1.48093,4,-10.207796
4,{'Page_No.': 'Page 5'},"compared, modified and either merged or""], [""abandoned. It should also be noted that the obtained data were analyzed quantitatively.""], [""This type of analysis involved counting the number of the interviewees\u2019 responses and""], [""calculating percentages.""], [""4. Findings""], [""A thorough analysis of the data yielded the following thematic categories: usage of""], [""mobile devices, reasons for using mobile devices, resources and tools, mobile""], [""encounters, language practiced and study performance.""], [""4.1. Usage of mobile devices""], [""Table 1 shows the study participants\u2019 mobile devices (MobDs) usage descriptions. The""], [""table demonstrates that smartphones were the most often used mobile devices by the""], [""students. In addition, the numerical information in the table indic...",1.49422,16,-10.456713


In [72]:
top_3_RAG_3 = top_3_rerank_3[["Documents_3", "Metadatas_3"]][:3]

In [73]:
query_3

'What are the methods covered in the method section of the document?'

In [74]:
top_3_RAG_3

Unnamed: 0,Documents_3,Metadatas_3
0,"The EUROCALL Review, Volume 25, No. 2, September 2017 [[""\uf0b7 Do you organize regular formal or informal mobile English language learning""], [""sessions?""], [""\uf0b7 What do you learn most frequently by means of your mobile device(s)? Why""], [""this?""], [""\uf0b7 Do you feel that thanks to the use of your mobile device(s) you devote more""], [""time for learning the English language?""], [""\uf0b7 As far as learning English through your mobile device(s) is concerned, do you""], [""consider yourself as an experienced user of such device(s)?""]] consider yourself as an experienced user of such device(s)? [[""The gathered data were subjected to qualitative and quantitative analysis. The analysis""], [""started with partial transcription of the important parts of the data (D\u00f6rnyei, 2007) on a""],...",{'Page_No.': 'Page 5'}
3,"different forms according to the""], [""person, the setting, and multiple contextual and micro-contextual factors\u201d and it is \u201ca""], [""multi-faceted concept that consists of several layers\u201d (Reinders, 2011, p. 48) whose""], [""roots are based in political, societal and educational developments. In addition to this,""], [""work on autonomy emphasizes social dimensions of learner autonomy in view of the""], [""fact that \u201cautonomous learners always do things for themselves, but they may or may""], [""not do things on their own\u201d (Little, 2009, p. 223) and that by means of social""], [""interactions language learners \u201cdevelop a capacity to analyze, reflect upon and""], [""synthesize information to create new perspectives\u201d (Lee, 2011, p. 88). It should also be""], [""noted t...",{'Page_No.': 'Page 2'}
4,"compared, modified and either merged or""], [""abandoned. It should also be noted that the obtained data were analyzed quantitatively.""], [""This type of analysis involved counting the number of the interviewees\u2019 responses and""], [""calculating percentages.""], [""4. Findings""], [""A thorough analysis of the data yielded the following thematic categories: usage of""], [""mobile devices, reasons for using mobile devices, resources and tools, mobile""], [""encounters, language practiced and study performance.""], [""4.1. Usage of mobile devices""], [""Table 1 shows the study participants\u2019 mobile devices (MobDs) usage descriptions. The""], [""table demonstrates that smartphones were the most often used mobile devices by the""], [""students. In addition, the numerical information in the table indic...",{'Page_No.': 'Page 5'}


## 6. Retrieval Augmented Generation


In [None]:
import google.generativeai as genai
import os

API_KEY = os.getenv("OPENAI_API_KEY")

# Set up Gemini API key
genai.configure(api_key = API_KEY)


def generate_response(query, top_3_RAG):
    """
    Generate a response using a Gemini model for a research paper-based semantic search system.
    """

    model = genai.GenerativeModel("gemini-2.0-flash-lite")

    messages = f"""
    You are an AI assistant specialized in analyzing and summarizing research papers written by domain specialists. 
    Your goal is to provide **accurate, well-structured, and research-backed answers** based on the retrieved documents.

    **User Query:** "{query}"

    **Retrieved Research Snippets:**  
    {top_3_RAG}

    ### **Response Guidelines:**
    1. **Use the provided document snippets** to construct a well-informed, factually correct response.  
    2. **Maintain scientific accuracy** and avoid adding unsupported claims.  
    3. If relevant information is present in a **tabular format**, reformat it for **better readability**.  
    4. **Cite sources** by mentioning the research paper title and page numbers.  
    5. If the query **is not relevant** to the retrieved documents, explicitly state:  
       - `"No relevant information was found in the retrieved research papers."`  

    ### **Expected Response Format:**
    - **Answer:** [Provide a clear, concise, and well-supported response]
    - **Citations:** [List the relevant research paper names and page numbers]
    
    Ensure that the response is **structured, professional, and easy to understand**.
    """

    # Generate response using the Hugging Face model
    response = model.generate_content(messages)

    return response.text.strip().split("\n")


In [80]:
query_1

'What is covered in the Abstract section of the document?'

In [81]:
# Generate the response - For Query 1

response = generate_response(query_1, top_3_RAG_1)
# Print the query and response in a well-formatted manner
print("Query 1:\n", query_1)
print("_" * 120)  # Separator line
print(response)

Query 1:
 What is covered in the Abstract section of the document?
________________________________________________________________________________________________________________________
['- **Answer:** The provided document snippets do not contain information about what is covered in the Abstract section.', '- **Citations:** None']


##### Query 2

In [82]:
query_2

'What does Literature Review section of the document contain?'

In [83]:
# Generate the response - For Query 2

response = generate_response(query_2, top_3_RAG_2)
# Print the query and response in a well-formatted manner
print("Query 1:\n", query_2)
print("_" * 120)  # Separator line
print(response)

Query 1:
 What does Literature Review section of the document contain?
________________________________________________________________________________________________________________________
['No relevant information was found in the retrieved research papers.']


#### Query 3

In [84]:
query_3

'What are the methods covered in the method section of the document?'

In [85]:
# Generate the response - For Query 3

response = generate_response(query_3, top_3_RAG_3)
# Print the query and response in a well-formatted manner
print("Query 1:\n", query_3)
print("_" * 120)  # Separator line
print(response)

Query 1:
 What are the methods covered in the method section of the document?
________________________________________________________________________________________________________________________
["- **Answer:** The method section of the research paper includes both qualitative and quantitative analysis of the gathered data. The qualitative analysis involved partial transcription of the data, while the quantitative analysis involved counting the number of interviewees' responses and calculating percentages (Page 5).", '- **Citations:** The EUROCALL Review, Volume 25, No. 2, September 2017 (Page 5)']
