<a href="https://colab.research.google.com/github/Oumayma-O/Information_retrieval_sys/blob/main/SRI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Overview: Information Retrieval System**

This project focuses on building an **Information Retrieval (IR) system** using the **vector space model**. The goal is to efficiently retrieve relevant documents from a collection based on user queries. The system incorporates various stages of preprocessing, indexing, and retrieval to demonstrate the core principles of IR.

#### **Key Components**

1. **Collecting Data**

1. **Data Preprocessing**:

2. **Inverted Index Construction**:
   - Creating a structured dictionary where each term maps to the documents it appears in.
   - Including additional information like term frequency in each document.
   - Sorting terms alphabetically for efficient querying.

4. **Query Processing and Retrieval**:
   - Implementing a vector space model for ranking documents based on relevance.
   - Testing the system on manually crafted queries to evaluate performance.

#### **Objective**
The ultimate aim is to demonstrate the functionality of an IR system, providing practical insights into how search engines and other retrieval systems operate at a fundamental level.




#Importing packages

In [None]:
!pip install langchain_community sentence_transformers -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.5/2.5 MB[0m [31m79.7 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m42.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m28.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m411.6/411.6 kB[0m [31m21.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.6/49.6 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
%pip install -qU langchain-google-genai

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/41.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.5/41.5 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!pip install nltk
!pip install twython
!pip install textblob

Collecting twython
  Downloading twython-3.9.1-py3-none-any.whl.metadata (20 kB)
Downloading twython-3.9.1-py3-none-any.whl (33 kB)
Installing collected packages: twython
Successfully installed twython-3.9.1


In [None]:
import os
import requests
from bs4 import BeautifulSoup
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import JsonOutputParser
from google.colab import drive
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from textblob import TextBlob
import nltk
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
from collections import defaultdict
import json
from collections import Counter
import numpy as np
import math
from nltk.tokenize import word_tokenize
from dotenv import load_dotenv

# Document Collection Creation

In this section we focused on the creation and preparation of the document collection that serves as the foundation for the Information Retrieval (IR) system. The process involves generating a set of URLs pointing to relevant articles, scraping content from those URLs, and structuring the data into a consistent format for subsequent processing.

#### **Steps and Processes**

1. **Curated URL Generation**:
   - Leveraged advanced language models like `Openai gpt` and `Gemini flash` to generate a list of Wikipedia URLs covering topics and subtopics related to Artificial Intelligence (AI).
   - Topics included:
     - Machine Learning
     - Deep Learning
     - Natural Language Processing
     - AI Ethics
     - AI Applications
     - AI in Robotics
     - Generative AI
     - Self-driving Cars
     - AI Trends and Innovations

3. **Web Scraping**:
   - Scraped content from the generated Wikipedia URLs using `requests` and `BeautifulSoup`.
   - Extracted article text from the webpage content and structured it as a collection of documents.

4. **Document Numbering and Storage**:
   - Assigned a unique document number to each article for indexing purposes.
   - Saved the document collection as a JSON file to ensure reproducibility and ease of access for subsequent stages of the project.

6. **Outcome**:
   - Successfully created a comprehensive document collection containing AI-related articles with structured content and metadata.
   - Saved the collection to a designated path (`/content/drive/MyDrive/sri-doc-collection/articles.json`) for future use.


##Langchain & Gemini

In [None]:
drive.mount('/content/drive')

env_path = "/content/drive/MyDrive/sri-doc-collection/.env.txt"

load_dotenv(env_path)

GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
llm = ChatGoogleGenerativeAI(
    model="gemini-1.5-pro",
    temperature=0,
    max_tokens=None,
    timeout=None,
    max_retries=2,
    # other params...
)

In [None]:
prompt = PromptTemplate(
    template="""You are an assistant for gathering valid and trusted URLs specifically from Wikipedia about a topic and its subtopics.

Your task is to generate a list of 120 valid, up-to-date, and public Wikipedia URLs from January 2025. These links should be accessible without requiring permissions and focus on the following topic and its subtopics:

Topic: {topic}
Subtopics include but are not limited to:
- Machine Learning
- Deep Learning
- Natural Language Processing
- AI Ethics
- AI Applications
- AI in Robotics
- Generative AI
- Self-driving Cars
- AI Trends and Innovations

**Output**:
- Only provide a valid JSON-formatted list with no additional explanations, comments, or placeholders.
- Example:
```json
[
  "https://en.wikipedia.org/wiki/Artificial_intelligence",
  "https://en.wikipedia.org/wiki/Machine_learning",
  ...
]
    """,
    input_variables=["topic"],
)

urls_chain = prompt | llm | JsonOutputParser()

topic = "Artificial Intelligence"
output = urls_chain.invoke({"topic": topic})



In [None]:
output

['https://en.wikipedia.org/wiki/Artificial_intelligence',
 'https://en.wikipedia.org/wiki/Machine_learning',
 'https://en.wikipedia.org/wiki/Deep_learning',
 'https://en.wikipedia.org/wiki/Natural_language_processing',
 'https://en.wikipedia.org/wiki/Ethics_of_artificial_intelligence',
 'https://en.wikipedia.org/wiki/Applications_of_artificial_intelligence',
 'https://en.wikipedia.org/wiki/Robotics',
 'https://en.wikipedia.org/wiki/Generative_artificial_intelligence',
 'https://en.wikipedia.org/wiki/Self-driving_car',
 'https://en.wikipedia.org/wiki/Artificial_intelligence_in_healthcare',
 'https://en.wikipedia.org/wiki/Artificial_intelligence_in_finance',
 'https://en.wikipedia.org/wiki/Artificial_intelligence_in_education',
 'https://en.wikipedia.org/wiki/Artificial_intelligence_in_video_games',
 'https://en.wikipedia.org/wiki/Artificial_intelligence_in_the_military',
 'https://en.wikipedia.org/wiki/Computer_vision',
 'https://en.wikipedia.org/wiki/Reinforcement_learning',
 'https://

In [None]:
len(output)

109

In [None]:
urls = output

for idx, url in enumerate(urls):
    print(f"{idx + 1}: {url}")

1: https://en.wikipedia.org/wiki/Artificial_intelligence
2: https://en.wikipedia.org/wiki/Machine_learning
3: https://en.wikipedia.org/wiki/Deep_learning
4: https://en.wikipedia.org/wiki/Natural_language_processing
5: https://en.wikipedia.org/wiki/Ethics_of_artificial_intelligence
6: https://en.wikipedia.org/wiki/Applications_of_artificial_intelligence
7: https://en.wikipedia.org/wiki/Robotics
8: https://en.wikipedia.org/wiki/Generative_artificial_intelligence
9: https://en.wikipedia.org/wiki/Self-driving_car
10: https://en.wikipedia.org/wiki/Artificial_intelligence_in_healthcare
11: https://en.wikipedia.org/wiki/Artificial_intelligence_in_finance
12: https://en.wikipedia.org/wiki/Artificial_intelligence_in_education
13: https://en.wikipedia.org/wiki/Artificial_intelligence_in_video_games
14: https://en.wikipedia.org/wiki/Artificial_intelligence_in_the_military
15: https://en.wikipedia.org/wiki/Computer_vision
16: https://en.wikipedia.org/wiki/Reinforcement_learning
17: https://en.wiki

In [None]:
articles = []
for url in urls:
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, 'html.parser')

        paragraphs = soup.find_all('p')
        article_text = " ".join([para.get_text() for para in paragraphs])
        articles.append({"url": url, "content": article_text})
    except Exception as e:
        print(f"Failed to scrape {url}: {e}")

import json
with open("articles.json", "w") as f:
    json.dump(articles, f, indent=4)


Failed to scrape https://en.wikipedia.org/wiki/Artificial_intelligence_in_education: 404 Client Error: Not Found for url: https://en.wikipedia.org/wiki/Artificial_intelligence_in_education
Failed to scrape https://en.wikipedia.org/wiki/Artificial_intelligence_in_the_military: 404 Client Error: Not Found for url: https://en.wikipedia.org/wiki/Artificial_intelligence_in_the_military
Failed to scrape https://en.wikipedia.org/wiki/Microsoft_AI: 404 Client Error: Not Found for url: https://en.wikipedia.org/wiki/Microsoft_AI
Failed to scrape https://en.wikipedia.org/wiki/AI_writing: 404 Client Error: Not Found for url: https://en.wikipedia.org/wiki/AI_writing
Failed to scrape https://en.wikipedia.org/wiki/Image_generation: 404 Client Error: Not Found for url: https://en.wikipedia.org/wiki/Image_generation
Failed to scrape https://en.wikipedia.org/wiki/Video_generation: 404 Client Error: Not Found for url: https://en.wikipedia.org/wiki/Video_generation


In [None]:
len(articles)

103

In [None]:
articles

[{'url': 'https://en.wikipedia.org/wiki/Artificial_intelligence',
  'content': '\n Artificial intelligence (AI), in its broadest sense, is intelligence exhibited by machines, particularly computer systems. It is a field of research in computer science that develops and studies methods and software that enable machines to perceive their environment and use learning and intelligence to take actions that maximize their chances of achieving defined goals.[1] Such machines may be called AIs.\n High-profile applications of AI include advanced web search engines (e.g., Google Search); recommendation systems (used by YouTube, Amazon, and Netflix); virtual assistants (e.g., Google Assistant, Siri, and Alexa); autonomous vehicles (e.g., Waymo); generative and creative tools (e.g., ChatGPT and AI art); and superhuman play and analysis in strategy games (e.g., chess and Go). However, many AI applications are not perceived as AI: "A lot of cutting edge AI has filtered into general applications, oft

In [None]:
drive.mount('/content/drive')

drive_path = "/content/drive/MyDrive/sri-doc-collection"
os.makedirs(drive_path, exist_ok=True)
file_path = os.path.join(drive_path, "articles.json")

In [None]:
for i, article in enumerate(articles, start=1):
    article["doc_number"] = i

In [None]:
article

{'url': 'https://en.wikipedia.org/wiki/Named-entity_recognition',
 'content': 'Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.\n Most research on NER/NEE systems has been structured as taking an unannotated block of text, such as this one:\n Jim bought 300 shares of Acme Corp. in 2006. And producing an annotated block of text that highlights the names of entities:\n [Jim]Person bought 300 shares of [Acme Corp.]Organization in [2006]Time. In this example, a person name consisting of one token, a two-token company name and a temporal expression have been detected and classified.\n State-of-the-art NER systems for English produce near-human performanc

In [None]:
with open(file_path, "w") as f:
    json.dump(articles, f, indent=4)

print(f"Scraped articles saved to {file_path}")

Scraped articles saved to /content/drive/MyDrive/sri-doc-collection/articles.json


#Articles' preprocessing

1. **Text Cleaning**:
   - Defines a `clean_text` function that lowercases text, removes punctuation, numbers, and newline characters.


3. **Stopwords Removal**:
   - Installs necessary packages and downloads NLTK resources.
   - Defines a `remove_stopwords` function to remove English stopwords using NLTK's list.


4. **Tokenization**:
   - Uses `nltk.tokenize` for tokenizing the text in the `filtered_content` column.

5. **POS Tagging**:
   - Applies Part-of-Speech (POS) tagging to the tokenized text using NLTK.

6. **Lemmatization**:
   - Defines a function `apply_lemmatization_on_tokens` that lemmatizes tokens using POS tags.
   - Maps NLTK POS tags to WordNet POS tags for more accurate lemmatization.



In [None]:
drive.mount('/content/drive')

drive_path = "/content/drive/MyDrive/sri-doc-collection"
os.makedirs(drive_path, exist_ok=True)
file_path = os.path.join(drive_path, "articles.json")

if os.path.exists(file_path):
    print(f"File exists at {file_path}. Loading articles...")
    with open(file_path, "r") as f:
        articles = json.load(f)
    print(f"Loaded {len(articles)} articles from file.")
else:
    print("File does not exist. Proceeding with scraping...")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
File exists at /content/drive/MyDrive/sri-doc-collection/articles.json. Loading articles...
Loaded 103 articles from file.


In [None]:
df = pd.DataFrame(articles)

In [None]:
df

Unnamed: 0,url,content,doc_number
0,https://en.wikipedia.org/wiki/Artificial_intel...,"\n Artificial intelligence (AI), in its broade...",1
1,https://en.wikipedia.org/wiki/Machine_learning,Machine learning (ML) is a field of study in a...,2
2,https://en.wikipedia.org/wiki/Deep_learning,Deep learning is a subset of machine learning ...,3
3,https://en.wikipedia.org/wiki/Natural_language...,Natural language processing (NLP) is a subfiel...,4
4,https://en.wikipedia.org/wiki/Ethics_of_artifi...,\n The ethics of artificial intelligence cover...,5
...,...,...,...
98,https://en.wikipedia.org/wiki/Speech-to-text,\nSpeech recognition is an interdisciplinary s...,99
99,https://en.wikipedia.org/wiki/Machine_translation,\n Machine translation is use of computational...,100
100,https://en.wikipedia.org/wiki/Question_answering,Question answering (QA) is a computer science ...,101
101,https://en.wikipedia.org/wiki/Text_summarization,Automatic summarization is the process of shor...,102


In [None]:
isinstance(df['content'][1], str)

True

In [None]:
type(df['content'])

In [None]:
file_path = f"{drive_path}/articles.csv"
df.to_csv(file_path, index=False)
print(f"DataFrame saved to {file_path}")

DataFrame saved to /content/drive/MyDrive/sri-doc-collection/articles.csv


In [None]:
drive.mount('/content/drive')

drive_path = "/content/drive/MyDrive/sri-doc-collection"
file_path = f"{drive_path}/articles.csv"

if os.path.exists(file_path):
    print(f"File found at {file_path}. Loading...")
    df = pd.read_csv(file_path)
    print("DataFrame loaded successfully:")
else:
    print(f"File not found at {file_path}. Please check the path or create the file.")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
File found at /content/drive/MyDrive/sri-doc-collection/articles.csv. Loading...
DataFrame loaded successfully:


In [None]:
df

Unnamed: 0,url,content,doc_number
0,https://en.wikipedia.org/wiki/Artificial_intel...,"\n Artificial intelligence (AI), in its broade...",1
1,https://en.wikipedia.org/wiki/Machine_learning,Machine learning (ML) is a field of study in a...,2
2,https://en.wikipedia.org/wiki/Deep_learning,Deep learning is a subset of machine learning ...,3
3,https://en.wikipedia.org/wiki/Natural_language...,Natural language processing (NLP) is a subfiel...,4
4,https://en.wikipedia.org/wiki/Ethics_of_artifi...,\n The ethics of artificial intelligence cover...,5
...,...,...,...
98,https://en.wikipedia.org/wiki/Speech-to-text,\nSpeech recognition is an interdisciplinary s...,99
99,https://en.wikipedia.org/wiki/Machine_translation,\n Machine translation is use of computational...,100
100,https://en.wikipedia.org/wiki/Question_answering,Question answering (QA) is a computer science ...,101
101,https://en.wikipedia.org/wiki/Text_summarization,Automatic summarization is the process of shor...,102


##Text cleaning

In [None]:
def clean_text(text):
    """
    Clean and preprocess text data.
    This function performs several cleaning operations:
    - Lowercases the text (Case Folding)
    - Removes punctuation, replacing hyphens with space
    - Removes numbers
    - Removes newline characters
    - Removes underscores
    - Removes lone characters (length 1 words)
    - Removes leading and trailing spaces

    Parameters:
    text (str): A string containing text data.
    Returns:
    str: A cleaned text string.
    """
    if not isinstance(text, str):
        return text

    # Lowercase the text
    text = text.lower()

    # Replace hyphens with space
    text = re.sub(r'-', ' ', text)

    # Remove underscores
    text = re.sub(r'_', '', text)

    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)

    # Remove numbers
    text = re.sub(r'\d', '', text)

    # Remove newlines
    text = re.sub(r'\n', ' ', text)

    # Remove lone characters (length 1 words)
    text = re.sub(r'\b\w{1}\b', '', text)

    # Remove all types of parentheses and their content
    text = re.sub(r'[\(\)\{\}\[\]\<\>]', '', text)

    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text)

    # Remove leading/trailing spaces
    text = text.strip()

    return text

text = "I am a test example with a lone character x and above-mentioned _word-to-check_ an example (with some parentheses) and  extra    spaces."
cleaned_text = clean_text(text)
print(cleaned_text)


am test example with lone character and above mentioned word to check an example with some parentheses and extra spaces


In [None]:
df["cleaned_content"] = df["content"].apply(clean_text)

In [None]:
df.head()

Unnamed: 0,url,content,doc_number,cleaned_content
0,https://en.wikipedia.org/wiki/Artificial_intel...,"\n Artificial intelligence (AI), in its broade...",1,artificial intelligence ai in its broadest sen...
1,https://en.wikipedia.org/wiki/Machine_learning,Machine learning (ML) is a field of study in a...,2,machine learning ml is field of study in artif...
2,https://en.wikipedia.org/wiki/Deep_learning,Deep learning is a subset of machine learning ...,3,deep learning is subset of machine learning th...
3,https://en.wikipedia.org/wiki/Natural_language...,Natural language processing (NLP) is a subfiel...,4,natural language processing nlp is subfield of...
4,https://en.wikipedia.org/wiki/Ethics_of_artifi...,\n The ethics of artificial intelligence cover...,5,the ethics of artificial intelligence covers b...


##Stopwords removal

In [None]:
nltk.download('stopwords')


stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    """
    Remove stopwords from text data.

    This function filters out common stopwords from the text data.
    Stopwords are removed based on the NLTK's English stopwords list.

    Parameters:
    text (str): A string containing text data.

    Returns:
    str: A string with stopwords removed.
    """
    if not isinstance(text, str):
        return text

    filtered_text = " ".join(word for word in text.split() if word not in stop_words)
    return filtered_text

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
df["filtered_content"] = df["cleaned_content"].apply(remove_stopwords)

In [None]:
df.head()

Unnamed: 0,url,content,doc_number,cleaned_content,filtered_content
0,https://en.wikipedia.org/wiki/Artificial_intel...,"\n Artificial intelligence (AI), in its broade...",1,artificial intelligence ai in its broadest sen...,artificial intelligence ai broadest sense inte...
1,https://en.wikipedia.org/wiki/Machine_learning,Machine learning (ML) is a field of study in a...,2,machine learning ml is field of study in artif...,machine learning ml field study artificial int...
2,https://en.wikipedia.org/wiki/Deep_learning,Deep learning is a subset of machine learning ...,3,deep learning is subset of machine learning th...,deep learning subset machine learning focuses ...
3,https://en.wikipedia.org/wiki/Natural_language...,Natural language processing (NLP) is a subfiel...,4,natural language processing nlp is subfield of...,natural language processing nlp subfield compu...
4,https://en.wikipedia.org/wiki/Ethics_of_artifi...,\n The ethics of artificial intelligence cover...,5,the ethics of artificial intelligence covers b...,ethics artificial intelligence covers broad ra...


In [None]:
df["cleaned_content"][1]

'machine learning ml is field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalize to unseen data and thus perform tasks without explicit instructions advances in the field of deep learning have allowed neural networks to surpass many previous approaches in performance ml finds application in many fields including natural language processing computer vision speech recognition email filtering agriculture and medicine the application of ml to business problems is known as predictive analytics statistics and mathematical optimization mathematical programming methods comprise the foundations of machine learning data mining is related field of study focusing on exploratory data analysis eda via unsupervised learning from theoretical viewpoint probably approximately correct pac learning provides framework for describing machine learning the term machine learning was coined in by arthur samuel an ibm emp

In [None]:
df["filtered_content"][1]

'machine learning ml field study artificial intelligence concerned development study statistical algorithms learn data generalize unseen data thus perform tasks without explicit instructions advances field deep learning allowed neural networks surpass many previous approaches performance ml finds application many fields including natural language processing computer vision speech recognition email filtering agriculture medicine application ml business problems known predictive analytics statistics mathematical optimization mathematical programming methods comprise foundations machine learning data mining related field study focusing exploratory data analysis eda via unsupervised learning theoretical viewpoint probably approximately correct pac learning provides framework describing machine learning term machine learning coined arthur samuel ibm employee pioneer field computer gaming artificial intelligence synonym self teaching computers also used time period although earliest machine 

##Tokenization

In [None]:
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')

df["tokenized_content"] = df["filtered_content"].apply(lambda x: word_tokenize(x))

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [None]:
df.head()

Unnamed: 0,url,content,doc_number,cleaned_content,filtered_content,tokenized_content
0,https://en.wikipedia.org/wiki/Artificial_intel...,"\n Artificial intelligence (AI), in its broade...",1,artificial intelligence ai in its broadest sen...,artificial intelligence ai broadest sense inte...,"[artificial, intelligence, ai, broadest, sense..."
1,https://en.wikipedia.org/wiki/Machine_learning,Machine learning (ML) is a field of study in a...,2,machine learning ml is field of study in artif...,machine learning ml field study artificial int...,"[machine, learning, ml, field, study, artifici..."
2,https://en.wikipedia.org/wiki/Deep_learning,Deep learning is a subset of machine learning ...,3,deep learning is subset of machine learning th...,deep learning subset machine learning focuses ...,"[deep, learning, subset, machine, learning, fo..."
3,https://en.wikipedia.org/wiki/Natural_language...,Natural language processing (NLP) is a subfiel...,4,natural language processing nlp is subfield of...,natural language processing nlp subfield compu...,"[natural, language, processing, nlp, subfield,..."
4,https://en.wikipedia.org/wiki/Ethics_of_artifi...,\n The ethics of artificial intelligence cover...,5,the ethics of artificial intelligence covers b...,ethics artificial intelligence covers broad ra...,"[ethics, artificial, intelligence, covers, bro..."


In [None]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [None]:
nltk.download('all')

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_r

True

##POS tagging

In [None]:
def apply_pos_tagging(tokens):
    """
    Apply POS tagging to tokenized text.

    Parameters:
    tokens (list): A list of tokenized words.

    Returns:
    list: A list of tuples, each containing a token and its corresponding POS tag.
    """
    pos_tags = pos_tag(tokens)

    return pos_tags

tokens = ['The', 'cats', 'are', 'running', 'better', 'than', 'before']
pos_tags = apply_pos_tagging(tokens)

In [None]:
df["pos_tagging_content"] = df["tokenized_content"].apply(apply_pos_tagging)

In [None]:
df.head()

Unnamed: 0,url,content,doc_number,cleaned_content,filtered_content,tokenized_content,pos_tagging_content
0,https://en.wikipedia.org/wiki/Artificial_intel...,"\n Artificial intelligence (AI), in its broade...",1,artificial intelligence ai in its broadest sen...,artificial intelligence ai broadest sense inte...,"[artificial, intelligence, ai, broadest, sense...","[(artificial, JJ), (intelligence, NN), (ai, NN..."
1,https://en.wikipedia.org/wiki/Machine_learning,Machine learning (ML) is a field of study in a...,2,machine learning ml is field of study in artif...,machine learning ml field study artificial int...,"[machine, learning, ml, field, study, artifici...","[(machine, NN), (learning, VBG), (ml, JJ), (fi..."
2,https://en.wikipedia.org/wiki/Deep_learning,Deep learning is a subset of machine learning ...,3,deep learning is subset of machine learning th...,deep learning subset machine learning focuses ...,"[deep, learning, subset, machine, learning, fo...","[(deep, JJ), (learning, NN), (subset, VBN), (m..."
3,https://en.wikipedia.org/wiki/Natural_language...,Natural language processing (NLP) is a subfiel...,4,natural language processing nlp is subfield of...,natural language processing nlp subfield compu...,"[natural, language, processing, nlp, subfield,...","[(natural, JJ), (language, NN), (processing, N..."
4,https://en.wikipedia.org/wiki/Ethics_of_artifi...,\n The ethics of artificial intelligence cover...,5,the ethics of artificial intelligence covers b...,ethics artificial intelligence covers broad ra...,"[ethics, artificial, intelligence, covers, bro...","[(ethics, NNS), (artificial, JJ), (intelligenc..."


##Lemmatization

In [None]:
lemmatizer = WordNetLemmatizer()

def apply_lemmatization_on_tokens(tokens_with_pos):
    """
    Apply lemmatization to a list of tokenized words using NLTK's WordNetLemmatizer,
    considering their POS tags.

    Parameters:
    tokens_with_pos (list): A list of tuples containing token and its POS tag.

    Returns:
    list: A list of lemmatized words.
    """
    lemmatized_tokens = []

    for word, tag in tokens_with_pos:
        # Convert NLTK POS tags to WordNet POS tags for lemmatizer
        if tag.startswith('J'):  # Adjectives (JJ, JJR, JJS)
            pos = 'a'
        elif tag.startswith('V'):  # Verbs (VB, VBD, VBG, VBN, VBP, VBZ)
            pos = 'v'
        elif tag.startswith('N'):  # Nouns (NN, NNS, NNP, NNPS)
            pos = 'n'
        elif tag.startswith('R'):  # Adverbs (RB, RBR, RBS)
            pos = 'r'
        else:
            pos = 'n'  # Default to noun if unsure


        lemmatized_tokens.append(lemmatizer.lemmatize(word, pos))

    return lemmatized_tokens

tokens = ['running', 'better', 'dogs']

pos_tags = apply_pos_tagging(tokens)

lemmatized_tokens = apply_lemmatization_on_tokens(pos_tags)

print("Original Tokens:", tokens)
print("POS Tags:", pos_tags)
print("Lemmatized Tokens:", lemmatized_tokens)


Original Tokens: ['running', 'better', 'dogs']
POS Tags: [('running', 'VBG'), ('better', 'RBR'), ('dogs', 'NNS')]
Lemmatized Tokens: ['run', 'well', 'dog']


In [None]:
df["lemmatized_content"] = df["pos_tagging_content"].apply(apply_lemmatization_on_tokens)

In [None]:
df.head()

Unnamed: 0,url,content,doc_number,cleaned_content,filtered_content,tokenized_content,pos_tagging_content,lemmatized_content
0,https://en.wikipedia.org/wiki/Artificial_intel...,"\n Artificial intelligence (AI), in its broade...",1,artificial intelligence ai in its broadest sen...,artificial intelligence ai broadest sense inte...,"[artificial, intelligence, ai, broadest, sense...","[(artificial, JJ), (intelligence, NN), (ai, NN...","[artificial, intelligence, ai, broad, sense, i..."
1,https://en.wikipedia.org/wiki/Machine_learning,Machine learning (ML) is a field of study in a...,2,machine learning ml is field of study in artif...,machine learning ml field study artificial int...,"[machine, learning, ml, field, study, artifici...","[(machine, NN), (learning, VBG), (ml, JJ), (fi...","[machine, learn, ml, field, study, artificial,..."
2,https://en.wikipedia.org/wiki/Deep_learning,Deep learning is a subset of machine learning ...,3,deep learning is subset of machine learning th...,deep learning subset machine learning focuses ...,"[deep, learning, subset, machine, learning, fo...","[(deep, JJ), (learning, NN), (subset, VBN), (m...","[deep, learning, subset, machine, learn, focus..."
3,https://en.wikipedia.org/wiki/Natural_language...,Natural language processing (NLP) is a subfiel...,4,natural language processing nlp is subfield of...,natural language processing nlp subfield compu...,"[natural, language, processing, nlp, subfield,...","[(natural, JJ), (language, NN), (processing, N...","[natural, language, processing, nlp, subfield,..."
4,https://en.wikipedia.org/wiki/Ethics_of_artifi...,\n The ethics of artificial intelligence cover...,5,the ethics of artificial intelligence covers b...,ethics artificial intelligence covers broad ra...,"[ethics, artificial, intelligence, covers, bro...","[(ethics, NNS), (artificial, JJ), (intelligenc...","[ethic, artificial, intelligence, cover, broad..."


In [None]:
file_path = f"{drive_path}/preprocessed_articles.csv"
df.to_csv(file_path, index=False)
print(f"DataFrame saved to {file_path}")

DataFrame saved to /content/drive/MyDrive/sri-doc-collection/preprocessed_articles.csv


In [None]:
len(df)

103

# Creating an Inverted Index

An inverted index is a data structure that maps terms to the documents in which they appear, along with their frequency. Here's an outline of what each part of the code accomplishes:

1. **Initializing the Inverted Index**  
   A `defaultdict` is used to initialize the inverted index. This structure allows dynamic updates of term-document mappings and maintains the count of term occurrences.

4. **Building the Inverted Index**  
   The code iterates over each document in the DataFrame:
   - Extracts the `doc_number` and tokenized content (`lemmatized_content`).
   - Updates the inverted index to include each term along with the document ID and frequency.
   - Calculating the df idf of each term
   - Converting frequency to tf-idf score

5. **Sorting the Inverted Index**  
   Once the inverted index is constructed, it is sorted alphabetically by terms for better readability and accessibility.

6. **Saving the Inverted Index**  
   The final inverted index is saved as a JSON file (`inverted_index.json`) in the Google Drive. This file serves as the output of the indexing process and can be used for various information retrieval tasks.


In [None]:
drive.mount('/content/drive')

drive_path = "/content/drive/MyDrive/sri-doc-collection"
file_path = f"{drive_path}/preprocessed_articles.csv"

if os.path.exists(file_path):
    print(f"File found at {file_path}. Loading...")
    df = pd.read_csv(file_path)
    print("DataFrame loaded successfully:")
else:
    print(f"File not found at {file_path}. Please check the path or create the file.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
File found at /content/drive/MyDrive/sri-doc-collection/preprocessed_articles.csv. Loading...
DataFrame loaded successfully:


##Creating an inverted index with doc_num and freq

In [None]:
df

Unnamed: 0,url,content,doc_number,cleaned_content,filtered_content,tokenized_content,pos_tagging_content,lemmatized_content
0,https://en.wikipedia.org/wiki/Artificial_intel...,"\n Artificial intelligence (AI), in its broade...",1,artificial intelligence ai in its broadest sen...,artificial intelligence ai broadest sense inte...,"[artificial, intelligence, ai, broadest, sense...","[(artificial, JJ), (intelligence, NN), (ai, NN...","[artificial, intelligence, ai, broad, sense, i..."
1,https://en.wikipedia.org/wiki/Machine_learning,Machine learning (ML) is a field of study in a...,2,machine learning ml is field of study in artif...,machine learning ml field study artificial int...,"[machine, learning, ml, field, study, artifici...","[(machine, NN), (learning, VBG), (ml, JJ), (fi...","[machine, learn, ml, field, study, artificial,..."
2,https://en.wikipedia.org/wiki/Deep_learning,Deep learning is a subset of machine learning ...,3,deep learning is subset of machine learning th...,deep learning subset machine learning focuses ...,"[deep, learning, subset, machine, learning, fo...","[(deep, JJ), (learning, NN), (subset, VBN), (m...","[deep, learning, subset, machine, learn, focus..."
3,https://en.wikipedia.org/wiki/Natural_language...,Natural language processing (NLP) is a subfiel...,4,natural language processing nlp is subfield of...,natural language processing nlp subfield compu...,"[natural, language, processing, nlp, subfield,...","[(natural, JJ), (language, NN), (processing, N...","[natural, language, processing, nlp, subfield,..."
4,https://en.wikipedia.org/wiki/Ethics_of_artifi...,\n The ethics of artificial intelligence cover...,5,the ethics of artificial intelligence covers b...,ethics artificial intelligence covers broad ra...,"[ethics, artificial, intelligence, covers, bro...","[(ethics, NNS), (artificial, JJ), (intelligenc...","[ethic, artificial, intelligence, cover, broad..."
...,...,...,...,...,...,...,...,...
98,https://en.wikipedia.org/wiki/Speech-to-text,\nSpeech recognition is an interdisciplinary s...,99,speech recognition is an interdisciplinary sub...,speech recognition interdisciplinary subfield ...,"[speech, recognition, interdisciplinary, subfi...","[(speech, NN), (recognition, NN), (interdiscip...","[speech, recognition, interdisciplinary, subfi..."
99,https://en.wikipedia.org/wiki/Machine_translation,\n Machine translation is use of computational...,100,machine translation is use of computational te...,machine translation use computational techniqu...,"[machine, translation, use, computational, tec...","[(machine, NN), (translation, NN), (use, IN), ...","[machine, translation, use, computational, tec..."
100,https://en.wikipedia.org/wiki/Question_answering,Question answering (QA) is a computer science ...,101,question answering qa is computer science disc...,question answering qa computer science discipl...,"[question, answering, qa, computer, science, d...","[(question, NN), (answering, VBG), (qa, JJ), (...","[question, answer, qa, computer, science, disc..."
101,https://en.wikipedia.org/wiki/Text_summarization,Automatic summarization is the process of shor...,102,automatic summarization is the process of shor...,automatic summarization process shortening set...,"[automatic, summarization, process, shortening...","[(automatic, JJ), (summarization, NN), (proces...","[automatic, summarization, process, shorten, s..."


In [None]:
inverted_index = defaultdict(lambda: defaultdict(int))

for index, row in df.iterrows():
    doc_number = row['doc_number']
    tokens = row['lemmatized_content']

    for token in tokens:
        inverted_index[token][doc_number] += 1

In [None]:
inverted_index

defaultdict(<function __main__.<lambda>()>,
            {'artificial': defaultdict(int,
                         {1: 41,
                          2: 28,
                          3: 13,
                          4: 4,
                          5: 35,
                          6: 34,
                          7: 14,
                          8: 15,
                          9: 2,
                          10: 26,
                          11: 34,
                          12: 11,
                          13: 6,
                          16: 1,
                          17: 2,
                          18: 1,
                          19: 2,
                          21: 1,
                          22: 9,
                          23: 13,
                          24: 9,
                          25: 6,
                          26: 5,
                          27: 5,
                          28: 5,
                          29: 6,
                          30: 2,
                   

In [None]:
inverted_index = dict(sorted(inverted_index.items()))

In [None]:
for term, docs in inverted_index.items():
    print(f"{term}: {dict(docs)}")

[1;30;43mLe flux de sortie a été tronqué et ne contient que les 5000 dernières lignes.[0m
raëlism: {52: 1}
rbf: {61: 1}
rbm: {16: 2}
rcmn: {17: 2}
rcnknn: {83: 1}
rcnwnn: {83: 1}
rctextbayesleftbfrac: {83: 1}
rctextbayesleftbsnbtnrighto: {83: 1}
rd: {10: 1, 27: 1, 56: 1, 74: 1}
rdb: {67: 1}
rdbms: {45: 1, 89: 1}
rdbmspromotional: {45: 1}
rdbrdf: {67: 1}
rdf: {29: 2, 67: 13}
rdfa: {67: 1}
rdi: {7: 1}
rds: {42: 1}
reach: {1: 6, 3: 1, 6: 1, 7: 1, 8: 2, 9: 4, 11: 1, 13: 1, 19: 3, 24: 1, 26: 1, 27: 1, 28: 1, 31: 5, 33: 1, 36: 7, 39: 2, 40: 3, 41: 1, 44: 2, 45: 3, 46: 3, 47: 3, 48: 1, 49: 1, 51: 11, 52: 1, 57: 2, 59: 1, 69: 1, 75: 1, 76: 2, 79: 2, 80: 2, 81: 1, 84: 2, 86: 2, 87: 1, 95: 1, 97: 3, 98: 1, 99: 2}
reachable: {74: 1}
reached: {36: 1}
react: {2: 1, 12: 5, 30: 1, 33: 1, 43: 1, 61: 1}
reactants: {87: 1}
reaction: {2: 1, 3: 1, 6: 1, 7: 1, 9: 2, 10: 1, 11: 1, 19: 2, 33: 8, 38: 2, 43: 4, 49: 1, 61: 2, 68: 1, 74: 1, 75: 1, 77: 1, 79: 1, 87: 3, 94: 1}
reactiondiffusion: {33: 1}
reactive

In [None]:
len(inverted_index )

18023

In [None]:
index_path = "/content/drive/MyDrive/sri-doc-collection/inverted_index.json"
with open(index_path, "w") as f:
    json.dump(inverted_index, f, indent=4)

print("Inverted index created and saved.")

Inverted index created and saved.


In [None]:
drive.mount('/content/drive')

drive_path = "/content/drive/MyDrive/sri-doc-collection"
index_path = f"{drive_path}/inverted_index.json"

def load_inverted_index(path):
    if os.path.exists(path):
        with open(path, "r") as f:
            inverted_index = json.load(f)
        print("Inverted index loaded successfully.")
        return inverted_index
    else:
        print("Inverted index not found, creating a new one.")
        return None

inverted_index = load_inverted_index(index_path)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Inverted index loaded successfully.


In [None]:
len(inverted_index )

18023

##Calculating IDF for all the tokens

In [None]:
num_docs = len(df)
idf = {}

for term, docs in inverted_index.items():
    df_term = len(docs)
    idf[term] = math.log(num_docs / (df_term))

In [None]:
idf

{'aaai': 3.248434627109745,
 'aadhaar': 4.634728988229636,
 'aae': 3.9415818076696905,
 'aahc': 4.634728988229636,
 'aamin': 4.634728988229636,
 'aaple': 4.634728988229636,
 'aaron': 2.842969519001581,
 'ab': 3.248434627109745,
 'abandon': 2.842969519001581,
 'abandonedac': 4.634728988229636,
 'abandonedy': 4.634728988229636,
 'abarbanel': 4.634728988229636,
 'abb': 4.634728988229636,
 'abbreviate': 4.634728988229636,
 'abbreviated': 4.634728988229636,
 'abbreviation': 3.5361166995615263,
 'abc': 3.9415818076696905,
 'abductive': 4.634728988229636,
 'abelian': 4.634728988229636,
 'aberystwyth': 4.634728988229636,
 'abide': 4.634728988229636,
 'ability': 0.540384426007535,
 'ablation': 4.634728988229636,
 'able': 0.4300363688386697,
 'abm': 4.634728988229636,
 'abnormal': 2.6888188391743224,
 'abnormality': 4.634728988229636,
 'abolition': 3.9415818076696905,
 'abortion': 4.634728988229636,
 'aboveb': 4.634728988229636,
 'aboveclarification': 4.634728988229636,
 'abraham': 4.63472898822

# **Query Processing and Retrieval with SMART Combinations**

This section describes how we process a user query and retrieve the most relevant documents using SMART (Salton, McGill, and Term weighting) schemes for both queries and documents. The process ensures efficient and meaningful comparisons between the query and document vectors. Key steps include:

1. **Document Normalization:**
   - Precomputing the norms of document vectors to facilitate cosine similarity calculations.
   - These norms are stored for efficient retrieval during comparisons.

2. **Query Processing:**
   - Cleaning the input query by removing stopwords and applying lemmatization to standardize terms.
   - Tokenizing the query and calculating term frequencies (TF) for its terms.
   - Applying a user-selected SMART weighting scheme (e.g., logarithmic, augmented, or boolean term frequency with optional IDF and normalization).

3. **Cosine Similarity Computation:**
   - Comparing the processed query vector with document vectors using cosine similarity.
   - Weighting document terms according to the selected SMART scheme for documents.

4. **Sorting and Ranking:**
   - Sorting documents by their similarity scores in descending order.
   - Returning the top-ranked documents as the most relevant results.



In [None]:
def process_query(query, idf, query_scheme):
    """
    Process a query with the given SMART weighting scheme.
    """
    if not isinstance(query, str):
        raise ValueError("Query must be a string.")

    query = clean_text(query)

    query = remove_stopwords(query)

    tokens = word_tokenize(query)

    pos_tags = apply_pos_tagging(tokens)

    lemmatized_tokens = apply_lemmatization_on_tokens(pos_tags)

    query_tf = Counter(lemmatized_tokens)

    query_vector = {}
    for term, tf in query_tf.items():
        if term in idf:
            query_vector[term] = apply_smart_scheme(tf, idf[term], query_scheme)

    return query_vector

In [None]:
doc_norms = {}
for term, docs in inverted_index.items():
    for doc, weight in docs.items():
        if doc not in doc_norms:
            doc_norms[doc] = 0
        doc_norms[doc] += weight ** 2

doc_norms = {doc: math.sqrt(norm) for doc, norm in doc_norms.items()}

In [None]:
smart_schemes = {
    'nnn': ('natural_tf', 'no_idf', 'none'),
    "ltc": ("logarithmic_tf", "idf", "cosine_normalization"),
    "lnc": ("logarithmic_tf", "none", "cosine_normalization"),
    "ntc": ("natural_tf", "idf", "cosine_normalization"),
    "anc": ("augmented_tf", "none", "cosine_normalization"),
}

In [None]:
def apply_smart_scheme(tf, idf, scheme):
    """
    Apply SMART weighting scheme to compute term weighting.
    """
    tf_scheme, idf_scheme, normalization_scheme = smart_schemes[scheme]


    if tf_scheme == "logarithmic_tf":
        tf = 1 + math.log(tf)
    elif tf_scheme == "natural_tf":
        tf = tf
    elif tf_scheme == "augmented_tf":
        tf = 0.5 + (0.5 * tf / max(tf, 1))
    elif tf_scheme == "boolean_tf":
        tf = 1 if tf > 0 else 0

    if idf_scheme == "idf":
        tf *= idf
    elif idf_scheme == "prob_idf":
        tf *= max(0, math.log(len(inverted_index) / idf))

    return tf

def cosine_similarity(query_vector, doc_scheme):
    """
    Compute cosine similarity between the query vector and the document vectors
    using different SMART weighting schemes for query and documents.
    """
    scores = defaultdict(float)

    for term, weight in query_vector.items():
        if term in inverted_index:
            for doc, doc_tf in inverted_index[term].items():
                doc_weight = apply_smart_scheme(doc_tf, idf[term], doc_scheme)
                scores[doc] += weight * doc_weight

    for doc in scores:
        scores[doc] /= doc_norms[doc]

    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

default_query_scheme = "ltc"
default_doc_scheme = "lnc"
print("Available SMART schemes:", ", ".join(smart_schemes.keys()))
user_query_scheme = input(f"Select a SMART scheme for the query (default: {default_query_scheme}): ").strip() or default_query_scheme
user_doc_scheme = input(f"Select a SMART scheme for the documents (default: {default_doc_scheme}): ").strip() or default_doc_scheme


query = "machine learning and AI applications"
query_vector = process_query(query, idf, user_query_scheme)
results = cosine_similarity(query_vector, user_doc_scheme)

print("\nTop results:")
for doc, score in results[:10]:
    print(f"Document {doc} - Score: {score}")


Available SMART schemes: nnn, ltc, lnc, ntc, anc
Select a SMART scheme for the query (default: ltc): 
Select a SMART scheme for the documents (default: lnc): 

Top results:
Document 78 - Score: 0.14334545627955655
Document 77 - Score: 0.057179662307690796
Document 41 - Score: 0.05241085484961286
Document 50 - Score: 0.050145591821804815
Document 35 - Score: 0.04453636680338076
Document 34 - Score: 0.03731926507652948
Document 4 - Score: 0.036496364136459264
Document 21 - Score: 0.03577533044316129
Document 54 - Score: 0.034223194567693085
Document 60 - Score: 0.028835733047042653


### Interpreting the Results:
Top Results:

The results show the top-ranked documents retrieved based on their similarity to the query.
Document 78 has the highest relevance score (0.1433), indicating it is the most relevant to the query.
The scores decrease progressively, with Document 60 being the least relevant among the top 10 documents.



### Limitations:
- **Term-Based Approach**: The approach here assumes that the importance of each word is determined by its frequency or presence in the document. This may not capture deeper semantic meaning, especially for complex or ambiguous queries.
- **Contextual Understanding**: The method doesn't account for synonyms, polysemy (same word with multiple meanings), or context.





### Conclusion and Suggestions

After analyzing the current system's performance and the retrieved results, we can suggest several improvements that would enhance the relevance and accuracy of the search results.

- **Embedding-Based Approach**:  
  Embedding-based methods have become the state of the art in information retrieval. By converting documents and queries into dense vector representations, we can capture deeper semantic relationships beyond exact term matches. This approach would allow the system to retrieve more relevant documents even when the query uses different terminology.
  - **Why This Could Help**: By using pre-trained models like Sentence-BERT, we can improve the system’s understanding of context, synonyms, and semantic similarity between queries and documents. This would result in better search relevance, as the model can identify concepts beyond exact word matches.
  - **Implementation Suggestion**: We can integrate a model such as `SentenceTransformer` to generate embeddings for both queries and documents. Then, using cosine similarity, we can rank documents based on how closely their vector representation matches the query. This method is well-suited for more complex queries and larger datasets.

- **Retrieval-Augmented Generation (RAG)**:  
  RAG systems combine retrieval with generation, which could be particularly useful for applications that require detailed, contextual answers, such as question answering or document summarization. This method retrieves relevant documents and then generates answers based on them, ensuring that the responses are grounded in the retrieved content.
  - **Why This Could Help**: RAG has proven to be effective in generating contextually accurate and relevant responses by blending retrieval with natural language generation. By grounding the output in retrieved documents, we can enhance the system’s performance for tasks requiring detailed information synthesis.
  - **Implementation Suggestion**: A possible implementation involves integrating a generative model, such as GPT-3 or T5, with a retrieval system. We would first retrieve the most relevant documents using embeddings or BM25 and then pass them to the generative model to produce the final response.

