# Transformer based contextual text representation framework for intellingent information retrieval

In [None]:
!pip install torch transformers sentence-transformers elasticsearch numpy scikit-learn

Collecting elasticsearch
  Downloading elasticsearch-8.17.2-py3-none-any.whl.metadata (8.8 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)

DATASET

In [None]:
documents = [
    # Climate change and agriculture
    "Climate change is increasingly recognized as one of the most significant challenges facing agriculture globally. Rising temperatures, altered rainfall patterns, and an increase in the frequency of extreme weather events are all affecting agricultural productivity. In regions that depend heavily on agriculture for their livelihoods, such as sub-Saharan Africa and parts of South Asia, these climate impacts are particularly severe. Crop yields are declining in many areas, and pests and diseases are spreading into new regions due to the changing climate. Addressing climate change and its impacts on agriculture requires immediate and sustained action, including adopting climate-smart agricultural practices and enhancing the resilience of farming systems. Farmers are also being encouraged to adopt drought-resistant crops, more efficient irrigation systems, and new farming technologies to adapt to the changing environment.",

    # Global warming and rising sea levels
    "Global warming, driven by the continued release of greenhouse gases into the atmosphere, has led to a steady rise in the global average temperature. As a result, polar ice sheets and glaciers are melting, contributing to rising sea levels. Coastal communities and cities are particularly vulnerable to these changes, as higher sea levels lead to flooding, erosion, and storm surges. In addition to the physical damage, rising sea levels threaten freshwater resources, agricultural land, and infrastructure. Vulnerable regions, such as small island nations and major coastal cities like New York, Tokyo, and Miami, are being forced to consider drastic measures like relocating populations or constructing expensive flood defenses. Despite the urgency of the issue, global efforts to curb emissions have been slow, and more aggressive policies are needed to mitigate the long-term effects of sea-level rise.",

    # Environmental risks due to pollution
    "Pollution, in its various forms, continues to pose one of the largest environmental risks to both the planet and human health. Air pollution, water contamination, and the accumulation of hazardous waste are just a few of the growing environmental issues we face today. In urban areas, industrial activities, vehicle emissions, and power plants contribute to high levels of air pollution, leading to respiratory problems, heart disease, and premature death. Water pollution from agricultural runoff, industrial waste, and sewage discharge is also a major concern, with harmful chemicals entering the water supply and affecting aquatic ecosystems and human communities that rely on these waters for drinking and agriculture. Furthermore, plastic waste, which takes centuries to decompose, is increasingly polluting oceans, rivers, and landscapes. The long-term impact of pollution on biodiversity and ecosystems cannot be overstated, as many species face extinction due to habitat destruction and the accumulation of toxins in their environments.",

    # Electric vehicles and carbon emissions
    "The transition from gasoline and diesel-powered vehicles to electric vehicles (EVs) is one of the key solutions for reducing carbon emissions from the transportation sector. The transportation industry is one of the largest contributors to greenhouse gas emissions, with conventional vehicles powered by fossil fuels emitting significant amounts of carbon dioxide and other pollutants. Electric vehicles, by contrast, run on electricity, and when charged from renewable sources like solar and wind, they offer a near-zero-emission alternative. While the adoption of EVs is increasing globally, challenges remain in terms of infrastructure, affordability, and battery technology. However, governments around the world are offering subsidies and incentives to encourage the transition to electric vehicles, while automakers are investing heavily in research and development to improve battery performance and reduce costs. In the future, EVs are expected to play a critical role in achieving net-zero emissions targets and mitigating climate change.",

    # Forest fires and their environmental impact
    "Forest fires have always been a natural part of the environment, but their frequency and intensity are increasing in recent years due to climate change. Rising temperatures, prolonged droughts, and changing precipitation patterns create ideal conditions for fires to spread uncontrollably across vast areas of forest. In addition to the immediate destruction of property and life, forest fires have significant long-term environmental consequences. They release vast amounts of carbon dioxide into the atmosphere, contributing to the greenhouse effect and global warming. Forests also act as important carbon sinks, absorbing carbon dioxide from the atmosphere, and when they burn, this stored carbon is released back into the atmosphere. The destruction of forests also leads to habitat loss for wildlife, reduces biodiversity, and disrupts local water cycles. As forest fires become more frequent and severe, mitigating their impact and adapting to the changing climate will require both global efforts to reduce greenhouse gas emissions and localized strategies to manage forest health.",

    # Ocean acidification and its impact on marine life
    "Ocean acidification is a major and often overlooked consequence of increased carbon dioxide levels in the atmosphere. As carbon dioxide is absorbed by the oceans, it reacts with seawater to form carbonic acid, which lowers the pH of the water. Over time, this acidification has significant implications for marine ecosystems, particularly organisms that rely on calcium carbonate to build their shells and skeletons, such as corals, mollusks, and certain types of plankton. The decline in coral reefs, which are highly sensitive to changes in pH, is a direct consequence of ocean acidification, with widespread impacts on marine biodiversity and the millions of people who depend on coral ecosystems for food and livelihood. Additionally, the disruption of the food chain due to acidification threatens fish populations, many of which are crucial for global fisheries. Ocean acidification is a slow but ongoing process, and if current trends continue, it could significantly alter marine life as we know it, leading to ecosystem collapse in some regions.",

    # Reforestation efforts to combat climate change
    "Reforestation is a powerful yet often underutilized tool in the fight against climate change. The process involves planting trees in areas where forests have been cut down or degraded, helping to restore ecosystems and increase the global capacity to sequester carbon. Forests play a vital role in absorbing carbon dioxide from the atmosphere, and reforestation has the potential to offset a significant portion of global emissions. In addition to carbon sequestration, reforestation helps restore biodiversity, protect watersheds, and improve soil quality. Various countries and organizations are investing in large-scale reforestation projects, with some initiatives aiming to plant billions of trees in the coming decades. However, reforestation efforts must be coupled with strong conservation policies and sustainable land management practices to ensure that newly planted forests are not later cleared for agriculture or development. While reforestation is an essential strategy, it is not a silver bullet and must be combined with efforts to reduce emissions at their source."
]


In [None]:
queries = [
    "What are the impacts of climate change on agriculture?",
    "What is the relationship between global warming and sea level rise?",
    "How does pollution affect the environment and human health?",
    "What are the benefits of electric vehicles in reducing carbon emissions?",
    "How do forest fires contribute to global warming and environmental degradation?",
    "What is ocean acidification and how does it affect marine life?",
    "What role does reforestation play in combating climate change?"
]


In [None]:
!pip install sentence-transformers



QUERY EXPANSION USING GEMINI

In [None]:
import google.generativeai as genai
import json

# Configure Gemini API Key
genai.configure(api_key="AIzaSyA7mWLZiUFOMC-BkJCvCOph4X8X0WFx304")

# Function to expand a given query
def expand_query(query, model="gemini-2.0-flash"):
    prompt = f"""
    Expand the following query to make it more informative and detailed.
    Retain the original intent but add relevant context, missing details,
    and important aspects to enhance clarity and usefulness.
    example:
    Before Expansion:
    climate change impact agriculture
    After Expansion:
    How does climate change impact agricultural productivity, crop yields, and food security, including its effects on soil fertility, water availability, and farming sustainability?
    (just return the expanded query as a single sentence)
    Query: "{query}"

    Expanded Query:
    """

    model = genai.GenerativeModel(model)
    response = model.generate_content(prompt)

    # Extract expanded query
    expanded_query = response.text.strip()

    return expanded_query

# Expand and store queries
expanded_queries = {query: expand_query(query) for query in queries}

# Print expanded queries
for query, expanded_query in expanded_queries.items():
    print(f"Original Query: {query}")
    print(f"Expanded Query: {expanded_query}")
    print("-" * 50)


Original Query: What are the impacts of climate change on agriculture?
Expanded Query: What are the multifaceted impacts of climate change on agriculture globally, considering alterations in temperature, precipitation patterns, increased frequency of extreme weather events, changes in atmospheric carbon dioxide concentrations, and sea-level rise, specifically examining effects on crop yields, livestock production, soil health, water resources, pest and disease distribution, and the livelihoods of farmers and agricultural communities, while also exploring regional variations and potential adaptation strategies?
--------------------------------------------------
Original Query: What is the relationship between global warming and sea level rise?
Expanded Query: What is the relationship between global warming and sea level rise, including the primary mechanisms driving this relationship such as thermal expansion of water and melting of glaciers and ice sheets, and what are the projected im

DOCUMENT SUMMARIZATION TO HANDLE LONG DOCUMENTS

In [None]:
from transformers import pipeline

# Initialize the BART summarization pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Function to summarize the documents
def summarize_documents(documents, max_length=80, min_length=50):
    summaries = []

    for doc in documents:
        # Generate summary for each document
        summary = summarizer(doc, max_length=max_length, min_length=min_length, do_sample=False)
        summaries.append(summary[0]['summary_text'])

    return summaries

# Get the summarized documents
summarized_documents = summarize_documents(documents)

# Print the original and summarized documents
for i, doc in enumerate(documents):
    print(f"Original Document {i+1}: \n{doc[:500]}...")  # Display the first 500 chars of the document
    print(f"Summarized Document {i+1}: \n{summarized_documents[i]}")
    print("-" * 100)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Device set to use cuda:0


Original Document 1: 
Climate change is increasingly recognized as one of the most significant challenges facing agriculture globally. Rising temperatures, altered rainfall patterns, and an increase in the frequency of extreme weather events are all affecting agricultural productivity. In regions that depend heavily on agriculture for their livelihoods, such as sub-Saharan Africa and parts of South Asia, these climate impacts are particularly severe. Crop yields are declining in many areas, and pests and diseases are...
Summarized Document 1: 
Climate change is increasingly recognized as one of the most significant challenges facing agriculture globally. Rising temperatures, altered rainfall patterns, and an increase in the frequency of extreme weather events are all affecting agricultural productivity. Crop yields are declining in many areas, and pests and diseases are spreading into new regions due to the changing climate.
-------------------------------------------------------------

DOCUMENT AND QUERY ENCODING USING SBERT

In [None]:
import json
from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Initialize SBERT Model
sbert_model = SentenceTransformer('all-MiniLM-L6-v2')

# Flatten the list of queries for encoding
all_queries = [query for query in expanded_queries.values()]

# Encode queries and documents using SBERT
query_embeddings = sbert_model.encode(all_queries, convert_to_tensor=True)
document_embeddings = sbert_model.encode(documents, convert_to_tensor=True)

# Convert embeddings to NumPy
query_embeddings_np = query_embeddings.cpu().numpy()
document_embeddings_np = document_embeddings.cpu().numpy()

# Compute cosine similarities
similarities = cosine_similarity(query_embeddings_np, document_embeddings_np)

# Print results
for i, query in enumerate(all_queries):
    print(f"Query: {query}")
    for j, doc in enumerate(documents):
        print(f"  Similarity with Document {j+1}: {similarities[i][j]:.4f}")
    print("-" * 50)


Query: What are the multifaceted impacts of climate change on agriculture globally, considering alterations in temperature, precipitation patterns, increased frequency of extreme weather events, changes in atmospheric carbon dioxide concentrations, and sea-level rise, specifically examining effects on crop yields, livestock production, soil health, water resources, pest and disease distribution, and the livelihoods of farmers and agricultural communities, while also exploring regional variations and potential adaptation strategies?
  Similarity with Document 1: 0.7659
  Similarity with Document 2: 0.4430
  Similarity with Document 3: 0.2411
  Similarity with Document 4: 0.1232
  Similarity with Document 5: 0.4335
  Similarity with Document 6: 0.2716
  Similarity with Document 7: 0.3873
--------------------------------------------------
Query: What is the relationship between global warming and sea level rise, including the primary mechanisms driving this relationship such as thermal ex

RE-RANK USING BERT CROSS ENCODER

In [None]:
import numpy as np
import torch
from sklearn.metrics.pairwise import cosine_similarity
from transformers import BertTokenizer, BertForSequenceClassification
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F

# Load the precomputed document embeddings (ensure these are tensors or numpy arrays)
document_embeddings = document_embeddings_np  # shape: (num_documents, embedding_dim)

# Sample queries
queries = ["electric vehicles"]

# Encode the query with SBERT (ensure this is a tensor or numpy array)
query_embedding = sbert_model.encode(queries, convert_to_tensor=True).cpu().numpy()  # shape: (1, embedding_dim)

# Step 1: Retrieve Top N Documents using Cosine Similarity

def retrieve_top_documents(query_embedding, document_embeddings, top_n=3):
    similarities = cosine_similarity(query_embedding, document_embeddings)
    sorted_idx = np.argsort(similarities[0])[::-1]  # Sort by similarity (descending order)
    return sorted_idx[:top_n], similarities[0][sorted_idx[:top_n]]

top_n = 3
top_document_indices, top_similarities = retrieve_top_documents(query_embedding, document_embeddings, top_n)

print("Top documents based on cosine similarity:")
for i in range(top_n):
    print(f"Document {top_document_indices[i]+1}: {documents[top_document_indices[i]]}")
    print(f"Cosine Similarity: {top_similarities[i]:.4f}")
    print("-" * 50)

# Step 2: Re-rank using BERT Cross-Encoder

# Load BERT tokenizer and model
bert_model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(bert_model_name)
bert_model = BertForSequenceClassification.from_pretrained(bert_model_name)

# Create Dataset for Query-Document pairs
class QueryDocumentDataset(Dataset):
    def __init__(self, query, documents, tokenizer, max_length=512):
        self.query = query
        self.documents = documents
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.documents)

    def __getitem__(self, idx):
        query = self.query
        document = self.documents[idx]
        inputs = self.tokenizer(query, document, padding='max_length', truncation=True, max_length=self.max_length, return_tensors="pt")
        return inputs

# Prepare the dataset for BERT re-ranking
top_documents = [documents[i] for i in top_document_indices]
dataset = QueryDocumentDataset(queries[0], top_documents, tokenizer)

# Create DataLoader for batching
dataloader = DataLoader(dataset, batch_size=2)

# Re-rank the documents using BERT Cross-Encoder
bert_model.eval()

def rerank_documents_with_bert(dataloader, model):
    scores = []
    with torch.no_grad():
        for batch in dataloader:
            input_ids = batch['input_ids'].squeeze(1)  # (batch_size, max_length)
            attention_mask = batch['attention_mask'].squeeze(1)  # (batch_size, max_length)

            # Forward pass through BERT
            outputs = model(input_ids, attention_mask=attention_mask)
            logits = outputs.logits  # Shape: (batch_size, num_labels)
            probs = F.softmax(logits, dim=-1)  # Apply softmax to get probabilities

            # Assuming that the first class corresponds to 'not relevant' and the second class corresponds to 'relevant'
            scores.extend(probs[:, 1].cpu().numpy())  # Relevance score (class 1)

    return scores

# Get the re-ranking scores for the top documents
rerank_scores = rerank_documents_with_bert(dataloader, bert_model)

# Display the re-ranked results
print("\nRe-ranked documents using BERT:")
for i in range(top_n):
    print(f"Document {top_document_indices[i]+1}: {documents[top_document_indices[i]]}")
    print(f"Relevance Score: {rerank_scores[i]:.4f}")
    print("-" * 50)


Top documents based on cosine similarity:
Document 4: The transition from gasoline and diesel-powered vehicles to electric vehicles (EVs) is one of the key solutions for reducing carbon emissions from the transportation sector. The transportation industry is one of the largest contributors to greenhouse gas emissions, with conventional vehicles powered by fossil fuels emitting significant amounts of carbon dioxide and other pollutants. Electric vehicles, by contrast, run on electricity, and when charged from renewable sources like solar and wind, they offer a near-zero-emission alternative. While the adoption of EVs is increasing globally, challenges remain in terms of infrastructure, affordability, and battery technology. However, governments around the world are offering subsidies and incentives to encourage the transition to electric vehicles, while automakers are investing heavily in research and development to improve battery performance and reduce costs. In the future, EVs are ex

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Re-ranked documents using BERT:
Document 4: The transition from gasoline and diesel-powered vehicles to electric vehicles (EVs) is one of the key solutions for reducing carbon emissions from the transportation sector. The transportation industry is one of the largest contributors to greenhouse gas emissions, with conventional vehicles powered by fossil fuels emitting significant amounts of carbon dioxide and other pollutants. Electric vehicles, by contrast, run on electricity, and when charged from renewable sources like solar and wind, they offer a near-zero-emission alternative. While the adoption of EVs is increasing globally, challenges remain in terms of infrastructure, affordability, and battery technology. However, governments around the world are offering subsidies and incentives to encourage the transition to electric vehicles, while automakers are investing heavily in research and development to improve battery performance and reduce costs. In the future, EVs are expected to