<a href="https://colab.research.google.com/github/Mariyaben/Vector-based-Retreival-Methods-and-Re-ranking/blob/main/Reranking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

LEXICAL RERANKING

Lexical reranking is a technique in NLP used to improve the quality of generated outputs by reordering candidates based on their lexical properties like word choice and syntactic structure. It enhances the fluency and accuracy of systems such as machine translation or search engines by prioritizing more linguistically appropriate results after an initial generation phase. Implementing lexical reranking in Google Colab involves leveraging NLP libraries and custom models to refine outputs based on deeper linguistic insights.

In [None]:
!pip install pdfplumber rank_bm25


Collecting pdfplumber
  Downloading pdfplumber-0.11.1-py3-none-any.whl (57 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/57.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.9/57.9 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Collecting pdfminer.six==20231228 (from pdfplumber)
  Downloading pdfminer.six-20231228-py3-none-any.whl (5.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.8/2.8 MB[0m [31m41.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: rank_bm25, pypdfium2, pdfminer.six, pdfplumber
Successfully installed pdfminer

In [None]:
import pdfplumber
import nltk
from rank_bm25 import BM25Okapi
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Function to extract text from PDF using pdfplumber
def extract_text_from_pdf(pdf_path):
    text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text += page.extract_text()
    return text

# Extract text from the provided PDFs
texts = [
    extract_text_from_pdf('icici_q1_by_idbi.pdf'),
    extract_text_from_pdf('icici_q2_by_idbi.pdf'),
    extract_text_from_pdf('icici_q3_by_idbi.pdf')
]

# Preprocess the text
nltk.download('punkt')
tokenized_texts = [nltk.word_tokenize(text.lower()) for text in texts]

# Implement Initial Search Using BM25
bm25 = BM25Okapi(tokenized_texts)
query = "credit growth"
tokenized_query = nltk.word_tokenize(query.lower())
bm25_scores = bm25.get_scores(tokenized_query)

# Implement TF-IDF Reranking
documents = [' '.join(tokens) for tokens in tokenized_texts]
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
query_vec = vectorizer.transform([query])
tfidf_scores = (tfidf_matrix * query_vec.T).toarray().flatten()

# Combine BM25 and TF-IDF Scores for Reranking
bm25_scores_normalized = bm25_scores / np.linalg.norm(bm25_scores)
tfidf_scores_normalized = tfidf_scores / np.linalg.norm(tfidf_scores)
final_scores = bm25_scores_normalized + tfidf_scores_normalized
sorted_indices = final_scores.argsort()[::-1]

# Retrieve and print the sorted documents
sorted_texts = [texts[i] for i in sorted_indices]

for i, text in enumerate(sorted_texts):
    print(f"Document {i+1}:")
    print(text[:1000])  # Print first 1000 characters of each document for preview
    print("\n" + "-"*100 + "\n")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Document 1:
Q1FY24 Result Review
ICICI Bank B U Y
T P Rs.1,240 Key Stock Data
CMP Rs.997 ICICIBC IN/ICBK.BO
NIMs declined QoQ; RoA sustained at multi quarter high Potential upside/downside 24% Sector Banking
Previous Rating BUY
Shares o/s (mn) 6,997
Summary Market cap. (Rs mn) 6,974,300
Price Performance (%)
ICICI Bank’s (one of our top picks) reported decline in NIMs by 12bps QoQ to 223.7
-1m -3m -12m
4.78% during Q1FY24 led by higher cost of deposits. Asset quality remain stable Absolute 7.7 12.7 24.6 52-week high / low Rs1,002 / 787
with GNPA at 2.76% vs 2.81% QoQ led by higher slippages. Also, restructured Rel to Sensex 2.2 0.9 5.7 Sensex / Nifty 66,684 / 19,745
assets stood at 0.4% vs 0.4% QoQ. Credit growth declined to 18% YoY vs 19%
V/s Consensus Shareholding Pattern (%)
YoY (FY23) as overseas book declined by 29.5% YoY. Bank reported strong
EPS (Rs) FY24E FY25E Promoters 0.0
profitability growth at 40% YoY led by strong NII growth. During Q1FY24, NII
IDBI Capital 49.7 59.2 FII 

LTR (Learn To Read)

LTR reranking, or Learn to Read reranking, is a technique in natural language processing where models are trained to reorder or refine outputs generated by initial models based on their understanding of text. This approach improves the relevance and quality of results in tasks like information retrieval or machine translation by leveraging deeper linguistic and contextual understanding. In Google Colab, LTR reranking can be implemented using NLP frameworks to enhance the accuracy and fluency of text-based applications by prioritizing more contextually appropriate outputs.



In [None]:
!pip install pdfplumber xgboost

Collecting pdfplumber
  Downloading pdfplumber-0.11.1-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.9/57.9 kB[0m [31m905.0 kB/s[0m eta [36m0:00:00[0m
Collecting pdfminer.six==20231228 (from pdfplumber)
  Downloading pdfminer.six-20231228-py3-none-any.whl (5.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.8/2.8 MB[0m [31m62.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pypdfium2, pdfminer.six, pdfplumber
Successfully installed pdfminer.six-20231228 pdfplumber-0.11.1 pypdfium2-4.30.0


In [None]:
# Import Libraries
import pdfplumber
import re
import xgboost as xgb
import pandas as pd

# Function to extract text from PDF using pdfplumber
def extract_text_from_pdf(file_path):
    text = ""
    with pdfplumber.open(file_path) as pdf:
        for page in pdf.pages:
            page_text = page.extract_text()
            if page_text:
                text += page_text + "\n"
    return text

# Extract text from the provided PDFs
pdf_q1_path = "icici_q1_by_idbi.pdf"
pdf_q2_path = "icici_q2_by_idbi.pdf"
pdf_q3_path = "icici_q3_by_idbi.pdf"

text_q1 = extract_text_from_pdf(pdf_q1_path)
text_q2 = extract_text_from_pdf(pdf_q2_path)
text_q3 = extract_text_from_pdf(pdf_q3_path)

# Function to extract financial metrics from the text
def extract_metrics(text):
    metrics = {
        'NIMs': 0,
        'GNPA': 0,
        'Credit Growth': 0,
        'Net Profit Growth': 0
    }

    # Extract NIMs
    nim_match = re.search(r'NIM(?:s)? (?:declined|grew) by (\d+\.?\d*)\s*bps', text)
    if nim_match:
        metrics['NIMs'] = float(nim_match.group(1))
    else:
        print("NIMs not found")

    # Extract GNPA
    gnpa_match = re.search(r'GNPA (?:at|stood at) (\d+\.?\d*)%', text)
    if gnpa_match:
        metrics['GNPA'] = float(gnpa_match.group(1))
    else:
        print("GNPA not found")

    # Extract Credit Growth
    credit_growth_match = re.search(r'Credit growth (?:remains strong at|was) (\d+\.?\d*)% YoY', text)
    if credit_growth_match:
        metrics['Credit Growth'] = float(credit_growth_match.group(1))
    else:
        print("Credit Growth not found")

    # Extract Net Profit Growth
    net_profit_growth_match = re.search(r'Net profit growth (?:at|was) (\d+\.?\d*)% YoY', text)
    if net_profit_growth_match:
        metrics['Net Profit Growth'] = float(net_profit_growth_match.group(1))
    else:
        print("Net Profit Growth not found")

    return metrics

# Extract metrics from each quarterly report
metrics_q1 = extract_metrics(text_q1)
metrics_q2 = extract_metrics(text_q2)
metrics_q3 = extract_metrics(text_q3)

print("Q1 FY24 Metrics:", metrics_q1)
print("Q2 FY24 Metrics:", metrics_q2)
print("Q3 FY24 Metrics:", metrics_q3)

# Create a DataFrame with the extracted metrics
data = pd.DataFrame([
    {**metrics_q1, 'Quarter': 'Q1 FY24'},
    {**metrics_q2, 'Quarter': 'Q2 FY24'},
    {**metrics_q3, 'Quarter': 'Q3 FY24'}
])

# Print DataFrame to check for correct columns
print(data)

# Fill missing values with zeros (if any)
data.fillna(0, inplace=True)

# Define the features and target
features = ['NIMs', 'GNPA', 'Credit Growth', 'Net Profit Growth']
X = data[features]
y = [3, 1, 2]  # Example target values for ranking; adjust these based on actual ranking criteria

# Convert data to DMatrix format required by xgboost
dtrain = xgb.DMatrix(X, label=y)

# Define parameters for the XGBoost ranker
params = {
    'objective': 'rank:pairwise',
    'eval_metric': 'ndcg',
    'eta': 0.1,
    'gamma': 1.0,
    'min_child_weight': 0.1,
    'max_depth': 6
}

# Train the model
model = xgb.train(params, dtrain, num_boost_round=10)

# Predict the ranking
predictions = model.predict(dtrain)

# Add predictions to the DataFrame and sort by the predicted ranking
data['Predicted Rank'] = predictions
data = data.sort_values(by='Predicted Rank', ascending=False)

# Print the final ranked DataFrame
print(data)


Credit Growth not found
Net Profit Growth not found
Net Profit Growth not found
Net Profit Growth not found
Q1 FY24 Metrics: {'NIMs': 12.0, 'GNPA': 2.76, 'Credit Growth': 0, 'Net Profit Growth': 0}
Q2 FY24 Metrics: {'NIMs': 25.0, 'GNPA': 2.48, 'Credit Growth': 18.0, 'Net Profit Growth': 0}
Q3 FY24 Metrics: {'NIMs': 10.0, 'GNPA': 2.3, 'Credit Growth': 18.5, 'Net Profit Growth': 0}
   NIMs  GNPA  Credit Growth  Net Profit Growth  Quarter
0  12.0  2.76            0.0                  0  Q1 FY24
1  25.0  2.48           18.0                  0  Q2 FY24
2  10.0  2.30           18.5                  0  Q3 FY24
   NIMs  GNPA  Credit Growth  Net Profit Growth  Quarter  Predicted Rank
0  12.0  2.76            0.0                  0  Q1 FY24             0.0
1  25.0  2.48           18.0                  0  Q2 FY24             0.0
2  10.0  2.30           18.5                  0  Q3 FY24             0.0


Semantic Reranking

Semantic reranking is a technique used in natural language processing to improve the relevance and accuracy of search results or machine translation outputs by reordering candidates based on their semantic meaning rather than just lexical properties. It involves evaluating and prioritizing outputs that better capture the intended meaning or context of the input text. In practice, semantic reranking enhances the performance of NLP systems by ensuring that the selected outputs not only match the surface-level words but also align closely with the underlying meaning or intent of the user query or input text.



In [None]:
# Import Libraries
import pdfplumber
import re
import pandas as pd
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity

# Function to extract text from PDF using pdfplumber
def extract_text_from_pdf(file_path):
    text = ""
    with pdfplumber.open(file_path) as pdf:
        for page in pdf.pages:
            page_text = page.extract_text()
            if page_text:
                text += page_text + "\n"
    return text

# Extract text from the provided PDFs
pdf_q1_path = "icici_q1_by_idbi.pdf"
pdf_q2_path = "icici_q2_by_idbi.pdf"
pdf_q3_path = "icici_q3_by_idbi.pdf"

text_q1 = extract_text_from_pdf(pdf_q1_path)
text_q2 = extract_text_from_pdf(pdf_q2_path)
text_q3 = extract_text_from_pdf(pdf_q3_path)

# Combine texts into a list for easy processing
texts = [text_q1, text_q2, text_q3]

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to get BERT embeddings for a given text
def get_bert_embeddings(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1).squeeze()
    return embeddings.numpy()

# Get embeddings for each text
embeddings = [get_bert_embeddings(text) for text in texts]

# Rank the documents based on a specific query or criterion
# For this example, we'll rank based on similarity to a hypothetical query "financial performance"
query = "financial performance"
query_embedding = get_bert_embeddings(query)

# Calculate cosine similarity between the query and document embeddings
similarities = cosine_similarity([query_embedding], embeddings)[0]

# Create a DataFrame with the extracted metrics and similarities
data = pd.DataFrame([
    {'Quarter': 'Q1 FY24', 'Similarity': similarities[0]},
    {'Quarter': 'Q2 FY24', 'Similarity': similarities[1]},
    {'Quarter': 'Q3 FY24', 'Similarity': similarities[2]}
])

# Sort the DataFrame by similarity to get the ranking
data = data.sort_values(by='Similarity', ascending=False)

# Print the final ranked DataFrame
print(data)


   Quarter  Similarity
2  Q3 FY24    0.362903
0  Q1 FY24    0.350945
1  Q2 FY24    0.350294


Hybrid Methods

A hybrid method for reranking combines the strengths of both lexical and semantic reranking to improve document relevance. Lexical reranking uses keyword matching to evaluate the presence of important terms, while semantic reranking leverages BERT embeddings to understand contextual relevance. By integrating both approaches, the hybrid method ensures comprehensive evaluation, enhancing the accuracy of document ranking.

In [None]:
# Import Libraries
import pdfplumber
import re
import pandas as pd
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity

# Function to extract text from PDF using pdfplumber
def extract_text_from_pdf(file_path):
    text = ""
    with pdfplumber.open(file_path) as pdf:
        for page in pdf.pages:
            page_text = page.extract_text()
            if page_text:
                text += page_text + "\n"
    return text

# Extract text from the provided PDFs
pdf_q1_path = "icici_q1_by_idbi.pdf"
pdf_q2_path = "icici_q2_by_idbi.pdf"
pdf_q3_path = "icici_q3_by_idbi.pdf"

text_q1 = extract_text_from_pdf(pdf_q1_path)
text_q2 = extract_text_from_pdf(pdf_q2_path)
text_q3 = extract_text_from_pdf(pdf_q3_path)

# Combine texts into a list for easy processing
texts = [text_q1, text_q2, text_q3]

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to get BERT embeddings for a given text
def get_bert_embeddings(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1).squeeze()
    return embeddings.numpy()

# Get embeddings for each text
embeddings = [get_bert_embeddings(text) for text in texts]

# Rank the documents based on a specific query or criterion
# For this example, we'll rank based on similarity to a hypothetical query "financial performance"
query = "financial performance"
query_embedding = get_bert_embeddings(query)

# Calculate cosine similarity between the query and document embeddings
semantic_similarities = cosine_similarity([query_embedding], embeddings)[0]

# Define important keywords for lexical reranking
keywords = ["NIM", "GNPA", "Credit Growth", "Net Profit Growth"]

# Function to perform lexical scoring
def lexical_score(text, keywords):
    score = 0
    for keyword in keywords:
        matches = re.findall(r'\b' + re.escape(keyword) + r'\b', text, re.IGNORECASE)
        score += len(matches)
    return score

# Get lexical scores for each text
lexical_scores = [lexical_score(text, keywords) for text in texts]

# Normalize scores to combine them
def normalize_scores(scores):
    min_score = min(scores)
    max_score = max(scores)
    normalized = [(score - min_score) / (max_score - min_score) for score in scores]
    return normalized

# Normalize lexical and semantic scores
normalized_lexical_scores = normalize_scores(lexical_scores)
normalized_semantic_scores = normalize_scores(semantic_similarities)

# Combine scores with a simple average
combined_scores = [(lex + sem) / 2 for lex, sem in zip(normalized_lexical_scores, normalized_semantic_scores)]

# Create a DataFrame with the combined scores
data = pd.DataFrame([
    {'Quarter': 'Q1 FY24', 'Lexical Score': normalized_lexical_scores[0], 'Semantic Score': normalized_semantic_scores[0], 'Combined Score': combined_scores[0]},
    {'Quarter': 'Q2 FY24', 'Lexical Score': normalized_lexical_scores[1], 'Semantic Score': normalized_semantic_scores[1], 'Combined Score': combined_scores[1]},
    {'Quarter': 'Q3 FY24', 'Lexical Score': normalized_lexical_scores[2], 'Semantic Score': normalized_semantic_scores[2], 'Combined Score': combined_scores[2]}
])

# Sort the DataFrame by combined score to get the final ranking
data = data.sort_values(by='Combined Score', ascending=False)

# Print the final ranked DataFrame
print(data)


   Quarter  Lexical Score  Semantic Score  Combined Score
2  Q3 FY24           1.00         1.00000         1.00000
1  Q2 FY24           0.75         0.00000         0.37500
0  Q1 FY24           0.00         0.05168         0.02584
