## ❗ Problem Statement


##### Understand Quantitve measure of relevenace 


+ **NDCG@10**: This metric, standing for Normalized Discounted Cumulative Gain at 10, rates a retrieval system's effectiveness in finding and correctly ordering the top 10 documents. The score ranges from 0 to 100, reflecting how closely the system's ordered list matches the ideal order of documents. NDCG@10 is widely used for its balance in evaluating both the precision of results and their proper sequencing.

- **NDCG@3**: Similar to NDCG@10, NDCG@3 focuses on the top 3 documents. It's particularly relevant in contexts where the highest accuracy in the topmost results is crucial, like in generative AI applications. This metric scores the system's ability to identify and correctly rank the three most relevant documents.

+ **Recall@50**: This measures the proportion of high-quality documents identified within the top 50 results. It's calculated by counting the number of documents rated as high quality by a scoring prompt and dividing this by the total number of known good documents for a given query. It's a useful metric for assessing the system's ability to retrieve a broad set of relevant documents from a large pool.

##### The limitation of semantic search and embeddings 

+ Limitations of Embedding-Based Search

    - Weakness in Keyword Precision: Embedding-based search excels in understanding the overall context and semantic meaning but may falter in accurately identifying specific keywords or phrases.It can miss documents containing exact terms if those terms are not semantically aligned with the rest of the content or query.
    + Contextual Misinterpretation: Embeddings can sometimes overgeneralize or misinterpret the context, leading to the retrieval of documents that are broadly relevant but miss specific nuances or details. They might struggle with distinguishing subtle differences in meanings, especially in specialized or technical domains.
    - Dependency on Training Data:The effectiveness of embeddings is highly dependent on the data they were trained on. If the training data lacks diversity or depth in certain topics, the embeddings may not capture those areas well.

+ Limitations of Semantic Search

    - Struggles with Synonyms and Paraphrasing:Traditional semantic search methods are often rigid in matching terms. They might not recognize synonyms or different ways of expressing the same idea, limiting their ability to retrieve all relevant documents.
    + Limited Understanding of Context:
    Semantic search can be effective in finding documents with specific terms but might not fully grasp the broader context or the intent behind a query. This limitation becomes pronounced in complex queries where understanding the context or the relationship between terms is crucial.



## 💡 Solution


Hybrid Search as a Winner: Hybrid search combines keyword and vector search methods, capitalizing on the strengths of both. Keyword search excels in identifying specific terms, while vector search excels in understanding semantic similarities. This combination ensures a more comprehensive and accurate retrieval of documents, making it especially effective for diverse and complex search queries.

Re-Ranking and L2 in Cognitive Search: The L2 layer in cognitive search improves upon the initial retrieval (L1) results by applying advanced ranking algorithms. It reorders the top documents, focusing on enhancing relevance and contextual accuracy. This is particularly important in scenarios where the initial retrieval might miss subtle nuances. L2 uses more sophisticated techniques, often leveraging deep learning models, to ensure the most relevant results are prioritized. 

In more detail: The semantic ranker runs the query and documents text simultaneously though transformer models that utilize the cross-attention mechanism to produce a ranker score. The query and document chunk score is calibrated to a range that is consistent across all indexes and queries. A score of 0 represents a very irrelevant chunk, and a score of 4 represents an excellent one. In the chart below, Hybrid + Semantic ranking finds the best content for the LLM at each result set size. 

## 📝 How-to

### 🌐 Azure Hybrid Search with Semantic Reranker

This section covers the implementation of a hybrid search system that combines traditional Azure search with a semantic reranker for improved results.

### 📊 Implementation of Evaluation Metrics using Scikit-learn

We will use the `ndcg_score` function from the `sklearn.metrics` module to evaluate our search system. This function calculates the Normalized Discounted Cumulative Gain (NDCG), a commonly used metric for evaluating the quality of a ranked list of items.

```python
from sklearn.metrics import ndcg_score
import numpy as np

# Sample data: Predicted scores and true relevance scores for a set of documents
predicted_scores = np.array([[0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.05]])
true_relevance = np.array([[1, 1, 0, 0, 1, 0, 0, 0, 0, 0]])  # Assuming binary relevance (1 for relevant, 0 for not relevant)

# NDCG@10
ndcg_at_10 = ndcg_score(true_relevance, predicted_scores, k=10)
print(f"NDCG@10: {ndcg_at_10}")

# NDCG@3
ndcg_at_3 = ndcg_score(true_relevance, predicted_scores, k=3)
print(f"NDCG@3: {ndcg_at_3}")

# Recall@50 - Normally, we'd have 50 documents, but for this example, we'll use the 10 we have
relevant_documents_count = np.sum(true_relevance)
recall_at_50 = relevant_documents_count / len(true_relevance[0])  # Dividing by total documents (50 ideally)
print(f"Recall@50: {recall_at_50}")
```

### 🎯 Evaluation Process

1. **🔍 Gather Azure Cognitive Search Results**: Retrieve the search results from Azure Cognitive Search.

2. **🎯 Define Ground Truth Relevance Scores**: Establish a ground truth set of relevance scores for the search results.

3. **📈 Calculate NDCG@3**: Use the `ndcg_score` function to calculate the NDCG at the 3rd position. This gives us a measure of the quality of the top 3 results.
```

## Getting Started

Before you start, ensure you have a `.env` file in your project directory with the following keys:

```plaintext
# Azure AI Search Service Configuration
AZURE_AI_SEARCH_SERVICE_ENDPOINT="[Your Azure Search Service Endpoint]"
AZURE_SEARCH_ADMIN_KEY="[Your Azure Search Index Name]"

#Azure Open API Configuration
AZURE_OPENAI_API_KEY='[Your OpenAI API Key]'
AZURE_OPENAI_ENDPOINT='[Your OpenAI Endpoint]'
AZURE_OPENAI_API_VERSION='[Your Azure OpenAI API Version]'
```

#### Setting Up Conda Environment and Configuring VSCode for Jupyter Notebooks

Follow these steps to create a Conda environment and set up your VSCode for running Jupyter Notebooks:

##### Create Conda Environment from the Repository

1. **Prepare the Environment File**:
   - Ensure you have an `environment.yml` file in your repository. This file should list all the necessary libraries and dependencies for your project.

2. **Use `make` to Create the Conda Environment**:
   - In your terminal or command line, navigate to the repository directory and look at the Makefile.
   - Execute the `make` command specified below to create the Conda environment using the `environment.yml` file:
     
     ```bash
     make create_conda_env
     ```

   - This command runs a `make` target that creates a Conda environment as defined in `environment.yml`.

3. **Activating the Environment**:
   - After creation, activate the new Conda environment by using:
     ```bash
     conda activate [YourEnvName]
     ```
     Replace `[YourEnvName]` with the name of your environment as specified in `environment.yml`.

##### Configure VSCode for Jupyter Notebooks

1. **Install Required Extensions**:
   - Download and install the `Python` and `Jupyter` extensions in VSCode.

2. **Attach Kernel to VSCode**:
   - Once the Conda environment is created, you should be able to see it in the kernel selection (top right corner of your VSCode interface).
   - Select your newly created environment as the kernel for running Jupyter Notebooks.

By following these steps, you'll set up a dedicated Conda environment for your project and configure VSCode to run Jupyter Notebooks efficiently. This environment will contain all the necessary dependencies in your `environment.yml` file.




In [27]:
import os
from dotenv import load_dotenv
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.models import Vector

In [28]:
from src.gbb_ai.langchain_integration_azureai import TextChunkingIndexing

In [29]:
# Load environment variables from .env file
load_dotenv()

# Set up Azure Cognitive Search credentials
service_endpoint = os.getenv("AZURE_AI_SEARCH_SERVICE_ENDPOINT")
key = os.getenv("AZURE_SEARCH_ADMIN_KEY")
credential = AzureKeyCredential(key)

# Import the TextChunkingIndexing class from the langchain_integration module
from src.gbb_ai.langchain_integration_azureai import TextChunkingIndexing

# Create an instance of the TextChunkingIndexing class
gbb_ai_client = TextChunkingIndexing()

# load the environment variables from the .env file
gbb_ai_client.load_environment_variables_from_env_file()

# Define the name of the deployment
DEPLOYMENT_NAME = "foundational-ada"

# Load the embedding model associated with the specified deployment
embedding_model = gbb_ai_client.load_embedding_model(azure_deployment=DEPLOYMENT_NAME)

2023-12-13 17:52:02,425 - micro - MainProcess - INFO     Loading OpenAIEmbeddings object with model, deployment foundational-ada, and chunk size 1000 (langchain_integration_azureai.py:load_embedding_model:114)


2023-12-13 17:52:03,202 - micro - MainProcess - INFO     AzureOpenAIEmbeddings object created successfully. (langchain_integration_azureai.py:load_embedding_model:125)


In [30]:
# Define the name of the Azure Search index
# This is the index where your data is stored in Azure Search
index_name = 'index-churchofjesuschrist-web'

# Set up the Azure Search client with the specified index
# This prepares the client to interact with the Azure Search service
search_client = SearchClient(service_endpoint, index_name, credential=credential)

In [31]:
search_query = "Who is Jesus Christ?"
search_vector = embedding_model.embed_query(search_query)

In [32]:
# Pure vector Search
r = search_client.search(None, top=5,vectors=[Vector(value=search_vector, k=50, fields="content_vector")])
for doc in r:
    content = doc["content"].replace("\n", " ")[:1000]
    print(f"score: {doc['@search.score']}. {content}")

score: 0.84436035. God’s Work of Salvation and Exaltation   Living the Gospel of Jesus Christ   16. Living the Gospel of Jesus ChristWe live the gospel as we exercise faith in Jesus Christ, repent daily, make covenants with God as we receive the ordinances of salvation and exaltation, and endure to the end by keeping those covenants.  17. Teaching the Gospel   17. Teaching the GospelEffective gospel teaching helps people grow in their testimonies and their faith in Heavenly Father and Jesus Christ.
score: 0.83588856. Isaiah 7Ephraim and Syria wage war against Judah—Christ will be born of a virgin—Compare 2 Nephi 17.   Isaiah 8Christ will be as a stone of stumbling and a rock of offense—Seek the Lord, not muttering wizards—Turn to the law and to the testimony for guidance—Compare 2 Nephi 18.   Isaiah 9Isaiah speaks about the Messiah—The people in darkness will see a great Light—Unto us a Child is born—He will be the Prince of Peace and reign on David’s throne—Compare 2 Nephi 19.
score: 

In [33]:
#keyword search
r = search_client.search(search_query, top=5)
for doc in r:
    if "Jesus" in doc["content"]:
        content = doc["content"].replace("\n", " ")[:1000]
        print(f"score: {doc['@search.score']}. {content}")

score: 8.379741. 18.12.1. Who Performs the OrdinanceOrdinances and blessings are sacred acts performed by the authority of the priesthood and in the name of Jesus Christ. As priesthood holders perform ordinances and blessings, they follow the Savior’s example of blessing others.
score: 8.292201. 18.10.4. Who Performs the OrdinanceOrdinances and blessings are sacred acts performed by the authority of the priesthood and in the name of Jesus Christ. As priesthood holders perform ordinances and blessings, they follow the Savior’s example of blessing others.
score: 8.288818. 18.6.1. Who Gives the BlessingOrdinances and blessings are sacred acts performed by the authority of the priesthood and in the name of Jesus Christ. As priesthood holders perform ordinances and blessings, they follow the Savior’s example of blessing others.   18.6.2. InstructionsOrdinances and blessings are sacred acts performed by the authority of the priesthood and in the name of Jesus Christ. As priesthood holders pe

In [34]:
r = search_client.search(search_query, top=5, vectors=[Vector(value=search_vector, k=50, fields="content_vector")])
for doc in r:
    content = doc["content"].replace("\n", " ")[:1000]
    print(f"score: {doc['@search.score']}, reranker: {doc['@search.reranker_score']}. {content}")

score: 0.027973394840955734, reranker: None. 17.1. Principles of Christlike Teaching   17.1. Principles of Christlike TeachingEffective gospel teaching helps people grow in their testimonies and their faith in Heavenly Father and Jesus Christ.   17.1.1. Love Those You TeachEffective gospel teaching helps people grow in their testimonies and their faith in Heavenly Father and Jesus Christ.   17.1.2. Teach by the SpiritEffective gospel teaching helps people grow in their testimonies and their faith in Heavenly Father and Jesus Christ.
score: 0.026012461632490158, reranker: None. God’s Work of Salvation and Exaltation   Living the Gospel of Jesus Christ   16. Living the Gospel of Jesus ChristWe live the gospel as we exercise faith in Jesus Christ, repent daily, make covenants with God as we receive the ordinances of salvation and exaltation, and endure to the end by keeping those covenants.  17. Teaching the Gospel   17. Teaching the GospelEffective gospel teaching helps people grow in th

In [35]:
# hybrid retrieval + rerank 
r = search_client.search(
        search_query,
        top=5, 
        vectors=[Vector(value=search_vector, k=50, fields="content_vector")],
        query_type="semantic",
        semantic_configuration_name="config",
        query_language="en-us")

for doc in r:
    content = doc["content"].replace("\n", " ")[:1000]
    print(f"score: {doc['@search.score']}, reranker: {doc['@search.reranker_score']}. {content}")

score: 0.015384615398943424, reranker: 2.584941864013672. 27.1.3. Members Who Have Physical DisabilitiesThe temple is the house of the Lord. It points us to our Savior, Jesus Christ. In temples, we participate in sacred ordinances and make covenants with Heavenly Father that bind us to Him and to our Savior. These covenants and ordinances prepare us to return to Heavenly Father’s presence and to be sealed together as families for eternity.
score: 0.016393441706895828, reranker: 2.510263442993164. 18.10.4. Who Performs the OrdinanceOrdinances and blessings are sacred acts performed by the authority of the priesthood and in the name of Jesus Christ. As priesthood holders perform ordinances and blessings, they follow the Savior’s example of blessing others.
score: 0.017401045188307762, reranker: 2.4976255893707275. 27.3.1. Who May Be Sealed in a TempleThe temple is the house of the Lord. It points us to our Savior, Jesus Christ. In temples, we participate in sacred ordinances and make cov

: 