# Vector Search Evaluation

1. Generation of **ground truth** dataset. 

✅ This step was compeleted in `01_evaluation_generation_ground_truth.ipynb` notebook.

  This can be done:
  * Manually by annotators / domain experts
  * Getting the data from users queries
  * Generate with LLM


Generally for one query, we might have multiple relevant documents, but for this use case, we have 1 relevant document(answer) for 1 query (user question).

The automatic generation of the dataset will be done as follows:
1. For every user query (question) LLM will be prompted to generate 5 similar questions
2. Apply vector search using our LLM-generated questions as queries to find relevant document in the knowledge base 
3. During the test phase we will evaluate our vector search to be able to detect relevant document for similar queries (aka. generated ones)

-------------
2. Evaluation of **Text Retrieval** techniques.

✅ Step compelted in `02_evaluation_text_retrieval.ipynb`

For every record in ground truth dataset we will:
  1. Execute query (perform text search in our vector database)
  2. Check if the retrieved results contain the answer assigned to the original query (from which we've generated our artificial queries)
  3. Perform metrics calculations
      * **Hit Rate** (or Recall)
      * **MRR (Mean Reciprocal** Rank)


-------------
3. Evaluation of **Vector Retrieval** techniques.

For every record in ground truth dataset we will:
  1. Execute query (perform vector search in our vector database)
  2. Check if the retrieved results contain the answer assigned to the original query (from which we've generated our artificial queries)
  3. Perform metrics calculations
      * **Hit Rate** (or Recall)
      * **MRR (Mean Reciprocal** Rank)

In [None]:
from sentence_transformers import SentenceTransformer

In [None]:
import json

with open('documents-with-ids.json', 'rt') as f_in:
    documents = json.load(f_in)

In [None]:
model_name = 'multi-qa-MiniLM-L6-cos-v1'
model = SentenceTransformer(model_name)

In [None]:
v = model.encode('I just discovered the course. Can I still join?')
len(v)

## 1. Indexing our KnowledgeBase data into ElasticSearch

### 1.1 Setup ElasticSearch connection

Running ES with Docker
```
docker run -it \
    --rm \
    --name elasticsearch \
    -p 9200:9200 \
    -p 9300:9300 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    docker.elastic.co/elasticsearch/elasticsearch:8.4.3
```

In [None]:
from elasticsearch import Elasticsearch

In [None]:
!curl localhost:9200

In [None]:
es_client = Elasticsearch('http://localhost:9200')
es_client.info()

In [None]:
health = es_client.cluster.health()
print(health)

In [None]:
# Adjust high and low watermarks temporarily
es_client.cluster.put_settings(body={
    "transient": {
        "cluster.routing.allocation.disk.watermark.low": "85%",
        "cluster.routing.allocation.disk.watermark.high": "95%",
        "cluster.routing.allocation.disk.watermark.flood_stage": "98%"
    }
})

Using OpenAI API for generation task.

### 1.2 Indexing data

Building embeddings for question, answer and question+answer

In [None]:
from elasticsearch import Elasticsearch

es_client = Elasticsearch('http://localhost:9200')

index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"},
            "id": {"type": "keyword"},
            "question_vector": {
                "type": "dense_vector",
                "dims": 384,
                "index": True,
                "similarity": "cosine"
            },
            "text_vector": {
                "type": "dense_vector",
                "dims": 384,
                "index": True,
                "similarity": "cosine"
            },
            "question_text_vector": {
                "type": "dense_vector",
                "dims": 384,
                "index": True,
                "similarity": "cosine"
            },
        }
    }
}

index_name = "course-questions"

es_client.indices.delete(index=index_name, ignore_unavailable=True)
es_client.indices.create(index=index_name, body=index_settings)

In [None]:
from tqdm.auto import tqdm

In [None]:
for doc in tqdm(documents):
    question = doc['question']
    text = doc['text']
    qt = question + ' ' + text

    doc['question_vector'] = model.encode(question)
    doc['text_vector'] = model.encode(text)
    doc['question_text_vector'] = model.encode(qt)

In [None]:
for doc in tqdm(documents):
    es_client.index(index=index_name, document=doc)

## Evaluation Metrics

Now we can iterate over our ground truth dataset(synthetic) to get the reponse back.
And check if the output are in the results.

## text search

## Hit Rate or Recall at k

Measures the proportion of queries for which at least one relevant document is retrieved in the top-k results

HR@k = N of queries with at least one relevant doc in top k / |Q|

## Mean Reciprocal Rank (MRR)

Evaluates the rank position of the first relevant document


## Embeddings

Ranking with question, answer, question+answer embeddings

## 1. Embedding

Embedding is a vector representation of a text (or other types of data). In such a form application of different ML/DL algorithms become possible.

In [None]:
from sentence_transformers import SentenceTransformer

In [None]:
model = SentenceTransformer('all-mpnet-base-v2')

In [None]:
# dimensionality of our embedding
len(model.encode("This is a simple sentence"))

In [None]:
# create a dense vector using pre-trainde model for our answer(aka `text` field)
opertaions = []

for doc in documents:
    doc['text_vector'] = model.encode(doc['text']).tolist()
    opertaions.append(doc)

In [None]:
opertaions[0]

## 3. Vector DataBase

Permits **optimal** and **effective** storage and search

### 3.1 Setup ElasticSearch connection

Running ES with Docker
```
docker run -it \
    --rm \
    --name elasticsearch \
    -p 9200:9200 \
    -p 9300:9300 \
    -e "discovery.type=single-node" \
    -e "xpack.security.enabled=false" \
    docker.elastic.co/elasticsearch/elasticsearch:8.4.3
```

In [None]:
from elasticsearch import Elasticsearch

In [None]:
es_client = Elasticsearch('http://localhost:9200')
es_client.info()

In [None]:
health = es_client.cluster.health()
print(health)

In [None]:
# Adjust high and low watermarks temporarily
es_client.cluster.put_settings(body={
    "transient": {
        "cluster.routing.allocation.disk.watermark.low": "85%",
        "cluster.routing.allocation.disk.watermark.high": "95%",
        "cluster.routing.allocation.disk.watermark.flood_stage": "98%"
    }
})

In [None]:
es_client.cluster.health()

In [None]:
#! pip install sentence_transformers==2.7.0

#### 3.1.1 Create Mappings and Index

* Mapping will define how we gonna store and index our data
* Each document is a collection of fields, which each have their own data type
* Mapping is like a db schema

In [None]:
index_settings = {
    "settings": {
        "number_of_shards": 1,
        "number_of_replicas": 0
    },
    "mappings": {
        "properties": {
            "text": {"type": "text"},
            "section": {"type": "text"},
            "question": {"type": "text"},
            "course": {"type": "keyword"} ,
            "text_vector": {"type": "dense_vector",
                            "dims": 768,
                            "index": True,
                            "similarity": "cosine"},
        }
    }
}

In [None]:
index_name = "course-questions"

es_client.indices.delete(index=index_name, ignore_unavailable=True)
es_client.indices.create(index=index_name, body=index_settings)

#### 3.1.2 Add documents into index


In [None]:
for doc in opertaions:
    try:
        es_client.index(index=index_name, document=doc)
    except Exception as e:
        print(e)

#### 3.1.3 Create end user query

In [None]:
search_term = "windows or mac?"
vector_search_term = model.encode(search_term)


#### 3.1.4 Perform Semantic Search 

In [None]:
query = {
    "field": "text_vector",
    "query_vector": vector_search_term,
    "k": 5,
    "num_candidates": 10000,
}

In [None]:
res = es_client.search(index=index_name, knn=query, source=["text", "section", "question", "course"])
res["hits"]["hits"]

#### 3.1.5 Perform Keyword search with Semantic Search (Hybrid/Advanced Search)


If using cosine similarity with normalized vectors, scores will fall btswn 0 and 1.

With advanced search may be different score.=> check the output and use `explain` parameter to see how the score was calculated

In [None]:
# Included "knn" in the search query performs a semantic search along with the filter on a keyword
knn_query = {
    "field": "text_vector",
    "query_vector": vector_search_term,
    "k": 5,
    "num_candidates": 10000
}

In [None]:
response = es_client.search(
    index=index_name,
    query={
        "match": {"section": "General course-related questions"},
    },
    knn=knn_query,
    size=5,
    explain=True # will tell how the score are calculated
)

In [None]:
response["hits"]["hits"][0]