# Search for duplicate documents to identify low-value content

This code demonstrate how to use a vector search to search duplicate documents.
Before using this notebook the index has to be created and documents has to be indexed, with the notebook [create_index_and_index_documents](../../4.-search-and-retrieval/4.1.-create-index-and-index-documents/create_index_and_index_documents.ipynb).

The output is the list of documents with a percentage of semantic similitude.

## Prerequisites

+ An Azure subscription, with [access to Azure OpenAI](https://aka.ms/oai/access).
+ An Azure OpenAI service with the service name and an API key.
+ A deployment of the text-embedding-ada-002 embedding model on the Azure OpenAI Service.
+ An Azure AI Search service with the end-point, API Key and the index name.

We used Python 3.12.5, [Visual Studio Code with the Python extension](https://code.visualstudio.com/docs/python/python-tutorial), and the [Jupyter extension](https://marketplace.visualstudio.com/items?itemName=ms-toolsai.jupyter) to test this example.

### Set up a Python virtual environment in Visual Studio Code

1. Open the Command Palette (Ctrl+Shift+P).
1. Search for **Python: Create Environment**.
1. Select **Venv**.
1. Select a Python interpreter. Choose 3.10 or later.

It can take a minute to set up. If you run into problems, see [Python environments in VS Code](https://code.visualstudio.com/docs/python/environments).

### Install packages

In [None]:
! pip install openai
! pip install azure-search-documents

## Import packages and create AOAI client

In [1]:
import os
import time
from dotenv import load_dotenv
import json
from openai import AzureOpenAI
from azure.core.credentials import AzureKeyCredential
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizedQuery, SearchScoreThreshold
import sys
sys.path.append('../..')
from rag_utils import cut_max_tokens

# Load environment variables from .env
load_dotenv(override=True)

# AZURE AI SEARCH
ai_search_endpoint = os.environ["SEARCH_SERVICE_ENDPOINT"]
ai_search_apikey = os.environ["SEARCH_SERVICE_QUERY_KEY"]
ai_search_index_name = os.environ["SEARCH_INDEX_NAME"]
ai_search_credential = AzureKeyCredential(ai_search_apikey)
# Create AI Search client 
ai_search_client = SearchClient(endpoint=ai_search_endpoint, index_name=ai_search_index_name, credential=ai_search_credential)

# AZURE OPENAI FOR EMBEDDING
aoai_embedding_endpoint = os.environ["AZURE_OPENAI_EMBEDDING_ENDPOINT"]
azure_openai_embedding_key = os.environ["AZURE_OPENAI_EMBEDDING_API_KEY"]
embedding_model_name = os.environ["AZURE_OPENAI_EMBEDDING_NAME_ADA"]
# Create AOAI client for embedding creation (ADA)
aoai_api_version = '2024-02-15-preview'
aoai_embedding_client = AzureOpenAI(
    azure_deployment=embedding_model_name,
    api_version=aoai_api_version,
    azure_endpoint=aoai_embedding_endpoint,
    api_key=azure_openai_embedding_key
)


## Vector search of document to identity duplicates
** NOTE: ** Create first the index and upload documents with 'create_index_and_index_documents.ipynb'

In [None]:
def vector_search(query: str, threshold):
    embedding = aoai_embedding_client.embeddings.create(input=query, model=embedding_model_name).data[0].embedding
    vector_query = VectorizedQuery(
        vector=embedding, 
        k_nearest_neighbors=2, 
        fields="embeddingContent", 
        #exhaustive=True, 
        threshold=SearchScoreThreshold(value=threshold))
    
    try:
        results = ai_search_client.search(
            search_text=None,  
            vector_queries=[vector_query],
            select="id, title, content",
            include_total_count=True,
            top=2
            )
        return results, results.get_count()

    except Exception as ex:
        print(ex)
        results = None
        return results, 0

# Read the file with every JSON record
fileinput = 'docs_duplicates.json'
print(f'Loading file {fileinput}...')
with open(fileinput, encoding='utf-8') as file:
    data = json.load(file)

# Search every record by content to find duplicates
for i, reg in enumerate(data):    
    print(f'[{i + 1}]: id {reg["id"]}, title: {reg["title"]}')
    content = reg["content"]
    
    # Vector search by the field "content" with 97% of similarity
    results, count = vector_search(cut_max_tokens(content), threshold=0.97)
    if results != None:
        if count > 1:
            print(f"\tnum results: {results.get_count()}")
            for result in results:
                if str(reg['id']).strip() != result['id'].strip(): # Si no se ha encontrado a sí mismo
                    print(f"\t*** DUPLICATE DOCUMENT ** id {reg["id"]}, title: {reg["title"]} --> id: {result['id']}, title: {result['title']}")
    time.sleep(1)
