# Search - from Data Source

The scripts rely on API keys for the connections

This notebook creates the following objects on your search service:

+ data source
+ search index
+ indexer

Once you've run all cells the data wil begin being indexed but the query won't return results until the indexer is finished and the search index is loaded. 
</em>

## Prerequisites

- [Azure Storage](https://learn.microsoft.com/azure/storage/common/storage-account-create)
   - Create a new container in your storage account. Make it identifiable to you.
   - Upload your data set (pdfs)

*Common data set to use is the nasa e-book - Upload the [PDFs from this folder](https://github.com/Azure-Samples/azure-search-sample-data/tree/main/nasa-e-book/earth_book_2019_text_pages)*

- [Azure AI Search](https://learn.microsoft.com/azure/search/search-create-service-portal)
(You may have already created them in previous notebooks)
  - Basic tier or higher is recommended.
  - Choose the same region as Azure OpenAI.
  - Enable semantic ranking.
  - Enable role-based access control.
  - Enable a system identity for Azure AI Search.
  - Update the .env file with AI_SEARCH_KEY  (In the portal go to resources then Settings, Keys on the left)
  - Update the .env file with AI_SEARCH_ENDPOINT  
  
Make sure you know the name of the deployed models, and have the endpoints for all Azure resources at hand. You will provide this information in the steps that follow.  


In [None]:
%pip install python-dotenv
%pip install azure-search-documents==11.5.1



### Set container name to name of newly created container.

In [None]:
container_name = "nasa"

In [None]:
# Load credentials
from dotenv import load_dotenv
import os 
load_dotenv()

# Check the environment variables are set and assign them to variables.
AI_SEARCH_ENDPOINT = os.getenv('AI_SEARCH_ENDPOINT')
AI_SEARCH_KEY = os.getenv('AI_SEARCH_KEY')

BLOB_STORAGE_ACCOUNT_CONNECTION_STRING = os.getenv('BLOB_STORAGE_ACCOUNT_CONNECTION_STRING')


# Ensure all required environment variables are set
if not all([AI_SEARCH_ENDPOINT, AI_SEARCH_KEY, BLOB_STORAGE_ACCOUNT_CONNECTION_STRING]):
    missing_vars = [var for var, val in zip(['AI_SEARCH_ENDPOINT', 'AI_SEARCH_KEY', 'BLOB_STORAGE_ACCOUNT_CONNECTION_STRING', 'OPENAI_API_KEY', 'OPENAI_API_ENDPOINT', 'AZURE_AI_KEY'], 
                                            [AI_SEARCH_ENDPOINT, AI_SEARCH_KEY, BLOB_STORAGE_ACCOUNT_CONNECTION_STRING]) if not val]
    raise ValueError(f"Environment variables {', '.join(missing_vars)} must be set.")

# Print the environment variables
print(f"AI_SEARCH_ENDPOINT: {AI_SEARCH_ENDPOINT}")
print(f"AI_SEARCH_KEY: {AI_SEARCH_KEY}")
print(f"BLOB_STORAGE_ACCOUNT_CONNECTION_STRING: {BLOB_STORAGE_ACCOUNT_CONNECTION_STRING}")



In [None]:
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchField,
    SearchFieldDataType,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    AzureOpenAIVectorizer,
    AzureOpenAIVectorizerParameters,
    SearchIndex
)

## Create a Data Source (Blob Container containting the pdfs)

Although only  PDF files are used here, this can be done at a much larger scale and Azure AI Search supports a range of other file formats including: Microsoft Office (DOCX/DOC, XSLX/XLS, PPTX/PPT, MSG), HTML, XML, ZIP, and plain text files (including JSON).
Azure Search support the following sources: [Data Sources Gallery](https://learn.microsoft.com/EN-US/AZURE/search/search-data-sources-gallery)

In [None]:
from azure.search.documents.indexes import SearchIndexerClient
from azure.search.documents.indexes.models import (
    SearchIndexerDataContainer,
    SearchIndexerDataSourceConnection
)

# Create a data source 
indexer_client = SearchIndexerClient(endpoint=AI_SEARCH_ENDPOINT, credential=AzureKeyCredential(AI_SEARCH_KEY))
container = SearchIndexerDataContainer(name=container_name)
data_source_connection = SearchIndexerDataSourceConnection(
    name=container_name+"-connection",
    type="azureblob",
    connection_string=BLOB_STORAGE_ACCOUNT_CONNECTION_STRING,
    container=container
)
data_source = indexer_client.create_or_update_data_source_connection(data_source_connection)

print(f"Data source '{data_source.name}' created or updated")

## Create Index

In [None]:
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SearchField,
    SearchFieldDataType,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    AzureOpenAIVectorizer,
    AzureOpenAIVectorizerParameters,
    SearchIndex
)

AZURE_SEARCH_CREDENTIAL = AzureKeyCredential(AI_SEARCH_KEY)

# Create a search index  
index_name = container_name+"-index"
index_client = SearchIndexClient(endpoint=AI_SEARCH_ENDPOINT, credential=AZURE_SEARCH_CREDENTIAL)  
fields = [
    SearchField(name="title", type=SearchFieldDataType.String),
    SearchField(name="metadata_storage_path", type=SearchFieldDataType.String, key=True, sortable=True, filterable=True, facetable=True, analyzer_name="keyword"),  
    SearchField(name="content", type=SearchFieldDataType.String, sortable=False, filterable=False, facetable=False, analyzer_name="standard.lucene")
    ]  
  

# Create the search index
index = SearchIndex(name=index_name, fields=fields)  
result = index_client.create_or_update_index(index)  
print(f"{result.name} created")  

## Create indexer

In [None]:
from azure.search.documents.indexes.models import (
    SearchIndexer,
    FieldMapping,
    IndexingParameters
)

# Create an indexer 
indexer_name = container_name+"-indexer" 

indexer_parameters = IndexingParameters(
    configuration={
        "dataToExtract": "contentAndMetadata",
        "parsingMode": "default"
    }
)

indexer = SearchIndexer(  
    name=indexer_name,  
    description="Indexer to index documents and generate embeddings",   
    target_index_name=index_name,  
    data_source_name=data_source.name,
    skillset_name=None,
    # Map the metadata_storage_name field to the title field in the index to display the PDF title in the search results  
    field_mappings=[
        FieldMapping(source_field_name="metadata_storage_name", target_field_name="title"),
        FieldMapping(
            source_field_name="metadata_storage_path",
            target_field_name="metadata_storage_path",
            mapping_function={"name": "base64Encode", "parameters": None}
        )
    ],
    parameters=indexer_parameters
)  

# Create and run the indexer  
indexer_client = SearchIndexerClient(endpoint=AI_SEARCH_ENDPOINT, credential=AZURE_SEARCH_CREDENTIAL) 

indexer_result = indexer_client.create_or_update_indexer(indexer)


print(f' {indexer_name} is created and running. Give the indexer a few minutes before running a query.')  

## Check results

In [None]:
from azure.search.documents import SearchClient
from azure.search.documents.models import VectorizableTextQuery
  

search_client = SearchClient(endpoint=AI_SEARCH_ENDPOINT, credential=AZURE_SEARCH_CREDENTIAL, index_name=index_name)

results =  search_client.search(query_type='simple',
    search_text="argentina" ,
    select=["title", "content"],
    include_total_count=True)

  
for result in results: 
    print(f"Score: {result['@search.score']}")
    print(f"Title: {result['title']}")
    print(f"Content: {result['content']}")