# Semantic Search with Azure Cognitive Search

Aim:
1. Demonstrate the profound capabilities of Cognitive Search in discerning and retrieving meaningful content
2. Showcasing its superiority over conventional search mechanisms in understanding and processing complex human languages and interactions

with respect to use in "Walter White" as our product.

Requirements:
1. [“Frederick Douglass,” written by Booker T. Washington](https://books.google.com/googlebooks/about/free_books.html/)
2. An Azure Subscription

Notes:
1. admin_key, service_key and other important credentials to be hidden before committing to GitHub
---

Team Cyber Wardens, VIT Pune
---


Installing dependencies

In [None]:
!pip install azure-search-documents pdfplumber

Setting up Azure Cognitive Search Service

In [None]:
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents import SearchClient

service_name = "**hidden**"
admin_key = "**hidden**"
index_name = "**hidden**"

endpoint = f"https://{service_name}.search.windows.net/"
admin_client = SearchIndexClient(endpoint=endpoint, index_name=index_name, credential=AzureKeyCredential(admin_key))
search_client = SearchClient(endpoint=endpoint, index_name=index_name, credential=AzureKeyCredential(admin_key))

Define the index schema and create the index.

In [None]:
from azure.search.documents.indexes.models import SearchIndex, SimpleField, SearchFieldDataType, SearchableField

fields = [
    SimpleField(name="id", type=SearchFieldDataType.String, key=True),
    SearchableField(name="title", type=SearchFieldDataType.String, sortable=True),
    SearchableField(name="content", type=SearchFieldDataType.String, analyzer_name="en.lucene"),
]
index = SearchIndex(name=index_name, fields=fields)
admin_client.create_index(index)

Download and Read the PDF Content

In [None]:
import pdfplumber
import requests

url = "https://raw.githubusercontent.com/fenago/datasets/main/books/Frederick_Douglass.pdf"
response = requests.get(url)
filename = "Frederick_Douglass.pdf"

with open(filename, 'wb') as file:
    file.write(response.content)

with pdfplumber.open(filename) as pdf:
    text = ''.join(page.extract_text() for page in pdf.pages)
print(text[:500])  # print the first 500 characters of the book

Upload Data to Index

In [None]:
batch = [{"@search.action": "upload", "id": "1", "title": "Frederick Douglass", "content": text}]
results = search_client.upload_documents(batch)

Perform Semantic Search

In [None]:
search_text = "freedom"
results = search_client.search(search_text=search_text, include_total_count=True)
for result in results:
    print(result)

In [None]:
search_text = "who is Frederick Douglas?"
results = search_client.search(search_text=search_text, include_total_count=True)

for result in results:
    print(f"ID: {result['id']}")
    print(f"Title: {result['title']}")
    print(f"Content: {result['content']}\n{'='*40}\n")

In [None]:
import json

search_text = "who is Frederick Douglas?"
results = search_client.search(search_text=search_text, include_total_count=True)

for result in results:
    print(json.dumps(result, indent=4))
    print('='*40)