# Search Solution
https://docs.microsoft.com/en-us/azure/search/search-get-started-python



## Creating an index

In an index we need to include a name, a fields collection and a key.

The *fields collection* defines the structure of a logical search document. It is used both when loading data and when returning results. 

Each field has a: name, type and attribute. <- determine how the field is used

Within an index, one field **must** be of type `Edm.String` and it will be the *key* for document identity. 

Attributes -> https://docs.microsoft.com/en-us/azure/search/search-what-is-an-index

CorsOptions - https://docs.microsoft.com/en-us/dotnet/api/microsoft.azure.search.models.corsoptions?view=azure-dotnet

Python Classes -> https://docs.microsoft.com/en-us/python/api/azure-search-documents/azure.search.documents.indexes.searchindexclient?view=azure-python

## Load documents


REST API calls 

Azure Cognitive Search - https://docs.microsoft.com/en-us/rest/api/searchservice/search-documents

# Demo Cognitive Search - Thesis

Blob to Cognitive Search -> https://docs.microsoft.com/en-us/azure/search/cognitive-search-tutorial-blob-python

In [1]:
import json
import requests
import os
import time
from pprint import pprint # pretty printer
from dotenv import load_dotenv

# Define the names for the data source, skillset, index and indexer
datasource_name = "cogsrch-py-datasource"
skillset_name = "cogsrch-py-skillset"
index_name = "cogsrch-py-index"
indexer_name = "cogsrch-py-indexer"

In [2]:
load_dotenv()

endpoint = os.getenv('SEARCH_ENDPOINT')
admin_key = os.getenv('SEARCH_ADMIN_API_KEY')

headers = {
    'Content-Type':'application/json',
    'api-key': admin_key
}

params = {
    'api-version':'2020-06-30'
}



### 0. Clear previous existing resources



In [5]:
# delete the skillset
# status code 204 on success

r = requests.delete(endpoint + "/datasources/" + datasource_name,
                    headers=headers, params=params)
print(r.status_code)

r = requests.delete(endpoint + "/skillsets/" + skillset_name,
                    headers=headers, params=params)
print(r.status_code)

r = requests.delete(endpoint + "/indexes/" + index_name,
                    headers=headers, params=params)
print(r.status_code)

r = requests.delete(endpoint + "/indexers/" + indexer_name,
                    headers=headers, params=params)
print(r.status_code)

204
404
404
404


## Create the pipeline


### 1. Create a data source

A *data source object* provides the connection string to the Blob container containing the sample data files. 

In [6]:
datasource_connection_string = os.getenv('STORAGE_CONNECTION_STRING')
blob_container_name = os.getenv('CONTAINER_NAME')

datasource_payload = {
    "name": datasource_name,
    "description": "Article examples",
    "type" : "azureblob",
    "credentials": {
        "connectionString": datasource_connection_string
    },
    "container": {
        "name": blob_container_name
    }    
}

r = requests.put(endpoint + "/datasources/" + datasource_name, 
                data=json.dumps(datasource_payload), headers=headers, params=params)

print(r.status_code)


201


### 2. Create a skillset

Set of enrichments that will be applied to the data

Create a skillset -> https://docs.microsoft.com/en-us/azure/search/cognitive-search-defining-skillset

In [7]:
language_detection_skill = {
            "@odata.type": "#Microsoft.Skills.Text.LanguageDetectionSkill",
            "inputs": [
                {
                    "name": "text", 
                    "source": "/document/content"
                }
            ],
            "outputs": [
                {
                    "name": "languageCode",
                    "targetName": "languageCode"
                }
            ]
        } 

text_split_skill = {
            "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
            "textSplitMode": "pages",
            "maximumPageLength": 4000,
            "inputs": [
                {
                    "name": "text",
                    "source": "/document/content"
                },
                {
                    "name": "languageCode",
                    "source": "/document/languageCode"
                }
            ],
            "outputs": [
                {
                    "name": "textItems",
                    "targetName": "pages"
                }
            ]
        }

key_phrase_extraction_skill = {
            "@odata.type": "#Microsoft.Skills.Text.KeyPhraseExtractionSkill",
            "context": "/document/pages/*",
            "inputs": [
                {
                    "name": "text", 
                    "source": "/document/pages/*"
                },
                {
                    "name": "languageCode", 
                    "source": "/document/languageCode"
                }
            ],
            "outputs": [
                {
                    "name": "keyPhrases",
                    "targetName": "keyPhrases"
                }
            ]
        }


skill_list = [language_detection_skill, text_split_skill]

skillset_payload = {
    "name": skillset_name,
    "description": "Detect language, split text and extract key-phrases",
    "skills": skill_list    
}

r = requests.put(endpoint + "/skillsets/" + skillset_name,
                 data=json.dumps(skillset_payload), headers=headers, params=params)
                 
print(r.status_code)

201


### Create custom skill 

To separate sections of the text

Create custom skill https://docs.microsoft.com/en-us/azure/search/cognitive-search-custom-skill-python

About custom skills https://docs.microsoft.com/en-us/azure/search/cognitive-search-custom-skill-interface


### 3. Create an index

Define index schema by specifying the fields to include in the searchable index and setting the search attributes for each field. 

Indexes -> https://docs.microsoft.com/en-us/rest/api/searchservice/create-index

https://docs.microsoft.com/en-us/azure/search/search-howto-indexing-azure-blob-storage

In [8]:
def create_field(name:str, type:str, searchable="false", filterable="false", sortable="false", facetable="false", key="false") -> dict:
    field = {
            "name": name,
            "type": type,
            "key": key,
            "searchable": searchable,
            "filterable": filterable,
            "sortable": sortable,
            "facetable": facetable
        }
    
    return field

id_field = create_field(name="id", type="Edm.String", searchable="true", sortable="true", key="true")
content_field = create_field(name="content", type="Edm.String", searchable="true")
language_code_field = create_field(name="languageCode", type="Edm.String", searchable="true", filterable="true")
key_phrases_field = create_field(name="keyPhrases", type="Collection(Edm.String)",  searchable="true")

#metadata 
metadata_author_field = create_field(name="authors", type="Edm.String", searchable="true")
metadata_title_field = create_field(name="title", type="Edm.String", searchable="true", sortable="true")
metadata_creation_date_field = create_field(name="creationDate", type="Edm.String", sortable="true", filterable="true")

metadata_storage_name = create_field(name="storageName", type="Edm.String", searchable="true", sortable="true")
metadata_storage_size = create_field(name="storageSize", type="Edm.Int64", sortable="true", filterable="true")

fields_list = [id_field, content_field, language_code_field, key_phrases_field, metadata_author_field, metadata_title_field, metadata_creation_date_field, metadata_storage_name, metadata_storage_size]

In [9]:
index_payload = {
    "name": index_name,
    "fields": fields_list
}

r = requests.put(endpoint + "/indexes/" + index_name, 
                data=json.dumps(index_payload), headers=headers, params=params)

print(r.status_code)

201


### 4. Create an indexer

Indexers drive the pipeline. Creating an indexer puts the pipeline in motion.

All the above 3 components (data source, skillset and index) are inputs for the indexer. 

-> how to field map https://docs.microsoft.com/en-us/azure/search/search-indexer-field-mappings

In [10]:
indexer_payload = {
    "name": indexer_name,
    "description": None,
    "dataSourceName": datasource_name,
    "targetIndexName": index_name,
    "skillsetName": skillset_name,
    "fieldMappings": [
        {
            "sourceFieldName": "metadata_storage_path",
            "targetFieldName": "id",
            "mappingFunction":
            {"name": "base64Encode"}
        },
        {
            "sourceFieldName": "content",
            "targetFieldName": "content"
        }, 
        {
            "sourceFieldName": "metadata_storage_size",
            "targetFieldName": "storageSize"
        }, 
        {
            "sourceFieldName": "metadata_storage_name",
            "targetFieldName": "storageName"
        }, 
        {
            "sourceFieldName": "metadata_title", 
            "targetFieldName": "title"
        }, 
        {
            "sourceFieldName": "metadata_creation_date",
            "targetFieldName": "creationDate"   # doesn't ~always~ work
        },
        {
            "sourceFieldName": "metadata_author",
            "targetFieldName": "authors"        # doesn't work well. It depends how the authors are separated in the metadata
        }
    ],
    "outputFieldMappings":
    [
        {
            "sourceFieldName": "/document/pages/*/keyPhrases/*",
            "targetFieldName": "keyPhrases"
        },
        {
            "sourceFieldName": "/document/languageCode",
            "targetFieldName": "languageCode"
        }
    ],
    "parameters":
    {
        "maxFailedItems": -1, #ignore errors during data import
        "maxFailedItemsPerBatch": -1,
        "configuration":
        {
            "dataToExtract": "contentAndMetadata"
        }
    }
}

r = requests.put(endpoint + "/indexers/" + indexer_name,
                 data=json.dumps(indexer_payload), headers=headers, params=params)
print(r.status_code)


201


In [10]:
# Get indexer status
#r = requests.get(endpoint + "/indexers/" + indexer_name +
#                 "/status", headers=headers, params=params)
#pprint(json.dumps(r.json(), indent=1))

### 5. Search





Not needed now necessarily -> How to page results: https://docs.microsoft.com/en-us/azure/search/search-pagination-page-layout

In [11]:
# Verification step: get the index definition of all fields

#r = requests.get(endpoint + "/indexes/" + index_name, headers=headers, params=params)
#pprint(json.dumps(r.json(), indent=1))

In [11]:
# Query the index to return the contents of a single field -> content
time.sleep(10)

r = requests.get(endpoint + "/indexes/" + index_name + "/docs?&search=*&$count=true&$select=content", headers=headers, params=params)

result_json = r.json()

#result = json.dumps(result_json, indent=1)

#pprint(result)
document_count = result_json["@odata.count"]
article_list = []

for i in range(document_count):
    article_list.append(result_json["value"][i])
    


In [12]:
print(article_list[0]["content"])


O R I G I N A L A R T I C L E

Self-reported olfactory loss associates with outpatient clinical course
in COVID-19

Carol H. Yan, MD1 , Farhoud Faraji, MD, PhD1, Divya P. Prajapati, BS1,2, Benjamin T. Ostrander, MD1 and
Adam S. DeConde, MD1

Background: Rapid spread of the severe acute respiratory
syndrome-coronavirus-2 (SARS-CoV-2) virus has le� many
health systems around the world overwhelmed, forcing
triaging of scarce medical resources. Identifying indicators
of hospital admission for coronavirus disease 2019 (COVID-
19) patients early in the disease course could aid the ef-
ficient allocation of medical interventions. Self-reported
olfactory impairment has recently been recognized as a
hallmark of COVID-19 and may be an important predictor
of clinical outcome.

Methods: A retrospective review of all patients pre-
senting to a San Diego Hospital system with laboratory-
confirmed positive COVID-19 infection was conducted with
evaluation of olfactory and gustatory function and clini

## Extract  -- Sections --

In [16]:
def extract_section(article_list, section_to_be_extracted, article_index):

   #select certain article
   article = article_list[article_index]
   
   section_req = {"section":section_to_be_extracted}

   # add the section name to the existing json so we can send it via the request
   article.update(section_req)
   req = json.dumps(article, indent=1)

   #print(article)

   section_extraction_url = os.getenv('SECTION_EXTRACTION_FUNCTION')
   #section_extraction_url = "http://localhost:7071/api/section_identifier"

   res = requests.get(section_extraction_url, data=req)

   print(res.status_code)

   if section_to_be_extracted == "All":
      section_to_be_extracted = "Introduction"

   section_lowerletter = section_to_be_extracted.lower()
   section = res.json()[section_lowerletter][0]
   print(section)


   #save file for further use without having to run the services again
   #save_file('texts/article_' + str(article_index), section_lowerletter + '.txt', section)

   return section
   
def save_file(dir_path, filename, file_content):
   os.makedirs(dir_path, exist_ok=True)
   with open(os.path.join(dir_path, filename), 'w', encoding="utf-8") as f:
      f.write(file_content)

In [17]:
#VARIABLES 
# which section to extract
# USE FIRST LETTER AS A CAPITAL LETTER
# Workable: Introduction, Methods, Results, Discussion, Conclusion
# Check section_identifier function for variations of the names (don't change them here)
section_to_be_extracted = "Conclusion" 

# which article to extract from
#article_index = 0
#section = extract_section(article_list, section_to_be_extracted, article_index)

conclusion_list = []

# not neccessarily needed to create the conclusion list
for i in range(document_count):
   
   conclusion_list.append(extract_section(article_list, section_to_be_extracted, i))
   print("\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n")



200
The current study has identified a strong inverse association
between COVID-19–related anosmia and a critical branch
point in management of COVID-19: the decision to com-
mit to hospital admission. Patients admitted for COVID-19
were 10 times less likely to report anosmia. These find-
ings have important immediate practical applications to
the lay public as well as healthcare workers and health-
care systems looking to efficiently risk-stratify patients to
efficiently provide appropriate medical and nonmedical in-
terventions. The association between olfactory dysfunction
and clinical outcomes also carries important implications
for future investigations seeking to understand the abil-
ity of SARS-CoV-2 virus to overwhelm the host immune
response.



~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


200

According to our results, the prevalence of gustatory and
olfactory dysfunction in COVID-19 patients was almo

In [18]:
print(conclusion_list[2])


Although there have been few reports of definite change in 
smell and taste perception in patients with COVID-19, the 
findings of these studies raise the question whether SARS-
CoV-2 may cause olfactory and gustatory dysfunctions [2, 
13, 21, 22]. The results of our study support recent reports 
that SARS-CoV-2 may infect oral and nasal tissues and 
cause olfactory and gustatory dysfunctions. These findings 
may aid future research on the diagnosis, prevention and 
treatment of COVID-19 consequences.



In [None]:
#DEBUGGING PURPOSES
import re
#text = "sdfsfds\n\nConclusion\n\nRTDHDDGDE\nGRTHNBTRDHNT\n\nResult and ethics\n\nskhfgksdfhsdfhku\n\ngfbfgfgfr\n\n.\n\nDiscussion\n\nJingleBells"

text = article["content"]

sec_name = "Conclusion"
edge = "References"

ed = edge.lower()

regex_terms = r'\n' + sec_name+ r'\n(.|\n)+\n' + edge + r'\n'

x = re.search(regex_terms, text)[0]


y = x.replace("\\n\\nConclusion\\n\\n", "")
z = y.replace("\\n\\nDiscussion\\n\\n", "")
w = z.replace("\\n\\nResult\\n\\n", "")
print(x)

