## Introduction

Knowledge bases in enterprises are very common in the industry today and can have extensive number of documents in different categories. Retrieving relevant content based on a user query is a challenging task.  Given a query we were able to retrieve information accurately at the document level using methods such as Page Rank developed and made highly accurate especially by Google,  after this point the user has to delve into the document and search for the relevant information.  With recent advances in Foundation Models such as the one developed by Open AI the challenge is alleviated by using “Semantic Search” methods by using encoding information such as “Embeddings” to find the relevant information and then to summarize the content to present to the user in a concise and succinct manner.  

This notebook will introduce the Use Case and will walk you through leveraging Azure Cognitive Search to extract relevant documents and leveraging the power of GPT-3 to address relevant part of the document, and provide a summary based on the prompt (instruction given to the model). It aims to demonstrate how to use Azure OpenAI’s GPT-3 capabilities to adapt to your summarization case, and how to set up and evaluate summarization results. The method is customizable to your summarization use case and can be applied to many different datasets. 

## Use Case

This use case consists of three sections:
- Document search
- Document Zone search
- Text summarization

Document Search is the process of extracting relevant document based on the query from a corpus of documents.
Document Zone search is the process of finding the relevant part of the document extracted from document search.
Text summarization is the process of creating summaries from large volumes of data while maintaining significant informational elements and content value. 
This use case can be useful in helping subject matter experts in finding relevant information from large document corpus.
Example: In the drug discovery process, scientists in pharmaceutical industry read a corpus of documents to find specific information related to concepts, experiment results etc. This use case enables them to ask questions from the document corpus and the solution will come back with the succinct answer. Consequently, expediting the drug discovery process.
 
Benefits of the solution:
1. Shortens reading time
2. Improves the effectiveness of searching for information
3. Removes bias from human summarization techniques
4. Increases bandwidth for humans to focus on more in-depth analysis 


The need for document summarization be applied to any subject matter (legal, financial, journalist, medical, academic, etc) that requires long document summarization. The subject matter that this notebook is focusing on is journalistic - we will walk through news articles. If the topic gets more domain specific, fine-tuning of the GPT3-model would work better rather than just using the few-shot or zero-shot example methods.  


## CNN daily mail dataset
For this walkthrough, we will be using the CNN/Daily Mail dataset. This is a common dataset used for text summarization and question answering tasks. Human generated abstractive summary bullets were generated from news stories in CNN and Daily Mail websites.


## Data Description
The relevant schema for our work today consists of:

- id: a string containing the heximal formatted SHA1 hash of the URL where the story was retrieved from
- article: a string containing the body of the news article
- highlights: a string containing the highlight of the article as written by the article author


## Import python modules

In [1]:
import pandas as pd
from azure.core.credentials import AzureKeyCredential
from azure.search.documents.indexes import SearchIndexClient 
from azure.search.documents import SearchClient
import openai
import re
import requests
import sys
from num2words import num2words
import os
import numpy as np
from openai.embeddings_utils import get_embedding, cosine_similarity
from transformers import GPT2TokenizerFast

In [1]:
# read the CNN dailymail dataset in pandas dataframe
df = pd.read_csv('') #path to CNN daily mail dataset
df.head()

## Section 1: Leveraging Cognitivie search to extract relevant article based on the query 

## Creating Cognitive Seach Index using CNN Dailymail dataset
<img src="images/AzureCogSearchIndex.png" alt="Alternative text" />

In [3]:
#setting up Azure cognitive service
service_name = "" # Cognitive Search Service Name
admin_key = "" # Cognitive Search Admin Key
index_name = "" # Cognitive Search index name

# Create an SDK client
endpoint = os.getenv("OPENAI_API_ENDPOINT")

search_client = SearchClient(endpoint=endpoint,
                      index_name=index_name,
                      api_version="2021-04-30-Preview",
                      credential=os.getenv("OPENAI_API_ENDPOINT"))

In [4]:
#Extracting relevant article based on the query. eg: Clinton Democratic Nomination
results = search_client.search(search_text="Clinton Democratic nomination", include_total_count=True)
document = next(results)['article']

In [2]:
document

In [3]:
#length of article extracted from Azure Cognitive search
len(document) 

## Section 2: Document Zone Search
Document Zone: Azure OpenAI Embedding API
Now that we narrowed on a single document from our knowledge base using Azure Cognitive Search- we can dive deeper into the single document to refine our initial query to a more specific section or "zone" of the article.

To do this, we will utilize the Azure Open AI Embeddings API.

## Embeddings Overview
An embedding is a special format of data representation that can be easily utilized by machine learning models and algorithms. The embedding is an information dense representation of the semantic meaning of a piece of text. Each embedding is a vector of floating-point numbers, such that the distance between two embeddings in the vector space is correlated with semantic similarity between two inputs in the original format. For example, if two texts are similar, then their vector representations should also be similar.

Different Azure OpenAI embedding models are specifically created to be good at a particular task. Similarity embeddings are good at capturing semantic similarity between two or more pieces of text. Text search embeddings help measure long documents are relevant to a short query. Code search embeddings are useful for embedding code snippets and embedding nature language search queries.

Embeddings make it easier to do machine learning on large inputs representing words by capturing the semantic similarities in a vector space. Therefore, we can use embeddings to if two text chunks are semantically related or similar, and inherently provide a score to assess similarity.

## Cosine Similarity
A previously used approach to match similar documents was based on counting maximum number of common words between documents. This is flawed since as the document size increases, the overlap of common words increases even if the topics differ. Therefore cosine similarity is a better approach.

Mathematically, cosine similarity measures the cosine of the angle between two vectors projected in a multi-dimensional space. This is beneficial because if two documents are far apart by Euclidean distance because of size, they could still have a smaller angle between them and therefore higher cosine similarity.

The Azure OpenAI embeddings rely on cosine similarity to compute similarity between documents and a query.

## Name of models deployed in Azure OpenAI 
<img src="images/Model deployment names.png" alt="Alternative text" />

## Setting up Azure OpenAI service and using deployed models

In [7]:
API_KEY = "" # SET YOUR OWN API KEY HERE
RESOURCE_ENDPOINT = "" # SET A LINK TO YOUR RESOURCE ENDPOINT
openai.api_type = "azure"
openai.api_key = API_KEY
openai.api_base = RESOURCE_ENDPOINT
openai.api_version = "2022-06-01-preview" #openai api version m

TEXT_SEARCH_DOC_EMBEDDING_ENGINE = '' # Model deployment name - mentioned in the above screenshot 
TEXT_SEARCH_QUERY_EMBEDDING_ENGINE = '' # Model deployment name - mentioned in the above screenshot
TEXT_DAVINCI_001 = "" # Model deployment name - mentioned in the above screenshot

In [8]:
#Defining helper functions
#Splits text after sentences ending in a period. Combines n sentences per chunk.
def splitter(n, s):
    pieces = s.split(". ")
    list_out = [" ".join(pieces[i:i+n]) for i in range(0, len(pieces), n)]
    return list_out

# Perform light data cleaning (removing redudant whitespace and cleaning up punctuation)
def normalize_text(s, sep_token = " \n "):
    s = re.sub(r'\s+',  ' ', s).strip()
    s = re.sub(r". ,","",s)
    # remove all instances of multiple spaces
    s = s.replace("..",".")
    s = s.replace(". .",".")
    s = s.replace("\n", "")
    s = s.strip()
    
    return s

In [9]:
document_chunks = splitter(10, normalize_text(document)) #splitting extracted document into chunks of 10 sentences

In [4]:
document_chunks

In [11]:
embed_df = pd.DataFrame(document_chunks, columns = ["chunks"]) #datframe with document chunks


In [12]:
#Create an embedding vector for each chunk that will capture the semantic meaning and overall topic of that chunk
embed_df['curie_search'] = embed_df["chunks"].apply(lambda x : get_embedding(x, engine = TEXT_SEARCH_DOC_EMBEDDING_ENGINE))

In [5]:
embed_df #dataframe with document chunks and their embeddings created using Curie embeddings model 

In [14]:
# search through the document for a text segment most similar to the query
# display top two most similar chunks based on cosine similarity
def search_docs(df, user_query, top_n=3):
    embedding = get_embedding(
        user_query,
        engine=TEXT_SEARCH_QUERY_EMBEDDING_ENGINE
    )
    df["similarities"] = df.curie_search.apply(lambda x: cosine_similarity(x, embedding))

    res = (
        df.sort_values("similarities", ascending=False)
        .reset_index(drop=True)
        .head(top_n)
    )
    return res

In [15]:
document_specific_query = "trouble so far in clinton campaign" 
res = search_docs(embed_df, document_specific_query, top_n=2) #finding top 2 results based on similarity 

## Section 3: Summarizer

This section will cover the end-to-end flow of using the GPT-3 models for summarization tasks. 
The model used by the Azure OpenAI service is a generative completion call which uses natural language instructions to identify the task being asked and skill required – aka Prompt Engineering. Using this approach, the first part of the prompt includes natural language instructions and/or examples of the specific task desired. The model then completes the task by predicting the most probable next text. This technique is known as "in-context" learning. 

There are three main approaches for in-context learning: Zero-shot, Few-shot and Fine tuning. These approaches vary based on the amount of task-specific data that is given to the model: 

**Zero-shot**: In this case, no examples are provided to the model and only the task request is provided. 

**Few-shot**: In this case, a user includes several examples in the call prompt that demonstrate the expected answer format and content. 

**Fine-Tuning**: Fine Tuning lets you tailor models to your personal datasets. This customization step will let you get more out of the service by providing: 
-	With lots of data (at least 500 and above) traditional optimization techniques are used with Back Propagation to re-adjust the weights of the model – this enables higher quality results than mere zero-shot or few-shot. 
-	A customized model improves the few-shot learning approach by training the model weights on your specific prompts and structure. This lets you achieve better results on a wider number of tasks without needing to provide examples in the prompt. The result is less text sent and fewer tokens 


In [None]:
'''Designing a prompt that will show and tell GPT-3 how to proceed. 
+ Providing an instruction to summarize the text about the general topic (prefix)
+ Providing quality data for the chunks to summarize and specifically mentioning they are the text provided (context + context primer)
+ Providing a space for GPT-3 to fill in the summary to follow the format (suffix)
'''

# result_1 corresponding to the top chunk from Section 2. result_2 corresponding to the second to top chunk from section 2. 
# change index for desired chunk
result_1 = res.chunks[0]
result_2 = res.chunks[1]
prompt_i = 'Summarize the content about the Clinton campaign given the text provided.\n\Text:\n'+" ".join([normalize_text(result_1)])+ '\n\nText:\n'+ " ".join([normalize_text(result_2)])+'\n\nSummary:\n'

# Using a low temperature to limit the creativity in the response. 
response = openai.Completion.create(
        engine= TEXT_DAVINCI_001,
        prompt = prompt_i,
        temperature = 0.0,
        max_tokens = 500,
        top_p = 1.0,
        frequency_penalty=0.5,
        presence_penalty = 0.5,
        best_of = 1
    )

print(response.choices[0].text)

In [None]:
#testing some parameters with a differnt temperature
response = openai.Completion.create(
        engine= TEXT_DAVINCI_001,
        prompt = prompt_i,
        temperature = 0.2,
        max_tokens = 500,
        top_p = 1.0,
        frequency_penalty=0.5,
        presence_penalty = 0.5,
        best_of = 1
    )

print(response.choices[0].text)

In [None]:
#testing some parameters 
response = openai.Completion.create(
        engine= TEXT_DAVINCI_001,
        prompt = prompt_i,
        temperature = 0.7,
        max_tokens = 500,
        top_p = 1.0,
        frequency_penalty=0.5,
        presence_penalty = 0.5,
        best_of = 1
    )

print(response.choices[0].text)