# Semantic Search on Specific Data Corpus
Query within specifis corpus. 

In [1]:
%load_ext autoreload
%autoreload 2

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Set up Azure OpenAI

In [2]:
import os
import openai
from dotenv import load_dotenv

# Set up Azure OpenAI
load_dotenv()
openai.api_type = "azure"
openai.api_base = "" # Api base is the 'Endpoint' which can be found in Azure Portal where Azure OpenAI is created. It looks like https://xxxxxx.openai.azure.com/
openai.api_version = "2022-12-01"
openai.api_key = os.getenv("OPENAI_API_KEY")

True

## Deploy a Language Model

In [3]:
# list models deployed with embeddings capability
deployment_id = None
result = openai.Deployment.list()

for deployment in result.data:
    if deployment["status"] != "succeeded":
        continue
    
    model = openai.Model.retrieve(deployment["model"])
    if model["capabilities"]["embeddings"] != True:
        continue
    
    deployment_id = deployment["id"]
    break

# if not model deployed, deploy one
if not deployment_id:
    print('No deployment with status: succeeded found.')
    model = "text-similarity-davinci-001"

    # Now let's create the deployment
    print(f'Creating a new deployment with model: {model}')
    result = openai.Deployment.create(model=model, scale_settings={"scale_type":"standard"})
    deployment_id = result["id"]
    print(f'Successfully created {model} with deployment_id {deployment_id}')
else:
    print(f'Found a succeeded deployment that supports embeddings with id: {deployment_id}.')

Found a succeeded deployment that supports embeddings with id: deployment-89153abdfa934e1580296dbee586239b.


## Load Data
The next cell will load embeddings generated in notebook [01-get-embeddings.ipynb](./01-get-embeddings.ipynb).

In [4]:
import pandas as pd
fname = '../data/bbc-news-data-embedding.csv'
df_orig = pd.read_csv(fname, delimiter='\t', index_col=False)

In [5]:
import numpy as np

DEVELOPMENT = False # Set this to True for development on small subset of data

if DEVELOPMENT:
    # Sub-sample for development
    df = df_orig.sample(n=20, replace=False, random_state=9).copy() # Set sample size
else:
    df = df_orig.copy()

# drop rows with NaN
df.dropna(inplace=True)

# convert string to array
df["embedding"] = df['embedding'].apply(eval).apply(np.array)
df

Unnamed: 0,category,filename,title,content,embedding
0,business,001.txt,Ad sales boost Time Warner profit,Quarterly profits at US media giant TimeWarne...,"[-0.0012276918860152364, 0.00733763724565506, ..."
1,business,002.txt,Dollar gains on Greenspan speech,The dollar has hit its highest level against ...,"[0.0009311728645116091, 0.014099937863647938, ..."
2,business,003.txt,Yukos unit buyer faces loan claim,The owners of embattled Russian oil giant Yuk...,"[-0.010487922467291355, 0.009665092453360558, ..."
3,business,004.txt,High fuel prices hit BA's profits,British Airways has blamed high fuel prices f...,"[0.0111119095236063, 0.004624682944267988, -0...."
4,business,005.txt,Pernod takeover talk lifts Domecq,Shares in UK drinks and food firm Allied Dome...,"[-0.0021637482568621635, 0.005410161800682545,..."
...,...,...,...,...,...
2219,tech,396.txt,New consoles promise big problems,Making games for future consoles will require...,"[0.014879594556987286, 0.004789963364601135, -..."
2220,tech,397.txt,BT program to beat dialler scams,BT is introducing two initiatives to help bea...,"[0.007671569474041462, 0.00624304823577404, -0..."
2221,tech,398.txt,Spam e-mails tempt net shoppers,Computer users across the world continue to i...,"[0.0026338498573750257, 0.015989987179636955, ..."
2222,tech,399.txt,Be careful how you code,A new European directive could put software w...,"[0.007126151118427515, 0.008495588786900043, -..."


## Find Articles with Similar Embeddings to that of the Question

In [7]:
import numpy as np

def get_embedding(text, deployment_id=deployment_id):
    """ 
    Get embeddings for an input text from the dataframe. 
    """
    result = openai.Embedding.create(
      deployment_id=deployment_id,
      input=text
    )
    result = np.array(result["data"][0]["embedding"])
    return result

def vector_similarity(x, y):
    """
    Returns the similarity between two vectors.    
    Because OpenAI Embeddings are normalized to length 1, the cosine similarity is the same as the dot product.
    """
    similarity = np.dot(x, y)
    return similarity 

def order_document_sections_by_query_similarity(query, contexts):
    """
    Find the query embedding for the supplied query, and compare it against all of the pre-calculated articles embeddings
    to find the most relevant articles. 
    Return the list of articles, sorted by relevance in descending order.
    """
    query_embedding = get_embedding(query)

    document_similarities = sorted(
        [(vector_similarity(query_embedding, doc_embedding), doc_index) for doc_index, doc_embedding in contexts.items()], 
        reverse=True)
    
    return document_similarities

## Retrieve Relevant Articles 

In [14]:
def retrieve_relevant_documents(query, contexts = df['embedding']):
    # find text most similar to the query
    answers = order_document_sections_by_query_similarity(query=query, contexts=contexts)[0:3] # Set to top 3

    # print top 3
    for answer in answers:
        print(f'similarity score:   {answer[0]}')
        print(df['content'].loc[answer[1]], '\n')

    return

## Query Examples

In [15]:
query = 'News about stock market.'
retrieve_relevant_documents(query=query)

similarity score:   0.5842770878602115
 The owner of the technology-dominated Nasdaq stock index plans to sell shares to the public and list itself on the market it operates.  According to a registration document filed with the Securities and Exchange Commission, Nasdaq Stock Market plans to raise $100m (£52m) from the sale. Some observers see this as another step closer to a full public listing. However Nasdaq, an icon of the 1990s technology boom, recently poured cold water on those suggestions.  The company first sold shares in private placements during 2000 and 2001. It technically went public in 2002 when the stock started trading on the OTC Bulletin Board, which lists equities that trade only occasionally. Nasdaq will not make money from the sale, only investors who bought shares in the private placings, the filing documents said. The Nasdaq is made up shares in technology firms and other companies with high growth potential. It was the most potent symbol of the 1990s internet an

In [16]:
query = 'News about stock market.'
retrieve_relevant_documents(query=query)

similarity score:   0.5842770878602115
 The owner of the technology-dominated Nasdaq stock index plans to sell shares to the public and list itself on the market it operates.  According to a registration document filed with the Securities and Exchange Commission, Nasdaq Stock Market plans to raise $100m (£52m) from the sale. Some observers see this as another step closer to a full public listing. However Nasdaq, an icon of the 1990s technology boom, recently poured cold water on those suggestions.  The company first sold shares in private placements during 2000 and 2001. It technically went public in 2002 when the stock started trading on the OTC Bulletin Board, which lists equities that trade only occasionally. Nasdaq will not make money from the sale, only investors who bought shares in the private placings, the filing documents said. The Nasdaq is made up shares in technology firms and other companies with high growth potential. It was the most potent symbol of the 1990s internet an

In [17]:
query = 'What is happening in the rugby world?'
retrieve_relevant_documents(query=query)

similarity score:   0.5600900443509507
 England will have to negotiate their way through a tough draw if they are to win the Rugby World Cup Sevens in Hong Kong next month.  The second seeds have been drawn against Samoa, France, Italy, Georgia and Chinese Taipei. The top two sides in each pool qualify but England could face 2001 winners New Zealand in the quarter-finals if they stumble against Samoa. Scotland and Ireland are in Pool A together with the All Blacks. England won the first event of the International Rugby Board World Sevens series in Dubai but have slipped to fourth in the table after failing to build on that victory.  However, they beat Samoa in the recent Los Angeles Sevens before losing to Argentina in the semi-finals. "England have the ability and determination to win this World Cup and create sporting history by being the only nation to hold both the 15s and Sevens World Cups at the same time," said England sevens coach Mike Friday. "England have a fantastic record i

In [18]:
query = 'What happened in Brazil?'
retrieve_relevant_documents(query=query)

similarity score:   0.5232753153602943
 A major reform of Brazil's bankruptcy laws has been approved by the country's Congress, in a move which it is hoped will cut the cost of borrowing.  The bill, proposed in 1993, has finally been approved by the leadership of President Luiz Inacio Lula da Silva. The old law, dating from 1945, gave priority first to workers, second to tax revenue and finally to creditors. The new legislation changes this, giving priority to creditors and limiting payments to workers. The new regulations will limit payments to workers to 150 times the minimum monthly salary, which is currently $94. The law also makes it more difficult for a company to declare bankruptcy. However, when a firm is declared bankrupt it will gain protection from creditors for 180 days while a recovery plan is worked out.  The proposals were opposed in the past by leftist parties, including Mr Lula's Worker Party. They considered that they undermined workers' rights. But President Lula bec