In [130]:
from IPython.display import display, HTML
display(HTML("<style>.container { width:96% !important; }</style>"))

# Module 2: Text Search with Amazon OpenSearch Service 

In this module, we are going to perform a simple search in OpenSearch by matching the individual words in our search query. We will:
1. Load data into OpenSearch from the Amazon Product Question and Answer (PQA) dataset. This dataset contains a list of common questions and answers related to products.
2. Query the data using a simple query search for find potentially matching questions. We will search the PQA dataset for questions similar to our sample question "does this work with xbox?". We expect to find matches in the dataset based on the individual words such as "xbox" and "work".

In subsequent modules, we will then demonstrate how to use semantic search to improve the relvance of the query results.

### 1. Install required libraries

Before we begin, we need to install some required libraries.

In [None]:
!pip install -q boto3
!pip install -q requests
!pip install -q requests-aws4auth
!pip install -q opensearch-py
!pip install -q tqdm
!pip install -q boto3
!pip install -q install transformers[torch]
!pip install -q transformers
!pip install -q sentence-transformers rank_bm25

### 2. Get Cloud Formation stack output variables

We also need to grab some key values from the infrastructure we provisioned using CloudFormation. To do this, we will list the outputs from the stack and store this in "outputs" to be used later.

You can ignore any "PythonDeprecationWarning" warnings.

In [None]:
import boto3
import json

cfn = boto3.client('cloudformation')
kms = boto3.client('secretsmanager')


def get_cfn_outputs(stackname):
    outputs = {}
    for output in cfn.describe_stacks(StackName=stackname)['Stacks'][0]['Outputs']:
        outputs[output['OutputKey']] = output['OutputValue']
    return outputs

## Setup variables to use for the rest of the demo
cloudformation_stack_name = "semantic-search"

outputs = get_cfn_outputs(cloudformation_stack_name)
bucket = outputs['s3BucketTraining']
aos_host = outputs['OpenSearchDomainEndpoint']
aos_credentials = json.loads(kms.get_secret_value(SecretId=outputs['OpenSearchSecret'])['SecretString'])

outputs

### 3. Copy the data set locally
Before we can run any queries, we need to download the Amazon Product Question and Answer data from : https://registry.opendata.aws/amazon-pqa/

Let's start by having a look at all the files in the dataset.

In [2]:
!aws s3 ls --no-sign-request s3://amazon-pqa/

2021-05-20 22:11:25 2267692311 amazon-pqa.tar.gz
2021-05-09 20:53:53  442066567 amazon_pqa_accessories.json
2021-05-09 20:53:49  275062405 amazon_pqa_activity_&_fitness_trackers.json
2021-05-09 20:53:49  127094083 amazon_pqa_adapters.json
2021-05-09 20:53:49  143639699 amazon_pqa_amazon_echo_&_alexa_devices.json
2021-05-09 20:53:49  106017252 amazon_pqa_area_rugs.json
2021-05-09 20:53:49  164430689 amazon_pqa_backpacks.json
2021-05-09 20:53:49  679285046 amazon_pqa_basic_cases.json
2021-05-09 20:53:49  390964941 amazon_pqa_batteries.json
2021-05-09 20:53:49  107896488 amazon_pqa_battery_chargers.json
2021-05-09 20:53:49   77113272 amazon_pqa_bed_frames.json
2021-05-09 20:53:49  157944761 amazon_pqa_beds.json
2021-05-09 20:53:49  218133567 amazon_pqa_bullet_cameras.json
2021-05-09 20:53:50  118106256 amazon_pqa_camcorders.json
2021-05-09 20:53:50   71239417 amazon_pqa_car.json
2021-05-09 20:53:50  137487049 amazon_pqa_car_stereo_receivers.json
2021-05-09 20:53:50  153301436 amazon_pqa_c

There are a lot of files here, so for the purposes of this demo, we focus on just the headset data. Let's download the amazon_pqa_headsets.json data locally. 

In [3]:
!aws s3 cp --no-sign-request s3://amazon-pqa/amazon_pqa_headsets.json ./amazon-pqa/amazon_pqa_headsets.json

download: s3://amazon-pqa/amazon_pqa_headsets.json to amazon-pqa/amazon_pqa_headsets.json


### 4. Create an OpenSearch cluster connection.
Next, we'll use Python API to set up connection with Amazon Opensearch Service domain.

Note: if you're using a region other than us-east-1, please update the region in the code below. 

In [102]:
# KwankiAhn Settings
aos_host = "search-search-test-fsnokwgicqo2bvwylh4ahuk2ru.us-east-1.es.amazonaws.com"
aos_credentials = {
    "username": "AK...",
    "password": "XT..."
}

In [103]:
from opensearchpy import OpenSearch, RequestsHttpConnection, AWSV4SignerAuth
import boto3

#update the region if you're working other than us-east-1
region = 'us-east-1' 

credentials = boto3.Session().get_credentials()
auth = AWSV4SignerAuth(credentials, region)
# auth = (aos_credentials['username'], aos_credentials['password'])

aos_client = OpenSearch(
    hosts = [{'host': aos_host, 'port': 443}],
    http_auth = auth,
    use_ssl = True,
    verify_certs = True,
    connection_class = RequestsHttpConnection
)

### 5. Create a index in Amazon Opensearch Service 
We are defining an index with english analyzer which will strip the common stopwords like `the`, `is`, `a`, `an`, etc..

We will use the aos_client connection we initiated ealier to create an index in Amazon OpenSearch Service

In [137]:
headset_default_index = {
    "settings": {
        "number_of_replicas": 1,
        "number_of_shards": 1,
        "analysis": {
            "analyzer": {
                "default": {
                    "type": "standard",
                    "stopwords": "_english_"
                }
            }
        }
    },
#     "mappings": {
#         "properties": {
#             "question": {
#                 "type": "text",
#                 "store": True
#             }
#         }
#     }
#     "mappings": {
#         "_source": {"enabled": True}, 
#         "properties": {
#             "question": {"type": "text"}
#         }
#     }
#     "mappings": {
#         'properties': {
#             'question': {'type': 'text'}
#         }
#     }           
}
# KwankiAhn 231225 Remark
# Error : Exception raised even though number of shard copies is a multiple of awareness attributes
# Solution : Open search domain 설정을 3AZ -> 1AZ Only 로 설정시 해결 (number_of_replicas 파라미터가 실제 노드 구성 Configuration 과 맞아야 하는 것으로 보임)

If for any reason you need to recreate your dataset, you can uncomment and execute the following to delete any previously created indexes. If this is the first time you're running this, you can skip this step.

In [138]:
aos_client.indices.delete(index="headset_pqa")

{'acknowledged': True}

Using the above index definition, we now need to create the index in Amazon OpenSearch

In [139]:
ret = aos_client.indices.create(index="headset_pqa")  #,body=headset_default_index,ignore=400)

In [140]:
import json
def displayPrettyJson(json_data):
    if type(json_data) is str:
        json_object = json.loads(json_data)
    else:
        json_object = json_data
    json_formatted_str = json.dumps(json_object, indent=2)
    print(json_formatted_str)
displayPrettyJson(ret)

{
  "acknowledged": true,
  "shards_acknowledged": true,
  "index": "headset_pqa"
}


Let's verify the created index information

In [141]:
aos_client.indices.get(index="headset_pqa")

{'headset_pqa': {'aliases': {},
  'mappings': {},
  'settings': {'index': {'replication': {'type': 'DOCUMENT'},
    'number_of_shards': '5',
    'provided_name': 'headset_pqa',
    'creation_date': '1703680478113',
    'number_of_replicas': '1',
    'uuid': 'ysTtonrXSuWPU-O1Ij1opg',
    'version': {'created': '136327827'}}}}}

### 6. Load the raw data into the Index
Next, let's load the headset PQA data we copied locally into the index we've just created.

In [142]:
import re
import json
from tqdm.contrib.concurrent import process_map
from multiprocessing import cpu_count

def load_pqa_as_json(file_name,number_rows=1000):
    result=[]
    with open(file_name) as f:
        i=0
        for line in f:
            data = json.loads(line)
            result.append(data)
            # refinedStr = re.sub(r'\W+', '', data["question_text"])
            # result.append({"question": str(data["question_text"])})
            i+=1
            if(i == number_rows):
                break
    return result

qa_list_json = load_pqa_as_json('amazon-pqa/amazon_pqa_headsets.json',number_rows=1000)

def es_import(question):
    aos_client.index(index='headset_pqa', body=question)

In [143]:
qa_list_json[:2]

[{'question_id': 'Tx39GCUOS5AYAFK',
  'question_text': 'does this work with cisco ip phone 7942',
  'asin': 'B000LSZ2D6',
  'bullet_point1': 'Noise-Canceling microphone filters out background sound',
  'bullet_point2': 'HW251N P/N 75100-06',
  'bullet_point3': 'Uses Plantronics QD Quick Disconnect Connector. Must be used with Plantronics Amp or with proper phone or USB adapter cable',
  'bullet_point4': 'Connectivity Technology: Wired, Earpiece Design: Over-the-head, Earpiece Type: Monaural, Host Interface: Proprietary, Microphone Design: Boom, Microphone Technology: Noise Canceling, Product Model: HW251N, Product Series: SupraPlus, Standard Warranty: 2 Year',
  'bullet_point5': 'Easy Lightweight Wear -Leaving One Ear Uncovered For Person-to-Person Conversations',
  'product_description': '',
  'brand_name': 'Plantronics',
  'item_name': 'Plantronics HW251N SupraPlus Wideband Headset (64338-31)',
  'question_type': 'yes-no',
  'answer_aggregated': 'neutral',
  'answers': [{'answer_text

In [None]:
workers = 4 * cpu_count()
    
process_map(es_import, qa_list_json, chunksize=100)

In [144]:
for query in qa_list_json:
    es_import(query)

To validate the load, we'll query the number of documents number in the index. We should have 1000 hits in the index.

In [145]:
res = aos_client.search(index="headset_pqa", body={"query": {"match_all": {}}})
print("Records found: %d " % res['hits']['total']['value'])

Records found: 1000 


### 7. Run a " Simple Text Search"

Now that we've loaded our data, let's run a keyword search for the question "does this work with xbox?", using the default OpenSearch query, and display the results.

In [154]:
import pandas as pd
query={
  "size": 10,
  "query": {
    "match": {
      "question_text": "does this work with xbox?"
    }
  }
}
res = aos_client.search(index="headset_pqa", body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['question_text'],hit['_source']['answers'][0]['answer_text']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","question","answer"])
with pd.option_context("display.max_colwidth", 120):
    display(query_result_df)

Unnamed: 0,_id,_score,question,answer
0,olJGq4wBZBEtYQQvFRaF,9.786783,Does this work with xbox one?,"sorry, Im not an xbox user."
1,rlJHq4wBZBEtYQQv4Bi0,9.415402,Does this work with the xbox one?,"Yeah of course , but you must have an adapter to use this beautiful headset"
2,IFJHq4wBZBEtYQQvZBip,8.508419,does this work on xbox one?,"I'm sorry, but not!"
3,GFJFq4wBZBEtYQQvmBYq,8.366218,Does this work for xbox one S?,It should work.
4,KVJHq4wBZBEtYQQvbBgd,7.526715,Does it work for Xbox 360?,"Sorry , it can't .Just for PS4"
5,MFJHq4wBZBEtYQQvchi5,7.40166,Does it work for xbox one?,"Thanks for your inquiry, it just works with PS4. Hope this is helpful for you."
6,YlJFq4wBZBEtYQQv3BYt,7.353278,work with xbox one???,with the stereo headset adapter with is not included
7,-FJHq4wBZBEtYQQvQhec,7.270713,Does this work with PS4,yes
8,o1JIq4wBZBEtYQQvsRkj,7.0938,Can this work with an xbox one with the adpater???,Yes but the audio is not very good. I recommend using the Mixamp and the adapter. The adapter I use for just voice chat
9,_1JHq4wBZBEtYQQvSBd0,7.03897,will these work with xbox one?,Yes


### 8. Search across multiple fields

Search across multiple fields could bring more results and scored based on BM25 relevancy 

In [155]:
import pandas as pd
query={
  "size": 10,
  "query": {
    "multi_match": {
      "query": "does this work with xbox?",
      "fields": ["question_text","bullet_point*", "answers.answer_text", "item_name"]
    }
  }
}
res = aos_client.search(index="headset_pqa", body=query)

query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['question_text'],hit['_source']['answers'][0]['answer_text']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","question","answer"])
with pd.option_context("display.max_colwidth", 120):
    display(query_result_df)

Unnamed: 0,_id,_score,question,answer
0,b1JFq4wBZBEtYQQv5xYP,11.929837,Can someone help me figure out how to get the mic working on the head set with a xbox one. we have the adapter cord ...,"Hi J. Klenske, if your xbox one is the new version, you can just use with a normal 2 in 1 splitter cable. If your xb..."
1,mlJIq4wBZBEtYQQvqRm9,9.80099,DOes it work with Xbox 360 or do you need the mixamp or chat cord?,This was a gift for my son. I asked him and he said that you need a chat cord.
2,olJGq4wBZBEtYQQvFRaF,9.786783,Does this work with xbox one?,"sorry, Im not an xbox user."
3,rlJHq4wBZBEtYQQv4Bi0,9.415402,Does this work with the xbox one?,"Yeah of course , but you must have an adapter to use this beautiful headset"
4,GlJFq4wBZBEtYQQvmRbS,9.162766,"So, these are totally comparable with the Xbox One?. Any adapters needed?.",Per Skullcandy....yes it works directly with xbox. Al
5,k1JGq4wBZBEtYQQvCBY5,8.92103,How do I know which version of the Xbox one that I have,"There is Xbox one, Xbox one s, now Xbox one x. It will say on the box it came in. It doesn't matter which version th..."
6,IFJHq4wBZBEtYQQvZBip,8.508419,does this work on xbox one?,"I'm sorry, but not!"
7,GFJFq4wBZBEtYQQvmBYq,8.366218,Does this work for xbox one S?,It should work.
8,QlJHq4wBZBEtYQQvgRin,8.153154,Will it work for xbox one?,"Sorry, it is not compatible with PS4 Xbox one. It could work with PCs, PS2, PS3, Xbox 360."
9,UVJGq4wBZBEtYQQvsBfK,7.888779,Does it work for android?,It does work with Android


### 9. Search with Field preference or boosting

When searching across fields, all fields given the same priority by default. But you can control the preference by giving static boost score to each field

In [156]:
import pandas as pd
query={
  "size": 10,
  "query": {
    "multi_match": {
      "query": "does this work with xbox?",
      "fields": ["question_text^2", "bullet_point*", "answers.answer_text^2", "item_name^1.5"]
    }
  }
}
res = aos_client.search(index="headset_pqa", body=query)

query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['question_text'],hit['_source']['answers'][0]['answer_text']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","question","answer"])
with pd.option_context("display.max_colwidth", 120):
    display(query_result_df)

Unnamed: 0,_id,_score,question,answer
0,mlJIq4wBZBEtYQQvqRm9,19.60198,DOes it work with Xbox 360 or do you need the mixamp or chat cord?,This was a gift for my son. I asked him and he said that you need a chat cord.
1,olJGq4wBZBEtYQQvFRaF,19.573566,Does this work with xbox one?,"sorry, Im not an xbox user."
2,rlJHq4wBZBEtYQQv4Bi0,18.830805,Does this work with the xbox one?,"Yeah of course , but you must have an adapter to use this beautiful headset"
3,GlJFq4wBZBEtYQQvmRbS,18.325531,"So, these are totally comparable with the Xbox One?. Any adapters needed?.",Per Skullcandy....yes it works directly with xbox. Al
4,k1JGq4wBZBEtYQQvCBY5,17.84206,How do I know which version of the Xbox one that I have,"There is Xbox one, Xbox one s, now Xbox one x. It will say on the box it came in. It doesn't matter which version th..."
5,IFJHq4wBZBEtYQQvZBip,17.016838,does this work on xbox one?,"I'm sorry, but not!"
6,GFJFq4wBZBEtYQQvmBYq,16.732435,Does this work for xbox one S?,It should work.
7,QlJHq4wBZBEtYQQvgRin,16.306309,Will it work for xbox one?,"Sorry, it is not compatible with PS4 Xbox one. It could work with PCs, PS2, PS3, Xbox 360."
8,UVJGq4wBZBEtYQQvsBfK,15.777558,Does it work for android?,It does work with Android
9,plJGq4wBZBEtYQQv_BcC,15.595388,Can I use my wireless bluetotth head phones for sound with game?,My son doesn’t think you can. He uses his wired headset that he uses for his xbox


### 10. Compound queries with `bool`

With `bool` queries, you can give more preference based on other field values/existance. In the below query, it will get higher score if `answer_aggregated` is `netural`

In [157]:
import pandas as pd
query={
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "does this work with xbox?",
            "fields": [ "question_text^2", "bullet_point*", "answers.answer_text^2","item_name^2"]
          }
        }
      ],
      "should": [
        {
          "term": {
            "answer_aggregated.keyword": {
              "value": "neutral"
            }
          }
        }
      ]
    }
  }
}
res = aos_client.search(index="headset_pqa", body=query)
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['question_text'],hit['_source']['answers'][0]['answer_text']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","question","answer"])
with pd.option_context("display.max_colwidth", 120):
    display(query_result_df)

Unnamed: 0,_id,_score,question,answer
0,rlJHq4wBZBEtYQQv4Bi0,20.291574,Does this work with the xbox one?,"Yeah of course , but you must have an adapter to use this beautiful headset"
1,mlJIq4wBZBEtYQQvqRm9,19.60198,DOes it work with Xbox 360 or do you need the mixamp or chat cord?,This was a gift for my son. I asked him and he said that you need a chat cord.
2,olJGq4wBZBEtYQQvFRaF,19.573566,Does this work with xbox one?,"sorry, Im not an xbox user."
3,IFJHq4wBZBEtYQQvZBip,18.477608,does this work on xbox one?,"I'm sorry, but not!"
4,GlJFq4wBZBEtYQQvmRbS,18.325531,"So, these are totally comparable with the Xbox One?. Any adapters needed?.",Per Skullcandy....yes it works directly with xbox. Al
5,GFJFq4wBZBEtYQQvmBYq,18.181099,Does this work for xbox one S?,It should work.
6,k1JGq4wBZBEtYQQvCBY5,17.84206,How do I know which version of the Xbox one that I have,"There is Xbox one, Xbox one s, now Xbox one x. It will say on the box it came in. It doesn't matter which version th..."
7,QlJHq4wBZBEtYQQvgRin,17.658031,Will it work for xbox one?,"Sorry, it is not compatible with PS4 Xbox one. It could work with PCs, PS2, PS3, Xbox 360."
8,UVJGq4wBZBEtYQQvsBfK,17.129282,Does it work for android?,It does work with Android
9,KVJHq4wBZBEtYQQvbBgd,16.405153,Does it work for Xbox 360?,"Sorry , it can't .Just for PS4"


### 11. Use custom scoring with function score queries

Function score are handy queries to overwrite the default BM-25 scoring. In the below query, it recalculates the score based on how many times the question was answered before.

In [158]:
import pandas as pd
query={
  "query": {
    "function_score": {
      "query": {
        "bool": {
          "must": [
            {
              "multi_match": {
                "query": "does this work with xbox?",
                "fields": ["question_text^5","bullet_point*","answers.answer_text^2", "item_name^2" ]
              }
            }
          ],
          "should": [
            {
              "term": {
                "answer_aggregated.keyword": {
                  "value": "neutral"
                }
              }
            }
          ]
        }
      },
      "functions": [
        {
          "script_score": {
            "script": "_score * 0.25 * doc['answers.answer_text.keyword'].length"
          }
        }
      ]
    }
  }
}
res = aos_client.search(index="headset_pqa", body=query)
print("Got %d Hits:" % res['hits']['total']['value'])
query_result=[]
for hit in res['hits']['hits']:
    row=[hit['_id'],hit['_score'],hit['_source']['question_text'],hit['_source']['answers'][0]['answer_text']]
    query_result.append(row)

query_result_df = pd.DataFrame(data=query_result,columns=["_id","_score","question","answer"])
with pd.option_context("display.max_colwidth", 120):
    display(query_result_df)

Got 984 Hits:


Unnamed: 0,_id,_score,question,answer
0,olJGq4wBZBEtYQQvFRaF,1795.896,Does this work with xbox one?,"sorry, Im not an xbox user."
1,rlJHq4wBZBEtYQQv4Bi0,1766.9371,Does this work with the xbox one?,"Yeah of course , but you must have an adapter to use this beautiful headset"
2,_1JHq4wBZBEtYQQvSBd0,1238.6775,will these work with xbox one?,Yes
3,JFJFq4wBZBEtYQQvohYy,1136.2794,Do they work with xbox one?,"No they don't , but let's hope that Microsoft makes a wireless headset"
4,p1JIq4wBZBEtYQQvtBl3,1134.56,Do they work with Xbox One system?,If you have the Xbox controller stereo headset adapter then yes
5,MlJFq4wBZBEtYQQvsBYc,840.7012,Does this work with PC? If it does how?,Yes it does work on pc. I personally just plug it into the headphone jack on my pc. If you want to use the extensi...
6,FlJIq4wBZBEtYQQvOxl0,823.2865,Does it work with galaxy s7?,"Yes, I've used this with the Galaxy S7 and j7. Hands-free dialing from the phone book also works very well."
7,nVJHq4wBZBEtYQQv0hhi,806.4542,Does this work with a Blackberry phone?,"You need a 2.5 adapter with it. ,"
8,DlJIq4wBZBEtYQQvNBne,775.08484,does this work with a kindle fire hd 8?,Hi! I believe so. I use it with Samsung devices and it works well. So it should work.
9,G1JIq4wBZBEtYQQvPxmS,759.9503,Does the M70 work with Android phones?,"Hello, Yes, I think so, thank you."


### 12. Observe The Results and Refine

Congratulations, you've now explored the possiblities of text search on the data in OpenSearch.

If you take a look at the results above, you'll notice that the results match one or more of the key words from our question, most commonly the words "work" and "xbox".  You'll also notices that a lot of these results aren't relevant to our original question, such as "Does it work on PS3?" and "Does it work for computers". In Module 3, we'll instead use semantic search to make the result more relevant.

### Store Variables Used for the Next Notebook

There are a few values you will need for the next notebook, execute the cells below to store them so they can be copied and pasted into the next part of the exercise.

In [None]:
%store outputs
%store bucket
%store aos_host