## HTX xData Test cv-hotword-similarity-5b Python Notebook

In [1]:
from InstructorEmbedding import INSTRUCTOR

model = INSTRUCTOR('hkunlp/instructor-large')

  from tqdm.autonotebook import trange
.gitattributes: 100%|██████████| 1.48k/1.48k [00:00<00:00, 5.77MB/s]
1_Pooling/config.json: 100%|██████████| 270/270 [00:00<00:00, 810kB/s]
2_Dense/config.json: 100%|██████████| 116/116 [00:00<00:00, 434kB/s]
pytorch_model.bin: 100%|██████████| 3.15M/3.15M [00:00<00:00, 11.7MB/s]
README.md: 100%|██████████| 66.3k/66.3k [00:00<00:00, 53.7MB/s]
config.json: 100%|██████████| 1.53k/1.53k [00:00<00:00, 13.2MB/s]
config_sentence_transformers.json: 100%|██████████| 122/122 [00:00<00:00, 249kB/s]
pytorch_model.bin: 100%|██████████| 1.34G/1.34G [01:56<00:00, 11.5MB/s]
sentence_bert_config.json: 100%|██████████| 53.0/53.0 [00:00<00:00, 249kB/s]
special_tokens_map.json: 100%|██████████| 2.20k/2.20k [00:00<00:00, 4.58MB/s]
spiece.model: 100%|██████████| 792k/792k [00:00<00:00, 12.5MB/s]
tokenizer.json: 100%|██████████| 2.42M/2.42M [00:00<00:00, 4.72MB/s]
tokenizer_config.json: 100%|██████████| 2.41k/2.41k [00:00<00:00, 20.8MB/s]
modules.json: 100%|██████████|

load INSTRUCTOR_Transformer
max_seq_length  512


In [39]:
import pandas as pd
import os

cv_dev_metadata = pd.read_csv(os.path.join('..', 'asr-train', 'cv-valid-dev.csv'))
cv_dev_metadata['finetuned_text'] = cv_dev_metadata['finetuned_text'].astype(str)

cv_dev_metadata.head(5)

Unnamed: 0,filename,text,up_votes,down_votes,age,gender,accent,duration,generated_text,finetuned_text
0,cv-valid-dev/sample-000000.mp3,be careful with your prognostications said the...,1,0,,,,,BE CAREFUL WITH YOUR PROGNOSTICATIONS SAID THE...,BE CAREFUL WITH YOUR PROGNOSTICATIONS SAID THE...
1,cv-valid-dev/sample-000001.mp3,then why should they be surprised when they se...,2,0,,,,,THEN WHY SHOULD THEY BE SURPRISED WHEN THEY SE...,THEN WHY SHOULD THEY BE SURPRISED WHEN THEY SE...
2,cv-valid-dev/sample-000002.mp3,a young arab also loaded down with baggage ent...,2,0,,,,,A YOUNG ARAB ALSO LOADED DOWN WITH BAGGAGE ENT...,A YOUNG ARAB ALSO LOADED DOWN WITH BAGGAGE ENT...
3,cv-valid-dev/sample-000003.mp3,i thought that everything i owned would be des...,3,0,,,,,I FELT THAT EVERYTHING I OWNED WOULD BE DESTROYED,I THOUGHT THAT EVERYTHING I OWNED WOULD BE DES...
4,cv-valid-dev/sample-000004.mp3,he moved about invisible but everyone could he...,1,0,fourties,female,england,,HE MOVED ABOUT INVISIBLE BUT EVERY ONE COULD H...,HE MOVED ABOUT INVISIBLE BUT EVERYONE COULD HE...


Encode the hotword and text to embeddings, and use cosine similarity to generate a similarity score between phrase and sentence

In [40]:
from sklearn.metrics.pairwise import cosine_similarity

hotword_list = ["destroy", "be careful", "stranger"]

# Use capitalize as the encodings behave differently in all uppercase, and provided examples use capitalize.
sentences_a = [['Represent the sentence to match: ', s.capitalize()] for s in cv_dev_metadata["finetuned_text"]]
sentences_b = [['Represent the phrase to find: ', hotword] for hotword in hotword_list]
embeddings_a = model.encode(sentences_a)
embeddings_b = model.encode(sentences_b)
similarities = cosine_similarity(embeddings_a,embeddings_b)

print(similarities)
print(similarities[0])

[[0.72322273 0.8924132  0.8676585 ]
 [0.7357803  0.79125535 0.80064064]
 [0.7337803  0.7572438  0.79558104]
 ...
 [0.7480864  0.8030087  0.7612227 ]
 [0.7127923  0.7348097  0.75321025]
 [0.73576105 0.74254346 0.7491277 ]]
[0.72322273 0.8924132  0.8676585 ]


From our dataset, we know from cv-hotword-5a that there are some samples which have phrases exactly matching the hotwords. These samples can be labelled as true in similarity, and we can use these labelled samples to provide a good estimate on the minimum similarity score for a particular sample to be considered similar.

In [42]:
min_similarity_score = {}

for idx, hotword in enumerate(hotword_list):
    cv_dev_exactmatch_generated = cv_dev_metadata[cv_dev_metadata['finetuned_text'].str.contains(hotword.upper(), na=False)]
    similarity_scores = []
    for row_idx, row in enumerate(cv_dev_exactmatch_generated.index):
        similarity_scores.append(similarities[row][idx])
    
    min_similarity_score[hotword] = min(similarity_scores)

print(min_similarity_score)

{'destroy': 0.85274625, 'be careful': 0.8924132, 'stranger': 0.81990176}


We iterate through all the similarity scores and find all entries that are equal to or greater than the similarity score of each exact match sample.

In [45]:
from IPython.display import display, HTML

boolean_list = []

for similarity in similarities:
    boolean_list.append(any([hotword_detect >= min_similarity_score[hotword_list[idx]] for idx, hotword_detect in enumerate(similarity)]))

cv_dev_metadata["similarity"] = boolean_list

print("Number of similar entries: {}".format(len(cv_dev_metadata[cv_dev_metadata["similarity"] == True])))
display(HTML(cv_dev_metadata[cv_dev_metadata["similarity"] == True][["finetuned_text", "similarity"]].sample(10).to_html()))

Number of similar entries: 61


Unnamed: 0,finetuned_text,similarity
3662,AND THE GIRL POINTED TO THE SOUTH INDICATING THAT IT WAS THERE THE STRANGE MAN LIVED,True
3909,THIS WAS THE STRANGEST OF ALL THINGS THAT EVER CAME TO EARTH FROM OUTER SPACE,True
3507,STRANGE IMAGES PASSED THROUGH MY MIND,True
1080,THE GUY THOUGHT HE WAS A LUNATIC AT LARGE AND MADE AN UNSUCCESSFUL ATTEMPT TO STOP HIM,True
2453,I DON'T LIKE PEOPLE TO DO THAT BECAUSE THE SHEEP ARE AFRAID OF STRANGERS,True
3,I THOUGHT THAT EVERYTHING I OWNED WOULD BE DESTROYED,True
1036,SANDRA READ ALOUD THE STRANGE EXCERT,True
892,HE DIDN'T KNOW THE MAN YET BUT HIS PRACTICED EYE WOULD RECOGNIZE HIM WHEN HE APPEARED,True
2706,STRANGE IMAGES PASSED THROUGH MY MIND,True
3225,HE DIDN'NT KNOW THE MAN YET BUT HIS PRACTICED EYE WOULD RECOGNIZE HIM WHEN HE APPEARED,True


In [44]:
cv_dev_metadata.to_csv("cv-valid-dev.csv", index=False)