Using any text embedding model, we aim to find similar phrases (to the 3 hotwords previously) to increase the diversity of the hotword dictionary, thereby increasing the overall effectiveness of our AI pipeline.

Google BERT is then selected since it is a powerful general-purpose language model. We then tap on the unmasking feature to generate similar words to the given hotwords, and rearrange the order to increase the vocabulary size.

In [None]:
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("fill-mask", model="google-bert/bert-base-uncased")

In [None]:
unmasker = pipeline('fill-mask', model='bert-base-uncased')

In [None]:
query1 = "be careful, destroy, [MASK], stranger"

outputs1 = unmasker(query1)
outputs1

In [None]:
query2 = "be careful, [MASK], destroy, stranger"

outputs2 = unmasker(query2)
outputs2

In [None]:
query3 = "[MASK], be careful, destroy, stranger"

outputs3 = unmasker(query3)
outputs3

In [7]:
outputs = []
outputs.extend([prediction['token_str'] for prediction in outputs1])
outputs.extend([prediction['token_str'] for prediction in outputs2])
outputs.extend([prediction['token_str'] for prediction in outputs3])

outputs

['stranger',
 'kill',
 'destroy',
 'protect',
 'danger',
 'destroy',
 'kill',
 'protect',
 'fight',
 'attack',
 'destroy',
 'kill',
 'please',
 'protect',
 'but']

In [None]:
PATH = 'hotword_inference.csv'

In [18]:
import pandas as pd

results_df = pd.read_csv(PATH) 
results_df['pred_str_lowercase'] = results_df['pred_str'].str.lower()
results_df

Unnamed: 0.1,Unnamed: 0,filename,text,pred_str,pred_str_lowercase
0,0,cv-valid-dev/sample-000000.mp3,BE CAREFUL WITH YOUR PROGNOSTICATIONS SAID THE...,BE CAREFUL WITH YOUR PROPMASTIGATIONS SAID THE...,be careful with your propmastigations said the...
1,1,cv-valid-dev/sample-000001.mp3,THEN WHY SHOULD THEY BE SURPRISED WHEN THEY SE...,THEN WHY SHOULD THEY BE SURPRISED WHEN THI SEE...,then why should they be surprised when thi see...
2,2,cv-valid-dev/sample-000002.mp3,A YOUNG ARAB ALSO LOADED DOWN WITH BAGGAGE ENT...,A YOUNG ARAB ALSO LOADED DOWN WITH BAGGAGE ENT...,a young arab also loaded down with baggage ent...
3,3,cv-valid-dev/sample-000003.mp3,I THOUGHT THAT EVERYTHING I OWNED WOULD BE DES...,I FELT THAT EVERYTHING I OWNED WOULD BE DESTROYED,i felt that everything i owned would be destroyed
4,4,cv-valid-dev/sample-000004.mp3,HE MOVED ABOUT INVISIBLE BUT EVERYONE COULD HE...,HE MOVED ABOUT INVISIBLE BUT EVERY ONE COULD H...,he moved about invisible but every one could h...
...,...,...,...,...,...
4071,4071,cv-valid-dev/sample-004071.mp3,BUT THEY COULD NEVER HAVE TAUGHT HIM ARABIC,BUT THEY COULD NEVER HAVE TAUGHT HIM ARABIC,but they could never have taught him arabic
4072,4072,cv-valid-dev/sample-004072.mp3,HE DECIDED TO CONCENTRATE ON MORE PRACTICAL MA...,HE DECIDED TO CONCENTRATE ON MORE PRACTICAL MA...,he decided to concentrate on more practical ma...
4073,4073,cv-valid-dev/sample-004073.mp3,THAT'S WHAT I'M NOT SUPPOSED TO SAY,THAT'S WHAT I'M NOT SUPPOSED TO SAY,that's what i'm not supposed to say
4074,4074,cv-valid-dev/sample-004074.mp3,JUST HANDLING THEM MADE HIM FEEL BETTER,JUST ANDILY PO BAD HIM FEEL PICTURE,just andily po bad him feel picture


In [19]:
# Define hotwords
hotwords = ["be careful", "destroy", "stranger"]

In [None]:
# extend hotwords and cast into string type
hotwords.extend(outputs)
hotwords_list = [str(word) for word in hotwords] 
hotwords_list

In [None]:
# doing manual revision: "but" isnt that similar to the other phrases, hence removing
word_to_remove = 'but'

# Remove the word from the list
hotwords_list = list(filter(lambda x: x != word_to_remove, hotwords_list)) 
hotwords_list  

In [22]:
# observed repeated words, removing repeats
hotwords_list = list(set(hotwords_list)) 
hotwords_list

['kill',
 'attack',
 'danger',
 'please',
 'be careful',
 'protect',
 'stranger',
 'fight',
 'destroy']

In [23]:
def contains_hotword(text):
  for hotword in hotwords_list:
    if hotword.lower() in str(text).lower():
      return 1
  return 0

In [24]:
# Create a new column 'contains_hotword'

results_df['similarity'] = results_df['pred_str_lowercase'].apply(lambda x: contains_hotword(x))
results_df

Unnamed: 0.1,Unnamed: 0,filename,text,pred_str,pred_str_lowercase,similarity
0,0,cv-valid-dev/sample-000000.mp3,BE CAREFUL WITH YOUR PROGNOSTICATIONS SAID THE...,BE CAREFUL WITH YOUR PROPMASTIGATIONS SAID THE...,be careful with your propmastigations said the...,1
1,1,cv-valid-dev/sample-000001.mp3,THEN WHY SHOULD THEY BE SURPRISED WHEN THEY SE...,THEN WHY SHOULD THEY BE SURPRISED WHEN THI SEE...,then why should they be surprised when thi see...,0
2,2,cv-valid-dev/sample-000002.mp3,A YOUNG ARAB ALSO LOADED DOWN WITH BAGGAGE ENT...,A YOUNG ARAB ALSO LOADED DOWN WITH BAGGAGE ENT...,a young arab also loaded down with baggage ent...,0
3,3,cv-valid-dev/sample-000003.mp3,I THOUGHT THAT EVERYTHING I OWNED WOULD BE DES...,I FELT THAT EVERYTHING I OWNED WOULD BE DESTROYED,i felt that everything i owned would be destroyed,1
4,4,cv-valid-dev/sample-000004.mp3,HE MOVED ABOUT INVISIBLE BUT EVERYONE COULD HE...,HE MOVED ABOUT INVISIBLE BUT EVERY ONE COULD H...,he moved about invisible but every one could h...,0
...,...,...,...,...,...,...
4071,4071,cv-valid-dev/sample-004071.mp3,BUT THEY COULD NEVER HAVE TAUGHT HIM ARABIC,BUT THEY COULD NEVER HAVE TAUGHT HIM ARABIC,but they could never have taught him arabic,0
4072,4072,cv-valid-dev/sample-004072.mp3,HE DECIDED TO CONCENTRATE ON MORE PRACTICAL MA...,HE DECIDED TO CONCENTRATE ON MORE PRACTICAL MA...,he decided to concentrate on more practical ma...,0
4073,4073,cv-valid-dev/sample-004073.mp3,THAT'S WHAT I'M NOT SUPPOSED TO SAY,THAT'S WHAT I'M NOT SUPPOSED TO SAY,that's what i'm not supposed to say,0
4074,4074,cv-valid-dev/sample-004074.mp3,JUST HANDLING THEM MADE HIM FEEL BETTER,JUST ANDILY PO BAD HIM FEEL PICTURE,just andily po bad him feel picture,0


In [27]:
# sanity check
results_df[results_df['similarity'] == 1]

Unnamed: 0.1,Unnamed: 0,filename,text,pred_str,pred_str_lowercase,similarity
0,0,cv-valid-dev/sample-000000.mp3,BE CAREFUL WITH YOUR PROGNOSTICATIONS SAID THE...,BE CAREFUL WITH YOUR PROPMASTIGATIONS SAID THE...,be careful with your propmastigations said the...,1
3,3,cv-valid-dev/sample-000003.mp3,I THOUGHT THAT EVERYTHING I OWNED WOULD BE DES...,I FELT THAT EVERYTHING I OWNED WOULD BE DESTROYED,i felt that everything i owned would be destroyed,1
55,55,cv-valid-dev/sample-000055.mp3,NO ONE ATTACKS AN OASIS,NO ONE ATTACKS AN OASIS,no one attacks an oasis,1
89,89,cv-valid-dev/sample-000089.mp3,THE STRANGER SEEMED SATISFIED WITH THE ANSWER,THE STRANGER SEEMED SATISFIED WIT THE ANSWER,the stranger seemed satisfied wit the answer,1
159,159,cv-valid-dev/sample-000159.mp3,THE FIRST RULE OF DON'T FIGHT CLUB IS LET'S TA...,THE FIRST RULE OF DON'T FIGHT CLUB IS LET'S TA...,the first rule of don't fight club is let's ta...,1
...,...,...,...,...,...,...
3535,3535,cv-valid-dev/sample-003535.mp3,YOU SHOULD SEE THE OTHER GUY SPECIFICALLY HOW ...,YOU SHOULD SEE THE OTHER GAY SPECIFICALLY HOW ...,you should see the other gay specifically how ...,1
3553,3553,cv-valid-dev/sample-003553.mp3,HE ASKED IT PLEASE NEVER TO STOP SPEAKING TO HIM,HE ASKED IT PLEASE NEVER TO STOP SPEAKING TO HIM,he asked it please never to stop speaking to him,1
3808,3808,cv-valid-dev/sample-003808.mp3,I HAD TO TEST YOUR COURAGE THE STRANGER SAID,I HAD TO TEST YOUR COURAGE THE STRANGER SAID,i had to test your courage the stranger said,1
3817,3817,cv-valid-dev/sample-003817.mp3,TWO WOMEN WERE STILL MISSING WHEN THE FIREFIGH...,TWO WOMEN WERE STILL MISSING WHEN THE FIRE FIG...,two women were still missing when the fire fig...,1


In [28]:
# Get filenames of rows with hotwords
detected_files = results_df[results_df['similarity'] == 1]['filename'] 
detected_files

0       cv-valid-dev/sample-000000.mp3
3       cv-valid-dev/sample-000003.mp3
55      cv-valid-dev/sample-000055.mp3
89      cv-valid-dev/sample-000089.mp3
159     cv-valid-dev/sample-000159.mp3
                     ...              
3535    cv-valid-dev/sample-003535.mp3
3553    cv-valid-dev/sample-003553.mp3
3808    cv-valid-dev/sample-003808.mp3
3817    cv-valid-dev/sample-003817.mp3
3861    cv-valid-dev/sample-003861.mp3
Name: filename, Length: 65, dtype: object

In [None]:
# Save filenames to detected.txt
with open("similar-hotwords.txt", "w") as f:
    for filename in detected_files:
        f.write(f"{filename}\n")