## B. Text Embedding
This task includes the use of `instructor-large` for the search of similar phrases to the hotwords detected in 5a. This will provide an updated `similarity` column with boolean values of whether there are similar phrases or not. 

In [1]:
# Import necessary libraries
from transformers import AutoTokenizer, AutoModel
import pandas as pd
import torch
import numpy as np

In [None]:
# Load model and tokenizer
# model_name = "hkunlp/instructor-large"
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = AutoModel.from_pretrained(model_name)

from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-large')

This section is used to extract out sentences with the hotwords for running cosine similarity with `INSTRUCTOR` on the rest of the dataframe. 

In [None]:
import re
import pandas as pd

df = pd.read_csv('C:/Users/Clarence/Desktop/GitHub/technical-test/asr-train/updated_v2_cv-valid-dev.csv') 

#convert text to lowercase first
df['generated_text'] = df['generated_text'].str.lower()
df['finetuned_text'] = df['finetuned_text'].str.lower()
df.fillna('', inplace=True)

#Define hotwords and their regex patterns
hotwords = {
    "be careful": r"be\s*careful",
    "destroy": r"destroy",
    "stranger": r"stranger"
}

# Function to check for hotwords in text
def contains_hotword(text, patterns):
    for pattern in patterns.values():
        if re.search(pattern, text, re.IGNORECASE):
            return True
    return False

# Iterate through the DataFrame and store filenames if hotwords are detected
hotword_sentences = []
for _, row in df.iterrows():
    if contains_hotword(row['finetuned_text'], hotwords):
        hotword_sentences.append(row['finetuned_text'])

In [17]:
hotword_sentences

['be careful whit your prognostications said the stranger', 'i thought that everything i owned would be destroyed', 'the stranger seemed satisfied with the answer', 'i had to test your courage the stranger said', 'i had to test your corrage the stranger said', 'be careful with your proagnostications said the stranger', 'the stranger was speaking of things that very few people knew about', 'the stranger was speaking of things that very few people knew about', 'i had to test your courage the stranger said', 'the stranger seemed satisfied with the answer', 'the stranger was speaking of things that very few people knew about', "i don't like people to do that because the sheep are afraid of strangers", "the stranger withdrew the sword from the boy's forehead and the boy felt immensely relieved", 'i had to test your courage the stranger said', 'i had to test your courage the stranger said']


## Running Similarity
Based on documentation on `INSTRUCTOR` [[1]](https://github.com/xlang-ai/instructor-embedding), cosine similarity is constructed here to calculate for other similar sentences (including the target sentences themselves). As a reference point, the similarity threshold is set at 0.999. 

In [24]:
# Function to calculate text embeddings
def get_embedding(text):
    embeddings = model.encode(text)
    
    return embeddings

In [25]:
#reference embedding tensor for comparison of cosine similarity against sentences in DataFrame
hotword_embeddings = get_embedding(hotword_sentences)

In [28]:
#testing it out
from sklearn.metrics.pairwise import cosine_similarity
similarity = cosine_similarity(hotword_embeddings, hotword_embeddings) #test

In [58]:
similarity

array([[0.9999999 , 0.8266499 , 0.87050796, 0.8903919 , 0.8858587 ,
        0.9952303 , 0.8990371 , 0.8990371 , 0.8903919 , 0.87050796,
        0.8990371 , 0.8255396 , 0.83485675, 0.89039195, 0.89039195],
       [0.8266499 , 1.0000002 , 0.80499256, 0.83657837, 0.8354285 ,
        0.83064413, 0.8207717 , 0.8207717 , 0.83657837, 0.80499256,
        0.8207717 , 0.8044783 , 0.8110869 , 0.8365785 , 0.8365785 ],
       [0.87050796, 0.80499256, 1.0000002 , 0.8845076 , 0.8920334 ,
        0.8703649 , 0.8823137 , 0.8823137 , 0.8845076 , 1.0000002 ,
        0.8823137 , 0.80607855, 0.8875704 , 0.88450754, 0.88450754],
       [0.8903919 , 0.83657837, 0.8845076 , 0.99999994, 0.9691083 ,
        0.89021873, 0.8917724 , 0.8917724 , 0.99999994, 0.8845076 ,
        0.8917724 , 0.8343547 , 0.84707516, 0.9999998 , 0.9999998 ],
       [0.8858587 , 0.8354285 , 0.8920334 , 0.9691083 , 1.        ,
        0.88466245, 0.8862458 , 0.8862458 , 0.9691083 , 0.8920334 ,
        0.8862458 , 0.829512  , 0.843199  , 

In [69]:
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np 

# Function to check similarity
def is_similar(text, threshold=0.87): # Assuming 'get_embedding' uses 'model.encode()' and 'hotword_embeddings' is already computed
    text_embedding = get_embedding(text).reshape(1, -1)
    cos_sim_scores = cosine_similarity(text_embedding, hotword_embeddings)
    # print(cos_sim_scores)
    
    # Check if any similarity score meets the threshold
    return np.any(cos_sim_scores >= threshold)

In [73]:
#trial with second sentence
similarity_2 = is_similar(df['finetuned_text'][13])

[[0.81524837 0.81018466 0.805854   0.8037375  0.813777   0.81189054
  0.8798793  0.8798793  0.8037375  0.805854   0.8798793  0.8150883
  0.7821808  0.8037375  0.8037375 ]]


In [74]:
similarity_2

True

In [62]:
# attach similarity column to DataFrame
df['similarity'] = ''

# Calculate similarity for each text in DataFrame
df['similarity'] = df['finetuned_text'].apply(lambda x: is_similar(x))


In [53]:
# Display the DataFrame
print(df.head())

                         filename  \
0  cv-valid-dev/sample-000000.mp3   
1  cv-valid-dev/sample-000001.mp3   
2  cv-valid-dev/sample-000002.mp3   
3  cv-valid-dev/sample-000003.mp3   
4  cv-valid-dev/sample-000004.mp3   

                                                text  up_votes  down_votes  \
0  be careful with your prognostications said the...         1           0   
1  then why should they be surprised when they se...         2           0   
2  a young arab also loaded down with baggage ent...         2           0   
3  i thought that everything i owned would be des...         3           0   
4  he moved about invisible but everyone could he...         1           0   

        age  gender   accent duration  \
0                                       
1                                       
2                                       
3                                       
4  fourties  female  england            

                                      generated_text  \
0  be

In [72]:
df['similarity'][0:20]

0      True
1     False
2     False
3      True
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13     True
14    False
15    False
16    False
17    False
18    False
19    False
Name: similarity, dtype: bool

## Observations
Iteration 1 (threshold = 0.8): returned all True with 100 samples (too warm for detecting similarity)
Iteration 2 (threshold = 0.999): Mostly False (with True returned on sentences with hotword x hotword cosine similarity)
Iteration 3 (threshold = 0.87): returned some sentences with similar embeddings (e.g, entry 15)

While the threshold can be manually adjusted, the tensors hold large dimensions of cosine similarity values, weighed against the `INSTRUCTOR` model. For this task, heuristical observations were made in comparison with hotword embeddings cf. comparative hotword-sentence embeddings to determine the threshold for similarity determination.
Modelling for `similarity` in more detail should include statistical graphing of both embeddings, and/or summary statistics to better ascertain threshold levels for determining sentences with similar embeddings.

In [75]:
# Save the updated DataFrame to CSV
df.to_csv('updated_v3_cv-valid-dev.csv', index=False)