# NOTE! NEW ENVIRONMENT REQUIRED
In Task 5B, we'll be using an embeddings model by installing the InstructorEmbedding library. This library and its requirements conflict the dependencies for the other tasks. Hence, we will use a seperate enviroment for this task. Below are the instructions:
1. Create a new enviroment with Python 3.10: `conda create --name myenv2 python=3.10`
2. Activate the enviroment: `conda activate myenv2`
3. Install the following packages using conda forge:  
`conda install -c conda-forge huggingface_hub=0.11.1 sentence-transformers==2.2.2 transformers==4.20.0 InstructorEmbedding pandas scikit-learn`

### Note
If you experience any issues during the installation with the tokenizer package, you will need to ensure that your device has rust installed. Follow the below steps to install it:
1. Run the following command: `curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh`
2. Add just to your environment path: `source $HOME/.cargo/env`
3. Check that rust is installed: `rustc --version`

# Task 5B
We will use an embeddings model to find similar phrases to the 3 hot words specified in the task. The hot words are "be careful", "destroy", and "stranger". 

In [3]:
import pandas as pd
from InstructorEmbedding import INSTRUCTOR
from sklearn.metrics.pairwise import cosine_similarity

  from tqdm.autonotebook import trange


In [7]:
# Load the pre-trained embedding model
model = INSTRUCTOR('hkunlp/instructor-large')

# Define hot words with a task-specific instruction
hot_words = [
    ["Represent the hotword for similarity", "be careful"],
    ["Represent the hotword for similarity", "destroy"],
    ["Represent the hotword for similarity", "stranger"]
]

# Embed the hot words
hotword_embeddings = model.encode(hot_words)

# Load cv-valid-dev.csv
cv_valid_dev = pd.read_csv('../data/common_voice/cv-valid-dev.csv')

# Define a function to compute the similarity for each text entry
def compute_similarity(row_text):
    instruction_text = [["Represent the text for similarity", row_text]]
    text_embedding = model.encode(instruction_text)
    # Calculate cosine similarity with hotword embeddings
    similarities = cosine_similarity(text_embedding, hotword_embeddings)
    return similarities.max()  # Return the highest similarity score

# Apply similarity computation
cv_valid_dev['similarity_score'] = cv_valid_dev['text'].astype(str).apply(compute_similarity)


load INSTRUCTOR_Transformer
max_seq_length  512


  model.load_state_dict(torch.load(os.path.join(input_path, 'pytorch_model.bin'), map_location=torch.device('cpu')))


In [8]:
# Classify similarity based on a threshold
threshold = 0.85
cv_valid_dev['similarity'] = cv_valid_dev['similarity_score'] >= threshold

# Save the updated DataFrame to a CSV
cv_valid_dev.to_csv('similarity.csv', index=False)

### Check Through The Similiarity Data

In [4]:
detected = pd.read_csv('detected.txt')
detected

Unnamed: 0,cv-valid-dev/sample-000000.mp3
0,cv-valid-dev/sample-000003.mp3
1,cv-valid-dev/sample-000089.mp3
2,cv-valid-dev/sample-000508.mp3
3,cv-valid-dev/sample-000674.mp3
4,cv-valid-dev/sample-001093.mp3
5,cv-valid-dev/sample-001101.mp3
6,cv-valid-dev/sample-001243.mp3
7,cv-valid-dev/sample-001501.mp3
8,cv-valid-dev/sample-001933.mp3
9,cv-valid-dev/sample-002405.mp3


In [5]:
similar = pd.read_csv('similarity.csv')
similar.head()

Unnamed: 0,filename,text,up_votes,down_votes,age,gender,accent,duration,similarity_score,similarity
0,cv-valid-dev/sample-000000.mp3,be careful with your prognostications said the...,1,0,,,,,0.872691,True
1,cv-valid-dev/sample-000001.mp3,then why should they be surprised when they se...,2,0,,,,,0.823436,False
2,cv-valid-dev/sample-000002.mp3,a young arab also loaded down with baggage ent...,2,0,,,,,0.785186,False
3,cv-valid-dev/sample-000003.mp3,i thought that everything i owned would be des...,3,0,,,,,0.817373,False
4,cv-valid-dev/sample-000004.mp3,he moved about invisible but everyone could he...,1,0,fourties,female,england,,0.754095,False


In [7]:
similar.loc[similar['similarity'] == True]['filename']

0       cv-valid-dev/sample-000000.mp3
89      cv-valid-dev/sample-000089.mp3
508     cv-valid-dev/sample-000508.mp3
579     cv-valid-dev/sample-000579.mp3
674     cv-valid-dev/sample-000674.mp3
1093    cv-valid-dev/sample-001093.mp3
1101    cv-valid-dev/sample-001101.mp3
1115    cv-valid-dev/sample-001115.mp3
1243    cv-valid-dev/sample-001243.mp3
1501    cv-valid-dev/sample-001501.mp3
1507    cv-valid-dev/sample-001507.mp3
1717    cv-valid-dev/sample-001717.mp3
1781    cv-valid-dev/sample-001781.mp3
1828    cv-valid-dev/sample-001828.mp3
1933    cv-valid-dev/sample-001933.mp3
1978    cv-valid-dev/sample-001978.mp3
2104    cv-valid-dev/sample-002104.mp3
2120    cv-valid-dev/sample-002120.mp3
2405    cv-valid-dev/sample-002405.mp3
2410    cv-valid-dev/sample-002410.mp3
2432    cv-valid-dev/sample-002432.mp3
3127    cv-valid-dev/sample-003127.mp3
3219    cv-valid-dev/sample-003219.mp3
3245    cv-valid-dev/sample-003245.mp3
3344    cv-valid-dev/sample-003344.mp3
3808    cv-valid-dev/samp

By comparing the two files, we observe the following:
- 12 of the 15 files in detected.txt were found in the similarity data.
- 3 of the 15 files in detected.txt were not found in the similarity data.
- 16 extra files, not found in detected.txt, were found in the similarity data. Looking through some of them, they have words that are similar to the hot words, which could be why they were included in the similarity data.