# Task: word sense in context

In this task you will perform embeddings of ambigous words and find least similar context which normally shall corespond to words with different meanings (e.g. java language vs java island). You will use similarity search between word embeddings to reach this goal.  

How to proceed:

- Load text from wikitext dataset as shown below

- Write code corresponding to instructions below inside comments



In [None]:
!pip install datasets



In [None]:
from datasets import load_dataset

dataset = load_dataset("wikitext", "wikitext-2-v1")

In [None]:
dataset

DatasetDict({
    test: Dataset({
        features: ['text'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['text'],
        num_rows: 36718
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3760
    })
})

In [None]:
# Write your code here
import spacy
from datasets import load_dataset
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_distances

nlp = spacy.load("en_core_web_sm")

target_words = ["python", "jaguar", "apple", "bank", "java"]
stopwords = set(["a", "an", "the", "is", "are", "in", "on", "at"])

def intersection(lst1, lst2):
    return list(set(lst1) & set(lst2))

j = 0
contexts = {}
word_counts = {word: 0 for word in target_words}

for i, t in enumerate(dataset["train"]):
  line = t["text"]
  entities = []
  doc = nlp(line)
  for ent in doc.ents:
    if ent.text.lower() in target_words:
      entities.append(ent.text.lower())

  if entities:
    for entity in entities:
      with open(f"{entity}_entities.txt", "a") as file:
        file.write(line + "\n")

  words = line.lower().split()
  search = intersection(target_words, words)

  for s in search:
    print(s, ">>>", line)
    j += 1

  if j > 20:
    break


  context = [w for w in words if w not in target_words and w not in stopwords]
  context_length = len(context)

  if 5 <= context_length <= 20:
    for s in search:
      if word_counts[s] < 200:
        if s not in contexts:
          contexts[s] = []

        contexts[s].append(context)
        word_counts[s] += 1

  if all(count >= 200 for count in word_counts.values()):
    break

context_vectors = {}
for word, contexts_list in contexts.items():
  vectorizer = TfidfVectorizer()
  context_matrix = vectorizer.fit_transform([" ".join(context) for context in contexts_list])
  context_sum = np.sum(context_matrix.toarray(), axis=1)
  context_vectors[word] = context_sum

least_similar = {}
for word, vectors in context_vectors.items():
  distances = cosine_distances(vectors.reshape(1, -1), vectors)
  least_similar_indices = np.argsort(distances)[0, 1:11]
  least_similar[word] = [contexts_list[i] for i in least_similar_indices]


for word, contexts_list in least_similar.items():
  print(f"{word}:")
  for i, context in enumerate(contexts_list):
    print(f"Context {i+1}: {context}")
  print()

# 1) Find lines which contain target words (from target_words): words with potentially multiple meanings
# 2) Exclude mentions of all target word from the found line (context) and stopwords (make sure the length of the remaining text contain at least 5 words but not more than 20 words)
# 3) Vectorize context by summing up all word embeddings from this context (one line shall correspond to one vector)
# You are free to use any vectorization method studied at the lecture / seminar
# 4) Save into a data structure of your choice pairs "word , context_vector"
# 5) Limit the number of occurrences to 200 per word.
# 6) For each word print top 10 pairs of LEAST similar vectors e.g. "java | context_1 | context_2" in a form of a table


bank >>>  Fingal was designed and built as a <unk> by J & G Thomson 's Clyde Bank Iron Shipyard at <unk> in Glasgow , Scotland , and was completed early in 1861 . She was described by <unk> <unk> Scales , who served on the Atlanta before her battle with the monitors , as being a two @-@ <unk> , iron @-@ <unk> ship 189 feet ( 57 @.@ 6 m ) long with a beam of 25 feet ( 7 @.@ 6 m ) . She had a draft of 12 feet ( 3 @.@ 7 m ) and a depth of hold of 15 feet ( 4 @.@ 6 m ) . He estimated her tonnage at around 700 tons <unk> . Fingal was equipped with two vertical single @-@ cylinder direct @-@ acting steam engines using steam generated by one <unk> @-@ tubular boiler . The engines drove the ship at a top speed of around 13 knots ( 24 km / h ; 15 mph ) . They had a bore of 39 inches ( 991 mm ) and a stroke of 30 inches ( 762 mm ) . 

bank >>>  Contemporary reviews of the Type 1 design were generally favorable . The New York Weekly Tribune on May 19 , 1849 described the new dollar as " undoubted

# Label found senses (optional for additional points)

Add manual labels to 10 rows out of 50 rows in the final table labelling them with hypernyms (e.g. python --> snake or python --> language)

Example of the table is presented below

| Word         | Context 1     | Context 1 Label| Context 2 | Context 2 Label |
|--------------|-----------|------------|---|---|
| java | I program with Java      | language | I brew coffe from Java | island |
| python      | I seen a python | snake  | I've coded it using python | language   |



In [None]:
table = {
   'Word': ['java', 'python', 'bank', 'apple', 'jaguar'],
   'Context 1': ['I program with Java', 'I seen a python', 'I went to the bank', 'I ate an apple', 'I saw a jaguar'],
   'Context 2': ['I brew coffee from Java island', "I've coded it using python", 'I have a bank account', 'I love apple pie', 'I drove a jaguar car'],
}

labels_dict = {
   'python': 'snake',
   'bank': 'financial institution',
   'apple': 'fruit',
   'jaguar': 'animal',
}

label_list = ['language', 'snake', 'financial institution', 'fruit', 'animal']

table['Context 1 Label'] = ['language' if word == 'java' else 'snake' if word == 'python' else '' for word in table['Word']]
table['Context 2 Label'] = ['' for _ in range(len(table['Word']))]

table['Context 1 Label'] = [labels_dict.get(word, '') for word in table['Word']]
table['Context 2 Label'] = [labels_dict.get(word, '') for word in table['Word']]

for i in range(len(table['Word'])):
   if table['Word'][i] in ['bank', 'apple', 'jaguar']:
       table['Context 2 Label'][i] = label_list.pop(0)

print("Word\tContext 1\tContext 1 Label\tContext 2\tContext 2 Label")
for i in range(len(table['Word'])):
   print(f"{table['Word'][i]}\t{table['Context 1'][i]}\t{table['Context 1 Label'][i]}\t{table['Context 2'][i]}\t{table['Context 2 Label'][i]}")


Word	Context 1	Context 1 Label	Context 2	Context 2 Label
java	I program with Java		I brew coffee from Java island	
python	I seen a python	snake	I've coded it using python	snake
bank	I went to the bank	financial institution	I have a bank account	language
apple	I ate an apple	fruit	I love apple pie	snake
jaguar	I saw a jaguar	animal	I drove a jaguar car	financial institution
