<a href="https://colab.research.google.com/github/GoldPapaya/info256-applied-nlp/blob/main/3.embeddings/HW2_Lexical_Semantics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/3.embeddings/HW2_Lexical_Semantics.ipynb)

**N.B.** Once it's open on Colab, remember to save a copy (by e.g. clicking `Copy to Drive` above).

---

## Homework 2: Lexical Semantics

In this homework, we will explore lexical semantics in the context of slang and FastText, an alternative to word2vec (Part 1); and, how to represent a sentence with individual word vectors, so we can measure the similarity between a pair of sentences (Part 2).

### Part 1: Slang and word similarity with FastText

Slang presents an interesting linguistic phenomenon that involves non-standard word forms. For this question, you will explore how FastText, an alternative to Word2Vec, handles the lexical semantics of slang and informal language.

First, **familiarize yourself with the slang dataset that we are using**, introduced in ["Toward Informal Language Processing: Knowledge of Slang in Large Language Models" (Sun et al., NAACL 2024)](https://aclanthology.org/2024.naacl-long.94/). The full dataset includes annotations indicating whether a sentence from OpenSubtitles (typically a line from a movie) contains a slang term, and you can find some example sentences and terms here:

https://raw.githubusercontent.com/dbamman/anlp25/main/data/slang_examples.tsv

What are the 10 most common slang words?

In [148]:
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp25/main/data/slang_examples.tsv

--2025-09-13 20:36:11--  https://raw.githubusercontent.com/dbamman/anlp25/main/data/slang_examples.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 376405 (368K) [text/plain]
Saving to: ‘slang_examples.tsv.3’


2025-09-13 20:36:11 (7.80 MB/s) - ‘slang_examples.tsv.3’ saved [376405/376405]



In [149]:
from posixpath import split
document = open("slang_examples.tsv").read()
slang = {}
sentences = document.split('\n')
for sentence in sentences:
    try:
        key, value = sentence.split('\t')
        slang[key] = value
    except:
        continue # ignore badly formatted sentences

freq_dict = {}
for value in slang.values():
    freq_dict[value] = freq_dict.get(value, 0) + 1

slang_list = list(sorted(freq_dict.items(), key=lambda x: x[1], reverse=True))

for i in slang_list[:10]:
  print(i)

('gonna', 399)
('yeah', 176)
('shit', 116)
('wanna', 105)
("ain't", 76)
('gotta', 62)
('mate', 62)
('okay', 47)
('man', 47)
('kid', 46)


Next, **train a [FastText model](https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastText) using `gensim` and `FastText`** on our slang data derived from the dataset described above, which you can download here:

https://raw.githubusercontent.com/dbamman/anlp25/main/data/slang_corpus.txt

For preprocessing, in the `txt` file, each token is already separated by whitespace, so you don't need to worry about tokenization. Treat each line in the file as one "sentence". For training, use the following parameters: embedding size of 400, context window of 5, frequency threshold of 5, and use 5 workers to train for 5 epochs.

Note: we are using the `FastText` _implementation_ included in the `gensim` library! **Don't use the `fasttext` library.**

In [150]:
!pip install gensim



In [151]:
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp25/main/data/slang_corpus.txt

--2025-09-13 20:36:19--  https://raw.githubusercontent.com/dbamman/anlp25/main/data/slang_corpus.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2868710 (2.7M) [text/plain]
Saving to: ‘slang_corpus.txt.2’


2025-09-13 20:36:19 (36.4 MB/s) - ‘slang_corpus.txt.2’ saved [2868710/2868710]



In [152]:
# copied the file parser from WordEmbeddings.ipynb
import re

sentences=[]
filename="slang_corpus.txt"
with open(filename) as file:
    for line in file:
        words = line.rstrip().lower()
        # this file is already tokenize, so we can split on whitespace
        # but first let's replace any sequence of whitespace (space, tab, newline, etc.) with single space
        words = re.sub(r"\s+", " ", words)
        sentences.append(words.split(" "))

In [153]:
import gensim
from gensim.models import FastText

ft_model = FastText(vector_size=400, window=5, min_count=5, workers=5)
ft_model.build_vocab(sentences)
ft_model.train(sentences, total_examples=ft_model.corpus_count, epochs=ft_model.epochs)

(2042983, 3344590)

With the trained models:

**Q1.** Pick a slang term, and in about 100 words, discuss:

- what the most similar words are to the slang term of your choosing (as measured by the model), and
- whether the result is aligned with your understanding.

In [154]:
ft_model.wv.most_similar('yeah', topn=5)

[('yep', 0.8565532565116882),
 ('ye', 0.8376001119613647),
 ('ah', 0.8322916030883789),
 ('swell', 0.8200086951255798),
 ('uh', 0.8141594529151917)]

The five most similar words to the slang term 'yeah' are: 'yep', 'ye', 'ah', 'swell', and 'uh'. This list of words fits reasonably well with my expectations, as several of these words share the same base components as 'yeah', such as 'ye' and 'yep' containing the same first two letters. This is consistent with FastText's utilization of n-grams.

**Q2.** Train a separate word2vec (not FastText) model using the same dataset, and in about 100 words, compare the two approaches used to estimate word vectors. Here are some potential topics:

- Look up the token `gonna` in both word2vec and FastText models. What does this tell you?
- What is the high level difference between word2vec and FastText?
- Evaluate the quality of the trained embeddings through intrinsic evaluation.

In [155]:
from gensim.models import Word2Vec

w2v_model = Word2Vec(vector_size=400, window=5, min_count=5, workers=5)
w2v_model.build_vocab(sentences)
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=w2v_model.epochs)

(2043310, 3344590)

In [156]:
w2v_model.wv.most_similar('yeah', topn=5)

[('uh', 0.8316991329193115),
 ('yes', 0.8296308517456055),
 ('okay', 0.8067767024040222),
 ('ok', 0.7860023975372314),
 ('fine', 0.7690392732620239)]

By plugging in the same slang term 'yeah' into the Word2Vec model, we see results that differ from the FastText model by quite a lot. Most of the top five similar words for Word2Vec look like words that may come before or after 'yeah' in a sentence, but do not necessarily share the same letters or structure, as was the case with FastText. The main difference between FastText and Word2Vec visible in this example is that Word2Vec seems to base similarity more on sentence context, where FastTest does so based on the structure of the word itself.

### Part 2. From words to sentences

So far we've been working with word vectors, but in real-world scenarios, we often want to work with not just a word, but a sequence (like a sentence), which will explore later in the semester. However, with what we have learned so far, how do you represent a sentence? One approach is to look up the word vectors for individual words in the sentence and then *average* them, which we will explore in this question. We will be using pre-trained GloVe vectors [cf. SLP 6.8.3] we used in class. Download them here:

https://raw.githubusercontent.com/dbamman/anlp25/main/data/glove.6B.100d.100K.txt

**Q3.** Load the pretrained embeddings with `gensim`'s `load_word2vec_format` (see the lab notebooks), and create a function that takes a pair of sentences as input, and outputs the similarity of the two sentences measured by cosine -- the sentence pair you can use for sanity check is provided below.

Find a pair of sentences where the similarity is high, but mean different (or opposite) things. Find a pair of sentences where the similarity is low, but you think the meanings are similar. In a paragraph, discuss why we might see these results given how we construct sentence embeddings.

In [157]:
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp25/main/data/glove.6B.100d.100K.txt

--2025-09-13 20:37:10--  https://raw.githubusercontent.com/dbamman/anlp25/main/data/glove.6B.100d.100K.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 85951834 (82M) [text/plain]
Saving to: ‘glove.6B.100d.100K.txt.2’


2025-09-13 20:37:12 (173 MB/s) - ‘glove.6B.100d.100K.txt.2’ saved [85951834/85951834]



----

In [158]:
from gensim.models import KeyedVectors

glove_model = KeyedVectors.load_word2vec_format("glove.6B.100d.100K.txt", binary=False, no_header=True)

In [159]:
import numpy as np

def sentence_similarity(sentence1, sentence2):
    s1_tokens = sentence1.lower().split()
    s2_tokens = sentence2.lower().split()

    s1_vectors = [glove_model[s1_token] for s1_token in s1_tokens if s1_token in glove_model]
    s2_vectors = [glove_model[s2_token] for s2_token in s2_tokens if s2_token in glove_model]

    if not s1_vectors or not s2_vectors:
        return 0

    avg_vector1 = np.mean(s1_vectors, axis=0)
    avg_vector2 = np.mean(s2_vectors, axis=0)

    return glove_model.cosine_similarities(avg_vector1, [avg_vector2])[0]

In [160]:
one="The queen rules the castle"
two="The king rules the castle"

sentence_similarity(one, two)

0.98152906

In [161]:
three="The turkey we had for supper last night was absolutely fantastic"
four="We had a really great turkey for dinner yesterday evening"

sentence_similarity(three, four)

0.9676

Both the comparison between high-similarity different-meaning sentences and low-similarity similar-meaning sentences yielded a high cosine score. I believe that this is in large part due to how sentence inputs are interpreted by the sentence_similarity function. The input sentences are first tokenized and then vectorized by the glove model, but then to compare the two sentences, I averaged the vectors, because glove_model.cosine_similarities supports comparison between a single vector and a set of other vectors. Since the sentence embeddings in both cases were averaged and then compared, it is difficult to find distinct word vectors that are truly orthogonal to one another. Another influence is the glove model itself, as its training corpus may overemphasize situations in which two *normally* different-meaning words appear in similar contexts. We see this effect when comparing two normally independent words (such as 'computer' and 'ancient' below) in isolation - the cosine score is much lower which is expected, but still not negative. Substituting in a single word to act as a sentence also removes the 'averaging' effect mentioned previously.

In [162]:
five="computer"
six="ancient"

sentence_similarity(five, six)

0.25513265

## To submit

Congratulations on finishing this homework!
Please follow the instructions below to download the notebook file (`.ipynb`) and its printed version (`.pdf`) for submission on bCourses -- remember **all cells must be executed**.