<a href="https://colab.research.google.com/github/GoldPapaya/info256-applied-nlp/blob/main/3.embeddings/HW2_Lexical_Semantics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp25/blob/main/3.embeddings/HW2_Lexical_Semantics.ipynb)

**N.B.** Once it's open on Colab, remember to save a copy (by e.g. clicking `Copy to Drive` above).

---

## Homework 2: Lexical Semantics

In this homework, we will explore lexical semantics in the context of slang and FastText, an alternative to word2vec (Part 1); and, how to represent a sentence with individual word vectors, so we can measure the similarity between a pair of sentences (Part 2).

### Part 1: Slang and word similarity with FastText

Slang presents an interesting linguistic phenomenon that involves non-standard word forms. For this question, you will explore how FastText, an alternative to Word2Vec, handles the lexical semantics of slang and informal language.

First, **familiarize yourself with the slang dataset that we are using**, introduced in ["Toward Informal Language Processing: Knowledge of Slang in Large Language Models" (Sun et al., NAACL 2024)](https://aclanthology.org/2024.naacl-long.94/). The full dataset includes annotations indicating whether a sentence from OpenSubtitles (typically a line from a movie) contains a slang term, and you can find some example sentences and terms here:

https://raw.githubusercontent.com/dbamman/anlp25/main/data/slang_examples.tsv

What are the 10 most common slang words?

In [17]:
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp25/main/data/slang_examples.tsv

--2025-09-13 00:02:09--  https://raw.githubusercontent.com/dbamman/anlp25/main/data/slang_examples.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 376405 (368K) [text/plain]
Saving to: ‘slang_examples.tsv.4’


2025-09-13 00:02:09 (11.9 MB/s) - ‘slang_examples.tsv.4’ saved [376405/376405]



In [18]:
document = open("slang_examples.tsv").read()

In [19]:
from posixpath import split
document = open("slang_examples.tsv").read()
slang = {}
sentences = document.split('\n')
for sentence in sentences:
    try:
        key, value = sentence.split('\t')
        slang[key] = value
    except:
        continue # ignore badly formatted sentences

freq_dict = {}
for value in slang.values():
    freq_dict[value] = freq_dict.get(value, 0) + 1

slang_list = list(sorted(freq_dict.items(), key=lambda x: x[1], reverse=True))

for i in slang_list[:10]:
  print(i)

('gonna', 399)
('yeah', 176)
('shit', 116)
('wanna', 105)
("ain't", 76)
('gotta', 62)
('mate', 62)
('okay', 47)
('man', 47)
('kid', 46)


Next, **train a [FastText model](https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.FastText) using `gensim` and `FastText`** on our slang data derived from the dataset described above, which you can download here:

https://raw.githubusercontent.com/dbamman/anlp25/main/data/slang_corpus.txt

For preprocessing, in the `txt` file, each token is already separated by whitespace, so you don't need to worry about tokenization. Treat each line in the file as one "sentence". For training, use the following parameters: embedding size of 400, context window of 5, frequency threshold of 5, and use 5 workers to train for 5 epochs.

Note: we are using the `FastText` _implementation_ included in the `gensim` library! **Don't use the `fasttext` library.**

In [20]:
!pip install gensim



In [21]:
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp25/main/data/slang_corpus.txt

--2025-09-13 00:02:29--  https://raw.githubusercontent.com/dbamman/anlp25/main/data/slang_corpus.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2868710 (2.7M) [text/plain]
Saving to: ‘slang_corpus.txt.1’


2025-09-13 00:02:29 (50.8 MB/s) - ‘slang_corpus.txt.1’ saved [2868710/2868710]



In [83]:
import gensim
from gensim.models import FastText

model = FastText(
    corpus_file="slang_corpus.txt",
    vector_size=400,
    window=5,
    min_count=5,
    workers=5,
    epochs=5,
    )

With the trained models:

**Q1.** Pick a slang term, and in about 100 words, discuss:

- what the most similar words are to the slang term of your choosing (as measured by the model), and
- whether the result is aligned with your understanding.

In [85]:
model.wv.most_similar('yeah', topn=5)

[('ah', 0.8827776312828064),
 ('Nah', 0.8369936943054199),
 ('Yeah', 0.8281385898590088),
 ('uh', 0.7648449540138245),
 ('Yep', 0.7647594213485718)]

**Q2.** Train a separate word2vec (not FastText) model using the same dataset, and in about 100 words, compare the two approaches used to estimate word vectors. Here are some potential topics:

- Look up the token `gonna` in both word2vec and FastText models. What does this tell you?
- What is the high level difference between word2vec and FastText?
- Evaluate the quality of the trained embeddings through intrinsic evaluation.

In [25]:
from gensim.models import Word2Vec

w2v_model = Word2Vec(
    corpus_file="slang_corpus.txt",
    vector_size=400,
    window=5,
    min_count=5,
    workers=5,
    epochs=5,
)

In [86]:
w2v_model.wv.most_similar('yeah', topn=5)

[('hey', 0.7975084781646729),
 ('okay', 0.7787396311759949),
 ('OK', 0.7622476816177368),
 ('God', 0.7427469491958618),
 ('mate', 0.7350471615791321)]

### Part 2. From words to sentences

So far we've been working with word vectors, but in real-world scenarios, we often want to work with not just a word, but a sequence (like a sentence), which will explore later in the semester. However, with what we have learned so far, how do you represent a sentence? One approach is to look up the word vectors for individual words in the sentence and then *average* them, which we will explore in this question. We will be using pre-trained GloVe vectors [cf. SLP 6.8.3] we used in class. Download them here:

https://raw.githubusercontent.com/dbamman/anlp25/main/data/glove.6B.100d.100K.txt

**Q3.** Load the pretrained embeddings with `gensim`'s `load_word2vec_format` (see the lab notebooks), and create a function that takes a pair of sentences as input, and outputs the similarity of the two sentences measured by cosine -- the sentence pair you can use for sanity check is provided below.

Find a pair of sentences where the similarity is high, but mean different (or opposite) things. Find a pair of sentences where the similarity is low, but you think the meanings are similar. In a paragraph, discuss why we might see these results given how we construct sentence embeddings.

In [87]:
!wget --no-check-certificate https://raw.githubusercontent.com/dbamman/anlp25/main/data/glove.6B.100d.100K.txt

--2025-09-13 00:46:01--  https://raw.githubusercontent.com/dbamman/anlp25/main/data/glove.6B.100d.100K.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 85951834 (82M) [text/plain]
Saving to: ‘glove.6B.100d.100K.txt’


2025-09-13 00:46:03 (273 MB/s) - ‘glove.6B.100d.100K.txt’ saved [85951834/85951834]



----

## To submit

Congratulations on finishing this homework!
Please follow the instructions below to download the notebook file (`.ipynb`) and its printed version (`.pdf`) for submission on bCourses -- remember **all cells must be executed**.