<a href="https://colab.research.google.com/github/Nouw/models-for-language-processing/blob/master/M4LP_Assignment_2_(2024)_student_version.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Assignment 2

This is the complete Assignment 2. You are asked to train and test linear and logistic regression models and access lexical resources.

# <font color="red">Contributions</font>

~~Delete this text and write instead of it your:~~
* ~~group number (same as the file name, for sanity chack)~~
* ~~a list of group members names (NOT student IDs)~~
* ~~who contributed to which exercises (you don't need to be very detailed)~~

To start the assignment, import prerequisite packages:

In [None]:
import torch
import torch.nn.functional as F
import torchtext as text
import numpy as np
from tqdm.notebook import trange, tqdm
import nltk,sklearn
from sklearn.model_selection import train_test_split

In [None]:
import pandas as pd
import collections, itertools
import more_itertools

# 2.1 Import GloVe word embedding model

GloVe contains _static_ word embeddings. This means that the vector is assigned to word types and does not vary in different contexts or for different word senses. Pretrained GloVe word embedding models exist in different sizes. For the purpose of the exercise, we will use the smallest GloVe vectors with 50 dimensions. First, let's download the vectors. This may take a few minutes.

In [None]:
glovedim=50
#If you experience an issue downloading the vectors, try a different download source by uncommenting the following line:
#text.vocab.GloVe.url["6B"] = "https://huggingface.co/stanfordnlp/glove/resolve/main/glove.6B.zip"
vec = text.vocab.GloVe(name='6B', dim=glovedim)

How many words does it contain?

In [None]:
#your code here
len(vec)

400000

We can now check if particular orthographic words have vectors in `vec`

In [None]:
#check if "cat" has a GloVe vector, return a Boolean value
#your code here

In [None]:
#check if "cact" has a GloVe vector, return a Boolean value
#your code here

In [None]:
#check if "cact" has a GloVe vector, return a Boolean value
#your code here

What is the vector of _cat_?



In [None]:
#your code here

# 2.2 Linear regression: Concreteness prediction

Obtain concreteness ratings from the paper

Brysbaert, M., Warriner, A., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. BEHAVIOR RESEARCH METHODS, 46 (3), 904–911. https://doi.org/10.3758/s13428-013-0403-5

In [None]:
!wget https://raw.githubusercontent.com/ArtsEngine/concreteness/master/Concreteness_ratings_Brysbaert_et_al_BRM.txt

This is a tab-separated file with a header. Structured data like this can be conveniently read via DictReader class from csv.

In [None]:
import csv

Using DictReader, create lists ```concreteness_words``` and ```concreteness_scores``` of words in the concreteness ratings file that have a GloVe vector

In [None]:
concreteness_words=[]
concreteness_scores=[]
with open("Concreteness_ratings_Brysbaert_et_al_BRM.txt",'r') as concfile:
  read_tsv = csv.DictReader(concfile, delimiter="\t")
  # complete the code below

How many words ended up in your concreteness dataset?

In [None]:
len(concreteness_words)

Create train and test partitions of the concreteness data:

In [None]:
conc_words_train, conc_words_test, conc_scores_train, conc_scores_test = train_test_split(concreteness_words,concreteness_scores,test_size=0.1)

Convert data to torch tensor format to use in a regression model:

In [None]:
#here we *stack* vectors for all words in a single torch tensor:
vecs_train=torch.stack([vec[w] for w in conc_words_train])
vecs_test=torch.stack([vec[w] for w in conc_words_test])
#here we convert lists of scores into a tensor
scores_train=torch.tensor(conc_scores_train)
scores_test=torch.tensor(conc_scores_test)

Now we can define linear regression model in pyTorch:

In [None]:
class Regression(torch.nn.Module):
     def __init__(self, input_dim, output_dim):
         super(Regression, self).__init__()
         self.linear = torch.nn.Linear(input_dim, output_dim)
     def forward(self, x):
         outputs = self.linear(x)
         return outputs

A specific linear regression model can have the input dimensionality of our word embeddings and 1-dimensional input (the concretenss score)

In [None]:
model = Regression(glovedim,1)

To train the model, we need a loss function and an optimiser:

In [None]:
criterion = torch.nn.MSELoss()

In [None]:
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

We can now train our linear regression model using gradient descent. Every 5 training steps, evaluate the model, printing out loss and accuracy for the training and test sets. Calculate the accuracy as the percentage of examples where the predicted score of the model differs from the correct score by less than 1. The training may take a minute or so.

In [None]:
losses = []
losses_test = []
Iterations = []
epochs=20
tolerance = 1.0
#after defining parameters, we train the model several times on the same data; each iteration is an epoch:
for epoch in trange(epochs, desc='Training Epochs'):
    x = vecs_train
    scores = scores_train
    optimizer.zero_grad()
    #here we pass the word vectors from the training set, obtaining regression model's predicted outputs:
    outputs = model(x)
    loss = criterion(torch.squeeze(outputs), scores)
    loss.backward()
    optimizer.step()

    #Now, every 5 epochs we can evaluate how the model performs on the data
    if (epoch + 1) % 5 == 0:
        #we don't compute gradients as the model is only evaluated and not updated
        with torch.no_grad():
            model.eval()
            #complete the code for evaluating the model here

What is the predicted concreteness score of the noun _abstractness_?

In [None]:
def predicted_concreteness(wd):
  #complete the code here

predicted_concreteness("abstractness")

What is the predicted concreteness score of the noun _dog_?

In [None]:
predicted_concreteness("dog")

# 2.3. Create a dataset of WordNet supersenses for words that have GloVe vectors.

First, download the WordNet database:

In [None]:
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.corpus import wordnet as wn

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


You can read documentation on object types in WordNet, for example by invoking the types:

In [None]:
nltk.corpus.reader.wordnet.Lemma

In [None]:
nltk.corpus.reader.wordnet.Synset

Now, retrieve the first lemma corresponding to the adjective _dry_ as `d`:

In [None]:
d=

What lemmas belong to the same SynSet? Retrieve the list.

In [None]:
#your code here

Retrieve the frequency of the lemma `d` recorded in WordNet

In [None]:
d.count()

Define function `ant_freq` that returns the frequency of (the first) antonym of a lemma.

In [None]:
def ant_freq(x):
  #your code here

Apply `ant_freq` to `d`. This will output the frequency of _wet_.

In [None]:
ant_freq(d)

Now, create a dataset that includes for each word in WordNet that has a GloVe vector the lexicographic file (supersense) of its first synset.

In [None]:
#your code here
wn_words =
wn_supersenses =

Split the dataset into train and test partitions:

In [None]:
wn_words_train, wn_words_test, wn_supersenses_train, wn_supersenses_test = train_test_split(wn_words,wn_supersenses,test_size=0.1)

# 2.4. Logistic regression: word class prediction.

Now we can address a classification task. Define a (multinomial) regression model using softmax, choose a loss function and an optimizer for it.

In [None]:
num_classes = len(set(wn_supersenses))

Initialize your model. Use the same Regression class for logistic regression as for linear regression - the difference will come from the objective (loss) function.

In [None]:
logreg_model = Regression(glovedim,num_classes)

Choose the loss function. This is a crucial choice: some of the loss functions in PyTorch (see https://pytorch.org/docs/stable/nn.html) already include a softmax or a sigmoid in their implementaion, which gives computational advantages. Be sure to read the documentation on your loss function to confirm you made a good choice.

In [None]:
#your code here


And initialize the optimiser:

In [None]:
#your code here


WordNet is quite big. For efficiency, use the following function for splitting your data into batches when processing:

In [None]:
batch_size = 200
def get_batches(src_iter, tgt_iter, batch_size=batch_size):
    for batch in more_itertools.chunked(zip(src_iter, tgt_iter), batch_size):
        x, y = zip(*batch)
        x = torch.stack(x)
        y = torch.stack(y)
        yield x, y

Train and test your logistic regression model, printing the train and test loss and accuracy:

In [None]:
#your code here


Define a mapping from indices to lexicographic file names:

In [None]:
#your code
itolexname=

What is the name of lexicographic file 2?

In [None]:
itolexname[2]

Which supersense (lexicographic file) does your classifier assign to the noun _abstractness_?

In [None]:
def predicted_lexname(wd):
#your code
#

predicted_lexname("abstractness")

Which supersense (lexicographic file) does your classifier assign to the noun _dog_?

In [None]:
predicted_lexname("dog")

# 2.5. Hypernymy classification

Now, download a lexical entailment (hypernymy) dataset called WBLESS. The dataset was developed by Weeds et al. with the goal of testing models on distinguishing hypernyms from other related word pairs.

Weeds et al. (2014) Julie Weeds, Daoud Clarke, Jeremy Reffin, David Weir, and Bill Keller. 2014. Learning to distinguish hypernyms and co-hyponyms. In Proceedings of the 2014 International Conference on Computational Linguistics, pages 2249–2259, Dublin, Ireland.

WBLESS (together with other relevant datasets) can be conveniently downloaded from the Facebook Research github page by Stephen Roller who worked on hypernymy learning:

Stephen Roller, Douwe Kiela, and Maximilian Nickel. 2018. Hearst Patterns Revisited: Automatic Hypernym Detection from Large Text Corpora. ACL.

Download the tab-separated dataset:

In [None]:
!wget https://github.com/facebookresearch/hypernymysuite/raw/main/data/wbless.tsv

Check how the data looks:

In [None]:
!head wbless.tsv

We used ```csv``` above to process a tab separated file. It can also be done using ```pandas```:

In [None]:
wbless_df = pd.read_csv('wbless.tsv', sep='\t')
wbless_df.head()

Unnamed: 0,word1,word2,label,relation,fold
0,frigate,craft,True,hyper,val
1,trouble,carp,False,other,val
2,fox,mouth,False,other,val
3,foot,robin,False,other,val
4,vest,garment,True,hyper,val


Now create training and test data for relation classification:

In [None]:
#your code here

Split into training and test data:

In [None]:
wbless_words_train, wbless_words_test, hypernymy_train, hypernymy_test = train_test_split(wbless_words,hypernymy,test_size=0.1)

Finally, create, train and test a logistic regression model that predicts whether two words stand in the hypernymy relation given their GloVe vectors.

Make sure your model predicts a single score used for the binary decision (hypernymy vs. non-hypernymy) rather than scores for multiple classes, and choose the loss function in pyTorch accordingly.

Print the train and test loss and accuracy. Use the concatenation of the two words' vectors as input to the logistic regression classifier.

In [None]:
#your code here


What label does your model predict for the pair _dog,animal_? Your code below should produce a Boolean value, `True` for the positive class (hypernymy) and `False` for the negative class (not hypernymy).

In [None]:
def predicted_hypernymy(w1,w2):
  #complete the code

predicted_hypernymy("dog","animal")

What label does your model predict for the pair _dog,cat_?

In [None]:
predicted_hypernymy("dog","cat")

What label does your model predict for the pair _animal,dog_?

In [None]:
predicted_hypernymy("animal","dog")

#2.6 Using FrameNet



You can explore FrameNet online:
https://framenet.icsi.berkeley.edu/frameIndex

Or read detailed documentation here:
https://framenet2.icsi.berkeley.edu/docs/r1.7/book.pdf

Now, load FrameNet via the NLTK package:

In [None]:
nltk.download('framenet_v17')
from nltk.corpus import framenet as fn

`fn.lus` allows to retrieve lexical units (LUs) recorded in FrameNet. A lexical unit approximately corresponds to a lemma in WordNet, i.e. a word taken in a specific sense. Without additional parameters, it returns a complete list:



In [None]:
fn.lus()

[<lu ID=16601 name=(can't) help.v>, <lu ID=14632 name=(in/out of) line.n>, ...]

You can also retrieve lexical units by regular expression search. For example, the following returns the list of LUs that contain string _pres_ at the beginning of the LU name:

In [None]:
fn.lus('^pres')

[<lu ID=957 name=press.v>, <lu ID=10117 name=press.v>, ...]

Individual lexical units can be retrieved by ID, for example:

In [None]:
fn.lu(10117)

lexical unit (10117): press.v

[definition]
  COD: make strong efforts to persuade or force to do something.

[frame] Attempt_suasion(87)

[POS] V

[status] Finished_Initial

[lexemes] press/V

[semTypes] 0 semantic types

[URL] https://framenet2.icsi.berkeley.edu/fnReports/data/lu/lu10117.xml

[subCorpus] 12 subcorpora
  01-T-Won-(1), 02-T-Wto-(1), 03-T-Winto-(1), 04-T-Wwith-(1),
  05-T-Wfor-(1), 06-AVP-T-(1), 07-T-AVP-(1), 08-T-NP-PP-(1),
  09-T-NP-(1), manually-added, other-matched-(1), other-
  unmatched-(1)

[exemplars] 20 sentences across all subcorpora

Terms that show up in square brackets are attributes, in this case of the lexical unit. They include usage examples:

In [None]:
fn.lu(10117).exemplars

exemplar sentences for press.v in Attempt_suasion:

[0] The sponsors of the bill made clear their intention to press for a vote on it within the current legislative session .
[1] Anne McIntosh , MEP for North East Essex , is pressing the Government to reverse its policy of controlled retreat , which allows the sea to gradually take its natural course .
[2] Lewis pressed Adam to accompany him on those solicitous weekend visits but Adam nearly always said he was too busy or would be bored .
[3] The committee also agreed to press for changes to the rule which limits the level of right-to-buy discounts in the case of newly built houses or those recently modernised .
[4] It is no secret that Damascus has been using the PKK as a bargaining chip to press Turkey into complying with its demands for more water from the Euphrates , on which it heavily depends .
[5] The tenant 's adviser should therefore press for an obligation on the landlord to notify the tenant of any application made by him to

... and the frame that the lexical unit evokes, which in turn has its own attributes:

In [None]:
fn.lu(10117).frame

frame (87): Attempt_suasion

[URL] https://framenet2.icsi.berkeley.edu/fnReports/data/frame/Attempt_suasion.xml

[definition]
  The  Speaker expresses through language his wish to get the
  Addressee to act in some way that will help to bring about events
  or states described in the Content. There is no implication that
  the Addressee forms an intention to act, let alone acts.     'Mr
  Smithers always encourages the employees to stay late and work
  harder.'  'Dennis Rodman advises moderation in all things. INI'
  The Content most prototypically refers to an action that the
  Addressee will carry out themselves, but may (in the case of
  valences with a non-finite Content clause) merely refer to a
  situation that they have indirect influence over, as in the
  following  'When I talked to her, I suggested that he be removed
  from office . DNI '

[semTypes] 0 semantic types

[frameRelations] 8 frame relations
  <Parent=Attempt -- Inheritance -> Child=Attempt_suasion>
  <Parent=Attem

**Exercise**. Find all LUs on FrameNet that share the frame with the noun *car*. Print these words along with their definitions.

In [None]:
#your code here

**Exercise**. Define a function that takes a FrameNet frame as input and prints out the definitions of all core frame elements associated with the frame. For example for the frame associated with the noun _car_ your function will print the definition of the only core frame element _Vehicle_:


> Vehicle is the transportation device that the human beings use to travel.This FE is incorporated into each LU in this frame.

In [None]:
def printcoreFE(frame):
  """
  Args:
    frame: a frame object from FrameNet
  """
#your code here

Test your function on the frame evoked by the verb _sing_:

In [None]:
sing=
printcoreFE(sing)