# 1. Information about the submission

## 1.1 Name and number of the assignment 

## **Taxonomy enrichment**, Assignment 3.

## 1.2 Student name

## **Albert Sayapin**

## 1.3 Codalab user ID

## **albertSayapin**

## 1.4 Additional comments

## *This is an interesting task, but I wish I had more spare time to feel it thoroughly:(*

Checked it on **Google Colab** successfully!

# 2. Technical Report

## 2.1 Methodology 

The main problem I tried to solve is **Taxonomy enrichment** task, meaning that we want to *add new unseen words to the taxonomy* and match them with the hypernyms from the existing set.

Identifying hypernymic relations has a lot of applications in Natural Language Processing, especially in semantically intensive tasks, such as *Question Answering*, *Textual
Entailment*, and *semantic search systems*. These relations play a crucial role in thesaurus
construction, but it is challenging and not effective to extract them manually. That is why the problem is worth solving.

**Example:**

*Existing taxonomy*: "Mona Lisa" is-a {"Painting", "Art", "Picture", ...}, "Equation" is-a {"Mathematical object", "Representation", "Math", ...}

*Add word to the taxonomy*: "Square" -> is-a {"Math", "Representation", ..., "Painting"}


Speaking of a model I used, actually it is a changed baseline:
- Calculate embeddings of all the words in the synset and average them;
- Calculate embeddings of out of taxonomy(orphans) words;
- Calculate similarity scores between all the synsets and orphans and find 10-15 the most similar(cosine similarity, dot product);
- Return top-10 hypernyms of the corresponding synsets.

However, I used diferent pretrained embeddings like http://vectors.nlpl.eu/repository/20/214.zip rather than https://dl.fbaipublicfiles.com/'fasttext/vectors-crawl/cc.ru.300.bin.gz, because could not run them neither in colab nor on a local laptoop(loaded too much RAM)


As a methodology resource I used these papers:
- https://www.dialog-21.ru/media/5111/nikishinaiplusetal-160.pdf
- https://www.dialog-21.ru/media/5123/tikhomirovmmplusetal-149.pdf
- https://www.dialog-21.ru/media/5125/yadrintsevvvplusetal-144.pdf

Steps of the project:

1. **Data preprocessing**: I had to preprocess "context" column of every dataset(train/test):
- *Lemmatized* all the words by pymystem3.Mystem stemmer and made them lowercase(Normalization step);
- *Dropped* all the words from nltk *Russian stopwords* list(As they do not bring any additional information);
- *Eliminated* all the words with *length* less than 2;

2. **Model training**: I had to calculate the embeddings of all the synsets in a RuWordNet with:
- geowac_lemmas_none_fasttextskipgram_300_5_2020
- ruscorpora_upos_skipgram_300_5_2018
- normalized or not.

3. **Model evaluation**: I had to use *MRR* and *MAP* to get intuition about the model quality.

4. **Send the results**: the test.tsv -> .zip files to CodaLab system.

## 2.2 Discussion of results

### **Summary of the experiment:**
Here below you can the best results I achieved with the help of:
- **TE20** -> which is based on *geowac_lemmas_none_fasttextskipgram_300_5_2020*
- **TE20N** -> which is normalized version of **TE20**
- **TE18** -> which is based on *ruscorpora_upos_skipgram_300_5_2018*
- **TE18N** -> which is normalized version of **TE18**

### *Train.csv results:*

Method | Nouns_Public MAP | Nouns_Public MRR | Verbs_Public MAP| Verbs_Public MRR|
--- | --- | --- | --- | --- |
TE20 | 0.17 | 0.18 | 0.08 | 0.09 |
TE20N | 0.21 | 0.24 | 0.09 | 0.10 |
TE18 | 0.20  | 0.22 | 0.09 | 0.11 |
TE18N | 0.23 | 0.25 | 0.12 | 0.14 |
Baseline | 0.42 | 0.45 | 0.33 | 0.38 | 

**Remark:** results for the baseline are from private evaluation(they are likely not so different from the public one)


The tables show us that the model based on *geowac_lemmas_none_fasttextskipgram_300_5_2020* and *ruscorpora_upos_skipgram_300_5_2018* works worse than model with embeddings from fasttext/vectors-crawl.
This is a direct problem of embeddings that do not reflect the senses for the particular data.
That is why we can conclude that embeddings are a key component if we prefer to use this methodology.

### **Conclusion:**
As we can see from the results a model which is based on pretrained word embeddings can work pretty well(Baseline).

# 3. Code

## 3.1 Requirements

# Some essential packages:

In [1]:
# Some essential packages:
!pip install pymystem3==0.1.10
!pip install gensim==4.1.2
!pip install nltk

Collecting pymystem3==0.1.10
  Downloading pymystem3-0.1.10-py3-none-any.whl (10 kB)
Installing collected packages: pymystem3
  Attempting uninstall: pymystem3
    Found existing installation: pymystem3 0.2.0
    Uninstalling pymystem3-0.2.0:
      Successfully uninstalled pymystem3-0.2.0
Successfully installed pymystem3-0.1.10
Collecting gensim==4.1.2
  Downloading gensim-4.1.2-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (24.1 MB)
[K     |████████████████████████████████| 24.1 MB 1.7 MB/s 
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 3.6.0
    Uninstalling gensim-3.6.0:
      Successfully uninstalled gensim-3.6.0
Successfully installed gensim-4.1.2


## 3.2 Download the data:

In [2]:
# Embeddings for the model:
!wget http://vectors.nlpl.eu/repository/20/214.zip
!unzip -o 214.zip -d ru_fasttext_model
!rm 214.zip

!wget http://vectors.nlpl.eu/repository/20/213.zip
!unzip -o 213.zip -d new_model
!rm 213.zip

!wget https://github.com/dialogue-evaluation/taxonomy-enrichment/archive/refs/heads/master.zip
!unzip -o master.zip -d tax_rich
!rm master.zip

# Create directory for the results:
!mkdir results/

--2021-12-24 08:15:06--  http://vectors.nlpl.eu/repository/20/214.zip
Resolving vectors.nlpl.eu (vectors.nlpl.eu)... 129.240.189.181
Connecting to vectors.nlpl.eu (vectors.nlpl.eu)|129.240.189.181|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1920218982 (1.8G) [application/zip]
Saving to: ‘214.zip’


2021-12-24 08:15:25 (95.8 MB/s) - ‘214.zip’ saved [1920218982/1920218982]

Archive:  214.zip
  inflating: ru_fasttext_model/meta.json  
  inflating: ru_fasttext_model/model.model  
  inflating: ru_fasttext_model/model.model.vectors_ngrams.npy  
  inflating: ru_fasttext_model/model.model.vectors.npy  
  inflating: ru_fasttext_model/model.model.vectors_vocab.npy  
  inflating: ru_fasttext_model/README  
--2021-12-24 08:16:15--  http://vectors.nlpl.eu/repository/20/213.zip
Resolving vectors.nlpl.eu (vectors.nlpl.eu)... 129.240.189.181
Connecting to vectors.nlpl.eu (vectors.nlpl.eu)|129.240.189.181|:80... connected.
HTTP request sent, awaiting response... 200 OK
Leng

In [3]:
import re
from collections import Counter

import numpy as np
import pandas as pd
import ast

from gensim.models import KeyedVectors


import nltk
from nltk.corpus import stopwords
from pymystem3 import Mystem

nltk.download("stopwords")
russian_stopwords = stopwords.words("russian")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


# Load the data:

In [4]:
path = "tax_rich/taxonomy-enrichment-master/data/"

## Nouns:

In [5]:
part = 'nouns'

# for model:
synsets_nouns = pd.read_csv(path + f'training_data/synsets_{part}.tsv', sep='\t', encoding='utf-8')
synsets_nouns["PARENTS"] = synsets_nouns["PARENTS"].apply(lambda x: ast.literal_eval(x))
synsets_nouns["PARENT_TEXTS"] = synsets_nouns["PARENT_TEXTS"].apply(lambda x: ast.literal_eval(x))

# public data:
nouns_public = pd.read_csv(path + f'public_test/{part}_public.tsv', sep='\t', names=["TEXT"])

# private data:
nouns_private = pd.read_csv(path + f'private_test/{part}_private.tsv', sep='\t', names=["TEXT"])

## Verbs:

In [6]:
part = 'verbs'

# for model:
synsets_verbs = pd.read_csv(path + f'training_data/synsets_{part}.tsv', sep='\t', encoding='utf-8')
synsets_verbs["PARENTS"] = synsets_verbs["PARENTS"].apply(lambda x: ast.literal_eval(x))
synsets_verbs["PARENT_TEXTS"] = synsets_verbs["PARENT_TEXTS"].apply(lambda x: ast.literal_eval(x))

# public data:
verbs_public = pd.read_csv(path + f'public_test/{part}_public.tsv', sep='\t', names=["TEXT"])

# private data:
verbs_private = pd.read_csv(path + f'private_test/{part}_private.tsv', sep='\t', names=["TEXT"])

# Preprocess the data:

In [7]:
def lemmatized_context(row, stemmer):
    s = row["TEXT"]
    tokens = stemmer.lemmatize(s.lower())
    tokens = [token for token in tokens if token not in russian_stopwords\
                and token != " " \
                and re.match('[\w\-]+$', token)\
                and (len(token) > 1)
            ]
    return tokens

def get_counter(row):
    s = row['context']
    c = Counter(s)
    return list(set(s)), c

def get_counter_global(data):
    return data.groupby("word").apply(lambda x: Counter(x["context"].sum()))

def preprocess(data):
    stemmer = Mystem()
    data['tokens'] = data.apply(lemmatized_context, 1, stemmer=stemmer)
    return data

def get_hypernyms(x, y_df):
    res_list = []
    res_set = set()
    cand = np.argsort([x.dot(y) for y in y_df["emb"]])[::-1][:10]
    hts = y_df.iloc[cand]["PARENTS"].tolist()
    for cd in hts:
        for j in cd:
            if len(res_list) == 10:
                break
            if j not in res_set:
                res_set.add(j)
                res_list.append(j)
    return res_list

In [8]:
preprocess(synsets_nouns)
preprocess(synsets_verbs)

preprocess(nouns_public)
preprocess(nouns_private)

preprocess(verbs_public)
preprocess(verbs_private);

Installing mystem to /root/.local/bin/mystem from http://download.cdn.yandex.net/mystem/mystem-3.0-linux3.1-64bit.tar.gz


# Model:

In [9]:
def get_hypernyms(x, y_df):
    res_list = []
    res_set = set()
    cand = np.argsort([x.dot(y) for y in y_df["emb"]])[::-1][:10]
    hts = y_df.iloc[cand]["PARENTS"].tolist()
    for cd in hts:
        for j in cd:
            if len(res_list) == 10:
                break
            if j not in res_set:
                res_set.add(j)
                res_list.append(j)
    return res_list

def save_to_file(file_name, df):
    with open(file_name, "w") as f:
        for i in range(len(df)):
            hypernyms = df.iloc[i]["hypernyms"]
            for j in hypernyms:
                f.write(f"{df.iloc[i]['TEXT']}\t{j}\n")
                

In [10]:
emb_18 = 'ru_fasttext_model/model.model'
emb_20 = 'new_model/model.model'
choice = emb_20

model = KeyedVectors.load(choice)

## Count embeddings of the model:

In [11]:
def count_embeddings(x, model):
    res = model[x].mean(axis=0)
    return res / np.linalg.norm(res)

In [12]:
# original:
#synsets_nouns["emb"] = synsets_nouns["tokens"].apply(lambda x: model[x].mean(axis=0))
#synsets_verbs["emb"] = synsets_verbs["tokens"].apply(lambda x: model[x].mean(axis=0))

# normalized:
synsets_nouns["emb"] = synsets_nouns["tokens"].apply(count_embeddings, model=model)
synsets_verbs["emb"] = synsets_verbs["tokens"].apply(count_embeddings, model=model)

## Public Nouns:

In [13]:
# original:
#nouns_public["emb"] = nouns_public["tokens"].apply(lambda x: model[x].mean(axis=0))

# normalized:
nouns_public["emb"] = nouns_public["tokens"].apply(count_embeddings, model=model)

nouns_public["hypernyms"] = nouns_public["emb"].apply(get_hypernyms, y_df=synsets_nouns)

file_name = "nouns_public_results.tsv"
save_to_file(file_name, nouns_public)

!zip results/nouns_public_results.zip nouns_public_results.tsv
!rm nouns_public_results.tsv

  adding: nouns_public_results.tsv (deflated 81%)


## Private Nouns:

In [14]:
# original:
#nouns_private["emb"] = nouns_private["tokens"].apply(lambda x: model[x].mean(axis=0))

# normalized:
nouns_private["emb"] = nouns_private["tokens"].apply(count_embeddings, model=model)

nouns_private["hypernyms"] = nouns_private["emb"].apply(get_hypernyms, y_df=synsets_nouns)

file_name = "nouns_private_results.tsv"
save_to_file(file_name, nouns_private)

!zip results/nouns_private_results.zip nouns_private_results.tsv
!rm nouns_private_results.tsv

  adding: nouns_private_results.tsv (deflated 82%)


## Public Verbs:

In [15]:
# original:
#verbs_public["emb"] = verbs_public["tokens"].apply(lambda x: model[x].mean(axis=0))

# normalized:
verbs_public["emb"] = verbs_public["tokens"].apply(count_embeddings, model=model)

verbs_public["hypernyms"] = verbs_public["emb"].apply(get_hypernyms, y_df=synsets_verbs)

file_name = "verbs_public_results.tsv"
save_to_file(file_name, verbs_public)

!zip results/verbs_public_results.zip verbs_public_results.tsv
!rm verbs_public_results.tsv

  adding: verbs_public_results.tsv (deflated 84%)


## Private Verbs:

In [17]:
# original:
#verbs_private["emb"] = verbs_private["tokens"].apply(lambda x: model[x].mean(axis=0))

# normalized:
verbs_private["emb"] = verbs_private["tokens"].apply(count_embeddings, model=model)

verbs_private["hypernyms"] = verbs_private["emb"].apply(get_hypernyms, y_df=synsets_verbs)

file_name = "verbs_private_results.tsv"
save_to_file(file_name, verbs_private)

!zip results/verbs_private_results.zip verbs_private_results.tsv
!rm verbs_private_results.tsv

  adding: verbs_private_results.tsv (deflated 84%)
