# 1. Information about the submission

## 1.1 Name and number of the assignment 

## **Word Sense Induction(Knowledge-free)**, Assignment 1.

## 1.2 Student name

## **Albert Sayapin**

## 1.3 Codalab user ID

## **albertSayapin**

## 1.4 Additional comments

## *This is a very cool task:)*

# 2. Technical Report

## 2.1 Methodology 

The main problem I tried to solve is **Word Sense Induction**, meaning that we want to automatically discover the senses of semantically ambiguous words from unannotated text.

Generally speaking, I used the model which consists of 3 components(Knowledge-Free):
- precalculated words embeddings;
- particular weighting scheme of embeddings;
- 1 step / 2 step clustering technique, namely how many clustering algorithms I use to get better results;

There are some essential steps in this project I had to walk through:

1. **Data preprocessing**: I had to preprocess "context" column of every dataset(train/test):
- *Lemmatized* all the words by pymystem3.Mystem stemmer and made them lowercase(Normalization step);
- *Dropped* all the words from nltk *Russian stopwords* list(As they do not bring any information usually);
- *Eliminated* all the words with *length* less than 3 and target word("word" column) as well;
- *Calculated words occurrences* both locally(for every context for every word) and globally(for every word) and left only unique tokens;
(It could help to get more elaborate weighting method)

2. **Model training**: I had to find optimal parameters for my model(using Brute Force as the size of the problem is not that big):
- *weighting scheme*: Average, Sum, Local Sum, Global Sum;
- *Normalize* context vectors or not?
- *1 step or 2 step clustering*? (Used AffinityPropagation for the first step to identify the number of clusters)
- What *clustering algorithm* to use? (AgglomerativeClustering, AffinityPropagation, KMeans, SpectralClustering)

3. **Model evaluation**: I had to use *Adjusted Rank Score* to compare different clusterings.

4. **Send the results**: the test.csv -> .zip files to CodaLab system.

Some words about why I decided to move this particular way:

*First of all*, I used fastText Skipgram model that was precalculated on GeoWAC data. It is a good choice to get results quickly and with high accuracy(especially Gensim interface). 

*Secondly*, in order to represent context out of words you have to combine them in some way to get one representation.
It is not pretty intuitive what method to use: just take the average or sum them up, weight all the words according to their relative frequency. Hence, I decided to look at them all and choose the best.

*Thirdly*, It is clustering of course. The most challenging part(and important) I think, as we do not know how to define the number of clusters beforehand. Thanks to Affinity Propagation algorithm, it can find it almost automatically. After that we can just leave it as it is or use the next clustering algorithm that needs the number of clusters.

## 2.2 Discussion of results

***Enter here** a discussion of results and a summary of the experiment. Here we want to see the final table with comparison of the baseline and all tried approaches you decided to report. Even if some method did not bring you to the top of the leaderboard, you should nevertheless indicate this result and a discussion, why, in your opinion, some approach worked and another failed. Interesting findings in the discussion will be a plus.*

### Summary of the experiment:

### Train.csv results:

Method | ARS_wiki-wiki | ARS_bts-rnc | ARS_active-dict|
--- | --- | --- | --- |
Baseline | 0.6278 | 0.2624 | 0.1764 |
WSI_?_?_?_? | ? | ? | ? |
WSI_?_?_?_? | ? | ? | ? |


### Test.csv results:

Method | ARS_wiki-wiki | ARS_bts-rnc | ARS_active-dict|
--- | --- | --- | --- |
Baseline | 0.6278 | 0.2624 | 0.1764 |
WSI_?_?_?_? | ? | ? | ? |
WSI_?_?_?_? | ? | ? | ? |

### Conclusion:

# 3. Code

## 3.1 Requirements

In [1]:
# Some essential packages:
!pip install pymystem3==0.1.10
!pip install gensim
!pip install sklearn
!pip install pandas
!pip install nltk
!pip install wget

# Embeddings for the model:
!wget http://vectors.nlpl.eu/repository/20/214.zip
!unzip 214.zip -d ru_fasttext_model
!rm 214.zip
!mkdir results

--2021-11-26 19:20:37--  http://vectors.nlpl.eu/repository/20/214.zip
Resolving vectors.nlpl.eu (vectors.nlpl.eu)... 129.240.189.181
Connecting to vectors.nlpl.eu (vectors.nlpl.eu)|129.240.189.181|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1920218982 (1,8G) [application/zip]
Saving to: ‘214.zip’


2021-11-26 19:37:59 (1,76 MB/s) - ‘214.zip’ saved [1920218982/1920218982]

Archive:  214.zip
  inflating: ru_fasttext_model/meta.json  
  inflating: ru_fasttext_model/model.model  
  inflating: ru_fasttext_model/model.model.vectors_ngrams.npy  
  inflating: ru_fasttext_model/model.model.vectors.npy  
  inflating: ru_fasttext_model/model.model.vectors_vocab.npy  
  inflating: ru_fasttext_model/README  


## 3.2 Download the data:

In [2]:
# To get train/test.csv:
!git clone https://github.com/nlpub/russe-wsi-kit.git

Cloning into 'russe-wsi-kit'...
remote: Enumerating objects: 148, done.[K
remote: Counting objects: 100% (32/32), done.[K
remote: Compressing objects: 100% (28/28), done.[K
remote: Total 148 (delta 4), reused 22 (delta 4), pack-reused 116[K
Receiving objects: 100% (148/148), 3.83 MiB | 8.02 MiB/s, done.
Resolving deltas: 100% (59/59), done.


## 3.3 Data Preprocessing: 

In [3]:
import re
from collections import Counter

import pandas as pd
import nltk
from pymystem3 import Mystem

from nltk.corpus import stopwords

nltk.download("stopwords")
russian_stopwords = stopwords.words("russian")


def lemmatized_context(row, stemmer):
    s = row['context']
    target = row['word']
    
    tokens = stemmer.lemmatize(s.lower())
    tokens = [token for token in tokens if token not in russian_stopwords\
                and token != " " \
                and re.match('[\w\-]+$', token)\
                and (len(token) > 2)\
                and token != target]
    return tokens

def get_counter(row):
    s = row['context']
    c = Counter(s)
    return list(set(s)), c

def get_counter_global(data):
    return data.groupby("word").apply(lambda x: Counter(x["context"].sum()))

def read_preprocess(data_path):
    
    # Read the data:
    data = pd.read_csv(data_path, sep='\t')

    # Essential columns:
    cols = [
        'context_id', 'word', 'gold_sense_id',
        'predict_sense_id', 'context',
    ]

    # Leave only essential columns:
    data = data[cols]
    
    # Lemmatization:
    stemmer = Mystem()
    data['context'] = data.apply(lemmatized_context, 1, stemmer=stemmer)

    # Leave unique and count all:
    data["embedding_need"] = data.apply(get_counter, axis=1)

    # Calculate frequency based on all contexts:
    word2glob_freq = get_counter_global(data).to_dict()

    return data, word2glob_freq

[nltk_data] Downloading package stopwords to /home/albert/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
# Define necessary paths:
path = "russe-wsi-kit/data/main/"

wiki = "wiki-wiki/"
bts = "bts-rnc/"
actd = "active-dict/"

train_file = "train.csv"
test_file = "test.csv"


# Prepare the data:
wiki_train, wiki_train_global = read_preprocess(path + wiki + train_file)
wiki_test, wiki_test_global = read_preprocess(path + wiki + test_file)

bts_train, bts_train_global = read_preprocess(path + bts + train_file)
bts_test, bts_test_global = read_preprocess(path + bts + test_file)

actd_train, actd_train_global = read_preprocess(path + actd + train_file)
actd_test, actd_test_global = read_preprocess(path + actd + test_file)

wiki_train.head()

Unnamed: 0,context_id,word,gold_sense_id,predict_sense_id,context,embedding_need
0,1,замок,1,,"[владимир, мономах, любеч, многочисленный, укр...","([ограда, строительство, лишь, часть, оборонит..."
1,2,замок,1,,"[шильонский, шильйон, известный, русскоязычный...","([шильонский, разный, время, русскоязычный, эл..."
2,3,замок,1,,"[проведение, архитектурный, археологический, р...","([музей, консультация, исторический, управлени..."
3,4,замок,1,,"[топь, белокуров, легенда, завещание, мавра, ю...","([топь, рождение, янтарный, мавра, завещание, ..."
4,5,замок,1,,"[великий, князь, литовский, гедимин, успешный,...","([кейстут, век, тракай, место, рождаться, окол..."


## 3.4 My method of text processing

In [5]:
import operator

import numpy as np
from sklearn.cluster import(
    AgglomerativeClustering,
    AffinityPropagation,
    KMeans,
    SpectralClustering,
)
from sklearn.metrics import adjusted_rand_score
import matplotlib.pyplot as plt

import gensim

import warnings
warnings.filterwarnings('ignore')


def give_emb_local_inv(model, context, counter, size, norm=False):
    """     
        Calculates the context vector based on word embeddings and
        each word is weighted by its relative frequency within one context. 
        
        Parameters:
            - model: dictionary like structure, gives embedding by the word: model[word];
            - context: list of words defining the context;
            - counter: dictionary like structure, gives the number of times 
                a word was seen in this context by the key: model[key];
            - size: size of word embeddings;
            - norm: True - normalize, False - do not;
        
        Result: array
    """

    if len(context) == 0:
        return np.zeros(size)

    weights = np.array([1 / counter[word] for word in context])
    ov = np.average(model[context], axis=0, weights=weights)
    if norm: 
        ov /= np.linalg.norm(ov)
    return ov


def give_emb_average(model, context, counter, size, norm=False):
    """     
        Calculates the context vector based on average of word embeddings.

        Parameters:
            - model: dictionary like structure, gives embedding by the word: model[word];
            - context: list of words defining the context;
            - counter: dictionary like structure, gives the number of times 
                a word was seen in this context by the key: model[key];
            - size: size of word embeddings;
            - norm: True - normalize, False - do not;
        
        Result: array  
    """

    if len(context) == 0:
        return np.zeros(size)

    ov = model[context].mean(axis=0)
    if norm: 
        ov /= np.linalg.norm(ov)
    return ov


def give_emb_sum(model, context, counter, size, norm=False):
    """     
        Calculates the context vector based on sum of word embeddings.

        Parameters:
            - model: dictionary like structure, gives embedding by the word: model[word];
            - context: list of words defining the context;
            - counter: dictionary like structure, gives the number of times 
                a word was seen in this context by the key: model[key];
            - size: size of word embeddings;
            - norm: True - normalize, False - do not;
        
        Result: array  
    """

    if len(context) == 0:
        return np.zeros(size)
        
    ov = model[context].sum(axis=0)
    if norm: 
        ov /= np.linalg.norm(ov)
    return ov


def give_result(model, data, clus_alg, emb_alg, norm, train=True):
    """    
        Calculates the context vector based on sum of word embeddings.

        Parameters:
            - model: dictionary like structure, gives embedding by the word: model[word];
            - data: Pandas Dataframe that has necessary columns:
                "embedding_need", "word", "gold_sense_id";
            - clus_alg: clustering function: e.g. 'Kmeans', 'SpectralClustering';
            - emb_alg: algorithm to get context vector;
            - norm: True - normalize, False - do not;
            - train: True - evaluate with ARS, False - do not;
        
        Result: array  
    """

    emb = np.array([emb_alg(model, context, counter, 300, norm) for context, counter in data["embedding_need"]])
    words = data["word"].unique().tolist()
    ars = []

    labels = []
    for i, word in enumerate(words):
        mask = data["word"] == word
        clustering = clus_alg.fit(emb[mask])
        labels += clustering.labels_.tolist()
        if train:
            ars.append(adjusted_rand_score(clustering.labels_, data.query("word == @word")["gold_sense_id"]))

    return np.mean(ars), labels


def give_2step_result(model, data, clus_alg, emb_alg, norm, pref=-2, train=True):
    """    
        Calculates the context vector based on sum of word embeddings.

        Parameters:
            - model: dictionary like structure, gives embedding by the word: model[word];
            - data: Pandas Dataframe that has necessary columns:
                "embedding_need", "word", "gold_sense_id";
            - clus_alg: clustering function: e.g. 'Kmeans', 'SpectralClustering';
            - emb_alg: algorithm to get context vector;
            - norm: True - normalize, False - do not;
            - pref: preference parameter for AffinityPropagation algorithm;
            - train: True - evaluate with ARS, False - do not;
        
        Result: array  
    """

    clus_find_n = AffinityPropagation(preference=pref, random_state=42)
    emb = np.array([emb_alg(model, context, counter, 300, norm) for context, counter in data["embedding_need"]])

    words = data["word"].unique().tolist()
    ars = []

    labels = []
    for i, word in enumerate(words):
        mask = data["word"] == word
        clustering = clus_find_n.fit(emb[mask])
        clustering = clus_alg(len(set(clustering.labels_))).fit(emb[mask])
        labels += clustering.labels_.tolist()
        if train:
            ars.append(adjusted_rand_score(clustering.labels_, data.query("word == @word")["gold_sense_id"]))

    return np.mean(ars), labels

In [6]:
class WSIModel:
    """ Word Sense Induction model based on pretrained word embeddings and clustering"""

    def __init__(self, pretrained_model_path):
        self.model = gensim.models.KeyedVectors.load(pretrained_model_path)
        self.size = 300#self.model.
        self.rs = 42
        self.embedded = None
        self.params = {
            "is_norm": None,
            "emb_func": None,
            "preference": None,
            "clust_func": None,
        }
        self.best_train_labels = None
        self.test_labels = None       
        self.global_dict = {}

    def _update_params(self, is_norm, emb_func, pref, clust_func):
        self.params["is_norm"] = is_norm
        self.params["emb_func"] = emb_func
        self.params["preference"] = pref
        self.params["clust_func"] = clust_func

    def _print_progess(self, emb_func, norm, clust_func, pref, ars_step, steps):
        print(
            f"Embedding Method: {emb_func}; Is_norm = {norm}; " 
            + f"Clustering = {clust_func} ({steps} step); Preference = {pref}"
            + f"\n***Mean ARS = {ars_step}***\n"
        )
    
    def _update_global_dict(self, data, labels):
        pass


    def fit(self, train_data, nv=(True,), print_all=True, app=(-3, -2, -1)):
        """    
            Fit the model parameters and returns train_data labels.

            Parameters:
                - train_data: Pandas Dataframe that has necessary columns:
                    "embedding_need", "word", "gold_sense_id";
                - nv: normalization parameters True/False ;
                - print_all: print the results of every model to stdout;
                - app: affinity propagation preference:
                    iterable structure that has preferences for AP algorithm;
        
            Result: train_data clustering labels,
                the order cooresponds to rows of train_data  
        """

        embedding_funcs = (give_emb_local_inv, give_emb_average, give_emb_sum)
        names_embedding_funcs = ("Local_Average", "Average", "Sum")

        clustering_funcs = (
            KMeans, AgglomerativeClustering, SpectralClustering,
        )
        names_clustering_funcs = (
            "KMeans", "AgglomerativeClustering", "SpectralClustering",
        )

        norm_variants = nv
        affinity_propagation_preference = app

        best_train_labels = None
        ars_init = 0.0
        for is_norm in norm_variants:
            for emb_func, name_emb_func in zip(embedding_funcs, names_embedding_funcs):
                for pref in affinity_propagation_preference:
                    for clust_func, name_clust_func in zip(clustering_funcs, names_clustering_funcs):
                        
                        # Get results for two step algorithm:
                        ars_2step, labels_2step = give_2step_result(
                            self.model,
                            train_data,
                            clust_func,
                            emb_func,
                            is_norm,
                            pref,
                        )
                        if print_all:
                            self._print_progess(
                                name_emb_func, is_norm,
                                name_clust_func, pref, ars_2step, 2,
                            )
                        
                        if ars_init <= ars_2step:
                            ars_init = ars_2step
                            best_train_labels = labels_2step
                            # Update parameters of the model:
                            self._update_params(
                                is_norm, emb_func,
                                pref, clust_func,
                            )
                            self._print_progess(
                                name_emb_func, is_norm,
                                name_clust_func, pref, ars_2step, 2,
                            )
                    
                    # Get results for one step algorithm:
                    ars_1step, labels_1step = give_result(
                            self.model,
                            train_data,
                            AffinityPropagation(preference=pref, random_state=self.rs),
                            emb_func,
                            is_norm,
                    )
                    if print_all:
                        self._print_progess(
                            name_emb_func, is_norm,
                            "AffinityPropagation", pref, ars_2step, 1,
                        )

                    if ars_init <= ars_1step:
                        ars_init = ars_1step
                        best_train_labels = labels_1step
                        # Update parameters of the model:
                        self._update_params(
                            is_norm, emb_func,
                            pref, -1,
                        )
                        self._print_progess(
                            name_emb_func, is_norm,
                            "AffinityPropagation", pref, ars_2step, 1,
                        )

        self._update_global_dict(train_data, best_train_labels)
        self.best_train_labels = best_train_labels
        return best_train_labels


    def predict(self, test_data):
        """    
            Predict the labels for test_data.

            Parameters:
                - test_data: Pandas Dataframe that has necessary columns:
                    "embedding_need", "word";

            Result: test_data clustering labels,
                the order cooresponds to rows of test_data  
        """

        if self.params["clust_func"] == -1:
            clust_func = AffinityPropagation(
                preference=self.params["preference"],
                random_state=self.rs,
            )
            _, labels = give_result(
                self.model,
                test_data,
                clust_func,
                self.params["emb_func"],
                self.params["is_norm"],
                train=False
            )
        else:
            _, labels = give_2step_result(
                self.model,
                test_data,
                self.params["clust_func"],
                self.params["emb_func"],
                self.params["is_norm"],
                self.params["preference"],
                train=False,
            )
        
        self._update_global_dict(test_data, labels)
        self.test_labels = labels
        return labels


    def get_cluster_label(word, context):
        """Gives label by word and context list"""
        
        default_cluster = 0
        counter = Counter(s)
        cxt = list(set(context))
        context_emb = self.params["emb_func"](self.model, cxt, counter, self.size, self.params["is_norm"])
        
        if word not in self.global_dict.keys():
            self.global_dict[word] = {default_cluster:context_emb}
            return default_cluster
        
        candidates = self.global_dict[word]
        result = {clust_label: context_emb.dot() for clust_label, av_emb in candidates.items()}
        return max(result.items(), key=operator.itemgetter(1))[0]

## 3.5 Train and evaluate the models:

In [7]:
cols = ["context_id", "word", "gold_sense_id", "predict_sense_id"]

## wiki-wiki dataset:

In [8]:
# Train/Evaluate/Get test for CodaLab:

wsi_model = WSIModel('ru_fasttext_model/model.model')
wiki_train_labels = wsi_model.fit(wiki_train, print_all=False)
wiki_test_labels = wsi_model.predict(wiki_test)

Embedding Method: Local_Average; Is_norm = True; Clustering = KMeans (2 step); Preference = -3
***Mean ARS = 0.6919861201365022***

Embedding Method: Local_Average; Is_norm = True; Clustering = AgglomerativeClustering (2 step); Preference = -3
***Mean ARS = 0.7508223416134545***

Embedding Method: Local_Average; Is_norm = True; Clustering = AgglomerativeClustering (2 step); Preference = -2
***Mean ARS = 0.7851293039268663***

Embedding Method: Average; Is_norm = True; Clustering = AgglomerativeClustering (2 step); Preference = -3
***Mean ARS = 0.7957682490549647***

Embedding Method: Sum; Is_norm = True; Clustering = AgglomerativeClustering (2 step); Preference = -3
***Mean ARS = 0.7957682490549647***



In [9]:
# Save the results:

wiki_test["predict_sense_id"] = wiki_test_labels
wiki_test[cols].to_csv("test.csv", sep='\t', index=None)
!zip wiki-wiki.zip test.csv

  adding: test.csv (deflated 85%)


## bts-rnc dataset:

In [10]:
# Train/Evaluate/Get test for CodaLab:

wsi_model = WSIModel('ru_fasttext_model/model.model')
bts_train_labels = wsi_model.fit(bts_train, nv=(False, True), print_all=False)
bts_test_labels = wsi_model.predict(bts_test)

Embedding Method: Local_Average; Is_norm = False; Clustering = KMeans (2 step); Preference = -3
***Mean ARS = 0.0***

Embedding Method: Local_Average; Is_norm = False; Clustering = AgglomerativeClustering (2 step); Preference = -3
***Mean ARS = 0.0***

Embedding Method: Local_Average; Is_norm = False; Clustering = AffinityPropagation (1 step); Preference = -3
***Mean ARS = -0.0026280260061273316***

Embedding Method: Local_Average; Is_norm = False; Clustering = KMeans (2 step); Preference = -2
***Mean ARS = 0.0***

Embedding Method: Local_Average; Is_norm = False; Clustering = AgglomerativeClustering (2 step); Preference = -2
***Mean ARS = 0.0***

Embedding Method: Local_Average; Is_norm = False; Clustering = SpectralClustering (2 step); Preference = -2
***Mean ARS = 0.000710873017918456***

Embedding Method: Average; Is_norm = False; Clustering = KMeans (2 step); Preference = -3
***Mean ARS = 0.08097254288020056***

Embedding Method: Average; Is_norm = False; Clustering = Agglomerativ

In [11]:
# Save the results:

bts_test["predict_sense_id"] = bts_test_labels
bts_test[cols].to_csv("bts_test.csv", sep='\t', index=None)
!zip bts.zip bts_test.csv

  adding: bts_test.csv (deflated 86%)


## active-dict dataset:

In [12]:
# Train/Evaluate/Get test for CodaLab:

wsi_model = WSIModel('ru_fasttext_model/model.model')
actd_train_labels = wsi_model.fit(actd_train, nv=(False, True), print_all=False, app=(-5, -4, -3, -2, -1))
actd_test_labels = wsi_model.predict(actd_test)

Embedding Method: Local_Average; Is_norm = False; Clustering = KMeans (2 step); Preference = -5
***Mean ARS = 0.00019790524438484513***

Embedding Method: Local_Average; Is_norm = False; Clustering = AgglomerativeClustering (2 step); Preference = -5
***Mean ARS = 0.00019790524438484513***

Embedding Method: Local_Average; Is_norm = False; Clustering = AffinityPropagation (1 step); Preference = -5
***Mean ARS = -0.0013731907532188628***

Embedding Method: Local_Average; Is_norm = False; Clustering = SpectralClustering (2 step); Preference = -4
***Mean ARS = 0.00506626049585486***

Embedding Method: Average; Is_norm = False; Clustering = KMeans (2 step); Preference = -5
***Mean ARS = 0.10785270897836366***

Embedding Method: Average; Is_norm = False; Clustering = AgglomerativeClustering (2 step); Preference = -4
***Mean ARS = 0.11175089441780814***

Embedding Method: Local_Average; Is_norm = True; Clustering = KMeans (2 step); Preference = -1
***Mean ARS = 0.23477566327576707***

Embeddi

In [13]:
# Save the results:

actd_test["predict_sense_id"] = actd_test_labels
actd_test[cols].to_csv("actd.csv", sep='\t', index=None)
!zip actd.zip actd.csv

  adding: actd.csv (deflated 85%)


In [59]:
a = np.random.rand(4, 5)
a

array([[0.24682799, 0.41761101, 0.70951733, 0.59810252, 0.79420665],
       [0.55941436, 0.87442077, 0.98456346, 0.81484982, 0.78967382],
       [0.03175679, 0.13472281, 0.70276014, 0.7575814 , 0.75208123],
       [0.79515903, 0.91340694, 0.3668862 , 0.84580481, 0.2110145 ]])

In [63]:
a /= a.sum(axis=0)
a

array([[0.15113538, 0.17845393, 0.25672481, 0.1982876 , 0.31182335],
       [0.34253532, 0.37365829, 0.35624481, 0.27014534, 0.31004366],
       [0.01944502, 0.05756988, 0.25427986, 0.25115927, 0.29528397],
       [0.48688428, 0.3903179 , 0.13275052, 0.28040779, 0.08284903]])

In [67]:
np.average(a, axis=0, weights=np.array([1, 2, 1, 1]))

array([0.26850706, 0.27473166, 0.27124896, 0.25402907, 0.26200873])