<a href="https://colab.research.google.com/github/LUMII-AILab/NLP_Course/blob/main/notebooks/w2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Vector Semantics and Word Embeddings
## Word2vec modeļa izveide
## Building and using Word2vec model

### Training
Teksta lejuplāde un sadalīšana rindiņās

Preprocessing - download and line segmentation

In [None]:
# ! pip install gensim
import urllib
import re
import multiprocessing
from time import time
from gensim.models import Word2Vec


# change to your own path if you have downloaded the file locally
# url = 'https://raw.githubusercontent.com/alexisperrier/intro2nlp/master/data/Shakespeare_alllines.txt'
url = "https://repository.clarin.lv/repository/xmlui/bitstream/handle/20.500.12574/41/rainis_v20180716.txt?sequence=1&isAllowed=y"
# read file into list of lines
lines = urllib.request.urlopen(url).read().decode('utf-8').split("\n")

Teksta priekšapstrāde - sadalīšana tekstvienībās

Tokenization





In [None]:
sentences = []

for line in lines:
   # remove punctuation
   line = re.sub(r'[\!"#$%&\*+,-./:;<=>?@^_`()|~=]','',line).strip()
#   line = line.lower()

   # simple tokenizer
   tokens = re.findall(r'\b\w+\b', line)

   # only keep lines with at least one token
   if len(tokens) > 1:
      sentences.append(tokens)



Modeļa apmācība

Training

The parameters:

*   min_count = int - Ignores all words with total absolute frequency lower than this - (2, 100)
*   window = int - The maximum distance between the current and predicted word within a sentence. E.g. window words on the left and window words on the left of our target - (2, 10)
* size = int - Dimensionality of the feature vectors. - (50, 300)
* sample = float - The threshold for configuring which higher-frequency words are randomly downsampled. Highly influencial. - (0, 1e-5)
* alpha = float - The initial learning rate - (0.01, 0.05)
* min_alpha = float - Learning rate will linearly drop to min_alpha as training progresses. To set it: alpha - (min_alpha * epochs) ~ 0.00
* negative = int - If > 0, negative sampling will be used, the int for negative specifies how many "noise words" should be drown. If set to 0, no negative sampling is used. - (5, 20)
* workers = int - Use these many worker threads to train the model (=faster training with multicore machines)



In [None]:

w2v_model = Word2Vec(
         sentences,
         min_count=3,   # Ignore words that appear less than this
         vector_size=50,       # Dimensionality of word embeddings
         sg = 1,        # skipgrams
         window=5,      # Context window for words during training
         epochs=40)       # Number of epochs training over corpus

Alternatīva: apmacība pa soļiem

Training in several steps (alternative)

In [None]:
# import multiprocessing
# from gensim.models import Word2Vec


cores = multiprocessing.cpu_count()

w2v_model = Word2Vec(min_count=20,
                     window=2,
                     vector_size=50,
                     sample=6e-5,
                     alpha=0.03,
                     min_alpha=0.0007,
                     negative=20,
                     workers=cores-1)


t = time()

w2v_model.build_vocab(sentences, progress_per=10000)

print('Time to build vocab: {} mins'.format(round((time() - t) / 60, 2)))

In [None]:
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

Modelis darbībā

Application

In [None]:
w2v_model.wv.most_similar('mīla')
# w2v_model.wv.most_similar('Romeo')

In [None]:
w2v_model.wv.most_similar(positive=["saule"])

In [None]:
w2v_model.wv.most_similar(negative=["saule"])

In [None]:
w2v_model.wv.similarity("saule", "pavasars")

In [None]:
w2v_model.wv.doesnt_match(["ziema", "pavasaris", "saule"])

Citi nopietnāki un mazāk nopietni materiāli:

* Tensorflow Word2Vec Tutorial:https://www.tensorflow.org/text/tutorials/word2vec
* Gensim Word2Vec Tutorial: https://www.kaggle.com/code/pierremegret/gensim-word2vec-tutorial/notebook#Getting-Started




## Vārdu līdzības vizualizācija
## Visualisation of word similarity
Vizualizācijai pielietojam MatPlotLib un SeaBorn bibliotēkas. <br>
Sklearn bibliotēka pielieto PCA un TSNE metodes, kas pārveido vārdus vektoru formā (skatīt vector_size pie word2vec), par punktu 2D telpā. <br>
Šīs dimensiju redukcijas metodes pēc iespējas saglabā individuālo vārdu līdzības.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set_style("darkgrid")

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

In [None]:
def tsnescatterplot(model, word, list_names):
    """ Plot in seaborn the results from the t-SNE dimensionality reduction algorithm of the vectors of a query word,
    its list of most similar words, and a list of words.
    """
    arrays = np.empty((0, 50), dtype='f')
    word_labels = [word]
    color_list  = ['red']

    # adds the vector of the query word
    arrays = np.append(arrays, model.wv.__getitem__([word]), axis=0)

    # gets list of most similar words
    close_words = model.wv.most_similar([word])

    # adds the vector for each of the closest words to the array
    for wrd_score in close_words:
        wrd_vector = model.wv.__getitem__([wrd_score[0]])
        word_labels.append(wrd_score[0])
        color_list.append('blue')
        arrays = np.append(arrays, wrd_vector, axis=0)

    # adds the vector for each of the words from list_names to the array
    for wrd in list_names:
        wrd_vector = model.wv.__getitem__([wrd])
        word_labels.append(wrd)
        color_list.append('green')
        arrays = np.append(arrays, wrd_vector, axis=0)

    # Reduces the dimensionality from 50 to 21 dimensions with PCA
    reduc = PCA(n_components=21).fit_transform(arrays)

    # Finds t-SNE coordinates for 2 dimensions
    np.set_printoptions(suppress=True)

    Y = TSNE(n_components=2, random_state=0, perplexity=15).fit_transform(reduc)

    # Sets everything up to plot
    df = pd.DataFrame({'x': [x for x in Y[:, 0]],
                       'y': [y for y in Y[:, 1]],
                       'words': word_labels,
                       'color': color_list})

    fig, _ = plt.subplots()
    fig.set_size_inches(9, 9)

    # Basic plot
    p1 = sns.regplot(data=df,
                     x="x",
                     y="y",
                     fit_reg=False,
                     marker="o",
                     scatter_kws={'s': 40,
                                  'facecolors': df['color']
                                 }
                    )

    # Adds annotations one by one with a loop
    for line in range(0, df.shape[0]):
         p1.text(df["x"][line],
                 df['y'][line],
                 '  ' + df["words"][line].title(),
                 horizontalalignment='left',
                 verticalalignment='bottom', size='medium',
                 color=df['color'][line],
                 weight='normal'
                ).set_size(15)


    plt.xlim(Y[:, 0].min()-np.absolute(Y[:, 0].min())*0.2, Y[:, 0].max()+np.absolute(Y[:, 0].max())*0.2)
    plt.ylim(Y[:, 1].min()-np.absolute(Y[:, 1].min())*0.2, Y[:, 1].max()+np.absolute(Y[:, 1].max())*0.2)

    plt.title('t-SNE visualization for {}'.format(word.title()))

In [None]:
# Red: Original word
# Blue: 10 closest word matches to the original word
# Green: 10 furthest word matches to the original word

word = "saule"
tsnescatterplot(w2v_model, word, [i[0] for i in w2v_model.wv.most_similar(negative=[word])])

# TF-IDF
### term frequency–inverse document frequency

Vispirms ir nepieciešams klāsts ar individuāliem failiem, kuriem veikt tf-idf analīzi. Pielietosim iepriekš izmantotos Šekspīra darbu datus. Sadalām tos pa lugām, nolasot, kur dokumentā sākas jauna luga jeb sākas pirmais cēliens.

In [None]:
url = 'https://raw.githubusercontent.com/alexisperrier/intro2nlp/master/data/Shakespeare_alllines.txt'
lines = urllib.request.urlopen(url).read().decode('utf-8').split("\n")
iter = 0
for line in lines:
    if line == "\"ACT I\"":
#        outfile.close()
        iter += 1
        outfile = open("work_"+str(iter)+".txt","w")
    outfile.write(line+"\n")
outfile.close()
work_count = iter
print("Divided into {} files".format(work_count))

<br> Veicam tf-idf rezultāta aprēķināšanu ar pašu veidotām funkcijām

In [None]:
# Dokumentu biežums
def number_of_docs_with_term(word):
    found = 0
    for i in range(1,work_count+1):
        infile = open("work_"+str(i)+".txt", "r")
        for line in infile:
            line = re.sub(r'[\!"#$%&\*+,-./:;<=>?@^_`()|~=]','',line).strip()
            tokens = re.findall(r'\b\w+\b', line)
            if word in tokens:
                found += 1
                break
        infile.close()
    return found


def doc_appearance(word):
    found = number_of_docs_with_term(word)
    print("Word \"{}\" was found in {} out of {} files.".format(word, found, work_count))

doc_appearance("king")
doc_appearance("Romeo")
doc_appearance("said")

In [None]:
# Vārda biežums
def word_frequency(word, doc_id):
    freq = 0
    infile = open("work_"+str(doc_id)+".txt", "r")
    for line in infile:
        line = re.sub(r'[\!"#$%&\*+,-./:;<=>?@^_`()|~=]','',line).strip()
        tokens = re.findall(r'\b\w+\b', line)
        for token in tokens:
            if token == word:
                freq += 1
    return freq


def frequency(word, doc_id):
    w_freq = word_frequency(word, doc_id)
    filename = "work_"+str(doc_id)+".txt"
    print("Found word \"{}\" in file \"{}\" a total of {} times.".format(word, filename, w_freq))

frequency("king", 28)
frequency("Romeo", 28)
frequency("said", 28)

In [None]:
# Calculating tf-idf using formula
import numpy as np
def tf_idf(word, doc_id):
    inverse_doc_freq = np.log((1+work_count)/(number_of_docs_with_term(word)+1))+1
    score = word_frequency(word, doc_id) * inverse_doc_freq
    return score


def print_score(word, doc_id):
    filename = "work_"+str(doc_id)+".txt"
    score = tf_idf(word, doc_id)
    print("The tf-idf score for word \"{}\" in file \"{}\" is: {}".format(word, filename, score))

print_score("king", 28)
print_score("Romeo", 28)
print_score("said", 28)

<br> <br>
Alternatīvi varam pielietot scikit-learn bibliotēku tf-idf aprēķinu veikšanai. <br> <br>
TfidfVectorizer papildus veic arī rezultātu smoothing (smooth_idf=True) un normalizēšanu(norm='l2'). <br>
Tādēļ ar bibliotēku iegūtie tf-idf reultāti atšķiras no iepriekš aprēķinātajiem.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd


text_files = []
for i in range(1,work_count+1):
    text_files.append("work_"+str(i)+".txt")

text_titles = [text.split(".")[0] for text in text_files]


# Get tf-idf score data for word
def vectorizer_score(word, doc_id):

    # Initialize and run TfidfVectorizer
    tfidf_vectorizer = TfidfVectorizer(input='filename', stop_words='english')
    tfidf_vector = tfidf_vectorizer.fit_transform(text_files)

    # Create a DataFrame out of the resulting tf–idf vector
    tfidf_df = pd.DataFrame(tfidf_vector.toarray(), index=text_titles, columns=tfidf_vectorizer.get_feature_names_out())
    tfidf_df = tfidf_df.stack().reset_index()
    tfidf_df = tfidf_df.rename(columns={0:'tfidf', 'level_0': 'document','level_1': 'term', 'level_2': 'term'})


    tfidf_df = tfidf_df[tfidf_df['document'] == 'work_'+str(doc_id)]
    print(tfidf_df[tfidf_df['term'] == word], "\n")


vectorizer_score("king", 28)
vectorizer_score("romeo", 28)
vectorizer_score("said", 28)


# Fasttext

In [None]:
! pip install fasttext
# git clone https://github.com/facebookresearch/fastText.git
import fasttext


In [None]:
# Skipgram model :
model = fasttext.train_unsupervised('work_1.txt', model='skipgram')

In [None]:
print(model.words)   # list of words in dictionary
print(model['king']) # get the vector of the word 'king'

In [None]:
model = fasttext.train_supervised('work_1.txt')
print(model.words)
print(model.labels)
print(model['wind']) # get the vector of the word 'king'