<a href="https://colab.research.google.com/github/LUMII-AILab/NLP_Course/blob/main/notebooks/w2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Word2vec modeļa izveide
# Building and using Word2vec model

Teksta lejuplāde un sadalīšana rindiņās

Preprocessing - download and line segmentation

In [1]:
import urllib
import re
from time import time
from gensim.models import Word2Vec


# change to your own path if you have downloaded the file locally
url = 'https://raw.githubusercontent.com/alexisperrier/intro2nlp/master/data/Shakespeare_alllines.txt'
#url = "https://repository.clarin.lv/repository/xmlui/bitstream/handle/20.500.12574/41/rainis_v20180716.txt?sequence=1&isAllowed=y"

# read file into list of lines
lines = urllib.request.urlopen(url).read().decode('utf-8').split("\n")

Teksta priekšapstrāde - sadalīšana tekstvienībās

Tokenization





In [2]:
sentences = []

for line in lines:
   # remove punctuation
   line = re.sub(r'[\!"#$%&\*+,-./:;<=>?@^_`()|~=]','',line).strip()

   # simple tokenizer
   tokens = re.findall(r'\b\w+\b', line)

   # only keep lines with at least one token
   if len(tokens) > 1:
      sentences.append(tokens)



Modeļa apmācība

Training

The parameters:

*   min_count = int - Ignores all words with total absolute frequency lower than this - (2, 100)
*   window = int - The maximum distance between the current and predicted word within a sentence. E.g. window words on the left and window words on the left of our target - (2, 10)
* size = int - Dimensionality of the feature vectors. - (50, 300)
* sample = float - The threshold for configuring which higher-frequency words are randomly downsampled. Highly influencial. - (0, 1e-5)
* alpha = float - The initial learning rate - (0.01, 0.05)
* min_alpha = float - Learning rate will linearly drop to min_alpha as training progresses. To set it: alpha - (min_alpha * epochs) ~ 0.00
* negative = int - If > 0, negative sampling will be used, the int for negative specifies how many "noise words" should be drown. If set to 0, no negative sampling is used. - (5, 20)
* workers = int - Use these many worker threads to train the model (=faster training with multicore machines)



In [3]:

w2v_model = Word2Vec(
         sentences,
         min_count=3,   # Ignore words that appear less than this
         vector_size=50,       # Dimensionality of word embeddings
         sg = 1,        # skipgrams
         window=7,      # Context window for words during training
         epochs=40)       # Number of epochs training over corpus

Alternatīva: apmacība pa soļiem

Training in several steps (alternative)

In [None]:
import multiprocessing


cores = multiprocessing.cpu_count()

w2v_model = Word2Vec(min_count=20,
                     window=2,
                     vector_size=50,
                     sample=6e-5,
                     alpha=0.03,
                     min_alpha=0.0007,
                     negative=20,
                     workers=cores-1)


t = time()

w2v_model.build_vocab(sentences, progress_per=10000)

print('Time to build vocab: {} mins'.format(round((time() - t) / 60, 2)))

Time to build vocab: 0.01 mins


In [None]:
w2v_model.train(sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

(21426316, 52958850)

Modelis darbībā

Application

In [20]:
#w2vcat=w2v_model.wv.most_similar('rīta')
w2v_model.wv.most_similar('king')

[('prince', 0.7845737338066101),
 ('duke', 0.7715505361557007),
 ('queen', 0.7306444644927979),
 ('emperor', 0.7019931674003601),
 ('college', 0.6877387166023254),
 ('rightful', 0.6712759137153625),
 ('sovereign', 0.6658298373222351),
 ('Dauphin', 0.6556982398033142),
 ('Naples', 0.6545740365982056),
 ('newness', 0.6544961333274841)]

In [21]:
w2v_model.wv.most_similar(positive=["Romeo"])

[('Tybalt', 0.7864425778388977),
 ('murdered', 0.7040663361549377),
 ('dead', 0.683326780796051),
 ('Bassianus', 0.6790679693222046),
 ('Polonius', 0.6736945509910583),
 ('mountaineers', 0.6709297299385071),
 ('Carlisle', 0.6702226996421814),
 ('Antony', 0.6452981233596802),
 ('Clarence', 0.6452953815460205),
 ('banished', 0.6417002081871033)]

In [22]:
w2v_model.wv.similarity("Romeo", "love")

0.4380147

In [23]:
w2v_model.wv.doesnt_match(["love", "Romeo", "cat"])

'cat'

Citi nopietnāki un mazāk nopietni materiāli:

* Tensorflow Word2Vec Tutorial:https://www.tensorflow.org/text/tutorials/word2vec
* Gensim Word2Vec Tutorial: https://www.kaggle.com/code/pierremegret/gensim-word2vec-tutorial/notebook#Getting-Started


