<a href="https://colab.research.google.com/github/TurkuNLP/intro-to-nlp/blob/master/intro_2023_exercise_10_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Creating word embeddings

This notebook demonstrates how to create new `word2vec` word embeddings using `gensim`.

---

## Setup

We'll use [`gensim`](https://pypi.org/project/gensim/) to induce the word embeddings and (following the [sentence splitting and tokenization notebook](https://github.com/TurkuNLP/intro-to-nlp/blob/master/sentence_splitting_and_tokenization.ipynb)) the [`sentence-splitter`](https://pypi.org/project/sentence-splitter/) package to split sentences.

In [1]:
!pip install --quiet gensim sentence-splitter

In [2]:
!wget -nc http://dl.turkunlp.org/TKO_7095_2023/fiwiki-20221120-sample.txt

File ‘fiwiki-20221120-sample.txt’ already there; not retrieving.



In [3]:
import regex

from sentence_splitter import SentenceSplitter

## Download and read data

We'll use the English data here, but you can switch out to the Finnish by commenting out the lines loading the Finnish and uncommenting the lines loading the English data. 

In [4]:
# !wget -nc http://dl.turkunlp.org/TKO_7095_2023/fiwiki-20221120-sample.txt
!wget -nc http://dl.turkunlp.org/TKO_7095_2023/enwiki-20220301-sample.txt

File ‘enwiki-20220301-sample.txt’ already there; not retrieving.



In [5]:
#paragraphs = open('fiwiki-20221120-sample.txt').readlines()
paragraphs = open('enwiki-20220301-sample.txt').readlines()

Check what we have

In [6]:
print('Total paragraphs:', len(paragraphs))
print('Total characters:', sum(len(p) for p in paragraphs))

Total paragraphs: 1000000
Total characters: 526082203


So, that's a million paragraphs totaling over 500 million characters. This is a reasonably large corpus, though still notably smaller than corpora on which word2vec models are generally trained.

In [7]:
for i, p in enumerate(paragraphs[:10]):
    print(f'paragraph {i}: {p}', end='')

paragraph 0: Incumbent CM Mayawati began her campaign on 27 January at a rally in Bijnor. On 15 January, she released the BSP's list of candidates for all the 403 constituencies. The list included 88 candidates belonging to SCs, 113 from OBCs, 85 religious minorities and 117 upper castes, out of which 74 are Brahmins.
paragraph 1: As a member of the OSS Research and Analysis Division, Wheeler had government security clearance to received secret and confidential "ditto" copies of monthly and semi-monthly reports of political developments throughout the world. Wheeler is alleged to have passed these reports as well as handwritten and typewritten material of cable reports from the State Department and the OSS to Soviet intelligence.  Wheeler is alleged to have provided information on the organization and policies of British intelligence services and furnished memoranda prepared by the Foreign Nationalities Branch of OSS on material relating to the particular racial groups and activities w

## Split sentences and tokenize

Following the [sentence splitting and tokenization notebook](https://github.com/TurkuNLP/intro-to-nlp/blob/master/sentence_splitting_and_tokenization.ipynb)).

(The line `%%time` is Jupyter ["cell magic"](https://ipython.readthedocs.io/en/stable/interactive/magics.html) to print execution time.) 

In [8]:
%%time

splitter = SentenceSplitter(language='fi')

sentences = [s for p in paragraphs for s in splitter.split(p)]

CPU times: user 15min 26s, sys: 3.96 s, total: 15min 30s
Wall time: 15min 50s


In [9]:
%%time

TOKENIZE_RE = regex.compile(r'([[:alnum:]]+|\S)')

tokenized = [TOKENIZE_RE.findall(s) for s in sentences]

CPU times: user 1min 23s, sys: 10.9 s, total: 1min 34s
Wall time: 1min 34s


So, with this approach, total time for sentence-splitting was ~15 min, and tokenization was just over one minute for the million paragraphs.

See a few examples

In [10]:
for i, s in enumerate(tokenized[:10]):
    print(f'Sentence {i}:', s)

Sentence 0: ['Incumbent', 'CM', 'Mayawati', 'began', 'her', 'campaign', 'on', '27', 'January', 'at', 'a', 'rally', 'in', 'Bijnor', '.']
Sentence 1: ['On', '15', 'January', ',', 'she', 'released', 'the', 'BSP', "'", 's', 'list', 'of', 'candidates', 'for', 'all', 'the', '403', 'constituencies', '.']
Sentence 2: ['The', 'list', 'included', '88', 'candidates', 'belonging', 'to', 'SCs', ',', '113', 'from', 'OBCs', ',', '85', 'religious', 'minorities', 'and', '117', 'upper', 'castes', ',', 'out', 'of', 'which', '74', 'are', 'Brahmins', '.']
Sentence 3: ['As', 'a', 'member', 'of', 'the', 'OSS', 'Research', 'and', 'Analysis', 'Division', ',', 'Wheeler', 'had', 'government', 'security', 'clearance', 'to', 'received', 'secret', 'and', 'confidential', '"', 'ditto', '"', 'copies', 'of', 'monthly', 'and', 'semi', '-', 'monthly', 'reports', 'of', 'political', 'developments', 'throughout', 'the', 'world', '.']
Sentence 4: ['Wheeler', 'is', 'alleged', 'to', 'have', 'passed', 'these', 'reports', 'as', 

## Alternative using UDpipe

You can alternatively perform sentence splitting and tokenization using [UDPipe](https://lindat.mff.cuni.cz/services/udpipe/) as follows, but note that this is considerably slower.

In [11]:
!pip3 install --quiet ufal.udpipe

In [12]:
!wget -nc https://github.com/TurkuNLP/intro-to-nlp/raw/master/Data/fi.segmenter.udpipe
!wget -nc https://github.com/TurkuNLP/intro-to-nlp/raw/master/Data/en.segmenter.udpipe

File ‘fi.segmenter.udpipe’ already there; not retrieving.

File ‘en.segmenter.udpipe’ already there; not retrieving.



In [13]:
import ufal.udpipe as udpipe

#model = udpipe.Model.load('fi.segmenter.udpipe')
model = udpipe.Model.load('en.segmenter.udpipe')

pipeline = udpipe.Pipeline(model, 'tokenize', 'none', 'none', 'horizontal')

(We're only processing the first 10,000 paragraphs here as this is quite slow.)

In [14]:
%%time

udpipe_segmented = [pipeline.process(p) for p in paragraphs[:10000]]

CPU times: user 3min 22s, sys: 544 ms, total: 3min 23s
Wall time: 3min 27s


For UDPipe, the sentence splitting and tokenization took over three minutes for just 10,000 paragraphs.

In [15]:
udpipe_sentences = [s for t in udpipe_segmented for s in t.split('\n')]
udpipe_tokenized = [s.split() for s in udpipe_sentences]

In [16]:
for i, s in enumerate(udpipe_tokenized[:10]):
    print(f'UDpipe sentence {i}:', s)

UDpipe sentence 0: ['Incumbent', 'CM', 'Mayawati', 'began', 'her', 'campaign', 'on', '27', 'January', 'at', 'a', 'rally', 'in', 'Bijnor', '.']
UDpipe sentence 1: ['On', '15', 'January', ',', 'she', 'released', 'the', 'BSP', "'s", 'list', 'of', 'candidates', 'for', 'all', 'the', '403', 'constituencies', '.']
UDpipe sentence 2: ['The', 'list', 'included', '88', 'candidates', 'belonging', 'to', 'SCs', ',', '113', 'from', 'OBCs', ',', '85', 'religious', 'minorities', 'and', '117', 'upper', 'castes', ',', 'out', 'of', 'which', '74', 'are', 'Brahmins', '.']
UDpipe sentence 3: []
UDpipe sentence 4: ['As', 'a', 'member', 'of', 'the', 'OSS', 'Research', 'and', 'Analysis', 'Division', ',', 'Wheeler', 'had', 'government', 'security', 'clearance', 'to', 'received', 'secret', 'and', 'confidential', '"', 'ditto', '"', 'copies', 'of', 'monthly', 'and', 'semi-monthly', 'reports', 'of', 'political', 'developments', 'throughout', 'the', 'world', '.']
UDpipe sentence 5: ['Wheeler', 'is', 'alleged', 'to',

## Create word2vec model

We'll create a word2vec model using `gensim` mostly with default parameters. For details on the parameters of this class, see <https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec>

In [17]:
%%time

from gensim.models import Word2Vec

model = Word2Vec(
    sentences=tokenized,
    vector_size=100,
)

CPU times: user 25min 21s, sys: 8.91 s, total: 25min 30s
Wall time: 15min 41s


Test with Finnish words (if you've trained a Finnish model)

In [18]:
# for word in ('hyvä', 'huono', 'koira', 'kissa', 'kuningas', 'kuningatar'):
#     print(f'{word}:\t', [w for w, s in model.wv.most_similar(word)])

Test with English words (if you've trained an English model)

In [19]:
for word in ('good', 'bad', 'dog', 'cat', 'king', 'queen'):
    print(f'{word}:\t', [w for w, s in model.wv.most_similar(word)])

good:	 ['bad', 'decent', 'perfect', 'tough', 'nice', 'sensible', 'genuine', 'wonderful', 'great', 'little']
bad:	 ['good', 'terrible', 'horrible', 'tough', 'careless', 'sad', 'poor', 'funny', 'crazy', 'stupid']
dog:	 ['cat', 'rabbit', 'donkey', 'pet', 'monkey', 'puppy', 'cow', 'fox', 'goat', 'horse']
cat:	 ['rabbit', 'dog', 'monkey', 'snake', 'spider', 'creature', 'wolf', 'beast', 'monster', 'donkey']
king:	 ['prince', 'emperor', 'monarch', 'sultan', 'ruler', 'duke', 'queen', 'Emperor', 'shah', 'kings']
queen:	 ['princess', 'prince', 'king', 'empress', 'bride', 'consort', 'monarch', 'dowager', 'regent', 'duchess']


Not too bad a result for a half-hour run! Some observations:

* Perhaps surprisingly, antonyms (words with opposite meaning) show up in the lists of most similar words. This is a known issue with word embeddings (see e.g. [Scheible _et al._ 2013](https://aclanthology.org/I13-1056.pdf)) and makes sense if you think about it: for example, _bad_ can appear in many of the same contexts as _good_.
* These word embeddings group animal words with both _dog_ and _cat_, but don't show the level of granularity we got with the embeddings trained on more data, where different types of dogs and cats were more similar to the words _dog_ and _cat_ (respectively) than other animals.
* We see a similar phenomenon with _king_ and _queen_ as with animals, with many words related to royalty and ruling classes showing up in both, with more limited distinction by e.g. gender than found in embeddings trained on more data.