# Wiki2vec

Jupyter notebook for creating a [Word2vec](https://en.wikipedia.org/wiki/Word2vec) model from a Wikipedia dump. This model file can then be read into [gensim's Word2Vec class](https://radimrehurek.com/gensim/models/word2vec.html). Feel free to edit this script as you see fit.

### Dependencies
- Python 3
- Jupyter
- Gensim

### Steps
- Download a Wikipedia dump by visiting

```
https://dumps.wikimedia.org/<locale>wiki/latest/<locale>wiki-latest-pages-articles.xml.bz2

E.x. https://dumps.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles.xml.bz2
```
- Once downloaded assign the following paths below:

In [8]:
WIKIPEDIA_DUMP_PATH = './data/wiki-corpuses/enwiki-latest-pages-articles.xml.bz2'

# Choose a path that the word2vec model should be saved to
# (during training), and read from afterwards.
WIKIPEDIA_W2V_PATH = './data/enwiki.model'

## Train Word2vec on Wikipedia dump

Here is where we train the word2vec model on the given Wikipedia dump. Specifically we,

1. Read given Wikipedia dump with gensim
2. Write to temporary text file (will get deleted)
3. Train word2vec model
4. Save word2vec model

*NB: 1 Wikipedia article is fed into word2vec as a single sentence.*

In [9]:
import sys
import os
import tempfile
import multiprocessing
import logging

from gensim.corpora import WikiCorpus
from gensim.models.word2vec import LineSentence
from gensim.models import Word2Vec

In [3]:
def write_wiki_corpus(wiki, output_file):
    """Write a WikiCorpus as plain text to file."""
    
    i = 0
    for text in wiki.get_texts():
        text_output_file.write(b' '.join(text) + b'\n')
        i = i + 1
        if (i % 10000 == 0):
            print('\rSaved %d articles' % i, end='', flush=True)
            
    print('\rFinished saving %d articles' % i, end='', flush=True)
    
def build_trained_model(text_file):
    """Reads text file and returns a trained model."""
    
    sentences = LineSentence(text_file)
    model = Word2Vec(sentences, size=400, window=5, min_count=5,
                     workers=multiprocessing.cpu_count())

    # Trim unneeded model memory to reduce RAM usage
    model.init_sims(replace=True)
    return model

In [None]:
logging_format = '%(asctime)s : %(levelname)s : %(message)s'
logging.basicConfig(format=logging_format, level=logging.INFO)

with tempfile.NamedTemporaryFile(suffix='.txt') as text_output_file:
    # Create wiki corpus, and save text to temp file
    wiki_corpus = WikiCorpus(WIKIPEDIA_DUMP_PATH, lemmatize=False, dictionary={})
    write_wiki_corpus(wiki_corpus, text_output_file)
    del wiki_corpus

    # Train model on wiki corpus
    model = build_trained_model(text_output_file)    
    model.save(WIKIPEDIA_W2V_PATH)

Saved 4180000 articles

2016-12-10 20:30:10,614 : INFO : finished iterating over Wikipedia corpus of 4181821 documents with 2280891234 positions (total 17098961 articles, 2343122692 positions before pruning articles shorter than 50 words)


Finished saving 4181821 articles

2016-12-10 20:30:10,616 : INFO : collecting all words and their counts
2016-12-10 20:30:10,620 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2016-12-10 20:30:12,033 : INFO : PROGRESS: at sentence #10000, processed 4937796 words, keeping 184723 word types
2016-12-10 20:30:13,336 : INFO : PROGRESS: at sentence #20000, processed 9824272 words, keeping 292933 word types
2016-12-10 20:30:14,708 : INFO : PROGRESS: at sentence #30000, processed 15030030 words, keeping 378072 word types
2016-12-10 20:30:16,025 : INFO : PROGRESS: at sentence #40000, processed 20035111 words, keeping 453474 word types
2016-12-10 20:30:17,393 : INFO : PROGRESS: at sentence #50000, processed 25192694 words, keeping 525904 word types
2016-12-10 20:30:18,730 : INFO : PROGRESS: at sentence #60000, processed 30261196 words, keeping 588151 word types
2016-12-10 20:30:20,047 : INFO : PROGRESS: at sentence #70000, processed 35254822 words, keeping 642323 word types
2016-12-10 20:30:21,447 : I

## Demo word2vec

Read in the saved word2vec model and perform some basic analysis on it.

In [10]:
import random

In [11]:
%time
model = Word2Vec.load(WIKIPEDIA_W2V_PATH)

In [12]:
vocab = list(model.vocab.keys())
print('Vocabulary sample:', vocab[:5])

Vocabulary sample: ['odeleye', 'mahajatra', 'wakandas', 'cluster', 'receli']


In [13]:
word = random.choice(vocab)

print('Similar words to:', word)
model.most_similar(word)

Similar words to: vrasivanopoulos


KeyboardInterrupt: 

In [9]:
word1 = random.choice(vocab)
word2 = random.choice(vocab)
print('similarity(%s, %s) = %f' % (word1, word2, model.similarity(word1, word2)))

similarity(laáb, shagarakti) = 0.228848
