# Gensim Training Experiments
- **Where:** HEIA-FR GPU-2
- **Dataset:** enwiki-latest-pages-articles.xml.bz2 (16GB)
- **Dictionary:** enwiki-english-wordids.txt.bz2 (16MB)

## What's going on
- Training a Word2Vec on the full wikipedia english dataset using its full pre-extracted dictionary.

## Results
- **Timeframe:** ~17h
- **Errors:** kernel crash when *resetting layer weights*
- **Comments:**
    - Those 17 hours of training are lost as it was not possible to save the model before the crash

In [None]:
# Set constantes
datafile_name = "enwiki-latest-pages-articles.xml.bz2"
dictionary_name = 'enwiki-english-wordids.txt.bz2'

In [None]:
# Start logging process at root level
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
logging.root.setLevel(level=logging.INFO)

In [None]:
# Load dictionary from file
from gensim.corpora import Dictionary
dictionary = Dictionary.load_from_text(dictionary_name)

In [None]:
# Build WikiCorpus based on the dictionary
from gensim.corpora import WikiCorpus
wiki = WikiCorpus(datafile_name, dictionary=dictionary)

In [None]:
# Initialize simple sentence iterator required for the Word2Vec model
# Bypass memory errors
class SentencesIterator:
    def __init__(self, wiki):
        self.wiki = wiki

    def __iter__(self):
        for sentence in self.wiki.get_texts():
            yield list(map(lambda x: x.decode('utf-8'), sentence))

sentences = SentencesIterator(wiki)

In [None]:
# Train model
from gensim.models import Word2Vec
import multiprocessing

cores = multiprocessing.cpu_count()
model = Word2Vec(sentences=sentences, size=300, min_count=1, window=5, workers=cores)