## Loading the required libraries

In [None]:
import logging
import smart_open
from gensim.corpora.wikicorpus import WikiCorpus, tokenize

In [None]:
# Google drive mount
from google.colab import drive
drive.mount('/content/drive')

In [None]:
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Preparing the corpus

The notebook used a dump that was actual as of 06/21/2022. To train a new model, necessary to download the actual dump of Wikipedia from [here](https://dumps.wikimedia.org/enwiki/).

In [None]:
!wget https://dumps.wikimedia.org/enwiki/20220320/enwiki-20220320-pages-articles-multistream9.xml-p2936261p4045402.bz2

--2022-06-21 14:46:54--  https://dumps.wikimedia.org/enwiki/20220320/enwiki-20220320-pages-articles-multistream9.xml-p2936261p4045402.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.7, 2620:0:861:1:208:80:154:7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.7|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 555343728 (530M) [application/octet-stream]
Saving to: ‘enwiki-20220320-pages-articles-multistream9.xml-p2936261p4045402.bz2’


2022-06-21 14:48:48 (4.66 MB/s) - ‘enwiki-20220320-pages-articles-multistream9.xml-p2936261p4045402.bz2’ saved [555343728/555343728]



Converted Wikipedia article dump from Wikimedia XML format into a text file. Preprocessed each article at the same time, normalizing its text to lowercase, splitting into tokens. 

In [None]:
wiki = WikiCorpus(
    "enwiki-20220320-pages-articles-multistream9.xml-p2936261p4045402.bz2",  # path to the file you downloaded above
    tokenizer_func=tokenize,  # simple regexp; plug in your own tokenizer here
    dictionary={},  # don't start processing the data yet
)
wiki.metadata = True

with smart_open.open("drive/MyDrive/Colab Notebooks/wiki.txt.gz", "w", encoding='utf8') as fout:
    for article_no, (content, (page_id, title)) in enumerate(wiki.get_texts()):
        title = ' '.join(title.split())
        if article_no % 500000 == 0:
            logging.info("processing article #%i: %r (%i tokens)", article_no, title, len(content))
        fout.write(f"{title}\t{' '.join(content)}\n")  # title_of_article [TAB] words of the article

2022-06-21 14:48:48,477 : INFO : processing article #0: 'David Stagg' (326 tokens)
2022-06-21 15:10:21,485 : INFO : finished iterating over Wikipedia corpus of 153429 documents with 110468270 positions (total 415310 articles, 111511108 positions before pruning articles shorter than 50 words)
