## Using gensim's Wikicorpus

I initially used gensim's Wikicorpus to extract a txt file from a Wikipedia drump. Some problems occurred: Wikicorpus removes stop words, punctuation and also some non-alphabetical words (dates etc.). Therefore, it did not seem like the optimal method of preprocessing. The code is stated below. 

In [None]:
"""
Inspired by:
https://github.com/panyang/Wikipedia_Word2vec/blob/master/v1/process_wiki.py
"""

import sys
from gensim.corpora import WikiCorpus

# Convert Wiki dump file to text corpus
def make_corpus(in_f, out_f):

    output = open(out_f, 'w')
    wiki = WikiCorpus(in_f)

    for text in wiki.get_texts():
        output.write(bytes(' '.join(text), 'utf-8').decode('utf-8') + '\n')
    output.close()

# System input
if __name__ == '__main__':

    if len(sys.argv) != 3:
        print('Usage: python3 make_wiki_corpus.py <wikipedia_dump_file> <processed_text_file>')
        sys.exit(1)
    in_f = sys.argv[1]
    out_f = sys.argv[2]
    make_corpus(in_f, out_f)

In [1]:
# Running the following command will convert a dump file to txt:

!python3 make_wiki_corpus.py <wikipedia_dump_file> <processed_text_file>'

In [10]:
# The following output is produced (see full file: wiki_en.txt)

i = 0

with open("wiki_en.txt") as f:
    for article in f:
        i += 1
        print(article)
        if i == 3:
            break
            
f.close()

sayf al din ghazi ii ibn mawdud full name sayf al din ghazi ii ibn mawdud ibn zengi died was zangid emir of mosul the nephew of nur ad din zengi he became emir of mosul in after the death of his father qutb ad din mawdud saif had been chosen as the successor under the advice of eunuch abd al masish who wanted to keep the effective rule in lieu of the young emir the disinherited son of mawdud imad ad din zengi ii fled to aleppo at the court of nur ad din the latter who was waiting for an excuse to annex mosul conquered sinjar in september and besieged mosul which surrendered on january after ousting al masish he put gümüshtekin one of his officers as governor leaving saif ud din nothing but the nominal title of emir the latter also married the daughter of nur ad din at nur ad din death may gümüshtekin went to damascus to take control of his son and entitled himself of atabeg of aleppo saif ud din rejected his tutorage and restored his independence the nobles of damascus worried by gümüs

## WikiExtractor

I have tried a different approach by using WikiExtractor. This is what I did:

1. Terminal: python3 -m wikiextractor.WikiExtractor enwiki-latest-pages-articles16.xml-p20460153p20570392.bz2 

2. Convert the preprocessed data into 1 txt file: cat text/*/* > jawiki.txt

The result can be seen in wiki.txt. I got rid of the doc id's (<doc id= ...............>) and wrote the output in a new file: new_wiki.txt

In [12]:
# The following output is produced (see full file: new_wiki.txt)

i = 0

with open("new_wiki.txt") as f:
    for line in f:
        i += 1
        print(line)
        if i == 5:
            break
            
f.close()

Sayf al-Din Ghazi II



Sayf al-Din Ghazi (II) ibn Mawdud (; full name: Sayf al-Din Ghazi II ibn Mawdud ibn Zengi; died 1180) was a Zangid Emir of Mosul, the nephew of Nur ad-Din Zengi. 

He became Emir of Mosul in 1170 after the death of his father Qutb ad-Din Mawdud. Saif had been chosen as the successor under the advice of eunuch ’Abd al-Masish, who wanted to keep the effective rule in lieu of the young emir; the disinherited son of Mawdud, Imad ad-Din Zengi II, fled to Aleppo at the court of Nur ad-Din. The latter, who was waiting for an excuse to annex Mosul, conquered Sinjar in September 1170 and besieged Mosul, which surrendered on 22 January 1171. After ousting al-Masish, he put Gümüshtekin, one of his officers, as governor, leaving Saif ud-Din nothing but the nominal title of emir. The latter also married the daughter of Nur ad-Din. 

At Nur ad-Din's death (May 1174), Gümüshtekin went to Damascus to take control of his son and entitled himself of atabeg of Aleppo. Saif ud-Din 

In [8]:
# import spacy

# nlp = spacy.load('en')

# tokens = nlp("wiki.txt")

# i = 0
# prompt = ''
# for sent in tokens.sents:
#     i += 1
#     prompt = prompt + ' ' + sent.string.strip()
#     if i > 3:
#         break
        
# print(prompt)