### Dependencies
`pip install wikipedia`

`pip install ipywidgets`

`pip install requests`

In [1]:
import wikidownloader as wiki

In [2]:
help(wiki.pages)

Help on function pages in module wikidownloader:

pages(page_amount, buffer_size=50, retry_failed_pages=False) -> Generator[str, NoneType, NoneType]
    Generator that downloads and yields string with 'buffer_size' random pages joined in one string.
    It loads pages in amount of 'buffer_size' in separate threads and joins them in a string.
    So you dont have to wait for 2 seconds for every page to load like if you were using a single thread.
    It also loads them while your code is running. So if its relatively slow, the next yield will
    probably be ready when you ask for next output
    
    Note: if 'page_amount' < 'buffer_size' then 'buffer_size' will be set to max(page_amount//100, 1)
    
     - 'page_amount': amount of random wikipedia pages you want to get. It is guaranteed to
     give you at least asked amount. It probably will give you more
     - 'buffer_size': amount of pages that will be loaded simultaneously and yielded to you joined 
     in a single string. (ret

In [3]:
%%time
for page_text in wiki.pages(3):
    words = page_text.split(" ")
    print(f"{len(words)} words\t{''.join(w+' ' for w in words[:7])}...")

1 words	 ...
1194 words	The 1125 German royal election was the ...
85 words	Mir Mukhtar Akhyar (1653-1719)(Urdu: میرمختار اخیار) was ...
45 words	Wick is an unincorporated community in Tyler ...
CPU times: total: 609 ms
Wall time: 7.43 s


In [4]:
%%time
wiki.set_lang('ru')
for page_text in wiki.pages(3):
    words = page_text.split(" ")
    print(f"{len(words)} words\t{''.join(w+' ' for w in words[:10])}...")

182 words	Александр Карл Ангальт-Бернбургский (нем. Alexander Carl von Anhalt-Bernburg; 2 марта ...
182 words	Глисерио Бадильес (исп. Glicerio Badilles) — филиппинский шахматист, национальный мастер.
Входил ...
241 words	Тенофовир/эмтрицитабин, известный под торговой маркой Truvada — комбинация двух антиретровирусных ...
CPU times: total: 547 ms
Wall time: 6.32 s


In [5]:
%%time
# retry_failed_pages=True. It returns almost always exactly 100 pages in a single string but it takes longer
for page_text in wiki.pages(300, buffer_size=50, retry_failed_pages=True):
    words = page_text.split(" ")
    print(f"{len(words)} words\t{''.join(w+' ' for w in words[:10])}...")



  lis = BeautifulSoup(html).find_all('li')


19911 words	Евгений Петрович Новиков (15 [27] августа 1826, Москва — 21 ...
26078 words	Доротея Энгельбретсдаттер (норв. Dorothe Engelbretsdotter; 16 января 1634, Берген, Норвегия ...
16409 words	Месяцослов — нецерковный календарь, появившийся в России в 1702 году ...
23643 words	Дворец Гижицкого — памятник архитектуры в селе Новоселица Хмельницкой области ...
23238 words	Сидни А. Монкриф (англ. Sidney A. Moncrief; род. 21 сентября ...
33620 words	Фаминцыны —  древний дворянский род.


== Происхождение и история рода ...
CPU times: total: 1min 34s
Wall time: 1min 20s


In [6]:
%%time
# retry_failed_pages=False. It returns less than 100 pages in a single string but it takes faster in a long run.
# you will still recieve at least requested amount of pages in total (probably even more)
for page_text in wiki.pages(300, buffer_size=50, retry_failed_pages=False):
    words = page_text.split(" ")
    print(f"{len(words)} words\t{''.join(w+' ' for w in words[:10])}...")

22813 words	Ниже представлена хронология мировых рекордов у мужчин в легкоатлетической эстафете ...
38072 words	Густаво Эспиноса Эспадас младший (исп. Gustavo Espinosa Espadas Jr.; род. ...
40041 words	Шайыр (каз. Шайыр) — село в Мангистауском районе Мангистауской области ...
23328 words	Тайгер Девор (англ. Tiger Devore; ранее известный как Говард Девор ...
17532 words	Веджини (груз. ვეჯინი) — село в Грузии. Находится в Гурджаанском ...
31044 words	The Sims 3: Pets — однопользовательская видеоигра в жанре симулятора ...
19496 words	Воронежский институт МВД России — высшее военно-учебное заведение, основанное 29 ...
CPU times: total: 1min 30s
Wall time: 1min 22s


In [7]:
help(wiki.get_widget)

Help on function get_widget in module wikidownloader:

get_widget() -> ipywidgets.widgets.widget_box.HBox
    # Returns ipywidgets.HBox. 
    its progress bar can be modified with following methods:
     - update_bar_val()
     - update_bar_max()
     - update_bar_desc()
     - update_bar_done()
     - update_bar_notdone()
    
    also upper text from the left widget box is editable with
     - update_text_status()
    
    additionaly you may add set text between upper and lower texts in left box
     - update_text_info()
    
    These methods will do nothing if called before get_widget()



In [8]:
wiki.get_widget()

HBox(children=(VBox(children=(Output(layout=Layout(margin='auto', width='92%')), Output(layout=Layout(margin='…

In [9]:
wiki.update_text_status("I'm here!")

In [10]:
wiki.update_text_info("And I am here too")

In [11]:
wiki.update_bar_max(150)
wiki.update_bar_val(78)

In [12]:
wiki.update_bar_desc("Custom text")

In [13]:
wiki.update_bar_done()

In [14]:
wiki.update_bar_notdone()

In [15]:
wiki.get_widget()

HBox(children=(VBox(children=(Output(layout=Layout(margin='auto', width='92%')), Output(layout=Layout(margin='…

In [16]:
%%time
for page_text in wiki.pages(50, buffer_size=15):
    words = page_text.split(" ")
    print(f"{len(words)} words\t{''.join(w+' ' for w in words[:10])}...")

3547 words	1049 (ты́сяча со́рок девя́тый) год  по юлианскому календарю — ...
44844 words	Республиканская альтернатива (РЕАЛ) (азерб. Respublikaçı Alternativ Partiyası; англ. Republican Alternative ...
8804 words	Caliban — немецкая металкор-группа. На сегодняшний день выпустила одиннадцать студийных ...
4004 words	Furcifer lateralis  (лат.) — вид ящериц из семейства хамелеонов, ...
CPU times: total: 18.1 s
Wall time: 32.3 s


In [17]:
wiki.get_widget()

HBox(children=(VBox(children=(Output(layout=Layout(margin='auto', width='92%')), Output(layout=Layout(margin='…

In [18]:
%%time
# example of practical using with progress bar interaction
from time import time
from nltk import sent_tokenize
from nltk.tokenize import word_tokenize
from pymorphy2 import MorphAnalyzer
morph = MorphAnalyzer()

pages_amount = 500
output_file_name = f'{pages_amount}_pages.txt'; open(output_file_name, 'w').close() # clearing/creating file
b_time = time()
bar_update_step = 50

for pages_text in wiki.pages(pages_amount, buffer_size=pages_amount//10):
    # tokenizing
    wiki.update_text_status('Tokenizing sentences...')
    sentences = sent_tokenize(pages_text)
    # widget visuals
    wiki.update_bar_notdone()
    wiki.update_bar_val(0)
    wiki.update_bar_max(len(sentences))
    token_count = 0
    # lemmatizing
    wiki.update_text_status('Lemmatizing sentences...')
    for i in range(len(sentences)):
        tokens = word_tokenize(sentences[i]); token_count += len(tokens)
        sentences[i] = ''.join([morph.parse(t)[0].normal_form + ' ' for t in tokens])
        # widget visuals not on every step so we have no bottleneck in graphics
        if i % bar_update_step == 0:
            wiki.update_bar_val(i)
            wiki.update_text_info(f"{round(token_count/max(time()-b_time, 1), 2)} tokens/s")
            token_count = 0
            b_time = time()
    wiki.update_bar_done()
    # saving to file
    wiki.update_text_status(f"Writing to '{output_file_name}'...")
    with open(output_file_name, 'a', encoding='utf-8') as f:
        f.write(''.join([sent + "\n" for sent in sentences]))
        
wiki.update_text_status(f"Done!")
wiki.update_bar_done()



  lis = BeautifulSoup(html).find_all('li')


CPU times: total: 4min 17s
Wall time: 3min 10s


In [19]:
from ipywidgets import Output
out = Output()
out

Output()

In [20]:
%%time
# example of practical using with wikipedia module only. No multithreads and generators
from time import time, sleep
from nltk import sent_tokenize
from nltk.tokenize import word_tokenize
from pymorphy2 import MorphAnalyzer
from wikipedia import random as random_page, page
from wikipedia.exceptions import WikipediaException
from requests.exceptions import RequestException
morph = MorphAnalyzer()

def get_page(rec_depth=0):
    if rec_depth >= 20:
        raise RecursionError
    page_title = random_page()
    try:
        return page(page_title).content
    except WikipediaException:
        return get_page(rec_depth=rec_depth + 1)
    except RequestException:
        sleep(5)
        return get_page(rec_depth = rec_depth + 1)

pages_amount = 500
output_file_name = f'{pages_amount}_pages.txt'; open(output_file_name, 'w').close() # clearing/creating file
update_step = 5

for i in range(pages_amount):
    if i % update_step == 0:
        with out:
            out.clear_output(wait=True)
            print(f"{round(100*i/pages_amount, 2)}%")
    pages_text = get_page()
    # tokenizing
    sentences = sent_tokenize(pages_text)
    # lemmatizing
    for i in range(len(sentences)):
        tokens = word_tokenize(sentences[i])
        sentences[i] = ''.join([morph.parse(t)[0].normal_form + ' ' for t in tokens])
    # saving to file
    with open(output_file_name, 'a', encoding='utf-8') as f:
        f.write(''.join([sent + "\n" for sent in sentences]))
        

CPU times: total: 2min 22s
Wall time: 20min 18s


*результат налицо*