# Data Preparation

This notebook sets the foundation for testing the **Low-Entropy Conjecture** across three typologically different languages. We extract and clean Wikipedia corpora for **Turkish** (agglutinative), **Russian** (fusional), and **English** (analytic) to create the balanced datasets needed for our further cross-linguistic comparison.

## Corpus Extraction and Initial Processing

We begin with Wikipedia dumps from October 2018, chosen for their comparable coverage and linguistic diversity across all three languages. Wikipedia provides ideal data because it contains both frequent everyday vocabulary and specialized terminology that showcases each language's full morphological richness.

The extraction process balances thoroughness with computational efficiency. Rather than processing entire dumps, we target approximately **350,000 articles per language** to ensure sufficient lexical coverage while maintaining manageable processing times.

In [2]:
import bz2
import re
import os
import json
import pickle
from pathlib import Path
from collections import Counter
import html
from tqdm import tqdm

let's define the functions so that all works for russian, english and turkish

In [4]:
def open_corpus_file(filepath):
    if filepath.endswith('.bz2'):
        return bz2.open(filepath, 'rt', encoding='utf-8')
    else:
        return open(filepath, 'r', encoding='utf-8')

def clean_xml_entities(text):
    text = text.replace('&lt;', '<')
    text = text.replace('&gt;', '>')
    text = text.replace('&quot;', '"')
    text = text.replace('&apos;', "'")
    text = text.replace('&amp;', '&')  # This must be last
    return text

def extract_plain_text_from_xml(filepath, skip_disambiguation=True, skip_redirects=True, max_articles=1000000):
    print(f"Extracting plain text from {filepath} (max {max_articles:,} articles)...")
    
    all_text = []
    article_count = 0
    filtered_disambig = 0
    filtered_redirects = 0

    with open_corpus_file(filepath) as file:
        current_article = ""
        in_article = False
        
        for line_num, line in enumerate(tqdm(file, desc="Processing XML")):
            if '<article name="' in line:
                name_match = re.search(r'<article name="([^"]+)">', line)
                if name_match:
                    article_name = name_match.group(1)
                    current_article = line
                    in_article = True
            
            elif '</article>' in line and in_article:
                current_article += line
                article_text = current_article
                article_text = article_text.replace('\n', '&newline;')
                
                if skip_disambiguation and '<disambiguation/>' in article_text:
                    filtered_disambig += 1
                    in_article = False
                    current_article = ""
                    continue
                
                if skip_redirects and '<redirect name=' in article_text and '<content>' not in article_text:
                    filtered_redirects += 1
                    in_article = False
                    current_article = ""
                    continue
                
                article_text = re.sub(r'<!--.+?-->', '', article_text)
                article_text = re.sub(r'</?cell>', ' ', article_text)
                article_text = re.sub(r'</?content>', '', article_text)
                article_text = re.sub(r'</?wikipedia>', '', article_text)
                article_text = re.sub(r'<wikipedia lang="[^"]*">', '', article_text)
                article_text = re.sub(r'<redirect name="[^"]*"/>', '', article_text)
                article_text = re.sub(r'<links_in name="[^"]*"/>', '', article_text)
                article_text = re.sub(r'<links_out name="[^"]*"/>', '', article_text)
                article_text = re.sub(r'<category name="[^"]*"/>', '', article_text)
                article_text = re.sub(r'<crosslanguage_link language="[^"]*" name="[^"]*"/>', '', article_text)
                article_text = re.sub(r'<link target="[^"]*">', '', article_text)
                article_text = re.sub(r'</link>', '', article_text)
                article_text = re.sub(r'<textlink name="[^"]*" freq="[0-9]+"/>', '', article_text)
                article_text = re.sub(r'<disambiguation/>', '', article_text)
                article_text = re.sub(r'<article name="[^"]*">', '', article_text)
                article_text = re.sub(r'</article>', '', article_text)
                article_text = re.sub(r'</?p>', '', article_text)
                article_text = re.sub(r'</?h>', '', article_text)
                article_text = re.sub(r'</?math>', '', article_text)
                article_text = re.sub(r'</?table>', '', article_text)
                
                article_text = clean_xml_entities(article_text)
                
                article_text = re.sub(r'&newline;\s+', '&newline;', article_text)
                article_text = re.sub(r'\s+&newline;', '&newline;', article_text)
                article_text = re.sub(r'(&newline;)+', '&newline;', article_text)
                article_text = re.sub(r'\s+', ' ', article_text)
                
                article_text = article_text.replace('&newline;', '\n')
                
                clean_text = article_text.strip()
                if len(clean_text) > 50:
                    all_text.append(clean_text)
                    article_count += 1
                    
                    if article_count % 1000 == 0:
                        print(f"Processed {article_count} articles...")
                    
                    if article_count >= max_articles:
                        print(f"Reached maximum articles limit: {max_articles:,}")
                        break
                
                in_article = False
                current_article = ""
                
            elif in_article:
                current_article += line
            
            if article_count >= max_articles:
                break
    
    print(f"Extracted {article_count} articles")
    if skip_disambiguation:
        print(f"Filtered {filtered_disambig} disambiguation articles")
    if skip_redirects:
        print(f"Filtered {filtered_redirects} redirect-only articles")
    
    return all_text

def extract_text_for_language(filepath, language_name, max_articles=1000000):
    print(f"\n=== Extracting {language_name} Text (max {max_articles:,} articles) ===")
    
    article_texts = extract_plain_text_from_xml(filepath, max_articles=max_articles)
    
    combined_text = '\n\n'.join(article_texts)
    
    print(f"Total text length: {len(combined_text):,} characters")
    print(f"Number of articles: {len(article_texts):,}")
    
    if combined_text:
        sample = combined_text[:500].replace('\n', ' ')
        print(f"Sample text: {sample}...")
    
    return combined_text

def save_extracted_text(text, language_code, output_dir="../data/processed/raw_text"):
    Path(output_dir).mkdir(parents=True, exist_ok=True)
    
    output_file = Path(output_dir) / f"{language_code}_text.txt"
    with open(output_file, 'w', encoding='utf-8') as f:
        f.write(text)
    
    print(f"Saved text to: {output_file}")
    print(f"File size: {output_file.stat().st_size / (1024*1024):.1f} MB")

def create_vocabulary_from_text(text, language_code, top_k=10000, min_word_len=2):
    print(f"\nCreating vocabulary for {language_code}...")
    
    words = text.lower().split()
    filtered_words = []
    for word in words:
        word = re.sub(r'^[^\w]+|[^\w]+$', '', word, flags=re.UNICODE)
        if len(word) >= min_word_len and re.search(r'[a-zA-Zа-яё\u00C0-\u017F\u0100-\u024F\u1E00-\u1EFF]', word):
            filtered_words.append(word)
    
    from collections import Counter
    word_counts = Counter(filtered_words)
    
    top_words = word_counts.most_common(top_k)
    vocabulary = [word for word, count in top_words]
    
    print(f"Total words: {len(filtered_words):,}")
    print(f"Unique words: {len(word_counts):,}")
    print(f"Top {top_k} vocabulary: {len(vocabulary):,}")
    print(f"Sample vocabulary: {vocabulary[:10]}")
    
    return vocabulary

def extract_turkish(max_articles=350000):
    return extract_text_for_language('../data/raw_corpora/trwiki-20181001-corpus.xml.bz2', 'Turkish', max_articles)

def extract_russian(max_articles=350000):
    return extract_text_for_language('../data/raw_corpora/ruwiki-20181001-corpus.xml.bz2', 'Russian', max_articles)

def extract_english(max_articles=350000):
    return extract_text_for_language('../data/raw_corpora/enwiki-20181001-corpus.xml.bz2', 'English', max_articles)

def analyze_text_structure(text, language_name):
    print(f"\n=== Text Structure Analysis for {language_name} ===")
    
    total_chars = len(text)
    lines = text.split('\n')
    non_empty_lines = [line for line in lines if line.strip()]
    
    print(f"Total characters: {total_chars:,}")
    print(f"Total lines: {len(lines):,}")
    print(f"Non-empty lines: {len(non_empty_lines):,}")
    print(f"Average characters per line: {total_chars / len(non_empty_lines):.1f}")
    
    words = text.split()
    print(f"Total words: {len(words):,}")
    print(f"Average words per line: {len(words) / len(non_empty_lines):.1f}")
    
    double_newlines = text.count('\n\n')
    print(f"Article separators (\\n\\n): {double_newlines:,}")
    print(f"Estimated articles: {double_newlines + 1:,}")
    
    print(f"\nStructure: Text contains full articles separated by double newlines")
    print(f"Each article contains multiple sentences in paragraph form")
    print(f"Words are separated by spaces (not pre-tokenized)")

In [5]:
turkish_text = extract_turkish()


=== Extracting Turkish Text (max 350,000 articles) ===
Extracting plain text from trwiki-20181001-corpus.xml.bz2 (max 350,000 articles)...


Processing XML: 238564it [00:00, 399321.30it/s]

Processed 1000 articles...


Processing XML: 401105it [00:01, 222504.70it/s]

Processed 2000 articles...


Processing XML: 547078it [00:02, 189699.40it/s]

Processed 3000 articles...


Processing XML: 651501it [00:02, 198382.89it/s]

Processed 4000 articles...


Processing XML: 761241it [00:03, 206185.56it/s]

Processed 5000 articles...


Processing XML: 956970it [00:03, 464579.12it/s]

Processed 6000 articles...


Processing XML: 1075794it [00:04, 279353.52it/s]

Processed 7000 articles...


Processing XML: 1134271it [00:04, 224511.67it/s]

Processed 8000 articles...


Processing XML: 1227060it [00:05, 215974.39it/s]

Processed 9000 articles...


Processing XML: 1305827it [00:05, 218432.47it/s]

Processed 10000 articles...


Processing XML: 1393031it [00:05, 201010.74it/s]

Processed 11000 articles...


Processing XML: 1453185it [00:06, 194333.79it/s]

Processed 12000 articles...


Processing XML: 1535594it [00:06, 185630.29it/s]

Processed 13000 articles...


Processing XML: 1595452it [00:07, 193196.86it/s]

Processed 14000 articles...


Processing XML: 1654171it [00:07, 179956.26it/s]

Processed 15000 articles...


Processing XML: 1735422it [00:07, 190458.63it/s]

Processed 16000 articles...


Processing XML: 1803636it [00:08, 211618.83it/s]

Processed 17000 articles...


Processing XML: 1865452it [00:08, 197057.54it/s]

Processed 18000 articles...


Processing XML: 1924882it [00:08, 194092.90it/s]

Processed 19000 articles...


Processing XML: 2004379it [00:09, 196846.97it/s]

Processed 20000 articles...


Processing XML: 2070550it [00:09, 211507.95it/s]

Processed 21000 articles...


Processing XML: 2136732it [00:09, 206345.30it/s]

Processed 22000 articles...


Processing XML: 2222977it [00:10, 209040.21it/s]

Processed 23000 articles...


Processing XML: 2324905it [00:10, 182301.94it/s]

Processed 24000 articles...


Processing XML: 2389426it [00:11, 206146.50it/s]

Processed 25000 articles...


Processing XML: 2435664it [00:11, 214863.15it/s]

Processed 26000 articles...


Processing XML: 2524922it [00:11, 218314.10it/s]

Processed 27000 articles...


Processing XML: 2594714it [00:11, 224852.47it/s]

Processed 28000 articles...


Processing XML: 2647187it [00:12, 241643.07it/s]

Processed 29000 articles...


Processing XML: 2722925it [00:12, 241678.55it/s]

Processed 30000 articles...


Processing XML: 2777048it [00:12, 243377.15it/s]

Processed 31000 articles...


Processing XML: 2849878it [00:13, 233242.30it/s]

Processed 32000 articles...


Processing XML: 2900759it [00:13, 242619.78it/s]

Processed 33000 articles...


Processing XML: 2993918it [00:13, 296042.42it/s]

Processed 34000 articles...
Processed 35000 articles...


Processing XML: 3053277it [00:13, 295413.89it/s]

Processed 36000 articles...
Processed 37000 articles...


Processing XML: 3165208it [00:14, 247345.08it/s]

Processed 38000 articles...


Processing XML: 3215896it [00:14, 247557.05it/s]

Processed 39000 articles...


Processing XML: 3287958it [00:14, 223303.14it/s]

Processed 40000 articles...


Processing XML: 3337057it [00:14, 235569.17it/s]

Processed 41000 articles...


Processing XML: 3406232it [00:15, 211177.90it/s]

Processed 42000 articles...


Processing XML: 3475529it [00:15, 215235.48it/s]

Processed 43000 articles...


Processing XML: 3546366it [00:15, 228666.41it/s]

Processed 44000 articles...


Processing XML: 3593900it [00:16, 227566.46it/s]

Processed 45000 articles...


Processing XML: 3642449it [00:16, 234747.51it/s]

Processed 46000 articles...


Processing XML: 3689291it [00:16, 221490.36it/s]

Processed 47000 articles...
Processed 48000 articles...


Processing XML: 3761136it [00:16, 233164.94it/s]

Processed 49000 articles...


Processing XML: 3809311it [00:17, 232822.66it/s]

Processed 50000 articles...
Processed 51000 articles...


Processing XML: 3882461it [00:17, 237798.54it/s]

Processed 52000 articles...
Processed 53000 articles...


Processing XML: 3954903it [00:17, 232288.69it/s]

Processed 54000 articles...
Processed 55000 articles...


Processing XML: 4022289it [00:18, 217687.37it/s]

Processed 56000 articles...
Processed 57000 articles...


Processing XML: 4087657it [00:18, 208490.15it/s]

Processed 58000 articles...
Processed 59000 articles...


Processing XML: 4152119it [00:18, 212313.02it/s]

Processed 60000 articles...
Processed 61000 articles...


Processing XML: 4194963it [00:18, 210242.84it/s]

Processed 62000 articles...
Processed 63000 articles...


Processing XML: 4262870it [00:19, 209643.57it/s]

Processed 64000 articles...
Processed 65000 articles...


Processing XML: 4306422it [00:19, 211487.17it/s]

Processed 66000 articles...
Processed 67000 articles...


Processing XML: 4347143it [00:19, 132540.67it/s]

Processed 68000 articles...


Processing XML: 4396520it [00:20, 146217.71it/s]

Processed 69000 articles...
Processed 70000 articles...


Processing XML: 4451311it [00:20, 165999.85it/s]

Processed 71000 articles...
Processed 72000 articles...


Processing XML: 4507496it [00:20, 179953.77it/s]

Processed 73000 articles...


Processing XML: 4548160it [00:20, 191260.44it/s]

Processed 74000 articles...


Processing XML: 4629915it [00:21, 250417.40it/s]

Processed 75000 articles...


Processing XML: 4703531it [00:21, 222118.48it/s]

Processed 76000 articles...


Processing XML: 4747902it [00:21, 214847.70it/s]

Processed 77000 articles...


Processing XML: 4816819it [00:22, 218882.07it/s]

Processed 78000 articles...


Processing XML: 4866498it [00:22, 233709.13it/s]

Processed 79000 articles...


Processing XML: 4913337it [00:22, 231283.37it/s]

Processed 80000 articles...


Processing XML: 4959616it [00:22, 219009.00it/s]

Processed 81000 articles...


Processing XML: 5028837it [00:23, 224009.03it/s]

Processed 82000 articles...
Processed 83000 articles...


Processing XML: 5099545it [00:23, 227699.64it/s]

Processed 84000 articles...


Processing XML: 5170325it [00:23, 233259.33it/s]

Processed 85000 articles...
Processed 86000 articles...


Processing XML: 5257898it [00:24, 207569.50it/s]

Processed 87000 articles...


Processing XML: 5298776it [00:24, 199370.00it/s]

Processed 88000 articles...


Processing XML: 5362906it [00:24, 205204.21it/s]

Processed 89000 articles...


Processing XML: 5424058it [00:24, 198959.76it/s]

Processed 90000 articles...


Processing XML: 5468659it [00:25, 209215.84it/s]

Processed 91000 articles...


Processing XML: 5534747it [00:25, 206054.03it/s]

Processed 92000 articles...


Processing XML: 5599282it [00:25, 209131.29it/s]

Processed 93000 articles...


Processing XML: 5664424it [00:26, 212235.78it/s]

Processed 94000 articles...


Processing XML: 5727313it [00:26, 203188.00it/s]

Processed 95000 articles...


Processing XML: 5773700it [00:26, 217528.32it/s]

Processed 96000 articles...


Processing XML: 5838638it [00:26, 209953.36it/s]

Processed 97000 articles...


Processing XML: 5888126it [00:27, 228628.28it/s]

Processed 98000 articles...


Processing XML: 5936230it [00:27, 233347.45it/s]

Processed 99000 articles...


Processing XML: 6005865it [00:27, 226761.30it/s]

Processed 100000 articles...


Processing XML: 6074429it [00:27, 216876.58it/s]

Processed 101000 articles...


Processing XML: 6143681it [00:28, 225554.88it/s]

Processed 102000 articles...


Processing XML: 6188705it [00:28, 215433.18it/s]

Processed 103000 articles...


Processing XML: 6253442it [00:28, 213246.34it/s]

Processed 104000 articles...


Processing XML: 6274796it [00:28, 208266.90it/s]

Processed 105000 articles...


Processing XML: 6358561it [00:29, 192204.37it/s]

Processed 106000 articles...


Processing XML: 6430001it [00:29, 213493.22it/s]

Processed 107000 articles...


Processing XML: 6476089it [00:29, 219856.68it/s]

Processed 108000 articles...


Processing XML: 6548790it [00:30, 235134.72it/s]

Processed 109000 articles...


Processing XML: 6621449it [00:30, 239467.16it/s]

Processed 110000 articles...


Processing XML: 6669201it [00:30, 230412.36it/s]

Processed 111000 articles...


Processing XML: 6741283it [00:30, 232982.74it/s]

Processed 112000 articles...


Processing XML: 6814654it [00:31, 236403.57it/s]

Processed 113000 articles...


Processing XML: 6867375it [00:31, 250326.16it/s]

Processed 114000 articles...


Processing XML: 6949448it [00:31, 258360.34it/s]

Processed 115000 articles...


Processing XML: 7000323it [00:32, 246981.42it/s]

Processed 116000 articles...


Processing XML: 7075338it [00:32, 240905.47it/s]

Processed 117000 articles...


Processing XML: 7151397it [00:32, 248226.58it/s]

Processed 118000 articles...


Processing XML: 7199970it [00:32, 232311.46it/s]

Processed 119000 articles...


Processing XML: 7270085it [00:33, 229499.95it/s]

Processed 120000 articles...


Processing XML: 7338773it [00:33, 223511.99it/s]

Processed 121000 articles...


Processing XML: 7383729it [00:33, 223903.11it/s]

Processed 122000 articles...


Processing XML: 7454011it [00:33, 222066.29it/s]

Processed 123000 articles...


Processing XML: 7524554it [00:34, 225992.67it/s]

Processed 124000 articles...


Processing XML: 7593009it [00:34, 225917.90it/s]

Processed 125000 articles...


Processing XML: 7662289it [00:34, 224988.35it/s]

Processed 126000 articles...


Processing XML: 7734400it [00:35, 233078.20it/s]

Processed 127000 articles...


Processing XML: 7780483it [00:35, 217307.75it/s]

Processed 128000 articles...


Processing XML: 7846440it [00:35, 219056.00it/s]

Processed 129000 articles...


Processing XML: 7923843it [00:36, 244049.14it/s]

Processed 130000 articles...


Processing XML: 7977525it [00:36, 252494.90it/s]

Processed 131000 articles...


Processing XML: 8028278it [00:36, 245821.32it/s]

Processed 132000 articles...


Processing XML: 8076691it [00:36, 234915.49it/s]

Processed 133000 articles...


Processing XML: 8123717it [00:36, 225225.07it/s]

Processed 134000 articles...


Processing XML: 8191094it [00:37, 218791.41it/s]

Processed 135000 articles...


Processing XML: 8259072it [00:37, 223403.04it/s]

Processed 136000 articles...


Processing XML: 8327900it [00:37, 226907.75it/s]

Processed 137000 articles...


Processing XML: 8373513it [00:38, 216090.08it/s]

Processed 138000 articles...


Processing XML: 8442289it [00:38, 219624.95it/s]

Processed 139000 articles...


Processing XML: 8486901it [00:38, 220004.31it/s]

Processed 140000 articles...


Processing XML: 8538240it [00:38, 238244.14it/s]

Processed 141000 articles...


Processing XML: 8607626it [00:39, 220224.92it/s]

Processed 142000 articles...


Processing XML: 8651455it [00:39, 215001.26it/s]

Processed 143000 articles...


Processing XML: 8697299it [00:39, 211005.56it/s]

Processed 144000 articles...


Processing XML: 8769559it [00:39, 221685.70it/s]

Processed 145000 articles...


Processing XML: 8845769it [00:40, 238703.64it/s]

Processed 146000 articles...


Processing XML: 8896445it [00:40, 236092.67it/s]

Processed 147000 articles...


Processing XML: 8965373it [00:40, 217291.20it/s]

Processed 148000 articles...


Processing XML: 9009514it [00:40, 212878.31it/s]

Processed 149000 articles...


Processing XML: 9051974it [00:41, 205119.00it/s]

Processed 150000 articles...


Processing XML: 9119829it [00:41, 205067.43it/s]

Processed 151000 articles...


Processing XML: 9160177it [00:41, 189911.68it/s]

Processed 152000 articles...


Processing XML: 9209234it [00:41, 216536.49it/s]

Processed 153000 articles...


Processing XML: 9277450it [00:42, 222114.21it/s]

Processed 154000 articles...


Processing XML: 9321631it [00:42, 217882.98it/s]

Processed 155000 articles...


Processing XML: 9390233it [00:42, 221773.67it/s]

Processed 156000 articles...


Processing XML: 9435513it [00:42, 222555.58it/s]

Processed 157000 articles...


Processing XML: 9505366it [00:43, 228509.01it/s]

Processed 158000 articles...


Processing XML: 9550252it [00:43, 219203.22it/s]

Processed 159000 articles...


Processing XML: 9594733it [00:43, 218466.35it/s]

Processed 160000 articles...


Processing XML: 9660854it [00:43, 213773.71it/s]

Processed 161000 articles...


Processing XML: 9703233it [00:44, 205173.04it/s]

Processed 162000 articles...


Processing XML: 9767493it [00:44, 208808.88it/s]

Processed 163000 articles...


Processing XML: 9838430it [00:44, 225359.31it/s]

Processed 164000 articles...


Processing XML: 9888592it [00:44, 237936.56it/s]

Processed 165000 articles...


Processing XML: 9962480it [00:45, 235543.13it/s]

Processed 166000 articles...


Processing XML: 10009258it [00:45, 215748.15it/s]

Processed 167000 articles...


Processing XML: 10073435it [00:45, 274617.03it/s]

Processed 168000 articles...
Processed 169000 articles...
Processed 170000 articles...


Processing XML: 10203347it [00:45, 375424.02it/s]

Processed 171000 articles...
Processed 172000 articles...
Processed 173000 articles...
Processed 174000 articles...


Processing XML: 10243686it [00:46, 383699.12it/s]

Processed 175000 articles...
Processed 176000 articles...


Processing XML: 10376679it [00:46, 268365.52it/s]

Processed 177000 articles...


Processing XML: 10430493it [00:46, 246518.27it/s]

Processed 178000 articles...


Processing XML: 10484115it [00:47, 255624.72it/s]

Processed 179000 articles...


Processing XML: 10545956it [00:47, 239022.14it/s]

Processed 180000 articles...


Processing XML: 10619027it [00:47, 228391.43it/s]

Processed 181000 articles...


Processing XML: 10691078it [00:47, 225049.25it/s]

Processed 182000 articles...


Processing XML: 10736454it [00:48, 223956.42it/s]

Processed 183000 articles...


Processing XML: 10783356it [00:48, 227963.02it/s]

Processed 184000 articles...


Processing XML: 10854637it [00:48, 229063.94it/s]

Processed 185000 articles...


Processing XML: 10911023it [00:48, 229147.82it/s]

Processed 186000 articles...


Processing XML: 10977664it [00:49, 213102.36it/s]

Processed 187000 articles...


Processing XML: 11022974it [00:49, 219681.62it/s]

Processed 188000 articles...


Processing XML: 11090047it [00:49, 219188.63it/s]

Processed 189000 articles...


Processing XML: 11166382it [00:50, 234754.71it/s]

Processed 190000 articles...


Processing XML: 11251973it [00:50, 258103.96it/s]

Processed 191000 articles...


Processing XML: 11336932it [00:50, 270854.33it/s]

Processed 192000 articles...


Processing XML: 11398895it [00:50, 285116.39it/s]

Processed 193000 articles...


Processing XML: 11481920it [00:51, 258320.78it/s]

Processed 194000 articles...


Processing XML: 11533710it [00:51, 247333.08it/s]

Processed 195000 articles...


Processing XML: 11634030it [00:51, 222361.49it/s]

Processed 196000 articles...


Processing XML: 11707533it [00:52, 231877.51it/s]

Processed 197000 articles...


Processing XML: 11781683it [00:52, 235861.37it/s]

Processed 198000 articles...


Processing XML: 11827548it [00:52, 216531.69it/s]

Processed 199000 articles...


Processing XML: 11881234it [00:53, 239675.58it/s]

Processed 200000 articles...
Processed 201000 articles...


Processing XML: 12009092it [00:53, 233694.16it/s]

Processed 202000 articles...


Processing XML: 12061246it [00:53, 242702.92it/s]

Processed 203000 articles...


Processing XML: 12133699it [00:54, 229123.33it/s]

Processed 204000 articles...


Processing XML: 12179673it [00:54, 222261.15it/s]

Processed 205000 articles...


Processing XML: 12223957it [00:54, 208557.66it/s]

Processed 206000 articles...


Processing XML: 12269016it [00:54, 216504.43it/s]

Processed 207000 articles...


Processing XML: 12319862it [00:54, 237149.51it/s]

Processed 208000 articles...
Processed 209000 articles...


Processing XML: 12414122it [00:55, 219982.07it/s]

Processed 210000 articles...


Processing XML: 12468028it [00:55, 246144.58it/s]

Processed 211000 articles...
Processed 212000 articles...


Processing XML: 12543726it [00:55, 248384.07it/s]

Processed 213000 articles...
Processed 214000 articles...


Processing XML: 12624988it [00:56, 259412.06it/s]

Processed 215000 articles...
Processed 216000 articles...


Processing XML: 12723966it [00:56, 235726.85it/s]

Processed 217000 articles...


Processing XML: 12770816it [00:56, 225460.16it/s]

Processed 218000 articles...


Processing XML: 12839160it [00:57, 222917.28it/s]

Processed 219000 articles...


Processing XML: 12891106it [00:57, 240638.41it/s]

Processed 220000 articles...


Processing XML: 12953214it [00:57, 250510.67it/s]

Processed 221000 articles...


Processing XML: 12978689it [00:57, 241615.66it/s]

Processed 222000 articles...


Processing XML: 13050020it [00:58, 228144.87it/s]

Processed 223000 articles...
Processed 224000 articles...


Processing XML: 13120781it [00:58, 225453.12it/s]

Processed 225000 articles...


Processing XML: 13176275it [00:58, 249977.06it/s]

Processed 226000 articles...


Processing XML: 13260542it [00:58, 266239.55it/s]

Processed 227000 articles...
Processed 228000 articles...


Processing XML: 13341205it [00:59, 259548.70it/s]

Processed 229000 articles...
Processed 230000 articles...


Processing XML: 13454870it [00:59, 253779.06it/s]

Processed 231000 articles...


Processing XML: 13538443it [00:59, 255204.34it/s]

Processed 232000 articles...


Processing XML: 13589657it [01:00, 253001.65it/s]

Processed 233000 articles...


Processing XML: 13640134it [01:00, 231206.25it/s]

Processed 234000 articles...


Processing XML: 13718562it [01:00, 248814.84it/s]

Processed 235000 articles...


Processing XML: 13773747it [01:00, 256161.29it/s]

Processed 236000 articles...


Processing XML: 13824737it [01:01, 234058.04it/s]

Processed 237000 articles...


Processing XML: 13871658it [01:01, 228155.29it/s]

Processed 238000 articles...


Processing XML: 13917317it [01:01, 223841.32it/s]

Processed 239000 articles...


Processing XML: 13968728it [01:01, 239219.57it/s]

Processed 240000 articles...
Processed 241000 articles...


Processing XML: 14073146it [01:02, 254763.59it/s]

Processed 242000 articles...


Processing XML: 14139123it [01:02, 291583.68it/s]

Processed 243000 articles...


Processing XML: 14223822it [01:02, 267294.23it/s]

Processed 244000 articles...
Processed 245000 articles...


Processing XML: 14302726it [01:02, 255128.77it/s]

Processed 246000 articles...


Processing XML: 14379492it [01:03, 246745.82it/s]

Processed 247000 articles...
Processed 248000 articles...


Processing XML: 14487795it [01:03, 260903.55it/s]

Processed 249000 articles...


Processing XML: 14542003it [01:03, 264591.54it/s]

Processed 250000 articles...
Processed 251000 articles...


Processing XML: 14629348it [01:04, 281545.10it/s]

Processed 252000 articles...


Processing XML: 14718225it [01:04, 275073.10it/s]

Processed 253000 articles...


Processing XML: 14774925it [01:04, 279156.24it/s]

Processed 254000 articles...


Processing XML: 14829661it [01:04, 245609.26it/s]

Processed 255000 articles...


Processing XML: 14879290it [01:05, 239653.10it/s]

Processed 256000 articles...


Processing XML: 14927518it [01:05, 219609.91it/s]

Processed 257000 articles...
Processed 258000 articles...


Processing XML: 15039009it [01:05, 260288.41it/s]

Processed 259000 articles...
Processed 260000 articles...


Processing XML: 15115512it [01:06, 234340.43it/s]

Processed 261000 articles...


Processing XML: 15200202it [01:06, 262697.17it/s]

Processed 262000 articles...
Processed 263000 articles...


Processing XML: 15286294it [01:06, 283330.64it/s]

Processed 264000 articles...


Processing XML: 15342891it [01:07, 248100.85it/s]

Processed 265000 articles...


Processing XML: 15416792it [01:07, 234855.03it/s]

Processed 266000 articles...


Processing XML: 15468614it [01:07, 246890.91it/s]

Processed 267000 articles...
Processed 268000 articles...


Processing XML: 15585418it [01:07, 272903.59it/s]

Processed 269000 articles...
Processed 270000 articles...


Processing XML: 15713018it [01:08, 312871.39it/s]

Processed 271000 articles...


Processing XML: 15774783it [01:08, 300052.05it/s]

Processed 272000 articles...


Processing XML: 15834420it [01:08, 284549.77it/s]

Processed 273000 articles...


Processing XML: 15893924it [01:09, 280343.01it/s]

Processed 274000 articles...


Processing XML: 15977171it [01:09, 262880.87it/s]

Processed 275000 articles...
Processed 276000 articles...


Processing XML: 16061240it [01:09, 272656.35it/s]

Processed 277000 articles...
Processed 278000 articles...


Processing XML: 16165956it [01:10, 243971.86it/s]

Processed 279000 articles...
Processed 280000 articles...


Processing XML: 16271271it [01:10, 260641.79it/s]

Processed 281000 articles...
Processed 282000 articles...


Processing XML: 16354881it [01:10, 258025.90it/s]

Processed 283000 articles...
Processed 284000 articles...


Processing XML: 16457127it [01:11, 319621.04it/s]

Processed 285000 articles...


Processing XML: 16523680it [01:11, 290301.53it/s]

Processed 286000 articles...
Processed 287000 articles...


Processing XML: 16589298it [01:11, 307958.01it/s]

Processed 288000 articles...
Processed 289000 articles...


Processing XML: 16682493it [01:11, 299612.78it/s]

Processed 290000 articles...
Processed 291000 articles...


Processing XML: 16738918it [01:12, 224598.66it/s]

Processed 292000 articles...


Processing XML: 16789324it [01:12, 235233.67it/s]

Processed 293000 articles...
Processed 294000 articles...


Processing XML: 16902083it [01:12, 267542.29it/s]

Processed 295000 articles...
Processed 296000 articles...


Processing XML: 16997865it [01:13, 303784.79it/s]

Processed 297000 articles...
Processed 298000 articles...


Processing XML: 17096206it [01:13, 319019.04it/s]

Processed 299000 articles...
Processed 300000 articles...


Processing XML: 17191498it [01:13, 304431.32it/s]

Processed 301000 articles...
Processed 302000 articles...


Processing XML: 17286513it [01:14, 297563.89it/s]

Processed 303000 articles...
Processed 304000 articles...


Processing XML: 17406628it [01:14, 281501.61it/s]

Processed 305000 articles...


Processing XML: 17462092it [01:14, 265510.71it/s]

Processed 306000 articles...


Processing XML: 17524752it [01:14, 283934.13it/s]

Processed 307000 articles...
Processed 308000 articles...


Processing XML: 17640505it [01:15, 271054.82it/s]

Processed 309000 articles...


Processing XML: 17668369it [01:15, 272609.69it/s]

Processed 310000 articles...
Processed 311000 articles...


Processing XML: 17771160it [01:15, 237688.52it/s]

Processed 312000 articles...


Processing XML: 17818748it [01:16, 227336.16it/s]

Processed 313000 articles...
Processed 314000 articles...


Processing XML: 17899166it [01:16, 254430.30it/s]

Processed 315000 articles...


Processing XML: 17950914it [01:16, 248770.04it/s]

Processed 316000 articles...
Processed 317000 articles...


Processing XML: 18050246it [01:17, 244975.16it/s]

Processed 318000 articles...
Processed 319000 articles...


Processing XML: 18151504it [01:17, 248141.44it/s]

Processed 320000 articles...


Processing XML: 18201200it [01:17, 241605.03it/s]

Processed 321000 articles...


Processing XML: 18249156it [01:17, 232849.10it/s]

Processed 322000 articles...
Processed 323000 articles...


Processing XML: 18321331it [01:18, 233639.16it/s]

Processed 324000 articles...
Processed 325000 articles...


Processing XML: 18412664it [01:18, 266229.05it/s]

Processed 326000 articles...
Processed 327000 articles...
Processed 328000 articles...


Processing XML: 18496604it [01:18, 276228.43it/s]

Processed 329000 articles...
Processed 330000 articles...


Processing XML: 18551424it [01:19, 260971.20it/s]

Processed 331000 articles...


Processing XML: 18683809it [01:19, 282740.47it/s]

Processed 332000 articles...
Processed 333000 articles...


Processing XML: 18810504it [01:19, 305378.55it/s]

Processed 334000 articles...
Processed 335000 articles...


Processing XML: 18899088it [01:20, 280029.91it/s]

Processed 336000 articles...
Processed 337000 articles...


Processing XML: 18986750it [01:20, 284917.72it/s]

Processed 338000 articles...
Processed 339000 articles...


Processing XML: 19099655it [01:20, 262678.65it/s]

Processed 340000 articles...
Processed 341000 articles...


Processing XML: 19183111it [01:21, 269013.04it/s]

Processed 342000 articles...
Processed 343000 articles...


Processing XML: 19268540it [01:21, 279829.95it/s]

Processed 344000 articles...
Processed 345000 articles...


Processing XML: 19391435it [01:22, 303941.80it/s]

Processed 346000 articles...


Processing XML: 19473236it [01:22, 333380.51it/s]

Processed 347000 articles...
Processed 348000 articles...


Processing XML: 19568769it [01:22, 237023.77it/s]

Processed 349000 articles...
Processed 350000 articles...
Reached maximum articles limit: 350,000
Extracted 350000 articles
Filtered 4789 disambiguation articles
Filtered 0 redirect-only articles
Total text length: 587,824,356 characters
Number of articles: 350,000
Sample text: Cengiz Han (Cenghis Khan, Çinggis Haan ya da doğum adıyla Temuçin (anlamı: demirci), Moğolca: Чингис Хаан ya da "Tengiz" (anlamı: deniz), ; d. 1162 – ö. 18 Ağustos 1227), Moğol komutan, hükümdar ve Moğol İmparatorluğu'nun kurucusudur. Cengiz Han, 13. Yüzyılın başında Orta Asya'daki tüm göçebe bozkır kavimlerini birleştirerek bir ulus haline getirdi ve o ulusu Moğol siyasi kimliği çatısı altında topladı. Dünya tarihinin en büyük askeri dehalarından biri olarak kabul edilen Cengiz Han, hükümdarlığ...





In [6]:
russian_text = extract_russian()


=== Extracting Russian Text (max 350,000 articles) ===
Extracting plain text from ruwiki-20181001-corpus.xml.bz2 (max 350,000 articles)...


Processing XML: 243993it [00:01, 135577.82it/s]

Processed 1000 articles...


Processing XML: 477701it [00:03, 173210.25it/s]

Processed 2000 articles...


Processing XML: 726744it [00:04, 264588.30it/s]

Processed 3000 articles...


Processing XML: 848047it [00:04, 307989.68it/s]

Processed 4000 articles...


Processing XML: 996941it [00:05, 146618.31it/s]

Processed 5000 articles...


Processing XML: 1147624it [00:07, 130831.65it/s]

Processed 6000 articles...


Processing XML: 1294806it [00:08, 154612.54it/s]

Processed 7000 articles...


Processing XML: 1409809it [00:09, 130083.25it/s]

Processed 8000 articles...


Processing XML: 1543420it [00:10, 133539.12it/s]

Processed 9000 articles...


Processing XML: 1717230it [00:11, 138531.69it/s]

Processed 10000 articles...


Processing XML: 1852424it [00:12, 136400.25it/s]

Processed 11000 articles...


Processing XML: 1961934it [00:13, 118440.60it/s]

Processed 12000 articles...


Processing XML: 2069995it [00:14, 117013.12it/s]

Processed 13000 articles...


Processing XML: 2180023it [00:14, 180306.14it/s]

Processed 14000 articles...


Processing XML: 2286502it [00:15, 135757.82it/s]

Processed 15000 articles...


Processing XML: 2369296it [00:16, 131236.13it/s]

Processed 16000 articles...


Processing XML: 2473883it [00:17, 142681.04it/s]

Processed 17000 articles...


Processing XML: 2558539it [00:17, 137654.08it/s]

Processed 18000 articles...


Processing XML: 2679311it [00:18, 119364.80it/s]

Processed 19000 articles...


Processing XML: 2767555it [00:19, 127933.46it/s]

Processed 20000 articles...


Processing XML: 2833627it [00:19, 124817.69it/s]

Processed 21000 articles...


Processing XML: 2911160it [00:20, 126405.81it/s]

Processed 22000 articles...


Processing XML: 2967850it [00:20, 141917.93it/s]

Processed 23000 articles...


Processing XML: 3010440it [00:21, 139175.74it/s]

Processed 24000 articles...


Processing XML: 3065925it [00:21, 133393.89it/s]

Processed 25000 articles...


Processing XML: 3158980it [00:22, 122637.87it/s]

Processed 26000 articles...


Processing XML: 3259023it [00:23, 144361.18it/s]

Processed 27000 articles...


Processing XML: 3343917it [00:23, 132017.16it/s]

Processed 28000 articles...


Processing XML: 3456231it [00:24, 140040.63it/s]

Processed 29000 articles...


Processing XML: 3525190it [00:25, 131459.87it/s]

Processed 30000 articles...


Processing XML: 3621734it [00:25, 131261.49it/s]

Processed 31000 articles...


Processing XML: 3720760it [00:26, 143457.36it/s]

Processed 32000 articles...


Processing XML: 3823745it [00:27, 139768.53it/s]

Processed 33000 articles...


Processing XML: 3904820it [00:27, 125255.58it/s]

Processed 34000 articles...


Processing XML: 3997361it [00:28, 121629.80it/s]

Processed 35000 articles...


Processing XML: 4091271it [00:29, 126916.67it/s]

Processed 36000 articles...


Processing XML: 4160963it [00:29, 129432.51it/s]

Processed 37000 articles...


Processing XML: 4218422it [00:30, 139782.01it/s]

Processed 38000 articles...


Processing XML: 4252371it [00:30, 155860.15it/s]

Processed 39000 articles...


Processing XML: 4300335it [00:30, 156581.63it/s]

Processed 40000 articles...


Processing XML: 4345191it [00:31, 133802.53it/s]

Processed 41000 articles...


Processing XML: 4433883it [00:31, 140134.24it/s]

Processed 42000 articles...


Processing XML: 4532483it [00:32, 133679.33it/s]

Processed 43000 articles...


Processing XML: 4614918it [00:33, 113958.16it/s]

Processed 44000 articles...


Processing XML: 4729448it [00:34, 151353.76it/s]

Processed 45000 articles...


Processing XML: 4804954it [00:34, 134339.22it/s]

Processed 46000 articles...


Processing XML: 4887767it [00:35, 126382.80it/s]

Processed 47000 articles...


Processing XML: 4988094it [00:35, 138233.46it/s]

Processed 48000 articles...


Processing XML: 5072671it [00:36, 130247.52it/s]

Processed 49000 articles...


Processing XML: 5161427it [00:37, 147618.19it/s]

Processed 50000 articles...


Processing XML: 5233852it [00:37, 133667.83it/s]

Processed 51000 articles...


Processing XML: 5310384it [00:38, 146545.23it/s]

Processed 52000 articles...


Processing XML: 5384249it [00:38, 137977.47it/s]

Processed 53000 articles...


Processing XML: 5456555it [00:39, 144533.11it/s]

Processed 54000 articles...


Processing XML: 5528252it [00:39, 132911.69it/s]

Processed 55000 articles...


Processing XML: 5612082it [00:40, 123565.64it/s]

Processed 56000 articles...


Processing XML: 5677684it [00:41, 118345.85it/s]

Processed 57000 articles...


Processing XML: 5761312it [00:41, 131012.63it/s]

Processed 58000 articles...


Processing XML: 5836910it [00:42, 148354.40it/s]

Processed 59000 articles...


Processing XML: 5919454it [00:42, 123463.22it/s]

Processed 60000 articles...


Processing XML: 6009443it [00:43, 136081.63it/s]

Processed 61000 articles...


Processing XML: 6089144it [00:44, 125838.94it/s]

Processed 62000 articles...


Processing XML: 6170633it [00:44, 129134.36it/s]

Processed 63000 articles...


Processing XML: 6258087it [00:45, 141287.24it/s]

Processed 64000 articles...


Processing XML: 6351243it [00:46, 136616.91it/s]

Processed 65000 articles...


Processing XML: 6438458it [00:46, 141071.39it/s]

Processed 66000 articles...


Processing XML: 6496815it [00:47, 135708.02it/s]

Processed 67000 articles...


Processing XML: 6595784it [00:47, 271655.82it/s]

Processed 68000 articles...
Processed 69000 articles...
Processed 70000 articles...


Processing XML: 6675943it [00:47, 333452.81it/s]

Processed 71000 articles...
Processed 72000 articles...
Processed 73000 articles...


Processing XML: 6750380it [00:47, 231292.63it/s]

Processed 74000 articles...


Processing XML: 6824334it [00:48, 168983.29it/s]

Processed 75000 articles...


Processing XML: 6906204it [00:49, 139508.23it/s]

Processed 76000 articles...


Processing XML: 6962325it [00:49, 181889.71it/s]

Processed 77000 articles...


Processing XML: 6999051it [00:49, 162474.59it/s]

Processed 78000 articles...


Processing XML: 7068336it [00:50, 166360.61it/s]

Processed 79000 articles...


Processing XML: 7146953it [00:50, 146926.91it/s]

Processed 80000 articles...


Processing XML: 7205220it [00:51, 141063.20it/s]

Processed 81000 articles...


Processing XML: 7294630it [00:51, 142659.78it/s]

Processed 82000 articles...


Processing XML: 7384319it [00:52, 143090.80it/s]

Processed 83000 articles...


Processing XML: 7456811it [00:52, 142293.45it/s]

Processed 84000 articles...


Processing XML: 7558427it [00:53, 150885.38it/s]

Processed 85000 articles...


Processing XML: 7653234it [00:54, 148930.18it/s]

Processed 86000 articles...


Processing XML: 7730092it [00:54, 133926.81it/s]

Processed 87000 articles...


Processing XML: 7816751it [00:55, 138555.37it/s]

Processed 88000 articles...


Processing XML: 7901547it [00:55, 141802.24it/s]

Processed 89000 articles...


Processing XML: 7976131it [00:56, 143431.93it/s]

Processed 90000 articles...


Processing XML: 8064992it [00:57, 143436.88it/s]

Processed 91000 articles...


Processing XML: 8137217it [00:57, 133902.29it/s]

Processed 92000 articles...


Processing XML: 8233662it [00:58, 152741.05it/s]

Processed 93000 articles...


Processing XML: 8293572it [00:58, 136812.85it/s]

Processed 94000 articles...


Processing XML: 8367218it [00:59, 144023.97it/s]

Processed 95000 articles...


Processing XML: 8438227it [00:59, 134999.76it/s]

Processed 96000 articles...


Processing XML: 8511830it [01:00, 142067.86it/s]

Processed 97000 articles...


Processing XML: 8593257it [01:00, 157108.61it/s]

Processed 98000 articles...


Processing XML: 8680491it [01:01, 154225.63it/s]

Processed 99000 articles...


Processing XML: 8754040it [01:01, 135188.66it/s]

Processed 100000 articles...


Processing XML: 8828477it [01:02, 141126.03it/s]

Processed 101000 articles...


Processing XML: 8901860it [01:02, 137776.34it/s]

Processed 102000 articles...


Processing XML: 8969855it [01:03, 116675.55it/s]

Processed 103000 articles...


Processing XML: 9057895it [01:04, 147447.02it/s]

Processed 104000 articles...


Processing XML: 9122713it [01:04, 160349.21it/s]

Processed 105000 articles...


Processing XML: 9200825it [01:05, 147616.10it/s]

Processed 106000 articles...


Processing XML: 9276987it [01:05, 133262.22it/s]

Processed 107000 articles...


Processing XML: 9349129it [01:06, 141568.19it/s]

Processed 108000 articles...


Processing XML: 9458574it [01:06, 175199.36it/s]

Processed 109000 articles...


Processing XML: 9545528it [01:07, 162378.48it/s]

Processed 110000 articles...


Processing XML: 9607067it [01:07, 141191.27it/s]

Processed 111000 articles...


Processing XML: 9836229it [01:08, 340140.80it/s]

Processed 112000 articles...


Processing XML: 9946563it [01:09, 182622.91it/s]

Processed 113000 articles...


Processing XML: 10017425it [01:09, 156152.99it/s]

Processed 114000 articles...


Processing XML: 10072351it [01:09, 167930.22it/s]

Processed 115000 articles...


Processing XML: 10107901it [01:10, 170706.99it/s]

Processed 116000 articles...


Processing XML: 10176753it [01:10, 166558.63it/s]

Processed 117000 articles...


Processing XML: 10212466it [01:10, 171915.36it/s]

Processed 118000 articles...


Processing XML: 10247041it [01:11, 165584.58it/s]

Processed 119000 articles...


Processing XML: 10286609it [01:11, 180226.65it/s]

Processed 120000 articles...


Processing XML: 10338922it [01:11, 165204.24it/s]

Processed 121000 articles...


Processing XML: 10388781it [01:11, 161867.59it/s]

Processed 122000 articles...


Processing XML: 10442157it [01:12, 164469.54it/s]

Processed 123000 articles...


Processing XML: 10529631it [01:12, 173547.55it/s]

Processed 124000 articles...


Processing XML: 10607666it [01:13, 183127.51it/s]

Processed 125000 articles...


Processing XML: 10752673it [01:13, 180237.48it/s]

Processed 126000 articles...


Processing XML: 10838359it [01:14, 158779.11it/s]

Processed 127000 articles...


Processing XML: 10926378it [01:14, 176940.41it/s]

Processed 128000 articles...


Processing XML: 10995918it [01:15, 168773.57it/s]

Processed 129000 articles...


Processing XML: 11083356it [01:15, 172667.05it/s]

Processed 130000 articles...


Processing XML: 11150506it [01:16, 163704.09it/s]

Processed 131000 articles...


Processing XML: 11212464it [01:16, 178576.52it/s]

Processed 132000 articles...


Processing XML: 11280617it [01:16, 207632.00it/s]

Processed 133000 articles...


Processing XML: 11326127it [01:17, 211955.86it/s]

Processed 134000 articles...
Processed 135000 articles...


Processing XML: 11426024it [01:17, 285662.92it/s]

Processed 136000 articles...
Processed 137000 articles...


Processing XML: 11530489it [01:17, 280487.13it/s]

Processed 138000 articles...
Processed 139000 articles...


Processing XML: 11620935it [01:18, 294911.78it/s]

Processed 140000 articles...
Processed 141000 articles...


Processing XML: 11687101it [01:18, 299667.50it/s]

Processed 142000 articles...
Processed 143000 articles...


Processing XML: 11795602it [01:18, 323236.62it/s]

Processed 144000 articles...
Processed 145000 articles...


Processing XML: 11828982it [01:18, 242256.09it/s]

Processed 146000 articles...


Processing XML: 11924347it [01:19, 198921.76it/s]

Processed 147000 articles...
Processed 148000 articles...


Processing XML: 11998576it [01:19, 226381.76it/s]

Processed 149000 articles...


Processing XML: 12044241it [01:20, 186952.24it/s]

Processed 150000 articles...


Processing XML: 12119697it [01:20, 174574.36it/s]

Processed 151000 articles...


Processing XML: 12208534it [01:20, 218156.11it/s]

Processed 152000 articles...
Processed 153000 articles...


Processing XML: 12278232it [01:21, 175439.34it/s]

Processed 154000 articles...


Processing XML: 12348888it [01:21, 161586.25it/s]

Processed 155000 articles...


Processing XML: 12430724it [01:22, 143650.63it/s]

Processed 156000 articles...


Processing XML: 12492860it [01:22, 147690.12it/s]

Processed 157000 articles...


Processing XML: 12572779it [01:23, 152432.33it/s]

Processed 158000 articles...


Processing XML: 12636146it [01:23, 149283.65it/s]

Processed 159000 articles...


Processing XML: 12712058it [01:24, 142910.34it/s]

Processed 160000 articles...


Processing XML: 12773673it [01:24, 136812.63it/s]

Processed 161000 articles...


Processing XML: 12850286it [01:25, 146461.97it/s]

Processed 162000 articles...


Processing XML: 12905466it [01:25, 173836.33it/s]

Processed 163000 articles...


Processing XML: 12983302it [01:26, 137771.69it/s]

Processed 164000 articles...


Processing XML: 13045489it [01:26, 162351.83it/s]

Processed 165000 articles...


Processing XML: 13095746it [01:26, 158504.16it/s]

Processed 166000 articles...


Processing XML: 13175515it [01:27, 147793.01it/s]

Processed 167000 articles...


Processing XML: 13230291it [01:27, 163121.51it/s]

Processed 168000 articles...


Processing XML: 13308231it [01:28, 149888.45it/s]

Processed 169000 articles...


Processing XML: 13417484it [01:28, 267396.99it/s]

Processed 170000 articles...
Processed 171000 articles...


Processing XML: 13475609it [01:28, 219920.94it/s]

Processed 172000 articles...


Processing XML: 13539715it [01:29, 175413.12it/s]

Processed 173000 articles...


Processing XML: 13594274it [01:29, 171287.08it/s]

Processed 174000 articles...


Processing XML: 13661622it [01:30, 150275.38it/s]

Processed 175000 articles...


Processing XML: 13729321it [01:30, 145778.32it/s]

Processed 176000 articles...


Processing XML: 13804679it [01:31, 144592.55it/s]

Processed 177000 articles...


Processing XML: 13852940it [01:31, 144582.45it/s]

Processed 178000 articles...


Processing XML: 13930012it [01:31, 146276.36it/s]

Processed 179000 articles...


Processing XML: 14000557it [01:32, 166566.39it/s]

Processed 180000 articles...


Processing XML: 14064761it [01:32, 150292.46it/s]

Processed 181000 articles...


Processing XML: 14127240it [01:33, 136648.27it/s]

Processed 182000 articles...


Processing XML: 14199447it [01:33, 139940.86it/s]

Processed 183000 articles...


Processing XML: 14257466it [01:34, 141178.00it/s]

Processed 184000 articles...


Processing XML: 14334581it [01:34, 147851.50it/s]

Processed 185000 articles...


Processing XML: 14401891it [01:35, 146876.34it/s]

Processed 186000 articles...


Processing XML: 14464874it [01:35, 145965.92it/s]

Processed 187000 articles...


Processing XML: 14524834it [01:35, 147083.32it/s]

Processed 188000 articles...


Processing XML: 14586639it [01:36, 145508.97it/s]

Processed 189000 articles...


Processing XML: 14683510it [01:36, 163695.27it/s]

Processed 190000 articles...


Processing XML: 14736986it [01:37, 174232.28it/s]

Processed 191000 articles...


Processing XML: 14812924it [01:37, 170527.16it/s]

Processed 192000 articles...


Processing XML: 14904092it [01:38, 151483.72it/s]

Processed 193000 articles...


Processing XML: 14963182it [01:38, 137892.65it/s]

Processed 194000 articles...


Processing XML: 15033886it [01:39, 134028.09it/s]

Processed 195000 articles...


Processing XML: 15088158it [01:39, 134949.24it/s]

Processed 196000 articles...


Processing XML: 15148717it [01:40, 150458.73it/s]

Processed 197000 articles...


Processing XML: 15210918it [01:40, 142024.53it/s]

Processed 198000 articles...


Processing XML: 15283864it [01:40, 178975.84it/s]

Processed 199000 articles...


Processing XML: 15356156it [01:41, 219393.23it/s]

Processed 200000 articles...


Processing XML: 15397499it [01:41, 172576.03it/s]

Processed 201000 articles...


Processing XML: 15447112it [01:41, 141600.59it/s]

Processed 202000 articles...


Processing XML: 15508657it [01:42, 143066.82it/s]

Processed 203000 articles...


Processing XML: 15566888it [01:42, 142039.63it/s]

Processed 204000 articles...


Processing XML: 15626325it [01:43, 141264.84it/s]

Processed 205000 articles...


Processing XML: 15690619it [01:43, 148618.28it/s]

Processed 206000 articles...


Processing XML: 15763519it [01:44, 136539.43it/s]

Processed 207000 articles...


Processing XML: 15807412it [01:44, 124984.99it/s]

Processed 208000 articles...


Processing XML: 15868858it [01:44, 140673.97it/s]

Processed 209000 articles...


Processing XML: 15943842it [01:45, 133576.51it/s]

Processed 210000 articles...


Processing XML: 15998518it [01:45, 131004.66it/s]

Processed 211000 articles...


Processing XML: 16071439it [01:46, 139157.19it/s]

Processed 212000 articles...


Processing XML: 16117148it [01:46, 143910.52it/s]

Processed 213000 articles...


Processing XML: 16177633it [01:47, 144576.13it/s]

Processed 214000 articles...


Processing XML: 16255342it [01:47, 140234.48it/s]

Processed 215000 articles...


Processing XML: 16310101it [01:48, 125274.03it/s]

Processed 216000 articles...


Processing XML: 16356888it [01:48, 141312.46it/s]

Processed 217000 articles...


Processing XML: 16430081it [01:48, 138119.61it/s]

Processed 218000 articles...


Processing XML: 16474194it [01:49, 141652.49it/s]

Processed 219000 articles...


Processing XML: 16542732it [01:49, 123115.04it/s]

Processed 220000 articles...


Processing XML: 16597604it [01:50, 132446.70it/s]

Processed 221000 articles...


Processing XML: 16654501it [01:50, 136102.34it/s]

Processed 222000 articles...


Processing XML: 16695181it [01:50, 131874.71it/s]

Processed 223000 articles...


Processing XML: 16752615it [01:51, 139463.62it/s]

Processed 224000 articles...


Processing XML: 16824294it [01:51, 160967.63it/s]

Processed 225000 articles...


Processing XML: 16883700it [01:52, 175789.36it/s]

Processed 226000 articles...


Processing XML: 16953118it [01:52, 149254.45it/s]

Processed 227000 articles...


Processing XML: 16998071it [01:52, 140971.47it/s]

Processed 228000 articles...


Processing XML: 17060376it [01:53, 144953.85it/s]

Processed 229000 articles...


Processing XML: 17120641it [01:53, 146590.66it/s]

Processed 230000 articles...


Processing XML: 17180948it [01:54, 145494.41it/s]

Processed 231000 articles...


Processing XML: 17260889it [01:54, 163637.68it/s]

Processed 232000 articles...


Processing XML: 17313877it [01:54, 172179.86it/s]

Processed 233000 articles...
Processed 234000 articles...


Processing XML: 17405845it [01:55, 253044.13it/s]

Processed 235000 articles...
Processed 236000 articles...


Processing XML: 17483794it [01:55, 319431.21it/s]

Processed 237000 articles...
Processed 238000 articles...


Processing XML: 17581312it [01:55, 296428.14it/s]

Processed 239000 articles...
Processed 240000 articles...


Processing XML: 17669265it [01:56, 246555.07it/s]

Processed 241000 articles...


Processing XML: 17694643it [01:56, 215881.52it/s]

Processed 242000 articles...


Processing XML: 17758718it [01:56, 181857.48it/s]

Processed 243000 articles...


Processing XML: 17810409it [01:57, 146956.62it/s]

Processed 244000 articles...


Processing XML: 17877403it [01:57, 154794.15it/s]

Processed 245000 articles...


Processing XML: 17927257it [01:57, 153247.74it/s]

Processed 246000 articles...


Processing XML: 18000293it [01:58, 173497.79it/s]

Processed 247000 articles...


Processing XML: 18040496it [01:58, 187325.22it/s]

Processed 248000 articles...


Processing XML: 18094300it [01:58, 159515.78it/s]

Processed 249000 articles...


Processing XML: 18164304it [01:59, 172445.43it/s]

Processed 250000 articles...
Processed 251000 articles...


Processing XML: 18238379it [01:59, 161606.43it/s]

Processed 252000 articles...


Processing XML: 18287258it [02:00, 158037.21it/s]

Processed 253000 articles...


Processing XML: 18355241it [02:00, 157702.24it/s]

Processed 254000 articles...


Processing XML: 18407858it [02:00, 164257.85it/s]

Processed 255000 articles...


Processing XML: 18452672it [02:00, 192769.78it/s]

Processed 256000 articles...
Processed 257000 articles...
Processed 258000 articles...


Processing XML: 18526946it [02:01, 205194.00it/s]

Processed 259000 articles...
Processed 260000 articles...


Processing XML: 18604067it [02:01, 170638.61it/s]

Processed 261000 articles...


Processing XML: 18669559it [02:02, 150667.93it/s]

Processed 262000 articles...


Processing XML: 18730428it [02:02, 149033.57it/s]

Processed 263000 articles...


Processing XML: 18785530it [02:02, 158533.66it/s]

Processed 264000 articles...


Processing XML: 18861027it [02:03, 180126.42it/s]

Processed 265000 articles...
Processed 266000 articles...


Processing XML: 18940325it [02:03, 234143.32it/s]

Processed 267000 articles...
Processed 268000 articles...


Processing XML: 19016393it [02:04, 246388.01it/s]

Processed 269000 articles...
Processed 270000 articles...


Processing XML: 19089354it [02:04, 186641.84it/s]

Processed 271000 articles...


Processing XML: 19157818it [02:04, 202433.55it/s]

Processed 272000 articles...
Processed 273000 articles...


Processing XML: 19207696it [02:04, 224237.23it/s]

Processed 274000 articles...
Processed 275000 articles...
Processed 276000 articles...


Processing XML: 19254488it [02:05, 205711.79it/s]

Processed 277000 articles...


Processing XML: 19294944it [02:05, 186793.61it/s]

Processed 278000 articles...


Processing XML: 19365735it [02:05, 159086.72it/s]

Processed 279000 articles...


Processing XML: 19423886it [02:06, 181647.59it/s]

Processed 280000 articles...


Processing XML: 19491347it [02:06, 151094.34it/s]

Processed 281000 articles...


Processing XML: 19553734it [02:07, 154284.47it/s]

Processed 282000 articles...


Processing XML: 19594314it [02:07, 163075.36it/s]

Processed 283000 articles...


Processing XML: 19646016it [02:07, 162395.25it/s]

Processed 284000 articles...


Processing XML: 19711533it [02:08, 156087.60it/s]

Processed 285000 articles...


Processing XML: 19778710it [02:08, 158368.51it/s]

Processed 286000 articles...


Processing XML: 19846243it [02:08, 165352.23it/s]

Processed 287000 articles...


Processing XML: 19905789it [02:09, 184829.43it/s]

Processed 288000 articles...


Processing XML: 19977911it [02:09, 166657.48it/s]

Processed 289000 articles...


Processing XML: 20044644it [02:10, 160101.15it/s]

Processed 290000 articles...


Processing XML: 20111453it [02:10, 160813.76it/s]

Processed 291000 articles...


Processing XML: 20175106it [02:11, 145186.22it/s]

Processed 292000 articles...


Processing XML: 20235791it [02:11, 147788.29it/s]

Processed 293000 articles...


Processing XML: 20299671it [02:11, 146979.54it/s]

Processed 294000 articles...


Processing XML: 20370031it [02:12, 171945.81it/s]

Processed 295000 articles...


Processing XML: 20403692it [02:12, 160637.49it/s]

Processed 296000 articles...


Processing XML: 20472199it [02:12, 154757.00it/s]

Processed 297000 articles...


Processing XML: 20536625it [02:13, 149544.40it/s]

Processed 298000 articles...


Processing XML: 20596846it [02:13, 143881.23it/s]

Processed 299000 articles...


Processing XML: 20663837it [02:14, 149278.51it/s]

Processed 300000 articles...


Processing XML: 20725390it [02:14, 139593.98it/s]

Processed 301000 articles...


Processing XML: 20810773it [02:15, 161129.27it/s]

Processed 302000 articles...


Processing XML: 20877980it [02:15, 157254.63it/s]

Processed 303000 articles...


Processing XML: 20944358it [02:15, 159033.12it/s]

Processed 304000 articles...


Processing XML: 21010223it [02:16, 160715.94it/s]

Processed 305000 articles...


Processing XML: 21079990it [02:16, 167110.21it/s]

Processed 306000 articles...


Processing XML: 21148196it [02:17, 164635.14it/s]

Processed 307000 articles...


Processing XML: 21216694it [02:17, 160652.29it/s]

Processed 308000 articles...


Processing XML: 21284137it [02:18, 160283.10it/s]

Processed 309000 articles...


Processing XML: 21382691it [02:18, 181810.11it/s]

Processed 310000 articles...


Processing XML: 21435647it [02:18, 161809.80it/s]

Processed 311000 articles...


Processing XML: 21501121it [02:19, 151773.64it/s]

Processed 312000 articles...


Processing XML: 21567908it [02:19, 158112.31it/s]

Processed 313000 articles...


Processing XML: 21646916it [02:20, 149978.24it/s]

Processed 314000 articles...


Processing XML: 21696238it [02:20, 148842.66it/s]

Processed 315000 articles...


Processing XML: 21759705it [02:21, 143686.95it/s]

Processed 316000 articles...


Processing XML: 21825664it [02:21, 159405.09it/s]

Processed 317000 articles...


Processing XML: 21890342it [02:21, 150451.33it/s]

Processed 318000 articles...


Processing XML: 21947305it [02:22, 179405.38it/s]

Processed 319000 articles...
Processed 320000 articles...


Processing XML: 22056291it [02:22, 166601.70it/s]

Processed 321000 articles...


Processing XML: 22106205it [02:23, 159400.32it/s]

Processed 322000 articles...


Processing XML: 22170461it [02:23, 150353.06it/s]

Processed 323000 articles...


Processing XML: 22230841it [02:24, 136244.15it/s]

Processed 324000 articles...


Processing XML: 22291830it [02:24, 146778.05it/s]

Processed 325000 articles...


Processing XML: 22352920it [02:24, 138279.17it/s]

Processed 326000 articles...


Processing XML: 22430935it [02:25, 158722.47it/s]

Processed 327000 articles...


Processing XML: 22493963it [02:25, 148393.27it/s]

Processed 328000 articles...


Processing XML: 22577831it [02:26, 167265.84it/s]

Processed 329000 articles...


Processing XML: 22646727it [02:26, 169071.50it/s]

Processed 330000 articles...


Processing XML: 22714079it [02:27, 206529.08it/s]

Processed 331000 articles...


Processing XML: 22791437it [02:27, 161116.73it/s]

Processed 332000 articles...


Processing XML: 22845923it [02:27, 160945.45it/s]

Processed 333000 articles...


Processing XML: 22909199it [02:28, 145235.34it/s]

Processed 334000 articles...


Processing XML: 22971132it [02:28, 151186.17it/s]

Processed 335000 articles...


Processing XML: 23017466it [02:29, 145807.10it/s]

Processed 336000 articles...


Processing XML: 23078532it [02:29, 149115.03it/s]

Processed 337000 articles...


Processing XML: 23138221it [02:29, 142045.76it/s]

Processed 338000 articles...


Processing XML: 23198178it [02:30, 142934.33it/s]

Processed 339000 articles...


Processing XML: 23258058it [02:30, 147329.97it/s]

Processed 340000 articles...


Processing XML: 23336885it [02:31, 174619.72it/s]

Processed 341000 articles...


Processing XML: 23389136it [02:31, 157521.70it/s]

Processed 342000 articles...


Processing XML: 23452735it [02:31, 148349.07it/s]

Processed 343000 articles...


Processing XML: 23534663it [02:32, 157712.63it/s]

Processed 344000 articles...


Processing XML: 23580647it [02:32, 144397.91it/s]

Processed 345000 articles...


Processing XML: 23640277it [02:33, 137569.07it/s]

Processed 346000 articles...


Processing XML: 23737980it [02:33, 243515.75it/s]

Processed 347000 articles...
Processed 348000 articles...


Processing XML: 23814028it [02:33, 220632.80it/s]

Processed 349000 articles...


Processing XML: 23852945it [02:34, 154629.11it/s]


Processed 350000 articles...
Reached maximum articles limit: 350,000
Extracted 350000 articles
Filtered 18118 disambiguation articles
Filtered 0 redirect-only articles
Total text length: 1,267,408,989 characters
Number of articles: 350,000
Sample text: Литва́ ( ), официальное название — Лито́вская Респу́блика ( ) — государство, расположенное в Северной Европе (одна из стран Балтии). Столица страны — Вильнюс. Площадь — 65 300 км². Протяжённость с севера на юг - 280 км, а с запада на восток - 370 км. Население составляет 3 054 000 человек. Согласно более современной гипотезе, название страны могло произойти от этнонима «леты» или «лейти», которым жители окрестных земель называли дружинников литовских князей. В начале XIII века в земли балтов-язы...


In [7]:
english_text = extract_english()


=== Extracting English Text (max 350,000 articles) ===
Extracting plain text from enwiki-20181001-corpus.xml.bz2 (max 350,000 articles)...


Processing XML: 196455it [00:01, 104854.44it/s]

Processed 1000 articles...


Processing XML: 350428it [00:03, 96977.71it/s] 

Processed 2000 articles...


Processing XML: 520111it [00:05, 97467.05it/s] 

Processed 3000 articles...


Processing XML: 686221it [00:06, 91479.23it/s] 

Processed 4000 articles...


Processing XML: 846731it [00:08, 107427.83it/s]

Processed 5000 articles...


Processing XML: 1028475it [00:10, 109680.90it/s]

Processed 6000 articles...


Processing XML: 1209305it [00:11, 112103.78it/s]

Processed 7000 articles...


Processing XML: 1421102it [00:13, 110934.36it/s]

Processed 8000 articles...


Processing XML: 1593054it [00:15, 126556.94it/s]

Processed 9000 articles...


Processing XML: 1781377it [00:16, 104268.09it/s]

Processed 10000 articles...


Processing XML: 1971700it [00:18, 122548.21it/s]

Processed 11000 articles...


Processing XML: 2138046it [00:19, 110645.40it/s]

Processed 12000 articles...


Processing XML: 2289673it [00:21, 108314.32it/s]

Processed 13000 articles...


Processing XML: 2466768it [00:22, 124514.93it/s]

Processed 14000 articles...


Processing XML: 2629496it [00:24, 115558.30it/s]

Processed 15000 articles...


Processing XML: 2802655it [00:25, 121682.02it/s]

Processed 16000 articles...


Processing XML: 3048202it [00:27, 169742.25it/s]

Processed 17000 articles...


Processing XML: 3291833it [00:28, 291028.16it/s]

Processed 18000 articles...


Processing XML: 3418152it [00:29, 142736.72it/s]

Processed 19000 articles...


Processing XML: 3591427it [00:30, 134536.22it/s]

Processed 20000 articles...


Processing XML: 3727620it [00:31, 157085.29it/s]

Processed 21000 articles...


Processing XML: 3821660it [00:32, 134331.39it/s]

Processed 22000 articles...


Processing XML: 3973299it [00:33, 119304.87it/s]

Processed 23000 articles...


Processing XML: 4104080it [00:34, 107388.21it/s]

Processed 24000 articles...


Processing XML: 4232924it [00:35, 116432.85it/s]

Processed 25000 articles...


Processing XML: 4383503it [00:37, 116138.09it/s]

Processed 26000 articles...


Processing XML: 4515577it [00:38, 103039.63it/s]

Processed 27000 articles...


Processing XML: 4654980it [00:39, 111000.72it/s]

Processed 28000 articles...


Processing XML: 4800063it [00:40, 129616.64it/s]

Processed 29000 articles...


Processing XML: 4939458it [00:41, 118376.98it/s]

Processed 30000 articles...


Processing XML: 5073194it [00:42, 143432.97it/s]

Processed 31000 articles...


Processing XML: 5216550it [00:44, 102977.50it/s]

Processed 32000 articles...


Processing XML: 5364078it [00:45, 110085.12it/s]

Processed 33000 articles...


Processing XML: 5507165it [00:46, 115937.32it/s]

Processed 34000 articles...


Processing XML: 5626479it [00:47, 113566.82it/s]

Processed 35000 articles...


Processing XML: 5771941it [00:49, 106053.43it/s]

Processed 36000 articles...


Processing XML: 5904355it [00:50, 114655.59it/s]

Processed 37000 articles...


Processing XML: 6048804it [00:51, 165069.83it/s]

Processed 38000 articles...


Processing XML: 6204513it [00:52, 155052.72it/s]

Processed 39000 articles...


Processing XML: 6300182it [00:53, 120556.39it/s]

Processed 40000 articles...


Processing XML: 6373870it [00:53, 107390.27it/s]

Processed 41000 articles...


Processing XML: 6503508it [00:54, 110613.20it/s]

Processed 42000 articles...


Processing XML: 6670725it [00:55, 187973.12it/s]

Processed 43000 articles...


Processing XML: 6850487it [00:56, 186008.60it/s]

Processed 44000 articles...


Processing XML: 7032380it [00:58, 178350.85it/s]

Processed 45000 articles...


Processing XML: 7231657it [00:59, 245988.33it/s]

Processed 46000 articles...


Processing XML: 7360485it [00:59, 145989.44it/s]

Processed 47000 articles...


Processing XML: 7476866it [01:00, 113515.64it/s]

Processed 48000 articles...


Processing XML: 7570895it [01:01, 112355.32it/s]

Processed 49000 articles...


Processing XML: 7647279it [01:02, 119888.87it/s]

Processed 50000 articles...


Processing XML: 7709614it [01:02, 124218.51it/s]

Processed 51000 articles...


Processing XML: 7788831it [01:03, 102477.82it/s]

Processed 52000 articles...


Processing XML: 7869744it [01:04, 133433.72it/s]

Processed 53000 articles...


Processing XML: 7929181it [01:04, 143241.57it/s]

Processed 54000 articles...


Processing XML: 7984954it [01:05, 131952.95it/s]

Processed 55000 articles...


Processing XML: 8038736it [01:05, 133592.68it/s]

Processed 56000 articles...


Processing XML: 8093285it [01:05, 121514.98it/s]

Processed 57000 articles...


Processing XML: 8168823it [01:06, 147630.44it/s]

Processed 58000 articles...


Processing XML: 8215353it [01:06, 132836.99it/s]

Processed 59000 articles...


Processing XML: 8265899it [01:07, 108698.73it/s]

Processed 60000 articles...


Processing XML: 8315140it [01:07, 119082.96it/s]

Processed 61000 articles...


Processing XML: 8373873it [01:08, 138725.04it/s]

Processed 62000 articles...


Processing XML: 8404153it [01:08, 145588.55it/s]

Processed 63000 articles...


Processing XML: 8448707it [01:08, 146487.74it/s]

Processed 64000 articles...


Processing XML: 8477815it [01:08, 143947.06it/s]

Processed 65000 articles...


Processing XML: 8522657it [01:09, 138625.16it/s]

Processed 66000 articles...


Processing XML: 8567416it [01:09, 139053.31it/s]

Processed 67000 articles...


Processing XML: 8624182it [01:10, 80593.74it/s] 

Processed 68000 articles...


Processing XML: 8695825it [01:10, 103339.19it/s]

Processed 69000 articles...


Processing XML: 8753334it [01:11, 116346.26it/s]

Processed 70000 articles...


Processing XML: 8806615it [01:11, 129445.44it/s]

Processed 71000 articles...


Processing XML: 8861869it [01:12, 135385.33it/s]

Processed 72000 articles...


Processing XML: 8907781it [01:12, 140983.97it/s]

Processed 73000 articles...


Processing XML: 8960414it [01:12, 164487.35it/s]

Processed 74000 articles...


Processing XML: 9007174it [01:13, 131193.28it/s]

Processed 75000 articles...


Processing XML: 9045931it [01:13, 124419.60it/s]

Processed 76000 articles...


Processing XML: 9087821it [01:13, 133292.90it/s]

Processed 77000 articles...


Processing XML: 9143752it [01:14, 136951.71it/s]

Processed 78000 articles...


Processing XML: 9197225it [01:14, 124879.57it/s]

Processed 79000 articles...


Processing XML: 9246817it [01:15, 117003.21it/s]

Processed 80000 articles...


Processing XML: 9287626it [01:15, 129124.67it/s]

Processed 81000 articles...


Processing XML: 9327362it [01:15, 122258.32it/s]

Processed 82000 articles...


Processing XML: 9381207it [01:16, 113766.67it/s]

Processed 83000 articles...


Processing XML: 9489888it [01:17, 106734.32it/s]

Processed 84000 articles...


Processing XML: 9608094it [01:18, 108967.12it/s]

Processed 85000 articles...


Processing XML: 9722857it [01:19, 105997.93it/s]

Processed 86000 articles...


Processing XML: 9835077it [01:20, 117504.42it/s]

Processed 87000 articles...


Processing XML: 9918295it [01:21, 110834.19it/s]

Processed 88000 articles...


Processing XML: 10023934it [01:21, 114694.06it/s]

Processed 89000 articles...


Processing XML: 10119705it [01:22, 109533.85it/s]

Processed 90000 articles...


Processing XML: 10228442it [01:23, 116365.07it/s]

Processed 91000 articles...


Processing XML: 10406738it [01:25, 126088.86it/s]

Processed 92000 articles...


Processing XML: 10540225it [01:26, 134400.70it/s]

Processed 93000 articles...


Processing XML: 10675575it [01:27, 119640.03it/s]

Processed 94000 articles...


Processing XML: 10836779it [01:28, 133691.88it/s]

Processed 95000 articles...


Processing XML: 10995452it [01:29, 135477.45it/s]

Processed 96000 articles...


Processing XML: 11108917it [01:30, 124027.96it/s]

Processed 97000 articles...


Processing XML: 11198979it [01:31, 114397.49it/s]

Processed 98000 articles...


Processing XML: 11295176it [01:32, 129604.35it/s]

Processed 99000 articles...


Processing XML: 11429188it [01:33, 125873.08it/s]

Processed 100000 articles...


Processing XML: 11544468it [01:34, 124175.70it/s]

Processed 101000 articles...


Processing XML: 11650308it [01:35, 133170.69it/s]

Processed 102000 articles...


Processing XML: 11777477it [01:36, 130835.98it/s]

Processed 103000 articles...


Processing XML: 11897567it [01:36, 123598.49it/s]

Processed 104000 articles...


Processing XML: 12006340it [01:37, 147918.77it/s]

Processed 105000 articles...


Processing XML: 12128829it [01:38, 138887.10it/s]

Processed 106000 articles...


Processing XML: 12236229it [01:39, 124250.14it/s]

Processed 107000 articles...


Processing XML: 12347919it [01:40, 123454.78it/s]

Processed 108000 articles...


Processing XML: 12451973it [01:41, 128997.08it/s]

Processed 109000 articles...


Processing XML: 12569410it [01:42, 132182.21it/s]

Processed 110000 articles...


Processing XML: 12676113it [01:42, 128570.60it/s]

Processed 111000 articles...


Processing XML: 12794577it [01:43, 122177.22it/s]

Processed 112000 articles...


Processing XML: 12900577it [01:44, 125478.96it/s]

Processed 113000 articles...


Processing XML: 13028377it [01:45, 172212.74it/s]

Processed 114000 articles...


Processing XML: 13149562it [01:46, 154367.76it/s]

Processed 115000 articles...


Processing XML: 13256446it [01:47, 143649.24it/s]

Processed 116000 articles...


Processing XML: 13364156it [01:47, 111498.62it/s]

Processed 117000 articles...


Processing XML: 13464312it [01:48, 125622.60it/s]

Processed 118000 articles...


Processing XML: 13570493it [01:49, 121744.76it/s]

Processed 119000 articles...


Processing XML: 13674467it [01:50, 120983.81it/s]

Processed 120000 articles...


Processing XML: 13788671it [01:51, 115045.89it/s]

Processed 121000 articles...


Processing XML: 13889320it [01:52, 121791.80it/s]

Processed 122000 articles...


Processing XML: 14008535it [01:53, 124790.08it/s]

Processed 123000 articles...


Processing XML: 14119295it [01:54, 145389.31it/s]

Processed 124000 articles...


Processing XML: 14228135it [01:55, 119909.38it/s]

Processed 125000 articles...


Processing XML: 14321107it [01:55, 111591.47it/s]

Processed 126000 articles...


Processing XML: 14373216it [01:56, 126378.67it/s]

Processed 127000 articles...


Processing XML: 14476519it [01:57, 119773.16it/s]

Processed 128000 articles...


Processing XML: 14575412it [01:57, 114026.39it/s]

Processed 129000 articles...


Processing XML: 14689375it [01:58, 134065.75it/s]

Processed 130000 articles...


Processing XML: 14798471it [01:59, 113196.84it/s]

Processed 131000 articles...


Processing XML: 14895531it [02:00, 117185.27it/s]

Processed 132000 articles...


Processing XML: 14995989it [02:01, 120857.41it/s]

Processed 133000 articles...


Processing XML: 15096052it [02:02, 125336.72it/s]

Processed 134000 articles...


Processing XML: 15190023it [02:03, 112923.41it/s]

Processed 135000 articles...


Processing XML: 15287990it [02:03, 115369.39it/s]

Processed 136000 articles...


Processing XML: 15385006it [02:04, 120576.31it/s]

Processed 137000 articles...


Processing XML: 15489795it [02:05, 121480.37it/s]

Processed 138000 articles...


Processing XML: 15596586it [02:06, 117488.06it/s]

Processed 139000 articles...


Processing XML: 15694110it [02:07, 114646.35it/s]

Processed 140000 articles...


Processing XML: 15781619it [02:08, 113155.41it/s]

Processed 141000 articles...


Processing XML: 15880995it [02:08, 117461.88it/s]

Processed 142000 articles...


Processing XML: 15978888it [02:09, 117265.69it/s]

Processed 143000 articles...


Processing XML: 16068802it [02:10, 124291.92it/s]

Processed 144000 articles...


Processing XML: 16169506it [02:11, 147335.06it/s]

Processed 145000 articles...


Processing XML: 16274570it [02:12, 121569.80it/s]

Processed 146000 articles...


Processing XML: 16356532it [02:12, 159100.95it/s]

Processed 147000 articles...


Processing XML: 16469662it [02:13, 136053.10it/s]

Processed 148000 articles...


Processing XML: 16588638it [02:14, 138299.82it/s]

Processed 149000 articles...


Processing XML: 16685985it [02:14, 129752.10it/s]

Processed 150000 articles...


Processing XML: 16769369it [02:15, 132453.79it/s]

Processed 151000 articles...


Processing XML: 16880055it [02:16, 142669.93it/s]

Processed 152000 articles...


Processing XML: 16985615it [02:17, 145313.17it/s]

Processed 153000 articles...


Processing XML: 17080487it [02:17, 121475.47it/s]

Processed 154000 articles...


Processing XML: 17177193it [02:18, 123691.57it/s]

Processed 155000 articles...


Processing XML: 17270539it [02:19, 132593.46it/s]

Processed 156000 articles...


Processing XML: 17360330it [02:20, 123853.21it/s]

Processed 157000 articles...


Processing XML: 17467373it [02:20, 126354.47it/s]

Processed 158000 articles...


Processing XML: 17579091it [02:21, 137995.62it/s]

Processed 159000 articles...


Processing XML: 17678736it [02:22, 135447.08it/s]

Processed 160000 articles...


Processing XML: 17756909it [02:23, 125413.31it/s]

Processed 161000 articles...


Processing XML: 17862932it [02:24, 125853.31it/s]

Processed 162000 articles...


Processing XML: 17948173it [02:24, 108994.98it/s]

Processed 163000 articles...


Processing XML: 18034693it [02:25, 117872.29it/s]

Processed 164000 articles...


Processing XML: 18119991it [02:26, 126194.11it/s]

Processed 165000 articles...


Processing XML: 18221653it [02:26, 124360.83it/s]

Processed 166000 articles...


Processing XML: 18316805it [02:27, 140261.02it/s]

Processed 167000 articles...


Processing XML: 18412404it [02:28, 130403.26it/s]

Processed 168000 articles...


Processing XML: 18513655it [02:29, 129510.16it/s]

Processed 169000 articles...


Processing XML: 18589890it [02:29, 122705.93it/s]

Processed 170000 articles...


Processing XML: 18702735it [02:30, 145024.37it/s]

Processed 171000 articles...


Processing XML: 18785761it [02:31, 121403.65it/s]

Processed 172000 articles...


Processing XML: 18886311it [02:32, 126872.30it/s]

Processed 173000 articles...


Processing XML: 19008524it [02:32, 145182.44it/s]

Processed 174000 articles...


Processing XML: 19108431it [02:33, 147421.63it/s]

Processed 175000 articles...


Processing XML: 19194110it [02:34, 130496.47it/s]

Processed 176000 articles...


Processing XML: 19281666it [02:34, 137347.36it/s]

Processed 177000 articles...


Processing XML: 19371868it [02:35, 146323.95it/s]

Processed 178000 articles...


Processing XML: 19467017it [02:36, 127871.54it/s]

Processed 179000 articles...


Processing XML: 19550043it [02:36, 134106.56it/s]

Processed 180000 articles...


Processing XML: 19631749it [02:37, 122530.41it/s]

Processed 181000 articles...


Processing XML: 19733671it [02:38, 134650.27it/s]

Processed 182000 articles...


Processing XML: 19825919it [02:39, 138979.15it/s]

Processed 183000 articles...


Processing XML: 19927711it [02:39, 144916.06it/s]

Processed 184000 articles...


Processing XML: 20018153it [02:40, 140168.45it/s]

Processed 185000 articles...


Processing XML: 20105384it [02:41, 138096.90it/s]

Processed 186000 articles...


Processing XML: 20194762it [02:41, 140645.77it/s]

Processed 187000 articles...


Processing XML: 20282156it [02:42, 139018.06it/s]

Processed 188000 articles...


Processing XML: 20365539it [02:42, 131023.13it/s]

Processed 189000 articles...


Processing XML: 20465343it [02:43, 143622.39it/s]

Processed 190000 articles...


Processing XML: 20540075it [02:44, 134389.40it/s]

Processed 191000 articles...


Processing XML: 20631453it [02:44, 141990.31it/s]

Processed 192000 articles...


Processing XML: 20728818it [02:45, 155757.53it/s]

Processed 193000 articles...


Processing XML: 20805593it [02:46, 146309.52it/s]

Processed 194000 articles...


Processing XML: 20892592it [02:46, 128279.06it/s]

Processed 195000 articles...


Processing XML: 20999398it [02:47, 156520.20it/s]

Processed 196000 articles...


Processing XML: 21072482it [02:47, 132022.20it/s]

Processed 197000 articles...


Processing XML: 21173678it [02:48, 141187.73it/s]

Processed 198000 articles...


Processing XML: 21246694it [02:49, 136027.56it/s]

Processed 199000 articles...


Processing XML: 21350363it [02:49, 137757.35it/s]

Processed 200000 articles...


Processing XML: 21438927it [02:50, 174435.65it/s]

Processed 201000 articles...


Processing XML: 21508044it [02:50, 156753.50it/s]

Processed 202000 articles...


Processing XML: 21595135it [02:51, 137651.01it/s]

Processed 203000 articles...


Processing XML: 21682063it [02:52, 132075.79it/s]

Processed 204000 articles...


Processing XML: 21770296it [02:52, 132226.37it/s]

Processed 205000 articles...


Processing XML: 21853227it [02:53, 129043.71it/s]

Processed 206000 articles...


Processing XML: 21953504it [02:54, 143175.84it/s]

Processed 207000 articles...


Processing XML: 22028557it [02:54, 137446.38it/s]

Processed 208000 articles...


Processing XML: 22112615it [02:55, 129061.82it/s]

Processed 209000 articles...


Processing XML: 22198372it [02:55, 166187.81it/s]

Processed 210000 articles...


Processing XML: 22318006it [02:56, 159972.10it/s]

Processed 211000 articles...


Processing XML: 22402785it [02:57, 138398.55it/s]

Processed 212000 articles...


Processing XML: 22474977it [02:57, 139313.67it/s]

Processed 213000 articles...


Processing XML: 22565716it [02:58, 136544.82it/s]

Processed 214000 articles...


Processing XML: 22647702it [02:59, 124808.32it/s]

Processed 215000 articles...


Processing XML: 22715777it [02:59, 130392.83it/s]

Processed 216000 articles...


Processing XML: 22783084it [03:00, 123499.53it/s]

Processed 217000 articles...


Processing XML: 22877012it [03:00, 124934.12it/s]

Processed 218000 articles...


Processing XML: 22955680it [03:01, 132999.18it/s]

Processed 219000 articles...


Processing XML: 23027949it [03:02, 136573.53it/s]

Processed 220000 articles...


Processing XML: 23122633it [03:02, 141995.67it/s]

Processed 221000 articles...


Processing XML: 23223655it [03:03, 133728.40it/s]

Processed 222000 articles...


Processing XML: 23305463it [03:04, 133959.55it/s]

Processed 223000 articles...


Processing XML: 23397469it [03:04, 140603.42it/s]

Processed 224000 articles...


Processing XML: 23476090it [03:05, 151748.23it/s]

Processed 225000 articles...


Processing XML: 23547846it [03:05, 134118.55it/s]

Processed 226000 articles...


Processing XML: 23630649it [03:06, 135383.34it/s]

Processed 227000 articles...


Processing XML: 23723583it [03:07, 141060.84it/s]

Processed 228000 articles...


Processing XML: 23808157it [03:07, 135675.96it/s]

Processed 229000 articles...


Processing XML: 23890228it [03:08, 126747.01it/s]

Processed 230000 articles...


Processing XML: 23970197it [03:09, 128229.79it/s]

Processed 231000 articles...


Processing XML: 24056567it [03:09, 143037.84it/s]

Processed 232000 articles...


Processing XML: 24145342it [03:10, 151040.25it/s]

Processed 233000 articles...


Processing XML: 24235580it [03:10, 143751.92it/s]

Processed 234000 articles...


Processing XML: 24312697it [03:11, 148183.19it/s]

Processed 235000 articles...


Processing XML: 24426791it [03:12, 158287.40it/s]

Processed 236000 articles...


Processing XML: 24515850it [03:12, 183100.28it/s]

Processed 237000 articles...


Processing XML: 24598688it [03:13, 140553.32it/s]

Processed 238000 articles...


Processing XML: 24685322it [03:13, 128531.43it/s]

Processed 239000 articles...


Processing XML: 24770034it [03:14, 136797.70it/s]

Processed 240000 articles...


Processing XML: 24848108it [03:15, 141699.63it/s]

Processed 241000 articles...


Processing XML: 24936990it [03:15, 135890.99it/s]

Processed 242000 articles...


Processing XML: 25024827it [03:16, 147594.20it/s]

Processed 243000 articles...


Processing XML: 25111510it [03:16, 152321.84it/s]

Processed 244000 articles...


Processing XML: 25202011it [03:17, 138143.70it/s]

Processed 245000 articles...


Processing XML: 25274545it [03:18, 137828.71it/s]

Processed 246000 articles...


Processing XML: 25366540it [03:18, 143819.72it/s]

Processed 247000 articles...


Processing XML: 25461497it [03:19, 152268.50it/s]

Processed 248000 articles...


Processing XML: 25538802it [03:19, 146211.83it/s]

Processed 249000 articles...


Processing XML: 25636743it [03:20, 161842.41it/s]

Processed 250000 articles...


Processing XML: 25730964it [03:21, 149249.92it/s]

Processed 251000 articles...


Processing XML: 25819957it [03:21, 147890.90it/s]

Processed 252000 articles...


Processing XML: 25905982it [03:22, 147358.21it/s]

Processed 253000 articles...


Processing XML: 25991120it [03:22, 136522.97it/s]

Processed 254000 articles...


Processing XML: 26084558it [03:23, 147148.48it/s]

Processed 255000 articles...


Processing XML: 26162160it [03:24, 146105.08it/s]

Processed 256000 articles...


Processing XML: 26244912it [03:24, 155218.79it/s]

Processed 257000 articles...


Processing XML: 26338710it [03:25, 149330.35it/s]

Processed 258000 articles...


Processing XML: 26438393it [03:25, 169472.24it/s]

Processed 259000 articles...


Processing XML: 26502775it [03:26, 148500.14it/s]

Processed 260000 articles...


Processing XML: 26593578it [03:27, 145203.34it/s]

Processed 261000 articles...


Processing XML: 26666739it [03:27, 140544.48it/s]

Processed 262000 articles...


Processing XML: 26758800it [03:28, 148774.68it/s]

Processed 263000 articles...


Processing XML: 26838744it [03:28, 142160.51it/s]

Processed 264000 articles...


Processing XML: 26931010it [03:29, 138438.08it/s]

Processed 265000 articles...


Processing XML: 27000349it [03:29, 132730.75it/s]

Processed 266000 articles...


Processing XML: 27082574it [03:30, 161861.08it/s]

Processed 267000 articles...


Processing XML: 27158718it [03:31, 135550.48it/s]

Processed 268000 articles...


Processing XML: 27244504it [03:31, 141760.66it/s]

Processed 269000 articles...


Processing XML: 27329722it [03:32, 138509.15it/s]

Processed 270000 articles...


Processing XML: 27403001it [03:32, 143180.20it/s]

Processed 271000 articles...


Processing XML: 27479630it [03:33, 144415.77it/s]

Processed 272000 articles...


Processing XML: 27556121it [03:33, 138947.23it/s]

Processed 273000 articles...


Processing XML: 27631482it [03:34, 138274.74it/s]

Processed 274000 articles...


Processing XML: 27721556it [03:35, 145955.94it/s]

Processed 275000 articles...


Processing XML: 27802490it [03:35, 147847.40it/s]

Processed 276000 articles...


Processing XML: 27887482it [03:36, 132421.99it/s]

Processed 277000 articles...


Processing XML: 27971433it [03:36, 141906.06it/s]

Processed 278000 articles...


Processing XML: 28067212it [03:37, 153253.10it/s]

Processed 279000 articles...


Processing XML: 28137892it [03:38, 124295.21it/s]

Processed 280000 articles...


Processing XML: 28203452it [03:38, 128675.92it/s]

Processed 281000 articles...


Processing XML: 28281290it [03:39, 145558.86it/s]

Processed 282000 articles...


Processing XML: 28369004it [03:39, 137570.38it/s]

Processed 283000 articles...


Processing XML: 28445794it [03:40, 146555.00it/s]

Processed 284000 articles...


Processing XML: 28517242it [03:40, 140214.42it/s]

Processed 285000 articles...


Processing XML: 28579898it [03:41, 143921.06it/s]

Processed 286000 articles...


Processing XML: 28648738it [03:41, 129578.80it/s]

Processed 287000 articles...


Processing XML: 28719835it [03:42, 133438.19it/s]

Processed 288000 articles...


Processing XML: 28806716it [03:42, 133346.12it/s]

Processed 289000 articles...


Processing XML: 28863173it [03:43, 128566.18it/s]

Processed 290000 articles...


Processing XML: 28948185it [03:44, 130537.34it/s]

Processed 291000 articles...


Processing XML: 29020450it [03:44, 139469.04it/s]

Processed 292000 articles...


Processing XML: 29092943it [03:45, 139142.55it/s]

Processed 293000 articles...


Processing XML: 29176691it [03:45, 134532.48it/s]

Processed 294000 articles...


Processing XML: 29255049it [03:46, 145939.97it/s]

Processed 295000 articles...


Processing XML: 29344703it [03:46, 142256.80it/s]

Processed 296000 articles...


Processing XML: 29415417it [03:47, 135223.11it/s]

Processed 297000 articles...


Processing XML: 29500209it [03:48, 139894.59it/s]

Processed 298000 articles...


Processing XML: 29586264it [03:48, 152308.02it/s]

Processed 299000 articles...


Processing XML: 29711498it [03:49, 180590.69it/s]

Processed 300000 articles...


Processing XML: 29779994it [03:49, 151058.10it/s]

Processed 301000 articles...


Processing XML: 29862028it [03:50, 156018.16it/s]

Processed 302000 articles...


Processing XML: 29943348it [03:50, 161189.32it/s]

Processed 303000 articles...


Processing XML: 30022722it [03:51, 151046.37it/s]

Processed 304000 articles...


Processing XML: 30100830it [03:51, 144089.65it/s]

Processed 305000 articles...


Processing XML: 30173276it [03:52, 138931.11it/s]

Processed 306000 articles...


Processing XML: 30266018it [03:52, 182007.27it/s]

Processed 307000 articles...


Processing XML: 30349884it [03:53, 151241.00it/s]

Processed 308000 articles...


Processing XML: 30426425it [03:53, 146273.80it/s]

Processed 309000 articles...


Processing XML: 30501556it [03:54, 142565.63it/s]

Processed 310000 articles...


Processing XML: 30574776it [03:55, 137663.76it/s]

Processed 311000 articles...


Processing XML: 30659836it [03:55, 129514.89it/s]

Processed 312000 articles...


Processing XML: 30733167it [03:56, 137704.37it/s]

Processed 313000 articles...


Processing XML: 30806346it [03:56, 130382.49it/s]

Processed 314000 articles...


Processing XML: 30883388it [03:57, 144543.55it/s]

Processed 315000 articles...


Processing XML: 30956552it [03:57, 142472.36it/s]

Processed 316000 articles...


Processing XML: 31029686it [03:58, 139154.23it/s]

Processed 317000 articles...


Processing XML: 31120697it [03:58, 142756.51it/s]

Processed 318000 articles...


Processing XML: 31189807it [03:59, 130145.67it/s]

Processed 319000 articles...


Processing XML: 31269213it [04:00, 138942.36it/s]

Processed 320000 articles...


Processing XML: 31345958it [04:00, 146457.64it/s]

Processed 321000 articles...


Processing XML: 31422990it [04:01, 141136.58it/s]

Processed 322000 articles...


Processing XML: 31488962it [04:01, 150342.79it/s]

Processed 323000 articles...


Processing XML: 31568389it [04:02, 152696.54it/s]

Processed 324000 articles...


Processing XML: 31643037it [04:02, 144472.58it/s]

Processed 325000 articles...


Processing XML: 31705676it [04:03, 145062.18it/s]

Processed 326000 articles...


Processing XML: 31789434it [04:03, 151858.77it/s]

Processed 327000 articles...


Processing XML: 31865287it [04:04, 145050.94it/s]

Processed 328000 articles...


Processing XML: 31938638it [04:04, 139838.88it/s]

Processed 329000 articles...


Processing XML: 32011203it [04:05, 129552.60it/s]

Processed 330000 articles...


Processing XML: 32086848it [04:05, 138453.22it/s]

Processed 331000 articles...


Processing XML: 32159723it [04:06, 149102.18it/s]

Processed 332000 articles...


Processing XML: 32249816it [04:06, 143145.86it/s]

Processed 333000 articles...


Processing XML: 32318576it [04:07, 129150.69it/s]

Processed 334000 articles...


Processing XML: 32375415it [04:07, 134009.37it/s]

Processed 335000 articles...


Processing XML: 32447955it [04:08, 137872.76it/s]

Processed 336000 articles...


Processing XML: 32520142it [04:08, 134551.12it/s]

Processed 337000 articles...


Processing XML: 32576699it [04:09, 135673.89it/s]

Processed 338000 articles...


Processing XML: 32652093it [04:09, 142073.23it/s]

Processed 339000 articles...


Processing XML: 32726376it [04:10, 136383.65it/s]

Processed 340000 articles...


Processing XML: 32784980it [04:10, 141673.72it/s]

Processed 341000 articles...


Processing XML: 32853348it [04:11, 128919.06it/s]

Processed 342000 articles...


Processing XML: 32925668it [04:11, 142180.35it/s]

Processed 343000 articles...


Processing XML: 32999148it [04:12, 139717.66it/s]

Processed 344000 articles...


Processing XML: 33070743it [04:12, 136163.94it/s]

Processed 345000 articles...


Processing XML: 33145590it [04:13, 144286.58it/s]

Processed 346000 articles...


Processing XML: 33206678it [04:13, 140668.07it/s]

Processed 347000 articles...


Processing XML: 33278076it [04:14, 135529.32it/s]

Processed 348000 articles...


Processing XML: 33438688it [04:14, 321180.76it/s]

Processed 349000 articles...


Processing XML: 33622002it [04:15, 131654.67it/s]


Processed 350000 articles...
Reached maximum articles limit: 350,000
Extracted 350000 articles
Filtered 5080 disambiguation articles
Filtered 0 redirect-only articles
Total text length: 2,413,476,077 characters
Number of articles: 350,000
Sample text: Anarchism is a political philosophy that advocates self-governed societies based on voluntary institutions. These are often described as stateless societies, although several authors have defined them more specifically as institutions based on non-hierarchical or free associations. Anarchism holds the state to be undesirable, unnecessary and harmful. According to Peter Kropotkin, Godwin was "the first to formulate the political and economical conceptions of anarchism, even though he did not give...


## Now let's save all of that so that we don't depend on the kernel

In [8]:
from pathlib import Path

In [9]:
Path("research_data").mkdir(exist_ok=True)

In [10]:
with open("../data/processed/raw_text/turkish_text.txt", 'r', encoding='utf-8') as f:
    sample = f.read(500)
    print("Turkish file sample:", sample[:200] + "...")

Saved Turkish text: 587,824,356 characters


In [11]:
with open("../data/processed/raw_text/russian_text.txt", 'r', encoding='utf-8') as f:
    sample = f.read(500)
    print("Russian file sample:", sample[:200] + "...")

Saved Russian text: 1,267,408,989 characters


In [12]:
with open("../data/processed/raw_text/english_text.txt", 'r', encoding='utf-8') as f:
    sample = f.read(500)
    print("English file sample:", sample[:200] + "...")

Saved English text: 2,413,476,077 characters


In [13]:
import os

In [14]:
for lang in ['turkish', 'russian', 'english']:
    filepath = f"../data/processed/raw_text/{lang}_text.txt"
    if os.path.exists(filepath):
        size_mb = os.path.getsize(filepath) / (1024 * 1024)
        print(f"{lang.capitalize()} file size: {size_mb:.1f} MB")

Turkish file size: 603.2 MB
Russian file size: 2122.9 MB
English file size: 2310.5 MB


At this point we can already notice something interesting: Turkish produces the smallest amount of text -- just 603 MB despite having 350k articles, while Russian generates 2123 MB and English yields some 2311 MB of text from the same article count. 


> This already tells us that Turkish manages semantics differently due to its agglutinative morphology. Apparently this allows Turkish to pack meanings more compactly at the word level by making use of extensive suffixation, whereas the more analytic English and fusional Russian require more words to express equivalent semantic content.

In [15]:
with open("../data/processed/raw_text/russian_text.txt", 'r', encoding='utf-8') as f:
    sample = f.read(500)
    print("Russian file sample:", sample[:200] + "...")

Russian file sample: Литва́ ( ), официальное название — Лито́вская Респу́блика ( ) — государство, расположенное в Северной Европе (одна из стран Балтии). Столица страны — Вильнюс.
Площадь — 65 300 км². Протяжённость с сев...


In [16]:
with open("../data/processed/raw_text/turkish_text.txt", 'r', encoding='utf-8') as f:
    sample = f.read(500)
    print("Turkish file sample:", sample[:200] + "...")

Turkish file sample: Cengiz Han (Cenghis Khan, Çinggis Haan ya da doğum adıyla Temuçin (anlamı: demirci), Moğolca: Чингис Хаан ya da "Tengiz" (anlamı: deniz), ; d. 1162 – ö. 18 Ağustos 1227), Moğol komutan, hükümdar ve Mo...


In [17]:
with open("../data/processed/raw_text/english_text.txt", 'r', encoding='utf-8') as f:
    sample = f.read(500)
    print("English file sample:", sample[:200] + "...")

English file sample: Anarchism is a political philosophy that advocates self-governed societies based on voluntary institutions. These are often described as stateless societies, although several authors have defined them...


Next up:
clean the text in brackets like 
- (одна из стран Балтии) or
- (Cenghis Khan, Çinggis Haan ya da doğum adıyla Temuçin (anlamı: demirci), Moğolca: Чингис Хаан ya da "Tengiz" (anlamı: deniz), ; d. 1162 – ö. 18 Ağustos 1227)

Multiple sources suggest removing the parenthetical text

https://raw.githubusercontent.com/eberlitz/pt-br-corpus/master/README.md#:~:text=,will%20allow%20abbreviations%2C%20like%20%27Dr
https://willbeason.com/2021/08/06/cleaning-the-wikipedia-corpus-articles-and-text/

If we search for good wikipedia texts cleaning practices, we can find:
https://willbeason.com/2021/08/06/cleaning-the-wikipedia-corpus-articles-and-text/
where the author says:

> I'm currently debating whether to exclude parenthetical text (such as this). Simply ignoring the parentheses often breaks grammatical structure, and can sometime form their own prose-like thoughts... Update: I will be discarding parenthetical statements. They're noisy and if I don't have enough data I can care about that problem later.

## Text Cleaning and Normalization

Raw Wikipedia text contains all sorts of artifacts that could hinder morphological and semantic analyses. We should remove all that markup noise before we proceed.

We will go with the basic set of cleaning operations recommeded here: 

https://raw.githubusercontent.com/eberlitz/pt-br-corpus/master/README.md#:~:text=,will%20allow%20abbreviations%2C%20like%20%27Dr

keeping in mind that we have already done the following:
- removed the HTML/XML tags
- converted the XML entities  like `&lt;`, `&gt;`, `&quot;`, `&apos;`, `&amp;`
- removed Wikipedia markup  like links, categories and crosslanguage links
- basic whitespace normalization

In [32]:
import pandas as pd
import re
import os
from pathlib import Path
from tqdm import tqdm
import time
import warnings
warnings.filterwarnings('ignore')

Now we proceed straight to cleaning the language files. We still save the originals though

Like we noticed before, our files are huge, so looping over them multiple times and using simple regex won't do the job in any reasonable time (I didn't manage to get a single Turkish corpus to be processed after nearly 2 hours). Instead we can use **vectorized string operations** which are $10$-$50$x faster than loops and process each corpus in chunks.

In [33]:
class Patterns:
    parentheses = re.compile(r'\([^()]*\)')
    square_brackets = re.compile(r'\[[^\]]*\]')
    numbers = re.compile(r'\b\d+[,.]?\d*\b')
    measurements = re.compile(r'\b\d+\s*km²?\b')
    long_dashes = re.compile(r'[–—−‑]+')
    fancy_quotes = re.compile(r'[""„«»]')
    fancy_apostrophes = re.compile(r'[''‚`]')
    multiple_spaces = re.compile(r'\s+')
    space_punct = re.compile(r'\s+([.!?])')

In [34]:
def vectorized_clean_chunk(articles_series):
    stats = {
        'parentheses': articles_series.str.count(r'\(').sum(),
        'numbers': articles_series.str.count(r'\b\d+').sum(),
        'chars_before': articles_series.str.len().sum()
    }
    
    articles_series = articles_series.str.replace(Patterns.parentheses, '', regex=True)
    articles_series = articles_series.str.replace(Patterns.square_brackets, '', regex=True)
    articles_series = articles_series.str.replace(Patterns.numbers, '0', regex=True)
    articles_series = articles_series.str.replace(Patterns.measurements, '', regex=True)
    articles_series = articles_series.str.replace(Patterns.long_dashes, '-', regex=True)
    articles_series = articles_series.str.replace(Patterns.fancy_quotes, '"', regex=True)
    articles_series = articles_series.str.replace(Patterns.fancy_apostrophes, "'", regex=True)
    articles_series = articles_series.str.replace(Patterns.multiple_spaces, ' ', regex=True)
    articles_series = articles_series.str.replace(Patterns.space_punct, r'\1', regex=True)
    articles_series = articles_series.str.strip()
    
    articles_series = articles_series[articles_series.str.len() >= 10]
    
    stats['chars_after'] = articles_series.str.len().sum()
    stats['articles_kept'] = len(articles_series)
    
    return articles_series.tolist(), stats

def clean_language_file(input_file, output_file, chunk_size_mb=50):
    file_size_mb = os.path.getsize(input_file) / (1024 * 1024)
    chunk_size = chunk_size_mb * 1024 * 1024
    
    print(f"File: {file_size_mb:.1f} MB, Chunk size: {chunk_size_mb} MB")
    
    Path(output_file).parent.mkdir(parents=True, exist_ok=True)
    
    total_stats = {'parentheses': 0, 'numbers': 0, 'chars_before': 0, 'chars_after': 0, 'articles': 0}
    
    chunks_processed = 0
    
    with open(input_file, 'r', encoding='utf-8') as infile, \
         open(output_file, 'w', encoding='utf-8') as outfile:
        
        buffer = ""
        file_position = 0
        
        with tqdm(total=file_size_mb, unit='MB', desc="Processing") as pbar:
            
            while True:
                chunk = infile.read(chunk_size)
                if not chunk:
                    if buffer.strip():
                        articles = [art.strip() for art in buffer.split('\n\n') if art.strip()]
                        if articles:
                            cleaned_articles, chunk_stats = vectorized_clean_chunk(pd.Series(articles))
                            for article in cleaned_articles:
                                outfile.write(article + '\n\n')
                            
                            for key in ['parentheses', 'numbers', 'chars_before', 'chars_after']:
                                total_stats[key] += chunk_stats[key]
                            total_stats['articles'] += len(cleaned_articles)
                    break
                
                chunk_size_mb_actual = len(chunk.encode('utf-8')) / (1024 * 1024)
                pbar.update(chunk_size_mb_actual)
                
                full_text = buffer + chunk
                last_article_end = full_text.rfind('\n\n')
                
                if last_article_end != -1:
                    process_text = full_text[:last_article_end + 2]
                    buffer = full_text[last_article_end + 2:]
                else:
                    process_text = full_text
                    buffer = ""
                
                if process_text.strip():
                    articles = [art.strip() for art in process_text.split('\n\n') if art.strip()]
                    
                    if articles:
                        # here we are doing the vectorized processing so it's done faster
                        cleaned_articles, chunk_stats = vectorized_clean_chunk(pd.Series(articles))
                        
                        for article in cleaned_articles:
                            outfile.write(article + '\n\n')
                        
                        # Update statistics
                        for key in ['parentheses', 'numbers', 'chars_before', 'chars_after']:
                            total_stats[key] += chunk_stats[key]
                        total_stats['articles'] += len(cleaned_articles)
                        
                        chunks_processed += 1
    
    # Final statistics
    output_size_mb = os.path.getsize(output_file) / (1024 * 1024)
    reduction_pct = (1 - output_size_mb / file_size_mb) * 100
    
    print(f"Processed {chunks_processed} chunks")
    print(f"Results: {total_stats['articles']:,} articles, "
          f"{total_stats['parentheses']:,} parentheses removed, "
          f"{total_stats['numbers']:,} numbers normalized")
    print(f"Size: {file_size_mb:.1f} MB -> {output_size_mb:.1f} MB ({reduction_pct:.1f}% reduction)")
    print(f"Saved: {output_file}")
    
    return output_file

In [35]:
Path("../data/processed/clean_text").mkdir(parents=True, exist_ok=True)

In [37]:
for language in ['turkish', 'russian', 'english']:
    print(f"\nProcessing {language.title()}:")
    input_file = f"../data/processed/raw_text/{language}_text.txt"
    output_file = f"../data/processed/clean_text/{language}_clean.txt"
    
    start_time = time.time()
    result_file = clean_language_file(input_file, output_file)
    processing_time = time.time() - start_time
    
    print(f"Processing time: {processing_time:.1f} seconds")


Processing Turkish:
File: 603.2 MB, Chunk size: 50 MB


Processing: 100%|████████████████████████████████| 603.1974821090698/603.1974821090698 [00:45<00:00, 13.27MB/s]


Processed 12 chunks
Results: 349,969 articles, 1,688,200 parentheses removed, 10,673,264 numbers normalized
Size: 603.2 MB -> 560.0 MB (7.2% reduction)
Saved: research_data/clean_data/turkish_clean.txt
Processing time: 45.5 seconds

Processing Russian:
File: 2122.9 MB, Chunk size: 50 MB


Processing: 100%|████████████████████████████████| 2122.904839515686/2122.904839515686 [01:38<00:00, 21.60MB/s]


Processed 25 chunks
Results: 349,988 articles, 4,304,471 parentheses removed, 16,250,307 numbers normalized
Size: 2122.9 MB -> 1961.2 MB (7.6% reduction)
Saved: research_data/clean_data/russian_clean.txt
Processing time: 98.3 seconds

Processing English:
File: 2310.5 MB, Chunk size: 50 MB


Processing: 100%|████████████████████████████████| 2310.487558364868/2310.487558364868 [02:57<00:00, 13.03MB/s]

Processed 47 chunks
Results: 350,000 articles, 5,759,614 parentheses removed, 26,233,991 numbers normalized
Size: 2310.5 MB -> 2159.5 MB (6.5% reduction)
Saved: research_data/clean_data/english_clean.txt
Processing time: 177.4 seconds





In [38]:
import random

In [44]:
chunks = [
    c.strip()
    for c in Path("../data/processed/clean_text/russian_clean.txt")
             .read_text(encoding="utf-8")
             .split("\n\n")
    if c.strip()
]

snippet = random.choice(chunks)
print(snippet)

Марк Антуан Эду - французский переводчик. Начал свою переводческую деятельность с участия в работе над переводом с английского языка "Медицинского словаря" Роберта Джеймса. Среди последующих переводов Эду - "История поэзии" Джона Брауна , "Теория нравственных чувств" Адама Смита , путевые записки Джона Белла , философские сочинения Фрэнсиса Хатчесона и др.Примечания


In [40]:
chunks = [
    c.strip()
    for c in Path("../data/processed/clean_text/turkish_clean.txt")
             .read_text(encoding="utf-8")
             .split("\n\n")
    if c.strip()
]

snippet = random.choice(chunks)
print(snippet)

Korumalı kruvazör, 0. yüzyıl sonlarında ortaya çıkan bir savaş gemisi tipidir. Bu gemi tipinde zırh kaplaması geminin makine dairesi gibi hayati bölümleri zırhla korunur, ayrıca geminin kömür ambarları da zırha ek koruma sağlarlar. Korumalı kruvazörler, yanlarda bir zırh kemeri olan zırhlı kruvazörlere alternatif bir tasarımdı.


In [42]:
chunks = [
    c.strip()
    for c in Path("../data/processed/clean_text/english_clean.txt")
             .read_text(encoding="utf-8")
             .split("\n\n")
    if c.strip()
]

snippet = random.choice(chunks)
print(snippet)

Pierre Antoine Noël Bruno, comte Daru was a French soldier, statesman, historian, and poet. The French generally refer to him as Pierre Daru.Early career Born in Montpellier, he was educated at the Oratorian-maintained military school of Tournon, and entered artillery service at an early age. He also took an interest in literature, and he published several minor pieces, until the outbreak of the French Revolution made him concentrate on his military assignments. In 0 he became commissary to the army, protecting the coasts of Brittany from projected descents of the British, or of French Royalists. Thrown into prison during the Reign of Terror, on an unsubstantiated charge of friendliness to the Royalists and the British, he was released after the fall of Maximilien Robespierre in the summer of 0 , and rose through the ranks until, in 0, he became chief commissary to the French Revolutionary Army serving under André Masséna in the north of Switzerland. In that position he won repute for 

> Okay i think the texts have been reasonably well cleaned

## Texts are ready

Now we have cleaned corpora that we will be using for the morphological complexity measurement and semantic network construction. 

The next phase will use these corpora to extract frequency-ranked vocabulary lists and begin the morphological analysis.