# Wikipedia Dump

As of 2022 Georgian Wikipedia dump weighs over 180 MB compressed (1.5 GB decompressed). Smaller but still linguistically quite rich corpus of just article abstracts is just 18 MB gzipped (233 MB decompressed). Both can be downloaded from https://dumps.wikimedia.org/kawiki/latest/. 

Texts can be extracted with either WikiExtractor library or full hands on with Beautiful Soup which produces cleaner results for Georgian.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import wikitextparser as wtp
from tqdm import tqdm
from pandarallel import pandarallel

pandarallel.initialize(progress_bar=True, nb_workers=6)

INFO: Pandarallel will run on 6 workers.
INFO: Pandarallel will use Memory file system to transfer data between the main process and workers.


In [10]:
import gzip

dump = gzip.open("../raw/kawiki-latest-abstract.xml.gz", "rt")
# dump = gzip.open("../raw/kawiki-20220520-pages-articles-multistream.xml.gz", "rb")
contents = dump.read()
soup = BeautifulSoup(contents, 'xml')

In [12]:
print(soup.prettify()[:1000])

<?xml version="1.0" encoding="utf-8"?>
<feed>
 <doc>
  <title>
   ვიკიპედია: ედუარდ შევარდნაძე
  </title>
  <url>
   https://ka.wikipedia.org/wiki/%E1%83%94%E1%83%93%E1%83%A3%E1%83%90%E1%83%A0%E1%83%93_%E1%83%A8%E1%83%94%E1%83%95%E1%83%90%E1%83%A0%E1%83%93%E1%83%9C%E1%83%90%E1%83%AB%E1%83%94
  </url>
  <abstract>
   | დაბადების ადგილი= მამათი, ოზურგეთის მაზრა, საქართველოს სსრ
  </abstract>
  <links>
   <sublink linktype="nav">
    <anchor>
     ბიოგრაფია
    </anchor>
    <link>
     https://ka.wikipedia.org/wiki/%E1%83%94%E1%83%93%E1%83%A3%E1%83%90%E1%83%A0%E1%83%93_%E1%83%A8%E1%83%94%E1%83%95%E1%83%90%E1%83%A0%E1%83%93%E1%83%9C%E1%83%90%E1%83%AB%E1%83%94#ბიოგრაფია
    </link>
   </sublink>
   <sublink linktype="nav">
    <anchor>
     ადრეული წლები და განათლება
    </anchor>
    <link>
     https://ka.wikipedia.org/wiki/%E1%83%94%E1%83%93%E1%83%A3%E1%83%90%E1%83%A0%E1%83%93_%E1%83%A8%E1%83%94%E1%83%95%E1%83%90%E1%83%A0%E1%83%93%E1%83%9C%E1%83%90%E1%83%AB%E1%83%94#ადრეული_წლები_და_გან

In [11]:
pages = soup.find_all('page')
print(len(pages))

0


In [13]:
pages = soup.find_all('doc')
print(len(pages))

161032


In [4]:
kawiki = []
for page in tqdm(pages):
    # print(page)
    wiki = {
        
        'id' : page.find('id').text,
        'ts' : page.find('timestamp').text,
        'title': page.find('title').text,
        'author': page.find('contributor').find('username').text if page.find('contributor').find('username') != None else page.find('contributor').find('ip').text,
        'text': page.find('text').text,
        'sha1': page.find('sha1').text,
    }

    kawiki.append(wiki)

kawiki = pd.DataFrame(kawiki)

100%|██████████| 320745/320745 [00:42<00:00, 7480.00it/s]


In [12]:
kawiki['text'] = kawiki['text'].parallel_apply(lambda row: wtp.remove_markup(row))
kawiki['isRedirection'] = kawiki.text.str.startswith('#გადამისამართება') | kawiki.text.str.startswith('#REDIRECT')

VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=53458), Label(value='0 / 53458')))…

In [14]:
kawiki

Unnamed: 0,id,ts,title,author,text,sha1,isRedirection
0,2,2004-01-26T08:48:14Z,ვიკიპედია:Orphaned articles,204.95.98.251,"<a href=""/wiki/Main_Page"" class='internal' tit...",ptlg4dsza52o10rnmk59f282y0fptzi,False
1,3,2004-01-26T00:26:51Z,ვიკიპედია:Most wanted articles,204.95.98.251,,pezx7385o3910tfwsxb3suac2dw5g9i,False
2,4,2004-01-26T00:38:52Z,ვიკიპედია:Short articles,204.95.98.251,"<a href=""/wiki/Main_Page"" class='internal' tit...",84gitu5sgfaec85yfccjzvl2kfafvn1,False
3,5,2004-01-26T00:35:44Z,ვიკიპედია:Long articles,204.95.98.251,"<a href=""/wiki/Main_Page"" class='internal' tit...",84gitu5sgfaec85yfccjzvl2kfafvn1,False
4,7,2004-11-18T15:15:52Z,მედიავიკი:Category,Malafaya,კატეგორია,3k5d8zjv7h101jficinh3d2x5jjwtpw,False
...,...,...,...,...,...,...,...
320740,532938,2022-05-20T04:35:44Z,ვახტანგ რობაქიძე,Jaba1977,ვახტანგ რობაქიძე (მოსამართლეობის კანდიდატი) და...,iavvdscaounfaz6sqowhc2ogsvp4u0l,False
320741,532939,2022-05-20T08:54:31Z,ნინა კარპაჩოვა,Ekkatterrinna,\n\nნინა კარპაჩოვა (უკრ. Ні́на Іва́нівна Карпа...,m955153juz99zg6n3e4pphtna8s6f8t,False
320742,532941,2022-05-19T23:20:41Z,Los Williames,185.225.28.38,220px,dagmtfb81asi3gnehcjfw5l1fhj4f0i,False
320743,532947,2022-05-20T08:27:24Z,ჰაილბრონის აჩრდილი,Lisztomaniac,\nthumb|right|250px|მემორიალური ნიშნული ოფიცერ...,6sgtfdyggb9svx8whe22owjhwkm8fia,False


In [13]:
kawiki.to_csv("kawiki.csv", index=None, encoding='utf-8')