# Czech Parliament Meetings (CPM) transcriptions - data preprocessing
The CPM web site contains a huge amount of transcriptions from all meetings. The kinds we're interested in are [here](http://www.psp.cz/eknih/2013ps/stenprot/index.htm). Fortunately, we don't have to dowload all individual texts, because there are also transcriptions in a compressed format [here](http://www.psp.cz/eknih/2013ps/stenprot/zip/index.htm).

The workflow is: download ZIPs, extract them, parse them into json.

In [1]:
import urllib.request as ur
import urllib.parse as up
import urllib
import lxml.html
import os.path
import zipfile
import glob
from collections import OrderedDict
import json
from collections import Counter
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
import string
import re

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
base = 'http://www.psp.cz/eknih/'
zp_folder = './zip/'
zp_ext = '.zip'
st_folder = './html/'
js_folder = 'json/'

Download a page with a list of the meetings and look for the links to Parliament meetings

In [3]:
ht = lxml.html.parse(base).getroot()

In [4]:
links = []

libs = ht.cssselect('div#main-content')[0]
for elem_a, elem_b in zip(libs.cssselect('a'), libs.cssselect('a b')):
    if elem_b.text.endswith('Poslanecká sněmovna'):
        links.append(up.urljoin(base, elem_a.attrib['href']))

print(links)

['http://www.psp.cz/eknih/2017ps/index.htm', 'http://www.psp.cz/eknih/2013ps/index.htm', 'http://www.psp.cz/eknih/2010ps/index.htm', 'http://www.psp.cz/eknih/2006ps/index.htm', 'http://www.psp.cz/eknih/2002ps/index.htm', 'http://www.psp.cz/eknih/1998ps/index.htm', 'http://www.psp.cz/eknih/1996ps/index.htm']


Extracting links to compressed files from individual meetings web pages

In [5]:
comp_links = []

for ln in links:
    page = lxml.html.parse(ln).getroot().cssselect('div#main-content')[0]
    for elem_a in page.cssselect('a'):
        if elem_a.text:
            if 'komprimované' in elem_a.text:
                comp_links.append(up.urljoin(ln, elem_a.attrib['href']))

print(comp_links)

['http://www.psp.cz/eknih/2017ps/stenprot/zip/', 'http://www.psp.cz/eknih/2013ps/stenprot/zip/index.htm', 'http://www.psp.cz/eknih/2010ps/stenprot/zip/index.htm', 'http://www.psp.cz/eknih/2006ps/stenprot/zip/index.htm', 'http://www.psp.cz/eknih/2002ps/stenprot/zip/index.htm']


Each site with compressed files contains several of them. So we download the page, find all links to files and download those.

In [6]:
if not os.path.exists('zip'): os.mkdir('zip')
page_num = 0
for comp_link in comp_links:
    page = lxml.html.parse(comp_link).getroot()
    for ln in page.cssselect('div#main-content a'):
        ln_attr_href = ln.attrib['href'].split('/')[-1]
        target = os.path.join(zp_folder, str(page_num) + '_' + ln_attr_href)
        if os.path.isfile(target): continue
        try:
            print('Retrieving ' + up.urljoin(comp_link, ln_attr_href) + ' ', end='')
            ur.urlretrieve(up.urljoin(comp_link, ln_attr_href), target)
            print('done.')
        except urllib.error.HTTPError as err:
            print('failed: {}'.format(err))
    page_num += 1

Retrieving http://www.psp.cz/eknih/2017ps/stenprot/zip/037schuz.zip done.
Retrieving http://www.psp.cz/eknih/2013ps/stenprot/zip/059schuz.zip failed: HTTP Error 404: Not Found
Retrieving http://www.psp.cz/eknih/2013ps/stenprot/zip/060schuz.zip failed: HTTP Error 404: Not Found
Retrieving http://www.psp.cz/eknih/2013ps/stenprot/zip/061schuz.zip failed: HTTP Error 404: Not Found


Extracting the archives

In [7]:
if not os.path.exists('html'): os.mkdir('html')

for arch in os.listdir(zp_folder):
    if arch.endswith(zp_ext):
        file_name = os.path.join(zp_folder, arch)
        target = os.path.join(st_folder, arch.split('.')[0])

        if os.path.exists(target): continue
        
        with zipfile.ZipFile(file_name) as zpf:
            print('Extracting ' + file_name + ' to ' + target + ' ', end='')
            try:
                zpf.extractall(target)
                print('done.')
            except OSError as err:
                os.rmdir(target)
                print('failed: {}'.format(err))

Extracting ./zip/2_042schuz.zip to ./html/2_042schuz failed: [Errno 22] Invalid argument
Extracting ./zip/0_037schuz.zip to ./html/0_037schuz done.


Conversion to text. There are a few notes:

- Each paragraph is either a follow up onto a speech, or the beginning of a new one.
- The beginning is determined by a link to the speaker profile. However, it's risky to simply determine a beginning by a link because transcriptions used to contain links to votes.
- Speeches may overflow one file and continue in another. It doesn't matter since one meeting is simply one flow on text, no matter the ends of files.
- Speaker names include their (politic) position. This is problematic since one person can have multiple names. It is "solved" by having a list of possible name preffixes (role, position, term of address, etc.)
- EDIT: As of October 2019, the preffix removal works well for all entries. A problem will occure when a politician's name is "Poslanec" (Member of Parliament) or "Senátor" (Senator).

In [8]:
if not os.path.exists(js_folder): os.mkdir(js_folder)
html_files = glob.glob('./html/*')

In [9]:
poz = 'Poslanec; PSP; Paní; Senátorka; Senátor; \
Poslankyně; mužů; Předsedající; práv; republiky; financí; prostředí; věcí; ČR'.split('; ')

def rm_position(name):    
    for p in poz:
        if p in name:
            name = name[name.rindex(p) + len(p) + 1:]
            
    return name

In [10]:
def write_json(nm, dt):
    fn = os.path.join(js_folder, '%s.json' % nm)
    
    with open(fn, 'w') as f:
        t = json.dump(dt, f, ensure_ascii=False, indent=2)

res = []
pid = 0
aut = None
tema = None
buf = []
for htmlf in html_files:
    buff = []
    fns = glob.glob(os.path.join(htmlf, 's*.htm'))
    for fn in fns:
        h = lxml.html.parse(fn).getroot()
        for p in h.cssselect('p'):
            pt = p.text_content().strip()
            if len(pt) == 0: continue
            pt = pt.replace('\xa0', ' ')
            
            od = p.find('a') # v textu je odkaz
            if od is None:
                buf += [pt]
                continue

            if len(buf) > 0:
                buff.extend([OrderedDict(id=pid, autor=aut, schuze=int(htmlf.split('/')[-1][:3]),\
                                         fn=fn, tema=tema, text='\n'.join(buf))])

            aut = rm_position(od.text.strip())
            buf = [pt[len(od.text)+1:].strip()] # pridame soucasny text (ale odseknem autora)
            pid += 1
    write_json(htmlf[-htmlf[::-1].find('/'):], buff)

print(htmlf)

./html/2_010schuz


## Name preffix removal check

In [11]:
json_files = glob.glob('json/*.json')

In [12]:
auts = []
for fn in json_files:
    with open(fn) as f:
        dt = json.load(f)
    
    for el in dt:
        if el['autor'] is not None:
            auts.append(el['autor'])

The most frequent speakers

In [13]:
Counter(auts).most_common()[:3]

[('Vojtěch Filip', 12595), ('Petr Gazdík', 8884), ('Jan Bartošek', 7854)]

Names longer then three words

In [14]:
[j for j in set(auts) if len(j.split(' ')) > 2]

['Augustin Karel Andrle Sylor',
 'Markéta Pekarová Adamová',
 'Jaroslava Pokorná Jermanová',
 'Zuzana Majerová Zahradníková',
 'Jana Mračková Vildumetzová',
 'Tomáš Jan Podivínský',
 'Zuzka Bebarová Rujbrová',
 'Hana Aulická Jírovcová']

## Sentence tokenization

Take all the text from jsons, lower-case it, remove all punctuation, and remove leading/trailing whitespace

Tokenize the sentences and write them to a file where each line is one sentence

Sentences containing numeric characters are excluded

In [15]:
def hasNumbers(inputString):
    return bool(re.search(r'\d', inputString))

In [16]:
json_files = glob.glob('json/*.json')
translator = str.maketrans('', '', string.punctuation)
str_builder = ''

file = open('./vocabulary.txt', 'w+')

for fn in json_files:
    with open(fn) as f:
        jsf = json.load(f)
    
    for el in jsf:
        for s in sent_tokenize(el['text']):
            processed_str = s.lower().translate(translator).strip()
            if processed_str and not hasNumbers(processed_str):
                str_builder += processed_str + '\n'
    file.write(str_builder)
    str_builder = ''

file.close()