# What needs doing:

* Count all words and sentences from all terms ✓
* Get term limits from wikipedia ✓
* Count occurences of body, desc, kinesic... items ✓

# Questions for Tomaž/Nikola:

* How is validation run?
* Do we need to follow the per-term directory and file structure (e.g. ParlaSpeech-HR) or can we preserve the current structure?
* What is the deal with relations, coalition field `mutual` for party listing while opposition has `active` and a field `passive #government.hr`?
* I don't have the education title (e.g. `<education n="6">stručni prvostupnik sestrinstva</education>`), only years of education.
* I don't have the exact dates of terms. I can only infer them from the rest of the data I have (I estimate it from the dates of speeches.)
* I get different data for coalition-opposition formations than in the existing main TEI files for term 9.
* I need the affiliation>roles, but they are not in the MP datasheet.

Tomaž:
* Education titles and `n` parameter?
* How to date ongoing terms?

In [39]:
from typing import Union, List
from pathlib import Path
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import pandas as pd

mpdf = pd.read_pickle("mpdf_corrected")
partiesdf = pd.read_pickle("partiesdf_corrected")
termdata = pd.read_csv("../terms.csv").set_index("Term")

def transform_abbreviation(s: str) -> str:
    return f"party.{s.replace(' ', '_').upper().replace('!', '')}"


In [40]:
# set(mpdf.party.unique()) - set(partiesdf.party.unique())
mpdf.party[~mpdf.party.isin(partiesdf.party)].unique()

array(['nezavisni', 'HKS', 'HS'], dtype=object)

# Preparing persons:

In [41]:
from xml.etree.ElementTree import Element, SubElement, tostring, XML, parse
from xml.dom import minidom
def pretty_print(s) -> None:
    print(minidom.parseString(tostring(s).decode("utf")).toprettyxml("\t"))
def pretty_string(s) -> None:
    return minidom.parseString(tostring(s).decode("utf")).toprettyxml("\t")

In [42]:
partiesdf.party.unique()

array(['DC', 'HDSSB', 'HDZ', 'HNS', 'HSLS', 'HSP', 'HSS', 'HSU', 'IDS',
       'LIBRA', 'LS', 'MDS', 'PGS', 'SBHS', 'SDAH', 'SDP', 'SDSS',
       'Hrvatski laburisti', 'HSD', 'BDSH', 'HGS', 'HSP AS', 'ID',
       'Naprijed Hrvatska!', 'Novi val', 'ORaH', 'Reformisti', 'BM365',
       'BUZ', 'HDS', 'HRAST', 'HRID', 'MOST', 'NLSP', 'PH', 'Živi zid',
       'GLAS', 'SMSH', 'SNAGA', 'DP', 'ZK', 'SIP', 'Pametno', 'Fokus'],
      dtype=object)

In [43]:
termdata

Unnamed: 0_level_0,From,To
Term,Unnamed: 1_level_1,Unnamed: 2_level_1
5,2003-12-22,2007-10-12
6,2008-01-11,2011-10-28
7,2011-12-22,2015-09-25
8,2015-12-03,2016-07-15
9,2016-10-14,2022-05-18
10,2022-07-22,ongoing


# Terms (line 438):

path: `/teiCorpus/teiHeader/profileDesc/particDesc/listOrg/org[2]/listEvent`

In [44]:
listEvent = Element("listEvent")
head = SubElement(listEvent, "head")
head.set("xml:lang", "hr")
head.text = "Mandatno obdoblje"
head = SubElement(listEvent, "head")
head.set("xml:lang", "en")
head.text = "Legislative period"

for i, row in termdata.iterrows():
    event = SubElement(listEvent, "event")
    event.set("xml:id", f"HS.{i}")
    event.set("from", row["From"])
    event.set("to", row["To"])
    
    label = SubElement(event, "label")
    label.set("xml:lang", "hr")
    label.text = f"{i}. mandat"
    label = SubElement(event, "label")
    label.set("xml:lang", "en")
    label.text = f"Term {i}"
listEvent_str = pretty_string(listEvent)
listEvent_str = "\n".join(listEvent_str.split("\n")[1:])
# print(listEvent_str)

# Parties: line 447

path:
`/teiCorpus/teiHeader/profileDesc/particDesc/listOrg/org[3:]`

In [45]:
"Hrvatski laburisti" in mpdf.party

False

In [46]:
partiesdf["id"] = partiesdf.party.apply(transform_abbreviation)

partydata = partiesdf["party full_name id".split()].drop_duplicates()

orgs = []
for i, row in partydata.iterrows():
    org = Element("org")
    org.set("xml:id", row["id"])
    org.set("role", "parliamentaryGroup")
    orgName = SubElement(org, "orgName")
    orgName.set("full", "yes")
    orgName.set("xml:lang", "hr")
    orgName.text = row["full_name"]
    
    orgName = SubElement(org, "orgName")
    orgName.set("full", "abb")
    orgName.text = row["party"]
    orgs.append(org)
    
# Add nezavisni:
org = Element("org")
org.set("xml:id", "party.NEZAVISNI")
org.set("role", "parliamentaryGroup")
orgName = SubElement(org, "orgName")
orgName.set("full", "yes")
orgName.set("xml:lang", "hr")
orgName.text = "nezavisni"
orgName = SubElement(org, "orgName")
orgName.set("full", "abb")
orgName.text = "nezavisni"
orgs.append(org)

# Add Hrvatski suverenisti:
org = Element("org")
org.set("xml:id", "party.HS")
org.set("role", "parliamentaryGroup")
orgName = SubElement(org, "orgName")
orgName.set("full", "yes")
orgName.set("xml:lang", "hr")
orgName.text = "Hrvatski Suverenisti"
orgName = SubElement(org, "orgName")
orgName.set("full", "abb")
orgName.text = "HS"
orgs.append(org)

# Add Hrvatska konzervativna stranka
org = Element("org")
org.set("xml:id", "party.HKS")
org.set("role", "parliamentaryGroup")
orgName = SubElement(org, "orgName")
orgName.set("full", "yes")
orgName.set("xml:lang", "hr")
orgName.text = "Hrvatska konzervativna stranka"
orgName = SubElement(org, "orgName")
orgName.set("full", "abb")
orgName.text = "HKS"
orgs.append(org)


orgs = [pretty_string(i) for i in orgs]

orgstring = "\n".join(
    ["\n".join(org.split("\n")[1:])for org in orgs]
)
# print(orgstring)

# People: line 525

path: `/teiCorpus/teiHeader/profileDesc/particDesc/listPerson`

In [47]:
for name in mpdf.fullname.unique():
    if mpdf.loc[mpdf.fullname==name, "codemp"].unique().shape[0] != 1:
        print(name, "\n", mpdf.loc[mpdf.fullname==name, "codemp"])

Vrbat Grgić, Tanja 
 334    M581
524    M582
725    M582
Name: codemp, dtype: object
Lovrić Merzel, Marina 
 376    M393
461    M394
Name: codemp, dtype: object


In [48]:
listPerson = []
codemps = mpdf.codemp.drop_duplicates()
n = 0
for i, c in enumerate(codemps):    
    subset = mpdf.query(f"codemp=='{c}'").reset_index(drop=True)
    surname = subset.lastname[0]
    forename = subset.firstname[0]
    sex = "M" if subset.gender[0] == 0 else "F"
    birth = str(subset.date_of_birth[0])
    birth = f"{birth[0:4]}-{birth[4:6]}-{birth[-2:]}"
    reference_name = f"{surname.replace(' ', '')}{forename.replace(' ', '')}"
    
    person = Element("person")
    person.set("xml:id", reference_name)
    person.set("n", str(n))
    n += 1
    persName = SubElement(person, "persName")
    
    buf = SubElement(persName, "surname")
    buf.text = surname
    
    buf = SubElement(persName, "forename")
    buf.text = forename
    
    buf = SubElement(person, "sex")
    buf.set("value", sex)
    
    buf = SubElement(person, "birth")
    buf.set("when", birth)
    

    for i, row in subset.drop_duplicates().iterrows():        
        ana = SubElement(person, "affiliation")
        ana.set("role", "member")
        ana.set("ref", f"#{transform_abbreviation(row['party'])}")
        
        rolename = SubElement(ana, "roleName")
        rolename.set("xml:lang", "en")
        rolename.text = "Member"

        ana = SubElement(person, "affiliation")
        ana.set("role", "member")
        ana.set("ref", "#HS")
        ana.set("ana", f"#HS.{row['term2']}")
        rolename = SubElement(ana, "roleName")
        rolename.set("xml:lang", "en")
        rolename.text = "MP"
    listPerson.append(person)


In [49]:
listPerson = ["\n".join(pretty_string(p).split("\n")[1:]) for p in listPerson]
listPerson = "\n".join(listPerson)


In [50]:
path_to_term_tmxs = Path("../T/")
assert path_to_term_tmxs.exists()


xiincludes = ""
for file in sorted(list(path_to_term_tmxs.glob("ParlaMint-HR_T*.xml"))):
    xiincludes += f"""    <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="{file.name}"/>\n"""


# Relations

path: `/teiCorpus/teiHeader/profileDesc/particDesc/listOrg/listRelation`

In [51]:
gb = partiesdf.groupby(["term2", "coalition"]).agg(
    {
        "id": lambda l: " ".join(set(["#"+i for i in l]))
    }
)
terms = partiesdf.term2.unique()
listRelation = ""
for term in terms:
    fromto = termdata.loc[term, :]
    ind = (term, 1)
    coalition = gb.loc[ind].id
    ind = (term, 0)
    opposition = gb.loc[ind].id
    listRelation += f"""
    <relation name="coalition"
            mutual="{coalition}"
            from="{fromto.From}"
            to="{fromto.To}"
            ana="#HS.{term}"/>
    <relation name="opposition"
            active="{opposition}"
            passive="#government.HR"
            from="{fromto.From}"
            to="{fromto.To}"
            ana="#HS.{term}"/>\n"""    

# Counting extents and tagUsages

In [52]:
from typing import Union
from pathlib import Path
def get_extent(path: Union[str, Path]):
    if isinstance(path, Path):
        assert path.exists()
        path = str(path)
    namespace= {
        "n": "http://www.tei-c.org/ns/1.0",
        
    }
    termtei = parse(path)
    measures = termtei.getroot().findall(".//n:measure", namespaces=namespace)
    for i in measures:
        if i.get("unit") == "speeches":
            speeches = i.get("quantity")
        if i.get("unit") == "words":
            words = i.get("quantity")
    return {"words": int(words), "speeches": int(speeches)}

def get_tagusage(path: Union[str, Path]):
    if isinstance(path, Path):
        assert path.exists()
        path = str(path)
    namespace= {
        "n": "http://www.tei-c.org/ns/1.0",
        
    }
    termtei = parse(path)
    tagusages = termtei.getroot().findall(".//n:tagUsage", namespaces=namespace)
    for i in tagusages:
        if i.get("unit") == "speeches":
            speeches = i.get("quantity")
        if i.get("unit") == "words":
            words = i.get("quantity")
    return {i.get("gi"): i.get("occurs") for i in tagusages}

path_to_term_tmxs = Path("../T/")
assert path_to_term_tmxs.exists()


extents = []
tagusages = []
for file in sorted(list(path_to_term_tmxs.glob("ParlaMint-HR_T*.xml"))):
    extents.append(get_extent(file))
    tagusages.append(get_tagusage(file))

tagusage = ""
import pandas as pd
tus = pd.DataFrame(data=tagusages).astype(int)
for gi, occurs in tus.sum().items():
    tagusage += f"""<tagUsage gi="{gi}" occurs="{int(occurs)}"/>\n"""
extent_count = pd.DataFrame(data=extents).astype(int).sum()
words = extent_count["words"]
speeches = extent_count["speeches"]
extent = f"""<measure unit="speeches" quantity="{speeches}" xml:lang="hr">{f'{speeches:,d}'.replace(',','.')} govora</measure>
<measure unit="speeches" quantity="{speeches}" xml:lang="en">{speeches:,d} speeches</measure>
<measure unit="words" quantity="{words}" xml:lang="hr">{f'{words:,d}'.replace(',','.')} riječi</measure>
<measure unit="words" quantity="{words}" xml:lang="en">{words:,d} words</measure>
"""

# Putting it all together

In [53]:
with open("../T/ParlaMint-HR_template2.xml") as f:
    content = f.read()
from string import Template
template = Template(content)
filled_in = template.substitute(
                    listEvent=listEvent_str,
                    orgs=orgstring,
                    listPerson = listPerson,
                    xiincludes = xiincludes,
                    listRelation = listRelation,
                    extent = extent,
                    tagusage = tagusage
                    )

with open("../T/ParlaMint-HR.xml", "w") as f:
    f.write(
        filled_in
    )

In [54]:
%%bash
pwd

/home/rupnik/parlamint/task15


In [55]:
%%bash

cp ../T/ParlaMint-HR_T*.xml ../ParlaMint/Data/ParlaMint-HR/
cp ../T/ParlaMint-HR.xml ../ParlaMint/Data/ParlaMint-HR/

cd ../ParlaMint
make add-common-content-HR
cd Data/ParlaMint-HR/
cp add-common-content/* .
rm -r add-common-content

rm -rf Data/ParlaMint-HR/add-common-content
mkdir -p Data/ParlaMint-HR/add-common-content
java -jar /usr/share/java/saxon.jar outDir=Data/ParlaMint-HR/add-common-content \
   -xsl:Scripts/parlamint-add-common-content.xsl \
   Data/ParlaMint-HR/ParlaMint-HR.ana.xml || :
java -jar /usr/share/java/saxon.jar outDir=Data/ParlaMint-HR/add-common-content \
   anaDir=`pwd`/Data/ParlaMint-HR/add-common-content/ParlaMint-HR \
   -xsl:Scripts/parlamint-add-common-content.xsl \
   Data/ParlaMint-HR/ParlaMint-HR.xml || :


Source file Data/ParlaMint-HR/ParlaMint-HR.ana.xml does not exist
ERROR : cannot locate .ana file /home/rupnik/parlamint/ParlaMint/Data/ParlaMint-HR/add-common-content/ParlaMint-HR/ParlaMint-HR_T05.ana.xml
ERROR : cannot locate .ana file /home/rupnik/parlamint/ParlaMint/Data/ParlaMint-HR/add-common-content/ParlaMint-HR/ParlaMint-HR_T06.ana.xml
ERROR : cannot locate .ana file /home/rupnik/parlamint/ParlaMint/Data/ParlaMint-HR/add-common-content/ParlaMint-HR/ParlaMint-HR_T07.ana.xml
ERROR : cannot locate .ana file /home/rupnik/parlamint/ParlaMint/Data/ParlaMint-HR/add-common-content/ParlaMint-HR/ParlaMint-HR_T08.ana.xml
ERROR : cannot locate .ana file /home/rupnik/parlamint/ParlaMint/Data/ParlaMint-HR/add-common-content/ParlaMint-HR/ParlaMint-HR_T09.ana.xml
ERROR : cannot locate .ana file /home/rupnik/parlamint/ParlaMint/Data/ParlaMint-HR/add-common-content/ParlaMint-HR/ParlaMint-HR_T10.ana.xml
INFO: Starting to process ParlaMint-HR
INFO: Processing ParlaMint-HR_T05.xml
INFO ParlaMint-HR

In [56]:
%%bash
pwd

/home/rupnik/parlamint/task15


In [57]:
%%bash
cd ../ParlaMint
make validate-parlamint-HR

Scripts/validate-parlamint.pl Schema 'Data/ParlaMint-HR'


INFO: Validating directory /home/rupnik/parlamint/ParlaMint/Data/ParlaMint-HR
INFO: Validating TEI root /home/rupnik/parlamint/ParlaMint/Data/ParlaMint-HR/ParlaMint-HR.xml
INFO: XML validation for ParlaMint-HR.xml
/home/rupnik/parlamint/ParlaMint/Data/ParlaMint-HR/ParlaMint-HR.xml:473:55: error: value of attribute "to" is invalid; must be an ISO year and month, must be an ISO date, must be an ISO date and time or must be a year
/home/rupnik/parlamint/ParlaMint/Data/ParlaMint-HR/ParlaMint-HR.xml:781:27: error: value of attribute "to" is invalid; must be an ISO year and month, must be an ISO date, must be an ISO date and time or must be a year
/home/rupnik/parlamint/ParlaMint/Data/ParlaMint-HR/ParlaMint-HR.xml:787:27: error: value of attribute "to" is invalid; must be an ISO year and month, must be an ISO date, must be an ISO date and time or must be a year
/home/rupnik/parlamint/ParlaMint/Data/ParlaMint-HR/ParlaMint-HR.xml:7829:45: error: ID "LovrićMerzelMarina" has already been defined

# Notes on the output:

`ParlaMint-HR.xml:473:55: error: value of attribute "to" is invalid; must be an ISO year and month, must be an ISO date, must be an ISO date and time or must be a year`: this is because I set the end of term 10 as ongoing instead of a date. 

`ParlaMint-HR.xml:595:66: error: value of attribute "xml:id" is invalid; must be an XML name without colons`: ~~Will try to fix by removing punctuations from party references. There aren't any colons in the name, though.~~ Yup, this worked.

`ParlaMint-HR.xml:727:26: error: value of attribute "name" is invalid; must be equal to "coalition", "opposition", "renaming", "representing" or "successor"`: value was "oposition" [sic.] This is since corrected.

```
ParlaMint-HR.xml:7829:45: error: ID "LovrićMerzelMarina" has already been defined
ParlaMint-HR.xml:6797:45: error: first occurrence of ID "LovrićMerzelMarina"
ParlaMint-HR.xml:8645:42: error: ID "VrbatGrgićTanja" has already been defined
ParlaMint-HR.xml:6506:42: error: first occurrence of ID "VrbatGrgićTanja"
```
this stems from the fact that those two MP have twin `codeMPs`. 

Errors I should focus on:
`ERROR ParlaMint-HR_T06: ERROR: Can't find local id for u/@who="#ErrorError:row['Speaker_name']='Nepoznat',row['lastname']=nan,row['firstname']=nan"`: Clearly missing data.

Other errors that I do not know how to remedy:

```
ERROR ParlaMint-HR: Duplicate party affiliation for #HS
ERROR ParlaMint-HR: Duplicate party affiliation for #party.SDP
```

ERROR ParlaMint-HR_T05: ERROR: Can't find local id for u/@who="#PalarićAntun"



In [58]:
mpdf[mpdf.fullname.isin(["Lovrić Merzel, Marina", "Vrbat Grgić, Tanja"])]

Unnamed: 0,codemp,order_id,term1,term2,term_id,type_of_list,fullname,firstname,lastname,party,date_of_birth,year_of_birth,gender,place_of_birth,field_of_study,education_y,constituency,bp_lat,bp_lon
334,M581,535,2007-2011,6,148,normal,"Vrbat Grgić, Tanja",Tanja,Vrbat Grgić,SDP,19790618,1979,1,"Kopar, Slovenia",1,18,8,45.548059,13.730188
376,M393,581,2007-2011,6,194,not_active,"Lovrić Merzel, Marina",Marina,Lovrić Merzel,SDP,19630902,1963,1,Sisak,2,16,6,45.485077,16.373116
461,M394,667,2011-2015,7,82,normal,"Lovrić Merzel, Marina",Marina,Lovrić Merzel,nezavisni,19630902,1963,1,Sisak,2,16,6,45.485077,16.373116
524,M582,730,2011-2015,7,145,normal,"Vrbat Grgić, Tanja",Tanja,Vrbat Grgić,SDP,19790618,1979,1,"Kopar, Slovenia",1,18,8,45.548059,13.730188
725,M582,941,2015-2016,8,143,normal,"Vrbat Grgić, Tanja",Tanja,Vrbat Grgić,SDP,19790618,1979,1,"Kopar, Slovenia",1,18,8,45.548059,13.730188
