# What needs doing:

* Do we have conflicts?
* Count all words and sentences from all terms
* Get min and max date from all terms
* Count occurences of body, desc, kinesic... items
* Get terms dates for `Mandatno obdoblje` sections and all coalition and opposition parties
* Get people info and nest them properly. See below.

# Questions for Tomaž/Nikola:

* How is validation run?
* Do we need to follow the per-term directory and file structure (e.g. ParlaSpeech-HR) or can we preserve the current structure?
* What is the deal with relations, coalition field `mutual` for party listing while opposition has `active` and a field `passive #government.hr`?
* I don't have the education title (e.g. `<education n="6">stručni prvostupnik sestrinstva</education>`), only years of education.
* I don't have the exact dates of terms. I can only infer them from the rest of the data I have (I estimate it from the dates of speeches.)
* I get different data for coalition-opposition formations than in the existing main TEI files for term 9.
* I need the affiliation>roles, but they are not in the MP datasheet.

# Resolved:
* ~~Is party `NZ` the same as `nezavisni`?~~ Yes, yes it is.
* ~~How to encode when an MP changes parties?~~ Like so:
```xml
<person xml:id="PernarIvan" n="1302">
                  <persName>
                     <surname>Pernar</surname>
                     <forename>Ivan</forename>
                  </persName>
                  <sex value="M"/>
                  <birth when="1985-10-14"/>
                  <education n="6">stručni prvostupnik sestrinstva</education>
                  <affiliation role="member" ref="#party.Nezavisni">
                     <roleName xml:lang="en">Member</roleName>
                  </affiliation>
                  <affiliation role="member" ref="#HS" ana="#HS.9">
                     <roleName xml:lang="en">MP</roleName>
                  </affiliation>
```
                  
Meeting notes
* Try running the perl scripts for validation
* Fill the extent fields in the main TEI.
* Term dates: to be corrected from wikipedia data, construct a new tsv and populate with data.
* Coalition/Opposition: Go only for `coalition` attribute in parties metadata (.xls file). Check for issues.
* opposition: passive seems to be always government. Is it tagged in the speeches in any parliament?
* Re:Education: ask Tomaž.
* Affiliation: every MP needs 2: as member of parliament and as member of their party. People like PM need only their party reference. But we do not have them in MP excell file, so this will probably cause validation issues.

In [47]:
from typing import Union, List
from pathlib import Path

import pandas as pd

directory = ".."
mp_path = Path(directory) / Path("Croatia_MPs_final_ 20220917.xlsx")
parties_path = Path(directory) / Path("Croatia_parties_final_20220917.xlsx")

mpdf = pd.read_excel(str(mp_path))
termdata = pd.read_csv("../terms.csv").set_index("Term")

#Correcting party:NZ to nezavisni
mpdf["party"] = mpdf.party.replace("NZ", "nezavisni")
partiesdf = pd.read_excel(str(parties_path))


# Do we have any MPs that occur in more than one parties?

In [48]:
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
gb = mpdf.groupby("codemp ".split()).agg({
    "party": lambda l: set(l)})
gb_problematic = gb[gb.party.apply(len) != 1]
gb_problematic.shape[0]

54

In [49]:
# problematic_codemps = gb_problematic.index.tolist()

# subset = mpdf[mpdf.codemp.isin(problematic_codemps)]
# gb = subset.groupby("fullname term1 party".split()).agg({
#     "party": lambda i: i,
    
# }
# )

# gb

# Preparing persons:

In [50]:
from xml.etree.ElementTree import Element, SubElement, tostring, XML, parse
from xml.dom import minidom
def pretty_print(s) -> None:
    print(minidom.parseString(tostring(s).decode("utf")).toprettyxml("\t"))
def pretty_string(s) -> None:
    return minidom.parseString(tostring(s).decode("utf")).toprettyxml("\t")

In [51]:
for codemp in mpdf.codemp.unique():
    subset = mpdf[mpdf.codemp == codemp]
    break

In [52]:
subset

Unnamed: 0,codemp,order_id,term1,term2,term_id,type_of_list,fullname,firstname,lastname,party,date_of_birth,year_of_birth,gender,place_of_birth,field_of_study,education_y,constituency,bp_lat,bp_lon
0,M172,1,2000-2003,4,1,normal,"Adlešič, Đurđa",Đurđa,Adlešič,HSLS,19600418,1960,1,Bjelovar,1,16,2,45.898797,16.842309
193,M172,194,2003-2007,5,2,normal,"Adlešič, Đurđa",Đurđa,Adlešič,HSLS,19600418,1960,1,Bjelovar,1,16,2,45.898797,16.842309
387,M172,388,2007-2011,6,1,normal,"Adlešič, Đurđa",Đurđa,Adlešič,nezavisni,19600418,1960,1,Bjelovar,1;2,16,2,45.898797,16.842309


In [53]:
mpdf.type_of_list.unique()

array(['normal', 'not_active', 'finished'], dtype=object)

In [54]:

termdata


Unnamed: 0_level_0,From,To
Term,Unnamed: 1_level_1,Unnamed: 2_level_1
5,2003-12-22,2007-10-12
6,2008-01-11,2011-10-28
7,2011-12-22,2015-09-25
8,2015-12-03,2016-07-15
9,2016-10-14,2022-05-18
10,2022-07-22,ongoing


# Terms (line 438):

path: `/teiCorpus/teiHeader/profileDesc/particDesc/listOrg/org[2]/listEvent`

In [55]:
listEvent = Element("listEvent")
head = SubElement(listEvent, "head")
head.set("xml:lang", "hr")
head.text = "Mandatno obdoblje"
head = SubElement(listEvent, "head")
head.set("xml:lang", "en")
head.text = "Legislative period"

for i, row in termdata.iterrows():
    event = SubElement(listEvent, "event")
    event.set("xml:id", f"HS.{i}")
    event.set("from", row["From"])
    event.set("to", row["To"])
    
    label = SubElement(event, "label")
    label.set("xml:lang", "hr")
    label.text = f"{i}. mandat"
    label = SubElement(event, "label")
    label.set("xml:lang", "en")
    label.text = f"Term {i}"
listEvent_str = pretty_string(listEvent)
listEvent_str = "\n".join(listEvent_str.split("\n")[1:])
print(listEvent_str)

<listEvent>
	<head xml:lang="hr">Mandatno obdoblje</head>
	<head xml:lang="en">Legislative period</head>
	<event xml:id="HS.5" from="2003-12-22" to="2007-10-12">
		<label xml:lang="hr">5. mandat</label>
		<label xml:lang="en">Term 5</label>
	</event>
	<event xml:id="HS.6" from="2008-01-11" to="2011-10-28">
		<label xml:lang="hr">6. mandat</label>
		<label xml:lang="en">Term 6</label>
	</event>
	<event xml:id="HS.7" from="2011-12-22" to="2015-09-25">
		<label xml:lang="hr">7. mandat</label>
		<label xml:lang="en">Term 7</label>
	</event>
	<event xml:id="HS.8" from="2015-12-03" to="2016-07-15">
		<label xml:lang="hr">8. mandat</label>
		<label xml:lang="en">Term 8</label>
	</event>
	<event xml:id="HS.9" from="2016-10-14" to="2022-05-18">
		<label xml:lang="hr">9. mandat</label>
		<label xml:lang="en">Term 9</label>
	</event>
	<event xml:id="HS.10" from="2022-07-22" to="ongoing">
		<label xml:lang="hr">10. mandat</label>
		<label xml:lang="en">Term 10</label>
	</event>
</listEvent>



# Parties: line 447

path:
`/teiCorpus/teiHeader/profileDesc/particDesc/listOrg/org[3:]`

In [56]:
partydata = partiesdf["party full_name".split()].drop_duplicates()
partydata["id"] = partydata.party.apply(lambda s: f"party.{s.replace(' ', '_').upper()}")

orgs = []
for i, row in partydata.iterrows():
    org = Element("org")
    org.set("xml:id", row["id"])
    org.set("role", "parliamentaryGroup")
    orgName = SubElement(org, "orgName")
    orgName.set("full", "yes")
    orgName.set("xml:lang", "hr")
    orgName.text = row["full_name"]
    
    orgName.set("full", "abb")
    orgName.text = row["party"]
    orgs.append(org)
orgs = [pretty_string(i) for i in orgs]

orgstring = "\n".join(
    ["\n".join(org.split("\n")[1:])for org in orgs]
)
print(orgstring)

<org xml:id="party.DC" role="parliamentaryGroup">
	<orgName full="abb" xml:lang="hr">DC</orgName>
</org>

<org xml:id="party.HDSSB" role="parliamentaryGroup">
	<orgName full="abb" xml:lang="hr">HDSSB</orgName>
</org>

<org xml:id="party.HDZ" role="parliamentaryGroup">
	<orgName full="abb" xml:lang="hr">HDZ</orgName>
</org>

<org xml:id="party.HNS" role="parliamentaryGroup">
	<orgName full="abb" xml:lang="hr">HNS</orgName>
</org>

<org xml:id="party.HSLS" role="parliamentaryGroup">
	<orgName full="abb" xml:lang="hr">HSLS</orgName>
</org>

<org xml:id="party.HSP" role="parliamentaryGroup">
	<orgName full="abb" xml:lang="hr">HSP</orgName>
</org>

<org xml:id="party.HSS" role="parliamentaryGroup">
	<orgName full="abb" xml:lang="hr">HSS</orgName>
</org>

<org xml:id="party.HSU" role="parliamentaryGroup">
	<orgName full="abb" xml:lang="hr">HSU</orgName>
</org>

<org xml:id="party.IDS" role="parliamentaryGroup">
	<orgName full="abb" xml:lang="hr">IDS</orgName>
</org>

<org xml:id="party.LIBRA

# People: line 525

path: `/teiCorpus/teiHeader/profileDesc/particDesc/listPerson`

In [57]:
people = mpdf.fullname.drop_duplicates()
assert people.shape[0] == mpdf.codemp.unique().shape[0], "People names are not unique!"
listPerson = []
for i, p in enumerate(people):
    subset = mpdf.query(f"fullname=='{p}'").reset_index(drop=True)
    surname = subset.lastname[0]
    forename = subset.firstname[0]
    sex = "M" if subset.gender[0] == 0 else "F"
    birth = str(subset.date_of_birth[0])
    birth = f"{birth[0:4]}-{birth[4:6]}-{birth[-2:]}"
    reference_name = f"{surname.replace(' ', '')}{forename.replace(' ', '')}"
    
    person = Element("person")
    person.set("xml:id", reference_name)
    persName = SubElement(person, "persName")
    
    buf = SubElement(persName, "surname")
    buf.text = surname
    
    buf = SubElement(persName, "forename")
    buf.text = forename
    
    buf = SubElement(person, "sex")
    buf.set("value", sex)
    
    buf = SubElement(person, "birth")
    buf.set("when", birth)
    
    for i, row in subset.iterrows():
        ana = SubElement(person, "affiliation")
        ana.set("role", "member")
        ana.set("ref", f"#party.{row['party']}")
        ana.set("ana", f"#HS.{row['term2']}")
        
        rolename = SubElement(ana, "roleName")
        rolename.set("xml:lang", "en")
        rolename.text = "MP"
        
    listPerson.append(person)

In [58]:
listPerson = ["\n".join(pretty_string(p).split("\n")[1:]) for p in listPerson]
listPerson = "\n".join(listPerson)
print(listPerson)

<person xml:id="AdlešičĐurđa">
	<persName>
		<surname>Adlešič</surname>
		<forename>Đurđa</forename>
	</persName>
	<sex value="F"/>
	<birth when="1960-04-18"/>
	<affiliation role="member" ref="#party.HSLS" ana="#HS.4">
		<roleName xml:lang="en">MP</roleName>
	</affiliation>
	<affiliation role="member" ref="#party.HSLS" ana="#HS.5">
		<roleName xml:lang="en">MP</roleName>
	</affiliation>
	<affiliation role="member" ref="#party.nezavisni" ana="#HS.6">
		<roleName xml:lang="en">MP</roleName>
	</affiliation>
</person>

<person xml:id="ArlovićMato">
	<persName>
		<surname>Arlović</surname>
		<forename>Mato</forename>
	</persName>
	<sex value="M"/>
	<birth when="1952-11-04"/>
	<affiliation role="member" ref="#party.SDP" ana="#HS.4">
		<roleName xml:lang="en">MP</roleName>
	</affiliation>
	<affiliation role="member" ref="#party.SDP" ana="#HS.5">
		<roleName xml:lang="en">MP</roleName>
	</affiliation>
</person>

<person xml:id="Babić-PetričevićZdenka">
	<persName>
		<surname>Babić-Petričević</

In [59]:
termdata

Unnamed: 0_level_0,From,To
Term,Unnamed: 1_level_1,Unnamed: 2_level_1
5,2003-12-22,2007-10-12
6,2008-01-11,2011-10-28
7,2011-12-22,2015-09-25
8,2015-12-03,2016-07-15
9,2016-10-14,2022-05-18
10,2022-07-22,ongoing


In [60]:
path_to_term_tmxs = Path("../T/")
assert path_to_term_tmxs.exists()


xiincludes = ""
for file in sorted(list(path_to_term_tmxs.glob("ParlaMint-HR_T*.xml"))):
    xiincludes += f"""    <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="{file.name}"/>\n"""
xiincludes

'    <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ParlaMint-HR_T05.xml"/>\n    <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ParlaMint-HR_T06.xml"/>\n    <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ParlaMint-HR_T07.xml"/>\n    <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ParlaMint-HR_T08.xml"/>\n    <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ParlaMint-HR_T09.xml"/>\n    <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ParlaMint-HR_T10.xml"/>\n'

# Putting it all together

In [61]:
with open("../T/ParlaMint-HR_template2.xml") as f:
    content = f.read()

    
from string import Template
template = Template(content)
filled_in = template.substitute(
                    listEvent=listEvent_str,
                    orgs=orgstring,
                    listPerson = listPerson,
                    xiincludes = xiincludes,
                    )

with open("../T/ParlaMint-HR.xml", "w") as f:
    f.write(
        filled_in
    )