The objective of this notebook is to fix the party abbreviations in `MP` excel file to correspond to `parties` table.

In [78]:
import pandas as pd
from pathlib import Path
from typing import List, Union

pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

# directory = ".."
# mp_path = Path(directory) / Path("Croatia_MPs_final_ 20220917.xlsx")
# parties_path = Path(directory) / Path("Croatia_parties_final_20220917.xlsx")

# mpdf = pd.read_excel(str(mp_path))#.dropna()
# partiesdf = pd.read_excel(str(parties_path))

import pandas as pd
mpdf = pd.read_pickle("mpdf")
partiesdf = pd.read_pickle("partiesdf")



# Let's filter parties that do not appear in the metadata:

In [79]:
from functools import lru_cache
@lru_cache
def is_present(query: str) -> bool:
    paths = list(
        Path("/home/rupnik/parlamint/").glob("*_meta.tsv"))
    for path in paths:
        with open(str(path)) as f:
            if query in f.read():
                return True
    return False


In [80]:

is_present('"nezavisni"')

True

In [81]:
# partiesdf["party_present"] = partiesdf.party.apply(lambda s: is_present(f'"{s}"'))
# partiesdf["full_name_present"] = partiesdf.full_name.apply(lambda s: is_present(f'"{s}"'))
# mpdf["speaker_present"] = mpdf.fullname.apply(lambda s: is_present(f'"{s}"'))

# mpdf.to_pickle("mpdf")
# partiesdf.to_pickle("partiesdf")

Let's drop all parties that do not appear in the metadata neither with their fullname neither with their abbreviation. Also let's drop speakers that do not appear in the metadata or that spoke in Term 4:

In [82]:
partiesdf = partiesdf[partiesdf.party_present | partiesdf.full_name_present].reset_index(drop=True)
mpdf = mpdf[mpdf.speaker_present & (mpdf.term2 != 4)].reset_index(drop=True)

Let's see which values do we have to impute:

In [83]:
values_to_fill = set(mpdf.party[mpdf.speaker_present]) - set(partiesdf.party)
values_to_fill

{'Blok',
 'Centar',
 'Damir Bajs NL',
 'Domovinski pokret',
 'HKS',
 'HSP dr. Ante Starčević',
 'Hrvatski suverenisti',
 'Most',
 'Možemo!',
 'NL',
 'NZ',
 'RF',
 'SDA HR',
 'SsIP',
 'nezavisni'}

We can already change NZ to nezavisni and solve one of them:

In [84]:
mpdf["party"] = mpdf.party.replace({
    "NZ": "nezavisni"
})
values_to_fill = set(mpdf.party[mpdf.speaker_present]) - set(partiesdf.party)

Approach:

For every discrepancy in party abbreviations we look in the respective term and see which party the person belongs to.

In [85]:
from typing import Union
import pandas as pd
mpdf["party_new"] = mpdf.party
from functools import lru_cache


def get_party(name: str, term: Union[str, int]) -> str:
    df = pd.read_csv(f"/home/rupnik/parlamint/{term}_meta.tsv", sep="\t", usecols=["Term", "Speaker_name", "Speaker_party"]).drop_duplicates()
    subset = df[df.Speaker_name == name]
    subset = subset[subset.Term == int(term)]
    
    if subset.shape[0] == 1:
        return subset.Speaker_party.values[0]
    else:
        return subset.Speaker_party


get_party("Banac, Ivo", 5)

1125            LS
34768    nezavisni
Name: Speaker_party, dtype: object

In [86]:
mpdf[mpdf.fullname == "Banac, Ivo"]

Unnamed: 0,codemp,order_id,term1,term2,term_id,type_of_list,fullname,firstname,lastname,party,date_of_birth,year_of_birth,gender,place_of_birth,field_of_study,education_y,constituency,bp_lat,bp_lon,speaker_present,party_new
13,M273,206,2003-2007,5,14,normal,"Banac, Ivo",Ivo,Banac,nezavisni,19470301,1947,0,Dubrovnik,1,22,6,42.650661,18.094424,True,nezavisni


In [87]:
mpdf["party_new"] = mpdf.party.copy()
mpdf = mpdf.loc[mpdf.speaker_present & (mpdf.term2 != 4)]
mpdf = mpdf.reset_index(drop=True)


for i, row in mpdf.iterrows():
    if row["party_new"] not in values_to_fill:
        continue
    try:
        mpdf.loc[i, "party_new"] = get_party(row["fullname"], row["term2"])
    except:
        continue
        mpdf.loc[i, "party_new"] = "".join(get_party(row["fullname"], row["term2"]))


In [88]:
mpdf.party_new.str.contains(" ").sum()

17

In [89]:
mpdf.loc[mpdf.party_new.isna(),:]

Unnamed: 0,codemp,order_id,term1,term2,term_id,type_of_list,fullname,firstname,lastname,party,date_of_birth,year_of_birth,gender,place_of_birth,field_of_study,education_y,constituency,bp_lat,bp_lon,speaker_present,party_new
942,M660,1180,2016-2020,9,185,not_active,"Marić, Zdravko",Zdravko,Marić,nezavisni,19770203,1977,0,Slavonski Brod,4,22,5,45.163143,18.011608,True,
1057,M660,1300,2020-2023,10,111,not_active,"Marić, Zdravko",Marić,Zdravko,nezavisni,19770203,1977,0,Slavonski Brod,4,22,5,45.163143,18.011608,True,


In [90]:
mpdf.loc[mpdf.party_new.isna(),:] = "nezavisni"

In [91]:
mpdf.loc[mpdf.party_new.isin(values_to_fill), "party_new"].unique()

array(['nezavisni', 'Most', 'HKS', 'Hrvatski suverenisti'], dtype=object)

Nezavisni is to be left alone. I can find `MOST` in the metadata and parties excel, so I can use that.

I can't find HKS anywhere (not in the metadata, nor in parties, but I can find it in coalition_composition attributes: `partiesdf[partiesdf.coalition_composition.str.lower().str.contains("hks")]`)

Same goes with `Hrvatski suverenisti`: I can't find `suverenisti` nor `HS`  in metadata. For now I'll just convert them to "HS", because there is no parties with this abbreviation yet.

In [92]:
partiesdf[partiesdf.party == "MOST"].sample()

Unnamed: 0,codeparty,term1,term2,party,full_name,established,chairman,ideology_LR,party_family,election_result,no_seats,coalition,coalition_composition,ruling,party_present,full_name_present
96,P97,2020-2024,10,MOST,Most nezavisnih lista,2012,Božo Petrov,4,1,7.39,8,0,-,0,True,True


In [93]:
mpdf["party"] = mpdf.party.replace({
    "Most": "MOST",
    "Hrvatski laburisti": "HS",
})

In [94]:
mpdf.to_pickle("mpdf_corrected")
partiesdf.to_pickle("partiesdf_corrected")