# Data description:

Information available in the bronbeek.constituent.csv
    - ConstituentID	
    - ConstituentTypeID	
    - Active	
    - AlphaSort	
    - LastName	
    - FirstName	
    - NameTitle	
    - Institution	
    - DisplayName	
    - BeginDate	
    - EndDate	
    - DisplayDate	
    - Biography	
    - Code	
    - Nationality	
    - School	
    - LoginID	
    - EnteredDate	
    - SysTimeStamp	
    - PublicAccessOld	
    - Remarks	
    - Position	
    - MiddleName	
    - Suffix	
    - CultureGroup	
    - salutation	
    - Approved	
    - PublicAccess	
    - IsPrivate	
    - DefaultNameID	
    - SystemFlag	
    - InternalStatus	
    - DefaultDisplayBioID	
    - BeginDateISO	
    - EndDateISO	
    - FullName (constructed by me: FirstName + LastName)
    
Information kept in the groundtruth file
    - ConstituentID		
    - AlphaSort	
    - LastName	
    - FirstName	
    - NameTitle		
    - DisplayName	
    - BeginDate	
    - EndDate	
    - DisplayDate	
    - Biography	
    - Nationality	
    - MiddleName	
    - Suffix		
    - BeginDateISO	
    - EndDateISO	
    - FullName 

NMVW constituent shape: 
    ![Constituent_shape](../nmvw_data/sample/sample_constituent.png)

Distinct Property on constituent instances: 
1   rdf:type --> crm:E21_Person(we are only considering person), crm:E74_Group, crm:E39_Actor 
2	crm:P1_is_identified_by --> Typically name
3	crm:P2_has_type --> nationality: http://vocab.getty.edu/aat/300379842
                        gender: http://vocab.getty.edu/aat/300055147 
                                male: http://vocab.getty.edu/aat/300189557
                                female: http://vocab.getty.edu/aat/300189559
4	la:equivalent --> external identifier, e.g., wikidata
5	crm:P100i_died_in --> person's dod
6	crm:P98i_was_born --> person's dob
7   crm:P67i_is_referred_to_by --> typically description or bio


Information kept in the groundtruth file
    - nmvw_uri
    - name_labels (only one preferred terms)
    - dob
    - dod
    - ~~nationality~~
    - ~~and gender~~

Construct Ground truth (size: 60): 
    - we select 20 random rows from each of previously evaluated result (surname: 20, abbrv: 20 and fuzzy:20)
    - we take 10 of the correct match and 10 of them unknow or negative match

## Sample selection from surname match evaluation

In [1]:
import pandas

In [2]:
new_df = pandas.read_csv("../exp201/results/SurnameMatchResult_sample_60.csv", sep=";")

chosing 10 succesful sample from surname match

In [3]:
sample_df = new_df.loc[new_df["HumanEval"] == "YES"].sample(10, random_state=1)

In [4]:
ground_truth = pandas.DataFrame() 
ground_truth = pandas.concat([ground_truth, sample_df], ignore_index=True, sort=False)

Choosing 10 unsuccessful match from surname match

In [5]:
sample_df = new_df.loc[new_df["HumanEval"] != "YES"].sample(10, random_state=1)

In [6]:
ground_truth = pandas.concat([ground_truth, sample_df], ignore_index=True, sort=False)

### Samples from Abbreviation Match

In [7]:
new_df = pandas.read_csv("../exp201/results/AbbreviationMatchResult_sample_60.csv", sep=";")

chosing 8 (the len of correct match is 8) succesful sample from abbraviation match

In [8]:
sample_df = new_df.loc[new_df["HumanEval"] == "YES"]

In [9]:
sample_df = sample_df.drop('Unnamed: 0', axis=1)

In [10]:
ground_truth = ground_truth.assign(nmvw_uri= None)
ground_truth = ground_truth.assign(name_label = None)

In [11]:
sample_df = sample_df.rename(columns={'Abbreviations': 'RetrievedNames'})

In [12]:
ground_truth = pandas.concat([ground_truth, sample_df], ignore_index=True, sort=False)

chosing 12 unsuccesful sample from abbraviation match

In [13]:
sample_df = new_df.loc[new_df["HumanEval"] != "YES"].sample(12, random_state=1)
sample_df = sample_df.drop('Unnamed: 0', axis=1)
sample_df = sample_df.rename(columns={'Abbreviations': 'RetrievedNames'})

In [14]:
ground_truth = pandas.concat([ground_truth, sample_df], ignore_index=True, sort=False)

### Samples from FuzzyString Match

In [15]:
new_df = pandas.read_csv("../exp201/results/FuzzyStringMatchResult_sample_60.csv", sep=";")

chosing 5 (the len of correct match is 5) succesful sample from fuzzystring match

In [16]:
sample_df = new_df.loc[new_df["HumanEval"] == "YES"]
sample_df = sample_df.rename(columns={'Unnamed: 0': 'index'})

In [17]:
ground_truth = pandas.concat([ground_truth, sample_df], ignore_index=True, sort=False)

chosing 15 unsuccesful sample from fuzzystring match

In [18]:
sample_df = new_df.loc[new_df["HumanEval"] != "YES"].sample(15, random_state=1)
sample_df = sample_df.rename(columns={'Unnamed: 0': 'index'})

In [19]:
ground_truth = pandas.concat([ground_truth, sample_df], ignore_index=True, sort=False)

In [20]:
ground_truth = ground_truth.rename(columns={'name_label': 'pref_label'})
ground_truth = ground_truth.assign(birth_begin_time= None)
ground_truth = ground_truth.assign(birth_end_time= None)
ground_truth = ground_truth.assign(death_begin_time= None)
ground_truth = ground_truth.assign(death_end_time= None)

ground_truth = ground_truth.assign(ConstituentID= None)

<ol>
    <li> merge the column values that I need from both data source </li>
    <li> NMVW ["nmvw_uri", "pref_label", "birth_begin_time", "birth_end_time", "death_begin_time", "death_end_time"] </li>
    <li> Bronbeek ['index', 'ConstituentID', AlphaSort, LastName, FirstName, NameTitle, DisplayName, BeginDate, EndDate, DisplayDate, Biography, Nationality, MiddleName, Suffix, BeginDateISO, EndDateISO, FullName] </li>
    <li> previous evaluation ['RetrievedNames', 'MATCH', 'HumanEval'] </li>
</ol>

In [21]:
def read_csv(file="/Users/sarah_shoilee/Desktop/Sarah/Bronbeek_Data"):
    df = pandas.read_csv(file)
    return df

bronbeek_const_df = read_csv("/Users/sarah_shoilee/Desktop/Sarah/Bronbeek_Data/csv_dump/Constituents.csv")
bronbeek_const_df['index'] = bronbeek_const_df.index

In [22]:
ground_truth.drop(['AlphaSort', 'FirstName', 'NameTitle', 'Biography', 'LastName',
       'DisplayName','MiddleName', 'FullName', 'Suffix','ConstituentID'], axis=1, inplace=True)

In [23]:
ground_truth = pandas.merge(ground_truth, bronbeek_const_df, how="left", on=["index", "index"])

In [24]:
ground_truth.drop(['Unnamed: 0.4', 'Unnamed: 0.3', 'Unnamed: 0.2', 'Unnamed: 0.1', 'Unnamed: 0', 'ConstituentTypeID', 'Active','Institution','Code','School','LoginID','EnteredDate','SysTimeStamp','PublicAccessOld','Remarks','Position','CultureGroup','salutation','Approved','PublicAccess','IsPrivate','DefaultNameID' ,'SystemFlag','InternalStatus','DefaultDisplayBioID'], axis=1, inplace=True)

In [25]:
ground_truth[['nmvw_uri', 'pref_label', 'birth_begin_time', 'birth_end_time', 'death_begin_time', 'death_end_time', 'index', 'ConstituentID', 'AlphaSort', 'NameTitle', 'Biography', 'FirstName','MiddleName', 'LastName', 'Suffix', 'FullName', 'DisplayName', 'BeginDate', 'EndDate', 'DisplayDate', 'Nationality', 'MiddleName', 'Suffix', 'BeginDateISO', 'EndDateISO' ,'RetrievedNames','MATCH', 'HumanEval']].to_csv("ground_truth.csv", index= False)

TODO:
    - what are the column ground truth must have (e.g., nmwv_dob, nmvw_dod, 
    bronbeek --> index, ConstituentID	AlphaSort	LastName	FirstName	NameTitle	DisplayName	BeginDate	EndDate	DisplayDate	Biography	Nationality	MiddleName	Suffix	DefaultDisplayBioID	BeginDateISO	EndDateISO	FullName)
    - store it in a csv

TODO: three more steps: 
        1. add the experts given common names
        2. ~~add missing information from NMVW (dataset merge)~~
        3. add missing information from bronbeek (dataset merge)

## 2. add missing information from NMVW (dataset merge)

In [26]:
%pip install ipython_sparql_pandas

Note: you may need to restart the kernel to use updated packages.


In [27]:
%load_ext ipython_sparql_pandas

In [28]:
%%sparql http://localhost:7200/repositories/LinkedArt_NMvW_Constituent -q -s constituent_graph
        

    PREFIX crm: <http://www.cidoc-crm.org/cidoc-crm/>
    PREFIX la: <https://linked.art/ns/terms/>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    select ?s ?name ?birth_begin_time ?birth_end_time ?death_begin_time ?death_end_time where { 
        Graph <http://www.example.com/nmvw.linkedart/ccrdfconst/>{
            ?s a crm:E21_Person .
                                ?s crm:P1_is_identified_by ?name_bnode .
                                ?name_bnode crm:P2_has_type <http://vocab.getty.edu/aat/300404670> .
                                ?name_bnode crm:P190_has_symbolic_content ?name .
            OPTIONAL{
                                ?s crm:P98i_was_born ?birth_bNode .
                                ?birth_bNode a crm:E67_Birth .
                                ?birth_bNode crm:P4_has_time-span ?b_time.
                                ?b_time a crm:E52_Time-Span .
                                ?b_time crm:P82a_begin_of_the_begin ?birth_begin_time .
                                ?b_time crm:P82b_end_of_the_end ?birth_end_time .
            }
            OPTIONAL{
                                ?s crm:P100i_died_in ?death_bNode .
                                ?death_bNode a crm:E69_Death .
                                ?death_bNode crm:P4_has_time-span ?d_time.
                                ?d_time a crm:E52_Time-Span .
                                ?d_time crm:P82a_begin_of_the_begin ?death_begin_time .
                                ?d_time crm:P82b_end_of_the_end ?death_end_time .
            }
        }
    }


In [29]:
constituent_graph.head()

Unnamed: 0,s,name,birth_begin_time,birth_end_time,death_begin_time,death_end_time
0,https://hdl.handle.net/20.500.11840/pi21251,Raden Adipati Karta Negara,,,,
1,https://hdl.handle.net/20.500.11840/pi22580,Neettiy Kwee,,,,
2,https://hdl.handle.net/20.500.11840/pi2643,P.C. Nelson,,,,
3,https://hdl.handle.net/20.500.11840/pi2648,Dhr. A.H. Neijs,1879.0,1879.0,1963.0,1963.0
4,https://hdl.handle.net/20.500.11840/pi30249,C.J. (Chris) Neeb,1860.0,1860.0,1924.0,1924.0


In [73]:
def add_data_from_nmvw(ground_truth, constituent_graph):
    for i, row in ground_truth.iterrows():
        if type(row['nmvw_uri']) == float and np.isnan(row['nmvw_uri']):
            continue
        else:
            q_row = constituent_graph.loc[constituent_graph['s']==row['nmvw_uri']][['name', 'birth_begin_time', 'birth_end_time', 'death_begin_time', 'death_end_time' ]]
            ground_truth.at[i, 'pref_label'] = str(q_row['name'].values[0])
            if q_row['birth_begin_time'].values[0] is not None :
                ground_truth.at[i, 'birth_begin_time'] = str(q_row['birth_begin_time'].values[0])
            if q_row['birth_end_time'].values[0] is not None :
                ground_truth.at[i, 'birth_end_time'] = str(q_row['birth_end_time'].values[0])
            if q_row['death_begin_time'].values[0] is not None :
                ground_truth.at[i, 'death_begin_time'] = str(q_row['death_begin_time'].values[0])
            if q_row['death_end_time'].values[0] is not None :
                ground_truth.at[i, 'death_end_time'] = str(q_row['death_end_time'].values[0])
                
    return ground_truth

In [75]:
ground_truth = pandas.read_csv("ground_truth.csv")
ground_truth = add_data_from_nmvw(ground_truth, constituent_graph)
ground_truth.to_csv('ground_truth.csv', index= False)

In [None]:
def add_data_from_bronbeek():
    pass