## Augmenting text descriptions of object, organisation and person records using their metadata.
E.g. adding 'Made in China, 1994' from PLACE_MADE and DATE_MADE metadata, after checking that these values aren't already in the description.

**Why?** Useful to provide information to machine learning models e.g. entity linker where descriptions are short or non-existent

**How?** 
1. Get useful metadata values
2. Check that they're not already in the description
3. Form a new description using the old description plus strings formed from these template values (*'Made in xxxx'*)
4. Return the strings shuffled, so that a machine learning model doesn't start to learn the order of the strings.
5. Store this new description alongside the original as `sdo:disambiguatingDescription`

In [97]:
import sys
sys.path.append("..")

from heritageconnector import datastore, datastore_helpers
from heritageconnector.utils.data_transformation import get_year_from_date_value
from heritageconnector.utils.generic import flatten_list_of_lists
import pandas as pd
import random, re, string

# pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

people_data_path = "./GITIGNORE_DATA/mimsy_adlib_joined_people.csv"
object_data_path = (
    "./GITIGNORE_DATA/smg-datasets-private/mimsy-catalogue-export.csv"
)
collection_prefix = "https://collection.sciencemuseumgroup.org.uk/objects/co"
people_prefix = "https://collection.sciencemuseumgroup.org.uk/people/cp"


### Objects

In [42]:
# from smg_loader
def load_catalogue_df():
    catalogue_df = pd.read_csv(object_data_path, low_memory=False)
    catalogue_df["URI"] = collection_prefix + catalogue_df["MKEY"].astype(str)
    catalogue_df["MATERIALS"] = catalogue_df["MATERIALS"].apply(
        datastore_helpers.split_list_string
    )
    catalogue_df["ITEM_NAME"] = catalogue_df["ITEM_NAME"].apply(
        datastore_helpers.split_list_string
    )
    catalogue_df.loc[:, ["DESCRIPTION", "OPTION1"]] = catalogue_df.loc[
        :, ["DESCRIPTION", "OPTION1"]
    ].applymap(datastore_helpers.process_text)

    newline = " \n "
    catalogue_df.loc[:, "DESCRIPTION"] = catalogue_df[["DESCRIPTION", "OPTION1"]].apply(
        lambda x: f"{newline.join(x)}"
        if x["DESCRIPTION"] != x["OPTION1"] and (str(x["OPTION1"]) != "nan")
        else x["DESCRIPTION"],
        axis=1,
    )
    
    cat_df["DATE_MADE"] = cat_df["DATE_MADE"].apply(
        get_year_from_date_value
    )

    catalogue_df = catalogue_df[~catalogue_df["CATEGORY1"].str.contains("Disposal")]
    catalogue_df["CATEGORY1"] = catalogue_df["CATEGORY1"].apply(
        lambda x: x.split(" - ")[1].strip()
    )
    
    return catalogue_df

cat_df = load_catalogue_df()

cat_df.head()

Unnamed: 0,MKEY,TITLE,ITEM_NAME,CATEGORY1,COLLECTOR,PLACE_COLLECTED,DATE_COLLECTED,PLACE_MADE,CULTURE,DATE_MADE,MATERIALS,MEASUREMENTS,EXTENT,DESCRIPTION,ITEM_COUNT,PARENT_KEY,BROADER_TEXT,WHOLE_PART,ARRANGEMENT,LANGUAGE_OF_MATERIAL,EDITION,OPTION1,OPTION2,OPTION3,OPTION4,OPTION5,OPTION6,OPTION7,OPTION8,OPTION9,OPTION10,OPTION11,OPTION12,OPTION13,OPTION14,OPTION15,CREATE_DATE,UPDATE_DATE,URI
0,16,Ansonia Sunwatch (pocket compass dial),[pocket horizontal sundial],Time Measurement,,,,"New York county, New York state, United States",,1922-1939,[nan],,,Ansonia Sunwatch (pocket compass dial),1.0,,,WHOLE,,eng,,,,"Desborough, Jane",,,,,,,,SMG00083125,,,,,12-MAR-96,27-MAY-20,https://collection.sciencemuseumgroup.org.uk/o...
1,17,Model of train of wheels used in a clock (full...,"[spring-driven clock mechanism, fusee, model]",Time Measurement,,,,,,,[nan],,,Model of train of wheels used in a clock (full...,1.0,,,WHOLE,,eng,,,,"Desborough, Jane",,,,,,,,,,,,,12-MAR-96,27-MAY-20,https://collection.sciencemuseumgroup.org.uk/o...
2,18,Ship's log sandglass,"[log (nautical instrument), sandglass]",Time Measurement,,,,,,,"[glass, sand, mounted, wood, timer]","overall: 140 mm 70 mm, 0.252 kg",,Ship's log-glass in wooden mount. 14 secs. Abb...,1.0,,,WHOLE,,eng,,,,"Desborough, Jane",,,,,,,,,,,,,12-MAR-96,27-MAY-20,https://collection.sciencemuseumgroup.org.uk/o...
3,20,Watch with Chinese duplex escapement,"[pocket watch, duplex watch]",Time Measurement,,,,,,,[nan],,,Watch with Chinese duplex escapement,1.0,,,WHOLE,,eng,,,,"Desborough, Jane",,,,,,,,,,,,,12-MAR-96,27-MAY-20,https://collection.sciencemuseumgroup.org.uk/o...
4,22,"""Ever Ready"" ceiling clock",[clocks],Time Measurement,,,,,,,[nan],"overall: 140 mm x 124 mm x 152 mm,",,"""Ever Ready"" ceiling clock",1.0,,,WHOLE,,eng,,,,"Desborough, Jane",,,,,,,,,RECORD ACTIVE IN ASSET PANDA – EDIT WITH CAUTION,,,,12-MAR-96,11-JUN-20,https://collection.sciencemuseumgroup.org.uk/o...


In [71]:
def create_object_disambiguating_description(row: pd.Series) -> str:
    # Create item_name (object type).
    # Here we also check that the name without 's' or 'es' is not already in the description,
    # which should cover the majority of plurals.
    if (str(row.ITEM_NAME[0]) != "nan") and \
        (str(row.ITEM_NAME[0]).lower() not in row.DESCRIPTION.lower()) and \
        (str(row.ITEM_NAME[0]).rstrip("s").lower() not in row.DESCRIPTION.lower()) and \
        (str(row.ITEM_NAME[0]).rstrip("es").lower() not in row.DESCRIPTION.lower()):
        item_name = f"{row.ITEM_NAME[0].capitalize().strip()}."  
    else:
        item_name = ""
    
    # create made in place and/or date 
    add_place_made = (str(row['PLACE_MADE']) != "nan") and (str(row['PLACE_MADE']).lower() not in row.DESCRIPTION.lower())
    add_date_made = (str(row['DATE_MADE']) != "nan") and (str(row['DATE_MADE']).lower() not in row.DESCRIPTION.lower())
    # Also check for dates minus suffixes, e.g. 200-250 should match with 200-250 AD and vice-versa
    if re.findall(r"\d+-?\d*", str(row['DATE_MADE'])):
        add_date_made = add_date_made and (re.findall(r"\d+-?\d*", row['DATE_MADE'])[0].lower() not in row.DESCRIPTION.lower())
    
    if add_place_made and add_date_made:
        made_str = f"Made in {row.PLACE_MADE.strip()}, {row.DATE_MADE.strip()}."
    elif add_place_made:
        made_str = f"Made in {row.PLACE_MADE.strip()}."
    elif add_date_made:
        made_str = f"Made {row.DATE_MADE.strip()}."
    else:
        made_str = ""
    
    # add space and full stop (if needed) to end of description
    description = (row.DESCRIPTION.strip() if row.DESCRIPTION.strip()[-1] == "." else f"{row.DESCRIPTION.strip()}.")
    
    # we shuffle the components of the description so any model using them does not learn the order that we put them in
    aug_description_components = [item_name, description, made_str]
    random.shuffle(aug_description_components)
    
    return (" ".join(aug_description_components)).strip()

for _, row in cat_df.sample(20).iterrows():
    print(create_object_disambiguating_description(row))
    print()

Folder "dyes and textiles: 1930." Papers giving figures and data used. Part of the Morton Collection.

Leslie's differential thermoscope on simple wooden stand.  Made 1851-1950.

Map, Stockton & Darlington Railway, system map, 1 inch to 10 miles, c. 1869, published by A McKay, printed by G W Hurd, paper.

Made in Birmingham, Borough of Birmingham, West Midlands, England, United Kingdom.  Fish knife, good condition, stamped with makers mark and BR (M) logo.

Made in London, Greater London, England, United Kingdom, 1841-1910. Urethral sound, narrow gauge cylindrical rod with tapering proximal curve and metal handle, by Montague, c. 1870.

Wooden cigar holder, cylinder form on a short stalk, screw on horn mouthpiece, Burstow's patent, English, 1870-1920.

Sample of patent prismatic rolled glass by Pilkington Brothers Ltd., St. Helens, England, 1892-1930 
 Sample of patent prismatic rolled glass by Pilkington Brothers Ltd., St. Helens, England, 1892-1930. Glass is fluid at high temperature

### Organisations

In [77]:
# from smg_loader
def load_orgs_data(people_data_path):
    # identifier in field_mapping
    table_name = "ORGANISATION"

    org_df = pd.read_csv(people_data_path, low_memory=False, nrows=None)
    # TODO: use isIndividual flag here
    org_df = org_df[org_df["GENDER"] == "N"]

    # PREPROCESS
    org_df["URI"] = people_prefix + org_df["LINK_ID"].astype(str)

#     org_df["BIRTH_DATE"] = org_df["BIRTH_DATE"].apply(get_year_from_date_value)
#     org_df["DEATH_DATE"] = org_df["DEATH_DATE"].apply(get_year_from_date_value)

    org_df[["adlib_id", "adlib_DESCRIPTION", "DESCRIPTION", "NOTE"]] = org_df[
        ["adlib_id", "adlib_DESCRIPTION", "DESCRIPTION", "NOTE"]
    ].fillna("")
    org_df[["DESCRIPTION", "adlib_DESCRIPTION", "NOTE"]] = org_df[
        ["DESCRIPTION", "adlib_DESCRIPTION", "NOTE"]
    ].applymap(datastore_helpers.process_text)
    org_df[["OCCUPATION", "NATIONALITY"]] = org_df[
        ["OCCUPATION", "NATIONALITY"]
    ].applymap(datastore_helpers.split_list_string)

#     org_df["NATIONALITY"] = org_df["NATIONALITY"].apply(
#         lambda x: flatten_list_of_lists(
#             [datastore_helpers.get_country_from_nationality(i) for i in x]
#         )
#     )

    org_df["adlib_id"] = org_df["adlib_id"].apply(
        lambda i: [
            f"https://collection.sciencemuseumgroup.org.uk/people/{x}"
            for x in str(i).split(",")
        ]
        if i
        else ""
    )

    newline = " \n "  # can't insert into fstring below
    org_df.loc[:, "BIOGRAPHY"] = org_df[
        ["DESCRIPTION", "adlib_DESCRIPTION", "NOTE"]
    ].apply(lambda x: f"{newline.join(x)}" if any(x) else "", axis=1)
    
    return org_df

org_df = load_orgs_data(people_data_path)
org_df.head()

Unnamed: 0,LINK_ID,PREFERRED_NAME,TITLE_NAME,FIRSTMID_NAME,LASTSUFF_NAME,SUFFIX_NAME,HONORARY_SUFFIX,GENDER,BRIEF_BIO,DESCRIPTION,NOTE,BIRTH_DATE,BIRTH_PLACE,DEATH_DATE,DEATH_PLACE,CAUSE_OF_DEATH,NATIONALITY,OCCUPATION,WEBSITE,AFFILIATION,LINGUISTIC_GROUP,TYPE,REFERENCE_NUMBER,SOURCE,CREATE_DATE,UPDATE_DATE,adlib_id,adlib_ALIAS,adlib_DESCRIPTION,URI,BIOGRAPHY
0,10243,Brooklyn Arms Company,,,Brooklyn Arms Company,,,N,Y,,object record: 1987-1020,c. 1870,"Brooklyn, New York, New York state, United States",,,,[american],[manufacturer of mathematical instruments],,,,,,N,28-Jan-98,06-Nov-18,,,,https://collection.sciencemuseumgroup.org.uk/p...,\n \n object record: 1987-1020
6,1036,British Railways Board,,,British Railways Board,,,N,Y,Created by the 1962 Transport Act as a success...,REF: http://www.ndad.nationalarchives.gov.uk/A...,1962,United Kingdom,1996.0,United Kingdom,,[british],[railway board],,,,,,Y,08-Jun-96,06-Nov-18,,,,https://collection.sciencemuseumgroup.org.uk/p...,Created by the 1962 Transport Act as a success...
8,1040,A Clarkson and Company Limited,,,A Clarkson and Company Limited,,,N,Y,,1996-218 object record,,United Kingdom,,,,[british],[supplier],,,,,,N,10-Jun-96,06-Nov-18,,,,https://collection.sciencemuseumgroup.org.uk/p...,\n \n 1996-218 object record
12,10769,London School of Weaving,,,London School of Weaving,,,N,Y,1898 - Founded by Katie Grasett. 1932 - Londo...,,1898,"London, Greater London, England, United Kingdom",1970.0,,,[british],[training establishment],,,,,,Y,05-Feb-98,06-Nov-18,,,,https://collection.sciencemuseumgroup.org.uk/p...,1898 - Founded by Katie Grasett. 1932 - Londo...
14,1083,James G Biddle Company,,,James G Biddle Company,,,N,N,,,,United States,,,,[american],[manufacturer of electrical equipment],,,,,,N,14-Jun-96,02-Nov-15,,,,https://collection.sciencemuseumgroup.org.uk/p...,


In [95]:
def create_org_disambiguating_description(row: pd.Series) -> str:
    """
    Original description col = BIOGRAPHY.
    Components:
    - NATIONALITY + OCCUPATION -> 'British Railway Board'
    - BIRTH_DATE + BIRTH_PLACE -> 'Founded 1962, United Kingdom'
    - DEATH_DATE + DEATH_PLACE -> 'Dissolved 1996.' (Add place if no overlap between 
        BIRTH_PLACE and DEATH_PLACE strings. Joined to founded string above)
    - BIOGRAPHY (original description)
    """
    
    # NATIONALITY + OCCUPATION (only uses first of each)
    nationality = str(row['NATIONALITY'][0])
    occupation = str(row['OCCUPATION'][0])
    add_nationality = (nationality != "nan") and (nationality.lower() not in row.BIOGRAPHY.lower())
    add_occupation = (occupation != "nan") and (occupation.lower() not in row.BIOGRAPHY.lower())
    
    if add_nationality and add_occupation:
        nationality_occupation_str = f"{nationality.strip().title()} {occupation.strip()}."
    elif add_nationality:
        nationality_occupation_str = f"{nationality.strip().title()}."
    elif add_occupation:
        nationality_occupation_str = f"{occupation.strip().capitalize()}."
    else:
        nationality_occupation_str = ""
        
    # BIRTH_PLACE + BIRTH_DATE
    add_birth_place = (str(row['BIRTH_PLACE']) != "nan") and (str(row['BIRTH_PLACE']).lower() not in row.BIOGRAPHY.lower())
    add_birth_date = (str(row['BIRTH_DATE']) != "nan") and (str(row['BIRTH_DATE']).lower() not in row.BIOGRAPHY.lower())
    # Also check for dates minus suffixes, e.g. 200-250 should match with 200-250 AD and vice-versa
    if re.findall(r"\d+-?\d*", str(row['BIRTH_DATE'])):
        add_birth_date = add_birth_date and (re.findall(r"\d+-?\d*", row['BIRTH_DATE'])[0].lower() not in row.BIOGRAPHY.lower())
    
    if add_birth_place and add_birth_date:
        founded_str = f"Founded in {row.BIRTH_PLACE.strip()}, {row.BIRTH_DATE.strip()}."
    elif add_birth_place:
        founded_str = f"Founded in {row.BIRTH_PLACE.strip()}."
    elif add_birth_date:
        founded_str = f"Founded {row.BIRTH_DATE.strip()}."
    else:
        founded_str = ""
        
    # DEATH_PLACE + DEATH_DATE
    add_death_place = (str(row['DEATH_PLACE']) != "nan") and (str(row['DEATH_PLACE']).lower() not in row.BIOGRAPHY.lower()) and \
    (str(row['DEATH_PLACE']) not in str(row['BIRTH_PLACE'])) and (str(row['BIRTH_PLACE']) not in str(row['DEATH_PLACE']))
    add_death_date = (str(row['DEATH_DATE']) != "nan") and (str(row['DEATH_DATE']).lower() not in row.BIOGRAPHY.lower())
    # Also check for dates minus suffixes, e.g. 200-250 should match with 200-250 AD and vice-versa
    if re.findall(r"\d+-?\d*", str(row['DEATH_DATE'])):
        add_death_date = add_death_date and (re.findall(r"\d+-?\d*", row['DEATH_DATE'])[0].lower() not in row.BIOGRAPHY.lower())
    
    if add_death_place and add_death_date:
        dissolved_str = f"Dissolved in {row.DEATH_PLACE.strip()}, {row.DEATH_DATE.strip()}."
    elif add_death_place:
        dissolved_str = f"Dissolved in {row.DEATH_PLACE.strip()}."
    elif add_death_date:
        dissolved_str = f"Dissolved {row.DEATH_DATE.strip()}."
    else:
        dissolved_str = ""
    
    # Assemble 
    dates_str = " ".join([founded_str, dissolved_str]).strip()
    
    # add space and full stop (if needed) to end of description
    if row.BIOGRAPHY:
        description = (row.BIOGRAPHY.strip() if row.BIOGRAPHY.strip()[-1] == "." else f"{row.BIOGRAPHY.strip()}.")
    else:
        description = ""
    
    # we shuffle the components of the description so any model using them does not learn the order that we put them in
    aug_description_components = [nationality_occupation_str, description, dates_str]
    random.shuffle(aug_description_components)
    
    return (" ".join(aug_description_components)).strip()

for _, row in org_df.sample(20).iterrows():
    print(create_org_disambiguating_description(row))
    print("--")

1988-550/33; 34; 40; 41; 42; 43; 56; 59: - Object Description 
 Chicago-based publisher of educational materials, founded in 1938. The company was purchased by IBM in 1964 and continued to operate as a subsidary. It was again purchased by Maxwell Communications Corporation in 1988, later becoming part of McGraw-Hill in 1989.
--
British manufacturer.  SOURCE:  1972-141/1231 
  
 No information on maker found on radiomuseum.org, or in internet searches. The valves may be made by another maker and Vacuum Science Products may be the vendor under licence, as was common with for instance RCA Corporation valves.
--
Manufacturer of pharamceuticals.  http://www.pfizer.com/about/leadership_and_structure/company_fact_sheet.jsp http://en.wikipedia.org/wiki/Pfizer 
  
 Established in 1849 by cousins Charles Pfizer and Charles Erhart in Brooklyn, New York; developed products through intensive pharmaceutical research during the second half of the 20th century; acquired Warner-Lambert (2000) and Pharm

### People

In [101]:
# from smg_loader
def load_people_data(people_data_path):
    """Load data from CSV files """

    def reverse_preferred_name(name: str) -> str:
        if not pd.isnull(name) and len(name.split(",")) == 2:
            return f"{name.split(',')[1].strip()} {name.split(',')[0].strip()}"
        else:
            return name

    # identifier in field_mapping
    table_name = "PERSON"

    people_df = pd.read_csv(people_data_path, low_memory=False)
    # TODO: use isIndividual flag here
    people_df = people_df[people_df["GENDER"].isin(["M", "F"])]

    # PREPROCESS
    people_df["URI"] = people_prefix + people_df["LINK_ID"].astype(str)
    # remove punctuation and capitalise first letter
    people_df["TITLE_NAME"] = people_df["TITLE_NAME"].apply(
        lambda i: str(i)
        .capitalize()
        .translate(str.maketrans("", "", string.punctuation))
    )
    people_df["PREFERRED_NAME"] = people_df["PREFERRED_NAME"].apply(
        reverse_preferred_name
    )
#     people_df["BIRTH_DATE"] = people_df["BIRTH_DATE"].apply(get_year_from_date_value)
#     people_df["DEATH_DATE"] = people_df["DEATH_DATE"].apply(get_year_from_date_value)
    people_df["OCCUPATION"] = people_df["OCCUPATION"].apply(
        datastore_helpers.split_list_string
    )
    people_df["NATIONALITY"] = people_df["NATIONALITY"].apply(
        datastore_helpers.split_list_string
    )
#     people_df["NATIONALITY"] = people_df["NATIONALITY"].apply(
#         lambda x: flatten_list_of_lists(
#             [datastore_helpers.get_country_from_nationality(i) for i in x]
#         )
#     )

#     people_df["BIRTH_PLACE"] = people_df["BIRTH_PLACE"].apply(
#         lambda i: get_wikidata_uri_from_placename(i, False, placename_qid_mapping)
#     )
#     people_df["DEATH_PLACE"] = people_df["DEATH_PLACE"].apply(
#         lambda i: get_wikidata_uri_from_placename(i, False, placename_qid_mapping)
#     )
    people_df[["adlib_id", "adlib_DESCRIPTION", "DESCRIPTION", "NOTE"]] = people_df[
        ["adlib_id", "adlib_DESCRIPTION", "DESCRIPTION", "NOTE"]
    ].fillna("")
    people_df["adlib_id"] = people_df["adlib_id"].apply(
        lambda i: [
            f"https://collection.sciencemuseumgroup.org.uk/people/{x}"
            for x in str(i).split(",")
        ]
        if i
        else ""
    )
    # remove newlines and tab chars
    people_df.loc[:, ["DESCRIPTION", "adlib_DESCRIPTION", "NOTE"]] = people_df.loc[
        :, ["DESCRIPTION", "adlib_DESCRIPTION", "NOTE"]
    ].applymap(datastore_helpers.process_text)

    # create combined text fields
    newline = " \n "  # can't insert into fstring below
    people_df.loc[:, "BIOGRAPHY"] = people_df[
        ["DESCRIPTION", "adlib_DESCRIPTION", "NOTE"]
    ].apply(lambda x: f"{newline.join(x)}" if any(x) else "", axis=1)

#     people_df.loc[:, "GENDER"] = people_df.loc[:, "GENDER"].replace(
#         {"F": WD.Q6581072, "M": WD.Q6581097}
#     )
    
    return people_df

people_df = load_people_data(people_data_path)
people_df.head()

Unnamed: 0,LINK_ID,PREFERRED_NAME,TITLE_NAME,FIRSTMID_NAME,LASTSUFF_NAME,SUFFIX_NAME,HONORARY_SUFFIX,GENDER,BRIEF_BIO,DESCRIPTION,NOTE,BIRTH_DATE,BIRTH_PLACE,DEATH_DATE,DEATH_PLACE,CAUSE_OF_DEATH,NATIONALITY,OCCUPATION,WEBSITE,AFFILIATION,LINGUISTIC_GROUP,TYPE,REFERENCE_NUMBER,SOURCE,CREATE_DATE,UPDATE_DATE,adlib_id,adlib_ALIAS,adlib_DESCRIPTION,URI,BIOGRAPHY
1,10245,Edward Rupert Zenthon,Nan,Edward Rupert,Zenthon,,,M,Y,REF: http://www.iwm.org.uk/collections/item/ob...,,1920-07,"London, Greater London, England, United Kingdom",c. 2002,,,[british],[engineer],,,,,,N,28-Jan-98,05-Aug-15,,,,https://collection.sciencemuseumgroup.org.uk/p...,REF: http://www.iwm.org.uk/collections/item/ob...
2,10269,John Troughton,Nan,John,Troughton,,,M,Y,"1739 - Born in Corney, Cumbria, England; Appre...","ODNB: Anita McConnell, ‘Troughton, Edward (175...",1739,"Broughton in Furness, Cumbria, England, United...",1807,"London, Greater London, England, United Kingdom",,"[english, british]",[mathematical instrument maker],,,,,,N,28-Jan-98,06-Nov-18,,,,https://collection.sciencemuseumgroup.org.uk/p...,"1739 - Born in Corney, Cumbria, England; Appre..."
3,1027,O Winston Link,Nan,O Winston,Link,,,M,Y,,WIKI: http://en.wikipedia.org/wiki/O._Winston_...,16/12/1914,"Brooklyn, New York city, New York state, Unite...",30/01/2001,"South Salem, Westchester county, New York stat...",heart attack,[american],[photographer],,,,,,N,08-Jun-96,07-Nov-19,,,,https://collection.sciencemuseumgroup.org.uk/p...,\n \n WIKI: http://en.wikipedia.org/wiki/O._...
4,1030,Stanley V Walton,Nan,Stanley V,Walton,,,M,N,,object record: 1996-7033,,,,,,[british],[railway photographer],,,,,,N,08-Jun-96,06-Nov-18,,,,https://collection.sciencemuseumgroup.org.uk/p...,\n \n object record: 1996-7033
5,10343,Archibald Turner,Nan,Archibald,Turner,,,M,Y,about 1840 - moved to Leicester by 1846 - est...,TNA: http://discovery.nationalarchives.gov.uk/...,,United Kingdom,1876,,,[british],"[manufacturer of fancy hosiery, inventor]",,,,,,N,29-Jan-98,06-Nov-18,,,,https://collection.sciencemuseumgroup.org.uk/p...,about 1840 - moved to Leicester by 1846 - est...


In [114]:
def create_people_disambiguating_description(row: pd.Series) -> str:
    """
    Original description col = BIOGRAPHY.
    Components:
    - NATIONALITY + OCCUPATION -> 'American photographer.'
    - BIRTH_DATE + BIRTH_PLACE -> 'Born 1962, United Kingdom.'
    - DEATH_DATE + DEATH_PLACE + CAUSE_OF_DEATH -> 'Died 1996 of heart attack.' (Add place if no overlap between 
        BIRTH_PLACE and DEATH_PLACE strings. Joined to founded string above)
    - BIOGRAPHY (original description)
    """
    
    # NATIONALITY + OCCUPATION (only uses first of each)
    nationality = str(row['NATIONALITY'][0])
    occupation = str(row['OCCUPATION'][0])
    add_nationality = (nationality != "nan") and (nationality.lower() not in row.BIOGRAPHY.lower())
    add_occupation = (occupation != "nan") and (occupation.lower() not in row.BIOGRAPHY.lower())
    
    if add_nationality and add_occupation:
        nationality_occupation_str = f"{nationality.strip().title()} {occupation.strip()}."
    elif add_nationality:
        nationality_occupation_str = f"{nationality.strip().title()}."
    elif add_occupation:
        nationality_occupation_str = f"{occupation.strip().capitalize()}."
    else:
        nationality_occupation_str = ""
        
    # BIRTH_PLACE + BIRTH_DATE
    add_birth_place = (str(row['BIRTH_PLACE']) != "nan") and (str(row['BIRTH_PLACE']).lower() not in row.BIOGRAPHY.lower())
    add_birth_date = (str(row['BIRTH_DATE']) != "nan") and (str(row['BIRTH_DATE']).lower() not in row.BIOGRAPHY.lower())
    
    # Also check for dates minus suffixes, e.g. 200-250 should match with 200-250 AD and vice-versa
    if re.findall(r"\d+-?\d*", str(row['BIRTH_DATE'])):
        add_birth_date = add_birth_date and (re.findall(r"\d+-?\d*", row['BIRTH_DATE'])[0].lower() not in row.BIOGRAPHY.lower())
    
    if add_birth_place and add_birth_date:
        founded_str = f"Born in {row.BIRTH_PLACE.strip()}, {row.BIRTH_DATE.strip()}."
    elif add_birth_place:
        founded_str = f"Born in {row.BIRTH_PLACE.strip()}."
    elif add_birth_date:
        founded_str = f"Born {row.BIRTH_DATE.strip()}."
    else:
        founded_str = ""
        
    # DEATH_PLACE + DEATH_DATE
    add_death_place = (str(row['DEATH_PLACE']) != "nan") and (str(row['DEATH_PLACE']).lower() not in row.BIOGRAPHY.lower()) and \
    (str(row['DEATH_PLACE']) not in str(row['BIRTH_PLACE'])) and (str(row['BIRTH_PLACE']) not in str(row['DEATH_PLACE']))
    add_death_date = (str(row['DEATH_DATE']) != "nan") and (str(row['DEATH_DATE']).lower() not in row.BIOGRAPHY.lower())
    # Also check for dates minus suffixes, e.g. 200-250 should match with 200-250 AD and vice-versa
    if re.findall(r"\d+-?\d*", str(row['DEATH_DATE'])):
        add_death_date = add_death_date and (re.findall(r"\d+-?\d*", row['DEATH_DATE'])[0].lower() not in row.BIOGRAPHY.lower())
    
    cause_of_death = str(row['CAUSE_OF_DEATH']).strip()
    add_cause_of_death = (cause_of_death != "nan") and (cause_of_death.lower() not in row.BIOGRAPHY.lower())
    if cause_of_death.startswith("illness (") and cause_of_death.endswith(")"):
        cause_of_death = cause_of_death.split("(")[1][0:-1]
    
    if add_death_place and add_death_date:
        dissolved_str = f"Died in {row.DEATH_PLACE.strip()}, {row.DEATH_DATE.strip()}."
    elif add_death_place:
        dissolved_str = f"Died in {row.DEATH_PLACE.strip()}."
    elif add_death_date:
        dissolved_str = f"Died {row.DEATH_DATE.strip()}."
    else:
        dissolved_str = ""
        
    if add_cause_of_death and (add_death_date or add_death_place):
        dissolved_str = dissolved_str[0:-1] + " of " + row.CAUSE_OF_DEATH.lower().strip() + "."
    elif add_cause_of_death:
        dissolved_str += f"Cause of death was {row.CAUSE_OF_DEATH.lower().strip()}."
    
    # Assemble 
    dates_str = " ".join([founded_str, dissolved_str]).strip()
    
    # add space and full stop (if needed) to end of description
    if row.BIOGRAPHY:
        description = (row.BIOGRAPHY.strip() if row.BIOGRAPHY.strip()[-1] == "." else f"{row.BIOGRAPHY.strip()}.")
    else:
        description = ""
    
    # we shuffle the components of the description so any model using them does not learn the order that we put them in
    aug_description_components = [nationality_occupation_str, description, dates_str]
    random.shuffle(aug_description_components)
    
    return (" ".join(aug_description_components)).strip()

for _, row in people_df.sample(20).iterrows():
    print(row.PREFERRED_NAME.upper())
    print(row.BIOGRAPHY.strip())
    print("-")
    print(create_people_disambiguating_description(row))
    print("--")

M ABEL
object 1977-381 
  
 issued token in Bungay, Suffolk
-
British token issuer.  object 1977-381 
  
 issued token in Bungay, Suffolk.
--
CASPAR VOPEL
Source: www.mhs.ox.ac.uk/epact/maker.php?MakerID=31
-
Source: www.mhs.ox.ac.uk/epact/maker.php?MakerID=31. Born in Medebach, Germany, 1511. Died 1561. German cartographer.
--
JULIE EASLEY

-

--
WILLIAM MACKENZIE
ODNB: Mike Chrimes, ‘Mackenzie, William (1794–1851)’, Oxford Dictionary of National Biography, Oxford University Press, Sept 2004; online edn, May 2009 http://www.oxforddnb.com/view/article/50205, accessed 8 June 2009 
  
 railway contractor
-
ODNB: Mike Chrimes, ‘Mackenzie, William (1794–1851)’, Oxford Dictionary of National Biography, Oxford University Press, Sept 2004; online edn, May 2009 http://www.oxforddnb.com/view/article/50205, accessed 8 June 2009 
  
 railway contractor. Born in Nelson, Lancashire, England, United Kingdom, 1794-03-20. Died in Liverpool, Liverpool, Merseyside, England, United Kingdom, 1851-10-29. B