# Disambiguation Pipeline: People

In [31]:
%load_ext autoreload
%autoreload 2

import sys
sys.path.append("../..")
import os

from heritageconnector.config import config

from tqdm import tqdm
import pandas as pd
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_columns', None)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## 1. import data from record matching step
At this point we have: 
- converted any URLs in our records to Wikidata references: `entity_matching.lookup`
- filtered Wikidata references for each record down to the exact person using their name, date of birth and date of death: `entity_matching.filtering`

In [7]:
df = pd.read_pickle("../../GITIGNORE_DATA/results/filtering_people_orgs_result.pkl")
df_people = df[df['GENDER'].isin(["M", "F"])]
df_people.loc[:, 'JOINED_NAME'] = df_people['FIRSTMID_NAME'] + " " + df_people['LASTSUFF_NAME']

df_people.head(2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[key] = _infer_fill_value(value)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


Unnamed: 0,LINK_ID,PREFERRED_NAME,TITLE_NAME,FIRSTMID_NAME,LASTSUFF_NAME,SUFFIX_NAME,HONORARY_SUFFIX,GENDER,BRIEF_BIO,DESCRIPTION,NOTE,BIRTH_DATE,BIRTH_PLACE,DEATH_DATE,DEATH_PLACE,CAUSE_OF_DEATH,NATIONALITY,OCCUPATION,WEBSITE,AFFILIATION,LINGUISTIC_GROUP,TYPE,REFERENCE_NUMBER,SOURCE,CREATE_DATE,UPDATE_DATE,res_ALL_NOTES,res_WIKIDATA_IDs,res_URLS,qcodes_filtered,JOINED_NAME
1,10245,"Zenthon, Edward Rupert",,Edward Rupert,Zenthon,,,M,Y,REF: http://www.iwm.org.uk/collections/item/object/1030031461,,1920-07,"London, Greater London, England, United Kingdom",c. 2002,,,British,engineer,,,,,,N,28-JAN-98,05-AUG-15,REF: http://www.iwm.org.uk/collections/item/object/1030031461 --- nan,[],[http://www.iwm.org.uk/collections/item/object/1030031461],[],Edward Rupert Zenthon
2,10269,"Troughton, John",,John,Troughton,,,M,Y,"1739 - Born in Corney, Cumbria, England; Apprenticed to his Uncle John Troughton \n1764 - traded at Surrey St., Strand, London \n1768-71 - traded at Crown Court, Fleet St., London\n1771-78 - traded at 17 Dean St., Fetter Lane, London \n1778-82 - traded at 1 Queen's Sq., Bartholomew Close, London \n1782 - purchased the business of Benjamin Cole \n1782-1788 - traded at the sign of the Orrery, 136 Fleet St, London, England. \n1788-1804 - in partnership as J & E Troughton, with brother Edward Troughton (1756-1835)","ODNB: Anita McConnell, ‘Troughton, Edward (1753–1835)’, Oxford Dictionary of National Biography, Oxford University Press, 2004; online edn, May 2005 [http://www.oxforddnb.com/view/article/27767]\nREF: A. McConnell, Instrument makers to the world: a history of Cooke, Troughton & Simms (1992) · A. W. Skempton and J. Brown, ‘John and Edward Troughton’, Notes and Records of the Royal Society, 27 (1972–3), 233–62",1739,"Broughton in Furness, Cumbria, England, United Kingdom",1807,"London, Greater London, England, United Kingdom",,English; British,mathematical instrument maker,,,,,,N,28-JAN-98,06-NOV-18,"1739 - Born in Corney, Cumbria, England; Apprenticed to his Uncle John Troughton \n1764 - traded at Surrey St., Strand, London \n1768-71 - traded at Crown Court, Fleet St., London\n1771-78 - traded at 17 Dean St., Fetter Lane, London \n1778-82 - traded at 1 Queen's Sq., Bartholomew Close, London \n1782 - purchased the business of Benjamin Cole \n1782-1788 - traded at the sign of the Orrery, 136 Fleet St, London, England. \n1788-1804 - in partnership as J & E Troughton, with brother Edward Troughton (1756-1835) --- ODNB: Anita McConnell, ‘Troughton, Edward (1753–1835)’, Oxford Dictionary of National Biography, Oxford University Press, 2004; online edn, May 2005 [http://www.oxforddnb.com/view/article/27767]\nREF: A. McConnell, Instrument makers to the world: a history of Cooke, Troughton & Simms (1992) · A. W. Skempton and J. Brown, ‘John and Edward Troughton’, Notes and Records of the Royal Society, 27 (1972–3), 233–62",[Q1293897],[http://www.oxforddnb.com/view/article/27767],[],John Troughton


In [8]:
perc_matched_from_lookup = df_people['qcodes_filtered'].apply(len).value_counts()[1] / len(df_people)

print(f"{round(100*perc_matched_from_lookup, 1)}% of {len(df_people)} people have been matched unambiguously from the URL lookup step")

28.2% of 10352 people have been matched unambiguously from the URL lookup step


## 2. disambiguation

### 2.1 resolve categorical fields to Wikidata entities
This will be used to narrow down our searches which provide candidates for the disambiguator. For people, we will just look up **OCCUPATION**. Other potential candidates are GENDER, PLACE_OF_BIRTH and PLACE_OF_DEATH.

In [9]:
from heritageconnector.utils.data_transformation import transform_series_str_to_list
from heritageconnector.entity_matching.reconciler import reconciler

In [5]:
# create list column from OCCUPATION
df_people['OCCUPATION'] = transform_series_str_to_list(df_people['OCCUPATION'], separator=";")

# reconcile list column into list of qcodes
rec = reconciler(df_people, table="PEOPLE")
df_people["OCCUPATION_qcodes"] = rec.process_column("OCCUPATION", multiple_vals=True)

df_people.to_pickle("../../GITIGNORE_DATA/results/people_occupations_reconciled.pkl")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
  0%|          | 0/2216 [00:00<?, ?it/s]

Looking up Wikidata qcodes for unique items..


  6%|▌         | 137/2216 [01:24<21:27,  1.61it/s]


KeyboardInterrupt: 

In [11]:
# else export from cached
# df_people = pd.read_pickle("../../GITIGNORE_DATA/results/people_occupations_reconciled.pkl")
df_people.loc[:, 'JOINED_NAME'] = df_people['FIRSTMID_NAME'] + " " + df_people['LASTSUFF_NAME'] # to remove once 2.1 is rerun

### 2.2 search - records for which we have at least one Wikidata ID for Occupations

In [13]:
from heritageconnector.disambiguation import search, retrieve

In [14]:
df_people_resolved = df_people[df_people["OCCUPATION_qcodes"].apply(len) > 0]

perc_resolved = len(df_people_resolved) / len(df_people)
num_without_match = len(set(df_people_resolved.index) - set(df_people[df_people['qcodes_filtered'].apply(len) > 0].index))

print(f"{round(perc_resolved*100, 2)}% of people have resolved occupations.")
print(f"This is {round(num_without_match/len(df_people)*100, 2)}%, excluding those who already have a match")

80.9% of people have resolved occupations.
This is 54.8%, excluding those who already have a match


#### Run search

In [35]:
len(df_people_resolved[df_people_resolved['qcodes_filtered'].apply(len) == 0])

5673

In [42]:
df_people_resolved.loc[:, "search_results"] = ""
df_search = df_people_resolved[df_people_resolved['qcodes_filtered'].apply(len) == 0].head(100)

for idx, row in tqdm(df_search.iterrows(), total=df_search.shape[0]):
    res = search.run(text=row["JOINED_NAME"], topn=3, limit=10, instanceof_filter="Q5", property_filters={"P106": row['OCCUPATION_qcodes']}).index.values.tolist()
    df_search.at[idx, "search_results"] = res

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s
100%|██████████| 100/100 [17:05<00:00, 10.26s/it]


In [43]:
df_search[df_search['search_results'].apply(len) > 0]

Unnamed: 0,LINK_ID,PREFERRED_NAME,TITLE_NAME,FIRSTMID_NAME,LASTSUFF_NAME,SUFFIX_NAME,HONORARY_SUFFIX,GENDER,BRIEF_BIO,DESCRIPTION,NOTE,BIRTH_DATE,BIRTH_PLACE,DEATH_DATE,DEATH_PLACE,CAUSE_OF_DEATH,NATIONALITY,OCCUPATION,WEBSITE,AFFILIATION,LINGUISTIC_GROUP,TYPE,REFERENCE_NUMBER,SOURCE,CREATE_DATE,UPDATE_DATE,res_ALL_NOTES,res_WIKIDATA_IDs,res_URLS,qcodes_filtered,OCCUPATION_list,OCCUPATION_qcodes,JOINED_NAME,search_results
31,103617,"Job, Charles",,Charles,Job,,,M,Y,RPS obituary.\nhttp://www.luminous-lint.com/app/photographer/Charles__Job/A/\n2010\nhttp://www.old-church-galleries.com/s_1600.asp,"Charles Job, a stockbroker by profession, was a keen amateur photographer, elected member of the Royal Photographic Society in 1893 and member of the Linked Ring Brotherhood. Worked at the Censor's Office in Liverpool during the first world war before returning to his native Sussex. Awarded Hon. FRPS in 1928. Noted for his landscape photographs.",1853,"Sussex, England, United Kingdom",1930,,,British,"[amateur photographer, stockbroker]",,,,,,N,06-JAN-05,30-OCT-12,"RPS obituary.\nhttp://www.luminous-lint.com/app/photographer/Charles__Job/A/\n2010\nhttp://www.old-church-galleries.com/s_1600.asp --- Charles Job, a stockbroker by profession, was a keen amateur photographer, elected member of the Royal Photographic Society in 1893 and member of the Linked Ring Brotherhood. Worked at the Censor's Office in Liverpool during the first world war before returning to his native Sussex. Awarded Hon. FRPS in 1928. Noted for his landscape photographs.",[],"[http://www.luminous-lint.com/app/photographer/Charles__Job/A/, http://www.old-church-galleries.com/s_1600.asp]",[],"[amateur photographer, stockbroker]","[Q21694268, Q4182927]",Charles Job,[http://www.wikidata.org/entity/Q5075774]
33,10165,"Melville, Alexander",,Alexander,Melville,,,M,Y,1878 - The London Gazette records Alexander Melville: 'improvements in machinery for producing ornamental designs on woollen fabrics'.,THE LONDON GAZETTE: https://www.thegazette.co.uk/London/issue/24540/page/135/data.pdf,,,1971,,,British,[inventor],,,,,,N,28-JAN-98,06-JUN-18,1878 - The London Gazette records Alexander Melville: 'improvements in machinery for producing ornamental designs on woollen fabrics'. --- THE LONDON GAZETTE: https://www.thegazette.co.uk/London/issue/24540/page/135/data.pdf,[],[https://www.thegazette.co.uk/London/issue/24540/page/135/data.pdf],[],[inventor],[Q205375],Alexander Melville,[http://www.wikidata.org/entity/Q34286]
40,11209,"Garratt, Colin",,Colin,Garratt,,,M,N,http://www.milepost92-half.co.uk/ColinGarratt.htm,Colin Garratt is a photographer and author on steam railways.,,,,,,British,"[railway photographer, author]",,,,,,Y,11-FEB-98,31-AUG-17,http://www.milepost92-half.co.uk/ColinGarratt.htm --- Colin Garratt is a photographer and author on steam railways.,[],[http://www.milepost92-half.co.uk/ColinGarratt.htm],[],"[railway photographer, author]","[Q482980, Q15296811, Q36180]",Colin Garratt,[http://www.wikidata.org/entity/Q75727490]
56,11808,"Jaeger, Eduard",,Eduard,Jaeger,,,M,Y,REF: http://fm.iowa.uiowa.edu/fmi/xsl/hardin/heirs/record_detail.xsl?-db=heirs&-lay=weblayout&HeirsNo=1977&-find\nWIKI: http://en.wikipedia.org/wiki/Eduard_J%C3%A4ger_von_Jaxtthal\nREF: http://www.aeiou.at/aeiou.encyclop.j/j072115.htm;internal&action=_setlanguage.action?LANGUAGE=en\nREF: http://beckerexhibits.wustl.edu/becker/records250.htm,"professor at the university of Vienna; used the ophthalmoscope for the determination of refractivity; made improvements to eye chart test types that were developed by Heinrich Kuechler; Ritter von Jaxtthal; published from 1844,",1818-06-25,"Vienna, Wien, Austria",1884-07-05,"Vienna, Wien, Austria",,Austrian,[ophthalmologist],,,,,,N,18-FEB-98,01-MAR-16,"REF: http://fm.iowa.uiowa.edu/fmi/xsl/hardin/heirs/record_detail.xsl?-db=heirs&-lay=weblayout&HeirsNo=1977&-find\nWIKI: http://en.wikipedia.org/wiki/Eduard_J%C3%A4ger_von_Jaxtthal\nREF: http://www.aeiou.at/aeiou.encyclop.j/j072115.htm;internal&action=_setlanguage.action?LANGUAGE=en\nREF: http://beckerexhibits.wustl.edu/becker/records250.htm --- professor at the university of Vienna; used the ophthalmoscope for the determination of refractivity; made improvements to eye chart test types that were developed by Heinrich Kuechler; Ritter von Jaxtthal; published from 1844,",[Q328536],"[http://fm.iowa.uiowa.edu/fmi/xsl/hardin/heirs/record_detail.xsl?-db=heirs&-lay=weblayout&HeirsNo=1977&-find, http://en.wikipedia.org/wiki/Eduard_J%C3%A4ger_von_Jaxtthal, http://www.aeiou.at/aeiou.encyclop.j/j072115.htm, http://beckerexhibits.wustl.edu/becker/records250.htm]",[],[ophthalmologist],[Q12013238],Eduard Jaeger,[http://www.wikidata.org/entity/Q328536]
83,1230,"Porter, Charles Talbot",,Charles Talbot,Porter,,,M,Y,JSTOR: http://www.jstor.org/pss/3103435,,1826,"Auburn, Cayuga, New York state, United States",1910,,,American,[engineer],,,,,,N,26-JUN-96,01-JUN-10,JSTOR: http://www.jstor.org/pss/3103435 --- nan,[],[http://www.jstor.org/pss/3103435],[],[engineer],"[Q151197, Q81096]",Charles Talbot Porter,[http://www.wikidata.org/entity/Q959123]
115,13390,"Johnson, Henry",,Henry,Johnson,,,M,Y,"Inv. nos. 1910-165 and 1980-1175 - Object descriptions;\nT/1980-1175: - \nDr. Booth 1859, 'A Description of the Volutor: An Instrument for Describing Spirals and Volutes'. London: Judd & Glass.",,,,,,,British,[inventor],,,,,,N,06-MAR-98,27-FEB-20,"Inv. nos. 1910-165 and 1980-1175 - Object descriptions;\nT/1980-1175: - \nDr. Booth 1859, 'A Description of the Volutor: An Instrument for Describing Spirals and Volutes'. London: Judd & Glass. --- nan",[],[],[],[inventor],[Q205375],Henry Johnson,"[http://www.wikidata.org/entity/Q353036, http://www.wikidata.org/entity/Q370171, http://www.wikidata.org/entity/Q516276]"
160,137316,"Smith, David",,David,Smith,,,M,N,,Biog provided by David Smith;\nDavid Smith is an internationally renowned Photographer turned Commercials Director. David won many international awards in his long and successful career. He was born in London and developed his creative sensibilities in response to growing up in the sixties. Smith spent three years at the Ealing College of Art; a hotbed of new creative thought that lead Smith to an incredible and enviable career as a creative entrepreneur.,,,,,,,[photographer],,,,,,Y,01-NOV-12,26-NOV-14,nan --- Biog provided by David Smith;\nDavid Smith is an internationally renowned Photographer turned Commercials Director. David won many international awards in his long and successful career. He was born in London and developed his creative sensibilities in response to growing up in the sixties. Smith spent three years at the Ealing College of Art; a hotbed of new creative thought that lead Smith to an incredible and enviable career as a creative entrepreneur.,[],[],[],[photographer],"[Q7187777, Q33231]",David Smith,"[http://www.wikidata.org/entity/Q726169, http://www.wikidata.org/entity/Q19879243, http://www.wikidata.org/entity/Q557]"
161,14844,"Dollond, George",,George,Dollond,,,M,Y,"ODNB: Gloria Clifton, ‘Dollond family (per. 1750–1871)’, Oxford Dictionary of National Biography, Oxford University Press, Sept 2004; Anderson, R.G.W., Burnett, J.,& Gee, B., Handlist of Scientific Instrument-Makers' Trade Catalogues 1600-1914 (National Museums of Scotland, 1990); Clifton, G., 'Directory of British Scientific Instrument Makers 1550-1851' (Zwemmer, 1995)","Traded at 59 St. Paul's Churchyard (1852-65) & 61 Paternoster Row (1859-65) both London, England. Apprenticed to uncle, George Dollond (1774-1852), took over business in 1852, succeeded by son, William Dollond (1834-93) in 1866. Born George Huggins later changed to George Dollond",1797,"England, United Kingdom",1866,"England, United Kingdom",,English; British,[optician],,,,,,N,26-MAR-98,22-MAY-14,"ODNB: Gloria Clifton, ‘Dollond family (per. 1750–1871)’, Oxford Dictionary of National Biography, Oxford University Press, Sept 2004; Anderson, R.G.W., Burnett, J.,& Gee, B., Handlist of Scientific Instrument-Makers' Trade Catalogues 1600-1914 (National Museums of Scotland, 1990); Clifton, G., 'Directory of British Scientific Instrument Makers 1550-1851' (Zwemmer, 1995) --- Traded at 59 St. Paul's Churchyard (1852-65) & 61 Paternoster Row (1859-65) both London, England. Apprenticed to uncle, George Dollond (1774-1852), took over business in 1852, succeeded by son, William Dollond (1834-93) in 1866. Born George Huggins later changed to George Dollond",[],[],[],[optician],"[Q1996635, Q71133019]",George Dollond,[http://www.wikidata.org/entity/Q5536785]
176,15129,"Stevens, Thomas",,Thomas,Stevens,,,M,Y,"REF: http://www.victoriansilk.com/\nREF: http://www.stevengraphs.com/thomstevandh.html\nThe silk pictures of Thomas Stevens: a biography of the Coventry weaver and his contribution to the art of weaving, with an illustrated catalogue of his work. New York, Wilma Sinclair Le Van Baker, 1957.","1854 - set up ribbon weaving business in Queen Street, Coventry; invented a woven silk technique he called the Stevengraph, first used in 1862; new factory in West Orchard and large warehouse in Much Park Street, Coventry; 1875 - Stevengraph Works were built in Cox Street, Coventry; 1879 - pictures first appeared; firm produced silk bookmarks, pictures, portraits and postcards",1828,"Coventry, West Midlands, England, United Kingdom",1888-10-24,,,British,"[silk ribbon weaver, inventor]",,,,,,N,08-APR-98,18-JUN-12,"REF: http://www.victoriansilk.com/\nREF: http://www.stevengraphs.com/thomstevandh.html\nThe silk pictures of Thomas Stevens: a biography of the Coventry weaver and his contribution to the art of weaving, with an illustrated catalogue of his work. New York, Wilma Sinclair Le Van Baker, 1957. --- 1854 - set up ribbon weaving business in Queen Street, Coventry; invented a woven silk technique he called the Stevengraph, first used in 1862; new factory in West Orchard and large warehouse in Much Park Street, Coventry; 1875 - Stevengraph Works were built in Cox Street, Coventry; 1879 - pictures first appeared; firm produced silk bookmarks, pictures, portraits and postcards",[],"[http://www.victoriansilk.com/, http://www.stevengraphs.com/thomstevandh.html]",[],"[silk ribbon weaver, inventor]",[Q205375],Thomas Stevens,"[http://www.wikidata.org/entity/Q245774, http://www.wikidata.org/entity/Q1294403]"
179,15237,"Lee, Rev. William",Rev.,William,Lee,,,M,Y,"ODNB: Marilyn Palmer, ‘Lee, William (d. 1614/15?)’, Oxford Dictionary of National Biography, Oxford University Press, 2004 [http://www.oxforddnb.com/view/article/16314, accessed 13 April 2007] William Lee (d. 1614/15?): doi:10.1093/ref:odnb/16314","Rev. William Lee is believed to have invented the first knitting frame in 1589 at Calverton, near Nottingham, assisted by his brother James. Lee's frame was not well received - no patent was granted even after a machine to make fine silk stockings was produced in 1798, and demonstrated to the Queen. In 1603 he moved to France but, being a Protestant, did not do very well there either. He died in France in 1610 in an impoverished state. However, his machines soon became much used in London and Nottinghamshireand was little altered until Jedediah Strutt invented the ribbing attachment in 1758/59: this was an addition to the machine; so that Lee's basic design continued in use well into the latter part of the 19th century.",,,1615,France,,English,[inventor],,,,,,N,21-APR-98,06-MAY-15,"ODNB: Marilyn Palmer, ‘Lee, William (d. 1614/15?)’, Oxford Dictionary of National Biography, Oxford University Press, 2004 [http://www.oxforddnb.com/view/article/16314, accessed 13 April 2007] William Lee (d. 1614/15?): doi:10.1093/ref:odnb/16314 --- Rev. William Lee is believed to have invented the first knitting frame in 1589 at Calverton, near Nottingham, assisted by his brother James. Lee's frame was not well received - no patent was granted even after a machine to make fine silk stockings was produced in 1798, and demonstrated to the Queen. In 1603 he moved to France but, being a Protestant, did not do very well there either. He died in France in 1610 in an impoverished state. However, his machines soon became much used in London and Nottinghamshireand was little altered until Jedediah Strutt invented the ribbing attachment in 1758/59: this was an addition to the machine; so that Lee's basic design continued in use well into the latter part of the 19th century.",[Q611727],[http://www.oxforddnb.com/view/article/16314],[],[inventor],[Q205375],William Lee,"[http://www.wikidata.org/entity/Q611727, http://www.wikidata.org/entity/Q165749, http://www.wikidata.org/entity/Q4800687]"
