# Create LifeSpans Data Frame

Updated: 2021-12-28


## Things to Do

1. Get a large list of data items. (People, events etc.)
2. Keep growing one "master dataframe" - save it to CSV with data_stamp.
3. Pick two (or more) and plot them comparitively
4. Given a time period, list all eligible events/people.
5. Given a field, get all relevant people.


### Output

**df_main**

**Columns**

```
ID_str
record_type: SE, SE_with_Activity, Event (single point), Duration
name
nickname
start_date # mostly DOB
end_date #often year/date of death
active_start
active_end # time when active period ended
occupation #  POLITICIAN
industry   # GOVERNMENT
domain      
notes # [list of keywords, phrases]
padding # 10 years default. Specify here if it should be different
source # filename, if CSV/JSON

##DERIVED
YrB
YrD
YrAS
YrAE

```



In [1]:
import os
import pandas as pd
import re
import wikipedia
import wptools

## Get Country (particular field) from Wikipedia itselt.

Ex. Given John Adams, get his Country.
This could be from Pantheon, or failing that, from Wikidata

1. Is the person available?
2. Get their country


`get_field_for_person(Name, fieldName)`


In [6]:
name_str = "John Adams"
print(wikipedia.search(name_str, results=10))

['John Adams', 'John Quincy Adams', 'John Adams (miniseries)', 'John Adams (composer)', 'John "Grizzly" Adams', 'John Adams Morgan', 'John Adams II', 'John Adams (mutineer)', 'John Adams Sr.', 'Ansel Adams']


In [12]:
name = "Adam Smith"

try:
    wkdata_person = wptools.page(name, silent=True).get_parse(show=False)
except:
    print("Nothing in Wikidata")
    
if wkdata_person.data is not None:
    print(wkdata_person.data)



In [2]:

useful_cols = ['record_type', 'name', 'start_date', 'end_date', 'active_start', 'active_end', 
                   'notes', 'occupation', 'industry', 'domain']

    


In [3]:
name_str = "birbal"
print(wikipedia.search(name_str, results=1))

['Birbal']


In [4]:
print(wikipedia.summary(name_str, sentences=5))

Birdal is a Turkish surname.


In [28]:
def get_death_from_summary(name, verbose=False):
    
    try:
        summ = wikipedia.summary(name, sentences=1) 
        if verbose:
            print(summ)
    except:
        if verbose:
            print('No summary found. Will try without auto suggest')
            
        try:
            summ = wikipedia.summary(name, sentences=1, auto_suggest=False) 
            if verbose:
                print(summ)
        except:
            if verbose:
                print('No summary found. Will try wikidata infobox next')
            return " "
            

    #PARSING SUMMARY
    result = re.search("\\((.+?)\\)", summ)
    print(result)
    if result:
        bracket =  result.group(1)
        if verbose:
            print(f'SUMM bracket {bracket}')

        if re.search(' ?– ?', bracket):
            death_yr = re.search("(\d+)(?!.*\d)", bracket)
            if death_yr:
                if verbose:
                    print(f'Summ Found {death_yr}')
                return death_yr.group(1)
            else: #search for death in BC period
                bc_death_yr = re.search("(\d+)(?!.*\d) BC", bracket)
                if bc_death_yr:
                    if verbose:
                        print(f'BC: {bc_death_yr}')
                    return bc_death_yr.group(1)                   
        else:
            if verbose:
                print('full bracket, possible alive')
            return bracket

    elif 'born' in summ:
        return('possibly alive')

    
    return ' '

def get_death_year_from_wikidata(name, verbose=False):
        
    try:
        wkdata_person = wptools.page(name, silent=True).get_parse(show=False)
    except:
        return "Nothing in Wikidata"
    
    if wkdata_person.data is not None:
        if 'infobox' in wkdata_person.data:
            try:
                if 'birth_date' in wkdata_person.data['infobox']:
                    if verbose:
                        print(wkdata_person.data['infobox']['birth_date'])
                    pass
                else:
                    if verbose:
                        print('keys are:')
                        print(wkdata_person.data['infobox'].keys())                    
            except TypeError as e:
                print( e)
                #print(wkdata_person.data['infobox'])
                return('No Wikidata/Infobox?')

            
            if 'death_date' in wkdata_person.data['infobox']:                    
                s= wkdata_person.data['infobox']['death_date']
                if verbose:
                    print('Death', s)                    
                
                result = re.search("(age|df|y|yes)\\|(\d+?)\\|", s) #first number after age
                if result:
                    if len(result.groups()) >=2:
                        if verbose:
                            print(f"Death year {result.group(2)}")
                            print(result.group(2).isdigit())
                        return result.group(2)
                elif " BC " in s:
                    result = re.search("Death.*(\d+) BC", s) #first number after age
                    if result:
                        if verbose:
                            print(f"BC Death year {result.group(1)}")
                            print(result.group(1).isdigit())
                        return result.group(1)
                    
                elif "aged" in s: #(aged 47)
                    age = re.search("aged (\d+)\\)", s) #first number after aged
                    if age:
                        if verbose:
                            print(f'Age {age.group(1)}')
                        return age.group(1)
                    
            else:
                if verbose:
                    print(wkdata_person.data['infobox'].keys())
                return(f'death_date not found in infobox')
        else:
            return('No Wikidata/Infobox?')
            

    else:
        return('No Wikidata/Infobox?')
        
        



def struct_wikidata_death(name, verbose = False):
    
    page = wptools.page(name, silent=True)
    try:
        page.get_wikidata(show=True)
    except Exception as e:
        print(name,  e)
        return(' ')
            
    for k in page.data['wikidata'].keys():
        if 'date of death' in k:
            dod = page.data['wikidata'][k]
            if isinstance(dod, list):
                dod = dod[0]
            if isinstance(dod, str):
                dd = dod.split('-')[0]
                if verbose:
                    print(dod)
                    print(type(dod))
                    print(dd)
                return dd
        
    return " "
    

In [29]:
name = 'Aristotle'
dy = get_death_year_from_wikidata(name, verbose=True)
dy

384 BC {{efn-ua|That these dates (the first half of the Olympiad year 384/383 BC, and in 322 shortly before the death of Demosthenes) are correct was shown by [[August Boeckh]] (''Kleine Schriften'' VI 195); for further discussion, see [[Felix Jacoby]] on ''[[FGrHist]]'' 244 F 38. Ingemar Düring, ''Aristotle in the Ancient Biographical Tradition'', Göteborg, 1957, |p.|253}} {{p.|253}}
dict_keys(['name', 'image', 'caption', 'birth_date', 'birth_place', 'death_place', 'education', 'spouse', 'era', 'region', 'notable_students'])


'death_date not found in infobox'

In [None]:
start, end = 400, 404
df2 = df.loc[start:end]
verbose = True

rows_list = []
for (idx, row) in df2.iterrows():
    adds = {}
    #print('\n\n')
    name = row['name']
    adds['name'] = name
    adds['BY'] = row['birthyear']
    print(f"{idx}: {name} BY {row['birthyear']}")
    d_from_s = get_death_from_summary(name, verbose=verbose)
    adds['Summ'] = d_from_s
    if verbose:
        print(d_from_s)
    if d_from_s.isdigit():
        adds['Summ_DY'] = d_from_s
        
    if verbose:
        print(f'_'*25)
    dy = get_death_year_from_wikidata(name, verbose=verbose)
    adds['Wikidata'] = dy  
    if dy is not None and dy.isdigit():
            adds['WkData_DY'] = dy
            
    #structured WIKIDATA        
    dod = struct_wikidata_death(name, verbose=verbose)
    adds['Struct_DY'] = dod
    
    
    rows_list.append(adds)
                    
df_adds = pd.DataFrame(rows_list)

In [33]:
df_adds.to_csv(f'data/adds{start}_{end}.csv', index=False)

In [12]:
import os
os.listdir()
os.getcwd()

'C:\\Users\\ramna\\PyLibrary\\Lifetimes'

In [34]:
df_adds

Unnamed: 0,name,BY,Summ,Summ_DY,Wikidata,WkData_DY,Struct_DY
0,Edmund Burke,1729,1797,1797.0,1797,1797.0,1797.0
1,Eric Clapton,1945,born 30 March 1945,,death_date not found in infobox,,
2,Epicurus,-341,270,270.0,,,
3,El Cid,1040,,,,,1098.0
4,Ethelred the Unready,968,,,,,


In [31]:
df.dtypes

en_curid                 int64
name                    object
numlangs                 int64
birthcity               object
birthstate              object
countryName             object
countryCode             object
countryCode3            object
LAT                    float64
LON                    float64
continentName           object
birthyear               object
gender                  object
occupation              object
industry                object
domain                  object
TotalPageViews           int64
L_star                 float64
StdDevPageViews        float64
PageViewsEnglish         int64
PageViewsNonEnglish      int64
AverageViews           float64
HPI                    float64
dtype: object

In [32]:
df_adds[df_adds['BY'] < '1950']

Unnamed: 0,name,BY,Summ,Summ_DY,Wikidata,WkData_DY,Struct_DY
0,Edmund Burke,1729,1797,1797.0,1797,1797.0,1797.0
1,Eric Clapton,1945,born 30 March 1945,,death_date not found in infobox,,
2,Epicurus,-341,270,270.0,,,
3,El Cid,1040,,,,,1098.0


In [21]:
page.get_wikidata()

www.wikidata.org (wikidata) Q226252
www.wikidata.org (labels) P31|Q4167410
Elizabeth (en) data
{
  assessments: <dict(2)> Disambiguation, Anthroponymy
  claims: <dict(4)> P460, P31, P1889, P373
  description: Wikimedia disambiguation page
  disambiguation: 47
  extext: **Elizabeth** or **Elisabeth** may refer to:
  extract: <p><b>Elizabeth</b> or <b>Elisabeth</b> may refer to:</p>
  label: Elizabeth
  labels: <dict(2)> P31, Q4167410
  length: 2,790
  links: <list(47)> City of Elizabeth, Edgar Allan Poe, Elisabeth ...
  modified: <dict(2)> page, wikidata
  pageid: 88283
  random: Cesáreo Quezadas
  redirects: <list(18)> {'pageid': 364335, 'ns': 0, 'title': 'Elis...
  requests: <list(3)> query, wikidata, labels
  title: Elizabeth
  url: https://en.wikipedia.org/wiki/Elizabeth
  url_raw: https://en.wikipedia.org/wiki/Elizabeth?action=raw
  watchers: 63
  what: Wikimedia disambiguation page
  wikibase: Q226252
  wikidata: <dict(1)> instance of (P31)
  wikidata_pageid: 220743
  wikidata_url

<wptools.page.WPToolsPage at 0x22a1dd03cc8>

In [231]:
page.data['labels']

{'P3280': 'BAnQ author ID',
 'Q36180': 'writer',
 'P8851': 'CONOR.SR ID',
 'P8945': 'Museo Galileo biography ID',
 'P3123': 'Stanford Encyclopedia of Philosophy ID',
 'P8833': 'Union Catalog of Armenian Libraries authority ID',
 'Q194414': 'Peripatetic school',
 'Q125551': 'Parmenides',
 'P5008': 'on focus list of Wikimedia project',
 'P6831': 'Pinakes author ID',
 'P6060': 'MoEML ID',
 'P8850': 'CONOR.KS ID',
 'P31': 'instance of',
 'P2959': 'permanent duplicated item',
 'P7699': 'National Library of Lithuania ID',
 'P6173': 'Bitraga author ID',
 'Q382282': 'Constitution of the Athenians',
 'P734': 'family name',
 'P3417': 'Quora topic ID',
 'P6886': 'writing language',
 'P2596': 'culture',
 'P6394': 'ELNET ID',
 'P268': 'Bibliothèque nationale de France ID',
 'P1871': 'CERL Thesaurus ID',
 'P1938': 'Project Gutenberg author ID',
 'P7902': 'Deutsche Biographie ID',
 'P1741': 'GTAA ID',
 'Q8409': 'Alexander the Great',
 'Q99413897': 'Medvik',
 'P213': 'ISNI',
 'Q132157': 'Zeno of Elea'

In [20]:
page.wanted_labels(['P570'])


In [245]:
page.data['wikidata']['date of death (P570)']

['-0322-00-00T00:00:00Z', '-0322-03-07T00:00:00Z']

In [242]:
page.data['labels']['date of death (P570)']

KeyError: 'date of death (P570)'

In [16]:
start, end = 1548, 1554
df2 = df.loc[start:end]
verbose = True

rows_list = []
for (idx, row) in df2.iterrows():
    adds = {}
    name = row['name']
    print('\n\n', name)
    dod = struct_wikidata_death(name, verbose=False)
    print(dod)



 William S. Burroughs
+1997


 William of Ockham
+1349


 William Pitt, 1st Earl of Chatham
+1778


 William Butler Yeats
William Butler Yeats https://www.wikidata.org/w/api.php?action=wbgetentities&formatversion=2&languages=en&props=aliases|info|claims|descriptions|labels|sitelinks&redirects=yes&sites=enwiki&titles=William%20Butler%20Yeats
 


 Walter Raleigh
+1618


 William Faulkner
+1962


 Wim Kok
+2018


In [18]:
page = wptools.page('Elizabeth')

In [19]:
page.get_query()

en.wikipedia.org (query) Elizabeth
Elizabeth (en) data
{
  assessments: <dict(2)> Disambiguation, Anthroponymy
  description: Wikimedia disambiguation page
  disambiguation: 47
  extext: **Elizabeth** or **Elisabeth** may refer to:
  extract: <p><b>Elizabeth</b> or <b>Elisabeth</b> may refer to:</p>
  label: Elizabeth
  length: 2,790
  links: <list(47)> City of Elizabeth, Edgar Allan Poe, Elisabeth ...
  modified: <dict(1)> page
  pageid: 88283
  random: Cesáreo Quezadas
  redirects: <list(18)> {'pageid': 364335, 'ns': 0, 'title': 'Elis...
  requests: <list(1)> query
  title: Elizabeth
  url: https://en.wikipedia.org/wiki/Elizabeth
  url_raw: https://en.wikipedia.org/wiki/Elizabeth?action=raw
  watchers: 63
  wikibase: Q226252
  wikidata_url: https://www.wikidata.org/wiki/Q226252
}


<wptools.page.WPToolsPage at 0x22a1dd03cc8>

-------

========

# WIKIPEDIA PANTHEON. 11000 Persons


In [2]:
pantheon = pd.read_csv('data/pantheon.tsv', sep='\t')
pantheon.shape

(11341, 23)

In [4]:
df.loc[0]

en_curid                           307
name                   Abraham Lincoln
numlangs                           131
birthcity                  Hodgenville
birthstate                          KY
countryName              UNITED STATES
countryCode                         US
countryCode3                       USA
LAT                            37.5711
LON                           -85.7386
continentName            North America
birthyear                         1809
gender                            Male
occupation                  POLITICIAN
industry                    GOVERNMENT
domain                    INSTITUTIONS
TotalPageViews                66145211
L_star                         5.80139
StdDevPageViews                 586915
PageViewsEnglish              41477236
PageViewsNonEnglish           24667975
AverageViews                    504925
HPI                            27.9386
Name: 0, dtype: object

In [29]:
get_field_from_pantheon('Lincoln')

Lincoln matches 3 Please be more specific
