# Create LifeSpans Data Frame

Updated: 2021-12-28


- Write a function to add DETAILS to any given person.
- Example: Find Tipu Sultan and add "East India Company" in details.

- For each economist, find their OCCUP, INDUSTRY and DOMAIN.
For each econ, find the COUNTRYNAME and add it.



## Things to Do

1. Get a large list of data items. (People, events etc.)
2. Keep growing one "master dataframe" - save it to CSV with data_stamp.
3. Pick two (or more) and plot them comparitively
4. Given a time period, list all eligible events/people.
5. Given a field, get all relevant people.


### Output

**df_main**

**Columns**

```
ID_str
record_type: SE, SE_with_Activity, Event (single point), Duration
name
nickname
start_date # mostly DOB
end_date #often year/date of death
active_start
active_end # time when active period ended
countryName
occupation #  POLITICIAN
industry   # GOVERNMENT
domain      
notes # [list of keywords, phrases]
padding # 10 years default. Specify here if it should be different
source # filename, if CSV/JSON

##DERIVED
YrB
YrD
YrAS
YrAE

```



------
# Get from CSV/JSON Sources

1. US Presidents

### Save as CSV

In [1]:
import os
import pandas as pd
import re
import numpy as np
import wikipedia
import wptools

### POSSIBLE RECORD TYPES

SE, Active_SE, Event

In [2]:
def convert_presidents_to_dfmain_format():
    pres = pd.read_json('data/US_Presidents.json')    

    pres.rename(columns={'president':'name'}, inplace=True)
    pres['end_date'] = pd.to_datetime(pres['death_year'], format='%Y')
    pres['start_date'] = pd.to_datetime(pres['birth_year'], format='%Y')
    pres['active_start'] = pd.to_datetime(pres['took_office'])
    pres['active_end'] = pd.to_datetime(pres['left_office'])
    pres['details'] = pres['number'].astype(str) + "," + pres['party']
    

    pres["occupation"] = "Politician"
    pres["industry"] = "Government"
    pres["domain"] = "Institutions"
    
    pres['countryName'] = "USA"
    
    pres['record_type'] = 'Active_SE' #active start and end
    useful_cols = ['record_type', 'name', 'start_date', 'end_date', 'active_start', 'active_end', 
                   'details', 'occupation', 'industry', 'domain', 'countryName']

    
    return pres[useful_cols]
    


# Creation of Base Files


## BHHS

## Presidents
## Economists

In [3]:
df_main = pd.DataFrame()

PRESIDENTS = True

if PRESIDENTS:
    df_pres = convert_presidents_to_dfmain_format()
    df_main = df_main.append(df_pres)


In [5]:
bhht = pd.read_csv('data/bhht-100k-sample.csv', encoding = "ISO-8859-1")
bhht.shape

(100473, 46)

In [21]:
rows = bhht['name'].str.contains('Humay')
bhht[rows]['name']

12658    Humayun_Faridi
56856       Humayon_Dar
Name: name, dtype: object

In [12]:
pantheon = pd.read_csv('data/pantheon.tsv', sep='\t')
pantheon.shape

(11341, 23)

In [74]:
def get_field_from_pantheon(pname, field, pantheon=pantheon):
    row = pantheon['name'].str.contains(pname)
    
    num_matches = row.sum()
    
    if(num_matches == 1):
        #print(pantheon[row][field], type(pantheon[row][field]))
        new_field = pantheon[row][field].values
        return new_field[0]
    else:
        print(f'{pname} matches {num_matches} Please be more specific')
        return None

# READ ECONOMISTS DATA

In [75]:
print(os.getcwd())
os.listdir('data')

C:\Users\ramna\PyLibrary\Lifespans


['adds1548_1554.csv',
 'celebrity_deaths_4.csv',
 'df_main.csv',
 'economists_lifetimes.csv',
 'list_of_presidents_of_france-836j.csv',
 'list_of_prime_ministers_of_uk-839j.csv',
 'pageviews_2008-2013.tsv',
 'pantheon.tsv',
 'philosophers.csv',
 'README.md',
 'US_Presidents.json',
 'wikilangs.tsv']

In [76]:
df = pd.read_csv('data/economists_lifetimes.csv')

In [77]:
df_main.columns

Index(['record_type', 'name', 'start_date', 'end_date', 'active_start',
       'active_end', 'details', 'occupation', 'industry', 'domain',
       'countryName'],
      dtype='object')

In [78]:
df.rename(columns={'entity':'name', 'birth_year':'start_date', 'death_year':'end_date'}, inplace=True)
df['record_type'] = 'SE' #active start and end

df["occupation"] = "Economist"
df["industry"] = "Social Sciences"
df["domain"] = "Science & Technology"

In [89]:
countries = []
for index, row in df.iterrows():
    #name = row['name']
    name = row['name']
    country = get_field_from_pantheon(name, field='countryName')
    countries.append(country)

print(countries)

df['countryName'] = countries


Milton Freedman matches 0 Please be more specific
Irving Fisher matches 0 Please be more specific
John Bates Clark matches 0 Please be more specific
Thomas Malthus  matches 0 Please be more specific
Arthur Pigou matches 0 Please be more specific
['UNITED KINGDOM', 'Germany', 'UNITED KINGDOM', 'AUSTRIA', None, 'United Kingdom', None, None, 'United States', 'UNITED STATES', 'UNITED STATES', 'SAINT LUCIA', None, 'United Kingdom', 'UNITED KINGDOM', None]


In [90]:
econ_useful_cols = ['record_type', 'name', 'start_date', 'end_date', 
                    'details', 'occupation', 'industry', 'domain', 'countryName']
df= df[econ_useful_cols]

===== End of Processing ECONOMISTS 

In [94]:
df_main = df_main.append(df, sort=False)

df_main.shape

(78, 11)

In [95]:
df_main.columns, df.columns

(Index(['active_end', 'active_start', 'countryName', 'details', 'domain',
        'end_date', 'industry', 'name', 'occupation', 'record_type',
        'start_date'],
       dtype='object'),
 Index(['record_type', 'name', 'start_date', 'end_date', 'details',
        'occupation', 'industry', 'domain', 'countryName'],
       dtype='object'))

In [96]:
df_main

Unnamed: 0,active_end,active_start,countryName,details,domain,end_date,industry,name,occupation,record_type,start_date
0,1797-03-04,1789-04-30,USA,"1,No Party",Institutions,1799-01-01 00:00:00,Government,George Washington,Politician,Active_SE,1732-01-01 00:00:00
1,1801-03-04,1797-03-04,USA,"2,Federalist",Institutions,1826-01-01 00:00:00,Government,John Adams,Politician,Active_SE,1735-01-01 00:00:00
2,1809-03-04,1801-03-04,USA,"3,Democratic-Republican",Institutions,1826-01-01 00:00:00,Government,Thomas Jefferson,Politician,Active_SE,1743-01-01 00:00:00
3,1817-03-04,1809-03-04,USA,"4,Democratic-Republican",Institutions,1836-01-01 00:00:00,Government,James Madison,Politician,Active_SE,1751-01-01 00:00:00
4,1825-03-04,1817-03-04,USA,"5,Democratic-Republican",Institutions,1831-01-01 00:00:00,Government,James Monroe,Politician,Active_SE,1758-01-01 00:00:00
...,...,...,...,...,...,...,...,...,...,...,...
11,NaT,NaT,SAINT LUCIA,"Movement of money in Ghana, Nobel prize for de...",Science & Technology,1991,Social Sciences,Arthur Lewis,Economist,SE,1915
12,NaT,NaT,,"Principles of population, population growth, f...",Science & Technology,1834,Social Sciences,Thomas Malthus,Economist,SE,1766
13,NaT,NaT,United Kingdom,"Classical liberalism, theory of liberty",Science & Technology,1873,Social Sciences,John Stuart Mill,Economist,SE,1806
14,NaT,NaT,UNITED KINGDOM,The concept of economic rent comparative advan...,Science & Technology,1823,Social Sciences,David Ricardo,Economist,SE,1772


In [97]:
df_main.to_csv('data/df_main.csv', index=False)


+++++++

# WIKIPEDIA PANTHEON. 11000 Persons


In [22]:
df = pd.read_csv('data/pantheon.tsv', sep='\t')
df.shape

(11341, 23)

In [4]:
df.loc[0]

en_curid                           307
name                   Abraham Lincoln
numlangs                           131
birthcity                  Hodgenville
birthstate                          KY
countryName              UNITED STATES
countryCode                         US
countryCode3                       USA
LAT                            37.5711
LON                           -85.7386
continentName            North America
birthyear                         1809
gender                            Male
occupation                  POLITICIAN
industry                    GOVERNMENT
domain                    INSTITUTIONS
TotalPageViews                66145211
L_star                         5.80139
StdDevPageViews                 586915
PageViewsEnglish              41477236
PageViewsNonEnglish           24667975
AverageViews                    504925
HPI                            27.9386
Name: 0, dtype: object