## Data and right questions
This week's data set comes from a GitHub repository that is remarkably detailed:

https://github.com/unitedstates/congress-legislators

We will only be looking at two of the files, legislators-historical.yaml and legislators-current.yaml, both (as the suffix indicates) in YAML format.

This week's learning goals involve working with YAML, manipulating complex data within a Pandas data frame, working with time-series data, pivot tables, and plotting.

Here are my eight questions; as usual, I'll be back on Thursday with my detailed solutions and Jupyter notebook:

- Read the two YAML files (legislators-historical and legislators-current) into a single data frame.
- Three columns (id, bio, and name) contain Python dicts. Expand each dict to be new columns in the row, and then remove the original columns. Then take the "terms" column, which contains a list of dicts, and expand it such that you have multiple rows per legislator, one term per row. (The rest of the data for the legislator will be duplicated.) Finally, set the "bioguide" column to be the index.
- Now expand the "terms" column to be new columns in the row, and remove the original "terms" column. The "bioguide" column should remain the index.
- Turn the birthday, start, and end columns into datetimes.
- What is the greatest number of terms a legislator has served in Congress? How old were they at the start of their first term, and how old at the end of their final term?
- Calculate how old each legislator was at the start of each term, and show the 10 eldest starts to a legislative term. Did any legislators make the list more than once? Display their ages in years.
- Create a pivot table showing, for Democrat and Republican legislators that started their terms in 1990 or after, the mean age (in years) of the members of each party.
- How many legislators in the current congress are older than Joe Biden? Get their names, birthdates, party affiliation, and gender. How many such people are in each party? How does it break down by gender?

In [96]:
import os
import pickle
import pandas as pd
import yaml
import dateutil.relativedelta as rd


In [97]:
#for era in ['current', 'historical']:
if not os.path.isfile('./data/df1.pickle'):
    dfs = {}
    for era in ['current', 'historical']:
        with open(f'data/legislators-{era}.yaml', 'r') as fh:
            dfs[era] = pd.json_normalize(yaml.safe_load(fh))
    df1 = pd.concat(dfs.values(), ignore_index=True).explode('terms').reset_index(drop=True)
    with open(f'data/df1.pickle', 'wb') as fh:
        pickle.dump(df1, fh)
else:
    df1 = pd.read_pickle('data/df1.pickle')
    
    
terms_df = pd.DataFrame.from_records(df1['terms'].values)
display(terms_df)
df = pd.concat([df1.drop('terms', axis=1), terms_df], axis=1)
df = (
    df.set_index('id.bioguide').astype(
        {'bio.birthday': 'datetime64[ns]', 'start': 'datetime64[ns]',
         'end': 'datetime64[ns]'}
    )
)
df

Unnamed: 0,type,start,end,state,district,party,url,class,address,phone,fax,contact_form,office,state_rank,rss_url,how,caucus,party_affiliations,end-type
0,rep,1993-01-05,1995-01-03,OH,13.0,Democrat,,,,,,,,,,,,,
1,rep,1995-01-04,1997-01-03,OH,13.0,Democrat,,,,,,,,,,,,,
2,rep,1997-01-07,1999-01-03,OH,13.0,Democrat,,,,,,,,,,,,,
3,rep,1999-01-06,2001-01-03,OH,13.0,Democrat,,,,,,,,,,,,,
4,rep,2001-01-03,2003-01-03,OH,13.0,Democrat,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45030,rep,2015-01-06,2017-01-03,NJ,9.0,Democrat,http://pascrell.house.gov,,2370 Rayburn HOB; Washington DC 20515-3009,202-225-5751,202-225-5782,https://pascrell.house.gov/contact/email-me,2370 Rayburn House Office Building,,http://www.house.gov/apps/list/press/nj08_pasc...,,,,
45031,rep,2017-01-03,2019-01-03,NJ,9.0,Democrat,https://pascrell.house.gov,,2370 Rayburn House Office Building; Washington...,202-225-5751,202-225-5782,,2370 Rayburn House Office Building,,http://www.house.gov/apps/list/press/nj08_pasc...,,,,
45032,rep,2019-01-03,2021-01-03,NJ,9.0,Democrat,https://pascrell.house.gov,,2409 Rayburn House Office Building Washington ...,202-225-5751,,,2409 Rayburn House Office Building,,http://www.house.gov/apps/list/press/nj08_pasc...,,,,
45033,rep,2021-01-03,2023-01-03,NJ,9.0,Democrat,https://pascrell.house.gov,,2409 Rayburn House Office Building Washington ...,202-225-5751,,,2409 Rayburn House Office Building,,http://www.house.gov/apps/list/press/nj08_pasc...,,,,


Unnamed: 0_level_0,id.thomas,id.lis,id.govtrack,id.opensecrets,id.votesmart,id.fec,id.cspan,id.wikipedia,id.house_history,id.ballotpedia,...,phone,fax,contact_form,office,state_rank,rss_url,how,caucus,party_affiliations,end-type
id.bioguide,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
B000944,00136,S307,400050,N00003535,27018.0,"[H2OH13033, S6OH00163]",5051.0,Sherrod Brown,9996.0,Sherrod Brown,...,,,,,,,,,,
B000944,00136,S307,400050,N00003535,27018.0,"[H2OH13033, S6OH00163]",5051.0,Sherrod Brown,9996.0,Sherrod Brown,...,,,,,,,,,,
B000944,00136,S307,400050,N00003535,27018.0,"[H2OH13033, S6OH00163]",5051.0,Sherrod Brown,9996.0,Sherrod Brown,...,,,,,,,,,,
B000944,00136,S307,400050,N00003535,27018.0,"[H2OH13033, S6OH00163]",5051.0,Sherrod Brown,9996.0,Sherrod Brown,...,,,,,,,,,,
B000944,00136,S307,400050,N00003535,27018.0,"[H2OH13033, S6OH00163]",5051.0,Sherrod Brown,9996.0,Sherrod Brown,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
P000096,01510,,400309,N00000751,478.0,[H6NJ08118],45543.0,Bill Pascrell,19391.0,Bill Pascrell,...,202-225-5751,202-225-5782,https://pascrell.house.gov/contact/email-me,2370 Rayburn House Office Building,,http://www.house.gov/apps/list/press/nj08_pasc...,,,,
P000096,01510,,400309,N00000751,478.0,[H6NJ08118],45543.0,Bill Pascrell,19391.0,Bill Pascrell,...,202-225-5751,202-225-5782,,2370 Rayburn House Office Building,,http://www.house.gov/apps/list/press/nj08_pasc...,,,,
P000096,01510,,400309,N00000751,478.0,[H6NJ08118],45543.0,Bill Pascrell,19391.0,Bill Pascrell,...,202-225-5751,,,2409 Rayburn House Office Building,,http://www.house.gov/apps/list/press/nj08_pasc...,,,,
P000096,01510,,400309,N00000751,478.0,[H6NJ08118],45543.0,Bill Pascrell,19391.0,Bill Pascrell,...,202-225-5751,,,2409 Rayburn House Office Building,,http://www.house.gov/apps/list/press/nj08_pasc...,,,,


In [98]:

df.loc[:, 'type'].groupby(lambda x: x).count().sort_values(ascending=False)


#dfs['current']['terms'].apply(lambda x: len(x))

id.bioguide
D000355    30
W000428    27
C000714    27
V000105    26
R000082    25
           ..
M001226     1
M001225     1
M001224     1
M001223     1
A000001     1
Name: type, Length: 12684, dtype: int64

In [99]:
ding0 = df.loc['D000355'].iloc[0].to_dict()
name, birthday, start = (ding0[x] for x in ['name.official_full', 'bio.birthday', 'start'])
end = df.loc['D000355'].iloc[-1].to_dict()['end']
print(name, birthday, start, end)
print(rd.relativedelta(start, birthday), rd.relativedelta(end, birthday))
      


John D. Dingell 1926-07-08 00:00:00 1955-01-05 00:00:00 2015-01-03 00:00:00
relativedelta(years=+28, months=+5, days=+28) relativedelta(years=+88, months=+5, days=+26)


In [100]:
def f(x):
    if pd.isna(x['start']) or pd.isna(x['bio.birthday']):
        return pd.NaT
    diff = rd.relativedelta(x['start'], x['bio.birthday'])
    return f'{diff.years} yrs, {diff.months} months, {diff.days} days'

def g(x):
    if pd.isna(x['start']) or pd.isna(x['bio.birthday']):
        return pd.NaT
    return (x['start'] - x['bio.birthday']).days

df['Age at start (days)'] = df.apply(g, axis=1)
df['Age at start'] = df.apply(f, axis=1)
(
    df.sort_values(by='Age at start (days)', ascending=False).iloc[:10]
      .loc[:, ['name.last', 'name.first', 'type', 'state', 'party',
               'start', 'Age at start']
          ]
)

Unnamed: 0_level_0,name.last,name.first,type,state,party,start,Age at start
id.bioguide,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
T000254,Thurmond,J.,sen,SC,Republican,1997-01-07,"94 yrs, 1 months, 2 days"
H000067,Hall,Ralph,rep,TX,Republican,2013-01-03,"89 yrs, 8 months, 0 days"
G000386,Grassley,Charles,sen,IA,Republican,2023-01-03,"89 yrs, 3 months, 17 days"
B001210,Byrd,Robert,sen,WV,Democrat,2007-01-04,"89 yrs, 1 months, 15 days"
P000218,Pepper,Claude,rep,FL,Democrat,1989-01-03,"88 yrs, 3 months, 26 days"
S000355,Sherwood,Isaac,rep,OH,Democrat,1923-12-03,"88 yrs, 3 months, 20 days"
S000827,Stedman,Charles,rep,NC,Democrat,1929-04-15,"88 yrs, 2 months, 17 days"
T000254,Thurmond,J.,sen,SC,Republican,1991-01-03,"88 yrs, 0 months, 29 days"
H000067,Hall,Ralph,rep,TX,Republican,2011-01-05,"87 yrs, 8 months, 2 days"
C000714,Conyers,John,rep,MI,Democrat,2017-01-03,"87 yrs, 7 months, 18 days"


In [101]:
df.columns

Index(['id.thomas', 'id.lis', 'id.govtrack', 'id.opensecrets', 'id.votesmart',
       'id.fec', 'id.cspan', 'id.wikipedia', 'id.house_history',
       'id.ballotpedia', 'id.maplight', 'id.icpsr', 'id.wikidata',
       'id.google_entity_id', 'name.first', 'name.last', 'name.official_full',
       'bio.birthday', 'bio.gender', 'name.middle', 'name.nickname',
       'name.suffix', 'leadership_roles', 'other_names', 'family',
       'id.bioguide_previous', 'id.house_history_alternate', 'type', 'start',
       'end', 'state', 'district', 'party', 'url', 'class', 'address', 'phone',
       'fax', 'contact_form', 'office', 'state_rank', 'rss_url', 'how',
       'caucus', 'party_affiliations', 'end-type', 'Age at start (days)',
       'Age at start'],
      dtype='object')

In [102]:
df['Age at start (years)'] = df['Age at start (days)']/365.0
df['Age at start (days)']

id.bioguide
B000944    14667
B000944    15396
B000944    16130
B000944    16859
B000944    17587
           ...  
P000096    28470
P000096    29198
P000096    29928
P000096    30659
P000096    31389
Name: Age at start (days), Length: 45035, dtype: object

In [103]:
(
    df[df['start'] >= pd.to_datetime('1990-01-01')].loc[:, ['party', 'Age at start (years)']]
        .pivot_table(index='party', values='Age at start (years)')
)


Unnamed: 0_level_0,Age at start (years)
party,Unnamed: 1_level_1
AL,38.915068
Democrat,56.631832
Democrat-Liberal,70.956164
Independent,60.363014
Libertarian,38.736986
Popular Democrat,40.926027
Republican,54.560661
Republican-Conservative,59.824658


In [104]:
df[df['party'] == 'AL']

Unnamed: 0_level_0,id.thomas,id.lis,id.govtrack,id.opensecrets,id.votesmart,id.fec,id.cspan,id.wikipedia,id.house_history,id.ballotpedia,...,office,state_rank,rss_url,how,caucus,party_affiliations,end-type,Age at start (days),Age at start,Age at start (years)
id.bioguide,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A000359,1679,,400002,N00009825,,,,Aníbal Acevedo Vilá,8754.0,,...,,,,,,,,14204,"38 yrs, 10 months, 21 days",38.915068


In [105]:
fossils = (
    df[(df['end'] > pd.to_datetime('2024-12-31')) & (df['bio.birthday'] < pd.to_datetime('1942-11-20'))]
        .iloc[:10].loc[:, ['name.official_full', 'bio.birthday', 'type', 'party', 'bio.gender']]
)
fossils

Unnamed: 0_level_0,name.official_full,bio.birthday,party,bio.gender
id.bioguide,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
S000033,Bernard Sanders,1941-09-08,Independent,M
M000355,Mitch McConnell,1942-02-20,Republican,M
C001051,John R. Carter,1941-11-06,Republican,M
C000537,James E. Clyburn,1940-07-21,Democrat,M
D000096,Danny K. Davis,1941-09-06,Democrat,M
G000386,Chuck Grassley,1933-09-17,Republican,M
H000874,Steny H. Hoyer,1939-06-14,Democrat,M
N000179,Grace F. Napolitano,1936-12-04,Democrat,F
N000147,Eleanor Holmes Norton,1937-06-13,Democrat,F
P000197,Nancy Pelosi,1940-03-26,Democrat,F


In [106]:
fossils['party'].value_counts()

party
Democrat       6
Republican     3
Independent    1
Name: count, dtype: int64

In [107]:
fossils['bio.gender'].value_counts()

bio.gender
M    7
F    3
Name: count, dtype: int64