## Party label issues

It was noticed that there was mislabelling of Representatives' party membership (see Sheila Jackson Lee). This needs to be fixed. This notebook address this issue and shows the careful process of correctly labelling all representatives.

In [681]:
import boto3
import pandas as pd
import requests
import json
import unidecode
client = boto3.client('s3')
import re
import numpy as np
from io import StringIO

In [1108]:
DF = pd.read_csv(client.get_object(Bucket='ascsagemaker',
                                       Key=f'JMP_congressional_nmf/House_bigrams/097.csv')['Body'])
DF = DF.loc[DF.chamber_x == 'H']

# get the member info for the given congress
rq = requests.get('https://api.propublica.org/congress/v1/97/house/members.json',headers={"X-API-Key":'WDmcjspeHVFMDSmrIZ3NV2gKAESs5vldAS8rz3X6'})
ref = pd.DataFrame(json.loads(rq.text)['results'][0]['members'])

ref = ref[['first_name','last_name','gender','state','party']]
ref['state_long'] = ref.state.apply(lambda x: abbrev_to_us_state[x].lower())
ref['last_name'] = ref['last_name'].apply(lambda x: unidecode.unidecode(x.lower()))

KeyError: "None of [Index(['first_name', 'last_name', 'gender', 'state', 'party'], dtype='object')] are in the [columns]"

In [None]:
ref

### How it will work

The congressional record includes the name of the representative speaking before the speech. Ordinarily the honorifics of the speaker are written before the last name of the speaker. However, in the case that the individual has a last name and gender that is shared with another representative, that representatives state is also entered (e.g. Mr. Young of Alaska). 

To match data to the propublica database, the speaker name must be parsed to extract these three pieces of information (honorific/gender, last name, state). The Gentzkow process does not work accurately, so I need to perform this process myself. 

In [1085]:
unique_names = DF.speaker.unique()

match_df = []
unmatched = []
for name in tqdm(unique_names):
    found_matches = []
    norm = name.lower()
    
    # filter by gender first
    if re.findall('mr\..*',norm):
        gender_ref = ref.loc[ref.gender == 'M']
    elif re.findall('mrs\..*',norm) or re.findall('ms\..*',norm):
        gender_ref = ref.loc[ref.gender == 'F']
    else:
        gender_ref = ref
    
    # find name matches
    rows = []
    for ix,row in gender_ref.iterrows():
        if re.findall(r'\b%s\b'%row.last_name,norm):
            rows.append(row)
            
    rows = pd.DataFrame(rows)
    if len(rows) > 0:
        if len(rows) == 1:
            rows['speaker'] = name
            match_df.append(rows)
        else:
            split_name = norm.split('of')
            if len(split_name) > 1:
                state = split_name[1].strip()
                rows = rows.loc[rows.state_long.str.contains(state)]
                if len(rows) == 1:
                    rows['speaker'] = name
                    match_df.append(rows)
                elif len(rows) > 1:
                    n_rows = []
                    for ix,row in rows.iterrows():
                        if re.findall(r'\b%s\b'%row['first_name'].lower(),norm):
                            n_rows.append(row)
                    if len(n_rows) != 1:
                        unmatched.append(name)
                    else:
                        n_rows = pd.DataFrame(n_rows)
                        n_rows['speaker'] = name
                        match_df.append(n_rows)
            else:
                unmatched.append(name)

match_df = pd.concat(match_df)
match_df = match_df[['speaker','party']]
df2 = DF.merge(match_df,on='speaker',how='outer')
print(f"original DF length was - {len(DF)}")
print(f'removed {len(DF) - (len(df2.loc[-df2.party_y.isnull()]))} or {(len(DF) - (len(df2.loc[-df2.party_y.isnull()])))/(len(DF)):2f}')

100%|██████████| 805/805 [00:23<00:00, 34.80it/s]


original DF length was - 40363
removed 7282 or 0.180413


In [1086]:
for v,i in df2.loc[df2.party_y.isnull()].groupby('speaker').count().reset_index()[['speaker','date']].iterrows():
    if i['date'] > 20:
        print(i)

speaker    Mr. ANDERSON
date                 95
Name: 6, dtype: object
speaker    Mr. AuCOIN
date              210
Name: 9, dtype: object
speaker    Mr. BATES
date              36
Name: 11, dtype: object
speaker    Mr. BEDELL
date              192
Name: 14, dtype: object
speaker    Mr. BIAGGI
date              278
Name: 17, dtype: object
speaker    Mr. BONKER
date              179
Name: 18, dtype: object
speaker    Mr. CHANDLER
date                 50
Name: 31, dtype: object
speaker    Mr. CONABLE
date               195
Name: 37, dtype: object
speaker    Mr. CONTE
date             378
Name: 38, dtype: object
speaker    Mr. DENNY SMITH
date                    44
Name: 48, dtype: object
speaker    Mr. DONNELLY
date                 65
Name: 52, dtype: object
speaker    Mr. DYMALLY
date                40
Name: 58, dtype: object
speaker    Mr. DYSON
date              41
Name: 59, dtype: object
speaker    Mr. ECKART
date               65
Name: 61, dtype: object
speaker    Mr. EDWARDS of Okla

In [1104]:
ref.loc[ref.last_name == 'won pat']

Unnamed: 0,first_name,last_name,gender,state,party,state_long
436,Antonio,won pat,,GU,D,guam


In [1105]:
df2.loc[df2.speaker == 'Mr. MORRISON of Connecticut','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. MORRISON of Washington','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. ROBERT F. SMITH','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. ROBINSON','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. ST GERMAIN','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. FOWLER','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. HANCE','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. LEVITAS','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. WEISS','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. WON PAT','party_y'] = 'D'

df2.loc[df2.speaker == 'Mr. PATTERSON','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. PERKINS','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. SAM B. HALL. JR','party_y'] = 'D'

df2.loc[df2.speaker == 'Mr. HIGHTOWER','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. PHILIP M. CRANE','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. RUDD','party_y'] = 'R'

df2.loc[df2.speaker == 'Mr. FRENZEL','party_y'] = 'R'

df2.loc[df2.speaker == 'Mr. SIKORSKI','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. SNYDER','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. CONABLE','party_y'] = 'R'

df2.loc[df2.speaker == 'Mr. STANGELAND','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. McKINNEY','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. RUDD','party_y'] = 'R'

df2.loc[df2.speaker == 'Mr. MICHEL','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. MITCHELL','party_y'] = 'D'

df2.loc[df2.speaker == 'Mr. SUNIA','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. JONES of Tennessee','party_y'] = 'D'

df2.loc[df2.speaker == 'Mr. TAYLOR','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. HILLIS','party_y'] = 'R'

df2.loc[df2.speaker == 'Mr. WALGREN','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. BEDELL','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. FUQUA','party_y'] = 'D'

df2.loc[df2.speaker == 'Mr. UDALL','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. WALGREN','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. BONKER','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. ANDERSON','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. BIAGGI','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. AuCOIN','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. BATES','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. FLIPPO','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. BOSCO','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. BRUCE','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. BOULTER','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. LATTA','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. CALLAHAN','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. CONTE','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. CHANDLER','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. DENNY SMITH','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. SMITH of Mississippi','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. DYMALLY','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. DYSON','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. DONNELLY','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. ECKART','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. ECKART of Ohio','party_y'] = 'D'

df2.loc[df2.speaker == 'Mr. ERDREICH','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. PAYNE of Virginia','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. FASCELL','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. THOMAS A. LUKEN','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. GREEN','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. GREEN of New York','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. HERTELL','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. HERTEL of Michigan','party_y'] = 'D'

df2.loc[df2.speaker == 'Mr. HERTEL','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. HOLLOWAY','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. HUBBARD','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. JAMES','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. LENT','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. LEVINE of California','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. MARLENEE','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. MOODY','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. PARRIS','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. McEWEN','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. HOPKINS','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. STANGELAND','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. SAVAGE','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. SCHUETTE','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. SHUSTER','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. SHUMWAY','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. TRAXLER','party_y'] = 'D'
df2.loc[df2.speaker == 'Mr. WYLIE','party_y'] = 'R'
df2.loc[df2.speaker == 'Mr. YATRON','party_y'] = 'D'
df2.loc[df2.speaker == 'Ms. LONG','party_y'] = 'R'



In [1106]:
print(f'removed {len(DF) - (len(df2.loc[-df2.party_y.isnull()]))} or {(len(DF) - (len(df2.loc[-df2.party_y.isnull()])))/(len(DF)):2f}')

removed 769 or 0.019052


In [1107]:
csv_buffer = StringIO()
df2.to_csv(csv_buffer)
s3_resource = boto3.resource('s3')
s3_resource.Object('ascsagemaker', 'JMP_congressional_nmf/House_bigrams/098_fixed_party.csv').put(Body=csv_buffer.getvalue())

{'ResponseMetadata': {'RequestId': 'JEB9RXA7SKNJRQ4H',
  'HostId': 'pJpXXtLM2EbEgOAKgWSTsULJUpcWHcJZhD0vnoM9xqFnPB+amfS+PXxOR2w3ed8rXj/9lfBKBKo=',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amz-id-2': 'pJpXXtLM2EbEgOAKgWSTsULJUpcWHcJZhD0vnoM9xqFnPB+amfS+PXxOR2w3ed8rXj/9lfBKBKo=',
   'x-amz-request-id': 'JEB9RXA7SKNJRQ4H',
   'date': 'Tue, 04 Jan 2022 16:32:22 GMT',
   'x-amz-version-id': '.MEnkWzDAzvhKfE_5ie8VsxgL1SQwRn7',
   'etag': '"15051d1093da73f43f220b28f5563287"',
   'server': 'AmazonS3',
   'content-length': '0'},
  'RetryAttempts': 0},
 'ETag': '"15051d1093da73f43f220b28f5563287"',
 'VersionId': '.MEnkWzDAzvhKfE_5ie8VsxgL1SQwRn7'}

In [1110]:
for congress in range(98,115): 

    DF2 = pd.read_csv(client.get_object(Bucket='ascsagemaker',
                                           Key=f'JMP_congressional_nmf/House_bigrams/{congress:0>3}_fixed_party.csv')['Body'])
    
    DD = DF2.loc[DF2.party_x != DF2.party_y]
    print(congress,len(DD))

98 2018
99 1813
100 1453
101 1276
102 1353
103 1449
104 2400
105 1441
106 1636
107 1355
108 1606
109 1639
110 2671
111 1844
112 3207
113 2613
114 2263
