The postings table has data that requires flattening. Some members switched parties and these changes need to be considered a separate posting for the durations served with each party.

In [1]:
# Import Dependencies
import pandas as pd
from datetime import datetime
import csv
import re



In [2]:
# Load the postings CSV file into a DataFrame
postings_df = pd.read_csv('data\\postings.csv', dtype=str)
postings_df.head()

Unnamed: 0,bioguide_id,chamber,job_type,congress_number,congress_start_date,congress_end_date,region_type,region_code,party_name,job_start_date,job_end_date
0,A000002,Representative,CongressMemberJob,86,1959-01-03,1961-01-03,StateRegion,VA,['Democrat'],,
1,A000016,Representative,CongressMemberJob,86,1959-01-03,1961-01-03,StateRegion,MS,['Democrat'],,
2,A000024,Representative,CongressMemberJob,86,1959-01-03,1961-01-03,StateRegion,IN,['Republican'],,
3,A000054,Representative,CongressMemberJob,86,1959-01-03,1961-01-03,StateRegion,NJ,['Democrat'],,
4,A000062,Senator,CongressMemberJob,86,1959-01-03,1961-01-03,StateRegion,VT,['Republican'],,


In [3]:
postings_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36674 entries, 0 to 36673
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   bioguide_id          36674 non-null  object
 1   chamber              36674 non-null  object
 2   job_type             36674 non-null  object
 3   congress_number      36674 non-null  object
 4   congress_start_date  36674 non-null  object
 5   congress_end_date    36674 non-null  object
 6   region_type          36558 non-null  object
 7   region_code          36558 non-null  object
 8   party_name           36654 non-null  object
 9   job_start_date       1662 non-null   object
 10  job_end_date         1440 non-null   object
dtypes: object(11)
memory usage: 3.1+ MB


In [4]:
# Convert date columns to date format
date_columns = ['congress_start_date', 'congress_end_date', 'job_start_date', 'job_end_date']
for col in date_columns:
    postings_df[col] = pd.to_datetime(postings_df[col], errors='coerce')
postings_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36674 entries, 0 to 36673
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   bioguide_id          36674 non-null  object        
 1   chamber              36674 non-null  object        
 2   job_type             36674 non-null  object        
 3   congress_number      36674 non-null  object        
 4   congress_start_date  36674 non-null  datetime64[ns]
 5   congress_end_date    36674 non-null  datetime64[ns]
 6   region_type          36558 non-null  object        
 7   region_code          36558 non-null  object        
 8   party_name           36654 non-null  object        
 9   job_start_date       1662 non-null   datetime64[ns]
 10  job_end_date         1440 non-null   datetime64[ns]
dtypes: datetime64[ns](4), object(7)
memory usage: 3.1+ MB


The nested data in the party_name col needs to be flattened for each party served. The column also needs to changed from a list of strings to a single string for each row.

In [5]:
# Make a list of bioguide_ids where the party_name column contains a list containing more than one party
# Ignore NaN values
multi_party_bioguide_ids = postings_df[postings_df['party_name'].str.contains(',', na=False)]['bioguide_id'].unique().tolist()
# print the list
print(multi_party_bioguide_ids)
# print a count of how many bioguide_ids are in the list
print(f"Number of bioguide_ids with multiple parties: {len(multi_party_bioguide_ids)}")

['R000150', 'J000057', 'I000029', 'G000382', 'R000354', 'T000058', 'E000172', 'F000257', 'G000280', 'M000206', 'J000072', 'A000361', 'H000067', 'S001177', 'A000367', 'M001201', 'V000133', 'M001183']
Number of bioguide_ids with multiple parties: 18


A data error was found during exploration. The start date of the first congress served would be found in the start date of the final term served.

In [19]:
# Make a list of bioguide_ids where the job_start_date is earlier than the congress_start_date
date_error_bioguide_ids = postings_df[pd.to_datetime(postings_df['job_start_date'], errors='coerce') < pd.to_datetime(postings_df['congress_start_date'], errors='coerce')]['bioguide_id'].unique().tolist()
print(date_error_bioguide_ids)
# print a count of how many bioguide_ids are in the list
print(f"Number of bioguide_ids with job_start_date earlier than congress_start_date: {len(date_error_bioguide_ids)}")


['B000223', 'P000152', 'B001251', 'B001270', 'D000482', 'G000582', 'M001165']
Number of bioguide_ids with job_start_date earlier than congress_start_date: 7


In [None]:
# Clean the postings data, correct for date errors
# updated posting start and end dates based on the following rules:
# 1. The posting start date cannot be after the posting end date.
# 2. The posting start date cannot be before the session start date.
# 3. The posting end date cannot be after the session end date.