# Project 4: Team 7
## Predicting Congressional Bill Passage
### Extract,Transform, and Load: Data for the current Congress- 118th Congress
- Data is all bills before the House and Senate for the current Congress, that have not been voted on. 
- After neural network machine learning models were created for both the House and the Senate from historical data from the last ten years (Congresses 113th to 117th), this notebook will clean, process, and predict data from the current Congress using those models.

#### Import dependecies and read in data:

In [None]:
# Import Dependencies:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from joblib import dump, load
from keras.models import load_model
import pandas as pd
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

output_filepath = "/content/gdrive/MyDrive/DataClassNotebooks/Project-4/output/"
Resources_filepath ="/content/gdrive/MyDrive/DataClassNotebooks/Project-4/Resources/"

In [None]:
# Read in currentbills.csv, this is current bills before Congress, none that have been voted on or passed:
# Import from google drive folder:
# Mount google drive to get data:
from google.colab import drive
drive.mount('/content/gdrive')
current_filepath = "/content/gdrive/MyDrive/DataClassNotebooks/Project-4/Resources/currentbills.csv"
current_df = pd.read_csv(current_filepath)
# current_df = pd.read_csv('../Resources/currentbills.csv')
# current_df = pd.read_csv('https://raw.githubusercontent.com/JJERANEK/Project-4/main/Resources/currentbills.csv')

# Split data into raw df for the House and Senata data
df_house = current_df[current_df['Legislation Number'].str.contains("H.J|H.R.")==True]
df_house.reset_index(drop=True)
df_senate = current_df[current_df['Legislation Number'].str.contains("S.J|S.")==True]
df_senate.reset_index(drop=True)

# Select only the columns that will be needed:
df_house = df_house[['Legislation Number', "URL",'Congress', 'Title', "Latest Summary", 'Sponsor',
       'Date of Introduction', 'Number of Cosponsors', 'Committees',
       'Latest Action', 'Latest Action Date', 'Subject']]
df_senate = df_senate[['Legislation Number', "URL", 'Congress', 'Title', "Latest Summary", 'Sponsor',
       'Date of Introduction', 'Number of Cosponsors', 'Committees',
       'Latest Action', 'Latest Action Date', 'Subject']]

# Check master df:
current_df.head()

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


  current_df = pd.read_csv(current_filepath)


Unnamed: 0,Legislation Number,URL,Congress,Title,Amends Bill,Sponsor,Date Offered,Date of Introduction,Number of Cosponsors,Date Submitted,...,Subject.44,Subject.45,Subject.46,Subject.47,Subject.48,Subject.49,Subject.50,Subject.51,Subject.52,Latest Summary
0,H.R. 1667,https://www.congress.gov/bill/118th-congress/h...,118th Congress (2023-2024),To require the Secretary of Agriculture to ide...,,"Westerman, Bruce [Rep.-R-AR-4]",,3/17/23,0,,...,,,,,,,,,,
1,H.R. 1666,https://www.congress.gov/bill/118th-congress/h...,118th Congress (2023-2024),To amend title XVIII to protect patient access...,,"Wenstrup, Brad R. [Rep.-R-OH-2]",,3/17/23,3,,...,,,,,,,,,,
2,H.R. 1665,https://www.congress.gov/bill/118th-congress/h...,118th Congress (2023-2024),To direct the Secretary of Transportation to e...,,"Velazquez, Nydia M. [Rep.-D-NY-7]",,3/17/23,4,,...,,,,,,,,,,
3,H.R. 1664,https://www.congress.gov/bill/118th-congress/h...,118th Congress (2023-2024),To require the Board of Governors of the Feder...,,"Torres, Ritchie [Rep.-D-NY-15]",,3/17/23,0,,...,,,,,,,,,,
4,H.R. 1663,https://www.congress.gov/bill/118th-congress/h...,118th Congress (2023-2024),To require the Secretary of the Treasury to de...,,"Torres, Ritchie [Rep.-D-NY-15]",,3/17/23,0,,...,,,,,,,,,,


In [None]:
#Get list of cosponsor columns
cosponsor_cols = [col for col in current_df.columns if 'Cosponsor' in col]
cosponsor_cols.remove('Number of Cosponsors')
# Create new df with cosponsor columns
cosponsors_df = current_df[cosponsor_cols]
# Add bill and congress for identification, number of cosponsors to ensure party counts total correct
cosponsors_df.insert(0, "Legislation Number", current_df['Legislation Number'])
cosponsors_df.insert(1, "Congress", current_df['Congress'])
cosponsors_df.insert(2, "Number of Cosponsors", current_df['Number of Cosponsors'])
cosponsors_df.head()

Unnamed: 0,Legislation Number,Congress,Number of Cosponsors,Cosponsor,Cosponsor.1,Cosponsor.2,Cosponsor.3,Cosponsor.4,Cosponsor.5,Cosponsor.6,...,Cosponsor.224,Cosponsor.225,Cosponsor.226,Cosponsor.227,Cosponsor.228,Cosponsor.229,Cosponsor.230,Cosponsor.231,Cosponsor.232,Cosponsor.233
0,H.R. 1667,118th Congress (2023-2024),0,,,,,,,,...,,,,,,,,,,
1,H.R. 1666,118th Congress (2023-2024),3,"Carter, Earl L. ""Buddy"" [Rep.-R-GA-1]","Sewell, Terri A. [Rep.-D-AL-7]","Tonko, Paul [Rep.-D-NY-20]",,,,,...,,,,,,,,,,
2,H.R. 1665,118th Congress (2023-2024),4,"Espaillat, Adriano [Rep.-D-NY-13]","Meng, Grace [Rep.-D-NY-6]","Goldman, Daniel S. [Rep.-D-NY-10]","Clarke, Yvette D. [Rep.-D-NY-9]",,,,...,,,,,,,,,,
3,H.R. 1664,118th Congress (2023-2024),0,,,,,,,,...,,,,,,,,,,
4,H.R. 1663,118th Congress (2023-2024),0,,,,,,,,...,,,,,,,,,,


In [None]:
# Count cosponsor dems per row and add to df
cosponsor_dems = cosponsors_df.astype(str).apply(lambda x: x.str.contains('-D-')).sum(axis=1)
cosponsors_df['Cosponsor Dems'] = cosponsor_dems
# Count cosponsor Reps per row and add to df
cosponsor_reps = cosponsors_df.astype(str).apply(lambda x: x.str.contains('-R-')).sum(axis=1)
cosponsors_df['Cosponsor Reps'] = cosponsor_reps
# Count cosponsor Independent per row and add to df
cosponsor_ind = cosponsors_df.astype(str).apply(lambda x: x.str.contains('-I-')).sum(axis=1)
cosponsors_df['Cosponsor Ind'] = cosponsor_ind

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cosponsors_df['Cosponsor Dems'] = cosponsor_dems
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cosponsors_df['Cosponsor Reps'] = cosponsor_reps
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cosponsors_df['Cosponsor Ind'] = cosponsor_ind


In [None]:
# get state for each cosponsor
for col in cosponsor_cols:
    cosponsors_df[col].update(cosponsors_df[col].str.split('-').str[2])
cosponsors_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cosponsors_df[col].update(cosponsors_df[col].str.split('-').str[2])


Unnamed: 0,Legislation Number,Congress,Number of Cosponsors,Cosponsor,Cosponsor.1,Cosponsor.2,Cosponsor.3,Cosponsor.4,Cosponsor.5,Cosponsor.6,...,Cosponsor.227,Cosponsor.228,Cosponsor.229,Cosponsor.230,Cosponsor.231,Cosponsor.232,Cosponsor.233,Cosponsor Dems,Cosponsor Reps,Cosponsor Ind
0,H.R. 1667,118th Congress (2023-2024),0,,,,,,,,...,,,,,,,,0,0,0
1,H.R. 1666,118th Congress (2023-2024),3,GA,AL,NY,,,,,...,,,,,,,,2,1,0
2,H.R. 1665,118th Congress (2023-2024),4,NY,NY,NY,NY,,,,...,,,,,,,,4,0,0
3,H.R. 1664,118th Congress (2023-2024),0,,,,,,,,...,,,,,,,,0,0,0
4,H.R. 1663,118th Congress (2023-2024),0,,,,,,,,...,,,,,,,,0,0,0


In [None]:
# remove any remaining brackets
cosponsors_df[cosponsor_cols] = cosponsors_df[cosponsor_cols].replace({']':''}, regex=True)
#get count of unique states
cosponsor_states = cosponsors_df[cosponsor_cols].nunique(axis=1)
cosponsors_df['Cosponsor States'] = cosponsor_states

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cosponsors_df[cosponsor_cols] = cosponsors_df[cosponsor_cols].replace({']':''}, regex=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cosponsors_df['Cosponsor States'] = cosponsor_states


In [None]:
# Create clean df with cosponsor counts
clean_cosponsor_df = cosponsors_df[['Legislation Number','Congress','Number of Cosponsors','Cosponsor Dems','Cosponsor Reps','Cosponsor Ind', 'Cosponsor States']].reset_index(drop=True)
# Join clean cosponsor df with house df
house_df = pd.merge(clean_cosponsor_df, df_house, how='inner', on=['Legislation Number', 'Congress'])
house_df = house_df.drop(columns='Number of Cosponsors_y').rename(columns={'Number of Cosponsors_x': 'Number of Cosponsors'})
# Join clean cosponsor df with senate df
senate_df = pd.merge(clean_cosponsor_df, df_senate, how='inner', on=['Legislation Number', 'Congress'])
senate_df = senate_df.drop(columns='Number of Cosponsors_y').rename(columns={'Number of Cosponsors_x': 'Number of Cosponsors'})
# Concat house and senate dfs to finish cleaning
frames = [house_df, senate_df]
congress_df = pd.concat(frames).reset_index(drop=True)
congress_df.head()

Unnamed: 0,Legislation Number,Congress,Number of Cosponsors,Cosponsor Dems,Cosponsor Reps,Cosponsor Ind,Cosponsor States,URL,Title,Latest Summary,Sponsor,Date of Introduction,Committees,Latest Action,Latest Action Date,Subject
0,H.R. 1667,118th Congress (2023-2024),0,0,0,0,0,https://www.congress.gov/bill/118th-congress/h...,To require the Secretary of Agriculture to ide...,,"Westerman, Bruce [Rep.-R-AR-4]",3/17/23,House - Natural Resources,Referred to the House Committee on Natural Res...,3/17/23,
1,H.R. 1666,118th Congress (2023-2024),3,2,1,0,3,https://www.congress.gov/bill/118th-congress/h...,To amend title XVIII to protect patient access...,,"Wenstrup, Brad R. [Rep.-R-OH-2]",3/17/23,"House - Energy and Commerce, Ways and Means",Referred to the Committee on Energy and Commer...,3/17/23,
2,H.R. 1665,118th Congress (2023-2024),4,4,0,0,1,https://www.congress.gov/bill/118th-congress/h...,To direct the Secretary of Transportation to e...,,"Velazquez, Nydia M. [Rep.-D-NY-7]",3/17/23,House - Transportation and Infrastructure,Referred to the House Committee on Transportat...,3/17/23,
3,H.R. 1664,118th Congress (2023-2024),0,0,0,0,0,https://www.congress.gov/bill/118th-congress/h...,To require the Board of Governors of the Feder...,,"Torres, Ritchie [Rep.-D-NY-15]",3/17/23,House - Financial Services,Referred to the House Committee on Financial S...,3/17/23,
4,H.R. 1663,118th Congress (2023-2024),0,0,0,0,0,https://www.congress.gov/bill/118th-congress/h...,To require the Secretary of the Treasury to de...,,"Torres, Ritchie [Rep.-D-NY-15]",3/17/23,House - Financial Services,Referred to the House Committee on Financial S...,3/17/23,


In [None]:
# Strip numbers and change Legislation Number to Bill Type
congress_df.insert(0, "Bill Type", congress_df['Legislation Number'].str.replace('\d+', ''))
# congress_df = congress_df.rename(columns = {"Legislation Number": "Bill Type"})
# Strip text from "Legislation Number" and leave only number:
congress_df['Legislation Number'] = congress_df['Legislation Number'].str.extract('(\d+)', expand=False)
# Cast as int64:
congress_df['Legislation Number'] = congress_df['Legislation Number'].astype(int)
# Get number of congress only, column 2
congress_df['Congress'] = congress_df['Congress'].str[:3]
# Cast as int64:
congress_df['Congress'] = congress_df['Congress'].astype(int)

  congress_df.insert(0, "Bill Type", congress_df['Legislation Number'].str.replace('\d+', ''))


In [None]:
# extract party and state into new column for sponsor
new = congress_df["Sponsor"].str.split("[", n = 1, expand = True)
congress_df['Sponsor Split']= new[1]
congress_df.drop(columns =["Sponsor"], inplace = True)
# Sponsor title, sponsor state, sponsor party in new columns
new2 = congress_df["Sponsor Split"].str.split("-", n = 3, expand = True)
congress_df['Sponsor Title']= new2[0]
congress_df['Sponsor Party']= new2[1]
congress_df['Sponsor State']= new2[2]
congress_df = congress_df.drop(columns={'Sponsor Split'})

In [None]:
# Create the month of bill introduction:
congress_df['Date of Introduction'] = pd.to_datetime(congress_df['Date of Introduction'])
congress_df['Month Introduced'] = pd.DatetimeIndex(congress_df['Date of Introduction']).month
congress_df = congress_df.drop(columns={'Date of Introduction'})
# Drop unneeded columns:
congress_df = congress_df.drop(columns={'Latest Action Date'})
# Take out extra brackets in statecolumn:
congress_df['Sponsor State'] = congress_df['Sponsor State'].replace({']':''}, regex=True)

## Save whole congress dataset up to this point:

In [None]:
# Save whole cleaned dataset:
# congress_df.to_csv(f'{Resources_filepath}/congress_predict_clean.csv')

### Split the data by House and Senate:

In [None]:
# Split into house and senate dfs:
house_cleaned = congress_df[congress_df['Bill Type'].str.contains("H.J|H")==True]
senate_cleaned = congress_df[congress_df['Bill Type'].str.contains("S.J|S.")==True]
senate_cleaned = senate_cleaned.reset_index(drop=True)
house_cleaned.head()

Unnamed: 0,Bill Type,Legislation Number,Congress,Number of Cosponsors,Cosponsor Dems,Cosponsor Reps,Cosponsor Ind,Cosponsor States,URL,Title,Latest Summary,Committees,Latest Action,Subject,Sponsor Title,Sponsor Party,Sponsor State,Month Introduced
0,H.R.,1667,118,0,0,0,0,0,https://www.congress.gov/bill/118th-congress/h...,To require the Secretary of Agriculture to ide...,,House - Natural Resources,Referred to the House Committee on Natural Res...,,Rep.,R,AR,3
1,H.R.,1666,118,3,2,1,0,3,https://www.congress.gov/bill/118th-congress/h...,To amend title XVIII to protect patient access...,,"House - Energy and Commerce, Ways and Means",Referred to the Committee on Energy and Commer...,,Rep.,R,OH,3
2,H.R.,1665,118,4,4,0,0,1,https://www.congress.gov/bill/118th-congress/h...,To direct the Secretary of Transportation to e...,,House - Transportation and Infrastructure,Referred to the House Committee on Transportat...,,Rep.,D,NY,3
3,H.R.,1664,118,0,0,0,0,0,https://www.congress.gov/bill/118th-congress/h...,To require the Board of Governors of the Feder...,,House - Financial Services,Referred to the House Committee on Financial S...,,Rep.,D,NY,3
4,H.R.,1663,118,0,0,0,0,0,https://www.congress.gov/bill/118th-congress/h...,To require the Secretary of the Treasury to de...,,House - Financial Services,Referred to the House Committee on Financial S...,,Rep.,D,NY,3


## House data cleaning:

### Committees:

In [None]:
# Committee column recoding to indicator variables:
# Create a list of committees for the House:
house_committees_lst = ["Agriculture", "Appropriations", "Armed Services", "Budget", "Education and the Workforce", "Energy and Commerce", "Ethics", "Financial Services", 
                        "Foreign Affairs", "Homeland Security", "House Administration", "Judiciary", "Natural Resources", 
                        "Oversight and Accountability", "Rules", "Science, Space, and Technology", "Small Business", "Transportation and Infrastructure", 
                        "Veterans' Affairs", "Ways and Means", "Intelligence", "Printing", "Taxation", "Library", "Economic"]
# Run a for loop to set each committee name to a new column and make a dummy var (case=False makes the str.contains case insensitive)):
for comm in house_committees_lst:
    house_cleaned[comm] = np.where(house_cleaned['Committees'].str.contains(comm, case=False), 1, 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  house_cleaned[comm] = np.where(house_cleaned['Committees'].str.contains(comm, case=False), 1, 0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  house_cleaned[comm] = np.where(house_cleaned['Committees'].str.contains(comm, case=False), 1, 0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  house_clean

## Subject:

In [None]:
# Subject column recoding to indicator variables:
# Create a list of the Subjects copied from etl_congress.ipynb for the house data:
house_subject_lst = ['Accounting and auditing',
 'Administrative law and regulatory procedures',
 'Administrative remedies',
 'Advisory bodies',
 'Agriculture and Food',
 'Appropriations',
 'Armed Forces and National Security',
 'Civil actions and liability',
 'Commerce',
 'Congress',
 'Congressional oversight',
 'Congressional tributes',
 'Crime and Law Enforcement',
 'Education',
 'Emergency Management',
 'Energy',
 'Environmental Protection',
 'Finance and Financial Sector',
 'Government Operations and Politics',
 'Health',
 'Housing and Community Development',
 'Immigration',
 'International Affairs',
 'Labor and Employment',
 'Native Americans',
 'Public Lands and Natural Resources',
 'Science, Technology, Communications',
 'Social Welfare',
 'Taxation',
 'Transportation and Public Works']

 
# Run a for loop to set each committee name to a new column and make a dummy var (case=False makes the str.contains case insensitive)):
for sub in house_subject_lst:
    house_cleaned[sub] = np.where(house_cleaned['Subject'].str.contains(sub, case=False), 1, 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  house_cleaned[sub] = np.where(house_cleaned['Subject'].str.contains(sub, case=False), 1, 0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  house_cleaned[sub] = np.where(house_cleaned['Subject'].str.contains(sub, case=False), 1, 0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  house_cleaned[sub] = 

In [None]:
# Save a df for the cleaned house data  use to append the 
# predictions from the model with the columns that we will need to put into the mongodb:
# Select only the columns that will be needed:
house_predictions = house_cleaned[["Bill Type", "Legislation Number", "Title", "Latest Summary",
                                   "Congress", "Sponsor Party", "Sponsor State", "Number of Cosponsors", 
                                   "Cosponsor Dems", "Cosponsor Reps", "Cosponsor Ind", "Month Introduced",
                                   "Subject", "Committees", "URL"]]
 # Drop the non-beneficial columns: 
house_cleaned = house_cleaned.drop(["Bill Type", "Legislation Number", "Title", "Subject", 
                                    "Committees", "Latest Action" , "Latest Summary", "URL"], axis='columns')
house_cleaned.head()                                  


Unnamed: 0,Congress,Number of Cosponsors,Cosponsor Dems,Cosponsor Reps,Cosponsor Ind,Cosponsor States,Sponsor Title,Sponsor Party,Sponsor State,Month Introduced,...,Health,Housing and Community Development,Immigration,International Affairs,Labor and Employment,Native Americans,Public Lands and Natural Resources,"Science, Technology, Communications",Social Welfare,Transportation and Public Works
0,1,0,0,0,0,0,Rep.,R,AR,3,...,1,1,1,1,1,1,1,1,1,1
1,1,3,2,1,0,3,Rep.,R,OH,3,...,1,1,1,1,1,1,1,1,1,1
2,1,4,4,0,0,1,Rep.,D,NY,3,...,1,1,1,1,1,1,1,1,1,1
3,1,0,0,0,0,0,Rep.,D,NY,3,...,1,1,1,1,1,1,1,1,1,1
4,1,0,0,0,0,0,Rep.,D,NY,3,...,1,1,1,1,1,1,1,1,1,1


In [None]:
# Convert categorical data to numeric with `pd.get_dummies`
house_cleaned = pd.get_dummies(house_cleaned,dtype=float)
house_cleaned.head()

Unnamed: 0,Congress,Number of Cosponsors,Cosponsor Dems,Cosponsor Reps,Cosponsor Ind,Cosponsor States,Month Introduced,Agriculture,Appropriations,Armed Services,...,Sponsor State_TN,Sponsor State_TX,Sponsor State_UT,Sponsor State_VA,Sponsor State_VI,Sponsor State_VT,Sponsor State_WA,Sponsor State_WI,Sponsor State_WV,Sponsor State_WY
0,1,0,0,0,0,0,3,0,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,3,2,1,0,3,3,0,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1,4,4,0,0,1,3,0,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,0,0,0,0,0,3,0,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1,0,0,0,0,0,3,0,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# Read in saved training data:
house_historical_df = pd.read_csv(f'{Resources_filepath}/house_processed.csv' )


In [None]:
# Compare training data to current bills and find the index of missing columns:
missing_cols_lst = house_historical_df.columns.difference(house_cleaned.columns).tolist()
missing_cols_index = [house_historical_df.columns.get_loc(col) for col in missing_cols_lst]
print(missing_cols_lst)
print(missing_cols_index)
# Subtract 1 for removing bill_passed and this is the order we need to add each column with all zeros:
# 'Sponsor Party_I' = 64,'Sponsor Party_L'= 65, 'Sponsor State_AK' = 66, 'Sponsor State_AS' = 69
# Add back the missing columns and fill with 0's:
house_cleaned.insert(63, 'Sponsor Party_I', 0)
house_cleaned.insert(64, 'Sponsor Party_L', 0)
house_cleaned.insert(66, 'Sponsor State_AK', 0)
house_cleaned.insert(69, 'Sponsor State_AS', 0)

['Sponsor Party_I', 'Sponsor Party_L', 'Sponsor State_AK', 'Sponsor State_AS', 'bill_passed']
[64, 65, 67, 70, 59]


In [None]:
# Save clean/processed house data:
# house_cleaned.to_csv(f'{Resources_filepath}/house_processed_currentbills.csv' ,index=False)

In [None]:
# Load in house_scaler from house model output: house_scaler.bin
from google.colab import files

output_filepath = "/content/gdrive/MyDrive/DataClassNotebooks/Project-4/output"
house_scaler=load(f'{output_filepath}/house_scaler.bin')
# house_scaler=load('../output/house_scaler.bin')
# Scale the house_cleaned data with saved house_scaler:
house_scaled = house_scaler.transform(house_cleaned)




In [None]:
# Load in saved house_model.h5:
house_model = load_model(f'{output_filepath}/house_model.h5')

In [None]:
# Make predictions with saved house_model on scale house data- house_scaled,
# Append to house_predictions which is the df that will be uploaded to mongodb for server side:
house_predictions["house_probabilities"] = house_model.predict(house_scaled)
house_predictions["house_predictions"] = (house_model.predict(house_scaled) > 0.5).astype("int32")
house_predictions["house_predictions"].unique()



array([0, 1], dtype=int32)

In [None]:
house_predictions["house_predictions"].value_counts()

0    1586
1     110
Name: house_predictions, dtype: int64

In [None]:
# Print prediction %:
print(f'Predicting {round((110/1586)*100,2)}% passing for current House bills')

Predicting 6.94% passing for current House bills


In [None]:
# Clean up column names and change "house_probabilities" to percent:
house_predictions = house_predictions.rename(columns = {"Bill Type": "bill_id", 
                                                         "Legislation Number":"", "Title":"title", "Latest Summary":"summary",
                                   "Congress":"congress_num", "Sponsor Party":"sponsor_party", 
                                   "Sponsor State":"sponsor_state", "Number of Cosponsors":"cosponsors_total", 
                                   "Cosponsor Dems":"cosponsors_dem", "Cosponsor Reps":"cosponsors_rep", 
                                   "Cosponsor Ind":"cosponsors_ind", "Month Introduced":"month_introduced",
                                   "Subject":"subject", "Committees":"committees", "URL":"url"})
house_predictions.head()
# Convert "house_probabilities" to percent:
house_predictions["house_probabilities"] = house_predictions["house_probabilities"].astype(float).map("{:.2%}".format)

## Save clean house_predictions dataset with prediction and probability columns: 

In [None]:
# Save house cleaned dataset using current Congress and added predictions:
house_predictions.to_csv(f'{output_filepath}/house_predict_current.csv', index=False)

# Senate Cleaning:

In [None]:
# Committee column recoding to indicator variables:
# Create a list of committees for the Senate:
senate_committees_lst = ["Agriculture, Nutrition, and Forestry", "Appropriations", "Armed Services", "Banking, Housing, and Urban Affairs", "Budget", 
                         "Commerce, Science, and Transportation", "Energy and Natural Resources", "Environment and Public Works", "Finance", 
                         "Foreign Relations", "Health, Education, Labor, and Pensions", "Homeland Security and Governmental Affairs","Judiciary", 
                         "Rules and Administration", "Small Business and Entrepreneurship", "Veterans Affairs", "International Narcotics Control", 
                         "Ethics", "Indian Affairs", "Intelligence", "Printing", "Taxation", "Library", "Economic"]
# Run a for loop to set each committee name to a new column and make a dummy var (case=False makes the str.contains case insensitive)):
for comm in senate_committees_lst:
    senate_cleaned[comm] = np.where(senate_cleaned['Committees'].str.contains(comm, case=False), 1, 0)

senate_cleaned.head()

Unnamed: 0,Bill Type,Legislation Number,Congress,Number of Cosponsors,Cosponsor Dems,Cosponsor Reps,Cosponsor Ind,Cosponsor States,URL,Title,...,Small Business and Entrepreneurship,Veterans Affairs,International Narcotics Control,Ethics,Indian Affairs,Intelligence,Printing,Taxation,Library,Economic
0,S.,875,118,1,0,1,0,1,https://www.congress.gov/bill/118th-congress/s...,A bill to prohibit the receipt of Federal fund...,...,0,0,0,0,0,0,0,0,0,0
1,S.,874,118,1,0,1,0,1,https://www.congress.gov/bill/118th-congress/s...,A bill to direct the Secretary of Labor to mod...,...,0,0,0,0,0,0,0,0,0,0
2,S.,873,118,1,0,1,0,1,https://www.congress.gov/bill/118th-congress/s...,"A bill to improve recreation opportunities on,...",...,0,0,0,0,0,0,0,0,0,0
3,S.,872,118,0,0,0,0,0,https://www.congress.gov/bill/118th-congress/s...,A bill to identify social media entities under...,...,0,0,0,0,0,0,0,0,0,0
4,S.,871,118,5,2,3,0,5,https://www.congress.gov/bill/118th-congress/s...,A bill to amend section 7014 of the Elementary...,...,0,0,0,0,0,0,0,0,0,0


## Subject:


In [None]:
# Subject column recoding to indicator variables:
# Create a list of the Subjects copied from etl_congress.ipynb for the house data:
senate_subject_lst = ['Academic performance and assessments',
 'Accounting and auditing',
 'Administrative law and regulatory procedures',
 'Administrative remedies',
 'Advisory bodies',
 'Agriculture and Food',
 'Alternative and renewable resources',
 'Appropriations',
 'Armed Forces and National Security',
 'Civil actions and liability',
 'Commerce',
 'Congressional oversight',
 'Crime and Law Enforcement',
 'Economics and Public Finance',
 'Education',
 'Emergency Management',
 'Energy',
 'Environmental Protection',
 'Finance and Financial Sector',
 'Foreign Trade and International Finance',
 'Government Operations and Politics',
 'Health','Housing and Community Development',
 'Immigration',
 'International Affairs',
 'Labor and Employment',
 'Native Americans',
 'Public Lands and Natural Resources',
 'Science, Technology, Communications',
 'Social Welfare',
 'Taxation',
 'Transportation and Public Works']
 
# Run a for loop to set each committee name to a new column and make a dummy var (case=False makes the str.contains case insensitive)):
for sub in senate_subject_lst:
    senate_cleaned[sub] = np.where(senate_cleaned['Subject'].str.contains(sub, case=False), 1, 0)
    


In [None]:
# Save a df for the cleaned house data  use to append the 
# predictions from the model with the columns that we will need to put into the mongodb:
# Select only the columns that will be needed:
senate_predictions = senate_cleaned[["Bill Type", "Legislation Number", "Title", "Latest Summary",
                                   "Congress", "Sponsor Party", "Sponsor State", "Number of Cosponsors", 
                                   "Cosponsor Dems", "Cosponsor Reps", "Cosponsor Ind", "Month Introduced",
                                   "Subject", "Committees", "URL"]]
 # Drop the non-beneficial columns: 
senate_cleaned = senate_cleaned.drop(["Bill Type", "Legislation Number", "Title", "Subject", 
                                    "Committees", "Latest Action" , "Latest Summary", "URL"], axis='columns')
senate_cleaned.head()                                  


Unnamed: 0,Congress,Number of Cosponsors,Cosponsor Dems,Cosponsor Reps,Cosponsor Ind,Cosponsor States,Sponsor Title,Sponsor Party,Sponsor State,Month Introduced,...,Health,Housing and Community Development,Immigration,International Affairs,Labor and Employment,Native Americans,Public Lands and Natural Resources,"Science, Technology, Communications",Social Welfare,Transportation and Public Works
0,118,1,0,1,0,1,Sen.,R,FL,3,...,1,1,1,1,1,1,1,1,1,1
1,118,1,0,1,0,1,Sen.,D,GA,3,...,1,1,1,1,1,1,1,1,1,1
2,118,1,0,1,0,1,Sen.,D,WV,3,...,1,1,1,1,1,1,1,1,1,1
3,118,0,0,0,0,0,Sen.,R,AR,3,...,1,1,1,1,1,1,1,1,1,1
4,118,5,2,3,0,5,Sen.,D,NM,3,...,1,1,1,1,1,1,1,1,1,1


In [None]:
# Convert categorical data to numeric with `pd.get_dummies`
senate_cleaned = pd.get_dummies(senate_cleaned,dtype=float)
senate_cleaned.head()
# Creates a df of 115 columns


Unnamed: 0,Congress,Number of Cosponsors,Cosponsor Dems,Cosponsor Reps,Cosponsor Ind,Cosponsor States,Month Introduced,"Agriculture, Nutrition, and Forestry",Appropriations,Armed Services,...,Sponsor State_SD,Sponsor State_TN,Sponsor State_TX,Sponsor State_UT,Sponsor State_VA,Sponsor State_VT,Sponsor State_WA,Sponsor State_WI,Sponsor State_WV,Sponsor State_WY
0,118,1,0,1,0,1,3,0,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,118,1,0,1,0,1,3,0,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,118,1,0,1,0,1,3,0,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,118,0,0,0,0,0,3,0,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,118,5,2,3,0,5,3,0,1,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# Read in saved training data:
senate_historical_df = pd.read_csv(f'{Resources_filepath}/senate_processed.csv' )

In [None]:
# Compare training data to current bills and find the index of missing columns:
missing_cols_lst_sen = senate_historical_df.columns.difference(senate_cleaned.columns).tolist()
missing_cols_index_sen = [senate_historical_df.columns.get_loc(col) for col in missing_cols_lst_sen]
print(missing_cols_lst_sen)
print(missing_cols_index_sen)

['bill_passed']
[61]


In [None]:
# Save clean/processed house data:
# senate_cleaned.to_csv(f'{Resources_filepath}/senate_processed_currentbills.csv' ,index=False)

In [None]:
# Load in house_scaler from house model output: house_scaler.bin
from google.colab import files

output_filepath = "/content/gdrive/MyDrive/DataClassNotebooks/Project-4/output"
senate_scaler=load(f'{output_filepath}/senate_scaler.bin')
# senate_scaler=load('../output/senate_scaler.bin')
# Scale the house_cleaned data with saved house_scaler:
senate_scaled = senate_scaler.transform(senate_cleaned)



In [None]:
# Load in saved senate_model.h5:
senate_model = load_model(f'{output_filepath}/senate_model.h5')

In [None]:
# Make predictions with saved senate_model on scale senate data- senate_scaled:
senate_predictions["senate_probabilities"] = senate_model.predict(senate_scaled)
senate_predictions["senate_predictions"] = (senate_model.predict(senate_scaled) > 0.5).astype("int32")
senate_predictions["senate_predictions"].unique()



array([0, 1], dtype=int32)

In [None]:
senate_predictions["senate_predictions"].value_counts()
# predicting 1.59%

0    867
1     23
Name: senate_predictions, dtype: int64

In [None]:
# Print prediction %:
print(f'Predicting {round((23/867)*100,2)}% passing for current Senate bills')


Predicting 2.65% passing for current Senate bills


In [None]:
# Clean up column names and change "senate_probabilities" to percent:
senate_predictions = senate_predictions.rename(columns = {"Bill Type": "bill_id", 
                                                         "Legislation Number":"", "Title":"title", "Latest Summary":"summary",
                                   "Congress":"congress_num", "Sponsor Party":"sponsor_party", 
                                   "Sponsor State":"sponsor_state", "Number of Cosponsors":"cosponsors_total", 
                                   "Cosponsor Dems":"cosponsors_dem", "Cosponsor Reps":"cosponsors_rep", 
                                   "Cosponsor Ind":"cosponsors_ind", "Month Introduced":"month_introduced",
                                   "Subject":"subject", "Committees":"committees", "URL":"url"})
senate_predictions.head()
# Convert "senate_probabilities" to percent:
senate_predictions["senate_probabilities"] = senate_predictions["senate_probabilities"].astype(float).map("{:.2%}".format)

## Save clean senate dataset: 

In [None]:
# Save senate_predictions dataset using current Congress and added predictions:
# senate_predictions.to_csv(f'{output_filepath}/senate_predict_current.csv', index=False)