<a href="https://colab.research.google.com/github/zwcrowley/module_21_deep_learning_challenge/blob/main/deep_learning_charity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project 4: Predicting Congressional Bill Passage**

# House Model: Machine Learning Optimization and Model Output

## Team 7


## Import dependencies and read in data:

In [5]:
# Import our dependencies
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from joblib import dump, load
import pandas as pd
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

In [None]:
#  Import and read the cleaned house data from AWS S3:
# house_df = pd.read_csv("https://project-4-team7.s3.ca-central-1.amazonaws.com/cleaned_house.csv")
# house_df.head()

In [6]:
# Import from google drive folder:
# Mount google drive to get data:
from google.colab import drive
drive.mount('/content/gdrive')
house_filepath = "/content/gdrive/MyDrive/DataClassNotebooks/Project-4/Resources/house_cleaned.csv"

# Read in senate data using pandas:
house_df = pd.read_csv(house_filepath)
# Glimpse house data:
house_df.head()

Mounted at /content/gdrive


Unnamed: 0,Bill Type,Congress,Number of Cosponsors,Cosponsor Dems,Cosponsor Reps,Cosponsor Ind,Cosponsor States,Committees,Latest Action,Subject,...,Transportation and Infrastructure,Veterans' Affairs,Ways and Means,Intelligence,Printing,Taxation,Library,Economic,bill_passed,bill_referred_committee
0,H.R.,113,200,200,0,0,46,"House - Judiciary, Energy and Commerce, Educat...",Referred to the Subcommittee on Higher Educati...,Accounting and auditing,...,0,0,0,0,0,0,0,0,0,0
1,H.R.,113,179,179,0,0,42,"House - House Administration, Judiciary, Scien...",Referred to the Subcommittee on Higher Educati...,Administrative law and regulatory procedures,...,0,1,0,0,0,0,0,0,0,0
2,H.R.,113,0,0,0,0,0,,,,...,1,1,1,1,1,1,1,1,1,1
3,H.R.,113,0,0,0,0,0,,,,...,1,1,1,1,1,1,1,1,1,1
4,H.R.,113,200,197,3,0,46,"House - Judiciary, Foreign Affairs, Homeland S...",Motion to Discharge Committee filed by Mr. Gar...,Administrative law and regulatory procedures,...,1,0,1,0,0,0,0,0,0,0


#### Building two ml models, one with House of Reps data and one with Senate Data

# House Model:

## Preprocessing

In [7]:
# Check for NAs, duplicates and get the shape of the data:
print(f'The shape of the house_df data is: {house_df.shape}')
print(f'The number of NAs in the house_df data: {house_df.isnull().sum()}')
print(f'The duplicate rows of NAs in the house_df data: {house_df.duplicated().sum()}')
# There are 41 columns, 40683 rows, and no NAs or duplicates in the house_df dataset.
# target is bill_passed

The shape of the house_df data is: (40683, 41)
The number of NAs in the house_df data: Bill Type                              0
Congress                               0
Number of Cosponsors                   0
Cosponsor Dems                         0
Cosponsor Reps                         0
Cosponsor Ind                          0
Cosponsor States                       0
Committees                            49
Latest Action                         30
Subject                              729
Sponsor Title                         16
Sponsor Party                         16
Sponsor State                         16
Month Introduced                       0
Agriculture                            0
Appropriations                         0
Armed Services                         0
Budget                                 0
Education and the Workforce            0
Energy and Commerce                    0
Ethics                                 0
Financial Services                     0
Foreign Aff

In [8]:
# Numeric variable stats
house_df.describe()

Unnamed: 0,Congress,Number of Cosponsors,Cosponsor Dems,Cosponsor Reps,Cosponsor Ind,Cosponsor States,Month Introduced,Agriculture,Appropriations,Armed Services,...,Transportation and Infrastructure,Veterans' Affairs,Ways and Means,Intelligence,Printing,Taxation,Library,Economic,bill_passed,bill_referred_committee
count,40683.0,40683.0,40683.0,40683.0,40683.0,40683.0,40683.0,40683.0,40683.0,40683.0,...,40683.0,40683.0,40683.0,40683.0,40683.0,40683.0,40683.0,40683.0,40683.0,40683.0
mean,115.362805,15.79033,9.883711,5.906398,0.000172,7.205049,5.376472,0.042843,0.016837,0.053438,...,0.077452,0.057985,0.192242,0.009857,0.001204,0.001204,0.001204,0.001204,0.02898,0.115847
std,1.470997,34.220849,25.16276,17.237882,0.013116,9.655811,3.294271,0.202507,0.128664,0.224907,...,0.267311,0.233718,0.394067,0.098792,0.034684,0.034684,0.034684,0.034684,0.167753,0.320045
min,113.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,114.0,1.0,0.0,0.0,0.0,1.0,3.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,116.0,4.0,1.0,1.0,0.0,3.0,5.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,117.0,15.0,7.0,4.0,0.0,10.0,7.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,118.0,386.0,238.0,244.0,1.0,58.0,12.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [9]:
# Get the column names:
list(house_df.columns)


['Bill Type',
 'Congress',
 'Number of Cosponsors',
 'Cosponsor Dems',
 'Cosponsor Reps',
 'Cosponsor Ind',
 'Cosponsor States',
 'Committees',
 'Latest Action',
 'Subject',
 'Sponsor Title',
 'Sponsor Party',
 'Sponsor State',
 'Month Introduced',
 'Agriculture',
 'Appropriations',
 'Armed Services',
 'Budget',
 'Education and the Workforce',
 'Energy and Commerce',
 'Ethics',
 'Financial Services',
 'Foreign Affairs',
 'Homeland Security',
 'House Administration',
 'Judiciary',
 'Natural Resources',
 'Oversight and Accountability',
 'Rules',
 'Science, Space, and Technology',
 'Small Business',
 'Transportation and Infrastructure',
 "Veterans' Affairs",
 'Ways and Means',
 'Intelligence',
 'Printing',
 'Taxation',
 'Library',
 'Economic',
 'bill_passed',
 'bill_referred_committee']

### Model Target:


*   "Latest Action" == "Became Public Law" is the target for the models



In [10]:
# Model Target:
house_df["bill_passed"].value_counts()
# 1179 bills  in the dataset which originated from the House became law in the 113th-118th Congresses

0    39504
1     1179
Name: bill_passed, dtype: int64

In [11]:
# Drop the non-beneficial columns: 'Unnamed' column.
house_df = house_df.drop(["Bill Type", "Committees", "Latest Action" ], axis='columns')
house_df.head()

Unnamed: 0,Congress,Number of Cosponsors,Cosponsor Dems,Cosponsor Reps,Cosponsor Ind,Cosponsor States,Subject,Sponsor Title,Sponsor Party,Sponsor State,...,Transportation and Infrastructure,Veterans' Affairs,Ways and Means,Intelligence,Printing,Taxation,Library,Economic,bill_passed,bill_referred_committee
0,113,200,200,0,0,46,Accounting and auditing,Rep.,D,WI,...,0,0,0,0,0,0,0,0,0,0
1,113,179,179,0,0,42,Administrative law and regulatory procedures,Rep.,D,GA,...,0,1,0,0,0,0,0,0,0,0
2,113,0,0,0,0,0,,Rep.,D,CA,...,1,1,1,1,1,1,1,1,1,1
3,113,0,0,0,0,0,,Rep.,D,CA,...,1,1,1,1,1,1,1,1,1,1
4,113,200,197,3,0,46,Administrative law and regulatory procedures,Rep.,D,FL,...,1,0,1,0,0,0,0,0,0,0


In [12]:
house_df.dtypes
# Need to create dummies for: Subject, Sponsor Title,  Sponsor Party, Sponsor State columns
# Need to bin Subject column before creating dummy vars.

Congress                              int64
Number of Cosponsors                  int64
Cosponsor Dems                        int64
Cosponsor Reps                        int64
Cosponsor Ind                         int64
Cosponsor States                      int64
Subject                              object
Sponsor Title                        object
Sponsor Party                        object
Sponsor State                        object
Month Introduced                      int64
Agriculture                           int64
Appropriations                        int64
Armed Services                        int64
Budget                                int64
Education and the Workforce           int64
Energy and Commerce                   int64
Ethics                                int64
Financial Services                    int64
Foreign Affairs                       int64
Homeland Security                     int64
House Administration                  int64
Judiciary                       

In [13]:
# Determine the number of unique values in each column.
house_df.nunique()


Congress                               6
Number of Cosponsors                 325
Cosponsor Dems                       236
Cosponsor Reps                       214
Cosponsor Ind                          2
Cosponsor States                      59
Subject                              768
Sponsor Title                          3
Sponsor Party                          3
Sponsor State                         56
Month Introduced                      12
Agriculture                            2
Appropriations                         2
Armed Services                         2
Budget                                 2
Education and the Workforce            2
Energy and Commerce                    2
Ethics                                 2
Financial Services                     2
Foreign Affairs                        2
Homeland Security                      2
House Administration                   2
Judiciary                              2
Natural Resources                      2
Oversight and Ac

In [14]:
# Look at Subject value counts for binning
Subject_counts = house_df["Subject"].value_counts()
print(f'Count of values for Subject column: \n{Subject_counts}')

Count of values for Subject column: 
Health                                                2496
Armed Forces and National Security                    2437
Taxation                                              1707
Administrative law and regulatory procedures          1580
Government Operations and Politics                    1519
                                                      ... 
U.S. Commission on International Religious Freedom       1
Italy                                                    1
Watersheds                                               1
Military law                                             1
Cosmetics and personal care                              1
Name: Subject, Length: 768, dtype: int64


In [15]:
# Now change cutoff value to get  bins for Subject
# Cutoff value of 200 to bin: 
# use the variable name `Subject_types_to_replace`
Subject_types_to_replace = list(Subject_counts[Subject_counts<200].index)

# Replace in dataframe
for sub in Subject_types_to_replace:
    house_df['Subject'] = house_df['Subject'].replace(sub,"Other")

# Check to make sure binning was successful
house_df['Subject'].value_counts()
# 33 Bins

Other                                           15670
Health                                           2496
Armed Forces and National Security               2437
Taxation                                         1707
Administrative law and regulatory procedures     1580
Government Operations and Politics               1519
Crime and Law Enforcement                        1371
Education                                        1103
Transportation and Public Works                   884
Finance and Financial Sector                      880
Public Lands and Natural Resources                843
International Affairs                             825
Commerce                                          745
Congressional oversight                           656
Science, Technology, Communications               652
Immigration                                       600
Energy                                            570
Labor and Employment                              566
Agriculture and Food        

In [16]:
# Convert categorical data to numeric with `pd.get_dummies`
house_df = pd.get_dummies(house_df,dtype=float)
house_df.head()

Unnamed: 0,Congress,Number of Cosponsors,Cosponsor Dems,Cosponsor Reps,Cosponsor Ind,Cosponsor States,Month Introduced,Agriculture,Appropriations,Armed Services,...,Sponsor State_TN,Sponsor State_TX,Sponsor State_UT,Sponsor State_VA,Sponsor State_VI,Sponsor State_VT,Sponsor State_WA,Sponsor State_WI,Sponsor State_WV,Sponsor State_WY
0,113,200,200,0,0,46,1,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,113,179,179,0,0,42,1,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,113,0,0,0,0,0,1,1,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,113,0,0,0,0,0,1,1,1,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,113,200,197,3,0,46,10,1,0,1,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
# Split our preprocessed data into our features and target arrays
y = house_df["bill_passed"].values
X = house_df.drop(["bill_passed"], axis='columns').values

# Split the preprocessed data into a training and testing dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=78)

In [18]:
# Create a StandardScaler instances
house_scaler = StandardScaler()

# Fit the StandardScaler
X_scaler = house_scaler.fit(X_train)

# Scale the data
X_train_scaled = X_scaler.transform(X_train)
X_test_scaled = X_scaler.transform(X_test)

## Compile, Train and Evaluate the Model

### Attempt #1


*   Total layers: 5 total
*   Activation function to for each layer is: relu, relu, tanh, tanh,  sigmoid
*   Number of neurons for each hidden layers: 9,7,7,5,1
*   Epochs: 100



In [19]:
# Define the model - deep neural net, i.e., the number of input features and hidden nodes for each layer.
# Define the model - deep neural net
number_input_features = len(X_train[0])
hidden_nodes_layer1 =  9
hidden_nodes_layer2_3 = 7
hidden_nodes_layer4 = 5

nn_1 = tf.keras.models.Sequential()

# First hidden layer
nn_1.add(
    tf.keras.layers.Dense(units=hidden_nodes_layer1, input_dim=number_input_features, activation="relu")
)

# Second hidden layer
nn_1.add(tf.keras.layers.Dense(units=hidden_nodes_layer2_3, activation="relu"))

# Third hidden layer
nn_1.add(tf.keras.layers.Dense(units=hidden_nodes_layer2_3, activation="tanh"))

# Fourth hidden layer
nn_1.add(tf.keras.layers.Dense(units=hidden_nodes_layer4, activation="tanh"))

# Output layer
nn_1.add(tf.keras.layers.Dense(units=1, activation="sigmoid"))

# Check the structure of the model
nn_1.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 9)                 1161      
                                                                 
 dense_1 (Dense)             (None, 7)                 70        
                                                                 
 dense_2 (Dense)             (None, 7)                 56        
                                                                 
 dense_3 (Dense)             (None, 5)                 40        
                                                                 
 dense_4 (Dense)             (None, 1)                 6         
                                                                 
Total params: 1,333
Trainable params: 1,333
Non-trainable params: 0
_________________________________________________________________


In [20]:
# Compile the model
nn_1.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

In [21]:
# Train the model
fit_model = nn_1.fit(X_train_scaled,y_train,epochs=100)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [22]:
# Evaluate the model using the test data
model_loss_1, model_accuracy_1 = nn_1.evaluate(X_test_scaled,y_test,verbose=2)
print(f"Loss: {round(model_loss_1,4)}, Accuracy: {round(model_accuracy_1,4)}")

318/318 - 0s - loss: 0.1193 - accuracy: 0.9723 - 478ms/epoch - 2ms/step
Loss: 0.1193, Accuracy: 0.9723


### Save the model from optimization attempt 1: which had an accuracy of .97

In [23]:
# Export the model from optimization attempt 2 to HDF5 file
from google.colab import files

output_filepath = "/content/gdrive/MyDrive/DataClassNotebooks/Project-4/output"

# Save house model, nn_1, and download a copy to local machine:
nn_1.save(f'{output_filepath}/house_model.h5')
files.download(f'{output_filepath}/house_model.h5')

# Save the StandardScaler() instance, house_scaler, for use in the flask app later:
dump(house_scaler, f'{output_filepath}/house_scaler.bin', compress=True)
files.download(f'{output_filepath}/house_scaler.bin')


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>