# Logistic Regression

The model that this data is based in is the Congressional voting data set from the Univeristy of Calirfornia Irvine Machine Learning Repository website.

Source:

Origin:

Congressional Quarterly Almanac, 98th Congress, 2nd session 1984, Volume XL: Congressional Quarterly Inc. Washington, D.C., 1985.

Donor:

Jeff Schlimmer (Jeffrey.Schlimmer '@' a.gp.cs.cmu.edu)

Data Set Information:

>This data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes identified by the CQA. The CQA lists nine different types of votes: voted for, paired for, and announced for (these three simplified to yea), voted against, paired against, and announced against (these three simplified to nay), voted present, voted present to avoid conflict of interest, and did not vote or otherwise make a position known (these three simplified to an unknown disposition).

Attribute Information:

1. Class Name: 2 (democrat, republican)
2. handicapped-infants: 2 (y,n)
3. water-project-cost-sharing: 2 (y,n)
4. adoption-of-the-budget-resolution: 2 (y,n)
5. physician-fee-freeze: 2 (y,n)
6. el-salvador-aid: 2 (y,n)
7. religious-groups-in-schools: 2 (y,n)
8. anti-satellite-test-ban: 2 (y,n)
9. aid-to-nicaraguan-contras: 2 (y,n)
10. mx-missile: 2 (y,n)
11. immigration: 2 (y,n)
12. synfuels-corporation-cutback: 2 (y,n)
13. education-spending: 2 (y,n)
14. superfund-right-to-sue: 2 (y,n)
15. crime: 2 (y,n)
16. duty-free-exports: 2 (y,n)
17. export-administration-act-south-africa: 2 (y,n)

Dua, D. and Karra Taniskidou, E. (2017). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. 

In [1]:
# Import dependencies.
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [2]:
#Read in the data into a dataframe using pd.read_table.
data = pd.read_table("house-votes-84.data", sep=",", header=None)
data

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
5,democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y
6,democrat,n,y,n,y,y,y,n,n,n,n,n,n,?,y,y,y
7,republican,n,y,n,y,y,y,n,n,n,n,n,n,y,y,?,y
8,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,y
9,democrat,y,y,y,n,n,n,y,y,y,n,n,n,n,n,?,?


In [3]:
#Renamed the columns to show the title of the bills.
vote_names = data.rename(columns={0:"Party",1:"Disabled Infants",2:"Water Project Cost Sharing",3:"Adoption of the Budget Resolution",
                                  4:"Physician Fee Freeze",5:"El Salvador Aid",6:"Religious Groups is Schools",7:"Anti-Satellite Test Ban",
                                  8:"Aid to Nicaraguan Contras",9:"MX Missile",10:"Immigration", 11:"Synfuels Corporation Cutback",
                                  12:"Education Spending", 13:"Superfund Right to Sue", 14:"Crime",15:"Duty Free Exports",16:"Export Administration Act South Africa"})
vote_names

Unnamed: 0,Party,Disabled Infants,Water Project Cost Sharing,Adoption of the Budget Resolution,Physician Fee Freeze,El Salvador Aid,Religious Groups is Schools,Anti-Satellite Test Ban,Aid to Nicaraguan Contras,MX Missile,Immigration,Synfuels Corporation Cutback,Education Spending,Superfund Right to Sue,Crime,Duty Free Exports,Export Administration Act South Africa
0,republican,n,y,n,y,y,y,n,n,n,y,?,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
2,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
5,democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y
6,democrat,n,y,n,y,y,y,n,n,n,n,n,n,?,y,y,y
7,republican,n,y,n,y,y,y,n,n,n,n,n,n,y,y,?,y
8,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,y
9,democrat,y,y,y,n,n,n,y,y,y,n,n,n,n,n,?,?


In [4]:
#Replace the questions marks, "?" in the data with NaN by using .replace() method.
vote_data = vote_names.replace("?", np.NaN)
vote_data

Unnamed: 0,Party,Disabled Infants,Water Project Cost Sharing,Adoption of the Budget Resolution,Physician Fee Freeze,El Salvador Aid,Religious Groups is Schools,Anti-Satellite Test Ban,Aid to Nicaraguan Contras,MX Missile,Immigration,Synfuels Corporation Cutback,Education Spending,Superfund Right to Sue,Crime,Duty Free Exports,Export Administration Act South Africa
0,republican,n,y,n,y,y,y,n,n,n,y,,y,y,y,n,y
1,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,
2,democrat,,y,y,,y,y,n,n,n,n,y,n,y,y,n,n
3,democrat,n,y,y,n,,y,n,n,n,n,y,n,y,n,n,y
4,democrat,y,y,y,n,y,y,n,n,n,n,y,,y,y,y,y
5,democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y
6,democrat,n,y,n,y,y,y,n,n,n,n,n,n,,y,y,y
7,republican,n,y,n,y,y,y,n,n,n,n,n,n,y,y,,y
8,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,y
9,democrat,y,y,y,n,n,n,y,y,y,n,n,n,n,n,,


## We dropped the "Party" column, because we are predictin political party.

In [5]:
X = vote_data.drop("Party",1)
y = vote_data["Party"]
print(X.shape, y.shape)

(435, 16) (435,)


## Due to the data being categorical, the "y" and "n" data points had to be transformed into indicator variables the numbers 1 for "y" and 0 for no in order for the model to work on the data. In order to do this the method "get_dummies()" was called on the dataframe. This step is called preprocessing.

In [6]:
#Transform the categorical data into a format that is readable by the model calling pd.get_dummies on the entire dataframe.
vote_data_encoded = pd.get_dummies(vote_data)
vote_data_encoded

Unnamed: 0,Party_democrat,Party_republican,Disabled Infants_n,Disabled Infants_y,Water Project Cost Sharing_n,Water Project Cost Sharing_y,Adoption of the Budget Resolution_n,Adoption of the Budget Resolution_y,Physician Fee Freeze_n,Physician Fee Freeze_y,...,Education Spending_n,Education Spending_y,Superfund Right to Sue_n,Superfund Right to Sue_y,Crime_n,Crime_y,Duty Free Exports_n,Duty Free Exports_y,Export Administration Act South Africa_n,Export Administration Act South Africa_y
0,0,1,1,0,0,1,1,0,0,1,...,0,1,0,1,0,1,1,0,0,1
1,0,1,1,0,0,1,1,0,0,1,...,0,1,0,1,0,1,1,0,0,0
2,1,0,0,0,0,1,0,1,0,0,...,1,0,0,1,0,1,1,0,1,0
3,1,0,1,0,0,1,0,1,1,0,...,1,0,0,1,1,0,1,0,0,1
4,1,0,0,1,0,1,0,1,1,0,...,0,0,0,1,0,1,0,1,0,1
5,1,0,1,0,0,1,0,1,1,0,...,1,0,0,1,0,1,0,1,0,1
6,1,0,1,0,0,1,1,0,0,1,...,1,0,0,0,0,1,0,1,0,1
7,0,1,1,0,0,1,1,0,0,1,...,1,0,0,1,0,1,0,0,0,1
8,0,1,1,0,0,1,1,0,0,1,...,0,1,0,1,0,1,1,0,0,1
9,1,0,0,1,0,1,0,1,1,0,...,1,0,1,0,1,0,0,0,0,0


## In order to avoid overfitting the data, we dropped the "no" responses from the data set. Only the "yes" responses were used to create the model.

In [7]:
y_votes = vote_data_encoded.drop(["Disabled Infants_n","Water Project Cost Sharing_n", "Adoption of the Budget Resolution_n",
                            "Physician Fee Freeze_n", "El Salvador Aid_n","Religious Groups is Schools_n", "Anti-Satellite Test Ban_n", "Aid to Nicaraguan Contras_n","MX Missile_n","Immigration_n","Synfuels Corporation Cutback_n","Education Spending_n","Superfund Right to Sue_n", 
                           "Crime_n", "Duty Free Exports_n","Export Administration Act South Africa_n" ], axis=1)
y_votes

Unnamed: 0,Party_democrat,Party_republican,Disabled Infants_y,Water Project Cost Sharing_y,Adoption of the Budget Resolution_y,Physician Fee Freeze_y,El Salvador Aid_y,Religious Groups is Schools_y,Anti-Satellite Test Ban_y,Aid to Nicaraguan Contras_y,MX Missile_y,Immigration_y,Synfuels Corporation Cutback_y,Education Spending_y,Superfund Right to Sue_y,Crime_y,Duty Free Exports_y,Export Administration Act South Africa_y
0,0,1,0,1,0,1,1,1,0,0,0,1,0,1,1,1,0,1
1,0,1,0,1,0,1,1,1,0,0,0,0,0,1,1,1,0,0
2,1,0,0,1,1,0,1,1,0,0,0,0,1,0,1,1,0,0
3,1,0,0,1,1,0,0,1,0,0,0,0,1,0,1,0,0,1
4,1,0,1,1,1,0,1,1,0,0,0,0,1,0,1,1,1,1
5,1,0,0,1,1,0,1,1,0,0,0,0,0,0,1,1,1,1
6,1,0,0,1,0,1,1,1,0,0,0,0,0,0,0,1,1,1
7,0,1,0,1,0,1,1,1,0,0,0,0,0,0,1,1,0,1
8,0,1,0,1,0,1,1,1,0,0,0,0,0,1,1,1,0,1
9,1,0,1,1,1,0,0,0,1,1,1,0,0,0,0,0,0,0


## In this step we split the data into testing and training data.

In [8]:
#Split the data in to training data and testing data.

from sklearn.model_selection import train_test_split

X = y_votes

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

X_train.head()

Unnamed: 0,Party_democrat,Party_republican,Disabled Infants_y,Water Project Cost Sharing_y,Adoption of the Budget Resolution_y,Physician Fee Freeze_y,El Salvador Aid_y,Religious Groups is Schools_y,Anti-Satellite Test Ban_y,Aid to Nicaraguan Contras_y,MX Missile_y,Immigration_y,Synfuels Corporation Cutback_y,Education Spending_y,Superfund Right to Sue_y,Crime_y,Duty Free Exports_y,Export Administration Act South Africa_y
311,1,0,0,0,1,0,0,1,1,1,1,1,0,0,1,0,0,1
3,1,0,0,1,1,0,0,1,0,0,0,0,1,0,1,0,0,1
18,0,1,0,1,0,1,1,1,0,0,0,0,0,0,1,1,0,0
208,1,0,0,0,1,0,0,0,1,1,1,0,0,0,0,0,1,1
60,1,0,1,1,1,0,0,0,1,1,1,1,0,0,0,0,1,0


## Created the logistic regression model.

In [9]:
#Create a logistic regression model.

from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression()
classifier

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

## In this step we fitted(trained) the model using the training data.

In [10]:
#Fit(train) the model using the training data.

classifier.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

## Used the testing data to validate the model.

In [11]:
#Validate the model using the testing data.

print(f"Training Data Score: {classifier.score(X_train, y_train)}")
print(f"Testing Data Score: {classifier.score(X_test, y_test)}")

Training Data Score: 1.0
Testing Data Score: 1.0


In [12]:
#Make predictions.

predictions = classifier.predict(X_test)
print(f"First 10 Predictions:   {predictions[:10]}")
print(f"First 10 Actual labels: {y_test[:10].tolist()}")

First 10 Predictions:   ['democrat' 'democrat' 'republican' 'republican' 'republican' 'republican'
 'democrat' 'republican' 'democrat' 'republican']
First 10 Actual labels: ['democrat', 'democrat', 'republican', 'republican', 'republican', 'republican', 'democrat', 'republican', 'democrat', 'republican']


## Both the training and testing scores were both 1.0 which may signify that the model may be off. However, the model did  predict the party correctly for all of its predictions.