# INFO371 Lab: Bayes theorem

Your task is to use Bayes theorem, and predict whether a
representative was a democrat or republican depending on how did they
voted for bill \#8 in the dataset.  This is in preparation for using the Naive Bayes estimator next class.

## Data
The data **house-votes-84-yeas** contains votes for 16 bills by 435 Representatives in 1984. The US Congress consists of two chambers: The House of Representatives (or just ''House''), and Senate.  There are 435 members---Representatives (or just Rep-s) in the House.  Almost all Reps are members of either Democratic or Republican party.

The first variable is the party membership (_republican or _democrat_), and the 16 following ones are whether the representative voted _yea_ for the corresponding bill. The word "Yea" indicates "yes" or "affirmative" for votes. 

The variables **yea1--yea16** are coded in a way that ''1''
means voting **yea**, and ''0'' means usually **nay** but
sometimes also missing/abstaining vote. We are using votes for bill 8 (column **yea8**) to predict the party membership **party**. 

---
## Implement the Bayes Classifier

The task in this lab is to compute the probability that a
Representative is democrat, depending on how did she vote for bill 8:

$Pr(party = D|vote_8)$. 

### Compute Priors

We are using the Bayes theorem to compute:


$Pr(party = D|vote = V) = \frac{Pr(vote = V|party = D) * Pr(party = D)} {Pr(vote = V)}$
    
where $D$ means the representative is a democrat, and $V$ means how she
voted for a particular bill (yea or not).  We focus on a single
bill, **yea8**.


* Compute the priors, $Pr(party = D)$ and $Pr(party = R)$, the percentage of democrats and republicans in your data.  


* extract the data corresponding to the bill (the column **yea8** in
  your data matrix.  Let's call it $x_8$ below.
  
  
* We also need the normalizers $Pr(vote = 1)$ and $Pr(vote = 0)$, the probabilities that representatives voted **yea** or not for that bill.


In [2]:
# code goes here
import pandas as pd
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split

hv = pd.read_csv("house-votes-84-yeas.csv.bz2", sep="\t")

In [3]:
#1
PrD = hv.party[hv.party == "democrat"].count() / hv.shape[0]
PrR = hv.party[hv.party == "republican"].count() / hv.shape[0]

print(PrD)
print(PrR)

0.6137931034482759
0.38620689655172413


In [4]:
#2
x8 = hv[["party", "yea8"]]

In [5]:
#3
Pr1 = x8.yea8[x8.yea8 == 1].count() / x8.shape[0]
Pr0 = x8.yea8[x8.yea8 == 0].count() / x8.shape[0]

print(Pr1)
print(Pr0)

0.5563218390804597
0.4436781609195402


Now we need the four conditional probabilities:  
$Pr(vote = Y|party = D)$, $Pr(vote = N|party = D)$, $Pr(vote = Y|party=R)$, and $Pr(vote=N|party=R)$ for the bill.


* Compute these conditional probabilities --- percentage of yeas and non-yeas for democrats, republicans. A good way to do it is to split the data into two---republicans and democrats, and find the mean vote values for these groups.

In [6]:
# code goes here
#4
x8d = x8[x8.party == "democrat"]
x8r = x8[x8.party == "republican"]

PrYD = (x8d.yea8[x8d.yea8 == 1].count() / x8.shape[0]) / PrD
PrND = (x8d.yea8[x8d.yea8 == 0].count() / x8.shape[0]) / PrD

PrYR = (x8r.yea8[x8r.yea8 == 1].count() / x8.shape[0]) / PrR
PrNR = (x8r.yea8[x8r.yea8 == 0].count() / x8.shape[0]) / PrR

print(PrYD, PrND, PrYR, PrNR)

0.8164794007490636 0.18352059925093633 0.14285714285714285 0.8571428571428571


## Predict using Bayes theorem

Now you have all the probabilities you need.  Next, let's use Bayes
theorem (see equation in the "Computure Priors" section above) and compute, for each representative, that probability that they are a democrat, and the probability that they are a republican, given their vote.


* Compute the probability of interest $Pr(party = D|vote_8)$. Use the plain Bayes theorem here and the conditional probablities you calculated above.  Do not compute it using the counts directly (that will not scale to naive bayes!).
  
  Note: you have to pick the correct conditional probability.  For
  instance, when computing the probability that the representative is a
  democrat, you have to choose either
  $Pr(vote=Y|party=D)$ or
  $Pr(vote=N|party=D)$, and the correct normalizer, either
  $Pr(vote=Y)$ or $Pr(vote =N)$, 
  depending on whether she voted yea or nay. 
  
  
* Categorize the representatives to democrats and republicans using threshold
  0.5.  It means those representatives who have $Pr(party = D|vote = V) > 0.5$ will be considered democrats and the way around.
  
  
* Print the confusion matrix and accuracy. (Note: it may help to use the confusion_matrix fucntion from sklearn.metrics) 


* Compare your accuracy with accuracy of the naive model that predict every representative to the majority class.  How much better is your classifier?


* Repeat the process with other bills.  Which bill will give you the best accuracy?  Which one the worst? 

  Hint: you may want to write  a function and loop over columns of the dataset.
  
  Hint2: Bills 2, 10, 16 will give the lowest accuracy.

In [7]:
# code goes here 
#5 
PrDY = PrYD * PrD / Pr1
PrDN = PrND * PrD / Pr0
PrRY = PrYR * PrR / Pr1
PrRN = PrNR * PrR / Pr0

print(PrDY, PrDN, PrRY, PrRN) #the first two probs are Pr(party = D\vote8)

0.9008264462809917 0.2538860103626943 0.09917355371900827 0.7461139896373057


In [8]:
#6
x8n = x8.copy()
x8n["pr"] = np.where((x8n["party"] == "democrat") & (x8n["yea8"] == 1), PrDY,
                         np.where((x8n["party"] == "democrat") & (x8n["yea8"] == 0), PrDN,
                                  np.where((x8n["party"] == "republican") & (x8n["yea8"] == 1), PrRY, PrRN)))
x8n["new_party"] = np.where(x8n["pr"] > 0.5, "democrat", "republican")

In [9]:
#7
cm = confusion_matrix(x8n["party"], x8n["new_party"])
accuracy = accuracy_score(x8n["party"], x8n["new_party"])
print(cm)
print()
print(accuracy)

[[218  49]
 [144  24]]

0.5563218390804597


In [10]:
#8
X_train, X_test, y_train, y_test = train_test_split(x8["yea8"], x8["party"], test_size=0.2)
d = DummyClassifier()
d.fit(X_train, y_train)
d.score(X_test, y_test)

0.5977011494252874

My dummy classifier achieves an improvement of approximately 5-15% in accuracy compared to the accuracy derived from the confusion matrix.

In [11]:
#9
def nm(col):
    PrD = hv.party[hv.party == "democrat"].count() / hv.shape[0]
    PrR = hv.party[hv.party == "republican"].count() / hv.shape[0]
    new_df = hv[["party", col]]
    Pr1 = new_df[col][new_df[col] == 1].count() / new_df.shape[0]
    Pr0 = new_df[col][new_df[col] == 0].count() / new_df.shape[0]
    
    demo_df = new_df[new_df.party == "democrat"]
    rep_df = new_df[new_df.party == "republican"]

    PrYD = (demo_df[col][demo_df[col] == 1].count() / new_df.shape[0]) / PrD
    PrND = (demo_df[col][demo_df[col] == 0].count() / new_df.shape[0]) / PrD

    PrYR = (rep_df[col][rep_df[col] == 1].count() / new_df.shape[0]) / PrR
    PrNR = (rep_df[col][rep_df[col] == 0].count() / new_df.shape[0]) / PrR

    PrDY = PrYD * PrD / Pr1
    PrDN = PrND * PrD / Pr0
    PrRY = PrYR * PrR / Pr1
    PrRN = PrNR * PrR / Pr0
    
    new_new_df= new_df.copy()
    new_new_df["pr"] = np.where((new_new_df["party"] == "democrat") & (new_new_df[col] == 1), PrDY,
                         np.where((new_new_df["party"] == "democrat") & (new_new_df[col] == 0), PrDN,
                                  np.where((new_new_df["party"] == "republican") & (new_new_df[col] == 1), PrRY, PrRN)))
    new_new_df["new_party"]  = np.where(new_new_df["pr"] > 0.5, "democrat", "republican")
    
    accuracy = accuracy_score(new_new_df["party"], new_new_df["new_party"])
    return 1 - accuracy

cols = list(hv.columns)
for i in range (1, 17):
    print("Accuracy of bill " + str(i) + ": " + str(nm(cols[i])))

Accuracy of bill 1: 0.5701149425287356
Accuracy of bill 2: 0.0
Accuracy of bill 3: 0.41839080459770117
Accuracy of bill 4: 0.40689655172413797
Accuracy of bill 5: 0.4873563218390805
Accuracy of bill 6: 0.6252873563218391
Accuracy of bill 7: 0.4505747126436782
Accuracy of bill 8: 0.44367816091954027
Accuracy of bill 9: 0.5241379310344827
Accuracy of bill 10: 0.0
Accuracy of bill 11: 0.6551724137931034
Accuracy of bill 12: 0.3931034482758621
Accuracy of bill 13: 0.48045977011494256
Accuracy of bill 14: 0.5701149425287356
Accuracy of bill 15: 0.6
Accuracy of bill 16: 0.0
