<a href="https://colab.research.google.com/github/Obyd/Machine-Learning-Fundamentals/blob/master/hw1_naive_bayes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 1: Naive Bayes Classifier

In this homework you will implement the Naive Bayes Classsifier on a data set of votes in the U.S. House of Representatives, with the goal of predicting the party affiliation of each congressman. The input data $X$ is given by a $N$-by-$D$ matrix, where $N$ is the number of examples and $D=16$ is the number of input features. Each feature is binary (yes/no). The targets are given by a length-$N$ sequence of classes, $Y$, that are also binary. More information on the data set can be found at  https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records.




First, we need to download the data. The following code uses the urllib library to request data from a website. The pandas library is a powerful library for data analysis --- we use the read_csv method to automatically parse the comma seperated variable (csv) file.

In [0]:
import pandas 
import urllib.request  
import numpy   # Numerical python.

# Download the data.
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data"
response = urllib.request.urlopen(url)

# Interpret text data into pandas data frame. Interpret 'abstain' votes as 'no'.
dataset  = pandas.read_csv(response, header=None, true_values=['y'], false_values=['n','?'])

# Set the column names.
names = ['label'] + [f'vote_{i}' for i in range(16)]
dataset.columns = names

# Tells pandas that this is a categorical feature.
dataset['label'] = pandas.Categorical(dataset['label'])

print("Dataset shape: ", dataset.shape)
dataset.head() # Prints first 5 examples from the data set.

Dataset shape:  (435, 17)


Unnamed: 0,label,vote_0,vote_1,vote_2,vote_3,vote_4,vote_5,vote_6,vote_7,vote_8,vote_9,vote_10,vote_11,vote_12,vote_13,vote_14,vote_15
0,republican,False,True,False,True,True,True,False,False,False,True,False,True,True,True,False,True
1,republican,False,True,False,True,True,True,False,False,False,False,False,True,True,True,False,False
2,democrat,False,True,True,False,True,True,False,False,False,False,True,False,True,True,False,False
3,democrat,False,True,True,False,False,True,False,False,False,False,True,False,True,False,False,True
4,democrat,True,True,True,False,True,True,False,False,False,False,True,False,True,True,True,True


Numpy is a powerful library for mathematical operations on vectors and matrices. Here we convert the pandas data into a 2-dimensional numpy array (a matrix). 

In [0]:
X = numpy.array(dataset.iloc[:,1:]) # Convert input features into Numpy array.
Y = dataset['label'].cat.codes # Converts string labels to binary values.

# Split data into train and test set. Use the first 335 examples for training.
num_train = 335
Xtrain = X[0:num_train, :]
Xtest  = X[num_train:,:]
Ytrain=Y[:num_train]
Ytest=Y[num_train:]
print(Xtrain.shape, Xtest.shape)
print(Ytrain.shape, Ytest.shape)

(335, 16) (100, 16)
(335,) (100,)


You are asked to implement the following functions.

In [0]:
def generative_model(Xtrain, Ytrain):
    ''' 
    Implements a generative algorithm on binary data.
    Inputs
        Xtrain: NxD matrix of features.
        Ytrain: N vector of class labels
    
    Returns
        p_label: Length 2 vector of class probabilities.
        p_votes: 2xD Matrix where entry i,j is p(x_j|v=i).
    ''' 
    # WRITE ME 
    p_label=numpy.zeros(2)
    p_votes=numpy.zeros((2,16))
    for i in range(Xtrain.shape[0]):
      if Ytrain[i]==0:
        p_label[0]+=1
    p_label[1]=Xtrain.shape[0]-p_label[0]
    
    for i in range(Xtrain.shape[1]):
      for j in range(Xtrain.shape[0]):
        if Ytrain[j]==0:
          if Xtrain[j,i]==1:
            p_votes[0,i]+=1/p_label[0]
        else:
          if Xtrain[j,i]==1:
            p_votes[1,i]+=1/p_label[1]

    p_label=p_label/Xtrain.shape[0]
    return p_label, p_votes
  
def discriminative_model(p_label, p_votes, Xtest):
    '''
    Implements Naive Bayes Classification.
    Inputs
      p_label, p_votes: From generative_model.
      Xtest: NxD matrix of binary features.
    
    Outputs 
      pred_prob: Probability of label=1 under model.
    '''
    # WRITE ME
    prob_dem=numpy.zeros(Xtest.shape[0])
    prob_rep=numpy.zeros(Xtest.shape[0])
    pred_prob=numpy.zeros(Xtest.shape[0])
    for i in range(Xtest.shape[0]):
      for j in range(Xtest.shape[1]):
        if Xtest[i,j]==1:
          if j==0:
            prob_dem[i]=p_votes[0,j]
            prob_rep[i]=p_votes[1,j]
          else:
            prob_dem[i]=prob_dem[i]*p_votes[0,j]
            prob_rep[i]=prob_rep[i]*p_votes[1,j]
        else:
          if j==0:
            prob_dem[i]=1-p_votes[0,j]
            prob_rep[i]=1-p_votes[1,j]
          else:
            prob_dem[i]=prob_dem[i]*(1-p_votes[0,j])
            prob_rep[i]=prob_rep[i]*(1-p_votes[1,j])
      pred_prob[i]=((prob_rep[i]*p_label[1])/(prob_dem[i]*p_label[0]+
                                               prob_rep[i]*p_label[1]))
    return pred_prob

def accuracy(y_true, y_predicted):
    ''' Calculates the fraction of correct predictions.
    '''
    assert(len(y_true) == len(y_predicted))
    # WRITE ME
    pred_results=numpy.zeros(len(y_predicted))
    for i in range(len(y_predicted)):
      pred_results[i]=int(round(y_predicted[i]))
      
    correct_pred=1-sum(abs(y_true-pred_results))/len(y_true)
    return correct_pred
    

In [0]:
def logprediction(p_label, p_votes, Xtest):
    '''Calculates the log probability of the Naive Bayes Classification
    '''
    pred_prob=numpy.zeros(Xtest.shape[0])
    for i in range(Xtest.shape[0]):
      exponent=0
      for j in range(len(p_votes)):
        if Xtest[i,j]==1:
          exponent = exponent + numpy.log(p_votes[1,j])
        else:
          exponent = exponent + numpy.log(1-p_votes[1,j])
      pred_prob[i]=numpy.exp(numpy.log(p_label[1])+exponent)

    return pred_prob  

In [0]:
# Make sure to print these for submission.
p_label, p_votes = generative_model(Xtrain, Ytrain)
print('Label priors:', p_label)
print('Conditional vote probabilities:', p_votes)

pred_prob = discriminative_model(p_label, p_votes, Xtest)
print('Predictions:', pred_prob)

Label priors: [0.62089552 0.37910448]
Conditional vote probabilities: [[0.58653846 0.40384615 0.88461538 0.03365385 0.16346154 0.41346154
  0.77884615 0.83653846 0.74519231 0.45192308 0.44230769 0.11538462
  0.24519231 0.33173077 0.62980769 0.64903846]
 [0.16535433 0.46456693 0.11023622 0.96850394 0.94488189 0.86614173
  0.23622047 0.12598425 0.09448819 0.54330709 0.08661417 0.81889764
  0.81102362 0.94488189 0.08661417 0.54330709]]
Predictions: [9.99999899e-01 4.95640927e-06 3.63198586e-11 1.80963571e-09
 9.99998953e-01 9.99997918e-01 4.11813135e-10 6.62684047e-05
 9.75348251e-01 4.48636963e-11 9.99949158e-01 9.99999981e-01
 9.99999913e-01 2.82386635e-09 9.85339654e-01 8.03188656e-07
 8.08427700e-01 4.10067045e-04 9.99888994e-01 4.41144464e-09
 1.81942458e-05 9.99999983e-01 9.99999983e-01 3.24919501e-05
 9.99999981e-01 1.14011503e-07 1.87073748e-02 6.05272439e-07
 9.99322094e-01 9.99955946e-01 9.97583651e-01 4.89728308e-09
 2.81548613e-07 1.23638351e-07 9.99999264e-01 1.72665393e-08
 

In [0]:
corret_pred = accuracy(Ytest,pred_prob)
print('Percent of Correct Predictions:',corret_pred)

new_pred = logprediction(p_label, p_votes, Xtest)

log_correct = accuracy(Ytest,new_pred)
print('Percent of Correct Predictions:',log_correct)

Percent of Correct Predictions: 0.85
Percent of Correct Predictions: 0.5900000000000001


## To turn in:
1) Implement the Naive Bayes Classifier using the starter code above. Make sure to print out the parameters of the generative model and the predictions on the test points.

2) Compute the log probability of the test set --- this is a single scalar value.

3) Compare the NB classifier to a model in which we predict a 50-50 chance for each vote, in terms of accuracy and the log probability. Which model is better and why? Describe two situations in which the Naive Bayes Classifier will fail. 



3) If we predict a 50-50 chance for each vote then any politician 

The Naive Bayes Classifier has a possibility to fail in situations where x_i and x_j are not truly indepent, and where there is extraneous data.
An example of the first situation is if we are determining sickness vs health and we account something like having watery eyes with having a running nose.
An example of the second situation is if we are again determining health, and we have some of the conditions being stuff that has nothing to do with health like if someone had coffee that morning.