## Darren Abramson
## Assignment 3

### 1. Predict the class label and discuss how good your prediction is.

First, load in the data.

In [3]:
import numpy as np
from sklearn.neural_network import MLPClassifier
data = np.loadtxt("train.txt")
data.shape

(11, 200)

Given `sklearn` conventions it is convenient to transpose the data so that columns represent features and rows represent data examples.

In [4]:
data = np.transpose(data)
print data.shape


(200, 11)


Features and labels are separated out.

In [5]:
labels = data[:, 10]
print labels.shape
features = data[:, 0:10:1]
print features.shape

(200,)
(200, 10)


Next we examine how balanced the labels are between classes.

In [9]:
unique, counts = np.unique(labels, return_counts=True)
print np.asarray((unique, counts)).T

[[   0.   20.]
 [   1.  180.]]


This presents an immediate difficulty. If we assign, say, 30% of the data randomly to a test set, there will only be about 6 items from class `0` in the training set. This will affect our ability to assess the quality of the model. Nevertheless, for the purpose of this exercise, I will use this split recognizing that sophisticated exist (e.g. repeating the less-represented class label, with or without some level of noise).

Below I fit a simple classifier using cross-validation and examine its accuracy on a test set.

In [23]:
from sklearn.cross_validation import train_test_split
features_train, features_test, labels_train, labels_test = \
    train_test_split(features, labels, test_size=0.3, random_state=42)

from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

clf = gnb

from sklearn import cross_validation
k_fold = cross_validation.KFold(len(features_train), 10)

averagePerformance = 0.

for k, (train, test) in enumerate(k_fold):
    clf.fit(features_train[train], labels_train[train])
    #print(k, clf.score(features_train[test], labels_train[test]))
    averagePerformance += clf.score(features_train[test], labels_train[test])

print "Average cross-validation performance: " , averagePerformance / 10

Average cross-validation performance:  0.821428571429


In [21]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

pred = clf.predict(features_test)

print confusion_matrix(labels_test, pred)
print "accuracy score for classifier with 30% withheld test set:", "\t", accuracy_score(labels_test, pred)

print 

[[ 1  5]
 [ 5 49]]
accuracy score for classifier with 30% withheld test set: 	0.833333333333


This is a **very bad** model. It classified the 0 label correctly only once, and misclassified it 10 times (there are equal numbers of false positives and false negatives of the 0 label). Furthermore, its overall performance is worse than a classifier that assigns the value of 1 to every input. 

Assuming the incidence of 0 and 1 labels in the available data is close to their actual frequencies, such a classifier would achieve 90% accuracy; this model has an accuracy of only 83.3%.

### 2. What is the maximum likelihood estimate of the parameter $\phi$ from $m$ independent trials in which $h$ heads are thrown given a Bernoulli random variable $Y$?

A Bernoulli random variable $Y$ is characterized by the given probability $P(Y=y) = \phi^y(1-\phi)^{1-y}$. To derive the maximum likelihood estimate of the parameter $\phi$ from $m$ independent trials in which $h$ heads are thrown, we write the log likelihood and set its derivative to 0 to find the maximum.

Let $y_h = h/m$ be the proportion of trials that are heads.

$$log\, P (Y=h) = log\, \phi^{y_h}(1-\phi)^{1-y_h}$$

$$ = y_h\, log\,\phi + (1-y_h)\,log\,(1-\phi)$$

Now we find the partial derivative with respect to $\phi$:

$$\frac{\partial}{\partial\phi} y_h\, log\,\phi + (1-y_h)\,log\,(1-\phi) = \frac{\phi - y_h}{(\phi-1)\phi}$$

[1]: http://www.wolframalpha.com/input/?i=d%2Fdx+y+log+x+%2B+(1-y)log(1-x)

Note: due to my rusty calculus, the above line was the result of visiting [Wolfram Alpha][1] (substituting $h$ for $y$ and $\phi$ for $x$).

Setting the partial derivative gives $\phi = y_h = h/m$.

### 3. Factorize the joint probability $p(x_1, x_2, x_3, x_4)$ according to the provided graphical causal model.

As Trappenberg (201x) illustrates with an example (Equation 4.24), a causal directed acyclic graph depicts causal dependency that provides a factorization of joint probability that is simpler than the result of applying the general chain rule. In this case, the joint probability is 

$$P(x_1) P(x_2) P(x_3 \mid x_1, x_2) P(x_4 \mid x_3)$$


### 4. Write out the network as a function $x_4 = \ldots$ and give the value for input $x_1 = 1, x_2 = 2$.

I define a python function that computes the output of the specified neural network below along with its output on the given input values. I have included the $w_{x_3 x_1}$ weight for clarity even though it is 1.

In [8]:
import math

def smallNN(x1, x2):
    return math.tanh(3. * math.tanh(x1 * 1. + x2 * 2))

print smallNN(1, 2)

0.995052065576
