## Question 1

Code to get `.arff`:

```bash
GENRE_1=country
GENRE_2=disco
GENRE_3=classical

# Get the Genres
if [ ! -d genres ]; then
    curl http://opihi.cs.uvic.ca/sound/genres.tar.gz | tar xz
fi

# Build collections
if [ ! -f q1_${GENRE_1i}.mf ]; then
    mkcollection -c q1_${GENRE_1}.mf -l ${GENRE_1} genres/${GENRE_1}
fi

if [ ! -f q1_${GENRE_2}.mf ]; then
    mkcollection -c q1_${GENRE_2}.mf -l ${GENRE_2} genres/${GENRE_2}
fi

if [ ! -f q1_${GENRE_3}.mf ]; then
    mkcollection -c q1_${GENRE_3}.mf -l ${GENRE_3} genres/${GENRE_3}
fi

cat q1_${GENRE_1}.mf q1_${GENRE_2}.mf q1_${GENRE_3}.mf > q1.mf

bextract -sv q1.mf -w q1.arff
```

### Weka: ZeroR

```
=== Summary ===

Correctly Classified Instances         100               33.3333 %
Incorrectly Classified Instances       200               66.6667 %
Kappa statistic                          0     
Mean absolute error                      0.4444
Root mean squared error                  0.4714
Relative absolute error                100      %
Root relative squared error            100      %
Coverage of cases (0.95 level)         100      %
Mean rel. region size (0.95 level)     100      %
Total Number of Instances              300     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 1.000    1.000    0.333      1.000    0.500      0.000    0.500     0.333     classical
                 0.000    0.000    0.000      0.000    0.000      0.000    0.500     0.333     country
                 0.000    0.000    0.000      0.000    0.000      0.000    0.500     0.333     disco
Weighted Avg.    0.333    0.333    0.111      0.333    0.167      0.000    0.500     0.333     

=== Confusion Matrix ===

   a   b   c   <-- classified as
 100   0   0 |   a = classical
 100   0   0 |   b = country
 100   0   0 |   c = disco```

### Weka: Naive Bayes

```
=== Summary ===

Correctly Classified Instances         253               84.3333 %
Incorrectly Classified Instances        47               15.6667 %
Kappa statistic                          0.765 
Mean absolute error                      0.1035
Root mean squared error                  0.3174
Relative absolute error                 23.2891 %
Root relative squared error             67.321  %
Coverage of cases (0.95 level)          86      %
Mean rel. region size (0.95 level)      34.3333 %
Total Number of Instances              300     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.890    0.030    0.937      0.890    0.913      0.872    0.987     0.959     classical
                 0.790    0.120    0.767      0.790    0.778      0.665    0.928     0.857     country
                 0.850    0.085    0.833      0.850    0.842      0.761    0.963     0.923     disco
Weighted Avg.    0.843    0.078    0.846      0.843    0.844      0.766    0.959     0.913     

=== Confusion Matrix ===

  a  b  c   <-- classified as
 89 11  0 |  a = classical
  4 79 17 |  b = country
  2 13 85 |  c = disco
```

### Weka: J48

```
=== Summary ===

Correctly Classified Instances         241               80.3333 %
Incorrectly Classified Instances        59               19.6667 %
Kappa statistic                          0.705 
Mean absolute error                      0.137 
Root mean squared error                  0.3515
Relative absolute error                 30.8158 %
Root relative squared error             74.5673 %
Coverage of cases (0.95 level)          84      %
Mean rel. region size (0.95 level)      36.7778 %
Total Number of Instances              300     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 0.950    0.045    0.913      0.950    0.931      0.896    0.955     0.911     classical
                 0.730    0.145    0.716      0.730    0.723      0.582    0.792     0.652     country
                 0.730    0.105    0.777      0.730    0.753      0.635    0.826     0.693     disco
Weighted Avg.    0.803    0.098    0.802      0.803    0.802      0.705    0.858     0.752     

=== Confusion Matrix ===

  a  b  c   <-- classified as
 95  4  1 |  a = classical
  7 73 20 |  b = country
  2 25 73 |  c = disco
```

### Weka: SMO

```
=== Summary ===

Correctly Classified Instances         282               94      %
Incorrectly Classified Instances        18                6      %
Kappa statistic                          0.91  
Mean absolute error                      0.2363
Root mean squared error                  0.2969
Relative absolute error                 53.1667 %
Root relative squared error             62.9815 %
Coverage of cases (0.95 level)          99.6667 %
Mean rel. region size (0.95 level)      66.6667 %
Total Number of Instances              300     

=== Detailed Accuracy By Class ===

                 TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
                 1.000    0.010    0.980      1.000    0.990      0.985    0.995     0.980     classical
                 0.880    0.030    0.936      0.880    0.907      0.864    0.929     0.865     country
                 0.940    0.050    0.904      0.940    0.922      0.882    0.954     0.877     disco
Weighted Avg.    0.940    0.030    0.940      0.940    0.940      0.910    0.959     0.908     

=== Confusion Matrix ===

   a   b   c   <-- classified as
 100   0   0 |   a = classical
   2  88  10 |   b = country
   0   6  94 |   c = disco

```

## Scikit-Learn

Some common code:

In [1]:
def do_scikit_problem(classifier):
    from sklearn.datasets import load_svmlight_file
    from sklearn import metrics
    
    data, target = load_svmlight_file("q1.libsvm")

    model = classifier()
    model.fit(data, target)

    print(model)
    
    # make predictions
    expected = target
    predicted = model.predict(data)
    
    # summarize the fit of the model
    print(metrics.classification_report(expected, predicted))
    print("Confusion Matrix:")
    print(metrics.confusion_matrix(expected, predicted))
    print("----")

from sklearn.linear_model import LogisticRegression
do_scikit_problem(LogisticRegression)

from sklearn.tree import DecisionTreeClassifier
do_scikit_problem(DecisionTreeClassifier)

from sklearn.naive_bayes import BernoulliNB
do_scikit_problem(BernoulliNB)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
             precision    recall  f1-score   support

        0.0       0.99      0.99      0.99       100
        1.0       0.93      0.92      0.92       100
        2.0       0.92      0.93      0.93       100

avg / total       0.95      0.95      0.95       300

Confusion Matrix:
[[99  1  0]
 [ 0 92  8]
 [ 1  6 93]]
----
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
             precision    recall  f1-score   support

        0.0       1.00      1.00      1.00       100
        1.0       1.00    

# Question 2

First let's traverse the tree and build the vectors. Then we'll print out the probabilities:

In [2]:
from os import listdir

hotwords = [ 'awful', 'bad', 'boring', 'dull', 'effective', 'great', 'hilarious' ]
dataset_size = 1000

def process_sample(string):
    instance_vector = [False] * len(hotwords)
    for i, word in enumerate(hotwords):
        if word in string:
            instance_vector[i] = True
    return instance_vector

def get_samples(polarity):
    data = []
    for file_name in listdir("q2_data/"+polarity):
        file_descriptor = open("q2_data/"+polarity+"/"+file_name)
        file_contents = file_descriptor.read()
        data.append(file_contents)
        file_descriptor.close()
    return data
        

# Load/Parse the data
pos = [process_sample(sample) for sample in get_samples("pos")]
neg = [process_sample(sample) for sample in get_samples("neg")]

# Calc probabilities
## Positives
pos_probabilities = [0] * len(hotwords)
for instance in pos:
    for i,exists in enumerate(instance):
        if exists:
            pos_probabilities[i] += 1
for i,val in enumerate(pos_probabilities):
    pos_probabilities[i] = float(val) / 10 # (1000 instances, 100%, so 10)
## Negatives
neg_probabilities = [0] * len(hotwords)
for instance in neg:
    for i,exists in enumerate(instance):
        if exists:
            neg_probabilities[i] += 1
for i,val in enumerate(neg_probabilities):
    neg_probabilities[i] = float(val) / 10
## Total
total_probabilities = [0] * len(hotwords)
for instance in neg + pos:
    for i,exists in enumerate(instance):
        if exists:
            total_probabilities[i] += 1
for i,val in enumerate(total_probabilities):
    total_probabilities[i] = float(val) / 10

# Print all nice like.
print('{:10s} {:7s} {:7s}'.format("word", "pos", "neg"))

for i,pair in enumerate(pos_probabilities):
    print('{:10s} {:4.1f}  {:4.1f}'.format(hotwords[i], pos_probabilities[i], neg_probabilities[i]))

word       pos     neg    
awful       3.4  12.2
bad        28.0  54.5
boring      5.4  17.5
dull        2.5  10.1
effective  15.4   8.6
great      48.5  32.0
hilarious  13.2   5.9


(Part 2)

Since half the reviews are negative, and half are positive, there is a 50% chance of any given sample being positive, and a 50% change of it being negative.

In [3]:
prob_polarity = {
    'pos': .50,
    'neg': .50
}

samples = {
    'pos': get_samples('pos'),
    'neg': get_samples('neg')
}

compound_probabilities = {
    'pos': [prob * prob_polarity['pos'] for prob in pos_probabilities],
    'neg': [prob * prob_polarity['neg'] for prob in neg_probabilities]
}

def classify(instance_vector):
    pos_numerator = prob_polarity['pos']
    neg_numerator = prob_polarity['neg']
    
    pos_denominator = 1 # We can safely start at 1.
    neg_denominator = 1
    
    for i,present in enumerate(instance_vector):
        if present:
            pos_numerator *= compound_probabilities['pos'][i]
            neg_numerator *= compound_probabilities['neg'][i]
            
            pos_denominator *= total_probabilities[i]
            neg_denominator *= total_probabilities[i]
    pos = pos_numerator / pos_denominator
    neg = neg_numerator / neg_denominator
    if pos > neg:
        return 'pos'
    else:
        return 'neg'
    

In [4]:
pos_as_pos = 0
pos_as_neg = 0
for i in [classify(sample) for sample in pos]:
    if i == 'pos':
        pos_as_pos += 1
    elif i == 'neg':
        pos_as_neg += 1

neg_as_pos = 0
neg_as_neg = 0
for i in [classify(sample) for sample in neg]:
    if i == 'pos':
        neg_as_pos += 1
    elif i == 'neg':
        neg_as_neg += 1

print(" pos  neg\n{:4d} {:4d} pos\n{:4d} {:4d} neg".format(pos_as_pos, pos_as_neg, neg_as_pos, neg_as_neg))
print("Accuracy: {}".format(float(pos_as_pos + neg_as_neg) / float(pos_as_pos + pos_as_neg + neg_as_neg + neg_as_pos)))

 pos  neg
 452  548 pos
 153  847 neg
Accuracy: 0.6495


(Part 3, Crossfold validation)

In [5]:
def get_folds(polarity):
    return zip(*[iter(get_samples(polarity))]*100)

def get_crossfolds(folds):
    crossfolds = []
    for i in range(0, 10-1):
        crossfolds.append(folds[i:] + folds[:i+1])
    return crossfolds
    
def get_probabilities(processed_instances):
    probabilities = [0] * len(hotwords)
    for instance in processed_instances:
        for i,exists in enumerate(instance):
            if exists:
                probabilities[i] += 1
    for i,val in enumerate(probabilities):
        probabilities[i] = (float(val) / len(processed_instances)) * 100
    return probabilities

def classify_crossfold(pos_fold_probabilities, neg_fold_probabilities, instance_vector):
    compound_probabilities = {
        'pos': [prob * .5 for prob in pos_fold_probabilities],
        'neg': [prob * .5 for prob in neg_fold_probabilities]
    }
    
    total_probabilities = [x + y for x,y in zip(compound_probabilities["pos"], compound_probabilities["neg"])]
    
    pos_numerator = prob_polarity['pos']
    neg_numerator = prob_polarity['neg']
    
    pos_denominator = 1 # We can safely start at 1.
    neg_denominator = 1
    
    for i,present in enumerate(instance_vector):
        if present:
            pos_numerator *= compound_probabilities['pos'][i]
            neg_numerator *= compound_probabilities['neg'][i]
            
            pos_denominator *= total_probabilities[i]
            neg_denominator *= total_probabilities[i]
    pos = pos_numerator / pos_denominator
    neg = neg_numerator / neg_denominator
    if pos > neg:
        return 'pos'
    else:
        return 'neg'

pos_folds = []
for chunk in get_folds("pos"):
    samples = []
    for sample in chunk:
        samples.append(process_sample(sample))
    pos_folds.append(samples)
    
neg_folds = []
for chunk in get_folds("neg"):
    samples = []
    for sample in chunk:
        samples.append(process_sample(sample))
    neg_folds.append(samples)

import itertools
pos_crossfold_probabilities = [get_probabilities(crossfold) for crossfold in itertools.chain(*get_crossfolds(pos_folds))]
neg_crossfold_probabilities = [get_probabilities(crossfold) for crossfold in itertools.chain(*get_crossfolds(neg_folds))]

def acc_and_confusion_for_fold(fold_number):
    pos_as_pos = 0
    pos_as_neg = 0
    for i in [classify_crossfold(pos_crossfold_probabilities[fold_number], neg_crossfold_probabilities[fold_number], sample) for sample in pos_folds[fold_number]]:
        if i == 'pos':
            pos_as_pos += 1
        elif i == 'neg':
            pos_as_neg += 1

    neg_as_pos = 0
    neg_as_neg = 0
    for i in [classify_crossfold(pos_crossfold_probabilities[fold_number], neg_crossfold_probabilities[fold_number], sample) for sample in neg_folds[fold_number]]:
        if i == 'pos':
            neg_as_pos += 1
        elif i == 'neg':
            neg_as_neg += 1
            
    print("For fold number: " + str(fold_number))
    print(" pos  neg\n{:4d} {:4d} pos\n{:4d} {:4d} neg".format(pos_as_pos, pos_as_neg, neg_as_pos, neg_as_neg))
    print("Accuracy: {}".format(float(pos_as_pos + neg_as_neg) / float(pos_as_pos + pos_as_neg + neg_as_neg + neg_as_pos)))
    print("\n")
    
[acc_and_confusion_for_fold(i) for i in range(0,10)]

For fold number: 0
 pos  neg
  70   30 pos
  21   79 neg
Accuracy: 0.745


For fold number: 1
 pos  neg
  50   50 pos
  14   86 neg
Accuracy: 0.68


For fold number: 2
 pos  neg
  38   62 pos
  10   90 neg
Accuracy: 0.64


For fold number: 3
 pos  neg
  47   53 pos
  16   84 neg
Accuracy: 0.655


For fold number: 4
 pos  neg
  48   52 pos
  18   82 neg
Accuracy: 0.65


For fold number: 5
 pos  neg
  42   58 pos
  20   80 neg
Accuracy: 0.61


For fold number: 6
 pos  neg
  46   54 pos
   7   93 neg
Accuracy: 0.695


For fold number: 7
 pos  neg
  46   54 pos
  15   85 neg
Accuracy: 0.655


For fold number: 8
 pos  neg
  43   57 pos
  19   81 neg
Accuracy: 0.62


For fold number: 9
 pos  neg
  47   53 pos
  11   89 neg
Accuracy: 0.68




[None, None, None, None, None, None, None, None, None, None]

(Part 4, Generator)

In [25]:
def generate_instance(polarity, random_number):
    instance = []
    if polarity == "pos":
        probs = pos_probabilities 
    else:
        probs = neg_probabilities  
    print(probs, random_number)
    for i,item in enumerate(probs):
        if random_number < item:
            instance.append(hotwords[i])
    return instance

from random import random
for polarity in ["pos", "neg"]:
    for i in range(0, 5):
        print(generate_instance(polarity, random() * 100))
    

([3.4, 28.0, 5.4, 2.5, 15.4, 48.5, 13.2], 60.19179438579083)
[]
([3.4, 28.0, 5.4, 2.5, 15.4, 48.5, 13.2], 97.77804699769412)
[]
([3.4, 28.0, 5.4, 2.5, 15.4, 48.5, 13.2], 26.357732538160093)
['bad', 'great']
([3.4, 28.0, 5.4, 2.5, 15.4, 48.5, 13.2], 16.56455714643361)
['bad', 'great']
([3.4, 28.0, 5.4, 2.5, 15.4, 48.5, 13.2], 95.38887198005234)
[]
([12.2, 54.5, 17.5, 10.1, 8.6, 32.0, 5.9], 49.8602127130661)
['bad']
([12.2, 54.5, 17.5, 10.1, 8.6, 32.0, 5.9], 17.282635757639852)
['bad', 'boring', 'great']
([12.2, 54.5, 17.5, 10.1, 8.6, 32.0, 5.9], 81.91954951438164)
[]
([12.2, 54.5, 17.5, 10.1, 8.6, 32.0, 5.9], 29.723791755111474)
['bad', 'great']
([12.2, 54.5, 17.5, 10.1, 8.6, 32.0, 5.9], 3.319642946796786)
['awful', 'bad', 'boring', 'dull', 'effective', 'great', 'hilarious']


In my experience with movie reviews this is basically par for the course, most reviews are totally meaningless.