<div class="alert alert-danger">
**Due date:** 2017-01-27
</div>

# Lab 1: Text Classification

**Students:** Alexander Stolpe (alest170), Fredrik Jonsén (frejo105)

## Introduction

Text classification is the task of sorting text documents into predefined classes. The concrete problem you will be working on in this lab is classification of texts regarding political block affiliation (right-wing/left-wing). The specific texts you are going to classify are speeches held in the Swedish parliament. The classifier will read in a speech and predict if the speaker belongs to a right-wing or a left-wing party.

In [1]:
import lab1

## Read in the data

The data used in this lab consists of all speeches held in the Swedish parliament in the 2014/2015 and 2015/2016 sessions. The raw data is taken from [Riksdag's open data](http://data.riksdagen.se/). Speeches are divided into two files in the [JSON format](https://en.wikipedia.org/wiki/JSON):
* `anforande-201415.json` with 12928 speeches
* `anforande-201516.json` with 13702 speeches

In order to read in the data files, we define a helper function. The function opens the given file and returns a list with speeches.

In [2]:
import json

def read_data(filename):
    with open(filename) as f:
        return json.load(f)

speeches_201415 = read_data("/home/729G17/labs/lab1/data/anforande-201415.json")
speeches_201516 = read_data("/home/729G17/labs/lab1/data/anforande-201516.json")

From a Python perspective, a speech is represented as a dictionary with three keys:

* `id`, a unique identifier for the speech (a string)
* `words`, a list with all the words in the speech (represented as strings)
* `class`, a string giving the correct class for the speech: either `'L'` (left) or `'R'` (right)

Here is an example:

In [3]:
sample = speeches_201516[42]
print(sample)

{'words': ['Fru', 'talman', 'Tack', 'Hans', 'Wallmark', 'för', 'frågan', 'Det', 'är', 'alldeles', 'riktigt', 'att', 'jag', 'reagerade', 'omedelbart', 'och', 'kommenterade', 'saken', 'på', 'det', 'enda', 'möjliga', 'sättet', 'nämligen', 'att', 'vi', 'beslutar', 'själva', 'om', 'vår', 'säkerhetspolitik', 'Vi', 'låter', 'oss', 'inte', 'påverkas', 'av', 'hotfulla', 'uttalanden', 'UD', 'kallade', 'också', 'upp', 'den', 'ryske', 'ambassadören', 'för', 'samtal', 'Nu', 'hör', 'det', 'till', 'saken', 'att', 'vi', 'inte', 'brukar', 'avslöja', 'innehållet', 'i', 'de', 'här', 'samtalen', 'och', 'det', 'är', 'ju', 'för', 'att', 'vi', 'ska', 'kunna', 'ställa', 'relevanta', 'frågor', 'i', 'det', 'här', 'fallet', 'till', 'ambassadören', 'och', 'förhoppningsvis', 'kunna', 'få', 'ett', 'svar', 'och', 'en', 'dialog', 'kring', 'de', 'frågor', 'som', 'är', 'viktiga', 'för', 'oss', 'Vår', 'säkerhetspolitiska', 'linje', 'som', 'Hans', 'Wallmark', 'väl', 'känner', 'till', 'ligger', 'fast', 'och', 'vår', 'själ

## Train and evaluate a classifier

The next code cell creates a new Naive Bayes classifier and trains it on the speeches from the 2014/2015 session:

In [4]:
classifier1 = lab1.Classifier.train(speeches_201415)

You can use the trained classifier for predicting the class for a new speech:

In [5]:
classifier1.predict(sample)

'L'

Was it correct? Your first exercise is to evaluate the classifier with respect to accuracy, precision, and recall:

In [6]:
lab1.evaluate(classifier1, speeches_201415)

accuracy = 0.9517
class L: precision = 0.9585, recall = 0.9486, f1 = 0.9535
class R: precision = 0.9445, recall = 0.9552, f1 = 0.9498


You should get the same results as in the table below.

**Tabell 1: Train on the speeches from 2014/2015, evaluate on the speeches from 2014/2015**

<table class="table">
<thead>
<tr><th>total</th><th colspan="3" style="width: auto">L (left)</th><th colspan="3">R (right)</th></tr>
<tr><th>accuracy</th><th>precision</th><th>recall</th><th>F1</th><th>precision</th><th>recall</th><th>F1</th></tr>
</thead>
<tbody>
<tr><td>95.17%</td><td>95.85%</td><td>94.86%</td><td>95.35%</td><td>94.45%</td><td>95.52%</td><td>94.98%</td></tr>
</tbody>
</table>

<div class="panel panel-primary">
<div class="panel-heading">Problem 1</div>
<div class="panel-body">
Redo the evaluation on the speeches from 2015/2016. Report the results from the evaluation. How and why do the results differ from the first evaluation? Why is the second evaluation more meaningful? Explain using your knowledge of machine learning methodology.
</ul>
</div>
</div>

In [7]:
# Room for your code for Problem 1
lab1.evaluate(classifier1, speeches_201516)

accuracy = 0.8197
class L: precision = 0.8209, recall = 0.8475, f1 = 0.8340
class R: precision = 0.8181, recall = 0.7877, f1 = 0.8027


Fill the table below with the results from your evaluation.

**Tabell 2: Train on the speeches from 2014/2015, evaluate on the speeches from 2015/2016**

<table class="table">
<thead>
<tr><th>total</th><th colspan="3">L (left)</th><th colspan="3">R (right)</th></tr>
<tr><th>accuracy</th><th>precision</th><th>recall</th><th>F1</th><th>precision</th><th>recall</th><th>F1</th></tr>
</thead>
<tbody>
<tr><td>81.97%</td><td>82.09%</td><td>84.75%</td><td>83.40%</td><td>81.81%</td><td>78.77%</td><td>80.27%</td></tr>
</tbody>
</table>

The results differ from the first evaluation because the first evalution used the same data set as was used to teach the classifier.
The second evaluation is more meaningful because it evaluates against data that has never been seen by the classifier before. 

<div class="alert alert-danger">
From now on you will always train on the speeches from 2014/2015 (training data) and test on the speeches from 2015/2016 (test data).
</div>

## Implement functions for evaluation

The function `lab1.evaluate()` you used above calls three functions from the module `lab1`:

* `accuracy()`, computes the classifier's accuracy
* `precision()`, computes precision
* `recall()`, computes recall

Your next exercise is to do your own implementation of these three functions.

<div class="panel panel-primary">
<div class="panel-heading">Problem 2</div>
<div class="panel-body">
Do your own implementation of the evaluation functions. Test your implementation by redoing the evaluation from the previous exercise with your new functions. You should get the same results as earlier.
</div>
</div>

Write the code for the three functions in the cells below. The only method of the classifier you will have to call in your implementation is `predict()`. (Look above for an example on how to use it.)

**Accuracy.** This function takes a classifier (`classifier`) and a list of documents  (`documents`) and should return the classifier's accuracy on the data set as a floating-point number between 0 and 1. If the measure is not defined it should return `float('NaN')` instead.

In [8]:
def accuracy(classifier, documents):
    """Compute the accuracy of a classifier on a list of gold-standard documents."""
    correct = 0
    for document in documents:
        result = classifier.predict(document)
        if result == document['class']: correct += 1 
    return float('NaN') if len(documents) == 0 else correct / len(documents)

**Precision.** This function takes a classifier  (`classifier`), a class `c`, and a list with documents (`documents`) and should compute the classifier's precision on the task of finding documents with class `c` as a floating-point number between 0 and 1. If the measure is not defined it should return `float('NaN')` instead.

In [17]:
def precision(classifier, c, documents):
    """Compute the class-specific precision of a classifier on a list of gold-standard documents."""
    total_class_predictions = 0
    correct_predictions = 0
    for document in documents:
        guess = classifier.predict(document)
        if guess != c: continue
        total_class_predictions += 1
        if document['class'] == guess:
            correct_predictions += 1
    return float('NaN') if total_class_predictions == 0 else correct_predictions / total_class_predictions

**Recall.** This function should do the same as the previous one, but instead compute the recall.

In [21]:
def recall(classifier, c, documents):
    """Compute the class-specific recall of a classifier on a list of gold-standard documents."""
    # TODO: Implement this method to solve Exercise 2
    correct_predictions = 0
    total_of_class = 0
    for document in documents:
        if document['class'] != c: continue
        total_of_class += 1
        result = classifier.predict(document)
        if result == document['class']: correct_predictions += 1
    
    return float('NaN') if total_of_class == 0 else correct_predictions / total_of_class

Use this version of `evaluate()` in order to test your implementation. Note that you will have to change the code below to compute the F1-score.

In [23]:
def our_evaluate(classifier, data):
    from math import isnan
    print("accuracy = %.4f" % accuracy(classifier, data))

    for c in sorted(classifier.classes):
        p = precision(classifier, c, data)
        r = recall(classifier, c, data)
        # TODO: Change the next line to compute the F1-score
        f = float('NaN')
        if not isnan(p) and not isnan(r):
            f = (2*p*r) / (p+r)
        print("class %s: precision = %.4f, recall = %.4f, f1 = %.4f" % (c, p, r, f))

In [24]:
our_evaluate(classifier1, speeches_201516)

accuracy = 0.8197
class L: precision = 0.8209, recall = 0.8475, f1 = 0.8340
class R: precision = 0.8181, recall = 0.7877, f1 = 0.8027


## Create a baseline

A **baseline** is a reference used to compare various methods with. A simple baseline for text classification is *Most Frequent Class*. This method predicts the class which appears most frequently in the training data documents – completely independent from which words appear in the documents. We would hope that a Naive Bayes-classifier returns better results than this simple baseline.

<div class="panel panel-primary">
<div class="panel-heading">Problem 3</div>
<div class="panel-body">
Implement the Most Frequent Class baseline. Start with finding out which class appears most frequently in the training data. Then implement a class `MostFrequentClassClassifier` which always predicts this class. Evaluate this new classifier in the same way as you evaluated the others above.
</div>
</div>

In [46]:
# Use this cell for code that determines the class that appears most frequently in the training data
def most_frequent_class(document):
    R = 0
    L = 0
    for doc in document:
        doc_class = doc['class']
        if doc_class == 'R': R += 1
        elif doc_class == 'L': L += 1
    if R == L: return 'None'
    return 'R' if R > L else 'L'
print(most_frequent_class(speeches_201415))

L


In [47]:
class MostFrequentClassClassifier(lab1.Classifier):
    most_frequent = most_frequent_class(speeches_201415)
    # TODO: Implement this method to solve Problem 3
    def predict(self, document):
        """Predict the class of the specified document."""
        return self.most_frequent

Note that you do not need to implement the training; this functionality is gained automatically by inheriting from `lab1.Classifier`.

In [48]:
classifier2 = MostFrequentClassClassifier.train(speeches_201415)
lab1.evaluate(classifier2, speeches_201516)

accuracy = 0.5344
class L: precision = 0.5344, recall = 1.0000, f1 = 0.6966
class R: precision = nan, recall = 0.0000, f1 = nan


**Tabell 3: Baseline results (Most Frequent Class) for the speeches from 2015/2016**

<table class="table">
<thead>
<tr><th>total</th><th colspan="3">L (left)</th><th colspan="3">R (right)</th></tr>
<tr><th>accuracy</th><th>precision</th><th>recall</th><th>F1</th><th>precision</th><th>recall</th><th>F1</th></tr>
</thead>
<tbody>
<tr><td>53.44%</td><td>53.44%</td><td>100%</td><td>69.66%</td><td>NaN</td><td>0%</td><td>NaN</td></tr>
</tbody>
</table>

## Implement the predict function

In the last part of this lab you will implement the classification rule for a Naive Bayes classifier.

### How a classifier is represented in Python

Remember that the core of a Naive Bayes classifier is a probabilistic model with four components: a set of possible classes, $C$, a vocabulary, $V$, a number of class probabilities, $P(c)$ and a number of word probabilities, $P(w \mid c)$. In order to implement the classification rule, you will need to know how these components are represented in Python. Each one of those exists as an attribute of the class `lab1.Classifier`:

**The set of possible classes.** This set is represented as a set of strings. Example:

In [49]:
# classes: Set[str]
print(sorted(classifier1.classes))

['L', 'R']


**The vocabulary.** The vocabulary is also represented as a set of strings. The following cell prints 20 words in the vocabulary for `classifier1`:

In [50]:
# vocabulary: Set[str]
print(sorted(classifier1.vocabulary)[20000:20020])

['begåvningar', 'behagat', 'behagligt', 'behandla', 'behandlad', 'behandlade', 'behandlades', 'behandlande', 'behandlar', 'behandlas', 'behandlat', 'behandlats', 'behandling', 'behandlingar', 'behandlingarna', 'behandlingen', 'behandlingsansvaret', 'behandlingsbehov', 'behandlingsform', 'behandlingsformer']


**Class probabilities.** For every possible class $c \in C$ there is a probability $P(c)$ that a given document has class $c$. These class probabilities are encoded as a dictionary, mapping classes (strings) to log probabilities (floating point numbers). The following code returns the class probability for the class `'L'` (left):

In [51]:
# pc: Mapping[str, float]
print(classifier1.pc['L'])

-0.6502975401334512


**Word probabilities.** For every possible class $c \in C$ and every word&nbsp;$w$ in the vocabulary, there is a probability $P(w \mid c)$ for $w$ occuring in a document with class&nbsp;$c$. These word probabilities are encoded as a nested dictionary, mapping classes (strings) to class-specific dictionaries mapping words (strings) to log probabilities (floating-point numbers). The following code returns the word probabilities for the word *behandlingsbehov* for the two possible classes.

In [52]:
# pw: Mapping[str, Mapping[str, float]]
print(classifier1.pw['L']['behandlingsbehov'])
print(classifier1.pw['R']['behandlingsbehov'])

-13.849960775761835
-14.526787889580016


<div class="alert alert-danger">
Note that the implementation uses log probabilities!
</div>

### The classification rule

Remember that the predicted class $\hat{c}$ for a document $d$ is given by the equation

$$
\hat{c} = \mathop{\text{arg max}}_{c \in C} P(c) \cdot \prod_{w \in V} P(w\mid c)^{f(w)}
$$

where $f(w)$ denotes the number of occurrences of word $w$ in document $d$. Observe that this equation uses normal probabilities; in order to implement it you will need translate it to log probabilities.

<div class="panel panel-primary">
<div class="panel-heading">Problem 4</div>
<div class="panel-body">
Do your own implementation of the method `predict()`. Test your implementation by redoing the evaluation from Problem&nbsp;1 with the new implementation. You should get the same results as previously.
</div>
</div>

In [80]:
class OurClassifier(lab1.Classifier):
    
    # TODO: Implement this method to solve Problem 4
    def predict(self, document):
        current_prediction = ('', float('-inf'))
        for c in self.classes:
            acc_pw = 1
            for word in document['words']:
                acc_pw += self.pw[c].get(word,0)
            total = self.pc[c] + acc_pw
            if (total > current_prediction[1]):
                current_prediction = (c, total)
        
        """Predict the class of the specified document."""
        return current_prediction[0]

Just as before, you do not need to implement the training procedure; this functionality is automatically derived from `lab1.Classifier`.

In [81]:
import time
classifier3 = OurClassifier.train(speeches_201415)
start_time = time.time()
lab1.evaluate(classifier3, speeches_201516)
print("--- %s seconds ---" % (time.time() - start_time))

accuracy = 0.8197
class L: precision = 0.8209, recall = 0.8475, f1 = 0.8340
class R: precision = 0.8181, recall = 0.7877, f1 = 0.8027
--- 8.837891817092896 seconds ---
