# DL4NLP SS17 Home Exercise 07
----------------------------------
**Due until Tuesday, 06.06. at 13:00**

## Task 1 Text Classification on 20 Newsgroups (10P)

In this exercise, the 20 Newsgroups dataset is being examined. It is a collection of e-mails coming from different newsgroups. The goal is to assign each e-mail its corresponding newsgroup.

** Warning: There is not much code to write in this exercise, but training CNNs is a computational intensive task! Plan accordingly for the extended training times. **


#### Hints on the Submission Format
* Please submit your python code for all the tasks where it is applicable. Make sure to include comments explaining complicated/non-obvious sections of your code.
* Please also submit a copy of the console output of your code execution. Your code might run in 10 minutes on your watercooled battlestation, but it might not run in 10 minutes for the person who corrects your home exercises. Thank you!
    * If your code didn't finish running, please be honest and add a short statement saying so. Thank you again!

### Task 1.0 Creating Data Splits (0P)
The following code generates train, dev and test sets for the 20 Newsgroups dataset (see [hex07_data.zip](https://moodle.informatik.tu-darmstadt.de/pluginfile.php/55190/mod_assign/introattachment/0/hex07_data.zip?forcedownload=1) in Moodle). You might need to modify the `input_file` variable according to your setup.
After execution `train_y` is a list of 20-component one-hot vectors representing newsgroups, and `train_x` is a list of 300-component vectors where each entry corresponds to a word ID. Each 300-component vector represents an e-mail.

In [None]:
from keras.preprocessing.sequence import pad_sequences
import random
import os

input_file = 'data.txt'
tmp_dir = '/tmp'
train_verbose = 1
pad_length = 300

def read_data(input_file):
    vocab = {0}
    data_x = []
    data_y = []
    with open(input_file) as f:
        for line in f:
            label, content = line.split('\t')
            content = [int(v) for v in content.split()]
            vocab.update(content)
            data_x.append(content)
            label = tuple(int(v) for v in label.split())
            data_y.append(label)

    data_x = pad_sequences(data_x, maxlen=pad_length)
    return list(zip(data_y , data_x)), vocab

data, vocab = read_data(input_file)
vocab_size = max(vocab)
random.seed(42)
random.shuffle(data)
input_len = len(data)
train_y, train_x = zip(*(data[:(input_len * 8) // 10]))
dev_y, dev_x = zip(*(data[(input_len * 8) // 10: (input_len * 9) // 10]))
test_y, test_x = zip(*(data[(input_len * 9) // 10:]))

### Task 1.1 - Dataset inspection (1 P)
Complete the given function for a basic inspection of the given dataset. The function should return the majority baseline accuracy (the expected accuracy when a classifier always chooses the most likely class) that is expected for the given dataset, as well as its number of classes as a tuple.

In [None]:
def inspect_dataset(labels):
    return 0, 0

baseline, num_classes = inspect_dataset(train_y)
print('Majority baseline accuracy is %f over %i classes' % (baseline, num_classes))

### Task 1.2 - A basic CNN (2 P)
Build a basic CNN by completing the code below that contains

* A convolutional layer with 75  filters and kernel size 3, using a ReLU activation function
* A (global) max pooling layer
* A softmax output layer

How do you rate your model's performance with respect to the established baseline?

In [None]:
from keras.models import Sequential, Model
from keras.layers import *
from keras.callbacks import EarlyStopping, ModelCheckpoint

import numpy as np
train_x, train_y = np.array(train_x), np.array(train_y)
dev_x, dev_y = np.array(dev_x), np.array(dev_y)
test_x, test_y = np.array(test_x), np.array(test_y)

# Leave those unmodified and, if requested by the task, modify them locally in the specific task
batch_size = 64
embedding_dims = 100
epochs = 2
filters = 75
kernel_size = 3

In [None]:
model = Sequential()
model.add(Embedding(vocab_size, embedding_dims, input_length=pad_length))

#####
# Your implementation here

#####

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(train_x, train_y, batch_size=batch_size, epochs=epochs, verbose=train_verbose)
print('Accuracy of simple CNN: %f' % model.evaluate(dev_x, dev_y, verbose=0)[1])
print()

### Task 1.3 - Early Stopping (2 P)
Based on the basic CNN from task 2, create a new model that uses early stopping to determine the optimal number of epochs. This can be done by setting a high number of epochs (say 50) initially, recording the best achieved result on the dev set during training, and early stopping if there are no improvements after a set amount of epochs (say 4).

To implement this in Keras, callbacks can be supplied to the training process. In particular, have a look at [Model Checkpoint](https://keras.io/callbacks/#modelcheckpoint) and [EarlyStopping](https://keras.io/callbacks/#earlystopping).

Determine the model as described above and reports its results on dev and test set. Don't forget to load the best weights after training.

### Task 1.4 - Testing CNN hyperparameters (3.5 P)
Based on the basic CNN from task 2, create additional models to test the effect of the following hyperparameters/modifications:

1. Number of filters
1. Kernel size
1. Dropout
1. Stride size
1. Pooling strategies
1. Number of Convolutional Layers, with different kernel sizes in each (1P for this hyperparameter, 0.5P for every other)

For each of these, report 3 tested configurations on the dev set. At the end, provide a short summary of the gained insights.

### Task 1.5 - Freestyle (1.5 P)
The task is simple: Construct a model of your choice to achieve the best performace. You may modify every parameter to your liking (except `pad_length` from the given code).

Report your 3 best identified configurations. Hand in your best found configuration in code and report its performance on the test set.

*Can you beat an accuracy of 85%?*