# DATA20001 Deep Learning - Group Project
## Text project

**Due Thursday, May 22, before 23:59.**

The task is to learn to assign the correct labels to news articles.  The corpus contains ~850K articles from Reuters.  The test set is about 10% of the articles. The data is unextracted in XML files.

We're only giving you the code for downloading the data, and how to save the final model. The rest you'll have to do yourselves.

Some comments and hints particular to the project:

- One document may belong to many classes in this problem, i.e., it's a multi-label classification problem. In fact there are documents that don't belong to any class, and you should also be able to handle these correctly. Pay careful attention to how you design the outputs of the network (e.g., what activation to use) and what loss function should be used.
- You may use word-embeddings to get better results. For example, you were already using a smaller version of the GloVE  embeddings in exercise 4. Do note that these embeddings take a lot of memory. 
- In the exercises we used e.g., `torchvision.datasets.MNIST` to handle the loading of the data in suitable batches. Here, you need to handle the dataloading yourself.  The easiest way is probably to create a custom `Dataset`. [See for example here for a tutorial](https://github.com/utkuozbulak/pytorch-custom-dataset-examples).

In [1]:
import pickle
import warnings

warnings.filterwarnings("ignore")

import numpy as np
import tensorflow as tf
from gensim.models import Word2Vec
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.datasets import reuters
from tensorflow.keras.initializers import Constant
from tensorflow.keras.layers import Dense, Dropout, Activation, Embedding, Conv1D, \
        GlobalMaxPooling1D, SpatialDropout1D, LSTM, GRU, Flatten
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import data
import preprocessing

seed = 42

The above command downloads and extracts the data files into the `train` subdirectory.

The files can be found in `train/`, and are named as `19970405.zip`, etc. You will have to manage the content of these zips to get the data. There is a readme which has links to further descriptions on the data.

The class labels, or topics, can be found in the readme file called `train/codes.zip`.  The zip contains a file called "topic_codes.txt".  This file contains the special codes for the topics (about 130 of them), and the explanation - what each code means.

The XML document files contain the article's headline, the main body text, and the list of topic labels assigned to each article.  You will have to extract the topics of each article from the XML.  For example: 
&lt;code code="C18"&gt; refers to the topic "OWNERSHIP CHANGES" (like a corporate buyout).

You should pre-process the XML to extract the words from the article: the &lt;headline&gt; element and the &lt;text&gt;.  You should not need any other parts of the article.

## Extracting the data

In [2]:
# data.extract_data(extraction_dir="train", data_dir="data", data_zip_name="reuters-training-corpus.zip")

try:
    with open("train/docs.pkl", "rb") as f:
        docs = pickle.load(f)
    labels = np.load("train/labels.npy")
except:
    docs, labels = data.get_docs_labels("train/REUTERS_CORPUS_2")
    with open("train/docs.pkl", "wb") as f:
        pickle.dump(docs, f)
    np.save("train/labels.npy", labels)

n_labels = len(labels[0])

print(len(docs))
print(labels.shape)

print(docs[-2])
print(labels[-2])

299773
(299773, 126)
India must allow free imports and exports of gold and liberalise trade in the yellow metal to boost industrial and employment growth, senior government officials said on Saturday. "The major objective of the new gold policy should perhaps be to recognise the importance of gold in the Indian economic system...," Y.V. Reddy, deputy governor of the Reserve Bank of India (RBI), told a gold seminar in the Indian capital. "...(The policy) should enable gold to play a transparent and positive role in the industrial development, employment and export sectors of the economy," Reddy said. Reddy said India should allow free imports and exports of gold and give up its present system of trade through designated agencies. Currently state-run State Bank of India and MMTC Ltd are the only entities permitted to import gold. "Free import under open general licence and free exports are pre-conditions for capturing world markets," Reddy said. He said most South Asian nations had liber

## Preprocessing the data

In [3]:
try:
    with open("train/preprocessed_docs.pkl", "rb") as f:
        preprocessed_docs = pickle.load(f)
except:
    preprocessed_docs = preprocessing.preprocess_corpus(docs)
    with open("train/preprocessed_docs.pkl", "wb") as f:
        pickle.dump(preprocessed_docs, f)

preprocessed_docs = [s.split() for s in preprocessed_docs]

print(" ".join(preprocessed_docs[-2]))

india must allow free import and export of gold and liberalise trade in the yellow metal to boost industrial and employment growth senior government official say on saturday the major objective of the new gold policy should perhaps be to recognise the importance of gold in the indian economic system y.v. reddy deputy governor of the reserve bank of india rbi tell a gold seminar in the indian capital the policy should enable gold to play a transparent and positive role in the industrial development employment and export sector of the economy reddy say reddy say india should allow free import and export of gold and give up -pron- present system of trade through designate agency currently state run state bank of india and mmtc ltd be the only entity permit to import gold free import under open general licence and free export be pre condition for capture world market reddy say -pron- say most south asian nation have liberalise gold import charge nominal duty this development make -pron- im

In [4]:
n_embedding = 100

try:
    w2v_model = Word2Vec.load("train/w2v.model")
except:
    w2v_model = Word2Vec(
        sentences=preprocessed_docs,
        size=n_embedding, window=5,
        workers=10,
        min_count=1
    )
    w2v_model.save("train/w2v.model")

print(len(list(w2v_model.wv.vocab)))

300821


In [5]:
n_sequence = max([len(doc) for doc in preprocessed_docs])

tokenizer = Tokenizer(filters="")
tokenizer.fit_on_texts(preprocessed_docs)
word_idx = tokenizer.word_index
n_vocabulary = len(word_idx) + 1

sequences = tokenizer.texts_to_sequences(preprocessed_docs)
sequences = pad_sequences(sequences, maxlen=n_sequence, padding="post")

print(n_sequence)
print(n_vocabulary)
print(" ".join(preprocessed_docs[1]))
print(sequences[1])

# n_docs = 50000
# n_vocabulary = 5000

# x_train = preprocessed_docs[:n_docs]
# y_train = labels[:n_docs]
#
# # x_train = preprocessed_docs
# # y_train = labels

# tokenizer = Tokenizer(filters="", num_words=n_vocabulary)
# tokenizer.fit_on_texts(x_train)
# # n_vocabulary = len(tokenizer.word_index) + 1
# x_train = tokenizer.texts_to_sequences(x_train)
# n_max_sequence = len(max(x_train, key=len))
# x_train = pad_sequences(x_train, maxlen=n_max_sequence, padding="post")
# x_train = np.array(x_train)

# print(n_vocabulary)
# print(n_max_sequence)
# print(x_train.shape)
# print(preprocessed_docs[1])
# print(x_train[1])

# x_train, x_test, y_train, y_test = train_test_split(x_train, y_train, test_size=.1, random_state=seed)

9271
300822
decision of the eea joint committee no -num- of -num- october -num- amend protocol -num- to the eea agreement on cooperation in specific field outside the -num- freedom decision of the eea joint committee no -num- of -num- january -num- amend annex ii technical regulation standard testing and certification to the eea agreement decision of the eea joint committee no -num- of -num- february -num- amend annex vi social security to the eea agreement end of document
[347   5   2 ...   0   0   0]


In [6]:
try:
    embedding_matrix = np.load("train/embedding_matrix.npy")
except:
    embedding_matrix = np.zeros((n_vocabulary, n_embedding))
    for token, i in word_idx.items():
        if token in w2v_model:
            embedding_matrix[i] = w2v_model[token]
        else:
            embedding_matrix[i] = np.zeros(n_embedding)
    np.save("train/embedding_matrix.npy", embedding_matrix)

print(embedding_matrix.shape)

# embedding_idx = {}
# for doc in preprocessed_docs:
#     for token in doc:
#         if token in w2v_model:
#             embedding_idx[token] = w2v_model[token]
#         else:
#             embedding_idx[token] = np.zeros(n_embedding)

# embedding_matrix = np.zeros((n_vocabulary, n_embedding))
# for word, i in word_idx.items():
#     embedding = embedding_idx.get(word)
#     if embedding is not None:
#         embedding_matrix[i] = embedding

# print(embedding_matrix.shape)

(300822, 100)


In [7]:
model = Sequential()

model.add(Embedding(
    n_vocabulary,
    n_embedding,
    embeddings_initializer=Constant(embedding_matrix),
    input_length=n_sequence,
    trainable=False
))
# model.add(Embedding(n_vocabulary, n_embedding, input_length=n_sequence))

# model.add(GRU(32, dropout=.2, recurrent_dropout=.2))

# model.add(LSTM(150, dropout=.5, return_sequences=True))
model.add(LSTM(100, dropout=.5))

# model.add(Conv1D(filters=100, kernel_size=3, activation="relu"))
# model.add(GlobalMaxPooling1D())
# model.add(Dense(100, activation="relu"))
# model.add(Dropout(.5))

# model.add(SpatialDropout1D(.2))
# model.add(LSTM(150, dropout=.2))

# model.add(Dropout(.2))
# model.add(Conv1D(filters=n_hidden, kernel_size=50, activation='relu', strides=1))
# model.add(GlobalMaxPooling1D())
# model.add(Dense(n_hidden, activation="relu"))
# model.add(Dropout(.2))

model.add(Dense(n_labels, activation="softmax"))
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 9271, 100)         30082200  
_________________________________________________________________
lstm (LSTM)                  (None, 100)               80400     
_________________________________________________________________
dense (Dense)                (None, 126)               12726     
Total params: 30,175,326
Trainable params: 93,126
Non-trainable params: 30,082,200
_________________________________________________________________


In [None]:
np.random.seed(seed)
tf.random.set_seed(seed)

batch_size = 128
epochs = 100

x_train, x_test, y_train, y_test = train_test_split(
    sequences,
    labels,
    test_size=.1,
    random_state=seed
)

es = EarlyStopping(patience=5, verbose=1, restore_best_weights=True)
history = model.fit(
    x_train, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_split=.1, callbacks=[es]
)

y_pred = model.predict(x_test, batch_size=batch_size, verbose=1)
f1 = f1_score(y_test, y_pred, average="micro")
print(f"test f1: {f1}")

## Save your model

It might be useful to save your model if you want to continue your work later, or use it for inference later.

In [9]:
# torch.save(model.state_dict(), 'model.pkl')

The model file should now be visible in the "Home" screen of the jupyter notebooks interface.  There you should be able to select it and press "download".

## Download test set

The testset will be made available during the last week before the deadline and can be downloaded in the same way as the training set.

## Predict for test set

You will be asked to return your predictions a separate test set.  These should be returned as a matrix with one row for each test article.  Each row contains a binary prediction for each label, 1 if it's present in the image, and 0 if not. The order of the labels is the order of the label (topic) codes.

An example row could like like this if your system predicts the presense of the second and fourth topic:

    0 1 0 1 0 0 0 0 0 0 0 0 0 0 ...
    
If you have the matrix prepared in `y` you can use the following function to save it to a text file.

In [10]:
# np.savetxt('results.txt', y, fmt='%d')