# DATA20001 Deep Learning - Group Project
## Text project

**Due Thursday, May 22, before 23:59.**

The task is to learn to assign the correct labels to news articles.  The corpus contains ~850K articles from Reuters.  The test set is about 10% of the articles. The data is unextracted in XML files.

We're only giving you the code for downloading the data, and how to save the final model. The rest you'll have to do yourselves.

Some comments and hints particular to the project:

- One document may belong to many classes in this problem, i.e., it's a multi-label classification problem. In fact there are documents that don't belong to any class, and you should also be able to handle these correctly. Pay careful attention to how you design the outputs of the network (e.g., what activation to use) and what loss function should be used.
- You may use word-embeddings to get better results. For example, you were already using a smaller version of the GloVE  embeddings in exercise 4. Do note that these embeddings take a lot of memory. 
- In the exercises we used e.g., `torchvision.datasets.MNIST` to handle the loading of the data in suitable batches. Here, you need to handle the dataloading yourself.  The easiest way is probably to create a custom `Dataset`. [See for example here for a tutorial](https://github.com/utkuozbulak/pytorch-custom-dataset-examples).

In [13]:
import os
import pickle
import random as rn
import warnings
from multiprocessing import cpu_count

warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import tensorflow as tf
from gensim.models import Word2Vec
from sklearn.metrics import accuracy_score, f1_score, hamming_loss
from sklearn.model_selection import train_test_split
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.initializers import Constant, GlorotUniform
from tensorflow.keras.layers import Dense, Dropout, Activation, Embedding, Conv1D, \
        GlobalMaxPooling1D, SpatialDropout1D, LSTM, GRU, Flatten, MaxPooling1D, \
        BatchNormalization, ReLU
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

import data
import preprocessing

seed = 42

The above command downloads and extracts the data files into the `train` subdirectory.

The files can be found in `train/`, and are named as `19970405.zip`, etc. You will have to manage the content of these zips to get the data. There is a readme which has links to further descriptions on the data.

The class labels, or topics, can be found in the readme file called `train/codes.zip`.  The zip contains a file called "topic_codes.txt".  This file contains the special codes for the topics (about 130 of them), and the explanation - what each code means.

The XML document files contain the article's headline, the main body text, and the list of topic labels assigned to each article.  You will have to extract the topics of each article from the XML.  For example: 
&lt;code code="C18"&gt; refers to the topic "OWNERSHIP CHANGES" (like a corporate buyout).

You should pre-process the XML to extract the words from the article: the &lt;headline&gt; element and the &lt;text&gt;.  You should not need any other parts of the article.

## Extracting the data

In [3]:
# data.extract_data(extraction_dir="train", data_dir="data", data_zip_name="reuters-training-corpus.zip")

df = pd.read_pickle("train/data.pkl")

# df = data.get_docs_labels("train/REUTERS_CORPUS_2")
# df.to_pickle("train/data.pkl")

docs = df["doc"].values
labels = np.array(df["labels"].tolist())
n_labels = len(data.CODEMAP)

print(docs.shape)
print(labels.shape)
print(docs[-2])
print(labels[-2])

(299773,)
(299773, 126)
Typhoon Winnie kills 25 in Taiwan. A typhoon that packed high winds and torrential rain killed 25 people in Taiwan on Monday and Tuesday, with landslides bringing down buildings and floodwaters turning streets into rivers, officials said on Tuesday. The death toll has risen to 25, one missing, 16 seriously injured and 62 slightly hurt, the government's anti-typhoon centre said in a statement. Three houses totally collapsed and 37 partly collapsed, it said. State television showed several five-storey buildings in eastern Taipei that had sunk two stories into the ground. The Central Weather Bureau said late on Monday the danger had passed as Typhoon Winnie headed towards mainland China. Heavy torrential rain and strong winds triggered landslides in Taipei, destroying or damaging buildings and blocking traffic. "The whole scene looks as if it has gone through an explosion," a state television reporter in the city said. Local authorities mobilised hundreds of rescue

## Preprocessing the data

In [4]:
with open("train/preprocessed_docs_no_sw_no_rep.pkl", "rb") as f:
    preprocessed_docs = pickle.load(f)

# preprocessed_docs = preprocessing.preprocess_corpus(docs)
# with open("train/preprocessed_docs.pkl", "wb") as f:
#     pickle.dump(preprocessed_docs, f)

print(preprocessed_docs[-2])

typhoon winnie kill 25 taiwan typhoon pack high wind torrential rain kill 25 people taiwan monday tuesday landslide bring building floodwater turn street river official say tuesday death toll rise 25 miss 16 seriously injure 62 slightly hurt government anti typhoon centre say statement house totally collapse 37 partly collapse say state television show storey building eastern taipei sink story ground central weather bureau say late monday danger pass typhoon winnie head mainland china heavy torrential rain strong wind trigger landslide taipei destroy damaging building block traffic scene look go explosion state television reporter city say local authority mobilise hundred rescue worker soldier help evacuate resident people remain trap house state medium say seven people bury landslide hit house north taipei woman survive torrential rain swamp house low lie area capital road flood score vehicle submerge typhoon pack maximum sustained wind 144 km hour 89 mile hour gust 180 kph 112 mph fo

## Converting the docs to token index sequences

In [5]:
n_vocabulary = 5000

tokenizer = Tokenizer(num_words=n_vocabulary, filters="")
tokenizer.fit_on_texts(preprocessed_docs)
word_idx = tokenizer.word_index

if n_vocabulary is None:
    n_vocabulary = len(word_idx)

print(n_vocabulary)

5000


In [6]:
n_sequence = 64
# n_sequence = max([len(doc) for doc in preprocessed_docs])

sequences = tokenizer.texts_to_sequences(preprocessed_docs)
sequences = pad_sequences(sequences, maxlen=n_sequence, padding="post", truncating="post")

print(n_sequence)
print(sequences.shape)
print(sequences[1])

# doc_matrix = tokenizer.texts_to_matrix(preprocessed_docs, mode="tfidf")

# print(doc_matrix.shape)
# print(doc_matrix[1])

64
(299773, 64)
[  40 2068 2043 1758   80  496   25  282  541  418  473  581   38 2952
  173  232 1113 2016  840  861 1915  282  541  418   86  452   25 2952
 2687  743 1119  610 4245  232  282  541  418   79  463   25 2952  788
  206  232   17 1144    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0]


## Creating word embeddings

In [7]:
n_embedding = 256

In [8]:
w2v_model = Word2Vec(sentences=[s.split() for s in preprocessed_docs],
                     size=n_embedding, 
                     window=5,
                     sg=1,
                     workers=cpu_count(),
                     min_count=1)

n_vocabulary_w2v = len(list(w2v_model.wv.vocab))
print(n_vocabulary_w2v)

648463


In [9]:
embedding_matrix = np.zeros((n_vocabulary, n_embedding))
for token, i in word_idx.items():
    if i >= n_vocabulary:
        continue
    if token in w2v_model:
        embedding_matrix[i] = w2v_model[token]
    else:
        embedding_matrix[i] = np.zeros(n_embedding)

print(embedding_matrix.shape)

(5000, 256)


## Defining the NN model

In [10]:
model = Sequential()

model.add(Embedding(
    n_vocabulary,
    n_embedding,
    embeddings_initializer=Constant(embedding_matrix),
    input_length=n_sequence,
    trainable=False
))
# model.add(Embedding(n_vocabulary, n_embedding, input_length=n_sequence))

model.add(Dropout(.25))
model.add(Conv1D(64, 5, activation="relu"))
model.add(Dropout(.25))
model.add(Conv1D(128, 5, activation="relu"))
model.add(Dropout(.25))
model.add(Flatten())
model.add(Dense(128))
model.add(BatchNormalization())
model.add(ReLU())
model.add(Dropout(.25))
model.add(Dense(128))
model.add(BatchNormalization())
model.add(ReLU())
model.add(Dropout(.25))

# model.add(Bidirectional(LSTM(256, return_sequences=True)))
# model.add(Bidirectional(LSTM(128)))
# model.add(Dense(128, activation="relu"))
# model.add(Dropout(.5))

# model.add(GRU(32, dropout=.2))

# model.add(Dense(512, activation="relu", input_shape=(n_vocabulary,)))
# model.add(Dropout(.5))

# model.add(Conv1D(100, 4, activation="relu"))
# model.add(MaxPooling1D(pool_size=3))
# model.add(Conv1D(100, 2, activation="relu"))
# model.add(Dropout(.5))
# model.add(Flatten())
# model.add(Dense(300, activation="relu"))

model.add(Dense(n_labels, activation="sigmoid"))
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 64, 256)           1280000   
_________________________________________________________________
dropout (Dropout)            (None, 64, 256)           0         
_________________________________________________________________
conv1d (Conv1D)              (None, 60, 64)            81984     
_________________________________________________________________
dropout_1 (Dropout)          (None, 60, 64)            0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 56, 128)           41088     
_________________________________________________________________
dropout_2 (Dropout)          (None, 56, 128)           0         
_________________________________________________________________
flatten (Flatten)            (None, 7168)              0

## Splitting data to train and test

In [11]:
n = None # set to None for full dataset

x_train, x_test, y_train, y_test = train_test_split(sequences,
# x_train, x_test, y_train, y_test = train_test_split(doc_matrix,
                                                    labels,
                                                    train_size=n,
                                                    test_size=n,
                                                    random_state=seed)

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

(224829, 64)
(224829, 126)
(74944, 64)
(74944, 126)


## Fitting and predicting

In [16]:
batch_size = 256

es = EarlyStopping(patience=3, verbose=1, restore_best_weights=True)
history = model.fit(x_train,
                    y_train,
                    batch_size=batch_size,
                    epochs=100,
                    verbose=1,
                    validation_split=.1,
                    callbacks=[es])

y_pred = np.round(model.predict(x_test, batch_size=batch_size, verbose=1)).astype(int)

print(f"test avg. accuracy: {np.mean([accuracy_score(y_test[i], y_pred[i]) for i in range(len(y_test))])}")
print(f"test hamming loss: {hamming_loss(y_test, y_pred)}")

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 00007: early stopping
test avg. accuracy: 0.9937531134358097
test hamming loss: 0.006246886564190151


## Save your model

It might be useful to save your model if you want to continue your work later, or use it for inference later.

In [None]:
# torch.save(model.state_dict(), 'model.pkl')

The model file should now be visible in the "Home" screen of the jupyter notebooks interface.  There you should be able to select it and press "download".

## Download test set

The testset will be made available during the last week before the deadline and can be downloaded in the same way as the training set.

## Predict for test set

You will be asked to return your predictions a separate test set.  These should be returned as a matrix with one row for each test article.  Each row contains a binary prediction for each label, 1 if it's present in the image, and 0 if not. The order of the labels is the order of the label (topic) codes.

An example row could like like this if your system predicts the presense of the second and fourth topic:

    0 1 0 1 0 0 0 0 0 0 0 0 0 0 ...
    
If you have the matrix prepared in `y` you can use the following function to save it to a text file.

In [None]:
# np.savetxt('results.txt', y, fmt='%d')