# DATA20001 Deep Learning - Group Project
## Text project

**Due Thursday, May 22, before 23:59.**

The task is to learn to assign the correct labels to news articles.  The corpus contains ~850K articles from Reuters.  The test set is about 10% of the articles. The data is unextracted in XML files.

We're only giving you the code for downloading the data, and how to save the final model. The rest you'll have to do yourselves.

Some comments and hints particular to the project:

- One document may belong to many classes in this problem, i.e., it's a multi-label classification problem. In fact there are documents that don't belong to any class, and you should also be able to handle these correctly. Pay careful attention to how you design the outputs of the network (e.g., what activation to use) and what loss function should be used.
- You may use word-embeddings to get better results. For example, you were already using a smaller version of the GloVE  embeddings in exercise 4. Do note that these embeddings take a lot of memory. 
- In the exercises we used e.g., `torchvision.datasets.MNIST` to handle the loading of the data in suitable batches. Here, you need to handle the dataloading yourself.  The easiest way is probably to create a custom `Dataset`. [See for example here for a tutorial](https://github.com/utkuozbulak/pytorch-custom-dataset-examples).

In [3]:
import pickle
import warnings

import numpy as np

import data
import preprocessing

warnings.filterwarnings("ignore")

The above command downloads and extracts the data files into the `train` subdirectory.

The files can be found in `train/`, and are named as `19970405.zip`, etc. You will have to manage the content of these zips to get the data. There is a readme which has links to further descriptions on the data.

The class labels, or topics, can be found in the readme file called `train/codes.zip`.  The zip contains a file called "topic_codes.txt".  This file contains the special codes for the topics (about 130 of them), and the explanation - what each code means.

The XML document files contain the article's headline, the main body text, and the list of topic labels assigned to each article.  You will have to extract the topics of each article from the XML.  For example: 
&lt;code code="C18"&gt; refers to the topic "OWNERSHIP CHANGES" (like a corporate buyout).

You should pre-process the XML to extract the words from the article: the &lt;headline&gt; element and the &lt;text&gt;.  You should not need any other parts of the article.

## Extracting the data

In [4]:
# data.extract_data(extraction_dir="train", data_dir="data", data_zip_name="reuters-training-corpus.zip")

try:
    with open("train/docs.pkl", "rb") as f:
        docs = pickle.load(f)
    labels = np.load("train/labels.npy")
except:
    docs, labels = data.get_docs_labels("train/REUTERS_CORPUS_2")
    with open("train/docs.pkl", "wb") as f:
        pickle.dump(docs, f)
    np.save("train/labels.npy", labels)

print(len(docs))
print(labels.shape)

print(docs[-2])
print(labels[-2])

extracting train/REUTERS_CORPUS_2/19970401.zip
5
(5, 126)
After a huge volume of trade in Bre-X Minerals Ltd caused the Toronto Stock Exchange's computer trading system to crash for the second time in as many weeks, the TSE said on Tuesday it will change the way it deals with the troubled Calgary exploration firm. TSE President Rowland Fleming called a news conference on Tuesday to inform the market that the exchange will no longer halt trading in Bre-X's stock until it first sees the information on which the company bases its halt request. Traditionally the TSE has halted trading of a stock at the request of a company pending the release of material news. Fleming charged that Bre-X took advantage of the policy to have its stock halted and then failed to report material news. "I think they have asked us to stretch the reasonable, the expected, approach to trading halts and the release of material information," Fleming told reporters. He assured investors that Bre-X shares would resume 

Preprocessing the data

In [5]:
try:
    with open("train/preprocessed_docs.pkl", "rb") as f:
        preprocessed_docs = pickle.load(f)
except:
    preprocessed_docs = preprocessing.preprocess_corpus(docs)
    with open("train/preprocessed_docs.pkl", "wb") as f:
        pickle.dump(preprocessed_docs, f)

print(preprocessed_docs[-2])

after a huge volume of trade in bre x minerals ltd cause the toronto stock exchange 's computer trading system to crash for the second time in as many week the tse say on tuesday -pron- will change the way -pron- deal with the troubled calgary exploration firm tse president rowland fleming call a news conference on tuesday to inform the market that the exchange will no longer halt trading in bre x 's stock until -pron- first see the information on which the company base -pron- halt request traditionally the tse have halt trading of a stock at the request of a company pende the release of material news fleming charge that bre x take advantage of the policy to have -pron- stock halt and then fail to report material news -pron- think -pron- have ask -pron- to stretch the reasonable the expect approach to trading halt and the release of material information fleming tell reporter -pron- assure investor that bre x share would resume trade when the stock market open at -num- est/1430 gmt on w

## Save your model

It might be useful to save your model if you want to continue your work later, or use it for inference later.

In [6]:
# torch.save(model.state_dict(), 'model.pkl')

The model file should now be visible in the "Home" screen of the jupyter notebooks interface.  There you should be able to select it and press "download".

## Download test set

The testset will be made available during the last week before the deadline and can be downloaded in the same way as the training set.

## Predict for test set

You will be asked to return your predictions a separate test set.  These should be returned as a matrix with one row for each test article.  Each row contains a binary prediction for each label, 1 if it's present in the image, and 0 if not. The order of the labels is the order of the label (topic) codes.

An example row could like like this if your system predicts the presense of the second and fourth topic:

    0 1 0 1 0 0 0 0 0 0 0 0 0 0 ...
    
If you have the matrix prepared in `y` you can use the following function to save it to a text file.

In [7]:
# np.savetxt('results.txt', y, fmt='%d')