# DATA20001 Deep Learning - Group Project
## Text project

**Due Thursday, May 20, before 23:59.**

The task is to learn to assign the correct labels to news articles.  The corpus contains ~850K articles from Reuters.  The test set is about 10% of the articles. The data is unextracted in XML files.

We're only giving you the code for downloading the data, and how to save the final model. The rest you'll have to do yourselves.

Some comments and hints particular to the project:

- One document may belong to many classes in this problem, i.e., it's a multi-label classification problem. In fact there are documents that don't belong to any class, and you should also be able to handle these correctly. Pay careful attention to how you design the outputs of the network (e.g., what activation to use) and what loss function should be used.
- You may use word-embeddings to get better results. For example, you were already using a smaller version of the GloVE  embeddings in exercise 4. Do note that these embeddings take a lot of memory. 
- In the exercises we used e.g., `torchvision.datasets.MNIST` to handle the loading of the data in suitable batches. Here, you need to handle the dataloading yourself.  The easiest way is probably to create a custom `Dataset`. [See for example here for a tutorial](https://github.com/utkuozbulak/pytorch-custom-dataset-examples).

## Get the data

In [2]:
import sys
import os
from os.path import join
from os.path import abspath
from os.path import split

import torch
import torchvision
from torchvision.datasets.utils import download_url
import zipfile

root_dir = os.getcwd()
if root_dir not in sys.path:
    sys.path.append(root_dir)
    
train_path = 'train'

data_folder_name = 'text-training-corpus'
DATA_FOLDER_DIR = os.path.abspath(os.path.join(root_dir, data_folder_name))

data_zip_name = 'reuters-training-corpus.zip'
DATA_ZIP_DIR = os.path.abspath(os.path.join(DATA_FOLDER_DIR, data_zip_name))

with zipfile.ZipFile(DATA_ZIP_DIR) as zip_f:
    zip_f.extractall(train_path)

The above command downloads and extracts the data files into the `train` subdirectory.

The files can be found in `train/`, and are named as `19970405.zip`, etc. You will have to manage the content of these zips to get the data. There is a readme which has links to further descriptions on the data.

The class labels, or topics, can be found in the readme file called `train/codes.zip`.  The zip contains a file called "topic_codes.txt".  This file contains the special codes for the topics (about 130 of them), and the explanation - what each code means.  

The XML document files contain the article's headline, the main body text, and the list of topic labels assigned to each article.  You will have to extract the topics of each article from the XML.  For example: 
&lt;code code="C18"&gt; refers to the topic "OWNERSHIP CHANGES" (like a corporate buyout).

You should pre-process the XML to extract the words from the article: the &lt;headline&gt; element and the &lt;text&gt;.  You should not need any other parts of the article.

## Your stuff goes here ...

In [321]:
#pip install lxml
from bs4 import BeautifulSoup as bs
import re

In [327]:
train_path = os.path.abspath(os.path.join(root_dir, train_path))
reuters_unzipped_path = os.path.abspath(os.path.join(train_path, 'REUTERS_CORPUS_2'))

zipped_news_files = os.listdir(os.path.abspath(reuters_unzipped_path))
del zipped_news_files[-3:] # remove files: codes.zip, dtds.zip, readme.txt
#zipped_news_files

Generating dataframe. Takes ~5min.

In [319]:
rows_list = []
pattern = r'"([A-Za-z0-9_\./\\-]*)"'

for news_file in zipped_news_files:
    zf = zipfile.ZipFile(os.path.abspath(os.path.join(reuters_unzipped_path, news_file)), 'r')
    for name in zf.namelist():
        dict1 = {} # saving into dicts which are then saved into a list which is then saved into the df.
        f = zf.open(name).read()
        soup = BeautifulSoup(f, "xml")

        title = soup.title.text
        dict1["title"] = soup.title.text
        text = soup.find("text").text
        dict1["text"] = soup.find("text").text
        codes = []
        metadata_codes = soup.metadata.find_all("code")
        for val in metadata_codes:
            m = re.search(pattern, str(val))
            codes.append(m.group().replace('"', ''))
        dict1["codes"] = codes
        
        rows_list.append(dict1)

df = pd.DataFrame(rows_list, columns=['title', 'text', 'codes'])

In [320]:
df

Unnamed: 0,title,text,codes
0,EU: REUTER EC REPORT LONG-TERM DIARY FOR APR ...,\n****\nHIGHLIGHTS\n****\nAMSTERDAM - The Neth...,"[EEC, G15, GCAT]"
1,EU: OFFICIAL JOURNAL CONTENTS - OJ L 85 OF MA...,\n* Decision of the EEA Joint Committee No 55/...,"[EEC, G15, GCAT]"
2,CANADA: Toronto stocks end higher after volati...,\nCHANGE\t\t\t\t CHANGE\nTSE\t 5900.37 ...,"[CANA, M11, MCAT]"
3,CANADA: TSE says will not halt Bre-X on request.,\nAfter a huge volume of trade in Bre-X Minera...,"[CANA, I21000, C13, C14, C15, C152, CCAT, M11,..."
4,CANADA: Suncor lowers Canada posted oil prices.,\nSuncor Inc said it lowered the price it woul...,"[CANA, M14, M143, MCAT]"
...,...,...,...
299768,USA: UPS says has deal to end Teamsters' strike.,\nUnited Parcel Service said on Monday it had ...,"[USA, GJOB]"
299769,USA: UPS says has tentative deal to end strike.,\nUnited Parcel Service said late Monday night...,"[USA, I79010, C42, CCAT, E41, ECAT, GCAT, GJOB]"
299770,JAPAN: Asia currency woes hurt region's oil pr...,\nThis year's rash of Asian currency crises co...,"[INDON, JAP, MALAY, PHLNS, THAIL, C21, C24]"
299771,TAIWAN: Typhoon Winnie kills 25 in Taiwan.,\nA typhoon that packed high winds and torrent...,"[TAIWAN, GDIS, GENV]"


## Save your model

It might be useful to save your model if you want to continue your work later, or use it for inference later.

In [None]:
torch.save(model.state_dict(), 'model.pkl')

The model file should now be visible in the "Home" screen of the jupyter notebooks interface.  There you should be able to select it and press "download".

## Download test set

The testset will be made available during the last week before the deadline and can be downloaded in the same way as the training set.

## Predict for test set

You will be asked to return your predictions a separate test set.  These should be returned as a matrix with one row for each test article.  Each row contains a binary prediction for each label, 1 if it's present in the image, and 0 if not. The order of the labels is the order of the label (topic) codes.

An example row could like like this if your system predicts the presense of the second and fourth topic:

    0 1 0 1 0 0 0 0 0 0 0 0 0 0 ...
    
If you have the matrix prepared in `y` you can use the following function to save it to a text file.

In [None]:
np.savetxt('results.txt', y, fmt='%d')