# DATA20001 Deep Learning - Group Project
## Text project

**Due Thursday, December 13, before 23:59.**

The task is to learn to assign the correct labels to news articles.  The corpus contains ~850K articles from Reuters.  The test set is about 10% of the articles. The data is unextracted in XML files.

We're only giving you the code for downloading the data, and how to save the final model. The rest you'll have to do yourselves.

Some comments and hints particular to the project:

- One document may belong to many classes in this problem, i.e., it's a multi-label classification problem. In fact there are documents that don't belong to any class, and you should also be able to handle these correctly. Pay careful attention to how you design the outputs of the network (e.g., what activation to use) and what loss function should be used.
- You may use word-embeddings to get better results. For example, you were already using a smaller version of the GloVE  embeddings in exercise 4. Do note that these embeddings take a lot of memory. 
- In the exercises we used e.g., `torchvision.datasets.MNIST` to handle the loading of the data in suitable batches. Here, you need to handle the dataloading yourself.  The easiest way is probably to create a custom `Dataset`. [See for example here for a tutorial](https://github.com/utkuozbulak/pytorch-custom-dataset-examples).

## Download the data

In [1]:
import os
import sys
import torch
from torchvision.datasets.utils import download_url
import zipfile

In [11]:
train_path = 'train/'

dl_file='reuters.zip'
dl_url='https://www.cs.helsinki.fi/u/jgpyykko/'
zip_path = os.path.join(train_path, dl_file)
if not os.path.isfile(zip_path):
    download_url(dl_url + dl_file, root=train_path, filename=dl_file, md5=None)
with zipfile.ZipFile(zip_path) as zip_f:
    zip_f.extractall(train_path)
    os.unlink(zip_path)

Downloading https://www.cs.helsinki.fi/u/jgpyykko/reuters.zip to train/reuters.zip
None


The above command downloads and extracts the data files into the `train` subdirectory.

The files can be found in `train/`, and are named as `19970405.zip`, etc. You will have to manage the content of these zips to get the data. There is a readme which has links to further descriptions on the data.

The class labels, or topics, can be found in the readme file called `train/codes.zip`.  The zip contains a file called "topic_codes.txt".  This file contains the special codes for the topics (about 130 of them), and the explanation - what each code means.  

The XML document files contain the article's headline, the main body text, and the list of topic labels assigned to each article.  You will have to extract the topics of each article from the XML.  For example: 
&lt;code code="C18"&gt; refers to the topic "OWNERSHIP CHANGES" (like a corporate buyout).

You should pre-process the XML to extract the words from the article: the &lt;headline&gt; element and the &lt;text&gt;.  You should not need any other parts of the article.

## Your stuff goes here ...

### Dataset extraction

In [6]:
import os
zip_path = "train/REUTERS_CORPUS_2/"
txt_path = "train/extracted/"

In [15]:
# Extracting the zips issued from the dataset file
zip_files =  [zip_path+f for f in os.listdir(zip_path) \
              if os.path.isfile(os.path.join(zip_path, f))\
              and f.endswith(".zip")]
zip_data  = [f for f in zip_files\
             if "codes" not in f 
             and "dtds" not in f] # Train data is all files like 132456.zip, excluding code.zip and dtds.zip

print("Extracting train data")
for i, zfpath in enumerate(zip_data):
    print("..Extracting ", zfpath, "\t(", i+1, "/", len(zip_data),")")
    with zipfile.ZipFile(zfpath) as zf:
        zf.extractall(txt_path)
# Exporting codes

print("Extracting train labels")
with zipfile.ZipFile(zip_path+"codes.zip") as zf:
    zf.extractall(txt_path+"labels/")

Extracting train data
..Extracting  train/REUTERS_CORPUS_2/19970728.zip 	( 1 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970404.zip 	( 2 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970415.zip 	( 3 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970412.zip 	( 4 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970529.zip 	( 5 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970805.zip 	( 6 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970730.zip 	( 7 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970604.zip 	( 8 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970725.zip 	( 9 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970710.zip 	( 10 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970414.zip 	( 11 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970705.zip 	( 12 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970630.zip 	( 13 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970626.zip 	( 14 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970806.zip 	( 15 / 127 )
..Extracting  train/REUTERS

### Dataset parsing

In [16]:
# Creation of a dictionnary allowing to link code labels to their actual meanings.
# For example : the code I50100 stands for "CONSTRUCTION OF BUILDINGS"
# Codes are the labels we will try to predict.
codes_path = txt_path+"labels/"
code_files = [codes_path+f for f in os.listdir(codes_path) \
              if os.path.isfile(os.path.join(codes_path, f))
              and not f.startswith("readme")]

code_meaning_dict={} # Creating a dictionnary linking a code label to its meaning.
for code_file in code_files:
    with open(code_file) as f:
        tmp = f.read().split("\n")
    code_file_txt = ""
    
    for line in tmp: 
        if not line.startswith(";") and line!="": # Dropping comments and empty lines
            code_tuple = line.split("\t") # [Code, meaning]
            if code_tuple[0] in code_meaning_dict.keys():
                print("COLLISION!", file=sys.stderr) # Checking a code doesn't happen twice 
                                                     # (relatively to the different categories)
                # No collision observed ; It is therefore safe to use a single dictionnary for all codes
            code_meaning_dict[code_tuple[0]] = code_tuple[1]

In [None]:
from xml.etree import ElementTree as ET
# ElementTree much more performant than xml.dom (100 times)
# because based on C. cElementTree deprecated, use ElementTree instead
# ref : http://effbot.org/zone/celementtree.htm
#       https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree

files = [txt_path+f for f in os.listdir(txt_path) \
              if os.path.isfile(os.path.join(txt_path, f)) and f.endswith(".xml")]
file_to_parse = files[1658] # Define arbritrarily


def parse_article(xml_str):
    """ extracts data (headine, text and labels/codes) from a xml formatted article into a dictionnary (json-like) """
    xml_tree = ET.fromstring(xml_str) # xml tree
    res = {}
    for node in list(xml_tree):
        if node.tag=="headline":
            res["headline"] = node.text
        elif node.tag=="text":
            res["text"]=""
            # The text is included in paragraph subnodes <p>
            for paragraph_texts in node.itertext(): #entering <p>
                res["text"] += paragraph_texts
            res["text"] = res["text"][1:-1] # Text is surrounded by \n, stripping them

        elif node.tag=="metadata":
            res["codes"] = []
            for codesNode in node.findall("codes"): #entering <codes>
                for codeNode in codesNode.findall("code"): #entering <code>
                    res["codes"].append(codeNode.attrib["code"])
    return res

file_to_parse = files[1658]

with open(file_to_parse) as f:
    xml_str = f.read()
parsed = parse_article(xml_str)

In [7]:
print("FILE:", file_to_parse)
print("HEADLINE:", parsed["headline"])
print("TEXT:", parsed["text"])
print("LABELS:", parsed["codes"])
codes_meaning = []
for code in parsed["codes"]:
    codes_meaning.append(code_meaning_dict[code])
print("LABELS' MEANING:", codes_meaning)
print(xml_str, file=sys.stderr)

FILE: train/extracted/765168newsML.xml
HEADLINE: June new-home sales up, inventories lean.
TEXT: Brisk new-home sales in June pulled the supply of houses on the market to a four-year low, the Commerce Department said on Wednesday, as an expanding economy braced consumer demand.
Sales increased 6.1 percent in June to a seasonally adjusted annual rate of 819,000 units after a steeply revised 1 percent rise to 772,000 in May.
Previously, the department said May sales had jumped 7.1 percent to a much higher rate of 825,000. Commerce also revised down sales for March and April from levels that it had estimated earlier.
Analysts said widespread revisions, especially the big downward change in May sales, complicated interpetation of the monthly figures. But, they said the positive trend on a year-over-year basis combined with slim inventories increased chances that sales could set a new record this year.
"We are on target to shatter the record for home sales of 757,000 units set last year," s

NameError: name 'code_meaning_dict' is not defined

In [22]:
# Number of different labels
print(len(code_meaning_dict.keys())) 

1362


In [11]:
%%time
# Save as a JSON file
# Very long (~ 1hour), To be ran only once
import json
jsontab = []
for i, file in enumerate(files):
    with open(file, encoding="latin-1") as f: #latin1 in order to process some spanish accents that crash in default utf8
        xml_str = f.read()
    parsed = parse_article(xml_str)
    jsontab.append(parsed)
    if i % 1000 == 0:
        print(i,"/",len(files)-1)

import json
with open("train/database.json", "w") as json_db:
    json_db.write(json.dumps(jsontab))
del jsontab

0 / 299772
1000 / 299772
2000 / 299772
3000 / 299772
4000 / 299772
5000 / 299772
6000 / 299772
7000 / 299772
8000 / 299772
9000 / 299772
10000 / 299772
11000 / 299772
12000 / 299772
13000 / 299772
14000 / 299772
15000 / 299772
16000 / 299772
17000 / 299772
18000 / 299772
19000 / 299772
20000 / 299772
21000 / 299772
22000 / 299772
23000 / 299772
24000 / 299772
25000 / 299772
26000 / 299772
27000 / 299772
28000 / 299772
29000 / 299772
30000 / 299772
31000 / 299772
32000 / 299772
33000 / 299772
34000 / 299772
35000 / 299772
36000 / 299772
37000 / 299772
38000 / 299772
39000 / 299772
40000 / 299772
41000 / 299772
42000 / 299772
43000 / 299772
44000 / 299772
45000 / 299772
46000 / 299772
47000 / 299772
48000 / 299772
49000 / 299772
50000 / 299772
51000 / 299772
52000 / 299772
53000 / 299772
54000 / 299772
55000 / 299772
56000 / 299772
57000 / 299772
58000 / 299772
59000 / 299772
60000 / 299772
61000 / 299772
62000 / 299772
63000 / 299772
64000 / 299772
65000 / 299772
66000 / 299772
67000 / 

In [222]:
# Read JSON file
import json
with open("train/database.json") as jsondb_file:
    jsondb = json.loads(jsondb_file.read())
#print(jsondb[1])

In [23]:
# Work on a smaller subset to make debugging easier
with open("train/database_mini.json", "w") as jsondb_file:
    for x in jsondb[:100]:
            jsondb_file.write(json.dumps(x)+"\n")
            # Weird formatting due do pytorch's formatting : succession of JSON lines

In [47]:
# Init glove with 50dim
# ref : https://nlp.stanford.edu/projects/glove/
from torchtext import datasets, vocab

if torch.cuda.is_available():
    print('Using GPU.')
    device = torch.device('cuda')
else:
    print('Using CPU.')
    device = torch.device('cpu')
    
glove = vocab.GloVe(name='6B', dim=50)

.vector_cache/glove.6B.zip: 0.00B [00:00, ?B/s]

Using CPU.


.vector_cache/glove.6B.zip: 862MB [1:07:13, 214kB/s]                              
100%|█████████▉| 399419/400000 [00:30<00:00, 39025.27it/s]

In [223]:
import torchtext.data as data

TEXT = data.Field(sequential=True, tokenize=lambda x: x.split(), lower=True)
LABEL = data.Field(sequential=False, use_vocab=False)


# Headline to be added later on
datafields = {"text" : ("text", TEXT),
             "codes": ("codes", LABEL)
            }


#trn, vld = data.TabularDataset.splits(
#               path="train/", # the root directory where the data lies
#               train='database_mini.json',
#               validation="database_mini.json", # TODO
#               format='json',
#               skip_header=True, # if your csv header has a header, make sure to pass this to ensure it doesn't get proceesed as data!
#               fields=datafields)


trn = data.TabularDataset(path='./train/database_mini.json',
                          format="json",
                          skip_header=False,
                          fields = datafields2)


TEXT.build_vocab(trn, vectors=glove)
LABEL.build_vocab(trn)

In [224]:
trn.examples[50].codes # trn dataset OK

['MEX', 'M13', 'M132', 'MCAT']

In [244]:
# here comes the trouble
# ref : https://github.com/pytorch/text/blob/master/torchtext/datasets/imdb.py 

train_iter = data.BucketIterator(
     dataset=trn,
    batch_size=4,
    sort_key=lambda x: len(x.text)
)

In [237]:
# same here
#train_iter = data.Iterator(
#     dataset=trn,
#    batch_size=4,
#    device =0,
#    sort_key=lambda x: len(x.text)
#)

The `device` argument should be set by using `torch.device` or passing a string as an argument. This behavior will be deprecated soon and currently defaults to cpu.


In [245]:
train_iter

<torchtext.data.iterator.BucketIterator at 0x7f310b2be630>

In [246]:
next(train_iter.__iter__()) # bug

ValueError: too many dimensions 'str'

In [241]:
for batch in train_iter:
    data = batch.text.transpose(0, 1).to(device) #data
    target = (batch.label - 1).to(device) #labels
    pass
print(batch)

ValueError: too many dimensions 'str'

### Neural network stuff (todo)

In [9]:
# Note : Data pre-processing and neural network things could be done into two separate notebooks 
# files as they will have little common variables

## Save your model

It might be useful to save your model if you want to continue your work later, or use it for inference later.

In [2]:
torch.save(model.state_dict(), 'model.pkl')

NameError: name 'model' is not defined

The model file should now be visible in the "Home" screen of the jupyter notebooks interface.  There you should be able to select it and press "download".

## Download test set

The testset will be made available during the last week before the deadline and can be downloaded in the same way as the training set.

## Predict for test set

You will be asked to return your predictions a separate test set.  These should be returned as a matrix with one row for each test article.  Each row contains a binary prediction for each label, 1 if it's present in the image, and 0 if not. The order of the labels is the order of the label (topic) codes.

An example row could like like this if your system predicts the presense of the second and fourth topic:

    0 1 0 1 0 0 0 0 0 0 0 0 0 0 ...
    
If you have the matrix prepared in `y` you can use the following function to save it to a text file.

In [None]:
np.savetxt('results.txt', y, fmt='%d')