# DATA20001 Deep Learning - Group Project
## Text project

**Due Thursday, December 13, before 23:59.**

The task is to learn to assign the correct labels to news articles.  The corpus contains ~850K articles from Reuters.  The test set is about 10% of the articles. The data is unextracted in XML files.

We're only giving you the code for downloading the data, and how to save the final model. The rest you'll have to do yourselves.

Some comments and hints particular to the project:

- One document may belong to many classes in this problem, i.e., it's a multi-label classification problem. In fact there are documents that don't belong to any class, and you should also be able to handle these correctly. Pay careful attention to how you design the outputs of the network (e.g., what activation to use) and what loss function should be used.
- You may use word-embeddings to get better results. For example, you were already using a smaller version of the GloVE  embeddings in exercise 4. Do note that these embeddings take a lot of memory. 
- In the exercises we used e.g., `torchvision.datasets.MNIST` to handle the loading of the data in suitable batches. Here, you need to handle the dataloading yourself.  The easiest way is probably to create a custom `Dataset`. [See for example here for a tutorial](https://github.com/utkuozbulak/pytorch-custom-dataset-examples).

## Download the data

In [2]:
import os
import sys
import torch
from torchvision.datasets.utils import download_url
import zipfile

In [11]:
train_path = 'train/'

dl_file='reuters.zip'
dl_url='https://www.cs.helsinki.fi/u/jgpyykko/'
zip_path = os.path.join(train_path, dl_file)
if not os.path.isfile(zip_path):
    download_url(dl_url + dl_file, root=train_path, filename=dl_file, md5=None)
with zipfile.ZipFile(zip_path) as zip_f:
    zip_f.extractall(train_path)
    os.unlink(zip_path)

Downloading https://www.cs.helsinki.fi/u/jgpyykko/reuters.zip to train/reuters.zip
None


The above command downloads and extracts the data files into the `train` subdirectory.

The files can be found in `train/`, and are named as `19970405.zip`, etc. You will have to manage the content of these zips to get the data. There is a readme which has links to further descriptions on the data.

The class labels, or topics, can be found in the readme file called `train/codes.zip`.  The zip contains a file called "topic_codes.txt".  This file contains the special codes for the topics (about 130 of them), and the explanation - what each code means.  

The XML document files contain the article's headline, the main body text, and the list of topic labels assigned to each article.  You will have to extract the topics of each article from the XML.  For example: 
&lt;code code="C18"&gt; refers to the topic "OWNERSHIP CHANGES" (like a corporate buyout).

You should pre-process the XML to extract the words from the article: the &lt;headline&gt; element and the &lt;text&gt;.  You should not need any other parts of the article.

## Your stuff goes here ...

### Dataset extraction

In [12]:
import os
zip_path = "train/REUTERS_CORPUS_2/"
txt_path = "train/extracted/"

In [15]:
# Extracting the zips issued from the dataset file
zip_files =  [zip_path+f for f in os.listdir(zip_path) \
              if os.path.isfile(os.path.join(zip_path, f))\
              and f.endswith(".zip")]
zip_data  = [f for f in zip_files\
             if "codes" not in f 
             and "dtds" not in f] # Train data is all files like 132456.zip, excluding code.zip and dtds.zip

print("Extracting train data")
for i, zfpath in enumerate(zip_data):
    print("..Extracting ", zfpath, "\t(", i+1, "/", len(zip_data),")")
    with zipfile.ZipFile(zfpath) as zf:
        zf.extractall(txt_path)
# Exporting codes

print("Extracting train labels")
with zipfile.ZipFile(zip_path+"codes.zip") as zf:
    zf.extractall(txt_path+"labels/")

Extracting train data
..Extracting  train/REUTERS_CORPUS_2/19970728.zip 	( 1 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970404.zip 	( 2 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970415.zip 	( 3 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970412.zip 	( 4 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970529.zip 	( 5 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970805.zip 	( 6 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970730.zip 	( 7 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970604.zip 	( 8 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970725.zip 	( 9 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970710.zip 	( 10 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970414.zip 	( 11 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970705.zip 	( 12 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970630.zip 	( 13 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970626.zip 	( 14 / 127 )
..Extracting  train/REUTERS_CORPUS_2/19970806.zip 	( 15 / 127 )
..Extracting  train/REUTERS

### Dataset parsing

In [124]:
# Here goes your code ...
#from xml.dom import minidom
#files = [txt_path+f for f in os.listdir(txt_path) \
#              if os.path.isfile(os.path.join(txt_path, f))]
#print(len(files))
#print(files[1])

#file_to_parse = files[1]
##with open(files[0], "r") as f:
##    content = f.read()
##print(content)
#xmldoc = minidom.parse(file_to_parse)
#for node in xmldoc.getElementsByTagName('code'):
#    print("LABEL:", node.attributes.items()[0][1])
#textnode = xmldoc.getElementsByTagName('text')
#hlnode = xmldoc.getElementsByTagName('headline')
##dir(hlnode[0])
#headline = hlnode[0].firstChild.toxml()
#print("HEADLINE:", headline)
#print(dir(textnode[0]))
##print("TEXT:", textnode[0].toxml())
#def extract_text(graph, acc):
#    # Depth first search to extract text from xml tree
#    for children in graph.childNodes:
#        pass
#    return acc
#print(textnode[0].toxml())
#for c in textnode[0].childNodes:
#    #print(c.toxml())
#    print(c.childNodes.toxml())

In [16]:
# Creation of a dictionnary allowing to link code labels to their actual meanings.
# For example : the code I50100 stands for "CONSTRUCTION OF BUILDINGS"
# Codes are the labels we will try to predict.
codes_path = txt_path+"labels/"
code_files = [codes_path+f for f in os.listdir(codes_path) \
              if os.path.isfile(os.path.join(codes_path, f))
              and not f.startswith("readme")]

code_meaning_dict={} # Creating a dictionnary linking a code label to its meaning.
for code_file in code_files:
    with open(code_file) as f:
        tmp = f.read().split("\n")
    code_file_txt = ""
    
    for line in tmp: 
        if not line.startswith(";") and line!="": # Dropping comments and empty lines
            code_tuple = line.split("\t") # [Code, meaning]
            if code_tuple[0] in code_meaning_dict.keys():
                print("COLLISION!", file=sys.stderr) # Checking a code doesn't happen twice 
                                                     # (relatively to the different categories)
                # No collision observed ; It is therefore safe to use a single dictionnary for all codes
            code_meaning_dict[code_tuple[0]] = code_tuple[1]

In [23]:
from xml.etree import ElementTree as ET
# ElementTree much more performant than xml.dom (100 times)
# because based on C. cElementTree deprecated, use ElementTree instead
# ref : http://effbot.org/zone/celementtree.htm
#       https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree

files = [txt_path+f for f in os.listdir(txt_path) \
              if os.path.isfile(os.path.join(txt_path, f))]
file_to_parse = files[1658] # Define arbritrarily


def parse_article(xml_str):
    """ extracts data (headine, text and labels/codes) from a xml formatted article into a dictionnary (json-like) """
    xml_tree = ET.fromstring(xml_str) # xml tree
    res = {}
    for node in list(xml_tree):
        if node.tag=="headline":
            res["headline"] = node.text
        elif node.tag=="text":
            res["text"]=""
            # The text is included in paragraph subnodes <p>
            for paragraph_texts in node.itertext(): #entering <p>
                res["text"] += paragraph_texts
            res["text"] = res["text"][1:-1] # Text is surrounded by \n, stripping them

        elif node.tag=="metadata":
            res["codes"] = []
            for codesNode in node.findall("codes"): #entering <codes>
                for codeNode in codesNode.findall("code"): #entering <code>
                    res["codes"].append(codeNode.attrib["code"])
    return res

file_to_parse = files[1658]

with open(file_to_parse) as f:
    xml_str = f.read()
parsed = parse_article(xml_str)

print("FILE:", file_to_parse)
print("HEADLINE:", parsed["headline"])
print("TEXT:", parsed["text"])
print("LABELS:", parsed["codes"])
codes_meaning = []
for code in parsed["codes"]:
    codes_meaning.append(code_meaning_dict[code])
print("LABELS' MEANING:", codes_meaning)
print(xml_str, file=sys.stderr)

FILE: train/extracted/765168newsML.xml
HEADLINE: June new-home sales up, inventories lean.
TEXT: Brisk new-home sales in June pulled the supply of houses on the market to a four-year low, the Commerce Department said on Wednesday, as an expanding economy braced consumer demand.
Sales increased 6.1 percent in June to a seasonally adjusted annual rate of 819,000 units after a steeply revised 1 percent rise to 772,000 in May.
Previously, the department said May sales had jumped 7.1 percent to a much higher rate of 825,000. Commerce also revised down sales for March and April from levels that it had estimated earlier.
Analysts said widespread revisions, especially the big downward change in May sales, complicated interpetation of the monthly figures. But, they said the positive trend on a year-over-year basis combined with slim inventories increased chances that sales could set a new record this year.
"We are on target to shatter the record for home sales of 757,000 units set last year," s

<?xml version="1.0" encoding="iso-8859-1" ?>
<newsitem itemid="765168" id="root" date="1997-07-30" xml:lang="en">
<title>USA: June new-home sales up, inventories lean.</title>
<headline>June new-home sales up, inventories lean.</headline>
<byline>Glenn Somerville</byline>
<dateline>WASHINGTON 1997-07-30</dateline>
<text>
<p>Brisk new-home sales in June pulled the supply of houses on the market to a four-year low, the Commerce Department said on Wednesday, as an expanding economy braced consumer demand.</p>
<p>Sales increased 6.1 percent in June to a seasonally adjusted annual rate of 819,000 units after a steeply revised 1 percent rise to 772,000 in May.</p>
<p>Previously, the department said May sales had jumped 7.1 percent to a much higher rate of 825,000. Commerce also revised down sales for March and April from levels that it had estimated earlier.</p>
<p>Analysts said widespread revisions, especially the big downward change in May sales, complicated interpetation of the monthly fi

In [22]:
# Number of different labels
print(len(code_meaning_dict.keys())) 

1362


### Neural network stuff (todo)

In [9]:
# Note : Data pre-processing and neural network things could be done into two separate notebooks 
# files as they will have little common variables

## Save your model

It might be useful to save your model if you want to continue your work later, or use it for inference later.

In [2]:
torch.save(model.state_dict(), 'model.pkl')

NameError: name 'model' is not defined

The model file should now be visible in the "Home" screen of the jupyter notebooks interface.  There you should be able to select it and press "download".

## Download test set

The testset will be made available during the last week before the deadline and can be downloaded in the same way as the training set.

## Predict for test set

You will be asked to return your predictions a separate test set.  These should be returned as a matrix with one row for each test article.  Each row contains a binary prediction for each label, 1 if it's present in the image, and 0 if not. The order of the labels is the order of the label (topic) codes.

An example row could like like this if your system predicts the presense of the second and fourth topic:

    0 1 0 1 0 0 0 0 0 0 0 0 0 0 ...
    
If you have the matrix prepared in `y` you can use the following function to save it to a text file.

In [None]:
np.savetxt('results.txt', y, fmt='%d')