# DATA20001 Deep Learning - Group Project
## Text project

**Due Thursday, December 13, before 23:59.**

The task is to learn to assign the correct labels to news articles.  The corpus contains ~850K articles from Reuters.  The test set is about 10% of the articles. The data is unextracted in XML files.

We're only giving you the code for downloading the data, and how to save the final model. The rest you'll have to do yourselves.

Some comments and hints particular to the project:

- One document may belong to many classes in this problem, i.e., it's a multi-label classification problem. In fact there are documents that don't belong to any class, and you should also be able to handle these correctly. Pay careful attention to how you design the outputs of the network (e.g., what activation to use) and what loss function should be used.
- You may use word-embeddings to get better results. For example, you were already using a smaller version of the GloVE  embeddings in exercise 4. Do note that these embeddings take a lot of memory. 
- In the exercises we used e.g., `torchvision.datasets.MNIST` to handle the loading of the data in suitable batches. Here, you need to handle the dataloading yourself.  The easiest way is probably to create a custom `Dataset`. [See for example here for a tutorial](https://github.com/utkuozbulak/pytorch-custom-dataset-examples).

## Download the data

In [1]:
import os
import torch
from torchvision.datasets.utils import download_url
import zipfile

train_path = 'train/'

dl_file='reuters.zip'
dl_url='https://www.cs.helsinki.fi/u/jgpyykko/'
zip_path = os.path.join(train_path, dl_file)
if not os.path.isfile(zip_path):
    download_url(dl_url + dl_file, root=train_path, filename=dl_file, md5=None)

with zipfile.ZipFile(zip_path) as zip_f:
    zip_f.extractall(train_path)
    #os.unlink(zip_path)

The above command downloads and extracts the data files into the `train` subdirectory.

The files can be found in `train/`, and are named as `19970405.zip`, etc. You will have to manage the content of these zips to get the data. There is a readme which has links to further descriptions on the data.

The class labels, or topics, can be found in the readme file called `train/codes.zip`.  The zip contains a file called "topic_codes.txt".  This file contains the special codes for the topics (about 130 of them), and the explanation - what each code means.  

The XML document files contain the article's headline, the main body text, and the list of topic labels assigned to each article.  You will have to extract the topics of each article from the XML.  For example: 
&lt;code code="C18"&gt; refers to the topic "OWNERSHIP CHANGES" (like a corporate buyout).

You should pre-process the XML to extract the words from the article: the &lt;headline&gt; element and the &lt;text&gt;.  You should not need any other parts of the article.

## Your stuff goes here ...

In [2]:
from bs4 import BeautifulSoup
import pandas as pd

In [3]:
# example http://www2.hawaii.edu/~takebaya/cent110/xml_parse/xml_parse.html
from bs4 import BeautifulSoup
# unzipped a single xml for testing
infile = open("train/477886newsML.xml","r")
contents = infile.read()
soup = BeautifulSoup(contents,'lxml') # use parser lxml as parser xml returns empty list

headline = soup.find('headline')
print(headline.get_text())

text = soup.find('text')
print(text.get_text())

REUTER EC REPORT LONG-TERM DIARY FOR APR 7 - DEC 31, 1997.

****
HIGHLIGHTS
****
AMSTERDAM - The Netherlands hosts summit of European Union leaders (June 16-17).
MADRID - NATO holds summit to set the course for enlargement (July 8 and 9).
LUXEMBOURG - Luxembourg hosts summit of European Union leaders (December 12-13).
APRIL
BRUSSELS (MODIFIED ITEM) - Conference of Bosnian donor countries originally scheduled for April has been POSTPONED to an unspecified date before June.
MONDAY, APRIL 7
NOORDWIJK, Netherlands (NEW ITEM) - EU foreign ministers hold conclave on the inter-governmental conference (second of two days).
NOORDWIJK, Netherlands (EXPANDED ITEM) - EU-Rio Group meeting involving South American countries and Mexico, Panama, Trinidad, Tobago and Costa Rica (To April 8). Honduras, Guyana and chairmen of the Organisation of American States, the InterAmerican Development Bank, Latin American Parliament, the Institute for European-Latin American Relations, the European Investment Bank

In [4]:
codes = soup.find_all('code')
for code in codes:
    print(code)

<code code="EEC">
<editdetail action="confirmed" attribution="Reuters BIP Coding Group" date="1997-04-01"></editdetail>
</code>
<code code="G15">
<editdetail action="confirmed" attribution="Reuters BIP Coding Group" date="1997-04-01"></editdetail>
</code>
<code code="GCAT">
<editdetail action="confirmed" attribution="Reuters BIP Coding Group" date="1997-04-01"></editdetail>
</code>


Refer to topic_codes.txt inside codes.zip, Which defines that 

G15 EUROPEAN COMMUNITY 

GCAT	GOVERNMENT/SOCIAL

EEC is found in region_codes.txt: EEC	EUROPEAN UNION

In [5]:
# only take codes by topic, not region or industry
codes = soup.find_all('codes', class_='bip:topics:1.0')
for code in codes:
    print(code)

<codes class="bip:topics:1.0">
<code code="G15">
<editdetail action="confirmed" attribution="Reuters BIP Coding Group" date="1997-04-01"></editdetail>
</code>
<code code="GCAT">
<editdetail action="confirmed" attribution="Reuters BIP Coding Group" date="1997-04-01"></editdetail>
</code>
</codes>


In [6]:
# extract all topic-classes 
# only take "codes" by topic, not region or industry: class == 'bip:topics:1.0'
# example
# <codes class="bip:topics:1.0">
# <code code="G15"> ... </code>
# <code code="GCAT"> ... </code>
# </codes>
for element in soup.find_all('codes', class_='bip:topics:1.0'):
    for code in element.find_all('code'):
        clas = code['code']
        print(clas)

G15
GCAT


In [7]:
# get list of *.zip files in dir, such that contain xml-files (name starts with 1).
dirpath = 'train/REUTERS_CORPUS_2/'
files = [f for f in os.listdir(dirpath) if os.path.isfile(os.path.join(dirpath, f))]
# cut out codes.zip, readme.txt etc. All zips containing .xml start with 1
filenames_zip = [f for f in files if '1' in f]
print(len(filenames_zip))
print(filenames_zip[0:4])

127
['19970722.zip', '19970508.zip', '19970421.zip', '19970612.zip']


In [8]:
# get xml-filenames inside a single zip-file
mypath = 'train/REUTERS_CORPUS_2/'
file = '19970722.zip'
zf = zipfile.ZipFile(mypath+file, 'r')

# get names of all xml-files within a zip
for name in zf.namelist():    
    print(name)    
    #f = zf.open(name)
    #print(f.read()) 

744587newsML.xml
744588newsML.xml
744589newsML.xml
744590newsML.xml
744591newsML.xml
744592newsML.xml
744593newsML.xml
744594newsML.xml
744595newsML.xml
744596newsML.xml
744597newsML.xml
744598newsML.xml
744599newsML.xml
744600newsML.xml
744601newsML.xml
744602newsML.xml
744603newsML.xml
744604newsML.xml
744605newsML.xml
744606newsML.xml
744607newsML.xml
744608newsML.xml
744609newsML.xml
744610newsML.xml
744611newsML.xml
744612newsML.xml
744613newsML.xml
744614newsML.xml
744615newsML.xml
744616newsML.xml
744617newsML.xml
744618newsML.xml
744619newsML.xml
744620newsML.xml
744621newsML.xml
744622newsML.xml
744623newsML.xml
744624newsML.xml
744625newsML.xml
744626newsML.xml
744627newsML.xml
744628newsML.xml
744629newsML.xml
744630newsML.xml
744631newsML.xml
744632newsML.xml
744633newsML.xml
744634newsML.xml
744635newsML.xml
744636newsML.xml
744637newsML.xml
744638newsML.xml
744639newsML.xml
744640newsML.xml
744641newsML.xml
744642newsML.xml
744643newsML.xml
744644newsML.xml
744645newsML.x

In [9]:
def read_one_zipfile(filepath):  
    '''
    read and parse contents of single zipfile (with about 100+ xml-files in it)
    fields: headline, text, classes
    return them as list
    '''
    this_documents=[]
    
    zf = zipfile.ZipFile(filepath, 'r')    

    # for all xml-files within a zip
    for name in zf.namelist():
        #if name.endswith('xml'): continue
    
        infile = zf.open(name)    
        contents = infile.read()
        soup = BeautifulSoup(contents,'lxml')
    
        headline = soup.find('headline')
        text = soup.find('text')       #print(headline.get_text())
    
    # extract all topic-classes 
    # only take "codes" by topic, not region or industry: class == 'bip:topics:1.0'
        classcodes = []
        for element in soup.find_all('codes', class_='bip:topics:1.0'):
            for code in element.find_all('code'):
                clas = code['code']
                #print(clas)
                classcodes.append(clas)

        this_documents.append({'headline': headline.get_text(), 'text': text.get_text(), 'codes': classcodes})
    return this_documents

In [10]:
# Read single zipfiles contents
mypath = 'train/REUTERS_CORPUS_2/'
documents= []    

file = '19970722.zip'

documents.extend( read_one_zipfile(mypath+file) )
len(documents)

3426

In [11]:
data_small = pd.DataFrame(documents)
data_small[0:5]

Unnamed: 0,codes,headline,text
0,"[C18, C181, CCAT]",Eureko is latest suitor for French insurer GAN.,"\nEureko, an alliance of six European financia..."
1,"[G15, GCAT]",Reuter EC Report Long-Term Diary for July 28 -...,\n****\nHIGHLIGHTS\n****\nLUXEMBOURG - Luxembo...
2,"[G15, GCAT]",Official Journal contents - OJ L 190 of July 1...,\n*\n(Note - contents are displayed in reverse...
3,"[G15, GCAT]",Official Journal contents - OJ C 221 of July 1...,\n*\n(Note - contents are displayed in reverse...
4,"[G15, GCAT]",Official Journal contents - OJ C 220 of July 1...,\n*\n(Note - contents are displayed in reverse...


In [38]:
# optional - faster test with cutting list to 2 zipfiles
#filenames_zip = filenames_zip[0:2]
#filenames_zip

['19970722.zip', '19970508.zip']

In [None]:
# CAN TAKE ABOUT 30-60 MIN!
# Read all zipfiles
mypath = 'train/REUTERS_CORPUS_2/'
documents= []

for file in filenames_zip:
    documents.extend( read_one_zipfile(mypath+file) )
len(documents)    

In [47]:
data = pd.DataFrame(documents)
data[0:5]

Unnamed: 0,codes,headline,text
0,"[C18, C181, CCAT]",Eureko is latest suitor for French insurer GAN.,"\nEureko, an alliance of six European financia..."
1,"[G15, GCAT]",Reuter EC Report Long-Term Diary for July 28 -...,\n****\nHIGHLIGHTS\n****\nLUXEMBOURG - Luxembo...
2,"[G15, GCAT]",Official Journal contents - OJ L 190 of July 1...,\n*\n(Note - contents are displayed in reverse...
3,"[G15, GCAT]",Official Journal contents - OJ C 221 of July 1...,\n*\n(Note - contents are displayed in reverse...
4,"[G15, GCAT]",Official Journal contents - OJ C 220 of July 1...,\n*\n(Note - contents are displayed in reverse...


In [12]:
# Save small example-table to pickle
data_small.to_pickle('input/reuters_small.pkl')

In [13]:
# load small
reuters = pd.read_pickle('input/reuters_small.pkl')

In [18]:
# Save large table to pickle
data.to_pickle('input/reuters_all.pkl')

In [19]:
# load large
reuters = pd.read_pickle('input/reuters_all.pkl')

In [21]:
len(reuters)

299773

### Read the CLASS codes

In [32]:
# read one of the Class-code files
import pandas as pd
import zipfile

zf = zipfile.ZipFile('train/REUTERS_CORPUS_2/codes.zip', 'r') 
colnames=['Code','Description']
df = pd.read_csv(zf.open('topic_codes.txt'), skiprows=2, error_bad_lines=True, 
                 header=None, names=colnames, sep='\t')

df # (the file has 2 first rows as CODE/DESCRIPTION, the extra line is still at row 0)

Unnamed: 0,Code,Description
0,1POL,CURRENT NEWS - POLITICS
1,2ECO,CURRENT NEWS - ECONOMICS
2,3SPO,CURRENT NEWS - SPORT
3,4GEN,CURRENT NEWS - GENERAL
4,6INS,CURRENT NEWS - INSURANCE
5,7RSK,CURRENT NEWS - RISK NEWS
6,8YDB,TEMPORARY
7,9BNX,TEMPORARY
8,ADS10,CURRENT NEWS - ADVERTISING
9,BNW14,CURRENT NEWS - BUSINESS NEWS


In [49]:
# Save to csv
df.to_csv('input/classcodes.csv', index=None)

In [50]:
# read csv
classcodes= pd.read_csv('input/classcodes.csv')

In [48]:
classcodes

Unnamed: 0,Code,Description
0,1POL,CURRENT NEWS - POLITICS
1,2ECO,CURRENT NEWS - ECONOMICS
2,3SPO,CURRENT NEWS - SPORT
3,4GEN,CURRENT NEWS - GENERAL
4,6INS,CURRENT NEWS - INSURANCE
5,7RSK,CURRENT NEWS - RISK NEWS
6,8YDB,TEMPORARY
7,9BNX,TEMPORARY
8,ADS10,CURRENT NEWS - ADVERTISING
9,BNW14,CURRENT NEWS - BUSINESS NEWS


## Save your model

It might be useful to save your model if you want to continue your work later, or use it for inference later.

In [None]:
torch.save(model.state_dict(), 'model.pkl')

The model file should now be visible in the "Home" screen of the jupyter notebooks interface.  There you should be able to select it and press "download".

## Download test set

The testset will be made available during the last week before the deadline and can be downloaded in the same way as the training set.

## Predict for test set

You will be asked to return your predictions a separate test set.  These should be returned as a matrix with one row for each test article.  Each row contains a binary prediction for each label, 1 if it's present in the image, and 0 if not. The order of the labels is the order of the label (topic) codes.

An example row could like like this if your system predicts the presense of the second and fourth topic:

    0 1 0 1 0 0 0 0 0 0 0 0 0 0 ...
    
If you have the matrix prepared in `y` you can use the following function to save it to a text file.

In [None]:
np.savetxt('results.txt', y, fmt='%d')