# Capstone Part2

## Problem Statement, goals and success criteria

I will use multiple machine learning methods and compare how well they perform on a single-label text classification task.

The main goal is to reproduce part of my PhD work using state-of-the-art libraries in Python, and be able to access how this area evolved in the past 10 years.

I consider this work will be successful if I am able to reproduce the initial "related work" from my thesis, which at the time took about one year to complete, for this capstone project.  I expect results to be approximately the same as previously published results, and I will even apply some Machine Learning models that I did not use at the time.

## Outline of proposed methods and models

I will use some of the classification methods available in sklearn for this task, including but not limited to: k-NearestNeighbors, NaiveBayes, and SupportVectorMachines.  

I will use accuracy to evaluate how well they perform.  

I will use public datasets that are actively used in scientific papers, so that these results are comparable to previously published work.

## Identify risks and assumptions

The datasets are publicly available and have been actively used for research work at least in the past ten years.  They can be uploaded to memory easily because they are not too big.  In fact, some of them exist as part of sklearn.datasets.  It's safe to say that the risks are minimal, and that I know my assumptions to be true.

I will download the datasets from my webpage:

http://ana.cachopo.org/datasets-for-single-label-text-categorization

## Create local PostgreSQL database

In [1]:
! ls -l ../datasets/

total 74600
-rw-r--r--+ 1 acardoso  staff  10808748 19 Jan  2005 20ng-test-all-terms.txt
-rw-r--r--+ 1 acardoso  staff  16682920 19 Jan  2005 20ng-train-all-terms.txt
-rw-r--r--@ 1 acardoso  staff         0  5 Dec 17:59 Icon?
-rw-r--r--+ 1 acardoso  staff   1522484 19 Jan  2005 r52-test-all-terms.txt
-rw-r--r--+ 1 acardoso  staff   4281453 19 Jan  2005 r52-train-all-terms.txt
-rw-r--r--+ 1 acardoso  staff   1195261 19 Jan  2005 r8-test-all-terms.txt
-rw-r--r--+ 1 acardoso  staff   3354424 19 Jan  2005 r8-train-all-terms.txt


In [2]:
from __future__ import division, print_function, unicode_literals
import pandas as pd

In [7]:
def read_file(filename):
    return pd.read_csv("../datasets/"+filename, 
                       header=None, sep='\t', 
                       names=['label', 'text'])

ng20_test_df = read_file('20ng-test-all-terms.txt')
ng20_train_df = read_file('20ng-train-all-terms.txt')
r52_test_df = read_file('r52-test-all-terms.txt')
r52_train_df = read_file('r52-train-all-terms.txt')
r8_test_df = read_file('r8-test-all-terms.txt')
r8_train_df = read_file('r8-train-all-terms.txt')

all_dfs = [ng20_test_df, ng20_train_df, 
           r52_test_df, r52_train_df, 
           r8_test_df, r8_train_df]

for df in all_dfs:
    print(df.shape)
    print(df.head())

(7528, 2)
         label                                               text
0  alt.atheism  re about the bible quiz answers in article hea...
1  alt.atheism  re amusing atheists and agnostics in article t...
2  alt.atheism  re yet more rushdie re islamic law jaeger buph...
3  alt.atheism  re christian morality is in article vice ico t...
4  alt.atheism  re after years can we say that christian moral...
(11293, 2)
         label                                               text
0  alt.atheism  alt atheism faq atheist resources archive name...
1  alt.atheism  alt atheism faq introduction to atheism archiv...
2  alt.atheism  re gospel dating in article mimsy umd edu mang...
3  alt.atheism  re university violating separation of church s...
4  alt.atheism  re soc motss et al princeton axes matching fun...
(2568, 2)
   label                                               text
0  trade  asian exporters fear damage from u s japan rif...
1  grain  china daily says vermin eat pct grain stocks a.

In [8]:
from sqlalchemy import create_engine
import pandas as pd
%load_ext sql

engine = create_engine('postgresql://postgres:chocolate@localhost:5432')

In [9]:
ng20_test_df.to_sql("ng20_test", engine, if_exists='replace')
ng20_train_df.to_sql("ng20_train", engine, if_exists='replace')
r52_test_df.to_sql("r52_test", engine, if_exists='replace')
r52_train_df.to_sql("r52_train", engine, if_exists='replace')
r8_test_df.to_sql("r8_test", engine, if_exists='replace')
r8_train_df.to_sql("r8_train", engine, if_exists='replace')

In [10]:
%%sql postgresql://postgres:chocolate@localhost:5432
SELECT * FROM ng20_test LIMIT 3;

3 rows affected.


index,label,text
0,alt.atheism,re about the bible quiz answers in article healta saturn wwc edu healta saturn wwc edu tammy r healy writes the cheribums are on the ark of the covenant when god said make no graven image he was refering to idols which were created to be worshipped the ark of the covenant wasn t wrodhipped and only the high priest could enter the holy of holies where it was kept once a year on the day of atonement i am not familiar with or knowledgeable about the original language but i believe there is a word for idol and that the translator would have used the word idol instead of graven image had the original said idol so i think you re wrong here but then again i could be too i just suggesting a way to determine whether the interpretation you offer is correct dean kaflowitz
1,alt.atheism,re amusing atheists and agnostics in article timmbake mcl timmbake mcl ucsb edu clam bake timmons writes fallacy atheism is a faith lo i hear the faq beckoning once again wonderful rule deleted you re correct you didn t say anything about a conspiracy correction hard atheism is a faith yes rule don t mix apples with oranges how can you say that the extermination by the mongols was worse than stalin khan conquered people unsympathetic to his cause that was atrocious but stalin killed millions of his own people who loved and worshipped him and his atheist state how can anyone be worse than that i will not explain this to you again stalin did nothing in the name of atheism whethe he was or was not an atheist is irrelevant get a grip man the stalin example was brought up not as an indictment of atheism but merely as another example of how people will kill others under any name that s fit for the occasion no look again while you never said it the implication is pretty clear i m sorry but i can only respond to your words not your true meaning usenet is a slippery medium deleted wrt the burden of proof so hard atheism has nothing to prove then how does it justify that god does not exist i know there s the faq etc but guess what if those justifications were so compelling why aren t people flocking to hard atheism they re not and they won t i for one will discourage people from hard atheism by pointing out those very sources as reliable statements on hard atheism look i m not supporting any dogmatic position i d be a fool to say that in the large group of people that are atheists no people exist who wish to proselytize in the same fashion as religion how many hard atheists do you see posting here anyway maybe i mm just not looking hard enough second what makes you think i m defending any given religion i m merely recognizing hard atheism for what it is a faith i never meant to do so although i understand where you might get that idea i was merely using the bible example as an allegory to illustrate my point and yes by we i am referring to every reader of the post where is the evidence that the poster stated that he relied upon evidence for what who i think i may have lost this thread why theists are arrogant deleted because they say such and such is absolutely unalterably true because my dogma says it is true i am not prepared to issue blanket statements indicting all theists of arrogance as you are wont to do with atheists bzzt by virtue of your innocent little pronoun they you ve just issued a blanket statement at least i will apologize by qualifying my original statement with hard atheist in place of atheist would you call john the baptist arrogant who boasted of one greater than he that s what many christians do today how is that in itself arrogant guilty as charged what i meant to say was the theists who are arrogant are this way because they say other than that i thought my meaning was clear enough any position that claims itself as superior to another with no supporting evidence is arrogant thanks for your apology btw i m not worthy only seriously misinformed with your sophisticated put down of they the theists your serious misinformation shines through explained above bake timmons iii there s nothing higher stronger more wholesome and more useful in life than some good memory alyosha in brothers karamazov dostoevsky
2,alt.atheism,re yet more rushdie re islamic law jaeger buphy bu edu gregg jaeger writes in article vice ico tek com bobbe vice ico tek com robert beauchaine writes bennett neil how bcci adapted the koran rules of banking the times august so let s see if some guy writes a piece with a title that implies something is the case then it must be so is that it gregg you haven t provided even a title of an article to support your contention this is how you support a position if you intend to have anyone respect it gregg any questions and i even managed to include the above reference with my head firmly engaged in my ass what s your excuse this supports nothing i have no reason to believe that this is piece is anything other than another anti islamic slander job you also have no reason to believe it is an anti islamic slander job apart from your own prejudices i have no respect for titles only for real content i can look up this article if i want true but i can tell you bcci was not an islamic bank why yes what s a mere report in the times stating that bcci followed islamic banking rules gregg knows islam is good and he knows bcci were bad therefore bcci cannot have been islamic anyone who says otherwise is obviously spreading slanderous propaganda if someone wants to discuss the issue more seriously then i d be glad to have a real discussion providing references etc i see if someone wants to provide references to articles you agree with you will also respond with references to articles you agree with mmm yes that would be a very intellectually stimulating debate doubtless that s how you spend your time in soc culture islam i ve got a special place for you in my kill file right next to bobby want to join him the more you post the more i become convinced that it is simply a waste of time to try and reason with moslems is that what you are hoping to achieve mathew


In [11]:
%%sql
SELECT * FROM r8_test LIMIT 3;

3 rows affected.


index,label,text
0,trade,asian exporters fear damage from u s japan rift mounting trade friction between the u s and japan has raised fears among many of asia s exporting nations that the row could inflict far reaching economic damage businessmen and officials said they told reuter correspondents in asian capitals a u s move against japan might boost protectionist sentiment in the u s and lead to curbs on american imports of their products but some exporters said that while the conflict would hurt them in the long run in the short term tokyo s loss might be their gain the u s has said it will impose mln dlrs of tariffs on imports of japanese electronics goods on april in retaliation for japan s alleged failure to stick to a pact not to sell semiconductors on world markets at below cost unofficial japanese estimates put the impact of the tariffs at billion dlrs and spokesmen for major electronics firms said they would virtually halt exports of products hit by the new taxes we wouldn t be able to do business said a spokesman for leading japanese electronics firm matsushita electric industrial co ltd mc t if the tariffs remain in place for any length of time beyond a few months it will mean the complete erosion of exports of goods subject to tariffs to the u s said tom murtha a stock analyst at the tokyo office of broker james capel and co in taiwan businessmen and officials are also worried we are aware of the seriousness of the u s threat against japan because it serves as a warning to us said a senior taiwanese trade official who asked not to be named taiwan had a trade trade surplus of billion dlrs last year pct of it with the u s the surplus helped swell taiwan s foreign exchange reserves to billion dlrs among the world s largest we must quickly open our markets remove trade barriers and cut import tariffs to allow imports of u s products if we want to defuse problems from possible u s retaliation said paul sheen chairman of textile exporters taiwan safe group a senior official of south korea s trade promotion association said the trade dispute between the u s and japan might also lead to pressure on south korea whose chief exports are similar to those of japan last year south korea had a trade surplus of billion dlrs with the u s up from billion dlrs in in malaysia trade officers and businessmen said tough curbs against japan might allow hard hit producers of semiconductors in third countries to expand their sales to the u s in hong kong where newspapers have alleged japan has been selling below cost semiconductors some electronics manufacturers share that view but other businessmen said such a short term commercial advantage would be outweighed by further u s pressure to block imports that is a very short term view said lawrence mills director general of the federation of hong kong industry if the whole purpose is to prevent imports one day it will be extended to other sources much more serious for hong kong is the disadvantage of action restraining trade he said the u s last year was hong kong s biggest export market accounting for over pct of domestically produced exports the australian government is awaiting the outcome of trade talks between the u s and japan with interest and concern industry minister john button said in canberra last friday this kind of deterioration in trade relations between two countries which are major trading partners of ours is a very serious matter button said he said australia s concerns centred on coal and beef australia s two largest exports to japan and also significant u s exports to that country meanwhile u s japanese diplomatic manoeuvres to solve the trade stand off continue japan s ruling liberal democratic party yesterday outlined a package of economic measures to boost the japanese economy the measures proposed include a large supplementary budget and record public works spending in the first half of the financial year they also call for stepped up spending as an emergency measure to stimulate the economy despite prime minister yasuhiro nakasone s avowed fiscal reform program deputy u s trade representative michael smith and makoto kuroda japan s deputy minister of international trade and industry miti are due to meet in washington this week in an effort to end the dispute reuter
1,grain,china daily says vermin eat pct grain stocks a survey of provinces and seven cities showed vermin consume between seven and pct of china s grain stocks the china daily said it also said that each year mln tonnes or pct of china s fruit output are left to rot and mln tonnes or up to pct of its vegetables the paper blamed the waste on inadequate storage and bad preservation methods it said the government had launched a national programme to reduce waste calling for improved technology in storage and preservation and greater production of additives the paper gave no further details reuter
2,ship,australian foreign ship ban ends but nsw ports hit tug crews in new south wales nsw victoria and western australia yesterday lifted their ban on foreign flag ships carrying containers but nsw ports are still being disrupted by a separate dispute shipping sources said the ban imposed a week ago over a pay claim had prevented the movement in or out of port of nearly vessels they said the pay dispute went before a hearing of the arbitration commission today meanwhile disruption began today to cargo handling in the ports of sydney newcastle and port kembla they said the industrial action at the nsw ports is part of the week of action called by the nsw trades and labour council to protest changes to the state s workers compensation laws the shipping sources said the various port unions appear to be taking it in turn to work for a short time at the start of each shift and then to walk off cargo handling in the ports has been disrupted with container movements most affected but has not stopped altogether they said they said they could not say how long the disruption will go on and what effect it will have on shipping movements reuter


I created a local PostgreSQL database and uploaded the data to it as requested, but I will not be using it for my work for two main reasons:

- The datasets are small and fit in memory, so I can use Pandas to query them, because I feel more confortable with it.

- These are datasets containing text, and they can be processed much more efficiently as DataFrames (or np.arrays) than as database tables.

## Query, Sort and Clean Data

The data was already clean and available in text files that could be directly read as csv files.

## Create a Data Dictionary

I will use three datasets commonly used for research in single-label text categorization.

Each dataset was available in a file containing one document per line.

Each document is composed by its class and its terms (or words).

Each document is represented by a "word" representing the document's class, a TAB character and then a sequence of "words" delimited by spaces, representing the terms contained in the document.  This can be read as a csv file where each row is a different document, the first column is the document's class and the other columns are the words in the document.

## Perform and summarize EDA

#### 20 Newsgroups

<p>This dataset is a
collection of approximately 20,000 newsgroup documents, partitioned
(nearly) evenly across 20 different newsgroups.  I used the "bydate" version,
because it already had a standard train/test split.  
</p>
<p>Although already cleaned-up, this dataset still had several
attachments, many PGP keys and some duplicates.
</p>
<p>After removing them and the messages that became empty because of
it, the distribution of train and test messages was the following for
each newsgroup:

<table align="center" border="1">
<tbody>
<tr>
<th colspan="4">20 Newsgroups</th>
</tr>
<tr>
<th>Class</th>
<th># train docs</th>
<th># test docs</th>
<th>Total # docs</th>
</tr>
<tr align="right">
<td>alt.atheism</td>
<td>480</td>
<td>319</td>
<td>799</td>
</tr>
<tr align="right">
<td>comp.graphics</td>
<td>584</td>
<td>389</td>
<td>973</td>
</tr>
<tr align="right">
<td>comp.os.ms-windows.misc</td>
<td>572</td>
<td>394</td>
<td>966</td>
</tr>
<tr align="right">
<td>comp.sys.ibm.pc.hardware</td>
<td>590</td>
<td>392</td>
<td>982</td>
</tr>
<tr align="right">
<td>comp.sys.mac.hardware</td>
<td>578</td>
<td>385</td>
<td>963</td>
</tr>
<tr align="right">
<td>comp.windows.x</td>
<td>593</td>
<td>392</td>
<td>985</td>
</tr>
<tr align="right">
<td>misc.forsale</td>
<td>585</td>
<td>390</td>
<td>975</td>
</tr>
<tr align="right">
<td>rec.autos</td>
<td>594</td>
<td>395</td>
<td>989</td>
</tr>
<tr align="right">
<td>rec.motorcycles</td>
<td>598</td>
<td>398</td>
<td>996</td>
</tr>
<tr align="right">
<td>rec.sport.baseball</td>
<td>597</td>
<td>397</td>
<td>994</td>
</tr>
<tr align="right">
<td>rec.sport.hockey</td>
<td>600</td>
<td>399</td>
<td>999</td>
</tr>
<tr align="right">
<td>sci.crypt</td>
<td>595</td>
<td>396</td>
<td>991</td>
</tr>
<tr align="right">
<td>sci.electronics</td>
<td>591</td>
<td>393</td>
<td>984</td>
</tr>
<tr align="right">
<td>sci.med</td>
<td>594</td>
<td>396</td>
<td>990</td>
</tr>
<tr align="right">
<td>sci.space</td>
<td>593</td>
<td>394</td>
<td>987</td>
</tr>
<tr align="right">
<td>soc.religion.christian</td>
<td>598</td>
<td>398</td>
<td>996</td>
</tr>
<tr align="right">
<td>talk.politics.guns</td>
<td>545</td>
<td>364</td>
<td>909</td>
</tr>
<tr align="right">
<td>talk.politics.mideast</td>
<td>564</td>
<td>376</td>
<td>940</td>
</tr>
<tr align="right">
<td>talk.politics.misc</td>
<td>465</td>
<td>310</td>
<td>775</td>
</tr>
<tr align="right">
<td>talk.religion.misc</td>
<td>377</td>
<td>251</td>
<td>628</td>
</tr>
<tr align="right">
<th>Total</th>
<th>11293</th>
<th>7528</th>
<th>18821</th>
</tr>
</tbody>
</table>

#### Reuters 21578

<p>I downloaded the Reuters-21578 dataset from <a href="http://www.daviddlewis.com/resources/testcollections/reuters21578/">David
Lewis' page</a> and used the standard "modApté" train/test split.  These documents
appeared on the Reuters newswire in 1987 and were manually classified
by personnel from Reuters Ltd.  

</p>
<p>Due to the fact that the class distribution for these documents is
very skewed, two sub-collections are usually considered for text
categorization tasks:

</p>
<ul>
<li><strong>R10</strong> The set of the 10 classes with the highest number of 
positive training examples.
</li>
<li><strong>R90</strong> The set of the 90 classes with at least one positive 
training and testing example.
</li></ul>
<p>Moreover, many of these documents are classified as having no topic
at all or with more than one topic.  In fact, you can see the
distribution of the documents per number of topics in the following
table, where <i># train docs</i> and <i># test docs</i> refer to
the <i>Mod Apté</i> split and <i># other</i> refers to documents
that were not considered in this split:

</p>
<table align="center" border="1">
<tbody>
<tr>
<th colspan="5">Reuters 21578</th>
</tr>
<tr>
<th># Topics</th>
<th># train docs</th>
<th># test docs</th>
<th># other</th>
<th>Total # docs</th>
</tr>
<tr align="right">
<td>0</td>
<td>1828</td>
<td>280</td>
<td>8103</td>
<td>10211</td>
</tr>
<tr align="right">
<td>1</td>
<td>6552</td>
<td>2581</td>
<td>361</td>
<td>9494</td>
</tr>
<tr align="right">
<td>2</td>
<td>890</td>
<td>309</td>
<td>135</td>
<td>1334</td>
</tr>
<tr align="right">
<td>3</td>
<td>191</td>
<td>64</td>
<td>55</td>
<td>310</td>
</tr>
<tr align="right">
<td>4</td>
<td>62</td>
<td>32</td>
<td>10</td>
<td>104</td>
</tr>
<tr align="right">
<td>5</td>
<td>39</td>
<td>14</td>
<td>8</td>
<td>61</td>
</tr>
<tr align="right">
<td>6</td>
<td>21</td>
<td>6</td>
<td>3</td>
<td>30</td>
</tr>
<tr align="right">
<td>7</td>
<td>7</td>
<td>4</td>
<td>0</td>
<td>11</td>
</tr>
<tr align="right">
<td>8</td>
<td>4</td>
<td>2</td>
<td>0</td>
<td>6</td>
</tr>
<tr align="right">
<td>9</td>
<td>4</td>
<td>2</td>
<td>0</td>
<td>6</td>
</tr>
<tr align="right">
<td>10</td>
<td>3</td>
<td>1</td>
<td>0</td>
<td>4</td>
</tr>
<tr align="right">
<td>11</td>
<td>0</td>
<td>1</td>
<td>1</td>
<td>2</td>
</tr>
<tr align="right">
<td>12</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>2</td>
</tr>
<tr align="right">
<td>13</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr align="right">
<td>14</td>
<td>0</td>
<td>2</td>
<td>0</td>
<td>2</td>
</tr>
<tr align="right">
<td>15</td>
<td>0</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr align="right">
<td>16</td>
<td>1</td>
<td>0</td>
<td>0</td>
<td>1</td>
</tr>
</tbody>
</table>
<p>As the goal in this project is to consider
<strong>single-labeled</strong> datasets, all the documents with less
than or with more than one topic were eliminated.  With this some of
the classes in R10 and R90 were left with no train or test documents.

</p>
<p>Considering only the documents with a single topic and the classes
which still have at least one train and one test example, we have 8 of the
10 most frequent classes and 52 of the original 90.  

</p>
<p>Following Sebastiani's convention, we will call these sets
<strong>R8</strong> and <strong>R52</strong>.  Note that from R10 to
R8 the classes <i>corn</i> and <i>wheat</i>, which are intimately
related to the class <i>grain</i> disapeared and this last class lost
many of its documents.

</p>
<p>The distribution of documents per class is the following for
<strong>R8</strong> and <strong>R52</strong>:

</p>
<table align="center" border="1">
<tbody>
<tr>
<th colspan="4">R8</th>
</tr>
<tr>
<th>Class</th>
<th># train docs</th>
<th># test docs</th>
<th>Total # docs</th>
</tr>
<tr align="right">
<td>acq</td>
<td>1596</td>
<td>696</td>
<td>2292</td>
</tr>
<tr align="right">
<td>crude</td>
<td>253</td>
<td>121</td>
<td>374</td>
</tr>
<tr align="right">
<td>earn</td>
<td>2840</td>
<td>1083</td>
<td>3923</td>
</tr>
<tr align="right">
<td>grain</td>
<td>41</td>
<td>10</td>
<td>51</td>
</tr>
<tr align="right">
<td>interest</td>
<td>190</td>
<td>81</td>
<td>271</td>
</tr>
<tr align="right">
<td>money-fx</td>
<td>206</td>
<td>87</td>
<td>293</td>
</tr>
<tr align="right">
<td>ship</td>
<td>108</td>
<td>36</td>
<td>144</td>
</tr>
<tr align="right">
<td>trade</td>
<td>251</td>
<td>75</td>
<td>326</td>
</tr>
<tr align="right">
<th>Total</th>
<th>5485</th>
<th>2189</th>
<th>7674</th>
</tr>
</tbody>
</table>
<table align="center" border="1">
<tbody>
<tr>
<th colspan="4">R52</th>
</tr>
<tr>
<th>Class</th>
<th># train docs</th>
<th># test docs</th>
<th>Total # docs</th>
</tr>
<tr align="right">
<td>acq</td>
<td>1596</td>
<td>696</td>
<td>2292</td>
</tr>
<tr align="right">
<td>alum</td>
<td>31</td>
<td>19</td>
<td>50</td>
</tr>
<tr align="right">
<td>bop</td>
<td>22</td>
<td>9</td>
<td>31</td>
</tr>
<tr align="right">
<td>carcass</td>
<td>6</td>
<td>5</td>
<td>11</td>
</tr>
<tr align="right">
<td>cocoa</td>
<td>46</td>
<td>15</td>
<td>61</td>
</tr>
<tr align="right">
<td>coffee</td>
<td>90</td>
<td>22</td>
<td>112</td>
</tr>
<tr align="right">
<td>copper</td>
<td>31</td>
<td>13</td>
<td>44</td>
</tr>
<tr align="right">
<td>cotton</td>
<td>15</td>
<td>9</td>
<td>24</td>
</tr>
<tr align="right">
<td>cpi</td>
<td>54</td>
<td>17</td>
<td>71</td>
</tr>
<tr align="right">
<td>cpu</td>
<td>3</td>
<td>1</td>
<td>4</td>
</tr>
<tr align="right">
<td>crude</td>
<td>253</td>
<td>121</td>
<td>374</td>
</tr>
<tr align="right">
<td>dlr</td>
<td>3</td>
<td>3</td>
<td>6</td>
</tr>
<tr align="right">
<td>earn</td>
<td>2840</td>
<td>1083</td>
<td>3923</td>
</tr>
<tr align="right">
<td>fuel</td>
<td>4</td>
<td>7</td>
<td>11</td>
</tr>
<tr align="right">
<td>gas</td>
<td>10</td>
<td>8</td>
<td>18</td>
</tr>
<tr align="right">
<td>gnp</td>
<td>58</td>
<td>15</td>
<td>73</td>
</tr>
<tr align="right">
<td>gold</td>
<td>70</td>
<td>20</td>
<td>90</td>
</tr>
<tr align="right">
<td>grain</td>
<td>41</td>
<td>10</td>
<td>51</td>
</tr>
<tr align="right">
<td>heat</td>
<td>6</td>
<td>4</td>
<td>10</td>
</tr>
<tr align="right">
<td>housing</td>
<td>15</td>
<td>2</td>
<td>17</td>
</tr>
<tr align="right">
<td>income</td>
<td>7</td>
<td>4</td>
<td>11</td>
</tr>
<tr align="right">
<td>instal-debt</td>
<td>5</td>
<td>1</td>
<td>6</td>
</tr>
<tr align="right">
<td>interest</td>
<td>190</td>
<td>81</td>
<td>271</td>
</tr>
<tr align="right">
<td>ipi</td>
<td>33</td>
<td>11</td>
<td>44</td>
</tr>
<tr align="right">
<td>iron-steel</td>
<td>26</td>
<td>12</td>
<td>38</td>
</tr>
<tr align="right">
<td>jet</td>
<td>2</td>
<td>1</td>
<td>3</td>
</tr>
<tr align="right">
<td>jobs</td>
<td>37</td>
<td>12</td>
<td>49</td>
</tr>
<tr align="right">
<td>lead</td>
<td>4</td>
<td>4</td>
<td>8</td>
</tr>
<tr align="right">
<td>lei</td>
<td>11</td>
<td>3</td>
<td>14</td>
</tr>
<tr align="right">
<td>livestock</td>
<td>13</td>
<td>5</td>
<td>18</td>
</tr>
<tr align="right">
<td>lumber</td>
<td>7</td>
<td>4</td>
<td>11</td>
</tr>
<tr align="right">
<td>meal-feed</td>
<td>6</td>
<td>1</td>
<td>7</td>
</tr>
<tr align="right">
<td>money-fx</td>
<td>206</td>
<td>87</td>
<td>293</td>
</tr>
<tr align="right">
<td>money-supply</td>
<td>123</td>
<td>28</td>
<td>151</td>
</tr>
<tr align="right">
<td>nat-gas</td>
<td>24</td>
<td>12</td>
<td>36</td>
</tr>
<tr align="right">
<td>nickel</td>
<td>3</td>
<td>1</td>
<td>4</td>
</tr>
<tr align="right">
<td>orange</td>
<td>13</td>
<td>9</td>
<td>22</td>
</tr>
<tr align="right">
<td>pet-chem</td>
<td>13</td>
<td>6</td>
<td>19</td>
</tr>
<tr align="right">
<td>platinum</td>
<td>1</td>
<td>2</td>
<td>3</td>
</tr>
<tr align="right">
<td>potato</td>
<td>2</td>
<td>3</td>
<td>5</td>
</tr>
<tr align="right">
<td>reserves</td>
<td>37</td>
<td>12</td>
<td>49</td>
</tr>
<tr align="right">
<td>retail</td>
<td>19</td>
<td>1</td>
<td>20</td>
</tr>
<tr align="right">
<td>rubber</td>
<td>31</td>
<td>9</td>
<td>40</td>
</tr>
<tr align="right">
<td>ship</td>
<td>108</td>
<td>36</td>
<td>144</td>
</tr>
<tr align="right">
<td>strategic-metal</td>
<td>9</td>
<td>6</td>
<td>15</td>
</tr>
<tr align="right">
<td>sugar</td>
<td>97</td>
<td>25</td>
<td>122</td>
</tr>
<tr align="right">
<td>tea</td>
<td>2</td>
<td>3</td>
<td>5</td>
</tr>
<tr align="right">
<td>tin</td>
<td>17</td>
<td>10</td>
<td>27</td>
</tr>
<tr align="right">
<td>trade</td>
<td>251</td>
<td>75</td>
<td>326</td>
</tr>
<tr align="right">
<td>veg-oil</td>
<td>19</td>
<td>11</td>
<td>30</td>
</tr>
<tr align="right">
<td>wpi</td>
<td>14</td>
<td>9</td>
<td>23</td>
</tr>
<tr align="right">
<td>zinc</td>
<td>8</td>
<td>5</td>
<td>13</td>
</tr>
<tr align="right">
<th>Total</th>
<th>6532</th>
<th>2568</th>
<th>9100</th>
</tr>
</tbody>
</table>

## How to tune and evaluate results

I will evaluate my results using accuracy, the standard evaluation measure for single-label text categorisation tasks.  I am already going to use three versions of two different datasets commonly used for research, and I can compare my results with others that have been previously published.

If I find other public datasets that are commonly used in research papers, I will probably use them as well.

## Create blog post summary

### Blog post here:

https://acardocacho.github.io/capstone-part2/