![license_header_logo](../../../images/license_header_logo.png)

> **Copyright (c) 2021 CertifAI Sdn. Bhd.**<br>
<br>
This program is part of OSRFramework. You can redistribute it and/or modify
<br>it under the terms of the GNU Affero General Public License as published by
<br>the Free Software Foundation, either version 3 of the License, or
<br>(at your option) any later version.
<br>
<br>This program is distributed in the hope that it will be useful
<br>but WITHOUT ANY WARRANTY; without even the implied warranty of
<br>MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
<br>GNU Affero General Public License for more details.
<br>
<br>You should have received a copy of the GNU Affero General Public License
<br>along with this program.  If not, see <http://www.gnu.org/licenses/>.
<br>

# Introduction

This notebook will show you how to carry out **natural language document classification** using Support Vector Machine (SVM) to classify text documents into exclusive groups. This is a supervised classification task, and therefore, we have a document corpus that is **pre-labelled** and we will begin by taking the training corpus and incorporating it into a Python data structure that is suitable for pre-processing and consumption via the classifier.

# Notebook Content

* [Supervised Classifiers](#Supervised-Classifiers)


* [Support Vector Machine](#Support-Vector-Machine)


* [Dataset](#Dataset)


* [HTMLParser](#HTMLParser)


* [Term-Frequency Inverse Document-Frequency (TF-IDF)](#Term-Frequency-Inverse-Document-Frequency-(TF-IDF))


* [Train-Test Split](#Train-Test-Split)


* [Support Vector Machine (SVM)](#Support-Vector-Machine-(SVM))


* [Performance Metrics](#Performance-Metrics)


* [Training on Full Dataset](#Training-on-Full-Dataset)

# Supervised Classifiers

**Supervised Classifiers** are a group of **statistical machine learning techniques** that attempt to attach a "class", or "label", to a particular set of features, based on **prior known labels** attached to other similar sets of features.

This is clearly quite an abstract definition, so it may help to have an example. Consider a set of text documents. Each document has an associated set of words, which we will call "features". Each of these documents might be associated with a class label that describes what the article is about.

![Classification](../../../images/classification.png)

# Support Vector Machine

**Support Vector Machines** are a subclass of supervised classifiers that attempt to partition a feature space into two or more groups, which in our case means separating a collection of articles into two or more class labels.

SVMs achieve this by **finding an optimal means** of separating such groups based on their known class labels. In the simpler cases the separation **"boundary"** is linear, leading to groups that are split up by lines (or planes) in high-dimensional spaces.

In more complicated cases (where groups are not nicely separated by lines or planes), SVMs are able to carry out **non-linear partitioning**. This is achieved by means of a **kernel function**. Ultimately, this makes them very sophisticated and capable classifiers, but at the usual expense that they can be **prone to overfitting**.

See the figure below for two examples of **non-linear decision boundaries** (**polynomial kernel** and **radial kernel respectively**) for two class labels (orange and blue), across two features:

![Support Vector Machine](../../../images/SVM.png)

# Dataset

The dataset that we are using in this notebook is [Reuters-21578](http://www.daviddlewis.com/resources/testcollections/reuters21578/) which the core dataset for machine learning classification task. The set consists of a collection of **news articles (a "corpus")** that are tagged with a selection of topics and geographic locations. Thus it comes "ready made" to be used in classification tests, since it is already **pre-labelled**.

This dataset should be available in the `dataset` folder, else you may also download it through [this link](http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz) and unzip it.

![Document Folder](../../../images/docs.png)

You will see that all the files beginning with *reut2-* are *.sgm*, which means that they are [SGML files](http://en.wikipedia.org/wiki/Standard_Generalized_Markup_Language). Unfortunately, Python deprecated sgmllib from Python in 2.6 and fully removed it for Python 3. However, all is not lost because we can create our own **SGML Parser class** that overrides Python's built in `HTMLParser`.

# HTMLParser

As stated above, our first goal is to actually create the **SGML Parser** that will achieve this. To do this we will subclass Python's `HTMLParser` class to handle the specific tags in the Reuters dataset.

Upon subclassing HTMLParser we override three methods, `handle_starttag`, `handle_endtag` and `handle_data`, which tell the parser what to do at the beginning of SGML tags, what to do at the closing of SGML tags and how to handle the data in between.

We also create two additional methods, `_reset` and `parse`, which are used to take care of internal state of the class and to parse the actual data in a chunked fashion, so as not to use up too much memory.

In [1]:
import html
from pprint import pprint
import re
from html.parser import HTMLParser

class ReutersParser(HTMLParser):
    """
    ReutersParser subclasses HTMLParser and is used to open the SGML
    files associated with the Reuters-21578 categorised test collection.

    The parser is a generator and will yield a single document at a time.
    Since the data will be chunked on parsing, it is necessary to keep 
    some internal state of when tags have been "entered" and "exited".
    Hence the in_body, in_topics and in_topic_d boolean members.
    """
    def __init__(self, encoding='latin-1'):
        """
        Initialise the superclass (HTMLParser) and reset the parser.
        Sets the encoding of the SGML files by default to latin-1.
        """
        html.parser.HTMLParser.__init__(self)
        self._reset()
        self.encoding = encoding

    def _reset(self):
        """
        This is called only on initialisation of the parser class
        and when a new topic-body tuple has been generated. It
        resets all off the state so that a new tuple can be subsequently
        generated.
        """
        self.in_body = False
        self.in_topics = False
        self.in_topic_d = False
        self.body = ""
        self.topics = []
        self.topic_d = ""

    def parse(self, fd):
        """
        parse accepts a file descriptor and loads the data in chunks
        in order to minimise memory usage. It then yields new documents
        as they are parsed.
        """
        self.docs = []
        for chunk in fd:
            self.feed(chunk.decode(self.encoding))
            for doc in self.docs:
                yield doc
            self.docs = []
        self.close()

    def handle_starttag(self, tag, attrs):
        """
        This method is used to determine what to do when the parser
        comes across a particular tag of type "tag". In this instance
        we simply set the internal state booleans to True if that particular
        tag has been found.
        """
        if tag == "reuters":
            pass
        elif tag == "body":
            self.in_body = True
        elif tag == "topics":
            self.in_topics = True
        elif tag == "d":
            self.in_topic_d = True 

    def handle_endtag(self, tag):
        """
        This method is used to determine what to do when the parser
        finishes with a particular tag of type "tag". 

        If the tag is a <REUTERS> tag, then we remove all 
        white-space with a regular expression and then append the 
        topic-body tuple.

        If the tag is a <BODY> or <TOPICS> tag then we simply set
        the internal state to False for these booleans, respectively.

        If the tag is a <D> tag (found within a <TOPICS> tag), then we
        append the particular topic to the "topics" list and 
        finally reset it.
        """
        if tag == "reuters":
            self.body = re.sub(r'\s+', r' ', self.body)
            self.docs.append( (self.topics, self.body) )
            self._reset()
        elif tag == "body":
            self.in_body = False
        elif tag == "topics":
            self.in_topics = False
        elif tag == "d":
            self.in_topic_d = False
            self.topics.append(self.topic_d)
            self.topic_d = ""  

    def handle_data(self, data):
        """
        The data is simply appended to the appropriate member state
        for that particular tag, up until the end closing tag appears.
        """
        if self.in_body:
            self.body += data
        elif self.in_topic_d:
            self.topic_d += data

In [2]:
# Open the first Reuters data set and create the parser
path = "../../../resources/day_04/reuters21578/"
filename = "reut2-000.sgm"

parser = ReutersParser()

# Parse the document and force all generated docs into
# a list so that it can be printed out to the console
docs = parser.parse(open(path + filename, 'rb'))
pprint(list(docs))

[(['cocoa', 'el-salvador', 'usa', 'uruguay'],
  'Showers continued throughout the week in the Bahia cocoa zone, alleviating '
  'the drought since early January and improving prospects for the coming '
  'temporao, although normal humidity levels have not been restored, '
  'Comissaria Smith said in its weekly review. The dry period means the '
  'temporao will be late this year. Arrivals for the week ended February 22 '
  'were 155,221 bags of 60 kilos making a cumulative total for the season of '
  '5.93 mln against 5.81 at the same stage last year. Again it seems that '
  'cocoa delivered earlier on consignment was included in the arrivals '
  'figures. Comissaria Smith said there is still some doubt as to how much old '
  'crop cocoa is still available as harvesting has practically come to an end. '
  'With total Bahia crop estimates around 6.4 mln bags and sales standing at '
  'almost 6.2 mln there are a few hundred thousand bags still in the hands of '
  'farmers, middlemen, exp

  'guaranteed return on both interest and principal since no payment of any '
  'kind is made until the bond matures. It said a bank can sell the bonds on '
  'the secondary bond market for either dlrs or pesos depending on its '
  'requirement. The documents said peso proceeds can be invested in selected '
  'industries under the Philippines" debt/equity program. Ongpin said Manila '
  'is sticking to its demand of a spread of 5/8 percentage points over London '
  'Interbank Offered Rates (LIBOR) for restructuring 3.6 billion dlrs of debt '
  'repayments. "(The proposal) will give the banks a choice of 5/8ths or the '
  'alternative," Ongpin said. "Our representatives have gone to Washington to '
  'the (International Monetary) Fund, the (World) Bank, the Fed (Federal '
  'Reserve Board) and the (U.S.) Treasury to brief them in advance on this '
  'alternative and it has generally been positively received." "We don"t '
  'believe that there is going to be a problem on the accounting s

  'investment certificates on March 9. Company chairman Jean-Rene Fourtou said '
  '500 mln francs of the issue will be placed in the U.S. Details of the issue '
  'will be announced by Finance Minister Edouard Balladur on March 6. The '
  'group, due to be privatised at an unspecified date, said in January it was '
  'planning a capital increase to pursue its development strategy and make '
  'further acquisitions. Rhone-Poulenc shares were suspended from trading on '
  "the Paris Bourse last Thursday ahead of the capital increase. The group's "
  'capital currently stands at 4.03 billion francs. Fourtou, speaking at a '
  'news conference, did not give details of acquisitions the company planned '
  'for 1987. He said acquisitions in 1987 would complement an industrial '
  'investment program of around five billion francs, and research spending of '
  'about 3.5 billion francs. Rhone-Poulenc spent 5.5 billion francs on '
  'acquisitions last year. "Chemistry is on the move and we fac

  'Shr 30 cts vs 36 cts Net 1,914,388 vs 1,906,095 Sales 58.8 mln vs 40.7 mln '
  '1st half Shr 47 cts vs 53 cts Net 2,961,718 vs 2,817,439 Sales 107.7 mln vs '
  '74.9 mln Avg shrs 6,342,353 vs 5,342,353 Reuter '),
 ([],
  'Shr profit six cts vs profit eight cts Net profit 102,000 vs profit 151,000 '
  'Revs 4,846,000 vs 5,041,000 Avg shrs 1,725,744 vs 1,806,323 12 mths Shr '
  'loss 1.36 dlrs vs profit 56 cts Net loss 2,318,00 vs profit 789,000 Revs '
  '17.5 mln vs 20.9 mln Avg shrs 1,710,655 vs 1,404,878 Reuter '),
 ([],
  'Thousands of Spanish farmers battled police in this northeastern city '
  'during a march to demand a better deal from the EC, protest organisers '
  'said. The farmers traded stones for tear gas and rubber pellets and '
  'occupied local government buildings in Saragossa. In the southern city of '
  'Malaga, citrus growers dumped more than 20 tonnes of lemons on the streets '
  'to protest against duties levied by the EC against their exports. Spain '
  'joined

  "Volcker's reappointment to the Fed in mid-1983. The sources said Baker "
  'respects Volcker and when appointed Treasury Secretary in February 1985, he '
  'decided to ensure a good working relationship, in part because he believed '
  'the two key government economic institutions have to work closely. Regan, '
  "Treasury Secretary during President Reagan's first term, was formerly head "
  "of Wall Street's largest brokerage firm Merrill Lynch and came to "
  "Washington determined to be America's pre-eminent economic spokesman. He "
  'developed a deep antipathy for Volcker, whose political skills undermined '
  'that ambition, and who financial markets took much more seriously. But the '
  'sources said Volcker would have to be invited to stay. "Is the president '
  'going to ask him? he wouldn\'t stay otherwise," said one. "He\'d have to be '
  'asked," said Stephen Axilrod, formerly staff director of monetary policy at '
  'the Fed and now vice-chairman of Nikko Securities Co.

In particular, note that instead of having a single topic label associated with a document, we have **multiple topics**. In order to increase the effectiveness of the classifier, it is necessary to assign only a **single class label to each document**. However, you'll also note that some of the labels are actually **geographic location tags**, such as "japan" or "thailand". Since we are concerned solely with topics and not countries we want to remove these before we select our topic.

The particular method that we will use to carry this out is rather simple. We will strip out the country names and then select the first remaining topic on the list. If there are no associated topics we will eliminate the article from our corpus.

In [3]:
# Obtain all the topic tags

def obtain_topic_tags(filename):
    """
    Open the topic list file and import all of the topic names
    taking care to strip the trailing "\n" from each word.
    """
    topics = open(
        filename, "r"
    ).readlines()
    
    topics = [t.strip() for t in topics]
    return topics

In [4]:
topics = obtain_topic_tags(path + "all-topics-strings.lc.txt")

print(topics, "-> length:",len(topics))

['acq', 'alum', 'austdlr', 'austral', 'barley', 'bfr', 'bop', 'can', 'carcass', 'castor-meal', 'castor-oil', 'castorseed', 'citruspulp', 'cocoa', 'coconut', 'coconut-oil', 'coffee', 'copper', 'copra-cake', 'corn', 'corn-oil', 'cornglutenfeed', 'cotton', 'cotton-meal', 'cotton-oil', 'cottonseed', 'cpi', 'cpu', 'crude', 'cruzado', 'dfl', 'dkr', 'dlr', 'dmk', 'drachma', 'earn', 'escudo', 'f-cattle', 'ffr', 'fishmeal', 'flaxseed', 'fuel', 'gas', 'gnp', 'gold', 'grain', 'groundnut', 'groundnut-meal', 'groundnut-oil', 'heat', 'hk', 'hog', 'housing', 'income', 'instal-debt', 'interest', 'inventories', 'ipi', 'iron-steel', 'jet', 'jobs', 'l-cattle', 'lead', 'lei', 'lin-meal', 'lin-oil', 'linseed', 'lit', 'livestock', 'lumber', 'lupin', 'meal-feed', 'mexpeso', 'money-fx', 'money-supply', 'naphtha', 'nat-gas', 'nickel', 'nkr', 'nzdlr', 'oat', 'oilseed', 'orange', 'palladium', 'palm-meal', 'palm-oil', 'palmkernel', 'peseta', 'pet-chem', 'platinum', 'plywood', 'pork-belly', 'potato', 'propane', 'r

In [5]:
def filter_doc(topics, docs):
    """
    Reads all of the documents and creates a new list of two-tuples
    that contain a single feature entry and the body text, instead of
    a list of topics. It removes all geographic features and only 
    retains those documents which have at least one non-geographic
    topic.
    """
    ref_docs = []
    for d in docs:
        if d[0] == [] or d[0] == "":
            continue
        for t in d[0]:
            if t in topics:
                d_tup = (t, d[1])
                ref_docs.append(d_tup)
                break
    return ref_docs

In [6]:
docs = list(parser.parse(open(path + "reut2-000.sgm", 'rb')))

ref_docs = filter_doc(topics, docs)

pprint(ref_docs)

[('cocoa',
  'Showers continued throughout the week in the Bahia cocoa zone, alleviating '
  'the drought since early January and improving prospects for the coming '
  'temporao, although normal humidity levels have not been restored, '
  'Comissaria Smith said in its weekly review. The dry period means the '
  'temporao will be late this year. Arrivals for the week ended February 22 '
  'were 155,221 bags of 60 kilos making a cumulative total for the season of '
  '5.93 mln against 5.81 at the same stage last year. Again it seems that '
  'cocoa delivered earlier on consignment was included in the arrivals '
  'figures. Comissaria Smith said there is still some doubt as to how much old '
  'crop cocoa is still available as harvesting has practically come to an end. '
  'With total Bahia crop estimates around 6.4 mln bags and sales standing at '
  'almost 6.2 mln there are a few hundred thousand bags still in the hands of '
  'farmers, middlemen, exporters and processors. There are do

 ('acq',
  'Investor David F. La Roche of North Kingstown, R.I., said he is offering to '
  'purchase 170,000 common shares of NECO Enterprises Inc at 26 dlrs each. He '
  'said the successful completion of the offer, plus shares he already owns, '
  "would give him 50.5 pct of NECO's 962,016 common shares. La Roche said he "
  'may buy more, and possible all NECO shares. He said the offer and '
  'withdrawal rights will expire at 1630 EST/2130 gmt, March 30, 1987. '
  'Reuter '),
 ('earn',
  'Period ended December 31, 1986 Oper shr loss 1.08 dlrs vs loss 84 cts Oper '
  'loss 7,700,000 vs loss 1,700,000 Revs 11,800,000 vs 9,800,000 Note: Current '
  'shr and net exclude extraordinary gain of 300,000 dlrs or five cts shr, '
  'versus extraordinary gain of 200,000 dlrs or four cts shr Reuter '),
 ('acq',
  '<Senior Engineering Group Plc> said it reached agreement with <Cronus '
  'Industries Inc> to acquire the whole share capital of <South Western '
  'Engineering Co> for 12.5 mln dlrs

  'financing is very difficult," she said. "But I am looking at it in terms of '
  'the economy." She said she was not trying to oppose official policy. "I\'m '
  'just saying, keep it competitive. I do not want it to become uncompetitive '
  'because then we are dead." Monsod said, "The ideal movement in the '
  'peso/dollar rate is a movement that will reflect differences in inflation '
  "(rates) of the Philippines versus the other country. It's an arithmetic "
  'thing." Official figures show Philippine inflation averaged 0.8 pct in '
  'calendar 1986. Ongpin told reporters on Saturday it was expected to touch '
  'five pct this year. He said the government and the International Monetary '
  'Fund had set the peso/dollar 1987 target rate at 20.80. The peso lost 22.2 '
  'pct in value to slump to 18.002 to the dollar when it was floated in 1984. '
  'REUTER '),
 ('acq',
  'The U.K. Trade Department said it would not refer Consolidated Goldfields '
  "Plc's <CGLD.L> purchase of <Amer

# Term-Frequency Inverse Document-Frequency (TF-IDF)

The TF-IDF value for a token increases proportionally to the **frequency of the word in the document** but is normalised by the **frequency of the word in the corpus**. This essentially reduces importance for words that appear a lot generally, as opposed to appearing a lot within a particular document.

This is precisely what we need as words such as "a", "the" will have extremely high occurances within the entire corpus, but the word "cat" may only appear often in a particular document. This would mean that we are giving "cat" a relatively higher strength than "a" or "the", for that document.

Hence we wish to combine the **process of vectorisation with that of TF-IDF** to produce a normalised matrix of document-token occurances. This will then be used to provide a list of features to the classifier upon which to train.

Thankfully, the developers at **scikit-learn** realised that it would be an extremely common operation to vectorise and transform text files in this manner and so included the `TfidfVectorizer` class.

We can use this class to take our list of two-tuples representing class labels and raw document text, to produce both a vector of class labels and a sparse matrix, which represents the TF-IDF and Vectorisation procedure applied to the raw text data.

Since scikit-learn classifiers take two separate data structures for training, namely, $y$, the vector of class labels or "responses" associated with an ordered set of documents, and, $X$, the sparse TF-IDF matrix of raw document text, we modify our two-tuple list to create $y$ and $X$. 

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer

def create_tfidf_training_data(docs):
    """
    Creates a document corpus list (by stripping out the
    class labels), then applies the TF-IDF transform to this
    list. 

    The function returns both the class label vector (y) and 
    the corpus token/feature matrix (X).
    """
    # Create the training data class labels
    y = [d[0] for d in docs]

    # Create the document corpus list
    corpus = [d[1] for d in docs]

    # Create the TF-IDF vectoriser and transform the corpus
    vectorizer = TfidfVectorizer(min_df=1)
    X = vectorizer.fit_transform(corpus)
    return X, y

In [8]:
# Vectorise and TF-IDF transform the corpus 
X, y = create_tfidf_training_data(ref_docs)

In [9]:
print(X)

  (0, 4978)	0.006505782293080079
  (0, 3849)	0.03659649179315834
  (0, 2450)	0.03246621749997353
  (0, 6179)	0.015354664765474238
  (0, 1530)	0.03488227307917922
  (0, 910)	0.018268790703293826
  (0, 1735)	0.023287127212758878
  (0, 5859)	0.021957475878191645
  (0, 1389)	0.03488227307917922
  (0, 1467)	0.013365801761576927
  (0, 4618)	0.030751998785994415
  (0, 2585)	0.0184550160543475
  (0, 2723)	0.0270062918722215
  (0, 703)	0.025919887627583044
  (0, 176)	0.018177816016485215
  (0, 37)	0.029422347451427192
  (0, 694)	0.025597715857460346
  (0, 175)	0.01471561800960256
  (0, 81)	0.018177816016485215
  (0, 2519)	0.024459270004817787
  (0, 2018)	0.02785899062423066
  (0, 238)	0.05684326005422488
  (0, 346)	0.03246621749997353
  (0, 351)	0.03488227307917922
  (0, 302)	0.03246621749997353
  :	:
  (511, 145)	0.0588335304973607
  (511, 3985)	0.20264324226946331
  (511, 4978)	0.01833917048771844
  (511, 1467)	0.07535380256155447
  (511, 238)	0.053411882448067945
  (511, 2268)	0.065746164192

At this stage we now have two components to our training data. The first, $X$, is a **matrix of document-token occurances**. The second, $y$, is a vector (which matches the ordering of the matrix) that contains the **correct class labels** for each of the documents. This is all we need to begin training and testing the **Support Vector Machine**.

# Train-Test Split

One question that arises here is what percentage to retain for training and what to use for testing. Clearly the more that is retained for training, the "better" the classifier will be because it will have seen more data. However, more training data means less testing data and as such a poorer estimate of its true classification capability. In practice, it is common to retain about **70-80%** of the data for training and use the remainder for testing.

Since the **training-test split** is such a common operation in machine learning, the developers of scikit-learn provided the `train_test_split` method to automatically create the split from a dataset provided. 

The `test_size` keyword argument controls the **size of the testing set**, in this case 20%. The `random_state` keyword argument controls the **random seed** for selecting the partition randomly.

In [10]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2, random_state=42
)

# Support Vector Machine (SVM)

Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection. `SVC`, `NuSVC` and `LinearSVC` are classes capable of performing binary and multi-class classification on a dataset.

In this instance we are going to use the `SVC` (Support Vector Classifier) class from scikit-learn.

![SVC](../../../images/SVC.png)

There are a few important parameters in configuring SVC:

* **C** (default=1.0): **Regularization parameter**. The strength of the regularization is inversely proportional to C. Must be strictly positive.


* **kernel** (default='rbf'): Specifies the kernel type to be used in the algorithm.


* **gamma** (default='scale'): Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.

In [11]:
from sklearn.svm import SVC

def train_svm(X, y):
    """
    Create and train the Support Vector Machine.
    """
    svm = SVC(C=1000000.0, gamma='auto', kernel='rbf')
    svm.fit(X, y)
    return svm

In [12]:
# Create and train the Support Vector Machine
svm = train_svm(X_train, y_train)

Now that the SVM has been trained we need to assess its performance on the *testing* data.

# Performance Metrics

The two main performance metrics that we will consider for this supervised classifer are the **hit-rate** and the **confusion matrix**. The former is simply the ratio of correct assignments to total assignments and is usually quoted as a percentage.

The confusion matrix goes into more detail and provides output on **true-positives**, **true-negatives**, **false-positives** and **false-negatives**. In a binary classification system, with a "true" or "false" class labelling, these characterise the rate at which the classifier correctly classifies something as true or false when it is, respectively, true or false, and also incorrectly classifies something as true or false when it is, respectively, false or true.

![Confusion Matrix](../../../images/confusion_matrix.png)

A confusion matrix need not be restricted to a binary classifier situation. For multiple class groups (as in our situation with the Reuters dataset) we will have an $N x N$ matrix, where $N$ is the number of class labels (or document topics).

Scikit-learn has functions for calculating both the hit-rate and the confusion matrix of a supervised classifier. The former is a method on the classifier itself called `score`. The latter must be imported from the `metrics` library.

The first task is to create a *predictions* array from the `X_test` test-set. This will simply contain the predicted class labels from the SVM via the retained 20% test set. This prediction array is used to create the confusion matrix. Notice that the `confusion_matrix` function takes both the `pred` predictions array and the `y_test` correct class labels to produce the matrix. In addition we create the hit-rate by providing `score` with both the `X_test` and `_test` subsets of the dataset.

In [13]:
from sklearn.metrics import confusion_matrix

# Make an array of predictions on the test set
pred = svm.predict(X_test)

In [14]:
# Output the hit-rate, score
print(svm.score(X_test, y_test))

0.6601941747572816


In [15]:
# Output confusion matrix
print(confusion_matrix(pred, y_test))

[[21  0  0  0  2  3  0  0  0  1  0  0  0  0  1  1  1  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  4  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  1  0  0  1 26  0  0  0  1  0  1  0  1  0  0  0  0  0]
 [ 0  0  0  0  0  0  2  0  0  0  0  0  0  0  0  0  0  0  1]
 [ 0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  1  0  0  0  0  3  0  0  0  0  0  0  0  0  0]
 [ 3  0  0  1  2  2  3  0  1  1  6  0  1  0  0  0  2  3  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0

Thus we have a 66% classification hit rate, with a confusion matrix that has entries mainly on the diagonal (i.e. the correct assignment of class label). Notice that since we are only using a **single file from the Reuters set (number 000)**, we aren't going to see the entire set of class labels and hence our confusion matrix is smaller in dimension than if we had used the full dataset.

# Training on Full Dataset

Now, we going to use all the function above to train our full Reuters dataset. We will load all 21 Reuters files and train the SVM on the full dataset. Then output the full hit-rate performance.

**Note**: Confuion matrix is not outputed as it becomes large for the total number of class labels within all documents.

In [16]:
# Create a list of all Reuter filenames
files = [path + "reut2-%03d.sgm" % r for r in range(0, 22)]

In [17]:
# Initialize parser
parser = ReutersParser()

In [18]:
# Parse all the documents
docs = []
for fn in files:
    for d in parser.parse(open(fn, 'rb')):
        docs.append(d)

In [19]:
# Obtain the topic tags and filter docs through it 
topics = obtain_topic_tags(path + "all-topics-strings.lc.txt")
ref_docs = filter_doc(topics, docs)

In [20]:
# Vectorise and TF-IDF transform the corpus 
X, y = create_tfidf_training_data(ref_docs)

In [21]:
# Create the training-test split of the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

In [22]:
# Create and train the Support Vector Machine
svm = train_svm(X_train, y_train)

In [23]:
# Make an array of predictions on the test set
pred = svm.predict(X_test)

In [24]:
# Output the hit-rate
print(svm.score(X_test, y_test))

0.8359718557607739


For the full corpus, the hit rate provided is **83.6%**

# Extension

There are plenty of ways to improve on this figure. In particular we can perform a **Grid Search Cross-Validation**, which is a means of determining the optimal parameters for the classifier that will achieve the best hit-rate (or other metric of choice).

# Contributors

**Author**
<br>Chee Lam

# References

1. [Supervised Learning for Document Classification with Scikit-Learn](https://www.quantstart.com/articles/Supervised-Learning-for-Document-Classification-with-Scikit-Learn/)
2. [Support Vector Classifier from Sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)