# Supervised Learning of Keywords using fastText
## Text Classification
The goal of text classification is to assign documents (such as emails, posts, text messages, product reviews, etc...) to one or multiple categories. Such categories can be review scores, spam v.s. non-spam, or the language in which the document was typed. Nowadays, the dominant approach to build such classifiers is machine learning, that is learning classification rules from examples. In order to build such classifiers, we need labeled data, which consists of documents and their corresponding categories (or tags, or labels).

As an example, we build a classifier which automatically classifies ONS publications by their supplied keywords.

In [1]:
import fastText
import gensim
import os

In [2]:
if os.path.isdir("./models/") is False:
    !mkdir models

# Corpa
To create the text corpus, we load in articles and bulletins published on the ONS website. The final corpus consists of sentences found in the pages summaries and markdown sections, tagged by the keywords provided with each page.

For the purpose of this notebook, we have stored all artices and bulletins in a local mongoDB database for painless retrieval. Lets define some functions for loading the pages as json below.

In [3]:
def getMongoDBClient(mongoUrl, port):
    from pymongo import MongoClient
    return MongoClient(mongoUrl, port)

def load_pages(use_mongo=True):
    # Load pages from disk/mongo
    pages = None
    print "Loading pages from " + "mongoDB" if use_mongo else "filesystem"
    if (use_mongo):
        mongoClient = getMongoDBClient("localhost", 27017)
        collection = mongoClient.local.pages

        query = {
            "sections": {
                "$exists": True
            }
        }
        cursor = collection.find(query)

        pages = []
        for doc in cursor:
            pages.append(doc)
    else:
        # Read from filesystem
        from modules.ONS.file_scanner import FileScanner
        scanner = FileScanner()
        pages = scanner.load_pages()
    print "Done"
    return pages

In [4]:
pages = load_pages(use_mongo=True)
print "Loaded %d pages" % (len(pages))

Loading pages from mongoDB
Done
Loaded 1960 pages


# Test processing
To give our model the best chance of accurate classifications, we should clean up the raw text keywords to remove things like stop words etc. Below we define some basic utility functions to clean raw text, then process the pages we just loaded in.

In [5]:
from gensim.utils import lemmatize
from nltk.corpus import stopwords

def get_stopwords():
    return set(stopwords.words('english'))  # nltk stopwords list

def get_bigram(train_texts):
    import gensim
    bigram = gensim.models.Phrases(train_texts)  # for bigram collocation detection
    return bigram

def build_texts_from_file(fname):
    import gensim
    """
    Function to build tokenized texts from file
    
    Parameters:
    ----------
    fname: File to be read
    
    Returns:
    -------
    yields preprocessed line
    """
    with open(fname) as f:
        for line in f:
            yield gensim.utils.simple_preprocess(line, deacc=True, min_len=3)

def build_texts_as_list_from_file(fname):
    return list(build_texts_from_file(fname))

def build_texts(texts):
    import gensim
    """
    Function to build tokenized texts from file
    
    Parameters:
    ----------
    fname: File to be read
    
    Returns:
    -------
    yields preprocessed line
    """
    for line in texts:
        yield gensim.utils.simple_preprocess(line, deacc=True, min_len=3)

def build_texts_as_list(texts):
    return list(build_texts(texts))

def process_texts(texts, stops=get_stopwords()):
    """
    Function to process texts. Following are the steps we take:
    
    1. Stopword Removal.
    2. Collocation detection.
    3. Lemmatization (not stem since stemming can reduce the interpretability).
    
    Parameters:
    ----------
    texts: Tokenized texts.
    
    Returns:
    -------
    texts: Pre-processed tokenized texts.
    """
    bigram = get_bigram(texts)

    texts = [[word for word in line if word not in stops] for line in texts]
    texts = [bigram[line] for line in texts]
    
    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()

    texts = [[word for word in lemmatizer.lemmatize(' '.join(line), pos='v').split()] for line in texts]
    return texts

In [7]:
import markdown, re
from string import punctuation
from bs4 import BeautifulSoup
from nltk.tokenize import sent_tokenize

delimiter = ","
pattern = "[, \-!?:]+"
fix_encoding = lambda s: s.decode('utf8', 'ignore')

def markdown_to_text(md):
    extensions = ['extra', 'smarty']
    html = markdown.markdown(md, extensions=extensions, output_format='html5')
    soup = BeautifulSoup(html, "lxml")
    return soup.text

def process(text):
    content = []
    if len(text) > 0:
        sentences = sent_tokenize(text)
        for sentence in sentences:
            if (len(filter(None, re.split(pattern, sentence))) > 10):
                content.append(fix_encoding(sentence.encode("utf-8").strip()))
    return content

def parse_page(page):
    description = page["description"]
    keywords = description["keywords"]
                   
    sentences = []
    if "summary" in description:
        sentences.extend(process(description["summary"]))
    if "sections" in page:
        for section in page["sections"]:
            if "markdown" in section:
                markdown = section["markdown"]
                text = markdown_to_text(markdown)
                sentences.extend(process(text))
                   
    # Collect list of unique, clean keywords
    labels = []
    for entry in keywords:
        entry = entry.strip().lower()
        # Replace spaces in keywords with '_', so the classifier identifies them as a single 'word'
        entry = re.sub( '\s+', '_', entry )
        labels.append(entry)
    
    if len(labels) > 0 and len(sentences) > 0:
        return {"sentences": sentences, "labels": labels}
    else:
        return None

data = []
for page in pages:
    if "description" in page and "keywords" in page["description"] and len(page["description"]["keywords"]) > 0:
        d = parse_page(page)
        if d is not None:
            data.append(d)

print len(data)

1787


# Writing the training set
The final step is to write the training data out in the correct format. We prepend a prefix to each label, so the classifier knows how to identify them aside from the raw text corpus.

In [16]:
stops = get_stopwords()

def label_sentences(sentences, labels, label_prefix="__label__"):
    # Clean up the labels by removing stop words etc.
    labels = build_texts_as_list(labels)
    # Filter out empty keywords
    labels = filter(None, process_texts(labels, stops=stops))
    
    sentences = build_texts_as_list(sentences)
    sentences = filter(None, process_texts(sentences, stops=stops))
    
    labels = set( ["%s%s" % (label_prefix, l[0]) for l in labels] )
    joined_labels = " ".join(labels)
    
    labelled_sentences = []
    for sentence in sentences:
        text = " ".join(sentence)
#         labelled_sentences.append( "%s %s" % (joined_labels, text) )
        labelled_sentences.append( (joined_labels, text) )

    return labelled_sentences

lines = []
for i in range(len(data)):
    d = data[i]
    sentences = d["sentences"]
    labels = d["labels"]
    labelled_sentences = label_sentences(sentences, labels)
    lines.extend(labelled_sentences)
    
# Shuffle the list
import random
random.shuffle(lines)

In [17]:
corpus_fname = "ons_labelled_corpus.txt"
print lines[0]

(u'__label__work', u'working uncertain estimates general changes numbers especially rates reported statistical_bulletin month periods small usually greater level explainable sampling variability')


# Training and Validation Data
In order to test the accuracy of our model, we purposely keep part of the training data back as a validation dataset by splitting the corpus into two files, with extendions ".train" and ".valid".

In [21]:
import numpy as np

def write_corpus(corpus, prefix):
    """
    Splits the corpus into training (.train) and validation (.valid) datasets
    """
    random.shuffle(corpus)
    
    size_train = int(np.round(len(corpus) * (3./4.)))
    train_corpus = corpus[:size_train]
    valid_corpus = corpus[size_train:]
    
    extensions = ["train", "valid"]
    corpa = [train_corpus, valid_corpus]
    
    for ext, corpus in zip(extensions, corpa):
        modes = ["labelled", "unlabelled"]
        fnames = ["%s_%s.txt.%s" % (prefix, mode, ext) for mode in modes]
        for fname, mode in zip(fnames, modes):
            with open(fname, "w") as f:
                for line in corpus:
                    label, text = line
                    s = None
                    if mode == "labelled":
                        s = "%s %s" % (label, text)
                    elif mode == "unlabelled":
                        s = text
                    if len(s) > 0:
                        s = re.sub( '\s+', ' ', s.encode("ascii", "ignore") ).strip()
                        f.write("%s\n" % s)

# if os.path.isfile(corpus_fname) is False:
write_corpus(lines, "ons")

In [23]:
def which(file):
    import os
    for path in os.environ["PATH"].split(os.pathsep):
        if os.path.exists(os.path.join(path, file)):
                return os.path.join(path, file)

    return None

def train_models(fastText_mode, corpus_file, output_name, models_dir):
    # fastText training params
    lr = 0.05
    dim = 300
    ws = 5
    epoch = 5
    minCount = 5
    neg = 5
    loss = 'ns'
    t = 1e-4

    from gensim.models import Word2Vec, KeyedVectors
    from gensim.models.word2vec import Text8Corpus

    # Same values as used for fastText training above
    params = {
        'alpha': lr,
        'size': dim,
        'window': ws,
        'iter': epoch,
        'min_count': minCount,
        'sample': t,
        'sg': 1,
        'hs': 0,
        'negative': neg
    }

    # Check for fasttext binary in path
    fasttext = which("fasttext")
    if (fasttext is None):
        raise Exception("Unable to locate fasttext binary in $PATH")

    # Generate the models

    # fastText with ngrams
    output_file = '{:s}_ft'.format(output_name)
    print('Training fasttext (mode={:s}) on {:s} corpus..'.format(fastText_mode, corpus_file))
    exe = "{fasttext} {mode} -input {corpus_file} -output {output}  -lr {lr} -dim {dim} -ws {ws} -epoch {epoch} -minCount {minCount} -neg {neg} -loss {loss} -t {t}"
    exe = exe.format(fasttext=fasttext, mode=fastText_mode, corpus_file=corpus_file, output=models_dir+output_file, lr=lr, dim=dim, ws=ws, epoch=epoch, minCount=minCount, neg=neg, loss=loss, t=t)
    os.system(exe)
        
    # fastText with NO ngrams
    output_file = '{:s}_ft_no_ng'.format(output_name)
    print('\nTraining fasttext (mode={:s}) on {:s} corpus (without char n-grams)..'.format(fastText_mode, corpus_file))
    exe = "{fasttext} {mode} -input {corpus_file} -output {output}  -lr {lr} -dim {dim} -ws {ws} -epoch {epoch} -minCount {minCount} -neg {neg} -loss {loss} -t {t} -maxn 0"
    exe = exe.format(fasttext=fasttext, mode=fastText_mode, corpus_file=corpus_file, output=models_dir+output_file, lr=lr, dim=dim, ws=ws, epoch=epoch, minCount=minCount, neg=neg, loss=loss, t=t)
    os.system(exe)
        
    # Word2Vec
    output_file = '{:s}_gs'.format(output_name)
    print('\nTraining word2vec on {:s} corpus..'.format(corpus_file))
    
    # Text8Corpus class for reading space-separated words file
    gs_model = Word2Vec(Text8Corpus(corpus_file), **params)
    # Direct local variable lookup doesn't work properly with magic statements (%time)
    gs_model.wv.save_word2vec_format(os.path.join(models_dir, '{:s}.vec'.format(output_file)))
    print('\nSaved gensim model as {:s}.vec'.format(output_file))
    
train_models("skipgram", "ons_unlabelled.txt.train", "ons", "models/")

Training fasttext (mode=skipgram) on ons_unlabelled.txt.train corpus..

Training fasttext (mode=skipgram) on ons_unlabelled.txt.train corpus (without char n-grams)..

Training word2vec on ons_unlabelled.txt.train corpus..

Saved gensim model as ons_gs.vec


In [12]:
# Train the model
model = None
if os.path.isfile("models/ons_supervised.bin"):
    model = fastText.load_model("models/ons_supervised.bin")
else:
    print "Training"
    model = fastText.train_supervised(input="ons_labelled_corpus.txt.train", label="__label__",\
                              epoch=50, lr=1.0, wordNgrams=3, verbose=2, minCount=15,\
                              minCountLabel=15, thread=12, pretrainedVectors="models/ons_ft.vec")
    model.save_model("models/ons_supervised.bin")
print "Done"

Training
Done


# Testing
To test the model, we use the validation training set to try and predict the top 'k' labels. The output of the below test are the precision at k=1 (P@1) and the recall at k=1 (R@1):

In [13]:
# Test the model
k=1
N, P, R = model.test("%s.valid" % corpus_fname, k)
print "Total number of samples=", N
print "P@%d=" % k, P
print "R@%d=" % k, R

Total number of samples= 52269
P@1= 0.744934856225
R@1= 0.227324210084


We can also compute the precision and recall at k=5 with:

In [14]:
# Test the model
k=5
N, P, R = model.test("%s.valid" % corpus_fname, k)
print "Total number of samples=", N
print "P@%d=" % k, P
print "R@%d=" % k, R

Total number of samples= 52269
P@5= 0.503594865025
R@5= 0.768384671073


In [15]:
# Example output
k = 10
label_prefix = "__label__"
labels, probs = model.predict("UKs shortfall to Germany", k)
for label,prob in zip(labels, probs):
    print "%s:%f" % (label.replace(label_prefix, ""), prob)

gdp:0.806220
economy:0.106632
oecd:0.045716
growth:0.007496
output:0.006851
inflation:0.006261
foreign:0.003135
monetary:0.003119
labour_market:0.002981
trade:0.002403


# Conclusions
In this notebook, we have successfully trained a model which can predict ONS keywords from raw text, using published articles and bulletins for supervised training. Such a model can be used to recommend keywords based on raw, human written, text input or even classify search terms into keyword categories.