### A simple chunking example

One of the most common sequence tagging tasks is chunking. Here we will download the CoNLL-2000 data and use it to train a chunking model. Make note that the terminology used here is relevant for chunking, but the it may vary if the tagger is applied to another task.

In [1]:
import gzip, os, wget

def extract(fp):
    f = gzip.open(fp, 'rb')
    with open(fp[:-3], 'w') as fh:
        fh.write(f.read())
    f.close()

try:
    os.makedirs('data')
except OSError:
    pass

try:
    os.makedirs('thesauri')
except OSError:
    pass

train_url = 'http://www.cnts.ua.ac.be/conll2000/chunking/train.txt.gz'
test_url = 'http://www.cnts.ua.ac.be/conll2000/chunking/test.txt.gz'
stanford_clusters_url = 'http://nlp.stanford.edu/software/egw4-reut.512.clusters'

train = wget.download(train_url, out='data/train.txt.gz')
test = wget.download(test_url, out='data/test.txt.gz')

stanford = wget.download(
    stanford_clusters_url,
    out='thesauri/egw4-reut.512.clusters'
)

extract(train)
extract(test)

# clean up
os.remove(train)
os.remove(test)

Now that we have the data, we need to set up the configuration for the tagger. Usually these are loaded from a text file into a `ConfigParser` object, but here we parse them from a string for presentation purposes.

In [2]:
import ConfigParser, StringIO

cfg_str = """[tagger]
# Training data path
train=data/test.txt

# Testing data path
test=data/test.txt

# Model path
model=tmp/model

# Feature vector
ftvec=word:[-3:3];can:[-3:3];cls:[0];short

# column separator in input (and output) file(s)
tab_sep=\s

# Column pattern
# [pos <form, postag>, chunk <form, postag, chunktag>]
cols=chunk

# Label column name
label_col=chunktag

# Evaluation function [pos, conll]
# Note: the evaluation functions are not constrained by tagset. However, the
# conll and bio evaluation functions work only with BIO or BIOSE tagsets.
eval_func=bio

# Name for the guess label column
guess_label_col=guesstag

[resources]
# Stanford clusters
cls=thesauri/egw4-reut.512.clusters

[crfsuite]
# coefficient for L1 penalty
c1=0.80
# coefficient for L2 penalty
c2=1e-3
# stop earlier
max_iterations=100
# include transitions that are possible, but not observed
feature.possible_transitions=True
"""

sio = StringIO.StringIO(cfg_str)
cfg = ConfigParser.ConfigParser()
cfg.readfp(sio)

Now let's train a model using that configuration.

In [3]:
from crfsuitetagger.tagger import CRFSTagger
from crfsuitetagger.utils import parse_tsv, export

c = CRFSTagger(cfg)
c.train()
r, d = c.test()
r

--------------------------------------------------------
--------------------------------------------------------
Total ==> pre: 95.69, rec: 94.87, f: 95.28 acc: n.a.
--------------------------------------------------------


###Reusing a model

The `model` option in the tagger configuration sets the location where the newly created model should be dumped. Using that location, one can load the model later on and use it for tagging more data.

In [4]:
c = CRFSTagger(mp='tmp/model')
data = parse_tsv('data/test.txt', cols='chunk', ts=' ')
d = c.tag(data=data)
export(d, open('tmp/chunk_output.txt', 'w'), cols='chunk')
d[:5]

array([('Rockwell', 'NNP', 'B-NP', 'B-NP', 28),
       ('International', 'NNP', 'I-NP', 'I-NP', -1),
       ('Corp.', 'NNP', 'I-NP', 'I-NP', -1),
       ("'s", 'POS', 'B-NP', 'B-NP', -1),
       ('Tulsa', 'NNP', 'I-NP', 'I-NP', -1)], 
      dtype=[('form', 'S60'), ('postag', 'S10'), ('chunktag', 'S10'), ('guesstag', 'S10'), ('eos', '<i4')])

### Feature vector templates

The central piece in the configuration of the tagger is the feature vector template. It is the sample pattern used for generating the feature vectors of every observation. For the model we just used, we set up a feature vector template with the following features:

    word:[-3:3];can:[-3:3];cls:[0];short

Each feature name in the template, e.g. `word`, is in fact a series of features generated from a context window. The context window is defined by the numbers in brackets, for example `[-3:3]`. Features are separated by semi-columns. Some features require additional parameters, like n-grams, for example, need to be specified as bigrams, trigrams, etc. These parameters are specified after the window brackets separated with commas like this:
    
    npos[-1:1],2

We can improve chunking by adding another window feature based on part-of-speech tags.

    word:[-3:3];pos:[-3:3];can:[-3:3];cls:[0];short

To do that, we just set the value of the `ftvec` option in the tagger configuration to the new feature vector template, and re-train the model

In [5]:
cfg.set('tagger', 'ftvec', 'word:[-3:3];pos:[-3:3];can:[-3:3];cls:[0];short')

c = CRFSTagger(cfg)
c.train()
r, d = c.test()
r

--------------------------------------------------------
--------------------------------------------------------
Total ==> pre: 96.9, rec: 96.5, f: 96.7 acc: n.a.
--------------------------------------------------------


### Additional features

`CRFSuiteTagger` allows the easy integration of new types of features through custom functions following a simple pattern. The feature function needs to have four leading parameters named in a particular way, apart from whatever other parameters it needs. It is recommended that the functions take `*args` and `**kwargs` parameters for safety reasons. The obligatory parameters are `data`, `i`, `cols`, and `rel`. Each of them has a vital importance for the way context features are generated. 

Here is an example function that we can use to overwrite the existing `word` function in order to introduce noise into the data.

In [6]:
def word(data, i, cols, rel=0, *args, **kwargs):
    """Generates a feature based on the `form` column, but replaces some
    prepositions with a placeholder <preposition>.

    **FEATURE GENERATION FUNCTION**

    :param data: data
    :type: np.recarray
    :param i: focus position
    :type i: int
    :param cols: column map
    :type cols: dict
    :param rel: relative position of context features
    :type rel: int
    :return: feature
    :rtype: str
    """
    if 0 <= i + rel < len(data):
        form = data[i + rel][cols['form']]
    else:
        form = None
    if form in ['to', 'from', 'with', 'in', 'over', 'by', 'through']:
        form = '<preposition>'
    return 'w[%s]=%s' % (rel, form)

A list of feature functions is passed to the constructor of the `CRFSuiteTagger` object.

In [7]:
from crfsuitetagger.tagger import CRFSTagger
from crfsuitetagger.utils import parse_tsv, export

c = CRFSTagger(cfg, fnx=[word])
c.train()
r, d = c.test()
r

--------------------------------------------------------
--------------------------------------------------------
Total ==> pre: 96.78, rec: 96.6, f: 96.69 acc: n.a.
--------------------------------------------------------


We see that the new `word` feature function brought down the precision of the model with .12%.

In [8]:
word = None
c = CRFSTagger(mp='tmp/model')
data = parse_tsv('data/test.txt', cols='chunk', ts=' ')
d = c.tag(data=data)
export(d, open('tmp/chunk_output.txt', 'w'), cols='chunk')
d[:5]

array([('Rockwell', 'NNP', 'B-NP', 'B-NP', 28),
       ('International', 'NNP', 'I-NP', 'I-NP', -1),
       ('Corp.', 'NNP', 'I-NP', 'I-NP', -1),
       ("'s", 'POS', 'B-NP', 'B-NP', -1),
       ('Tulsa', 'NNP', 'I-NP', 'I-NP', -1)], 
      dtype=[('form', 'S60'), ('postag', 'S10'), ('chunktag', 'S10'), ('guesstag', 'S10'), ('eos', '<i4')])

##### Limitation

Due to the serialisation of the custom feature functions, no packages (including those part of the distribution) can be used, nor should they. If you are doing any more complicated computation during feature generation then probably you are doing something wrong, and you probably need to pre-compute a resource and load it as a parameter (see how cluster features work in `crfsuitetagger.features.ft_cls` and `crfsuitetagger.readers.cls`). If you really, really have to use some package you can hack it by importing it in the function -- bad style, worse performance, etc.