VaLID: compression-based text classification

This library implements Prediction by Partial Matching (PPM), originally for language identification (LID), but suitable for arbitrary tasks assigning discrete labels to sequences. Output scores are log prob values representing the likelihood of a model on an average character basis.

The Classifier class is compliant with Scikit-learn's API.

Quick start

Install via pip:

# from source
pip install . --user

# from pypi
pip install valid --user

Training a classification model is very simple. Assuming you have sequences of hashable values (e.g. characters) associated with labels:

from valid.model import Classifier

c = Classifier()
for text in englishtexts:
	c.train("eng", text)
for text in spanish_texts:
	c.train("spa", text)

You can then apply the model to new data points:

>>> print c.predict("Some English text")
{"eng" : -0.1, "spa" : -0.00002}

Note that it returns log-probabilities for each possible label. The model can be serialized with the pickle library:

import gzip
import pickle
with gzip.open(model_file, "w") as ofd:
	pickle.dump(c, ofd)

Later, it can be deserialized to be applied to test data, or trained further:

import gzip
import pickle
with gzip.open(model_file) as ifd:
	c = pickle.load(ifd)

See scripts/valid_example.py for a full, but still very simple, example that reads lines of tab-separated label/text from STDIN, sweeps n-gram values from 1 to 5 to find the best byte-based, character-based, and word-based models, and writes them to disk. Experiments with ~70k tweets consistently show the 3-gram byte model to be optimal.

LID and Tweet functionality

The normalizeTweet function in valid.utils may be useful for preprocessing microblog texts before training.

There are three lookup tables in valid.languages:

ALL_LANGUAGES is a list of iso639-1 language codes
MAJOR_LANGUAGES is a subset of major languages
MAP_2_TO_3 is a best-effort map from iso639-1 to iso639-3

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
scripts		scripts
valid		valid
.dockerignore		.dockerignore
.gitlab-ci.yml		.gitlab-ci.yml
Dockerfile		Dockerfile
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts

scripts

valid

valid

.dockerignore

.dockerignore

.gitlab-ci.yml

.gitlab-ci.yml

Dockerfile

Dockerfile

MANIFEST.in

MANIFEST.in

README.md

README.md

setup.py

setup.py

Repository files navigation

VaLID: compression-based text classification

Quick start

LID and Tweet functionality

Further reading

About

Releases

Packages

Contributors 2

Languages

TomLippincott/VaLID

Folders and files

Latest commit

History

Repository files navigation

VaLID: compression-based text classification

Quick start

LID and Tweet functionality

Further reading

About

Resources

Stars

Watchers

Forks

Languages