preprocessing

Test repo for WE1S preprocessing script.

Preliminary Description of the WE1S Preprocessing Pipeline

Normalise whitespace and bad unicode

a. Match r'((\r\n)|[\n\v])+' and replace it with '\n'
b. Match r'(?!\n)\s+') and replace it with ' '
c. Trim whitespace from at the start and end of the string d. Normalise the Unicode according 'NFKD' method
e. If an accented character is Unicode, replace it with an unaccented ASCII equivalent by normalising it according 'NFKD' method.

Tokenise according to spaCy's pipeline (see below) using the following settings:

PREFIX_RE = re.compile(r'''^[\[\]\("'\.,;:-]''')
SUFFIX_RE = re.compile(r'''[\[\]\)"'\.,;:-]$''')
INFIX_RE = re.compile(r'''[-~]''')
SIMPLE_URL_RE = re.compile(r'''^https?://''')

When parsing the word "humanities", do not lemmatise it as "humanity".
Re-tokenise the text merging entities into tokens using spaCy's named entity recognition. Do not tag cardinal numbers, dates (except months), quantities, or times as entities.
Remove stop words from the WE1S standard list.
From the remaining tokens, try to remove meaningless punctuation using the regex pattern r'\.\W|\W\.|^[\!\?,;:\[\]\{\}]|[\!\?,;:\[\]\{\}]$'.
Create a term count matrix.
Generate a "vectors" file where each document is on a separate line and terms appearing N times in the document are repeated N times. All terms are separated by spaces.
Import into MALLET using --token-regex "\S+". Do not strip stopwords.

Note that Twitter is preprocessed according to a variant of this pipeline for which the documentation is pending.

The spaCy Pipeline

The main WE1S corpus has linguistic features extracted using spaCy's en_core_web_lg language model. The preprocessor records a features table containing a row for each token, normalised (lower-cased) token, lemma, part of speech, detailed part of speech (tag), and named entity. The features table is saved to the document's manifest and is accessible for extracting features from the document without repeated parsing. However, because the original text may be reconstructed from the features table, which preserves token order, public document manifests may have the table sorted.

Detailed information about spaCy's linguistic feature extraction can be found on the spaCy website. Some features extracted during preprocessing are not preserved in the documents features table.

Tokenisation

spaCy tokenizes documents by applying rules from its language model to segment the text into words, punctuation and so on. Whitespace information is preserved in the tokens and no information is added or removed. As a result, the features table contains punctuation and line breaks, which are treated as tokens.

After the tokens are split on spaces (' '), spaCy then perform two checks on the resulting substrings:

Does the substring match an exception rule

First, the raw text is split on whitespace characters, similar to text.split(' '). Then, the tokenizer processes the text from left to right. On each substring, it performs two checks:

Does the substring match one of the language model's tokenizer exception rules, such as splitting “don’t” but leaving “U.K.” as one?
Does the substring contain a prefix, suffix or infix (typically punctuation marks, hyphens, or quotes) that can be split off?

If so, these rules are applied before the spaCy continues to the next substring. Note that WE1S uses customised prefix, suffix, and infix patterns listed above.

Lemmatisation and Part of Speech Tagging

spaCy uses rules in its language model to identify lemmas and parts of speech for each token. Although spaCy is very accurate, the accuracy does vary with the type of data. By default, it lemmatises "humanities" as "humanity", so the WE1S pipeline applies an exception to this rule.

Named Entity Recognition

Named entity recognition is applied after tokenisation, lemmatisation, and part of speech tagging. spaCy uses rules in its language model to identify named entities. As with lemmatisation and parts of speech, accuracy depends on the data. Named entities are tagged with a wide variety of descriptive labels, which are described here.

spaCy typically identifies space-separated phrases (e.g. "Toni Morrison") as named entities. Phrases like these have to be converted to "toni_morrison" for use in WE1S visualisations, and the result of spaCy named entity recognition is a proliferation of underscored phrases. In order to avoid this, the WE1S pipeline customises spaCy's named entity recognition by skipping some categories of entities. These include the categories CARDINAL, DATE (but dates containing the names of months are allowed), QUANTITY, and TIME.

Name		Name	Last commit message	Last commit date
Latest commit History 161 Commits
assets/css		assets/css
baremetal		baremetal
data		data
data_zip		data_zip
data_zip_bk		data_zip_bk
libs		libs
logo		logo
tests		tests
we1s_utils		we1s_utils
.gitignore		.gitignore
Document_Class.ipynb		Document_Class.ipynb
Makefile		Makefile
README.md		README.md
README_for_preprocess_script.md		README_for_preprocess_script.md
Untitled4.ipynb		Untitled4.ipynb
__init__.py		__init__.py
_config.yml		_config.yml
bagify.ipynb		bagify.ipynb
count.json		count.json
dupe_extract.py		dupe_extract.py
mallet-test.ipynb		mallet-test.ipynb
mallet_vectors.ipynb		mallet_vectors.ipynb
requirements.txt		requirements.txt
starting_out.py		starting_out.py
test.py		test.py
test_preprocessing.py		test_preprocessing.py
trash_zips.py		trash_zips.py
tuples_to_mallet_import.ipynb		tuples_to_mallet_import.ipynb
vector_test.txt		vector_test.txt
wishlist_check.py		wishlist_check.py
zip_preprocess.py		zip_preprocess.py
zip_renamer.py		zip_renamer.py
zip_undo_rename.py		zip_undo_rename.py

whatevery1says/preprocessing

Folders and files

Latest commit

History

Repository files navigation

preprocessing

Preliminary Description of the WE1S Preprocessing Pipeline

The spaCy Pipeline

Tokenisation

Lemmatisation and Part of Speech Tagging

Named Entity Recognition

About

Topics

Resources

Stars

Watchers

Forks

Languages