# Text preprocessing

During text preprocessing, a corpus of documents is tokenized (i.e. the document strings are split into individual words, punctuation, numbers, etc.) and then these tokens can be transformed, filtered or annotated. The goal is to prepare the raw texts in a way that makes it easier to perform eventual analysis methods in a later stage, e.g. by reducing noise in the dataset. tmtoolkit provides a rich set of tools for this purpose in the [tmtoolkit.preprocess](api.rst#tmtoolkit-preprocess) module.   

## Two approaches: functional API and `TMPreproc` class

There are two ways to apply text preprocessing methods to your documents: First, there is the [functional API](api.rst#module-tmtoolkit.preprocess) which consists of a set of Python functions that accept a list of (tokenized) documents. An example might be:

```python
corpus = [
    "Hello world!",    # document 1
    "Another example"  # document 2
]

docs = tokenize(corpus)
to_lowercase(docs)
# Out: [['hello', 'world', '!'],
#       ['another', 'example']]
```


The advantage of this approach is that it's very straight-forward and flexible. However, you must manage any meta data associated with the documents on your own (e.g. document labels or token metadata). Furthermore, the processing is not done in parallel.

Second, there is the [TMPreproc class](api.rst#tmpreproc-class-for-parallel-text-preprocessing) which addresses these limitations. You can create an instance of this class from your (labelled) documents and then apply preprocessing methods to it. This instance is a "state-machine", i.e. its contents (the documents) can change when you call a method. An example:

```python
corpus = {
    "doc1": "Hello world!",
    "doc2": "Another example"
}

preproc = TMPreproc(corpus)     # documents are directly tokenized
preproc.tokens_to_lowercase()   # this changes the documents
preproc.tokens                  # one of many ways to access the tokens

# Out:
# {
#   'doc1': ['hello', 'world', '!'],
#   'doc2': ['another', 'example']
# }
```

The most important advantage is that `TMPreproc` employs parallel processing, i.e. it uses all available processors on your machine to do the computations necessary during preprocessing. For large text corpora, this can lead to a strong speed up. 

Both approaches offer mostly the same features in terms of available preprocessing methods. `TMPreproc` has some more methods to export the data to dataframes or datatables. In general, the functional API is mostly used for quick prototyping and when using a small amount of data. For projects with large amounts of data, it's recommended to use `TMPreproc`, especially because of the parallel computation support.

This chapter starts with a few examples using the functional API and then turns to `TMPreproc`.

## Functional API

The functions in the preprocessing module make up the [functional API](api.rst#module-tmtoolkit.preprocess) for text preprocessing. We will explore some of the available functions. Most of them require at least passing a list of tokenized documents. In order to tokenize raw text documents (for example from a [Corpus](text_corpora.ipynb) object), we can use [tokenize()](api.rst#tmtoolkit.preprocess.tokenize). 

### Loading example data

Let's load a sample of three documents from the built-in *NewsArticles* dataset. We'll save the document labels in `doc_labels` since the functional API works with lists of documents (not with dicts): 

In [1]:
import random
random.seed(20191018)   # to make the sampling reproducible

from tmtoolkit.corpus import Corpus
from tmtoolkit.preprocess import tokenize

corpus = Corpus.from_builtin_corpus('english-NewsArticles').sample(3)
doc_labels = corpus.keys()
doc_labels

dict_keys(['NewsArticles-1880', 'NewsArticles-3350', 'NewsArticles-99'])

### Tokenization

We can now tokenize these documents. We use `corpus.values()` to pass a list of documents. We get a list of tokenized documents back (i.e. a list of lists). We peak into the documents by only showing the first 10 words at maximum.

In [2]:
docs = tokenize(corpus.values())
[doc[:10] for doc in docs]

[['White',
  'House',
  'aides',
  'told',
  'to',
  'keep',
  'Russia-related',
  'materials',
  'Lawyers',
  'for'],
 ['Frustration',
  'as',
  'cabin',
  'electronics',
  'ban',
  'comes',
  'into',
  'force',
  'Passengers',
  'decry'],
 ['Should',
  'you',
  'have',
  'two',
  'bins',
  'in',
  'your',
  'bathroom',
  '?',
  'Our']]

### Corpus language

Some preprocessing steps are language-dependent, i.e. they're trained for different languages and hence you have to tell in which language your documents are written. At the moment, tmtoolkit only supports two languages off the shelf: English and German. 

In the functional API, all functions that are language-dependent have a `language` argument. Examples of such functions are [tokenize()](api.rst#tmtoolkit.preprocess.tokenize), [pos_tag()](api.rst#tmtoolkit.preprocess.pos_tag), [stem()](api.rst#tmtoolkit.preprocess.stem) and [lemmatize()](api.rst#tmtoolkit.preprocess.lemmatize). The default language for the `language` parameter of the preprocessing functions is set in [tmtoolkit.defaults.language](api.rst#tmtoolkit.defaults.language). If you don't change it, it's set to `"english"`. So you have two options when you use the functional API and work with a corpus that is not in English: you either or pass the `language` parameter each time you use a language-dependent function; or you set `tmtoolkit.defaults.language` right at the beginning which will be used as default for all further language-dependent preprocessing functions. Let's try both options with a German sample corpus:

In [3]:
from tmtoolkit.preprocess import stem

docs_de = [
    'Von der Wiege bis zur Bahre, Formulare, Formulare.',
    'Fischers Fritz fischt frische Fische.',
    'Viel schon ist getan, mehr noch ist zu tun, sagt der Wasserhahn zum Wasserhuhn.'
]

Option 1, passing the `language` parameter each time:

In [4]:
tokens_de = tokenize(docs_de, language='german')
stemmed_de = stem(tokens_de, language='german')
stemmed_de

[['von',
  'der',
  'wieg',
  'bis',
  'zur',
  'bahr',
  ',',
  'formular',
  ',',
  'formular',
  '.'],
 ['fisch', 'fritz', 'fischt', 'frisch', 'fisch', '.'],
 ['viel',
  'schon',
  'ist',
  'getan',
  ',',
  'mehr',
  'noch',
  'ist',
  'zu',
  'tun',
  ',',
  'sagt',
  'der',
  'wasserhahn',
  'zum',
  'wasserhuhn',
  '.']]

Option 2, setting `tmtoolkit.defaults.language` provides the same output:

In [5]:
import tmtoolkit.defaults
tmtoolkit.defaults.language = 'german'

tokens_de = tokenize(docs_de)
stemmed_de == stem(tokens_de) 

True

We will return to the English corpus hence we can reset the default language and clean up:

In [6]:
tmtoolkit.defaults.language = 'english'

del docs_de, tokens_de, stemmed_de 

### A small tour around the functional preprocessing API

We will continue with the most important functions in the preprocessing API and apply them to our English sample corpus.

#### Document length

The document length is the number of tokens per document and can be obtained with [doc_lengths()](api.rst#tmtoolkit.preprocess.doc_lengths):

In [7]:
from tmtoolkit.preprocess import doc_lengths

doc_lengths(docs)

[227, 646, 1052]

#### Vocabulary and document frequencies

The vocabulary is the set of unique tokens in the corpus, i.e. all tokens that occur at least once in at least one of the documents. You can use [vocabulary()](api.rst#tmtoolkit.preprocess.vocabulary) for that and [vocabulary_counts()](api.rst#tmtoolkit.preprocess.vocabulary_counts) to additionally get the number of times each token appears in the corpus. 

The document frequency of a token is the number of documents in which this token occurs at least once. The function [doc_frequencies()](api.rst#tmtoolkit.preprocess.doc_frequencies) returns this measure for all tokens in the vocabulary. 

In [8]:
from tmtoolkit.preprocess import vocabulary, vocabulary_counts, doc_frequencies

# first 10 entries from the sorted vocab
vocabulary(docs, sort=True)[:10]

['%', "'", "''", "'s", '(', ')', ',', '-', '-Al', '.']

In [9]:
# get unsorted vocabulary counts as Counter object
vocab_counts = vocabulary_counts(docs)
# get top 10 tokens by occurrence
vocab_counts.most_common(10)

[('the', 82),
 (',', 70),
 ('.', 60),
 ('to', 53),
 ('and', 45),
 ('in', 38),
 ('a', 31),
 ('``', 28),
 ('of', 25),
 ("''", 23)]

In [10]:
doc_freq = doc_frequencies(docs)

# "the" occurs in all three documents, "Lawyers" only in one
doc_freq['the'], doc_freq['Lawyers']


(3, 1)

#### Part-of-speech (POS) tagging

Part-of-speech (POS) tagging finds the grammatical word-category for each token in a document. The function [pos_tag()](api.rst#tmtoolkit.preprocess.pos_tag) employs this for the whole corpus. It returns a list of tags for each document. These tags conform to a specific *tagset*. For English this is the [Penn Treebank tagset](https://ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) and for German this is the [STTS tagset](http://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/TagSets/stts-table.html).

These tags can be used to filter, annotate or lemmatize the documents.

Remember that this is a language-dependent function.

In [11]:
from tmtoolkit.preprocess import pos_tag

docs_pos = pos_tag(docs)

# show pairs of tokens and POS tags for the first 10 tokens in the first document
list(zip(docs[0][:10], docs_pos[0][:10]))

[('White', 'NNP'),
 ('House', 'NNP'),
 ('aides', 'NNS'),
 ('told', 'VBD'),
 ('to', 'TO'),
 ('keep', 'VB'),
 ('Russia-related', 'JJ'),
 ('materials', 'NNS'),
 ('Lawyers', 'NNS'),
 ('for', 'IN')]

#### Stemming and lemmatization

Stemming and lemmatization bring a token, if it is a word, to a base form. The former method is rule-based and creates base forms by cutting of common pre- and suffixes. The resulting token may not be a lexicographically correct word any more. We've already used [stem()](api.rst#tmtoolkit.preprocess.stem) in an example above.

Lemmatization is a more sophisticated process that tries to find lexicographically correct base form of a given word by also considering its POS tag and possibly its context (tokens and POS tags nearby). It's usually not rule-based but a trained model that predicts the base form from the mentioned parameters. Lemmatization can be applied with [lemmatize()](api.rst#tmtoolkit.preprocess.lemmatize).

Remember that both functions are language-dependent.

In [12]:
from tmtoolkit.preprocess import lemmatize

docs_lem = lemmatize(docs, docs_pos)
# show pairs of original tokens and lemmata for the first 10 tokens of first document
list(zip(docs[0][:10], docs_lem[0][:10]))

[('White', 'White'),
 ('House', 'House'),
 ('aides', 'aide'),
 ('told', 'tell'),
 ('to', 'to'),
 ('keep', 'keep'),
 ('Russia-related', 'Russia-related'),
 ('materials', 'material'),
 ('Lawyers', 'Lawyers'),
 ('for', 'for')]

#### "Cleaning" tokens

Depending on your methodology, it may be necessary to "clean" or "normalize" your tokens in different ways in order to remove noise from the corpus, such as punctuation tokens or numbers, upper/lowercase forms of the same word, etc. Note that this is usually not necessary when you work with more modern approaches such as word embeddings (word vectors).   

If you want to remove certain characters in *all* tokens in your corpus, you can use [remove_chars()](api.rst#tmtoolkit.preprocess.remove_chars) and pass it a sequence of characters to remove.

Note that for the following examples I continue working with the lemmatized documents `docs_lem`.

In [13]:
from tmtoolkit.preprocess import remove_chars

# remove all vowels from the documents, show first 10 tokens from first document
remove_chars(docs_lem, 'aeiou')[0][:10]

['Wht', 'Hs', 'd', 'tll', 't', 'kp', 'Rss-rltd', 'mtrl', 'Lwyrs', 'fr']

You can for example use this to remove all punctuation characters from all tokens:

In [14]:
import string

docs_clean = remove_chars(docs_lem, string.punctuation)
# show pairs of original tokens and cleaned tokens for the first 10 tokens of 2nd doc.
list(zip(docs_lem[2][:10], docs_clean[2][:10]))

[('Should', 'Should'),
 ('you', 'you'),
 ('have', 'have'),
 ('two', 'two'),
 ('bin', 'bin'),
 ('in', 'in'),
 ('your', 'your'),
 ('bathroom', 'bathroom'),
 ('?', ''),
 ('Our', 'Our')]

Notice how the token `'?'` was transformed to an empty string `''`, because "?" is a punctuation character.

A common (but harsh) practice is to transform all tokens to lowercase forms, which can be done with :

In [15]:
from tmtoolkit.preprocess import to_lowercase

docs_clean = to_lowercase(docs_clean)
docs_clean[2][:10]

['should', 'you', 'have', 'two', 'bin', 'in', 'your', 'bathroom', '', 'our']

The function [clean_tokens()](api.rst#tmtoolkit.preprocess.clean_tokens) finally applies several steps that remove tokens that meet certain criteria. This includes removing:

- punctuation tokens
- stopwords (very common words for the given language)
- empty tokens (i.e. `''`)
- tokens that are longer or shorter than a certain number of characters
- numbers  

Note that this is a language-dependent function, because the default stopword list is determined per language. This function has lot's of parameters to tweak, so it's recommended to check out the documentation.

In [16]:
from tmtoolkit.preprocess import clean_tokens

# remove punct., stopwords, empty tokens (this is the default)
# plus tokens shorter than 2 characters and numeric tokens like "2019"
docs_final = clean_tokens(docs_clean, remove_shorter_than=2, remove_numbers=True)

# first 10 tokens of doc. #2
docs_final[2][:10]

['two',
 'bin',
 'bathroom',
 'bathroom',
 'fill',
 'shampoo',
 'bottle',
 'toilet',
 'roll',
 'cleaning']

Due to the removal of several tokens in the previous step, the document lengths for the processed corpus are much smaller than for the original corpus:

In [17]:
doc_lengths(docs), doc_lengths(docs_final)

([227, 646, 1052], [129, 310, 504])

We can also observe that the vocabulary got smaller after the processing steps, which, for large corpora, is also important in terms of computation time and memory consumption for later analyses:

In [18]:
len(vocabulary(docs)), len(vocabulary(docs_final))

(681, 478)

You can also apply custom token transform functions by using [transform()](api.rst#tmtoolkit.preprocess.transform) and passing it a function that should be applied to each token in each document (hence it must accept one string argument).

First let's define such a function. Here we create a simple function that should return a token's "shape" in terms of the case of its characters:

In [19]:
def token_shape(t):
    return ''.join(['X' if str.isupper(c) else 'x' for c in t])

token_shape('USA'), token_shape('CamelCase'), token_shape('lower')

('XXX', 'XxxxxXxxx', 'xxxxx')

We can now apply this function to our corpus:

In [20]:
from tmtoolkit.preprocess import transform

doc_shapes = transform(docs, token_shape)

# show pairs of tokens and POS tags for the first 10 tokens in the first document
list(zip(docs[0][:10], doc_shapes[0][:10]))

[('White', 'Xxxxx'),
 ('House', 'Xxxxx'),
 ('aides', 'xxxxx'),
 ('told', 'xxxx'),
 ('to', 'xx'),
 ('keep', 'xxxx'),
 ('Russia-related', 'Xxxxxxxxxxxxxx'),
 ('materials', 'xxxxxxxxx'),
 ('Lawyers', 'Xxxxxxx'),
 ('for', 'xxx')]

#### Keywords-in-context (KWIC)

*Keywords-in-context (KWIC)* allow you to quickly to investigate certain keywords and their neighborhood of tokens, i.e. the tokens that appear right before and after this keyword.

tmtoolkit provides two functions for this purpose:

- [kwic()](api.rst#tmtoolkit.preprocess.kwic) is the base function accepting the input documents, a search pattern and several options that control how the search pattern is matched (more on that below); use this function when you want to further process the output of a KWIC search;
- [kwic_table()](api.rst#tmtoolkit.preprocess.kwic_table) is the more "user friendly" version of the above function as it produces a datatable with the highlighted keyword by default

Let's see both functions in action:

In [21]:
from tmtoolkit.preprocess import kwic, kwic_table

kwic(docs, 'news')

[[],
 [['told', 'Reuters', 'news', 'agency', '.'],
  ['Jazeera', 'and', 'news', 'agencies']],
 []]

We see that the first and last document do not contain any keyword that matches `"news"`, hence we get empty results for these documents. In the second document, we get two result contexts for the requested keyword. This keyword stands in the middle and is surrounded by its "context tokens", which by default means two tokens to the left and two tokens to the right. Notice that in the second result context only one token to the right is shown since the document ends after "agencies".

In [22]:
kwic_table(docs, 'news')

Unnamed: 0_level_0,doc,context,kwic
Unnamed: 0_level_1,▪▪▪▪▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪
0,1,0,told Reuters *news* agency .
1,1,1,Jazeera and *news* agencies


With `kwic_table()`, we get back a datatable which provides a better formatting for quick investigation. See how the matched tokens are highlighted as `*news*` and empty results are removed (only document "1" contains the keyword which is the *second* document – remember that Python indexing starts with 0).

We can also pass the document labels via `doc_labels` to get proper labels in the `doc` column instead of document indices:

In [23]:
kwic_table(docs, 'news', doc_labels=doc_labels)

Unnamed: 0_level_0,doc,context,kwic
Unnamed: 0_level_1,▪▪▪▪,▪▪▪▪▪▪▪▪,▪▪▪▪
0,NewsArticles-3350,0,told Reuters *news* agency .
1,NewsArticles-3350,1,Jazeera and *news* agencies


#### Filtering tokens

token_match
filter_tokens, remove_tokens, filter_documents, remove_documents,
filter_documents_by_name, remove_documents_by_name, filter_for_pos
remove_common_tokens, remove_uncommon_tokens

#### Expanding contractions and "gluing" tokens

#### Generating n-grams

#### Generating a sparse document-term matrix (DTM)

