# Basic text processing, data sources and corpora

## Short intro to NLTK

NLTK library organization ([modules](http://www.nltk.org/py-modindex.html)):

| Module |	Shortcuts |	Data Structures	| Interfaces|	NLP Pyramid|
|---|---|---|---|---|
|*nltk.stem*, *nltk.text*, *nltk.tokenize* | *word_tokenize*, *sent_tokenize* |	*str*, *nltk.Text* => *[str]* |	*StemmerI*, *TokenizerI* |	*Morphology* |
|*nltk.tag*, *nltk.chunk*|	*pos_tag*|	*[str]* => *[(str, tag)]*, *nltk.Tree*|	*TaggerI*, *ParserI*, *ChunkParserI*	|Syntax|
|*nltk.chunk*, *nltk.sem*|	*ne_chunk*|	*nltk.Tree*, *nltk.DependencyGraph*|	*ParserI*, *ChunkParserI*|	Semantics|
|*nltk.sem.drt*	|–|	*Expression*	|–|	Pragmatics|

An example passing the first three levels of the NLP Pyramid:

In [None]:
from nltk import word_tokenize, pos_tag, ne_chunk
 
text = "John works at OBI." 
 
# Morphology Level
tokens = word_tokenize(text)
print("Tokens:", tokens)
 
# Syntax Level
tagged_tokens = pos_tag(tokens)
print("POS tagging:", tagged_tokens)
 
# Semantics Level
ner_tree = ne_chunk(tagged_tokens)
print("Light parsing:", ner_tree)

When working with text data, a [Text object](http://www.nltk.org/api/nltk.html#nltk.text.Text) might be useful to use:

In [None]:
from nltk import Text
from nltk.corpus import reuters
 
text = Text(reuters.words())
 
print("Similar words to 'Monday':")
text.similar('Monday', 5)

print("\nCommon contexts to a list of words ('August', 'June'):")
text.common_contexts(['August', 'June'])

print("\nContexts of a word 'Monday':")
text.concordance('Monday')

Working with n-grams (bigrams, trigrams) and collocations extraction:

In [None]:
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
 
# Bigrams
finder = BigramCollocationFinder.from_words(nltk.corpus.reuters.words())
finder.apply_freq_filter(5)

print("\nBest 50 bigrams according to PMI:", finder.nbest(bigram_measures.pmi, 50))
 
# Trigrams
finder = TrigramCollocationFinder.from_words(nltk.corpus.reuters.words())
finder.apply_freq_filter(5)
 
print("\nBest 50 trigrams according to PMI:", finder.nbest(trigram_measures.pmi, 50))

Conversion between different data formats:

In [None]:
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.tag import untag, str2tuple, tuple2str
from nltk.chunk import tree2conllstr, conllstr2tree, conlltags2tree, tree2conlltags
 
text = "John works at OBI."
 
tokens = word_tokenize(text)
print("Tokens: ", tokens)
 
tagged_tokens = pos_tag(tokens)
print("\nTagged tokens: ", tagged_tokens)
 
print("\nUntagged tokens", untag(tagged_tokens))
 
tagged_tokens = [tuple2str(t) for t in tagged_tokens] 
print("\nTagged tokens to strings:", tagged_tokens)
 
tagged_tokens = [str2tuple(t) for t in tagged_tokens]
print("\nTagged tokens from strings to tuples:",  tagged_tokens)
 
ner_tree = ne_chunk(tagged_tokens)
print("\nNER tree:", ner_tree)
 
iob_tagged = tree2conlltags(ner_tree)
print("\nIOB tagged tree:", iob_tagged)
 
ner_tree = conlltags2tree(iob_tagged)
print("\nBack to tree:", ner_tree)
 
tree_str = tree2conllstr(ner_tree)
print("\nTree as CoNLL string:\n", tree_str)
 
ner_tree = conllstr2tree(tree_str, chunk_types=('PERSON', 'ORGANIZATION'))
print("\nCoNLL string to tree:", ner_tree)
 

## Sentence splitting

Generally, pretrained models are good enough for tokenization. You might encounter issues using them if you are working with a specific genre of text (e.g. technical with specific abbreviations) or working with an unsupported language.

NLTK by default uses *PunktSentenceTokenizer*, which is unsupervised trainable model. A simple scenario:

In [None]:
from nltk import sent_tokenize
 
sentence = "All FRI students could get a Msc. in Computer Science."
print("Sentences:", sent_tokenize(sentence))

As the above obviously does not work correctly for us, let's train our model.

In [None]:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer
from nltk.corpus import gutenberg
 
# 1. Prepare text data
text = ""
for file_id in gutenberg.fileids():
    text += gutenberg.raw(file_id)

# 2. Train the params
trainer = PunktTrainer()
trainer.INCLUDE_ALL_COLLOCS = True
trainer.train(text)
 
# 3. Instantiate the model
tokenizer = PunktSentenceTokenizer(trainer.get_params())

Now we can test again ...

In [None]:
sentences = "Mr. James told me Dr. Zitnik is not available today. I will go to his office hours."
 
print("Tokenized: ", tokenizer.tokenize(sentences))
print("\nLearned abbreviations:", tokenizer._params.abbrev_types)
 
from pprint import pprint
print("\nSplit decisions debugging:")
for decision in tokenizer.debug_decisions(sentences):
    pprint(decision)
    print('=' * 30)

We can manually add abbreviations or other parameters to update the model.

In [None]:
tokenizer._params.abbrev_types.add('msc')
tokenizer._params.abbrev_types.add('dr')

sentence = "All FRI students could get a Msc. in Computer Science."
print("Tokenized:", tokenizer.tokenize(sentence))

sentences = "Mr. James told me Dr. Zitnik is not available today. I will go to his office hours."
print("Tokenized: ", tokenizer.tokenize(sentences))

## Basic text processing tools

### Stemming

> In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphological root of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root. Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.

One of the first stemmers was designed by [Martin Porter](https://tartarus.org/martin/PorterStemmer/) (*M.F. Porter, 1980, An algorithm for suffix stripping, Program, 14(3), pp. 130−137*), which is a suffix stripping algorithm and most widely used. Later he continued his work and presented his improved work for English and for some other languages with the Snowball stemmer.

The NLTK includes some stemmer implementations - Porter, Lancaster, Snowball.

In [None]:
import nltk
import string

# Prepare text
def getTokens():
   with open('shakespeare.txt', 'r') as shakes:
    text = shakes.read().lower()
    # remove punctuation
    table = text.maketrans({key: None for key in string.punctuation})
    text = text.translate(table)      
    tokens = nltk.word_tokenize(text)
    return tokens

# Do stemming
from nltk.stem import SnowballStemmer
stemmer = SnowballStemmer("english")
stems = [(token, stemmer.stem(token)) for token in getTokens()]
stems[:50]

### Lemmatisation

> Lemmatisation (or lemmatization) in linguistics is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.

> In computational linguistics, lemmatisation is the algorithmic process of determining the lemma for a given word. Since the process may involve complex tasks such as understanding context and determining the part of speech of a word in a sentence (requiring, for example, knowledge of the grammar of a language) it can be a hard task to implement a lemmatiser for a new language.

> In many languages, words appear in several inflected forms. For example, in English, the verb 'to walk' may appear as 'walk', 'walked', 'walks', 'walking'. The base form, 'walk', that one might look up in a dictionary, is called the lemma for the word.

> Lemmatisation is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster, and the reduced accuracy may not matter for some applications.

In [None]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(token) for token in getTokens()]
lemmas

In [None]:
# Find the differences between lemmas and stems

The function above finds a lemma expecting a noun by default, so we should also include a part-of-speech information, e.g., `lemmatizer.lemmatize("walking", pos='v')` for verbs - the default parameter is noun.

In [None]:
# Try to lemmatise is, are, was, were without and with pos parameter. Is there a difference?

### Part-of-speech tagging

> A part of speech (abbreviated form: PoS or POS) is a category of words (or, more generally, of lexical items) which have similar grammatical properties. Words that are assigned to the same part of speech generally display similar behavior in terms of syntax—they play similar roles within the grammatical structure of sentences—and sometimes in terms of morphology, in that they undergo inflection for similar properties. Commonly listed English parts of speech are noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection, and sometimes numeral, article or determiner.

In NLTK, there exist some implementations of taggers and possible tagsets ([source](http://www.nltk.org/_modules/nltk/tag.html#pos_tag)).

In [None]:
posTagged = nltk.pos_tag(getTokens())
posTagged[:50]

By default, the Penn Treebank tags are used ([tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)).

In [None]:
nltk.help.upenn_tagset("NNP") # Meaning of a specific tag with examples

### Documents comparison using TF-IDF

TF-IDF is a common metric in information retrieval which weights a term with respect to a list of documents.

> In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval and text mining. The tf-idf value increases proportionally to the number of times a word appears in the document, but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general.

> Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.

The metric is calculated as follows:

$$\textrm{TF}(term) = \frac{\textrm{Number of times term appears in a document}}{\textrm{Total number of terms in the document}}$$

$$\textrm{IDF}(term) = \frac{\textrm{Total number of documents}}{\textrm{Number of documents with the term in it}}$$

$$\textrm{TF-IDF}(term) = \textrm{TF}(term) * \log_e(\textrm{IDF}(term))$$

Computing TF-IDF scores for given documents:

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vect = TfidfVectorizer() # parameters for tokenization, stopwords can be passed
tfidf = vect.fit_transform(["Halloween is scary!",
                            "People wear halloween masks.",
                            "There exist many masks and halloween pumpkins.",
                            "People build scary lantern pumpkins."])

print("TF-IDF vectors (each column is a document):\n{}\nRows:\n{}".format(tfidf.T.A, vect.get_feature_names()))

Computing [cosine similarities](https://en.wikipedia.org/wiki/Cosine_similarity) for documents:

In [None]:
cosine = (tfidf * tfidf.T).A
print("Cosine similarity between the documents: \n{}".format(cosine))

Comparing a new document (or searching) to the corpus:

In [None]:
weights = vect.transform(["Where to buy a halloween mask and pumpkins?"])

# HINT: If the text is completely different from the corpus, a zero vector will be returned
# and therefore also not printed.
print("New document:\n{}".format(weights.T.A)) 

In [None]:
# Try to calculate, which documents are the most similar to a query above

## Useful data sources/tools

When recognizing important parts of text, it is useful to help yourself with semantic databases or gazetteer lists (e.g., lists of organization names, countries, ...).

### WordNet

> WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets, provides short definitions and usage examples, and records a number of relations among these synonym sets or their members. WordNet can thus be seen as a combination of dictionary and thesaurus. While it is accessible to human users via a web browser, its primary use is in automatic text analysis and artificial intelligence applications. The database and software tools have been released under a BSD style license and are freely available for download from the WordNet website. Both the lexicographic data (lexicographer files) and the compiler (called grind) for producing the distributed database are available.

Let's find all sets of synonyms for word `book`.

In [None]:
from nltk.corpus import wordnet
syns = wordnet.synsets("book")
syns

Print out some data about these synsets.

In [None]:
for synset in syns:
    print("Syset name: '{}'".format(synset.name()))
    print("Lemmas:     '{}'".format([lemma.name() for lemma in synset.lemmas()]))
    print("Definition: '{}'".format(synset.definition()))
    print("Examples:   '{}'\n".format(synset.examples()))

Find synonyms, antonyms, hypernyms and hyponyms of a word `good`.

In [None]:
synonyms = []
antonyms = []
hypernyms = []
hyponyms = []

for synset in wordnet.synsets("good"):
    for lemma in synset.lemmas():
        synonyms.append(lemma.name())
        if lemma.antonyms():
            antonyms.append(lemma.antonyms()[0].name())
        if synset.hypernyms():
            hypernyms.append(synset.hypernyms()[0].name())
        if synset.hyponyms():
            hyponyms.append(synset.hyponyms()[0].name())

print("Synonyms:  '{}'\n".format(set(synonyms)))
print("Antonyms:  '{}'".format(set(antonyms)))
print("Hypernyms: '{}'".format(set(hypernyms)))
print("Hyponyms:  '{}'".format(set(hyponyms)))

Compare similarity of two synsets.

The Wu & Palmer measure calculates relatedness by considering the depths of the two synsets in the WordNet taxonomies, along with the depth of the LCS. The formula is score = 2*depth(lcs) / (depth(s1) + depth(s2)). This means that 0 < score <= 1. The score can never be zero because the depth of the LCS is never zero (the depth of the root of a taxonomy is one). The score is one if the two input synsets are the same.

In [None]:
good = wordnet.synsets("good")[0]
nice = wordnet.synsets("nice")[0]
bad  = wordnet.synsets("bad")[0]

print("Good vs. nice: {:.2}".format(good.wup_similarity(nice)))
print(" Good vs. bad: {:.2}".format(good.wup_similarity(bad)))

### Other sources

A plethora of other data sources exist, for example:

* [VerbNet](http://verbs.colorado.edu/~mpalmer/projects/verbnet.html)
* [FrameNet](https://framenet.icsi.berkeley.edu/fndrupal/)
* [SentiStrength](http://sentistrength.wlv.ac.uk/)
* [BabelNet](http://babelnet.org/)
* [ConceptNet](http://www.conceptnet.io/)


## Corpora

To train or test an NLP system, we need some tagged data to compare the systems to. Some datasets exist from challenges, for example:

* [CoNLL](http://www.conll.org/previous-tasks)
* [SemEval](https://en.wikipedia.org/wiki/SemEval)
* BioNLP [2016](http://2016.bionlp-st.org/), [2013](http://2013.bionlp-st.org/), [2011](http://2011.bionlp-st.org/), [2009](http://www.nactem.ac.uk/tsujii/GENIA/SharedTask/)
* [CHEMDNER](http://www.biocreative.org/tasks/biocreative-iv/chemdner/)
* [ACE2004](https://catalog.ldc.upenn.edu/LDC2005T09)
* [Enron email dataset](https://www.cs.cmu.edu/~./enron/)
* [Citeseer](http://csxstatic.ist.psu.edu/about/data)
* [Reuters](http://www.daviddlewis.com/resources/testcollections/reuters21578/)

## Exercise

Select one of the NLP datasets or extract your own and then:

* Overview its structure.
* Report on dataset features (number of tags, words, sentences, ...)
* Use some existing data source and define some manual rules to tackle the dataset problem. Report on the accuracy of your method (percentage of correct labels) and compare to baseline or existing work.

You can also visualize data using [Matplotlib](http://matplotlib.org/gallery.html).