# P21 Introduction Textmining

## Agenda

Natural Language Processing\
Information extraction architecture\
Topic modelling

Textmining toolstacks
* in python
    * [NLTK](https://www.nltk.org/)
    * [spaCy](https://spacy.io/)
    * notebook examples
* in R
    * tm package
    * [tidytext](https://www.tidytextmining.com/index.html)

## Natural Language Processing

The term Natural __Language Processing__ encompasses a broad set of techniques
__for automated generation, manipulation and analysis of natural or human
languages__.

Although most NLP techniques inherit largely from Linguistics and Artificial
Intelligence, they are also influenced by relatively newer areas such as
Machine Learning, Computational Statistics and Cognitive Science.

## Can Humans Parse Natural Language?

<div id="left", class="smaller">

__Usually not !!!__
We make mistakes on complex parsing structures\
We can’t parse without world knowledge and lexical knowledge
* Need to know what we’re talking about
* Need to know the words used

__Garden Path Sentences__ (sentences usually not correctly parsed by humans)
* While she hunted the deer ran into the woods.
* The woman who whistles tunes pianos.
Confusing without context, sometimes even with\
Early semantic/pragmatic feedback in syntactic discrimination

__Center Embedding__\
Leads to “stack overflow”
* The mouse ran.
* The mouse the cat chased ran.
* The mouse the cat the dog bit chased ran.
* The mouse the cat the dog the person petted bit chased ran
</div>
<div id="right", class="smaller">

<span style="color:red">__Problem is ambiguity and eager decision making.__\
We can only keep a few analyses in memory at a time!</span>

![](./png/thomas_bever.png)

Thomas Bever
</div>

## Information Extraction Architecture

<img align="middle" src="./png/Information_extraction_architecture.png" width="800"/>

## Some Basic Terminology

* __Token__: Before any real processing can be done on the input text, it needs to be segmented into linguistic units such as
words, punctuation, numbers or alphanumerics. These units are known as tokens.
* __Sentence__: An ordered sequence of tokens.
* __Tokenization__: The process of splitting a sentence into its constituent tokens. For segmented languages such as
English, the existence of whitespace makes tokenization relatively easier and uninteresting. However, for languages such as Chinese and Arabic, the task is more difficult since there are no explicit boundaries. 
* __Corpus__: A body of text, usually containing a large number of sentences.
* __Part-of-speech (POS) Tag__: A word can be classified into one or more of a set of lexical or part-of-speech categories such as Nouns, Verbs, Adjectives and Articles, to name a few. A POS tag is a symbol representing such a lexical category - NN(Noun), VB(Verb), JJ(Adjective), AT(Article). One of the oldest and most commonly used tag sets is the Brown Corpus tag set.
* __Parse Tree__: A tree defined over a given sentence that represents the syntactic structure of the sentence as defined by a formal grammar.

## Tokenization

Tokenizers divide strings into lists of substrings.

For example, tokenizers can be used to find the list of sentences or words in a string.

## Stemming

Stemmers remove morphological affixes from words, leaving only the word stem [online demo](http://text-processing.com/demo/stem/).

Simple stemmers:\
Plural(meervoud)\
Verbs(werkwoorden)

Different Stemming Algorithms:
* Paice/Husk Stemming Algorithm
* Porter Stemming Algorithm
* Lovins Stemming Algorithm
* Dawson Stemming Algorithm
* Krovetz Stemming Algorithm

## Parts of Speech Tagging (PoS tagging)

Parts of Speech Tagging (PoS tagging) is assigning Parts of Speech to the words in a text [online demo](https://corenlp.run/).

    Als vliegen vliegen vliegen vliegen vliegensvlug.
    Als/CONJ vliegen/NN vliegen/VB vliegen/VB vliegen/NN vliegensvlug/ADV

PoS tagging is a kind of word sense disambiguation: the PoS tag gives some information about the sense of the word in the context of use. It is a non-trivial task:

* Some words (at least in a sense of this word) that occur in the lexicon or dictionary have more than one possible Part of Speech. Like: "vliegen", it can be a noun as well as a verb.\
Note that even if we restrict to verbs the word “vliegen” has several senses: "Een vogel kan vliegen", "Als de bom valt vliegen de mensen uiteen."
* Some words are unknown.
* Tags are not well-defined. In "Wat fietsen" is “fietsen” a Noun or a Verb ?

## Parse Tree example

<img align="middle" src="./png/parse_tree_example.png" width="800"/>

## Chunking

<img align="middle" src="./png/chunking.png" width="600"/>

The basic technique we will use for entity detection is chunking, which
segments and labels multi-token sequences as illustrated above.

The smaller boxes show the word-level tokenization and part-of-speech
tagging, while the large boxes show higher-level chunking. Each of these
larger boxes is called a chunk.

## Universal Part-of-Speech Tagset

<img align="middle" src="./png/universal_part_of_speech_tagset.png" width="600"/>


## Named Entity Recognition

Er zijn websites en API ‘s die dit process voor je kunnen doen.

Bijvoorbeeld:
http://text-processing.com/demo/

## Recap: Natural Language Processing

* Tokenise
* Stemming
* Tagging
* Chunking
* Entity Recognition

<img align="middle" src="./png/nlp.png" width="600"/>


### FROG

For Dutch language : [frog](https://languagemachines.github.io/frog/)

## Textmining in python

Textmining and natural language processing has become a huge field of research...

* [NLTK](https://www.nltk.org/)
* [spaCy](https://spacy.io/)
* [Gensim](https://radimrehurek.com/gensim/)

Deeplearning advancements in natural language processing are enormous, see [huggingface](https://huggingface.co/)

* word embedding (tensorflow embedding projector [demo](https://projector.tensorflow.org/
))
* [transformers](https://towardsdatascience.com/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452)
* automatic translation
* automatic summary
* automatic Q&A chatbots

## Jupyter notebook basic examples

Start jupyter notebook server and play around with some examples:

    jupyter notebook

![](./png/jupyter_notebook.png)

## Questions?