In [26]:
from __future__ import print_function
import nltk
nltk.download() # you need only to download example from the book

# NLTK


A typical NLP pipeline increasingly extracts metadata, in order to extract higher-level information, for example **biomedical 'events'**:

1. Gene Expression between the trigger word “expression” and the protein “interferon regulatory factor 4″; and
2. Negative Regulation between the trigger “Down-regulation” and “expression”, representing event 1.

![event](biomedical_event.jpg)

The figure below presents an NLP pipeline to recognize such 'events'. It contains the following steps:

* Reader: read input data and mark the text regions of interest;
* NLP-Preprocessing: perform sentence splitting and tokenization, lemmatization, part-of-speech (POS) tagging, chunking and dependency parsing;
* Concept loader: load relevant concepts;
* Dictionary tagger: perform trigger recognition using previously built dictionaries;
* Machine learning: perform trigger recognition using previously trained models;
* Post-processing: remove false positive trigger names through rule-based approaches;
* Writer: write the output to an external resource.

![nlp](nlp.png)

(source: David Campos, http://doi.org/10.1186/1751-0473-9-1)

---

---

## Let's get started ourselves:

**Tokenization** (splitting sentences & words)

In [27]:
text = "I just ate the cake with a spoon"
words = nltk.word_tokenize( text )
words

['I', 'just', 'ate', 'the', 'cake', 'with', 'a', 'spoon']

**Part of speech tagging** assigns the most probable POS to each token

In [28]:
nltk.pos_tag(words)

[('I', 'PRP'),
 ('just', 'RB'),
 ('ate', 'VB'),
 ('the', 'DT'),
 ('cake', 'NN'),
 ('with', 'IN'),
 ('a', 'DT'),
 ('spoon', 'NN')]

In [29]:
print('only words: ', [w[0] for w in nltk.pos_tag(words) if w[1].startswith('N')])
print('only verbs: ', [w[0] for w in nltk.pos_tag(words) if w[1].startswith('V')])

only words:  ['cake', 'spoon']
only verbs:  ['ate']


**Chunking** and **named entity recognition**

In [30]:
from nltk import word_tokenize, ne_chunk, pos_tag
text2 = "Barack Obama meets Michael Jackson in Nihonbashi"
chunked = ne_chunk(pos_tag(word_tokenize(text2)))
for i in chunked:
    print(i)

(PERSON Barack/NNP)
(ORGANIZATION Obama/NNP)
('meets', 'VBZ')
(PERSON Michael/NNP Jackson/NNP)
('in', 'IN')
(GPE Nihonbashi/NNP)
