## Word  and Sentence transformation

### Basics of NLP and NLU
    - tokenization
    - lemmatization
    - POS tagging
    - syntactic parsing
   

- More semantic tasks:
    - summarization
    - question answering
    - information extraction (e.g. NER tagging)
    - relation extraction 
    - chatbots
    - machine translation
    - ...

In [23]:
import spacy
from spacy import displacy

#loading the english model
nlp = spacy.load('en_core_web_sm')

## Basic preprocessing tasks, text normalization

<h3 id="Tokenization">Tokenization</h3>


In [24]:
sens = "Muffins cost $3.88 in New York. Please buy me two as I can't go." \
" They'll taste good. I'm going to Finland's capital to hear about state-of-the-art solutions in NLP."

print(sens.split())

print(len(sens.split()))


['Muffins', 'cost', '$3.88', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two', 'as', 'I', "can't", 'go.', "They'll", 'taste', 'good.', "I'm", 'going', 'to', "Finland's", 'capital', 'to', 'hear', 'about', 'state-of-the-art', 'solutions', 'in', 'NLP.']
29


In [25]:
sens = "Muffins cost $3.88 in New York. Please buy me two as I can't go." \
" They'll taste good. I'm going to Finland's capital to hear about state-of-the-art solutions in NLP."

doc = nlp(sens)

tokens = [token.text for token in doc]
print(tokens)

['Muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.', 'Please', 'buy', 'me', 'two', 'as', 'I', 'ca', "n't", 'go', '.', 'They', "'ll", 'taste', 'good', '.', 'I', "'m", 'going', 'to', 'Finland', "'s", 'capital', 'to', 'hear', 'about', 'state', '-', 'of', '-', 'the', '-', 'art', 'solutions', 'in', 'NLP', '.']


In [26]:
for sen in doc.sents:
    print(sen)

Muffins cost $3.88 in New York.
Please buy me two as I can't go.
They'll taste good.
I'm going to Finland's capital to hear about state-of-the-art solutions in NLP.


In [27]:
for token in doc:
    print(token.text, token.is_alpha, token.is_stop)

Muffins True False
cost True False
$ False False
3.88 False False
in True True
New True False
York True False
. False False
Please True True
buy True False
me True True
two True True
as True True
I True True
ca True True
n't False True
go True True
. False False
They True True
'll False True
taste True False
good True False
. False False
I True True
'm False True
going True False
to True True
Finland True False
's False True
capital True False
to True True
hear True False
about True True
state True False
- False False
of True True
- False False
the True True
- False False
art True False
solutions True False
in True True
NLP True False
. False False


### Lemmatization, stemming

- The goal of lemmatization is to find the dictionary form of the words
- Called the "lemma" of a word
- _dogs_ -> _dog_ , _went_ -> _go_
- Ambiguity plays a role: _saw_ -> _see_?
- Needs POS tag to disambiguate

In [28]:
doc = nlp("I saw two dogs yesterday.")

lemmata = [token.lemma_ for token in doc]
print(lemmata)

['I', 'see', 'two', 'dog', 'yesterday', '.']


### POS tagging

- Words can be groupped into grammatical categories.
- These are called the Part Of Speech tags of the words.
- Words belonging to the same group are interchangable
- Ambiguity: _guard_ ?


In [29]:
doc = nlp("The white dog went to play football yesterday.")

[token.pos_ for token in doc]

['DET', 'ADJ', 'NOUN', 'VERB', 'PART', 'VERB', 'NOUN', 'NOUN', 'PUNCT']

## Advanced tasks

### Syntactic parsing


-  *Colorless green ideas sleep furiously.* 

- *Furiously sleep ideas green colorless.*

Chomsky (1956)


Two types.
- Phrase structure grammar
- __Dependency grammar__


### Universal Dependency Parsing
- Started and standardized in the [UD](http://universaldependencies.org/) project.
- The types are Language-independent
- The annotations are trying to be consistent accross 70+ languages

In [30]:
doc = nlp("Colorless green ideas sleep furiously")
displacy.render(doc, style='dep', jupyter=True, options={'distance': 100})

### Named entity recognition

- Identify the present entities in the text

In [31]:
sens = "Muffins cost $3.88 in New York. Please buy me two as I can't go." \
" They'll taste good. I'm going to Finland's capital to hear about state-of-the-art solutions in NLP."

doc = nlp(sens)
for ent in doc.ents:
    print(ent)

    
displacy.render(doc, style='ent', jupyter=True)

3.88
New York
two
Finland
NLP


### Language modelling

- One of the most important task in NLP
- The goal is to compute the "probability" of a sentence
- Can be used in:
    - Machine Translation
    - Text generation
    - Correcting spelling
    - Word vectors?
- P(the quick brown __fox__) > P(the quick brown __stick__)

In [32]:
#!pip install transformers
from transformers import pipeline

text_generator = pipeline("text-generation")
print(text_generator("The quick brown ", max_length=10, do_sample=False))

No model was supplied, defaulted to gpt2 and revision 6c0e608 (https://huggingface.co/gpt2).
Using a pipeline without specifying a model name and revision in production is not recommended.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'The quick brown iced tea is a great way'}]


## Semantic tasks

### Summarization

In [33]:
summarizer = pipeline("summarization")
summarizer("Deep learning is used almost exclusively in a Linux environment.\
You need to be comfortable using the command line if you are serious about deep learning and NLP.\
    Most NLP and deep learning libraries have better support for Linux and MacOS than Windows. \
    Most papers nowadays release the source code for their experiments with Linux support only.",
           min_length=5)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.
Your max_length is set to 142, but your input_length is only 73. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=36)


[{'summary_text': ' Deep learning is used almost exclusively in a Linux environment . Most NLP and deep learning libraries have better support for Linux and MacOS than Windows .'}]

### Sentiment Analysis
- In the simplest case, decide whether a text is negative or positive.

In [34]:
sentiment = pipeline("sentiment-analysis")
sentiment(['This class is really cool! I would recommend this to anyone!'])

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


[{'label': 'POSITIVE', 'score': 0.9998173117637634}]

### Question Answering

- Given a context and a question choose the right answer
- Can be extractive or abstractive

In [35]:
question_answerer = pipeline('question-answering')
question_answerer({
    'question': 'Who went to the store ?',
    'context': 'Adam went to the store yesterday.'})

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


{'score': 0.9957693219184875, 'start': 0, 'end': 4, 'answer': 'Adam'}