<small><i>This notebook was put together by [Alexander Fridman](http://www.rocketscience.ai) and [Volha Hedranovich](http://www.rocketscience.ai) for the Lecture Course. Source and license info is on [GitHub](https://github.com/volhahedranovich/jupyter_lectures).</i></small>

In [15]:
import pandas as pd
from spacy import displacy
import en_core_web_sm
from pywsd.lesk import simple_lesk

nlp = en_core_web_sm.load()

nltk.download('averaged_perceptron_tagger')


# <div class="alert alert-block alert-danger">NLP Basic Tasks</div>

- Part-of-Speech Tagging

- Word-Sense Disambiguation

- Named Entity Recognition

- Language Identification

- Text Summarisation

- Sentiment Analysis

- Semantic Text Similarity

- Topic Modeling

- Authorship Identification

### <div class="alert alert-block alert-success">Part-of-Speech Tagging</div>
The process of marking up a word in a text as corresponding to a particular part of speech, based on both its definition and its context — i.e., its relationship with adjacent and related words in a phrase.

**Text**: The original word text.

**Lemma**: The base form of the word.

**POS**: The simple part-of-speech tag.

**Tag**: The detailed part-of-speech tag.

**Dep**: Syntactic dependency, i.e. the relation between tokens.

**Shape**: The word shape – capitalisation, punctuation, digits.

**is alpha**: Is the token an alpha character?

**is stop**: Is the token part of a stop list, i.e. the most common words of the language?

In [52]:
doc = nlp(u'Apple is looking at buying U.K. startup for $1 billion')

rows = []

for token in doc:
    rows.append([token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
                 token.shape_, token.is_alpha, token.is_stop])

cols = ['Text', 'Lemma', 'POS', 'Tag', 'Dep', 'Shape', 'Alpha', 'Stop']
print(pd.DataFrame(data=rows, columns=cols))
displacy.render(doc, style='dep', jupyter=True, options={'distance':90})

       Text    Lemma    POS  Tag       Dep  Shape  Alpha   Stop
0     Apple    apple  PROPN  NNP     nsubj  Xxxxx   True  False
1        is       be   VERB  VBZ       aux     xx   True   True
2   looking     look   VERB  VBG      ROOT   xxxx   True  False
3        at       at    ADP   IN      prep     xx   True   True
4    buying      buy   VERB  VBG     pcomp   xxxx   True  False
5      U.K.     u.k.  PROPN  NNP  compound   X.X.  False  False
6   startup  startup   NOUN   NN      dobj   xxxx   True  False
7       for      for    ADP   IN      prep    xxx   True   True
8         $        $    SYM    $  quantmod      $  False  False
9         1        1    NUM   CD  compound      d  False  False
10  billion  billion    NUM   CD      pobj   xxxx   True  False


### <div class="alert alert-block alert-success">Word-Sense Disambiguation</div>

WSD is identifying which sense of a word (i.e. meaning) is used in a sentence, when the word has multiple meanings.

E.g. word "bass":
- a type of fish
- tones of low frequency
- a type of instrument

<img src="img/WSD.png" alt="spaCy NLP pipeline" title="spaCy NLP pipeline" />



In [89]:
# sent = 'I want to eat a bass'
sent = 'Musically the music ranges from growly bass to quiet high strings, with percussion and brass standing out also'
# sent = 'I work at the plant'
# sent = 'Eventually, it melts to supply water and nutrients to plants and aquatic organisms'
ambiguous = 'bass'
answer = simple_lesk(sent, ambiguous, pos='n')
print(answer)
print(answer.definition())

Synset('sea_bass.n.01')
the lean flesh of a saltwater fish of the family Serranidae


In [75]:
from pywsd import disambiguate
from pywsd.similarity import max_similarity as maxsim
disambiguate('bank', algorithm=maxsim, similarity_option='wup', keepLemmas=True)

[('bank', 'bank', None)]

### <div class="alert alert-block alert-success">Named Entity Recognition</div>
A named entity is a "real-world object" that's assigned a name – for example, a person, a country, a product or a book title. spaCy can recognise various types of named entities in a document, by asking the model for a prediction. Because models are statistical and strongly depend on the examples they were trained on, this doesn't always work perfectly and might need some tuning later, depending on your use case.

**Text**: The original entity text.

**Start**: Index of start of entity in the Doc.

**End**: Index of end of entity in the Doc.

**Label**: Entity label, i.e. type.

In [54]:
rows = []

for ent in doc.ents:
    rows.append([ent.text, ent.start_char, ent.end_char, ent.label_])
    
cols = ['Text', 'Start', 'End', 'Label']
print(pd.DataFrame(data=rows, columns=cols))
displacy.render(doc, style='ent', jupyter=True)

         Text  Start  End  Label
0       Apple      0    5    ORG
1        U.K.     27   31    GPE
2  $1 billion     44   54  MONEY


### <div class="alert alert-block alert-success">Language Identification</div>

### <div class="alert alert-block alert-success">Text Summarisation</div>

### <div class="alert alert-block alert-success">Sentiment Analysis</div>

### <div class="alert alert-block alert-success">Semantic Text Similarity</div>

### <div class="alert alert-block alert-success">Topic Modeling</div>

### <div class="alert alert-block alert-success">Authorship Identification</div>

# <div class="alert alert-block alert-danger">Key Python packages for NLP</div>

### <div class="alert alert-block alert-info">NLTK</div>
All according to text processing



*More Info:*

*[Natural Language Processing with Python, by Steven Bird, Ewan Klein, and Edward Loper](http://www.nltk.org/book/)*

*[Python 3 Text Processing with NLTK 3 Cookbook, by Jacob Perkins](https://www.packtpub.com/application-development/python-3-text-processing-nltk-3-cookbook)*

### <div class="alert alert-block alert-info">spaCy</div>
When you call **nlp** on a text, `spaCy` first **tokenizes** the text to produce a Doc object. The Doc is then processed in several different steps – this is also referred to as the processing pipeline. The **pipeline used by the default** models consists of a **tagger**, a **parser** and an **entity recognizer**. Each pipeline component returns the processed Doc, which is then passed on to the next component.

<img src="img/spacy_pipeline.png" alt="spaCy NLP pipeline" title="spaCy NLP pipeline" />

*[More info](https://spacy.io/usage/linguistic-features)*