# Quick tour of traditional nlp

Sources:

- Zheng and Casari (2016)
- Russel and Norvig (2016)
- Foundations of Statistical Natural Language Processing
- Natural Laguage processing with python: Analyzing text with nltk
- Speech and Language Processing

## Corpora, tokens and types

In [None]:
# Text dataset corpora, raw text in ASCII or UTF-8
corpora = ["machine learning is a usefull tool",
          "ai and machine learning are relation, one is a subset of the other",
          "youtube videos are amazing"]

In [None]:
# Tokens correspond to words and numeric sequences separated by
# white-space characters or punctuation
tokens = [corpus.split() for corpus in corpora]
tokens

![datasets](assets/datasets.png)

## NLP preprocessing packages

- NLTK
- spaCy

In [3]:
import spacy
nlp = spacy.load('en')

In [4]:
text = "Mary, don't slap the green witch."
print([token.text for token in nlp(text.lower())])

['mary', ',', 'do', "n't", 'slap', 'the', 'green', 'witch', '.']


In [7]:
from nltk.tokenize import TweetTokenizer
tweet=u"Snow White and the Seven Degrees #MakeAMovieCold@midnight:-)"
tokenizer = TweetTokenizer()
print(tokenizer.tokenize(tweet.lower()))

['snow', 'white', 'and', 'the', 'seven', 'degrees', '#makeamoviecold', '@midnight', ':-)']


## Unigrams, Bigrams, Trigrams, ..., N-grams

In [51]:
def n_grams(text, n):
    """Takes tokens list and return a list of n-grams 
    """
    return [text[i:i+n] for i in range(len(text) - n +  1)]

In [52]:
cleaned = [token.text for token in nlp(text.lower())]
n_grams(cleaned, 3)

[['mary', ',', 'do'],
 [',', 'do', "n't"],
 ['do', "n't", 'slap'],
 ["n't", 'slap', 'the'],
 ['slap', 'the', 'green'],
 ['the', 'green', 'witch'],
 ['green', 'witch', '.']]

## Lemmas and Stems

In [53]:
# Lemmatization reduce words to their root forms
doc = nlp(u"he was running late")
for token in doc:
    print(f"{token} --> {token.lemma_}")

he --> -PRON-
was --> be
running --> run
late --> late


In [57]:
# Stemming is just a cut word, and does not care about the meaning or root forms
from nltk.stem import PorterStemmer
pst = PorterStemmer()
doc = nlp(u"he was running late")
for token in doc:
    print(f"{token} --> {pst.stem(token.text)}")

he --> he
was --> wa
running --> run
late --> late


## Categorizing words: POS TAGGING

In [58]:
doc = nlp(u"Mary slapped the green witch.")
for token in doc:
    print(f"{token} - {token.pos_}")

Mary - PROPN
slapped - VERB
the - DET
green - ADJ
witch - NOUN
. - PUNCT


## Categorizing Spans: Chunking and Named Entity Recognition

In [63]:
print("Noun Phrases:")
for chunk in doc.noun_chunks:
    print(f"{chunk} - {chunk.label_}")

Noun Phrases:
Mary - NP
the green witch - NP


In [68]:
doc = nlp(u"Mary was born in Chicken. Alaska, and studies at Harvard")
print("Name Entity:")
for token in doc:
    print(f"{token} - {token.ent_type_}")

Name Entity:
Mary - PERSON
was - 
born - 
in - 
Chicken - GPE
. - 
Alaska - GPE
, - 
and - 
studies - 
at - 
Harvard - ORG
