**Corpus, Tokens and N-Grams**

Corpus > Documents > Paragraphs > Sentences > Tokens


* **Corpus**: Collection of text documents
* **Tokens**: Smaller units of a text (words, phrases, ngrams)
* **N-Grams**: combinations of N words / Characters together


* **Unigrams**: I, Love, My, Phone
* **Bigrams**: I Love, Love my, My phone
* **Trigrams**: I love my, love my phone

In [17]:
import nltk
nltk.download('punkt')

from nltk.corpus import wordnet

print(wordnet.synsets('good'))

from nltk import ngrams
from nltk.tokenize import sent_tokenize,  word_tokenize

sentence = "\nI love to play football"
print (sentence)

n = 2
for gram in ngrams(word_tokenize(sentence), n):
    print(gram)

[Synset('good.n.01'), Synset('good.n.02'), Synset('good.n.03'), Synset('commodity.n.01'), Synset('good.a.01'), Synset('full.s.06'), Synset('good.a.03'), Synset('estimable.s.02'), Synset('beneficial.s.01'), Synset('good.s.06'), Synset('good.s.07'), Synset('adept.s.01'), Synset('good.s.09'), Synset('dear.s.02'), Synset('dependable.s.04'), Synset('good.s.12'), Synset('good.s.13'), Synset('effective.s.04'), Synset('good.s.15'), Synset('good.s.16'), Synset('good.s.17'), Synset('good.s.18'), Synset('good.s.19'), Synset('good.s.20'), Synset('good.s.21'), Synset('well.r.01'), Synset('thoroughly.r.02')]

I love to play football
('I', 'love')
('love', 'to')
('to', 'play')
('play', 'football')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Hikari\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**Tokenization**

Process of splittiong a text object into smaller units (tokens)

Smaller Units: words, numbers, symbols, ngrams, characters

White space tokenizer/ Unigram tokenizer

* **Sentence:** "I went to New-York to play footbal"
* **Tokens:** "I", "went", "to", "New-York", "to", "play", "footbal"

Regular expression tokenizer

* **Sentence:** "Footbal, Cricket; Golf Tennis"
* **Tokens:** "Footbal", "Cricket", "Golf", "Tennis"

In [5]:
import nltk
nltk.download('punkt')

from nltk.tokenize import sent_tokenize,  word_tokenize
text = 'Hi John, How are you doing? I will be travelling to your city. Lets catchup'

print(sent_tokenize(text))
print(word_tokenize(text))

['Hi John, How are you doing?', 'I will be travelling to your city.', 'Lets catchup']
['Hi', 'John', ',', 'How', 'are', 'you', 'doing', '?', 'I', 'will', 'be', 'travelling', 'to', 'your', 'city', '.', 'Lets', 'catchup']


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Hikari\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


**Normalization**

**Morpheme:** base form of a word

**Structure of token**: <prefix><morpheme><suffix>

* **Example:** Antinationalist = Anti+national+ist

**Normalization:** Process of converting a token into its base form (morpheme)

Helpful in reducing data dimensionality, text cleaning

**Types:** Stemming and Lemitization

In [6]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Hikari\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

**Normalization: Stemming** 

Elementary rule based process of removal of infectional form from a token

Outputs the stem of a word

"laughing", "laughed", "laughs", "laugh" >>> "laugh"

Form Suffix Stem
studies es studi
studying ing study


* **Example:** 
* his teams are not winning
* hi team are not winn

In [7]:
import nltk
nltk.download('wordnet')

from nltk.stem import PorterStemmer

stemmer = PorterStemmer()   

print(stemmer.stem("player"))
print(stemmer.stem("playing"))
print(stemmer.stem("plays"))
print(stemmer.stem("increases"))


player
play
play
increas


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Hikari\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


**Normalization: Lemmatization**

Systematic process for reducing a token to its lemma

Makes use of vocabulary, word structure, part of speech tags and grammar relations

**Example:** 

* am,are,is >> be
* running, ran, run, rans >> run

**verb** Running >> run

**noun** Running >> running

**Multiplicate word**

* multiplications>>multiplication>>multiplicate>>multiple
* Multiplicati(vely)(vily) >> multiplicative >> multiplicate
* Multiplicably >> Multiplicable >> Multiplicate >>Multiole
* Multiples >> Multiple
* Multiplied >> Multiply >> Multiple
* Multipliers >> Multiplier >> Multiply >> Multiple
* Multiplies >> Multiply >> Multiple
* Multiplying >> Multiply >> Multiple
* Multipliably >> Multipliable >> Multiply >> Multiple

In [8]:
import nltk
nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer
lemm = WordNetLemmatizer()

print(lemm.lemmatize("increases"))
print(lemm.lemmatize("running", pos="v"))

increase
run


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Hikari\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


**Part of speech tags**

Defines the syntactic context and role of words in the sentence

**Common POS Tags:** Noun, Verb, Adjectives, Adverbs

**Sentence:** David has purchased a new Laptop from Apple Store

* {David, NNP}
* {has, VBZ}
* {purchased, VBN}
* {a, DT}
* {new, JJ}
* {laptop, NN}
* {from, IN}
* {Apple, NNP}
* {store, NN}

Defined by their relationship with the adjacent words

**English tag set**

| **Tag** | **Description**                                            | **Example**                                                     |
|---------|------------------------------------------------------------|-----------------------------------------------------------------|
| DT      | Determiner                                                 | the, a, an, this, that, these, those                            |
| QT      | Quantifier                                                 | some, any, many, few, several, enough                           |
| CD      | Cardinal number                                            | one, two, three, four, five, ...                                |
| NN      | Noun, singular                                             | book, table, dog, city                                          |
| NNS     | Noun, plural                                               | books, tables, dogs, cities                                     |
| NNP     | Proper noun, singular                                      | John, Mary, New York                                            |
| NNPS    | Proper noun, plural                                        | United States, Beatles                                          |
| EX      | Existential there                                          | There was a party.                                              |
| PRP     | Personal pronoun                                           | I, you, he, she, it, we, they, me, <br/>you, him, her, us, them |
| PRP$    | Possessive pronoun                                         | my, your, his, her, <br/>its, our, their                        |
| POS     | Possessive ending                                          | 's, '                                                           |
| RBS     | Adverb, superlative                                        | most, best, worst                                               |
| RBR     | Adverb, comparative                                        | more, better, worse                                             |
| RB      | Adverb                                                     | quickly, happily, very, often                                   |
| JJS     | Adjective, superlative                                     | biggest, best, worst                                            |
| JJR     | Adjective, comparative                                     | bigger, better, worse                                           |
| JJ      | Adjective                                                  | big, good, bad                                                  |
| MD      | Modal                                                      | can, could, may, might, <br/>must, should, will, would          |
| VB      | Verb, base form                                            | go, eat, play                                                   |
| VBP     | Verb, present tense, <br/>other than third person singular | go,                                                             |
| VBZ     | Verb, present tense, <br/>third person singular            | goes                                                            |
| VBD     | Verb, past tense                                           | went, ate, played                                               |
| VBN     | Verb, past participle                                      | gone, eaten, played                                             |
| VBG     | Verb, gerund or present participle                         | going, eating, playing                                          |
| WDT     | Wh-determiner                                              | which, what,<br/> where, when,<br/> who, whose                  |

| **Tag**  | **Description**  | **Example** |
|----------|------------------|-------------|
| WP       | Wh-pronoun        | who, whom, which, what |
| WP$      | Possessive wh-pronoun | whose |
| WRB      | Wh-adverb         | when, where, why, how |
| TO       | The preposition to | to go, to eat, to play |
| IN       | Preposition or <br/>subordinating conjunction | in, on, at, for, because, since |
| CC       | Coordinating <br/>conjunction | and, but, or, nor, for, yet, so |
| UH       | Interjection      | oh, wow, ouch |
| RP       | Particle          | up, down, out, off |
| SYM      | Symbol            | , |
| \        | Currency sign     | \ |
| ''       | Double or single<br/> quotation marks | "hello", 'world' |
| (        | Opening parenthesis,<br/> bracket,<br/> angle bracket, or brace | ( |
| )        | Closing parenthesis,<br/> bracket,<br/> angle bracket, or brace | ) |
| ,        | Comma             | , |
| .        | End of sentence punctuation | . ! ? |
| :        | Mid-sentence punctuation | : ; ... -- - |

In [9]:
import nltk
nltk.download('averaged_perceptron_tagger')

from nltk import pos_tag, ngrams

text = 'Hi John, How are you doing? I will be travelling to your city. Lets catchup'

tokens = word_tokenize(text)
pos_tags = pos_tag(tokens)
print(pos_tags)

[('Hi', 'NNP'), ('John', 'NNP'), (',', ','), ('How', 'NNP'), ('are', 'VBP'), ('you', 'PRP'), ('doing', 'VBG'), ('?', '.'), ('I', 'PRP'), ('will', 'MD'), ('be', 'VB'), ('travelling', 'VBG'), ('to', 'TO'), ('your', 'PRP$'), ('city', 'NN'), ('.', '.'), ('Lets', 'VBZ'), ('catchup', 'JJ')]


[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Hikari\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


**Constituency Grammar**

**Constituents:** Words / phrases / group of words

**Constituency Grammar:** Organize any sentence into constituents using their properties

**Properties:** Part of Speech Tags / Noun Phrases / Verb Phrases

**Sentence:** \<subject\>\<context\>\<object\>

* **<subject>** The cats / The dogs / They
* **<context>** are running / are barking / are eating
* **<object>** in the park / happily / since the morning

**Another view (using part of speech)**

\<DT NN\> \<JJ VB\> \<PRP DT NN\> ---> The dogs are barking in the park

In [10]:
import nltk
from nltk import CFG


nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')


grammar = CFG.fromstring("""
  S -> NP VP
  NP -> DT NN | DT NNS | PRP
  VP -> VBZ VP | VBG PP | VBG Adv
  PP -> IN NP
  Adv -> RB
  DT -> 'The' | 'the'
  NN -> 'dog' | 'cat' | 'park'
  NNS -> 'dogs' | 'cats'
  PRP -> 'They'
  VBZ -> 'are'
  VBG -> 'running' | 'barking' | 'eating'
  IN -> 'in'
  RB -> 'happily'
""")


parser = nltk.ChartParser(grammar)


sentence = "The dogs are barking in the park".split()


for tree in parser.parse(sentence):
    #print(tree)
    tree.pretty_print()



                    S                        
      ______________|_____                    
     |                    VP                 
     |         ___________|___                
     |        |               VP             
     |        |      _________|___            
     |        |     |             PP         
     |        |     |      _______|___        
     NP       |     |     |           NP     
  ___|___     |     |     |        ___|___    
 DT     NNS  VBZ   VBG    IN      DT      NN 
 |       |    |     |     |       |       |   
The     dogs are barking  in     the     park



[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Hikari\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Hikari\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


**Dependency Grammar**

Words of a sentence depends on which other word (dependencies)

* **Example:** Modifiers (barking dog)

Organize words a sentence according to their dependencies

All the words are directly or indirectly linked to the root using links

These dependencies represents relationships among the words in a sentence 

**Sentence:** User is the largest community of data scientist and provides best resources for 
understanding data and analytics 

* **nsubj:** {User, NNP} and {community, NN}
* **cop:** {is, VBZ} and {community, NN}
* **det:** {the, DT} and {community, NN}
* **amod:** {largest, JJS} and {community, NN}
* **nmod:**  {community, NN} and { scientist , NNS }
* **case:**  {of, IN} and { scientist , NNS }
* **compound:**  {data, NNS} and { scientist , NNS }
* **cc:**  {community, NN} and { and , CC }
* **conj:**  {community, NN} and { provides , VBZ }
* **dobj:**  { provides , VBZ } and { resouces , NNS }
* **amod:**  { best , JJS } and { resouces , NNS }
* **acl:**  { resouces , NNS } and { understanding , VBG }  
* **mark:**  { for , IN } and { understanding , VBG }  
* **dobj:**  { understanding , VBG } and  { data , NNS }
* **cc:**  { data , NNS } and  { and , CC}
* **conj:**  { data , NNS } and  { analytics , NNS}

Relation: (Governer, Relation, Dependent)

* \<User\>\<is\>\<the largest community of data scientists\>

In [18]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")

sentence = "User is the largest community of data scientists and provides best resources for understanding data and analytics."

doc = nlp(sentence)

for token in doc:
    print(f'{token.text:10} {token.dep_:10} {token.head.text:10} {token.head.pos_:10} {token.pos_:10}')


displacy.render(doc, style="dep", jupyter=True)


ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

Dependency Grammar - Use Cases

* Named Entity Recognition
* Question Answering Systems
* Coreference Resolution
* Text Summarization
* Text Classifications