<a href="https://colab.research.google.com/github/Avi-000-Avi/NLP-pipeline-for-chatbots/blob/master/NLPpipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [0]:
text = """
Dealing with textual data is very crucial so to handle these text data we need some 
basic text processing steps. Most of the processing steps covered in this section are 
commonly used in NLP and involve the combination of several steps into a single 
executable flow. This is usually referred to as the NLP pipeline. These flow 
can be a combination of tokenization, stemming, word frequency, parts of 
speech tagging, etc.
"""

In [0]:
sentences = nltk.sent_tokenize(text)
words = [nltk.word_tokenize(s) for s in sentences]
words

[['Dealing',
  'with',
  'textual',
  'data',
  'is',
  'very',
  'crucial',
  'so',
  'to',
  'handle',
  'these',
  'text',
  'data',
  'we',
  'need',
  'some',
  'basic',
  'text',
  'processing',
  'steps',
  '.'],
 ['Most',
  'of',
  'the',
  'processing',
  'steps',
  'covered',
  'in',
  'this',
  'section',
  'are',
  'commonly',
  'used',
  'in',
  'NLP',
  'and',
  'involve',
  'the',
  'combination',
  'of',
  'several',
  'steps',
  'into',
  'a',
  'single',
  'executable',
  'flow',
  '.'],
 ['This',
  'is',
  'usually',
  'referred',
  'to',
  'as',
  'the',
  'NLP',
  'pipeline',
  '.'],
 ['These',
  'flow',
  'can',
  'be',
  'a',
  'combination',
  'of',
  'tokenization',
  ',',
  'stemming',
  ',',
  'word',
  'frequency',
  ',',
  'parts',
  'of',
  'speech',
  'tagging',
  ',',
  'etc',
  '.']]

## Part-of-Speech tagging
Some words have multiple meanings, for example, charge is a noun, but can also be a verb, (to) charge. Knowing a Part-of-Speech (POS) can help to disambiguate the meaning. Each token in a sentence has several attributes that we can use for our analysis.

The POS of a word is one example: nouns are a person, place, or thing; verbs are actions or occurrences and adjectives are words that describe nouns. Using these attributes, it becomes straightforward to create a summary of a piece of text by counting the most common nouns, verbs, and adjectives

In [0]:
tagged_wt = [nltk.pos_tag(w) for w in words]

In [0]:
tagged_wt

[[('Dealing', 'VBG'),
  ('with', 'IN'),
  ('textual', 'JJ'),
  ('data', 'NNS'),
  ('is', 'VBZ'),
  ('very', 'RB'),
  ('crucial', 'JJ'),
  ('so', 'RB'),
  ('to', 'TO'),
  ('handle', 'VB'),
  ('these', 'DT'),
  ('text', 'JJ'),
  ('data', 'NN'),
  ('we', 'PRP'),
  ('need', 'VBP'),
  ('some', 'DT'),
  ('basic', 'JJ'),
  ('text', 'NN'),
  ('processing', 'NN'),
  ('steps', 'NNS'),
  ('.', '.')],
 [('Most', 'JJS'),
  ('of', 'IN'),
  ('the', 'DT'),
  ('processing', 'NN'),
  ('steps', 'NNS'),
  ('covered', 'VBN'),
  ('in', 'IN'),
  ('this', 'DT'),
  ('section', 'NN'),
  ('are', 'VBP'),
  ('commonly', 'RB'),
  ('used', 'VBN'),
  ('in', 'IN'),
  ('NLP', 'NNP'),
  ('and', 'CC'),
  ('involve', 'VB'),
  ('the', 'DT'),
  ('combination', 'NN'),
  ('of', 'IN'),
  ('several', 'JJ'),
  ('steps', 'NNS'),
  ('into', 'IN'),
  ('a', 'DT'),
  ('single', 'JJ'),
  ('executable', 'JJ'),
  ('flow', 'NN'),
  ('.', '.')],
 [('This', 'DT'),
  ('is', 'VBZ'),
  ('usually', 'RB'),
  ('referred', 'VBN'),
  ('to', 'TO'),

In [0]:
patternPOS = []
for tag in tagged_wt:
  patternPOS.append([v for k,v in tag])

In [0]:
patternPOS

[['VBG',
  'IN',
  'JJ',
  'NNS',
  'VBZ',
  'RB',
  'JJ',
  'RB',
  'TO',
  'VB',
  'DT',
  'JJ',
  'NN',
  'PRP',
  'VBP',
  'DT',
  'JJ',
  'NN',
  'NN',
  'NNS',
  '.'],
 ['JJS',
  'IN',
  'DT',
  'NN',
  'NNS',
  'VBN',
  'IN',
  'DT',
  'NN',
  'VBP',
  'RB',
  'VBN',
  'IN',
  'NNP',
  'CC',
  'VB',
  'DT',
  'NN',
  'IN',
  'JJ',
  'NNS',
  'IN',
  'DT',
  'JJ',
  'JJ',
  'NN',
  '.'],
 ['DT', 'VBZ', 'RB', 'VBN', 'TO', 'IN', 'DT', 'NNP', 'NN', '.'],
 ['DT',
  'NN',
  'MD',
  'VB',
  'DT',
  'NN',
  'IN',
  'NN',
  ',',
  'VBG',
  ',',
  'NN',
  'NN',
  ',',
  'NNS',
  'IN',
  'NN',
  'NN',
  ',',
  'FW',
  '.']]

## Extracting nouns
Let's extract all of the nouns present in the corpus. This is very useful practice when you want to extract something specific. We are using NN, NNS, NNP, and NNPS tags to extract the nouns:

In [0]:
nouns = []
for tag in tagged_wt:
  nouns.append([k for k,v in tag if v in ['NN','NNS','NNP','NNPS']])

Extracting verbs
Let's extract all of the verbs present in the corpus. In this case, we are using VB, VBD, VBG, VBN, VBP, and VBZ as verb tags:

In [0]:
verbs = []
for tag in tagged_wt:
  verbs.append([v for k,v in tag if v in ['VB','VBD','VBG','VBN']])

Now, let's use spacy to tokenize a piece of text and access the POS attribute for each token.

As an example application, we'll tokenize the previous paragraph and count the most common nouns with the following code. We'll also lemmatize the tokens, which gives the root form a word, to help us standardize across forms of a word:

In [0]:
import spacy
from collections import Counter
from tabulate import tabulate
nlp = spacy.load('en_core_web_sm')

In [0]:
doc = nlp(text)
noun_counter = Counter(token.lemma_ for token in doc if token.pos_ == 'NOUN' )

In [0]:
doc


Dealing with textual data is very crucial so to handle these text data we need some 
basic text processing steps. Most of the processing steps covered in this section are 
commonly used in NLP and involve the combination of several steps into a single 
executable flow. This is usually referred to as the NLP pipeline. These flow 
can be a combination of tokenization, stemming, word frequency, parts of 
speech tagging, etc.

In [0]:
noun_counter

Counter({'combination': 2,
         'datum': 2,
         'flow': 2,
         'frequency': 1,
         'part': 1,
         'pipeline': 1,
         'processing': 2,
         'section': 1,
         'speech': 1,
         'stemming': 1,
         'step': 3,
         'tagging': 1,
         'text': 2,
         'tokenization': 1,
         'word': 1})

In [0]:
print(tabulate(noun_counter.most_common(5), headers = ['Noun', 'Count']))

Noun           Count
-----------  -------
step               3
datum              2
text               2
processing         2
combination        2


## Dependency parsing
Dependency parsing is a way to understand the relationships between words in a sentence. Dependency relations are a more fine-grained attribute, available to help build the model's understanding of the words through their relationships in a sentence:

In [0]:
doc = nlp(sentences[2])
doc

This is usually referred to as the NLP pipeline.

In [0]:
spacy.displacy.render(doc, style ='dep', options = {'distance':140}, jupyter = True)

NER
Finally, there's NER. Named entities are the proper nouns of sentences. Computers have gotten pretty good at figuring out if they're in a sentence and at classifying what type of entity they are. spacy handles NER at the document level, since the name of an entity can span several tokens:

In [0]:
doc = nlp(u"My name is Jack and I live in San Francisco.")

entity_types = ((ent.text, ent.label_) for ent in doc.ents)
print(tabulate(entity_types, headers = ['Entity','Entity Type']))

Entity         Entity Type
-------------  -------------
Jack           PERSON
San Francisco  GPE
