## Illustrate Python Libraries to make chunking using PoS tagging.

- spaCy 

- TextBlob
---

> Each Library are a little-bit different. Spacy includes common words in the English language like 'a' and 'the while TextBlob removes these.


### Using the Yelp Dataset Challenge

- https://www.yelp.com/dataset/challenge
---

In [1]:
# Import Libraries

import pandas as pd
import json

In [2]:
# Load the first 10,000 reviews

f = open('../../data/yelp_dataset/yelp_academic_dataset_review.json', encoding='utf8')
js = []
for i in range(10):
    js.append(json.loads(f.readline()))
f.close()

review_df = pd.DataFrame(js)

---

## Using spaCy

In [3]:
# first we'll walk through spaCy's functions

import spacy

In [4]:
# model meta data

spacy.info('en')


    [93mInfo about model en[0m

    lang               en             
    pipeline           ['tagger', 'parser', 'ner']
    accuracy           {'token_acc': 99.8698372794, 'ents_p': 84.9664503965, 'ents_r': 85.6312524451, 'uas': 91.7237657538, 'tags_acc': 97.0403350292, 'ents_f': 85.2975560875, 'las': 89.800872413}
    name               core_web_sm    
    license            CC BY-SA 3.0   
    author             Explosion AI   
    url                https://explosion.ai
    vectors            {'keys': 0, 'width': 0, 'vectors': 0}
    sources            ['OntoNotes 5', 'Common Crawl']
    version            2.0.0          
    spacy_version      >=2.0.0a18     
    parent_package     spacy          
    speed              {'gpu': None, 'nwords': 291344, 'cpu': 5122.3040471407}
    email              contact@explosion.ai
    description        English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Assigns word vectors, context-specific token vect

{'lang': 'en',
 'pipeline': ['tagger', 'parser', 'ner'],
 'accuracy': {'token_acc': 99.8698372794,
  'ents_p': 84.9664503965,
  'ents_r': 85.6312524451,
  'uas': 91.7237657538,
  'tags_acc': 97.0403350292,
  'ents_f': 85.2975560875,
  'las': 89.800872413},
 'name': 'core_web_sm',
 'license': 'CC BY-SA 3.0',
 'author': 'Explosion AI',
 'url': 'https://explosion.ai',
 'vectors': {'keys': 0, 'width': 0, 'vectors': 0},
 'sources': ['OntoNotes 5', 'Common Crawl'],
 'version': '2.0.0',
 'spacy_version': '>=2.0.0a18',
 'parent_package': 'spacy',
 'speed': {'gpu': None, 'nwords': 291344, 'cpu': 5122.3040471407},
 'email': 'contact@explosion.ai',
 'description': 'English multi-task CNN trained on OntoNotes, with GloVe vectors trained on Common Crawl. Assigns word vectors, context-specific token vectors, POS tags, dependency parse and named entities.',
 'link': '/usr/local/lib/python3.6/dist-packages/spacy/data/en',
 'source': '/usr/local/lib/python3.6/dist-packages/en_core_web_sm'}

In [5]:
# Preload the language model

nlp = spacy.load('en')

In [6]:
# We can create a Pandas Series of spaCy nlp variables

doc_df = review_df['text'].apply(nlp)

In [7]:
type(doc_df)

pandas.core.series.Series

In [8]:
type(doc_df[0])

spacy.tokens.doc.Doc

In [9]:
doc_df[0]

The pizza was okay. Not the best I've had. I prefer Biaggio's on Flamingo / Fort Apache. The chef there can make a MUCH better NY style pizza. The pizzeria @ Cosmo was over priced for the quality and lack of personality in the food. Biaggio's is a much better pick if youre going for italian - family owned, home made recipes, people that actually CARE if you like their food. You dont get that at a pizzeria in a casino. I dont care what you say...

In [10]:
# spacy gives you both fine grained (.pos_) + coarse grained (.tag_) parts of speech    

for doc in doc_df[0]:
    print(doc.text, doc.pos_, doc.tag_)

The DET DT
pizza NOUN NN
was VERB VBD
okay ADJ JJ
. PUNCT .
Not ADV RB
the DET DT
best ADJ JJS
I PRON PRP
've VERB VB
had VERB VBN
. PUNCT .
I PRON PRP
prefer VERB VBP
Biaggio PROPN NNP
's PART POS
on ADP IN
Flamingo PROPN NNP
/ SYM SYM
Fort PROPN NNP
Apache PROPN NNP
. PUNCT .
The DET DT
chef NOUN NN
there ADV EX
can VERB MD
make VERB VB
a DET DT
MUCH ADV RB
better ADJ JJR
NY PROPN NNP
style NOUN NN
pizza NOUN NN
. PUNCT .
The DET DT
pizzeria NOUN NN
@ ADP IN
Cosmo PROPN NNP
was VERB VBD
over ADV RB
priced VERB VBN
for ADP IN
the DET DT
quality NOUN NN
and CCONJ CC
lack NOUN NN
of ADP IN
personality NOUN NN
in ADP IN
the DET DT
food NOUN NN
. PUNCT .
Biaggio PROPN NNP
's PART POS
is VERB VBZ
a DET DT
much ADV RB
better ADJ JJR
pick NOUN NN
if ADP IN
you PRON PRP
re VERB VBZ
going VERB VBG
for ADP IN
italian ADJ JJ
- PUNCT HYPH
family NOUN NN
owned VERB VBN
, PUNCT ,
home NOUN NN
made VERB VBD
recipes NOUN NNS
, PUNCT ,
people NOUN NNS
that ADJ WDT
actually ADV RB
CARE VERB VBP
if ADP 

In [11]:
# spaCy also does some basic noun chunking for us

print([chunk for chunk in doc_df[0].noun_chunks])

[The pizza, I, I, Biaggio, Flamingo / Fort Apache, The chef, a MUCH better NY style pizza, The pizzeria, Cosmo, the quality, lack, personality, the food, Biaggio, a much better pick, you, recipes, people, you, their food, You, a pizzeria, a casino, I, what, you]


---

## Using TextBlob
---

In [12]:
# Import TextBlob

from textblob import TextBlob

> The default tagger in TextBlob uses the PatternTagger, the same as [pattern](https://www.clips.uantwerpen.be/pattern), which is fine for our example. To use the NLTK tagger, we can specify the pos_tagger when we call TextBlob. More [here](http://textblob.readthedocs.io/en/dev/advanced_usage.html#advanced).

In [13]:
blob_df = review_df['text'].apply(TextBlob)
type(blob_df)

pandas.core.series.Series

In [14]:
type(blob_df[0])

textblob.blob.TextBlob

In [15]:
blob_df[0].tags

[('The', 'DT'),
 ('pizza', 'NN'),
 ('was', 'VBD'),
 ('okay', 'RB'),
 ('Not', 'RB'),
 ('the', 'DT'),
 ('best', 'JJS'),
 ('I', 'PRP'),
 ("'ve", 'VBP'),
 ('had', 'VBN'),
 ('I', 'PRP'),
 ('prefer', 'VBP'),
 ('Biaggio', 'NNP'),
 ("'s", 'POS'),
 ('on', 'IN'),
 ('Flamingo', 'NNP'),
 ('/', 'NNP'),
 ('Fort', 'NNP'),
 ('Apache', 'NNP'),
 ('The', 'DT'),
 ('chef', 'NN'),
 ('there', 'EX'),
 ('can', 'MD'),
 ('make', 'VB'),
 ('a', 'DT'),
 ('MUCH', 'NNP'),
 ('better', 'JJR'),
 ('NY', 'NNP'),
 ('style', 'NN'),
 ('pizza', 'NN'),
 ('The', 'DT'),
 ('pizzeria', 'NN'),
 ('@', 'NNP'),
 ('Cosmo', 'NNP'),
 ('was', 'VBD'),
 ('over', 'RB'),
 ('priced', 'VBN'),
 ('for', 'IN'),
 ('the', 'DT'),
 ('quality', 'NN'),
 ('and', 'CC'),
 ('lack', 'NN'),
 ('of', 'IN'),
 ('personality', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('food', 'NN'),
 ('Biaggio', 'NNP'),
 ("'s", 'POS'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('much', 'RB'),
 ('better', 'RBR'),
 ('pick', 'NN'),
 ('if', 'IN'),
 ('youre', 'NN'),
 ('going', 'VBG'),
 ('for', 'IN'

In [16]:
# blobs in TextBlob also have noun phrase extraction

print([np for np in blob_df[0].noun_phrases])

['biaggio', 'flamingo', '/ fort', 'apache', 'much', 'ny', 'style pizza', 'pizzeria @', 'cosmo', 'biaggio', 'care', 'dont care']
