# Introduction to spaCy

This code accompanies Kim Fessel's post on the ODSC blog: ["Level Up: spaCy NLP for the Win,"](https://opendatascience.com/level-up-spacy-nlp-for-the-win/) published February 2020.

## spaCy Installation

Install spaCy with pip:

`pip install spacy`

You will also need to download a language model.  For learning purposes, we will just start with this small English model:

`python -m spacy download en_core_web_sm`

## spaCy Basics

### Tokenization

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
review = "I'm so happy I went to this awesome Vegas buffet!"

In [None]:
doc = nlp(review)

> "The resulting spaCy document is a rich collection of tokens that have been annotated with many attributes... To see this in action, loop over each token in the document and print out the part of speech, lemma, and whether or not this token is a so-called stop word."

In [None]:
for token in doc:
    print(token.text, token.pos_, token.lemma_, token.is_stop)

> "... spaCy tokenizes text in an entirely nondestructive manner... The underlying text does not change... spaCy does not explicitly break the original text into a list, but tokens can be accessed by index span."

In [None]:
doc.text

In [None]:
doc[:5]

In [None]:
doc[-5:-1]

> "spaCy also performs automatic sentence detection.  Iterating over the generator `doc.sents` yields each recognized sentence."

In [None]:
type(doc.sents)

In [None]:
for sent in doc.sents:
    print(sent)

### Dependencies

> "... spaCy provides syntactic parsing to show word usage, thus creating a dependency tree..."

In [None]:
for token in doc:
    print(token.text, token.dep_)

> "... visualizing these relationships reveals an even more comprehensive story.  First load a submodule called displaCy to help with the visualization... ask displaCy to render the dependency tree..."

In [None]:
from spacy import displacy

In [None]:
displacy.render(doc, style='dep', options={'distance': 80}) 

> "You can even traverse this parse tree... spaCy accurately labels 'awesome' as an adjectival modifier (amod) and also detects its relationship to 'buffet':"

In [None]:
from spacy.symbols import amod

In [None]:
for token in doc:
    if token.dep_ == 'amod':
        print(f"ADJ MODIFIER: {token.text} --> NOUN: {token.head}")

In [None]:
spacy.explain("amod")

### Named Entity Recognition

> "To see which tokens spaCy identifies as named entities... simply cycle through `doc.ents`"

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)

In [None]:
spacy.explain("GPE")

In [None]:
displacy.render(doc, style='ent', jupyter=True)

> "Consider this more complicated example with four different kinds of entities; displaCy provides unique colors to each."

In [None]:
document = nlp(
    "One year ago, I visited the Eiffel Tower with Jeff in Paris, France."
    )

In [None]:
displacy.render(document, style='ent', jupyter=True)

In [None]:
spacy.explain("FAC")

## Case Study: Restaurant Reviews

> "We will examine [this Kaggle dataset](https://www.kaggle.com/vigneshwarsofficial/reviews), consisting of 1,000 [restaurant] reviews labeled by sentiment."

In [None]:
import pandas as pd

pd.set_option('max_colwidth', 100)

In [None]:
url = 'http://bit.ly/375FDrO'  #Kaggle dataset

df = pd.read_csv(url, sep='\t')

In [None]:
df.shape

In [None]:
df.columns = ['text', 'rating']

In [None]:
df.head()

### Pipelines

> "We will now use spaCy's `pipe` method in order to process multiple documents in one go."

In [None]:
df['spacy_doc'] = list(nlp.pipe(df.text))

In [None]:
df.head()

### Parts of Speech by Sentiment

> "Splitting the information by sentiment..."

In [None]:
positive_reviews = df[df.rating==1]
negative_reviews = df[df.rating==0]

> "What are the most common adjectives used in positive versus negative reviews?... Let's [also] check the nouns..."

In [None]:
pos_adj = [token.text.lower() for doc in positive_reviews.spacy_doc for token in doc if token.pos_=='ADJ']
neg_adj = [token.text.lower() for doc in negative_reviews.spacy_doc for token in doc if token.pos_=='ADJ']

pos_noun = [token.text.lower() for doc in positive_reviews.spacy_doc for token in doc if token.pos_=='NOUN']
neg_noun = [token.text.lower() for doc in negative_reviews.spacy_doc for token in doc if token.pos_=='NOUN']

In [None]:
from collections import Counter

In [None]:
Counter(pos_adj).most_common(10)

In [None]:
Counter(neg_adj).most_common(10)

In [None]:
Counter(pos_noun).most_common(10)

In [None]:
Counter(neg_noun).most_common(10)

### Dependency Parsing

> "For a given noun of interest, extract each of the adjectival modifiers that are among its children tokens..."

In [None]:
from spacy.symbols import amod
from pprint import pprint

In [None]:
def get_amods(noun, ser):
    amod_list = []
    for doc in ser:
        for token in doc:
            if (token.text) == noun:
                for child in token.children:
                    if child.dep == amod:
                        amod_list.append(child.text.lower())
    return sorted(amod_list)

def amods_by_sentiment(noun):
    print(f"Adjectives describing {str.upper(noun)}:\n")
    
    print("POSITIVE:")
    pprint(get_amods(noun, positive_reviews.spacy_doc))
    
    print("\nNEGATIVE:")
    pprint(get_amods(noun, negative_reviews.spacy_doc))

In [None]:
amods_by_sentiment("food")

In [None]:
amods_by_sentiment("service")