# Chapter 3 - spaCy Application

## Instructions

- Run the cells with "assert" statements to see if your answer's output matches what the output should be. If it runs without error, your answer matches! If your output is different, you'll get a hint.

In [1]:
import pandas as pd

import spacy
nlp = spacy.load('en_core_web_sm')

_If you're gettting an error about being unable to find `en_core_web_sm`, uncomment and run this line first:_

In [2]:
#!python3 -m spacy download en_core_web_sm

Please perform calculations on this dataframe called `df` for the exercises in this chapter.  These data include [reviews for AirBnB properties](https://www.kaggle.com/broach/denverairbnb?select=reviews.csv) in the Denver, Colorado, USA, area.

In [3]:
df = pd.read_csv('https://github.com/kimfetti/Projects/blob/master/Etc/airbnb_reviews10K.csv?raw=True')

In [4]:
df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,34283811,563042535,2019-11-11,64422953,Latisha,Mitzi’s place is so AWESOME and she is a great...
1,28316749,368346030,2019-01-06,230355204,Dustin,We really enjoyed staying here. Great location...
2,4909321,73216678,2016-05-07,8332410,Ben,Rebecca and her family were very hospitable an...
3,875596,4841584,2013-05-28,6213025,Mari,"Great house, very well located. Clean and quiet."
4,15766497,142924541,2017-04-09,57826718,Mackenzie,"This place is located in a great, fun area! Yo..."


In [5]:
df.shape

(10000, 6)

## Exercise 1

Create a new column in `df` called "spacy_doc".  This column should be the parsed spaCy document for the text in the "comments" column of `df`.

_Hint_: You should be able to accomplish this with one line of code.  This calculation may take a few minutes.

In [6]:
### BEGIN SOLUTION
df['spacy_doc'] = list(nlp.pipe(df.comments))
### END SOLUTION

df.head()

Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments,spacy_doc
0,34283811,563042535,2019-11-11,64422953,Latisha,Mitzi’s place is so AWESOME and she is a great...,"(Mitzi, ’s, place, is, so, AWESOME, and, she, ..."
1,28316749,368346030,2019-01-06,230355204,Dustin,We really enjoyed staying here. Great location...,"(We, really, enjoyed, staying, here, ., Great,..."
2,4909321,73216678,2016-05-07,8332410,Ben,Rebecca and her family were very hospitable an...,"(Rebecca, and, her, family, were, very, hospit..."
3,875596,4841584,2013-05-28,6213025,Mari,"Great house, very well located. Clean and quiet.","(Great, house, ,, very, well, located, ., , C..."
4,15766497,142924541,2017-04-09,57826718,Mackenzie,"This place is located in a great, fun area! Yo...","(This, place, is, located, in, a, great, ,, fu..."


In [7]:
### CHECK YOUR OUTPUT WITH THE ANSWER
assert 'spacy_doc' in df.columns, "Be sure to create a column in df called 'spacy_doc'."
assert type(df.spacy_doc[0]) == spacy.tokens.doc.Doc, "Be sure that each entry in the spacy_doc column of df is a spaCy document."

In [8]:
### BEGIN HIDDEN TESTS

# test first document text
assert df.spacy_doc[0].text == df.comments[0]

# test last document dependencies
last_doc = nlp(df.comments.iloc[-1])
for token, test_token in zip(df.spacy_doc.iloc[-1], last_doc):
    assert token.dep_ == test_token.dep_

### END HIDDEN TESTS

## Exercise 2 - Popular Adjectives

Now it's time to find the most popular adjectives in this collection of AirBnB reviews.  Complete the following steps:

1. Create a list called `adj_list` that contains the TEXT strings of all the tokens that spaCy has identified as adjectives in documents of the "spacy_doc" column.  Be sure to also make all the adjective text lowercase!  (You can create this list either with a double list comprehension or with a double `for` loop.)
2. Use a counter to find the 20 most common adjectives and the number of times these adjectives were used. Call this result `top_adj`, which should be structured as a Python list of tuples:

  ```[(adj1, int1), (adj2, int2), ..., (adj20, int20)]```

  See the example below to see how a Python counter object works.  

Do your top adjectives make sense given that these reviews are about AirBnB rentals?

#### Counter Demo

In [9]:
word_list = ['hi', 'bye', 'hello', 'hello', 'hiya', 'hello', 'bye']

First we import `Counter` from the `collections` module.  Then we apply it to `word_list` to tally up the strings.

In [10]:
from collections import Counter

Counter(word_list)

Counter({'hi': 1, 'bye': 2, 'hello': 3, 'hiya': 1})

Then we can apply the `.most_common()` method to sort these words descending by popularity.

In [11]:
Counter(word_list).most_common()

[('hello', 3), ('bye', 2), ('hi', 1), ('hiya', 1)]

If we supply an integer, $n$, into `.most_common()` we will be returned the top $n$ most popular items.

In [12]:
Counter(word_list).most_common(2)

[('hello', 3), ('bye', 2)]

#### spaCy Lowercase Text Demo

Also remember that you can make spaCy text lowercase by using `.lower()`

In [13]:
my_phrase = "HiYa!"

my_doc = nlp(my_phrase)

my_doc.text.lower()

'hiya!'

Okay, time to put what you learned in these demos into practice! Follow the directions above.

In [14]:
### BEGIN SOLUTION
adj_list = []
for doc in df.spacy_doc:
    for token in doc:
        if token.pos_ == 'ADJ':
            adj_list.append(token.text.lower())

top_adj = Counter(adj_list).most_common(20)
### END SOLUTION

adj_list[:15]

['great',
 'beautiful',
 'easy',
 'keyless',
 'safe',
 'lovely',
 'close',
 'good',
 'comfortable',
 'more',
 'laundry',
 'great',
 'great',
 'entire',
 'great']

In [15]:
top_adj

[('great', 6868),
 ('clean', 2879),
 ('nice', 1835),
 ('comfortable', 1709),
 ('perfect', 1661),
 ('easy', 1445),
 ('amazing', 874),
 ('good', 871),
 ('close', 870),
 ('beautiful', 848),
 ('wonderful', 836),
 ('quiet', 784),
 ('cozy', 673),
 ('quick', 664),
 ('awesome', 657),
 ('friendly', 630),
 ('little', 606),
 ('helpful', 582),
 ('lovely', 523),
 ('cute', 503)]

In [16]:
### CHECK YOUR OUTPUT WITH THE ANSWER
assert type(adj_list) == list, "Be sure that adj_list is a Python list."
assert type(adj_list[0]) == str, "Be sure that the elements of adj_list are Python strings."
for adj in adj_list[:15]:
    assert adj == adj.lower(), "Be sure to make all of your adjectives lowercase."
assert len(adj_list) > 10000, "You should find well over 10,000 adjectives in these AirBnB reviews."

assert type(top_adj) == list, "Be sure that top_adj is a Python list."
assert type(top_adj[0]) == tuple, "The elements of top_adj should be tuples.  Have you looked for the most common adjectives?"
assert len(top_adj) == 20, "Be sure that top_adj contains the 20 most popular adjectives for this dataset."
assert top_adj[0][1] > 5000, "You should find that the top adjective was used over 5,000 times."

In [17]:
### BEGIN HIDDEN TESTS

## this trusts that they have created spacy_doc correctly...
### ... but recreating the spaCy docs seems super inefficient/slow
test_adj_list = []
for doc in df.spacy_doc:
    for token in doc:
        if token.pos_ == 'ADJ':
            test_adj_list.append(token.text.lower())

assert adj_list == test_adj_list

assert top_adj == Counter(test_adj_list).most_common(20)
### END HIDDEN TESTS

## Exercise 3 - Adjective Modifiers

Now let's find out how customers who left these AirBnB reviews describe the "neighborhood" of their rental.  

You will be creating a list of adjective modifiers for the noun "neighborhood" and saving this list as `adj_modifiers`.  To do this, you will want to cycle through all the documents in `df.spacy_doc` and cycle through all of the tokens in each doc to look for the LOWERCASE token text "neighborhood".  If the token text is "neighborhood" you will look for adjective modifier children and collect the LOWERCASE text of these in your `adj_modifiers` list.

Then extract the top 10 adjectives and their count in a list of tuples, the same structure as Exercise 2.  (Again, a Python counter works best.) Save this in a variable called `top_adj_mod`.

Do the adjective modifiers you collected about "neighborhood" seem to make sense?

In [18]:
noun_str = 'neighborhood'
adj_modifiers = []
top_adj_mod = []

## BEGIN SOLUTION

for doc in df.spacy_doc:
    for token in doc:
        if token.text.lower() == noun_str:
            for child in token.children:
                if child.dep_ == 'amod':
                    adj_modifiers.append(child.text.lower())

top_adj_mod = Counter(adj_modifiers).most_common(10)

### END SOLUTION

adj_modifiers[:15]

['nice',
 'nice',
 'quiet',
 'quiet',
 'nice',
 'mature',
 'quiet',
 'quiet',
 'close',
 'lovely',
 'safe',
 'nice',
 'quiet',
 'full',
 'quiet']

In [19]:
top_adj_mod

[('quiet', 162),
 ('great', 102),
 ('nice', 83),
 ('safe', 35),
 ('beautiful', 32),
 ('close', 23),
 ('residential', 18),
 ('walkable', 18),
 ('cool', 17),
 ('cute', 16)]

In [20]:
### CHECK YOUR OUTPUT WITH THE ANSWER
assert type(adj_modifiers) == list, "Be sure that adj_modifiers is a Python list."
assert type(adj_modifiers[0]) == str, "Be sure that the elements of adj_modifiers are Python strings."
for adj in adj_modifiers[:5]:
    assert adj == adj.lower(), "Be sure to make all of your adjectives lowercase."
assert len(adj_modifiers) > 500, "You should find over 500 adjective modifiers for 'neighborhood' in these AirBnB reviews."

assert type(top_adj_mod) == list, "Be sure that top_adj_mod is a Python list."
assert type(top_adj_mod[0]) == tuple, "The elements of top_adj_mod should be tuples.  Have you looked for the most common adjective modifiers?"
assert len(top_adj_mod) == 10, "Be sure that top_adj contains the 10 most popular adjective modifiers for 'neighborhood'."
assert top_adj_mod[0][1] > 100, "You should find that the top adjective modifier was used over 100 times to describe 'neighborhood'."

In [21]:
### BEGIN HIDDEN TESTS
# These tests asssume you have created the spacy doc correctly
test_adj_modifiers = []

for doc in df.spacy_doc:
    for token in doc:
        if token.text.lower() == noun_str:
            for child in token.children:
                if child.dep_ == 'amod':
                    test_adj_modifiers.append(child.text.lower())
assert adj_modifiers == test_adj_modifiers

assert top_adj_mod == Counter(test_adj_modifiers).most_common(10)
### END HIDDEN TESTS