# Basic Text Processing with Spacy
    
You're a consultant for [DelFalco's Italian Restaurant](https://defalcosdeli.com/index.html).

<img src="https://upload.wikimedia.org/wikipedia/commons/0/0a/Meatballs_sandwich10000000041678_000334_%2815638892980%29.jpg" alt="meatball sub">

The owner asked you to identify whether there are any foods on their menu that diners find disappointing. Before getting started, run the following cell to set up code checking for this coding exercise.

In [1]:
import pandas as pd

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.nlp.ex1 import *
print('Setup Complete')

Setup Complete


The business owner suggested you use diner reviews from the Yelp website to determine which dishes people liked and disliked. You pulled the data from Yelp. Before you get to analysis, run the code cell below for a quick look at the data you have to work with.

In [2]:
# Load in the data from JSON file
data = pd.read_json('../input/nlp-course/restaurant.json')
data.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
100086,h8l10hiVsn4DlAv3_LlsQw,UFE_5r4ewNK3jA2pRPn2ww,r5PLDU-4mSbde5XekTXSCA,5,0,0,0,Wow! Really good Italian food with a grocery s...,2018-07-08 22:42:32
1013,vvIzf3pr8lTqE_AOsxmgaA,MAmijW4ooUzujkufYYLMeQ,r5PLDU-4mSbde5XekTXSCA,4,0,0,0,We have been trying Eggplant sandwiches all ov...,2015-04-15 04:50:56
101755,27t2Z9QXd6Pm9lKfEG_LzQ,kjeX2RXvW7RhBbD2QLd5jA,r5PLDU-4mSbde5XekTXSCA,5,7,8,6,"(Lyrics - Falco - Rock Me Amadeus)\n\nOoh, Roc...",2014-01-20 16:31:22
102164,ZB4Okiod5Yxaxx5UEEvdWg,fDBybzZAL5UDscd33HCXyA,r5PLDU-4mSbde5XekTXSCA,4,3,4,2,The omnipresent crowds here speak volumes abou...,2015-12-06 23:21:51
102256,BbJWlQRUPGFxnlZbgpdrLA,62JJoUPxKxqb6snMJxi2ng,r5PLDU-4mSbde5XekTXSCA,3,0,0,0,I had a bruschetta open face sandwich basicall...,2017-12-18 22:42:51


The owner also gave you this list of menu items and common alternate spellings.

In [3]:
menu = ["Cheese Steak", "Cheesesteak", "Steak and Cheese", "Italian Combo", "Tiramisu", "Cannoli",
        "Chicken Salad", "Chicken Spinach Salad", "Meatball", "Pizza", "Pizzas", "Spaghetti",
        "Bruchetta", "Eggplant", "Italian Beef", "Purista", "Pasta", "Calzones",  "Calzone",
        "Italian Sausage", "Chicken Cutlet", "Chicken Parm", "Chicken Parmesan", "Gnocchi",
        "Chicken Pesto", "Turkey Sandwich", "Turkey Breast", "Ziti", "Portobello", "Reuben",
        "Mozzarella Caprese",  "Corned Beef", "Garlic Bread", "Pastrami", "Roast Beef",
        "Tuna Salad", "Lasagna", "Artichoke Salad", "Fettuccini Alfredo", "Chicken Parmigiana",
        "Grilled Veggie", "Grilled Veggies", "Grilled Vegetable", "Mac and Cheese", "Macaroni",  
         "Prosciutto", "Salami"]

# Step 1: Play Your Anlalysis

Given the data from Yelp and the list of menu items, do you have any ideas for how you could find which menu items have disappointed diners?

Think about your answer. Then uncomment and run the cell below to see one approach.

In [4]:
q_1.solution()

<IPython.core.display.Javascript object>

<span style="color:#33cc99">Solution:</span> You could group reviews by what menu items they mention, and then calculate average rating
    for reviews that mentioned each item. You can tell which foods are mentioned in reviews with low scores,
    so the restaurant can fix the recipe or remove those foods from the menu.

# Step 2: Find items in one review

You'll pursue this plan of calculating average scores of the reviews mentioning each menuitem.

As a first step, you'll write code to extract the foods mentioned in a single review.

You've previously seen `Matcher`, which looks for a single token in a document. But many menu items are multiple tokens long. So you'll use `PhraseMatcher` which can match series of tokens.

Fill in the `____` values below to get a list of items matching a single menu item.

In [5]:
import spacy
from spacy.matcher import PhraseMatcher

index_of_review_to_test_on = 14
text_to_test_on = data.text.iloc[index_of_review_to_test_on]

# Load the SpaCy model
tokenizer = spacy.load('en')

# Create the tokenized version of text_to_test_on
review_doc = ____

# Create the PhraseMatcher object. The tokenizer is the first argument. Use attr = 'LOWER' to make consistent capitalization
matcher = PhraseMatcher(tokenizer.vocab, attr='LOWER')

# Create a list of tokens for each item in the menu
menu_tokens_list = [____ for item in menu]

# Add the item patterns to the matcher. 
# Look at https://spacy.io/api/phrasematcher#add in the docs for help with this step
# Then uncomment the lines below 

# 
#matcher.add("MENU",            # Just a name for the set of rules we're matching to
#            None,              # Special actions to take on matched words
#            ____  
#           )

# Find matches in the review_doc
# matches = ____

q_2.check()

<IPython.core.display.Javascript object>

<span style="color:#ccaa33">Check:</span> When you've updated the starter code, `check()` will tell you whether your code is correct. Remember, you must create the following variable: `matches`

In [6]:
# Uncomment if you need some guidance
# q_2.hint()
# q_2.solution()

After implementing the above cell, uncomment the follow cell to print the matches

In [7]:
# for match in matches:
#    print(f"Token number {match[1]}: {review_doc[match[1]:match[2]]}")

In [8]:
#%%RM_IF(PROD)%%
import spacy
from spacy.matcher import PhraseMatcher

# Load the SpaCy model
tokenizer = spacy.load('en')

index_of_review_to_test_on = 14
text_to_test_on = data.text.iloc[index_of_review_to_test_on]

# Create the tokenized review_doc
review_doc = tokenizer(text_to_test_on)

# Create the PhraseMatcher object. The tokenizer is the first argument.
# Reviews don't have consistent capitalization. Perform case-insensitive matching by adding argument attr='LOWER'
matcher = PhraseMatcher(tokenizer.vocab, attr='LOWER')

# Create a list of docs for each item in the menu
menu_tokens_list = [tokenizer(item) for item in menu]

# Add the item patterns to the matcher
matcher.add("MENU",            # Just a name for the set of rules we're matching to
            None,
            *menu_tokens_list  # Add the patterns to match to. In this case the menu_tokens_list
           )

# Find matches in the review_doc
matches = matcher(review_doc)

# Uncomment when checking code is complete
q_2.assert_check_passed()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

# Step 3: Matching on the whole dataset

Now run this matcher over the whole dataset and collect ratings for each menu item. Each review has a rating, `review.stars`. For each item that appears in the review text (`review.text`), append the review's rating to a list of ratings for that item. The lists are kept in a dictionary `item_ratings`.

To get the matched phrases, you can reference the PhraseMatcher documentation for the structure of each match object:

>A list of `(match_id, start, end)` tuples, describing the matches. A match tuple describes a span `doc[start:end]`. The `match_id` is the ID of the added match pattern.

In [9]:
from collections import defaultdict

# item_ratings is a dictionary of lists. If a key doesn't exist in item_ratings,
# the key is added with an empty list as the value.
item_ratings = defaultdict(list)

for idx, review in data.iterrows():
    doc = ____
    # Using the matcher from the previous exercise
    matches = ____
    
    # Create a set of the items found in the review text
    found_items = ____
    
    # Update item_ratings with rating for each item in found_items
    # Transform the item strings to lowercase to make it case insensitive
    ____

q_3.check()

<IPython.core.display.Javascript object>

<span style="color:#cc3333">Incorrect:</span> Please add items to item_ratings. You should have 43 items.

In [10]:
# Uncomment if you need some guidance
#q_3.hint()
#q_3.solution()

In [11]:
#%%RM_IF(PROD)%%

from collections import defaultdict

item_ratings = defaultdict(list)

for idx, review in data.iterrows():
    doc = tokenizer(review.text)
    matches = matcher(doc)

    found_items = set([doc[match[1]:match[2]] for match in matches])
    
    for item in found_items:
        item_ratings[str(item).lower()].append(review.stars)
        
q_3.assert_check_passed()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

# Step 4: What's the worst reviewed item?

Using these item ratings, find the menu item with the worst average rating?

In [12]:
# Calculate the mean ratings for each menu item as a dictionary
mean_ratings = ____

# Find the worst item, and write it as a string in worst_text. This can be multiple lines of code if you want.
worst_item = ____

q_4.check()

<IPython.core.display.Javascript object>

<span style="color:#ccaa33">Check:</span> When you've updated the starter code, `check()` will tell you whether your code is correct. You need to update the code that creates variable `worst_item`

In [13]:
# Uncomment if you need some guidance
# q_4.hint()
# q_4.solution()

In [14]:
# After implementing the above cell, uncomment and run this to print 
# out the worst items. Otherwise you'll get an error.

# for item in worst_items:
#     print(f"{item:>25}{mean_ratings[item]:>10.3f}")

In [15]:
#%%RM_IF(PROD)%%

mean_ratings = {item: sum(ratings)/len(ratings) for item, ratings in item_ratings.items()}
worst_item = sorted(mean_ratings, key=mean_ratings.get)[0]
    
q_4.assert_check_passed()

<IPython.core.display.Javascript object>

<span style="color:#33cc33">Correct</span>

In [16]:
worst_item

'chicken cutlet'

# Step 5: Are counts important here?

Similar to the mean ratings, you can calculate the number of reviews for each item.

In [17]:
counts = {item: len(ratings) for item, ratings in item_ratings.items()}

item_counts = sorted(counts, key=counts.get, reverse=True)
for item in item_counts:
    print(f"{item:>25}{counts[item]:>5}")

                    pizza  358
                    pasta  255
                 meatball  163
              cheesesteak  146
                  calzone  110
                 eggplant   95
                  cannoli   89
             cheese steak   88
                  lasagna   83
                  purista   67
               prosciutto   63
             chicken parm   58
          italian sausage   57
             garlic bread   46
                  gnocchi   45
                spaghetti   41
                 calzones   38
                   pizzas   33
                   salami   32
            chicken pesto   30
             italian beef   29
                 tiramisu   27
                     ziti   26
            italian combo   22
         chicken parmesan   21
       chicken parmigiana   18
           mac and cheese   18
               portobello   18
                 pastrami   16
           chicken cutlet   11
         steak and cheese    9
               roast beef    7
        

Here is code to print the 10 best and 10 worst rated items. Look at the results, and decide whether you think it's important to consider the number of reviews when interpreting scores of which items are best and worst.

In [18]:
sorted_ratings = sorted(mean_ratings, key=mean_ratings.get)

print("Worst rated menu items:")
for item in sorted_ratings[:10]:
    print(f"{item:20} Ave rating: {mean_ratings[item]:.2f} \tcount: {counts[item]}")
    
print("\n\nBest rated menu items:")
for item in sorted_ratings[-10:]:
    print(f"{item:20} Ave rating: {mean_ratings[item]:.2f} \tcount: {counts[item]}")

Worst rated menu items:
chicken cutlet       Ave rating: 3.55 	count: 11
turkey sandwich      Ave rating: 3.80 	count: 5
spaghetti            Ave rating: 3.85 	count: 41
italian combo        Ave rating: 3.91 	count: 22
eggplant             Ave rating: 3.97 	count: 95
italian beef         Ave rating: 4.00 	count: 29
tuna salad           Ave rating: 4.00 	count: 5
garlic bread         Ave rating: 4.02 	count: 46
meatball             Ave rating: 4.08 	count: 163
portobello           Ave rating: 4.11 	count: 18


Best rated menu items:
prosciutto           Ave rating: 4.62 	count: 63
purista              Ave rating: 4.64 	count: 67
chicken salad        Ave rating: 4.67 	count: 6
pastrami             Ave rating: 4.69 	count: 16
reuben               Ave rating: 4.80 	count: 5
steak and cheese     Ave rating: 4.89 	count: 9
artichoke salad      Ave rating: 5.00 	count: 5
corned beef          Ave rating: 5.00 	count: 2
fettuccini alfredo   Ave rating: 5.00 	count: 6
turkey breast        Ave ra

Uncomment the following line after you've decided your answer.

In [19]:
# q_5.solution()

# Keep Going

Now that you are ready to combine your NLP skills with your ML skills. **[See how it's done](#$NEXT_NOTEBOOK_URL$)**.