# Basic Text Processing with Spacy

In this exercise, you'll use SpaCy to generate some basic statistics from Yelp reviews. You'll be looking at reviews for specific dishes from an Italian deli, [DelFalco's in Scottsdale, Arizona](https://defalcosdeli.com/index.html). For example, they have meatball subs!

![meatball sub](https://upload.wikimedia.org/wikipedia/commons/0/0a/Meatballs_sandwich10000000041678_000334_%2815638892980%29.jpg)

You're a consultant for the restaurant looking to get insight into the quality of their food. You have an idea to use customer ratings from Yelp reviews to measure the quality of specific dishes. Assuming that a customer's rating and the menu items mentioned in the review are correlated, items that consistently appear in reviews with low ratings are likely subpar. Using this analysis, you can provide feedback to the owner.

The goal then is to extract menu items from the review text and find basic statistics on the ratings. For example, you can count how many times specific dishes appear in the reviews.

First you'll load in Pandas and SpaCy, then load the data from a JSON file.

In [111]:
import pandas as pd

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.nlp.ex2 import *
print("\nSetup complete")


Setup complete


In [6]:
# Load in the data from JSON file
data = pd.read_json('../input/restaurant.json')
data.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
109,lDJIaF4eYRF4F7g6Zb9euw,lb0QUR5bc4O-Am4hNq9ZGg,r5PLDU-4mSbde5XekTXSCA,4,2,0,0,I used to work food service and my manager at ...,2013-01-27 17:54:54
1013,vvIzf3pr8lTqE_AOsxmgaA,MAmijW4ooUzujkufYYLMeQ,r5PLDU-4mSbde5XekTXSCA,4,0,0,0,We have been trying Eggplant sandwiches all ov...,2015-04-15 04:50:56
1204,UF-JqzMczZ8vvp_4tPK3bQ,slfi6gf_qEYTXy90Sw93sg,r5PLDU-4mSbde5XekTXSCA,5,1,0,0,Amazing Steak and Cheese... Better than any Ph...,2011-03-20 00:57:45
1251,geUJGrKhXynxDC2uvERsLw,N_-UepOzAsuDQwOUtfRFGw,r5PLDU-4mSbde5XekTXSCA,1,0,0,0,Although I have been going to DeFalco's for ye...,2018-07-17 01:48:23
1354,aPctXPeZW3kDq36TRm-CqA,139hD7gkZVzSvSzDPwhNNw,r5PLDU-4mSbde5XekTXSCA,2,0,0,0,"Highs: Ambience, value, pizza and deserts. Thi...",2018-01-21 10:52:58


I've provided a list with the menu items and common alternate spellings. This could be improved, but it will be good for this exercise.

In [77]:
menu = ["Cheese Steak", "Cheesesteak", "Steak and Cheese", "Italian Combo", "Tiramisu", "Cannoli",
        "Chicken Salad", "Chicken Spinach Salad", "Meatball", "Pizza", "Pizzas", "Spaghetti",
        "Bruchetta", "Eggplant", "Italian Beef", "Purista", "Pasta", "Calzones",  "Calzone",
        "Italian Sausage", "Chicken Cutlet", "Chicken Parm", "Chicken Parmesan", "Gnocchi",
        "Chicken Pesto", "Turkey Sandwich", "Turkey Breast", "Ziti", "Portobello", "Reuben",
        "Mozzarella Caprese",  "Corned Beef", "Garlic Bread", "Pastrami", "Roast Beef",
        "Tuna Salad", "Lasagna", "Artichoke Salad", "Fettuccini Alfredo", "Chicken Parmigiana",
        "Grilled Veggie", "Grilled Veggies", "Grilled Vegetable", "Mac and Cheese", "Macaroni",  
         "Prosciutto", "Salami"]

### Exercise: Find items in one review

First up, you'll use SpaCy to find menu items in a single review. For this you can use `PhraseMatcher` which matches based on phrase patterns. Comparatively `Matcher` matches on tokens, individual words. However, some of the menu items are phrases, so you can't match on individual tokens only. Note that while the menu items are in title case, review authors will often write the food items in a variety of cases. You'll need to tell the `PhraseMatcher` to perform case-insensitive matching with the `attr` keyword argument.

Using the `nlp` model, create a list of phrase docs from the `menu` list. Add the patterns to `PhraseMatcher` with the key `"MENU"`. Then use the `PhraseMatcher` to find matches in `doc`, an example 

In [112]:
import spacy
from spacy.matcher import PhraseMatcher

# Load the SpaCy model
nlp = spacy.load('en_core_web_sm')
# Create the doc object
review_doc = nlp(data.iloc[4].text)

# Create the PhraseMatcher object, be sure to match on lowercase text
matcher = ____

# Create a list of docs for each item in the menu
patterns = ____

# Add the item patterns to the matcher
____

# Find matches in the review_doc
matches = ____

In [113]:
# After implementing the above cell, run this to print out the matches
# Otherwise you'll can an error
for match in matches:
    print(f"At position {match[1]}: {doc[match[1]:match[2]]}")

TypeError: 'PlaceholderValue' object is not iterable

In [None]:
#q1.hint()
#q1.solution()

In [107]:
#%%RM_IF(PROD)%%

import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.load('en_core_web_sm')
review_doc = nlp(data.iloc[4].text)

matcher = PhraseMatcher(nlp.vocab, attr='LOWER')
patterns = [nlp(item) for item in menu]
matcher.add("MENU", None, *patterns)
matches = matcher(review_doc)

for match in matches:
    print(f"At position {match[1]}: {doc[match[1]:match[2]]}")
    
# Uncomment when checking code is complete
#q1.assert_check_passed()

At position 6: pizza
At position 51: Pizza
At position 70: Cannoli
At position 98: Pasta


### Exercise: Matching on the whole dataset

Now run this matcher over the whole dataset and collect ratings for each menu item. Each review has a rating, `data.stars`. For each item that appears in the review text, append the review's rating to a list of ratings for that item. The lists are kept in a dictionary `item_ratings`.

In [None]:
from collections import defaultdict

# item_ratings is a dictionary of lists. If a key doesn't exist in item_ratings,
# the key is added with an empty list.
item_ratings = defaultdict(list)

for idx, review in data.iterrows():
    doc = ____
    matches = ____
    
    # Create a set of the items found in the review text
    # Transform the item strings to lowercase to make it case insensitive
    found_items = ____
    
    # Update item_ratings with rating for each item in found_items
    ____

In [120]:
#%%RM_IF(PROD)%%

from collections import defaultdict

item_ratings = defaultdict(list)

for idx, review in data.iterrows():
    doc = nlp(review.text)
    matches = matcher(doc)

    found_items = set([doc[match[1]:match[2]] for match in matches])
    
    for item in found_items:
        item_ratings[str(item).lower()].append(review.stars)

### Combine Similar Items

You have some items like Steak and Cheese, Cheesesteak, and Cheese Steak that all refer to the same item, but are counted separately. Because language is messy. Before doing analysis, you should combine these items.

In [60]:
similar_items = [('cheesesteak', 'cheese steak'),
                 ('cheesesteak', 'steak and cheese'),
                 ('chicken parmigiana', 'chicken parm'),
                 ('chicken parmigiana', 'chicken parmesan'),
                 ('mac and cheese', 'macaroni'),
                 ('calzone', 'calzones')]

for (destination, source) in similar_items:
    item_ratings[destination].extend(item_ratings.pop(source))

### Exercise: Which items are the best reviewed?

Using these item ratings, find the mean ratings for each item. Then sort the ratings to find the best 

In [116]:
mean_ratings = {item: sum(ratings)/len(ratings) for item, ratings in item_ratings.items()}
best_items = sorted(mean_ratings, key=mean_ratings.get, reverse=True)

In [117]:
for item in best_items:
    print(f"{item:>25}{mean_ratings[item]:>5}")

       chicken parmigiana  4.0
                 eggplant  4.0
                    pizza  4.0
         steak and cheese  4.0
                 meatball  4.0
                    pasta  4.0
                  cannoli  4.0
               prosciutto  4.0
                  purista  4.0
             cheese steak  4.0
              cheesesteak  4.0
                  calzone  4.0
    chicken spinach salad  4.0
                 tiramisu  4.0
            italian combo  4.0
             italian beef  4.0
                   salami  4.0
             chicken parm  4.0
            chicken pesto  4.0
          turkey sandwich  4.0
           chicken cutlet  4.0
               tuna salad  4.0
                     ziti  4.0
          artichoke salad  4.0
                  lasagna  4.0
       fettuccini alfredo  4.0
                   pizzas  4.0
            turkey breast  4.0
                 calzones  4.0
           mac and cheese  4.0
           grilled veggie  4.0
             garlic bread  4.0
        

### Which items are the most popular?

Similar to the mean ratings, you can calculate the number of reviews for each item.

In [101]:
counts = {item: len(ratings) for item, ratings in item_ratings.items()}

In [102]:
item_counts = sorted(counts, key=counts.get, reverse=True)
for item in item_counts:
    print(f"{item:>25}{counts[item]:>5}")

                    pizza  358
                    pasta  255
              cheesesteak  243
                 meatball  163
                  calzone  148
       chicken parmigiana   97
                 eggplant   95
                  cannoli   89
                  lasagna   83
                  purista   67
               prosciutto   63
          italian sausage   57
             garlic bread   46
                  gnocchi   45
                spaghetti   41
                   pizzas   33
                   salami   32
            chicken pesto   30
             italian beef   29
                 tiramisu   27
                     ziti   26
           mac and cheese   24
            italian combo   22
               portobello   18
                 pastrami   16
           chicken cutlet   11
               roast beef    7
       fettuccini alfredo    6
           grilled veggie    6
            chicken salad    6
               tuna salad    5
          turkey sandwich    5
        

### Thought Question: Are counts important here?

Finally, print out the 10 best and 10 worst items. Print the item name, the average rating, and the count. It's important to consider the number of ratings for a specific item when using the mean to make decisions. Why is this?

Answer: The less data we have for any specific item, the less we can trust that the average rating is the "real" sentiment of the customers. This is fairly common sense. If more people tell you the same thing, you're more likely to believe it. It's also mathematically sound. As the number of data points increases, the error on the mean decreases as $1 / \sqrt{n}$.

In [73]:
for item in best_items[:10]:
    print(f"{item:20} Average rating: {mean_ratings[item]:.3f} \tcount: {counts[item]}")

artichoke salad      Average rating: 5.000 	count: 5
fettuccini alfredo   Average rating: 5.000 	count: 6
turkey breast        Average rating: 5.000 	count: 1
corned beef          Average rating: 5.000 	count: 2
reuben               Average rating: 4.800 	count: 5
pastrami             Average rating: 4.688 	count: 16
chicken salad        Average rating: 4.667 	count: 6
purista              Average rating: 4.642 	count: 67
prosciutto           Average rating: 4.619 	count: 63
chicken pesto        Average rating: 4.567 	count: 30


In [74]:
for item in best_items[:-10:-1]:
    print(f"{item:20} Average rating: {mean_ratings[item]:.3f} \tcount: {counts[item]}")

chicken cutlet       Average rating: 3.545 	count: 11
turkey sandwich      Average rating: 3.800 	count: 5
spaghetti            Average rating: 3.854 	count: 41
italian combo        Average rating: 3.909 	count: 22
eggplant             Average rating: 3.968 	count: 95
tuna salad           Average rating: 4.000 	count: 5
italian beef         Average rating: 4.000 	count: 29
garlic bread         Average rating: 4.022 	count: 46
meatball             Average rating: 4.080 	count: 163


### Next Up!

In the next tutorial you'll learn how to create a text classification model with SpaCy.