# Basic Text Processing with Spacy

In this exercise, you'll use SpaCy to generate some basic statistics from Yelp reviews.

You're a consultant for a restaurant looking to get insight into the quality of their food. You have an idea to use Yelp reviews to use customer ratings to measure the quality of specific dishes. Your hypothesis is that a customer's rating and the menu items mentioned in the review will be correlated. Items that consistently appear in reviews with low ratings are likely subpar. Using this analysis, you can provide feedback to the owner and head cook.

The goal then is to extract menu items from the review text and find basic statistics on the ratings. For example, you can count how many times specific dishes appear in the reviews.

First you'll load in Pandas and SpaCy, then load the data from a JSON file.

In [10]:
import pandas as pd
import spacy

In [6]:
# Load in the data from JSON file
data = pd.read_json('../input/restaurant.json')
data.head()

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
109,lDJIaF4eYRF4F7g6Zb9euw,lb0QUR5bc4O-Am4hNq9ZGg,r5PLDU-4mSbde5XekTXSCA,4,2,0,0,I used to work food service and my manager at ...,2013-01-27 17:54:54
1013,vvIzf3pr8lTqE_AOsxmgaA,MAmijW4ooUzujkufYYLMeQ,r5PLDU-4mSbde5XekTXSCA,4,0,0,0,We have been trying Eggplant sandwiches all ov...,2015-04-15 04:50:56
1204,UF-JqzMczZ8vvp_4tPK3bQ,slfi6gf_qEYTXy90Sw93sg,r5PLDU-4mSbde5XekTXSCA,5,1,0,0,Amazing Steak and Cheese... Better than any Ph...,2011-03-20 00:57:45
1251,geUJGrKhXynxDC2uvERsLw,N_-UepOzAsuDQwOUtfRFGw,r5PLDU-4mSbde5XekTXSCA,1,0,0,0,Although I have been going to DeFalco's for ye...,2018-07-17 01:48:23
1354,aPctXPeZW3kDq36TRm-CqA,139hD7gkZVzSvSzDPwhNNw,r5PLDU-4mSbde5XekTXSCA,2,0,0,0,"Highs: Ambience, value, pizza and deserts. Thi...",2018-01-21 10:52:58


I've provided a list with the menu items and common misspellings for a few of the items. This could be improved, but it will be good for this exercise.

In [29]:
menu = ["Cheese Steak", "Cheesesteak", "Steak and Cheese", "Italian Combo", "Tiramisu", "Cannoli",
        "Chicken Salad", "Chicken Spinach Salad", "Meatball", "Pizza", "Pizzas",
        "Bruchetta", "Eggplant", "Italian Beef", "Purista", "Pasta",
        "Italian Sausage", "Chicken Cutlet", "Chicken Parm", "Chicken Parmesan", "Calzone", 
        "Chicken Pesto", "Turkey Sandwich", "Turkey Breast", "Ziti", "Portobello", "Reuben",
        "Mozzarella Caprese", "Calzones", "Corned Beef", "Garlic Bread", "Spaghetti",
        "Tuna Salad", "Lasagna", "Artichoke Salad", "Fettuccini Alfredo", "Chicken Parmigiana",
        "Grilled Veggie", "Grilled Veggies", "Grilled Vegetable", "Mac and Cheese", "Macaroni",  
        "Pastrami", "Roast Beef", "Prosciutto", "Salami", "Gnocchi"]

In [22]:
nlp = spacy.load('en_core_web_sm')

### Exercise: Match One Review

In [30]:
from spacy.matcher import PhraseMatcher

doc = nlp(data.iloc[4].text)

matcher = PhraseMatcher(nlp.vocab, attr='LOWER')
patterns = [nlp(item) for item in menu]
matcher.add("MENU", None, *patterns)
matches = matcher(doc)

for match in matches:
    print(f"At position {match[1]}: {doc[match[1]:match[2]]}")

print("\n")
print(doc)

At position 6: pizza
At position 51: Pizza
At position 70: Cannoli
At position 98: Pasta


Highs: Ambience, value, pizza and deserts. This is a genuine Italian grocery and I might shop here during the day when the crowds are smaller. The restaurant meals are cheap for what you get and word is out among the ASU students and young families. Pizza crust was chewy, crispy with just the right amount of char, like a good bread. Cannoli was outstanding, obviously had been freshly filled, as shell was perfectly crunchy and the sweet ricotta center had a nice clean flavor. Lows: Pasta and disorganization.  It was a packed Saturday night and they weren't prepared. The wine glasses and forks ran out. They had a weird ordering system where there were two cash registers, one halfway through the line. When one of the staff would become available they would open the halfway register, which meant that someone who was lucky enough to be there at that moment would get to order early, in effect cutting i

### Exercise: Matching on the whole dataset

Now run this matcher over the whole dataset and collect ratings for each menu item.

In [59]:
from collections import defaultdict

matcher = PhraseMatcher(nlp.vocab, attr='LOWER')
patterns = [nlp(item) for item in menu]
matcher.add("MENU", None, *patterns)

item_ratings = defaultdict(list)
for idx, row in data.iterrows():
    doc = nlp(row.text)
    matches = matcher(doc)

    found_items = set([doc[match[1]:match[2]] for match in matches])
    
    for item in found_items:
        item_ratings[str(item).lower()].append(row.stars)

### Combine Similar Items

You have some items like Steak and Cheese, Cheesesteak, and Cheese Steak that all refer to the same item, but are counted separately. Because language is messy. Before doing analysis, you should combine these items and otherwise clean up the extracted data.

In [60]:
similar_items = [('cheesesteak', 'cheese steak'),
                 ('cheesesteak', 'steak and cheese'),
                 ('chicken parmigiana', 'chicken parm'),
                 ('chicken parmigiana', 'chicken parmesan'),
                 ('mac and cheese', 'macaroni'),
                 ('calzone', 'calzones')]


for (destination, source) in similar_items:
    item_ratings[destination].extend(item_ratings.pop(source))

### Exercise: Which items are the most popular?

Look at the counts

In [61]:
counts = {item: len(ratings) for item, ratings in item_ratings.items()}

In [75]:
item_counts = sorted(counts, key=mean_ratings.get, reverse=True)

In [76]:
item_counts

['artichoke salad',
 'fettuccini alfredo',
 'turkey breast',
 'corned beef',
 'reuben',
 'pastrami',
 'chicken salad',
 'purista',
 'prosciutto',
 'chicken pesto',
 'chicken spinach salad',
 'grilled veggie',
 'gnocchi',
 'lasagna',
 'cheesesteak',
 'pizzas',
 'pasta',
 'mac and cheese',
 'calzone',
 'cannoli',
 'pizza',
 'tiramisu',
 'ziti',
 'chicken parmigiana',
 'salami',
 'italian sausage',
 'roast beef',
 'portobello',
 'meatball',
 'garlic bread',
 'italian beef',
 'tuna salad',
 'eggplant',
 'italian combo',
 'spaghetti',
 'turkey sandwich',
 'chicken cutlet']

### Exercise: Which items are the best reviewed?

Look at the means

In [63]:
mean_ratings = {item: sum(ratings)/len(ratings) for item, ratings in item_ratings.items()}

In [64]:
best_items = sorted(mean_ratings, key=mean_ratings.get, reverse=True)

#### Thought Question: Are counts important here?

Finally, print out the 10 best and 10 worst items. Print the item name, the average rating, and the count. It's important to consider the number of ratings for a specific item when using the mean to make decisions. Why is this?

Answer: The less data we have for any specific item, the less we can trust that the average rating is the "real" sentiment of the customers. This is fairly common sense. If more people tell you the same thing, you're more likely to believe it. It's also mathematically sound. As the number of data points increases, the error on the mean decreases as $1 / \sqrt{n}$.

In [73]:
for item in best_items[:10]:
    print(f"{item:20} Average rating: {mean_ratings[item]:.3f} \tcount: {counts[item]}")

artichoke salad      Average rating: 5.000 	count: 5
fettuccini alfredo   Average rating: 5.000 	count: 6
turkey breast        Average rating: 5.000 	count: 1
corned beef          Average rating: 5.000 	count: 2
reuben               Average rating: 4.800 	count: 5
pastrami             Average rating: 4.688 	count: 16
chicken salad        Average rating: 4.667 	count: 6
purista              Average rating: 4.642 	count: 67
prosciutto           Average rating: 4.619 	count: 63
chicken pesto        Average rating: 4.567 	count: 30


In [74]:
for item in best_items[:-10:-1]:
    print(f"{item:20} Average rating: {mean_ratings[item]:.3f} \tcount: {counts[item]}")

chicken cutlet       Average rating: 3.545 	count: 11
turkey sandwich      Average rating: 3.800 	count: 5
spaghetti            Average rating: 3.854 	count: 41
italian combo        Average rating: 3.909 	count: 22
eggplant             Average rating: 3.968 	count: 95
tuna salad           Average rating: 4.000 	count: 5
italian beef         Average rating: 4.000 	count: 29
garlic bread         Average rating: 4.022 	count: 46
meatball             Average rating: 4.080 	count: 163


### Next Up!

In the next tutorial you'll learn how to create a text classification model with SpaCy.