* Tokenize the `REVIEWS` text into meaningful words
* Combine the tokens again and print them
* Stem the tokens using the Porter and Snowball stemmers. Which one works best, and why?
* Remove stopwords
* Identify the ten most frequent unigrams and bi-grams

In [4]:
REVIEWS = '''\
After a morning of Thrift Store hunting, a friend and I were thinking of lunch, and he suggested Emil's after he'd seen Chris Sebak do a bit on it and had tried it a time or two before, and I had not. He said they had a decent Reuben, but to be prepared to step back in time.

Well, seeing as how I'm kind of addicted to late 40's and early 50's, and the whole Rat Pack scene, stepping back in time is a welcomed change in da burgh...as long as it doesn't involve 1979, which I can see all around me every day.

And yet another shot at finding a decent Reuben in da burgh...well, that's like hunting the Holy Grail. So looking under one more bush certainly wouldn't hurt.

So off we go right at lunchtime in the middle of...where exactly were we? At first I thought we were lost, driving around a handful of very rather dismal looking blocks in what looked like a neighborhood that had been blighted by the building of a highway. And then...AHA! Here it is! And yep, there it was. This little unassuming building with an add-on entrance with what looked like a very old hand painted sign stating quite simply 'Emil's.

We walked in the front door, and entered another world. Another time, and another place. Oh, and any Big Burrito/Sousa foodies might as well stop reading now. I wouldn't want to see you walk in, roll your eyes and say 'Reaaaaaalllly?'

This is about as old world bar/lounge/restaurant as it gets. Plain, with a dark wood bar on one side, plain white walls with no yinzer pics, good sturdy chairs and actual white linens on the tables. This is the kind of neighborhood dive that I could see Frank and Dino pulling a few tables together for some poker, a fish sammich, and some cheap scotch. And THAT is exactly what I love.

Oh...but good food counts too.

We each had a Reuben, and my friend had a side of fries. The Reubens were decent, but not NY awesome. A little too thick on the bread, but overall, tasty and definitely filling. Not too skimpy on the meat. I seriously CRAVE a true, good NY Reuben, but since I can't afford to travel right now, what I find in da burgh will have to do. But as we sat and ate, burgers came out to an adjoining table. Those were some big thick burgers. A steak went past for the table behind us. That was HUGE! And when we asked about it, the waitress said 'Yeah, it's huge and really good, and he only charges $12.99 for it, ain't that nuts?' Another table of five came in, and wham. Fish sandwiches PILED with breaded fish that looked amazing. Yeah, I want that, that, that and THAT!

My friend also mentioned that they have a Chicken Parm special one day of the week that is only served UNTIL 4 pm, and that it is fantastic. If only I could GET there on that week day before 4...

The waitress did a good job, especially since there was quite a growing crowd at lunchtime on a Saturday, and only one of her. She kept up and was very friendly.

They only have Pepsi products, so I had a brewed iced tea, which was very fresh, and she did pop by to ask about refills as often as she could. As the lunch hour went on, they were getting busy.

Emil's is no frills, good portions, very reasonable prices, VERY comfortable neighborhood hole in the wall...kind of like Cheers, but in a blue collar neighborhood in the 1950's. Fan-freakin-tastic! I could feel at home here.

You definitely want to hit Mapquest or plug in your GPS though. I am not sure that I could find it again on my own...it really is a hidden gem. I will be making my friend take me back until I can memorize where the heck it is.

Addendum: 2nd visit for the fish sandwich. Excellent. Truly. A pound of fish on a fish-shaped bun (as opposed to da burgh's seemingly popular hamburger bun). The fish was flavorful, the batter excellent, and for just $8. This may have been the best fish sandwich I've yet to have in da burgh.
'''

In [5]:
# Import nltk and download the list of stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/minor/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [6]:
# Normalize the text
import re
lower_case_text = REVIEWS.lower()
space_normalized_text = re.sub(r'[^a-zA-Z0-9\s]', ' ', lower_case_text)

In [7]:
# Tokenize the text
from nltk.tokenize import word_tokenize
tokens = word_tokenize(space_normalized_text)

In [8]:
# Show the reconstituted text
' '.join(tokens)

'after a morning of thrift store hunting a friend and i were thinking of lunch and he suggested emil s after he d seen chris sebak do a bit on it and had tried it a time or two before and i had not he said they had a decent reuben but to be prepared to step back in time well seeing as how i m kind of addicted to late 40 s and early 50 s and the whole rat pack scene stepping back in time is a welcomed change in da burgh as long as it doesn t involve 1979 which i can see all around me every day and yet another shot at finding a decent reuben in da burgh well that s like hunting the holy grail so looking under one more bush certainly wouldn t hurt so off we go right at lunchtime in the middle of where exactly were we at first i thought we were lost driving around a handful of very rather dismal looking blocks in what looked like a neighborhood that had been blighted by the building of a highway and then aha here it is and yep there it was this little unassuming building with an add on ent

In [9]:
# Stem the tokens

import pandas as pd
from nltk.stem import PorterStemmer, SnowballStemmer

porter_stemmer = PorterStemmer()
snowball_stemmer = SnowballStemmer(language="english")

stemmed_tokens_df = pd.DataFrame({
    "token": tokens,
    "porter_stemmed_token": [porter_stemmer.stem(token) for token in tokens],
    "snowball_stemmed_token": [snowball_stemmer.stem(token) for token in tokens],
})

stemmed_tokens_df

Unnamed: 0,token,porter_stemmed_token,snowball_stemmed_token
0,after,after,after
1,a,a,a
2,morning,morn,morn
3,of,of,of
4,thrift,thrift,thrift
...,...,...,...
749,to,to,to
750,have,have,have
751,in,in,in
752,da,da,da


In [10]:
# Remove stopwords
import nltk.corpus
stopwords_set = set(nltk.corpus.stopwords.words("english"))
tokens_without_stopwords = [token for token in tokens if token not in stopwords_set]
tokens_without_stopwords

['morning',
 'thrift',
 'store',
 'hunting',
 'friend',
 'thinking',
 'lunch',
 'suggested',
 'emil',
 'seen',
 'chris',
 'sebak',
 'bit',
 'tried',
 'time',
 'two',
 'said',
 'decent',
 'reuben',
 'prepared',
 'step',
 'back',
 'time',
 'well',
 'seeing',
 'kind',
 'addicted',
 'late',
 '40',
 'early',
 '50',
 'whole',
 'rat',
 'pack',
 'scene',
 'stepping',
 'back',
 'time',
 'welcomed',
 'change',
 'da',
 'burgh',
 'long',
 'involve',
 '1979',
 'see',
 'around',
 'every',
 'day',
 'yet',
 'another',
 'shot',
 'finding',
 'decent',
 'reuben',
 'da',
 'burgh',
 'well',
 'like',
 'hunting',
 'holy',
 'grail',
 'looking',
 'one',
 'bush',
 'certainly',
 'hurt',
 'go',
 'right',
 'lunchtime',
 'middle',
 'exactly',
 'first',
 'thought',
 'lost',
 'driving',
 'around',
 'handful',
 'rather',
 'dismal',
 'looking',
 'blocks',
 'looked',
 'like',
 'neighborhood',
 'blighted',
 'building',
 'highway',
 'aha',
 'yep',
 'little',
 'unassuming',
 'building',
 'add',
 'entrance',
 'looked',
 'li

In [13]:
from collections import Counter
from nltk.util import ngrams

unigram_counter = Counter()
for unigram in ngrams(tokens_without_stopwords, 1):
    unigram_counter[unigram[0]] += 1
print("Most common unigrams:")
print(unigram_counter.most_common(10))
print()

bigram_counter = Counter()
for bigram in ngrams(tokens_without_stopwords, 2):
    bigram_counter[bigram] += 1
print("Most common bigrams:")
print(bigram_counter.most_common(10))

Most common unigrams:
[('fish', 8), ('good', 6), ('da', 5), ('burgh', 5), ('another', 5), ('could', 5), ('friend', 4), ('time', 4), ('reuben', 4), ('like', 4)]

Most common bigrams:
[(('da', 'burgh'), 5), (('decent', 'reuben'), 2), (('back', 'time'), 2), (('looked', 'like'), 2), (('fish', 'sandwich'), 2), (('morning', 'thrift'), 1), (('thrift', 'store'), 1), (('store', 'hunting'), 1), (('hunting', 'friend'), 1), (('friend', 'thinking'), 1)]
