<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

# Natural Language Processing (NLP)
## *Data Science Unit 4 Sprint 1 Assignment 1*

Your goal in this assignment: find the attributes of the best & worst coffee shops in the dataset. The text is fairly raw: dates in the review, extra words in the `star_rating` column, etc. You'll probably want to clean that stuff up for a better analysis. 

Analyze the corpus of text using text visualizations of token frequency. Try cleaning the data as much as possible. Try the following techniques: 
- Lemmatization
- Custom stopword removal

Keep in mind the attributes of good tokens. Once you have a solid baseline, layer in the star rating in your visualization(s). Key part of this assignment - produce a write-up of the attributes of the best and worst coffee shops. Based on your analysis, what makes the best the best and the worst the worst. Use graphs and numbesr from your analysis to support your conclusions. There should be plenty of markdown cells! :coffee:

In [13]:
from IPython.display import YouTubeVideo

YouTubeVideo('Jml7NVYm8cs')

# Watch your mouth.

In [64]:
from collections import Counter
import re
 
import pandas as pd

# Plotting
import squarify
import matplotlib.pyplot as plt
import seaborn as sns

# NLP Libraries
import spacy
from spacy.tokenizer import Tokenizer
from nltk.stem import PorterStemmer

from collections import Counter


In [3]:
%pwd

'C:\\Users\\David_Cruz\\Desktop\\DS-Unit-4-Sprint-1-NLP\\module1-text-data'

In [14]:
import pandas as pd

url = "https://raw.githubusercontent.com/DAVIDCRUZ0202/DS-Unit-4-Sprint-1-NLP/main/module1-text-data/data/yelp_coffeeshop_review_data.csv"

shops = pd.read_csv(url)
shops.head()

# The raw download of this dataframe is surprisingly clean.

Unnamed: 0,coffee_shop_name,full_review_text,star_rating
0,The Factory - Cafe With a Soul,11/25/2016 1 check-in Love love loved the atm...,5.0 star rating
1,The Factory - Cafe With a Soul,"12/2/2016 Listed in Date Night: Austin, Ambia...",4.0 star rating
2,The Factory - Cafe With a Soul,11/30/2016 1 check-in Listed in Brunch Spots ...,4.0 star rating
3,The Factory - Cafe With a Soul,11/25/2016 Very cool decor! Good drinks Nice ...,2.0 star rating
4,The Factory - Cafe With a Soul,12/3/2016 1 check-in They are located within ...,4.0 star rating


In [15]:
shops.shape

(7616, 3)

In [17]:
shops.isna().sum()

coffee_shop_name    0
full_review_text    0
star_rating         0
dtype: int64

In [18]:
# No null values. This means that we don't have to drop any of our observations, and 
# we can work with the dataframe in its entirety. Awesomesauce.

In [23]:
# The goal is to split all reviews in the same manner. An approach that i'll try is to first tokenize each review
# and then split the tokenized reviews by slicing the first item out of each review, and saving it into a new column named
# dates

In [24]:
def tokenize(text):
    """Parses a string into a list of semantic units (words)

    Args:
        text (str): The string that the function will tokenize.

    Returns:
        list: tokens parsed out by the mechanics of your choice
    """
    
    tokens = re.sub('[^a-zA-Z 0-9]', '', text)
    tokens = tokens.lower().split()
    
    return tokens

In [28]:
shops['tokens'] = shops['full_review_text'].apply(tokenize)

In [54]:
dates = []

for i in shops['tokens']:
    print(pd.to_datetime(i[0], format='%m%d%Y', errors='ignore'))
    dates.append(pd.to_datetime(i[0], format='%m%d%Y', errors='ignore'))

2016-11-25 00:00:00
2016-12-02 00:00:00
2016-11-30 00:00:00
2016-11-25 00:00:00
2016-12-03 00:00:00
2016-11-20 00:00:00
2016-10-27 00:00:00
2016-11-02 00:00:00
2016-10-25 00:00:00
2016-11-10 00:00:00
2016-10-22 00:00:00
2016-11-20 00:00:00
2016-11-17 00:00:00
2016-12-05 00:00:00
2016-11-13 00:00:00
2016-11-09 00:00:00
2016-11-06 00:00:00
2016-10-25 00:00:00
2016-10-15 00:00:00
2016-12-01 00:00:00
2016-10-12 00:00:00
2016-10-10 00:00:00
2016-10-25 00:00:00
2016-11-16 00:00:00
2016-11-17 00:00:00
2016-12-02 00:00:00
2016-11-09 00:00:00
2016-12-02 00:00:00
2016-12-01 00:00:00
2016-11-11 00:00:00
2016-11-09 00:00:00
2016-10-13 00:00:00
2016-11-10 00:00:00
2016-11-23 00:00:00
2016-10-16 00:00:00
2016-10-07 00:00:00
2016-11-02 00:00:00
2016-10-15 00:00:00
2016-10-18 00:00:00
2016-12-02 00:00:00
2016-11-14 00:00:00
2016-10-07 00:00:00
2016-11-22 00:00:00
2016-11-21 00:00:00
2016-11-14 00:00:00
2016-10-13 00:00:00
2016-11-02 00:00:00
2016-11-30 00:00:00
2016-11-26 00:00:00
2016-11-12 00:00:00


2015-02-16 00:00:00
2014-12-30 00:00:00
2015-11-09 00:00:00
2016-04-10 00:00:00
2015-03-22 00:00:00
2016-02-23 00:00:00
2014-11-01 00:00:00
2012-04-10 00:00:00
2015-11-09 00:00:00
2016-03-16 00:00:00
2015-09-14 00:00:00
2015-07-22 00:00:00
2013-05-20 00:00:00
2015-07-24 00:00:00
2013-06-26 00:00:00
2015-05-19 00:00:00
2014-07-23 00:00:00
2013-04-11 00:00:00
2014-04-11 00:00:00
2015-03-28 00:00:00
2015-09-17 00:00:00
2015-05-20 00:00:00
2015-04-03 00:00:00
2015-03-26 00:00:00
2014-09-18 00:00:00
2014-07-14 00:00:00
2015-03-12 00:00:00
2014-02-21 00:00:00
2012-04-07 00:00:00
2012-06-05 00:00:00
2014-10-05 00:00:00
2014-10-14 00:00:00
2015-01-06 00:00:00
2015-06-29 00:00:00
2013-12-07 00:00:00
2015-12-08 00:00:00
2015-02-01 00:00:00
2014-08-05 00:00:00
2015-11-01 00:00:00
2013-03-25 00:00:00
2015-06-05 00:00:00
2013-10-25 00:00:00
2014-06-03 00:00:00
2014-07-08 00:00:00
2014-07-10 00:00:00
2014-02-06 00:00:00
2014-06-04 00:00:00
2012-06-08 00:00:00
2012-06-15 00:00:00
2012-04-24 00:00:00


2014-11-08 00:00:00
2016-05-01 00:00:00
2015-03-22 00:00:00
2014-09-12 00:00:00
2015-09-08 00:00:00
2015-08-27 00:00:00
2015-06-10 00:00:00
2014-11-06 00:00:00
2015-01-04 00:00:00
2015-06-14 00:00:00
2014-03-17 00:00:00
2014-05-22 00:00:00
2014-07-23 00:00:00
2013-12-24 00:00:00
2014-05-16 00:00:00
2011-10-04 00:00:00
2013-09-28 00:00:00
2015-09-16 00:00:00
2015-03-21 00:00:00
2015-09-16 00:00:00
2015-01-04 00:00:00
2012-12-02 00:00:00
2015-03-10 00:00:00
2015-02-01 00:00:00
2013-06-25 00:00:00
2013-09-20 00:00:00
2013-11-19 00:00:00
2014-11-07 00:00:00
2014-10-20 00:00:00
2011-07-19 00:00:00
2014-04-25 00:00:00
2014-11-14 00:00:00
2013-09-03 00:00:00
2011-02-06 00:00:00
2014-02-08 00:00:00
2015-02-15 00:00:00
2014-09-06 00:00:00
2013-05-27 00:00:00
2010-12-13 00:00:00
2013-04-24 00:00:00
2013-06-27 00:00:00
2013-03-06 00:00:00
2014-06-27 00:00:00
2014-08-30 00:00:00
2014-07-07 00:00:00
2014-05-30 00:00:00
2011-12-09 00:00:00
2013-11-04 00:00:00
2010-05-14 00:00:00
2010-02-19 00:00:00


2016-07-25 00:00:00
2015-07-29 00:00:00
2014-11-02 00:00:00
2015-05-10 00:00:00
2014-07-18 00:00:00
2015-09-19 00:00:00
2014-05-27 00:00:00
2015-09-27 00:00:00
2014-08-30 00:00:00
2014-02-19 00:00:00
2014-03-10 00:00:00
2014-03-14 00:00:00
2015-02-25 00:00:00
2014-12-08 00:00:00
2013-12-26 00:00:00
2015-04-03 00:00:00
2013-12-20 00:00:00
2015-11-22 00:00:00
2015-04-09 00:00:00
2016-05-20 00:00:00
2015-07-20 00:00:00
2014-12-09 00:00:00
2014-11-09 00:00:00
2014-03-17 00:00:00
2015-07-27 00:00:00
2016-11-07 00:00:00
2014-11-06 00:00:00
2015-02-21 00:00:00
2015-06-19 00:00:00
2015-01-30 00:00:00
2014-10-29 00:00:00
2014-10-22 00:00:00
2015-04-28 00:00:00
2016-11-25 00:00:00
2016-12-02 00:00:00
2016-11-30 00:00:00
2016-11-25 00:00:00
2016-12-03 00:00:00
2016-11-20 00:00:00
2016-10-27 00:00:00
2016-11-02 00:00:00
2016-10-25 00:00:00
2016-11-10 00:00:00
2016-10-22 00:00:00
2016-11-20 00:00:00
2016-11-17 00:00:00
2016-12-05 00:00:00
2016-11-13 00:00:00
2016-11-09 00:00:00
2016-11-06 00:00:00


2015-09-13 00:00:00
2013-12-29 00:00:00
2013-12-07 00:00:00
2015-12-09 00:00:00
2015-09-02 00:00:00
2014-02-04 00:00:00
2015-02-15 00:00:00
2015-10-06 00:00:00
2013-09-17 00:00:00
2014-07-18 00:00:00
2015-09-13 00:00:00
2013-08-30 00:00:00
2014-01-30 00:00:00
2014-10-07 00:00:00
2014-05-17 00:00:00
2012-11-02 00:00:00
2011-07-28 00:00:00
2015-07-18 00:00:00
2014-11-07 00:00:00
2014-04-19 00:00:00
2016-09-14 00:00:00
2016-10-15 00:00:00
2016-05-28 00:00:00
2016-04-09 00:00:00
2015-02-15 00:00:00
2016-07-08 00:00:00
2016-02-03 00:00:00
2016-12-02 00:00:00
2016-12-02 00:00:00
2016-03-17 00:00:00
2016-03-30 00:00:00
2016-07-09 00:00:00
2015-12-07 00:00:00
2014-12-01 00:00:00
2015-09-24 00:00:00
2016-03-15 00:00:00
2016-12-05 00:00:00
2015-08-26 00:00:00
2014-12-10 00:00:00
2015-06-14 00:00:00
2014-02-06 00:00:00
2015-10-08 00:00:00
2015-03-20 00:00:00
2013-07-28 00:00:00
2011-12-30 00:00:00
2013-11-24 00:00:00
2013-04-22 00:00:00
2014-08-27 00:00:00
2014-09-19 00:00:00
2011-09-10 00:00:00


2016-03-29 00:00:00
2016-09-06 00:00:00
2016-06-14 00:00:00
2016-01-04 00:00:00
2015-06-27 00:00:00
2016-07-18 00:00:00
2015-12-07 00:00:00
2016-09-18 00:00:00
2016-02-05 00:00:00
2015-07-27 00:00:00
2015-12-04 00:00:00
2015-11-04 00:00:00
2015-11-08 00:00:00
2016-07-05 00:00:00
2016-09-23 00:00:00
2016-09-26 00:00:00
2016-07-26 00:00:00
2016-10-12 00:00:00
2015-10-04 00:00:00
2015-12-07 00:00:00
2015-11-28 00:00:00
2016-03-21 00:00:00
2016-05-10 00:00:00
2016-05-08 00:00:00
2015-10-17 00:00:00
2015-10-18 00:00:00
2015-03-16 00:00:00
2016-05-09 00:00:00
2015-10-03 00:00:00
2016-02-09 00:00:00
2015-07-21 00:00:00
2015-10-16 00:00:00
2015-12-06 00:00:00
2016-08-13 00:00:00
2016-06-25 00:00:00
2016-02-16 00:00:00
2016-08-27 00:00:00
2016-08-02 00:00:00
2016-02-01 00:00:00
2016-08-07 00:00:00
2015-08-31 00:00:00
2016-07-24 00:00:00
2016-02-13 00:00:00
2015-12-26 00:00:00
2016-06-09 00:00:00
2015-10-19 00:00:00
2016-05-18 00:00:00
2015-09-25 00:00:00
2016-07-06 00:00:00
2015-11-26 00:00:00


2013-02-19 00:00:00
2009-07-06 00:00:00
2010-10-24 00:00:00
2012-09-02 00:00:00
2009-08-25 00:00:00
2010-04-12 00:00:00
2012-09-02 00:00:00
2010-01-09 00:00:00
2012-10-10 00:00:00
2012-04-07 00:00:00
2012-05-24 00:00:00
2011-05-15 00:00:00
2012-03-13 00:00:00
2010-08-29 00:00:00
2012-02-20 00:00:00
2011-08-16 00:00:00
2011-08-18 00:00:00
2010-07-31 00:00:00
2011-09-30 00:00:00
2013-02-25 00:00:00
2011-05-06 00:00:00
2010-07-11 00:00:00
2009-08-28 00:00:00
2012-02-23 00:00:00
2009-11-28 00:00:00
2010-03-25 00:00:00
2009-09-11 00:00:00
2010-02-04 00:00:00
2009-02-08 00:00:00
2008-11-01 00:00:00
2011-05-23 00:00:00
2011-03-02 00:00:00
2016-12-02 00:00:00
2016-11-26 00:00:00
2016-08-29 00:00:00
2016-08-29 00:00:00
2016-05-23 00:00:00
2016-03-01 00:00:00
2016-07-03 00:00:00
2016-01-04 00:00:00
2016-10-01 00:00:00
2016-08-29 00:00:00
2016-11-06 00:00:00
2016-04-10 00:00:00
2016-07-23 00:00:00
2016-11-03 00:00:00
2016-08-02 00:00:00
2016-11-01 00:00:00
2016-08-26 00:00:00
2016-01-02 00:00:00


2016-10-17 00:00:00
2016-06-13 00:00:00
2016-09-14 00:00:00
2016-06-09 00:00:00
2016-09-15 00:00:00
2016-09-26 00:00:00
2016-08-09 00:00:00
2016-05-24 00:00:00
2015-09-16 00:00:00
2016-09-09 00:00:00
2016-08-27 00:00:00
2016-07-12 00:00:00
2016-05-18 00:00:00
2016-05-20 00:00:00
2016-01-04 00:00:00
2015-12-02 00:00:00
2016-02-20 00:00:00
2015-07-06 00:00:00
2016-12-06 00:00:00
2015-10-19 00:00:00
2014-11-29 00:00:00
2014-11-22 00:00:00
2015-07-18 00:00:00
2015-05-31 00:00:00
2014-05-02 00:00:00
2015-11-17 00:00:00
2015-11-14 00:00:00
2014-04-02 00:00:00
2015-09-01 00:00:00
2015-08-21 00:00:00
2014-05-20 00:00:00
2013-11-16 00:00:00
2015-09-10 00:00:00
2014-09-12 00:00:00
2015-01-05 00:00:00
2015-04-10 00:00:00
2015-05-23 00:00:00
2015-06-24 00:00:00
2015-05-05 00:00:00
2015-06-02 00:00:00
2015-05-08 00:00:00
2015-05-13 00:00:00
2014-11-09 00:00:00
2013-12-07 00:00:00
2014-12-05 00:00:00
2015-04-30 00:00:00
2012-10-21 00:00:00
2015-03-19 00:00:00
2015-02-16 00:00:00
2015-08-27 00:00:00


2015-07-24 00:00:00
2016-01-04 00:00:00
2016-03-27 00:00:00
2015-01-06 00:00:00
2015-01-07 00:00:00
2016-03-13 00:00:00
2016-03-12 00:00:00
2016-02-13 00:00:00
2015-07-25 00:00:00
2016-09-03 00:00:00
2015-08-26 00:00:00
2016-11-06 00:00:00
2015-08-11 00:00:00
2016-08-14 00:00:00
2015-08-08 00:00:00
2015-08-08 00:00:00
2015-05-18 00:00:00
2016-11-07 00:00:00
2016-12-01 00:00:00
2015-07-21 00:00:00
2015-07-21 00:00:00
2015-06-17 00:00:00
2016-02-17 00:00:00
2015-01-07 00:00:00
2016-08-05 00:00:00
2015-05-16 00:00:00
2016-12-09 00:00:00
2016-07-15 00:00:00
2016-02-05 00:00:00
2016-07-25 00:00:00
2015-08-06 00:00:00
2014-07-02 00:00:00
2015-11-14 00:00:00
2016-07-02 00:00:00
2015-04-16 00:00:00
2015-05-25 00:00:00
2015-09-23 00:00:00
2016-07-12 00:00:00
2015-05-20 00:00:00
2014-09-14 00:00:00
2015-05-28 00:00:00
2016-11-01 00:00:00
2015-11-15 00:00:00
2016-01-07 00:00:00
2016-06-23 00:00:00
2015-02-09 00:00:00
2015-09-13 00:00:00
2016-09-18 00:00:00
2016-06-07 00:00:00
2014-07-14 00:00:00


2011-12-16 00:00:00
2011-06-07 00:00:00
2012-05-26 00:00:00
2009-03-09 00:00:00
2012-06-06 00:00:00
2011-06-06 00:00:00
2011-05-06 00:00:00
2008-07-08 00:00:00
2011-07-27 00:00:00
2010-08-21 00:00:00
2007-12-10 00:00:00
2010-06-02 00:00:00
2010-07-01 00:00:00
2012-07-25 00:00:00
2010-02-15 00:00:00
2006-11-11 00:00:00
2009-10-03 00:00:00
2008-12-18 00:00:00
2010-12-08 00:00:00
2008-11-01 00:00:00
2011-06-16 00:00:00
2008-06-13 00:00:00
2007-11-16 00:00:00
2016-11-20 00:00:00
2016-11-16 00:00:00
2016-11-02 00:00:00
2016-09-06 00:00:00
2016-08-18 00:00:00
2016-08-12 00:00:00
2016-07-23 00:00:00
2016-11-10 00:00:00
2016-09-07 00:00:00
2016-09-20 00:00:00
2016-02-08 00:00:00
2016-07-14 00:00:00
2016-10-18 00:00:00
2016-11-09 00:00:00
2016-11-30 00:00:00
2016-10-17 00:00:00
2016-10-24 00:00:00
2016-04-14 00:00:00
2016-11-06 00:00:00
2016-08-04 00:00:00
2016-08-03 00:00:00
2016-06-12 00:00:00
2016-09-06 00:00:00
2015-08-17 00:00:00
2016-04-11 00:00:00
2016-10-25 00:00:00
2016-02-11 00:00:00


2016-04-02 00:00:00
2016-06-04 00:00:00
2014-04-17 00:00:00
2015-12-30 00:00:00
2015-12-30 00:00:00
2016-03-07 00:00:00
2015-02-19 00:00:00
2016-02-01 00:00:00
2016-11-03 00:00:00
2016-02-11 00:00:00
2014-09-11 00:00:00
2016-02-22 00:00:00
2015-06-16 00:00:00
2015-09-08 00:00:00
2016-03-08 00:00:00
2016-02-14 00:00:00
2014-10-29 00:00:00
2015-01-04 00:00:00
2015-05-23 00:00:00
2015-11-04 00:00:00
2015-12-05 00:00:00
2015-05-03 00:00:00
2015-10-15 00:00:00
2015-06-22 00:00:00
2015-07-15 00:00:00
2015-03-21 00:00:00
2014-05-05 00:00:00
2015-07-20 00:00:00
2016-11-04 00:00:00
2015-09-29 00:00:00
2015-08-30 00:00:00
2015-04-11 00:00:00
2015-11-03 00:00:00
2015-12-19 00:00:00
2014-06-30 00:00:00
2014-01-30 00:00:00
2015-08-19 00:00:00
2015-08-19 00:00:00
2014-02-01 00:00:00
2016-05-30 00:00:00
2015-03-09 00:00:00
2012-11-25 00:00:00
2015-03-21 00:00:00
2014-11-29 00:00:00
2012-08-06 00:00:00
2015-09-21 00:00:00
2015-01-05 00:00:00
2015-09-30 00:00:00
2015-08-07 00:00:00
2013-12-06 00:00:00


2015-12-06 00:00:00
2010-04-04 00:00:00
2013-08-26 00:00:00
2011-03-14 00:00:00
2010-08-22 00:00:00
2008-10-12 00:00:00
2009-03-28 00:00:00
2011-11-27 00:00:00
2010-07-23 00:00:00
2011-09-22 00:00:00
2009-11-03 00:00:00
2011-12-03 00:00:00
2010-12-30 00:00:00
2010-07-02 00:00:00
2009-12-29 00:00:00
2009-11-12 00:00:00
2012-09-20 00:00:00
2012-09-26 00:00:00
2012-05-02 00:00:00
2009-07-30 00:00:00
2010-05-18 00:00:00
2010-04-04 00:00:00
2011-03-02 00:00:00
2012-07-25 00:00:00
2010-03-05 00:00:00
2008-10-08 00:00:00
2011-02-14 00:00:00
2009-10-04 00:00:00
2009-10-14 00:00:00
2016-11-23 00:00:00
2016-11-17 00:00:00
2016-11-13 00:00:00
2016-11-11 00:00:00
2016-08-11 00:00:00
2016-10-11 00:00:00
2016-11-01 00:00:00
2016-07-08 00:00:00
2016-12-04 00:00:00
2016-12-01 00:00:00
2016-11-05 00:00:00
2016-12-01 00:00:00
2016-10-04 00:00:00
2016-11-03 00:00:00
2016-10-09 00:00:00
2016-08-04 00:00:00
2016-04-24 00:00:00
2016-11-19 00:00:00
2016-11-16 00:00:00
2016-11-23 00:00:00
2016-06-28 00:00:00


In [55]:
shops['date_reviewed'] = dates

In [56]:
shops

Unnamed: 0,coffee_shop_name,full_review_text,star_rating,tokens,date_reviewed
0,The Factory - Cafe With a Soul,11/25/2016 1 check-in Love love loved the atm...,5.0 star rating,"[11252016, 1, checkin, love, love, loved, the,...",2016-11-25
1,The Factory - Cafe With a Soul,"12/2/2016 Listed in Date Night: Austin, Ambia...",4.0 star rating,"[1222016, listed, in, date, night, austin, amb...",2016-12-02
2,The Factory - Cafe With a Soul,11/30/2016 1 check-in Listed in Brunch Spots ...,4.0 star rating,"[11302016, 1, checkin, listed, in, brunch, spo...",2016-11-30
3,The Factory - Cafe With a Soul,11/25/2016 Very cool decor! Good drinks Nice ...,2.0 star rating,"[11252016, very, cool, decor, good, drinks, ni...",2016-11-25
4,The Factory - Cafe With a Soul,12/3/2016 1 check-in They are located within ...,4.0 star rating,"[1232016, 1, checkin, they, are, located, with...",2016-12-03
...,...,...,...,...,...
7611,The Steeping Room,2/19/2015 I actually step into this restauran...,4.0 star rating,"[2192015, i, actually, step, into, this, resta...",2015-02-19
7612,The Steeping Room,"1/21/2016 Ok, The Steeping Room IS awesome. H...",5.0 star rating,"[1212016, ok, the, steeping, room, is, awesome...",2016-12-01
7613,The Steeping Room,"4/30/2015 Loved coming here for tea, and the ...",4.0 star rating,"[4302015, loved, coming, here, for, tea, and, ...",2015-04-30
7614,The Steeping Room,8/2/2015 The food is just average. The booths...,3.0 star rating,"[822015, the, food, is, just, average, the, bo...",2015-08-02


In [57]:
# Now that we've made a date column, lets further examine our tokens

In [70]:
def count(docs):

        word_counts = Counter()
        appears_in = Counter()
        
        total_docs = len(docs)

        for doc in docs:
            word_counts.update(doc)
            appears_in.update(set(doc))

        temp = zip(word_counts.keys(), word_counts.values())
        
        wc = pd.DataFrame(temp, columns = ['word', 'count'])

        wc['rank'] = wc['count'].rank(method='first', ascending=False)
        total = wc['count'].sum()

        wc['pct_total'] = wc['count'].apply(lambda x: x / total)
        
        wc = wc.sort_values(by='rank')
        wc['cul_pct_total'] = wc['pct_total'].cumsum()

        t2 = zip(appears_in.keys(), appears_in.values())
        ac = pd.DataFrame(t2, columns=['word', 'appears_in'])
        wc = ac.merge(wc, on='word')

        wc['appears_in_pct'] = wc['appears_in'].apply(lambda x: x / total_docs)
        
        return wc.sort_values(by='rank')

In [69]:
word_counts = Counter()
shops['tokens'].apply(lambda x: word_counts.update(x))
word_counts.most_common(50)

[('the', 34809),
 ('and', 26650),
 ('a', 22755),
 ('i', 20237),
 ('to', 17164),
 ('of', 12600),
 ('is', 11999),
 ('coffee', 10353),
 ('was', 9707),
 ('in', 9546),
 ('it', 9379),
 ('for', 8680),
 ('this', 6583),
 ('but', 6501),
 ('with', 6332),
 ('my', 6202),
 ('they', 6165),
 ('that', 6151),
 ('you', 5847),
 ('place', 5426),
 ('on', 5251),
 ('have', 5019),
 ('so', 4557),
 ('are', 4359),
 ('not', 4207),
 ('good', 3973),
 ('great', 3919),
 ('its', 3633),
 ('their', 3633),
 ('had', 3402),
 ('here', 3299),
 ('be', 3282),
 ('like', 3088),
 ('at', 3079),
 ('as', 3044),
 ('there', 2975),
 ('if', 2897),
 ('out', 2726),
 ('or', 2653),
 ('we', 2633),
 ('just', 2615),
 ('me', 2613),
 ('all', 2551),
 ('from', 2501),
 ('very', 2443),
 ('get', 2427),
 ('were', 2339),
 ('really', 2317),
 ('one', 2287),
 ('austin', 2252)]

In [71]:
shops['star_rating'] = shops['star_rating'].map(lambda x: x.rstrip('star rating'))

In [73]:
shops.head()

Unnamed: 0,coffee_shop_name,full_review_text,star_rating,tokens,date_reviewed
0,The Factory - Cafe With a Soul,11/25/2016 1 check-in Love love loved the atm...,5.0,"[11252016, 1, checkin, love, love, loved, the,...",2016-11-25
1,The Factory - Cafe With a Soul,"12/2/2016 Listed in Date Night: Austin, Ambia...",4.0,"[1222016, listed, in, date, night, austin, amb...",2016-12-02
2,The Factory - Cafe With a Soul,11/30/2016 1 check-in Listed in Brunch Spots ...,4.0,"[11302016, 1, checkin, listed, in, brunch, spo...",2016-11-30
3,The Factory - Cafe With a Soul,11/25/2016 Very cool decor! Good drinks Nice ...,2.0,"[11252016, very, cool, decor, good, drinks, ni...",2016-11-25
4,The Factory - Cafe With a Soul,12/3/2016 1 check-in They are located within ...,4.0,"[1232016, 1, checkin, they, are, located, with...",2016-12-03


In [74]:
# Now that we've removed the words from the rating, we can convert it to an int
# and that will make it easier to rank reviews by rating

In [83]:
ratings = []

for i in shops['star_rating']:
    ratings.append(int(float(shops['star_rating'][0])))

In [86]:
shops['star_rating'] = ratings

In [91]:
shops

Unnamed: 0,coffee_shop_name,full_review_text,star_rating,tokens,date_reviewed
0,The Factory - Cafe With a Soul,11/25/2016 1 check-in Love love loved the atm...,5,"[11252016, 1, checkin, love, love, loved, the,...",2016-11-25
1,The Factory - Cafe With a Soul,"12/2/2016 Listed in Date Night: Austin, Ambia...",5,"[1222016, listed, in, date, night, austin, amb...",2016-12-02
2,The Factory - Cafe With a Soul,11/30/2016 1 check-in Listed in Brunch Spots ...,5,"[11302016, 1, checkin, listed, in, brunch, spo...",2016-11-30
3,The Factory - Cafe With a Soul,11/25/2016 Very cool decor! Good drinks Nice ...,5,"[11252016, very, cool, decor, good, drinks, ni...",2016-11-25
4,The Factory - Cafe With a Soul,12/3/2016 1 check-in They are located within ...,5,"[1232016, 1, checkin, they, are, located, with...",2016-12-03
...,...,...,...,...,...
7611,The Steeping Room,2/19/2015 I actually step into this restauran...,5,"[2192015, i, actually, step, into, this, resta...",2015-02-19
7612,The Steeping Room,"1/21/2016 Ok, The Steeping Room IS awesome. H...",5,"[1212016, ok, the, steeping, room, is, awesome...",2016-12-01
7613,The Steeping Room,"4/30/2015 Loved coming here for tea, and the ...",5,"[4302015, loved, coming, here, for, tea, and, ...",2015-04-30
7614,The Steeping Room,8/2/2015 The food is just average. The booths...,5,"[822015, the, food, is, just, average, the, bo...",2015-08-02


## How do we want to analyze these coffee shop tokens? 

- Overall Word / Token Count
- View Counts by Rating 
- *Hint:* a 'bad' coffee shops has a rating betweeen 1 & 3 based on the distribution of ratings. A 'good' coffee shop is a 4 or 5. 

In [92]:
def count(docs):

        word_counts = Counter()
        appears_in = Counter()
        
        total_docs = len(docs)

        for doc in docs:
            word_counts.update(doc)
            appears_in.update(set(doc))

        temp = zip(word_counts.keys(), word_counts.values())
        
        wc = pd.DataFrame(temp, columns = ['word', 'count'])

        wc['rank'] = wc['count'].rank(method='first', ascending=False)
        total = wc['count'].sum()

        wc['pct_total'] = wc['count'].apply(lambda x: x / total)
        
        wc = wc.sort_values(by='rank')
        wc['cul_pct_total'] = wc['pct_total'].cumsum()

        t2 = zip(appears_in.keys(), appears_in.values())
        ac = pd.DataFrame(t2, columns=['word', 'appears_in'])
        wc = ac.merge(wc, on='word')

        wc['appears_in_pct'] = wc['appears_in'].apply(lambda x: x / total_docs)
        
        return wc.sort_values(by='rank')

In [93]:
wc = count(shops['tokens'])

In [94]:
wc.head()

Unnamed: 0,word,appears_in,count,rank,pct_total,cul_pct_total,appears_in_pct
26,the,6847,34809,1.0,0.044537,0.044537,0.899028
14,and,6864,26650,2.0,0.034098,0.078635,0.901261
11,a,6246,22755,3.0,0.029114,0.107749,0.820116
56,i,5528,20237,4.0,0.025893,0.133642,0.72584
69,to,5653,17164,5.0,0.021961,0.155602,0.742253


In [95]:
wc[wc['rank'] <= 100]['cul_pct_total'].max()

0.5335194530026511

## Can visualize the words with the greatest difference in counts between 'good' & 'bad'?

Couple Notes: 
- Rel. freq. instead of absolute counts b/c of different numbers of reviews
- Only look at the top 5-10 words with the greatest differences


## Stretch Goals

* Analyze another corpus of documents - such as Indeed.com job listings ;).
* Play with the Spacy API to
 - Extract Named Entities
 - Extracting 'noun chunks'
 - Attempt Document Classification with just Spacy
 - *Note:* This [course](https://course.spacy.io/) will be of interesting in helping you with these stretch goals. 
* Try to build a plotly dash app with your text data 

