<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

# Natural Language Processing (NLP)
## *Data Science Unit 4 Sprint 1 Assignment 1*

Your goal in this assignment: find the attributes of the best & worst coffee shops in the dataset. The text is fairly raw: dates in the review, extra words in the `star_rating` column, etc. You'll probably want to clean that stuff up for a better analysis. 

Analyze the corpus of text using text visualizations of token frequency. Try cleaning the data as much as possible. Try the following techniques: 
- Lemmatization
- Custom stopword removal

Keep in mind the attributes of good tokens. Once you have a solid baseline, layer in the star rating in your visualization(s). Key part of this assignment - produce a write-up of the attributes of the best and worst coffee shops. Based on your analysis, what makes the best the best and the worst the worst. Use graphs and numbesr from your analysis to support your conclusions. There should be plenty of markdown cells! :coffee:

In [1]:
%pwd

'/Users/danielfernandez/Documents/Github-Proj/DS-Unit-4-Sprint-1-NLP/module1-text-data'

In [2]:
import pandas as pd

url = "https://raw.githubusercontent.com/LambdaSchool/DS-Unit-4-Sprint-1-NLP/main/module1-text-data/data/yelp_coffeeshop_review_data.csv"

shops = pd.read_csv(url)
shops.tail(10)

Unnamed: 0,coffee_shop_name,full_review_text,star_rating
7606,The Steeping Room,6/11/2015 Same great tea and food as their Do...,4.0 star rating
7607,The Steeping Room,8/14/2015 This place is amazing! It's one of ...,5.0 star rating
7608,The Steeping Room,9/20/2015 I come here when I visit my friend ...,4.0 star rating
7609,The Steeping Room,12/7/2014 1 check-in After noticing many frie...,4.0 star rating
7610,The Steeping Room,3/1/2016 Great food! I haven't had a meal I d...,5.0 star rating
7611,The Steeping Room,2/19/2015 I actually step into this restauran...,4.0 star rating
7612,The Steeping Room,"1/21/2016 Ok, The Steeping Room IS awesome. H...",5.0 star rating
7613,The Steeping Room,"4/30/2015 Loved coming here for tea, and the ...",4.0 star rating
7614,The Steeping Room,8/2/2015 The food is just average. The booths...,3.0 star rating
7615,The Steeping Room,5/23/2015 I finally stopped in for lunch with...,4.0 star rating


In [3]:
# Import statements

import re
import spacy
import squarify
import seaborn as sns
from collections import Counter
import matplotlib.pyplot as plt

# Spacy Model
!python -m spacy download en_core_web_lg
nlp = spacy.load('en_core_web_lg')

Collecting en_core_web_lg==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz (827.9 MB)
[K     |████████████████████████████████| 827.9 MB 9.7 MB/s eta 0:00:012
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')


In [4]:
# shops['full_review_text'][5]

" 11/20/2016 1 check-in Very cute cafe! I think from the moment I stepped in, there really wasn't anything I didn't find cute at The Factory. From their decor to their cups, everything was really cute. It's really the perfect place for a catching up with friends or a coffee date.  When you go order, there's really the least amount of interaction ever with the workers. You just pick your order on an iPad and they'll call your order out after for you to pick up at the counter. The whole thing's pretty novel honestly. I got the Viva Matcha Latte and it was so good! Perfect amount of sweetness and perfect temperature. I went on a cold night and this cafe is just so cozy, it was such a great combination. They have these swings as well which were pretty fun to sit on honestly.  Prices are what I would expect for a cafe like this, not super cheap, but not too pricey. There's no wifi here, so if you want to study, maybe this isn't the right place. But overall, very nice atmosphere! Viva matcha

In [5]:
# Observing star rating formatting
    # star_rating
    # 	4.0 star rating
pd.unique(shops['star_rating'])

array([' 5.0 star rating ', ' 4.0 star rating ', ' 2.0 star rating ',
       ' 3.0 star rating ', ' 1.0 star rating '], dtype=object)

In [6]:
def tokenize(raw, col):
    token_list = []
    regx_txt_date = '(\d{1}[\/-]\d{1}[\/-]\d{4}|\d{2}[\/-]\d{2}[\/-]\d{4})([^a-zA-Z 0-9])'    
    for row in raw[col]:
        # Remove non-alphanumeric char and dates, then lower
        text = re.sub(regx_txt_date, '', clean_0)
        text = text.lower()

        tokens = text.split(" ")
        token_list.append(tokens)
    
    return token_list

In [7]:
def tokenize_df(raw):
    """
    Remove dates, non-alphanumeric chars and empty strings from row.
    Use with `df.row.apply(lambda x: function(x))`.
    
    Input: String
    
    :Output: List
    """
    # Remove dates (1/1/1111 or 01/01/1111) and non-alphanum
    regx_txt_date = '\d{1}[\/-]\d{1}[\/-]\d{4}|\d{2}[\/-]\d{2}[\/-]\d{4}'
    text_0 = re.sub(regx_txt_date, '', raw)
    text = re.sub('[^a-zA-Z 0-9]', '', text_0)
    
    # Lowercase tokens and split
    text = text.lower()
    # Remove--pesky--empty strings
    tokens = list(filter(None, text.split(" ")))
    
    return tokens

In [8]:
# Clean star_rating
def comet(tail):
    return int(re.sub('[^1-5]', '', tail))

shops['star_rating_int']= shops['star_rating'].apply(lambda tail: comet(tail))

In [9]:
wrd_cnt = Counter()
shops['tokens'] = shops['full_review_text'].apply(lambda text: tokenize_df(text))
shops['tokens'].apply(lambda token: wrd_cnt.update(token))
wrd_cnt.most_common(10)

[('the', 34809),
 ('and', 26650),
 ('a', 22755),
 ('i', 20237),
 ('to', 17164),
 ('of', 12600),
 ('is', 11999),
 ('coffee', 10353),
 ('was', 9707),
 ('in', 9546)]

In [21]:
# Use Spacy to tokenize and remove stop words
def space_token(df, col):
    cpy_df = df.copy()
    tokens_space = []
    for doc in nlp.pipe(cpy_df[col]):
        doc_tokens = []
        for t in doc:
            if (t.is_stop == False) & (t.is_digit == False) & (t.is_oov == False):
                doc_tokens.append(t.text.lower())
        tokens_space.append(doc_tokens)
    cpy_df['tokens_spacy'] = tokens_space
    return cpy_df
        

In [22]:
test = space_token(shops, 'full_review_text')
test.head(4)

Unnamed: 0,coffee_shop_name,full_review_text,star_rating,star_rating_int,tokens,tokens_spacy
0,The Factory - Cafe With a Soul,11/25/2016 1 check-in Love love loved the atm...,5.0 star rating,5,"[1, checkin, love, love, loved, the, atmospher...","[ , check, -, love, love, loved, atmosphere, !..."
1,The Factory - Cafe With a Soul,"12/2/2016 Listed in Date Night: Austin, Ambia...",4.0 star rating,4,"[1, listed, in, date, night, austin, ambiance,...","[ , listed, date, night, :, austin, ,, ambianc..."
2,The Factory - Cafe With a Soul,11/30/2016 1 check-in Listed in Brunch Spots ...,4.0 star rating,4,"[1, checkin, listed, in, brunch, spots, i, lov...","[ , check, -, listed, brunch, spots, loved, ec..."
3,The Factory - Cafe With a Soul,11/25/2016 Very cool decor! Good drinks Nice ...,2.0 star rating,2,"[very, cool, decor, good, drinks, nice, seatin...","[ , cool, decor, !, good, drinks, nice, seatin..."


## How do we want to analyze these coffee shop tokens? 

- Overall Word / Token Count
- View Counts by Rating 
- *Hint:* a 'bad' coffee shops has a rating betweeen 1 & 3 based on the distribution of ratings. A 'good' coffee shop is a 4 or 5. 

## Can visualize the words with the greatest difference in counts between 'good' & 'bad'?

Couple Notes: 
- Rel. freq. instead of absolute counts b/c of different numbers of reviews
- Only look at the top 5-10 words with the greatest differences


## Stretch Goals

* Analyze another corpus of documents - such as Indeed.com job listings ;).
* Play with the Spacy API to
 - Extract Named Entities
 - Extracting 'noun chunks'
 - Attempt Document Classification with just Spacy
 - *Note:* This [course](https://course.spacy.io/) will be of interesting in helping you with these stretch goals. 
* Try to build a plotly dash app with your text data 

