# Section 1: Dictionary

The dictionary was built using the three blogs cited previously, and the 5% of customers reviews group by rating score. We noted that we lose valuable information using only the keywords generated by blogs. Thus, we choose randomly the 5% of reviews with score 1, 5% with score 2 and so on, for trying to rescue a text representative sample of what people are talking about in every categorical score.

#### Importing all relevant packages

In [1]:
from nltk.stem import WordNetLemmatizer
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import json
import nltk
import re

Defining functions for pre-processing stage:

#### Building **expandContractions** function

In [2]:
"""
from http://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python
all credits go to alko and arturomp @ stack overflow.
"""

with open('../Data_Story/wordLists/contractionList.txt', 'r') as f:
    cList = json.loads(f.read())
    c_re = re.compile('(%s)' % '|'.join(cList.keys()))

def expandContractions(text, c_re=c_re):
    def replace(match):
        return cList[match.group(0)]
    return c_re.sub(replace, text)

#### Stop words
Add name of coffee shops as stop-words

In [3]:
stop_words = []
with open('../Data_Story/wordLists/stop_wordsList.txt') as f:
    stop_words = f.read().rstrip()

#### Pipeline preprocessing

1. Normalize and expand contractions
2. Delete spetial characters
3. Tokenize words for filtering stopwords
4. Lemmatization
5. Join sentences again

In [4]:
wpt = nltk.WordPunctTokenizer()
lemmatizer = WordNetLemmatizer() 

def pre_processing(text):
    # Normalize constractions and apply expansion of Constractions
    text = re.sub(r'’',"'", text)
    text = expandContractions(text.lower())
    # Filtering special characters
    text = re.sub(r'[^a-zA-Z\s]','', text)
    # Tokenization and filtering stop-words
    tokens = wpt.tokenize(text)
    words = [word for word in tokens if word not in stop_words]
    # Lemmatization
    words_lem = [lemmatizer.lemmatize(word) for word in words]
    text_norm = ' '.join(words_lem)
    
    return text_norm

#### Building TF-IDF function

1. Build a Vectorizer model of TF-IDF with pairs of words (bi-grams)
2. Fit the model with the input of the function (a normalized text)
3. Transform matrix TD-IDF to an array
4. Use a threshold to filter some TF-IDF values
5. Return normalized freq and key words

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
def key_word_extractor(norm_corpus):
    tv = TfidfVectorizer(use_idf=True, ngram_range=(2,2))
    tv_matrix = tv.fit_transform(norm_corpus)
    tv_matrix = tv_matrix.toarray()

    idx = [i for i in range(len(norm_corpus))]
    vocab = tv.get_feature_names()
    df_tfidV = pd.DataFrame(np.round(tv_matrix, 2), columns=vocab, index=idx)

    key_values = []
    key_words = []
    for i in range(len(norm_corpus)):
        for column in vocab:
            if df_tfidV[column][i] > 0.0001:
                key_values.append(df_tfidV[column][i])
                key_words.append(column)
    return key_words, key_values

## Blogs

#### Preparing Blog reviews for analysis
Merging all blogs in one dataset. A column of Blog is included

In [6]:
df = pd.read_csv('../Data_Extraction/Reviews/reviews_blog_a.csv', usecols=['Name', 'Description'])

In [7]:
df['Blog'] = 'A'
df = df.rename(index=str, columns={"Name": "Names", "Description": "Review"})
df.head()

Unnamed: 0,Names,Review,Blog
0,Trouble Coffee,"Yeah yeah, they've got the toast you crave. Ci...",A
1,Andytown Coffee Roasters,This small Outer Sunset shop chooses and roast...,A
2,Garden House Cafe,In the reaches of the Outer Richmond lies Gard...,A
3,Snowbird Coffee,"Opened by former filmmaker Eugene Kim, Snowbir...",A
4,Flywheel Coffee,Haight-Ashbury's new school roaster and cafe F...,A


In [8]:
df1 = pd.read_csv('../Data_Extraction/Reviews/reviews_blogs_bc.csv', usecols=['Names', 'Wifi', 'Review', 'Blog'])

In [9]:
df_blogs = df1.append(df, ignore_index=True, sort=False)

Applying preprocessing to blogs. We built the list **docs** using the new column **text normalized** from blogs dataFrame

In [10]:
df_blogs['text normalized'] = df_blogs.Review.apply(lambda x: pre_processing(x))

In [11]:
docs = df_blogs['text normalized'].tolist()

Inspecting name of coffee shops and adding to stop-words. This procedure is manual because we can find names as baked, coffee, tea, pastry, for instances. We don't know which names must be filtered until visualize this information.

In [12]:
df_blogs['Names'].unique()

array(['Beacon Coffee & Pantry', 'Contraband Coffeebar', 'The Station SF',
       'Mazarine Coffee', 'Verve Coffee Roasters',
       'Ritual Coffee Roasters', 'Philz Coffee', 'Mercury Cafe',
       'Jane on Fillmore', 'Saint Frank', 'Nook', 'Réveille Coffee Co.',
       'Duboce Park Cafe', 'Atlas Cafe', 'The Grove',
       'The Interval at Long Now', '20th Century Cafe',
       'Sightglass Coffee', 'Verve', 'The\xa0Mill SF', 'Cafe Réveille',
       'Trouble Coffee', 'Blue Bottle Coffee', 'Four Barrel Coffee',
       'Mission Heirloom', 'Paramo Coffee', 'Le Marais Bakery', 'Hollow',
       'Red Door Coffee', 'Andytown Coffee Roasters', 'Garden House Cafe',
       'Snowbird Coffee', 'Flywheel Coffee', 'fifty/fifty',
       'Ritual Roasters Coffee', 'The Mill',
       'Wrecking Ball Coffee Roasters', 'Lady Falcon Coffee Club',
       'Linea Caffe', 'George and Lennie', 'Coffee Cultures',
       'Sextant Coffee Roasters', 'Tartine Manufactory',
       'Equator Coffees & Teas', 'Caffe Tries

#### Applying Vectorizer function to **docs**

Make a dataFrame with the resultant vocabulary, **df_key_groups**.

In [13]:
key_words = key_word_extractor(docs)[0]
key_values = key_word_extractor(docs)[1]

In [14]:
l_words = []
l_values = []
for word in key_words:
    l_words.append(word)
    
for value in key_values:
    l_values.append(value)

In [15]:
df_keys = pd.DataFrame({'key_words': l_words, 'key_values': l_values})

In [16]:
df_keys_group = df_keys.groupby('key_words').mean().sort_values('key_values')
df_keys_group.head()

Unnamed: 0_level_0,key_values
key_words,Unnamed: 1_level_1
espresso drink,0.115
cold brew,0.12
another location,0.13
authentic soda,0.13
pastry free,0.13


## Reviews

#### Loading customers dataset

In [17]:
df_reviews = pd.read_csv("../Data_Extraction/Reviews/reviews_rating_date.csv", \
                         usecols=['Coffee', 'Description','Rating', 'date'])

Inspecting proportion of Rating, we reduce the positive reviews, selecting 500 randomly with 5 stars and 500 with 4 stars. 

In [18]:
df_reviews.groupby('Rating').count()

Unnamed: 0_level_0,Coffee,Description,date
Rating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0 star rating,245,263,263
2.0 star rating,196,207,207
3.0 star rating,305,327,327
4.0 star rating,766,821,821
5.0 star rating,1756,1930,1930


#### Sampling data with more reviews

In [19]:
df1_ = df_reviews[df_reviews['Rating'] == '1.0 star rating']
df2_ = df_reviews[df_reviews['Rating'] == '2.0 star rating']
df3_ = df_reviews[df_reviews['Rating'] == '3.0 star rating']
df4_ = df_reviews[df_reviews['Rating'] == '4.0 star rating']
df5_ = df_reviews[df_reviews['Rating'] == '5.0 star rating']

df4_sample = df4_.sample(n=500, random_state=42)
df5_sample = df5_.sample(n=500, random_state=42)

#### **Data** is the new vector used on the analysis. 



In [20]:
data = pd.concat([df1_, df2_, df3_, df4_sample, df5_sample], sort=False)

Take 5% of every category to add features into the **Coffee Dictionary**

In [21]:
n1 = data[data.Rating == '1.0 star rating'].sample(frac=0.05, random_state=42)
n2 = data[data.Rating == '2.0 star rating'].sample(frac=0.05, random_state=42)
n3 = data[data.Rating == '3.0 star rating'].sample(frac=0.05, random_state=42)
n4 = data[data.Rating == '4.0 star rating'].sample(frac=0.05, random_state=42)
n5 = data[data.Rating == '5.0 star rating'].sample(frac=0.05, random_state=42)

Defining a dataFrame called **samples_dict**, we proceed to apply the preprocessing and the TF-IDF Vectorizer

In [22]:
samples_dict = pd.concat([n1, n2, n3, n4, n5])

In [23]:
samples_dict['Coffee'].unique()

array(['Dynamo Donut &', 'The Revolution', nan, 'Trouble Coffee',
       'Saint Frank', 'Philz', 'Martha & Brothers', 'Golden Bear Trading',
       'Beanstalk', 'Ritual Coffee', 'YakiniQ', 'Le Marais',
       'Flywheel Coffee', 'U :Dessert', 'Sightglass', 'Vive La',
       'Blue Bottle', 'Spike’s Coffees &', 'Tartine Bakery &',
       'Equator Coffees &', 'Little', 'The', 'Black', 'Mazarine',
       'Les Gourmands', 'Cantata Coffee', 'Contraband Coffee', 'Coffee',
       'Workshop Cafe', 'Urban', 'Spin City', 'Paramo Coffee',
       'UpForDayz', 'Java Beach', 'Réveille Coffee',
       'Howard Street Coffee', 'Craftsman and', 'Morning Due',
       'Weaver’s Coffee &', 'Socola Chocolatier and', 'Four Barrel',
       'Henry’s House of', 'ilana’s', 'Wrecking Ball Coffee', 'Saltroot',
       'Earth’s', 'Cafe', 'Alamo Square', 'Le Cafe du', 'Oakside',
       'Central Coffee Tea &', 'Pinhole', 'Faye’s', 'Crostini &',
       'Cinderella Bakery &'], dtype=object)

In [24]:
samples_dict['text normalized'] = samples_dict.Description.apply(lambda x: pre_processing(x))

In [25]:
samples_dict.head()

Unnamed: 0,Coffee,Description,Rating,date,text normalized
1548,Dynamo Donut &,The donuts are good but the service is horribl...,1.0 star rating,2/23/2019,donut good service horrible run flavor early t...
1559,The Revolution,the music is great but make sure you buy somet...,1.0 star rating,11/13/2018,music great sure buy something else rude guy s...
3475,,I used to work at this cafe and couldn't stand...,1.0 star rating,8/17/2017,used work not stand working longer week owner ...
2836,Trouble Coffee,Horrible staff. I just came in on the phone an...,1.0 star rating,2/25/2019,horrible staff came phone told gf wait ordered...
946,Saint Frank,This is the worst coffee I've tasted in SF. Th...,1.0 star rating,1/8/2019,worst tasted almond milk sucked drink almond m...


In [26]:
docs_customer = samples_dict['text normalized'].tolist()

In [27]:
customer_key_words = key_word_extractor(docs_customer)[0]
customer_key_values = key_word_extractor(docs_customer)[1]

In [28]:
l_words = []
l_values = []
for word in customer_key_words:
    l_words.append(word)
    
for value in customer_key_values:
    l_values.append(value)

With exactly the same proceed, we build a second dataFrame with key-words from customers. 

In [29]:
df_keys_customer = pd.DataFrame({'key_words': l_words, 'key_values': l_values})

We group key words using the mean of the tfidf value, avoiding duplicates key-words

In [30]:
df_keys_group_customer = df_keys_customer.groupby('key_words').mean().sort_values('key_values')
df_keys_group_customer.head()

Unnamed: 0_level_0,key_values
key_words,Unnamed: 1_level_1
repeat customer,0.07
opt sweetness,0.07
sylvain chaillout,0.07
sylvain looked,0.07
sylvain very,0.07


Now, we concatenate both (from blogs and customers) and save it as **tdidf_vocab**, available in the folder **preprocessing_ml**

In [31]:
#pd.concat([df_keys_group_customer, df_keys_group]).to_csv('preprocessing_ml/tfidf_vocab.csv')