# Section 2: Features vector

## Manual categorization of the dictionary 

This dictionary expands sections into categories applied in Data Story folder. Criteria to label phrases is the same, but now three new sections are included: **price** (on category place), **snacks** (on food category) and **do** (on category place, including *study, work, talking with friends, hanging* and elements associated as *laptop, wifi, books*). The name **decoration** is replaced for **ambient** because this topic includes **decoration**, **size** of the business, **music**; and **go** includes the categories **to here**, **to go** (previoulsy used in Data Story) and all the elements associated with the possibility or not of find a set, grab the coffee to sit or to go, how crowded is the place, availability of tables and **out** is related to the view, neighborhood, parking. 

Definitive 3 categories and 16 sections are listed below:

1. **Coffee**: *Baristas, Roasting, Beans, Drinks, Sentiment*

2. **Place**: *Ambient, Go, Do, Out (outside), Sentiment, Price*

3. **Food**: *Sentiment, Breakfast, Baked, Lunch, Snacks*

## Building feature vectors

1. **Similarity vectors**

Joining these key-words group by sections, we build topic documents of every section and we split the customer reviews (paragraphs) into sentences to looking for the similarity between the topic documents with every sentence using cosine similarity. In this way, we have reviews split into sentences and every sentence is represented as a vector with 16 features with the similarity score between the sentence and every topic document, called **Similarity feature vector**.

2. **Polarity vectors**

Additionally, the polarity pattern of every sentence is computed and rescaled to have values between 0 and 1 instead -1 and 1. We did that because, as we noted in the previous analysis, neutral sentences are frequent and they have a polarity pattern around 0. Multiplying the original polarity scores by the similarity scores cancels a lot of values in every feature and we potentially could lose a lot of information. Finally, the vectors of sentences are group by review and aggregated using the mean of all the components, to build one vector for review with 16 features, corresponding to a ponderation between the similarity and the polarity of the topics presents on the review. These vectors of polarity are called **Polarity features vectors**.

#### Importing relevant packages

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import json
import nltk
import re

In [2]:
from nltk.stem import WordNetLemmatizer 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from textblob.sentiments import PatternAnalyzer
from textblob import TextBlob

#### Importing vocabulary categorized

In [3]:
df_all = pd.read_csv('../Machine_Learning/preprocessing_ml/tfidf_vocab_categorized.csv')
df = df_all.dropna()
print('Original shape = {} and shape after to delete NaN categories and sections = {}'.format(df_all.shape, df.shape))

Original shape = (5378, 4) and shape after to delete NaN categories and sections = (1267, 4)


In [4]:
df.head()

Unnamed: 0,key_words,key_values,category,section
11,fifthgeneration baker,0.07,food,baked
17,tea decent,0.07,coffee,drinks
18,tea said,0.07,coffee,drinks
23,bouncy crunchy,0.07,food,baked
25,buttery lot,0.07,food,baked


Function to transform key-words in documents:

In [5]:
def convert(s): 
    # initialization of string to "" 
    new = "" 
    # traverse in the string  
    for x in s: 
        new += x+' '
    # return string  
    return new 

Inspecting data group by categories and sections:

In [6]:
df_food = df[df.category == 'food']
df_coffee = df[df.category == 'coffee']
df_place = df[df.category == 'place']

In [7]:
print(df_food.shape, df_coffee.shape, df_place.shape)

(373, 4) (341, 4) (553, 4)


In [8]:
df_food.groupby('section').count()

Unnamed: 0_level_0,key_words,key_values,category
section,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
baked,115,115,115
breakfast,102,102,102
lunch,85,85,85
sentiment,50,50,50
snacks,21,21,21


In [9]:
df_coffee.groupby('section').count()

Unnamed: 0_level_0,key_words,key_values,category
section,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
barista,65,65,65
beans,40,40,40
drinks,174,174,174
roast,33,33,33
sentiment,29,29,29


In [10]:
df_place.groupby('section').count()

Unnamed: 0_level_0,key_words,key_values,category
section,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
decoration,142,142,142
do,84,84,84
go,91,91,91
out,114,114,114
price,20,20,20
sentiment,102,102,102


#### Transforming key-words in documents

In [11]:
doc_coffee_beans = convert(df_coffee[df_coffee.section == 'beans'].key_words.tolist())
doc_coffee_roast = convert(df_coffee[df_coffee.section == 'roast'].key_words.tolist())
doc_coffee_drinks = convert(df_coffee[df_coffee.section == 'drinks'].key_words.tolist())
doc_coffee_barista = convert(df_coffee[df_coffee.section == 'barista'].key_words.tolist())
doc_coffee_sentiment = convert(df_coffee[df_coffee.section == 'sentiment'].key_words.tolist())

In [12]:
doc_place_go = convert(df_place[(df_place.section == 'go')].key_words.tolist())
doc_place_do = convert(df_place[df_place.section == 'do'].key_words.tolist())
doc_place_out = convert(df_place[df_place.section == 'out'].key_words.tolist())
doc_place_price = convert(df_place[(df_place.section == 'price')].key_words.tolist())
doc_place_ambient = convert(df_place[(df_place.section == 'decoration')].key_words.tolist())
doc_place_sentiment = convert(df_place[df_place.section == 'sentiment'].key_words.tolist())

In [13]:
doc_food_baked = convert(df_food[df_food.section == 'baked'].key_words.tolist())
doc_food_lunch = convert(df_food[df_food.section == 'lunch'].key_words.tolist())
doc_food_breakfast = convert(df_food[df_food.section == 'breakfast'].key_words.tolist())
doc_food_sentiment = convert(df_food[df_food.section == 'sentiment'].key_words.tolist())
doc_food_snacks = convert(df_food[df_food.section == 'snacks'].key_words.tolist())

#### Importing reviews

Importing reviews, changing the names duplicated on coffee shops of the same chain. One of our analysis pretend to cluster the coffee shops, this it is important to distinguish business with the same name.

In [14]:
df_reviews = pd.read_csv("../Data_Extraction/Reviews/reviews_rating_date.csv", \
                         usecols=['Coffee', 'Description','Rating', 'date'])

In [15]:
df_reviews.loc[20:40, 'Coffee'] = 'Coffee 1'
df_reviews.loc[40:60, 'Coffee'] = 'Coffee 2'
df_reviews.loc[60:80, 'Coffee'] = 'The Mill'
df_reviews.loc[1939:1959, 'Coffee'] = 'The Mill 1'
df_reviews.loc[1200:1220, 'Coffee'] = 'Andytown Coffee 1'
df_reviews.loc[1691:1711, 'Coffee'] = 'Réveille Coffee 1'
df_reviews.loc[340:360, 'Coffee'] = 'Sightglass 1'
df_reviews.loc[780:800, 'Coffee'] = 'Sightglass 2'
df_reviews.loc[2439:2459, 'Coffee'] = 'Sightglass 3'
df_reviews.loc[940:960, 'Coffee'] = 'Saint Frank 2'
df_reviews.loc[800:820, 'Coffee'] = 'Philz 1'
df_reviews.loc[2739:2759, 'Coffee'] = 'Philz 2'
df_reviews.loc[3228:3248, 'Coffee'] = 'Philz 3'
df_reviews.loc[700:720, 'Coffee'] = 'Blue Bottle 1'
df_reviews.loc[2119:2139, 'Coffee'] = 'Blue Bottle 2'
df_reviews.loc[2499:2519, 'Coffee'] = 'Blue Bottle 3'
df_reviews.loc[2299:2319, 'Coffee'] = 'Jane on 1'
df_reviews.loc[2159:2179, 'Coffee'] = 'Equator Coffees & 1'
df_reviews.loc[2399:2419, 'Coffee'] = 'Equator Coffees & 2'
df_reviews.loc[3172:3192 , 'Coffee'] = 'Contraband Coffee 1'
df_reviews.loc[2859:2879, 'Coffee'] = 'Little 1'
df_reviews.loc[1471:1491, 'Coffee'] = 'Cafe 1'
df_reviews.loc[2599:2619, 'Coffee'] = 'Cafe 2'
df_reviews.loc[2962:2982, 'Coffee'] = 'Cafe 3'
df_reviews.loc[3122:3132, 'Coffee'] = 'Bluestone 1'
df_reviews.loc[3528:3548, 'Coffee'] = 'Red Door 1'
df_reviews.loc[2339:2359, 'Coffee'] = 'Martha & Brothers 1'
df_reviews.loc[3042:3062, 'Coffee'] = 'Boba 1'
df_reviews.loc[3102:3122, 'Coffee'] = 'Boba 2'

In [16]:
df_reviews.groupby('Rating').count()

Unnamed: 0_level_0,Coffee,Description,date
Rating,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1.0 star rating,246,263,263
2.0 star rating,196,207,207
3.0 star rating,309,327,327
4.0 star rating,780,821,821
5.0 star rating,1780,1930,1930


If we are dealing with polarity patterns, we must expect to have a proportion similar between negative and positive patterns and it make sense use samples randomly selected of reviews of 4 and 5 stars in a range of 200-300. 

In [17]:
df1_ = df_reviews[df_reviews['Rating'] == '1.0 star rating']
df2_ = df_reviews[df_reviews['Rating'] == '2.0 star rating']
df3_ = df_reviews[df_reviews['Rating'] == '3.0 star rating']
df4_ = df_reviews[df_reviews['Rating'] == '4.0 star rating']
df5_ = df_reviews[df_reviews['Rating'] == '5.0 star rating']

df4_sample = df4_.sample(n=200, random_state=42)
df5_sample = df5_.sample(n=200, random_state=42)

In [18]:
data = pd.concat([df1_, df2_, df3_, df4_sample, df5_sample], sort=False)

#### Pre-processing data

In [19]:
"""
from http://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python
all credits go to alko and arturomp @ stack overflow.
"""

with open('../Data_Story/wordLists/contractionList.txt', 'r') as f:
    cList = json.loads(f.read())
    c_re = re.compile('(%s)' % '|'.join(cList.keys()))

def expandContractions(text, c_re=c_re):
    def replace(match):
        return cList[match.group(0)]
    return c_re.sub(replace, text)

In [20]:
stop_words = []
with open('../Data_Story/wordLists/stop_wordsList.txt') as f:
    stop_words = f.read().rstrip()

In [21]:
wpt = nltk.WordPunctTokenizer()
lemmatizer = WordNetLemmatizer() 

def pre_processing(text):
    # Normalize constractions and apply expansion of Constractions
    text = re.sub(r'’',"'", text)
    text = expandContractions(text.lower())
    # Filtering special characters
    text = re.sub(r'[^a-zA-Z\s]','', text)
    # Tokenization and filtering stop-words
    tokens = wpt.tokenize(text)
    words = [word for word in tokens if word not in stop_words]
    # Lemmatization
    words_lem = [lemmatizer.lemmatize(word) for word in words]
    text_norm = ' '.join(words_lem)
    
    return text_norm

In [22]:
data = data.reset_index()

#### Building Similarity features vectors

1. Building a dictionary with the 16 documents of coffee topics
2. Defining a function to build vectors from text using CountVectorizer
3. Defining a function to compute similarity as the cosine of two input vectors.
4. Iterating in reviews split into sentences to calculate the similarity score of every sentences with the 16 topics (features) to build the **Similarity features vectors**. 
5. Filtering the sentences with 0 similarity score in all the features (sum of the scores is 0)
6. Saving the features in the dataFrame *df_similarity* in folder **preprocessing_ml** as a CSV file called **similarity_features_vectors.csv**

Dictionary of documents:

In [23]:
documents = {
           'beans' : doc_coffee_beans,
           'roast' : doc_coffee_roast,
           'drinks' : doc_coffee_drinks,
           'barista' : doc_coffee_barista,
           'coffee sentiment' : doc_coffee_sentiment,
           'go' : doc_place_go,
           'do' : doc_place_do,
           'out' : doc_place_out,
           'ambient': doc_place_ambient,
           'price' : doc_place_price,
           'place sentiment' : doc_place_sentiment,
           'baked' : doc_food_baked,
           'lunch' : doc_food_lunch,
           'breakfast' : doc_food_breakfast,
           'snacks' : doc_food_snacks,
           'food sentiment' : doc_food_sentiment
            }

Similarity functions:

In [24]:
def get_cosine_sim(*strs): 
    vectors = [t for t in get_vectors(*strs)]
    return cosine_similarity(vectors)[0, 1]
    
def get_vectors(*strs):
    text = [t for t in strs]
    vectorizer = CountVectorizer(text)
    vectorizer.fit(text)
    v = vectorizer.transform(text).toarray()
    return v

In [25]:
def find_similarities(documents, s1):
    # Return the similarities (score and name of the topic)
    score, topic = [], []

    for key, doc in documents.items():
        score.append(get_cosine_sim(s1, doc))
        topic.append(key)
    
    return topic, score

Compute the similarity between sentences and documents:

In [26]:
id_review = []
coffee = []
rating = []
date = []
sentence = []
beans = []
roast = []
drinks = []
barista = []
coffee_sentiment = []
go = []
price = []
do = []
out = []
ambient = []
place_sentiment = []
baked = []
lunch = []
breakfast = []
snacks = []
food_sentiment = []

for idx, review in enumerate(data.Description): 
    coffee_el = data.Coffee[idx]
    rating_el = data.Rating[idx]
    date_el = data.date[idx]
    idx_el = idx
    for s in review.split('.'):
        coffee.append(coffee_el)
        rating.append(rating_el)
        date.append(date_el)
        id_review.append(idx_el)
        a, b = find_similarities(documents, pre_processing(s))
        beans.append(b[0])
        roast.append(b[1])
        drinks.append(b[2])
        barista.append(b[3])
        coffee_sentiment.append(b[4])
        go.append(b[5])
        do.append(b[6])
        out.append(b[7])
        ambient.append(b[8])
        price.append(b[9])
        place_sentiment.append(b[10])
        baked.append(b[11])
        lunch.append(b[12])
        breakfast.append(b[13])
        snacks.append(b[14])
        food_sentiment.append(b[15])
        sentence.append(s)

Building a dataFrame of similarities

In [27]:
df_similarity = pd.DataFrame({'id review' : id_review,
                              'coffee' : coffee,
                              'rating' : rating,
                              'date' : date,
                              'beans' : beans,
                              'roast' : roast,
                              'drinks' : drinks,
                              'barista' : barista,
                              'coffee_sentiment' : coffee_sentiment,
                              'go' : go,
                              'do' : do,
                              'out' : out,
                              'ambient' : ambient,
                              'price' : price,
                              'place_sentiment' : place_sentiment,
                              'baked' : baked,
                              'lunch' : lunch,
                              'breakfast' : breakfast,
                              'snacks' : snacks,
                              'food_sentiment' : food_sentiment,
                              'sentence_normalized' : sentence
                              })

Filtering null similarities

In [28]:
df_similarity['sum'] = df_similarity.loc[:, 'beans':'food_sentiment'].sum(axis=1)
df_similarity_filter = df_similarity[df_similarity['sum']>0]
df_similarity_filter.shape

(7197, 22)

In [29]:
df_similarity_copy = df_similarity_filter.copy()
df_sc = df_similarity_copy

In [30]:
df_similarity_copy.head()

Unnamed: 0,id review,coffee,rating,date,beans,roast,drinks,barista,coffee_sentiment,go,...,ambient,price,place_sentiment,baked,lunch,breakfast,snacks,food_sentiment,sentence_normalized,sum
0,0,Réveille Coffee,1.0 star rating,4/16/2019,0.0,0.0,0.026689,0.013172,0.0,0.028212,...,0.050878,0.028748,0.240632,0.0519,0.027582,0.046318,0.0,0.0,"Folks, avoid this place unless you like being ...",0.643034
1,0,Réveille Coffee,1.0 star rating,4/16/2019,0.0,0.021167,0.029506,0.029123,0.0,0.0,...,0.022499,0.0,0.0,0.019126,0.076232,0.025603,0.0,0.097037,I went for a quick cappuccino only since they...,0.343937
2,0,Réveille Coffee,1.0 star rating,4/16/2019,0.0,0.0,0.036137,0.0,0.0,0.025466,...,0.0,0.0,0.0,0.0,0.0,0.015679,0.0,0.0,The girl behind the counter walked over and s...,0.077282
6,0,Réveille Coffee,1.0 star rating,4/16/2019,0.0,0.0,0.031296,0.03089,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,I said hello,0.062186
9,0,Réveille Coffee,1.0 star rating,4/16/2019,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.056433,0.0,0.0,0.0,0.0,0.0,again nothing but a stare,0.056433


Save the dataFrame in the folder **preprocessing_ml** as **similarity_features_vectors.csv**

In [31]:
#df_sc.to_csv('preprocessing_ml/similarity_features_vectors.csv')

#### Building Polarity features vectors

1. We use the normalized sentences to measure the polarity of every sentence
2. Rescale the polarity patterns from (-1, 1) to (0, 1).
3. Weighting the features using the polarity pattern in every sentence.
4. Group sentences by id review using the mean of score of every feature. The feature vector is finally a review described for the topics mentioned into the review weighted by how strong is the similarity between the sentence and the coffee topics and the polarity of the message.

Defining sentiment analysis function:

In [32]:
def sentiment_parameters_Pattern(sentence):
    blob = TextBlob(sentence, analyzer=PatternAnalyzer())
    return blob.sentiment.polarity

Computing polarity pattern:

In [33]:
polarity = []
for sentence in df_sc['sentence_normalized']:
    polarity.append(sentiment_parameters_Pattern(sentence))
    
df_sc['polarity'] = polarity

In [34]:
df_sc.head()

Unnamed: 0,id review,coffee,rating,date,beans,roast,drinks,barista,coffee_sentiment,go,...,price,place_sentiment,baked,lunch,breakfast,snacks,food_sentiment,sentence_normalized,sum,polarity
0,0,Réveille Coffee,1.0 star rating,4/16/2019,0.0,0.0,0.026689,0.013172,0.0,0.028212,...,0.028748,0.240632,0.0519,0.027582,0.046318,0.0,0.0,"Folks, avoid this place unless you like being ...",0.643034,0.0
1,0,Réveille Coffee,1.0 star rating,4/16/2019,0.0,0.021167,0.029506,0.029123,0.0,0.0,...,0.0,0.0,0.019126,0.076232,0.025603,0.0,0.097037,I went for a quick cappuccino only since they...,0.343937,0.144444
2,0,Réveille Coffee,1.0 star rating,4/16/2019,0.0,0.0,0.036137,0.0,0.0,0.025466,...,0.0,0.0,0.0,0.0,0.015679,0.0,0.0,The girl behind the counter walked over and s...,0.077282,-0.4
6,0,Réveille Coffee,1.0 star rating,4/16/2019,0.0,0.0,0.031296,0.03089,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,I said hello,0.062186,0.0
9,0,Réveille Coffee,1.0 star rating,4/16/2019,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.056433,0.0,0.0,0.0,0.0,0.0,again nothing but a stare,0.056433,0.0


Rescaling the polarity pattern scores:

In [35]:
max_old = 1
max_new = 1
min_old = -1
min_new = 0

df_sc['polarity'] = df_sc['polarity'].apply(lambda v:((max_new - min_new)/(max_old - min_old))*(v - max_old) + max_new)

Weighting the features calculated previously using similarity:

In [36]:
df_polarity_vector = df_sc[['beans','roast','drinks','barista',\
                            'coffee_sentiment','go','do','out',\
                            'ambient','price','place_sentiment',
                            'baked','lunch','breakfast','snacks',\
                            'food_sentiment']].multiply(df_sc['polarity'], axis="index")

Transforming the rating column to float:

In [37]:
df_polarity_vector['id review'] = df_sc['id review']
df_polarity_vector['rating'] = df_sc['rating'].str.split(' star rating').str.get(0).astype(float)

Group sentences by id review:

In [39]:
df_polarity_vector[df_polarity_vector == 0] = np.nan
df_polarity_vector = df_polarity_vector.groupby('id review').mean()
df_polarity_vector = df_polarity_vector.fillna(0)

Save the dataFrame in the folder **preprocessing_ml** as **features_extended_vectors.csv**

In [40]:
#df_polarity_vector.to_csv('preprocessing_ml/features_extended_vectors.csv')