# Extension of Exploratory Data Analysis

In this section, we explore polarity and subjectivity of customer reviews dataset, lenght of reviews and look for patterns to find useful similarities and differences between the reviews to build features vectors in the next Machine Learning sections. Customer reviews will be used to predict the satisfaction and to investigate the causes of those good or bad experiences, defined as categories. We'll be able to split coffee shops in clusters according to how much positive or negative are the mentions that people do in every category. Additionally, the blog reviews are used exclusively to build clusters of coffee shops related to features extracted from the descriptions.

## Guideline

The main purposes here are the following:
### To Resolve the Predictive Score Modeling:
- Compare different polarity algorithms to determine which is the most appropiate in this scenario.
- Determine if we must calculate polarity scores per review or sentences into the reviews and then take the average or some percentile to estimate the polarity of the review.
- Determine if is useful to use more than one polarity algorithm as feature.
- Look for more potential language features (as lenght of the message, distribution of the polarity and subjectivity into the review, type of language, slangs, uppercase expressions, etc).


### To Resolve topics that people mention in their reviews:
- Create dictionary of words per topic.
- Use Topic Modeling Hierarchical
- Use TF-IDF and select some key-words
- Use word embeddings

### To Resolve the unsupervised clustering of coffee shops:
- Use information from blogs and brief descriptions.
- Identify features (as presence or absence of wifi, music, decoration, tables and others)

Importing relevant packages

In [1]:
import warnings

warnings.filterwarnings("ignore")


from textblob.sentiments import PatternAnalyzer
from textblob import TextBlob
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import datetime
import re

Defining the **sentiment parameters pattern** function using one of the sentiment analyzer provides for *TextBlob*

In [2]:
def sentiment_parameters_Pattern(sentence):
    blob = TextBlob(sentence, analyzer=PatternAnalyzer())
    return blob.sentiment.polarity, blob.sentiment.subjectivity

### Customer reviews dataset

In [3]:
df_reviews = pd.read_csv("../Data_Extraction/Reviews/reviews_rating_date.csv", \
                         usecols=['Coffee', 'Description','Rating', 'date'])

In [4]:
df_reviews.head()

Unnamed: 0,Coffee,Description,Rating,date
0,Réveille Coffee,"This is a cute coffee shop, I love the ambianc...",4.0 star rating,5/10/2019
1,Réveille Coffee,"I wanted to like this place, however the venti...",2.0 star rating,4/20/2019
2,Réveille Coffee,I didn't tried brunch before in another locati...,5.0 star rating,4/18/2019
3,Réveille Coffee,"Folks, avoid this place unless you like being ...",1.0 star rating,4/16/2019
4,Réveille Coffee,A nice compact coffee shop at Castro area. Qua...,4.0 star rating,4/16/2019


In [5]:
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3548 entries, 0 to 3547
Data columns (total 4 columns):
Coffee         3268 non-null object
Description    3548 non-null object
Rating         3548 non-null object
date           3548 non-null object
dtypes: object(4)
memory usage: 111.0+ KB


In [5]:
df_reviews['scores'] = df_reviews['Rating'].str.split(' star rating').str.get(0).astype(float)

In [10]:
df_reviews.drop(columns='Rating', inplace=True)

In [6]:
df_reviews['Polarity_Pattern'] = [sentiment_parameters_Pattern(r)[0] for r in df_reviews['Description']]

In [7]:
df_reviews['Subjectivity_Pattern'] = [sentiment_parameters_Pattern(r)[1] for r in df_reviews['Description']]

In [8]:
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3548 entries, 0 to 3547
Data columns (total 7 columns):
Coffee                  3268 non-null object
Description             3548 non-null object
Rating                  3548 non-null object
date                    3548 non-null object
scores                  3548 non-null float64
Polarity_Pattern        3548 non-null float64
Subjectivity_Pattern    3548 non-null float64
dtypes: float64(3), object(4)
memory usage: 194.1+ KB


In [8]:
df_reviews.head()

Unnamed: 0,Coffee,Description,Rating,date,scores,Polarity_Pattern,Subjectivity_Pattern
0,Réveille Coffee,"This is a cute coffee shop, I love the ambianc...",4.0 star rating,5/10/2019,4.0,0.250758,0.692424
1,Réveille Coffee,"I wanted to like this place, however the venti...",2.0 star rating,4/20/2019,2.0,0.051736,0.624306
2,Réveille Coffee,I didn't tried brunch before in another locati...,5.0 star rating,4/18/2019,5.0,0.666667,0.616667
3,Réveille Coffee,"Folks, avoid this place unless you like being ...",1.0 star rating,4/16/2019,1.0,0.07,0.607778
4,Réveille Coffee,A nice compact coffee shop at Castro area. Qua...,4.0 star rating,4/16/2019,4.0,0.39,0.61


In [None]:
# import nltk
# nltk.download('vader_lexicon')
# from nltk.sentiment.vader import SentimentIntensityAnalyzer
# sid = SentimentIntensityAnalyzer()