# Extension of Exploratory Data Analysis

In this section, we explore polarity and subjectivity of customer reviews dataset, lenght of reviews and look for patterns to find useful similarities and differences between the reviews to build features vectors in the next Machine Learning sections. Customer reviews will be used to predict the satisfaction and to investigate the causes of those good or bad experiences, defined as categories. We'll be able to split coffee shops in clusters according to how much positive or negative are the mentions that people do in every category. Additionally, the blog reviews are used exclusively to build clusters of coffee shops related to features extracted from the descriptions.

## Guideline

The main purposes here are the following:
### To Resolve the Predictive Score Modeling:
- Compare different polarity algorithms to determine which is the most appropiate in this scenario.
- Determine if we must calculate polarity scores per review or sentences into the reviews and then take the average or some percentile to estimate the polarity of the review.
- Determine if is useful to use more than one polarity algorithm as feature.
- Look for more potential language features (as lenght of the message, distribution of the polarity and subjectivity into the review, type of language, slangs, uppercase expressions, etc).


### To Resolve topics that people mention in their reviews:
- Create dictionary of words per topic.
- Use Topic Modeling Hierarchical
- Use TF-IDF and select some key-words
- Use word embeddings

### To Resolve the unsupervised clustering of coffee shops:
- Use information from blogs and brief descriptions.
- Identify features (as presence or absence of wifi, music, decoration, tables and others)

# 1. Features for predictive score modeling

Importing relevant packages

In [1]:
import warnings

warnings.filterwarnings("ignore")

from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import datetime
import re

### a. TextBlob

Defining the **sentiment parameters pattern** function using one of the sentiment analyzer provides for *TextBlob*

In [2]:
from textblob.sentiments import PatternAnalyzer
from textblob import TextBlob

def sentiment_parameters_textblob(text_data):
    blob = TextBlob(text_data, analyzer=PatternAnalyzer())
    return blob.sentiment.polarity, blob.sentiment.subjectivity

In this case, we get a **polarity** and a **subjectivity score**.

### b. nltk Sentiment Intensity Analyzer

In [3]:
import nltk

nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sentiment_nltk = SentimentIntensityAnalyzer()

def sentiment_parameters_nltk(text_data):
    return sentiment_nltk.polarity_scores(text_data)

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/daniela/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


### c. FLAIR Sentiment Classifier

In [4]:
import sys
#!{sys.executable} -m pip install flair
import flair

flair_sentiment = flair.models.TextClassifier.load('en-sentiment')

def sentiment_parameters_flair(text_data):
    s = flair.data.Sentence(text_data)
    flair_sentiment.predict(s)
    return s.labels

2020-06-14 18:05:37,000 loading file /Users/daniela/.flair/models/sentiment-en-mix-distillbert.pt


### Customer reviews dataset

In [5]:
df_reviews = pd.read_csv("../Data_Extraction/Reviews/reviews_rating_date.csv", \
                         usecols=['Coffee', 'Description','Rating', 'date'])

In [6]:
df_reviews.tail()

Unnamed: 0,Coffee,Description,Rating,date
3543,Red Door,Great atmosphere. Awesome coffee. One of my fa...,4.0 star rating,2/26/2018
3544,Red Door,This place was great- recommended to me by som...,4.0 star rating,2/5/2018
3545,Red Door,I really like the concept of this place - coff...,3.0 star rating,12/15/2017
3546,Red Door,"I love the artsy vibes in this cafe, because i...",4.0 star rating,10/29/2017
3547,Red Door,Really nice coffee shop within an art gallery....,5.0 star rating,10/17/2017


In [7]:
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3548 entries, 0 to 3547
Data columns (total 4 columns):
Coffee         3268 non-null object
Description    3548 non-null object
Rating         3548 non-null object
date           3548 non-null object
dtypes: object(4)
memory usage: 111.0+ KB


### Pre-processing steps

In [8]:
df_reviews['scores'] = df_reviews['Rating'].str.split(' star rating').str.get(0).astype(float)

In [9]:
df_reviews.drop(columns='Rating', inplace=True)

In [10]:
import datetime

def validate_date(date_str):
    try:
        date_updated = datetime.datetime.strptime(date_str.splitlines()[0], "%m/%d/%Y")
    except:
        print(date_str)
        raise ValueError("Incorrect data format, should be MM-DD-YYYY")
    return date_updated

In [11]:
df_reviews['valid_date'] =  df_reviews.date.apply(validate_date)

In [12]:
df_reviews.drop(columns='date', inplace=True)

### 1.1 Sentiment Analysis: Polarity and Subjectivity Patterns using TextBlob, NLTK and Flair:

In [13]:
df_reviews['polarity_textblob'] = [sentiment_parameters_textblob(r)[0] for r in df_reviews['Description']]

In [14]:
df_reviews['subjectivity_textblob'] = [sentiment_parameters_textblob(r)[1] for r in df_reviews['Description']]

In [20]:
df_reviews['neg_nltk'] = [sentiment_parameters_nltk(r)['neg'] for r in df_reviews['Description']]

In [21]:
df_reviews['neu_nltk'] = [sentiment_parameters_nltk(r)['neu'] for r in df_reviews['Description']]

In [22]:
df_reviews['pos_nltk'] = [sentiment_parameters_nltk(r)['pos'] for r in df_reviews['Description']]

In [23]:
df_reviews['compound_nltk'] = [sentiment_parameters_nltk(r)['compound'] for r in df_reviews['Description']]

In [34]:
df_reviews['value_flair'] = [sentiment_parameters_flair(r)[0].value for r in df_reviews['Description']]

In [35]:
df_reviews['score_flair'] = [sentiment_parameters_flair(r)[0].score for r in df_reviews['Description']]

In [36]:
df_reviews.head()

Unnamed: 0,Coffee,Description,scores,valid_date,polarity_textblob,subjectivity_textblob,neg_nltk,neu_nltk,pos_nltk,compound_nltk,value_flair,score_flair
0,Réveille Coffee,"This is a cute coffee shop, I love the ambianc...",4.0,2019-05-10,0.250758,0.692424,0.102,0.661,0.237,0.9308,POSITIVE,0.998546
1,Réveille Coffee,"I wanted to like this place, however the venti...",2.0,2019-04-20,0.051736,0.624306,0.075,0.826,0.099,0.1606,NEGATIVE,0.999996
2,Réveille Coffee,I didn't tried brunch before in another locati...,5.0,2019-04-18,0.666667,0.616667,0.0,0.655,0.345,0.9538,POSITIVE,0.997934
3,Réveille Coffee,"Folks, avoid this place unless you like being ...",1.0,2019-04-16,0.07,0.607778,0.075,0.822,0.103,0.6261,NEGATIVE,0.999989
4,Réveille Coffee,A nice compact coffee shop at Castro area. Qua...,4.0,2019-04-16,0.39,0.61,0.0,0.647,0.353,0.8689,POSITIVE,0.96848


### 1.2 Readability tests applied to reviews

In this section, we use **textstat**, a Python package to calculate statistics from text to determine readability, complexity and grade level of a particular corpus. The scores gives us information about how easy or difficult to understand are the reviews.

In [49]:
!{sys.executable} -m pip install textstat

import textstat

You should consider upgrading via the '/Users/daniela/anaconda3/bin/python -m pip install --upgrade pip' command.[0m


In [50]:
def flesch(text_data):
    return textstat.flesch_reading_ease(text_data)

def smog_index(text_data):
    return textstat.smog_index(text_data)

def kincaid_grade(text_data):
    return textstat.flesch_kincaid_grade(text_data)

def coleman_liau_index(text_data):
    return textstat.coleman_liau_index(text_data)

def ari(text_data):
    return textstat.automated_readability_index(text_data)

def dale_chall_readability(text_data):
    return textstat.dale_chall_readability_score(text_data)

def difficult_words(text_data):
    return textstat.difficult_words(text_data)

def linsear_write(text_data):
    return textstat.linsear_write_formula(text_data)

def gunning_fog(text_data):
    return textstat.gunning_fog(text_data)

def text_standard(text_data):
    return textstat.text_standard(text_data)

In [51]:
df_reviews['flesch'] = [flesch(r) for r in df_reviews['Description']]

In [52]:
df_reviews['smog_index'] = [smog_index(r) for r in df_reviews['Description']]

In [53]:
df_reviews['kincaid_grade'] = [kincaid_grade(r) for r in df_reviews['Description']]

In [54]:
df_reviews['coleman_liau_index'] = [coleman_liau_index(r) for r in df_reviews['Description']]

In [55]:
df_reviews['ari'] = [ari(r) for r in df_reviews['Description']]

In [56]:
df_reviews['dale_chall_readability'] = [dale_chall_readability(r) for r in df_reviews['Description']]

In [57]:
df_reviews['difficult_words'] = [difficult_words(r) for r in df_reviews['Description']]

In [58]:
df_reviews['linsear_write'] = [linsear_write(r) for r in df_reviews['Description']]

In [59]:
df_reviews['gunning_fog'] = [gunning_fog(r) for r in df_reviews['Description']]

In [60]:
df_reviews['text_standard'] = [text_standard(r) for r in df_reviews['Description']]

In [61]:
df_reviews.head()

Unnamed: 0,Coffee,Description,scores,valid_date,polarity_textblob,subjectivity_textblob,neg_nltk,neu_nltk,pos_nltk,compound_nltk,...,flesch,smog_index,kincaid_grade,coleman_liau_index,ari,dale_chall_readability,difficult_words,linsear_write,gunning_fog,text_standard
0,Réveille Coffee,"This is a cute coffee shop, I love the ambianc...",4.0,2019-05-10,0.250758,0.692424,0.102,0.661,0.237,0.9308,...,79.09,7.2,6.6,7.31,8.3,6.53,9,8.25,7.57,6th and 7th grade
1,Réveille Coffee,"I wanted to like this place, however the venti...",2.0,2019-04-20,0.051736,0.624306,0.075,0.826,0.099,0.1606,...,68.1,9.6,8.7,7.31,9.0,6.66,18,8.5,9.71,8th and 9th grade
2,Réveille Coffee,I didn't tried brunch before in another locati...,5.0,2019-04-18,0.666667,0.616667,0.0,0.655,0.345,0.9538,...,79.97,8.8,4.2,7.04,5.1,5.96,4,4.125,6.96,6th and 7th grade
3,Réveille Coffee,"Folks, avoid this place unless you like being ...",1.0,2019-04-16,0.07,0.607778,0.075,0.822,0.103,0.6261,...,68.5,10.6,8.6,7.89,9.9,6.11,13,7.571429,10.18,7th and 8th grade
4,Réveille Coffee,A nice compact coffee shop at Castro area. Qua...,4.0,2019-04-16,0.39,0.61,0.0,0.647,0.353,0.8689,...,72.53,8.8,5.0,8.77,6.6,7.59,5,3.666667,6.56,6th and 7th grade


### 1.3 Sentiment Analysis by sentence

In [None]:
for sentence in review.split('.')[:-1]]

In [37]:
df_reviews.Description[0]

'This is a cute coffee shop, I love the ambiance, the coffee and service are great. The only downside to coming to this cafe is that they only have one outlet making it hard to charge devices while working, the one outlet they do have is VERY loose (meaning your charger will fall out of it). Other than that, nice and friendly environment. Remember, no wifi on the weekends. Enjoy!'

In [None]:
p, rating, name, date = [], [], [], []
n_review = []
n_sentences = []


for x, y in enumerate(polarity):
    r = df_reviews['scores'][x]
    n = df_reviews['Coffee'][x] 
    d = df_reviews['valid_date'][x]
    k = 1
    for elem in y:
        p.append(elem)
        rating.append(r)
        name.append(n)
        date.append(d)
        n_review.append(x)
        n_sentences.append(k)
        k+=1