# Extension of Exploratory Data Analysis

In this section, we explore polarity and subjectivity of customer reviews dataset, lenght of reviews and look for patterns to find useful similarities and differences between the reviews to build features vectors in the next Machine Learning sections. Customer reviews will be used to predict the satisfaction and to investigate the causes of those good or bad experiences, defined as categories. We'll be able to split coffee shops in clusters according to how much positive or negative are the mentions that people do in every category. Additionally, the blog reviews are used exclusively to build clusters of coffee shops related to features extracted from the descriptions.

## Guideline

The main purposes here are the following:
### To Resolve the Predictive Score Modeling:
- Compare different polarity algorithms to determine which is the most appropiate in this scenario.
- Determine if we must calculate polarity scores per review or sentences into the reviews and then take the average or some percentile to estimate the polarity of the review.
- Determine if is useful to use more than one polarity algorithm as feature.
- Look for more potential language features (as lenght of the message, distribution of the polarity and subjectivity into the review, type of language, slangs, uppercase expressions, etc).


### To Resolve topics that people mention in their reviews:
- Create dictionary of words per topic.
- Use Topic Modeling Hierarchical
- Use TF-IDF and select some key-words
- Use word embeddings

### To Resolve the unsupervised clustering of coffee shops:
- Use information from blogs and brief descriptions.
- Identify features (as presence or absence of wifi, music, decoration, tables and others)

# 1. Features for predictive score modeling

## 1.1 Compare different polarity algorithms

Importing relevant packages

In [1]:
import warnings

warnings.filterwarnings("ignore")

from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import datetime
import re

### a. TextBlob

Defining the **sentiment parameters pattern** function using one of the sentiment analyzer provides for *TextBlob*

In [2]:
from textblob.sentiments import PatternAnalyzer
from textblob import TextBlob

def sentiment_parameters_textblob(sentence):
    blob = TextBlob(sentence, analyzer=PatternAnalyzer())
    return blob.sentiment.polarity, blob.sentiment.subjectivity

In this case, we get a **polarity** and a **subjectivity score**.

### b. nltk Sentiment Intensity Analyzer

In [3]:
import nltk

nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sentiment_nltk = SentimentIntensityAnalyzer()

def sentiment_parameters_nltk(sentence):
    return sentiment_nltk.polarity_scores(sentence)

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/daniela/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


### c. FLAIR Sentiment Classifier

In [4]:
import sys
#!{sys.executable} -m pip install flair
import flair

flair_sentiment = flair.models.TextClassifier.load('en-sentiment')

def sentiment_parameters_flair(sentence):
    s = flair.data.Sentence(sentence)
    flair_sentiment.predict(s)
    return s.labels

2020-06-14 18:05:37,000 loading file /Users/daniela/.flair/models/sentiment-en-mix-distillbert.pt


### Customer reviews dataset

In [5]:
df_reviews = pd.read_csv("../Data_Extraction/Reviews/reviews_rating_date.csv", \
                         usecols=['Coffee', 'Description','Rating', 'date'])

In [6]:
df_reviews.tail()

Unnamed: 0,Coffee,Description,Rating,date
3543,Red Door,Great atmosphere. Awesome coffee. One of my fa...,4.0 star rating,2/26/2018
3544,Red Door,This place was great- recommended to me by som...,4.0 star rating,2/5/2018
3545,Red Door,I really like the concept of this place - coff...,3.0 star rating,12/15/2017
3546,Red Door,"I love the artsy vibes in this cafe, because i...",4.0 star rating,10/29/2017
3547,Red Door,Really nice coffee shop within an art gallery....,5.0 star rating,10/17/2017


In [7]:
df_reviews.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3548 entries, 0 to 3547
Data columns (total 4 columns):
Coffee         3268 non-null object
Description    3548 non-null object
Rating         3548 non-null object
date           3548 non-null object
dtypes: object(4)
memory usage: 111.0+ KB


### Pre-processing steps

In [8]:
df_reviews['scores'] = df_reviews['Rating'].str.split(' star rating').str.get(0).astype(float)

In [9]:
df_reviews.drop(columns='Rating', inplace=True)

In [10]:
import datetime

def validate_date(date_str):
    try:
        date_updated = datetime.datetime.strptime(date_str.splitlines()[0], "%m/%d/%Y")
    except:
        print(date_str)
        raise ValueError("Incorrect data format, should be MM-DD-YYYY")
    return date_updated

In [11]:
df_reviews['valid_date'] =  df_reviews.date.apply(validate_date)

In [12]:
df_reviews.drop(columns='date', inplace=True)

In [13]:
df_reviews['polarity_textblob'] = [sentiment_parameters_textblob(r)[0] for r in df_reviews['Description']]

In [14]:
df_reviews['subjectivity_textblob'] = [sentiment_parameters_textblob(r)[1] for r in df_reviews['Description']]

In [20]:
df_reviews['neg_nltk'] = [sentiment_parameters_nltk(r)['neg'] for r in df_reviews['Description']]

In [21]:
df_reviews['neu_nltk'] = [sentiment_parameters_nltk(r)['neu'] for r in df_reviews['Description']]

In [22]:
df_reviews['pos_nltk'] = [sentiment_parameters_nltk(r)['pos'] for r in df_reviews['Description']]

In [23]:
df_reviews['compound_nltk'] = [sentiment_parameters_nltk(r)['compound'] for r in df_reviews['Description']]

In [34]:
df_reviews['value_flair'] = [sentiment_parameters_flair(r)[0].value for r in df_reviews['Description']]

In [None]:
df_reviews['score_flair'] = [sentiment_parameters_flair(r)[0].score for r in df_reviews['Description']]

In [17]:
#sentiment_parameters_nltk(df_reviews['Description'][0])

{'neg': 0.102, 'neu': 0.661, 'pos': 0.237, 'compound': 0.9308}