## Analysis of user reviews

In this section, we apply Natural Language preprocessing techniques to find meaningful keywords. We want to predict ratings, recommendations from user reviews and discover the most common topics into the reviews to realize if there are some particular (or transversal) improvements that we can do in every department and division.

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import scipy as scp
import matplotlib.pyplot as plt

import statsmodels.api as sm
import scipy.stats as stats
import pylab


from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tag import pos_tag
import re

  (fname, cnt))
  (fname, cnt))


In [2]:
pd.set_option('display.max_colwidth', -1)

In [3]:
df_review = pd.read_csv('dataset/raw/Womens_Clothing_E-Commerce_Reviews.csv', index_col=0)

In [4]:
df_review.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 23486 entries, 0 to 23485
Data columns (total 10 columns):
Clothing ID                23486 non-null int64
Age                        23486 non-null int64
Title                      19676 non-null object
Review Text                22641 non-null object
Rating                     23486 non-null int64
Recommended IND            23486 non-null int64
Positive Feedback Count    23486 non-null int64
Division Name              23472 non-null object
Department Name            23472 non-null object
Class Name                 23472 non-null object
dtypes: int64(5), object(5)
memory usage: 2.0+ MB


### 1. Text Normalization and tokens

In [20]:
stop_words = set(stopwords.words('english')) 
wordnet_lemmatizer = WordNetLemmatizer()

def pre_processing(text):
    # Filtering special characters
    try:
        text = re.sub(r'[^a-zA-Z\s]','', text.lower())
        text = text.replace('xxxx', '')
        # Tokenization and filtering stop-words
        tokens = word_tokenize(text)
        lem = []
        for word, tag in pos_tag(tokens):
            if word not in stop_words:
                if tag.startswith("NN"):
                    nn =  wordnet_lemmatizer.lemmatize(word, pos='n')
                    lem.append(nn)
                elif tag.startswith('VB'):
                    vv = wordnet_lemmatizer.lemmatize(word, pos='v')
                    lem.append(vv)
                elif tag.startswith('JJ'):
                    aa =  wordnet_lemmatizer.lemmatize(word, pos='a')
                    lem.append(aa)
                else:
                    w = word
                    lem.append(w)

        text_norm = ' '.join(lem)
    
    except:
        text_norm = ''
    
    return text_norm

Example:

In [25]:
print("Original Text: {0:s}".format(df_review['Review Text'][0]))

Original Text: Absolutely wonderful - silky and sexy and comfortable


In [26]:
print("Normalized Text: {0:s}".format(pre_processing(df_review['Review Text'][0])))

Normalized Text: absolutely wonderful silky sexy comfortable


In [21]:
df_review['Review_norm'] = df_review['Review Text'].apply(pre_processing)

In [None]:
def tokens_counter(text):
    text = re.sub(r'[^a-zA-Z\s]','', text.lower())
    tokens = word_tokenize(text)
    return len(tokens)

### 2. Polarity metrics

#### a. TextBlob

Defining the sentiment parameters pattern function using one of the sentiment analyzer provides for TextBlob

In [8]:
from textblob.sentiments import PatternAnalyzer
from textblob import TextBlob

def sentiment_parameters_textblob(text_data):
    dict_textblob = {}
    try:
        blob = TextBlob(text_data, analyzer=PatternAnalyzer())
        dict_textblob['polarity'] = blob.sentiment.polarity
        dict_textblob['subjectivity'] = blob.sentiment.subjectivity   
    except:
        dict_textblob['polarity'] = np.nan
        dict_textblob['subjectivity'] = np.nan
    return dict_textblob

#### b. nltk Sentiment Intensity Analyzer

In [9]:
nltk.download('vader_lexicon')

from nltk.sentiment.vader import SentimentIntensityAnalyzer

sentiment_nltk = SentimentIntensityAnalyzer()

def sentiment_parameters_nltk(text_data):
    dict_scores = {}
    try:
        dict_scores = sentiment_nltk.polarity_scores(text_data)
    except:
        dict_scores['neg'] = np.nan
        dict_scores['neu'] = np.nan
        dict_scores['pos'] = np.nan
    return dict_scores
    

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/daniela/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [10]:
textblob = [sentiment_parameters_textblob(r) for r in df_review['Review Text']]

In [11]:
nltkList = [sentiment_parameters_nltk(r) for r in df_review['Review Text']]

In [12]:
df_review['polTextB'] = [nested_dict['polarity'] for nested_dict in textblob]

In [13]:
df_review['subTextB'] = [nested_dict['subjectivity'] for nested_dict in textblob]

In [14]:
df_review['negNLTK'] = [nested_dict['neg'] for nested_dict in nltkList]

In [15]:
df_review['neuNLTK'] = [nested_dict['neu'] for nested_dict in nltkList]

In [16]:
df_review['posNLTK'] = [nested_dict['pos'] for nested_dict in nltkList]

In [22]:
df_review.head(2)

Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name,polTextB,subTextB,negNLTK,neuNLTK,posNLTK,Review_norm
0,767,33,,Absolutely wonderful - silky and sexy and comfortable,4,1,0,Initmates,Intimate,Intimates,0.633333,0.933333,0.0,0.272,0.728,absolutely wonderful silky sexy comfortable
1,1080,34,,"Love this dress! it's sooo pretty. i happened to find it in a store, and i'm glad i did bc i never would have ordered it online bc it's petite. i bought a petite and am 5'8"". i love the length on me- hits just a little below the knee. would definitely be a true midi on someone who is truly petite.",5,1,4,General,Dresses,Dresses,0.339583,0.725,0.0,0.664,0.336,love dress sooo pretty happen find store im glad bc never would order online bc petite buy petite love length hit little knee would definitely true midi someone truly petite


### 2. Readibility scores and statistical tests