# A Machine Learning journey from customer reviews to business insights
# *Part 2: Data preparation for review text*

*Author: Federica Lionetto*  
*Email: federica.lionetto@gmail.com*  
*Date: 17 November 2020*  
*License: Creative Commons BY-NC-SA*

*Based on the dataset available at:*
- https://www.kaggle.com/efehandanisman/skytrax-airline-reviews

### Further readings

- Hutto, C.J. and Gilbert, E.E., 2014, "VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text", Eighth International Conference on Weblogs and Social Media (ICWSM-14), Ann Arbor, MI, June 2014, https://www.researchgate.net/publication/275828927_VADER_A_Parsimonious_Rule-based_Model_for_Sentiment_Analysis_of_Social_Media_Text
- Sentiment analysis using VADER, https://github.com/cjhutto/vaderSentiment
- "Detecting bad customer reviews with NLP", https://towardsdatascience.com/detecting-bad-customer-reviews-with-nlp-d8b36134dc7e

## 1 - Import modules and helper functions

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_palette('Set2')

import datetime as dt
import dateutil

import string

from sklearn.feature_extraction.text import CountVectorizer

import nltk
from nltk import tokenize, pos_tag
from nltk.corpus import stopwords, wordnet
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.stem import WordNetLemmatizer

import os
import importlib

In [None]:
# Debugging capabilities.
import pdb

In [None]:
# Needed for Colab.
!git clone https://github.com/FedericaLionetto/UZHMLWorkshop2020-NLP
os.chdir('UZHMLWorkshop2020-NLP/')

In [None]:
import sys  
sys.path.insert(0, './helper_functions')

In [None]:
# Related to visualization.
import plot_cmap
import plot_two_hists_comp_sns

# Related to NLP.
import get_wordnet_pos

## 2 - Load the input data

In [None]:
# Type of each field in the input data.
df_dtype = pd.read_csv('../Results/PreprocessedDataLightTypes.csv')
dict_dtype = df_dtype[['index','dtypes']].set_index('index').to_dict()['dtypes']
dict_dtype['recommended'] = 'bool'

In [None]:
# Input data.
df = pd.read_csv('../Results/PreprocessedDataLight.csv', dtype=dict_dtype, keep_default_na=False, na_values=['_'])
df.drop(columns=['Unnamed: 0'],inplace=True)

In [None]:
df.head()

In [None]:
df.shape

Get the names of the colums in the dataset.

In [None]:
cols = df.columns.to_list()
print('Columns in the dataset:')
print(cols)

Get the total number of customer reviews in the dataset.

In [None]:
n_reviews = df.shape[0]
print('Number of customer reviews in the dataset: {:d}'.format(n_reviews))

## 3 Work with the review text

### 3.1 - Get review text and create a new data frame with NLP information

In [None]:
# Series of all review texts in the dataset.
reviews_list = df['review_text'].copy()

In [None]:
reviews_list.shape

In [None]:
df_nlp = df.copy()

### 3.2 - Sentiment analysis using VADER

Sentiment analysis is the field of NLP that aims at understanding the sentiment of a certain portion of text. One of the best-known packages for sentiment analysis is the open-source package VADER, which is part of NLTK.

The official description of VADER reads
"VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media."   
VADER is built on social media text but it is in general applicable to other domains, including customer reviews.  
VADER is based on a lexicon (vocabulary) that is validated by multiple human judges according to a well-defined and standard procedure. Each word in the lexicon is associated with a sentiment valence, consisting of two properties, polarity and intensity. The polarity describes if the text is positive/negative. The intensity describes how much the text is positive/negative, on a scale from -4 to 4. Words not included in the lexicon are classified as neutral. 

To evaluate the sentiment of a sentence or list of sentences, VADER looks for words in the text that are part of the lexicon, modifies the intensity and polarity of the identified words according to a series of rules, sums up these values and then normalises to the range [-1,1].  
VADER incorporates emojis (for example ":-)"), acronyms (for example "LOL") and slang (for example "nah"). The algorithm differs from a Bag of Words approach as it takes words order and degree modifiers into account, e.g. by increasing/decreasing the intensity of the sentiment.   
For example, the sentences:
- "This flight was great.", 
- "This flight was really great." 
- "This flight was really GREAT."
- "This flight was really GREAT!"
- "This flight was really GREAT! :-)"  
would have an increasing intensity, triggered by degree modifiers.

The output of the sentiment analysis is a series of scores, namely "compound", "pos", "neu" and "neg".  
The compound score is normalized between -1 (extremely negative) and 1 (extremely positive) and is a good metric if we need a single value that summarises the sentiment of a given sentence. The compound score can also be used to classify sentences into positive, neutral and negative by setting an appropriate threshold on the compound score. The official recommended threshold is:
- positive sentiment, compound score >= 0.05
- neutral sentiment, compound score <= 0.05 and >= -0.05
- negative sentiment, compound score <= -0.05  

The positive, neutral and negative scores represent the fraction of the sentence that has a positive, neutral and negative sentiment. The sum of these three scores should sum up to 1. The positive, neutral and negative scores are a good metric if we need multiple values that summarise the sentiment of a given sentence.

In [None]:
import nltk
nltk.download('vader_lexicon')

In [None]:
sid = SentimentIntensityAnalyzer()

In [None]:
# Simple examples.
print(sid.polarity_scores("This flight was great."))
print(sid.polarity_scores("This flight was really great."))
print(sid.polarity_scores("This flight was really GREAT."))
print(sid.polarity_scores("This flight was really GREAT!"))
print(sid.polarity_scores("This flight was really GREAT! :-)"))

In [None]:
# Examples.
review = reviews_list[0]
review_tok = tokenize.sent_tokenize(review)
print(review_tok)

In [None]:
# Example on a review level.
print('Review text:')
print(review)

review_polarity_scores = sid.polarity_scores(review)

for key in sorted(review_polarity_scores.keys()):
    print('{}: {}, '.format(key,review_polarity_scores[key]), end='')
print('\n')

In [None]:
# Example on a sentence level.
print('Review text:')
print(review_tok)

for sentence in review_tok:
    print('Sentence text:')
    print(sentence)
    sentence_polarity_scores = sid.polarity_scores(sentence)

    for key in sorted(sentence_polarity_scores.keys()):
        print('{}: {}, '.format(key,sentence_polarity_scores[key]), end='')
    print('\n')

In [None]:
# Augment the dataset with the overall polarity score of the review, as obtained using VADER on the review level.
reviews_polarity = []

for i_review, review in enumerate(reviews_list):
    # print('Review text:')
    # print(review)

    review_polarity_scores = sid.polarity_scores(review)
    review_polarity_score_compound = review_polarity_scores['compound']
    
    print('Review #{:d}: '.format(i_review), end='')
    for key in sorted(review_polarity_scores.keys()):
        print('{}: {:.4f}, '.format(key,review_polarity_scores[key]), end='')
    print('')
    
    reviews_polarity.append(review_polarity_score_compound)

# print(reviews_polarity)

In [None]:
df_nlp['polarity'] = reviews_polarity

In [None]:
df_nlp.head()

We look at the correlation between compound score and recommendation and at the distribution of the compound score for positive and negative customer reviews.

In [None]:
corr_values = df_nlp[['polarity','recommended']].dropna(axis=0,how='any').corr()

In [None]:
plot_cmap.plot_cmap(matrix_values=corr_values, 
                    figsize_w=4, 
                    figsize_h=4, 
                    filename='../Results/02/Corr.png')

In [None]:
plot_two_hists_comp_sns.plot_two_hists_comp_sns(df_1=df_nlp[df_nlp['recommended']==True],
                                                df_2=df_nlp[df_nlp['recommended']==False],
                                                label_1='recommended',
                                                label_2='not recommended',
                                                feat='polarity',
                                                bins=30,
                                                title='Distribution of all customer reviews',
                                                x_label='Polarity',
                                                y_label='Entries / bin',
                                                filename='../Results/02/HistPolarityByRecommendation.png')

**DISCUSSION**:  
*What could be the limitations of this approach? Would you expect it to perform well on the customer reviews?*

### 3.3 - Preprocess review text

#### 3.3.1 - Import packages

In [None]:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

#### 3.3.2 - Stop words

In [None]:
# Stop words.
# Airlines appearing in the dataset. This is the official name of the airlines. These words should be removed from the review text.
airlines_lower = df_nlp['airline'].str.lower().unique().tolist()
# Words appearing in the official name of the airlines. These words should be removed from the review text.
airlines_identifier = ['airlines',
                       'air lines',
                       'airline',
                       'air line',
                       'airways',
                       'air']
# In addition to the official name of the airlines, customers can use shortened versions of this name.
airlines_informal_lower = []
for airline in airlines_lower:
    found = False
    for airline_identifier in airlines_identifier:
        if found == False:
            if str(' '+airline_identifier) in airline:
                airline_informal = airline.replace(str(' '+airline_identifier),'')
                airlines_informal_lower.append(airline_informal)
                found = True
# Other stop words.
additional_stopwords = ['one','get','also','however','even','make']

In [None]:
print(airlines_lower)

In [None]:
print(airlines_identifier)

In [None]:
print(airlines_informal_lower)

In [None]:
nltk_stopwords = stopwords.words('english')
nltk_stopwords_extended = nltk_stopwords + airlines_lower + airlines_identifier + airlines_informal_lower + additional_stopwords
print('Number of stopwords in NLTK: {:d}'.format(len(nltk_stopwords)))
print('Number of stopwords after extension: {:d}'.format(len(nltk_stopwords_extended)))

#### 3.3.3 - Lower/upper case, punctuation, tokenization, stop words, POS tagging and lemmatization

First of all, we convert all characters in the review text to lower case.

After that, we remove the punctuation and tokenize each customer review into a list of individual words. 

As a next step, we need to select only those words in the review text that could be relevant to solve the problem at hand. In particular, all stop words should be filtered out as they do not affect the meaning of the sentence.  
We can download the stopwords from NLTK and specify that we want to use those corresponding to the English language.

We then proceed to POS tagging, which allows to identify the role of each word in the sentence, according to the categories noun, verb, adjective, adverb and others. This is needed for a correct lemmatization of the words in the review text.

The lemmatization consists in bringing the words to their "standard" form, e.g. to convert "wrote" to "write" or "writing" to write.

In [None]:
def get_clean_text(text):
    # Transform the text so that all words are lower case.
    # print(text)
    text = text.lower()
    # Remove stop words corresponding to airlines. This is needed here as airline names can consist of multiple words and will not be removed after splitting by words.
    # print(text)
    for airline_lower in airlines_lower:
        text = text.replace(airline_lower, '')
    # Remove punctuation and tokenize the text into individual words.
    # print(text)
    text = [word.strip(string.punctuation) for word in text.split(" ")]
    # Remove words that contain numbers.
    # print(text)
    text = [word for word in text if not any(c.isdigit() for c in word)]
    # Remove stop words.
    # print(text)
    text = [word for word in text if word not in nltk_stopwords_extended]
    # Remove empty tokens.
    # print(text)
    text = [word for word in text if len(word)>0]
    # POS tagging of the text.
    # print(text)
    pos_tags = pos_tag(text)
    # Lemmatize the text.
    # print(text)
    text = [WordNetLemmatizer().lemmatize(i_pos_tag[0], get_wordnet_pos.get_wordnet_pos(i_pos_tag[1])) for i_pos_tag in pos_tags]
    # Remove words with only one letter.
    # print(text)
    text = [word for word in text if len(word)>1]
    # Join the text with space as a word delimiter.
    # print(text)
    text = " ".join(text)
    # Remove non-ASCII characters.
    printable = set(string.printable)
    text = ''.join(filter(lambda x: x in printable, text))
    return text

In [None]:
# Example of POS tagging.
pos_tag(tokenize.word_tokenize('This is a simple test for you.'))

In [None]:
# Example of lemmatization.
WordNetLemmatizer().lemmatize('written',wordnet.VERB)

In [None]:
reviews_list[0]

In [None]:
get_clean_text(reviews_list[0])

In [None]:
df_nlp['review_text_clean'] = df_nlp['review_text'].apply(lambda x: get_clean_text(x))

In [None]:
df_nlp['review_text_clean'][0]

#### 3.3.4 - Vectorization

We convert the text of each customer review from a textual representation to a numerical representation. The vectors of the numerical representation correspond to the words that appear in the preprocessed text of the customer reviews. The values in the numerical representation correspond to the occurrences of the specified word in the customer review. To avoid to end up with too many features in the numerical representation, we limit the dictionary to the words that appear at least a minimum number of times in the customer reviews. This threshold is specified through the parameter `min_df` of `CountVectorizer`.  

For example, if we want to use a 3D numerical representation, we might have features corresponding to the words `flight`, `service`, `food`. For a certain customer review, the value of the feature `flight` will correspond to how many times `flight` is mentioned in the text, and similar for the other two features `service` and `food`.

It should be noted that, up to this point, there is no "meaning" associated to the words in the dictionary. The numerical representation does not take the similarity between two words into account. Embeddings are a way to map words to meanings and are an interesting option, but for the sake of simplicity they are not implemented in this exercise.

In [None]:
# List of reviews.
corpus = df_nlp['review_text_clean'].values

In [None]:
corpus[0]

In [None]:
len(corpus)

In [None]:
vectorizer_ngrams = CountVectorizer(binary=False, ngram_range=(1, 1), analyzer='word', min_df=50)

In [None]:
vectorizer_ngrams.fit(corpus)

In [None]:
vec_review_text_clean_feats = vectorizer_ngrams.get_feature_names()

In [None]:
vec_review_text_clean_feats[:10]

In [None]:
len(vec_review_text_clean_feats)

In [None]:
vec_review_text_clean = vectorizer_ngrams.transform(df_nlp['review_text_clean'])

In [None]:
vec_review_text_clean.shape

In [None]:
vec_review_text_clean.dtype

In [None]:
vec_review_text_clean

In [None]:
vec_review_text_clean_feats_new = ['count_'+feat for feat in vec_review_text_clean_feats]

In [None]:
# Add features to the dataset.
df_vec_review_text_clean = pd.DataFrame(vec_review_text_clean.toarray(),columns=vec_review_text_clean_feats_new)

In [None]:
df_vec_review_text_clean.head()

In [None]:
df_nlp['review_text_clean'][0]

In [None]:
df_vec_review_text_clean.iloc[0]['count_lose']

In [None]:
df_nlp_final = pd.concat([df_nlp,df_vec_review_text_clean], axis=1)

In [None]:
df_nlp_final.head()

In [None]:
df_nlp_final['cabin'].head()

## 4 - Save the dataset

In [None]:
df_nlp_final_types = df_nlp_final.dtypes.to_frame('dtypes').reset_index()

df_nlp_types = df_nlp.dtypes.to_frame('dtypes').reset_index()

In [None]:
df_nlp_final.to_csv('../Results/NLPFinalDataLight.csv')
df_nlp_final_types.to_csv('../Results/NLPFinalDataLightTypes.csv')

df_nlp.to_csv('../Results/NLPDataLight.csv')
df_nlp_types.to_csv('../Results/NLPDataLightTypes.csv')

In [None]:
with open('../Results/VecReviewTextCleanFeats.csv', 'w') as f:
    f.write(', '.join(vec_review_text_clean_feats_new))

In [None]:
with open('../Results/NLTKStopWordsExtended.csv', 'w') as f:
    f.write(', '.join(nltk_stopwords_extended))