# Objective

Using small toy (fake) data, this notebook demonstrates how to use NLP extractor classes to obtain desired features (e.g., sentiment, returnability, topic scores) from customer reviews.

# I. Basic Setup

In [1]:
# Configure environment
from os import path
CURR_PATH = path.abspath("__file__") # Full path to current script
ROOT_PATH = path.dirname(path.dirname(path.dirname(CURR_PATH)))
import sys; sys.path.insert(0, ROOT_PATH)

In [2]:
import numpy as np
import pandas as pd
from src.features import Sentiment, Returnability, Topics

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/alex/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Let’s first load and check the toy data to work with.

In [3]:
# Load data
products_df = pd.read_csv(path.join(ROOT_PATH, "demo", "data", "toydata_products.csv"))
reviews_df = pd.read_csv(path.join(ROOT_PATH, "demo", "data", "toydata_reviews.csv"))

products_df = products_df[products_df['year']==2017].drop('year', 1)
reviews_df = reviews_df.drop('year', 1)

In [4]:
# Preview product-level data
products_df

Unnamed: 0,wfsku,mkcname,avgweight,returnratescaled
0,A,Bath,29.071028,0.443155
1,B,Bath,32.733974,0.32651
2,C,Bedroom,26.795903,0.703419
3,D,Bedroom,37.564827,0.013095
4,E,Lighting,29.051481,0.050202
5,F,Lighting,31.353412,0.813216
6,G,Bath,28.813357,0.758053
7,H,Bedroom,31.003479,0.769882
8,I,Bath,35.247619,0.464246
9,J,Lighting,35.580414,0.169323


In [5]:
# Preview review-level data
reviews_df

Unnamed: 0,rvprcomments,wasreturned,wfsku
0,Absolutely thrilled about these. They are so ...,1,A
1,They were nice but too big for my area that I ...,1,B
2,I returned this item the item was not the prob...,1,C
3,Awful! Sending back. Doesn’t even look close t...,1,D
4,There was no replacement bulb as advertised. W...,1,E
...,...,...,...
995,Chair is gorgeous and was easy to put together...,0,F
996,love this chair!! its nice looking and comfy,0,G
997,"Wayfair gave a great price, great shipping and...",0,H
998,Perfect size for pasta or salad. Classic style...,0,I


The goal is to extract NLP features from texts in `reviews_df` and merge their product-level averages into `products_df`.

# II. Extracting NLP Features

## 1. Sentiment

`Sentiment` class extracts sentiment-related features (e.g., negativity) using pre-trained models in `nltk` and `textblob` packages. Hence, it does not need separate training.

### Extraction

In [6]:
# Initiate model
sentiment = Sentiment()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/sangyoonpark/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [7]:
# Extract features
sentiment_features = sentiment.extract(reviews_df['rvprcomments'])

In [8]:
# Check result
sentiment_features

Unnamed: 0,subjectivity,neg,neu,pos,compound
0,0.816667,0.000,0.614,0.386,0.9067
1,0.550000,0.000,0.863,0.137,0.2263
2,0.250000,0.113,0.771,0.116,0.0150
3,0.636667,0.230,0.770,0.000,-0.5093
4,0.700000,0.059,0.941,0.000,-0.2960
...,...,...,...,...,...
995,0.533333,0.000,0.619,0.381,0.9634
996,0.800000,0.000,0.442,0.558,0.8217
997,0.635000,0.000,0.420,0.580,0.9450
998,0.583333,0.000,0.730,0.270,0.5719


In [9]:
# Merge extracted features into original data
reviews_df = pd.concat([reviews_df, sentiment_features], axis=1)

## 2. Returnability

`Returnability` class extracts features related to product return using a BERT-based classifier for whether a review resulted in a product return or not. Therefore, it needs to be first trained on labeled text data (i.e. reviews resulting in return vs. no return).

### Training

In [10]:
# Initiate model
BERT_returnability = Returnability()

No GPU available, using the CPU instead
Work may take extremely long time; consider using GPU


Training of the BERT-based model takes a long time (especially if GPU is not available), so we often do training on a subset of the given data. Let’s do that here too.

In [11]:
# Subset data
reviews_df_sub = reviews_df.sample(n=100) # 10% of toy data

# Train model
BERT_returnability.train(
    reviews=reviews_df_sub['rvprcomments'],
    labels=reviews_df_sub['wasreturned'],
    save_filename=path.join(ROOT_PATH, "demo", "models", "toymodel_returnability.pt") # Save trained model
)



Training...

  Average training loss: 0.69
  Training epoch took: 0:00:35

Evaluating...

  Average validation loss: 0.69
  Validation epoch took: 0:00:01


Training...

  Average training loss: 0.69
  Training epoch took: 0:00:33

Evaluating...

  Average validation loss: 0.69
  Validation epoch took: 0:00:01


Training...

  Average training loss: 0.68
  Training epoch took: 0:00:33

Evaluating...

  Average validation loss: 0.69
  Validation epoch took: 0:00:01

Training complete!
Total training took 0:01:48 (h:mm:ss)


The model is trained/validated over 3 epochs and the best one (i.e. model with the lowest average validation loss) is saved.

### Extraction

Now that the model has been trained and saved, let’s load it back.

In [12]:
# Load previously trained model weights
BERT_returnability.load_model(path.join(ROOT_PATH, "demo", "models", "toymodel_returnability.pt"))

Model loaded successfully


Alternatively, we can use a different pre-trained model to perform feature extraction. For instance, the following loads the model trained on the entire review data from 2017:

In [13]:
# # Load model trained on entire review data from 2017
# BERT_returnability.load_model(path.join(ROOT_PATH, "models", "BERT_classifier_weights_sprint8.pt"))

We now use the loaded model to extract features related to product return.

In [14]:
# Extract features (from entire toy data)
returnability_features = BERT_returnability.extract(reviews_df['rvprcomments'])


Extracting features...


In [15]:
# Check result
returnability_features

Unnamed: 0,emb1,emb2,emb3,emb4,emb5,emb6,emb7,emb8,emb9,emb10,emb11,emb12,emb13,emb14,emb15,emb16,p_return
0,0.085738,0.257699,0.118066,0.013044,-0.124014,-0.136310,0.043116,0.283141,-0.122666,-0.165492,0.191405,-0.088334,0.068480,-0.079817,0.051973,-0.180668,0.475461
1,0.065253,0.037195,0.146367,-0.076044,-0.256372,0.002775,-0.108024,0.233402,-0.099681,0.020216,-0.073740,-0.107011,-0.034351,-0.138147,-0.023599,0.177158,0.505305
2,0.084194,0.102908,0.156051,-0.095641,-0.102037,-0.086480,-0.023378,0.256139,-0.142111,-0.042361,0.081371,-0.101391,0.019372,-0.173496,0.020558,0.042286,0.492520
3,0.133768,0.234132,0.087523,0.022394,-0.110837,-0.161413,0.006956,0.284745,-0.078459,-0.184026,0.184995,-0.074023,0.076958,-0.044890,0.043529,-0.230942,0.475055
4,0.062233,0.165933,0.119340,-0.054423,-0.090893,-0.155029,0.034449,0.192151,-0.072766,-0.179385,0.148082,-0.049794,0.053612,-0.065891,0.035864,-0.131583,0.488051
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.076263,0.167665,0.123643,-0.069697,-0.114485,-0.138100,0.036304,0.234786,-0.102340,-0.204039,0.187579,-0.062094,0.075429,-0.064454,0.104489,-0.150888,0.480893
996,0.066437,0.106028,0.196471,-0.078311,-0.142841,-0.043661,-0.044618,0.263376,-0.104600,-0.060075,0.101862,-0.143819,0.035562,-0.196204,0.005051,0.131481,0.497815
997,0.071167,0.222655,0.138505,-0.026557,-0.108133,-0.112422,0.084748,0.286177,-0.116125,-0.131257,0.158168,-0.089275,0.036126,-0.113423,0.012092,-0.086477,0.481502
998,0.100210,0.262722,0.131102,0.018450,-0.089444,-0.146816,0.029026,0.298695,-0.088645,-0.148253,0.215767,-0.088470,0.066891,-0.110652,0.031237,-0.178992,0.475455


In [16]:
# Merge extracted features into original data
reviews_df = pd.concat([reviews_df, returnability_features], axis=1)

## 3. Topics

`Topics` class builds an LDA language model that identifies a set of topics commonly present across the given text data. This model can then be used to score a given text on the pre-identified topics. Therefore, it involves unsupervised training on the target text data.

### Training

In [17]:
# Initiate model
topics = Topics(20) # Obtain 20 topics

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/sangyoonpark/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [18]:
# Train model
topics.train(
    reviews=reviews_df['rvprcomments'],
    save_filename=path.join(ROOT_PATH, "demo", "models", "toymodel_topics") # Save trained model
)

----Preprocessing----
strip_newline...
sent to words ...
remove stop words...
lemmatization...
0
dictionary...
filter...
compactify...
bag of word
Training Completed. The modeling process took 0.38419293562571205 minutes


### Extraction

In [19]:
# Load previously trained model
topics.load_model(path.join(ROOT_PATH, "demo", "models", "toymodel_topics"))

----Model loaded to disk----


Alternatively, we can use a different pre-trained model to perform feature extraction. For instance, the following loads the model trained on the entire review data from 2017:

In [20]:
# # Load model trained on entire review data from 2017
# topics.load_model(path.join(ROOT_PATH, "models", "topic_model"))

We now use the loaded model to extract topic scores for each review.

In [21]:
topic_features = topics.extract(reviews=reviews_df['rvprcomments'])

----Topic Score Extracted----


In [22]:
# Check result
topic_features

Unnamed: 0,topic_0,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,topic_7,topic_8,topic_9,topic_10,topic_11,topic_12,topic_13,topic_14,topic_15,topic_16,topic_17,topic_18,topic_19
0,0.012500,0.262500,0.012500,0.012500,0.012500,0.262500,0.012500,0.012500,0.012500,0.012500,0.012500,0.012500,0.012500,0.012500,0.012500,0.012500,0.012500,0.012500,0.262499,0.012500
1,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.350000,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.350000,0.016667,0.016667,0.016667
2,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.350000,0.016667,0.350000,0.016667,0.016667,0.016667
3,0.025000,0.025000,0.025000,0.025000,0.025000,0.025000,0.025000,0.025000,0.025000,0.025000,0.025000,0.025000,0.025000,0.025000,0.025000,0.525000,0.025000,0.025000,0.025000,0.025000
4,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.683333,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.010000,0.210000,0.010000,0.010000,0.010000,0.010000,0.010000,0.010000,0.010000,0.210000,0.010000,0.010000,0.010000,0.010000,0.010000,0.010000,0.010000,0.010000,0.010000,0.410000
996,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.350000,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.016667,0.350000,0.016667,0.016667,0.016667,0.016667
997,0.010000,0.010000,0.010000,0.010000,0.010000,0.610000,0.210000,0.010000,0.010000,0.010000,0.010000,0.010000,0.010000,0.010000,0.010000,0.010000,0.010000,0.010000,0.010000,0.010000
998,0.025000,0.025000,0.025000,0.025000,0.025000,0.025000,0.025000,0.025000,0.525000,0.025000,0.025000,0.025000,0.025000,0.025000,0.025000,0.025000,0.025000,0.025000,0.025000,0.025000


In [23]:
# Merge extracted features into original data
reviews_df = pd.concat([reviews_df, topic_features], axis=1)

# III. Merging NLP Features into Product-Level Data

Now that all NLP features have been extracted, we can combine, aggregate, and merge them into the product-level data.

In [24]:
# Identify relevant columns in review-level data
nlp_feature_cols = (
    list(sentiment_features.columns) + 
    list(returnability_features.columns) + 
    list(topic_features.columns)
)

In [25]:
# Aggregate up to product level
nlp_features_product_level = reviews_df[['wfsku'] + nlp_feature_cols].groupby('wfsku').mean()

In [26]:
# Merge NLP features into product-level data
products_df = products_df.merge(nlp_features_product_level, on='wfsku', how='left')

In [27]:
# Check result
products_df

Unnamed: 0,wfsku,mkcname,avgweight,returnratescaled,subjectivity,neg,neu,pos,compound,emb1,...,topic_10,topic_11,topic_12,topic_13,topic_14,topic_15,topic_16,topic_17,topic_18,topic_19
0,A,Lighting,28.414403,0.593681,0.603041,0.03527,0.66913,0.29563,0.452125,0.070353,...,0.030215,0.055562,0.020465,0.02838,0.049465,0.078697,0.043955,0.053008,0.063191,0.052206
1,B,Rugs,36.619338,0.065961,0.671528,0.04283,0.59847,0.35869,0.443316,0.076775,...,0.032076,0.037353,0.022174,0.026493,0.050864,0.106854,0.046019,0.044543,0.056818,0.042712
2,C,Tabletop,33.790857,0.120171,0.650995,0.04503,0.62919,0.32579,0.501917,0.072777,...,0.037819,0.032704,0.01941,0.052709,0.048334,0.079482,0.043561,0.05239,0.062868,0.056888
3,D,Bedroom,32.060588,0.64066,0.596602,0.0318,0.65842,0.30977,0.453621,0.072893,...,0.034913,0.064193,0.02308,0.037382,0.047633,0.093049,0.045611,0.04258,0.058763,0.045817
4,E,Outdoor,25.741503,0.476572,0.634164,0.03968,0.62331,0.337,0.465525,0.078262,...,0.044107,0.042934,0.019172,0.056636,0.043017,0.098722,0.040517,0.05143,0.039114,0.049712
5,F,Window,26.438588,0.835132,0.619593,0.03165,0.6276,0.34071,0.530379,0.072929,...,0.045305,0.041234,0.020361,0.036111,0.062213,0.074373,0.034305,0.057444,0.046259,0.069673
6,G,Bath,31.159452,0.765305,0.658849,0.04378,0.59818,0.35804,0.453283,0.072793,...,0.029437,0.054974,0.020601,0.043238,0.046005,0.066493,0.039471,0.067132,0.05035,0.061944
7,H,Kitchen,32.532757,0.284236,0.667442,0.05142,0.63595,0.31261,0.406191,0.077334,...,0.030151,0.053651,0.020984,0.036746,0.051528,0.098598,0.053651,0.050651,0.07572,0.059077
8,I,Nursery,29.706887,0.426242,0.632424,0.04873,0.63864,0.31266,0.42051,0.076763,...,0.029085,0.06924,0.020656,0.043989,0.042778,0.106477,0.039585,0.063251,0.052381,0.047285
9,J,Pet,32.36533,0.521686,0.627844,0.03881,0.6231,0.33813,0.498119,0.075529,...,0.040801,0.054829,0.022079,0.038181,0.044079,0.063353,0.05766,0.061912,0.046183,0.054268


We confirm that all the extracted NLP features are now merged into the product-level data. With this augmented data, we can then build product-level models that predict product return rates.
