In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json
%matplotlib inline

We'll be using BERT as our classifier, as it is the current state-of-the-art method for text classification.

In [2]:
empath = pd.read_pickle('empath')

empath.head()

Unnamed: 0,reviewerID,asin,reviewText,overall,unixReviewTime,categories,brand,healthScore,danger
0,A1ZQZ8RJS1XVTX,0657745316,"No sugar, no GMO garbage, no fillers that come...",5,2013-10-11,[[Grocery & Gourmet Food]],,0.0,0.0
1,A31W38VGZAUUM4,0700026444,"This is my absolute, undisputed favorite tea r...",5,2012-12-06,[[Grocery & Gourmet Food]],,0.0,
2,A3I0AV0UJX5OH0,1403796890,I ordered spongbob slippers and I got John Cen...,1,2013-12-02,[[Grocery & Gourmet Food]],,0.0,
3,A3QAAOLIXKV383,1403796890,The cart is fine and works for the purpose for...,3,2011-06-12,[[Grocery & Gourmet Food]],,0.0,0.0
4,AB1A5EGHHVA9M,141278509X,This product by Archer Farms is the best drink...,5,2012-03-24,[[Grocery & Gourmet Food]],,0.0,


Let's first check if we can use the columns reviewerId and asin as a unique index

In [3]:
empath.set_index(['reviewerID', 'asin'])

Unnamed: 0_level_0,Unnamed: 1_level_0,reviewText,overall,unixReviewTime,categories,brand,healthScore,danger
reviewerID,asin,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
A1ZQZ8RJS1XVTX,0657745316,"No sugar, no GMO garbage, no fillers that come...",5,2013-10-11,[[Grocery & Gourmet Food]],,0.0,0.0
A31W38VGZAUUM4,0700026444,"This is my absolute, undisputed favorite tea r...",5,2012-12-06,[[Grocery & Gourmet Food]],,0.0,
A3I0AV0UJX5OH0,1403796890,I ordered spongbob slippers and I got John Cen...,1,2013-12-02,[[Grocery & Gourmet Food]],,0.0,
A3QAAOLIXKV383,1403796890,The cart is fine and works for the purpose for...,3,2011-06-12,[[Grocery & Gourmet Food]],,0.0,0.0
AB1A5EGHHVA9M,141278509X,This product by Archer Farms is the best drink...,5,2012-03-24,[[Grocery & Gourmet Food]],,0.0,
...,...,...,...,...,...,...,...,...
A1DGBQQIBHTLPF,B00LOZ7F0S,I was attracted to this product initially beca...,5,2014-07-17,[[Grocery & Gourmet Food]],,0.0,
A3DYJ90MXMH04P,B00LOZ7F0S,"Wonderful vanilla. Very strong vanilla flavor,...",5,2014-07-10,[[Grocery & Gourmet Food]],,0.0,
AWQYAJV2QWLWA,B00LOZ7F0S,I usually don't write reviews but this product...,5,2014-07-13,[[Grocery & Gourmet Food]],,1.0,
A2AT0AI6QKSY2C,B00LOZ7F0S,"I like to drink my tea with vanilla extract, i...",5,2014-07-15,[[Grocery & Gourmet Food]],,0.0,


In [4]:
empath.index.is_unique

True

Index is unique => we can use it as index!

In [9]:
empath.reset_index()

empath['ID'] = empath['reviewerID'] + "_" + empath['asin']

empath.head()

Unnamed: 0,reviewerID,asin,reviewText,overall,unixReviewTime,categories,brand,healthScore,danger,ID
0,A1ZQZ8RJS1XVTX,0657745316,"No sugar, no GMO garbage, no fillers that come...",5,2013-10-11,[[Grocery & Gourmet Food]],,0.0,0.0,A1ZQZ8RJS1XVTX_0657745316
1,A31W38VGZAUUM4,0700026444,"This is my absolute, undisputed favorite tea r...",5,2012-12-06,[[Grocery & Gourmet Food]],,0.0,,A31W38VGZAUUM4_0700026444
2,A3I0AV0UJX5OH0,1403796890,I ordered spongbob slippers and I got John Cen...,1,2013-12-02,[[Grocery & Gourmet Food]],,0.0,,A3I0AV0UJX5OH0_1403796890
3,A3QAAOLIXKV383,1403796890,The cart is fine and works for the purpose for...,3,2011-06-12,[[Grocery & Gourmet Food]],,0.0,0.0,A3QAAOLIXKV383_1403796890
4,AB1A5EGHHVA9M,141278509X,This product by Archer Farms is the best drink...,5,2012-03-24,[[Grocery & Gourmet Food]],,0.0,,AB1A5EGHHVA9M_141278509X


BERT expect the training data in a very specific format :
* a unique ID for each data sample,
* the label associated to that sample as an int (0 or 1 for us),
* A column containing all the same letter,
* the text to classify.

This data must then be exported in **tsv** format in a data/ folder in two files named *train.tsv* and *dev.tsv*

In [12]:
train = empath[~empath['danger'].isna()]

train['LetterColumn'] = 'a'

train.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,reviewerID,asin,reviewText,overall,unixReviewTime,categories,brand,healthScore,danger,ID,LetterColumn
0,A1ZQZ8RJS1XVTX,657745316,"No sugar, no GMO garbage, no fillers that come...",5,2013-10-11,[[Grocery & Gourmet Food]],,0.0,0.0,A1ZQZ8RJS1XVTX_0657745316,a
3,A3QAAOLIXKV383,1403796890,The cart is fine and works for the purpose for...,3,2011-06-12,[[Grocery & Gourmet Food]],,0.0,0.0,A3QAAOLIXKV383_1403796890,a
6,A3LZA698SQPCXE,1453060464,My wife picked some of this up on sale. I usu...,3,2013-07-17,[[Grocery & Gourmet Food]],,0.0,0.0,A3LZA698SQPCXE_1453060464,a
19,A2IPE2KFGTZMI3,3295000018,"Fast and hot, did I mention hot, well it's hot",5,2014-07-10,[[Grocery & Gourmet Food]],,0.0,0.0,A2IPE2KFGTZMI3_3295000018,a
23,A2H05FXY2BXUAN,3295000018,High quality! worth every penny you pay for it...,5,2014-03-13,[[Grocery & Gourmet Food]],,0.0,0.0,A2H05FXY2BXUAN_3295000018,a


In [13]:
train = train[['ID', 'danger', 'LetterColumn', 'reviewText']]

train.head()

Unnamed: 0,ID,danger,LetterColumn,reviewText
0,A1ZQZ8RJS1XVTX_0657745316,0.0,a,"No sugar, no GMO garbage, no fillers that come..."
3,A3QAAOLIXKV383_1403796890,0.0,a,The cart is fine and works for the purpose for...
6,A3LZA698SQPCXE_1453060464,0.0,a,My wife picked some of this up on sale. I usu...
19,A2IPE2KFGTZMI3_3295000018,0.0,a,"Fast and hot, did I mention hot, well it's hot"
23,A2H05FXY2BXUAN_3295000018,0.0,a,High quality! worth every penny you pay for it...


We now split the dataframe in two : one for the training part (*train_final*) and one for the evaluation process (*dev*). 

Note that we split our data in a stratified fashion using the label column, as we want the labels to be equally represented in both sets.

In [15]:
from sklearn.model_selection import train_test_split

stratify_parameter = train.danger

train_final, dev = train_test_split(train, random_state=10, test_size=0.01, \
                                    stratify=stratify_parameter)

train_final.to_csv('data/train.tsv', sep='\t', index=False, header=False)
dev.to_csv('data/dev.tsv', sep='\t', index=False, header=False)

Let us now also create our testing set (which contains all the other reviews to which we want to assign a label). This time the format is :
* a unique ID for each data sample,
* the text to classify.

This data must again be exported in **tsv** format in a data/ folder and named *test.tsv*

In [17]:
test = empath[empath['danger'].isna()]

test.head()

Unnamed: 0,reviewerID,asin,reviewText,overall,unixReviewTime,categories,brand,healthScore,danger,ID
1,A31W38VGZAUUM4,0700026444,"This is my absolute, undisputed favorite tea r...",5,2012-12-06,[[Grocery & Gourmet Food]],,0.0,,A31W38VGZAUUM4_0700026444
2,A3I0AV0UJX5OH0,1403796890,I ordered spongbob slippers and I got John Cen...,1,2013-12-02,[[Grocery & Gourmet Food]],,0.0,,A3I0AV0UJX5OH0_1403796890
4,AB1A5EGHHVA9M,141278509X,This product by Archer Farms is the best drink...,5,2012-03-24,[[Grocery & Gourmet Food]],,0.0,,AB1A5EGHHVA9M_141278509X
5,A3DTB6RVENLQ9Q,1453060375,Don't buy this item - rip off at this price. ...,1,2013-03-03,[[Grocery & Gourmet Food]],Mio,0.0,,A3DTB6RVENLQ9Q_1453060375
7,A2XZPK86YY9R6G,1453060782,I bought these on sale (2 for $4) at my local ...,4,2013-08-20,[[Grocery & Gourmet Food]],,0.0,,A2XZPK86YY9R6G_1453060782


In [19]:
test = test[['ID', 'reviewText']]

test.head()

Unnamed: 0,ID,reviewText
1,A31W38VGZAUUM4_0700026444,"This is my absolute, undisputed favorite tea r..."
2,A3I0AV0UJX5OH0_1403796890,I ordered spongbob slippers and I got John Cen...
4,AB1A5EGHHVA9M_141278509X,This product by Archer Farms is the best drink...
5,A3DTB6RVENLQ9Q_1453060375,Don't buy this item - rip off at this price. ...
7,A2XZPK86YY9R6G_1453060782,I bought these on sale (2 for $4) at my local ...


In [20]:
test.to_csv('data/test.tsv', sep='\t', index=False, header=False)

We are now ready for the actual classification phase : we will be using the [BERT-Large, Uncased (Whole Word Masking)](https://github.com/google-research/bert#pre-trained-models) model to classify our reviews.

Note that the classification task must be done with the python script *run_classifer.py* available in the [original repo](https://github.com/google-research/bert#pre-trained-models), and thus cannot be shown here.

This is the command executed to run our classification process :
```
python run_classifier.py  --task_name=cola 
--do_train=true --do_eval=true --do_predict=true 
--data_dir=./data/ --vocab_file=./uncased_L-24_H-1024_A-16/vocab.txt 
--bert_config_file=./uncased_L-24_H-1024_A-16/bert_config.json 
--init_checkpoint=./uncased_L-24_H-1024_A-16/bert_model.ckpt 
--max_seq_length=400 --train_batch_size=8 --learning_rate=2e-5 
--num_train_epochs=3.0 --output_dir=./bert_output/ --do_lower_case=True```