# Sentiment Analysis of Amazon Reviews

![](https://www.topbots.com/wp-content/uploads/2020/01/cover_sentiment_analysis_BERT_1600px_web-1280x640.jpg)

Hello Everyone!

In this notebook, I’ll work with data from Amazon Review, which consists of 360000 reviews. There’re only positive and negative sentences.

Steps:
* EDA
* Baseline Logistic Regression(Tf-Idf)
* DistilBert
* [DistilBert Inference Optimization](https://www.kaggle.com/alexalex02/nlp-transformers-inference-optimization)

## Importing libraries and reading data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import warnings
import seaborn as sns
warnings.filterwarnings('ignore')

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix
from sklearn.model_selection import train_test_split
import joblib
import eli5

In [None]:
train_val = pd.read_csv('../input/amazontrainreviews/train.csv', index_col=0)
train_val.reset_index(drop=True, inplace=True)

# EDA

In [None]:
print(train_val.info())
display(train_val.head())

We have 0 Null value and now let's look at target distribution

In [None]:
sns.countplot(train_val['labels']);
plt.title('Labels distribution');

Let’s count number of words and see it distribution

In [None]:
train_val['len'] = train_val['sentences'].apply(lambda x: len(x.split()))
sns.distplot(train_val['len']);

Now we’ll divide it by sentiment and calculate average values

In [None]:
neg_mean_len = train_val.groupby('labels')['len'].mean().values[0]
pos_mean_len = train_val.groupby('labels')['len'].mean().values[1]

print(f"Negative mean length: {neg_mean_len:.2f}")
print(f"Positive mean length: {pos_mean_len:.2f}")
print(f"Mean Difference: {neg_mean_len-pos_mean_len:.2f}")
ax = sns.catplot(x='labels', y='len', data=train_val, kind='box')

We can see that negative sentences are longer on average. To say how significant this difference, we use permutation testing and calculate p-value.

First, we define a function to generate a permutation sample from two arrays. Then, we generate permutation replicates, which are a single statistic computed from permutation sample. Last, we compute the probability of getting at least 5.91 difference in mean under the hypothesis that the distributions of words are identical.

In [None]:
neg_array = train_val[train_val['labels']==0]['len'].values
pos_array = train_val[train_val['labels']==1]['len'].values
mean_diff = neg_mean_len - pos_mean_len

In [None]:
def permutation_sample(data1, data2):
    # Permute the concatenated array: permuted_data
    data = np.concatenate((data1,data2))
    permuted_data = np.random.permutation(data)

    # Split the permuted array into two: perm_sample_1, perm_sample_2
    perm_sample_1 = permuted_data[:len(data1)]
    perm_sample_2 = permuted_data[len(data1):]

    return perm_sample_1, perm_sample_2

In [None]:
def draw_perm_reps(data_1, data_2, size=1):

    perm_replicates = np.empty(size)

    for i in range(size):
        # Generate permutation sample
        perm_sample_1, perm_sample_2 = permutation_sample(data_1, data_2)

        # Compute the test statistic
        perm_replicates[i] = np.mean(perm_sample_1) - np.mean(perm_sample_2)

    return perm_replicates

In [None]:
perm_replicates = draw_perm_reps(neg_array, pos_array,
                                 size=10000)

# Compute p-value: p
p = np.sum(perm_replicates >= mean_diff) / len(perm_replicates)

print(f'p-value = {p}')

The p-value tells us that the null hypothesis is false.

# Baseline - LogReg (Tf-Idf)

Our baseline will be Logistic Regression with Tf-Idf. First, we define a function for prediction, which calculates accuracy, f1_score, confusion matrix and saves our model.

In [None]:
def prediction(model, X_train, y_train, X_valid, y_valid):
    model.fit(X_train, y_train)
    pred = model.predict(X_valid)
    acc = accuracy_score(y_valid, pred)
    f1 = f1_score(y_valid, pred)
    conf = confusion_matrix(y_valid, pred)
    joblib.dump(model, f"model_acc_{acc:.5f}.pkl")
    return model, acc, f1, conf

Extracting unigrams, bigrams and trigrams, also removing stopwords.

In [None]:
transformer = TfidfVectorizer(stop_words='english', ngram_range=(1, 3), 
                              lowercase=True, max_features=100000)
X = transformer.fit_transform(train_val['sentences'])
y = train_val.labels

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, 
                                                      random_state=42, stratify=y)
model = LogisticRegression(C=1, random_state=42, n_jobs=-1)
fit_model, acc, f1, conf = prediction(model, X_train, y_train, X_valid, y_valid)

In [None]:
print(f"Accuracy: {acc:.5f}")
print(f"F1_Score: {f1:.5f}")
print(f"Confusion Matrix: {conf}")

Interpreting model weights with ELI5.

In [None]:
eli5.show_weights(estimator=fit_model, 
                  feature_names= list(transformer.get_feature_names()),
                    top=(20,20))

# DistilBert

Here we'll use DistilBert from [transformers](https://huggingface.co/transformers/index.html). And [catalyst](https://github.com/catalyst-team/catalyst) for running experiment.

First, we install torch nightly for Mixed-precision training.

In [None]:
!pip install --pre torch==1.7.0.dev20200701+cu101 torchvision==0.8.0.dev20200701+cu101 -f https://download.pytorch.org/whl/nightly/cu101/torch_nightly.html
import torch
torch.__version__

In [None]:
import os
os.environ['WANDB_SILENT'] = 'True'
os.environ['CUDA_VISIBLE_DEVICES'] = "0"

from typing import Mapping, List
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F

from transformers import AutoConfig, AutoTokenizer, AutoModel

from catalyst.dl import SupervisedRunner
from catalyst.dl.callbacks import AccuracyCallback, OptimizerCallback, CheckpointCallback, WandbLogger
from catalyst.utils import set_global_seed, prepare_cudnn
from catalyst.contrib.nn import RAdam, Lookahead, OneCycleLRWithWarmup
import wandb

Config setup

In [None]:
MODEL_NAME = 'distilbert-base-uncased'
LOG_DIR = "./amazon" 
NUM_EPOCHS = 2 
LEARNING_RATE = 5e-5
MAX_SEQ_LENGTH = 512
BATCH_SIZE = 32
WEIGHT_DECAY = 1e-3
ACCUMULATION_STEPS = 3
SEED = 42
FP_16 = dict(opt_level="O1")

For reproducibility

In [None]:
set_global_seed(SEED)
prepare_cudnn(deterministic=True, benchmark=True)

We'll create dataset. Instantiate tokenizer. Then, we convert tokens to integers, add special tokens, use padding to max_length. Return `'input_ids', 'attention_mask', 'targets'`

In [None]:
class ReviewDataset(Dataset):

    
    def __init__(self,
                 sentences: List[str],
                 labels: List[str] = None,
                 max_seq_length: int = MAX_SEQ_LENGTH,
                 model_name: str = 'distilbert-base-uncased'):

        self.sentences = sentences
        self.labels = labels
        self.max_seq_length = max_seq_length

        self.tokenizer = AutoTokenizer.from_pretrained(model_name)

        
    def __len__(self):

        return len(self.sentences)

    
    def __getitem__(self, index) -> Mapping[str, torch.Tensor]:

        sentence = self.sentences[index]
        encoded = self.tokenizer.encode_plus(sentence, add_special_tokens=True, 
                                        pad_to_max_length=True, max_length=self.max_seq_length, 
                                        return_tensors="pt",)
        
        output = {
            'input_ids': encoded['input_ids'],
            'attention_mask': encoded['attention_mask']
        }
        
        output['targets'] = torch.tensor(self.labels[index], dtype=torch.long)
        
        return output

Making train_test_split, defining datasets and loaders

In [None]:
df_train, df_valid = train_test_split(
            train_val,
            test_size=0.2,
            random_state=42,
            stratify = train_val.labels.values
        )
print(df_train.shape, df_valid.shape)

In [None]:
train_dataset = ReviewDataset(
    sentences=df_train['sentences'].values.tolist(),
    labels=df_train['labels'].values,
    max_seq_length=MAX_SEQ_LENGTH,
    model_name=MODEL_NAME
)

valid_dataset = ReviewDataset(
    sentences=df_valid['sentences'].values.tolist(),
    labels=df_valid['labels'].values,
    max_seq_length=MAX_SEQ_LENGTH,
    model_name=MODEL_NAME
)

In [None]:
train_val_loaders = {
    "train": DataLoader(dataset=train_dataset,
                        batch_size=BATCH_SIZE, 
                        shuffle=True, num_workers=2, pin_memory=True),
    "valid": DataLoader(dataset=valid_dataset,
                        batch_size=BATCH_SIZE, 
                        shuffle=False, num_workers=2, pin_memory=True)    
}

Review and model input

In [None]:
print(df_valid.sentences.values[50])
valid_dataset[50]

Initialize pre-trained model. From config we'll use dimensionality of the encoder layers and the pooler layer = 768. And dropout probabilities = 0.2. Then, we'll compute logits for the input sequence.

In [None]:
class DistilBert(nn.Module):

    def __init__(self, pretrained_model_name: str = MODEL_NAME, num_classes: int = 2):

        super().__init__()

        config = AutoConfig.from_pretrained(
             pretrained_model_name)

        self.distilbert = AutoModel.from_pretrained(pretrained_model_name,
                                                    config=config)
        self.pre_classifier = nn.Linear(config.dim, config.dim)
        self.classifier = nn.Linear(config.dim, num_classes)
        self.dropout = nn.Dropout(config.seq_classif_dropout)

    def forward(self, input_ids, attention_mask=None, head_mask=None):

        assert attention_mask is not None, "attention mask is none"
        distilbert_output = self.distilbert(input_ids=input_ids,
                                            attention_mask=attention_mask,
                                            head_mask=head_mask)
        hidden_state = distilbert_output[0]  # [BATCH_SIZE=32, MAX_SEQ_LENGTH = 512, DIM = 768]
        pooled_output = hidden_state[:, 0]  # [32, 768]
        pooled_output = self.pre_classifier(pooled_output)  # [32, 768]
        pooled_output = F.relu(pooled_output)  # [32, 768]
        pooled_output = self.dropout(pooled_output)  # [32, 768]
        logits = self.classifier(pooled_output)  # [32, 2]

        return logits

In [None]:
model = DistilBert()

Training setup:

1. We'll apply weight decay for all parameters except 'bias' and 'LayerNorm'
1. Lookahead optimizer(improves the learning stability and lowers the variance of its inner optimizer)
1. OneCycleLRWithWarmup with 0 warmup steps, cosine annealing from 5e-5 to 1e-8.
1. Gradient accumulation for large batch training.

In [None]:
param_optim = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']

In [None]:
criterion = nn.CrossEntropyLoss()

base_optimizer = RAdam([
    {'params': [p for n,p in param_optim if not any(nd in n for nd in no_decay)],
     'weight_decay': WEIGHT_DECAY}, 
    {'params': [p for n,p in param_optim if any(nd in n for nd in no_decay)],
     'weight_decay': 0.0}
])
optimizer = Lookahead(base_optimizer)
scheduler = OneCycleLRWithWarmup(
    optimizer, 
    num_steps=NUM_EPOCHS, 
    lr_range=(LEARNING_RATE, 1e-8),
    init_lr=LEARNING_RATE,
    warmup_steps=0,
)

In [None]:
runner = SupervisedRunner(
    input_key=(
        "input_ids",
        "attention_mask"
    )
)
# model training
runner.train(
    model=model,
    criterion=criterion,
    optimizer=optimizer,
    scheduler=scheduler,
    loaders=train_val_loaders,
    callbacks=[
        AccuracyCallback(num_classes=2),
        OptimizerCallback(accumulation_steps=ACCUMULATION_STEPS),
        WandbLogger(name="Name", project="sentiment-analysis"),
    ],
    fp16=FP_16,
    logdir=LOG_DIR,
    num_epochs=NUM_EPOCHS,
    verbose=True
)

![](https://i.ibb.co/9wxK0Zz/Val-Metric.png)

After two epochs, we’ll able to reach 96.22% accuracy, which is on 6% higher than logistic regression.

To improve our result even more, we can continue fine-tuning with frozen encoder.


### Test

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
def prediction(model, sentence: str, max_len: int = 512, device = 'cpu'):
    x_encoded = tokenizer.encode_plus(sentence, add_special_tokens=True, pad_to_max_length=True, max_length=max_len, return_tensors="pt",).to(device)
    logits = model(x_encoded['input_ids'], x_encoded['attention_mask'])
    probabilities = F.softmax(logits.detach(), dim=1)
    output = probabilities.max(axis=1)
    print(sentence)
    print(f"Class: {['Negative' if output.indices[0] == 0 else 'Positive'][0]}, Probability: {output.values[0]:.4f}")

In [None]:
prediction(plain_model, df_valid['sentences'].values[20])