<a href="https://colab.research.google.com/github/ITU-Business-Analytics-Team/Business_Analytics_for_Professionals/blob/main/Part%20I%20%3A%20Methods%20%26%20Technologies%20for%20Business%20Analytics/Chapter%207%3A%20Text%20Analytics/7_6_3_Deep_Learning_Based_Sentiment_Analysis__XLNET.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Sentiment Analysis (Opinion Mining)**
## Deep Learning Based Sentiment Analysis

The sentiment analysis of commodity news task are previously investigated under statistical methods. Since it is also mentioned at the end of that notebook, sentiment analysis with a few instances (there are only 1120 news total for train and test) is a complex problem. In the latest years, research on NLP models advanced and produced some high quality classification models, most of them are deep learning based methods. In this notebook, the XLNet approach will be introduced.



### XLNET

XLNET is another transformers architecture based sequence classification algorithm. Its implementation and results very similar to BERT. PyTorch will be used again and even most of the previously defined functions in BERT notebook will be reused.

In [None]:
!pip install sentencepiece
# restart the runtime from runtime tab in colab toolbar if you download the sentencepiece for first time

Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[?25l[K     |▎                               | 10 kB 28.3 MB/s eta 0:00:01[K     |▌                               | 20 kB 23.0 MB/s eta 0:00:01[K     |▉                               | 30 kB 16.3 MB/s eta 0:00:01[K     |█                               | 40 kB 14.8 MB/s eta 0:00:01[K     |█▍                              | 51 kB 5.5 MB/s eta 0:00:01[K     |█▋                              | 61 kB 6.0 MB/s eta 0:00:01[K     |██                              | 71 kB 5.5 MB/s eta 0:00:01[K     |██▏                             | 81 kB 6.1 MB/s eta 0:00:01[K     |██▍                             | 92 kB 6.1 MB/s eta 0:00:01[K     |██▊                             | 102 kB 5.3 MB/s eta 0:00:01[K     |███                             | 112 kB 5.3 MB/s eta 0:00:01[K     |███▎                            | 122 kB 5.3 MB/s eta 0:00:01[K     |███▌      

In [None]:
!pip install transformers
!pip install torch torchvision torchaudio

Collecting transformers
  Downloading transformers-4.12.3-py3-none-any.whl (3.1 MB)
[K     |████████████████████████████████| 3.1 MB 5.3 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.1.1-py3-none-any.whl (59 kB)
[K     |████████████████████████████████| 59 kB 6.5 MB/s 
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 33.5 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 36.3 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 39.2 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attem

In [None]:
# for deep learning implementation
import torch
# to work on the dataset
import pandas as pd
# to follow progress as bar in notebook
from tqdm.notebook import tqdm

In [None]:
# read the data  
url=   'https://docs.google.com/spreadsheets/d/1XXyxrd7r0mx7kyLaYHDVwh6BFJzo8cPD/edit?usp=sharing&ouid=108589602591644119588&rtpof=true&sd=true'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]

df = pd.read_excel(path)
df['summary'] = df['summary'].map(lambda x: x.lstrip('News :'))
df['summary'] = df['summary'].map(lambda x: x.lstrip('UPDATE'))
df['summary'] = df['summary'].map(lambda x: x.lstrip('METALS-'))
df.rename(columns={'summary':'text'}, inplace = True)

In [None]:
url=   'https://docs.google.com/spreadsheets/d/145tqf2J949KGCYnH-Nx3hiaHTogiZFn4/edit?usp=sharing&ouid=108589602591644119588&rtpof=true&sd=true'
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
test_df = pd.read_excel(path)
test_df['summary'] = test_df['summary'].map(lambda x: x.lstrip('News :'))
test_df['summary'] = test_df['summary'].map(lambda x: x.lstrip('UPDATE'))
test_df['summary'] = test_df['summary'].map(lambda x: x.lstrip('METALS-'))
test_df.rename(columns={'summary':'text'}, inplace = True)
test_df

Unnamed: 0,text,sentiment
0,Copper at near 2-week highs on hopes China imp...,0
1,"China's Yunnan to help firms stockpile 110,000...",1
2,COLUMN-Politics trumps aluminium as U.S. reimp...,-1
3,Base metals decline on weak China demand outlook,-1
4,"ALUMINIUM FALLS TO $1,751.50/T, LOWEST SINCE...",-1
...,...,...
163,China names former Chinalco exec as industry m...,1
164,Copper edges off two-year low as Washington so...,0
165,"Uncertainty on global growth, trade war weighs...",-1
166,Copper gains after Fed chief rekindles rate cu...,1


In [None]:
df = df.drop_duplicates().merge(test_df.drop_duplicates(), on=test_df.columns.to_list(), 
                   how='left', indicator=True, right_index = False, left_index = False)
df = df.loc[df._merge=='left_only',df.columns!='_merge']
df = df.reset_index(drop = True, inplace= False)

In [None]:
df.sentiment.value_counts()

 1    486
-1    366
 0     64
Name: sentiment, dtype: int64

In [None]:
possible_labels = df.sentiment.unique()
label_dict = {}
for index, possible_label in enumerate(possible_labels):
    label_dict[possible_label] = index

In [None]:
df['label'] = df.sentiment.replace(label_dict)
test_df['label'] = test_df.sentiment.replace(label_dict)

In [None]:
df['data_type'] = ['train']*df.shape[0]
test_df['data_type'] = ['val']*test_df.shape[0]

In [None]:
df.groupby(['sentiment', 'label', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,text
sentiment,label,data_type,Unnamed: 3_level_1
-1,1,train,366
0,2,train,64
1,0,train,486


In [None]:
test_df.groupby(['sentiment', 'label', 'data_type']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,text
sentiment,label,data_type,Unnamed: 3_level_1
-1,1,val,65
0,2,val,12
1,0,val,91


In [None]:
from torch.utils.data import TensorDataset
from transformers import XLNetTokenizer, XLNetModel

As different than BERT, 'xlnet-large-cased' tokenizer will be used.

In [None]:
PRE_TRAINED_MODEL_NAME = 'xlnet-large-cased'
tokenizer = XLNetTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)

Downloading:   0%|          | 0.00/779k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/761 [00:00<?, ?B/s]

In [None]:
encoded_data_train = tokenizer.batch_encode_plus(    df.text.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=52, 
    return_tensors='pt')

encoded_data_val = tokenizer.batch_encode_plus(
    test_df.text.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=52, 
    return_tensors='pt'
)


input_ids_train = encoded_data_train['input_ids']
attention_masks_train = encoded_data_train['attention_mask']
labels_train = torch.tensor(df.label.values)

input_ids_val = encoded_data_val['input_ids']
attention_masks_val = encoded_data_val['attention_mask']
labels_val = torch.tensor(test_df.label.values)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


In [None]:
dataset_train = TensorDataset(input_ids_train, attention_masks_train, labels_train)
dataset_val = TensorDataset(input_ids_val, attention_masks_val, labels_val)

In [None]:
import numpy as np
from sklearn.metrics import f1_score

In [None]:
def f1_score_func(preds, labels):
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return f1_score(labels_flat, preds_flat, average='weighted')

In [None]:
def accuracy_per_class(preds, labels, test=False):
    label_dict_inverse = {v: k for k, v in label_dict.items()}
    
    preds_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()

    overall_acc = 0
    for label in np.unique(labels_flat):
        y_preds = preds_flat[labels_flat==label]
        y_true = labels_flat[labels_flat==label]
        print(f'Class: {label_dict_inverse[label]}')
        acc = (len(y_preds[y_preds==label])/len(y_true))*100
        overall_acc += acc * len(y_preds)
        print(f'Accuracy: {acc}\n')
    if (test==False):
        print(f'Overall Accuracy: {overall_acc/len(dataset_val)}\n')
    else:
        print(f'Overall Accuracy: {overall_acc/len(dataset_test)}\n')
    

In [None]:
from transformers import XLNetForSequenceClassification

In [None]:
import random

seed_val = 17
random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

In [None]:
model = XLNetForSequenceClassification.from_pretrained('xlnet-base-cased', num_labels=len(label_dict), mem_len=1024)

Downloading:   0%|          | 0.00/760 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/445M [00:00<?, ?B/s]

Some weights of the model checkpoint at xlnet-base-cased were not used when initializing XLNetForSequenceClassification: ['lm_loss.weight', 'lm_loss.bias']
- This IS expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLNetForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of XLNetForSequenceClassification were not initialized from the model checkpoint at xlnet-base-cased and are newly initialized: ['logits_proj.weight', 'sequence_summary.summary.bias', 'logits_proj.bias', 'sequence_summary.summary.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions a

In [None]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

In [None]:
from transformers import AdamW, get_linear_schedule_with_warmup

In [None]:
batch_size = 32

dataloader_train = DataLoader(dataset_train, 
                              sampler=RandomSampler(dataset_train), 
                              batch_size=batch_size)

dataloader_validation = DataLoader(dataset_val, 
                                   sampler=SequentialSampler(dataset_val), 
                                   batch_size=batch_size)

In [None]:
optimizer = AdamW(model.parameters(),
                  lr=5e-5, 
                  eps=1e-8)

In [None]:
epochs = 25

scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps=0,
                                            num_training_steps=len(dataloader_train)*epochs)

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)

print(device)

cuda


In [None]:
def evaluate(dataloader_val):

    model.eval()
    
    loss_val_total = 0
    predictions, true_vals = [], []
    
    for batch in dataloader_val:
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }

        with torch.no_grad():        
            outputs = model(**inputs)
            
        loss = outputs[0]
        logits = outputs[1]
        loss_val_total += loss.item()

        logits = logits.detach().cpu().numpy()
        label_ids = inputs['labels'].cpu().numpy()
        predictions.append(logits)
        true_vals.append(label_ids)
    
    loss_val_avg = loss_val_total/len(dataloader_val) 
    
    predictions = np.concatenate(predictions, axis=0)
    true_vals = np.concatenate(true_vals, axis=0)
            
    return loss_val_avg, predictions, true_vals

In [None]:
for epoch in tqdm(range(1, epochs+1)):
    
    model.train()
    
    loss_train_total = 0

    progress_bar = tqdm(dataloader_train, desc='Epoch {:1d}'.format(epoch), leave=False, disable=False)
    for batch in progress_bar:

        model.zero_grad()
        
        batch = tuple(b.to(device) for b in batch)
        
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2],
                 }       

        outputs = model(**inputs)
        
        loss = outputs[0]
        loss_train_total += loss.item()
        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        optimizer.step()
        scheduler.step()
        
        progress_bar.set_postfix({'training_loss': '{:.3f}'.format(loss.item()/len(batch))})
         
        
    torch.save(model.state_dict(), f'finetuned_XLNET_epoch_{epoch}.model')
        
    tqdm.write(f'\nEpoch {epoch}')
    
    loss_train_avg = loss_train_total/len(dataloader_train)             
    tqdm.write(f'Training loss: {loss_train_avg}')
    
    val_loss, predictions, true_vals = evaluate(dataloader_validation)
    val_f1 = f1_score_func(predictions, true_vals)
    tqdm.write(f'Validation loss: {val_loss}')
    tqdm.write(f'F1 Score (Weighted): {val_f1}')

  0%|          | 0/25 [00:00<?, ?it/s]

Epoch 1:   0%|          | 0/29 [00:00<?, ?it/s]


Epoch 1
Training loss: 0.930281425344533
Validation loss: 0.9117169777552286
F1 Score (Weighted): 0.3806306306306307


Epoch 2:   0%|          | 0/29 [00:00<?, ?it/s]


Epoch 2
Training loss: 0.9007037836929848
Validation loss: 0.8955048223336538
F1 Score (Weighted): 0.3806306306306307


Epoch 3:   0%|          | 0/29 [00:00<?, ?it/s]


Epoch 3
Training loss: 0.8872000842258848
Validation loss: 0.7881355186303457
F1 Score (Weighted): 0.6646459719316392


Epoch 4:   0%|          | 0/29 [00:00<?, ?it/s]


Epoch 4
Training loss: 0.7391687117773911
Validation loss: 0.7723020712534586
F1 Score (Weighted): 0.6300926487842375


Epoch 5:   0%|          | 0/29 [00:00<?, ?it/s]


Epoch 5
Training loss: 0.6648332908235747
Validation loss: 0.8688974877198538
F1 Score (Weighted): 0.6438859494415051


Epoch 6:   0%|          | 0/29 [00:00<?, ?it/s]


Epoch 6
Training loss: 0.5926435168447166
Validation loss: 0.7045249342918396
F1 Score (Weighted): 0.6772300469483569


Epoch 7:   0%|          | 0/29 [00:00<?, ?it/s]


Epoch 7
Training loss: 0.5283459671612444
Validation loss: 0.766220768292745
F1 Score (Weighted): 0.6726025954881676


Epoch 8:   0%|          | 0/29 [00:00<?, ?it/s]


Epoch 8
Training loss: 0.40335605421970633
Validation loss: 0.9514694611231486
F1 Score (Weighted): 0.6825051321619059


Epoch 9:   0%|          | 0/29 [00:00<?, ?it/s]


Epoch 9
Training loss: 0.30661046761890937
Validation loss: 1.219559907913208
F1 Score (Weighted): 0.6708798817669785


Epoch 10:   0%|          | 0/29 [00:00<?, ?it/s]


Epoch 10
Training loss: 0.26154837654582386
Validation loss: 1.1069376567999523
F1 Score (Weighted): 0.6592586948423582


Epoch 11:   0%|          | 0/29 [00:00<?, ?it/s]


Epoch 11
Training loss: 0.2033536323699458
Validation loss: 1.415563941001892
F1 Score (Weighted): 0.6766510558177226


Epoch 12:   0%|          | 0/29 [00:00<?, ?it/s]


Epoch 12
Training loss: 0.1745362190915079
Validation loss: 1.9261420369148254
F1 Score (Weighted): 0.6352421412113232


Epoch 13:   0%|          | 0/29 [00:00<?, ?it/s]


Epoch 13
Training loss: 0.1372369714723579
Validation loss: 1.5664267142613728
F1 Score (Weighted): 0.6682839046396163


Epoch 14:   0%|          | 0/29 [00:00<?, ?it/s]


Epoch 14
Training loss: 0.11499147065754595
Validation loss: 1.506350080172221
F1 Score (Weighted): 0.6551720389033632


Epoch 15:   0%|          | 0/29 [00:00<?, ?it/s]


Epoch 15
Training loss: 0.11106352379609799
Validation loss: 1.6428695718447368
F1 Score (Weighted): 0.6807521395655036


Epoch 16:   0%|          | 0/29 [00:00<?, ?it/s]


Epoch 16
Training loss: 0.11553412922336881
Validation loss: 1.7775602340698242
F1 Score (Weighted): 0.6498501642036126


Epoch 17:   0%|          | 0/29 [00:00<?, ?it/s]


Epoch 17
Training loss: 0.10220914248955147
Validation loss: 1.8201106190681458
F1 Score (Weighted): 0.6358866963484124


Epoch 18:   0%|          | 0/29 [00:00<?, ?it/s]


Epoch 18
Training loss: 0.08466668038404193
Validation loss: 1.9533601999282837
F1 Score (Weighted): 0.6800389510642967


Epoch 19:   0%|          | 0/29 [00:00<?, ?it/s]


Epoch 19
Training loss: 0.05690824959812493
Validation loss: 1.8015793164571126
F1 Score (Weighted): 0.6724480021893814


Epoch 20:   0%|          | 0/29 [00:00<?, ?it/s]


Epoch 20
Training loss: 0.051393256313970376
Validation loss: 1.8780896266301472
F1 Score (Weighted): 0.67947912302751


Epoch 21:   0%|          | 0/29 [00:00<?, ?it/s]


Epoch 21
Training loss: 0.049729566146009443
Validation loss: 2.1004692912101746
F1 Score (Weighted): 0.6609629358989972


Epoch 22:   0%|          | 0/29 [00:00<?, ?it/s]


Epoch 22
Training loss: 0.04718722517056198
Validation loss: 1.972888171672821
F1 Score (Weighted): 0.6985193561960867


Epoch 23:   0%|          | 0/29 [00:00<?, ?it/s]


Epoch 23
Training loss: 0.0437426020547844
Validation loss: 2.020647943019867
F1 Score (Weighted): 0.6714736441024497


Epoch 24:   0%|          | 0/29 [00:00<?, ?it/s]


Epoch 24
Training loss: 0.035648891668172616
Validation loss: 2.095287561416626
F1 Score (Weighted): 0.67214858580248


Epoch 25:   0%|          | 0/29 [00:00<?, ?it/s]


Epoch 25
Training loss: 0.033835522214287955
Validation loss: 2.073530375957489
F1 Score (Weighted): 0.6730248299743503


>The performance of the model could deviate from one runtime to other due to randomization in optimization algorithms which ensure to reach global optimal optimum point. In order to reduce the effect of randomization, the model could run several times and the results could be averaged. You can go ahead and try this approach. \\
>Alternatively, a pretrained well-performed model is added here. The following cells can be used to download and use it. PyTorch will import the weights and bias from the given pretrained model.

In [None]:
!pip install gdown



In [None]:
import gdown
url = 'https://drive.google.com/uc?id=1tftMIqAUw66hRwS8xCIoAIU6TKNOFqYF'
output = 'finetuned_XLNET_epoch_19.model'
gdown.download(url, output, quiet=False)

Downloading...
From: https://drive.google.com/uc?id=1tftMIqAUw66hRwS8xCIoAIU6TKNOFqYF
To: /content/finetuned_XLNET_epoch_19.model
100%|██████████| 469M/469M [00:03<00:00, 127MB/s]


'finetuned_XLNET_epoch_19.model'

In [None]:
#if your colab runtime machine has cpu please uncomment the line below
model.load_state_dict(torch.load('finetuned_XLNET_epoch_19.model',map_location=torch.device('cpu')))

# if your colab runtime machine has gpu please use this line
#model.load_state_dict(torch.load('finetuned_XLNET_epoch_19.model'), predictions, true_vals = evaluate(dataloader_validation))

<All keys matched successfully>

In [None]:
f1_score_func(predictions, true_vals)

0.6730248299743503

In [None]:
accuracy_per_class(predictions, true_vals)

Class: 1
Accuracy: 82.41758241758241

Class: -1
Accuracy: 61.53846153846154

Class: 0
Accuracy: 8.333333333333332

Overall Accuracy: 69.04761904761905



As mentioned before, the result is very similar to Bert's performance and outperforms statistical models. The model produces more accurate predictions on positive news as expected due to imbalanced dataset. 