# **Sentiment Analyse**

Consequent en snel tekst analyseren door middel van Machine Learning.

Gebruik naast de normale packages ook het pytorch_pretrained_bert package van huggingface. Dit is met afstand de makkelijkste manier om het state-of-the-art Google BERT algoritme te implementeren en manipuleren. 

De structuur van ons model is gedeeltelijk gebasseerd op de structuur zoals voorgesteld door Michel Kana in [dit artikel](https://towardsdatascience.com/bert-for-dummies-step-by-step-tutorial-fb90890ffe03).

In [0]:
#!pip install pytorch-pretrained-bert pytorch-nlp

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import string
import io

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

import torch
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
from tensorflow.keras.preprocessing.sequence import pad_sequences
from pytorch_pretrained_bert import BertTokenizer, BertConfig
from pytorch_pretrained_bert import BertAdam, BertForSequenceClassification
from tqdm import tqdm, trange

In [0]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
torch.cuda.get_device_name(0)

'Tesla P100-PCIE-16GB'

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive


In [0]:
path = '/content/gdrive/My Drive'
file_name = r"training.1600000.processed.noemoticon.csv"
file = os.path.join(path, file_name)

column_names = ['Sentiment', 'ID', 'Date', 'Query', 'Username', 'Text']

df = pd.read_csv(file, header=None, names=column_names, encoding = "ISO-8859-1")

print(df['Sentiment'].unique())

df.head()

[0 4]


Unnamed: 0,Sentiment,ID,Date,Query,Username,Text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [0]:
mask = df["Sentiment"] == 4

df.loc[mask, "Sentiment"] = 1

In [0]:
Text = df['Text'].copy()
Sentiment = df["Sentiment"].copy()

Sentiment = Sentiment.to_numpy()

In [0]:
X_train, X_test, Y_train, Y_test = train_test_split(Text, Sentiment, test_size=0.1, random_state=42)
X_cv, X_test, Y_cv, Y_test = train_test_split(X_test, Y_test, test_size=0.5, random_state=42)

print("{} records in train, {} records in cv, {} records in test".format(len(X_train), len(X_cv), len(X_test)))

1440000 records in train, 80000 records in cv, 80000 records in test


In [0]:
X_train = X_train.str.translate(str.maketrans('', '', string.punctuation))
X_cv = X_cv.str.translate(str.maketrans('', '', string.punctuation))
X_test = X_test.str.translate(str.maketrans('', '', string.punctuation))

In [0]:
X_train = X_train.astype(str)
X_cv = X_cv.astype(str)
X_test = X_test.astype(str)

# Model

In [0]:
# voeg speciale tokens voor BERT toe; [CLS] en [SEP] geven het begin en einde van een zin aan. 
sentences = ['[CLS] ' + sentence + ' [SEP]' for sentence in list(X_train)]
print(sentences[1])

[CLS] ethanonly its during the uni exam period   asot400 [SEP]


In [0]:
# In plaats van een normale Tokenizer, initiëren we de pre-trained BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
tokenized_texts = [tokenizer.tokenize(sent) for sent in sentences]
print ("Tokenize the first sentence:")
print (tokenized_texts[0])

100%|██████████| 231508/231508 [00:00<00:00, 1236098.79B/s]


Tokenize the first sentence:
['[CLS]', 'just', 'saw', 'your', 'picture', 'and', 'my', 'heart', 'melted', '[SEP]']


In [0]:
# Maximum zin lengte aangeven. 
MAX_LEN = 50
# Gebruik de BERT tokenizer om de tokens te converteren naar de matchende indices van de BERT vocabulair
input_ids = pad_sequences([tokenizer.convert_tokens_to_ids(txt) for txt in tokenized_texts],
                          maxlen=MAX_LEN, dtype="long", truncating="post", padding="post")

input_ids

array([[  101,  2074,  2387, ...,     0,     0,     0],
       [  101,  6066,  2239, ...,     0,     0,     0],
       [  101,  9389, 10020, ...,     0,     0,     0],
       ...,
       [  101,  7098, 16650, ...,     0,     0,     0],
       [  101,  4060,  2378, ...,     0,     0,     0],
       [  101,  2188,  5702, ...,     0,     0,     0]])

In [0]:
# Zet de attention masks op
attention_masks = []
# Maak een mask met waarde 1 voor elke token gevolgd door 0 voor padding
for seq in input_ids:
  seq_mask = [float(i>0) for i in seq]
  attention_masks.append(seq_mask)

In [0]:
# Gebruik train_test_split om data te splitten in train en validation sets voor training
train_inputs, validation_inputs, train_labels, validation_labels = train_test_split(input_ids, Y_train, 
                                                            random_state=42, test_size=0.1)
train_masks, validation_masks, _, _ = train_test_split(attention_masks, input_ids,
                                             random_state=42, test_size=0.1)
                                             
# Converteer alle data naar torch tensors
train_inputs = torch.tensor(train_inputs)
validation_inputs = torch.tensor(validation_inputs)
train_labels = torch.tensor(train_labels)
validation_labels = torch.tensor(validation_labels)
train_masks = torch.tensor(train_masks)
validation_masks = torch.tensor(validation_masks)

# Selecteer batch size voor training. 
batch_size = 32

# Bouw een iterator voor data met torch DataLoader 
train_data = TensorDataset(train_inputs, train_masks, train_labels)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)
validation_data = TensorDataset(validation_inputs, validation_masks, validation_labels)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=batch_size)

In [0]:
# Schep een instance van het pretrained model en laat de structuur zien
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
model.cuda()

100%|██████████| 407873900/407873900 [00:11<00:00, 36299067.35B/s]


BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): BertLayerNorm()
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): BertLayerNorm()
              (dropout): Dropout(p=0.1, inplace=False)
   

In [0]:
# BERT fine-tuning parameters
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]

optimizer = BertAdam(optimizer_grouped_parameters,
                     lr=2e-5,
                     warmup=.1)

# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)
  
# Store our loss and accuracy for plotting
train_loss_set = []
# Number of training epochs 
epochs = 4

# BERT training loop
for _ in trange(epochs, desc="Epoch"):  
  
  ## TRAINING
  
  # Bereid het model voor voor trainen. Deze modus staat toe dat de gewichten in het netwerk geupdate kunnen worden.
  model.train()  
  # Tracking variables
  tr_loss = 0
  nb_tr_examples, nb_tr_steps = 0, 0
  # Train het model op de data voor een epoch
  for step, batch in enumerate(train_dataloader):
    # Stel printstatement in voor elke 5000 batches om voortgang weer te geven
    if step % 5000 == 0:
      print("Batch {} of 45,000 processed".format(step))
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
    # Haal de variabelen uit de eerder geïniteerde dataloader
    b_input_ids, b_input_mask, b_labels = batch
    # Herstel de coëfficienten naar nul. 
    optimizer.zero_grad()
    # Forward prop
    loss = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)
    if step % 200:
      train_loss_set.append(loss.item())    
    # Backward prop
    loss.backward()
    # Update parameters
    optimizer.step()
    # Update tracking variables
    tr_loss += loss.item()
    nb_tr_examples += b_input_ids.size(0)
    nb_tr_steps += 1
  print("Train loss: {}".format(tr_loss/nb_tr_steps))
       
  ## VALIDATION

  # Stel evaluation mode in voor het model
  model.eval()
  # Tracking variables 
  eval_loss, eval_accuracy = 0, 0
  nb_eval_steps, nb_eval_examples = 0, 0
  # Bereken evaluatie waardes voor een epoch
  for batch in validation_dataloader:
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
    # Haal de variabelen uit de eerder geïniteerde dataloader
    b_input_ids, b_input_mask, b_labels = batch
    # Geef aan dat coëfficiënten niet berekend en opgeslagen hoeven te worden om geheugen te besparen
    with torch.no_grad():
      # Forward pass, bereken logit predictions
      logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask)    
    # Verplaats logits and labels to CPU
    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()
    tmp_eval_accuracy = flat_accuracy(logits, label_ids)    
    eval_accuracy += tmp_eval_accuracy
    nb_eval_steps += 1
  print("Validation Accuracy: {}".format(eval_accuracy/nb_eval_steps))

# Plot Training Prestaties
plt.figure(figsize=(15,8))
plt.title("Training loss")
plt.xlabel("Batch")
plt.ylabel("Loss")
plt.plot(train_loss_set)
plt.show()

t_total value of -1 results in schedule not being applied
Epoch:   0%|          | 0/4 [00:00<?, ?it/s]

Batch 0 of 45,000 processed
Batch 5000 of 45,000 processed
Batch 10000 of 45,000 processed
Batch 15000 of 45,000 processed
Batch 20000 of 45,000 processed
Batch 25000 of 45,000 processed
Batch 30000 of 45,000 processed
Batch 35000 of 45,000 processed
Batch 40000 of 45,000 processed
Train loss: 0.35261475471342785


Epoch:  25%|██▌       | 1/4 [2:44:22<8:13:06, 9862.28s/it]

Validation Accuracy: 0.8597569444444444
Batch 0 of 45,000 processed
Batch 5000 of 45,000 processed
Batch 10000 of 45,000 processed
Batch 15000 of 45,000 processed
Batch 20000 of 45,000 processed
Batch 25000 of 45,000 processed
Batch 30000 of 45,000 processed
Batch 35000 of 45,000 processed
Batch 40000 of 45,000 processed
Train loss: 0.29640969763549024


Epoch:  50%|█████     | 2/4 [5:28:48<5:28:46, 9863.46s/it]

Validation Accuracy: 0.8617708333333334
Batch 0 of 45,000 processed
Batch 5000 of 45,000 processed
Batch 10000 of 45,000 processed
Batch 15000 of 45,000 processed
Batch 20000 of 45,000 processed
Batch 25000 of 45,000 processed
Batch 30000 of 45,000 processed
Batch 35000 of 45,000 processed
Batch 40000 of 45,000 processed
Train loss: 0.24923814755521806


Epoch:  75%|███████▌  | 3/4 [8:14:07<2:44:40, 9880.19s/it]

Validation Accuracy: 0.8570486111111111
Batch 0 of 45,000 processed
Batch 5000 of 45,000 processed
Batch 10000 of 45,000 processed
Batch 15000 of 45,000 processed
Batch 20000 of 45,000 processed
