<a href="https://colab.research.google.com/github/Logicus03/Bert-Sentiment-Analysis-/blob/master/TensorFlow2_BERT_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TensorFlow 2 - BERT: Tweet Sentiment Analysis

BERT (Bidirectional Encoder Representations from Transformers) - A pre-trained BERT model can be fine tuned to create state-of-the-art models for a wide range of NLP tasks such as question answering, sentiment analysis and named entity recognition.

**Dataset**

Tweet dataset has tweets dataset for natural language processing.
Please download the dataset from [Kaggle link](www.kaggle.com/dataset/4af304c0f797e3b08f22895d6a0dcf95eee4c37f7a20775c7a4ee2281c6ba2ff).

**Problem**

A text in tweets dataset is either positive or negative. Therefore, the NLP tweet sentiment analysis task is a supervised learning binary classification problem.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
# Install the required package
!pip install bert-for-tf2

Collecting bert-for-tf2
[?25l  Downloading https://files.pythonhosted.org/packages/35/5c/6439134ecd17b33fe0396fb0b7d6ce3c5a120c42a4516ba0e9a2d6e43b25/bert-for-tf2-0.14.4.tar.gz (40kB)
[K     |████████                        | 10kB 23.9MB/s eta 0:00:01[K     |████████████████▏               | 20kB 1.7MB/s eta 0:00:01[K     |████████████████████████▎       | 30kB 2.3MB/s eta 0:00:01[K     |████████████████████████████████| 40kB 1.8MB/s 
[?25hCollecting py-params>=0.9.6
  Downloading https://files.pythonhosted.org/packages/a4/bf/c1c70d5315a8677310ea10a41cfc41c5970d9b37c31f9c90d4ab98021fd1/py-params-0.9.7.tar.gz
Collecting params-flow>=0.8.0
  Downloading https://files.pythonhosted.org/packages/a9/95/ff49f5ebd501f142a6f0aaf42bcfd1c192dc54909d1d9eb84ab031d46056/params-flow-0.8.2.tar.gz
Building wheels for collected packages: bert-for-tf2, py-params, params-flow
  Building wheel for bert-for-tf2 (setup.py) ... [?25l[?25hdone
  Created wheel for bert-for-tf2: filename=bert_for_tf2

In [3]:
# Import modules
import pandas as pd
import numpy as np
import bert
import tensorflow as tf
import tensorflow_hub as hub
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import  Model
from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import ModelCheckpoint, TensorBoard
from tqdm import tqdm
import matplotlib.pyplot as plt

print("TensorFlow Version:",tf.__version__)
print("Hub version: ",hub.__version__)
pd.set_option('display.max_colwidth',1000)


TensorFlow Version: 2.2.0
Hub version:  0.8.0


## Data preprocessing

In [4]:
# Montar o Google Drive
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:
# Read the IMDB Dataset.csv into Pandas dataframe
df=pd.read_csv("/content/drive/My Drive/Colab Notebooks/0. Mestrado/train3n.csv", error_bad_lines=False, sep=';')

In [6]:
def get_treated_data(dataset, cols, cols_drop = [], col_to_change='sentiment', val_col_change = {"Negativo": 0, "Positivo":1}):
    
    # # 1. Criar a variável "data"
    # dataset = pd.read_csv( 
    #     DATASET_PATH,
    #     engine="python", 
    #     encoding="latin1"
    # )
        
    # 2. Rename columns
    dataset.columns = cols
    
    # 3. Drop columns not needed
    dataset.drop(cols_drop, axis=1, inplace=True)
    
    # 3.1 Drop all rows with at least one element is missing
    dataset.dropna()
    
    # 4. Convert setiments from "Negative/Positive" to "0/1" 
    # dataset.replace({col_to_change: val_col_change}, inplace=True)
    
    # Return our dataset
    return dataset

In [7]:
df.head()

Unnamed: 0,id,tweet_text,tweet_date,sentiment,query_used
0,1050785521201541121,@Laranjito76 A pessoa certa para isso seria o vale e azevedo :),Fri Oct 12 16:29:25 +0000 2018,1,:)
1,1050785431955140608,"@behin_d_curtain Para mim, é precisamente o contrário :) Vem a chuva e vem a boa disposição :)",Fri Oct 12 16:29:04 +0000 2018,1,:)
2,1050785401248645120,Vou fazer um video hoje... estou pensando em falar um pouco sobre o novo meta do CSGO e sobre a pagina https://t.co/5RjhKnj0oh Alguem tem uma sugestao? Queria falar sobre algo do cenario nacional :D,Fri Oct 12 16:28:56 +0000 2018,1,:)
3,1050785370982547461,"aaaaaaaa amei tanto essas polaroids, nem sei expressar o quanto eu to apaixonada de vdd✨💖🎈🎉🎊 espero que outras pessoas consigam ganhar também :) https://t.co/pbIp7tRcSE",Fri Oct 12 16:28:49 +0000 2018,1,:)
4,1050785368902131713,"Valoriza o coração do menininho que vc tem. Ele é diferente. O faça sorrir e ter certeza disso ❤️ — Eu valorizo todo mundo na minha vida, não vai ser diferente com ele :)) https://t.co/5c7wlXQyz9",Fri Oct 12 16:28:49 +0000 2018,1,:)


In [8]:
default_cols = ["id", "text", "date", "sentiment", "query"];
default_drop_cols = ["id", "date", "query"]
# default_cols = ["sentiment", "text"];
# default_drop_cols = ["id", "date", "query"]

# class_names = ['Negativo', 'Positive']
df = get_treated_data(df, default_cols, cols_drop = default_drop_cols)

In [9]:
# Take a peek at the dataset
df["sentiment"].value_counts(normalize=True)

1    0.33334
2    0.33333
0    0.33333
Name: sentiment, dtype: float64

In [10]:
def preprocess_text(text):
    
    # Not needed to be imported globally
    from bs4 import BeautifulSoup
    import re
    text = BeautifulSoup(text, "lxml").get_text()
    text = re.sub(r'^https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE) # Remove urls
    text = re.sub(r"@[A-Za-z0-9]+", ' ', text)
    text = re.sub(r"https?://[A-Za-z0-9./]+", ' ', text)
    text = re.sub(r"[^a-zA-Z.!?']", ' ', text)
    text = re.sub(r" +", ' ', text)
    
    return text

In [11]:
df['text'] = df['text'].apply(lambda text: preprocess_text(text))

df.head(5)

  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that document to Beautiful Soup.' % decoded_markup
  ' that d

Unnamed: 0,text,sentiment
0,A pessoa certa para isso seria o vale e azevedo,1
1,d curtain Para mim precisamente o contr rio Vem a chuva e vem a boa disposi o,1
2,Vou fazer um video hoje... estou pensando em falar um pouco sobre o novo meta do CSGO e sobre a pagina Alguem tem uma sugestao? Queria falar sobre algo do cenario nacional D,1
3,aaaaaaaa amei tanto essas polaroids nem sei expressar o quanto eu to apaixonada de vdd espero que outras pessoas consigam ganhar tamb m,1
4,Valoriza o cora o do menininho que vc tem. Ele diferente. O fa a sorrir e ter certeza disso Eu valorizo todo mundo na minha vida n o vai ser diferente com ele,1


In [12]:
print("The number of rows and columns in the dataset is: {}".format(df.shape))

The number of rows and columns in the dataset is: (100000, 2)


In [13]:
# Identify missing values
df.apply(lambda x: sum(x.isnull()), axis=0)

text         0
sentiment    0
dtype: int64

In [14]:
# Check the target class balance
df["sentiment"].value_counts(normalize=True)

1    0.33334
2    0.33333
0    0.33333
Name: sentiment, dtype: float64

**Download token**

In [15]:
!rm -rf bert-base-portuguese-cased
!mkdir bert-base-portuguese-cased
!wget https://neuralmind-ai.s3.us-east-2.amazonaws.com/nlp/bert-base-portuguese-cased/bert-base-portuguese-cased_pytorch_checkpoint.zip
!wget https://neuralmind-ai.s3.us-east-2.amazonaws.com/nlp/bert-base-portuguese-cased/vocab.txt 

!apt-get install unzip

!unzip bert-base-portuguese-cased_pytorch_checkpoint.zip -d bert-base-portuguese-cased
!mv vocab.txt bert-base-portuguese-cased/vocab.txt 
!pip install -U transformers

--2020-07-12 11:38:41--  https://neuralmind-ai.s3.us-east-2.amazonaws.com/nlp/bert-base-portuguese-cased/bert-base-portuguese-cased_pytorch_checkpoint.zip
Resolving neuralmind-ai.s3.us-east-2.amazonaws.com (neuralmind-ai.s3.us-east-2.amazonaws.com)... 52.219.96.8
Connecting to neuralmind-ai.s3.us-east-2.amazonaws.com (neuralmind-ai.s3.us-east-2.amazonaws.com)|52.219.96.8|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 406220891 (387M) [application/zip]
Saving to: ‘bert-base-portuguese-cased_pytorch_checkpoint.zip’


2020-07-12 11:38:46 (83.6 MB/s) - ‘bert-base-portuguese-cased_pytorch_checkpoint.zip’ saved [406220891/406220891]

--2020-07-12 11:38:56--  https://neuralmind-ai.s3.us-east-2.amazonaws.com/nlp/bert-base-portuguese-cased/vocab.txt
Resolving neuralmind-ai.s3.us-east-2.amazonaws.com (neuralmind-ai.s3.us-east-2.amazonaws.com)... 52.219.105.82
Connecting to neuralmind-ai.s3.us-east-2.amazonaws.com (neuralmind-ai.s3.us-east-2.amazonaws.com)|52.219.105.82

In [16]:
from transformers import BertTokenizer, BertConfig, TFBertModel
bert_model = TFBertModel.from_pretrained("bert-base-portuguese-cased", from_pt=True)


All PyTorch model weights were used when initializing TFBertModel.

Some weights or buffers of the PyTorch model TFBertModel were not initialized from the TF 2.0 model and are newly initialized: ['cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [17]:
# Functions for constructing BERT Embeddings: input_ids, input_masks, input_segments and Inputs
MAX_SEQ_LEN=500 # max sequence length

def get_masks(tokens):
    """Masks: 1 for real tokens and 0 for paddings"""
    return [1]*len(tokens) + [0] * (MAX_SEQ_LEN - len(tokens))
 
def get_segments(tokens):
    """Segments: 0 for the first sequence, 1 for the second"""  
    segments = []
    current_segment_id = 0
    for token in tokens:
        segments.append(current_segment_id)
        if token == "[SEP]":
            current_segment_id = 1
    return segments + [0] * (MAX_SEQ_LEN - len(tokens))

def get_ids(tokens, tokenizer):
    """Token ids from Tokenizer vocab"""
    token_ids = tokenizer.convert_tokens_to_ids(tokens,)
    input_ids = token_ids + [0] * (MAX_SEQ_LEN - len(token_ids))
    return input_ids

def create_single_input(sentence, tokenizer, max_len):
    """Create an input from a sentence"""
    stokens = tokenizer.tokenize(sentence)
    stokens = stokens[:max_len]
    stokens = ["[CLS]"] + stokens + ["[SEP]"]
 
    ids = get_ids(stokens, tokenizer)
    masks = get_masks(stokens)
    segments = get_segments(stokens)

    return ids, masks, segments
 
def convert_sentences_to_features(sentences, tokenizer):
    """Convert sentences to features: input_ids, input_masks and input_segments"""
    input_ids, input_masks, input_segments = [], [], []
 
    for sentence in tqdm(sentences,position=0, leave=True):
      ids,masks,segments=create_single_input(sentence,tokenizer,MAX_SEQ_LEN-2)
      assert len(ids) == MAX_SEQ_LEN
      assert len(masks) == MAX_SEQ_LEN
      assert len(segments) == MAX_SEQ_LEN
      input_ids.append(ids)
      input_masks.append(masks)
      input_segments.append(segments)

    return [np.asarray(input_ids, dtype=np.int32), 
          np.asarray(input_masks, dtype=np.int32), 
          np.asarray(input_segments, dtype=np.int32)]

def create_tonkenizer(bert_layer):
    """Instantiate Tokenizer with vocab"""
    # vocab_file=bert_layer.resolved_object.vocab_file.asset_path.numpy()
    # do_lower_case=bert_layer.resolved_object.do_lower_case.numpy() 
    # tokenizer=bert.bert_tokenization.FullTokenizer(vocab_file,do_lower_case)
    do_lower_case = False
    tokenizer = BertTokenizer("bert-base-portuguese-cased/vocab.txt", do_lower_case)
    return tokenizer

## Modelling

In [18]:
def nlp_model(callable_object):
    # Load the pre-trained BERT base model
    # bert_layer = hub.KerasLayer(handle=callable_object, trainable=True)  

    bert_layer = callable_object
   
    # BERT layer three inputs: ids, masks and segments
    input_ids = Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32, name="input_ids")           
    input_masks = Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32, name="input_masks")       
    input_segments = Input(shape=(MAX_SEQ_LEN,), dtype=tf.int32, name="segment_ids")
    
    inputs = [input_ids, input_masks, input_segments] # BERT inputs
    # If using hub.KerasLayer, PLEASE, CHANGE THE ORDER of the variables, I mean: 
    # pooled_output, sequence_output = 
    sequence_output, pooled_output = bert_layer(inputs) # BERT outputs 
    
    # Add a hidden layer
    x = Dense(units=768, activation='relu')(pooled_output)
    x = Dropout(0.3)(x)
 
    # Add output layer
    outputs = Dense(3, activation="softmax")(x)

    # Construct a new model
    model = Model(inputs=inputs, outputs=outputs, )
    return model




In [19]:
model = nlp_model(bert_model)
model.summary()


Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_ids (InputLayer)          [(None, 500)]        0                                            
__________________________________________________________________________________________________
input_masks (InputLayer)        [(None, 500)]        0                                            
__________________________________________________________________________________________________
segment_ids (InputLayer)        [(None, 500)]        0                                            
__________________________________________________________________________________________________
tf_bert_model (TFBertModel)     ((None, 500, 768), ( 108923136   input_ids[0][0]                  
                                                                 input_masks[0][0]            

## Model training

In [20]:
# Create examples for training and testing

df = df.sample(frac=1) # Shuffle the dataset
tokenizer = create_tonkenizer(model.layers[3])

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df['text'], df['sentiment'], 
    test_size=0.3, 
    stratify=df['sentiment'], 
    random_state=15 
    )

print( "\nx_train: {}; \tX_test: {}".format(X_train.shape, X_test.shape))
print("\ny_test: \n{}, \n\ny_train: \n{}".format(y_train.value_counts(normalize=True), y_test.value_counts(normalize=True) ) )

X_train = convert_sentences_to_features(X_train, tokenizer)
X_test = convert_sentences_to_features(X_test, tokenizer)

y_train = to_categorical( y_train )
y_test =  to_categorical( y_test )


  0%|          | 0/70000 [00:00<?, ?it/s]


x_train: (70000,); 	X_test: (30000,)

y_test: 
1    0.333343
2    0.333329
0    0.333329
Name: sentiment, dtype: float64, 

y_train: 
2    0.333333
1    0.333333
0    0.333333
Name: sentiment, dtype: float64


100%|██████████| 70000/70000 [00:14<00:00, 4969.09it/s]
100%|██████████| 30000/30000 [00:06<00:00, 4993.97it/s]


In [21]:
y_train

array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       ...,
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.]], dtype=float32)

In [22]:
# callback

checkpoint_path = "./sentiment_analysis_model"
ckpt = tf.train.Checkpoint(model=model)
ckpt_manager = tf.train.CheckpointManager(ckpt, checkpoint_path, max_to_keep=1)

class CustomCallback(tf.keras.callbacks.Callback):

    def on_epoch_end(self, epoch, logs=None):
        ckpt_manager.save()
        print("Checkpoint saved at {}.".format(checkpoint_path))

In [23]:
# Train the model
BATCH_SIZE = 10
EPOCHS = 2

# Use Adam optimizer to minimize the categorical_crossentropy loss
opt = Adam(learning_rate=2e-5)

# loss = tf.keras.losses.CategoricalCrossentropy()
# metric = tf.keras.metrics.CategoricalAccuracy()


# softmax_cross_entropy_with_logits
model.compile(optimizer=opt, 
              loss= 'categorical_crossentropy', #binary_crossentropy
              metrics = ['categorical_accuracy']
              )

# Fit the data to the model
history = model.fit(X_train, y_train,
                    validation_data=(X_test, y_test),
                    epochs=EPOCHS,
                    batch_size=BATCH_SIZE,
                    verbose = 1,
                    callbacks=[CustomCallback()]
                    )


Epoch 1/2
Epoch 2/2


In [24]:
def save_model(model, name, path, h5=False):
  '''
  model, model_name, path, h5(optional)
  '''
  if h5:
    !pip install -q pyyaml h5py  # Required to save models in HDF5 format
    model.save( "{}.h5".format(name) )
  else:
    model.save( name )


In [25]:
save_model(model, "sentiment_model", "trained_model")

Instructions for updating:
If using Keras pass *_constraint arguments to layers.
INFO:tensorflow:Assets written to: sentiment_model/assets


In [26]:
history.history

{'categorical_accuracy': [0.8448571562767029, 0.8967857360839844],
 'loss': [0.3415980637073517, 0.2433004528284073],
 'val_categorical_accuracy': [0.8707333207130432, 0.8638333082199097],
 'val_loss': [0.28664323687553406, 0.3480300009250641]}

## Analysis of model performance

In [27]:
# # Load the pretrained nlp_model
# from tensorflow.keras.models import load_model
# new_model = load_model('test')
# new_model.summary



In [28]:
# Predict on test dataset
from sklearn.metrics import classification_report, confusion_matrix
pred_test = np.argmax(model.predict(X_test), axis=1)

In [29]:
print(classification_report(np.argmax(y_test,axis=1), pred_test))

              precision    recall  f1-score   support

           0       0.77      0.89      0.82     10000
           1       0.87      0.71      0.78     10000
           2       0.96      1.00      0.98     10000

    accuracy                           0.86     30000
   macro avg       0.87      0.86      0.86     30000
weighted avg       0.87      0.86      0.86     30000



In [30]:
print(pred_test[:40])
print( y_test[:40].argmax(1) )

[1 2 0 0 1 2 2 0 2 1 1 2 2 2 1 2 2 2 2 0 1 0 0 1 1 2 2 1 0 1 0 0 1 2 0 2 0
 0 0 1]
[1 2 0 0 1 2 2 0 2 0 1 2 2 2 1 2 2 2 2 0 0 1 0 1 1 2 2 1 0 1 1 0 1 2 0 2 0
 0 0 1]


# Predict

In [31]:
def get_predictions(model_, sentence):
  sent = []
  sent.append(sentence)
  sentence_feature = convert_sentences_to_features(sent, tokenizer)

  prediction = np.argmax(model_.predict( sentence_feature ) , axis=1) 

  # Show Positivo/Negativo
  pred = ["Negativo" if x == 0 else "Positivo" if x == 2 else "Neutro"  for x in prediction]

  return pred

In [32]:
# Predict
get_predictions( model, "Aquele ator é ruim" )

100%|██████████| 1/1 [00:00<00:00, 1994.44it/s]


['Negativo']

In [33]:
get_predictions( model, "Eu gosto do seu sorriso" )

100%|██████████| 1/1 [00:00<00:00, 1916.08it/s]


['Neutro']