# Sentyment analysis con BERT transformers

Carichiamo il BERT Tokenizer e Sequence Classifier pre-addestrati, così come InputExample e InputFeatures. Poi, costruiremo il nostro modello con il Sequence Classifier e il nostro tokenizer con il Tokenizer del BERT.

In [1]:
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures

model = TFBertForSequenceClassification.from_pretrained("bert-base-uncased")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

All model checkpoint layers were used when initializing TFBertForSequenceClassification.

Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Vediamo il nostro modello:

In [2]:
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________


Possiamo decidere di lasciarlo allenare tutto o di allenare solo il classificatore finale. 
Più layer rimarranno sbloccati, maggior memoria verrà occupata (CPU o GPU) e maggior tempo servirà per un epoca.

In [3]:
for lay in model.layers:
    if lay.name != 'classifier':
        lay.trainable=False
model.summary()

Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
bert (TFBertMainLayer)       multiple                  109482240 
_________________________________________________________________
dropout_37 (Dropout)         multiple                  0         
_________________________________________________________________
classifier (Dense)           multiple                  1538      
Total params: 109,483,778
Trainable params: 1,538
Non-trainable params: 109,482,240
_________________________________________________________________


In [4]:

import tensorflow as tf
import pandas as pd
import numpy as np

## Data import

In [5]:

df = pd.read_csv('../data/Reviews.csv', usecols=['Text', 'Summary', 'Score'])
df.head()
df = df[df['Score'] != 3]
df['Sentiment'] = df['Score'].apply(lambda rating : 1 if rating > 3 else 0)
df=df.sample(2500).reset_index() #subset per rendere il tutorial più veloce
index = df.index

df['random_number'] = np.random.randn(len(index))

train_full = df[df['random_number'] <= 0.8]
test_full = df[df['random_number'] > 0.8]


A differenza dei modelli basati su bag of words, ora possiamo usare la colonna del testo, e non quella del riassunto, perché invece di avere una rappresentazione sparsa delle frequenze abbiamo un vettore di lunghezza fissa costruito usando i "token", parti di parola. Si potrebbe valutare se sia meglio l'uno, l'altro o la somma dei due.

In [6]:

train = train_full.filter(['Text', 'Sentiment']).reset_index(drop=True)
train.columns = ['DATA_COLUMN', 'LABEL_COLUMN']
train.head()

Unnamed: 0,DATA_COLUMN,LABEL_COLUMN
0,"My mom and I bought these some years back, fro...",1
1,These tasty shrimp crackers remind me of homet...,1
2,I got this tea because it is supposed to help ...,1
3,I received the pack of three flavors today and...,1
4,"I use this for cooking rice, veggies, soups, e...",1


In [7]:

test = test_full.filter(['Text', 'Sentiment']).reset_index(drop=True)
test.columns = ['DATA_COLUMN', 'LABEL_COLUMN']
test.head()

Unnamed: 0,DATA_COLUMN,LABEL_COLUMN
0,These crackers are my new favorite. The Parmes...,1
1,I have an 8 year old and a 2 year old. They ar...,1
2,Me and my kids are loving the aerogarden. The...,1
3,To be honest I wish I had more to say about th...,1
4,Excellent product and delivered fresh. Made of...,1


## Creazione di sequenze di ingresso
Abbiamo due oggetti Dataframe di pandas che ci aspettano per convertirli in oggetti adatti al modello BERT. Sfrutteremo la funzione InputExample che ci aiuta a creare sequenze dal nostro set di dati. La funzione InputExample può essere chiamata come segue:

In [8]:

InputExample(guid=None,
             text_a = "Hello, world",
             text_b = None,
             label = 1)

InputExample(guid=None, text_a='Hello, world', text_b=None, label=1)

## Ora creeremo due funzioni principali:

    1 - convert_data_to_examples: Questa accetterà i nostri dataset di training e test e convertirà ogni riga in un oggetto InputExample.
    2 - convert_examples_to_tf_dataset: Questa funzione tokenizzerà gli oggetti InputExample, poi creerà il formato di input richiesto con gli oggetti tokenizzati, infine, creerà un dataset di input che possiamo dare in pasto al modello.



In [9]:
def convert_data_to_examples(train, test, DATA_COLUMN, LABEL_COLUMN): 
  train_InputExamples = train.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[DATA_COLUMN], 
                                                          text_b = None,
                                                          label = x[LABEL_COLUMN]), axis = 1)

  validation_InputExamples = test.apply(lambda x: InputExample(guid=None, # Globally unique ID for bookkeeping, unused in this case
                                                          text_a = x[DATA_COLUMN], 
                                                          text_b = None,
                                                          label = x[LABEL_COLUMN]), axis = 1)
  
  return train_InputExamples, validation_InputExamples

  train_InputExamples, validation_InputExamples = convert_data_to_examples(train, 
                                                                           test, 
                                                                           'DATA_COLUMN', 
                                                                           'LABEL_COLUMN')
  
def convert_examples_to_tf_dataset(examples, tokenizer, max_length=128):
    features = [] # -> will hold InputFeatures to be converted later

    for e in examples:
        # Documentation is really strong for this method, so please take a look at it
        input_dict = tokenizer.encode_plus(
            e.text_a,
            add_special_tokens=True,
            max_length=max_length, # truncates if len(s) > max_length
            return_token_type_ids=True,
            return_attention_mask=True,
            pad_to_max_length=True, # pads to the right by default # CHECK THIS for pad_to_max_length
            truncation=True
        )

        input_ids, token_type_ids, attention_mask = (input_dict["input_ids"],
            input_dict["token_type_ids"], input_dict['attention_mask'])

        features.append(
            InputFeatures(
                input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, label=e.label
            )
        )

    def gen():
        for f in features:
            yield (
                {
                    "input_ids": f.input_ids,
                    "attention_mask": f.attention_mask,
                    "token_type_ids": f.token_type_ids,
                },
                f.label,
            )

    return tf.data.Dataset.from_generator(
        gen,
        ({"input_ids": tf.int32, "attention_mask": tf.int32, "token_type_ids": tf.int32}, tf.int64),
        (
            {
                "input_ids": tf.TensorShape([None]),
                "attention_mask": tf.TensorShape([None]),
                "token_type_ids": tf.TensorShape([None]),
            },
            tf.TensorShape([]),
        ),
    )


DATA_COLUMN = 'DATA_COLUMN'
LABEL_COLUMN = 'LABEL_COLUMN'

We can call the functions we created above with the following lines:

In [10]:

train_InputExamples, validation_InputExamples = convert_data_to_examples(train, test, DATA_COLUMN, LABEL_COLUMN)

train_data = convert_examples_to_tf_dataset(list(train_InputExamples), tokenizer)
train_data = train_data.shuffle(100).batch(16)

validation_data = convert_examples_to_tf_dataset(list(validation_InputExamples), tokenizer)
validation_data = validation_data.batch(16)



Il nostro set di dati contenente le sequenze di input elaborate è pronto per essere dato in pasto al modello.

## Configurazione del modello BERT e fine-tuning

Useremo Adam come ottimizzatore, CategoricalCrossentropy come funzione di perdita e SparseCategoricalAccuracy come metrica di precisione. 

In [11]:

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0), 
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), 
              metrics=[tf.keras.metrics.SparseCategoricalAccuracy('accuracy')])

model.fit(train_data, epochs=2, validation_data=validation_data)

Epoch 1/2
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
Epoch 2/2


<tensorflow.python.keras.callbacks.History at 0x1e011825af0>

In [12]:

pred_sentences = [test['DATA_COLUMN'][0]]
pred_sentences


["These crackers are my new favorite. The Parmesan really is the best. The sharp cheddar that comes mixed in the box is really good, too. I used to think the white cheddar was the best. Now, it's an 'old reliable'. I keep wondering which of these flavors would be best crumbled in the food processor, or blender to be used in meatballs, or meatloaf. Cheez-Its, made into crumbs, would be good on baked, or fried chicken, or fish, or on top of your favorite casserole dish. Really, try these, they're delicious."]

In [13]:

tf_batch = tokenizer(pred_sentences, max_length=128, padding=True, truncation=True, return_tensors='tf')
tf_outputs = model(tf_batch)
tf_predictions = tf.nn.softmax(tf_outputs[0], axis=-1)
labels = ['Negative','Positive']
label = tf.argmax(tf_predictions, axis=1)
label = label.numpy()
for i in range(len(pred_sentences)):
  print(pred_sentences[i], ": \n", labels[label[i]])

These crackers are my new favorite. The Parmesan really is the best. The sharp cheddar that comes mixed in the box is really good, too. I used to think the white cheddar was the best. Now, it's an 'old reliable'. I keep wondering which of these flavors would be best crumbled in the food processor, or blender to be used in meatballs, or meatloaf. Cheez-Its, made into crumbs, would be good on baked, or fried chicken, or fish, or on top of your favorite casserole dish. Really, try these, they're delicious. : 
 Positive
