# Sentiment Analysis

Per l'esercizio utilizzeremo un set di dati di recensioni di fil con l'etichetta del gradimento della recensione. Le recensioni sono etichettati su 5 classi:

1.   0 recensione negativa
2.   1 recensione un po' negativa
3.   2 recensione neutrale
4.   3 recensione un po' positiva
5.   4 recensione positiva







In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.21.2-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 4.8 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 41.3 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.9.1-py3-none-any.whl (120 kB)
[K     |████████████████████████████████| 120 kB 48.4 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.9.1 tokenizers-0.12.1 transformers-4.21.2


In [None]:
import numpy as np          # Use numpy per la conversione del dataset in array 
from tqdm.auto import tqdm  # Use tqdm per la creazione degli input
import tensorflow as tf     # Use tf per l'addestramente del modello
import pandas as pd         # Use pandas per leggere il file di input 
from transformers import BertTokenizer, TFBertModel
from sklearn.preprocessing import OneHotEncoder

In [None]:
data_path = "drive/MyDrive/BERT/data/"
weights_path = "drive/MyDrive/BERT/weights/"
file_name = "train.csv"

In [None]:
df = pd.read_csv(data_path + file_name, sep="\t")
df.head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2
3,4,1,A,2
4,5,1,series,2


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156060 entries, 0 to 156059
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   PhraseId    156060 non-null  int64 
 1   SentenceId  156060 non-null  int64 
 2   Phrase      156060 non-null  object
 3   Sentiment   156060 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 4.8+ MB


In [None]:
sentiment = list(df["Sentiment"].unique())
sentiment

[1, 2, 3, 4, 0]

In [None]:
MAX_LENGTH = 256
BATCH = 16
OUTPUT = 5
PERC_TRAIN = 0.8
train_size = int((len(df) // BATCH) * PERC_TRAIN)
LR = 1e-5
BERT_MODEL = "bert-base-cased"

## Preprocessing

In [None]:
tz = BertTokenizer.from_pretrained(BERT_MODEL)

Downloading vocab.txt:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
# Mi preparo due array che corrispondono agli ID dell'input e 
# la relativa attention mask
x_input_id = np.zeros((len(df), MAX_LENGTH))
x_attention_mask = np.zeros((len(df), MAX_LENGTH))

In [None]:
def preprocessing_dataset(df, ids, masks, tokenizer):
    for i, text in tqdm(enumerate(df['Phrase'])):
        tokenized_text = tokenizer.encode_plus(
            text=text,
            max_length=MAX_LENGTH, 
            truncation=True, 
            padding='max_length', 
            add_special_tokens=True,
            return_tensors='tf'
        )
        ids[i, :] = tokenized_text.input_ids
        masks[i, :] = tokenized_text.attention_mask
    return ids, masks

In [None]:
x_input_id, x_attention_mask = preprocessing_dataset(df, 
                                                  x_input_id, 
                                                  x_attention_mask, 
                                                  tz)

0it [00:00, ?it/s]

In [None]:
labels = OneHotEncoder().fit_transform(df[["Sentiment"]]).toarray()
labels.shape

(156060, 5)

In [None]:
df.iloc[42]

PhraseId         43
SentenceId        1
Phrase        which
Sentiment         2
Name: 42, dtype: object

In [None]:
labels[42]

array([0., 0., 1., 0., 0.])

### Creo il Dataset

In [None]:
def SentimentDataset(input_ids, attn_masks, labels):
    return {
        'input_ids': input_ids,
        'attention_mask': attn_masks
    }, labels

In [None]:
dataset = tf.data.Dataset.from_tensor_slices((x_input_id, x_attention_mask, labels))
dataset = dataset.map(SentimentDataset)
dataset.take(42)

<TakeDataset element_spec=({'input_ids': TensorSpec(shape=(256,), dtype=tf.float64, name=None), 'attention_mask': TensorSpec(shape=(256,), dtype=tf.float64, name=None)}, TensorSpec(shape=(5,), dtype=tf.float64, name=None))>

In [None]:
# Mischio il dataset
dataset = dataset.shuffle(10000).batch(BATCH, drop_remainder=True)

In [None]:
# Creo il dataset di train e di validation
train_dataset = dataset.take(train_size)
val_dataset = dataset.skip(train_size)

In [None]:
train_dataset.take(42)

<TakeDataset element_spec=({'input_ids': TensorSpec(shape=(16, 256), dtype=tf.float64, name=None), 'attention_mask': TensorSpec(shape=(16, 256), dtype=tf.float64, name=None)}, TensorSpec(shape=(16, 5), dtype=tf.float64, name=None))>

### Creo il Modello

In [None]:
model = TFBertModel.from_pretrained(BERT_MODEL) 

Downloading tf_model.h5:   0%|          | 0.00/502M [00:00<?, ?B/s]

Some layers from the model checkpoint at bert-base-cased were not used when initializing TFBertModel: ['mlm___cls', 'nsp___cls']
- This IS expected if you are initializing TFBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertModel were initialized from the model checkpoint at bert-base-cased.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions without further training.


In [None]:
input_ids = tf.keras.layers.Input(shape=(MAX_LENGTH,), 
                                  name='input_ids', 
                                  dtype='int32')

attn_masks = tf.keras.layers.Input(shape=(MAX_LENGTH,), 
                                   name='attention_mask', 
                                   dtype='int32')

bert_embds = model.bert(input_ids, 
                        attention_mask=attn_masks)[1]
 

intermediate_layer = tf.keras.layers.Dense(512, 
                                           activation='relu', 
                                           name='intermediate_layer')(bert_embds)


output_layer = tf.keras.layers.Dense(OUTPUT, 
                                     activation='softmax', 
                                     name='output_layer')(intermediate_layer)


sentiment_model = tf.keras.Model(inputs=[input_ids, attn_masks], 
                                 outputs=output_layer)


sentiment_model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 input_ids (InputLayer)         [(None, 256)]        0           []                               
                                                                                                  
 attention_mask (InputLayer)    [(None, 256)]        0           []                               
                                                                                                  
 bert (TFBertMainLayer)         TFBaseModelOutputWi  108310272   ['input_ids[0][0]',              
                                thPoolingAndCrossAt               'attention_mask[0][0]']         
                                tentions(last_hidde                                               
                                n_state=(None, 256,                                           

In [None]:
sentiment_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=LR),
                        loss=tf.keras.losses.CategoricalCrossentropy(),
                        metrics=["accuracy"])

In [None]:
train_data = sentiment_model.fit(train_dataset,
                                 validation_data=val_dataset,
                                 epochs=2)

Epoch 1/2
Epoch 2/2


In [None]:
model.save_weights(data_path + 'weights_2_epochs.h5')