<a href="https://colab.research.google.com/github/DRSLima/sentiment-analysis-twitter/blob/master/HandsOn_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
!pip install transformers



In [0]:
import pandas as pd
import numpy as np
import torch
import transformers as ppb # pytorch transformers
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

df_train = pd.read_csv('https://raw.githubusercontent.com/jpvmm/FASAM_NLP/master/train_sample.csv')

#Vamos ter usar o subset menor pois a memória do notebook não vai aguentar.
df_train = df_train[0:10000]


df_test = pd.read_csv('https://raw.githubusercontent.com/jpvmm/FASAM_NLP/master/test_sample.csv')

df_test = df_test[0:500]


# **Começando os trabalhos**


---
**Pergunta**: Como podemos trabalhar com duas linguages diferentes? Qual estratégia poderia ser adotada utilizando o BERT?

## **Utilizando DistilBERT**

Vamos utilizar o DistilBERT da biblioteca Transformers HuggingFace.

Aqui utilizaremos o modelo multilingual pré treinado.

In [0]:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-multilingual-cased')

tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
modelo = model_class.from_pretrained(pretrained_weights)

### **Limpeza e Tokenização**

In [0]:
x = df_train['title']
y = df_train['category']

In [0]:
#Limpeza de dados

import re
import string
import nltk

nltk.download('stopwords')

from nltk.corpus import stopwords

def limpaTexto(texto):

  tokens = texto.split()

  #Regex para filtro de caracteres
  re_puc = re.compile('[%s]' % re.escape(string.punctuation))
  #Remoção de pontuação
  tokens = [re_puc.sub('', w ) for w in tokens]
  #Remoção de tokens não alfabéticos
  tokens = [word for word in tokens if word.isalpha()]
  #Remoção de stopwords
  stop_words1 = set(stopwords.words('portuguese'))
  stop_words2 = set(stopwords.words('spanish'))

  tokens = [w for w in tokens if not w in stop_words1]
  tokens = [w for w in tokens if not w in stop_words2]
  tokens = [word for word in tokens if len(word) > 1]
  tokens = ' '.join(tokens)

  return tokens

x = x.apply(limpaTexto)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [0]:
#Visualizando como o BERT tokenizer sentenças
#Wordpiece tokenization

for i in range(3):
  print(x[i], '---->', tokenizer.tokenize(x[i]))

Microscópio Biológico Binocular Meopta Profissional ----> ['Micro', '##sc', '##ó', '##pio', 'Biol', '##ógico', 'Bin', '##oc', '##ular', 'Me', '##op', '##ta', 'Prof', '##ission', '##al']
Nissan Versa ----> ['Nissan', 'Vers', '##a']
Llave Contacto Yamaha Fz Tapa Tanque Mpr ----> ['L', '##lave', 'Contact', '##o', 'Yamaha', 'F', '##z', 'Ta', '##pa', 'Tan', '##que', 'M', '##pr']


In [0]:
tokenized = x.apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

for i in range(3):
  print(x[i],'--->', tokenized[i])

Microscópio Biológico Binocular Meopta Profissional ---> [101, 78857, 31505, 10443, 22196, 35992, 42151, 50754, 25125, 18062, 11589, 13362, 10213, 24864, 58334, 10415, 102]
Nissan Versa ---> [101, 41650, 46744, 10113, 102]
Llave Contacto Yamaha Fz Tapa Tanque Mpr ---> [101, 149, 57782, 77562, 10133, 56988, 143, 10305, 14248, 11359, 30594, 11189, 150, 52302, 102]


In [0]:
#Criando o padding
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])
padded.shape

(10000, 33)

### **Masking**
pesquisar isso aqui

In [0]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(10000, 33)

### **BERT Embeddings**

Os embeddings do BERT são gerados a partir de sua última camada. 

Como estamos trabalhando com uma tarefa de classificação, apenas a posição com o token CLS é a que no interessa, pois é a posição da tarefa de classificação do BERT.

In [0]:
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = modelo(input_ids, attention_mask=attention_mask)

In [0]:
last_hidden_states[0].shape

torch.Size([10000, 33, 768])

**Pegando apenas o vetor de primeira posição na saída**

In [0]:
features = last_hidden_states[0][:,0,:].numpy()
features.shape

(10000, 768)

## **Modelo de Classificação**

Para realizar a tarefa de classificação, vamos utilizar um modelo normal do scikitlearn

In [0]:
x_train, x_test, y_train, y_test = train_test_split(features, y)

In [0]:
lr_clf = LogisticRegression(max_iter=1000)
lr_clf.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [0]:
lr_clf.score(x_test, y_test)

0.1936

# **Desafio**

Conecte uma rede neural à saída do BERT e veja se consegue melhores resultados.

In [0]:
embedding_size = 768
max_seq_length = max_len

# Reshape bert_output before passing it the GRU
bert_output_ = tf.keras.layers.Reshape((max_seq_length, embedding_size))(last_hidden_states[0])
gru_out = tf.keras.layers.GRU(100, activation='sigmoid')(bert_output_)
dense = tf.keras.layers.Dense(256, activation="relu")(gru_out)
pred = tf.keras.layers.Dense(1, activation="sigmoid")(dense)

In [0]:
input1 = tf.keras.Input(shape=(max_len, 33,), name='input1')

modelo = tf.keras.Model(inputs=input1, outputs=pred)

modelo.summary()

modelo.compile(loss='categorical_crossentropy',
               optimizer='adam',
               metrics=['accuracy'])

weights_filepath='./pesos.h5'

callbacks = [ModelCheckpoint(weights_filepath, monitor='val_loss', mode='min',
                             verbose=1, save_best_only=True),
             EarlyStopping(monitor='val_loss', mode='min', patience=10, verbose=1)]

Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input1 (InputLayer)             [(None, 33, 33)]     0                                            
__________________________________________________________________________________________________
tf_op_layer_dense_5/Sigmoid (Te [(10000, 1)]         0                                            
Total params: 0
Trainable params: 0
Non-trainable params: 0
__________________________________________________________________________________________________


In [0]:
# Separa o dataset em dados de treinamento/validação
X_train, X_test, y_train, Y_test =  train_test_split(features, y)

lr_clf = LogisticRegression(max_iter=1000)
lr_clf.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [0]:
lr_clf.score(X_test, Y_test)

0.1364