<a href="https://colab.research.google.com/github/DRSLima/sentiment-analysis-twitter/blob/master/Text%20Classification%20with%20BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [0]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/50/10/aeefced99c8a59d828a92cc11d213e2743212d3641c87c82d61b035a7d5c/transformers-2.3.0-py3-none-any.whl (447kB)
[K     |████████████████████████████████| 450kB 2.8MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/74/f4/2d5214cbf13d06e7cb2c20d84115ca25b53ea76fa1f0ade0e3c9749de214/sentencepiece-0.1.85-cp36-cp36m-manylinux1_x86_64.whl (1.0MB)
[K     |████████████████████████████████| 1.0MB 57.9MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/a6/b4/7a41d630547a4afd58143597d5a49e07bfd4c42914d8335b2a5657efc14b/sacremoses-0.0.38.tar.gz (860kB)
[K     |████████████████████████████████| 870kB 50.9MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.38-cp36-none-any.whl size=884629 sha256=0b58c3bbd449563d

In [0]:
import pandas as pd
import numpy as np
import torch
import transformers as ppb # pytorch transformers
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

df_train = pd.read_csv('https://raw.githubusercontent.com/jpvmm/FASAM_NLP/master/train_sample.csv')

#Vamos ter usar o subset menor pois a memória do notebook não vai aguentar.
df_train = df_train[0:10000]


df_test = pd.read_csv('https://raw.githubusercontent.com/jpvmm/FASAM_NLP/master/test_sample.csv')

df_test = df_test[0:500]


# **Começando os trabalhos**


---
**Pergunta**: Como podemos trabalhar com duas linguages diferentes? Qual estratégia poderia ser adotada utilizando o BERT?

## **Utilizando DistilBERT**

Vamos utilizar o DistilBERT da biblioteca Transformers HuggingFace.

Aqui utilizaremos o modelo multilingual pré treinado.

In [0]:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-multilingual-cased')

tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
modelo = model_class.from_pretrained(pretrained_weights)

### **Limpeza e Tokenização**

In [0]:
x = df_train['title']
y = df_train['category']

In [0]:
#Limpeza de dados

import re
import string
import nltk

nltk.download('stopwords')

from nltk.corpus import stopwords

def limpaTexto(texto):

  tokens = texto.split()

  #Regex para filtro de caracteres
  re_puc = re.compile('[%s]' % re.escape(string.punctuation))
  #Remoção de pontuação
  tokens = [re_puc.sub('', w ) for w in tokens]
  #Remoção de tokens não alfabéticos
  tokens = [word for word in tokens if word.isalpha()]
  #Remoção de stopwords
  stop_words1 = set(stopwords.words('portuguese'))
  stop_words2 = set(stopwords.words('spanish'))

  tokens = [w for w in tokens if not w in stop_words1]
  tokens = [w for w in tokens if not w in stop_words2]
  tokens = [word for word in tokens if len(word) > 1]
  tokens = ' '.join(tokens)

  return tokens

x = x.apply(limpaTexto)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [0]:
#Visualizando como o BERT tokenizer sentenças
#Wordpiece tokenization

for i in range(3):
  print(x[i], '---->', tokenizer.tokenize(x[i]))

Microscópio Biológico Binocular Meopta Profissional ----> ['micro', '##sco', '##pio', 'bio', '##logico', 'bin', '##oc', '##ular', 'me', '##op', '##ta', 'profissional']
Nissan Versa ----> ['ni', '##ssa', '##n', 'vers', '##a']
Llave Contacto Yamaha Fz Tapa Tanque Mpr ----> ['ll', '##ave', 'contacto', 'ya', '##mah', '##a', 'f', '##z', 'tapa', 'tan', '##que', 'm', '##pr']


In [0]:
tokenized = x.apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

for i in range(3):
  print(x[i],'--->', tokenized[i])

Microscópio Biológico Binocular Meopta Profissional ---> [101, 54396, 22402, 22196, 12297, 59361, 16292, 25125, 18062, 10911, 13362, 10213, 40604, 102]
Nissan Versa ---> [101, 10414, 11253, 10115, 12576, 10113, 102]
Llave Contacto Yamaha Fz Tapa Tanque Mpr ---> [101, 22469, 23641, 45620, 10549, 56271, 10113, 174, 10305, 89711, 15176, 11189, 181, 52302, 102]


In [0]:
#Criando o padding
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])
padded.shape

(10000, 28)

### **Masking**
pesquisar isso aqui

In [0]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(10000, 28)

### **BERT Embeddings**

Os embeddings do BERT são gerados a partir de sua última camada. 

Como estamos trabalhando com uma tarefa de classificação, apenas a posição com o token CLS é a que no interessa, pois é a posição da tarefa de classificação do BERT.

In [0]:
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = modelo(input_ids, attention_mask=attention_mask)

In [0]:
last_hidden_states[0].shape

torch.Size([10000, 28, 768])

**Pegando apenas o vetor de primeira posição na saída**

In [0]:
features = last_hidden_states[0][:,0,:].numpy()
features.shape

(10000, 768)

## **Modelo de Classificação**

Para realizar a tarefa de classificação, vamos utilizar um modelo normal do scikitlearn

In [0]:
x_train, x_test, y_train, y_test = train_test_split(features, y)

In [0]:
lr_clf = LogisticRegression(max_iter=1000)
lr_clf.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=1000,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [0]:
lr_clf.score(x_test, y_test)

0.1936

# **Desafio**

Conecte uma rede neural à saída do BERT e veja se consegue melhores resultados.