# Classificação de Documentos

A clssificação de documentos é muito útil em vários aspectos. Um dos tipos de classificação de texto é a análise de sentimentos.

A fim de ilustrar a classificação de documentos iremos criar um modelo para classificar uma frase como positiva ou negativa.

## Carregando o embedding e os dados

In [1]:
!pip install unidecode
!pip install vaderSentiment

Collecting unidecode
  Downloading Unidecode-1.3.8-py3-none-any.whl (235 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.5/235.5 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: unidecode
Successfully installed unidecode-1.3.8
Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: vaderSentiment
Successfully installed vaderSentiment-3.3.2


In [39]:
import gensim
import pandas as pd
from nltk.corpus import stopwords
import string
from unidecode import unidecode
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
import re
import numpy as np

In [66]:
#opção 1 -> montar o drive no colab e acessar o arquivo de embedding do drive
from google.colab import drive
drive.mount('/content/drive')

#opção 2 -> fazer download e fazer upload por aqui
#from google.colab import files
#uploaded = files.upload()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [22]:
%%time
path='drive/MyDrive/aulas/Processamento de Linguagem Natural - Saude/ptwiki_20180420_100d.txt.bz2'
word_vectors = gensim.models.KeyedVectors.load_word2vec_format(path,
                                                               binary=False,
                                                               limit=50000)

CPU times: user 8.97 s, sys: 51.4 ms, total: 9.03 s
Wall time: 10.4 s


In [69]:
df = pd.read_csv('drive/MyDrive/aulas/Processamento de Linguagem Natural - Saude/imdb-reviews-pt-br.csv')

In [71]:
df.sentiment.value_counts()

sentiment
neg    24765
pos    24694
Name: count, dtype: int64

## Tratamento dos dados

1. Transforme a variavel alvo (sentiment) em uma variável binaria
2. Faço o seguinte pré-processamento no texto:
  1. tokenize as expressões usando regex "\w+(?:'\w+)?|[^\w\s]"
  2. Passe tudo para minusculo
  3. Remova stopwords
  4. Remova pontuação
  5. Remova números

  Faça duas funções de pré-processamento uma que retorne a frase processada e uma que retorne uma lista com os tokens.

## Solução

In [23]:
df = pd.read_csv('drive/MyDrive/aulas/Processamento de Linguagem Natural - Saude/imdb-reviews-pt-br.csv')

In [24]:
target = df['sentiment'].replace(['neg','pos'],[0,1])

In [25]:
def pre_processamento_texto_return_str(corpus, portugues_stops):
    corpus_alt = re.findall(r"\w+(?:'\w+)?|[^\w\s]", corpus) # extração palavras -> palavras
    corpus_alt = [t.lower() for t in corpus_alt] #passando para minusculo
    corpus_alt = [t for t in corpus_alt if t not in portugues_stops] #remoção dos tokens que são SW
    corpus_alt = [t for t in corpus_alt if t not in string.punctuation] # o mesmo para pontuação
    corpus_alt = [re.sub(r'\d', '', t) for t in corpus_alt] #remoção de numeros
    corpus_alt_str = ' '.join(corpus_alt)
    return corpus_alt_str

In [26]:
def pre_processamento_texto_return_token(corpus, portugues_stops):
    corpus_alt = re.findall(r"\w+(?:'\w+)?|[^\w\s]", corpus)
    corpus_alt = [t.lower() for t in corpus_alt]
    portugues_stops = stopwords.words('portuguese')
    corpus_alt = [t for t in corpus_alt if t not in portugues_stops]
    corpus_alt = [t for t in corpus_alt if t not in string.punctuation]
    corpus_alt = [re.sub(r'\d', '', t) for t in corpus_alt]

    return corpus_alt

In [27]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [28]:
portugues_stops = stopwords.words('portuguese')

In [31]:
from tqdm import tqdm
tqdm.pandas()

In [73]:
df["text_pt_sem_stopwords"] = df["text_pt"]\
      .progress_apply(lambda x: pre_processamento_texto_return_str(x, portugues_stops))

100%|██████████| 49459/49459 [00:53<00:00, 924.18it/s] 


In [74]:
df["text_pt_sem_stopwords"]

0        vez sr costner arrumou filme tempo necessário ...
1        exemplo motivo maioria filmes ação mesmos gené...
2        primeiro tudo odeio raps imbecis poderiam agir...
3        beatles puderam escrever músicas todos gostass...
4        filmes fotos latão palavra apropriada verdade ...
                               ...                        
49454    média votos baixa fato funcionário locadora ac...
49455    enredo algumas reviravoltas infelizes inacredi...
49456    espantado forma filme maioria outros média  es...
49457    christmas together realmente veio antes tempo ...
49458    drama romântico classe trabalhadora diretor ma...
Name: text_pt_sem_stopwords, Length: 49459, dtype: object

## Bag-of-word

Crie uma representação bag-of-words do texto.

## Solução

In [75]:
vect_bag = CountVectorizer()
X_bag = vect_bag.fit_transform(df['text_pt_sem_stopwords'])
vocabulario = vect_bag.get_feature_names_out()

In [76]:
print("Vocabulario",len(vocabulario))
print("Features",X_bag.shape)
print("Target",target.shape)

Vocabulario 127500
Features (49459, 127500)
Target (49459,)


## Embedding

Utilizando o embedding crie uma representação de embedding com a média das representações das palavras do texto.

## Solução

In [36]:
df["text_pt_sem_stopwords_token"] = df["text_pt"]\
.progress_apply(lambda x: pre_processamento_texto_return_token(x, portugues_stops))

100%|██████████| 49459/49459 [01:13<00:00, 668.73it/s]


In [50]:
def calcula_embedding_frase(tokens):
    return np.mean([word_vectors[t] for t in tokens if t in word_vectors.key_to_index.keys()], axis=0)

In [51]:
X_embedding = df["text_pt_sem_stopwords_token"].progress_apply(lambda x: calcula_embedding_frase(x))

100%|██████████| 49459/49459 [00:23<00:00, 2081.58it/s]


In [77]:
X_embedding

0        [0.18474115, 0.19828236, -0.021269115, -0.2198...
1        [0.18389921, 0.22538583, -0.048877165, -0.1858...
2        [0.25800207, 0.14127004, -0.0005150017, -0.183...
3        [0.29422346, 0.12918055, -0.05406107, -0.20715...
4        [0.22152832, 0.111977965, -0.080611, -0.158311...
                               ...                        
49454    [0.18613999, 0.17669998, -0.081205994, -0.0967...
49455    [0.20748213, 0.15518929, -0.019182147, -0.1851...
49456    [0.20601723, 0.17797449, -0.045954805, -0.1948...
49457    [0.18068357, 0.14797525, -0.09047413, -0.18754...
49458    [0.27019876, 0.16974878, -0.06660596, -0.18025...
Name: text_pt_sem_stopwords_token, Length: 49459, dtype: object

### Treinamento

Separe o banco de bag of words em treino e teste e treine um modelo de regressão Logistica, qual a avaliação desse modelo?

## Solução

In [78]:
X_train_bow, X_test_bow, y_train_bow, y_test_bow = \
    train_test_split(X_bag, target, random_state=123)

In [53]:
modelo_bow = LogisticRegression()
modelo_bow.fit(X_train_bow,y_train_bow)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [54]:
y_pred = modelo_bow.predict(X_test_bow)

In [55]:
print(classification_report(y_test_bow, y_pred))

              precision    recall  f1-score   support

           0       0.88      0.87      0.88      6112
           1       0.87      0.89      0.88      6253

    accuracy                           0.88     12365
   macro avg       0.88      0.88      0.88     12365
weighted avg       0.88      0.88      0.88     12365



### Embedding

Separe o banco de embedding em treino e teste e treine um modelo de regressão Logistica, qual a avaliação desse modelo?

## Solução

In [56]:
X_train_embedding, X_test_embedding, y_train_embedding, y_test_embedding = \
train_test_split(X_embedding.values, target,random_state=123)

In [57]:
X_train_embedding = pd.DataFrame([x for x in X_train_embedding])
X_test_embedding = pd.DataFrame([x for x in X_test_embedding])

In [58]:
modelo_embedding = LogisticRegression()
modelo_embedding.fit(X_train_embedding,y_train_embedding)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [59]:
y_pred = modelo_embedding.predict(X_test_embedding)

In [60]:
print(classification_report(y_test_embedding, y_pred))

              precision    recall  f1-score   support

           0       0.76      0.77      0.77      6112
           1       0.77      0.76      0.77      6253

    accuracy                           0.77     12365
   macro avg       0.77      0.77      0.77     12365
weighted avg       0.77      0.77      0.77     12365



# Análise de sentimentos

Quando o objetivo é realizar análise de sentimentos podemos treinar o nosso proprio modelo ou utilizar ferramentas já feitas. Exemplo: Vader.

O VADER (Valence Aware Dictionary e sEntiment Reasoner) é uma ferramenta de análise de sentimentos baseada em regras e léxico, especificamente identifica os sentimentos expressos nas mídias sociais.

- positive sentiment: compound score >= 0.05
- neutral sentiment: (compound score > -0.05) e (compound score < 0.05)
- negative sentiment: compound score <= -0.05

Mais informações: https://github.com/cjhutto/vaderSentiment

In [79]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [80]:
analyzer = SentimentIntensityAnalyzer()

In [81]:
texto_neg = df.loc[0, "text_en"]
texto_pos = df.loc[49431, "text_en"]

In [82]:
texto_neg

'Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrific sea rescue sequences, of which there are very few I just did not care about any of the characters. Most of us have ghosts in the closet, and Costners character are realized early on, and then forgotten until much later, by which time I did not care. The character we should really care about is a very cocky, overconfident Ashton Kutcher. The problem is he comes off as kid who thinks hes better than anyone else around him and shows no signs of a cluttered closet. His only obstacle appears to be winning over Costner. Finally when we are well past the half way point of this stinker, Costner tells us all about Kutchers ghosts. We are told why Kutcher is driven to be the best with no prior inkling or foreshadowing. No magic here, it was all I could do to keep from turning it off an hour in.'

In [64]:
analyzer.polarity_scores(texto_neg)

{'neg': 0.126, 'neu': 0.76, 'pos': 0.114, 'compound': 0.3958}

In [83]:
analyzer.polarity_scores(texto_pos)

{'neg': 0.084, 'neu': 0.737, 'pos': 0.179, 'compound': 0.9969}