# Aula 1: Introdução a Analise de Sentimentos

- Pra que serve
  - Avaliar percepção de produto
    - Geral, no tempo, demografico, por categoria de clientes ou produtos
  - Monitorar Redes Sociais
    - Pessoas publicas, politicos, etc
  - Monitorar marcas, produtos, lançamentos
  - Pesquisa de mercado


- Formas
  - Positivo, Negativo, Neutro
  - Sentimentos Especificos
    - Furioso, satisfeito
    - Interessando, não interessado
  - Valor inteiro (Score): 1 a 10
  - Sistemas de Estrelas

- Tarefa desafiadora
  - Um sentimento negativo nem sempre vem expresso atraves de adjetivos como "ruim"
    - Ex: ' A tela poderia ser maior '
  - Contexto é muito importante
    - Ex: ' Tive uma pessima experiencia com o concorrente '
  - Ironia e Sarcasmos
    - Ex: ' Obrigada Latam por enviar a minha mala para a pqp. Serviço brilhante '
  - Ambiguidade
  - Emojis



---

## Como funciona?

- Sistema de regras
  - Não precisa de treinamento, não depende de dados
  - Não aprende

- Sistemas com AI/ML
  - Dependem de dados para treinamento
  - Podem ser treinados para negocios especificos
  - Supervisionado: Dados rotulados
  - Não supervisionado
    - Palavras Negativas estão proximas

# Aula 2: Exemplo pratico com LSTM

### Tarefa Supervisionada: Dados Rotulados

O que muda essa analise para a classificação binaria feita antes de spam/ham?
- Agora é um problema multiclasse (Neutral, Positive, Negative)
- Classificaçao binaria: modelo.add(Dense(units=1, activation='sigmoid'))
- Um neuronio so é capaz de produzir numeros
- Como representar 3 categorias com um unico neuronio? **NÃO PODE**


---


- SOLUÇÃO: **Camada de saida com 3 neuronios e funçao de ativação softmax**
  - 1 neuronio para cada classe
- modelo.add(Dense(3, activation='softmax'))
- Como cada neuronio vai representar uma classe? One Hot Encoding
  - Saida sera em termos de probabilidades
    - [prob Neutro     prob Positive   prob Negative]


---


- Alguns tipos de dados dependem de ordem
  - Texto
    - Fernando é uma pessoa nova
    - Fernando é uma nova pessoa
  - Solução: **Rede Neural Recorrente (RNN)**
    - Considera ordem
    - Mas tem dificuldade de manter a informação das etapas iniciais

- LSTM: Tipo de RNN
  - Possui mecanismos chamados "gates" que aprendem quais dados devem ser mantidos ou não
  - É uma camada que sera 'empilhada' na topologia da rede
  - Demais etapas sao semelhantes
  - Custo computacional muito alto
  - Capaz de 'tirar proveito' de NVIDIA CUDA (GPU)



# Aula 3: LSTM I

In [3]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, ConfusionMatrixDisplay

from keras.models import Sequential
from keras.layers import Dense, Flatten, Embedding, LSTM, SpatialDropout1D
from tensorflow.keras.utils import to_categorical

from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [5]:
url = 'https://raw.githubusercontent.com/AnaClaraGuerra22/PLN-LLMs-e-Gen-AI---Udemy/931c68630a663029c81bb1066b92d67ae90a4001/Analise%20de%20Sentimento/Data/Tweets.csv'

tweets = pd.read_csv(url)
print(tweets.shape)
tweets.head()


(14640, 15)


Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)


In [8]:
tweets['airline_sentiment'].value_counts()

# desbalanceado
# negative	9178
# neutral	3099
# positive	2363

Unnamed: 0_level_0,count
airline_sentiment,Unnamed: 1_level_1
negative,9178
neutral,3099
positive,2363


In [10]:
tweets = tweets[tweets['airline_sentiment_confidence'] > 0.8]
tweets

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
3,570301031407624196,negative,1.0,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0,Can't Tell,1.0000,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)
5,570300767074181121,negative,1.0,Can't Tell,0.6842,Virgin America,,jnardino,,0,@VirginAmerica seriously would pay $30 a fligh...,,2015-02-24 11:14:33 -0800,,Pacific Time (US & Canada)
9,570295459631263746,positive,1.0,,,Virgin America,,YupitsTate,,0,"@VirginAmerica it was amazing, and arrived an ...",,2015-02-24 10:53:27 -0800,Los Angeles,Eastern Time (US & Canada)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14631,569588464896876545,negative,1.0,Bad Flight,1.0000,American,,MDDavis7,,0,@AmericanAir thx for nothing on getting us out...,,2015-02-22 12:04:07 -0800,US,Eastern Time (US & Canada)
14633,569587705937600512,negative,1.0,Cancelled Flight,1.0000,American,,RussellsWriting,,0,@AmericanAir my flight was Cancelled Flightled...,,2015-02-22 12:01:06 -0800,Los Angeles,Arizona
14636,569587371693355008,negative,1.0,Customer Service Issue,1.0000,American,,itsropes,,0,@AmericanAir leaving over 20 minutes Late Flig...,,2015-02-22 11:59:46 -0800,Texas,
14637,569587242672398336,neutral,1.0,,,American,,sanyabun,,0,@AmericanAir Please bring American Airlines to...,,2015-02-22 11:59:15 -0800,"Nigeria,lagos",


In [11]:
tweets['airline_sentiment'].value_counts()

Unnamed: 0_level_0,count
airline_sentiment,Unnamed: 1_level_1
negative,7392
neutral,1550
positive,1517


In [13]:
token = Tokenizer(num_words=1000)
token.fit_on_texts(tweets['text'].values) # cria o modelo

In [14]:
x = token.texts_to_sequences(tweets['text'].values) # executa o modelo
x = pad_sequences(x, padding = 'post', maxlen=100)    # deixar do mesmo tamanho as mensagens

In [15]:
print(x)

[[ 97  62 229 ...   0   0   0]
 [ 97  99 131 ...   0   0   0]
 [ 97   9  99 ...   0   0   0]
 ...
 [ 13 429  98 ...   0   0   0]
 [ 13  89 692 ...   0   0   0]
 [ 13   6  23 ...   0   0   0]]


# Aula 4: LSTM II

In [16]:
Labelencoder = LabelEncoder()
y = Labelencoder.fit_transform(tweets['airline_sentiment'])
print(y)

[1 0 0 ... 0 1 0]


In [18]:
# one hot encoding
#  transformar o airline_sentiment para 3 colunas c numeros (positivo, neutro, negativo)

y = to_categorical(y)
print(y)

[[0. 1. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 ...
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]]


In [19]:
x_train, x_test, y_tran, y_test = train_test_split(x, y, test_size=0.3)
x_test

array([[ 16,  66,  61, ...,   0,   0,   0],
       [  8,   3, 258, ...,   0,   0,   0],
       [ 12, 125, 651, ...,   0,   0,   0],
       ...,
       [ 16,  20,   7, ...,   0,   0,   0],
       [ 16,   7,   1, ...,   0,   0,   0],
       [ 12,   3,  88, ...,   0,   0,   0]], dtype=int32)

In [21]:
modelo = Sequential()

modelo.add(Embedding(input_dim=len(token.word_index), output_dim=128, input_length=x.shape[1]))

modelo.add(SpatialDropout1D(0.2))

modelo.add(LSTM(196, dropout=0.2, recurrent_dropout=0, activation='tanh', recurrent_activation='sigmoid', unroll=False, use_bias=True))

modelo.add(Dense(3, activation='softmax'))



In [22]:
modelo.compile(optimizer='Adam', loss='categorical_crossentropy', metrics=['accuracy'])
print(modelo.summary())

None


# Aula 5: LSTM III

In [23]:
modelo.fit(x_train, y_tran, epochs=10, batch_size=30, verbose = True, validation_data=(x_test, y_test))

Epoch 1/10
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m60s[0m 235ms/step - accuracy: 0.6861 - loss: 0.8596 - val_accuracy: 0.7113 - val_loss: 0.8061
Epoch 2/10
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m81s[0m 233ms/step - accuracy: 0.7108 - loss: 0.8078 - val_accuracy: 0.7113 - val_loss: 0.8057
Epoch 3/10
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m84s[0m 240ms/step - accuracy: 0.7113 - loss: 0.8052 - val_accuracy: 0.7113 - val_loss: 0.8019
Epoch 4/10
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m82s[0m 241ms/step - accuracy: 0.7075 - loss: 0.8102 - val_accuracy: 0.7113 - val_loss: 0.8031
Epoch 5/10
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m77s[0m 221ms/step - accuracy: 0.7115 - loss: 0.8034 - val_accuracy: 0.7113 - val_loss: 0.8021
Epoch 6/10
[1m245/245[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m54s[0m 219ms/step - accuracy: 0.7007 - loss: 0.8202 - val_accuracy: 0.7113 - val_loss: 0.8016
Epoch 7/10

<keras.src.callbacks.history.History at 0x780f6185d910>

In [24]:
loss, accuracy = modelo.evaluate(x_test, y_test)
print('Loss: ', loss)
print('Accuracy: ', accuracy)

[1m99/99[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 72ms/step - accuracy: 0.6999 - loss: 0.8198
Loss:  0.8019313216209412
Accuracy:  0.7112810611724854


In [25]:
prev = modelo.predict(x_test)
prev

[1m99/99[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 64ms/step


array([[0.6990494 , 0.14385913, 0.15709135],
       [0.6990494 , 0.14385913, 0.15709138],
       [0.6990494 , 0.14385913, 0.15709135],
       ...,
       [0.69904953, 0.14385915, 0.15709138],
       [0.6990495 , 0.14385913, 0.15709136],
       [0.6990495 , 0.14385913, 0.15709136]], dtype=float32)