# Projeto Grupo BT-G3


## **Integrantes do grupo**
- Daniel Barzilai
- Larissa Carvalho
- Maria Luisa Maia
- Pedro Rezende
- Rafael Moritz
- Vitor Oliveira

<center><img src="https://www.inteli.edu.br/wp-content/uploads/2021/08/20172028/marca_1-2.png" width="50%" height="50%"/></center>

<h1 align='center'><b>IA para Marketing: Monitoramento de campanhas utilizando processamento de linguagem natural (PLN)<b></h1>

<center><img src="https://upload.wikimedia.org/wikipedia/commons/c/c2/Btg-logo-blue.svg" width="50%" height="50%"/></center>

<h2 align='center'>O Banco BTG Pactual enfrenta um desafio na área de Marketing em entender as necessidades e demandas dos clientes de maneira fácil e rápida nas redes sociais. A solução proposta para esse problema foi o desenvolvimento de uma Inteligência Artificial utilizando processamento de linguagem natural (PLN), capaz de monitorar as campanhas de marketing, voltadas para o Instagram. O objetivo principal dessa solução é rastrear os dados em tempo real, analisar e interpretar as mensagens e comentários enviados pelos clientes na rede social, a fim de identificar as necessidades e demandas de forma precisa e eficiente.</h2>

---

# Sobre os dados

Esse projeto está utilizando dados coletados e tratados pela equipe de Automation do BTG Pactual, o qual disponibilizou o dataset. Com base nas informações dispostas nesse dataset, realizaremos insights a cerca dos comentários feitos nos posts do Instagram do próprio banco. Vale lembrar que os dados estão anonimizados e resguardados para manter a privacidade e ética com os usuários e com o banco.

# 1. Instalação / Setup

Para o início do projeto, fizemos o desenvolvimento no Google Colab, por isso temos uma célula de conexão com o Google Drive, para poder acessar os dados. Caso seja rodado no Jupyter Notebook, precisará do dataset baixado.

In [121]:
#Conectar com o Google Drive

from google.colab import drive
drive.mount('/content/drive')

#Conectando o ambiente ao Google Drive

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Aqui nós fazemos as importações para tratamento dos dados, pré-processamento dos dados e modelamento do Bag of Words. 

## pips

In [122]:
!pip install -U spacy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [123]:
!pip install wordcloud

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [124]:
!pip install keras

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [125]:
!pip install tensorflow

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [126]:
!pip install scikit-learn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Imports

In [127]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud
import re

from sklearn.feature_extraction.text import CountVectorizer
import ast
from keras.preprocessing.text import Tokenizer
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split

import nltk
import spacy
import gensim
from scipy.spatial.distance import cosine
from gensim.models import KeyedVectors

# 2. Entendimento e Tratamento dos Dados

Rodando o dataset, para analisar seu conteúdo:

In [128]:
df = pd.read_csv('/content/drive/MyDrive/Módulo 6/projeto/base_tratada.csv')
df

Unnamed: 0.1,Unnamed: 0,autor,sentimento,texto_tratado
0,0,winthegame_of,1,"['alvarez', 'marsal', 'estar', 'conosco', 'spo..."
1,1,marta_bego,1,"['btgpactual', 'with', 'makerepost', 'entender..."
2,2,lmviapiana,2,"['minuto', 'touro', 'ouro']"
3,3,vanilson_dos,1,['sim']
4,4,ricktolledo,2,"['querer', 'saber', 'btg', 'banking', 'próprio..."
...,...,...,...,...
9472,9472,perspectiveinvestimentos,2,"['excelente', 'explicação']"
9473,9473,eduardocolares,2,"['atendar', 'telefone', 'amor', 'deus']"
9474,9474,danielucm,2,"['saber', 'qual', '10', 'grande', 'fiis', 'mer..."
9475,9475,amgcapitalinvest,1,"['erro', 'financeiro', 'eliminar', 'antes', '3..."


In [129]:
df.columns

Index(['Unnamed: 0', 'autor', 'sentimento', 'texto_tratado'], dtype='object')

In [130]:
df = df.drop(['Unnamed: 0'], axis=1)
df

Unnamed: 0,autor,sentimento,texto_tratado
0,winthegame_of,1,"['alvarez', 'marsal', 'estar', 'conosco', 'spo..."
1,marta_bego,1,"['btgpactual', 'with', 'makerepost', 'entender..."
2,lmviapiana,2,"['minuto', 'touro', 'ouro']"
3,vanilson_dos,1,['sim']
4,ricktolledo,2,"['querer', 'saber', 'btg', 'banking', 'próprio..."
...,...,...,...
9472,perspectiveinvestimentos,2,"['excelente', 'explicação']"
9473,eduardocolares,2,"['atendar', 'telefone', 'amor', 'deus']"
9474,danielucm,2,"['saber', 'qual', '10', 'grande', 'fiis', 'mer..."
9475,amgcapitalinvest,1,"['erro', 'financeiro', 'eliminar', 'antes', '3..."


In [131]:
df['texto_tratado']

0       ['alvarez', 'marsal', 'estar', 'conosco', 'spo...
1       ['btgpactual', 'with', 'makerepost', 'entender...
2                             ['minuto', 'touro', 'ouro']
3                                                 ['sim']
4       ['querer', 'saber', 'btg', 'banking', 'próprio...
                              ...                        
9472                          ['excelente', 'explicação']
9473              ['atendar', 'telefone', 'amor', 'deus']
9474    ['saber', 'qual', '10', 'grande', 'fiis', 'mer...
9475    ['erro', 'financeiro', 'eliminar', 'antes', '3...
9476    ['porque', 'morning', 'call', 'aparecer', 'spo...
Name: texto_tratado, Length: 9477, dtype: object

In [132]:
# Supondo que seu DataFrame seja chamado de df e a coluna seja 'texto_tratado'
df['texto_tratado'] = df['texto_tratado'].str.replace("'", "")
df['texto_tratado']

0       [alvarez, marsal, estar, conosco, sportainmet,...
1       [btgpactual, with, makerepost, entender, impac...
2                                   [minuto, touro, ouro]
3                                                   [sim]
4       [querer, saber, btg, banking, próprio, btg, ad...
                              ...                        
9472                              [excelente, explicação]
9473                      [atendar, telefone, amor, deus]
9474    [saber, qual, 10, grande, fiis, mercado, selec...
9475    [erro, financeiro, eliminar, antes, 30, ano, 1...
9476    [porque, morning, call, aparecer, spotify, atu...
Name: texto_tratado, Length: 9477, dtype: object

# 3. Bag of Words (BoW) - **CORRIGIR**

O modelo Bag of Words (BoW) é uma técnica utilizada em processamento de linguagem natural para representar um texto como um conjunto de palavras desordenadas, ignorando a ordem e a estrutura gramatical das frases. 

Nesse modelo, cada palavra única do texto é transformada em uma "feature" (característica), e a frequência de cada palavra no texto é usada como um valor numérico para a feature correspondente.

Por exemplo, a frase "O gato preto pulou o muro" seria representada como um conjunto de palavras desordenadas: `'o', 'gato', 'preto', 'pulou', 'o', 'muro'`. A frequência de cada palavra seria contada, e o resultado seria um vetor numérico que representa a frequência de cada palavra na frase.


## 3.1 Teste isolado

In [19]:
# Multiple documents
text = ["Estamos fazendo um projeto pro BTG!", "Somos alunos de Sistemas de Informação do Inteli", "O Renato é o nosso orientador", "O Hayashi é o nosso professor de programação"] 

# create the transform
vectorizer = CountVectorizer()

# tokenize and build vocab
vectorizer.fit(text)

# summarize
print(sorted(vectorizer.vocabulary_))

# encode document
vector = vectorizer.transform(text)

# summarize encoded vector
print(vector.shape)
print(vector.toarray())

['alunos', 'btg', 'de', 'do', 'estamos', 'fazendo', 'hayashi', 'informação', 'inteli', 'nosso', 'orientador', 'pro', 'professor', 'programação', 'projeto', 'renato', 'sistemas', 'somos', 'um']
(4, 19)
[[0 1 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 0 1]
 [1 0 2 1 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0]
 [0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0]
 [0 0 1 0 0 0 1 0 0 1 0 0 1 1 0 0 0 0 0]]


**AVISO**:
O modelo Bag of Words é uma técnica simples e eficiente para representar textos em formato vetorial, o que permite utilizá-los em algoritmos de aprendizado de máquina. 

No entanto, essa abordagem ignora informações importantes sobre a estrutura e o significado das frases, como a ordem das palavras e as relações sintáticas entre elas. Por isso, é comum utilizar técnicas mais avançadas, como a modelagem de tópicos (topic modeling) e as redes neurais, para lidar com textos mais complexos.

Porém, para vieses acadêmicos, estamos implementando e aprendendo como é o funcionamento desse processo.

## 3.2 Definição da Função


In [133]:
df

Unnamed: 0,autor,sentimento,texto_tratado
0,winthegame_of,1,"[alvarez, marsal, estar, conosco, sportainmet,..."
1,marta_bego,1,"[btgpactual, with, makerepost, entender, impac..."
2,lmviapiana,2,"[minuto, touro, ouro]"
3,vanilson_dos,1,[sim]
4,ricktolledo,2,"[querer, saber, btg, banking, próprio, btg, ad..."
...,...,...,...
9472,perspectiveinvestimentos,2,"[excelente, explicação]"
9473,eduardocolares,2,"[atendar, telefone, amor, deus]"
9474,danielucm,2,"[saber, qual, 10, grande, fiis, mercado, selec..."
9475,amgcapitalinvest,1,"[erro, financeiro, eliminar, antes, 30, ano, 1..."


In [134]:
df['texto_tratado']

0       [alvarez, marsal, estar, conosco, sportainmet,...
1       [btgpactual, with, makerepost, entender, impac...
2                                   [minuto, touro, ouro]
3                                                   [sim]
4       [querer, saber, btg, banking, próprio, btg, ad...
                              ...                        
9472                              [excelente, explicação]
9473                      [atendar, telefone, amor, deus]
9474    [saber, qual, 10, grande, fiis, mercado, selec...
9475    [erro, financeiro, eliminar, antes, 30, ano, 1...
9476    [porque, morning, call, aparecer, spotify, atu...
Name: texto_tratado, Length: 9477, dtype: object

In [145]:
def bow(frases):
    # Inicializa o CountVectorizer
    vectorizer = CountVectorizer()

    # Concatena os tokens de cada frase em uma única string
    frases_concatenadas = [''.join(tokens) for tokens in frases]

    # Cria o modelo Bag of Words
    bow_model = vectorizer.fit_transform(frases_concatenadas)

    # Dicionário de palavras
    dicionario = vectorizer.vocabulary_

    bow_df = pd.DataFrame(bow_model.toarray(), columns=vectorizer.get_feature_names_out())

    return bow_model, dicionario, bow_df

# Aplicar a função de Bag of Words
bow_model, dicionario, bow_df = bow(df['texto_tratado'].tolist())

In [146]:
# Imprime o vocabulário de palavras
print("Dicionário de palavras:")
print(dicionario, "\n")

# Imprime a matriz Bag of Words
print("Representação Bag of Words:")
print(bow_model.toarray())

Dicionário de palavras:

Representação Bag of Words:
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


---

In [148]:
bow_df

Unnamed: 0,00,000,0000,0001,001,002,00244,004,00georgeleandro00,01,...,𝚜𝚎𝚛,𝚜𝚎𝚞𝚜,𝚜𝚞𝚊,𝚝𝚎,𝚝𝚞𝚍𝚘,𝚞𝚖𝚊,𝚟𝚊,𝚟𝚊𝚒,𝚟𝚊𝚕𝚘𝚛𝚎𝚜,𝚟𝚘𝚕𝚊𝚝𝚒𝚕𝚒𝚍𝚊𝚍𝚎
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9472,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9473,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9474,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9475,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [24]:
# # Criando a função de vetorização, a qual irá criar um DataFrame com todos os termos e suas contabilizações em cada comentário (representados pelo index)
# def bow(comentarios): 
#      # Inicializa o CountVectorizer
#     vectorizer = CountVectorizer(analyzer=lambda x: x)
#     # Cria o modelo Bag of Words
#     bow_model = vectorizer.fit_transform(comentarios)
#     # Dicionário de palavras
#     dicionario = vectorizer.vocabulary_
#     # Cria um dataframe com as palavras e suas frequências
#     bow_df = pd.DataFrame(bow_model.toarray(), columns=vectorizer.get_feature_names_out())
#     return bow_df, dicionario

# bow_model, dicionario = bow(df['texto_tratado'])

In [25]:
# bow_model, dicionario

In [26]:
# # Imprime o vocabulário de palavras
# print("Dicionário de palavras:")
# print(dicionario, "\n")

---

# 10. Word2Vec com CBOW

## Estruturação

In [34]:
df

Unnamed: 0,autor,sentimento,texto_tratado
0,winthegame_of,1,"[alvarez, marsal, estar, conosco, sportainmet,..."
1,marta_bego,1,"[btgpactual, with, makerepost, entender, impac..."
2,lmviapiana,2,"[minuto, touro, ouro]"
3,vanilson_dos,1,[sim]
4,ricktolledo,2,"[querer, saber, btg, banking, próprio, btg, ad..."
...,...,...,...
9472,perspectiveinvestimentos,2,"[excelente, explicação]"
9473,eduardocolares,2,"[atendar, telefone, amor, deus]"
9474,danielucm,2,"[saber, qual, 10, grande, fiis, mercado, selec..."
9475,amgcapitalinvest,1,"[erro, financeiro, eliminar, antes, 30, ano, 1..."


In [37]:
cbow = '/content/drive/MyDrive/Módulo 6/Prog/cbow_s50.txt'

In [38]:
model_cbow = KeyedVectors.load_word2vec_format(cbow)

## Teste isolado

In [39]:
# Testando o word2vec
wordvec_test = model_cbow['projeto']

wordvec_test

array([-0.074174, -0.152088,  0.086627, -0.224567,  0.362562,  0.130683,
       -0.089179, -0.086973,  0.309501,  0.004112, -0.308202,  0.351789,
       -0.477863,  0.050276,  0.213283,  0.159895, -0.285545, -0.08832 ,
       -0.015449,  0.014816, -0.613861,  0.502556,  0.021688,  0.369492,
        0.280691,  0.016868,  0.105584, -0.180754, -0.078456,  0.148032,
        0.36293 , -0.011634,  0.412191, -0.009049,  0.010404,  0.131242,
       -0.032483, -0.133067, -0.063802,  0.434015, -0.214768, -0.072132,
        0.045601, -0.368866,  0.502808,  0.048293, -0.254894,  0.142581,
       -0.075066,  0.015646], dtype=float32)

## Definição de função

In [40]:
def create_sentence_vector(model, df):
    sentence_table = []
    for sentence in df['texto_tratado']:
        word_vectors = [model[word] for word in sentence if word in model]
        if len(word_vectors) > 0:
            sentence_vector = sum(word_vectors) / len(word_vectors)
        else:
            sentence_vector = [None] * 100  # Cria uma lista de 100 elementos None
        sentence_table.append((sentence, *sentence_vector[:50]))  # Adiciona apenas os primeiros 50 elementos do vetor

    column_labels = ['Frase']
    for i in range(50):
        column_labels.append(f'Vetor{i+1}')
    df_vec = pd.DataFrame(sentence_table, columns=column_labels)

    df["sentimentoNumerico"] = df["sentimento"].replace({'NEGATIVE': -1, 'POSITIVE': 1, 'NEUTRAL': 0})

    # Definir o índice do DataFrame df_vec como o mesmo índice de df_processada['sentimentoNumerico']
    df_vec.set_index(df["sentimentoNumerico"].index, inplace=True)

    df_vec['sentimento'] = df["sentimentoNumerico"]
    df_vec = df_vec.dropna()

    return df_vec

## Teste de funções

In [41]:
df_vec = create_sentence_vector(model_cbow, df)
df_vec

Unnamed: 0,Frase,Vetor1,Vetor2,Vetor3,Vetor4,Vetor5,Vetor6,Vetor7,Vetor8,Vetor9,...,Vetor42,Vetor43,Vetor44,Vetor45,Vetor46,Vetor47,Vetor48,Vetor49,Vetor50,sentimento
0,"[alvarez, marsal, estar, conosco, sportainmet,...",0.216308,-0.126807,0.240276,-0.070348,-0.016067,0.209095,0.073239,0.057978,0.061847,...,0.021205,-0.104079,0.155380,0.091749,-0.043675,0.158847,-0.037001,0.020596,0.184101,1
1,"[btgpactual, with, makerepost, entender, impac...",0.225879,-0.123276,0.215921,-0.052553,-0.007740,0.207449,0.073616,0.037889,0.062654,...,0.003267,-0.071278,0.156911,0.084502,-0.004404,0.159728,-0.029227,0.029953,0.190338,1
2,"[minuto, touro, ouro]",0.265227,-0.068285,0.152235,-0.044329,-0.102729,0.141353,0.092800,0.113174,0.015783,...,0.078032,-0.202677,0.155750,0.062291,0.007038,0.134573,0.014635,0.034189,0.345674,2
3,[sim],0.166258,-0.029796,0.204045,-0.297490,0.046077,0.140763,0.035251,-0.174491,0.211817,...,0.065839,-0.092451,0.308218,-0.034692,-0.032851,-0.028724,-0.068701,0.011158,0.258413,1
4,"[querer, saber, btg, banking, próprio, btg, ad...",0.192599,-0.164825,0.279021,-0.053466,-0.041365,0.211765,0.074588,0.092593,0.094428,...,0.088143,-0.130285,0.164846,0.077747,-0.068860,0.181872,-0.076526,-0.003717,0.197040,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9472,"[excelente, explicação]",0.190917,-0.133475,0.241675,-0.053180,0.067256,0.201138,0.034109,-0.078718,-0.066131,...,-0.082151,0.016113,0.154861,0.068700,-0.004302,0.079717,-0.028388,-0.017448,0.188785,2
9473,"[atendar, telefone, amor, deus]",0.188641,-0.119377,0.199339,-0.105448,0.023176,0.178837,0.069476,-0.004494,0.034710,...,0.034035,-0.126673,0.165176,0.080313,-0.024160,0.118848,-0.003502,0.087053,0.215656,2
9474,"[saber, qual, 10, grande, fiis, mercado, selec...",0.213684,-0.135475,0.219169,-0.072920,-0.014112,0.203863,0.063960,0.038996,0.069897,...,0.034963,-0.097084,0.175341,0.088866,-0.047218,0.153991,-0.030393,0.021887,0.179608,2
9475,"[erro, financeiro, eliminar, antes, 30, ano, 1...",0.212646,-0.112742,0.221387,-0.078302,-0.032960,0.218673,0.071489,0.038917,0.037444,...,0.022441,-0.094323,0.150688,0.081716,-0.028744,0.145769,-0.029518,0.024455,0.191446,1


In [42]:
#df_vec.to_csv('Word2Vec_Cbow_modelo_treinado',encoding='utf-8', index=False, header=True)

# 11. Naive Bayes + Word2Vec com CBOW

In [43]:
label = preprocessing.LabelEncoder()

In [44]:
label.fit(df_vec['sentimento'])
df_vec['sentimento'] = label.transform(df_vec['sentimento'])

In [45]:
df_vec = df_vec.dropna()
df_vec

Unnamed: 0,Frase,Vetor1,Vetor2,Vetor3,Vetor4,Vetor5,Vetor6,Vetor7,Vetor8,Vetor9,...,Vetor42,Vetor43,Vetor44,Vetor45,Vetor46,Vetor47,Vetor48,Vetor49,Vetor50,sentimento
0,"[alvarez, marsal, estar, conosco, sportainmet,...",0.216308,-0.126807,0.240276,-0.070348,-0.016067,0.209095,0.073239,0.057978,0.061847,...,0.021205,-0.104079,0.155380,0.091749,-0.043675,0.158847,-0.037001,0.020596,0.184101,1
1,"[btgpactual, with, makerepost, entender, impac...",0.225879,-0.123276,0.215921,-0.052553,-0.007740,0.207449,0.073616,0.037889,0.062654,...,0.003267,-0.071278,0.156911,0.084502,-0.004404,0.159728,-0.029227,0.029953,0.190338,1
2,"[minuto, touro, ouro]",0.265227,-0.068285,0.152235,-0.044329,-0.102729,0.141353,0.092800,0.113174,0.015783,...,0.078032,-0.202677,0.155750,0.062291,0.007038,0.134573,0.014635,0.034189,0.345674,2
3,[sim],0.166258,-0.029796,0.204045,-0.297490,0.046077,0.140763,0.035251,-0.174491,0.211817,...,0.065839,-0.092451,0.308218,-0.034692,-0.032851,-0.028724,-0.068701,0.011158,0.258413,1
4,"[querer, saber, btg, banking, próprio, btg, ad...",0.192599,-0.164825,0.279021,-0.053466,-0.041365,0.211765,0.074588,0.092593,0.094428,...,0.088143,-0.130285,0.164846,0.077747,-0.068860,0.181872,-0.076526,-0.003717,0.197040,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9472,"[excelente, explicação]",0.190917,-0.133475,0.241675,-0.053180,0.067256,0.201138,0.034109,-0.078718,-0.066131,...,-0.082151,0.016113,0.154861,0.068700,-0.004302,0.079717,-0.028388,-0.017448,0.188785,2
9473,"[atendar, telefone, amor, deus]",0.188641,-0.119377,0.199339,-0.105448,0.023176,0.178837,0.069476,-0.004494,0.034710,...,0.034035,-0.126673,0.165176,0.080313,-0.024160,0.118848,-0.003502,0.087053,0.215656,2
9474,"[saber, qual, 10, grande, fiis, mercado, selec...",0.213684,-0.135475,0.219169,-0.072920,-0.014112,0.203863,0.063960,0.038996,0.069897,...,0.034963,-0.097084,0.175341,0.088866,-0.047218,0.153991,-0.030393,0.021887,0.179608,2
9475,"[erro, financeiro, eliminar, antes, 30, ano, 1...",0.212646,-0.112742,0.221387,-0.078302,-0.032960,0.218673,0.071489,0.038917,0.037444,...,0.022441,-0.094323,0.150688,0.081716,-0.028744,0.145769,-0.029518,0.024455,0.191446,1


## Separando Treino e Teste

In [56]:
target = df_vec['sentimento']

In [57]:
feature = df_vec.iloc[:,1:50]

In [58]:
feature

Unnamed: 0,Vetor1,Vetor2,Vetor3,Vetor4,Vetor5,Vetor6,Vetor7,Vetor8,Vetor9,Vetor10,...,Vetor40,Vetor41,Vetor42,Vetor43,Vetor44,Vetor45,Vetor46,Vetor47,Vetor48,Vetor49
0,0.216308,-0.126807,0.240276,-0.070348,-0.016067,0.209095,0.073239,0.057978,0.061847,0.172586,...,0.076437,-0.272009,0.021205,-0.104079,0.155380,0.091749,-0.043675,0.158847,-0.037001,0.020596
1,0.225879,-0.123276,0.215921,-0.052553,-0.007740,0.207449,0.073616,0.037889,0.062654,0.167859,...,0.092294,-0.304136,0.003267,-0.071278,0.156911,0.084502,-0.004404,0.159728,-0.029227,0.029953
2,0.265227,-0.068285,0.152235,-0.044329,-0.102729,0.141353,0.092800,0.113174,0.015783,0.202198,...,-0.008447,-0.193025,0.078032,-0.202677,0.155750,0.062291,0.007038,0.134573,0.014635,0.034189
3,0.166258,-0.029796,0.204045,-0.297490,0.046077,0.140763,0.035251,-0.174491,0.211817,0.288314,...,0.183434,-0.415105,0.065839,-0.092451,0.308218,-0.034692,-0.032851,-0.028724,-0.068701,0.011158
4,0.192599,-0.164825,0.279021,-0.053466,-0.041365,0.211765,0.074588,0.092593,0.094428,0.177284,...,0.048799,-0.253769,0.088143,-0.130285,0.164846,0.077747,-0.068860,0.181872,-0.076526,-0.003717
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9472,0.190917,-0.133475,0.241675,-0.053180,0.067256,0.201138,0.034109,-0.078718,-0.066131,0.187608,...,0.014565,-0.321192,-0.082151,0.016113,0.154861,0.068700,-0.004302,0.079717,-0.028388,-0.017448
9473,0.188641,-0.119377,0.199339,-0.105448,0.023176,0.178837,0.069476,-0.004494,0.034710,0.150081,...,0.071114,-0.194663,0.034035,-0.126673,0.165176,0.080313,-0.024160,0.118848,-0.003502,0.087053
9474,0.213684,-0.135475,0.219169,-0.072920,-0.014112,0.203863,0.063960,0.038996,0.069897,0.170876,...,0.081039,-0.298570,0.034963,-0.097084,0.175341,0.088866,-0.047218,0.153991,-0.030393,0.021887
9475,0.212646,-0.112742,0.221387,-0.078302,-0.032960,0.218673,0.071489,0.038917,0.037444,0.146022,...,0.065854,-0.252245,0.022441,-0.094323,0.150688,0.081716,-0.028744,0.145769,-0.029518,0.024455


In [59]:
X_train, X_test, y_train, y_test = train_test_split(feature, target, test_size=0.2, random_state=42)

## Avaliação do modelo 

In [60]:
clf = GaussianNB()

clf = clf.fit(X_train,y_train.values.ravel())

Y_pred = clf.predict(X_test)

print(classification_report(y_test, Y_pred))

              precision    recall  f1-score   support

           0       0.29      0.69      0.41       417
           1       0.74      0.45      0.56       855
           2       0.37      0.24      0.29       624

    accuracy                           0.43      1896
   macro avg       0.47      0.46      0.42      1896
weighted avg       0.52      0.43      0.44      1896



In [61]:
acc_score = accuracy_score(y_test, Y_pred)
format_output = "{:.2%}".format(acc_score)
print("Precisão final de :",format_output) 

Precisão final de : 43.35%


# 12. Word2Vec com embedding layer

## Definição de função

In [86]:
from gensim.models import Word2Vec

# Função que treina o modelo Word2Vec no corpus do dataframe
def train_word2vec(df, column_name):
    # Obtém as frases tokenizadas
    sentences = df[column_name].tolist()
    
    # Treina o modelo Word2Vec
    model = Word2Vec(sentences, min_count=1)
    
    return model

In [91]:
# Função que define os vetores para cada palavra do vocabulario
def get_word_vectors(model, sentence):
    vectors = []
    for word in sentence:
        if word in model.wv:
            vectors.append(model.wv[word]) # Append na lista de vetores
    if vectors:
        return np.sum(vectors, axis=0)/len(sentence) # Soma dos vetores para cada frase
    else:
        return np.zeros(model.vector_size)

# Criação do dataframe de vetores para cada frase
def create_word2vec_dataframe(df, column_name, model):
    sentences = df[column_name].tolist()
    vectors = [get_word_vectors(model, sentence) for sentence in sentences] # Itera para cada frase um vetor
    # Criação do dataframe
    df_vectors = pd.DataFrame(vectors, columns=[f"Vetor{i}" for i in range(model.vector_size)])
    df_word2vec = pd.concat([df, df_vectors], axis=1)
    return df_word2vec

## Teste de funções

In [88]:
model = train_word2vec(df, 'texto_tratado')



In [106]:
df_word2vec = create_word2vec_dataframe(df,'texto_tratado', model)
df_word2vec

Unnamed: 0,autor,sentimento,texto_tratado,sentimentoNumerico,Vetor0,Vetor1,Vetor2,Vetor3,Vetor4,Vetor5,...,Vetor90,Vetor91,Vetor92,Vetor93,Vetor94,Vetor95,Vetor96,Vetor97,Vetor98,Vetor99
0,winthegame_of,1,"[alvarez, marsal, estar, conosco, sportainmet,...",1,0.428724,0.432991,-0.069373,-0.390625,0.167993,0.150017,...,0.158695,-0.021530,0.064255,0.200677,-0.343568,-0.150278,0.230582,0.138340,0.116164,0.015622
1,marta_bego,1,"[btgpactual, with, makerepost, entender, impac...",1,0.403599,0.396930,-0.090693,-0.378203,0.168807,0.165952,...,0.149624,-0.021719,0.076423,0.185731,-0.343840,-0.103155,0.225854,0.109737,0.125461,0.038858
2,lmviapiana,2,"[minuto, touro, ouro]",2,0.415616,0.379824,-0.028566,-0.463731,0.139106,0.132234,...,0.159725,0.066124,0.105851,0.169309,-0.307872,0.018729,0.325074,0.177072,0.128467,-0.057827
3,vanilson_dos,1,[sim],1,0.043417,0.729222,-0.035962,-0.106024,0.147785,0.231467,...,-0.018889,-0.232910,0.265774,0.039684,-0.324337,0.001886,0.156238,-0.021586,0.092202,0.265736
4,ricktolledo,2,"[querer, saber, btg, banking, próprio, btg, ad...",2,0.330063,0.426375,-0.119240,-0.384473,0.195382,0.153858,...,0.185977,-0.036598,0.019594,0.171955,-0.361425,-0.090451,0.166682,0.058767,0.070553,0.079802
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9472,perspectiveinvestimentos,2,"[excelente, explicação]",2,0.381040,0.407306,-0.065901,-0.504870,0.080631,0.252352,...,0.134118,0.041042,0.041759,0.281209,-0.357967,-0.192821,0.302168,0.096228,0.154049,0.025645
9473,eduardocolares,2,"[atendar, telefone, amor, deus]",2,0.395917,0.530222,-0.039731,-0.400390,0.122958,0.180124,...,0.110479,0.000207,0.061072,0.156445,-0.315394,-0.072328,0.232020,0.139051,0.095105,0.036066
9474,danielucm,2,"[saber, qual, 10, grande, fiis, mercado, selec...",2,0.428404,0.398813,-0.022072,-0.430962,0.147732,0.159678,...,0.149765,-0.004145,0.061801,0.212190,-0.354424,-0.174809,0.281553,0.152908,0.108495,0.006989
9475,amgcapitalinvest,1,"[erro, financeiro, eliminar, antes, 30, ano, 1...",1,0.472402,0.341054,-0.040685,-0.382366,0.148118,0.116719,...,0.251312,0.017855,0.042332,0.199385,-0.370954,-0.160366,0.176536,0.147924,0.046004,-0.067451


In [108]:
df_word2vec = df_word2vec.drop(columns=['autor', 'sentimento'])
df_word2vec

Unnamed: 0,texto_tratado,sentimentoNumerico,Vetor0,Vetor1,Vetor2,Vetor3,Vetor4,Vetor5,Vetor6,Vetor7,...,Vetor90,Vetor91,Vetor92,Vetor93,Vetor94,Vetor95,Vetor96,Vetor97,Vetor98,Vetor99
0,"[alvarez, marsal, estar, conosco, sportainmet,...",1,0.428724,0.432991,-0.069373,-0.390625,0.167993,0.150017,-0.125445,-0.307210,...,0.158695,-0.021530,0.064255,0.200677,-0.343568,-0.150278,0.230582,0.138340,0.116164,0.015622
1,"[btgpactual, with, makerepost, entender, impac...",1,0.403599,0.396930,-0.090693,-0.378203,0.168807,0.165952,-0.137153,-0.299038,...,0.149624,-0.021719,0.076423,0.185731,-0.343840,-0.103155,0.225854,0.109737,0.125461,0.038858
2,"[minuto, touro, ouro]",2,0.415616,0.379824,-0.028566,-0.463731,0.139106,0.132234,-0.182196,-0.331240,...,0.159725,0.066124,0.105851,0.169309,-0.307872,0.018729,0.325074,0.177072,0.128467,-0.057827
3,[sim],1,0.043417,0.729222,-0.035962,-0.106024,0.147785,0.231467,-0.147459,-0.076083,...,-0.018889,-0.232910,0.265774,0.039684,-0.324337,0.001886,0.156238,-0.021586,0.092202,0.265736
4,"[querer, saber, btg, banking, próprio, btg, ad...",2,0.330063,0.426375,-0.119240,-0.384473,0.195382,0.153858,-0.135111,-0.289961,...,0.185977,-0.036598,0.019594,0.171955,-0.361425,-0.090451,0.166682,0.058767,0.070553,0.079802
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9472,"[excelente, explicação]",2,0.381040,0.407306,-0.065901,-0.504870,0.080631,0.252352,-0.074826,-0.285107,...,0.134118,0.041042,0.041759,0.281209,-0.357967,-0.192821,0.302168,0.096228,0.154049,0.025645
9473,"[atendar, telefone, amor, deus]",2,0.395917,0.530222,-0.039731,-0.400390,0.122958,0.180124,-0.127471,-0.352738,...,0.110479,0.000207,0.061072,0.156445,-0.315394,-0.072328,0.232020,0.139051,0.095105,0.036066
9474,"[saber, qual, 10, grande, fiis, mercado, selec...",2,0.428404,0.398813,-0.022072,-0.430962,0.147732,0.159678,-0.106857,-0.310669,...,0.149765,-0.004145,0.061801,0.212190,-0.354424,-0.174809,0.281553,0.152908,0.108495,0.006989
9475,"[erro, financeiro, eliminar, antes, 30, ano, 1...",1,0.472402,0.341054,-0.040685,-0.382366,0.148118,0.116719,-0.172834,-0.373485,...,0.251312,0.017855,0.042332,0.199385,-0.370954,-0.160366,0.176536,0.147924,0.046004,-0.067451


# 13. Naive Bayes + Word2Vec com embedding layer

In [109]:
df_word2vec = df_word2vec.dropna()
df_word2vec

Unnamed: 0,texto_tratado,sentimentoNumerico,Vetor0,Vetor1,Vetor2,Vetor3,Vetor4,Vetor5,Vetor6,Vetor7,...,Vetor90,Vetor91,Vetor92,Vetor93,Vetor94,Vetor95,Vetor96,Vetor97,Vetor98,Vetor99
0,"[alvarez, marsal, estar, conosco, sportainmet,...",1,0.428724,0.432991,-0.069373,-0.390625,0.167993,0.150017,-0.125445,-0.307210,...,0.158695,-0.021530,0.064255,0.200677,-0.343568,-0.150278,0.230582,0.138340,0.116164,0.015622
1,"[btgpactual, with, makerepost, entender, impac...",1,0.403599,0.396930,-0.090693,-0.378203,0.168807,0.165952,-0.137153,-0.299038,...,0.149624,-0.021719,0.076423,0.185731,-0.343840,-0.103155,0.225854,0.109737,0.125461,0.038858
2,"[minuto, touro, ouro]",2,0.415616,0.379824,-0.028566,-0.463731,0.139106,0.132234,-0.182196,-0.331240,...,0.159725,0.066124,0.105851,0.169309,-0.307872,0.018729,0.325074,0.177072,0.128467,-0.057827
3,[sim],1,0.043417,0.729222,-0.035962,-0.106024,0.147785,0.231467,-0.147459,-0.076083,...,-0.018889,-0.232910,0.265774,0.039684,-0.324337,0.001886,0.156238,-0.021586,0.092202,0.265736
4,"[querer, saber, btg, banking, próprio, btg, ad...",2,0.330063,0.426375,-0.119240,-0.384473,0.195382,0.153858,-0.135111,-0.289961,...,0.185977,-0.036598,0.019594,0.171955,-0.361425,-0.090451,0.166682,0.058767,0.070553,0.079802
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9472,"[excelente, explicação]",2,0.381040,0.407306,-0.065901,-0.504870,0.080631,0.252352,-0.074826,-0.285107,...,0.134118,0.041042,0.041759,0.281209,-0.357967,-0.192821,0.302168,0.096228,0.154049,0.025645
9473,"[atendar, telefone, amor, deus]",2,0.395917,0.530222,-0.039731,-0.400390,0.122958,0.180124,-0.127471,-0.352738,...,0.110479,0.000207,0.061072,0.156445,-0.315394,-0.072328,0.232020,0.139051,0.095105,0.036066
9474,"[saber, qual, 10, grande, fiis, mercado, selec...",2,0.428404,0.398813,-0.022072,-0.430962,0.147732,0.159678,-0.106857,-0.310669,...,0.149765,-0.004145,0.061801,0.212190,-0.354424,-0.174809,0.281553,0.152908,0.108495,0.006989
9475,"[erro, financeiro, eliminar, antes, 30, ano, 1...",1,0.472402,0.341054,-0.040685,-0.382366,0.148118,0.116719,-0.172834,-0.373485,...,0.251312,0.017855,0.042332,0.199385,-0.370954,-0.160366,0.176536,0.147924,0.046004,-0.067451


## Separando Treino e Teste

In [112]:
target = df_word2vec['sentimentoNumerico']

In [116]:
feature = df_word2vec.iloc[:,2:102]

In [117]:
feature

Unnamed: 0,Vetor0,Vetor1,Vetor2,Vetor3,Vetor4,Vetor5,Vetor6,Vetor7,Vetor8,Vetor9,...,Vetor90,Vetor91,Vetor92,Vetor93,Vetor94,Vetor95,Vetor96,Vetor97,Vetor98,Vetor99
0,0.428724,0.432991,-0.069373,-0.390625,0.167993,0.150017,-0.125445,-0.307210,-0.142480,0.231476,...,0.158695,-0.021530,0.064255,0.200677,-0.343568,-0.150278,0.230582,0.138340,0.116164,0.015622
1,0.403599,0.396930,-0.090693,-0.378203,0.168807,0.165952,-0.137153,-0.299038,-0.139476,0.258859,...,0.149624,-0.021719,0.076423,0.185731,-0.343840,-0.103155,0.225854,0.109737,0.125461,0.038858
2,0.415616,0.379824,-0.028566,-0.463731,0.139106,0.132234,-0.182196,-0.331240,-0.126577,0.281796,...,0.159725,0.066124,0.105851,0.169309,-0.307872,0.018729,0.325074,0.177072,0.128467,-0.057827
3,0.043417,0.729222,-0.035962,-0.106024,0.147785,0.231467,-0.147459,-0.076083,-0.280814,0.387266,...,-0.018889,-0.232910,0.265774,0.039684,-0.324337,0.001886,0.156238,-0.021586,0.092202,0.265736
4,0.330063,0.426375,-0.119240,-0.384473,0.195382,0.153858,-0.135111,-0.289961,-0.199326,0.239902,...,0.185977,-0.036598,0.019594,0.171955,-0.361425,-0.090451,0.166682,0.058767,0.070553,0.079802
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9472,0.381040,0.407306,-0.065901,-0.504870,0.080631,0.252352,-0.074826,-0.285107,-0.200422,0.303381,...,0.134118,0.041042,0.041759,0.281209,-0.357967,-0.192821,0.302168,0.096228,0.154049,0.025645
9473,0.395917,0.530222,-0.039731,-0.400390,0.122958,0.180124,-0.127471,-0.352738,-0.170759,0.313089,...,0.110479,0.000207,0.061072,0.156445,-0.315394,-0.072328,0.232020,0.139051,0.095105,0.036066
9474,0.428404,0.398813,-0.022072,-0.430962,0.147732,0.159678,-0.106857,-0.310669,-0.122632,0.232796,...,0.149765,-0.004145,0.061801,0.212190,-0.354424,-0.174809,0.281553,0.152908,0.108495,0.006989
9475,0.472402,0.341054,-0.040685,-0.382366,0.148118,0.116719,-0.172834,-0.373485,-0.039884,0.235572,...,0.251312,0.017855,0.042332,0.199385,-0.370954,-0.160366,0.176536,0.147924,0.046004,-0.067451


In [118]:
X_train, X_test, y_train, y_test = train_test_split(feature, target, test_size=0.2, random_state=42)

## Avaliação do modelo 

In [119]:
clf = GaussianNB()

clf = clf.fit(X_train,y_train.values.ravel())

Y_pred = clf.predict(X_test)

print(classification_report(y_test, Y_pred))

              precision    recall  f1-score   support

           0       0.30      0.77      0.43       417
           1       0.77      0.45      0.57       855
           2       0.34      0.17      0.22       624

    accuracy                           0.43      1896
   macro avg       0.47      0.46      0.41      1896
weighted avg       0.52      0.43      0.42      1896



In [120]:
acc_score = accuracy_score(y_test, Y_pred)
format_output = "{:.2%}".format(acc_score)
print("Precisão final de :",format_output) 

Precisão final de : 42.77%
