# Projeto Grupo BT-G3


## **Integrantes do grupo**
- Daniel Barzilai
- Larissa Carvalho
- Maria Luisa Maia
- Pedro Rezende
- Rafael Moritz
- Vitor Oliveira

<center><img src="https://www.inteli.edu.br/wp-content/uploads/2021/08/20172028/marca_1-2.png" width="50%" height="50%"/></center>

<h1 align='center'><b>IA para Marketing: Monitoramento de campanhas utilizando processamento de linguagem natural (PLN)<b></h1>

<center><img src="https://upload.wikimedia.org/wikipedia/commons/c/c2/Btg-logo-blue.svg" width="50%" height="50%"/></center>

<h2 align='center'>O Banco BTG Pactual enfrenta um desafio na área de Marketing em entender as necessidades e demandas dos clientes de maneira fácil e rápida nas redes sociais. A solução proposta para esse problema foi o desenvolvimento de uma Inteligência Artificial utilizando processamento de linguagem natural (PLN), capaz de monitorar as campanhas de marketing, voltadas para o Instagram. O objetivo principal dessa solução é rastrear os dados em tempo real, analisar e interpretar as mensagens e comentários enviados pelos clientes na rede social, a fim de identificar as necessidades e demandas de forma precisa e eficiente.</h2>

---

# Sobre os dados

Esse projeto está utilizando dados coletados e tratados pela equipe de Automation do BTG Pactual, o qual disponibilizou o dataset. Com base nas informações dispostas nesse dataset, realizaremos insights a cerca dos comentários feitos nos posts do Instagram do próprio banco. Vale lembrar que os dados estão anonimizados e resguardados para manter a privacidade e ética com os usuários e com o banco.

# 1. Instalação / Setup

Para o início do projeto, fizemos o desenvolvimento no Google Colab, por isso temos uma célula de conexão com o Google Drive, para poder acessar os dados. Caso seja rodado no Jupyter Notebook, precisará do dataset baixado.

In [1]:
#Conectar com o Google Drive

# from google.colab import drive
# drive.mount('/content/drive')

#Conectando o ambiente ao Google Drive

Mounted at /content/drive


- Aqui nós fazemos as importações para tratamento dos dados  e modelagem do Word2Vec.

## pips

In [2]:
!pip install keras

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
!pip install tensorflow

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [4]:
!pip install scikit-learn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Imports

In [29]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.feature_extraction.text import CountVectorizer
from keras.preprocessing.text import Tokenizer
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.model_selection import train_test_split

import gensim
from scipy.spatial.distance import cosine
from gensim.models import KeyedVectors
from gensim.models import Word2Vec

# 2. Entendimento e Tratamento dos Dados

Rodando o dataset, para analisar seu conteúdo:

In [37]:
df = pd.read_csv('/content/drive/MyDrive/Módulo 6/projeto/nova_base_tratada').drop(['Unnamed: 0'], axis=1)
df

Unnamed: 0,autor,sentimento,texto_tratado
0,winthegame_of,1,"['alvarez', 'marsal', 'estar', 'conosco', 'spo..."
1,marta_bego,1,"['btgpactual', 'with', 'makerepost', 'entender..."
2,lmviapiana,2,"['minuto', 'touro', 'ouro']"
3,vanilson_dos,1,['sim']
4,ricktolledo,2,"['querer', 'saber', 'banking', 'próprio', 'adm..."
...,...,...,...
9202,perspectiveinvestimentos,2,"['excelente', 'explicação']"
9203,eduardocolares,2,"['atendar', 'telefone', 'amor', 'deus']"
9204,danielucm,2,"['saber', 'qual', 'grande', 'fiis', 'mercado',..."
9205,amgcapitalinvest,1,"['erro', 'financeiro', 'eliminar', 'antes', 'a..."


In [38]:
df.columns

Index(['autor', 'sentimento', 'texto_tratado'], dtype='object')

In [39]:
df['texto_tratado'] = df['texto_tratado'].str.replace("'", "")
df['texto_tratado']

0       [alvarez, marsal, estar, conosco, sportainmet,...
1       [btgpactual, with, makerepost, entender, impac...
2                                   [minuto, touro, ouro]
3                                                   [sim]
4           [querer, saber, banking, próprio, administro]
                              ...                        
9202                              [excelente, explicação]
9203                      [atendar, telefone, amor, deus]
9204    [saber, qual, grande, fiis, mercado, selecione...
9205    [erro, financeiro, eliminar, antes, ano, _, pa...
9206    [porque, morning, call, aparecer, spotify, atu...
Name: texto_tratado, Length: 9207, dtype: object

In [40]:
df

Unnamed: 0,autor,sentimento,texto_tratado
0,winthegame_of,1,"[alvarez, marsal, estar, conosco, sportainmet,..."
1,marta_bego,1,"[btgpactual, with, makerepost, entender, impac..."
2,lmviapiana,2,"[minuto, touro, ouro]"
3,vanilson_dos,1,[sim]
4,ricktolledo,2,"[querer, saber, banking, próprio, administro]"
...,...,...,...
9202,perspectiveinvestimentos,2,"[excelente, explicação]"
9203,eduardocolares,2,"[atendar, telefone, amor, deus]"
9204,danielucm,2,"[saber, qual, grande, fiis, mercado, selecione..."
9205,amgcapitalinvest,1,"[erro, financeiro, eliminar, antes, ano, _, pa..."


# 3. Word2Vec com CBOW

- Neste tópico, o grupo utilizou um modelo já pré treinado, que está sendo referenciado abaixo.

- CBOW (Continuous Bag-of-Words) é um modelo de linguagem amplamente utilizado em processamento de linguagem natural (NLP) e representação distribuída de palavras.

## Estruturação

In [41]:
df

Unnamed: 0,autor,sentimento,texto_tratado
0,winthegame_of,1,"[alvarez, marsal, estar, conosco, sportainmet,..."
1,marta_bego,1,"[btgpactual, with, makerepost, entender, impac..."
2,lmviapiana,2,"[minuto, touro, ouro]"
3,vanilson_dos,1,[sim]
4,ricktolledo,2,"[querer, saber, banking, próprio, administro]"
...,...,...,...
9202,perspectiveinvestimentos,2,"[excelente, explicação]"
9203,eduardocolares,2,"[atendar, telefone, amor, deus]"
9204,danielucm,2,"[saber, qual, grande, fiis, mercado, selecione..."
9205,amgcapitalinvest,1,"[erro, financeiro, eliminar, antes, ano, _, pa..."


In [42]:
cbow = '/content/drive/MyDrive/Módulo 6/Prog/cbow_s50.txt'

In [43]:
model_cbow = KeyedVectors.load_word2vec_format(cbow)

## Teste isolado

Para testar se o modelo está funcionando, testamos com a palavra 'projeto' e 'banco'.

In [44]:
# Testando o word2vec
wordvec_test = model_cbow['projeto']

wordvec_test

array([-0.074174, -0.152088,  0.086627, -0.224567,  0.362562,  0.130683,
       -0.089179, -0.086973,  0.309501,  0.004112, -0.308202,  0.351789,
       -0.477863,  0.050276,  0.213283,  0.159895, -0.285545, -0.08832 ,
       -0.015449,  0.014816, -0.613861,  0.502556,  0.021688,  0.369492,
        0.280691,  0.016868,  0.105584, -0.180754, -0.078456,  0.148032,
        0.36293 , -0.011634,  0.412191, -0.009049,  0.010404,  0.131242,
       -0.032483, -0.133067, -0.063802,  0.434015, -0.214768, -0.072132,
        0.045601, -0.368866,  0.502808,  0.048293, -0.254894,  0.142581,
       -0.075066,  0.015646], dtype=float32)

In [45]:
# Testando o word2vec
wordvec_test = model_cbow['banco']

wordvec_test

array([ 1.81041e-01,  1.07700e-01, -1.04667e-01,  2.43361e-01,
        6.06380e-02,  3.92829e-01, -3.33944e-01, -3.81778e-01,
        1.42200e-01,  8.59360e-02, -1.16615e-01,  3.95722e-01,
       -6.12684e-01, -7.68980e-02,  3.34396e-01,  8.11270e-02,
       -5.17700e-02, -3.21950e-01, -6.91509e-01, -3.31210e-01,
       -5.43213e-01,  6.09881e-01,  2.43700e-01,  3.73240e-02,
        1.16518e-01,  1.78859e-01, -3.78839e-01,  1.27430e-01,
        1.94497e-01,  7.32000e-04,  3.14395e-01, -2.04550e-01,
        5.34431e-01, -5.55100e-03,  3.52343e-01, -4.92000e-02,
       -1.38384e-01,  2.31630e-02, -3.40013e-01,  5.00201e-01,
       -1.14170e-02, -1.29925e-01, -6.12800e-03, -1.80481e-01,
        1.99391e-01,  1.37645e-01, -7.66434e-01, -2.26784e-01,
       -6.16110e-02,  9.05920e-02], dtype=float32)

## Definição de função

Desenvolvemos a função abaixo chamada "create_sentence_vector" que tem como objetivo criar um vetor de representação para cada sentença do DataFrame fornecido. 

In [46]:
def vetorizando(model, df):
    sentence_table = []
    for sentence in df['texto_tratado']:
        word_vectors = [model[word] for word in sentence if word in model]
        if len(word_vectors) > 0:
            sentence_vector = sum(word_vectors) / len(word_vectors)
        else:
            sentence_vector = [None] * 100  # Cria uma lista de 100 elementos None
        sentence_table.append((sentence, *sentence_vector[:50]))  # Adiciona apenas os primeiros 50 elementos do vetor

    column_labels = ['Frase']
    for i in range(50):
        column_labels.append(f'Vetor{i+1}')
    df_vec = pd.DataFrame(sentence_table, columns=column_labels)

    df["sentimentoNumerico"] = df["sentimento"].replace({'NEGATIVE': -1, 'POSITIVE': 1, 'NEUTRAL': 0})

    # Definir o índice do DataFrame df_vec como o mesmo índice de df_processada['sentimentoNumerico']
    df_vec.set_index(df["sentimentoNumerico"].index, inplace=True)

    df_vec['sentimento'] = df["sentimentoNumerico"]
    df_vec = df_vec.dropna()

    return df_vec

## Teste de funções

Abaixo, testamos a função, que foi definida acima, no dataframe inteiro. Criando 50 colunas (vetores) e a última coluna é o sentimento gerado naquela frase, que já está tratado.

In [47]:
df_vec = vetorizando(model_cbow, df)
df_vec

Unnamed: 0,Frase,Vetor1,Vetor2,Vetor3,Vetor4,Vetor5,Vetor6,Vetor7,Vetor8,Vetor9,...,Vetor42,Vetor43,Vetor44,Vetor45,Vetor46,Vetor47,Vetor48,Vetor49,Vetor50,sentimento
0,"[alvarez, marsal, estar, conosco, sportainmet,...",0.213634,-0.129877,0.241601,-0.075002,-0.015629,0.206194,0.072658,0.055472,0.061554,...,0.024361,-0.111328,0.157674,0.094309,-0.047458,0.157365,-0.033920,0.022211,0.182153,1
1,"[btgpactual, with, makerepost, entender, impac...",0.222697,-0.124886,0.213157,-0.059091,-0.010530,0.201566,0.071898,0.033920,0.059524,...,0.008988,-0.079109,0.159296,0.085387,-0.008607,0.158519,-0.022680,0.031107,0.189521,1
2,"[minuto, touro, ouro]",0.265227,-0.068285,0.152235,-0.044329,-0.102729,0.141353,0.092800,0.113174,0.015783,...,0.078032,-0.202677,0.155750,0.062291,0.007038,0.134573,0.014635,0.034189,0.345674,2
3,[sim],0.166258,-0.029796,0.204045,-0.297490,0.046077,0.140763,0.035251,-0.174491,0.211817,...,0.065839,-0.092451,0.308218,-0.034692,-0.032851,-0.028724,-0.068701,0.011158,0.258413,1
4,"[querer, saber, banking, próprio, administro]",0.187512,-0.183612,0.300155,-0.052422,-0.034717,0.232278,0.058778,0.084289,0.088006,...,0.097538,-0.161461,0.196748,0.088577,-0.080884,0.167507,-0.049984,-0.000942,0.187811,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9202,"[excelente, explicação]",0.190917,-0.133475,0.241675,-0.053180,0.067256,0.201138,0.034109,-0.078718,-0.066131,...,-0.082151,0.016113,0.154861,0.068700,-0.004302,0.079717,-0.028388,-0.017448,0.188785,2
9203,"[atendar, telefone, amor, deus]",0.188641,-0.119377,0.199339,-0.105448,0.023176,0.178837,0.069476,-0.004494,0.034710,...,0.034035,-0.126673,0.165176,0.080313,-0.024160,0.118848,-0.003502,0.087053,0.215656,2
9204,"[saber, qual, grande, fiis, mercado, selecione...",0.215474,-0.137852,0.223206,-0.072183,-0.013213,0.205186,0.063497,0.039164,0.070273,...,0.034706,-0.097793,0.177275,0.090335,-0.047405,0.154374,-0.028906,0.023713,0.179591,2
9205,"[erro, financeiro, eliminar, antes, ano, _, pa...",0.219393,-0.129317,0.239226,-0.064735,-0.025696,0.224218,0.070732,0.042386,0.040706,...,0.025414,-0.108338,0.160880,0.092846,-0.032266,0.151619,-0.023750,0.028080,0.191956,1


- Caso deseje exportar a tabela como um csv.

In [48]:
#df_vec.to_csv('Word2Vec_Cbow_modelo_treinado',encoding='utf-8', index=False, header=True)

# 4. Naive Bayes + Word2Vec com CBOW

O Naive Bayes é um algoritmo utilizado para classificação de textos com base em representações vetoriais.

Nesse tópico vamos aplicar esse algortimo no Dataframe vetorizado pelo Word2Vec com o modelo CBOW, com objetivo de classificar os textos.

In [49]:
df_vec = df_vec.dropna()
df_vec

Unnamed: 0,Frase,Vetor1,Vetor2,Vetor3,Vetor4,Vetor5,Vetor6,Vetor7,Vetor8,Vetor9,...,Vetor42,Vetor43,Vetor44,Vetor45,Vetor46,Vetor47,Vetor48,Vetor49,Vetor50,sentimento
0,"[alvarez, marsal, estar, conosco, sportainmet,...",0.213634,-0.129877,0.241601,-0.075002,-0.015629,0.206194,0.072658,0.055472,0.061554,...,0.024361,-0.111328,0.157674,0.094309,-0.047458,0.157365,-0.033920,0.022211,0.182153,1
1,"[btgpactual, with, makerepost, entender, impac...",0.222697,-0.124886,0.213157,-0.059091,-0.010530,0.201566,0.071898,0.033920,0.059524,...,0.008988,-0.079109,0.159296,0.085387,-0.008607,0.158519,-0.022680,0.031107,0.189521,1
2,"[minuto, touro, ouro]",0.265227,-0.068285,0.152235,-0.044329,-0.102729,0.141353,0.092800,0.113174,0.015783,...,0.078032,-0.202677,0.155750,0.062291,0.007038,0.134573,0.014635,0.034189,0.345674,2
3,[sim],0.166258,-0.029796,0.204045,-0.297490,0.046077,0.140763,0.035251,-0.174491,0.211817,...,0.065839,-0.092451,0.308218,-0.034692,-0.032851,-0.028724,-0.068701,0.011158,0.258413,1
4,"[querer, saber, banking, próprio, administro]",0.187512,-0.183612,0.300155,-0.052422,-0.034717,0.232278,0.058778,0.084289,0.088006,...,0.097538,-0.161461,0.196748,0.088577,-0.080884,0.167507,-0.049984,-0.000942,0.187811,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9202,"[excelente, explicação]",0.190917,-0.133475,0.241675,-0.053180,0.067256,0.201138,0.034109,-0.078718,-0.066131,...,-0.082151,0.016113,0.154861,0.068700,-0.004302,0.079717,-0.028388,-0.017448,0.188785,2
9203,"[atendar, telefone, amor, deus]",0.188641,-0.119377,0.199339,-0.105448,0.023176,0.178837,0.069476,-0.004494,0.034710,...,0.034035,-0.126673,0.165176,0.080313,-0.024160,0.118848,-0.003502,0.087053,0.215656,2
9204,"[saber, qual, grande, fiis, mercado, selecione...",0.215474,-0.137852,0.223206,-0.072183,-0.013213,0.205186,0.063497,0.039164,0.070273,...,0.034706,-0.097793,0.177275,0.090335,-0.047405,0.154374,-0.028906,0.023713,0.179591,2
9205,"[erro, financeiro, eliminar, antes, ano, _, pa...",0.219393,-0.129317,0.239226,-0.064735,-0.025696,0.224218,0.070732,0.042386,0.040706,...,0.025414,-0.108338,0.160880,0.092846,-0.032266,0.151619,-0.023750,0.028080,0.191956,1


## Separando Treino e Teste

In [50]:
target = df_vec['sentimento']
feature = df_vec.iloc[:,1:50]
feature

Unnamed: 0,Vetor1,Vetor2,Vetor3,Vetor4,Vetor5,Vetor6,Vetor7,Vetor8,Vetor9,Vetor10,...,Vetor40,Vetor41,Vetor42,Vetor43,Vetor44,Vetor45,Vetor46,Vetor47,Vetor48,Vetor49
0,0.213634,-0.129877,0.241601,-0.075002,-0.015629,0.206194,0.072658,0.055472,0.061554,0.170172,...,0.074627,-0.270438,0.024361,-0.111328,0.157674,0.094309,-0.047458,0.157365,-0.033920,0.022211
1,0.222697,-0.124886,0.213157,-0.059091,-0.010530,0.201566,0.071898,0.033920,0.059524,0.164536,...,0.086584,-0.301574,0.008988,-0.079109,0.159296,0.085387,-0.008607,0.158519,-0.022680,0.031107
2,0.265227,-0.068285,0.152235,-0.044329,-0.102729,0.141353,0.092800,0.113174,0.015783,0.202198,...,-0.008447,-0.193025,0.078032,-0.202677,0.155750,0.062291,0.007038,0.134573,0.014635,0.034189
3,0.166258,-0.029796,0.204045,-0.297490,0.046077,0.140763,0.035251,-0.174491,0.211817,0.288314,...,0.183434,-0.415105,0.065839,-0.092451,0.308218,-0.034692,-0.032851,-0.028724,-0.068701,0.011158
4,0.187512,-0.183612,0.300155,-0.052422,-0.034717,0.232278,0.058778,0.084289,0.088006,0.148424,...,0.046667,-0.247744,0.097538,-0.161461,0.196748,0.088577,-0.080884,0.167507,-0.049984,-0.000942
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9202,0.190917,-0.133475,0.241675,-0.053180,0.067256,0.201138,0.034109,-0.078718,-0.066131,0.187608,...,0.014565,-0.321192,-0.082151,0.016113,0.154861,0.068700,-0.004302,0.079717,-0.028388,-0.017448
9203,0.188641,-0.119377,0.199339,-0.105448,0.023176,0.178837,0.069476,-0.004494,0.034710,0.150081,...,0.071114,-0.194663,0.034035,-0.126673,0.165176,0.080313,-0.024160,0.118848,-0.003502,0.087053
9204,0.215474,-0.137852,0.223206,-0.072183,-0.013213,0.205186,0.063497,0.039164,0.070273,0.172185,...,0.081652,-0.300155,0.034706,-0.097793,0.177275,0.090335,-0.047405,0.154374,-0.028906,0.023713
9205,0.219393,-0.129317,0.239226,-0.064735,-0.025696,0.224218,0.070732,0.042386,0.040706,0.154811,...,0.063415,-0.258450,0.025414,-0.108338,0.160880,0.092846,-0.032266,0.151619,-0.023750,0.028080


In [51]:
X_train, X_test, y_train, y_test = train_test_split(feature, target, test_size=0.2, random_state=42)

## Avaliação do modelo 

In [52]:
clf = GaussianNB()

clf = clf.fit(X_train,y_train.values.ravel())

Y_pred = clf.predict(X_test)

print(classification_report(y_test, Y_pred))

              precision    recall  f1-score   support

           0       0.29      0.74      0.42       386
           1       0.74      0.46      0.56       844
           2       0.34      0.20      0.25       612

    accuracy                           0.43      1842
   macro avg       0.46      0.46      0.41      1842
weighted avg       0.51      0.43      0.43      1842



In [53]:
acc_score = accuracy_score(y_test, Y_pred)
format_output = "{:.2%}".format(acc_score)
print("Precisão final de :",format_output) 

Precisão final de : 42.94%


# 5. Word2Vec com o *corpus*

## Estruturação


In [54]:
df

Unnamed: 0,autor,sentimento,texto_tratado,sentimentoNumerico
0,winthegame_of,1,"[alvarez, marsal, estar, conosco, sportainmet,...",1
1,marta_bego,1,"[btgpactual, with, makerepost, entender, impac...",1
2,lmviapiana,2,"[minuto, touro, ouro]",2
3,vanilson_dos,1,[sim],1
4,ricktolledo,2,"[querer, saber, banking, próprio, administro]",2
...,...,...,...,...
9202,perspectiveinvestimentos,2,"[excelente, explicação]",2
9203,eduardocolares,2,"[atendar, telefone, amor, deus]",2
9204,danielucm,2,"[saber, qual, grande, fiis, mercado, selecione...",2
9205,amgcapitalinvest,1,"[erro, financeiro, eliminar, antes, ano, _, pa...",1


## Definição de função

A função abaixo utiliza a biblioteca Gensim para treinar o Word2Vec, que recebe a coluna do dataframe.

In [55]:
# Função que treina o modelo Word2Vec no corpus do dataframe
def train_word2vec(df, column_name):
    # Obtém as frases tokenizadas
    sentences = df[column_name].tolist()
    
    # Treina o modelo Word2Vec
    model = Word2Vec(sentences, min_count=1)
    
    return model

A primeira função recebe o Word2Vec já treinado e itera as palavras (input) da lista para verificar se está presente no modelo.

A segunda função cria um novo dataframe com o modelo Word2Vec.

In [56]:
# Função que define os vetores para cada palavra do vocabulario
def get_word_vectors(model, sentence):
    vectors = []
    for word in sentence:
        if word in model.wv:
            vectors.append(model.wv[word]) # Append na lista de vetores
    if vectors:
        return np.sum(vectors, axis=0)/len(sentence) # Soma dos vetores para cada frase
    else:
        return np.zeros(model.vector_size)

# Criação do dataframe de vetores para cada frase
def create_word2vec_dataframe(df, column_name, model):
    sentences = df[column_name].tolist()
    vectors = [get_word_vectors(model, sentence) for sentence in sentences] # Itera para cada frase um vetor
    # Criação do dataframe
    df_vectors = pd.DataFrame(vectors, columns=[f"Vetor{i}" for i in range(model.vector_size)])
    df_word2vec = pd.concat([df, df_vectors], axis=1)
    return df_word2vec

## Teste de funções

Abaixo são testados as funções que foram criadas acima.

In [57]:
model = train_word2vec(df, 'texto_tratado')



In [58]:
df_word2vec = create_word2vec_dataframe(df,'texto_tratado', model)
df_word2vec

Unnamed: 0,autor,sentimento,texto_tratado,sentimentoNumerico,Vetor0,Vetor1,Vetor2,Vetor3,Vetor4,Vetor5,...,Vetor90,Vetor91,Vetor92,Vetor93,Vetor94,Vetor95,Vetor96,Vetor97,Vetor98,Vetor99
0,winthegame_of,1,"[alvarez, marsal, estar, conosco, sportainmet,...",1,-0.034426,0.300972,-0.286190,-0.065209,-0.054679,0.096039,...,-0.068513,-0.155436,0.140325,0.219084,-0.289465,-0.035990,0.078490,-0.070072,-0.003752,0.339329
1,marta_bego,1,"[btgpactual, with, makerepost, entender, impac...",1,-0.037321,0.289959,-0.259288,-0.064339,-0.080668,0.098687,...,-0.060072,-0.149105,0.130588,0.186867,-0.302085,-0.018805,0.053526,-0.054743,-0.009448,0.354135
2,lmviapiana,2,"[minuto, touro, ouro]",2,-0.193654,0.391434,-0.203922,0.001809,-0.143087,0.069520,...,-0.066392,-0.108025,0.141333,0.167520,-0.338259,0.087925,0.085912,-0.092600,0.032975,0.329287
3,vanilson_dos,1,[sim],1,0.082540,0.359721,-0.090063,0.209893,0.271785,-0.004873,...,-0.064920,-0.291844,0.273709,0.119920,-0.276886,-0.050655,0.175334,-0.094184,-0.075605,0.116921
4,ricktolledo,2,"[querer, saber, banking, próprio, administro]",2,-0.076144,0.319210,-0.289153,-0.018826,-0.025748,0.047773,...,-0.103470,-0.161082,0.136264,0.172653,-0.259784,0.001959,0.127059,-0.129546,0.007025,0.357119
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9202,perspectiveinvestimentos,2,"[excelente, explicação]",2,0.013271,0.233060,-0.269739,-0.228910,-0.098336,0.027902,...,0.048463,-0.198648,0.078566,0.267573,-0.418458,-0.069492,0.134299,-0.233718,0.027100,0.381566
9203,eduardocolares,2,"[atendar, telefone, amor, deus]",2,-0.097030,0.357497,-0.289556,-0.076197,-0.054787,0.091044,...,-0.089239,-0.201867,0.149633,0.174404,-0.306654,-0.032587,0.117704,-0.042513,0.026960,0.443921
9204,danielucm,2,"[saber, qual, grande, fiis, mercado, selecione...",2,-0.047819,0.247797,-0.269047,-0.119334,-0.083395,0.097093,...,-0.031900,-0.186846,0.143561,0.254065,-0.343297,-0.009880,0.068257,-0.120914,0.022000,0.399099
9205,amgcapitalinvest,1,"[erro, financeiro, eliminar, antes, ano, _, pa...",1,-0.041982,0.288899,-0.303314,-0.088639,-0.096838,0.103386,...,-0.067419,-0.155152,0.124977,0.252822,-0.333116,-0.019388,0.046955,-0.103713,0.002978,0.393447


In [59]:
df_word2vec = df_word2vec.drop(columns=['autor', 'sentimento'])
df_word2vec

Unnamed: 0,texto_tratado,sentimentoNumerico,Vetor0,Vetor1,Vetor2,Vetor3,Vetor4,Vetor5,Vetor6,Vetor7,...,Vetor90,Vetor91,Vetor92,Vetor93,Vetor94,Vetor95,Vetor96,Vetor97,Vetor98,Vetor99
0,"[alvarez, marsal, estar, conosco, sportainmet,...",1,-0.034426,0.300972,-0.286190,-0.065209,-0.054679,0.096039,-0.265692,0.127671,...,-0.068513,-0.155436,0.140325,0.219084,-0.289465,-0.035990,0.078490,-0.070072,-0.003752,0.339329
1,"[btgpactual, with, makerepost, entender, impac...",1,-0.037321,0.289959,-0.259288,-0.064339,-0.080668,0.098687,-0.287177,0.113840,...,-0.060072,-0.149105,0.130588,0.186867,-0.302085,-0.018805,0.053526,-0.054743,-0.009448,0.354135
2,"[minuto, touro, ouro]",2,-0.193654,0.391434,-0.203922,0.001809,-0.143087,0.069520,-0.235220,0.141716,...,-0.066392,-0.108025,0.141333,0.167520,-0.338259,0.087925,0.085912,-0.092600,0.032975,0.329287
3,[sim],1,0.082540,0.359721,-0.090063,0.209893,0.271785,-0.004873,-0.347171,0.258687,...,-0.064920,-0.291844,0.273709,0.119920,-0.276886,-0.050655,0.175334,-0.094184,-0.075605,0.116921
4,"[querer, saber, banking, próprio, administro]",2,-0.076144,0.319210,-0.289153,-0.018826,-0.025748,0.047773,-0.273074,0.146078,...,-0.103470,-0.161082,0.136264,0.172653,-0.259784,0.001959,0.127059,-0.129546,0.007025,0.357119
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9202,"[excelente, explicação]",2,0.013271,0.233060,-0.269739,-0.228910,-0.098336,0.027902,-0.262189,0.145487,...,0.048463,-0.198648,0.078566,0.267573,-0.418458,-0.069492,0.134299,-0.233718,0.027100,0.381566
9203,"[atendar, telefone, amor, deus]",2,-0.097030,0.357497,-0.289556,-0.076197,-0.054787,0.091044,-0.288496,0.154998,...,-0.089239,-0.201867,0.149633,0.174404,-0.306654,-0.032587,0.117704,-0.042513,0.026960,0.443921
9204,"[saber, qual, grande, fiis, mercado, selecione...",2,-0.047819,0.247797,-0.269047,-0.119334,-0.083395,0.097093,-0.253981,0.148540,...,-0.031900,-0.186846,0.143561,0.254065,-0.343297,-0.009880,0.068257,-0.120914,0.022000,0.399099
9205,"[erro, financeiro, eliminar, antes, ano, _, pa...",1,-0.041982,0.288899,-0.303314,-0.088639,-0.096838,0.103386,-0.263547,0.119052,...,-0.067419,-0.155152,0.124977,0.252822,-0.333116,-0.019388,0.046955,-0.103713,0.002978,0.393447


# 6. Naive Bayes + Word2Vec com o *corpus*

Essa abordagem é utilizada para realizar classificação de texto com base no Word2Vec.

## Estruturação

In [60]:
df_word2vec = df_word2vec.dropna()
df_word2vec

Unnamed: 0,texto_tratado,sentimentoNumerico,Vetor0,Vetor1,Vetor2,Vetor3,Vetor4,Vetor5,Vetor6,Vetor7,...,Vetor90,Vetor91,Vetor92,Vetor93,Vetor94,Vetor95,Vetor96,Vetor97,Vetor98,Vetor99
0,"[alvarez, marsal, estar, conosco, sportainmet,...",1,-0.034426,0.300972,-0.286190,-0.065209,-0.054679,0.096039,-0.265692,0.127671,...,-0.068513,-0.155436,0.140325,0.219084,-0.289465,-0.035990,0.078490,-0.070072,-0.003752,0.339329
1,"[btgpactual, with, makerepost, entender, impac...",1,-0.037321,0.289959,-0.259288,-0.064339,-0.080668,0.098687,-0.287177,0.113840,...,-0.060072,-0.149105,0.130588,0.186867,-0.302085,-0.018805,0.053526,-0.054743,-0.009448,0.354135
2,"[minuto, touro, ouro]",2,-0.193654,0.391434,-0.203922,0.001809,-0.143087,0.069520,-0.235220,0.141716,...,-0.066392,-0.108025,0.141333,0.167520,-0.338259,0.087925,0.085912,-0.092600,0.032975,0.329287
3,[sim],1,0.082540,0.359721,-0.090063,0.209893,0.271785,-0.004873,-0.347171,0.258687,...,-0.064920,-0.291844,0.273709,0.119920,-0.276886,-0.050655,0.175334,-0.094184,-0.075605,0.116921
4,"[querer, saber, banking, próprio, administro]",2,-0.076144,0.319210,-0.289153,-0.018826,-0.025748,0.047773,-0.273074,0.146078,...,-0.103470,-0.161082,0.136264,0.172653,-0.259784,0.001959,0.127059,-0.129546,0.007025,0.357119
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9202,"[excelente, explicação]",2,0.013271,0.233060,-0.269739,-0.228910,-0.098336,0.027902,-0.262189,0.145487,...,0.048463,-0.198648,0.078566,0.267573,-0.418458,-0.069492,0.134299,-0.233718,0.027100,0.381566
9203,"[atendar, telefone, amor, deus]",2,-0.097030,0.357497,-0.289556,-0.076197,-0.054787,0.091044,-0.288496,0.154998,...,-0.089239,-0.201867,0.149633,0.174404,-0.306654,-0.032587,0.117704,-0.042513,0.026960,0.443921
9204,"[saber, qual, grande, fiis, mercado, selecione...",2,-0.047819,0.247797,-0.269047,-0.119334,-0.083395,0.097093,-0.253981,0.148540,...,-0.031900,-0.186846,0.143561,0.254065,-0.343297,-0.009880,0.068257,-0.120914,0.022000,0.399099
9205,"[erro, financeiro, eliminar, antes, ano, _, pa...",1,-0.041982,0.288899,-0.303314,-0.088639,-0.096838,0.103386,-0.263547,0.119052,...,-0.067419,-0.155152,0.124977,0.252822,-0.333116,-0.019388,0.046955,-0.103713,0.002978,0.393447


## Separando Treino e Teste

É necessário separar quais são as frases de teste e de treino, processo realizado abaixo.

In [61]:
target = df_word2vec['sentimentoNumerico']
feature = df_word2vec.iloc[:,2:102]
feature

Unnamed: 0,Vetor0,Vetor1,Vetor2,Vetor3,Vetor4,Vetor5,Vetor6,Vetor7,Vetor8,Vetor9,...,Vetor90,Vetor91,Vetor92,Vetor93,Vetor94,Vetor95,Vetor96,Vetor97,Vetor98,Vetor99
0,-0.034426,0.300972,-0.286190,-0.065209,-0.054679,0.096039,-0.265692,0.127671,0.082438,0.144984,...,-0.068513,-0.155436,0.140325,0.219084,-0.289465,-0.035990,0.078490,-0.070072,-0.003752,0.339329
1,-0.037321,0.289959,-0.259288,-0.064339,-0.080668,0.098687,-0.287177,0.113840,0.072201,0.169703,...,-0.060072,-0.149105,0.130588,0.186867,-0.302085,-0.018805,0.053526,-0.054743,-0.009448,0.354135
2,-0.193654,0.391434,-0.203922,0.001809,-0.143087,0.069520,-0.235220,0.141716,-0.046249,0.223912,...,-0.066392,-0.108025,0.141333,0.167520,-0.338259,0.087925,0.085912,-0.092600,0.032975,0.329287
3,0.082540,0.359721,-0.090063,0.209893,0.271785,-0.004873,-0.347171,0.258687,-0.046280,0.064418,...,-0.064920,-0.291844,0.273709,0.119920,-0.276886,-0.050655,0.175334,-0.094184,-0.075605,0.116921
4,-0.076144,0.319210,-0.289153,-0.018826,-0.025748,0.047773,-0.273074,0.146078,0.010143,0.117052,...,-0.103470,-0.161082,0.136264,0.172653,-0.259784,0.001959,0.127059,-0.129546,0.007025,0.357119
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9202,0.013271,0.233060,-0.269739,-0.228910,-0.098336,0.027902,-0.262189,0.145487,0.034696,0.108496,...,0.048463,-0.198648,0.078566,0.267573,-0.418458,-0.069492,0.134299,-0.233718,0.027100,0.381566
9203,-0.097030,0.357497,-0.289556,-0.076197,-0.054787,0.091044,-0.288496,0.154998,0.078797,0.167947,...,-0.089239,-0.201867,0.149633,0.174404,-0.306654,-0.032587,0.117704,-0.042513,0.026960,0.443921
9204,-0.047819,0.247797,-0.269047,-0.119334,-0.083395,0.097093,-0.253981,0.148540,0.066677,0.146063,...,-0.031900,-0.186846,0.143561,0.254065,-0.343297,-0.009880,0.068257,-0.120914,0.022000,0.399099
9205,-0.041982,0.288899,-0.303314,-0.088639,-0.096838,0.103386,-0.263547,0.119052,0.072317,0.157323,...,-0.067419,-0.155152,0.124977,0.252822,-0.333116,-0.019388,0.046955,-0.103713,0.002978,0.393447


In [62]:
X_train, X_test, y_train, y_test = train_test_split(feature, target, test_size=0.2, random_state=42)

## Avaliação do modelo 

Abaixo é calculado a acurácia, média macro e média ponderada do modelo.

In [63]:
clf = GaussianNB()

clf = clf.fit(X_train,y_train.values.ravel())

Y_pred = clf.predict(X_test)

print(classification_report(y_test, Y_pred))

              precision    recall  f1-score   support

           0       0.28      0.83      0.42       386
           1       0.77      0.46      0.58       844
           2       0.35      0.11      0.17       612

    accuracy                           0.42      1842
   macro avg       0.47      0.47      0.39      1842
weighted avg       0.53      0.42      0.41      1842



In [64]:
acc_score = accuracy_score(y_test, Y_pred)
format_output = "{:.2%}".format(acc_score)
print("Precisão final de :",format_output) 

Precisão final de : 42.29%
