# Teste NeuralMed - Análise de Sentimentos

O dataset em si contém representantes de diferentes fontes - Amazon, Yelp e IMDB. A escolha do Jupyter Notebook foi para uma melhor apresentação do código e seus resultados.

## Tipo de aprendizagem 
Estamos trabalhando com uma aprendizagem superviosonada, ou seja, já temos uma ideia da saída, e estamos lidando com um conjunto de treino pré-definido, nesse caso, como subcategoria de aprendizado supervionado, nos deparamos com uma problema de classificação, onde temos entradas e a saída será rotulada entre positivo e negativo.

## Algoritmo de Classificação
Foi utilizado como algoritmo de classificação, o algoritmo Naive Bayes, que se baseia no Teorema Naive Bayes de indepedência entre os termos, a escolha por MultinomialNB, foi por exclusão, logo que a Bernoulli leva em consideração valores booleanos para cada palavra, o que não seria o mais adequado logo que o mesmo termo em frases diferentes pode ser para a primeira positiva, e na segunda negativa.

## Pré Processamento
Optei por não retirar stopwords, para não ter mudança de significado por causa do uso do "not"
Utilizei como extração do texto, o recurso bag of words, por levar em consideração a frequência de ocorrência no texto.

## Conjunto de treino 
Optei por utilizar uma comparação entre a acurácia para cada dataset e para eles concatenados

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.model_selection import cross_val_predict

In [2]:
df_imdb = pd.read_csv("imdb_labelled.txt", delimiter= "\t", header=None)
df_yelp = pd.read_csv("yelp_labelled.txt", delimiter="\t", header=None)
df_amazon = pd.read_csv("amazon_cells_labelled.txt", delimiter="\t", header=None)


Renomear as colunas entre qual coluna é texto e qual é nota atribuida como positiva ou negativa

In [3]:
df_imdb.columns = ['Reviews', 'Rating']
df_yelp.columns = ['Reviews', 'Rating']
df_amazon.columns = ['Reviews', 'Rating']

In [4]:
comments_imdb = df_imdb["Reviews"].values
notes_imdb = df_imdb['Rating'].values

In [5]:
comments_yelp = df_yelp["Reviews"].values
notes_yelp = df_yelp['Rating'].values

In [6]:
comments_amazon= df_amazon["Reviews"].values
notes_amazon = df_amazon['Rating'].values

## Usando o dataset IMDB para treino

In [9]:
vectorizer_imdb= CountVectorizer(analyzer="word")
freq_comments_imdb = vectorizer_imdb.fit_transform(comments_imdb)
modelo_imdb = MultinomialNB()
modelo_imdb.fit(freq_comments_imdb,notes_imdb)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [10]:
teste_1 =["Wow... Loved this place.",
          "Crust is not good.",
          "Not tasty and the texture was just nasty.",
          "Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.",
          "The selection on the menu was great and so were the prices.",
          "Now I am getting angry and I want my damn pho.",
          "Honeslty it didn't taste THAT fresh.)",
          "The potatoes were like rubber and you could tell they had been made up ahead of time being kept under a warmer.",
          "The fries were great too.",
          "A great touch."]

In [12]:
freq_teste_1 = vectorizer_imdb.transform(teste_1)
modelo_imdb.predict(freq_teste_1)

array([0, 0, 0, 1, 1, 1, 1, 1, 1, 1])

In [13]:
results_imdb = cross_val_predict(modelo_imdb, freq_comments_imdb, notes_imdb, cv=10)

In [14]:
metrics.accuracy_score(notes_imdb,results_imdb)

0.7473262032085561

## Agora usando o para treino o dataset do Yelp para treino 

In [15]:
vectorizer_yelp= CountVectorizer(analyzer="word")
freq_comments_yelp= vectorizer_yelp.fit_transform(comments_yelp)
modelo_yelp = MultinomialNB()
modelo_yelp.fit(freq_comments_yelp,notes_yelp)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [17]:
teste_2=["The structure of this film is easily the most tightly constructed in the history of cinema.",
         "I can think of no other film where something vitally important occurs every other minute.",
         "In other words, the content level of this film is enough to easily fill a dozen other films. ",
         "How can anyone in their right mind ask for anything more from a movie than this?",
         "It's quite simply the highest, most superlative form of cinema imaginable.",
         "But it's just not funny.",
         "But even the talented Carrell can't save this. ",
         "His co-stars don't fare much better, with people like Morgan Freeman, Jonah Hill, and Ed Helms just wasted.",
         "The story itself is just predictable and lazy.",
         "Wasted two hours."]

In [18]:
freq_teste_2 = vectorizer_yelp.transform(teste_2)
modelo_yelp.predict(freq_teste_2)

array([0, 0, 0, 0, 0, 0, 0, 0, 1, 0])

In [19]:
results_yelp = cross_val_predict(modelo_yelp, freq_comments_yelp, notes_yelp, cv=10)

In [20]:
metrics.accuracy_score(notes_yelp,results_yelp)

0.805

## Agrupando os datasets

In [23]:
yelp_imdb = [df_imdb, df_yelp]
yelp_imdb_c = pd.concat(yelp_imdb)

In [24]:
comments_yelp_imdb= yelp_imdb_c["Reviews"].values
notes_yelp_imdb= yelp_imdb_c['Rating'].values

In [25]:
vectorizer_yelp_imdb= CountVectorizer(analyzer="word")
freq_comments_yelp_imdb= vectorizer_yelp_imdb.fit_transform(comments_yelp_imdb)
modelo_yelp_imdb = MultinomialNB()
modelo_yelp_imdb.fit(freq_comments_yelp_imdb,notes_yelp_imdb)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [26]:
teste_3=["The structure of this film is easily the most tightly constructed in the history of cinema.",
         "I can think of no other film where something vitally important occurs every other minute.",
         "In other words, the content level of this film is enough to easily fill a dozen other films. ",
         "How can anyone in their right mind ask for anything more from a movie than this?",
         "It's quite simply the highest, most superlative form of cinema imaginable.",
         "But it's just not funny.",
         "But even the talented Carrell can't save this. ",
         "His co-stars don't fare much better, with people like Morgan Freeman, Jonah Hill, and Ed Helms just wasted.",
         "The story itself is just predictable and lazy.",
         "Wasted two hours.",
         "Wow... Loved this place.",
          "Crust is not good.",
          "Not tasty and the texture was just nasty.",
          "Stopped by during the late May bank holiday off Rick Steve recommendation and loved it.",
          "The selection on the menu was great and so were the prices.",
          "Now I am getting angry and I want my damn pho.",
          "Honeslty it didn't taste THAT fresh.)",
          "The potatoes were like rubber and you could tell they had been made up ahead of time being kept under a warmer.",
          "The fries were great too.",
          "A great touch."]

In [27]:
freq_teste_3 = vectorizer_yelp_imdb.transform(teste_3)
modelo_yelp_imdb.predict(freq_teste_3)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1])

In [28]:
results_yelp_imdb = cross_val_predict(modelo_yelp_imdb, freq_comments_yelp_imdb, notes_yelp_imdb, cv=10)

In [29]:
metrics.accuracy_score(notes_yelp_imdb,results_yelp_imdb)

0.7934782608695652

## Predições para cada um dos modelos

IMDB

In [31]:
freq_amazon = vectorizer_imdb.transform(comments_amazon)
modelo_imdb.predict(freq_amazon)

array([0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0,
       1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1,
       0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1,
       0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0,

In [32]:
results_amazon = cross_val_predict(modelo_imdb, freq_amazon, notes_amazon, cv=10)

In [36]:
metrics.accuracy_score(notes_amazon,results_amazon)

0.806

Yelp

In [35]:
freq_amazon_y = vectorizer_yelp.transform(comments_amazon)
modelo_yelp.predict(freq_amazon_y)

array([0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0,
       1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0,
       1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0,
       0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,
       0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1,
       0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0,
       1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0,
       0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1,
       0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0,

In [37]:
results_amazon_y = cross_val_predict(modelo_yelp, freq_amazon_y, notes_amazon, cv=10)

In [38]:
metrics.accuracy_score(notes_amazon,results_amazon_y)

0.802

Dataframes concatenados

In [39]:
freq_amazon_yi = vectorizer_yelp_imdb.transform(comments_amazon)
modelo_yelp_imdb.predict(freq_amazon_yi)

array([0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1,
       0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0,
       1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0,
       1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0,
       1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0,
       0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0,
       0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1,
       0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0,
       0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1,
       1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0,
       0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0,

In [40]:
results_amazon_yi = cross_val_predict(modelo_yelp_imdb, freq_amazon_yi, notes_amazon, cv=10)

In [41]:
metrics.accuracy_score(notes_amazon,results_amazon_yi)

0.81

Contando com arredondamento, o IMDB e o dataframe concatenado de IMDB+Yelp, tiveram um resultado muito próximo, com a predição, mesmo não apresentando resultados tão diferentes quando separados. Com mais dados, temos uma acurácia melhor para treinar o nosso modelo, e consequentemente na nossa predição. Avaliando os modelos, conseguimos perceber que não houve ocorrência de overfiting, o nosso modelo não está memorizando, mas sim aprendendo com os nossos dados de treino.