***Análise de sentimentos***


Uma tendência relativamente mais recente na análise de textos vai além da detecção de tópicos e tenta identificar a emoção por trás de um texto. Isso é chamado de análise de sentimentos, ou também de mineração de opinião e IA de emoção.

Por exemplo, a frase "Amo chocolate" é muito positiva no que diz respeito ao chocolate como alimento. "Odeio este novo telefone" também dá uma indicação clara das preferências do cliente sobre o produto. Nestes dois casos particulares, as palavras "amor" e "ódio" carregam uma clara polaridade de sentimentos. Um caso mais complexo poderia ser a frase "Não gosto do novo telefone", onde a polaridade positiva de "gosto/gostar" é invertida em uma polaridade negativa pela negação. O mesmo para "Não desgosto de chocolate", onde a negação de uma palavra negativa como "desgostar" traz uma sentença positiva.

Algumas vezes, a polaridade (por exemplo, positividade ou negatividade) de uma palavra depende do contexto. "Estes cogumelos são comestíveis" é uma sentença positiva em relação à saúde. No entanto, "Este bife é comestível" é uma sentença negativa em relação a um restaurante. Às vezes a polaridade de uma palavra é delimitada no tempo, como "gosto de viajar, às vezes". Onde 'às vezes' limita a polaridade positiva da palavra "gosto". E assim por diante, com exemplos mais sutis como "Não penso que esse chocolate seja realmente bom" ou ainda pior como "Este show foi realmente tão fantástico?".

Falamos aqui sobre sentimentos positivos e negativos. No entanto, POSITIVO e NEGATIVO não são os únicos rótulos que podem ser usados para definir o sentimento em uma sentença. Geralmente, todo o intervalo MUITO NEGATIVO, NEGATIVO, NEUTRO, POSITIVO e MUITO POSITIVO é usado. Às vezes, no entanto, rótulos adicionais menos óbvios também são usados, como IRONIA, EUFEMISMO, INCERTEZAS, etc.

Como podemos extrair sentimento de um texto? Às vezes até os humanos não têm certeza da emoção real nas entrelinhas. Mesmo se conseguirmos extrair o recurso associado ao sentimento, como podemos medi-lo? Há várias abordagens para isso, envolvendo NLP, linguística computacional e, finalmente, mineração de texto. Vamos nos preocupar aqui com as abordagens de mineração de texto, que são principalmente duas: uma abordagem com machine learning e uma abordagem lexical.

A abordagem lexical se baseia nas palavras do texto e no sentimento que carregam. Essa técnica usa conceitos de NLP e um dicionário para extrair o tom de voz.


A abordagem baseada em ML precisa de uma coleção de documentos com sentimentos marcados; esta é uma coleção na qual cada documento foi avaliado manualmente e rotulado em termos de sentimento. Após algum pré-processamento, um algoritmo supervisionado por ML é treinado para reconhecer o sentimento em cada texto.



https://www.infoq.com/br/articles/sentiment-analysis-whats-with-the-tone/

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re # for regex
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB,MultinomialNB,BernoulliNB
from sklearn.metrics import accuracy_score
import pickle#Salvar a maquina preditiva

#from nltk.test.portuguese_en_fixt import setup_module # StopWords lingua portuguesa

ANÁLISE DE SENTIMENTOS  - ML

In [2]:
data = pd.read_csv('/content/drive/MyDrive/1-CIENCIA DE DADOS-CURSOS_ESTUDO DE CASO/ESTUDO DE CASO/ANALISE DE SENTIMENTOS /1_5010757440220693066.csv')
data

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [3]:
print(data.shape)

(50000, 2)


In [5]:
data.info

<bound method DataFrame.info of                                                   review sentiment
0      One of the other reviewers has mentioned that ...  positive
1      A wonderful little production. <br /><br />The...  positive
2      I thought this was a wonderful way to spend ti...  positive
3      Basically there's a family where a little boy ...  negative
4      Petter Mattei's "Love in the Time of Money" is...  positive
...                                                  ...       ...
49995  I thought this movie did a down right good job...  positive
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  I am a Catholic taught in parochial elementary...  negative
49998  I'm going to have to disagree with the previou...  negative
49999  No one expects the Star Trek movies to be high...  negative

[50000 rows x 2 columns]>

In [6]:
#Saber a quantidade de palavras positivas e negativas 
data.sentiment.value_counts()


#Esta bem balanceada

positive    25000
negative    25000
Name: sentiment, dtype: int64

In [9]:
data.review[0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the fa

Limpeza de Dados

     * Remover HTML tags 
     * REgex: '<, *?>'

In [11]:
#Criar uma função para limpar tag's
#re = que é de expressão regulares

def clean(text):
  cleaned = re.compile(r'<.*?>')
  return re.sub(cleaned, '',text)

data.review   = data.review.apply(clean)
data.review[0]

"One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.I would say the main appeal of the show is due to the fact that it goes where other shows wo

Remover os Caracteres Especiais

In [13]:
def is_special(text):
  rem = ''
  for i in text:
    if i.isalnum():
      rem = rem + i 
    else:
      rem = rem + ' ' 
  return rem 

data.review = data.review.apply(is_special) 
data.review      

0        One of the other reviewers has mentioned that ...
1        A wonderful little production  The filming tec...
2        I thought this was a wonderful way to spend ti...
3        Basically there s a family where a little boy ...
4        Petter Mattei s  Love in the Time of Money  is...
                               ...                        
49995    I thought this movie did a down right good job...
49996    Bad plot  bad dialogue  bad acting  idiotic di...
49997    I am a Catholic taught in parochial elementary...
49998    I m going to have to disagree with the previou...
49999    No one expects the Star Trek movies to be high...
Name: review, Length: 50000, dtype: object

Convertendo Tudo para Minusculas 

   (Lowercase)

In [17]:
def to_lower(text):
    return text.lower()

data.review = data.review.apply(to_lower)  
data.review[0]

'one of the other reviewers has mentioned that after watching just 1 oz episode you ll be hooked  they are right  as this is exactly what happened with me the first thing that struck me about oz was its brutality and unflinching scenes of violence  which set in right from the word go  trust me  this is not a show for the faint hearted or timid  this show pulls no punches with regards to drugs  sex or violence  its is hardcore  in the classic use of the word it is called oz as that is the nickname given to the oswald maximum security state penitentary  it focuses mainly on emerald city  an experimental section of the prison where all the cells have glass fronts and face inwards  so privacy is not high on the agenda  em city is home to many  aryans  muslims  gangstas  latinos  christians  italians  irish and more    so scuffles  death stares  dodgy dealings and shady agreements are never far away i would say the main appeal of the show is due to the fact that it goes where other shows wo

Removendo Stopwords

In [18]:
import nltk

nltk.download('stopwords')
nltk.download('punkt')

def rem_stopwords(text):
  #Vou utilizar o stopwords da lingua inglêsa, mas podemos utilizar o PT-BR, é so mudar o stopwords, claro o texto deve esta em português.
  stop_words = set(stopwords.words('english'))
  words = word_tokenize(text)
  return [w for w in words if w not in stop_words]

data.review = data.review.apply(rem_stopwords) 
data.review[0] 


#Tras apenas as palavras importates dentro do texto. deixando o texto mais limpo 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


['one',
 'reviewers',
 'mentioned',
 'watching',
 '1',
 'oz',
 'episode',
 'hooked',
 'right',
 'exactly',
 'happened',
 'first',
 'thing',
 'struck',
 'oz',
 'brutality',
 'unflinching',
 'scenes',
 'violence',
 'set',
 'right',
 'word',
 'go',
 'trust',
 'show',
 'faint',
 'hearted',
 'timid',
 'show',
 'pulls',
 'punches',
 'regards',
 'drugs',
 'sex',
 'violence',
 'hardcore',
 'classic',
 'use',
 'word',
 'called',
 'oz',
 'nickname',
 'given',
 'oswald',
 'maximum',
 'security',
 'state',
 'penitentary',
 'focuses',
 'mainly',
 'emerald',
 'city',
 'experimental',
 'section',
 'prison',
 'cells',
 'glass',
 'fronts',
 'face',
 'inwards',
 'privacy',
 'high',
 'agenda',
 'em',
 'city',
 'home',
 'many',
 'aryans',
 'muslims',
 'gangstas',
 'latinos',
 'christians',
 'italians',
 'irish',
 'scuffles',
 'death',
 'stares',
 'dodgy',
 'dealings',
 'shady',
 'agreements',
 'never',
 'far',
 'away',
 'would',
 'say',
 'main',
 'appeal',
 'show',
 'due',
 'fact',
 'goes',
 'shows',
 'da

Steam The Words

In [20]:
def stem_txt(text):
    ss = SnowballStemmer('english') #tem portugues também
    return " ".join([ss.stem(w) for w in text])

data.review = data.review.apply(stem_txt)
data.review[0]

#Trabalho de gramatica, para condesar a maior quantidade de informação com maneas palavras 

'one review mention watch 1 oz episod hook right exact happen first thing struck oz brutal unflinch scene violenc set right word go trust show faint heart timid show pull punch regard drug sex violenc hardcor classic use word call oz nicknam given oswald maximum secur state penitentari focus main emerald citi experiment section prison cell glass front face inward privaci high agenda em citi home mani aryan muslim gangsta latino christian italian irish scuffl death stare dodgi deal shadi agreement never far away would say main appeal show due fact goe show dare forget pretti pictur paint mainstream audienc forget charm forget romanc oz mess around first episod ever saw struck nasti surreal say readi watch develop tast oz got accustom high level graphic violenc violenc injustic crook guard sold nickel inmat kill order get away well manner middl class inmat turn prison bitch due lack street skill prison experi watch oz may becom comfort uncomfort view that get touch darker side'

Criação do Modelo

  -  BOW - Bang of Words

In [21]:
x = np.array(data.iloc[:,0].values)
y = np.array(data.sentiment.values)
cv = CountVectorizer(max_features= 1000)
X = cv.fit_transform(data.review).toarray()
print("X.shape =", X.shape)
print("y.shape =", y.shape)


#Treinamos a maquina preditiva 


X.shape = (50000, 1000)
y.shape = (50000,)


In [22]:
#Visualizar a amquina preditiva 

print(y)

['positive' 'positive' 'positive' ... 'negative' 'negative' 'negative']


Train Test Split

In [26]:
trainx, testx, trainy, testy = train_test_split(X,y, test_size=0.2, random_state=9)
print("Train Shapes: X = {}, y = {}".format(trainx.shape, trainy.shape))
print("Test Shapes: X = {}, y = {}". format(testx.shape, testy.shape))


#Agora temos a base para treinar e para testar 

Train Shapes: X = (40000, 1000), y = (40000,)
Test Shapes: X = (10000, 1000), y = (10000,)


Definir os modelos e treiná-los

In [28]:

#Usamos hiporparametros para aperfeiçoar o nosso modelo 
gnb,mnb,bnb = GaussianNB(),MultinomialNB(alpha=1.0,fit_prior=True),BernoulliNB(alpha=1.0,fit_prior=True)
gnb.fit(trainx,trainy)
mnb.fit(trainx,trainy)
bnb.fit(trainx,trainy)

BernoulliNB()

Métricas de previsão e precisão para escolher o melhor modelo

In [30]:
ypg = gnb.predict(testx)
ypm = mnb.predict(testx)
ypb = bnb.predict(testx)

Fazer  accuracia para Saber a Melhor Maquina 

In [31]:
print("Gaussian = ",accuracy_score(testy,ypg))
print("Multinomial = ",accuracy_score(testy,ypm))
print("Bernoulli = ",accuracy_score(testy,ypb))

Gaussian =  0.7843
Multinomial =  0.831
Bernoulli =  0.8386


A melhor maquina é a Bernoulli, de acertividade 


SAlvar

In [32]:
pickle.dump(bnb, open('model1.pkl', 'wb'))

Utilizando a Máquina Preditiva para Analisar o Sentimento do cliente

TEXTE:

In [33]:
rev =  """Terrible. Complete trash. Brainless tripe. Insulting to anyone who isn't an 8 year old fan boy. Im actually pretty disgusted that this movie is making the money it is - what does it say about the people who brainlessly hand over the hard earned cash to be 'entertained' in this fashion and then come here to leave a positive 8.8 review?? Oh yes, they are morons. Its the only sensible conclusion to draw. How anyone can rate this movie amongst the pantheon of great titles is beyond me.

So trying to find something constructive to say about this title is hard...I enjoyed Iron Man? Tony Stark is an inspirational character in his own movies but here he is a pale shadow of that...About the only 'hook' this movie had into me was wondering when and if Iron Man would knock Captain America out...Oh how I wished he had :( What were these other characters anyways? Useless, bickering idiots who really couldn't organise happy times in a brewery. The film was a chaotic mish mash of action elements and failed 'set pieces'...

I found the villain to be quite amusing.

And now I give up. This movie is not robbing any more of my time but I felt I ought to contribute to restoring the obvious fake rating and reviews this movie has been getting on IMDb."""

f1 = clean(rev)#limpesa do texto
f2 = is_special(f1)#tirar o caracteres especiais
f3 = to_lower(f2)#colocar letras minusculas
f4 = rem_stopwords(f3)#tirar as stopwordes 
f5 = stem_txt(f4)

bow,words = [],word_tokenize(f5)
for word in words:
    bow.append(words.count(word))

word_dict = cv.vocabulary_
pickle.dump(word_dict,open('bow.pkl','wb'))

inp = []
for i in word_dict:
    inp.append(f5.count(i[0]))
y_pred = bnb.predict(np.array(inp).reshape(1,1000))

Resultado:

In [34]:
y_pred

#O texto foi negativo

array(['negative'], dtype='<U8')