# Trabalho de conclusão de curso
## Comparativo de análise de sentimentos em posts do Twitter/Reddit relacionados à Stocks

# Parte 3 - Processamento dos sentimentos

## Preparando ambiente

In [None]:
# Hepers

!pip install timely --quiet

  Building wheel for timely (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 327kB 5.8MB/s 
[K     |████████████████████████████████| 266kB 29.0MB/s 
[?25h  Building wheel for pyahocorasick (setup.py) ... [?25l[?25hdone


In [None]:
# Sentiment Libs

!pip install afinn --quiet
# !pip install pattern --quiet
# !pip install stanza --quiet
!pip install transformers --quiet

try:
  import polyglot
except ImportError:
  !pip3 install -U git+https://github.com/aboSamoor/polyglot.git@master --quiet

[?25l[K     |██████▎                         | 10kB 14.1MB/s eta 0:00:01[K     |████████████▌                   | 20kB 18.8MB/s eta 0:00:01[K     |██████████████████▊             | 30kB 16.1MB/s eta 0:00:01[K     |█████████████████████████       | 40kB 11.1MB/s eta 0:00:01[K     |███████████████████████████████▏| 51kB 4.5MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 2.9MB/s 
[?25h  Building wheel for afinn (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 2.0MB 5.9MB/s 
[K     |████████████████████████████████| 3.2MB 41.9MB/s 
[K     |████████████████████████████████| 890kB 31.7MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


In [None]:
import math

from timely import Stopwatch
from datetime import datetime, timedelta

<Figure size 1728x1152 with 0 Axes>

In [None]:
# from google.colab import drive

# drive.mount('/content/drive')

Mounted at /content/drive


## Análise

In [None]:
dfDados = pd.read_csv(
  'tsla_2019_clean.csv',
  sep = ',',
  lineterminator = '\n',
  index_col = 0,
  dtype = {
    'Unnamed: 0': 'str',
    'Created At': 'str',
    'Name': 'str',
    'Text': 'str',
    'Source': 'str',
    'CleanText': 'str'
  }
)
dfDados['Created At'] = pd.to_datetime(dfDados['Created At'])
dfDados.dtypes

Created At    datetime64[ns]
Name                  object
Text                  object
Source                object
CleanText             object
dtype: object

### Sentimentos

* [NLTK/VADER](https://pypi.org/project/nltk/) (Rule-based Model): Python package for natural language processing;
* [Textblob](https://pypi.org/project/textblob/): Library for processing textual data;
* [AFINN](https://pypi.org/project/afinn/): Wordlist-based approach for sentiment analysis;
* [Polyglot](https://pypi.org/project/polyglot/): Polyglot is a natural language pipeline that supports massive multilingual applications;
* [Pattern](https://github.com/clips/pattern/) *: Pattern is a web mining module for Python;
* [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) *: CoreNLP is your one stop shop for natural language processing in Java!;
* [Transformers](https://huggingface.co/) **: Build, train and deploy state of the art models powered by the reference open source in natural language processing;

\* Não está sendo utilizado, código comentado.
\** Processamento é demorado, atingiu o timeout do collab.

<br>

---
#### Tipos de métodos
1 - Lexicon-based method \
2 - Machine Learning method

* **NLTK/VADER**: 1 - Rule-based Model;
* **Textblob**: 1 e 2 - Two sentiment analysis implementations, PatternAnalyzer (based on the pattern library) and NaiveBayesAnalyzer (an NLTK classifier trained on a movie reviews corpus);
* **AFINN**: 1 - Wordlist-based approach for sentiment analysis;
* **Polyglot**: 1 - Has a scale of words’ polarity consisted of three degrees;
* **Pattern**: 1 - Depending upon the most commonly occurring positive and negative adjectives, a sentiment score between 1 and -1 is assigned to the text;
* **Stanford CoreNLP**: 2 - SentimentAnnotator implements Socher et al’s sentiment model. Attaches a binarized tree of the sentence to the sentence level CoreMap. The nodes of the tree then contain the annotations from RNNCoreAnnotations indicating the predicted class and scores for that subtree;
* **Transformers**: 2 - In 2017, researchers at google brought forward the concept of the transformer model which is a lot more efficient than its predecessors. First, the input embedding is multi-dimensional in the sense that it can process complete sentences and not a series of words one by one. Second, it has a powerful multi-headed attention mechanism that enables sentences to maintain context and relationships between words within a sentence. It performs this attention analysis for each word several times to ensure adequate sampling. Finally, it uses a feed forward neural network to normalize the results and provide a sentiment (or polarity) prediction.

In [None]:
def calculateSentiment(df, name, calculateFunc, resultFunc):
  listTexts = df['CleanText']
  listTexts = listTexts.apply(lambda x: str(x)[:512]) # Transformers limit

  listScores = []

  with Stopwatch() as s:
    listScores = listTexts.apply(lambda text: calculateFunc(text))
  print(f'Took {s.duration()}\n')

  df[name] = listScores;
  df[f'{name}Score'] = df[name].apply(lambda score: resultFunc(score))

In [None]:
def getDefaultTextAnalysis(value):
  if isinstance(value, float):
    if value < 0:
      return 'Negative'
    elif value == 0:
      return 'Neutral'
    else:
      return 'Positive'

#### Nltk

In [None]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')
sia = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


In [None]:
calculateSentiment(
  dfDados,
  'nltkCompound',
  lambda text: sia.polarity_scores(text)['compound'],
  getDefaultTextAnalysis
)

#### Textblob

In [None]:
from textblob import TextBlob # TextBlob - Python library for processing textual data
from textblob import Blobber
from textblob.sentiments import NaiveBayesAnalyzer

nltk.download('movie_reviews')
nltk.download('punkt')

textBlobber = Blobber(analyzer=NaiveBayesAnalyzer())

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
calculateSentiment(
  dfDados,
  'textblob',
  lambda text: TextBlob(text).sentiment.polarity,
  getDefaultTextAnalysis
)

In [None]:
def getTextblobNaiveBayesTextAnalysis(value):
  if value == 'neg':
    return 'Negative'
  elif value == 'pos':
    return 'Positive'
  else:
    return 'Neutral'

In [None]:
calculateSentiment(
  dfDados,
  'textblobNaiveBayes',
  lambda text: textBlobber(str(text)).sentiment.classification,
  getTextblobNaiveBayesTextAnalysis
)

#### Afinn

In [None]:
from afinn import Afinn

afinn = Afinn()

In [None]:
calculateSentiment(
  dfDados,
  'afinn',
  lambda text: afinn.score(text),
  getDefaultTextAnalysis
)

#### Polyglot

In [None]:
from polyglot.downloader import downloader
from polyglot.text import Text

downloader.supported_tasks(lang = 'en');
downloader.download('sentiment2.en')

[polyglot_data] Downloading package sentiment2.en to
[polyglot_data]     /root/polyglot_data...


True

In [None]:
calculateSentiment(
  dfDados,
  'polyglot',
  lambda text: (Text(text, hint_language_code = 'en')).polarity,
  getDefaultTextAnalysis
)

#### \\\\ Pattern

In [None]:
# from pattern.en import sentiment

In [None]:
# calculateSentiment(
#   dfDados,
#   'pattern',
#   lambda text: sentiment(text)[0],
#   getDefaultTextAnalysis
# )

#### \\\\ Stanford CoreNLP

In [None]:
# import stanza

# # Download the Stanford CoreNLP package with Stanza's installation command
# # This'll take several minutes, depending on the network speed
# corenlp_dir = './corenlp'
# stanza.install_corenlp(dir=corenlp_dir)

# # Set the CORENLP_HOME environment variable to point to the installation location
# import os
# os.environ["CORENLP_HOME"] = corenlp_dir

# from stanza.server import CoreNLPClient

INFO:stanza:Installing CoreNLP package into ./corenlp...
Downloading http://nlp.stanford.edu/software/stanford-corenlp-latest.zip: 100%|██████████| 505M/505M [06:50<00:00, 1.23MB/s]


In [None]:
# # Construct a CoreNLPClient with some basic annotators, a memory allocation of 4GB, and port number 9001
# client = CoreNLPClient(
#   annotators = ['sentiment'],
#   outputFormat = 'json',
#   memory = '4G', 
#   endpoint = 'http://localhost:9001',
#   be_quiet = True
# )
# # print(client)

INFO:stanza:Writing properties to tmp file: corenlp_server-2416942ed13e49c2.props


In [None]:
# calculateSentiment(
#   dfDados,
#   'coreNLP',
#   lambda text: (client.annotate(text)).sentence[0].sentiment,
#   getDefaultTextAnalysis
# )

In [None]:
# # Shut down the background CoreNLP server
# client.stop()

#### Transformers

In [None]:
from transformers import pipeline

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=629.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267844284.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=48.0, style=ProgressStyle(description_w…




In [None]:
def getBertTextAnalysis(value):
  if value == 'NEGATIVE':
    return 'Negative'
  elif value == 'POSITIVE':
    return 'Positive'
  else:
    return 'Neutral'

def getNlptownTextAnalysis(value):
  if value == 'negative':
    return 'Negative'
  elif value == 'positive':
    return 'Positive'
  else:
    return 'Neutral'

def getRobertaTextAnalysis(value):
  if int(value[0]) < 3:
    return 'Negative'
  elif int(value[0]) > 3:
    return 'Positive'
  else:
    return 'Neutral'

{'label': 'NEGATIVE', 'score': 0.999782383441925}

In [None]:
def calculateTransformerSentiment(df, name, resultFunc):
  listTexts = df['CleanText']
  listTexts = listTexts.apply(lambda x: str(x)[:512]) # Transformers limit

  listLabels = []
  listScores = []

  with Stopwatch() as s:
    i = 1;
    for text in listTexts:
      classification = classifier(text)[0]
      classificationLabel = classification['label']
      classificationScore = classification['score']

      listLabels.append(classificationLabel)
      listScores.append(classificationScore)

      print(f'{i} - {classificationLabel} - {classificationScore}')
      i += 1
      # listScores = listTexts.apply(classifier(text)[0]['label'])
  print(f'Took {s.duration()}\n')

  df[name] = listScores
  df[f'{name}Score'] = listLabels.apply(lambda score: resultFunc(score))

In [None]:
classifier = pipeline('sentiment-analysis')

calculateTransformerSentiment(
  dfDados,
  'transformer'
  getBertTextAnalysis,
)

In [None]:
classifier = pipeline('sentiment-analysis', model="nlptown/bert-base-multilingual-uncased-sentiment")

calculateTransformerSentiment(
  dfDados,
  'transformerNlptown'
  getNlptownTextAnalysis,
)

In [None]:
classifier = pipeline('sentiment-analysis', model="aychang/roberta-base-imdb")

calculateTransformerSentiment(
  dfDados,
  'transformerRoberta'
  getRobertaTextAnalysis,
)

In [None]:
dfDados.to_csv('tsla_2019_process.csv')