# Description

This notebook contains a pipeline for training a Naive Bayes algorithm using the NLTK library in Python to classify the emotion of a given sentence. The provided dataset is a generic dataset containing 6 emotions in portuguese language.

For the text processing we will be using stop words removal and stemming, and then build a Naive Bayes Classification Machine Learning model, which is a statistical model where training is based on constructing a frequency table to estimate the probability of each sentence belonging to a specific class. It is a classical statistical algorithm widely used in text classification problems.

We also thought about building several functions in a way that encapsulates the code as best as possible, also to ensure code efficiency and reproducibility.

The notebook will be divided in two parts, the first part I will be doing an EDA to understand better the NLTK library and the stop words and stemming strategies to process text data.

In [1]:
# Load nltk library
import nltk

# 1. NLTK, Stop Words and Stemming

In [2]:
# Example dataset that is already preprocessed.

database = [('eu sou admirada por muitos','alegria'),
        ('me sinto completamente amado','alegria'),
        ('amar e maravilhoso','alegria'),
        ('estou me sentindo muito animado novamente','alegria'),
        ('eu estou muito bem hoje','alegria'),
        ('que belo dia para dirigir um carro novo','alegria'),
        ('o dia está muito bonito','alegria'),
        ('estou contente com o resultado do teste que fiz no dia de ontem','alegria'),
        ('o amor e lindo','alegria'),
        ('nossa amizade e amor vai durar para sempre', 'alegria'),
        ('estou amedrontado', 'medo'),
        ('ele esta me ameacando a dias', 'medo'),
        ('isso me deixa apavorada', 'medo'),
        ('este lugar e apavorante', 'medo'),
        ('se perdermos outro jogo seremos eliminados e isso me deixa com pavor', 'medo'),
        ('tome cuidado com o lobisomem', 'medo'),
        ('se eles descobrirem estamos encrencados', 'medo'),
        ('estou tremendo de medo', 'medo'),
        ('eu tenho muito medo dele', 'medo'),
        ('estou com medo do resultado dos meus testes', 'medo')]

In [3]:
# visualizing database
print(database)

[('eu sou admirada por muitos', 'alegria'), ('me sinto completamente amado', 'alegria'), ('amar e maravilhoso', 'alegria'), ('estou me sentindo muito animado novamente', 'alegria'), ('eu estou muito bem hoje', 'alegria'), ('que belo dia para dirigir um carro novo', 'alegria'), ('o dia está muito bonito', 'alegria'), ('estou contente com o resultado do teste que fiz no dia de ontem', 'alegria'), ('o amor e lindo', 'alegria'), ('nossa amizade e amor vai durar para sempre', 'alegria'), ('estou amedrontado', 'medo'), ('ele esta me ameacando a dias', 'medo'), ('isso me deixa apavorada', 'medo'), ('este lugar e apavorante', 'medo'), ('se perdermos outro jogo seremos eliminados e isso me deixa com pavor', 'medo'), ('tome cuidado com o lobisomem', 'medo'), ('se eles descobrirem estamos encrencados', 'medo'), ('estou tremendo de medo', 'medo'), ('eu tenho muito medo dele', 'medo'), ('estou com medo do resultado dos meus testes', 'medo')]


In [4]:
# Printing first register
print(database[0])

('eu sou admirada por muitos', 'alegria')


### Stop Words

Stop words should be removed for nlp machine learning algorithms, because they do not bring any relevant information for the ML model and can bring noise to the data.

In [5]:
# Stop words
stopwords = ['a', 'agora', 'algum', 'alguma', 'aquele', 'aqueles', 'de', 'deu', 'do', 'e', 'estou', 'esta', 'esta',
             'ir', 'meu', 'muito', 'mesmo', 'no', 'nossa', 'o', 'outro', 'para', 'que', 'sem', 'talvez', 'tem', 'tendo',
             'tenha', 'teve', 'tive', 'todo', 'um', 'uma', 'umas', 'uns', 'vou']

In [6]:
# NLTK stopwords
stopwords_nltk = nltk.corpus.stopwords.words('portuguese')
print(len(stopwords_nltk))

207


The NLTK library has a standard stopwords list, but I can add more stopwords to this list if I think that is necessary for the analysis.

In [7]:
# Function to remove stop words 
def remove_stopwords(text):
    phrases = []
    for (words, emotions) in text:
        without_stopwords = [p for p in words.split() if p not in stopwords_nltk]
        phrases.append((without_stopwords, emotions))
    return phrases

print(remove_stopwords(database))

[(['admirada', 'muitos'], 'alegria'), (['sinto', 'completamente', 'amado'], 'alegria'), (['amar', 'maravilhoso'], 'alegria'), (['sentindo', 'animado', 'novamente'], 'alegria'), (['bem', 'hoje'], 'alegria'), (['belo', 'dia', 'dirigir', 'carro', 'novo'], 'alegria'), (['dia', 'bonito'], 'alegria'), (['contente', 'resultado', 'teste', 'fiz', 'dia', 'ontem'], 'alegria'), (['amor', 'lindo'], 'alegria'), (['amizade', 'amor', 'vai', 'durar', 'sempre'], 'alegria'), (['amedrontado'], 'medo'), (['ameacando', 'dias'], 'medo'), (['deixa', 'apavorada'], 'medo'), (['lugar', 'apavorante'], 'medo'), (['perdermos', 'outro', 'jogo', 'eliminados', 'deixa', 'pavor'], 'medo'), (['tome', 'cuidado', 'lobisomem'], 'medo'), (['descobrirem', 'encrencados'], 'medo'), (['tremendo', 'medo'], 'medo'), (['medo'], 'medo'), (['medo', 'resultado', 'testes'], 'medo')]


### Stemming
It is used to reduce the data dimensionality. We use this to capture the root of a word.
Sometimes, we can lose some information. Need to pay attention when using this method.


In [8]:
# visualizing stemmers for nltk
stemmer = nltk.stem.RSLPStemmer()
print(stemmer.stem(word="livro"))
print(stemmer.stem(word="estudar"))

livr
estud


In [9]:
# Function to apply stemming and remove stopwords

# There are a lot of functions to remove the stemmers
def apply_stemming(text):
    stemmer = nltk.stem.RSLPStemmer()
    phrases_stemming = []
    for (words, emotions) in text:
        with_stemming = [str(stemmer.stem(p)) for p in words.split() if p not in stopwords_nltk]
        phrases_stemming.append((with_stemming, emotions))
    
    return phrases_stemming

phrases_stemming = apply_stemming(database)
phrases_stemming

[(['admir', 'muit'], 'alegria'),
 (['sint', 'complet', 'am'], 'alegria'),
 (['am', 'maravilh'], 'alegria'),
 (['sent', 'anim', 'nov'], 'alegria'),
 (['bem', 'hoj'], 'alegria'),
 (['bel', 'dia', 'dirig', 'carr', 'nov'], 'alegria'),
 (['dia', 'bonit'], 'alegria'),
 (['cont', 'result', 'test', 'fiz', 'dia', 'ont'], 'alegria'),
 (['am', 'lind'], 'alegria'),
 (['amizad', 'am', 'vai', 'dur', 'sempr'], 'alegria'),
 (['amedront'], 'medo'),
 (['ameac', 'dia'], 'medo'),
 (['deix', 'apavor'], 'medo'),
 (['lug', 'apavor'], 'medo'),
 (['perd', 'outr', 'jog', 'elimin', 'deix', 'pav'], 'medo'),
 (['tom', 'cuid', 'lobisom'], 'medo'),
 (['descobr', 'encrenc'], 'medo'),
 (['trem', 'med'], 'medo'),
 (['med'], 'medo'),
 (['med', 'result', 'test'], 'medo')]

### List database's words

In [10]:
def search_words(phrase):
    allwords = []
    for (words, emotions) in phrase:
        allwords.extend(words)
    return allwords

allwords = search_words(phrases_stemming)
allwords


['admir',
 'muit',
 'sint',
 'complet',
 'am',
 'am',
 'maravilh',
 'sent',
 'anim',
 'nov',
 'bem',
 'hoj',
 'bel',
 'dia',
 'dirig',
 'carr',
 'nov',
 'dia',
 'bonit',
 'cont',
 'result',
 'test',
 'fiz',
 'dia',
 'ont',
 'am',
 'lind',
 'amizad',
 'am',
 'vai',
 'dur',
 'sempr',
 'amedront',
 'ameac',
 'dia',
 'deix',
 'apavor',
 'lug',
 'apavor',
 'perd',
 'outr',
 'jog',
 'elimin',
 'deix',
 'pav',
 'tom',
 'cuid',
 'lobisom',
 'descobr',
 'encrenc',
 'trem',
 'med',
 'med',
 'med',
 'result',
 'test']

### Extract unique words

We are going to remove duplicated words in the list with allwords.

In [11]:
def word_frequency(words):
    words = nltk.FreqDist(words)
    return words

frequency = word_frequency(allwords)
frequency.most_common(50)

[('am', 4),
 ('dia', 4),
 ('med', 3),
 ('nov', 2),
 ('result', 2),
 ('test', 2),
 ('deix', 2),
 ('apavor', 2),
 ('admir', 1),
 ('muit', 1),
 ('sint', 1),
 ('complet', 1),
 ('maravilh', 1),
 ('sent', 1),
 ('anim', 1),
 ('bem', 1),
 ('hoj', 1),
 ('bel', 1),
 ('dirig', 1),
 ('carr', 1),
 ('bonit', 1),
 ('cont', 1),
 ('fiz', 1),
 ('ont', 1),
 ('lind', 1),
 ('amizad', 1),
 ('vai', 1),
 ('dur', 1),
 ('sempr', 1),
 ('amedront', 1),
 ('ameac', 1),
 ('lug', 1),
 ('perd', 1),
 ('outr', 1),
 ('jog', 1),
 ('elimin', 1),
 ('pav', 1),
 ('tom', 1),
 ('cuid', 1),
 ('lobisom', 1),
 ('descobr', 1),
 ('encrenc', 1),
 ('trem', 1)]

In [12]:
# Search for unique words
def search_unique_words(frequency):
    list_with_unique_words = frequency.keys()
    return list_with_unique_words

unique_words = search_unique_words(frequency)
unique_words

dict_keys(['admir', 'muit', 'sint', 'complet', 'am', 'maravilh', 'sent', 'anim', 'nov', 'bem', 'hoj', 'bel', 'dia', 'dirig', 'carr', 'bonit', 'cont', 'result', 'test', 'fiz', 'ont', 'lind', 'amizad', 'vai', 'dur', 'sempr', 'amedront', 'ameac', 'deix', 'apavor', 'lug', 'perd', 'outr', 'jog', 'elimin', 'pav', 'tom', 'cuid', 'lobisom', 'descobr', 'encrenc', 'trem', 'med'])

### Extract which words have in a phrase

We are going to build a dictionary that will return True or False if a word is in the unique_words from the dataset.
The function will recieve the root values obtained before.

In [13]:
def extract_words_from_a_phrase_based_on_a_list(document):
    doc = set(document)
    characteristic = {}
    for words in unique_words:
        boolean = words in doc
        characteristic[words] = (words in doc)

    return characteristic

phrase_characteristic = extract_words_from_a_phrase_based_on_a_list(['am', 'nov', 'dia']) # test function
print(phrase_characteristic)

{'admir': False, 'muit': False, 'sint': False, 'complet': False, 'am': True, 'maravilh': False, 'sent': False, 'anim': False, 'nov': True, 'bem': False, 'hoj': False, 'bel': False, 'dia': True, 'dirig': False, 'carr': False, 'bonit': False, 'cont': False, 'result': False, 'test': False, 'fiz': False, 'ont': False, 'lind': False, 'amizad': False, 'vai': False, 'dur': False, 'sempr': False, 'amedront': False, 'ameac': False, 'deix': False, 'apavor': False, 'lug': False, 'perd': False, 'outr': False, 'jog': False, 'elimin': False, 'pav': False, 'tom': False, 'cuid': False, 'lobisom': False, 'descobr': False, 'encrenc': False, 'trem': False, 'med': False}


### Apply the word extract in a phrase for the complete database after the stemming

In [14]:
# Complete dataset
complete_dataset = nltk.classify.apply_features(extract_words_from_a_phrase_based_on_a_list, phrases_stemming)
print(complete_dataset)

[({'admir': True, 'muit': True, 'sint': False, 'complet': False, 'am': False, 'maravilh': False, 'sent': False, 'anim': False, 'nov': False, 'bem': False, 'hoj': False, 'bel': False, 'dia': False, 'dirig': False, 'carr': False, 'bonit': False, 'cont': False, 'result': False, 'test': False, 'fiz': False, 'ont': False, 'lind': False, 'amizad': False, 'vai': False, 'dur': False, 'sempr': False, 'amedront': False, 'ameac': False, 'deix': False, 'apavor': False, 'lug': False, 'perd': False, 'outr': False, 'jog': False, 'elimin': False, 'pav': False, 'tom': False, 'cuid': False, 'lobisom': False, 'descobr': False, 'encrenc': False, 'trem': False, 'med': False}, 'alegria'), ({'admir': False, 'muit': False, 'sint': True, 'complet': True, 'am': True, 'maravilh': False, 'sent': False, 'anim': False, 'nov': False, 'bem': False, 'hoj': False, 'bel': False, 'dia': False, 'dirig': False, 'carr': False, 'bonit': False, 'cont': False, 'result': False, 'test': False, 'fiz': False, 'ont': False, 'lind':

In [15]:
# Visualizing an example
complete_dataset[0]

({'admir': True,
  'muit': True,
  'sint': False,
  'complet': False,
  'am': False,
  'maravilh': False,
  'sent': False,
  'anim': False,
  'nov': False,
  'bem': False,
  'hoj': False,
  'bel': False,
  'dia': False,
  'dirig': False,
  'carr': False,
  'bonit': False,
  'cont': False,
  'result': False,
  'test': False,
  'fiz': False,
  'ont': False,
  'lind': False,
  'amizad': False,
  'vai': False,
  'dur': False,
  'sempr': False,
  'amedront': False,
  'ameac': False,
  'deix': False,
  'apavor': False,
  'lug': False,
  'perd': False,
  'outr': False,
  'jog': False,
  'elimin': False,
  'pav': False,
  'tom': False,
  'cuid': False,
  'lobisom': False,
  'descobr': False,
  'encrenc': False,
  'trem': False,
  'med': False},
 'alegria')

### Naive Bayes for Emotion Classification

The Naives Bayes train algorithm will build the frequency table that will be used to calculate the probability to a phrase belongs to the each class.

In [16]:
# Import the classifier from NLTK
classifier = nltk.NaiveBayesClassifier.train(complete_dataset)

# Print the labels
print('Classifier Labels: ', classifier.labels())

# Print informative features
print(classifier.show_most_informative_features(5))

Classifier Labels:  ['alegria', 'medo']
Most Informative Features
                     dia = True           alegri : medo   =      2.3 : 1.0
                      am = False            medo : alegri =      1.6 : 1.0
                     med = False          alegri : medo   =      1.4 : 1.0
                     dia = False            medo : alegri =      1.3 : 1.0
                  apavor = False          alegri : medo   =      1.2 : 1.0
None


The most informative feature is "dia = True", that is showed that when dia is in the phrase there is a probability 2.3 times grater to be an alegria phrase that to be a medo phrase.

The same interpretation is valid for False values. When med = False, or the "med" root is not in the phrase, the probability that the phrase in an alegria phrase is 1.4 times greater than to be a medo phrase.

### Test Algorithm

In [17]:
test = "estou com medo"
tests_stemming = []
stemmer = nltk.stem.RSLPStemmer()
for (palavras) in test.split():
    with_stemming = [p for p in palavras.split()]
    tests_stemming.append(str(stemmer.stem(with_stemming[0])))

novo = extract_words_from_a_phrase_based_on_a_list(tests_stemming)
distribuition_prob = classifier.prob_classify(novo)
for classe in distribuition_prob.samples():
    print(classe, distribuition_prob.prob(classe))

alegria 0.04104610212207739
medo 0.9589538978779223


# 2. Naives Bayes Machine Learning Classification Model

From now on, we will building the Machine Learning Model to try to identify the emotion in the phrase. There are two datasets, one for training and other for testing. And will be used the accuracy and confusion matrix to evaluate the model. 
For last, we will be doing an inference in a random example to test the functions and the output of the model.

It will be loaded another datasets for the analysis and there will be some functions that will facilitate the understanding of the code. The most important will be the:
- training_pipeline_emotion_analysis
- test_pipeline_emotion_analysis
- inference_pipeline_emotion_analysis

In [18]:
def read_txt_file(path):
    
    # Open the text file for reading with the correct encoding (assuming UTF-8)
    with open(path, 'r', encoding='utf-8') as file:
        # Read the contents of the file
        file = file.read()

    # Remove the unwanted part from the string
    file = file.split('=')[1]

    # Clean data
    file = file.replace("[", "").replace("]", "").replace("\n", "")

    # Convert the cleaned string back to a list of tuples
    file = eval(file)

    return file


In [19]:
# Load txt file as train and test database
base_treino_path = 'Arquivos_curso/dados/base_treino.txt'
base_teste_path = 'Arquivos_curso/dados/base_teste.txt'

# Python list with the text values
base_treino = read_txt_file(base_treino_path)
base_teste = read_txt_file(base_teste_path)


In [20]:
# Visualizing training data
print("Traing data:", base_treino[0:5])

# Visualizing test data
print("Test data:", base_teste[0:5])

Traing data: (('este trabalho e agradável', 'alegria'), ('gosto de ficar no seu aconchego', 'alegria'), ('fiz a adesão ao curso hoje', 'alegria'), ('eu sou admirada por muitos', 'alegria'), ('adoro como você e', 'alegria'))
Test data: (('não precisei pagar o ingresso', 'alegria'), ('se eu ajeitar tudo fica bem', 'alegria'), ('minha fortuna ultrapassa a sua', 'alegria'), ('sou muito afortunado', 'alegria'), ('e benefico para todos esta nova medida', 'alegria'))


### Pre-defined NLTK parameters

In [21]:
# NLTK stopwords
stopwords_nltk = nltk.corpus.stopwords.words('portuguese')

# Obtaind stemmers for portuguese
stemmer = nltk.stem.RSLPStemmer()

### Preprocess functions

In [22]:
# Function to apply stemming and remove stopwords
def apply_stemming(text, train_test_or_inference):
    stemmer = nltk.stem.RSLPStemmer()
    phrases_stemming = []
    if (train_test_or_inference == 'train') or (train_test_or_inference == 'test'):
        for (words, emotions) in text:
            with_stemming = [str(stemmer.stem(p)) for p in words.split() if p not in stopwords_nltk]
            phrases_stemming.append((with_stemming, emotions))
    else:
        for (words) in text:
            with_stemming = [str(stemmer.stem(p)) for p in words.split() if p not in stopwords_nltk]
            phrases_stemming.append((with_stemming))
    
    return phrases_stemming

# Obtain all words in the document
def search_words(phrase):
    allwords = []
    for (words, emotions) in phrase:
        allwords.extend(words)

    return allwords

# Calculate frequency of the words
def word_frequency(words):
    words = nltk.FreqDist(words)
    return words

# Obtain unique words in the document
def search_unique_words(frequency):
    list_with_unique_words = frequency.keys()
    return list_with_unique_words

def extract_words_from_a_phrase_based_on_a_list(document, unique_words):
    doc = set(document)
    characteristic = {}
    for words in unique_words:
        characteristic[words] = (words in doc)

    return characteristic

def preprocessing(database, train_test_or_inference):
    
    phrases_stemming = apply_stemming(database, train_test_or_inference)

    allwords = search_words(phrases_stemming)

    frequency = word_frequency(allwords)

    unique_words = search_unique_words(frequency)

    # Use a lambda function to pass unique_words as an argument to extract_words_from_a_phrase_based_on_a_list
    feature_extractor = lambda document: extract_words_from_a_phrase_based_on_a_list(document, unique_words)
    
    complete_dataset = nltk.classify.apply_features(feature_extractor, phrases_stemming)

    return complete_dataset, unique_words

def training_pipeline_emotion_analysis(dataset, train_test_or_inference = 'train'):
    
    # Preprocessing data
    preprocessed_dataset, train_unique_words = preprocessing(dataset, train_test_or_inference)
    
    # Import the classifier from NLTK
    classifier = nltk.NaiveBayesClassifier.train(preprocessed_dataset)

    # Print the labels
    print('Classifier Labels: ', classifier.labels())

    # Print informative features
    print(classifier.show_most_informative_features(5))

    return classifier, train_unique_words

def test_pipeline_emotion_analysis(classifier, dataset, train_test_or_inference = 'test'):
    
    # Preprocessing data
    processed_test_dataset, test_unique_words = preprocessing(dataset, train_test_or_inference)

    # Printing results
    print("Model Accuracy", round(nltk.classify.accuracy(classifier, processed_test_dataset), 2))

    return processed_test_dataset, test_unique_words

def visualizing_classifier_errors(classifier, test_dataset):

    # Errors
    errors = []
    for (frase, clase) in test_dataset:
        resultado = classifier.classify(frase)
        if resultado != classe:
            errors.append((classe, resultado, frase))
    
    return errors

def inference_pipeline_emotion_analysis(classifier, dataset, unique_words, train_test_or_inference = 'inference'):
    
    # Stemming dataset
    phrases_stemming = apply_stemming(dataset, train_test_or_inference)[0]

    # Extract features
    new_dataset = extract_words_from_a_phrase_based_on_a_list(phrases_stemming, unique_words)
    
    # Calculate probability
    distribuition_prob = classifier.prob_classify(new_dataset)
    
    for classe in distribuition_prob.samples():
        print(classe, distribuition_prob.prob(classe))

    return None


# Results

In [23]:
# Trained algorithm
emotion_classifier, train_unique_words = training_pipeline_emotion_analysis(base_treino)
processed_test_dataset, test_unique_words = test_pipeline_emotion_analysis(emotion_classifier, base_teste)

Classifier Labels:  ['alegria', 'desgosto', 'medo', 'raiva', 'surpresa', 'tristeza']
Most Informative Features
                 acredit = True           surpre : triste =      6.3 : 1.0
                    real = True           surpre : alegri =      5.8 : 1.0
                     vou = True             medo : triste =      5.7 : 1.0
                     tão = True           surpre : raiva  =      4.3 : 1.0
                     dia = True           triste : alegri =      4.0 : 1.0
None
Model Accuracy 0.42


In [24]:
# Visualizing errors
errors = visualizing_classifier_errors(emotion_classifier, processed_test_dataset)
for (classe, resultado, frase) in errors:
    print((classe, resultado, frase))

('medo', 'raiva', {'precis': True, 'pag': True, 'ingress': True, 'ajeit': False, 'tud': False, 'fic': False, 'bem': False, 'fortun': False, 'ultrapass': False, 'afortun': False, 'benef': False, 'tod': False, 'nov': False, 'med': False, 'lind': False, 'ach': False, 'sapat': False, 'simpá': False, 'ansi': False, 'cheg': False, 'congratul': False, 'aniversári': False, 'delicad': False, 'coloc': False, 'dorm': False, 'music': False, 'viv': False, 'conclu': False, 'taref': False, 'difícil': False, 'gradu': False, 'cont': False, 'confi': False, 'praz': False, 'conhecê-l': False, 'colegu': False, 'anim': False, 'aproveit': False, 'fer': False, 'vam': False, 'aprove': False, 'divert': False, 'jog': False, 'ter': False, 'muit': False, 'divers': False, 'tant': False, 'assim': False, 'vou': False, 'consent': False, 'orç': False, 'client': False, 'pal': False, 'pod': False, 'cas': False, 'ador': False, 'perfum': False, 'bondad': False, 'cativ': False, 'despreocup': False, 'preocup': False, 'aconte

In [25]:
from nltk.metrics import ConfusionMatrix

true_value = []
prediction = []

for (frase, classe) in processed_test_dataset:
    result = emotion_classifier.classify(frase)
    prediction.append(result)
    true_value.append(classe)

matrix = ConfusionMatrix(true_value, prediction)
print(matrix)

         |     d        s  t |
         |  a  e        u  r |
         |  l  s        r  i |
         |  e  g     r  p  s |
         |  g  o  m  a  r  t |
         |  r  s  e  i  e  e |
         |  i  t  d  v  s  z |
         |  a  o  o  a  a  a |
---------+-------------------+
 alegria |<26> 3  4 10  5  . |
desgosto |  9<21> 1  1  2  2 |
    medo | 10  5<12> 4  2  3 |
   raiva | 10  3  8 <8> 3  4 |
surpresa | 18  1  .  2<13> 2 |
tristeza | 12  2  3  3  1<15>|
---------+-------------------+
(row = reference; col = test)



A lot of values are being classified as alegria, so we can improve this algorithm looking to this values and try to understand why a lot of registers are in this class.

In [26]:
# Inference random phrase
new_test = 'eu sinto amor por voce'
new_test = inference_pipeline_emotion_analysis(emotion_classifier, [new_test], train_unique_words)
new_test

alegria 0.8804406738365976
desgosto 0.016370373709289034
medo 0.0031238677859800196
raiva 0.009496972316888471
surpresa 0.0013069281269570722
tristeza 0.08926118422428708
