<h1>Sentimental Analysis using Twitter messages</h1>
This notebook tries to explain step-by-step about the development of an algorithm that tries to classify the sentimental characteristics of some phrase using NaiveBayes concept

In [20]:
__author__ = "Ionésio Junior"

import pandas as pd
import nltk, re
from collections import defaultdict
from string import punctuation as punct
from collections import OrderedDict
from nltk.classify.util import accuracy as eval_accuracy
from nltk.classify import NaiveBayesClassifier 
from nltk.metrics import (BigramAssocMeasures, precision as eval_precision,
    recall as eval_recall, f_measure as eval_f_measure)
from math import floor

<h2>Data pre-processing</h2>
This snippet implements some functions to load and filter our data set.We need to remove links/hashtags and punctuation of our data before train some model.After that, we'll get token words.

In [6]:
stopwords = nltk.corpus.stopwords.words('portuguese')

def filter_by_stopwords(word):
    if word not in stopwords and word not in punct:
        return True
    else:
        return False


def filter_dataset(data_text):
    # Remove URLS / Hashtags / links
    data_text = re.sub(r'@\S+', '', data_text)
    data_text = re.sub(r'http\S+', '', data_text)
    data_text = re.sub(r'#\S+', '', data_text)

    # Filter stop words and extract tokens
    tokens = list( filter( lambda word: filter_by_stopwords(word), nltk.word_tokenize( data_text.lower() ) ))
    return tokens

<h2>Structuring our data set</h2>
Now, we need to put and organize our data set in a "bag of words" structure and label the bags.After that, we'll separate dataset by label (pos / neg) 

In [5]:
def build_bag_of_words(tweet_text):
    ''' 
        Construct an abstraction of concept "bag of words" to each tweet
        Args:
            Tweet_text(String) : text of tweet message
        Return:
            {Word:Boolean} : Bag of words
    '''
    return { word:True for word in filter_dataset(tweet_text) }

def extract_labels(dataset):
    '''
        Extract labels and filter dataset
        
        Params:
            DataSet(DataFrame) : Set of tweets previously labeled
        Return:
            (RotuloPositivo, RotuloNegativo) : Separated/filtered labels 
    '''
    positive_label = dataset[dataset.sentiment == 1].text
    negative_label = dataset[dataset.sentiment == 0].text
    filtered_positive_label = [ (build_bag_of_words(tweet),"pos") for tweet in positive_label ]
    filtered_negative_label = [ (build_bag_of_words(tweet),"neg") for tweet in negative_label ]
    return (filtered_positive_label, filtered_negative_label)

<h2>Training</h2>
After filtering,structuring and labeling our dataset, we can train our classifier. But, for test principles we'll divide our data (70% to train and 30% to test).

In [15]:
def train_model(dataset):
    '''
        Build and train a model of classfier
        
        Args:
            DataSet(DataFrame) : data set to be used by classifier
        Return:
            classifier : trained classfier
    '''
    # Extracting filtered and labeled text data
    positive_label, negative_label = extract_labels( dataset )
    
    # Spliting data set (70% train / 30% test)
    dataset_size = len(positive_label)
    train_set = positive_label[:floor(dataset_size * 0.7)] + negative_label[:floor(dataset_size * 0.7)]
    test_set = positive_label[floor(dataset_size * 0.7):] + negative_label[floor(dataset_size * 0.7):]
    
    # Training our model
    classifier = NaiveBayesClassifier.train(train_set)
    return (classifier, test_set)

<h2>Evaluate Classifier</h2>
Now, we need to develop some function to measure accuracy and others metrics of our classifier using test set

In [64]:
def evaluate(test_set, classifier=None, accuracy=True, f_measure=True,
                 precision=True, recall=True, verbose=False):
        """
        Evaluate and print classifier performance on the test set.

        :param test_set: A list of (tokens, label) tuples to use as gold set.
        :param classifier: a classifier instance (previously trained).
        :param accuracy: if `True`, evaluate classifier accuracy.
        :param f_measure: if `True`, evaluate classifier f_measure.
        :param precision: if `True`, evaluate classifier precision.
        :param recall: if `True`, evaluate classifier recall.
        :return: evaluation results.
        :rtype: dict(str): float
        """
        if classifier is None:
            classifier = classifier
        print("=== Evaluating {0} results... ===".format(type(classifier).__name__))
        metrics_results = {}
        if accuracy == True:
            accuracy_score = eval_accuracy(classifier, test_set)
            metrics_results['Accuracy'] = accuracy_score

        gold_results = defaultdict(set)
        test_results = defaultdict(set)
        labels = set()
        for i, (feats, label) in enumerate(test_set):
            labels.add(label)
            gold_results[label].add(i)
            observed = classifier.classify(feats)
            test_results[observed].add(i)

        for label in labels:
            if precision == True:
                precision_score = eval_precision(gold_results[label],
                    test_results[label])
                metrics_results['Precision [{0}]'.format(label)] = precision_score
            if recall == True:
                recall_score = eval_recall(gold_results[label],
                    test_results[label])
                metrics_results['Recall [{0}]'.format(label)] = recall_score
            if f_measure == True:
                f_measure_score = eval_f_measure(gold_results[label],
                    test_results[label])
                metrics_results['F-measure [{0}]'.format(label)] = f_measure_score

        # Print evaluation results (in alphabetical order)
        if verbose == True:
            for result in sorted(metrics_results):
                print('{0}: {1}'.format(result, metrics_results[result]))

        return metrics_results

<h2>Apply trained classifier on other text</h2>
Here, we'll get a new text input and produce a positive/negative label using our trained classifier

In [60]:
def classify(classifier, text):
    probabilities = classifier.prob_classify(build_bag_of_words(text))
    predicted = probabilities.max()
    print(text + "[ " + "{:.2f}".format(probabilities.prob(predicted)*100)+ "% " + predicted.capitalize() + "]")

<h2>Testing with simple inputs</h2>
This snippet tests the classifier using simple texts

In [89]:
def test_classifier_on_simple_texts(classifier):
    print("=== Applying the classifier on other phrases ===")
    classify(classifier, "Hoje é um dia triste!")
    classify(classifier, "Sorrir faz bem para a alma!")
    classify(classifier, "Dançar é muito bom!")
    classify(classifier, "Adoro correr!")
    classify(classifier, "Odeio tomar café frio")
    classify(classifier, "Sorria, você está sendo filmado!")
    print("\n== Outlier ==")
    classify(classifier, "Não grite comigo!")
    classify(classifier, "Adoro ver os outros perderem!")
    classify(classifier, "Vamos destruir o preconceito!")

<h2>Using the classifier on another tweets</h2>
This snippet label ***#MeuPaiNãoSabeMas*** posts on twitter.

In [90]:
def test_classifier_on_tweets(classifier):
    print("=== Applying the classifier on tweets ===")
    classify(classifier, "#MeuPaiNãoSabeMas eu sinto saudade de ele me chamar de princesa")
    classify(classifier, "#MeuPaiNaoSabeMas quando ele grita comigo eu choro")
    classify(classifier, "#MeuPaiNaoSabeMas amo ele e tenho muito orgulho de ser filho dele")
    classify(classifier, "#MeuPaiNãoSabeMas a falta dele ainda machuca muito")
    classify(classifier, "#MeuPaiNãoSabeMas quando ele pedia pra pegar mais uma cerveja na geladeira eu dava um gole so pra saber qual era o gosto")
    classify(classifier, "#MeuPaiNãoSabeMas ele é minha maior inspiração")

<h2>Let's put it all together</h2>

In [91]:
if __name__ == "__main__":
    # Load data set
    dataset = pd.read_csv('database/db.csv',encoding='utf-8', sep='\t')
    # Train data set
    classifier, test_set = train_model(dataset)
    #Test data set
    test_results = evaluate(test_set,classifier)
    [ print(str(key) + ":" + "%.2f" % test_results[key]) for key in test_results.keys() ]
    print("\n\n")
    #Apply classifier on other phrases
    test_classifier_on_simple_texts(classifier)
    print("\n\n")
    #Apply classifier on another tweets
    test_classifier_on_tweets(classifier);

=== Evaluating NaiveBayesClassifier results... ===
Accuracy:0.74
Precision [neg]:0.69
Recall [neg]:0.74
F-measure [neg]:0.72
Precision [pos]:0.78
Recall [pos]:0.74
F-measure [pos]:0.76



=== Applying the classifier on other phrases ===
Hoje é um dia triste![ 82.22% Neg]
Sorrir faz bem para a alma![ 87.24% Pos]
Dançar é muito bom![ 84.48% Pos]
Adoro correr![ 93.07% Pos]
Odeio tomar café frio[ 81.79% Neg]
Sorria, você está sendo filmado![ 86.51% Pos]

== Outlier ==
Não grite comigo![ 78.16% Pos]
Adoro ver os outros perderem![ 95.51% Pos]
Vamos destruir o preconceito![ 94.10% Neg]



#MeuPaiNãoSabeMas eu sinto saudade de ele me chamar de princesa[ 97.63% Neg]
#MeuPaiNaoSabeMas quando ele grita comigo eu choro[ 83.00% Neg]
#MeuPaiNaoSabeMas amo ele e tenho muito orgulho de ser filho dele[ 64.00% Pos]
#MeuPaiNãoSabeMas a falta dele ainda machuca muito[ 78.77% Neg]
#MeuPaiNãoSabeMas quando ele pedia pra pegar mais uma cerveja na geladeira eu dava um gole so pra saber qual era o gosto[ 65.18

<h2>Analysis of results</h2>
The model was relatively accurate in our manual queries. However, as we can see from generating the evaluation metrics using the test set, we still have relatively low ones. Possible suggestions for improving the classifier would be to increase the data set and perform more advanced preprocessing.