<h1>Sentimental Analysis using Twitter messages</h1>
This notebook tries to explain step-by-step about the development of an algorithm that tries to classify the sentimental characteristics of some phrase using NaiveBayes concept

In [32]:
__author__ = "Ionésio Junior"

import pandas as pd
import nltk, re
from string import punctuation as punct
from collections import OrderedDict

from nltk.classify import NaiveBayesClassifier 
from nltk.classify.util import accuracy as nltk_acc
from math import floor

<h2>Data pre-processing</h2>
This snippet implements some functions to load and filter our data set.We need to remove links/hashtags and punctuation of our data before train some model.After that, we'll get token words.

In [31]:
stopwords = nltk.corpus.stopwords.words('portuguese')

def filter_by_stopwords(word):
    if word not in stopwords and word not in punct:
        return True
    else:
        return False


def filter_dataset(data_text):
    # Remove URLS / Hashtags / links
    data_text = re.sub(r'@\S+', '', data_text)
    data_text = re.sub(r'http\S+', '', data_text)
    data_text = re.sub(r'#\S+', '', data_text)

    # Filter stop words and extract tokens
    tokens = list( filter( lambda word: filter_by_stopwords(word), nltk.word_tokenize( data_text.lower() ) ))
    return tokens

<h2>Structuring our data set</h2>
Now, we need to put and organize our data set in a "bag of words" structure and label the bags.After that, we'll separate dataset by label (pos / neg) 

In [None]:
def build_bag_of_words(tweet_text):
    ''' 
        Construct an abstraction of concept "bag of words" to each tweet
        Args:
            Tweet_text(String) : text of tweet message
        Return:
            {Word:Boolean} : Bag of words
    '''
    return { word:True for word in filter_dataset(tweet_text) }

def extract_labels(dataset):
    '''
        Extract labels and filter dataset
        
        Params:
            DataSet(DataFrame) : Set of tweets previously labeled
        Return:
            (RotuloPositivo, RotuloNegativo) : Separated/filtered labels 
    '''
    positive_label = dataset[dataset.sentiment == 1].text
    negative_label = dataset[dataset.sentiment == 0].text
    filtered_positive_label = [ (build_bag_of_words(tweet),"pos") for tweet in positive_label ]
    filtered_negative_label = [ (build_bag_of_words(tweet),"neg") for tweet in negative_label ]
    return (filtered_positive_label, filtered_negative_label)

<h2>Training</h2>
After filtering,structuring and labeling our dataset, we can train our classifier. But, for test principles we'll divide our data (70% to train and 30% to test).

In [49]:
def train_model(dataset):
    '''
        Build and train a model of classfier
        
        Args:
            DataSet(DataFrame) : data set to be used by classifier
        Return:
            classifier : trained classfier
    '''
    # Extracting filtered and labeled text data
    positive_label, negative_label = extract_labels( dataset )
    
    # Separing data set (70% train / 30% test)
    dataset_size = len(positive_label)
    train_set = positive_label[:floor(dataset_size * 0.7)] + negative_label[:floor(dataset_size * 0.7)]
    test_set = positive_label[floor(dataset_size * 0.7):] + negative_label[floor(dataset_size * 0.7):]
    
    # Training our model
    classifier = NaiveBayesClassifier.train(train_set)
    return classifier

In [50]:
if __name__ == "__main__":
    dataset = pd.read_csv('database/db.csv',encoding='utf-8', sep='\t')
    classifier = train_model(dataset)