# Detecting Sentiment Based on OpenAi Solution

https://openai.com/blog/unsupervised-sentiment-neuron/

This solution consists on using a multiplicative LSTM (mLSTM) to simply predict the next charactere on product reviews. After training the recurrent network, use it's hidden units to train a classifier.

In my case, I lack the resources and time available in OpenAi (as long as the knowledge), so I am going to do my best with a few days of training on a GeForce GTX 1050 instead of "one month across four NVIDIA Pascal GPUs [...] processing 12,500 characters per second."

Instead of using amazon reviews, I will use a portuguese language review dataset obtained from https://www.kaggle.com/olistbr/brazilian-ecommerce.

This project was made for the natural language processing discipline, UFPELs computer science course.

In [1]:
import os
import sys
import pandas as pd
import random
raw_data = "../data/raw/"

reviews = pd.read_csv(os.path.join(raw_data, "olist_order_reviews_dataset.csv"), 
                      usecols=["review_score", "review_comment_message"]).fillna("")
reviews = reviews[reviews.review_comment_message != ""]
reviews.head()

Unnamed: 0,review_score,review_comment_message
3,5,Recebi bem antes do prazo estipulado.
4,5,Parabéns lojas lannister adorei comprar pela I...
9,4,aparelho eficiente. no site a marca do aparelh...
12,4,"Mas um pouco ,travando...pelo valor ta Boa.\r\n"
15,5,"Vendedor confiável, produto ok e entrega antes..."


In [2]:
separated = dict()
for score, comment in set(list(map(tuple, reviews.values))):
    try:
        separated[score].add(comment)
    except:
        separated[score] = {comment}
        
    
train, test = [], []
for key in separated:
    test_examples = random.sample(separated[key], k=int(len(separated[key]) * 0.2))
    train_examples = [x for x in separated[key] if x not in test_examples]
    test_examples = list(test_examples)
    
    for example in test_examples:
        test.append([key, example])
    for example in train_examples:
        train.append([key, example])
    

In [3]:
random.shuffle(train)
train = pd.DataFrame(train, columns=["score", "message"])

random.shuffle(test)
test = pd.DataFrame(test, columns=["score", "message"])

In [4]:
train.to_csv("../data/interim/train.csv")
test.to_csv("../data/interim/test.csv")

In [10]:
len(reviews[reviews.review_score == 1])

9179

In [11]:
len(reviews[reviews.review_score == 2])

2229

In [12]:
len(reviews[reviews.review_score == 3])

3665

In [13]:
len(reviews[reviews.review_score == 4])

6034

In [14]:
len(reviews[reviews.review_score == 5])

20646