<h1>Análise de Sentimentos dos comentários do IMDB</h1>

In [16]:
from warnings import filterwarnings
filterwarnings('ignore')

import os
import re
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, average_precision_score
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from skopt import forest_minimize
from lightgbm import LGBMClassifier

<a name='IND'></a>
<h2>Índice</h2>
<font size=3>
<ol>
    <li><a href='#SETDATA'>Montando o dataset</a></li>
    <ol>
        <li><a href='#DATATRAIN'>Dataset de Treino</a></li>
        <li><a href='#DATATEST'>Dataset de Teste</a></li>
    </ol>
    <li><a href='#PREPRO'>Pré-processamento</a></li>
    <li><a href='#LR'>Regressão Logística</a></li>
    <li><a href='#RF'>Random Forrest</a></li>
    <li><a href='#LGBM'>LightGBM</a></li>
    <li><a href='#BAYES'>Bayesian Optimization</a></li>
    <li><a href='#VOT'>Voting</a></li>
</ol>
</font>

<a name='SETDATA'></a>
<h2>Montando o dataset</h2>
<font size=3>Como os comentários dos filmes estão em diferentes arquivos <i>.txt</i> vamos ter que agregar todos os comentários em um dataset. Com isso criamos uma função que irá limpar os comentários e retirar as <i>stop words</i>.</font>

In [3]:
def remove_stop_words(comment):
    stop_words = pd.read_csv('./stop-words.csv', header=None)
    stop_words = stop_words[0].values
    word2word = comment.split(sep=' ')      # Divide o comentário por palavra
    for word in word2word:
        if word in stop_words:
            word2word.remove(word)          # Remove a palavra se ela for uma stop word
    return ' '.join(word2word)              # Junta as palavras que sobraram do comentário

<font size=3>Agora com a função de limpeza dos comentários feita, vamos criar a função para agregar os comentários de um diretório específico</font>

In [4]:
def get_commentaries(path):
    files = os.listdir(path)                                 # Pega todos os nomes dos arquivos do diretório
    break_line = re.compile('(<br /><br />)')                # Compila um padrão de expressão regular de quebra de linha 
    character =  re.compile("[.;:!\'?,\"()\[\]]")            # Compila um padrão de expressão regular de acentuação
    comm_list = []
    for file in files:
        nota = file[-5]                                      # Pega a nota que o usuário deu para o filme
        if nota=='0':
            nota = '10'
        with open(path+file, 'r') as file:
            comment = file.read()                            # Lê o comentário
            comment = break_line.sub(' ', comment)           # Substitui a quebra de linha em espaço
            comment = character.sub('', comment)             # Substitui a acentuação por nada
            comment = comment.lower()                        # Deixa todos os caracteres em minúsculo
            comment = remove_stop_words(comment)             # Remove as stop words
            comm_list.append([comment, nota])
    df = pd.DataFrame(comm_list, columns=['Comments', 'Rating'])
    return df

<font size=3>Função feita vamos pegar os comentários nos seguintes diretórios:
<ul>
    <li>./aclimdb/train/neg/</li>
    <li>./aclimdb/train/pos/</li>
    <li>./aclimdb/test/neg/</li>
    <li>./aclimdb/test/pos/</li>
</ul>
</font>
<a name=DATATRAIN></a>
<h3>Dataset de Treino</h3>

In [5]:
train_dir = ['./aclimdb/train/neg/', './aclimdb/train/pos/']
df_train = pd.DataFrame()
for directory in train_dir:
    temp_df = get_commentaries(directory)
    df_train = pd.concat([df_train, temp_df], ignore_index=True)
print('Tamanho do dataset: {}'.format(df_train.shape))

Tamanho do dataset: (25000, 2)


<a name='DATATEST'></a>
<h3>Dataset de Teste</h3>

In [6]:
train_dir = ['./aclimdb/test/neg/', './aclimdb/test/pos/']
df_test = pd.DataFrame()
for directory in train_dir:
    temp_df = get_commentaries(directory)
    df_test = pd.concat([df_test, temp_df], ignore_index=True)
print('Tamanho do dataset: {}'.format(df_test.shape))

Tamanho do dataset: (25000, 2)


<a name='PREPRO'></a>
<h2>Pré-processamento</h2>
<font size=3>Vamos ajustar a variável alvo, definindo como 1 para comentário <b>bom</b> e 0 para comentário <b>ruim</b>.

In [7]:
df_train['Rating'] = df_train['Rating'].astype(int)
df_train['Target'] = df_train['Rating'].apply(lambda x: 1 if x > 6 else 0)
df_test['Rating'] = df_train['Rating'].astype(int)
df_test['Target'] = df_test['Rating'].apply(lambda x: 1 if x > 6 else 0)

<font size=3>Agora dividiremos em <b>X</b> e <b>Y</b>.</font>

In [8]:
Xtrain = df_train['Comments']
ytrain = df_train['Target']
Xtest = df_test['Comments']
ytest = df_test['Target']

<font size=3>Com a divisão feita, vamos utilizar o <i>TfidfVectorizer</i>.</font>

In [9]:
tfidf = TfidfVectorizer(min_df=2, ngram_range=(1,1))
Xtrain_vec = tfidf.fit_transform(Xtrain)
Xtest_vec = tfidf.transform(Xtest)
print('Xtrain:\t{}\nXtest:\t{}'.format(Xtrain.shape, Xtest.shape))

Xtrain:	(25000,)
Xtest:	(25000,)


<a name='LR'></a>
<h2>Regressão Logística</h2>

In [10]:
lr = LogisticRegression(random_state=79)
lr.fit(Xtrain_vec, ytrain)
ypred_lr = lr.predict(Xtest_vec)
roc_auc_lr = roc_auc_score(ytest, ypred_lr)
acc_lr = accuracy_score(ytest, ypred_lr)
f1_lr = f1_score(ytest, ypred_lr)
avg_prec_lr = average_precision_score(ytest, ypred_lr)
print('roc auc:\t{}\naccuracy:\t{}\nf1 score:\t{}\navg precision:\t{}'.format(roc_auc_lr, acc_lr, f1_lr, avg_prec_lr))

roc auc:	0.88416
accuracy:	0.88416
f1 score:	0.8841321917260143
avg precision:	0.8397297774931968


<a name='RF'></a>
<h2>Random Forrest</h2>

In [11]:
rf = RandomForestClassifier(n_estimators=1000, min_samples_leaf=2, class_weight='balanced', n_jobs=5, random_state=79)
rf.fit(Xtrain_vec, ytrain)
ypred_rf = rf.predict(Xtest_vec)
roc_auc_fr = roc_auc_score(ytest, ypred_rf)
acc_fr = accuracy_score(ytest, ypred_rf)
f1_fr = f1_score(ytest, ypred_rf)
avg_prec_fr = average_precision_score(ytest, ypred_rf)
print('roc auc:\t{}\naccuracy:\t{}\nf1 score:\t{}\navg precision:\t{}'.format(roc_auc_fr, acc_fr, f1_fr, avg_prec_fr))

roc auc:	0.8586800000000001
accuracy:	0.85868
f1 score:	0.8583116101864849
avg precision:	0.808663826296743


<a name='LGBM'></a>
<h2>LightGBM</h2>

In [18]:
lgbm = LGBMClassifier(random_state=79, class_weight='balanced', n_jobs=5)
lgbm.fit(Xtrain_vec, ytrain)
ypred_lgbm = lgbm.predict(Xtest_vec)
roc_auc_lgbm = roc_auc_score(ytest, ypred_lgbm)
acc_lgbm = accuracy_score(ytest, ypred_lgbm)
f1_lgbm = f1_score(ytest, ypred_lgbm)
avg_prec_lgbm = average_precision_score(ytest, ypred_lgbm)
print('roc auc:\t{}\naccuracy:\t{}\nf1 score:\t{}\navg precision:\t{}'.format(roc_auc_lgbm, acc_lgbm,
                                                                              f1_lgbm, avg_prec_lgbm))

roc auc:	0.86104
accuracy:	0.86104
f1 score:	0.8625138515117935
avg precision:	0.8081338408521304


<a name='BAYES'></a>
<h2>Bayesian Optimization</h2>

In [13]:
def tune_lgbm(params):
    print(params)
    lr = params[0]
    max_depth = params[1]
    min_child_samples = params[2]
    subsample = params[3]
    colsample_bytree = params[4]
    n_estimators = params[5]
    
    min_df = params[6]
    ngram_range = (1, params[7])
    
    tfidf = TfidfVectorizer(min_df=min_df, ngram_range=ngram_range)
    Xtrain_vec = tfidf.fit_transform(Xtrain)
    Xtest_vec = tfidf.transform(Xtest)
    
    lgbm = LGBMClassifier(learning_rate=lr, num_leaves=2 ** max_depth, max_depth=max_depth, 
                         min_child_samples=min_child_samples, subsample=subsample,
                         colsample_bytree=colsample_bytree, bagging_freq=1,n_estimators=n_estimators, random_state=0, 
                         class_weight="balanced", n_jobs=6)
    lgbm.fit(Xtrain_vec, ytrain)
    
    ypred_lgbm = lgbm.predict(Xtest_vec)
    
    print(roc_auc_score(ytest, ypred_lgbm))
    
    return -average_precision_score(ytest, ypred_lgbm)


space = [(1e-3, 1e-1, 'log-uniform'), # lr
          (1, 10), # max_depth
          (1, 20), # min_child_samples
          (0.05, 1.), # subsample
          (0.05, 1.), # colsample_bytree
          (100,1000), # n_estimators
          (1,5), # min_df
          (1,5)] # ngram_range

res = forest_minimize(tune_lgbm, space, random_state=79, n_random_starts=20, n_calls=50, verbose=1)

Iteration No: 1 started. Evaluating function at random point.
[0.006292827825947035, 4, 19, 0.46248056016314054, 0.8805011161066282, 827, 3, 4]
0.7979999999999999
Iteration No: 1 ended. Evaluation done at random point.
Time taken: 155.6676
Function value obtained: -0.7271
Current minimum: -0.7271
Iteration No: 2 started. Evaluating function at random point.
[0.04267373105004074, 3, 16, 0.1261298714290931, 0.998324765488932, 163, 3, 5]
0.80608
Iteration No: 2 ended. Evaluation done at random point.
Time taken: 130.3084
Function value obtained: -0.7366
Current minimum: -0.7366
Iteration No: 3 started. Evaluating function at random point.
[0.008175050624738778, 7, 5, 0.2915815679278623, 0.6808984760418078, 497, 5, 4]
0.814
Iteration No: 3 ended. Evaluation done at random point.
Time taken: 254.6230
Function value obtained: -0.7466
Current minimum: -0.7466
Iteration No: 4 started. Evaluating function at random point.
[0.013777428975026594, 1, 3, 0.8384254088062836, 0.6808661755213409, 628,

<font size=3>Após 50 iterações, verificamos o conjunto de hiperparâmetros que nos deu a melhor métrica.</font>

In [14]:
res.x

[0.09431638480128277, 4, 3, 0.6295926398680836, 0.34512927281549965, 815, 3, 1]

<font size=3>Com a <i>Otimização Bayesiana</i> conseguimos os melhores hiperparâmetros para o <i>LightGBM</i>, vamos verificar as metricas com o modelo treinado.</font>

In [19]:
params = res.x
lr = params[0]
max_depth = params[1]
min_child_samples = params[2]
subsample = params[3]
colsample_bytree = params[4]
n_estimators = params[5]
min_df = params[6]
ngram_range = (1, params[7])

tfidf = TfidfVectorizer(min_df=min_df, ngram_range=ngram_range)
Xtrain_vec = tfidf.fit_transform(Xtrain)
Xtest_vec = tfidf.transform(Xtest)

lgbm = LGBMClassifier(learning_rate=lr, num_leaves=2 ** max_depth, max_depth=max_depth, 
                     min_child_samples=min_child_samples, subsample=subsample,
                     colsample_bytree=colsample_bytree, bagging_freq=1,n_estimators=n_estimators, random_state=0, 
                     class_weight="balanced", n_jobs=6)
lgbm.fit(Xtrain_vec, ytrain)
ypred_lgbm = lgbm.predict(Xtest_vec)

roc_auc_lgbm = roc_auc_score(ytest, ypred_lgbm)
acc_lgbm = accuracy_score(ytest, ypred_lgbm)
f1_lgbm = f1_score(ytest, ypred_lgbm)
avg_prec_lgbm = average_precision_score(ytest, ypred_lgbm)
print('roc auc:\t{}\naccuracy:\t{}\nf1 score:\t{}\navg precision:\t{}'.format(roc_auc_lgbm, acc_lgbm,
                                                                              f1_lgbm, avg_prec_lgbm))

roc auc:	0.8740799999999999
accuracy:	0.87408
f1 score:	0.8758577174856061
avg precision:	0.823079670244206


<font size=3>Agora com todos os modelos testados, podemos ver que o que tem a melhor performance é a <i>Regressão Logística</i>. Para tentarmos melhorar essas métricas fazendo um ensemble dos modelos.</font>

<a name='VOT'></a>
<h2>Voting</h2>

<font size=3>Uma tentativa de melhorar as métricas do nosso modelo seria fazer um ensemble dos modelos que treinamos.</font>

In [31]:
params = res.x
learn_rate = params[0]
max_depth = params[1]
min_child_samples = params[2]
subsample = params[3]
colsample_bytree = params[4]
n_estimators = params[5]
min_df = params[6]
ngram_range = (1, params[7])

tfidf = TfidfVectorizer(min_df=min_df, ngram_range=ngram_range)
Xtrain_vec = tfidf.fit_transform(Xtrain)
Xtest_vec = tfidf.transform(Xtest)

lr = LogisticRegression(random_state=79)
rf = RandomForestClassifier(n_estimators=1000, min_samples_leaf=2, class_weight='balanced', n_jobs=5, random_state=79)
lgbm = LGBMClassifier(learning_rate=learn_rate, num_leaves=2 ** max_depth, max_depth=max_depth, 
                     min_child_samples=min_child_samples, subsample=subsample,
                     colsample_bytree=colsample_bytree, bagging_freq=1,n_estimators=n_estimators, random_state=0, 
                     class_weight="balanced", n_jobs=6)

vot = VotingClassifier(estimators=[('lr', lr), ('rf', rf), ('lgbm', lgbm)], voting='soft')
vot.fit(Xtrain_vec, ytrain)
ypred_vot = vot.predict(Xtest_vec)

roc_auc_vot = roc_auc_score(ytest, ypred_vot)
acc_vot = accuracy_score(ytest, ypred_vot)
f1_vot = f1_score(ytest, ypred_vot)
avg_prec_vot = average_precision_score(ytest, ypred_vot)
print('roc auc:\t{}\naccuracy:\t{}\nf1 score:\t{}\navg precision:\t{}'.format(roc_auc_vot, acc_vot,
                                                                              f1_vot, avg_prec_vot))

roc auc:	0.886
accuracy:	0.886
f1 score:	0.8871465906391066
avg precision:	0.8390286968794103


<font size=3>Com as métricas acima podemos verificar que mesmo usando <i>Voting</i> não conseguimos bater a <i>Regressão Logística</i>.</font>