# IMDB sentiment analysis
In this project, IMDB movie review texts by users are classified as either negative or positive. Labels for sentiment exist in the training set, so this is a supervised learning problem. In the accompanying Kaggle competition, evaluation metric is Area Under Curve (AUC), so predictions are probabilities of reviews being positive. There is a "leak" in the test set which allows us to see the ground truth and therefore AUC scores without submitting to Kaggle. 

First, simple Bag of Words + logistic regression approach is evaluated, which achieves surprisingly good results on the test set. Then, a convolutional neural network (CNN) is trained for the same purpose. It does not perform as well with the current architecture but by combining the predictions with the simpler model, we get a decent boost.

**Note**: this is a work in progress. Documentation will improve soon. Also pretrained word vectors and different neural network architectures will be examined later.

In [72]:
import pandas as pd
from bs4 import BeautifulSoup
import re
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score

from keras.models import Sequential
from keras.layers import Embedding, Dropout, Conv1D, MaxPooling1D, Flatten, Dense, BatchNormalization, SpatialDropout1D
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.optimizers import Adam

DATA_DIR = 'D:/Data/Kaggle_imdb/'
np.random.seed(2017)

In [2]:
train_labeled = pd.read_csv(DATA_DIR + 'labeledTrainData.tsv', sep='\t', quoting=3)
train_unlabeled = pd.read_csv(DATA_DIR + 'unlabeledTrainData.tsv', sep='\t', quoting=3)
test = pd.read_csv(DATA_DIR + 'testData.tsv', sep='\t', quoting=3)

In [3]:
train_labeled.head()

Unnamed: 0,id,sentiment,review
0,"""5814_8""",1,"""With all this stuff going down at the moment ..."
1,"""2381_9""",1,"""\""The Classic War of the Worlds\"" by Timothy ..."
2,"""7759_3""",0,"""The film starts with a manager (Nicholas Bell..."
3,"""3630_4""",0,"""It must be assumed that those who praised thi..."
4,"""9495_8""",1,"""Superbly trashy and wondrously unpretentious ..."


In [4]:
train_unlabeled.head()

Unnamed: 0,id,review
0,"""9999_0""","""Watching Time Chasers, it obvious that it was..."
1,"""45057_0""","""I saw this film about 20 years ago and rememb..."
2,"""15561_0""","""Minor Spoilers<br /><br />In New York, Joan B..."
3,"""7161_0""","""I went to see this film with a great deal of ..."
4,"""43971_0""","""Yes, I agree with everyone on this site this ..."


In [5]:
print train_labeled['sentiment'].unique()

[1 0]


In [6]:
print train_labeled.shape

(25000, 3)


In [7]:
print train_labeled['review'].values[:2]

[ '"With all this stuff going down at the moment with MJ i\'ve started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ\'s feeling towards the press and also the obvious message of drugs are bad m\'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it final

For text preprocessing several techniques, such as lemmatizing and removing stop words, were tried. CV score was found to be the highest without them.

In [8]:
def text_process(review):
    text = BeautifulSoup(review, 'html5lib').get_text().lower()
    text = re.sub('[^A-Za-z0-9! ]', '', text)
    return text

In [9]:
%%time
train_reviews = train_labeled['review'].apply(text_process).values
test_reviews = test['review'].apply(text_process).values
unlabeled_reviews = train_unlabeled['review'].apply(text_process).values
train_sentiment = train_labeled['sentiment']

Wall time: 6min 42s


In [10]:
test_sentiment = test['id'].apply(lambda x: (int(x.split('_')[1][:-1]) > 5) * 1)

Tf-idf decreased the cross-validated AUC score so it was not used.

In [11]:
text_pipe = Pipeline([
    ('bow', CountVectorizer(ngram_range=(1,2), min_df=2)),
    #('tf-idf', TfidfTransformer()),
    ('lr', LogisticRegression())
])

In [12]:
%%time
cv_score = cross_val_score(text_pipe, train_reviews, train_sentiment, 
                           cv=5, n_jobs=-1, scoring='roc_auc')
print 'CV mean: {}, CV std: {}'.format(cv_score.mean(), cv_score.std())

CV mean: 0.955424096, CV std: 0.00338045430448
Wall time: 1min 49s


In [13]:
text_pipe.fit(train_reviews, train_sentiment)

Pipeline(steps=[('bow', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=2,
        ngram_range=(1, 2), preprocessor=None, stop_words=None,
        str...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

In [14]:
lr_preds = text_pipe.predict_proba(test_reviews)[:,1]
print roc_auc_score(test_sentiment, lr_preds)

0.9570968768


## WE MUST GO DEEPER

In [15]:
ascii_train = [review.encode('ascii') for review in train_reviews]
ascii_test = [review.encode('ascii') for review in test_reviews]
ascii_unlabeled = [review.encode('ascii') for review in unlabeled_reviews]
ascii_all = ascii_train + ascii_test + ascii_unlabeled

In [21]:
np.percentile(train_labeled['review'].apply(lambda x: len(x.split())), 98)

789.0

In [16]:
MAX_WORDS = 8000
SEQ_LEN = 800
BATCH_SIZE = 64

In [17]:
tokenizer = Tokenizer(num_words=MAX_WORDS)
tokenizer.fit_on_texts(ascii_all)

In [18]:
train_seq = tokenizer.texts_to_sequences(ascii_train)
test_seq = tokenizer.texts_to_sequences(ascii_test)
unlabeled_seq = tokenizer.texts_to_sequences(ascii_unlabeled)

In [19]:
train_seq = sequence.pad_sequences(train_seq, maxlen=SEQ_LEN)
test_seq = sequence.pad_sequences(test_seq, maxlen=SEQ_LEN)
unlabeled_seq = sequence.pad_sequences(unlabeled_seq, maxlen=SEQ_LEN)

In [20]:
mask = np.random.rand(len(train_seq)) < 0.9

In [21]:
X_train = train_seq[mask]
y_train = train_sentiment[mask]
X_val = train_seq[~mask]
y_val = train_sentiment[~mask]

In [22]:
model = Sequential([
    Embedding(MAX_WORDS, 32, input_length=SEQ_LEN),
    Flatten(),
    Dense(100, activation='relu'),
    Dropout(0.7),
    Dense(1, activation='sigmoid')
])

model.compile(Adam(lr=0.0001), 'binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=8, validation_data=(X_val, y_val), batch_size=BATCH_SIZE)

Train on 22451 samples, validate on 2549 samples
Epoch 1/8
Epoch 2/8
Epoch 3/8
Epoch 4/8
Epoch 5/8
Epoch 6/8
Epoch 7/8
Epoch 8/8


<keras.callbacks.History at 0x1e6c24710>

In [27]:
conv_model = Sequential([
    Embedding(MAX_WORDS, 50, input_length=SEQ_LEN),
    SpatialDropout1D(0.2),
    Conv1D(64, 5, activation='relu'),
    MaxPooling1D(),
    Flatten(),
    Dropout(0.2),
    Dense(128, activation='relu'),
    Dropout(0.7),
    Dense(1, activation='sigmoid')
])

conv_model.compile(RMSprop(lr=0.0001), 'binary_crossentropy', metrics=['accuracy'])
conv_model.fit(X_train, y_train, epochs=4, validation_data=(X_val, y_val), batch_size=BATCH_SIZE)

Train on 22451 samples, validate on 2549 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0xa0c3ff748>

In [28]:
conv_model.optimizer.lr = 0.00001
conv_model.fit(X_train, y_train, epochs=2, validation_data=(X_val, y_val), batch_size=BATCH_SIZE)

Train on 22451 samples, validate on 2549 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0xa0bdd3978>

In [29]:
conv_model.save_weights('val_acc_8917.hdf5')

In [44]:
conv_model.load_weights('val_acc_8917.hdf5')

In [45]:
conv_model.fit(X_train, y_train, epochs=1, validation_data=(X_val, y_val), batch_size=BATCH_SIZE)

Train on 22451 samples, validate on 2549 samples
Epoch 1/1


<keras.callbacks.History at 0xa0bdd3cc0>

In [57]:
cnn_pred = conv_model.predict(test_seq, batch_size=BATCH_SIZE * 2).ravel()

CNN does not achieve as good performance as the logistic regression model. 

In [58]:
print roc_auc_score(test_sentiment, cnn_pred)

0.9546203456


In [70]:
np.corrcoef([lr_preds, cnn_pred])

array([[ 1.        ,  0.93157591],
       [ 0.93157591,  1.        ]])

Predictions by the CNN are still different enough compared to the simpler model to provide a decent boost when combined.

In [67]:
combined_preds = (cnn_pred * 4 + lr_preds * 6) / 2
print roc_auc_score(test_sentiment, combined_preds)

0.9612864256


With the AUC score of 0.96129 we would be at position 65 out of 578 in the Kaggle competition.

## SAVE RESULTS

In [68]:
results = pd.DataFrame({'id': test.index, 'sentiment': combined_preds})

In [69]:
results.to_csv('submission.csv', index=False, quoting=3)