# 00 Training and evaluating a DNN model on the IMDB Dataset
## Downloading and data preprocessing

Downloaded the dataset at http://ai.stanford.edu/~amaas/data/sentiment/

```
@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}
```

In [47]:
%time

import os
import pandas as pd

df = pd.DataFrame(columns = ['text','sentiment'])

imdb_dir = "./datasets/aclImdb"

for dir_kind in ['train','test']:
    for label_type in ['neg', 'pos']:
        dir_name = os.path.join(imdb_dir, dir_kind, label_type)
        for fname in os.listdir(dir_name):
            if fname[-4:] == '.txt':
                f = open(os.path.join(dir_name, fname))
                df = df.append({'text': f.read(), 'sentiment': ['neg','pos'].index(label_type)}, ignore_index = True)
                f.close()

CPU times: user 21 µs, sys: 1 µs, total: 22 µs
Wall time: 39.1 µs


In [48]:
df.head()

Unnamed: 0,text,sentiment
0,I am quite a fan of novelist/screenwriter Mich...,0
1,If this book remained faithful to the book the...,0
2,The Eternal Jew (Der Ewige Jude) does not have...,0
3,Here are the matches . . . (adv. = advantage)<...,0
4,I'm sorry but I didn't like this doc very much...,0


In [49]:
print ('Number of negative istances:', len(df[df['sentiment'] == 0]))
print ('Number of positive istances:', len(df[df['sentiment'] == 1]))
print ('Il dataset risulta essere bilanciato!')

Number of negative istances: 25000
Number of positive istances: 25000
Il dataset risulta essere bilanciato!


In [50]:
print(df['text'][0])

I am quite a fan of novelist/screenwriter Michael Chabon. His novel "Wonder Boys" became a fantastic movie by Curtis Hanson. His masterful novel "The Amazing Adventures of Kavalier and Clay" won the Pulitzer Prize a few years back, and he had a hand in the script of "Spider Man 2", arguably the greatest comic book movie of all time.<br /><br />Director Rawson Marshall Thurber has also directed wonderful comedic pieces, such as the gut-busting "Dodgeball" and the genius short film series "Terry Tate: Office Linebacker". And with a cast including Peter Saarsgard, Sienna Miller, Nick Nolte and Mena Suvari, this seems like a no-brainer.<br /><br />It is. Literally.<br /><br />Jon Foster stars as Art Bechstein, the son of a mobster (Nolte) who recently graduated with a degree in Economics. Jon is in a state of arrested development: he works a minimum wage job at Book Barn, has a vapid relationship with his girlfriend/boss, Phlox (Suvari), which amounts to little more than copious amounts of

In [57]:
from bs4 import BeautifulSoup

def remove_html_tags(text):
    return BeautifulSoup(text, 'lxml').text

In [58]:
remove_html_tags(df['text'][0])

'I am quite a fan of novelist/screenwriter Michael Chabon. His novel "Wonder Boys" became a fantastic movie by Curtis Hanson. His masterful novel "The Amazing Adventures of Kavalier and Clay" won the Pulitzer Prize a few years back, and he had a hand in the script of "Spider Man 2", arguably the greatest comic book movie of all time.Director Rawson Marshall Thurber has also directed wonderful comedic pieces, such as the gut-busting "Dodgeball" and the genius short film series "Terry Tate: Office Linebacker". And with a cast including Peter Saarsgard, Sienna Miller, Nick Nolte and Mena Suvari, this seems like a no-brainer.It is. Literally.Jon Foster stars as Art Bechstein, the son of a mobster (Nolte) who recently graduated with a degree in Economics. Jon is in a state of arrested development: he works a minimum wage job at Book Barn, has a vapid relationship with his girlfriend/boss, Phlox (Suvari), which amounts to little more than copious amounts of sex, with no plans other than to c

In [59]:
df['text'] = df['text'].apply(lambda x: remove_html_tags(x))

In [60]:
df.head()

Unnamed: 0,text,sentiment
0,I am quite a fan of novelist/screenwriter Mich...,0
1,If this book remained faithful to the book the...,0
2,The Eternal Jew (Der Ewige Jude) does not have...,0
3,Here are the matches . . . (adv. = advantage)T...,0
4,I'm sorry but I didn't like this doc very much...,0


## Creating the DNN Model

In [101]:
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

x_train, x_val, y_train, y_val = train_test_split(df['text'], df['sentiment'], test_size = 0.33, shuffle = True)


In [102]:
print (y_train.value_counts())

print (y_val.value_counts())

1    16784
0    16716
Name: sentiment, dtype: int64
0    8284
1    8216
Name: sentiment, dtype: int64


In [106]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np

tokenizer = Tokenizer()
texts = np.concatenate((x_train, x_val), axis = 0)
tokenizer.fit_on_texts(texts)

maxlen = max([len(t.split()) for t in texts])

words_size = len(tokenizer.word_index) + 1

train_sequences = tokenizer.texts_to_sequences(x_train)
val_sequences = tokenizer.texts_to_sequences(x_val)

print('Found %s unique tokens.' % len(tokenizer.word_index))

train_data = pad_sequences(train_sequences, maxlen = maxlen)
val_data = pad_sequences(val_sequences, maxlen = maxlen)

print('Shape of train data tensor:', train_data.shape)
print('Shape of train label tensor:', y_train.shape)

print('Shape of validation data tensor:', val_data.shape)
print('Shape of validation label tensor:', y_val.shape)

Found 126505 unique tokens.
Shape of train data tensor: (33500, 2450)
Shape of train label tensor: (33500,)
Shape of validation data tensor: (16500, 2450)
Shape of validation label tensor: (16500,)
