In [1]:
import os
import numpy as np
import pandas as pd
import pickle
from tqdm import tqdm

## Looking into the Data

In [2]:
path = "./data/aclImdb/"
positiveFiles = [x for x in os.listdir(path+"train/pos/") if x.endswith(".txt")]
negativeFiles = [x for x in os.listdir(path+"train/neg/") if x.endswith(".txt")]
testPositiveFiles = [x for x in os.listdir(path+"test/pos") if x.endswith(".txt")]
testNegativeFiles = [x for x in os.listdir(path+"test/neg") if x.endswith(".txt")]

In [3]:
print(len(positiveFiles),len(negativeFiles))

12500 12500


In [4]:
print(len(testPositiveFiles),len(testNegativeFiles))

12500 12500


In [5]:
with open(path+"train/pos/"+positiveFiles[4], encoding="latin1") as f:
    print(f.read())

Having watched this film strictly on the strength of reviewers' ratings I was most pleasantly surprised. Although clearly low-budget, it bears the signs of clever ingenuity. For example, when Julia wakes in the strange house and looks out the window I found myself thinking that her sense of isolation would be enhanced with an exterior shot focused on her face and then moving backwards to include the house and its isolated location. And lo and behold! the next scene was exactly that last shot of the house standing lonely on the cliff at the water's edge. There are other examples of how a clever director can elevate his film to the level of a very enjoyable thriller. Savvy viewers will surely spot them but should rest assured they will not be disappointed.<br /><br />As to the performances, George Macready is his usual creepy self, barely maintaining his composure while suggesting a capacity for unadulterated violence. Nina Foch was surprisingly good as the no-nonsense working girl who's

In [6]:
with open(path+"train/pos/"+positiveFiles[2000], encoding="latin1") as f:
    print(f.read())

Definitely one of the lesser of the Astaire/Rogers musicals. It's just very poorly plotted and paced. It only runs a few minutes longer than Swing Time, for example, but it feels a heck of a lot longer. This is partly due to the secondary romance between Randolph Scott and Harriet Hilliard. Scott is rarely ever interesting. I like Hilliard. She's sweet, and I love at least one of her songs, "But Where Are You?" ("Get Thee Behind Me Satan", her other number, is a weak leftover from Top Hat, thankfully cut from that masterpiece). Follow the Fleet would actually be a bad film if not for at least three brilliant dance sequences between Astaire and Rogers. The dancing contest vies for the top spot of any of their numbers. The dance is just fantastic. "I'm Putting All My Eggs in One Basket" presents the two rehearsing a dance that they don't quite have perfected yet. Its imperfections make it all the more perfect. And "Let's Face the Music and Dance" is easily one of Irving Berlin's best son

## Loading Data and Cleaning
* Remove punctuation from words (e.g. ‘what’s’).
* Removing tokens that are just punctuation (e.g. ‘-‘).
* Removing tokens that contain numbers (e.g. ’10/10′).
* Remove tokens that have one character (e.g. ‘a’).
* Remove tokens that don’t have much meaning (e.g. ‘and’)

In [7]:
from nltk.corpus import stopwords
import string

def clean_doc(text):
    # split into tokens
    tokens = text.lower().split()
    # remove punctuation
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # filter out stop words
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    # filtering out short tokens
    tokens = [word for word in tokens if len(word) > 1]
    return ' '.join(tokens)

In [8]:
train_X, train_y, test_X, test_y = [], [], [], []

for pfile in positiveFiles:
    with open(path+"train/pos/"+pfile, encoding="latin1") as f:
        text = clean_doc(f.read())
        train_X.append(text)
        train_y.append(1)
for nfile in negativeFiles:
    with open(path+"train/neg/"+nfile, encoding="latin1") as f:
        text = clean_doc(f.read())
        train_X.append(text)
        train_y.append(0)
        
for tpfile in testPositiveFiles:
    with open(path+"test/pos/"+tpfile, encoding="latin1") as f:
        text = clean_doc(f.read())
        test_X.append(text)
        test_y.append(1)
for tnfile in testNegativeFiles:
    with open(path+"test/neg/"+tnfile, encoding="latin1") as f:
        text = clean_doc(f.read())
        test_X.append(text)
        test_y.append(0)

In [9]:
#cleaned data
train_X[4]

'watched film strictly strength reviewers ratings pleasantly surprised although clearly lowbudget bears signs clever ingenuity example julia wakes strange house looks window found thinking sense isolation would enhanced exterior shot focused face moving backwards include house isolated location lo behold next scene exactly last shot house standing lonely cliff waters edge examples clever director elevate film level enjoyable thriller savvy viewers surely spot rest assured disappointedbr br performances george macready usual creepy self barely maintaining composure suggesting capacity unadulterated violence nina foch surprisingly good nononsense working girl whos submit without fight dame may witty oh boy even doubting eyes believing could get away evil schemesbr br real diamond rough missed'

In [10]:
test_X[0]

'woman aunt go scotland locate evasive muchmaligned film denouement point interesting wellacted eerie fine set design william cameron menzies developed projection veronica hurst captivating genteel sort chic british marilyn monroe still love richard carlson hiding family secret forbidding castle even bats belfry moves leisurely final extraordinary setpiece hurst aunt katherine emery also narrator sneak castle night venture maze find theyre looking center kid always remembered sequence theres nothing scarier claustrophobic finding way high maze hedges naturally two women get separated setting stage engrossing suspense horrific music final result mildly disappointing really since carlsons epilogue end makes sense goingson even provoking sympathy worth seeing'

In [11]:
# shuffling data
from sklearn.utils import shuffle
train_X, train_y = shuffle(train_X, train_y, random_state=0)
test_X, test_y = shuffle(test_X, test_y, random_state=1)

In [12]:
print(train_X[2],train_y[2])

reason dont give movie fewer stars isnt quite par movie like manos hands fate movies greatest crime fact headmeltingly boring terribly unforgivably british premise movie sounds potentially promising whole teleporting concept direction went completely uninteresting movie research funding bowties projecting lasers actors wooden unemotional aloof love affair two scientists anything intriguing never able tell attraction chemistry nonexistent really understand meltyfaced main guy decided slaughter everyone met least know always give someone fair hearing cut research grants else go rampaging killing wantonly goofy hand gestures 0


## Getting word2vec

In [13]:
word2vec = {}
with open('./data/glove/glove.6B.100d.txt') as f:
  for line in f:
    values = line.split()
    word = values[0]
    vec = np.asarray(values[1:], dtype='float32')
    word2vec[word] = vec
print('Found %s word vectors.' % len(word2vec))

Found 400000 word vectors.


## Tokenization and padding

In [14]:
count=0
for i,text in enumerate(train_X):
    if len(text.split())>500:
        count+=1
count

170

In [15]:
MAX_SEQUENCE_LENGTH = 500
MAX_VOCAB_SIZE = 30000
EMBEDDING_DIM = 100

In [16]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

Using TensorFlow backend.


In [17]:
tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE)
tokenizer.fit_on_texts(train_X)
train_X = tokenizer.texts_to_sequences(train_X)
test_X = tokenizer.texts_to_sequences(test_X)

In [18]:
word2idx = tokenizer.word_index
print('Found %s unique tokens.' % len(word2idx))

Found 117014 unique tokens.


In [19]:
train_X = pad_sequences(train_X, maxlen=MAX_SEQUENCE_LENGTH)
test_X = pad_sequences(test_X, maxlen=MAX_SEQUENCE_LENGTH)
print('Shape of train_X and test_X:', train_X.shape, test_X.shape)

Shape of train_X and test_X: (25000, 500) (25000, 500)


In [20]:
with open("./data_pickle/train_X.pkl","wb") as f:
    pickle.dump(train_X,f)

In [21]:
with open("./data_pickle/train_y.pkl","wb") as f:
    pickle.dump(train_y,f)

In [22]:
with open("./data_pickle/test_X.pkl","wb") as f:
    pickle.dump(test_X,f)

In [23]:
with open("./data_pickle/test_y.pkl","wb") as f:
    pickle.dump(test_y,f)

## Prepare_embedding_matrix

In [24]:
num_words = min(MAX_VOCAB_SIZE, len(word2idx) + 1)
embedding_matrix = np.zeros((num_words, EMBEDDING_DIM))
for word, i in tqdm(word2idx.items(),total=len(word2idx)):
  if i < MAX_VOCAB_SIZE:
    embedding_vector = word2vec.get(word)
    if embedding_vector is not None:
      # words not found in embedding index will be all zeros.
      embedding_matrix[i] = embedding_vector

100%|██████████| 117014/117014 [00:00<00:00, 681853.88it/s]


In [25]:
num_words

30000

In [26]:
with open("./data_pickle/embedding_matrix.pkl","wb") as f:
    pickle.dump(embedding_matrix,f)