# Assignment 10

# Mary Donovan Martello

## This file contains code from Deep Learning with Python, www.manning.com/books/deep-learning-with-python, Copyright 2018 Francois Chollet.

## Purpose:  Transform text input into tokens and convert those tokens into numeric vectors using one-hot encoding and feature hashing.  Build basic text-processing and classification models using recurrent neural networks.  Demonstrate how word embeddings such as Word2Vec can help improve the performance of text-processing models.

## Assignment 10.1:  Text Preprocessing

### Create a tokenized function that splits a sentence into words, implement a `ngram` function that splits tokens into N-grams, and implement an one_hot_encode function to create a vector from a numerical vector from a list of tokens.

In [4]:
# Load libraries
import unicodedata
import sys
import re
import string

**Create tokenized function to split a sentence into words.**

In [42]:
# tokenize sentence and remove punctuation 

def tokenize(sentence):
    tokens = []
   
    # tokenize the sentence and remove punctuation
    for word in sentence.split():
        # lower case the word
        word = word.lower()
        # For each token, remove any punctuation characters and   
            # add the stripped word to the token list
        tokens.append(word.translate(str.maketrans('', '', string.punctuation)))

    return tokens



In [43]:
tokens = tokenize("The man said: I've sat on the mat by the cat.")

In [44]:
print(tokens)

['the', 'man', 'said', 'ive', 'sat', 'on', 'the', 'mat', 'by', 'the', 'cat']


In [49]:
len(tokens)

11

Assignment 10.1.b

Implement an `ngram` function that splits tokens into N-grams. 




**Implement a ngram functio that splits tokens into N-grams.**

In [47]:
def ngram(tokens, n):
    ngrams = []
    # Create ngrams
    # Use the zip function to help us generate n-grams
    # Concatentate the tokens into ngrams and return
    grams = zip(*[tokens[i:] for i in range(n)])
    ngrams.append([" ".join(ngram) for ngram in grams])
    return ngrams


In [48]:
ngram(tokens, 2)

[['the man',
  'man said',
  'said ive',
  'ive sat',
  'sat on',
  'on the',
  'the mat',
  'mat by',
  'by the',
  'the cat']]

**Implement an one-hot-encode function to create a numerical vector from a list of tokens.**

In [55]:
def one_hot_encode(tokens, num_words):
    token_index = {}
    # First, build an index of all tokens
    for word in tokens:
        if word not in token_index:
            # Assign a unique index to each unique word
            token_index[word] = len(token_index) + 1 # Note that we don't attribute index 0 to anything.
    # vectorize the tokens
    results = np.zeros((len(tokens), num_words, max(token_index.values()) + 1))
    for j, word in list(enumerate(sample.split()))[:max_length]:
        index = token_index.get(word)
        results[j, index] = 1.
    return results


In [56]:
one_hot_encode(tokens, 10)

array([[[1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        ...,
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.]],

       [[1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        ...,
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.]],

       [[1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        ...,
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.],
        [1., 1., 1., ..., 1., 1., 1.]],

       ...,

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0.

## Assignment 10.2:  Sequential Neural Network with Embeddings

### Classify text with a sequential model including embeddings.

In [57]:
import keras

**Load Data**

In [61]:
import os

imdb_dir = "c:\\dev\\code\\DSC650\\dsc650\\data\\external\\imdb\\aclImdb"

train_dir = os.path.join(imdb_dir, 'train')

labels = []
texts = []

# collect reviews into a list of strings and collect review labels into a labels list
# negative reviews stored in negative directory and positive reviews in positive directory
for label_type in ['neg', 'pos']:
    dir_name = os.path.join(train_dir, label_type)
    for fname in os.listdir(dir_name):
        if fname[-4:] == '.txt':
            # https://stackoverflow.com/questions/9233027/unicodedecodeerror-charmap-codec-cant-decode-byte-x-in-position-y-character
            f = open(os.path.join(dir_name, fname), encoding="utf8")
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

In [62]:
len(texts)

25000

In [63]:
len(labels)

25000

**Tokenize the raw IMDB text data**

In [65]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np

maxlen = 100  # cut reviews after 100 words
training_samples = 15000  
validation_samples = 10000  
max_words = 10000  # limit to the top 10,000 words in the dataset

# tokenizing the text of the raw data
# create vocabulary index based on word frequency
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

# transform to a sequence of vectors
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

# pad data b/c all sequences in a batch must be of same length
data = pad_sequences(sequences, maxlen=maxlen)


labels = np.asarray(labels)
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

# shuffle the data because samples were ordered by negative and positive
indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

# split the data into a training set and a validation set
x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples: training_samples + validation_samples]
y_val = labels[training_samples: training_samples + validation_samples]

Found 88582 unique tokens.
Shape of data tensor: (25000, 100)
Shape of label tensor: (25000,)


**Train the model.**

In [67]:
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense

embedding_dim = 100

# instantiiate a model
model = Sequential()
# add embedding layer to vectorize words
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
# flattens 3D embedded tensor into a 2D tensor
model.add(Flatten())
# binary classifier model
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# review layers
model.summary()

# compile the model
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])
# fit the model
history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_data=(x_val, y_val))

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 100, 100)          1000000   
_________________________________________________________________
flatten (Flatten)            (None, 10000)             0         
_________________________________________________________________
dense (Dense)                (None, 32)                320032    
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
Total params: 1,320,065
Trainable params: 1,320,065
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Evaluate the model on the test data.

**Tokenize the test data.**

In [73]:
test_dir = os.path.join(imdb_dir, 'test')

labels = []
texts = []

for label_type in ['neg', 'pos']:
    dir_name = os.path.join(test_dir, label_type)
    for fname in sorted(os.listdir(dir_name)):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname), encoding="utf8")
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

sequences = tokenizer.texts_to_sequences(texts)
x_test = pad_sequences(sequences, maxlen=maxlen)
y_test = np.asarray(labels)

**Evalute the model on the test data.**

In [74]:
model.evaluate(x_test, y_test)



[3.182436466217041, 0.49827998876571655]

## 10.3 Add LSTM Layer

**Load and preprocess the data.**

In [75]:
from keras.datasets import imdb
from keras.preprocessing import sequence

max_features = 10000  # number of words to consider as features
maxlen = 500  # cut texts after this number of words (among top max_features most common words)
batch_size = 32

print('Loading data...')
(input_train, y_train2), (input_test, y_test2) = imdb.load_data(num_words=max_features)
print(len(input_train), 'train sequences')
print(len(input_test), 'test sequences')

print('Pad sequences (samples x time)')
# pad data b/c all sequences in a batch must be of same length
input_train = sequence.pad_sequences(input_train, maxlen=maxlen)
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)
print('input_train shape:', input_train.shape)
print('input_test shape:', input_test.shape)

Loading data...
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
25000 train sequences
25000 test sequences
Pad sequences (samples x time)
input_train shape: (25000, 500)
input_test shape: (25000, 500)


## Build the Model

In [76]:
from keras.layers import LSTM

# instantiiate a model
model = Sequential()
# add embedding layer to to vectorize words
model.add(Embedding(max_features, 32))
# add LSTM RNN layer
model.add(LSTM(32))
# binary classifier model
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
# fit the model
history = model.fit(input_train, y_train2,
                    epochs=10,
                    batch_size=128,
                    validation_split=0.2)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
