<a href="https://colab.research.google.com/github/TJConnellyContingentMacro/NU422/blob/master/RNN_on_Movie_Reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Language Modeling with RNN - Movie Reviews

## Chakin and Other Libraries Needed

In [182]:
!pip install chakin
import chakin

chakin.search(lang='English')


                   Name  Dimension  ... Language    Author
2          fastText(en)        300  ...  English  Facebook
11         GloVe.6B.50d         50  ...  English  Stanford
12        GloVe.6B.100d        100  ...  English  Stanford
13        GloVe.6B.200d        200  ...  English  Stanford
14        GloVe.6B.300d        300  ...  English  Stanford
15       GloVe.42B.300d        300  ...  English  Stanford
16      GloVe.840B.300d        300  ...  English  Stanford
17    GloVe.Twitter.25d         25  ...  English  Stanford
18    GloVe.Twitter.50d         50  ...  English  Stanford
19   GloVe.Twitter.100d        100  ...  English  Stanford
20   GloVe.Twitter.200d        200  ...  English  Stanford
21  word2vec.GoogleNews        300  ...  English    Google

[12 rows x 7 columns]


In [0]:
import numpy as np

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np

import os  # operating system functions
import os.path  # for manipulation of file path names

import re  # regular expressions

from collections import defaultdict

import nltk
from nltk.tokenize import TreebankWordTokenizer

import tensorflow as tf

RANDOM_SEED = 9999

# To make output stable across runs
def reset_graph(seed= RANDOM_SEED):
    tf.reset_default_graph()
    tf.set_random_seed(seed)
    np.random.seed(seed)

REMOVE_STOPWORDS = False  # no stopword removal 



In [184]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [0]:
import os


In [186]:
!ls "/content/drive/My Drive/embeddings"

gloVe.6B


In [187]:
embeddings_directory = '/content/drive/My Drive/embeddings/gloVe.6B'
filename = 'glove.6B.50d.txt'
embeddings_filename = os.path.join(embeddings_directory, filename)
embeddings_filename

'/content/drive/My Drive/embeddings/gloVe.6B/glove.6B.50d.txt'


## Set vocab size

In [0]:
EVOCABSIZE = 10000

##Modules - loading embeddings

In [0]:
word_to_index, index_to_embedding = load_embedding_from_disks(embeddings_filename, with_indexes=True)

In [190]:
def load_embedding_from_disks(embeddings_filename, with_indexes=True):
    """
    Read a embeddings txt file. If `with_indexes=True`, 
    we return a tuple of two dictionnaries
    `(word_to_index_dict, index_to_embedding_array)`, 
    otherwise we return only a direct 
    `word_to_embedding_dict` dictionnary mapping 
    from a string to a numpy array.
    """
    if with_indexes:
        word_to_index_dict = dict()
        index_to_embedding_array = []
  
    else:
        word_to_embedding_dict = dict()

    with open(embeddings_filename, 'r', encoding='utf-8') as embeddings_file:
        for (i, line) in enumerate(embeddings_file):

            split = line.split(' ')

            word = split[0]

            representation = split[1:]
            representation = np.array(
                [float(val) for val in representation]
            )

            if with_indexes:
                word_to_index_dict[word] = i
                index_to_embedding_array.append(representation)
            else:
                word_to_embedding_dict[word] = representation

    # Empty representation for unknown words.
    _WORD_NOT_FOUND = [0.0] * len(representation)
    if with_indexes:
        _LAST_INDEX = i + 1
        word_to_index_dict = defaultdict(
            lambda: _LAST_INDEX, word_to_index_dict)
        index_to_embedding_array = np.array(
            index_to_embedding_array + [_WORD_NOT_FOUND])
        return word_to_index_dict, index_to_embedding_array
    else:
        word_to_embedding_dict = defaultdict(lambda: _WORD_NOT_FOUND)
        return word_to_embedding_dict

print('\nLoading embeddings from', embeddings_filename)
word_to_index, index_to_embedding = \
    load_embedding_from_disks(embeddings_filename, with_indexes=True)
print("Embedding loaded from disks.")


Loading embeddings from /content/drive/My Drive/embeddings/gloVe.6B/glove.6B.50d.txt
Embedding loaded from disks.


In [191]:
# Additional background code from
# https://github.com/guillaume-chevalier/GloVe-as-a-TensorFlow-Embedding-Layer
# shows the general structure of the data structures for word embeddings
# This code is modified for our purposes in language modeling 
vocab_size, embedding_dim = index_to_embedding.shape
print("Embedding is of shape: {}".format(index_to_embedding.shape))
print("This means (number of words, number of dimensions per word)\n")
print("The first words are words that tend occur more often.")

print("Note: for unknown words, the representation is an empty vector,\n"
      "and the index is the last one. The dictionnary has a limit:")
print("    {} --> {} --> {}".format("A word", "Index in embedding", 
      "Representation"))
word = "worsdfkljsdf"  # a word obviously not in the vocabulary
idx = word_to_index[word] # index for word obviously not in the vocabulary
complete_vocabulary_size = idx 
embd = list(np.array(index_to_embedding[idx], dtype=int)) # "int" compact print
print("    {} --> {} --> {}".format(word, idx, embd))
word = "the"
idx = word_to_index[word]
embd = list(index_to_embedding[idx])  # "int" for compact print only.
print("    {} --> {} --> {}".format(word, idx, embd))

# Show how to use embeddings dictionaries with a test sentence
# This is a famous typing exercise with all letters of the alphabet
# https://en.wikipedia.org/wiki/The_quick_brown_fox_jumps_over_the_lazy_dog
a_typing_test_sentence = 'The quick brown fox jumps over the lazy dog'
print('\nTest sentence: ', a_typing_test_sentence, '\n')
words_in_test_sentence = a_typing_test_sentence.split()

print('Test sentence embeddings from complete vocabulary of', 
      complete_vocabulary_size, 'words:\n')
for word in words_in_test_sentence:
    word_ = word.lower()
    embedding = index_to_embedding[word_to_index[word_]]
    print(word_ + ": ", embedding)


Embedding is of shape: (400001, 50)
This means (number of words, number of dimensions per word)

The first words are words that tend occur more often.
Note: for unknown words, the representation is an empty vector,
and the index is the last one. The dictionnary has a limit:
    A word --> Index in embedding --> Representation
    worsdfkljsdf --> 400000 --> [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    the --> 0 --> [0.418, 0.24968, -0.41242, 0.1217, 0.34527, -0.044457, -0.49688, -0.17862, -0.00066023, -0.6566, 0.27843, -0.14767, -0.55677, 0.14658, -0.0095095, 0.011658, 0.10204, -0.12792, -0.8443, -0.12181, -0.016801, -0.33279, -0.1552, -0.23131, -0.19181, -1.8823, -0.76746, 0.099051, -0.42125, -0.19526, 4.0071, -0.18594, -0.52287, -0.31681, 0.00059213, 0.0074449, 0.17778, -0.15897, 0.012041, -0.054223, -0.29871, -0.15749, -0.34758, -0.045637, -0.44251, 0.18785, 0.0027849, -0.18

In [192]:

# ------------------------------------------------------------- 
# Define vocabulary size for the language model    
# To reduce the size of the vocabulary to the n most frequently used words

def default_factory():
    return EVOCABSIZE  # last/unknown-word row in limited_index_to_embedding
# dictionary has the items() function, returns list of (key, value) tuples
limited_word_to_index = defaultdict(default_factory, \
    {k: v for k, v in word_to_index.items() if v < EVOCABSIZE})

# Select the first EVOCABSIZE rows to the index_to_embedding
limited_index_to_embedding = index_to_embedding[0:EVOCABSIZE,:]
# Set the unknown-word row to be all zeros as previously
limited_index_to_embedding = np.append(limited_index_to_embedding, 
    index_to_embedding[index_to_embedding.shape[0] - 1, :].\
        reshape(1,embedding_dim), 
    axis = 0)

# Delete large numpy array to clear some CPU RAM
#del index_to_embedding

# Verify the new vocabulary: should get same embeddings for test sentence
# Note that a small EVOCABSIZE may yield some zero vectors for embeddings
print('\nTest sentence embeddings from vocabulary of', EVOCABSIZE, 'words:\n')
for word in words_in_test_sentence:
    word_ = word.lower()
    embedding = limited_index_to_embedding[limited_word_to_index[word_]]
    print(word_ + ": ", embedding)

# ------------------------------------------------------------
# code for working with movie reviews data 
# Source: Miller, T. W. (2016). Web and Network Data Science.
#    Upper Saddle River, N.J.: Pearson Education.
#    ISBN-13: 978-0-13-388644-3
# This original study used a simple bag-of-words approach
# to sentiment analysis, along with pre-defined lists of
# negative and positive words.        
# Code available at:  https://github.com/mtpa/wnds       
# ------------------------------------------------------------
# Utility function to get file names within a directory
def listdir_no_hidden(path):
    start_list = os.listdir(path)
    end_list = []
    for file in start_list:
        if (not file.startswith('.')):
            end_list.append(file)
    return(end_list)

# define list of codes to be dropped from document
# carriage-returns, line-feeds, tabs
codelist = ['\r', '\n', '\t']   

# We will not remove stopwords in this exercise because they are
# important to keeping sentences intact
if REMOVE_STOPWORDS:
    print(nltk.corpus.stopwords.words('english'))

# previous analysis of a list of top terms showed a number of words, along 
# with contractions and other word strings to drop from further analysis, add
# these to the usual English stopwords to be dropped from a document collection
    more_stop_words = ['cant','didnt','doesnt','dont','goes','isnt','hes',\
        'shes','thats','theres','theyre','wont','youll','youre','youve', 'br'\
        've', 're', 'vs'] 

    some_proper_nouns_to_remove = ['dick','ginger','hollywood','jack',\
        'jill','john','karloff','kudrow','orson','peter','tcm','tom',\
        'toni','welles','william','wolheim','nikita']

    # start with the initial list and add to it for movie text work 
    stoplist = nltk.corpus.stopwords.words('english') + more_stop_words +\
        some_proper_nouns_to_remove

# text parsing function for creating text documents 
# there is more we could do for data preparation 
# stemming... looking for contractions... possessives... 
# but we will work with what we have in this parsing function
# if we want to do stemming at a later time, we can use
#     porter = nltk.PorterStemmer()  
# in a construction like this
#     words_stemmed =  [porter.stem(word) for word in initial_words]  
def text_parse(string):
    # replace non-alphanumeric with space 
    temp_string = re.sub('[^a-zA-Z]', '  ', string)    
    # replace codes with space
    for i in range(len(codelist)):
        stopstring = ' ' + codelist[i] + '  '
        temp_string = re.sub(stopstring, '  ', temp_string)      
    # replace single-character words with space
    temp_string = re.sub('\s.\s', ' ', temp_string)   
    # convert uppercase to lowercase
    temp_string = temp_string.lower()    
    if REMOVE_STOPWORDS:
        # replace selected character strings/stop-words with space
        for i in range(len(stoplist)):
            stopstring = ' ' + str(stoplist[i]) + ' '
            temp_string = re.sub(stopstring, ' ', temp_string)        
    # replace multiple blank characters with one blank character
    temp_string = re.sub('\s+', ' ', temp_string)    
    return(temp_string)    

# -----------------------------------------------


Test sentence embeddings from vocabulary of 10000 words:

the:  [ 4.1800e-01  2.4968e-01 -4.1242e-01  1.2170e-01  3.4527e-01 -4.4457e-02
 -4.9688e-01 -1.7862e-01 -6.6023e-04 -6.5660e-01  2.7843e-01 -1.4767e-01
 -5.5677e-01  1.4658e-01 -9.5095e-03  1.1658e-02  1.0204e-01 -1.2792e-01
 -8.4430e-01 -1.2181e-01 -1.6801e-02 -3.3279e-01 -1.5520e-01 -2.3131e-01
 -1.9181e-01 -1.8823e+00 -7.6746e-01  9.9051e-02 -4.2125e-01 -1.9526e-01
  4.0071e+00 -1.8594e-01 -5.2287e-01 -3.1681e-01  5.9213e-04  7.4449e-03
  1.7778e-01 -1.5897e-01  1.2041e-02 -5.4223e-02 -2.9871e-01 -1.5749e-01
 -3.4758e-01 -4.5637e-02 -4.4251e-01  1.8785e-01  2.7849e-03 -1.8411e-01
 -1.1514e-01 -7.8581e-01]
quick:  [ 0.13967   -0.53798   -0.18047   -0.25142    0.16203   -0.13868
 -0.24637    0.75111    0.27264    0.61035   -0.82548    0.038647
 -0.32361    0.30373   -0.14598   -0.23551    0.39267   -1.1287
 -0.23636   -1.0629     0.046277   0.29143   -0.25819   -0.094902
  0.79478   -1.2095    -0.01039   -0.092086   0.84322   

###Gather Reviews - Negative and Positive


In [0]:
dir_name = '/content/drive/My Drive/movie-reviews-negative/'
# neg_filenames = !ls '/content/drive/My Drive/movie-reviews-negative'
filenames = listdir_no_hidden(path=dir_name)
num_files = len(filenames)

# for i in range(len(neg_file_list)):
#     file_exists = os.path.isfile(os.path.join(dir_name, neg_file_list[i]))
#     assert file_exists
# print('\nDirectory:',dir_name)    
# print('%d files found' % len(neg_file_list))

In [0]:
def read_data(filename):

  with open(filename, encoding='utf-8') as f:
    data = tf.compat.as_str(f.read())
    data = data.lower()
    data = text_parse(data)
    data = TreebankWordTokenizer().tokenize(data)  # The Penn Treebank

  return data

In [195]:
negative_documents = []

print('\nProcessing document files under', dir_name)
for i in range(num_files):
    #print(' ', filenames[i])

    words = read_data(os.path.join(dir_name, filenames[i]))
    negative_documents.append(words)
    


Processing document files under /content/drive/My Drive/movie-reviews-negative/


In [196]:
len(negative_documents)

500

Postive Reviews - Gather Data

In [197]:
dir_name = '/content/drive/My Drive/movie-reviews-positive/'
# neg_filenames = !ls '/content/drive/My Drive/movie-reviews-negative'
filenames = listdir_no_hidden(path=dir_name)
num_files = len(filenames)

positive_documents = []

print('\nProcessing document files under', dir_name)
for i in range(num_files):
    #print(' ', filenames[i])

    words = read_data(os.path.join(dir_name, filenames[i]))
    positive_documents.append(words)


Processing document files under /content/drive/My Drive/movie-reviews-positive/


In [198]:
len(positive_documents)

500

### Show Major Words

In [199]:
for word in positive_documents:
  for word in word:
    word_ = word.lower()
    embedding = index_to_embedding[word_to_index[word_]]
    if float(np.linalg.norm(embedding)) > 7:
      norm = str(np.linalg.norm(embedding))

      print((word + ": ").ljust(15) + norm)
# print("Note: here we printed words starting with capital letters, \n"
#       "however to take their embeddings we need their lowercase version (str.lower())")

cents:         7.132643022166443
yen:           7.352429926608617
huk:           7.433102355194111
billion:       7.371013581415314
duh:           7.516570923567951
mur:           7.017110525600192
cents:         7.132643022166443
index:         7.13995329162215
bouchet:       7.072356350444384
bouchet:       7.072356350444384
bouchet:       7.072356350444384
bouchet:       7.072356350444384
bouchet:       7.072356350444384
bouchet:       7.072356350444384
bouchet:       7.072356350444384
bouchet:       7.072356350444384
bouchet:       7.072356350444384
bouchet:       7.072356350444384
bouchet:       7.072356350444384
bouchet:       7.072356350444384
roeper:        7.147004479193854
roeper:        7.147004479193854


In [200]:
for word in negative_documents:
  for word in word:
    word_ = word.lower()
    embedding = index_to_embedding[word_to_index[word_]]
    if float(np.linalg.norm(embedding)) > 7:
      norm = str(np.linalg.norm(embedding))

      print((word + ": ").ljust(15) + norm)

cents:         7.132643022166443
herein:        9.025840270305642
heh:           7.136435992616148
heh:           7.136435992616148
expectable:    8.218706406533816
ee:            7.408210289925834
minister:      7.055862783003153
feh:           9.175938250165755
hah:           8.893612129012613
cents:         7.132643022166443
minister:      7.055862783003153
minister:      7.055862783003153
rahs:          8.61003805380795
uh:            7.278735447096524
minister:      7.055862783003153
hahk:          7.136847005913956
behl:          7.008982925702131


# convert positive/negative documents into numpy array 
note that reviews vary from 22 to 1052 words   so we use the first 20 and last 20 words of each review  as our word sequences for analysis

In [201]:
max_review_length = 0  # initialize
for doc in negative_documents:
    max_review_length = max(max_review_length, len(doc))    
for doc in positive_documents:
    max_review_length = max(max_review_length, len(doc)) 
print('max_review_length:', max_review_length) 

min_review_length = max_review_length  # initialize
for doc in negative_documents:
    min_review_length = min(min_review_length, len(doc))    
for doc in positive_documents:
    min_review_length = min(min_review_length, len(doc)) 
print('min_review_length:', min_review_length) 

max_review_length: 1052
min_review_length: 22


 construct list of 1000 lists with 40 words in each list

In [0]:
from itertools import chain
documents = []
for doc in negative_documents:
    doc_begin = doc[0:20]
    doc_end = doc[len(doc) - 20: len(doc)]
    documents.append(list(chain(*[doc_begin, doc_end])))    
for doc in positive_documents:
    doc_begin = doc[0:20]
    doc_end = doc[len(doc) - 20: len(doc)]
    documents.append(list(chain(*[doc_begin, doc_end])))   

 create list of lists of lists for embeddings

In [0]:

embeddings = []    
for doc in documents:
    embedding = []
    for word in doc:
       embedding.append(limited_index_to_embedding[limited_word_to_index[word]]) 
    embeddings.append(embedding)


In [204]:
#Check on the embeddings list of list of lists 
# -----------------------------------------------------
# Show the first word in the first document
test_word = documents[0][0]    
print('First word in first document:', test_word)    
print('Embedding for this word:\n', 
      limited_index_to_embedding[limited_word_to_index[test_word]])
print('Corresponding embedding from embeddings list of list of lists\n',
      embeddings[0][0][:])

First word in first document: story
Embedding for this word:
 [ 0.48251    0.87746   -0.23455    0.0262     0.79691    0.43102
 -0.60902   -0.60764   -0.42812   -0.012523  -1.2894     0.52656
 -0.82763    0.30689    1.1972    -0.47674   -0.46885   -0.19524
 -0.28403    0.35237    0.45536    0.76853    0.0062157  0.55421
  1.0006    -1.3973    -1.6894     0.30003    0.60678   -0.46044
  2.5961    -1.2178     0.28747   -0.46175   -0.25943    0.38209
 -0.28312   -0.47642   -0.059444  -0.59202    0.25613    0.21306
 -0.016129  -0.29873   -0.19468    0.53611    0.75459   -0.4112
  0.23625    0.26451  ]
Corresponding embedding from embeddings list of list of lists
 [ 0.48251    0.87746   -0.23455    0.0262     0.79691    0.43102
 -0.60902   -0.60764   -0.42812   -0.012523  -1.2894     0.52656
 -0.82763    0.30689    1.1972    -0.47674   -0.46885   -0.19524
 -0.28403    0.35237    0.45536    0.76853    0.0062157  0.55421
  1.0006    -1.3973    -1.6894     0.30003    0.60678   -0.46044
  2.596

In [205]:
# Show the seventh word in the tenth document
test_word = documents[6][9]    
print('First word in first document:', test_word)    
print('Embedding for this word:\n', 
      limited_index_to_embedding[limited_word_to_index[test_word]])
print('Corresponding embedding from embeddings list of list of lists\n',
      embeddings[6][9][:])


First word in first document: here
Embedding for this word:
 [ 0.14094   0.68201  -0.50406   0.38316   0.63427  -1.1851   -0.46932
  0.28639  -0.43216  -0.55399  -0.44542  -0.37547  -0.3705   -0.10563
  1.1606    0.43494   0.38033   0.030184 -0.24547  -0.43203  -0.031259
  0.50174   0.27714   0.12505   0.82877  -1.7273   -0.3644    0.30344
 -0.17817  -0.012443  3.4775    0.51806  -0.46432  -0.13342  -0.22624
 -0.24472   0.062998  0.50663  -0.31938   0.079926 -0.54474   0.19452
  0.12387  -0.055269  0.65444   0.43451  -0.42384   0.11082   0.11009
 -0.27094 ]
Corresponding embedding from embeddings list of list of lists
 [ 0.14094   0.68201  -0.50406   0.38316   0.63427  -1.1851   -0.46932
  0.28639  -0.43216  -0.55399  -0.44542  -0.37547  -0.3705   -0.10563
  1.1606    0.43494   0.38033   0.030184 -0.24547  -0.43203  -0.031259
  0.50174   0.27714   0.12505   0.82877  -1.7273   -0.3644    0.30344
 -0.17817  -0.012443  3.4775    0.51806  -0.46432  -0.13342  -0.22624
 -0.24472   0.062998  

In [206]:
# Show the last word in the last document
test_word = documents[999][39]    
print('First word in first document:', test_word)    
print('Embedding for this word:\n', 
      limited_index_to_embedding[limited_word_to_index[test_word]])
print('Corresponding embedding from embeddings list of list of lists\n',
      embeddings[999][39][:])        

First word in first document: ages
Embedding for this word:
 [ 0.23112    1.1862     0.11349   -0.88792    0.86805    0.91468
 -0.45029   -1.031     -0.9349    -0.0053284 -0.026446  -0.74621
  0.52657   -0.43498    1.089      0.35619   -0.55435   -0.69322
 -0.50731    0.3368    -0.052842   0.48811    0.99526    1.0149
  0.086193   0.39121   -0.65664   -1.3824     0.40994   -0.29029
  2.3207     0.87304    0.14717   -0.54456    1.0059     0.057116
 -0.36031    0.062007  -0.068883  -0.048031  -0.51337   -0.73957
  1.0286     0.71768    0.21626    0.289      0.93457   -0.86077
 -0.24782   -0.33616  ]
Corresponding embedding from embeddings list of list of lists
 [ 0.23112    1.1862     0.11349   -0.88792    0.86805    0.91468
 -0.45029   -1.031     -0.9349    -0.0053284 -0.026446  -0.74621
  0.52657   -0.43498    1.089      0.35619   -0.55435   -0.69322
 -0.50731    0.3368    -0.052842   0.48811    0.99526    1.0149
  0.086193   0.39121   -0.65664   -1.3824     0.40994   -0.29029
  2.3207

 -----------------------------------------------------    
### Make embeddings a numpy array for use in an RNN 
### Create training and test sets with Scikit Learn
-----------------------------------------------------


In [0]:
embeddings_array = np.array(embeddings)

# Define the labels to be used 500 negative (0) and 500 positive (1)
thumbs_down_up = np.concatenate((np.zeros((500), dtype = np.int32), 
                      np.ones((500), dtype = np.int32)), axis = 0)


### SKLearn

In [0]:
from sklearn.model_selection import train_test_split

# Random splitting of the data in to training (80%) and test (20%)  
X_train, X_test, y_train, y_test = \
    train_test_split(embeddings_array, thumbs_down_up, test_size=0.20, 
                     random_state = RANDOM_SEED)

 --------------------------------------------------------------------------      
We use a very simple Recurrent Neural Network for this assignment Géron, A. 2017. Hands-On Machine Learning with Scikit-Learn & TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. 
    Sebastopol, Calif.: O'Reilly. [ISBN-13 978-1-491-96229-9] 
   Chapter 14 Recurrent Neural Networks, pages 390-391
    Source code available at https://github.com/ageron/handson-ml
   Jupyter notebook file 14_recurrent_neural_networks.ipynb
   
   See section on Training an sequence Classifier, # In [34]:
    which uses the MNIST case data...  we revise to accommodate the movie review data in this assignment    

In [209]:
reset_graph()

n_steps = embeddings_array.shape[1]  # number of words per document 
n_inputs = embeddings_array.shape[2]  # dimension of  pre-trained embeddings
n_neurons = 20  # analyst specified number of neurons
n_outputs = 2  # thumbs-down or thumbs-up

learning_rate = 0.001

X = tf.placeholder(tf.float32, [None, n_steps, n_inputs])
y = tf.placeholder(tf.int32, [None])

basic_cell = tf.contrib.rnn.BasicRNNCell(num_units=n_neurons)
outputs, states = tf.nn.dynamic_rnn(basic_cell, X, dtype=tf.float32)

logits = tf.layers.dense(states, n_outputs)
xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                          logits=logits)
loss = tf.reduce_mean(xentropy)
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)
correct = tf.nn.in_top_k(logits, y, 1)
accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

init = tf.global_variables_initializer()

n_epochs = 50
batch_size = 100

with tf.Session() as sess:
    init.run()
    for epoch in range(n_epochs):
        print('\n  ---- Epoch ', epoch, ' ----\n')
        for iteration in range(y_train.shape[0] // batch_size):          
            X_batch = X_train[iteration*batch_size:(iteration + 1)*batch_size,:]
            y_batch = y_train[iteration*batch_size:(iteration + 1)*batch_size]
            print('  Batch ', iteration, ' training observations from ',  
                  iteration*batch_size, ' to ', (iteration + 1)*batch_size-1,)
            sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
        acc_train = accuracy.eval(feed_dict={X: X_batch, y: y_batch})
        acc_test = accuracy.eval(feed_dict={X: X_test, y: y_test})
        print('\n  Train accuracy:', acc_train, 'Test accuracy:', acc_test)



  ---- Epoch  0  ----

  Batch  0  training observations from  0  to  99
  Batch  1  training observations from  100  to  199
  Batch  2  training observations from  200  to  299
  Batch  3  training observations from  300  to  399
  Batch  4  training observations from  400  to  499
  Batch  5  training observations from  500  to  599
  Batch  6  training observations from  600  to  699
  Batch  7  training observations from  700  to  799

  Train accuracy: 0.47 Test accuracy: 0.515

  ---- Epoch  1  ----

  Batch  0  training observations from  0  to  99
  Batch  1  training observations from  100  to  199
  Batch  2  training observations from  200  to  299
  Batch  3  training observations from  300  to  399
  Batch  4  training observations from  400  to  499
  Batch  5  training observations from  500  to  599
  Batch  6  training observations from  600  to  699
  Batch  7  training observations from  700  to  799

  Train accuracy: 0.46 Test accuracy: 0.5

  ---- Epoch  2  ----

In [0]:

def plot_with_labels(low_dim_embs, labels):
    assert low_dim_embs.shape[0] >= len(labels), "More labels than embeddings"
    plt.figure(figsize=(18, 18))  #in inches
    for i, label in enumerate(labels):
        x, y = low_dim_embs[i,:]
        plt.scatter(x, y)
        plt.annotate(label,
                     xy=(x, y),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')

In [0]:
# from sklearn.manifold import TSNE

# tsne = TSNE(perplexity=30, n_components=2, init='pca', n_iter=5000)
# plot_only = 500
# low_dim_embs = tsne.fit_transform(final_embeddings[:plot_only,:])
# labels = [vocabulary[i] for i in range(plot_only)]
# plot_with_labels(low_dim_embs, labels)

# Conclusion


Overall more work Is needed to vet the RNN model and its training accuracy.   Initial modeling, though, suggests it could be quite useful for monitoring call logs.   Model calculation intensity will be significant and could limit the timeliness of the model results.   So while real-time monitoring won’t be possible yet, a RNN-based model could do a could job of alerting customer support personnel when and whom they should be contacting.   