# Natural Language Processing
- **Week 1**
  Tokenized, sequenced and padded words to prepare a sentiment analysis. Every word is given an index. Words of a sentence are combined in a tokenized representation array. To unify the different sequence sizes we applied a padding of zeroes, such as allowing tensorflow to work with the texts. 
- **Week 2**
  Used embeddings and tensorflow datasets to make a sentiment analysis of IMDB reviews. Instead of using the previously applied *word-bag-model* we represent words in an n-dimensional vector space as so called *word embedding*. The weights of the word embedding can be visualized in a 3d projection using [tensorflow projector](https://projector.tensorflow.org).
- **Week 3**
  Applied a *Recurrent Neural Network* (RNN) and used *long short-term memory* (LSTM) besides *gated recurrent unit* (GRU). RNN can be applied to sequences as sentences. Here previously seen words can have an impact of the meaning of the sentence. RNN permit us to transfer this information to later calculations in the sequence. LSTM and GRU further introduces a internal memory state to deal with long term temporal dependencies.

In [2]:
!wget --no-check-certificate \
    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/bbc-text.csv \
    -O ./assets/bbc-text.csv \
    -nv



2019-11-07 11:41:09 URL:https://storage.googleapis.com/laurencemoroney-blog.appspot.com/bbc-text.csv [5057493/5057493] -> "./assets/bbc-text.csv" [1]


In [1]:
import csv
import tensorflow as tf
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
print(tf.__version__)

2.0.0


In [15]:
sentences = []
with open('./assets/bbc-text.csv', 'r') as f:
    csv_reader = csv.reader(f, delimiter=',')
    next(csv_reader)
    for row in csv_reader:
        sentences += row[1:]
print(len(sentences))
print(sentences[0])

2225
tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially  the technology allows for much more personalised tv. they are also being built-in to 

In [18]:
tokenizer = Tokenizer(oov_token='<OOV>')
tokens = tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

print(len(word_index))

29727


In [22]:
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding='post')
print(padded[0])
print(padded.shape)

[177 265   7 ...   0   0   0]
(2225, 4491)


## Week2

### import data

In [2]:
import tensorflow_datasets as tfds
imdb, info = tfds.load('imdb_reviews', with_info=True, as_supervised=True)

In [7]:
import numpy as np

train_data, test_data = imdb['train'], imdb['test']

training_sentences = []
training_labels = []

testing_sentences = []
testing_labels = []

for s, l in train_data:
    training_sentences.append(str(s.numpy()))
    training_labels.append(l.numpy())

for s, l in test_data:
    testing_sentences.append(str(s.numpy()))
    testing_labels.append(l.numpy())

training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)

### Hyperparameters

In [8]:
vocab_size = 10000
oov_token = '<OOV>'
max_length = 120
trunc_type = 'post'
embedding_dim = 16
num_epochs = 10

# Tokenization

In [9]:
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_token)
tokenizer.fit_on_texts(training_sentences)
sequences = tokenizer.texts_to_sequences(training_sentences)
padded = pad_sequences(sequences, maxlen=max_length, truncating=trunc_type)

testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences,maxlen=max_length)

In [26]:
word_index = tokenizer.word_index
reverse_word_index = dict([(val, index) for (index, val) in word_index.items()])

### Model

In [10]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid'),
    ])
model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['acc'])
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 120, 16)           160000    
_________________________________________________________________
flatten_1 (Flatten)          (None, 1920)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 6)                 11526     
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 7         
Total params: 171,533
Trainable params: 171,533
Non-trainable params: 0
_________________________________________________________________


In [11]:
model.fit(padded,
          training_labels_final,
          epochs=num_epochs,
          validation_data=(testing_padded, testing_labels_final),
          verbose=2)

Train on 25000 samples, validate on 25000 samples
Epoch 1/10


25000/25000 - 3s - loss: 0.4938 - acc: 0.7484 - val_loss: 0.3450 - val_acc: 0.8505


Epoch 2/10


25000/25000 - 2s - loss: 0.2421 - acc: 0.9066 - val_loss: 0.3832 - val_acc: 0.8330


Epoch 3/10


25000/25000 - 3s - loss: 0.0947 - acc: 0.9750 - val_loss: 0.4487 - val_acc: 0.8276


Epoch 4/10


25000/25000 - 3s - loss: 0.0273 - acc: 0.9967 - val_loss: 0.5176 - val_acc: 0.8275


Epoch 5/10


25000/25000 - 3s - loss: 0.0124 - acc: 0.9983 - val_loss: 0.5769 - val_acc: 0.8242


Epoch 6/10


25000/25000 - 3s - loss: 0.0060 - acc: 0.9992 - val_loss: 0.6316 - val_acc: 0.8214


Epoch 7/10


25000/25000 - 3s - loss: 0.0023 - acc: 0.9998 - val_loss: 0.7049 - val_acc: 0.8186


Epoch 8/10


25000/25000 - 3s - loss: 8.2560e-04 - acc: 1.0000 - val_loss: 0.7540 - val_acc: 0.8196


Epoch 9/10


25000/25000 - 3s - loss: 3.3887e-04 - acc: 1.0000 - val_loss: 0.7896 - val_acc: 0.8214


Epoch 10/10


25000/25000 - 2s - loss: 1.7957e-04 - acc: 1.0000 - val_loss: 0.8233 - val_acc: 0.8219


<tensorflow.python.keras.callbacks.History at 0x13d2daa50>

### Export - Get Embedding weights and write to file

In [32]:
e = model.layers[0]
weights = e.get_weights()[0]

import io

out_v = io.open('./assets/imdb_vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('./assets/imdb_meta.tsv', 'w', encoding='utf-8')

for i in range(1, vocab_size):
    embeddings = weights[i]
    word = reverse_word_index[i]
    out_m.write(word + '\n')
    out_v.write('\t'.join([str(x) for x in embeddings]) + '\n') 

## Week 3 - RNN: Recurrent Neural Networks

In [36]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64), return_sequences=True),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    # tf.keras.layers.Bidirectional(tf.keras.layers.GRU(32)),
    # tf.keras.layers.Conv1D(128, 5, activation='relu'),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid'),
    ])
model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['acc'])
model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 120, 16)           160000    
_________________________________________________________________
bidirectional (Bidirectional (None, 128)               41472     
_________________________________________________________________
flatten_2 (Flatten)          (None, 128)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 6)                 774       
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 7         
Total params: 202,253
Trainable params: 202,253
Non-trainable params: 0
_________________________________________________________________


In [56]:
embedding_dim = 100
max_length = 16
trunc_type = 'post'
padding_type = 'post'
oov_tok = '<OOV>'
training_size = 160000
test_portion = .1

In [39]:
!wget --no-check-certificate \
    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/training_cleaned.csv \
    -O .assets/training_cleaned.csv



--2019-11-08 16:00:20--  https://storage.googleapis.com/laurencemoroney-blog.appspot.com/training_cleaned.csv
Resolving storage.googleapis.com (storage.googleapis.com)... 

216.58.211.48
Connecting to storage.googleapis.com (storage.googleapis.com)|216.58.211.48|:443... connected.


HTTP request sent, awaiting response... 

200 OK
Length: 238942690 (228M) [application/octet-stream]
Saving to: ‘/tmp/training_cleaned.csv’

          /tmp/trai   0%[                    ]       0  --.-KB/s               

         /tmp/train   0%[                    ] 216.23K  1012KB/s               

        /tmp/traini   0%[                    ] 572.58K  1.32MB/s               

       /tmp/trainin   0%[                    ]   1.12M  1.79MB/s               

      /tmp/training   0%[                    ]   1.63M  1.98MB/s               

     /tmp/training_   0%[                    ]   2.18M  2.08MB/s               

    /tmp/training_c   1%[                    ]   2.66M  2.14MB/s               

   /tmp/training_cl   1%[                    ]   3.41M  2.35MB/s               

  /tmp/training_cle   1%[                    ]   4.02M  2.39MB/s               

 /tmp/training_clea   1%[                    ]   4.46M  2.35MB/s               

/tmp/training_clean   2%[                    ]   4.93M  2.34MB/s               

tmp/training_cleane   2%[                    ]   5.51M  2.38MB/s               

mp/training_cleaned   2%[                    ]   6.01M  2.37MB/s               

p/training_cleaned.   2%[                    ]   6.22M  2.27MB/s               

/training_cleaned.c   2%[                    ]   6.43M  2.18MB/s               

training_cleaned.cs   3%[                    ]   6.88M  2.17MB/s    eta 1m 42s 

raining_cleaned.csv   3%[                    ]   7.38M  2.18MB/s    eta 1m 42s 

aining_cleaned.csv    3%[                    ]   7.97M  2.30MB/s    eta 1m 42s 

ining_cleaned.csv     3%[                    ]   8.58M  2.35MB/s    eta 1m 42s 

ning_cleaned.csv      3%[                    ]   8.94M  2.30MB/s    eta 1m 42s 

ing_cleaned.csv       4%[                    ]   9.47M  2.27MB/s    eta 98s    

ng_cleaned.csv        4%[                    ]   9.99M  2.27MB/s    eta 98s    

g_cleaned.csv         4%[                    ]  10.24M  2.21MB/s    eta 98s    

_cleaned.csv          4%[                    ]  10.60M  2.12MB/s    eta 98s    

cleaned.csv           4%[                    ]  10.90M  2.00MB/s    eta 98s    

leaned.csv            5%[>                   ]  11.41M  2.05MB/s    eta 1m 40s 

eaned.csv             5%[>                   ]  11.88M  2.04MB/s    eta 1m 40s 

aned.csv              5%[>                   ]  12.66M  2.08MB/s    eta 1m 40s 

ned.csv               5%[>                   ]  13.13M  2.12MB/s    eta 1m 40s 

ed.csv                6%[>                   ]  13.77M  2.25MB/s    eta 1m 40s 

d.csv                 6%[>                   ]  14.41M  2.36MB/s    eta 94s    

.csv                  6%[>                   ]  15.01M  2.38MB/s    eta 94s    

csv                   6%[>                   ]  15.52M  2.38MB/s    eta 94s    

sv                    7%[>                   ]  16.16M  2.40MB/s    eta 94s    

v                     7%[>                   ]  16.62M  2.44MB/s    eta 94s    

                      7%[>                   ]  17.30M  2.51MB/s    eta 90s    

                  /   7%[>                   ]  17.76M  2.51MB/s    eta 90s    

                 /t   7%[>                   ]  18.19M  2.48MB/s    eta 90s    

                /tm   8%[>                   ]  18.65M  2.51MB/s    eta 90s    

               /tmp   8%[>                   ]  19.29M  2.66MB/s    eta 90s    

              /tmp/   8%[>                   ]  19.77M  2.67MB/s    eta 89s    

             /tmp/t   8%[>                   ]  20.13M  2.58MB/s    eta 89s    

            /tmp/tr   9%[>                   ]  20.71M  2.62MB/s    eta 89s    

           /tmp/tra   9%[>                   ]  21.37M  2.58MB/s    eta 89s    

          /tmp/trai   9%[>                   ]  21.82M  2.58MB/s    eta 89s    

         /tmp/train   9%[>                   ]  22.13M  2.41MB/s    eta 89s    

        /tmp/traini   9%[>                   ]  22.22M  2.29MB/s    eta 89s    

       /tmp/trainin   9%[>                   ]  22.60M  2.22MB/s    eta 89s    

      /tmp/training  10%[=>                  ]  22.91M  2.11MB/s    eta 89s    

     /tmp/training_  10%[=>                  ]  23.43M  2.12MB/s    eta 89s    

    /tmp/training_c  10%[=>                  ]  23.85M  2.11MB/s    eta 90s    

   /tmp/training_cl  10%[=>                  ]  24.40M  2.11MB/s    eta 90s    

  /tmp/training_cle  10%[=>                  ]  24.87M  2.11MB/s    eta 90s    

 /tmp/training_clea  11%[=>                  ]  25.43M  2.18MB/s    eta 90s    

/tmp/training_clean  11%[=>                  ]  26.01M  2.17MB/s    eta 90s    

tmp/training_cleane  11%[=>                  ]  26.76M  2.22MB/s    eta 87s    

mp/training_cleaned  12%[=>                  ]  27.51M  2.32MB/s    eta 87s    

p/training_cleaned.  12%[=>                  ]  28.15M  2.39MB/s    eta 87s    

/training_cleaned.c  12%[=>                  ]  28.76M  2.36MB/s    eta 87s    

training_cleaned.cs  12%[=>                  ]  29.33M  2.43MB/s    eta 87s    

raining_cleaned.csv  13%[=>                  ]  30.05M  2.45MB/s    eta 83s    

aining_cleaned.csv   13%[=>                  ]  30.65M  2.69MB/s    eta 83s    

ining_cleaned.csv    13%[=>                  ]  31.22M  2.73MB/s    eta 83s    

ning_cleaned.csv     13%[=>                  ]  31.79M  2.79MB/s    eta 83s    

ing_cleaned.csv      14%[=>                  ]  32.51M  2.91MB/s    eta 83s    

ng_cleaned.csv       14%[=>                  ]  33.16M  3.00MB/s    eta 80s    

g_cleaned.csv        14%[=>                  ]  33.69M  2.99MB/s    eta 80s    

_cleaned.csv         15%[==>                 ]  34.32M  2.98MB/s    eta 80s    

cleaned.csv          15%[==>                 ]  35.01M  3.04MB/s    eta 80s    

leaned.csv           15%[==>                 ]  35.68M  3.09MB/s    eta 80s    

eaned.csv            15%[==>                 ]  36.27M  3.05MB/s    eta 78s    

aned.csv             16%[==>                 ]  36.80M  3.00MB/s    eta 78s    

ned.csv              16%[==>                 ]  37.41M  2.95MB/s    eta 78s    

ed.csv               16%[==>                 ]  38.30M  3.08MB/s    eta 78s    

d.csv                17%[==>                 ]  38.99M  3.11MB/s    eta 78s    

.csv                 17%[==>                 ]  39.68M  3.12MB/s    eta 75s    

csv                  17%[==>                 ]  40.29M  3.06MB/s    eta 75s    

sv                   17%[==>                 ]  40.80M  3.07MB/s    eta 75s    

v                    18%[==>                 ]  41.47M  3.16MB/s    eta 75s    

                     18%[==>                 ]  42.26M  3.16MB/s    eta 75s    

                  /  18%[==>                 ]  42.96M  3.19MB/s    eta 72s    

                 /t  19%[==>                 ]  43.33M  3.11MB/s    eta 72s    

                /tm  19%[==>                 ]  43.94M  3.13MB/s    eta 72s    

               /tmp  19%[==>                 ]  44.49M  3.08MB/s    eta 72s    

              /tmp/  19%[==>                 ]  44.96M  3.02MB/s    eta 72s    

             /tmp/t  20%[===>                ]  45.65M  3.05MB/s    eta 71s    

            /tmp/tr  20%[===>                ]  46.26M  3.04MB/s    eta 71s    

           /tmp/tra  20%[===>                ]  46.94M  3.07MB/s    eta 71s    

          /tmp/trai  20%[===>                ]  47.44M  2.98MB/s    eta 71s    

         /tmp/train  21%[===>                ]  47.87M  2.93MB/s    eta 71s    

        /tmp/traini  21%[===>                ]  48.51M  2.88MB/s    eta 70s    

       /tmp/trainin  21%[===>                ]  49.19M  2.88MB/s    eta 70s    

      /tmp/training  21%[===>                ]  49.83M  2.91MB/s    eta 70s    

     /tmp/training_  22%[===>                ]  50.37M  2.94MB/s    eta 70s    

    /tmp/training_c  22%[===>                ]  51.01M  2.89MB/s    eta 70s    

   /tmp/training_cl  22%[===>                ]  51.52M  2.81MB/s    eta 68s    

  /tmp/training_cle  22%[===>                ]  52.32M  2.84MB/s    eta 68s    

 /tmp/training_clea  23%[===>                ]  52.87M  2.91MB/s    eta 68s    

/tmp/training_clean  23%[===>                ]  53.54M  2.94MB/s    eta 68s    

tmp/training_cleane  23%[===>                ]  53.80M  2.81MB/s    eta 68s    

mp/training_cleaned  23%[===>                ]  54.10M  2.76MB/s    eta 67s    

p/training_cleaned.  23%[===>                ]  54.58M  2.66MB/s    eta 67s    

/training_cleaned.c  24%[===>                ]  54.99M  2.64MB/s    eta 67s    

training_cleaned.cs  24%[===>                ]  55.63M  2.63MB/s    eta 67s    

raining_cleaned.csv  24%[===>                ]  56.29M  2.71MB/s    eta 67s    

aining_cleaned.csv   24%[===>                ]  56.87M  2.72MB/s    eta 66s    

ining_cleaned.csv    25%[====>               ]  57.46M  2.67MB/s    eta 66s    

ning_cleaned.csv     25%[====>               ]  58.15M  2.67MB/s    eta 66s    

ing_cleaned.csv      25%[====>               ]  58.65M  2.65MB/s    eta 66s    

ng_cleaned.csv       26%[====>               ]  59.43M  2.72MB/s    eta 66s    

g_cleaned.csv        26%[====>               ]  60.12M  2.74MB/s    eta 64s    

_cleaned.csv         26%[====>               ]  60.37M  2.63MB/s    eta 64s    

cleaned.csv          26%[====>               ]  60.87M  2.58MB/s    eta 64s    

leaned.csv           26%[====>               ]  61.12M  2.41MB/s    eta 64s    

eaned.csv            27%[====>               ]  61.96M  2.51MB/s    eta 64s    

aned.csv             27%[====>               ]  62.30M  2.60MB/s    eta 64s    

ned.csv              27%[====>               ]  62.66M  2.57MB/s    eta 64s    

ed.csv               27%[====>               ]  62.97M  2.55MB/s    eta 64s    

d.csv                27%[====>               ]  63.22M  2.45MB/s    eta 64s    

.csv                 27%[====>               ]  63.59M  2.38MB/s    eta 64s    

csv                  28%[====>               ]  63.97M  2.29MB/s    eta 64s    

sv                   28%[====>               ]  64.18M  2.14MB/s    eta 64s    

v                    28%[====>               ]  64.62M  2.14MB/s    eta 64s    

                     28%[====>               ]  65.13M  2.11MB/s    eta 64s    

                  /  28%[====>               ]  65.62M  2.04MB/s    eta 64s    

                 /t  28%[====>               ]  66.04M  1.96MB/s    eta 64s    

                /tm  29%[====>               ]  66.49M  1.94MB/s    eta 64s    

               /tmp  29%[====>               ]  66.80M  1.94MB/s    eta 64s    

              /tmp/  29%[====>               ]  67.33M  1.94MB/s    eta 64s    

             /tmp/t  29%[====>               ]  67.72M  2.00MB/s    eta 64s    

            /tmp/tr  29%[====>               ]  67.96M  1.83MB/s    eta 64s    

           /tmp/tra  30%[=====>              ]  68.46M  1.84MB/s    eta 64s    

          /tmp/trai  30%[=====>              ]  68.77M  1.88MB/s    eta 64s    

         /tmp/train  30%[=====>              ]  68.94M  1.81MB/s    eta 64s    

        /tmp/traini  30%[=====>              ]  69.33M  1.86MB/s    eta 64s    

       /tmp/trainin  30%[=====>              ]  70.04M  1.95MB/s    eta 64s    

      /tmp/training  30%[=====>              ]  70.60M  2.00MB/s    eta 64s    

     /tmp/training_  31%[=====>              ]  71.05M  2.07MB/s    eta 64s    

    /tmp/training_c  31%[=====>              ]  71.32M  2.01MB/s    eta 64s    

   /tmp/training_cl  31%[=====>              ]  71.74M  1.97MB/s    eta 64s    

  /tmp/training_cle  31%[=====>              ]  72.07M  1.93MB/s    eta 63s    

 /tmp/training_clea  31%[=====>              ]  72.57M  1.91MB/s    eta 63s    

/tmp/training_clean  32%[=====>              ]  72.93M  1.90MB/s    eta 63s    

tmp/training_cleane  32%[=====>              ]  73.32M  1.86MB/s    eta 63s    

mp/training_cleaned  32%[=====>              ]  73.69M  1.82MB/s    eta 63s    

p/training_cleaned.  32%[=====>              ]  74.07M  1.87MB/s    eta 63s    

/training_cleaned.c  32%[=====>              ]  74.43M  1.85MB/s    eta 63s    

training_cleaned.cs  32%[=====>              ]  74.62M  1.78MB/s    eta 63s    

raining_cleaned.csv  32%[=====>              ]  74.93M  1.80MB/s    eta 63s    

aining_cleaned.csv   32%[=====>              ]  75.05M  1.72MB/s    eta 63s    

ining_cleaned.csv    33%[=====>              ]  75.46M  1.56MB/s    eta 64s    

ning_cleaned.csv     33%[=====>              ]  75.49M  1.41MB/s    eta 64s    

ing_cleaned.csv      33%[=====>              ]  75.93M  1.41MB/s    eta 64s    

ng_cleaned.csv       33%[=====>              ]  76.16M  1.10MB/s    eta 66s    

g_cleaned.csv        33%[=====>              ]  76.32M  1015KB/s    eta 66s    

_cleaned.csv         33%[=====>              ]  76.47M   903KB/s    eta 66s    

cleaned.csv          33%[=====>              ]  76.54M   723KB/s    eta 69s    

leaned.csv           33%[=====>              ]  76.58M   681KB/s    eta 69s    

eaned.csv            33%[=====>              ]  76.63M   560KB/s    eta 69s    

aned.csv             33%[=====>              ]  76.71M   538KB/s    eta 71s    

ned.csv              33%[=====>              ]  76.76M   472KB/s    eta 71s    

ed.csv               33%[=====>              ]  76.89M   438KB/s    eta 71s    

d.csv                33%[=====>              ]  77.05M   433KB/s    eta 71s    

.csv                 33%[=====>              ]  77.22M   411KB/s    eta 71s    

csv                  33%[=====>              ]  77.44M   436KB/s    eta 72s    

sv                   34%[=====>              ]  77.74M   425KB/s    eta 72s    

v                    34%[=====>              ]  78.12M   464KB/s    eta 72s    

                     34%[=====>              ]  78.52M   479KB/s    eta 72s    

                  /  34%[=====>              ]  78.90M   607KB/s    eta 72s    

                 /t  34%[=====>              ]  79.40M   732KB/s    eta 71s    
















































































































































































































































































































































































































































































































































































































































2019-11-08 16:02:05 (2.19 MB/s) - ‘/tmp/training_cleaned.csv’ saved [238942690/238942690]



In [82]:
import csv
corpus = []

with open('./assets/training_cleaned.csv', 'r') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in reader:
        list_item = [row[5]]
        label = row[0]
        if label == '0':
            list_item.append(0)
        else:
            list_item.append(1)
        corpus.append(list_item) 

In [102]:
import random
random.shuffle(corpus)

labels = []
sentences = []
for x in range(1, training_size):
    sentences.append(corpus[x][0])
    labels.append(corpus[x][1])

tokenizer = Tokenizer(oov_token=oov_tok)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
vocab_size = len(word_index)
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, maxlen=max_length, padding=padding_type, truncating=trunc_type)

split = int(test_portion * training_size)

test_sequences = padded[:split]
test_labels = np.array(labels[:split])
training_sequences = padded[split:]
training_labels = np.array(labels[split:])

## Download and prepare GloVe: Global Vectors for Word Representation

In [86]:
!wget --no-check-certificate \
    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/glove.6B.100d.txt \
    -O ./assets/glove.6B.100d.txt \
    -nv



2019-11-08 16:42:22 URL:https://storage.googleapis.com/laurencemoroney-blog.appspot.com/glove.6B.100d.txt [347116733/347116733] -> "./assets/glove.6B.100d.txt" [1]


In [89]:
embeddings_index = {};
with open('./assets/glove.6B.100d.txt') as f:
    for line in f:
        values = line.split();
        word = values[0];
        coefs = np.asarray(values[1:], dtype='float32');
        embeddings_index[word] = coefs;

embeddings_matrix = np.zeros((vocab_size+1, embedding_dim));
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word);
    if embedding_vector is not None:
        embeddings_matrix[i] = embedding_vector; 

In [96]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size+1, embedding_dim, input_length=max_length, weights=[embeddings_matrix], trainable=False),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32)),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_11 (Embedding)     (None, 16, 100)           13844600  
_________________________________________________________________
dropout_4 (Dropout)          (None, 16, 100)           0         
_________________________________________________________________
bidirectional_7 (Bidirection (None, 16, 128)           84480     
_________________________________________________________________
bidirectional_8 (Bidirection (None, 64)                41216     
_________________________________________________________________
flatten_5 (Flatten)          (None, 64)                0         
_________________________________________________________________
dense_6 (Dense)              (None, 64)                4160      
_________________________________________________________________
dense_7 (Dense)              (None, 1)                

In [101]:
print(type(training_labels))
training_labels_final = np.array(training_labels)
print(type(training_labels_final))

<class 'list'>
<class 'numpy.ndarray'>


In [103]:
num_epochs = 10
history = model.fit(training_sequences, training_labels, epochs=num_epochs, verbose=2)

Train on 143999 samples
Epoch 1/10


InvalidArgumentError:  indices[10,0] = 138460 is not in [0, 138446)
	 [[node sequential_3/embedding_11/embedding_lookup (defined at /usr/local/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1751) ]] [Op:__inference_distributed_function_227574]

Function call stack:
distributed_function
