# Natural Language Processing
- **Week 1**
  Tokenized, sequenced and padded words to prepare a sentiment analysis. Every word is given an index. Words of a sentence are combined in a tokenized representation array. To unify the different sequence sizes we applied a padding of zeroes, such as allowing tensorflow to work with the texts. 
- **Week 2**
  Used embeddings and tensorflow datasets to make a sentiment analysis of IMDB reviews. Instead of using the previously applied *word-bag-model* we represent words in an n-dimensional vector space as so called *word embedding*. The weights of the word embedding can be visualized in a 3d projection using [tensorflow projector](https://projector.tensorflow.org).

In [2]:
!wget --no-check-certificate \
    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/bbc-text.csv \
    -O ./assets/bbc-text.csv \
    -nv



2019-11-07 11:41:09 URL:https://storage.googleapis.com/laurencemoroney-blog.appspot.com/bbc-text.csv [5057493/5057493] -> "./assets/bbc-text.csv" [1]


In [1]:
import csv
import tensorflow as tf
import numpy as np
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
print(tf.__version__)

2.0.0


In [15]:
sentences = []
with open('./assets/bbc-text.csv', 'r') as f:
    csv_reader = csv.reader(f, delimiter=',')
    next(csv_reader)
    for row in csv_reader:
        sentences += row[1:]
print(len(sentences))
print(sentences[0])

2225
tv future in the hands of viewers with home theatre systems  plasma high-definition tvs  and digital video recorders moving into the living room  the way people watch tv will be radically different in five years  time.  that is according to an expert panel which gathered at the annual consumer electronics show in las vegas to discuss how these new technologies will impact one of our favourite pastimes. with the us leading the trend  programmes and other content will be delivered to viewers via home networks  through cable  satellite  telecoms companies  and broadband service providers to front rooms and portable devices.  one of the most talked-about technologies of ces has been digital and personal video recorders (dvr and pvr). these set-top boxes  like the us s tivo and the uk s sky+ system  allow people to record  store  play  pause and forward wind tv programmes when they want.  essentially  the technology allows for much more personalised tv. they are also being built-in to 

In [18]:
tokenizer = Tokenizer(oov_token='<OOV>')
tokens = tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

print(len(word_index))

29727


In [22]:
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding='post')
print(padded[0])
print(padded.shape)

[177 265   7 ...   0   0   0]
(2225, 4491)


## Week2

### import data

In [2]:
import tensorflow_datasets as tfds
imdb, info = tfds.load('imdb_reviews', with_info=True, as_supervised=True)

In [7]:
import numpy as np

train_data, test_data = imdb['train'], imdb['test']

training_sentences = []
training_labels = []

testing_sentences = []
testing_labels = []

for s, l in train_data:
    training_sentences.append(str(s.numpy()))
    training_labels.append(l.numpy())

for s, l in test_data:
    testing_sentences.append(str(s.numpy()))
    testing_labels.append(l.numpy())

training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)

### Hyperparameters

In [8]:
vocab_size = 10000
oov_token = '<OOV>'
max_length = 120
trunc_type = 'post'
embedding_dim = 16
num_epochs = 10

# Tokenization

In [9]:
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_token)
tokenizer.fit_on_texts(training_sentences)
sequences = tokenizer.texts_to_sequences(training_sentences)
padded = pad_sequences(sequences, maxlen=max_length, truncating=trunc_type)

testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences,maxlen=max_length)

In [26]:
word_index = tokenizer.word_index
reverse_word_index = dict([(val, index) for (index, val) in word_index.items()])

### Model

In [10]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid'),
    ])
model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['acc'])
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 120, 16)           160000    
_________________________________________________________________
flatten_1 (Flatten)          (None, 1920)              0         
_________________________________________________________________
dense_2 (Dense)              (None, 6)                 11526     
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 7         
Total params: 171,533
Trainable params: 171,533
Non-trainable params: 0
_________________________________________________________________


In [11]:
model.fit(padded,
          training_labels_final,
          epochs=num_epochs,
          validation_data=(testing_padded, testing_labels_final),
          verbose=2)

Train on 25000 samples, validate on 25000 samples
Epoch 1/10


25000/25000 - 3s - loss: 0.4938 - acc: 0.7484 - val_loss: 0.3450 - val_acc: 0.8505


Epoch 2/10


25000/25000 - 2s - loss: 0.2421 - acc: 0.9066 - val_loss: 0.3832 - val_acc: 0.8330


Epoch 3/10


25000/25000 - 3s - loss: 0.0947 - acc: 0.9750 - val_loss: 0.4487 - val_acc: 0.8276


Epoch 4/10


25000/25000 - 3s - loss: 0.0273 - acc: 0.9967 - val_loss: 0.5176 - val_acc: 0.8275


Epoch 5/10


25000/25000 - 3s - loss: 0.0124 - acc: 0.9983 - val_loss: 0.5769 - val_acc: 0.8242


Epoch 6/10


25000/25000 - 3s - loss: 0.0060 - acc: 0.9992 - val_loss: 0.6316 - val_acc: 0.8214


Epoch 7/10


25000/25000 - 3s - loss: 0.0023 - acc: 0.9998 - val_loss: 0.7049 - val_acc: 0.8186


Epoch 8/10


25000/25000 - 3s - loss: 8.2560e-04 - acc: 1.0000 - val_loss: 0.7540 - val_acc: 0.8196


Epoch 9/10


25000/25000 - 3s - loss: 3.3887e-04 - acc: 1.0000 - val_loss: 0.7896 - val_acc: 0.8214


Epoch 10/10


25000/25000 - 2s - loss: 1.7957e-04 - acc: 1.0000 - val_loss: 0.8233 - val_acc: 0.8219


<tensorflow.python.keras.callbacks.History at 0x13d2daa50>

### Export - Get Embedding weights and write to file

In [32]:
e = model.layers[0]
weights = e.get_weights()[0]

import io

out_v = io.open('./assets/imdb_vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('./assets/imdb_meta.tsv', 'w', encoding='utf-8')

for i in range(1, vocab_size):
    embeddings = weights[i]
    word = reverse_word_index[i]
    out_m.write(word + '\n')
    out_v.write('\t'.join([str(x) for x in embeddings]) + '\n') 

In [31]:
print(len(weights), len(reverse_word_index))


10000 86539
