# Subword level representation


In this notebook, we will preprocess the data to represent sentences in a subword level. The data set is `ag_news`, same with `char-level-cnn` [project](https://github.com/BrambleXu/nlp-beginner-guide-keras/tree/master/char-level-cnn). The reason that I create [nlp-beginner-guide-keras](https://github.com/BrambleXu/nlp-beginner-guide-keras) is to learn different techniques, so here we use a different approach to do the preprocess. We will use subword level word representation, instead of character level word representation.


## What is subword level representation



## Why use subword level representation


## How to preprocess 

As for the preprocessing, you can find detail explanation in this notebook [subword-preprocess](https://github.com/BrambleXu/nlp-beginner-guide-keras/blob/master/char-level-rnn/notebooks/subword-preprocess.ipynb). 

In [7]:
import sys, os
sys.path.append(os.pardir)
from data_helpers import BPE

In [18]:
#=======================All Preprocessing====================

# load data
import numpy as np
import pandas as pd
train_data_source = '../../char-level-cnn/data/ag_news_csv/train.csv'
test_data_source = '../../char-level-cnn/data/ag_news_csv/test.csv'
train_df = pd.read_csv(train_data_source, header=None)
test_df = pd.read_csv(test_data_source, header=None)

# concatenate column 1 and column 2 as one text
for df in [train_df, test_df]:
    df[1] = df[1] + df[2]
    df = df.drop([2], axis=1)
    
# convert string to lower case 
train_texts = train_df[1].values 
train_texts = [s.lower() for s in train_texts]
test_texts = test_df[1].values 
test_texts = [s.lower() for s in test_texts]

# replace all digits with 0
import re
train_texts = [re.sub('\d', '0', s) for s in train_texts]
test_texts = [re.sub('\d', '0', s) for s in test_texts]

# replace all URLs with <url> 
url_reg  = r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b'
train_texts = [re.sub(url_reg, '<url>', s) for s in train_texts]
test_texts = [re.sub(url_reg, '<url>', s) for s in test_texts]

# Convert string to subword, this process may take several minutes
bpe = BPE("../pre-trained-model/en.wiki.bpe.op25000.vocab")
train_texts = [bpe.encode(s) for s in train_texts]
test_texts = [bpe.encode(s) for s in test_texts]

# Build vocab, {token: index}
vocab = {}
for i, token in enumerate(bpe.words):
    vocab[token] = i + 1
    
# Convert subword to index, function version 
def subword2index(texts, vocab):
    sentences = []
    for s in texts:
        s = s.split()
        one_line = []
        for word in s:
            if word not in vocab.keys():
                one_line.append(vocab['unk'])
            else:
                one_line.append(vocab[word])
        sentences.append(one_line)
    return sentences

# Convert train and test 
train_sentences = subword2index(train_texts, vocab)
test_sentences = subword2index(test_texts, vocab)

# Padding
from keras.preprocessing.sequence import pad_sequences
train_data = pad_sequences(train_sentences, maxlen=1014, padding='post')
test_data = pad_sequences(test_sentences, maxlen=1014, padding='post')

# Convert to numpy array
train_data = np.array(train_data)
test_data = np.array(test_data)

#=======================Get classes================
train_classes = train_df[0].values
train_class_list = [x-1 for x in train_classes]
test_classes = test_df[0].values
test_class_list = [x-1 for x in test_classes]

from keras.utils import to_categorical
train_classes = to_categorical(train_class_list)
test_classes = to_categorical(test_class_list)

Using TensorFlow backend.


In [30]:
# from os.path import join, exists, split

# data_dir = '../preprocessed-data/dataset'
# train_x = 'train_data.npy'
# train_y = 'train_class.npy'
# test_x = 'test_data.npy'
# test_y = 'test_classes.npy'

# # np.save(join(data_dir, train_x), train_data) 
# np.savez(data_dir, x_train=train_data, y_train=train_classes, x_test=test_data, y_test=test_classes)
# # This file is too big, 519.6MB

# import numpy as np
# import pandas as pd
# x = pd.HDFStore("dataset.hdf")
# x.append("train_data", pd.DataFrame(train_data)) # <-- This will take a while.
# x.append("test_data", pd.DataFrame(test_data)) # <-- This will take a while.
# x.close()
# # This will also output a datafile bigger than 500MB

In [20]:
train_data

array([[ 5323,    68, 24904, ...,     0,     0,     0],
       [ 3226,    84,    51, ...,     0,     0,     0],
       [18658,    36,  6182, ...,     0,     0,     0],
       ...,
       [21745,    18,   313, ...,     0,     0,     0],
       [15235, 24915, 24889, ...,     0,     0,     0],
       [  591,   302,  2622, ...,     0,     0,     0]], dtype=int32)

## embedding layer wegihts

In order to use the embedding weights we first to load the subword embedding weights.

In [38]:
from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format("../pre-trained-model/en.wiki.bpe.op25000.d50.w2v.bin", binary=True)

In [39]:
len(vocab)

25000

In [52]:
for i, subword in enumerate(vocab):
    print(subword)
    if i > 4:
        break

<unk>
<s>
</s>
▁t
▁a
he


In [40]:
model['in']

array([-0.399702, -0.769862, -0.06641 , -0.211852, -0.359098, -0.055825,
        0.4286  , -0.256576, -0.086343,  0.406772, -0.072157, -0.174386,
        0.398903, -0.040825, -0.155359,  0.048774, -0.238695,  0.024354,
       -0.347787,  0.081793,  0.141403,  0.08835 , -0.070075,  0.110401,
        0.003846, -0.265394,  0.724276, -0.523481, -0.162674,  0.147213,
       -0.209789, -0.132434, -0.067623,  0.691781,  0.421201, -0.047779,
        0.397612, -0.279393, -0.967681,  0.55612 , -0.042962, -0.3673  ,
        0.314757,  0.114486, -0.278512, -0.042936, -0.020144,  0.100965,
        0.181277,  0.040286], dtype=float32)

In [46]:
model['<unk>']

array([-0.047345, -0.813617, -0.402143,  0.163767,  3.029769,  0.466452,
       -0.859536,  0.912698,  0.513252, -0.082041,  1.04137 ,  1.15992 ,
        0.183564,  0.32676 , -0.983799, -0.744597,  0.547359,  0.341305,
        0.239759,  0.953342, -0.474623, -1.014153,  0.780751, -0.970756,
       -0.436472,  0.998653, -1.763717,  0.156439, -0.411622,  0.544716,
       -0.902719, -0.825915,  0.549098, -0.080528, -1.215276, -0.113391,
       -0.735994, -0.501781,  1.573995, -0.817193,  0.087332,  0.090806,
        0.293357, -0.444164,  0.192026, -0.580188,  0.51405 , -0.857277,
        1.569506,  0.143075], dtype=float32)

In [47]:
embedding_dim = 50
embedding_weights = np.zeros((len(vocab) + 1, embedding_dim)) # (25001, 50)

for subword, i in vocab.items():
    if subword in model.vocab:
        embedding_vector = model[subword]
        if embedding_vector is not None:
            embedding_weights[i] = embedding_vector
    else:
        print(subword) # print the subword in vocab but not in model
        continue

<s>
</s>
▁distric
ptember
bruary
▁performan
orporated
▁headqu
▁attem
▁mathem
▁passeng
uguese
▁azerbai
▁compris
urday
▁emplo
▁portra
▁thous
▁lithu
▁leban
▁councill
▁specim
▁molec
▁entrepren
▁predecess
▁glouc
▁earthqu
▁istan
imination
▁infloresc
▁ingred
chiidae
▁sofl
ürttemberg
▁practition
echua
eteries
bridgeshire
▁nudi
rzys
tokrzys
uchestan
▁taekw
kopol
giluyeh
▁fute
ivisie
marthen
▁gillesp
aziland
scray
alandhar
azulu
alisco


In [53]:
print(len(vocab))
print(embedding_weights.shape)

25000
(25001, 50)


In [60]:
from keras.layers import Embedding

# parameter 
input_size = 1014
embedding_size = 50

num_of_classes = 4
dropout_p = 0.5
optimizer = 'adam'
loss = 'categorical_crossentropy'

embedding_layer = Embedding(len(vocab)+1,
                            embedding_size,
                            weights=[embedding_weights],
                            input_length=input_size)


## Model Construction

In [68]:
from keras.layers import Input, Embedding, Dense, Flatten
from keras.layers import LSTM, Dropout
from keras.models import Model

In [71]:
inputs = Input(shape=(input_size,))
embedded_sequence = embedding_layer(inputs)
x = LSTM(256, return_sequences=True, activation='relu')(embedded_sequence)
x = LSTM(256, return_sequences=True, activation='relu')(x)
x = Flatten()(x)
x = Dense(1024, activation='relu')(x)
x = Dropout(dropout_p)(x)
x = Dense(1024, activation='relu')(x)
x = Dropout(dropout_p)(x)
prediction = Dense(num_of_classes, activation='softmax')(x)

model = Model(inputs=inputs, outputs=prediction)
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_6 (InputLayer)         (None, 1014)              0         
_________________________________________________________________
embedding_2 (Embedding)      (None, 1014, 50)          1250050   
_________________________________________________________________
lstm_11 (LSTM)               (None, 1014, 256)         314368    
_________________________________________________________________
lstm_12 (LSTM)               (None, 1014, 256)         525312    
_________________________________________________________________
flatten_2 (Flatten)          (None, 259584)            0         
_________________________________________________________________
dense_7 (Dense)              (None, 1024)              265815040 
_________________________________________________________________
dropout_5 (Dropout)          (None, 1024)              0         
__________

## Train the model

In [74]:
# prepare the data 
indices = np.arange(train_data.shape[0])
np.random.shuffle(indices)

x_train = train_data[indices][:1000]
y_train = train_classes[indices][:1000]

x_test = test_data[:100]
y_test = test_classes[:100]

# training
model.fit(x_train, y_train,
          validation_data=(x_test, y_test),
          batch_size=128,
          epochs=1,
          verbose=1)

Train on 1000 samples, validate on 100 samples
Epoch 1/1


<keras.callbacks.History at 0x1aaaf76b00>