# Subword level representation


In this notebook, we will preprocess the data to represent sentences in a subword level. The data set is `ag_news`, same with `char-level-cnn` [project](https://github.com/BrambleXu/nlp-beginner-guide-keras/tree/master/char-level-cnn). The reason that I create [nlp-beginner-guide-keras](https://github.com/BrambleXu/nlp-beginner-guide-keras) is to learn different techniques, so here we use a different approach to do the preprocess. We will use subword level word representation, instead of character level word representation.


## What is subword level representation



## Why use subword level representation


## How to preprocess 

As for the preprocessing, you can find detail explanation in this notebook [subword-preprocess](https://github.com/BrambleXu/nlp-beginner-guide-keras/blob/master/char-level-rnn/notebooks/subword-preprocess.ipynb). 

In [1]:
import sys, os
sys.path.append(os.pardir)
from data_helpers import BPE

In [2]:
#=======================All Preprocessing====================

# load data
import numpy as np
import pandas as pd
train_data_source = '../../char-level-cnn/data/ag_news_csv/train.csv'
test_data_source = '../../char-level-cnn/data/ag_news_csv/test.csv'
train_df = pd.read_csv(train_data_source, header=None)
test_df = pd.read_csv(test_data_source, header=None)

# concatenate column 1 and column 2 as one text
for df in [train_df, test_df]:
    df[1] = df[1] + df[2]
    df = df.drop([2], axis=1)
    
# convert string to lower case 
train_texts = train_df[1].values 
train_texts = [s.lower() for s in train_texts]
test_texts = test_df[1].values 
test_texts = [s.lower() for s in test_texts]

# replace all digits with 0
import re
train_texts = [re.sub('\d', '0', s) for s in train_texts]
test_texts = [re.sub('\d', '0', s) for s in test_texts]

# replace all URLs with <url> 
url_reg  = r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b'
train_texts = [re.sub(url_reg, '<url>', s) for s in train_texts]
test_texts = [re.sub(url_reg, '<url>', s) for s in test_texts]

# Convert string to subword, this process may take several minutes
bpe = BPE("../pre-trained-model/en.wiki.bpe.op25000.vocab")
train_texts = [bpe.encode(s) for s in train_texts]
test_texts = [bpe.encode(s) for s in test_texts]

# Build vocab, {token: index}
vocab = {}
for i, token in enumerate(bpe.words):
    vocab[token] = i + 1
    
# Convert subword to index, function version 
def subword2index(texts, vocab):
    sentences = []
    for s in texts:
        s = s.split()
        one_line = []
        for word in s:
            if word not in vocab.keys():
                one_line.append(vocab['unk'])
            else:
                one_line.append(vocab[word])
        sentences.append(one_line)
    return sentences

# Convert train and test 
train_sentences = subword2index(train_texts, vocab)
test_sentences = subword2index(test_texts, vocab)

# Padding
from keras.preprocessing.sequence import pad_sequences
train_data = pad_sequences(train_sentences, maxlen=1014, padding='post')
test_data = pad_sequences(test_sentences, maxlen=1014, padding='post')

# Convert to numpy array
train_data = np.array(train_data)
test_data = np.array(test_data)

#=======================Get classes================
train_classes = train_df[0].values
train_class_list = [x-1 for x in train_classes]
test_classes = test_df[0].values
test_class_list = [x-1 for x in test_classes]

from keras.utils import to_categorical
train_classes = to_categorical(train_class_list)
test_classes = to_categorical(test_class_list)

Using TensorFlow backend.


In [3]:
train_data

array([[ 5323,    68, 24904, ...,     0,     0,     0],
       [ 3226,    84,    51, ...,     0,     0,     0],
       [18658,    36,  6182, ...,     0,     0,     0],
       ...,
       [21745,    18,   313, ...,     0,     0,     0],
       [15235, 24915, 24889, ...,     0,     0,     0],
       [  591,   302,  2622, ...,     0,     0,     0]], dtype=int32)

## embedding layer wegihts

In order to use the embedding weights we first to load the subword embedding weights.

In [4]:
from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format("../pre-trained-model/en.wiki.bpe.op25000.d50.w2v.bin", binary=True)

In [5]:
from keras.layers import Embedding

# parameter
input_size = 1014
embedding_size = 50
conv_layers = [[256, 7, 3],
               [256, 7, 3],
               [256, 3, -1],
               [256, 3, -1],
               [256, 3, -1],
               [256, 3, 3]]

fully_connected_layers = [1024, 1024]
num_of_classes = 4
dropout_p = 0.5
optimizer = 'adam'
loss = 'categorical_crossentropy'


embedding_dim = 50
embedding_weights = np.zeros((len(vocab) + 1, embedding_dim)) # (25001, 50)

for subword, i in vocab.items():
    if subword in model.vocab:
        embedding_vector = model[subword]
        if embedding_vector is not None:
            embedding_weights[i] = embedding_vector
    else:
#         print(subword) # print the subword in vocab but not in model
        continue
    
embedding_layer = Embedding(len(vocab)+1,
                            embedding_size,
                            weights=[embedding_weights],
                            input_length=input_size)


## Model Construction

In [6]:
from keras.layers import Input, Embedding, Activation, Flatten, Dense
from keras.layers import Conv1D, MaxPooling1D, Dropout
from keras.models import Model

In [7]:
# Model Construction
# Input
inputs = Input(shape=(input_size,), name='input', dtype='int64')  # shape=(?, 1014)
# Embedding
x = embedding_layer(inputs)
# Conv
for filter_num, filter_size, pooling_size in conv_layers:
    x = Conv1D(filter_num, filter_size)(x)
    x = Activation('relu')(x)
    if pooling_size != -1:
        x = MaxPooling1D(pool_size=pooling_size)(x)  # Final shape=(None, 34, 256)
x = Flatten()(x)  # (None, 8704)
# Fully connected layers
for dense_size in fully_connected_layers:
    x = Dense(dense_size, activation='relu')(x)  # dense_size == 1024
    x = Dropout(dropout_p)(x)
# Output Layer
predictions = Dense(num_of_classes, activation='softmax')(x)

# Callbacks
# earlystopper = EarlyStopping(monitor='val_loss', patience=15, verbose=1)

# Build model
model = Model(inputs=inputs, outputs=predictions)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])  # Adam, categorical_crossentropy
model.summary()


# filepath="weights.best.hdf5"
# checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')

# # check 5 epochs
# early_stop = EarlyStopping(monitor='val_acc', patience=5, mode='max') 

# callbacks_list = [checkpoint, early_stop]

# history = model.fit(x, y, validation_data=(x_test, y_test), epochs=100, callbacks=callbacks_list)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (InputLayer)           (None, 1014)              0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 1014, 50)          1250050   
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 1008, 256)         89856     
_________________________________________________________________
activation_1 (Activation)    (None, 1008, 256)         0         
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 336, 256)          0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 330, 256)          459008    
_________________________________________________________________
activation_2 (Activation)    (None, 330, 256)          0         
__________

## Train the model

In [9]:
# prepare the data 
# indices = np.arange(train_data.shape[0])
# np.random.shuffle(indices)

x_train = train_data
y_train = train_classes

x_test = test_data
y_test = test_classes

# Training
model.fit(x_train, y_train,
          validation_data=(x_test, y_test),
          batch_size=128,
          epochs=10,
          shuffle=True,
          verbose=2)

Train on 120000 samples, validate on 7600 samples
Epoch 1/10
 - 390s - loss: 0.1762 - acc: 0.9432 - val_loss: 0.2915 - val_acc: 0.9146
Epoch 2/10
 - 390s - loss: 0.1389 - acc: 0.9554 - val_loss: 0.3005 - val_acc: 0.9141
Epoch 3/10
 - 390s - loss: 0.1107 - acc: 0.9650 - val_loss: 0.3227 - val_acc: 0.9108
Epoch 4/10
 - 390s - loss: 0.0914 - acc: 0.9712 - val_loss: 0.3619 - val_acc: 0.9104
Epoch 5/10
 - 390s - loss: 0.0748 - acc: 0.9773 - val_loss: 0.3720 - val_acc: 0.9174
Epoch 6/10
 - 390s - loss: 0.0612 - acc: 0.9812 - val_loss: 0.4105 - val_acc: 0.9093
Epoch 7/10
 - 390s - loss: 0.0519 - acc: 0.9839 - val_loss: 0.4378 - val_acc: 0.9068
Epoch 8/10
 - 390s - loss: 0.0437 - acc: 0.9865 - val_loss: 0.4788 - val_acc: 0.9089
Epoch 9/10
 - 390s - loss: 0.0404 - acc: 0.9880 - val_loss: 0.5484 - val_acc: 0.9064
Epoch 10/10
 - 390s - loss: 0.0394 - acc: 0.9883 - val_loss: 0.5281 - val_acc: 0.9072


<keras.callbacks.History at 0x7f8c5c04e5c0>