# Subword level representation


In this notebook, we will preprocess the data to represent sentences in a subword level. 

## byte-pair-encoding (BPE) 


After doing some research about subword level word representation, here I found some useful tools.

- [SentencePiece](https://github.com/google/sentencepiece) 
- [BPEmb](https://github.com/bheinzerling/bpemb)

SentencePiece is an unsupervised text tokenizer and detokenizer tool, and it supports multiple subword algorithms, like byte-pair-encoding (BPE) and unigram language model. BPEmb is a collection of pre-trained subword embeddings in 275 languages, and it support SentencePiece. But I had some trouble to install the SentencePiece.

Because I am using OSX, I have to build SentencePiece from source with C++. After run the command: `brew install protobuf autoconf automake libtool`, it returns the error shows `Error: Failed to download resource "pkg-config"`. After try some ways I still can not install SentencePiece successfully. Fortunately in the BPEmb page, there are is a script that could get the subword vectors without using SentencePiece. Here we use this script to get the subword vector.

[On-the-fly conversion to subwords in Python](https://github.com/bheinzerling/bpemb/issues/10). This version is easier to use: [bpe.py](https://github.com/bheinzerling/bpemb/blob/master/bpe.py)




In [3]:
from math import log


class BPE(object):

    def __init__(self, vocab_file):
        with open(vocab_file, encoding="utf8") as f:
            self.words = [l.split()[0] for l in f]
            log_len = log(len(self.words))
            self.wordcost = {
                k: log((i+1) * log_len)
                for i, k in enumerate(self.words)}
            self.maxword = max(len(x) for x in self.words)

    def encode(self, s):
        """Uses dynamic programming to infer the location of spaces in a string
        without spaces."""

        s = s.replace(" ", "▁")

        # Find the best match for the i first characters, assuming cost has
        # been built for the i-1 first characters.
        # Returns a pair (match_cost, match_length).
        def best_match(i):
            candidates = enumerate(reversed(cost[max(0, i - self.maxword):i]))
            return min(
                (c + self.wordcost.get(s[i-k-1:i], 9e999), k+1)
                for k, c in candidates)

        # Build the cost array.
        cost = [0]
        for i in range(1, len(s) + 1):
            c, k = best_match(i)
            cost.append(c)

        # Backtrack to recover the minimal-cost string.
        out = []
        i = len(s)
        while i > 0:
            c, k = best_match(i)
            assert c == cost[i]
            out.append(s[i-k:i])

            i -= k

        return " ".join(reversed(out))

There are three attributes in the BPE class:
- `BPE.words`: a list of all subword token, we can take this as a dictionary
- `BPE.wordcost`: a dict contain `log(len(elf.words))` for each word
- `BPE.maxword`: the most longest length of token in `BPE.words`

We take an example to see the result:

In [32]:
bpe = BPE("../pre_trained_model/en.wiki.bpe.op25000.vocab")
print(bpe.encode(' this is our house in boomchakalaka'))  
# >>> ▁this ▁is ▁our ▁house ▁in ▁boom ch ak al aka 

▁this ▁is ▁our ▁house ▁in ▁boom ch ak al aka


In [34]:
# see the attributes in the BPE class
print(bpe.words[:10])

for i, token in enumerate(bpe.wordcost):
    print(token)
    if i > 4:
        break
        
print(bpe.maxword)

['<unk>', '<s>', '</s>', '▁t', '▁a', 'he', 'in', '▁the', '00', 'er']
<unk>
<s>
</s>
▁t
▁a
he
16


Next we precess our sentences to get the subword level representation.

## Preprocess 

- Load data 
- Convert string to subword
- Convert subword to index 
- Padding 



In [103]:
#========================Load data=========================
import numpy as np
import pandas as pd
train_data_source = '../../char-level-cnn/data/ag_news_csv/train.csv'
test_data_source = '../../char-level-cnn/data/ag_news_csv/test.csv'
train_df = pd.read_csv(train_data_source, header=None)
test_df = pd.read_csv(test_data_source, header=None)
# concatenate column 1 and column 2 as one text
for df in [train_df, test_df]:
    df[1] = df[1] + df[2]
    df = df.drop([2], axis=1)
    
# convert string to lower case 
train_texts = train_df[1].values 
train_texts = [s.lower() for s in train_texts]
test_texts = test_df[1].values 
test_texts = [s.lower() for s in test_texts]

In [104]:
print(train_texts[0])
print(test_texts[0])


wall st. bears claw back into the black (reuters)reuters - short-sellers, wall street's dwindling\band of ultra-cynics, are seeing green again.
fears for t n pension after talksunions representing workers at turner   newall say they are 'disappointed' after talks with stricken parent firm federal mogul.


In [105]:
# replace all digits with 0
import re
train_texts = [re.sub('\d', '0', s) for s in train_texts]
test_texts = [re.sub('\d', '0', s) for s in test_texts]

In [106]:
print(train_texts[0])
print(test_texts[0])

wall st. bears claw back into the black (reuters)reuters - short-sellers, wall street's dwindling\band of ultra-cynics, are seeing green again.
fears for t n pension after talksunions representing workers at turner   newall say they are 'disappointed' after talks with stricken parent firm federal mogul.


In [107]:
# replace all URLs with <url> 
url_reg  = r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b'
train_texts = [re.sub(url_reg, '<url>', s) for s in train_texts]
test_texts = [re.sub(url_reg, '<url>', s) for s in test_texts]

In [108]:
print(train_texts[0])
print(test_texts[0])

wall st. bears claw back into the black (reuters)reuters - short-sellers, wall street's dwindling\band of ultra-cynics, are seeing green again.
fears for t n pension after talksunions representing workers at turner   newall say they are 'disappointed' after talks with stricken parent firm federal mogul.


As for `re.MULTILINE`, you can see the explanation [here](https://teamtreehouse.com/community/dont-quite-understand-the-use-of-remultiline) and [hear](http://messefor.hatenablog.com/entry/2017/01/15/215722)


In [109]:
# Convert string to subword, this process may take several minutes
train_texts = [bpe.encode(s) for s in train_texts]
test_texts = [bpe.encode(s) for s in test_texts]

In [110]:
print(train_texts[0])
print(test_texts[0])

wall ▁st . ▁bears ▁c law ▁back ▁into ▁the ▁black ▁( re uters ) re uters ▁- ▁short - sel lers , ▁wall ▁street ' s ▁d wind ling \ b a n d ▁ o f ▁ u l t r a - c y n i c s , ▁ a r e ▁ s e e i n g ▁ g r e e n ▁ a g a i n .
fe ars ▁for ▁t ▁n ▁pension ▁after ▁talks un ions ▁representing ▁workers ▁at ▁turner ▁ ▁ ▁new all ▁say ▁they ▁are ▁' dis app ointed ' ▁after ▁talks ▁with ▁strick en ▁parent ▁firm ▁federal ▁mog ul .


In [113]:
# Build vocab, {token: index}
vocab = {}
for i, token in enumerate(bpe.words):
    vocab[token] = i + 1

In [114]:
for i, (key, value) in enumerate(vocab.items()):
    print(key, value)
    if i > 4:
        break

<unk> 1
<s> 2
</s> 3
▁t 4
▁a 5
he 6


In [95]:
# # Convert subword to index 
# train_sentences = []
# for s in train_texts:
#     s = s.split()
#     one_line = []
#     for word in s:
#         if word not in vocab.keys():
#             one_line.append(vocab['<unk>'])
#         else:
#             one_line.append(vocab[word])
#     train_sentences.append(one_line)

In [97]:
# print(train_sentences[0])

[5323, 68, 24904, 6039, 13, 5012, 1025, 549, 8, 1237, 72, 14, 21182, 24912, 14, 21182, 865, 1144, 24910, 3065, 5583, 24905, 2227, 1230, 24915, 24889, 37, 9683, 1307, 1, 24901, 24884, 24887, 24893, 24882, 24888, 24898, 24882, 24897, 24892, 24885, 24890, 24884, 24910, 24894, 24903, 24887, 24886, 24894, 24889, 24905, 24882, 24884, 24890, 24883, 24882, 24889, 24883, 24883, 24886, 24887, 24900, 24882, 24900, 24890, 24883, 24883, 24887, 24882, 24884, 24900, 24884, 24886, 24887, 24904]


In [118]:
# Convert subword to index, function version 
def subword2index(texts, vocab):
    sentences = []
    for s in texts:
        s = s.split()
        one_line = []
        for word in s:
            if word not in vocab.keys():
                one_line.append(vocab['unk'])
            else:
                one_line.append(vocab[word])
        sentences.append(one_line)
    return sentences
    
# Convert train and test 
train_sentences = subword2index(train_texts, vocab)
test_sentences = subword2index(test_texts, vocab)

In [121]:
print(len(train_sentences))
print(train_sentences[0])
print(len(test_sentences))
print(test_sentences[0])

120000
[5323, 68, 24904, 6039, 13, 5012, 1025, 549, 8, 1237, 72, 14, 21182, 24912, 14, 21182, 865, 1144, 24910, 3065, 5583, 24905, 2227, 1230, 24915, 24889, 37, 9683, 1307, 2837, 24901, 24884, 24887, 24893, 24882, 24888, 24898, 24882, 24897, 24892, 24885, 24890, 24884, 24910, 24894, 24903, 24887, 24886, 24894, 24889, 24905, 24882, 24884, 24890, 24883, 24882, 24889, 24883, 24883, 24886, 24887, 24900, 24882, 24900, 24890, 24883, 24883, 24887, 24882, 24884, 24900, 24884, 24886, 24887, 24904]
7600
[3225, 1572, 71, 4, 48, 15602, 296, 13706, 47, 225, 3499, 4629, 95, 7898, 24882, 24882, 232, 130, 3651, 422, 157, 1186, 8429, 1383, 23271, 24915, 296, 13706, 105, 20439, 25, 7619, 3430, 2352, 13573, 97, 24904]


In [122]:
from keras.preprocessing.sequence import pad_sequences
# Padding
train_data = pad_sequences(train_sentences, maxlen=1014, padding='post')
test_data = pad_sequences(test_sentences, maxlen=1014, padding='post')

# Convert to numpy array
train_data = np.array(train_data)
test_data = np.array(test_data)

In [128]:
print(len(train_data[0]))
print(train_data[0])
print(len(train_data[0]))
print(test_data[0])

1014
[ 5323    68 24904 ...     0     0     0]
1014
[3225 1572   71 ...    0    0    0]


In [129]:
#=======================Get classes================
train_classes = train_df[0].values
train_class_list = [x-1 for x in train_classes]
test_classes = test_df[0].values
test_class_list = [x-1 for x in test_classes]

from keras.utils import to_categorical
train_classes = to_categorical(train_class_list)
test_classes = to_categorical(test_class_list)

In [132]:
print(train_classes)
print(test_classes)

[[0. 0. 1. 0.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]
 ...
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]
 [0. 1. 0. 0.]]
[[0. 0. 1. 0.]
 [0. 0. 0. 1.]
 [0. 0. 0. 1.]
 ...
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 1. 0.]]


We can write all code together:

In [None]:
# BPE
from math import log

class BPE(object):

    def __init__(self, vocab_file):
        with open(vocab_file, encoding="utf8") as f:
            self.words = [l.split()[0] for l in f]
            log_len = log(len(self.words))
            self.wordcost = {
                k: log((i+1) * log_len)
                for i, k in enumerate(self.words)}
            self.maxword = max(len(x) for x in self.words)

    def encode(self, s):
        """Uses dynamic programming to infer the location of spaces in a string
        without spaces."""

        s = s.replace(" ", "▁")

        # Find the best match for the i first characters, assuming cost has
        # been built for the i-1 first characters.
        # Returns a pair (match_cost, match_length).
        def best_match(i):
            candidates = enumerate(reversed(cost[max(0, i - self.maxword):i]))
            return min(
                (c + self.wordcost.get(s[i-k-1:i], 9e999), k+1)
                for k, c in candidates)

        # Build the cost array.
        cost = [0]
        for i in range(1, len(s) + 1):
            c, k = best_match(i)
            cost.append(c)

        # Backtrack to recover the minimal-cost string.
        out = []
        i = len(s)
        while i > 0:
            c, k = best_match(i)
            assert c == cost[i]
            out.append(s[i-k:i])

            i -= k

        return " ".join(reversed(out))

In [None]:
#=======================All Preprocessing====================

# load data
import numpy as np
import pandas as pd
train_data_source = '../../char-level-cnn/data/ag_news_csv/train.csv'
test_data_source = '../../char-level-cnn/data/ag_news_csv/test.csv'
train_df = pd.read_csv(train_data_source, header=None)
test_df = pd.read_csv(test_data_source, header=None)
# concatenate column 1 and column 2 as one text
for df in [train_df, test_df]:
    df[1] = df[1] + df[2]
    df = df.drop([2], axis=1)
    
# convert string to lower case 
train_texts = train_df[1].values 
train_texts = [s.lower() for s in train_texts]
test_texts = test_df[1].values 
test_texts = [s.lower() for s in test_texts]

# replace all digits with 0
import re
train_texts = [re.sub('\d', '0', s) for s in train_texts]
test_texts = [re.sub('\d', '0', s) for s in test_texts]

# replace all URLs with <url> 
url_reg  = r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b'
train_texts = [re.sub(url_reg, '<url>', s) for s in train_texts]
test_texts = [re.sub(url_reg, '<url>', s) for s in test_texts]

# Convert string to subword, this process may take several minutes
train_texts = [bpe.encode(s) for s in train_texts]
test_texts = [bpe.encode(s) for s in test_texts]

# Build vocab, {token: index}
vocab = {}
for i, token in enumerate(bpe.words):
    vocab[token] = i + 1
    
# Convert subword to index, function version 
def subword2index(texts, vocab):
    sentences = []
    for s in texts:
        s = s.split()
        one_line = []
        for word in s:
            if word not in vocab.keys():
                one_line.append(vocab['unk'])
            else:
                one_line.append(vocab[word])
        sentences.append(one_line)
    return sentences

# Convert train and test 
train_sentences = subword2index(train_texts, vocab)
test_sentences = subword2index(test_texts, vocab)

# Padding
from keras.preprocessing.sequence import pad_sequences
train_data = pad_sequences(train_sentences, maxlen=1014, padding='post')
test_data = pad_sequences(test_sentences, maxlen=1014, padding='post')

# Convert to numpy array
train_data = np.array(train_data)
test_data = np.array(test_data)

#=======================Get classes================
train_classes = train_df[0].values
train_class_list = [x-1 for x in train_classes]
test_classes = test_df[0].values
test_class_list = [x-1 for x in test_classes]

from keras.utils import to_categorical
train_classes = to_categorical(train_class_list)
test_classes = to_categorical(test_class_list)