Here just for simplicity, I write all preprocess code together. If you are instested what happend in the preprocess step, please move to this [notebook](https://github.com/BrambleXu/nlp-beginner-guide-keras/blob/master/char-level-cnn/notebooks/char-level-text-preprocess-with-keras-summary.ipynb). 

In [12]:
# write all code in one cell 

#========================Load data=========================
import numpy as np
import pandas as pd

train_data_source = '../data/ag_news_csv/train.csv'
test_data_source = '../data/ag_news_csv/test.csv'

train_df = pd.read_csv(train_data_source, header=None)
test_df = pd.read_csv(test_data_source, header=None)

# concatenate column 1 and column 2 as one text
for df in [train_df, test_df]:
    df[1] = df[1] + df[2]
    df = df.drop([2], axis=1)
    
# convert string to lower case 
train_texts = train_df[1].values 
train_texts = [s.lower() for s in train_texts] 

test_texts = test_df[1].values 
test_texts = [s.lower() for s in test_texts] 

#=======================Convert string to index================
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

# Tokenizer
tk = Tokenizer(num_words=None, char_level=True, oov_token='UNK')
tk.fit_on_texts(train_texts)
# If we already have a character list, then replace the tk.word_index
# If not, just skip below part

#-----------------------Skip part start--------------------------
# construct a new vocabulary 
alphabet="abcdefghijklmnopqrstuvwxyz0123456789 ,;.!?:'\"/\\|_@#$%^&*~`+-=<>()[]{}"
char_dict = {}
for i, char in enumerate(alphabet):
    char_dict[char] = i + 1
    
# Use char_dict to replace the tk.word_index
tk.word_index = char_dict.copy() 
# Add 'UNK' to the vocabulary 
tk.word_index[tk.oov_token] = max(char_dict.values()) + 1
#-----------------------Skip part end----------------------------

# Convert string to index 
train_sequences = tk.texts_to_sequences(train_texts)
test_texts = tk.texts_to_sequences(test_texts)

# Padding
train_data = pad_sequences(train_sequences, maxlen=1014, padding='post')
test_data = pad_sequences(test_texts, maxlen=1014, padding='post')

# Convert to numpy array
train_data = np.array(train_data, dtype='float32')
test_data = np.array(test_data, dtype='float32')

#=======================Get classes================
train_classes = train_df[0].values
train_class_list = [x-1 for x in train_classes]

test_classes = test_df[0].values
test_class_list = [x-1 for x in test_classes]

from keras.utils import to_categorical
train_classes = to_categorical(train_class_list)
test_classes = to_categorical(test_class_list)

## Construct Model

We implement the char_cnn_zhang model from this paper:

- Xiang Zhang, Junbo Zhao, Yann LeCun. [Character-level Convolutional Networks for Text Classification](http://arxiv.org/abs/1509.01626). NIPS 2015

The model structure:

![](https://cdn-images-1.medium.com/max/1600/0*fovAEUSdSkbsnJw5.png)

This graph may look difficult to understand. Here is the model setup. 


![](https://img-blog.csdn.net/20170721104727009)


If you want to see the detail for this model, please move to this [notebook](https://github.com/BrambleXu/nlp-beginner-guide-keras/blob/master/char-level-cnn/notebooks/char_cnn_zhang.ipynb)

We choose the small frame, 256 filters in convolutional layer and 1024 output units in dense layer. 

- Embedding Layer
- Six convolutional layers, and 3 convolutional layers followed by a max pooling layer
- Two fully connected layer(dense layer in keras), neuron units are 1024. 
- Output layer(dense layer), neuron units depends on classes. In this task, we set it 4. 

First we have to construct a embedding index. Beside the 68 characters in `alphabet`, we add `UNK` with `index 69`. These 69 characters are saved to `tk.word_index`, we could output it to see:

In [13]:
print(tk.word_index)

{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7, 'h': 8, 'i': 9, 'j': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'v': 22, 'w': 23, 'x': 24, 'y': 25, 'z': 26, '0': 27, '1': 28, '2': 29, '3': 30, '4': 31, '5': 32, '6': 33, '7': 34, '8': 35, '9': 36, ' ': 37, ',': 38, ';': 39, '.': 40, '!': 41, '?': 42, ':': 43, "'": 44, '"': 45, '/': 46, '\\': 47, '|': 48, '_': 49, '@': 50, '#': 51, '$': 52, '%': 53, '^': 54, '&': 55, '*': 56, '~': 57, '`': 58, '+': 59, '-': 60, '=': 61, '<': 62, '>': 63, '(': 64, ')': 65, '[': 66, ']': 67, '{': 68, '}': 69, 'UNK': 70}


In [14]:
vocab_size = len(tk.word_index)
vocab_size

70

We can use one-hot vector to represent these 69 words. Because Keras use `0` for `PAD`. We add first line containing all 0 to represent the `PAD`

In [15]:
embedding_weights = [] #(71, 70)
embedding_weights.append(np.zeros(vocab_size)) # first row is pad

for char, i in tk.word_index.items(): # from index 1 to 70
    onehot = np.zeros(vocab_size)
    onehot[i-1] = 1
    embedding_weights.append(onehot)
embedding_weights = np.array(embedding_weights)

In [16]:
print(embedding_weights.shape) # first row all 0 for PAD, 69 char, last row for UNK
embedding_weights

(71, 70)


array([[0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 1., 0.],
       [0., 0., 0., ..., 0., 0., 1.]])

In [6]:
from keras.layers import Input, Embedding, Activation, Flatten, Dense
from keras.layers import Conv1D, MaxPooling1D, Dropout
from keras.models import Model

In [7]:
# parameter 
input_size = 1014
# vocab_size = 69
embedding_size = 69
conv_layers = [[256, 7, 3], 
               [256, 7, 3], 
               [256, 3, -1], 
               [256, 3, -1], 
               [256, 3, -1], 
               [256, 3, 3]]

fully_connected_layers = [1024, 1024]
num_of_classes = 4
dropout_p = 0.5
optimizer = 'adam'
loss = 'categorical_crossentropy'

In [8]:
# Embedding layer Initialization
embedding_layer = Embedding(vocab_size+1, 
                            embedding_size,
                            input_length=input_size,
                            weights=[embedding_weights])

In [9]:
# Model 

# Input
inputs = Input(shape=(input_size,), name='input', dtype='int64')  # shape=(?, 1014)
# Embedding 
x = embedding_layer(inputs)
# Conv 
for filter_num, filter_size, pooling_size in conv_layers:
    x = Conv1D(filter_num, filter_size)(x) 
    x = Activation('relu')(x)
    if pooling_size != -1:
        x = MaxPooling1D(pool_size=pooling_size)(x) # Final shape=(None, 34, 256)
x = Flatten()(x) # (None, 8704)
# Fully connected layers 
for dense_size in fully_connected_layers:
    x = Dense(dense_size, activation='relu')(x) # dense_size == 1024
    x = Dropout(dropout_p)(x)
# Output Layer
predictions = Dense(num_of_classes, activation='softmax')(x)
# Build model
model = Model(inputs=inputs, outputs=predictions)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy']) # Adam, categorical_crossentropy
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input (InputLayer)           (None, 1014)              0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 1014, 69)          4830      
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 1008, 256)         123904    
_________________________________________________________________
activation_1 (Activation)    (None, 1008, 256)         0         
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 336, 256)          0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 330, 256)          459008    
_________________________________________________________________
activation_2 (Activation)    (None, 330, 256)          0         
__________

## train the model
Because here I just use CPU to run the model, so I only use 10000 samples for trianing and 1000 samples for testing.

In [10]:
# 1000 training samples and 100 testing samples
indices = np.arange(train_data.shape[0])
np.random.shuffle(indices)

x_train = train_data[indices][:1000]
y_train = train_classes[indices][:1000]

x_test = test_data[:100]
y_test = test_classes[:100]

In [11]:
# Training
model.fit(x_train, y_train,
          validation_data=(x_test, y_test),
          batch_size=128,
          epochs=10,
          verbose=2)

Train on 1000 samples, validate on 100 samples
Epoch 1/10
 - 34s - loss: 1.4076 - acc: 0.2440 - val_loss: 1.3802 - val_acc: 0.3800
Epoch 2/10
 - 36s - loss: 1.3869 - acc: 0.2730 - val_loss: 1.3468 - val_acc: 0.4300
Epoch 3/10
 - 33s - loss: 1.3834 - acc: 0.2650 - val_loss: 1.3415 - val_acc: 0.4400
Epoch 4/10
 - 34s - loss: 1.3798 - acc: 0.3020 - val_loss: 1.3610 - val_acc: 0.4500
Epoch 5/10
 - 31s - loss: 1.3715 - acc: 0.3040 - val_loss: 1.2889 - val_acc: 0.4500
Epoch 6/10
 - 32s - loss: 1.3656 - acc: 0.3400 - val_loss: 1.2839 - val_acc: 0.3400
Epoch 7/10
 - 34s - loss: 1.3470 - acc: 0.3370 - val_loss: 1.2851 - val_acc: 0.4100
Epoch 8/10
 - 35s - loss: 1.3216 - acc: 0.3400 - val_loss: 1.2680 - val_acc: 0.3900
Epoch 9/10
 - 32s - loss: 1.2564 - acc: 0.3910 - val_loss: 1.2213 - val_acc: 0.4600
Epoch 10/10
 - 34s - loss: 1.0872 - acc: 0.5080 - val_loss: 1.2991 - val_acc: 0.3200


<keras.callbacks.History at 0x1868b05ac8>

Because we use a small dataset, so the model is easy to overfit. 

In [None]:
# =====================Char CNN in whole dataset=======================
# parameter
input_size = 1014
vocab_size = len(tk.word_index)
embedding_size = 70
conv_layers = [[256, 7, 3],
               [256, 7, 3],
               [256, 3, -1],
               [256, 3, -1],
               [256, 3, -1],
               [256, 3, 3]]

fully_connected_layers = [1024, 1024]
num_of_classes = 4
dropout_p = 0.5
optimizer = 'adam'
loss = 'categorical_crossentropy'

# Embedding weights
embedding_weights = []  # (71, 70)
embedding_weights.append(np.zeros(vocab_size))  # (0, 70)

for char, i in tk.word_index.items():  # from index 1 to 70
    onehot = np.zeros(vocab_size)
    onehot[i - 1] = 1
    embedding_weights.append(onehot)

embedding_weights = np.array(embedding_weights)
print('Load')

# Embedding layer Initialization
embedding_layer = Embedding(vocab_size + 1,
                            embedding_size,
                            input_length=input_size,
                            weights=[embedding_weights])

# Model Construction
# Input
inputs = Input(shape=(input_size,), name='input', dtype='int64')  # shape=(?, 1014)
# Embedding
x = embedding_layer(inputs)
# Conv
for filter_num, filter_size, pooling_size in conv_layers:
    x = Conv1D(filter_num, filter_size)(x)
    x = Activation('relu')(x)
    if pooling_size != -1:
        x = MaxPooling1D(pool_size=pooling_size)(x)  # Final shape=(None, 34, 256)
x = Flatten()(x)  # (None, 8704)
# Fully connected layers
for dense_size in fully_connected_layers:
    x = Dense(dense_size, activation='relu')(x)  # dense_size == 1024
    x = Dropout(dropout_p)(x)
# Output Layer
predictions = Dense(num_of_classes, activation='softmax')(x)
# Build model
model = Model(inputs=inputs, outputs=predictions)
model.compile(optimizer=optimizer, loss=loss, metrics=['accuracy'])  # Adam, categorical_crossentropy
model.summary()

# Shuffle
indices = np.arange(train_data.shape[0])
np.random.shuffle(indices)

x_train = train_data[indices]
y_train = train_classes[indices]

x_test = test_data
y_test = test_classes

# Training
model.fit(x_train, y_train,
          validation_data=(x_test, y_test),
          batch_size=128,
          epochs=10,
          verbose=2)

After training the whole data in GPU, we can get the result below.

```
Train on 120000 samples, validate on 7600 samples
Epoch 1/10
 - 425s - loss: 0.8142 - acc: 0.6320 - val_loss: 0.3946 - val_acc: 0.8578
Epoch 2/10
 - 420s - loss: 0.3400 - acc: 0.8818 - val_loss: 0.3144 - val_acc: 0.8879
Epoch 3/10
 - 420s - loss: 0.2699 - acc: 0.9080 - val_loss: 0.2871 - val_acc: 0.8988
Epoch 4/10
 - 420s - loss: 0.2261 - acc: 0.9229 - val_loss: 0.3066 - val_acc: 0.8979
Epoch 5/10
 - 420s - loss: 0.1961 - acc: 0.9328 - val_loss: 0.3286 - val_acc: 0.8950
Epoch 6/10
 - 420s - loss: 0.1669 - acc: 0.9432 - val_loss: 0.3220 - val_acc: 0.8953
Epoch 7/10
 - 420s - loss: 0.1371 - acc: 0.9537 - val_loss: 0.3573 - val_acc: 0.8922
Epoch 8/10
 - 420s - loss: 0.1197 - acc: 0.9594 - val_loss: 0.3808 - val_acc: 0.8917
Epoch 9/10
 - 420s - loss: 0.1045 - acc: 0.9643 - val_loss: 0.3834 - val_acc: 0.8957
Epoch 10/10
 - 420s - loss: 0.0885 - acc: 0.9699 - val_loss: 0.4172 - val_acc: 0.8976

```