**In this project we implement the paper :**

# Character-level Convolutional Networks for Text Classification
Published by ``Xiang Zhang, Junbo Zhao, Yann LeCun``

- We have used AG’s news corpus dataset to show that character-level convolutional networks could achieve
state-of-the-art or competitive results.

- The AG's news topic classification dataset is constructed by choosing 4 largest classes from the original corpus. 
    - Each class contains 30,000 training samples and 1,900 testing samples. 
    - The total number of training samples is 120,000 and testing 7,600.

## Data Preprocessing

Importing the dependencies

In [None]:
import numpy as np
import pandas as pd
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical

Loading the AG's News corpurs dataset

In [None]:
# path to our dataset
train_path = '/content/train.csv'
test_path = '/content/test.csv'

# loading dataset in pandas dataframe
train_df = pd.read_csv(train_path, header=None)
test_df = pd.read_csv(test_path, header=None)

print("Shape of the Train Data : ", train_df.shape)
print("Shape of the Test Data : ", test_df.shape)

Shape of the Train Data :  (120000, 3)
Shape of the Test Data :  (7600, 3)


Since the text of both train & test data spans across 2 columns so we :
1. Concatenate column 2 and column 3 of the given Train & Test dataset
2. Then drop the redundant column 3 in both Train & Test

In [None]:
# concatenating column 2 and 3 of train
train_df[1] = train_df[1] + train_df[2]
# dropping redundant column 3 in train
train_df = train_df.drop([2],axis=1)

# concatenating column 2 and 3 of test
test_df[1] = test_df[1] + test_df[2]
# dropping redundant column 3 in test
test_df = test_df.drop([2],axis=1)

print("Shape of the Train Data : ", train_df.shape)
print("Shape of the Test Data : ", test_df.shape)

Shape of the Train Data :  (120000, 2)
Shape of the Test Data :  (7600, 2)


Now since the characters in our alphabet are all smallcase letters so we will now convert the train & test text to all smallcase

In [None]:
# converting train data to lowercase
train_lower = train_df[1].str.lower()

# converting test data to lowercase
test_lower = test_df[1].str.lower()

In this section we will tokenize the text in the training dataset
- We set the `char_level=True` so the we can process the text character by character
- Next we set `oov_token=UNK` to add this token to vocabulary in order to handle unseen chracters in the test data

In [None]:
# initializing the tokenizer
tokens = Tokenizer(num_words=None, char_level=True, oov_token='UNK')
# fitting tokenizer on the training data
tokens.fit_on_texts(train_lower)

We will now define the characters in our alphabet. And then map the characters to integers in range `[1, len(alphabet)]` in the same order as they appear in the alphabet.

In [None]:
# known characters in our alphabet
alphabet = "abcdefghijklmnopqrstuvwxyz0123456789-,;.!?:'\"/\\|_@#$%^&*~`+-=<>()[]{}"

# mapping of character to integer
character_dictionary = dict()
for i in range(len(alphabet)):
    character_dictionary[alphabet[i]] = i + 1


- Then we set the word_index of tokenizer to the character mapping we just mapped. Also we map the UNK token to a value not in our dictionary, here `len(alphabet) + 1`

- Then we convert the lowercase test & train data to sequence of integers using our dictionary mapping.

In [None]:
tokens.word_index = character_dictionary 
tokens.word_index[tokens.oov_token] = len(alphabet) + 1

# converting data to sequence of integers using our character to int mapping
train_sequences = tokens.texts_to_sequences(train_lower)
test_sequences = tokens.texts_to_sequences(test_lower)

Next we add padding to the sequences so that they are all of the same max size of 1014

In [None]:
train_padded = pad_sequences(train_sequences, padding='post', maxlen=1014)
test_padded = pad_sequences(test_sequences,  padding='post', maxlen=1014)

train_data = np.array(train_padded)
test_data = np.array(test_padded)

Now we will find the available labels/classes of both the train & test text data. Classes are `0,1,2 and 3`

In [None]:
# classes corresponding to the train data
train_classes = np.array(train_df[0].values)
train_classes = train_classes - 1

# classes corresponding to the test data
test_classes = np.array(test_df[0].values)
test_classes = test_classes - 1



In the final step of the data preprocessing 

we will use the `to_categorical` API in keras to convert array of our class/label data to one-hot vector.

In [None]:
train_classes = to_categorical(train_classes)
test_classes = to_categorical(test_classes)

# Debugging (to be deleted before submission)

In [None]:
train_data.shape, test_data.shape

((120000, 1014), (7600, 1014))

In [None]:
for i in range(1014):
  print(train_data[0][i],end=" ")

23 1 12 12 70 19 20 40 70 2 5 1 18 19 70 3 12 1 23 70 2 1 3 11 70 9 14 20 15 70 20 8 5 70 2 12 1 3 11 70 64 18 5 21 20 5 18 19 65 18 5 21 20 5 18 19 70 60 70 19 8 15 18 20 60 19 5 12 12 5 18 19 38 70 23 1 12 12 70 19 20 18 5 5 20 44 19 70 4 23 9 14 4 12 9 14 7 47 2 1 14 4 70 15 6 70 21 12 20 18 1 60 3 25 14 9 3 19 38 70 1 18 5 70 19 5 5 9 14 7 70 7 18 5 5 14 70 1 7 1 9 14 40 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 

In [None]:
test_lower[0], test_sequences[0]

("fears for t n pension after talksunions representing workers at turner   newall say they are 'disappointed' after talks with stricken parent firm federal mogul.",
 [6,
  5,
  1,
  18,
  19,
  70,
  6,
  15,
  18,
  70,
  20,
  70,
  14,
  70,
  16,
  5,
  14,
  19,
  9,
  15,
  14,
  70,
  1,
  6,
  20,
  5,
  18,
  70,
  20,
  1,
  12,
  11,
  19,
  21,
  14,
  9,
  15,
  14,
  19,
  70,
  18,
  5,
  16,
  18,
  5,
  19,
  5,
  14,
  20,
  9,
  14,
  7,
  70,
  23,
  15,
  18,
  11,
  5,
  18,
  19,
  70,
  1,
  20,
  70,
  20,
  21,
  18,
  14,
  5,
  18,
  70,
  70,
  70,
  14,
  5,
  23,
  1,
  12,
  12,
  70,
  19,
  1,
  25,
  70,
  20,
  8,
  5,
  25,
  70,
  1,
  18,
  5,
  70,
  44,
  4,
  9,
  19,
  1,
  16,
  16,
  15,
  9,
  14,
  20,
  5,
  4,
  44,
  70,
  1,
  6,
  20,
  5,
  18,
  70,
  20,
  1,
  12,
  11,
  19,
  70,
  23,
  9,
  20,
  8,
  70,
  19,
  20,
  18,
  9,
  3,
  11,
  5,
  14,
  70,
  16,
  1,
  18,
  5,
  14,
  20,
  70,
  6,
  9,
  18,
  13,
  70,
  6,

In [None]:
train_classes

array([[0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 1., 0.],
       ...,
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.]], dtype=float32)