<a href="https://colab.research.google.com/github/Atabak-Touri/NLP-CNN_RNN/blob/main/sentenceclassification_CNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%matplotlib inline
import collections
import math
import numpy as np
import pandas as pd
import os
import random
import tensorflow as tf
import zipfile
from matplotlib import pylab
from six.moves import range
from six.moves.urllib.request import urlretrieve
import tensorflow as tf

seed = 54321

%env TF_FORCE_GPU_ALLOW_GROWTH=true

env: TF_FORCE_GPU_ALLOW_GROWTH=true


**Downloading the dataset**

In [2]:
url = 'http://cogcomp.org/Data/QA/QC/'
dir_name = 'data'

def download_data(dir_name, filename, expected_bytes):
    os.makedirs(dir_name, exist_ok=True)
    if not os.path.exists(os.path.join(dir_name,filename)):
        filepath, _ = urlretrieve(url + filename, os.path.join(dir_name,filename))
    else:
        filepath = os.path.join(dir_name, filename)

    statinfo = os.stat(filepath)
    if statinfo.st_size == expected_bytes:
        print('Found and verified %s' % filepath)
    else:
        print(statinfo.st_size)
        raise Exception(
          'Failed to verify ' + filepath + '. Can you get to it with a browser?')

    return filepath

train_filename = download_data(dir_name, 'train_5500.label', 335858)
test_filename = download_data(dir_name, 'TREC_10.label',23354)

Found and verified data/train_5500.label
Found and verified data/TREC_10.label


going through each line and split the question, category and sub-category into lists.
in the end we have splitted testing and training lists.

In [5]:
def read_data(filename):
    '''
    Read data from a file with given filename
    Returns a list of strings where each string is a lower case word
    '''

    # Holds question strings, categories and sub categories
    # category/sub_cateory definitions: https://cogcomp.seas.upenn.edu/Data/QA/QC/definition.html
    questions, categories, sub_categories = [], [], []

    with open(filename,'r',encoding='latin-1') as f:
        # Read each line
        for row in f:
            # Each string has format <cat>:<sub cat> <question>
            # Split by : to separate cat and (sub_cat + question)
            row_str = row.split(":")
            cat, sub_cat_and_question = row_str[0], row_str[1]
            tokens = sub_cat_and_question.split(' ')
            # The first word in sub_cat_and_question is the sub category
            # rest is the question
            sub_cat, question = tokens[0], ' '.join(tokens[1:])

            questions.append(question.lower().strip())
            categories.append(cat)
            sub_categories.append(sub_cat)


    return questions, categories, sub_categories

train_questions, train_categories, train_sub_categories = read_data(train_filename)
test_questions, test_categories, test_sub_categories = read_data(test_filename)

n_samples = 10
print(f"train_questions has {len(train_questions)} questions / {len(train_categories)} labels")
print("Some samples")
for question, cat, sub_cat in zip(train_questions[:n_samples], train_categories[:n_samples], train_sub_categories[:n_samples]):
    print(f"\t{question} / cat - {cat} / sub_cat - {sub_cat}")

print(f"\ntest_questions has {len(test_questions)} questions / {len(test_categories)} labels")
print("Some samples")
for question, cat, sub_cat in zip(test_questions[:n_samples], test_categories[:n_samples], test_sub_categories[:n_samples]):
    print(f"\t{question} / cat - {cat} / sub_cat - {sub_cat}")

train_questions has 5452 questions / 5452 labels
Some samples
	how did serfdom develop in and then leave russia ? / cat - DESC / sub_cat - manner
	what films featured the character popeye doyle ? / cat - ENTY / sub_cat - cremat
	how can i find a list of celebrities ' real names ? / cat - DESC / sub_cat - manner
	what fowl grabs the spotlight after the chinese year of the monkey ? / cat - ENTY / sub_cat - animal
	what is the full form of .com ? / cat - ABBR / sub_cat - exp
	what contemptible scoundrel stole the cork from my lunch ? / cat - HUM / sub_cat - ind
	what team did baseball 's st. louis browns become ? / cat - HUM / sub_cat - gr
	what is the oldest profession ? / cat - HUM / sub_cat - title
	what are liver enzymes ? / cat - DESC / sub_cat - def
	name the scar-faced bounty hunter of the old west . / cat - HUM / sub_cat - ind

test_questions has 500 questions / 500 labels
Some samples
	how far is it from denver to aspen ? / cat - NUM / sub_cat - dist
	what county is modesto , cal

**Creating Pandas data frame**

pandas dataframe is expressive frames for storing multi dimensional data.

In [6]:
train_df = pd.DataFrame(
    {'question': train_questions, 'category': train_categories, 'sub_category': train_sub_categories}
)
#consturct with a dictionary. keys are columns of the dataset and values are the elements of each column
#as it can be distinguished, we have three columns of question, category and sub-category
test_df = pd.DataFrame(
    {'question': test_questions, 'category': test_categories, 'sub_category': test_sub_categories}
)

train_df.head(n=10)

Unnamed: 0,question,category,sub_category
0,how did serfdom develop in and then leave russ...,DESC,manner
1,what films featured the character popeye doyle ?,ENTY,cremat
2,how can i find a list of celebrities ' real na...,DESC,manner
3,what fowl grabs the spotlight after the chines...,ENTY,animal
4,what is the full form of .com ?,ABBR,exp
5,what contemptible scoundrel stole the cork fro...,HUM,ind
6,what team did baseball 's st. louis browns bec...,HUM,gr
7,what is the oldest profession ?,HUM,title
8,what are liver enzymes ?,DESC,def
9,name the scar-faced bounty hunter of the old w...,HUM,ind


In [8]:
# it is also important not to have any orders in the training dataset.
#so we do shuffling:
train_df = train_df.sample(frac=1.0, random_state=seed)
train_df.head()

Unnamed: 0,question,category,sub_category
4327,how old was stevie wonder when he signed with ...,NUM,period
1233,what baseball team was routinely called `` dem...,HUM,gr
127,what crooner joined the andrews sisters for pi...,HUM,ind
3040,what is it like to experience a near death epi...,DESC,desc
2803,what are the medical purposes of `` clitoridec...,DESC,reason


In [10]:
#identifying unique values presented in the train_df["category"]:
unique_cats = train_df["category"].unique()
#creating dictionary of each category with a numerical ID. "np.arange" gives a series of integer in a range.
labels_map = dict(zip(unique_cats, np.arange(unique_cats.shape[0])))

print(f"Label->ID mapping: {labels_map}")

n_classes = len(labels_map)

# Convert all string labels to IDs(numerical labels)
train_df["category"] = train_df["category"].map(labels_map)
test_df["category"] = test_df["category"].map(labels_map)

# View
train_df.head(n=10)

Label->ID mapping: {np.int64(0): np.int64(0), np.int64(1): np.int64(1), np.int64(2): np.int64(2), np.int64(3): np.int64(3), np.int64(4): np.int64(4), np.int64(5): np.int64(5)}


Unnamed: 0,question,category,sub_category
4327,how old was stevie wonder when he signed with ...,0,period
1233,what baseball team was routinely called `` dem...,1,gr
127,what crooner joined the andrews sisters for pi...,1,ind
3040,what is it like to experience a near death epi...,2,desc
2803,what are the medical purposes of `` clitoridec...,2,reason
331,what is the normal resting heart rate of a hea...,0,other
4060,what are the largest deserts in the world ?,3,other
1228,what game is garry kasparov really good at ?,4,sport
3749,what is the largest island in the mediterranea...,3,other
2845,when did the carolingian period begin ?,0,date


validation process:

In [11]:
from sklearn.model_selection import train_test_split

train_df, valid_df = train_test_split(train_df, test_size=0.1)#10% val dataset
print(f"Train size: {train_df.shape}")
print(f"Valid size: {valid_df.shape}")

# Print data
train_df.head()

Train size: (4906, 3)
Valid size: (546, 3)


Unnamed: 0,question,category,sub_category
5293,what is a horologist ?,2,def
1924,what major victorian novelist spent as much ti...,1,ind
1555,who were the yankee 's frequent enemies ?,1,gr
2463,what is the average life expectancy of a male ...,0,period
2738,what oldtime kids ' fare did tv guide writer j...,4,cremat


**Building tokenizer:**

In [12]:
from tensorflow.keras.preprocessing.text import Tokenizer

# Define a tokenizer and fit on train data
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_df["question"].tolist())

# Derive the vocabulary size
n_vocab = len(tokenizer.index_word) + 1
print(f"Vocabluary size: {n_vocab}")

train_df["question"].str.split(" ").str.len().describe(percentiles=[0.01, 0.5, 0.99])


Vocabluary size: 7895


Unnamed: 0,question
count,4906.0
mean,10.087852
std,3.80083
min,2.0
1%,4.0
50%,10.0
99%,22.0
max,37.0


**padding:**

e.g:
$$\text{Input Tensor} = \begin{bmatrix} 101 & 5800 & 320 & 102 & 0 & 0 & 0 & 0 & 0 & 0 \\ 200 & 345 & 678 & 901 & 112 & 334 & 200 & 567 & 890 & 102 \end{bmatrix}$$



In [13]:

# Convert each list of tokens to a list of IDs, using tokenizer's mapping
train_sequences = tokenizer.texts_to_sequences(train_df["question"].tolist())
train_labels = train_df["category"].values
valid_sequences = tokenizer.texts_to_sequences(valid_df["question"].tolist())
valid_labels = valid_df["category"].values
test_sequences = tokenizer.texts_to_sequences(test_df["question"].tolist())
test_labels = test_df["category"].values

max_seq_length = 22

# Pad shorter sentences and truncate longer ones (maximum length: max_seq_length)
preprocessed_train_sequences = tf.keras.preprocessing.sequence.pad_sequences(
    train_sequences, maxlen=max_seq_length, padding='post', truncating='post'
)#maxlen: maximum padding length. pad(post): start at the end.
preprocessed_valid_sequences = tf.keras.preprocessing.sequence.pad_sequences(
    valid_sequences, maxlen=max_seq_length, padding='post', truncating='post'
)
preprocessed_test_sequences = tf.keras.preprocessing.sequence.pad_sequences(
    test_sequences, maxlen=max_seq_length, padding='post', truncating='post'
)

for sentence classification, the value for convolution window is important and it is also important to preserve spariality. the convolution operation in CNN plays an important role in maintaining spatial information of the sentences.

**Implementation of CNN in sentence classification:**

In [14]:
import tensorflow.keras.backend as K
import tensorflow.keras.layers as layers
import tensorflow.keras.regularizers as regularizers
from tensorflow.keras.models import Model

K.clear_session()

# Input layer takes word IDs as inputs
word_id_inputs = layers.Input(shape=(max_seq_length,), dtype='int32')

# Get the embeddings of the inputs / out [batch_size, sent_length, output_dim]
embedding_out = layers.Embedding(input_dim=n_vocab, output_dim=64)(word_id_inputs)


# For all layers: in [batch_size, sent_length, emb_size] / out [batch_size, sent_length, 100]
#we are using 1D convolution as opposed to 2D in image classification!
conv1_1 = layers.Conv1D(
    100, kernel_size=3, strides=1, padding='same', activation='relu'
)(embedding_out)
conv1_2 = layers.Conv1D(
    100, kernel_size=4, strides=1, padding='same', activation='relu'
)(embedding_out)
conv1_3 = layers.Conv1D(
    100, kernel_size=5, strides=1, padding='same', activation='relu'
)(embedding_out)

# in previous conve outputs / out [batch_size, sent_length, 300]
conv_out = layers.Concatenate(axis=-1)([conv1_1, conv1_2, conv1_3])

# Pooling over time operation. This is doing the max pooling over sequence lenth
# in other words, each feature map results in a single output
# in [batch_size, sent_length, 300] / out [batch_size, 1, 300]
pool_over_time_out = layers.MaxPool1D(pool_size=max_seq_length, padding='valid')(conv_out)

# Flatten the unit length dimension
flatten_out = layers.Flatten()(pool_over_time_out)

# Compute the final output
out = layers.Dense(
    n_classes, activation='softmax',
    kernel_regularizer=regularizers.l2(0.001)
)(flatten_out)

# Define the model
cnn_model = Model(inputs=word_id_inputs, outputs=out)

# Compile the model with loss/optimzier/metrics
cnn_model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

cnn_model.summary()

training the model:

In [16]:
# Call backs
lr_reduce_callback = tf.keras.callbacks.ReduceLROnPlateau(
    monitor='val_loss', factor=0.1, patience=3, verbose=1,
    mode='auto', min_delta=0.0001, min_lr=0.000001
)

# Train the model
cnn_model.fit(
    preprocessed_train_sequences, train_labels,
    validation_data=(preprocessed_valid_sequences, valid_labels),
    batch_size=128,
    epochs=25,
    callbacks=[lr_reduce_callback]
)

Epoch 1/25
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 53ms/step - accuracy: 0.3249 - loss: 1.6873 - val_accuracy: 0.6154 - val_loss: 1.2384 - learning_rate: 0.0010
Epoch 2/25
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 48ms/step - accuracy: 0.6355 - loss: 1.0944 - val_accuracy: 0.7344 - val_loss: 0.7828 - learning_rate: 0.0010
Epoch 3/25
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 47ms/step - accuracy: 0.8138 - loss: 0.6371 - val_accuracy: 0.8059 - val_loss: 0.5843 - learning_rate: 0.0010
Epoch 4/25
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 75ms/step - accuracy: 0.9131 - loss: 0.3693 - val_accuracy: 0.8407 - val_loss: 0.4913 - learning_rate: 0.0010
Epoch 5/25
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 51ms/step - accuracy: 0.9636 - loss: 0.2038 - val_accuracy: 0.8480 - val_loss: 0.4561 - learning_rate: 0.0010
Epoch 6/25
[1m39/39[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 

<keras.src.callbacks.history.History at 0x7e1219838590>

**Testing the model:**

In [17]:
cnn_model.evaluate(preprocessed_test_sequences, test_labels, return_dict=True)

[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - accuracy: 0.8865 - loss: 0.3541


{'accuracy': 0.8799999952316284, 'loss': 0.3868033289909363}

during this chapter I learned how important it is to have "pooling overtime" in our 1D-CNN classification as it will reduce the dimensionality of the feature maps. in comparison to max pooling in one region, overtime pooling will take the max across the entire sequence length, ensuring the model captures the most important information.
on the other hand, I learned how to use TensorFlow to implement CNN and see it's performance in sentence classification task rather than image classifican.
it is also worth mentioning that how we can use this technique for real world problem , as an instance if I have a book about birds and I just want to know about a specific species, then it is much more easier to search for the term and get a summary of sentences that are only correspond to that name instead of reading the whole book.