### Text Classification
In this series of notebooks we are going to have a closer look on Natural Language Processing in Keras.

### About this one!
In this notebook we are going to create a model that will do sentiment classification and we will learn the following:

1. Load the data from disk for text classification task.
2. We will use the `TextVectorization` layer for word spliting and indexing.

### Imports

In [1]:
import tensorflow as tf
import numpy as np

import os, random, string, re

from tensorflow.keras.layers import TextVectorization

np.__version__

'1.19.5'

### Data
We are going to download the data (IMDB) dataset and load in google colab.

To download the data we are going to use the following command:

```shell
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz

# extracting
!tar -xf aclImdb_v1.tar.gz
```

In [2]:
!curl -O https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
!tar -xf aclImdb_v1.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 80.2M  100 80.2M    0     0  13.2M      0  0:00:06  0:00:06 --:--:-- 17.5M


The `aclImdb/train/pos` and `aclImdb/train/neg` folders contain text files, each of which represents one review (either `positive` or `negative`):

Let's check a single example of a positive review using the shell command `cat`.

In [3]:
!cat aclImdb/train/pos/6248_7.txt

Being an Austrian myself this has been a straight knock in my face. Fortunately I don't live nowhere near the place where this movie takes place but unfortunately it portrays everything that the rest of Austria hates about Viennese people (or people close to that region). And it is very easy to read that this is exactly the directors intention: to let your head sink into your hands and say "Oh my god, how can THAT be possible!". No, not with me, the (in my opinion) totally exaggerated uncensored swinger club scene is not necessary, I watch porn, sure, but in this context I was rather disgusted than put in the right context.<br /><br />This movie tells a story about how misled people who suffer from lack of education or bad company try to survive and live in a world of redundancy and boring horizons. A girl who is treated like a whore by her super-jealous boyfriend (and still keeps coming back), a female teacher who discovers her masochism by putting the life of her super-cruel "lover" 

We are only interested in the `pos` and `neg` folders, so let's delete the other files

In [4]:
!rm -r aclImdb/train/unsup
!rm -r aclImdb/test/unsup

rm: cannot remove 'aclImdb/test/unsup': No such file or directory


Now our folder and file structures looks as follows:

```
📁 acllmdb
  📁 test
    📁 neg
      🗄...txt
    📁 pos
     🗄...txt
  📁 train
    📁 neg
     🗄...txt
    📁 pos
     🗄...txt
```

We are going to use the [`tf.keras.preprocessing.text_dataset_from_directory`](https://keras.io/api/preprocessing/text#textdatasetfromdirectory-function) to generate a labeled [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) object from text files that are in our disk.

We are then going to generate three sets of data. The train, validation and test. The validation set will come as a fraction of `20%` from the train set.

### Loading the data using the `tf.keras.preprocessing.text_dataset_from_directory()`

In [5]:
!rm -r aclImdb/train/.ipynb_checkpoints
!rm -r aclImdb/test/.ipynb_checkpoints

rm: cannot remove 'aclImdb/train/.ipynb_checkpoints': No such file or directory
rm: cannot remove 'aclImdb/test/.ipynb_checkpoints': No such file or directory


In [6]:
BATCH_SIZE = 32
SEED = 42

raw_train_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "aclImdb/train",
    batch_size = BATCH_SIZE,
    validation_split = .2,
    subset = "training",
    seed = SEED
)
raw_valid_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "aclImdb/train",
    batch_size = BATCH_SIZE,
    validation_split = .2,
    subset = "validation",
    seed = SEED
)
raw_test_ds = tf.keras.preprocessing.text_dataset_from_directory(
    "aclImdb/test",
    batch_size = BATCH_SIZE
)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.
Found 25000 files belonging to 2 classes.
Using 5000 files for validation.
Found 25000 files belonging to 2 classes.


Getting the classname

In [7]:
raw_train_ds.class_names

['neg', 'pos']

Counting examples...

In [8]:
print("train examples: %d"
 % tf.data.experimental.cardinality(raw_train_ds))
print("test examples: %d"
 % tf.data.experimental.cardinality(raw_test_ds))
print("valid examples: %d"
 % tf.data.experimental.cardinality(raw_valid_ds))

train examples: 625
test examples: 782
valid examples: 157


Let's check a some examples in a single batch.

In [9]:
for text_batch, label_batch in raw_train_ds.take(1):
  for text, label in zip(text_batch, label_batch[:5]):
    print(f"text: {text}\nlabel: {label}\nn")

text: b'"Pandemonium" is a horror movie spoof that comes off more stupid than funny. Believe me when I tell you, I love comedies. Especially comedy spoofs. "Airplane", "The Naked Gun" trilogy, "Blazing Saddles", "High Anxiety", and "Spaceballs" are some of my favorite comedies that spoof a particular genre. "Pandemonium" is not up there with those films. Most of the scenes in this movie had me sitting there in stunned silence because the movie wasn\'t all that funny. There are a few laughs in the film, but when you watch a comedy, you expect to laugh a lot more than a few times and that\'s all this film has going for it. Geez, "Scream" had more laughs than this film and that was more of a horror film. How bizarre is that?<br /><br />*1/2 (out of four)'
label: 0
n
text: b"David Mamet is a very interesting and a very un-equal director. His first movie 'House of Games' was the one I liked best, and it set a series of films with characters whose perspective of life changes as they get into

### Data cleaning
We are going to clean our data. By cleaning I mean removing unnessesarry text for example ``html`` tags. We are goiing to create a `standardization` function which will do the following:

1. lower case the text
2. remove html tags


In [10]:
def standardization(input_data):
  input_data = tf.strings.lower(input_data)
  input_data = tf.strings.regex_replace(input_data, "<br/>", " ")
  return tf.strings.regex_replace(
      input_data, re.escape(string.punctuation), ""
  )

### Model Hyper Params

In [11]:
MAX_FEATURES = 20000
EMBEDDING_DIM = 128
SEQUENCE_LENGTH = 500

### Text Vectorization
Now that we have our function that standadize text, we can then create a vectorization layer. WE are using this layer to:
1. split strings
2. map them to integer representations


In [12]:
vectorize_layer = TextVectorization(
    standardize=standardization,
    max_tokens=MAX_FEATURES,
    output_mode="int",
    output_sequence_length=SEQUENCE_LENGTH
)

In [13]:
vectorize_layer.get_vocabulary()

['', '[UNK]']

As you can see that we only have the ``''`` and ``[UNK]`` as our vocabulary, we need to to make our `vectorize_layer` to adapt to our train features. So to do that we need to extract text features first from our `train_dataset` and the we call the `adapt`method to create the vocabulary.

In [14]:
text_ds = raw_train_ds.map(lambda x, y: x)

In [15]:
vectorize_layer.adapt(text_ds)

### Text vectorization layer.

There are two options we can use to vectorize our data.

1. **Make it part of the model**

This method allows us to pass the raw text strings to the model and the model will take care of the rest for us: The following example shows how we can use this method:

```py
input = tf.keras.Input(shape=(1,), dtype=tf.string, name='text')
x = vectorized_layer(input)
x = keras.layers.Embedding(MAX_FEATURES + 1, EMBEDDING_DIM)(x)

....
```

2. **Apply it to the text dataset**.

This method will obtain a dataset of word indices then feed those to the network.

> An important difference between the two is that option 2 enables you to do **asynchronous CPU processing and buffering** of your data when training on GPU. So if you're training the model on GPU, you probably want to go with this option to get the best performance



In [16]:
def vectorize(text, label):
  text= tf.expand_dims(text, -1)
  return vectorize_layer(text), label

Vectorizing the data

In [17]:
train_ds = raw_train_ds.map(vectorize)
valid_ds = raw_valid_ds.map(vectorize)
test_ds = raw_test_ds.map(vectorize)

### Async prefetching / buffering of the data 

In [18]:
BUFFER_SIZE =  100
train_ds = train_ds.cache().prefetch(buffer_size=BUFFER_SIZE)
test_ds = test_ds.cache().prefetch(buffer_size=BUFFER_SIZE)
valid_ds = valid_ds.cache().prefetch(buffer_size=BUFFER_SIZE)

### Building the model

Our model will be using Conv1D layers with an embedding layer as well as the GlobalMaxPooling1D, Dropout and Dense layers. We are going to make use of the functional API model.

In [19]:
inputs = tf.keras.layers.Input(shape=(None, ), dtype="int64")
x = tf.keras.layers.Embedding(MAX_FEATURES,EMBEDDING_DIM)(inputs)
x = tf.keras.layers.Dropout(rate=.5)(x)

x = tf.keras.layers.Conv1D(128, 5, padding="valid", 
                           activation="relu", strides=3)(x)
x = tf.keras.layers.Conv1D(128, 5, padding="valid", 
                           activation="relu", strides=3)(x)
x = tf.keras.layers.GlobalAveragePooling1D()(x)


x = tf.keras.layers.Dense(128, activation="relu")(x)
x = tf.keras.layers.Dropout(rate=.5)(x)

outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)

model = tf.keras.Model(inputs, outputs)

model.summary()


Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, None)]            0         
_________________________________________________________________
embedding (Embedding)        (None, None, 128)         2560000   
_________________________________________________________________
dropout (Dropout)            (None, None, 128)         0         
_________________________________________________________________
conv1d (Conv1D)              (None, None, 128)         82048     
_________________________________________________________________
conv1d_1 (Conv1D)            (None, None, 128)         82048     
_________________________________________________________________
global_average_pooling1d (Gl (None, 128)               0         
_________________________________________________________________
dense (Dense)                (None, 128)               16512 

### Compiling the model

In [20]:
model.compile(loss="binary_crossentropy",
              optimizer="adam", 
              metrics=["accuracy"])

### Training the model

In [21]:
EPOCHS = 3

model.fit(train_ds, validation_data=valid_ds, epochs=EPOCHS)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f50604ebed0>

### Evaluating the model

In [22]:
model.evaluate(test_ds,verbose=1)



[0.4240582287311554, 0.8715999722480774]

### Model inference

In the predict we are going to vectorize the text, then we are going to call the model.predict so that it will returns a prediction label to us.

In [23]:
data = []
for text_batch, label_batch in raw_test_ds.take(1):
  for text, label in zip(text_batch, label_batch[:5]):
    data.append({"text":text, "label": label})

In [None]:
def vectorize_text(text):
  text= tf.expand_dims(text, -1)
  return vectorize_layer(text)

vectorize_text("this movie sucks!")

In [32]:
def make_prediction(text):
  vectors = vectorize_text(text)
  pred = tf.round(tf.squeeze(model(vectors)))
  return pred.numpy().astype("int32")

In [36]:
print("real label\tpredicted label\tpredicted class")
for ele in data:
  predicted = make_prediction(ele["text"])
  print(f"{predicted}\t\t{ele['label']}\t\t{raw_train_ds.class_names[predicted]}")

real label	predicted label	predicted class
0		1		neg
1		1		pos
1		1		pos
1		1		pos
1		1		pos


### Making an End-End Model.

If you want to obtain a model capable of processing raw strings, you can simply create a new model (using the weights we just trained) as follows:


```py
# A string input
inputs = tf.keras.Input(shape=(1,), dtype="string")
# Turn strings into vocab indices
indices = vectorize_layer(inputs)
# Turn vocab indices into predictions
outputs = model(indices)

# Our end to end model
end_to_end_model = tf.keras.Model(inputs, outputs)
end_to_end_model.compile(
    loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]
)

# Test it with `raw_test_ds`, which yields raw strings
end_to_end_model.evaluate(raw_test_ds)
```

We are going to create this model from scratch in this notbook.

In [37]:
inputs = tf.keras.layers.Input(shape=(1, ), dtype="string")
indeces = vectorize_layer(inputs)
x = tf.keras.layers.Embedding(MAX_FEATURES,EMBEDDING_DIM)(indeces)
x = tf.keras.layers.Dropout(rate=.5)(x)

x = tf.keras.layers.Conv1D(128, 5, padding="valid", 
                           activation="relu", strides=3)(x)
x = tf.keras.layers.Conv1D(128, 5, padding="valid", 
                           activation="relu", strides=3)(x)
x = tf.keras.layers.GlobalAveragePooling1D()(x)


x = tf.keras.layers.Dense(128, activation="relu")(x)
x = tf.keras.layers.Dropout(rate=.5)(x)

outputs = tf.keras.layers.Dense(1, activation="sigmoid")(x)

model = tf.keras.Model(inputs, outputs)

model.summary()

Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 1)]               0         
_________________________________________________________________
text_vectorization (TextVect (None, 500)               0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 500, 128)          2560000   
_________________________________________________________________
dropout_2 (Dropout)          (None, 500, 128)          0         
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 166, 128)          82048     
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 54, 128)           82048     
_________________________________________________________________
global_average_pooling1d_1 ( (None, 128)               0   

### Training the end-to-end model

In [39]:
model.compile(
    loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"]
)

EPOCHS = 3

model.fit(raw_train_ds, validation_data=raw_valid_ds, epochs=EPOCHS)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f4fcb979650>

### Evalutaing the model

In [40]:
model.evaluate(raw_test_ds,verbose=1)



[0.5001795887947083, 0.8663600087165833]

### Model Inference

In [49]:
def make_prediction(text):
  text = tf.constant(text, dtype="string")
  text= tf.expand_dims(text, -1)
  pred = tf.round(tf.squeeze(model(text)))
  return pred.numpy().astype("int32")

In [50]:
print("real label\tpredicted label\tpredicted class")
for ele in data:
  predicted = make_prediction(ele["text"])
  print(f"{predicted}\t\t{ele['label']}\t\t{raw_train_ds.class_names[predicted]}")

real label	predicted label	predicted class
0		1		neg
1		1		pos
1		1		pos
1		1		pos
1		1		pos


### Conclusion 

In this notebook we have leant how we can load the dataset using the `text_dataset_from_directory` and train a model to perform sentiment analysis using CONV1D. In the next notebook we are going to perform the same task using a RNN. 