<a href="https://colab.research.google.com/github/PaulToronto/TensorFlow_Tutorials/blob/main/3_Keras_Basic_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Keras - Basic text classification

- https://tinyurl.com/42fab6rw

## Imports

In [1]:
import tensorflow as tf

import os
from textwrap import wrap
import shutil
import re
import string


from tensorflow.keras import layers
from tensorflow.keras import losses

## The IMDB dataset

- this notebook trains a sentiment analysis model to classify reviews as either *positive* or *negative* based on the text of the review
- this is a binary classification problem
- dataset source: https://ai.stanford.edu/%7Eamaas/data/sentiment/
    - 50,000 movie reviews from the Internet Movie Database
    - Split into 25,000 for training and 25,0000 for testing
    - The training and testing sets are **balanced**, meaning they contain and equal number of positive and negative reviews


In [2]:
url = 'https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'

dataset = tf.keras.utils.get_file(fname='aclImdb_v1',
                                  origin=url,
                                  untar=True,
                                  cache_dir='.',
                                  cache_subdir='')

dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')

Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
[1m84125825/84125825[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 0us/step


In [3]:
type(dataset), dataset

(str, './aclImdb_v1')

In [4]:
type(dataset_dir), dataset_dir

(str, './aclImdb')

In [5]:
os.listdir(dataset_dir)

['imdbEr.txt', 'README', 'train', 'test', 'imdb.vocab']

In [6]:
train_dir = os.path.join(dataset_dir,'train')
train_dir

'./aclImdb/train'

In [7]:
os.listdir(train_dir)

['neg',
 'unsupBow.feat',
 'labeledBow.feat',
 'pos',
 'unsup',
 'urls_neg.txt',
 'urls_unsup.txt',
 'urls_pos.txt']

The `./aclImdb/train/neg` and `./aclImdb/train/pos` directories contain many text files, each of which is a single movie review.

In [8]:
# looking at one of those files
sample_pos_file = os.path.join(train_dir, 'pos/1181_9.txt')
with open(sample_pos_file) as f:
    text = f.read()

wrap(text)

['Rachel Griffiths writes and directs this award winning short film. A',
 'heartwarming story about coping with grief and cherishing the memory',
 "of those we've loved and lost. Although, only 15 minutes long,",
 'Griffiths manages to capture so much emotion and truth onto film in',
 'the short space of time. Bud Tingwell gives a touching performance as',
 "Will, a widower struggling to cope with his wife's death. Will is",
 'confronted by the harsh reality of loneliness and helplessness as he',
 "proceeds to take care of Ruth's pet cow, Tulip. The film displays the",
 'grief and responsibility one feels for those they have loved and lost.',
 'Good cinematography, great direction, and superbly acted. It will',
 'bring tears to all those who have lost a loved one, and survived.']

In [9]:
sample_neg_file = os.path.join(train_dir, 'neg/10008_2.txt')
with open(sample_neg_file) as f:
    text = f.read()

wrap(text)

['The film is bad. There is no other way to say it. The story is weak',
 "and outdated, especially for this country. I don't think most people",
 'know what a "walker" is or will really care. I felt as if I was',
 "watching a movie from the 70's. The subject was just not believable",
 'for the year 2007, even being set in DC. I think this rang true for',
 'everyone else who watched it too as the applause were low and quick at',
 "the end. Most didn't stay for the Q&A either.<br /><br />I don't think",
 'Schrader really thought the film out ahead of time. Many of the scenes',
 'seemed to be cut short as if they were never finished or he just',
 "didn't know how to finish them. He jumped from one scene to the next",
 'and you had to try and figure out or guess what was going on. I really',
 "didn't get Woody's (Carter) private life or boyfriend either. What",
 'were all the "artistic" male bondage and torture pictures (from Iraq',
 'prisons) about? What was he thinking? I think it was hi

### Load the dataset

To do this, the `text_dataset_from_directory` utility is used. This utility exepcts a directory structure as follows:

```python
main_directory/
...class_a/
......a_text_1.txt
......a_text_2.txt
...class_b/
......b_text_1.txt
......b_text_2.txt
```

In [10]:
# 3 of these are folders: 'neg', 'pos' and 'unsup'
os.listdir(train_dir)

['neg',
 'unsupBow.feat',
 'labeledBow.feat',
 'pos',
 'unsup',
 'urls_neg.txt',
 'urls_unsup.txt',
 'urls_pos.txt']

To prepare a dataset for binary classification, we need two folders, corrsponding to `class_a` and `class_b` above, so we want to get rid of `unsup`

In [11]:
remove_dir = os.path.join(train_dir, 'unsup')
remove_dir

'./aclImdb/train/unsup'

In [12]:
# shutil recursively deletes a directory "tree"
shutil.rmtree(remove_dir)

In [13]:
os.listdir(train_dir)

['neg',
 'unsupBow.feat',
 'labeledBow.feat',
 'pos',
 'urls_neg.txt',
 'urls_unsup.txt',
 'urls_pos.txt']

The IMDB dataset has already been divided into train and test, but it lacks a validation set. Let's create a validation set using an 80:20 split of the training data by using the `validation_split` argument below.

In [14]:
batch_size = 32
seed = 42       # optional seed for random shuffling
# note, the `shuffle` parameter defaults to True

raw_train_ds = tf.keras.utils.text_dataset_from_directory(
    'aclImdb/train',
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed
)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.


In [15]:
type(raw_train_ds), raw_train_ds

(tensorflow.python.data.ops.prefetch_op._PrefetchDataset,
 <_PrefetchDataset element_spec=(TensorSpec(shape=(None,), dtype=tf.string, name=None), TensorSpec(shape=(None,), dtype=tf.int32, name=None))>)

You can train a model by passing the dataset directly into the `model.fit()` method. First let's have a look at the data.

In [16]:
test = raw_train_ds.take(1)

for a, b in test:
    print(type(a))
    print(type(b))

<class 'tensorflow.python.framework.ops.EagerTensor'>
<class 'tensorflow.python.framework.ops.EagerTensor'>


In [17]:
for a, b in test:
    print(a.numpy())
    print(len(a.numpy()))       # batch size
    print('\n______________\n')
    print(b.numpy())
    print(len(b.numpy()))       # batch size

[b"Having seen most of Ringo Lam's films, I can say that this is his best film to date, and the most unusual. It's a ancient china period piece cranked full of kick-ass martial arts, where the location of an underground lair full of traps and dungeons plays as big a part as any of the characters. The action is fantastic, the story is tense and entertaining, and the set design is truely memorable. Sadly, Burning Paradise has not been made available on DVD and vhs is next-to-impossible to get your mitts on, even if you near the second biggest china-town in North America (like I do). If you can find it, don't pass it up."
 b'Caution: May contain spoilers...<br /><br />I\'ve seen this movie 3 times & I\'ve liked it every time. Upon seeing it again, I\'m always reminded of how good it is. An HBO TV movie- very well done like most of their movies are- this would\'ve gotten Oscars for it\'s performances had it been released for general distribution instead of made for TV.<br /><br />As I\'m s

In [18]:
for a, b in test:
    for i in range(3):
        print(a.numpy()[i])
        print(b.numpy()[i])
        print('\n________\n')

b'Silent Night, Deadly Night 5 is the very last of the series, and like part 4, it\'s unrelated to the first three except by title and the fact that it\'s a Christmas-themed horror flick.<br /><br />Except to the oblivious, there\'s some obvious things going on here...Mickey Rooney plays a toymaker named Joe Petto and his creepy son\'s name is Pino. Ring a bell, anyone? Now, a little boy named Derek heard a knock at the door one evening, and opened it to find a present on the doorstep for him. Even though it said "don\'t open till Christmas", he begins to open it anyway but is stopped by his dad, who scolds him and sends him to bed, and opens the gift himself. Inside is a little red ball that sprouts Santa arms and a head, and proceeds to kill dad. Oops, maybe he should have left well-enough alone. Of course Derek is then traumatized by the incident since he watched it from the stairs, but he doesn\'t grow up to be some killer Santa, he just stops talking.<br /><br />There\'s a mysteri

In [19]:
# each time `.take()` is called, a pointer is moved forward
#. her were want to refresh the data to bring the pointer back to 0
batch_size = 32
seed = 42       # optional seed for random shuffling
# note, the `shuffle` parameter defaults to True

raw_train_ds = tf.keras.utils.text_dataset_from_directory(
    'aclImdb/train',
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed
)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.


In [20]:
for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(3):
    print("Review", text_batch.numpy()[i])
    print("Label", label_batch.numpy()[i])

Review b'"Pandemonium" is a horror movie spoof that comes off more stupid than funny. Believe me when I tell you, I love comedies. Especially comedy spoofs. "Airplane", "The Naked Gun" trilogy, "Blazing Saddles", "High Anxiety", and "Spaceballs" are some of my favorite comedies that spoof a particular genre. "Pandemonium" is not up there with those films. Most of the scenes in this movie had me sitting there in stunned silence because the movie wasn\'t all that funny. There are a few laughs in the film, but when you watch a comedy, you expect to laugh a lot more than a few times and that\'s all this film has going for it. Geez, "Scream" had more laughs than this film and that was more of a horror film. How bizarre is that?<br /><br />*1/2 (out of four)'
Label 0
Review b"David Mamet is a very interesting and a very un-equal director. His first movie 'House of Games' was the one I liked best, and it set a series of films with characters whose perspective of life changes as they get into 

#### The classes

The labels are 0 and 1. How do these correspond to the sentiment of movie reviews?

In [21]:
print('Label 0 corresponds to', raw_train_ds.class_names[0])
print('Label 1 corresponds to', raw_train_ds.class_names[1])

Label 0 corresponds to neg
Label 1 corresponds to pos


### Test and Validation sets

- We will use the remaining 5,000 reviews from the training set for validation

#### IMPORTANT NOTE:

When using the validation_split and subset arguments, make sure to either specify a random `seed`, or to pass `shuffle=False`, so that the validation and training splits have no overlap.

In [22]:
raw_val_ds = tf.keras.utils.text_dataset_from_directory(
    'aclImdb/train',
    batch_size=batch_size,
    validation_split=0.2,
    subset='validation',    # this is the only arg that changed
    seed=seed
)

Found 25000 files belonging to 2 classes.
Using 5000 files for validation.


In [23]:
raw_test_ds = tf.keras.utils.text_dataset_from_directory(
    'aclImdb/test',
    batch_size=batch_size
)

Found 25000 files belonging to 2 classes.


### Prepare the dataset for training

#### List of things to do:

1. **Standardize**: preprocess the text, typically to remove punctuation or HTML
2. **Tokenize**: split the strings into *tokens* (example: splitting a sentence into individual words by splitting on whitespace)
3. **Vectorize**: convert tokens into numbers so they can be fed ito a neural network

All of these tasks can be completed using the `tf.keras.layers.TextVectorization` layer.

- The default standardizer in this layer only converts text to lowercase and strips punctuation. We need a custom standardization functiom to removee the HTML
- **Note**:To prevent training-testing skew (also known as training-serving skew), it is important to preprocess the data identically at train and test time. To facilitate this, the `TextVectorization` layer can be included directly inside your model, as shown later in this tutorial.
    - https://developers.google.com/machine-learning/guides/rules-of-ml#training-serving_skew

#### Custom standardizer

In [24]:
input_data_sample = text
input_data_sample

'The film is bad. There is no other way to say it. The story is weak and outdated, especially for this country. I don\'t think most people know what a "walker" is or will really care. I felt as if I was watching a movie from the 70\'s. The subject was just not believable for the year 2007, even being set in DC. I think this rang true for everyone else who watched it too as the applause were low and quick at the end. Most didn\'t stay for the Q&A either.<br /><br />I don\'t think Schrader really thought the film out ahead of time. Many of the scenes seemed to be cut short as if they were never finished or he just didn\'t know how to finish them. He jumped from one scene to the next and you had to try and figure out or guess what was going on. I really didn\'t get Woody\'s (Carter) private life or boyfriend either. What were all the "artistic" male bondage and torture pictures (from Iraq prisons) about? What was he thinking? I think it was his very poor attempt at trying to create this d

In [25]:
lowercase = tf.strings.lower(input_data_sample)
lowercase.numpy() # converts EagerTensor to bytes string literal

b'the film is bad. there is no other way to say it. the story is weak and outdated, especially for this country. i don\'t think most people know what a "walker" is or will really care. i felt as if i was watching a movie from the 70\'s. the subject was just not believable for the year 2007, even being set in dc. i think this rang true for everyone else who watched it too as the applause were low and quick at the end. most didn\'t stay for the q&a either.<br /><br />i don\'t think schrader really thought the film out ahead of time. many of the scenes seemed to be cut short as if they were never finished or he just didn\'t know how to finish them. he jumped from one scene to the next and you had to try and figure out or guess what was going on. i really didn\'t get woody\'s (carter) private life or boyfriend either. what were all the "artistic" male bondage and torture pictures (from iraq prisons) about? what was he thinking? i think it was his very poor attempt at trying to create this 

In [26]:
stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
stripped_html.numpy()

b'the film is bad. there is no other way to say it. the story is weak and outdated, especially for this country. i don\'t think most people know what a "walker" is or will really care. i felt as if i was watching a movie from the 70\'s. the subject was just not believable for the year 2007, even being set in dc. i think this rang true for everyone else who watched it too as the applause were low and quick at the end. most didn\'t stay for the q&a either.  i don\'t think schrader really thought the film out ahead of time. many of the scenes seemed to be cut short as if they were never finished or he just didn\'t know how to finish them. he jumped from one scene to the next and you had to try and figure out or guess what was going on. i really didn\'t get woody\'s (carter) private life or boyfriend either. what were all the "artistic" male bondage and torture pictures (from iraq prisons) about? what was he thinking? i think it was his very poor attempt at trying to create this dark priva

In [27]:
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

In [28]:
# removes puncuation
tf.strings.regex_replace(stripped_html,
                         '[%s]' % re.escape(string.punctuation),
                         '').numpy()

b'the film is bad there is no other way to say it the story is weak and outdated especially for this country i dont think most people know what a walker is or will really care i felt as if i was watching a movie from the 70s the subject was just not believable for the year 2007 even being set in dc i think this rang true for everyone else who watched it too as the applause were low and quick at the end most didnt stay for the qa either  i dont think schrader really thought the film out ahead of time many of the scenes seemed to be cut short as if they were never finished or he just didnt know how to finish them he jumped from one scene to the next and you had to try and figure out or guess what was going on i really didnt get woodys carter private life or boyfriend either what were all the artistic male bondage and torture pictures from iraq prisons about what was he thinking i think it was his very poor attempt at trying to create this dark private subculture life for woodys character

In [29]:
# putting it all together in a function
def custom_standardization(input_data):
  lowercase = tf.strings.lower(input_data)
  stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
  return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation),
                                  '')

# test function
custom_standardization(input_data_sample).numpy()

b'the film is bad there is no other way to say it the story is weak and outdated especially for this country i dont think most people know what a walker is or will really care i felt as if i was watching a movie from the 70s the subject was just not believable for the year 2007 even being set in dc i think this rang true for everyone else who watched it too as the applause were low and quick at the end most didnt stay for the qa either  i dont think schrader really thought the film out ahead of time many of the scenes seemed to be cut short as if they were never finished or he just didnt know how to finish them he jumped from one scene to the next and you had to try and figure out or guess what was going on i really didnt get woodys carter private life or boyfriend either what were all the artistic male bondage and torture pictures from iraq prisons about what was he thinking i think it was his very poor attempt at trying to create this dark private subculture life for woodys character

#### `TextVectorization` layer

- This is the layer that is used to standardize, tokenize and vectorize the data
- Arguments used:
    - **standardize**: our `custom_standardization` function
    - **split**: we leave it as the default which is `"whitespace"`
    - **max_tokens**: maximum size of the vocabulary for this layer
    - **output_mode**: `"int"`: Outputs integer indices, one integer index per split string token. When output mode is `"int"`, 0 is reserved for masked locations; this reduces the vocab size to `max_tokens - 2` instead of `max_tokens - 1`
    - **output_sequence_length**: only valide in INT mode. If set, the output will have its time dimension padded or truncated to exactly `output_sequence_length`, regardless of how many tokens resulted from the splitting step.

In [30]:
max_features = 10_000
sequence_length = 250

vectorize_layer = layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=sequence_length
)

vectorize_layer

<TextVectorization name=text_vectorization, built=False>

In [39]:
print(type(vectorize_layer))

<class 'keras.src.layers.preprocessing.text_vectorization.TextVectorization'>


Next, you will call the `adapt` method to fit the state of the preprocessing layer to the dataset. This will cause the model to build an index of strings to integers.
- This creates the vocabulary
- Note: It's important to only use your training data when calling adapt (using the test set would leak information).


In [32]:
raw_train_ds

<_PrefetchDataset element_spec=(TensorSpec(shape=(None,), dtype=tf.string, name=None), TensorSpec(shape=(None,), dtype=tf.int32, name=None))>

In [33]:
# make a text only dataset (without the labels)
# `.map`: Maps `map_func` across the elements of this dataset.
train_text = raw_train_ds.map(lambda x, y: x)

In [34]:
for item in raw_train_ds.as_numpy_iterator():
    print(item) # label is still here
    break

(array([b"Having seen most of Ringo Lam's films, I can say that this is his best film to date, and the most unusual. It's a ancient china period piece cranked full of kick-ass martial arts, where the location of an underground lair full of traps and dungeons plays as big a part as any of the characters. The action is fantastic, the story is tense and entertaining, and the set design is truely memorable. Sadly, Burning Paradise has not been made available on DVD and vhs is next-to-impossible to get your mitts on, even if you near the second biggest china-town in North America (like I do). If you can find it, don't pass it up.",
       b'Caution: May contain spoilers...<br /><br />I\'ve seen this movie 3 times & I\'ve liked it every time. Upon seeing it again, I\'m always reminded of how good it is. An HBO TV movie- very well done like most of their movies are- this would\'ve gotten Oscars for it\'s performances had it been released for general distribution instead of made for TV.<br /><

In [35]:
for item in train_text.as_numpy_iterator():
    print(item)
    break

[b'Silent Night, Deadly Night 5 is the very last of the series, and like part 4, it\'s unrelated to the first three except by title and the fact that it\'s a Christmas-themed horror flick.<br /><br />Except to the oblivious, there\'s some obvious things going on here...Mickey Rooney plays a toymaker named Joe Petto and his creepy son\'s name is Pino. Ring a bell, anyone? Now, a little boy named Derek heard a knock at the door one evening, and opened it to find a present on the doorstep for him. Even though it said "don\'t open till Christmas", he begins to open it anyway but is stopped by his dad, who scolds him and sends him to bed, and opens the gift himself. Inside is a little red ball that sprouts Santa arms and a head, and proceeds to kill dad. Oops, maybe he should have left well-enough alone. Of course Derek is then traumatized by the incident since he watched it from the stairs, but he doesn\'t grow up to be some killer Santa, he just stops talking.<br /><br />There\'s a myster

In [36]:
# UNK is for an unknown word that doesn't exist in the vocabulary set
vectorize_layer.get_vocabulary()

['', '[UNK]']

In [37]:
# creates the vocabulary
# causes the model to buld an index of strings to integers
vectorize_layer.adapt(train_text)

In [38]:
vectorize_layer.get_vocabulary()[:25]

['',
 '[UNK]',
 'the',
 'and',
 'a',
 'of',
 'to',
 'is',
 'in',
 'it',
 'i',
 'this',
 'that',
 'was',
 'as',
 'for',
 'with',
 'movie',
 'but',
 'film',
 'on',
 'not',
 'you',
 'are',
 'his']