<a href="https://colab.research.google.com/github/PaulToronto/TensorFlow_Tutorials/blob/main/3_Keras_Basic_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Keras - Basic text classification

- https://tinyurl.com/42fab6rw

## Imports

In [1]:
import tensorflow as tf

import os
from textwrap import wrap
import shutil

## The IMDB dataset

- this notebook trains a sentiment analysis model to classify reviews as either *positive* or *negative* based on the text of the review
- this is a binary classification problem
- dataset source: https://ai.stanford.edu/%7Eamaas/data/sentiment/
    - 50,000 movie reviews from the Internet Movie Database
    - Split into 25,000 for training and 25,0000 for testing
    - The training and testing sets are **balanced**, meaning they contain and equal number of positive and negative reviews


In [2]:
url = 'https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'

dataset = tf.keras.utils.get_file(fname='aclImdb_v1',
                                  origin=url,
                                  untar=True,
                                  cache_dir='.',
                                  cache_subdir='')

dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb')

In [3]:
type(dataset), dataset

(str, './aclImdb_v1')

In [4]:
type(dataset_dir), dataset_dir

(str, './aclImdb')

In [5]:
os.listdir(dataset_dir)

['imdbEr.txt', 'README', 'train', 'test', 'imdb.vocab']

In [6]:
train_dir = os.path.join(dataset_dir,'train')
train_dir

'./aclImdb/train'

In [7]:
os.listdir(train_dir)

['neg',
 'unsupBow.feat',
 'labeledBow.feat',
 'pos',
 'unsup',
 'urls_neg.txt',
 'urls_unsup.txt',
 'urls_pos.txt']

The `./aclImdb/train/neg` and `./aclImdb/train/pos` directories contain many text files, each of which is a single movie review.

In [8]:
# looking at one of those files
sample_pos_file = os.path.join(train_dir, 'pos/1181_9.txt')
with open(sample_pos_file) as f:
    text = f.read()

wrap(text)

['Rachel Griffiths writes and directs this award winning short film. A',
 'heartwarming story about coping with grief and cherishing the memory',
 "of those we've loved and lost. Although, only 15 minutes long,",
 'Griffiths manages to capture so much emotion and truth onto film in',
 'the short space of time. Bud Tingwell gives a touching performance as',
 "Will, a widower struggling to cope with his wife's death. Will is",
 'confronted by the harsh reality of loneliness and helplessness as he',
 "proceeds to take care of Ruth's pet cow, Tulip. The film displays the",
 'grief and responsibility one feels for those they have loved and lost.',
 'Good cinematography, great direction, and superbly acted. It will',
 'bring tears to all those who have lost a loved one, and survived.']

In [9]:
sample_neg_file = os.path.join(train_dir, 'neg/10008_2.txt')
with open(sample_neg_file) as f:
    text = f.read()

wrap(text)

['The film is bad. There is no other way to say it. The story is weak',
 "and outdated, especially for this country. I don't think most people",
 'know what a "walker" is or will really care. I felt as if I was',
 "watching a movie from the 70's. The subject was just not believable",
 'for the year 2007, even being set in DC. I think this rang true for',
 'everyone else who watched it too as the applause were low and quick at',
 "the end. Most didn't stay for the Q&A either.<br /><br />I don't think",
 'Schrader really thought the film out ahead of time. Many of the scenes',
 'seemed to be cut short as if they were never finished or he just',
 "didn't know how to finish them. He jumped from one scene to the next",
 'and you had to try and figure out or guess what was going on. I really',
 "didn't get Woody's (Carter) private life or boyfriend either. What",
 'were all the "artistic" male bondage and torture pictures (from Iraq',
 'prisons) about? What was he thinking? I think it was hi

### Load the dataset

To do this, the `text_dataset_from_directory` utility is used. This utility exepcts a directory structure as follows:

```python
main_directory/
...class_a/
......a_text_1.txt
......a_text_2.txt
...class_b/
......b_text_1.txt
......b_text_2.txt
```

In [10]:
# 3 of these are folders: 'neg', 'pos' and 'unsup'
os.listdir(train_dir)

['neg',
 'unsupBow.feat',
 'labeledBow.feat',
 'pos',
 'unsup',
 'urls_neg.txt',
 'urls_unsup.txt',
 'urls_pos.txt']

To prepare a dataset for binary classification, we need two folders, corrsponding to `class_a` and `class_b` above, so we want to get rid of `unsup`

In [11]:
remove_dir = os.path.join(train_dir, 'unsup')
remove_dir

'./aclImdb/train/unsup'

In [12]:
# shutil recursively deletes a directory "tree"
shutil.rmtree(remove_dir)

In [13]:
os.listdir(train_dir)

['neg',
 'unsupBow.feat',
 'labeledBow.feat',
 'pos',
 'urls_neg.txt',
 'urls_unsup.txt',
 'urls_pos.txt']

The IMDB dataset has already been divided into train and test, but it lacks a validation set. Let's create a validation set using an 80:20 split of the training data by using the `validation_split` argument below.

In [59]:
batch_size = 32
seed = 42       # optional seed for random shuffling
# note, the `shuffle` parameter defaults to True

raw_train_ds = tf.keras.utils.text_dataset_from_directory(
    'aclImdb/train',
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed
)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.


In [22]:
type(raw_train_ds)

You can train a model by passing the dataset directly into the `model.fit()` method. First let's have a look at the data.

In [63]:
test = raw_train_ds.take(1)

for a, b in test:
    print(type(a))
    print(type(b))

<class 'tensorflow.python.framework.ops.EagerTensor'>
<class 'tensorflow.python.framework.ops.EagerTensor'>


In [62]:
for a, b in test:
    print(a.numpy())
    print(len(a.numpy()))       # batch size
    print('\n______________\n')
    print(b.numpy())
    print(len(b.numpy()))       # batch size

[b'Silent Night, Deadly Night 5 is the very last of the series, and like part 4, it\'s unrelated to the first three except by title and the fact that it\'s a Christmas-themed horror flick.<br /><br />Except to the oblivious, there\'s some obvious things going on here...Mickey Rooney plays a toymaker named Joe Petto and his creepy son\'s name is Pino. Ring a bell, anyone? Now, a little boy named Derek heard a knock at the door one evening, and opened it to find a present on the doorstep for him. Even though it said "don\'t open till Christmas", he begins to open it anyway but is stopped by his dad, who scolds him and sends him to bed, and opens the gift himself. Inside is a little red ball that sprouts Santa arms and a head, and proceeds to kill dad. Oops, maybe he should have left well-enough alone. Of course Derek is then traumatized by the incident since he watched it from the stairs, but he doesn\'t grow up to be some killer Santa, he just stops talking.<br /><br />There\'s a myster

In [68]:
for a, b in test:
    for i in range(3):
        print(a.numpy()[i])
        print(b.numpy()[i])
        print('\n________\n')

b'Every scene was put together perfectly.This movie had a wonderful cast and crew. I mean, how can you have a bad movie with Robert Downey Jr. in it,none have and ever will exist. He has the ability to brighten up any movie with his amazing talent.This movie was perfect! I saw this movie sitting all alone on a movie shelf in "Blockbuster" and like it was calling out to me,I couldn\'t resist picking it up and bringing it home with me. You can call me a sappy romantic, but this movie just touched my heart, not to mention made me laugh with pleasure at the same time. Even though it made me cry,I admit, at the end, the whole movie just brightened up my outlook on life thereafter.I suggested to my horror, action, and pure humor movie buff of a brother,who absolutely adored this movie. This is a movie with a good sense of feeling.It could make you laugh out loud, touch your heart, make you fall in love,and enjoy your life.Every time you purposefully walk past this movie, just be aware that y

In [71]:
# each time `.take()` is called, a pointer is moved forward
#. her were want to refresh the data to bring the pointer back to 0
batch_size = 32
seed = 42       # optional seed for random shuffling
# note, the `shuffle` parameter defaults to True

raw_train_ds = tf.keras.utils.text_dataset_from_directory(
    'aclImdb/train',
    batch_size=batch_size,
    validation_split=0.2,
    subset='training',
    seed=seed
)

Found 25000 files belonging to 2 classes.
Using 20000 files for training.


In [72]:
for text_batch, label_batch in raw_train_ds.take(1):
  for i in range(3):
    print("Review", text_batch.numpy()[i])
    print("Label", label_batch.numpy()[i])

Review b'"Pandemonium" is a horror movie spoof that comes off more stupid than funny. Believe me when I tell you, I love comedies. Especially comedy spoofs. "Airplane", "The Naked Gun" trilogy, "Blazing Saddles", "High Anxiety", and "Spaceballs" are some of my favorite comedies that spoof a particular genre. "Pandemonium" is not up there with those films. Most of the scenes in this movie had me sitting there in stunned silence because the movie wasn\'t all that funny. There are a few laughs in the film, but when you watch a comedy, you expect to laugh a lot more than a few times and that\'s all this film has going for it. Geez, "Scream" had more laughs than this film and that was more of a horror film. How bizarre is that?<br /><br />*1/2 (out of four)'
Label 0
Review b"David Mamet is a very interesting and a very un-equal director. His first movie 'House of Games' was the one I liked best, and it set a series of films with characters whose perspective of life changes as they get into 

#### The classes

The labels are 0 and 1. How do these correspond to the sentiment of movie reviews?

In [77]:
print('Label 0 corresponds to', raw_train_ds.class_names[0])
print('Label 1 corresponds to', raw_train_ds.class_names[1])

Label 0 corresponds to neg
Label 1 corresponds to pos
