<a href="https://colab.research.google.com/github/Joshua-Woodard/Text_Classification_IMDB_Sentiment/blob/main/Text_Classification_IMDB_Sentiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Text Classification

In [None]:
import tensorflow as tf
import numpy as np
import pandas as pd

import os

# Get helper_functions.py script from course GitHub
!wget https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py 

# Import helper functions we're going to use
from helper_functions import create_tensorboard_callback, plot_loss_curves, unzip_data, walk_through_dir, calculate_results


--2022-07-13 19:06:23--  https://raw.githubusercontent.com/mrdbourke/tensorflow-deep-learning/main/extras/helper_functions.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 10246 (10K) [text/plain]
Saving to: ‘helper_functions.py.1’


2022-07-13 19:06:23 (65.8 MB/s) - ‘helper_functions.py.1’ saved [10246/10246]



## Get Data

In [None]:
# Get data
imdb_dataset = pd.read_csv("/content/drive/MyDrive/IMDB_dataset.zip")
imdb_dataset.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


## Preprocess Data

In [None]:
#Create column transformer to both normalize/preprocess our data (one-hot-encodes too!)
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder, LabelEncoder
from sklearn.model_selection import train_test_split


In [None]:
X = imdb_dataset.drop(["sentiment"], axis=1)
y = imdb_dataset["sentiment"]

In [None]:
X.head(), y.head()

(                                              review
 0  One of the other reviewers has mentioned that ...
 1  A wonderful little production. <br /><br />The...
 2  I thought this was a wonderful way to spend ti...
 3  Basically there's a family where a little boy ...
 4  Petter Mattei's "Love in the Time of Money" is..., 0    positive
 1    positive
 2    positive
 3    negative
 4    positive
 Name: sentiment, dtype: object)

### Setup train/test split

In [None]:
# Setup train/test split
train_sentences, test_sentences, train_labels, test_labels = train_test_split(X.to_numpy(), y.to_numpy(), test_size=0.2, random_state=42)

In [None]:
len(train_sentences), len(test_sentences), len(train_labels), len(test_labels)

(40000, 10000, 40000, 10000)

In [None]:
test_labels

array(['positive', 'positive', 'negative', ..., 'positive', 'negative',
       'positive'], dtype=object)

In [None]:
le = LabelEncoder()
train_labels_encoded = le.fit_transform(y_train)
test_labels_encoded = le.transform(y_test)

In [None]:
train_labels_encoded, test_labels_encoded

(array([0, 0, 1, ..., 0, 1, 1]), array([1, 1, 0, ..., 1, 0, 1]))

In [None]:
train_sentences[:5]

array([['That\'s what I kept asking myself during the many fights, screaming matches, swearing and general mayhem that permeate the 84 minutes. The comparisons also stand up when you think of the one-dimensional characters, who have so little depth that it is virtually impossible to care what happens to them. They are just badly written cyphers for the director to hang his multicultural beliefs on, a topic that has been done much better in other dramas both on TV and the cinema.<br /><br />I must confess, I\'m not really one for spotting bad performances during a film, but it must be said that Nichola Burley (as the heroine\'s slutty best friend) and Wasim Zakir (as the nasty, bullying brother) were absolutely terrible. I don\'t know what acting school they graduated from, but if I was them I\'d apply for a full refund post haste. Only Samina Awan in the lead role manages to impress in a cast of so-called British talent that we\'ll probably never hear from again. At least, that\'s the 

#### Text Vectorization and Embedding

In [None]:
import tensorflow as tf
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
# Note: in TensorFlow 2.6+, you no longer need "layers.experimental.preprocessing"
# you can use: "tf.keras.layers.TextVectorization", see https://github.com/tensorflow/tensorflow/releases/tag/v2.6.0 for more

# Use the default TextVectorization variables
text_vectorizer = TextVectorization(max_tokens=None, # how many words in the vocabulary (all of the different words in your text)
                                    standardize="lower_and_strip_punctuation", # how to process text
                                    split="whitespace", # how to split tokens
                                    ngrams=None, # create groups of n-words?
                                    output_mode="int", # how to map tokens to numbers
                                    output_sequence_length=None) # how long should the output sequence of tokens be?
                                    # pad_to_max_tokens=True) # Not valid if using max_tokens=None

In [None]:
# Find average number of tokens in text
round(sum([len(str(i).split()) for i in train_sentences])/len(train_sentences))

231

In [None]:
sum([len(str(i).split()) for i in train_sentences])

9239833

In [None]:
import tensorflow as tf
from tensorflow.keras.layers import TextVectorization

max_vocab_length = 900000
max_length = 231

# Use the default variables
text_vectorizer = TextVectorization(max_tokens=max_vocab_length,
                                    output_mode="int",
                                    output_sequence_length=max_length)

In [None]:
# Fit to training text
text_vectorizer.adapt(train_sentences)

In [None]:
sample_sentence = "There's a flood in my street!"
text_vectorizer([sample_sentence])

<tf.Tensor: shape=(1, 231), dtype=int64, numpy=
array([[ 215,    4, 6848,    8,   54,  948,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,   

In [None]:
import random
# Choose a random sentence from the training dataset and tokenize it
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}\
      \n\nVectorized version:")
text_vectorizer([random_sentence])

Original text:
["Jason Lee does his best to bring fun to a silly situation, but the movie just fails to make a connect. <br /><br />Perhaps because Julia Stiles character seems awkward as the conniving and sexy soon to be cousin-in-law. <br /><br />Maybe it is because she and Selma Blair's characters should have been cast the opposite way. (Selma Blair seems more conniving than Julia would be).<br /><br />Either way this movie is yet another Hollywood trivialization of a possibly real world situation (that being getting caught with your pants out at your bachelor party not stooping your cousin), which while having promise fails to deliver.<br /><br />There are some laughs to be sure and the cast (even if miscast) do their best with sub grade material which doesn't transcend its raunchy topic. So instead of getting a successful raunch fest (ie Animal House or American Pie) we are left with a middle ground of part humor and part stupidity (ala Meatballs 2 or something)."]      

Vectoriz

<tf.Tensor: shape=(1, 231), dtype=int64, numpy=
array([[ 1612,   955,   122,    25,   116,     6,   728,   245,     6,
            4,   670,   873,    19,     2,    18,    40,   996,     6,
           94,     4,  3829,    13,    13,   373,    85,  2664,  9647,
          109,   181,  1981,    15,     2,  9093,     3,  1233,   528,
            6,    27, 69648,    13,    13,   269,     9,     7,    85,
           59,     3, 11837, 28830,   102,   140,    26,    75,   179,
            2,  1802,    98, 11837,  2937,   181,    53,  9093,    72,
         2664,    58,  4565,    13,   351,    98,    11,    18,     7,
          242,   155,   369, 57012,     5,     4,   872,   145,   187,
          873,    12,   107,   374,   991,    17,   123,  4238,    46,
           31,   123,  8005,  1002,    22, 34027,   123,  3014,    62,
          132,   255,  2129,   996,     6, 24254,    13,    48,    24,
           47,   935,     6,    27,   246,     3,     2,   179,    55,
           44,  3187,    77, 

In [None]:
# Get the unique words in the vocabulary
words_in_vocab = text_vectorizer.get_vocabulary()
top_5_words = words_in_vocab[:5] # most common tokens (notice the [UNK] token for "unknown" words)
bottom_5_words = words_in_vocab[-5:] # least common tokens
print(f"Number of words in vocab: {len(words_in_vocab)}")
print(f"Top 5 most common words: {top_5_words}") 
print(f"Bottom 5 least common words: {bottom_5_words}")

Number of words in vocab: 160802
Top 5 most common words: ['', '[UNK]', 'the', 'and', 'a']
Bottom 5 least common words: ['0001', '000001', '00000001', '\x10own', '\x08\x08\x08\x08a']


In [None]:
# Embedding layer
tf.random.set_seed(42)
from tensorflow.keras import layers

embedding = layers.Embedding(input_dim = max_vocab_length, #input shape
                             output_dim = 128, # set size of embedding vector
                             embeddings_initializer="uniform", # default, initialize randomly
                             input_length=max_length, # how long is each input
                             name="embedding_1")

embedding

<keras.layers.embeddings.Embedding at 0x7fa55eb29810>

In [None]:
# Get a random sentence from training set
random_sentence = random.choice(train_sentences)
print(f"Original text:\n{random_sentence}\
      \n\nEmbedded version:")

# Embed the random sentence (turn it into numerical representation)
sample_embed = embedding(text_vectorizer([random_sentence]))
sample_embed

Original text:
["A wonderful film by Powell and Pressburger, whose work I now want to explore more. The film is about what we perceive as real and what is real, and how the two can be so difficult to distinguish from one another. Beautifully shot and acted, although David Niven doesn't seem to be 27 years old, as his character claims to be. Fun to see a very young Richard Attenborough. This film made me think, while I was watching it, and afterwards."]      

Embedded version:


<tf.Tensor: shape=(1, 231, 128), dtype=float32, numpy=
array([[[-0.04364428,  0.02437404, -0.03696011, ..., -0.04763393,
          0.02931459,  0.0068561 ],
        [ 0.00731655, -0.03667552,  0.04437304, ..., -0.00131422,
         -0.04961992, -0.02081001],
        [-0.03769413, -0.00289064, -0.03677417, ..., -0.04211248,
         -0.03247404,  0.04107854],
        ...,
        [ 0.01645621, -0.00589932, -0.01471175, ..., -0.02511839,
          0.00912381, -0.00024097],
        [ 0.01645621, -0.00589932, -0.01471175, ..., -0.02511839,
          0.00912381, -0.00024097],
        [ 0.01645621, -0.00589932, -0.01471175, ..., -0.02511839,
          0.00912381, -0.00024097]]], dtype=float32)>

### OR Use Universal Sentence Encoder from TF Hub (pretrained embeddings)

In [None]:
import tensorflow_hub as hub
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

In [None]:
sentence_encoder_layer = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
                                        input_shape=[],
                                        dtype=tf.string,
                                        trainable=False, ## If you want to fine-tune can set trainable=True, False is just feature extraction
                                        name="USE")

## Create model using Sequential API and encoding layer from USE

In [None]:
# Create model using Sequential API and encoder layer from USE
model_USE = tf.keras.Sequential([
                                 sentence_encoder_layer,
                                 layers.Dense(64, activation="relu"),
                                 layers.Dense(1, activation="sigmoid")
], name="model_USE")

# Compile
model_USE.compile(loss="binary_crossentropy",
                  optimizer=tf.keras.optimizers.Adam(),
                  metrics=["accuracy"])

model_USE.summary()

Model: "model_USE"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 USE (KerasLayer)            (None, 512)               256797824 
                                                                 
 dense (Dense)               (None, 64)                32832     
                                                                 
 dense_1 (Dense)             (None, 1)                 65        
                                                                 
Total params: 256,830,721
Trainable params: 32,897
Non-trainable params: 256,797,824
_________________________________________________________________


In [None]:
history_USE = model_USE.fit(train_sentences,
                            train_labels_encoded,
                            epochs=5,
                            validation_data=(test_sentences, test_labels_encoded))

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


### Make predictions

In [None]:
# Make preds
model_USE_pred_probs = model_USE.predict(test_sentences)
model_USE_pred_probs

array([[0.66395766],
       [0.9724653 ],
       [0.03029633],
       ...,
       [0.3844333 ],
       [0.04628875],
       [0.9414945 ]], dtype=float32)

In [None]:
model_USE_preds = tf.squeeze(tf.round(model_USE_pred_probs))
model_USE_preds[:10]

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([1., 1., 0., 1., 0., 1., 1., 0., 0., 0.], dtype=float32)>

### Evaluate Results

In [None]:
model_USE_results = calculate_results(test_labels_encoded, model_USE_preds)
model_USE_results

{'accuracy': 86.36,
 'f1': 0.8635593341435942,
 'precision': 0.8643254352944936,
 'recall': 0.8636}

* For CNN's for text classification using Conv1D and MaxPooling, see https://dev.mrdbourke.com/tensorflow-deep-learning/08_introduction_to_nlp_in_tensorflow/#convolutional-neural-networks-for-text

* For RNN's for text classification using LSTM, GRU, or Bidirectional, see https://dev.mrdbourke.com/tensorflow-deep-learning/08_introduction_to_nlp_in_tensorflow/#recurrent-neural-networks-rnns

In [None]:
model_USE.save("/content/drive/MyDrive/Test_Models/model_USE.h5")