## Training your own model with Tensorflow!

This exercise will done using Google Colab, Python, and Tensorflow.



Please click on the play button below to make sure you have the necessary libraries.

In [None]:
!pip install tensorflow
!pip install matplotlib
import tensorflow as tf
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences


In this lesson, we'll be classifying movie reviews as negative or positive based off IMDB's dataset! I don't know how many of you have heard of IMDB, but it's a popular movie database website where users can rate movies.

Press the button below to download the dataset onto this Colab instance! It won't download to your actual machine.
It may take 5s-1m to download.

In [None]:

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=10000)
print(f"Training Sample Count: {len(x_train)}")
print(f"Test Sample Count: {len(x_test)}")

This dataset has been split into train & test subsets.
The train subset is used to build up the model. The test subset is not used to make the model directly, but rather to test if it's accurate.

You should always have a train & test dataset to make sure your data is accurate. In this case, 50% of the data is used for testing and 50% is used for training.

To make the dataset in a format that Tensorflow can use, we need to play around with the text in the dataset. This command makes sure that no review is more than 256 characters.

---



In [None]:
x_train = pad_sequences(x_train, maxlen=256)
x_test = pad_sequences(x_test, maxlen=256)


Now let's actually create the model!

In [None]:
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Embedding, Flatten, Dense

model = Sequential([
    Embedding(input_dim=10000, output_dim=64, input_length=256),
    Flatten(),
    Dense(16, activation='relu'),
    Dense(1, activation='sigmoid')
])



This might be a bit difficult to understand.
When words are input into this model, they aren't actually recognized by the model as words, but numbers. This allows it to more easily group words with similar meaning together e.g. "mad" and "angry" together to make it easier for the model to recognize.

Flatten and Dense are outside of scope of this lesson to explain, but feel feel free to look them up if interested!

Now we'll train the model using the training data we downloaded earlier! This may take a while. If you want it to go faster, you can decrease the number of epochs (min times it runs through every value in the dataset) but that may affect accuracy!

In [None]:
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=4, batch_size=32, validation_data=(x_test, y_test))


Now that our model is trained, it's time to see if it can correctly classify our test data!




In [None]:
test_loss, test_accuracy = model.evaluate(x_test, y_test)
print(f'Accuracy: {round(test_accuracy*100, 2)}%')


The accuracy will change each time you complete the sequence, but for me it was 85.7% accurate. Not bad!

Don't focus on the loss value for now; it measures how well the model's predictions fit their regular values. Unlike accuracy, it isn't a percentage, so don't think of it that way and you can ignore it for the purposes of this exercise.

# Model In Action

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing.sequence import pad_sequences

(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=10000)

word_index = imdb.get_word_index()
reverse_word_index = {value: key for key, value in word_index.items()}

def decode_review(encoded_review):
    return " ".join([reverse_word_index.get(i - 3, "?") for i in encoded_review])

def visualize_prediction(review, actual_label, model):
    padded_review = pad_sequences([review], maxlen=256)
    prediction = model.predict(padded_review)[0][0]
    actual_label = "Positive" if actual_label == 1 else "Negative"
    predicted_label = "Positive" if prediction > 0.5 else "Negative"

    print(f"Review: {decode_review(review)[:257]}{'...' if len(decode_review(review)) > 256 else ''}")
    print(f"Actual: {actual_label}")
    print(f"Predicted: {predicted_label}")
    print(f"Confidence: {round(prediction, 2)*100}% chance it's Positive")

for i in range(5):
    visualize_prediction(x_test[i], y_test[i], model)
    print('\n')


As mentioned before, the text is actually processed as numbers rather than words directly. The decode function converts the text back into words so humans can understand it.
The confidence represents the percentage the model is sure the prediction is positive.

Type your own review here and see how the model categorizes it!

In [None]:
def predict_custom_text(model, text):
    tokens = [word_index.get(word, 2) + 3 for word in text.lower().split()]
    padded_tokens = pad_sequences([tokens], maxlen=256)

    prediction = model.predict(padded_tokens)[0][0]
    predicted_label = "Positive" if prediction > 0.5 else "Negative"
    print(f"Text: {text}")
    print(f"Predicted Label: {predicted_label}")
    print(f"Confidence: {round(prediction, 2)*100}% it's Positive")

user_text = input("Type your own review: ")
predict_custom_text(model, user_text)
