<a href="https://colab.research.google.com/github/JasperAD11/Sentiment-Across-Signals-Neural-Networks-vs.-LLMs/blob/main/imdb_starting.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Starting from section 4.1 of the book **Deep Learning with Python**

## Dataset

In [1]:
from tensorflow.keras.datasets import imdb    # imdb is a dataset with reviews (encoded in numbers)

In [5]:
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
# num_words keeps the TOP words only.
# In this case 10000 most frequent individual words in the dataset,
# This is: we keep all review, but inside each we only keep "top words"

Both train_data and test_data are **lists with nested lists inside**, each of the nested lists is a review, where each word is casted into a number.
(each word in the dictionary is equivalente to a specific number)

In [24]:
train_data[:2]

array([list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]),
       list([1, 194, 1153, 194, 8255, 78, 228,

Both train_labels and test_labels are lists of 0/1 label, where 0=negative review and 1=positive review.

In [25]:
train_labels[1:3]

array([0, 0])

## Extra: decoding back to natural laguage


In [26]:
word_index = imdb.get_word_index()    # dict that maps each word to its code number

reverse_word_index = dict(            # dict inverting value and key (number to word)
    [(value, key) for (key, value) in word_index.items()])

# Decoding train_data[0]
decoded_review = " ".join(
    [reverse_word_index.get(i - 3, "?") for i in train_data[0]])

decoded_review

"? this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert ? is an amazing actor and now the same being director ? father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for ? and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also ? to the two little boy's that played the ? of norman and paul they were just brilliant children are often left out of the ? list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you th

## Turning the list of words(numbers) into tensors

The underlying idea is that we want to have a matrix with 10000 columns (one for each of the top10000 words) and a row for each review.

(The book uses 'sequence' instead of 'review')

In [28]:
import numpy as np

def vectorize_reviews(reviews, dimension=10000):     # dimension=10000 because there are 10000 different words
  results = np.zeros((len(reviews), dimension))      # starting with a tensor of all 0.(s)
  for i, review in enumerate(reviews):
    for j in review:                                # j is each index inside a nested list in data = each word in a review
      results[i, j] = 1.                            # inside a tensor inside "results", 1. is that word appers
      return results

x_train = vectorize_reviews(train_data)
x_test = vectorize_reviews(test_data)

x_train[0]

# After the transformation above, each review is now a RANK-1 TENSOR with 10000 dimensions.

array([0., 1., 0., ..., 0., 0., 0.])

Similarly, we need to vectorize labels. For that, we do the following:

In [29]:
y_train = np.asarray(train_labels).astype("float32")
y_test = np.asarray(test_labels).astype("float32")