# Get data

Us `tfds.load` to get the data and the specify the dataset you want. Find other datasets and their info here => https://github.com/tensorflow/datasets/tree/master/docs/catalog.

Parameters:

* `with_info` will give you meta data about the data like labels info and enoder used etc
* `as_supervised` will let you choose the data based on the type of learning you want. For example => `as_supervised=True` will give you labels as well

In [1]:
import tensorflow_datasets as tfds 

imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)

04 examples/s][A
Generating unsupervised examples...:   7%|▋         | 3549/50000 [00:07<00:21, 2144.65 examples/s][A
Generating unsupervised examples...:   8%|▊         | 3782/50000 [00:07<00:22, 2089.13 examples/s][A
Generating unsupervised examples...:   8%|▊         | 4004/50000 [00:07<00:22, 2064.74 examples/s][A
Generating unsupervised examples...:   8%|▊         | 4220/50000 [00:07<00:21, 2090.64 examples/s][A
Generating unsupervised examples...:   9%|▉         | 4436/50000 [00:07<00:26, 1694.25 examples/s][A
Generating unsupervised examples...:   9%|▉         | 4622/50000 [00:07<00:26, 1730.12 examples/s][A
Generating unsupervised examples...:  10%|▉         | 4807/50000 [00:08<00:26, 1690.18 examples/s][A
Generating unsupervised examples...:  10%|▉         | 4985/50000 [00:08<00:27, 1659.55 examples/s][A
Generating unsupervised examples...:  10%|█         | 5198/50000 [00:08<00:25, 1784.41 examples/s][A
Generating unsupervised examples...:  11%|█         | 5401/50000

# Playing around with data

Type of data in imdb data

In [4]:
imdb.keys()

dict_keys([Split('train'), Split('test'), Split('unsupervised')])

In [8]:
print("No of training sentences : ", len(imdb["train"]))
print("No of testing sentences : ", len(imdb["test"]))

No of training sentences :  25000
No of testing sentences :  25000


# Creating train and test data

In [9]:
train_data = imdb["train"]
test_data = imdb["test"]

In [16]:
train_sentences = []
train_labels = []

for sentence, label in train_data:
    train_sentences.append(sentence.numpy().decode('utf-8'))
    train_labels.append(label.numpy())

test_sentences = []
test_labels = []
for sentence, label in test_data:
    test_sentences.append(sentence.numpy().decode('utf-8'))
    test_labels.append(label.numpy())

print("No of training sentences : ", len(train_sentences))
print("No of testing sentences : ", len(test_sentences))

No of training sentences :  25000
No of testing sentences :  25000


# Defining train parameters

In [26]:
vocab_length = 10000
embedding_size = 16
max_length = 120
oov_token = "<OOV>"
padding_type = "post"
truncate_type = "post"

# Creating tokenizer

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(
    num_words=vocab_length,
    oov_token=oov_token
)
tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index
# Train
train_sequence = tokenizer.texts_to_sequences(train_sentences)
train_padded_sequence = pad_sequences(
    sequences=train_sequence,
    maxlen=max_length,
    padding=padding_type,
    truncating=truncate_type
)
# Test
test_sequence = tokenizer.texts_to_sequences(test_sentences)
test_padded_sequence = pad_sequences(
    sequences=test_sequence,
    maxlen=max_length,
    padding=padding_type,
    truncating=truncate_type
)

# Defining the model

In [25]:
import tensorflow as tf 

model = tf.keras.models.Sequential([
    tf.keras.layers.Embedding(vocab_length, embedding_size, input_length=max_length),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid'),
])

model.compile(
    loss='binary_crossentropy',
    metrics=["accuracy"],
    optimizer='adam'
)
model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 120, 16)           160000    
_________________________________________________________________
flatten_3 (Flatten)          (None, 1920)              0         
_________________________________________________________________
dense_6 (Dense)              (None, 6)                 11526     
_________________________________________________________________
dense_7 (Dense)              (None, 1)                 7         
Total params: 171,533
Trainable params: 171,533
Non-trainable params: 0
_________________________________________________________________
