# Deep Learning Fundamentals 11 -  Important Features of Tensorflow

Welcome to another notebook on Deep Learning Fundamentals. In this notebook, we will be talking about the Data API of Tensorflow which lets us easily manage data in a pipeline, and Tensorflow datasets which provide us with an easy way to download commonly used datasets. Moreover, we will learn more about Preprocessing layers, Embeddings, and TensorflowHub while exploring Tensorflow Data/Datasets. Let's get started.

In [1]:
import sklearn
import numpy as np
import os
import tensorflow as tf
from tensorflow import keras
import matplotlib as mpl
import matplotlib.pyplot as plt

## The Tensorflow Data API

In this part, we will have a short case in which we assume we are dealing with a big dataset as if it normally does not fit into memory. However, before getting onto that I would like to show some functions that are highly utilized while using the Data API. 

In [2]:
dataset = tf.data.Dataset.range(10)

In [3]:
dataset = dataset.repeat(3).batch(7)
for item in dataset:
    print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int64)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int64)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int64)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int64)
tf.Tensor([8 9], shape=(2,), dtype=int64)


When we want to transform the instances in our dataset, we can use `map()` and `apply()` methods. They both almost do the same thing, though there is a little difference between them.

* While the `map()` method applies a transformation to each item, the `apply()` method applies a transformation to the dataset as a whole. - [Géron, A. (2019)](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)

In [4]:
dataset = dataset.map(lambda x: x * 2)
dataset

<MapDataset element_spec=TensorSpec(shape=(None,), dtype=tf.int64, name=None)>

We can also use `filter()` methods for filtering the dataset.

In [5]:
dataset = dataset.unbatch()
dataset = dataset.filter(lambda x: x < 10)  # keep only items < 10
for item in dataset.take(3):
    print(item)

tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)


We can use `shuffle()` method for shuffling our data.

In [6]:
tf.random.set_seed(42)

dataset = tf.data.Dataset.range(10).repeat(3)
dataset = dataset.shuffle(buffer_size=3, seed=42).batch(7)

## Creating a pipeline with Tensorflow

In this part, I will load the California House Dataset and split it into pieces. I do this as an example for the case when we have a very large dataset that does not fit in memory. In these cases, we can split the data into many files first, then let Tensorflow read these files in parallel.

Let's load the dataset first and split it into training,test,validation folds.

In [7]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data, housing.target.reshape(-1, 1), random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, random_state=42)

scaler = StandardScaler()
scaler.fit(X_train)
X_mean = scaler.mean_
X_std = scaler.scale_


Now let's split the dataset into 20 CSV files.

In [8]:
def save_to_multiple_csv_files(data, name_prefix, header=None, n_parts=10):
    housing_dir = os.path.join("datasets", "housing")
    os.makedirs(housing_dir, exist_ok=True)
    path_format = os.path.join(housing_dir, "my_{}_{:02d}.csv")

    filepaths = []
    m = len(data)
    for file_idx, row_indices in enumerate(np.array_split(np.arange(m), n_parts)):
        part_csv = path_format.format(name_prefix, file_idx)
        filepaths.append(part_csv)
        with open(part_csv, "wt", encoding="utf-8") as f:
            if header is not None:
                f.write(header)
                f.write("\n")
            for row_idx in row_indices:
                f.write(",".join([repr(col) for col in data[row_idx]]))
                f.write("\n")
    return filepaths

In [9]:
train_data = np.c_[X_train, y_train]
valid_data = np.c_[X_valid, y_valid]
test_data = np.c_[X_test, y_test]
header_cols = housing.feature_names + ["MedianHouseValue"]
header = ",".join(header_cols)

train_filepaths = save_to_multiple_csv_files(train_data, "train", header, n_parts=20)
valid_filepaths = save_to_multiple_csv_files(valid_data, "valid", header, n_parts=10)
test_filepaths = save_to_multiple_csv_files(test_data, "test", header, n_parts=10)

Let's look at one of the files.

In [10]:
import pandas as pd

pd.read_csv(train_filepaths[1]).head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedianHouseValue
0,3.6875,44.0,4.524476,0.993007,457.0,3.195804,34.04,-118.15,1.625
1,3.9917,25.0,5.576364,1.034545,1602.0,2.912727,38.33,-122.82,2.441
2,1.69,18.0,4.423529,1.145882,994.0,2.338824,34.14,-116.32,0.542
3,3.9643,52.0,4.79798,1.020202,467.0,2.358586,37.84,-122.26,1.888
4,6.6704,25.0,6.829317,1.022088,1503.0,3.018072,33.88,-117.99,2.406


In [11]:
train_filepaths

['datasets/housing/my_train_00.csv',
 'datasets/housing/my_train_01.csv',
 'datasets/housing/my_train_02.csv',
 'datasets/housing/my_train_03.csv',
 'datasets/housing/my_train_04.csv',
 'datasets/housing/my_train_05.csv',
 'datasets/housing/my_train_06.csv',
 'datasets/housing/my_train_07.csv',
 'datasets/housing/my_train_08.csv',
 'datasets/housing/my_train_09.csv',
 'datasets/housing/my_train_10.csv',
 'datasets/housing/my_train_11.csv',
 'datasets/housing/my_train_12.csv',
 'datasets/housing/my_train_13.csv',
 'datasets/housing/my_train_14.csv',
 'datasets/housing/my_train_15.csv',
 'datasets/housing/my_train_16.csv',
 'datasets/housing/my_train_17.csv',
 'datasets/housing/my_train_18.csv',
 'datasets/housing/my_train_19.csv']

Now let's build an input pipeline and add basic preprocessing.

In [12]:
filepath_dataset = tf.data.Dataset.list_files(train_filepaths, seed=42)

Now we will use `interleave()` method to read files. At a time 5 files will be read and the first row(header) will be skipped.

In [13]:
n_readers = 5
dataset = filepath_dataset.interleave(
    lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
    cycle_length=n_readers)

In [14]:
for line in dataset.take(5):
    print(line.numpy())

b'4.7361,7.0,7.464968152866242,1.1178343949044587,846.0,2.694267515923567,34.49,-117.27,1.745'
b'3.6641,17.0,5.577142857142857,1.1542857142857144,511.0,2.92,40.85,-121.07,0.808'
b'4.5909,16.0,5.475877192982456,1.0964912280701755,1357.0,2.9758771929824563,33.63,-117.71,2.418'
b'3.6875,44.0,4.524475524475524,0.993006993006993,457.0,3.195804195804196,34.04,-118.15,1.625'
b'2.3,25.0,5.828178694158075,0.9587628865979382,909.0,3.1237113402061856,36.25,-119.4,1.328'


* For interleaving to work best, it is preferable to have files of identical length; otherwise the ends of the longest files will not be interleaved. - [Géron, A. (2019)](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)

* To use parallelism; `interleave()` does not use parallelism by default and reads one line at a time from each file. However, we can use parallelism by setting `num_parallel_calls` arguments to the number of threads we want or we can set it to `tf.data.experimental.AUTOTUNE `(in this case, Tensorflow choose the right number of threads dynamically based on the available CPU). - [Géron, A. (2019)](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)

Now let's define a function that we will use for basic preprocessing.

In [15]:
n_inputs = 8 # X_train.shape[-1]

@tf.function
def preprocess(line):
    defs = [0.] * n_inputs + [tf.constant([], dtype=tf.float32)]
    fields = tf.io.decode_csv(line, record_defaults=defs)
    x = tf.stack(fields[:-1])
    y = tf.stack(fields[-1:])
    return (x - X_mean) / X_std, y

Now it is time to combine all the stuff inside one function.

In [16]:
def csv_reader_dataset(filepaths, repeat=1, n_readers=5,
                       n_read_threads=None, shuffle_buffer_size=10000,
                       n_parse_threads=5, batch_size=32):
    dataset = tf.data.Dataset.list_files(filepaths).repeat(repeat)
    dataset = dataset.interleave(
        lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
        cycle_length=n_readers, num_parallel_calls=n_read_threads)
    dataset = dataset.shuffle(shuffle_buffer_size)
    dataset = dataset.map(preprocess, num_parallel_calls=n_parse_threads)
    dataset = dataset.batch(batch_size)
    return dataset.prefetch(1)

At the end, we call `prefect(1)`, by doing that we create a dataset that will always be one batch ahead. In other words, while the algorithm is training, the dataset will be working in parallel to get the next batch ready. Doing that can improve the performance dramatically.

Let's use the function on training set

In [17]:
tf.random.set_seed(42)

train_set = csv_reader_dataset(train_filepaths, batch_size=3)
for X_batch, y_batch in train_set.take(2):
    print("X =", X_batch)
    print("y =", y_batch)
    print()

X = tf.Tensor(
[[ 0.5804519  -0.20762321  0.05616303 -0.15191229  0.01343246  0.00604472
   1.2525111  -1.3671792 ]
 [ 5.818099    1.8491895   1.1784915   0.28173092 -1.2496178  -0.3571987
   0.7231292  -1.0023477 ]
 [-0.9253566   0.5834586  -0.7807257  -0.28213993 -0.36530012  0.27389365
  -0.76194876  0.72684526]], shape=(3, 8), dtype=float32)
y = tf.Tensor(
[[1.752]
 [1.313]
 [1.535]], shape=(3, 1), dtype=float32)

X = tf.Tensor(
[[-0.8324941   0.6625668  -0.20741376 -0.18699841 -0.14536144  0.09635526
   0.9807942  -0.67250353]
 [-0.62183803  0.5834586  -0.19862501 -0.3500319  -1.1437552  -0.3363751
   1.107282   -0.8674123 ]
 [ 0.8683102   0.02970133  0.3427381  -0.29872298  0.7124906   0.28026953
  -0.72915536  0.86178064]], shape=(3, 8), dtype=float32)
y = tf.Tensor(
[[0.919]
 [1.028]
 [2.182]], shape=(3, 1), dtype=float32)



Let's use the function on all the sets.

In [18]:
train_set = csv_reader_dataset(train_filepaths, repeat=None)
valid_set = csv_reader_dataset(valid_filepaths)
test_set = csv_reader_dataset(test_filepaths)

Let's train a simple model

In [19]:
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

model = keras.models.Sequential([
    keras.layers.Dense(30, activation="relu", input_shape=X_train.shape[1:]),
    keras.layers.Dense(1),
])

In [20]:
model.compile(loss="mse", optimizer=keras.optimizers.SGD(lr=1e-3))

batch_size = 32
model.fit(train_set, steps_per_epoch=len(X_train) // batch_size, epochs=10,
          validation_data=valid_set)

Epoch 1/10


  super(SGD, self).__init__(name, **kwargs)


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7fe015779e10>

In [21]:
model.evaluate(test_set, steps=len(X_test) // batch_size)



0.4787752032279968

In [22]:
new_set = test_set.map(lambda X, y: X) # we could instead just pass test_set, Keras would ignore the labels
X_new = X_test
model.predict(new_set, steps=len(X_new) // batch_size)



array([[3.8256302],
       [2.41031  ],
       [1.0489031],
       ...,
       [3.1952968],
       [1.4562888],
       [3.159451 ]], dtype=float32)

As we did previously, we can also build our own training loop.

In [23]:
optimizer = keras.optimizers.Nadam(lr=0.01)
loss_fn = keras.losses.mean_squared_error

@tf.function
def train(model, n_epochs, batch_size=32,
          n_readers=5, n_read_threads=5, shuffle_buffer_size=10000, n_parse_threads=5):
    train_set = csv_reader_dataset(train_filepaths, repeat=n_epochs, n_readers=n_readers,
                       n_read_threads=n_read_threads, shuffle_buffer_size=shuffle_buffer_size,
                       n_parse_threads=n_parse_threads, batch_size=batch_size)
    n_steps_per_epoch = len(X_train) // batch_size
    total_steps = n_epochs * n_steps_per_epoch
    global_step = 0
    for X_batch, y_batch in train_set.take(total_steps):
        global_step += 1
        if tf.equal(global_step % 100, 0):
            tf.print("\rGlobal step", global_step, "/", total_steps)
        with tf.GradientTape() as tape:
            y_pred = model(X_batch)
            main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
            loss = tf.add_n([main_loss] + model.losses)
        gradients = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))

train(model, 5)

  super(Nadam, self).__init__(name, **kwargs)


Global step 100 / 1810
Global step 200 / 1810
Global step 300 / 1810
Global step 400 / 1810
Global step 500 / 1810
Global step 600 / 1810
Global step 700 / 1810
Global step 800 / 1810
Global step 900 / 1810
Global step 1000 / 1810
Global step 1100 / 1810
Global step 1200 / 1810
Global step 1300 / 1810
Global step 1400 / 1810
Global step 1500 / 1810
Global step 1600 / 1810
Global step 1700 / 1810
Global step 1800 / 1810


Lastly, there is a short description of each method in the `Dataset` class which we can easily access:

In [24]:
for m in dir(tf.data.Dataset):
    if not (m.startswith("_") or m.endswith("_")):
        func = getattr(tf.data.Dataset, m)
        if hasattr(func, "__doc__"):
            print("● {:21s}{}".format(m + "()", func.__doc__.split("\n")[0]))

● apply()              Applies a transformation function to this dataset.
● as_numpy_iterator()  Returns an iterator which converts all elements of the dataset to numpy.
● batch()              Combines consecutive elements of this dataset into batches.
● bucket_by_sequence_length()A transformation that buckets elements in a `Dataset` by length.
● cache()              Caches the elements in this dataset.
● cardinality()        Returns the cardinality of the dataset, if known.
● choose_from_datasets()Creates a dataset that deterministically chooses elements from `datasets`.
● concatenate()        Creates a `Dataset` by concatenating the given dataset with this dataset.
● element_spec()       The type specification of an element of this dataset.
● enumerate()          Enumerates the elements of this dataset.
● filter()             Filters this dataset according to `predicate`.
● flat_map()           Maps `map_func` across this dataset and flattens the result.
● from_generator()     Create

## Preprocessing Layers

I generally prefer to do preprocessing before constructing the model but in some cases, it may be also good to use Preprocessing Layers. It's especially good (for me at least) to use them for data augmentation.

In [25]:
import pandas as pd

def load_housing_data():
    csv_path = os.path.join("/content/housing.csv")
    return pd.read_csv(csv_path)

In [26]:
housing = load_housing_data()
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


We can perform normalization by using Tensorflow. I will outline two ways here, the first one is to use `tf.feature_column.numeric_column()` and the other one is using `keras.layers.Lambda()`.

In [27]:
age_mean, age_std = X_mean[1], X_std[1]  # The median age is column in 1
housing_median_age = tf.feature_column.numeric_column(
    "housing_median_age", normalizer_fn=lambda x: (x - age_mean) / age_std)

In [28]:
means = np.mean(X_train, axis=0, keepdims=True)
stds = np.std(X_train, axis=0, keepdims=True)
eps = keras.backend.epsilon()
model = keras.models.Sequential([
 keras.layers.Lambda(lambda inputs: (inputs - means) / (stds + eps)),
])

We can also perform one-hot encoding easily with Tensorflow.

In [29]:
vocab = ["<1H OCEAN", "INLAND", "NEAR OCEAN", "NEAR BAY", "ISLAND"]
indices = tf.range(len(vocab), dtype=tf.int64)

table_init = tf.lookup.KeyValueTensorInitializer(vocab, indices)
num_oov_buckets = 2
table = tf.lookup.StaticVocabularyTable(table_init, num_oov_buckets)


The code is quite simple. 

1. We started by defining our vocabulary (possible categories) and then created a tensor for the instances.
2. We then passed the list of categories and their corresponding indices to create an initializer for the lookup table. Note: `KeyValueTensorInitializer()` is used when we have the data ready, in the case when we categories are listed in a text file we should prefer `TextFileInitializer()` instead.
3. In the last two lines we created the lookup table by specifying the number of out-of-vocabulary (OOV) buckets. 

**Out-of-Vocabulary Buckets:** These are the categories that does not exist in the vocabulary. When there is a category that does not exist in the vocabulary, the lookup table will compute a hash of this category and assign the unknown category to one of these out-of-vocabulary buckets. 

* In the case when we have a large number of categories, It may not be convenient to get the full list of categories. A good solution to that problem is to define the vocabulary based on a data sample rather than the whole training set and add some oov buckets for the other categories that were not in the data sample. The more unknown categories you expect to find during training, the more oov buckets you should use.Indeed, if there are not enough oov buckets, there will be collisions: different categories will end up in the same bucket, so the neural network will not be able to distinguish them (at least not based on this feature). - [Géron, A. (2019)](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)

Let's do another example for one-hot encoding using `tf.one_hot()` to see oov buckets.

In [30]:
categories = tf.constant(["NEAR BAY", "DESERT", "INLAND", "INLAND"])
cat_indices = table.lookup(categories)
cat_indices


<tf.Tensor: shape=(4,), dtype=int64, numpy=array([3, 5, 1, 1])>

The unknown category "DESERT" was mapped to one of the two oov buckets. Let's use `tf.one_hot()` to encode the indices now.

In [31]:
cat_one_hot = tf.one_hot(cat_indices, depth=len(vocab) + num_oov_buckets)
cat_one_hot

<tf.Tensor: shape=(4, 7), dtype=float32, numpy=
array([[0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0.]], dtype=float32)>

* As a rule of thumb, if the number of categories is lower than 10, then one-hot encoding is generally the way to go (but your mileage may vary!). If the number of categories is greater than 50 (which is often the case when you use hash buckets), then embeddings are usually preferable. In between 10 and 50 categories, you may want to experiment with both options and see which one works best for your use case.

### Embeddings

Embeddings are an important subject that we will cover in Sequences Models but let's have a quick introduction about this subject since this section is about encoding.

* An embedding is a trainable dense vector that represents a category. By default, embeddings are initialized randomly, so for example the "NEAR BAY" category could be represented initially by a random vector such as [0.131, 0.890], while the "NEAR OCEAN" category might be represented by another random vector such as [0.631, 0.791]. In this example, we use 2D embeddings, but the number of dimensions is a hyperparameter you can tweak.

It is an important thing that we can tweak the dimension, thanks to that, we can train the embedding and find a better representation which in return, will make it easier for our neural network to make accurate predictions. This is a subject of representation learning that we will talk more about later.

Word Embeddings: Word Embeddings got very popular after Google Researchers publisted the [paper](Distributed Representations of Words and Phrases and Their Compositionality). In this paper, they traied a neural network to predict the words neay any given word, and obtained astounding word embeddings. These word embeddings were so powerful that they even captured the concept of gender (The famous example: if you compute embedding vectors of the following ford, King – Man + Woman (adding and subtracting then the result will be very close to the embedding of the word Queen). On the other hand, word embeddings can also sometimes capture out worst biases (a related [paper](https://arxiv.org/abs/1905.09866))

Let's implement a simple embedding.

We firstly need to create an embedding matrix that has one row per category and per oov bucket, and one column per embedding
dimension.

In [32]:
embedding_dim = 2
embed_init = tf.random.uniform([len(vocab) + num_oov_buckets, embedding_dim])
embedding_matrix = tf.Variable(embed_init)


In this example, we have a 2D embedding which is a random 6 x 2 matrix.

In [33]:
embedding_matrix

<tf.Variable 'Variable:0' shape=(7, 2) dtype=float32, numpy=
array([[0.7413678 , 0.62854624],
       [0.01738465, 0.3431449 ],
       [0.51063764, 0.3777541 ],
       [0.07321596, 0.02137029],
       [0.2871771 , 0.4710616 ],
       [0.6936141 , 0.07321334],
       [0.93251204, 0.20843053]], dtype=float32)>

Now let’s encode the same batch of categorical features as we did earlier, but using these embeddings.

In [34]:
categories = tf.constant(["NEAR BAY", "DESERT", "INLAND", "INLAND"])
cat_indices = table.lookup(categories)
cat_indices

<tf.Tensor: shape=(4,), dtype=int64, numpy=array([3, 5, 1, 1])>

In [35]:
tf.nn.embedding_lookup(embedding_matrix, cat_indices)

<tf.Tensor: shape=(4, 2), dtype=float32, numpy=
array([[0.07321596, 0.02137029],
       [0.6936141 , 0.07321334],
       [0.01738465, 0.3431449 ],
       [0.01738465, 0.3431449 ]], dtype=float32)>

The tf.nn.embedding_lookup() function looks up the rows in the embedding matrix, an returns the embedding for each row in the embedding matrix.

Instead of doing what we did above, we can also use keras.layers.Embedding which handles the embedding matrix, is trainable by default and randomly initialized.When it is called with some category indices it returns the rows at those indices in the embedding matrix.

In [36]:
embedding = keras.layers.Embedding(input_dim=len(vocab) + num_oov_buckets,output_dim=embedding_dim)
embedding(cat_indices)


<tf.Tensor: shape=(4, 2), dtype=float32, numpy=
array([[ 0.01289039,  0.0191061 ],
       [ 0.01639661, -0.01945841],
       [ 0.00692506, -0.00518861],
       [ 0.00692506, -0.00518861]], dtype=float32)>

Let's put everything together.

In [37]:
regular_inputs = keras.layers.Input(shape=[8])
categories = keras.layers.Input(shape=[], dtype=tf.string)
cat_indices = keras.layers.Lambda(lambda cats: table.lookup(cats))(categories)
cat_embed = keras.layers.Embedding(input_dim=6, output_dim=2)(cat_indices)
encoded_inputs = keras.layers.concatenate([regular_inputs, cat_embed])
outputs = keras.layers.Dense(1)(encoded_inputs)
model = keras.models.Model(inputs=[regular_inputs, categories],
 outputs=[outputs])


This model gets two inputs: one for regular input that contains eight numerical features and the other one is a categorical input (containing one categorical feature per instance). It uses the Lambda layer to look up each category’s index, then it looks up the embeddings for these indices. To give the encoded input, we concatenate the embeddings and the regular inputs. An lastly, we add our dense output layer.


* One-hot encoding followed by a Dense layer (with no activation function and no biases) is equivalent to an Embedding layer. However, the Embedding layer uses way fewer computations (the performance difference becomes clear when the size of the embedding matrix grows). The Dense layer’s weight matrix plays the role of the embedding matrix. For example, using one-hot vectors of size 20 and a Dense layer with 10 units is equivalent to using an Embedding layer with input_dim=20 and output_dim=10. As a result, it would be wasteful to use more embedding dimensions than the number of units in the layer that follows the Embedding layer. - [Géron, A. (2019)](https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/)

## Keras Preprocessing Layers

Keras provides layers that we can use for preprocessing. The API provides layers such `keras.layers.Normalization()` or `keras.layers.TextVectorization()` that we previously used as well as `keras.layers.Discretization()` that we didn't use before.
Discretization layer is used for one-hot encoding continuous data in bins, for instance, (low, medium, high) would be encoded as [1, 0, 0], [0, 1, 0], and [0, 0, 1]. 


Additionally, we can chain multiple preprocessing layers by using keras.layers.PreprocessingStage(). The TextVectorization layer can also be used to output word-count vectors instead of word indices (a.k.a bag of words: named so because in this representation the order of the words are not important, for instance, if the vocabulary is ["and", "basketball", "more"] then the text "more and more" will be mapped to the vector [1, 0, 2] since more appeared for twice and and appeared once). We can also use TF-IDF (A common technique that is used to divide each word count by the log of the total number of training instances in which the word appears. The reason for doing that is to reduce the importance of frequent words so that words like "and" that will probably appear frequently in most text even though they are the least interesting, will not perceived unnecessarly important by the algorithm.

It is important to note that these layers are not differentiable and will be freezed during training. More information can be found on [Github](https://github.com/keras-team/governance/blob/master/rfcs/20190502-preprocessing-layers.md)


Tensorflow has many layers that we can use for a wide number of operations. See the [documentation](https://www.tensorflow.org/api_docs/python/tf/keras/layers)

## Tensorflow Datasets

We can easily load datasets that are already existing in Tensorflow. Let's look the list of datasets that are avaliable in Tensorflow.

In [38]:
import tensorflow_datasets as tfds
print(tfds.list_builders())

['abstract_reasoning', 'accentdb', 'aeslc', 'aflw2k3d', 'ag_news_subset', 'ai2_arc', 'ai2_arc_with_ir', 'amazon_us_reviews', 'anli', 'answer_equivalence', 'arc', 'asqa', 'asset', 'assin2', 'bair_robot_pushing_small', 'bccd', 'beans', 'bee_dataset', 'beir', 'big_patent', 'bigearthnet', 'billsum', 'binarized_mnist', 'binary_alpha_digits', 'ble_wind_field', 'blimp', 'booksum', 'bool_q', 'c4', 'caltech101', 'caltech_birds2010', 'caltech_birds2011', 'cardiotox', 'cars196', 'cassava', 'cats_vs_dogs', 'celeb_a', 'celeb_a_hq', 'cfq', 'cherry_blossoms', 'chexpert', 'cifar10', 'cifar100', 'cifar10_1', 'cifar10_corrupted', 'citrus_leaves', 'cityscapes', 'civil_comments', 'clevr', 'clic', 'clinc_oos', 'cmaterdb', 'cnn_dailymail', 'coco', 'coco_captions', 'coil100', 'colorectal_histology', 'colorectal_histology_large', 'common_voice', 'coqa', 'cos_e', 'cosmos_qa', 'covid19', 'covid19sum', 'crema_d', 'criteo', 'cs_restaurants', 'curated_breast_imaging_ddsm', 'cycle_gan', 'd4rl_adroit_door', 'd4rl_ad

Let's download MNIST. 

In [39]:
datasets = tfds.load(name="mnist",data_dir='C:/')
mnist_train, mnist_test = datasets["train"], datasets["test"]
mnist_train = mnist_train.repeat(5).batch(32)
mnist_train = mnist_train.map(lambda items: (items["image"], items["label"]))
mnist_train = mnist_train.prefetch(1)
for images, labels in mnist_train.take(1):
    print(images.shape)
    print(labels.numpy())

(32, 28, 28, 1)
[4 1 0 7 8 1 2 7 1 6 6 4 7 7 3 3 7 9 9 1 0 6 6 9 9 4 8 9 4 7 3 3]


Items in the dataset are dictionaries containing both the features and the labels but Keras expects input to be a tuple. We used map() method for transforming the dictionary to a tuple. Additionaly, we can set as_supervised argument to True not to need this dictionary to tuple transformation.

In [40]:
datasets = tfds.load(name="mnist", batch_size=32, as_supervised=True,data_dir='C:/')
mnist_train = datasets["train"].repeat().prefetch(1)
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28, 1]),
    keras.layers.Lambda(lambda images: tf.cast(images, tf.float32)),
    keras.layers.Dense(10, activation="softmax")])
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(lr=1e-3),
              metrics=["accuracy"])
model.fit(mnist_train, steps_per_epoch=60000 // 32, epochs=5)

Epoch 1/5


  super(SGD, self).__init__(name, **kwargs)


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fe014e860d0>

## Tensorflow Hub

Tensorflow Hub is a trained machine learning repository that we can use for implementing trained models.

Let's train a text embedding with Tensorflow Hub.

In [41]:
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

In [42]:
import tensorflow_hub as hub

hub_layer = hub.KerasLayer("https://tfhub.dev/google/tf2-preview/nnlm-en-dim50/1",
                           output_shape=[50], input_shape=[], dtype=tf.string)

model = keras.Sequential()
model.add(hub_layer)
model.add(keras.layers.Dense(16, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 keras_layer (KerasLayer)    (None, 50)                48190600  
                                                                 
 dense (Dense)               (None, 16)                816       
                                                                 
 dense_1 (Dense)             (None, 1)                 17        
                                                                 
Total params: 48,191,433
Trainable params: 833
Non-trainable params: 48,190,600
_________________________________________________________________


In [43]:
sentences = tf.constant(["It was a great movie", "The actors were amazing"])
embeddings = hub_layer(sentences)

In [44]:
embeddings

<tf.Tensor: shape=(2, 50), dtype=float32, numpy=
array([[ 7.45939985e-02,  2.76720114e-02,  9.38646123e-02,
         1.25124469e-01,  5.40293928e-04, -1.09435350e-01,
         1.34755149e-01, -9.57818255e-02, -1.85177118e-01,
        -1.69703495e-02,  1.75612606e-02, -9.06603858e-02,
         1.12110220e-01,  1.04646273e-01,  3.87700424e-02,
        -7.71859884e-02, -3.12189370e-01,  6.99466765e-02,
        -4.88970093e-02, -2.99049795e-01,  1.31183028e-01,
        -2.12630898e-01,  6.96169436e-02,  1.63592950e-01,
         1.05169769e-02,  7.79720694e-02, -2.55230188e-01,
        -1.80790052e-01,  2.93739915e-01,  1.62875261e-02,
        -2.80566931e-01,  1.60284728e-01,  9.87277832e-03,
         8.44555616e-04,  8.39456245e-02,  3.24002892e-01,
         1.53253034e-01, -3.01048346e-02,  8.94618109e-02,
        -2.39153411e-02, -1.50188789e-01, -1.81733668e-02,
        -1.20483577e-01,  1.32937476e-01, -3.35325629e-01,
        -1.46504581e-01, -1.25251599e-02, -1.64428815e-01,
       