##### Copyright 2019 The TensorFlow Authors.

Licensed under the Apache License, Version 2.0 (the "License");

In [1]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Load a pandas DataFrame

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/tutorials/load_data/pandas_dataframe"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/load_data/pandas_dataframe.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/docs/blob/master/site/en/tutorials/load_data/pandas_dataframe.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/docs/site/en/tutorials/load_data/pandas_dataframe.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
</table>

This tutorial provides examples of how to load <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html" class="external">pandas DataFrames</a> into TensorFlow.

You will use a small <a href="https://archive.ics.uci.edu/ml/datasets/heart+Disease" class="external">heart disease dataset</a> provided by the UCI Machine Learning Repository. There are several hundred rows in the CSV. Each row describes a patient, and each column describes an attribute. You will use this information to predict whether a patient has heart disease, which is a binary classification task.

## Read data using pandas

In [2]:
import pandas as pd
import tensorflow as tf

SHUFFLE_BUFFER = 500
BATCH_SIZE = 2

2022-07-07 11:39:04.120929: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-07-07 11:39:04.120954: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


Download the CSV file containing the heart disease dataset:

In [3]:
# csv_file = tf.keras.utils.get_file('heart.csv', 'https://storage.googleapis.com/download.tensorflow.org/data/heart.csv')


Read the CSV file using pandas:

In [4]:
# url = '/content/sample_data/64_hand_labled_pairs.csv'
path = '../data/question_pairs/train.csv'
df = pd.read_csv(path)

This is what the data looks like:

In [5]:
df.head()

Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0


In [6]:
df.dtypes

id               int64
qid1             int64
qid2             int64
question1       object
question2       object
is_duplicate     int64
dtype: object

You will build models to predict the label contained in the `target` column.

In [7]:
label = df.pop('is_duplicate')

## A DataFrame as an array

If your data has a uniform datatype, or `dtype`, it's possible to use a pandas DataFrame anywhere you could use a NumPy array. This works because the `pandas.DataFrame` class supports the `__array__` protocol, and TensorFlow's `tf.convert_to_tensor` function accepts objects that support the protocol.

Take the numeric features from the dataset (skip the categorical features for now):

In [8]:
numeric_feature_names = ['question1', 'question2']
numeric_features = df[numeric_feature_names]
numeric_features.head()

Unnamed: 0,question1,question2
0,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...
1,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...
2,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...
3,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...
4,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?


The DataFrame can be converted to a NumPy array using the `DataFrame.values` property or `numpy.array(df)`. To convert it to a tensor, use `tf.convert_to_tensor`:

In [9]:
from pandas.core.frame import DataFrame
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

ans1 = []
ans2 = []
new_df = DataFrame()
# add entry for every words
tokenizer = Tokenizer(num_words=256, oov_token="<OOV>")
#TODO fit_on_texts only with the trained data, not the test data
for row in numeric_features['question1'].values[:15000]:
    tokenizer.fit_on_texts([row])
    tokend = tokenizer.texts_to_sequences([row])
    ans1.append(pad_sequences(tokend, maxlen=256, padding='post', truncating='post'))
for row in numeric_features['question2'].values[:15000]:
    tokenizer.fit_on_texts([row])
    tokend = tokenizer.texts_to_sequences([row])
    ans2.append(pad_sequences(tokend, maxlen=256, padding='post', truncating='post'))
# Tokenize all strings
# training_text1 = ans1[0:40]
# testint_text1 = ans1[40:]
# training_text2 = ans2[0:40]
# testint_text2 = ans2[40:]
training_text1 = ans1
training_text2 = ans2
# model = Model()
texts = []
for i, j in zip(training_text1, training_text2):
    texts.append(np.append(i,j))

# new_df['tokend answer 1'] = ans1
# new_df['tokend answer 2'] = ans2
new_df = pd.DataFrame(texts)


In [10]:
texts

[array([ 4,  5,  6,  2,  7,  2,  8,  9, 10,  3, 11, 12,  3, 13,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0, 

In [11]:
new_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,502,503,504,505,506,507,508,509,510,511
0,4,5,6,2,7,2,8,9,10,3,...,0,0,0,0,0,0,0,0,0,0
1,2,3,4,14,15,16,17,18,19,20,...,0,0,0,0,0,0,0,0,0,0
2,21,22,8,23,2,24,7,25,26,27,...,0,0,0,0,0,0,0,0,0,0
3,32,33,2,34,35,36,9,10,2,37,...,0,0,0,0,0,0,0,0,0,0
4,39,40,41,4,42,43,44,45,46,47,...,0,0,0,0,0,0,0,0,0,0


In [12]:
tf.convert_to_tensor(new_df)

2022-07-07 11:43:44.935173: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:975] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-07-07 11:43:44.935511: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-07-07 11:43:44.935589: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2022-07-07 11:43:44.935655: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2022-07-07 11:43:44.935718: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Co

<tf.Tensor: shape=(15000, 512), dtype=int32, numpy=
array([[ 4,  5,  6, ...,  0,  0,  0],
       [ 2,  3,  4, ...,  0,  0,  0],
       [21, 22,  8, ...,  0,  0,  0],
       ...,
       [12, 45, 95, ...,  0,  0,  0],
       [ 6, 84,  1, ...,  0,  0,  0],
       [17, 10, 55, ...,  0,  0,  0]], dtype=int32)>

In general, if an object can be converted to a tensor with `tf.convert_to_tensor` it can be passed anywhere you can pass a `tf.Tensor`.

### Quora siames model

In [None]:
from keras.regularizers import l2
from keras.models import Sequential
from keras.optimizers import Adam
from keras.layers import Conv2D, ZeroPadding2D, Activation, Input, concatenate
from keras.models import Model

from keras.layers.normalization import BatchNormalization
from keras.layers.pooling import MaxPooling2D
from keras.layers.merge import Concatenate
from keras.layers.core import Lambda, Flatten, Dense
from keras.initializers import glorot_uniform
from keras.layers import Input, Dense, Flatten, GlobalMaxPool2D, GlobalAvgPool2D, Concatenate, Multiply, Dropout, Subtract, Add, Conv2D
from keras import backend as K

def cosine_distance(vests):
    x, y = vests
    x = K.l2_normalize(x, axis=-1)
    y = K.l2_normalize(y, axis=-1)
    return -K.mean(x * y, axis=-1, keepdims=True)

def cos_dist_output_shape(shapes):
    shape1, shape2 = shapes
    return (shape1[0],1)

from sklearn.metrics import roc_auc_score

def auroc(y_true, y_pred):
    return tf.py_func(roc_auc_score, (y_true, y_pred), tf.double)


In [50]:
input_1 = Input(shape=(training_text1.shape[1],))
input_2 = Input(shape=(training_text2.shape[1],))


common_embed = Embedding(name="synopsis_embedd",input_dim =len(t.word_index)+1, 
                       output_dim=len(embeddings_index['no']),weights=[embedding_matrix], 
                       input_length=training_text1.shape[1],trainable=False) 
lstm_1 = common_embed(input_1)
lstm_2 = common_embed(input_2)


common_lstm = LSTM(64,return_sequences=True, activation="relu")
vector_1 = common_lstm(lstm_1)
vector_1 = Flatten()(vector_1)

vector_2 = common_lstm(lstm_2)
vector_2 = Flatten()(vector_2)


NameError: name 'Input' is not defined

In [None]:

x3 = Subtract()([vector_1, vector_2])
x3 = Multiply()([x3, x3])

x1_ = Multiply()([vector_1, vector_1])
x2_ = Multiply()([vector_2, vector_2])
x4 = Subtract()([x1_, x2_])
    
    #https://stackoverflow.com/a/51003359/10650182
x5 = Lambda(cosine_distance, output_shape=cos_dist_output_shape)([vector_1, vector_2])
    
conc = Concatenate(axis=-1)([x5,x4, x3])

x = Dense(100, activation="relu", name='conc_layer')(conc)
x = Dropout(0.01)(x)
out = Dense(1, activation="sigmoid", name = 'out')(x)

quora_model = Model([input_1, input_2], out)

quora_model.compile(loss="binary_crossentropy", metrics=['acc',auroc], optimizer=Adam(0.00001))

### My Model

In [13]:
model = tf.keras.Sequential()
model.add(tf.keras.Input(shape=(256*2,))),
model.add(tf.keras.layers.Dense(256*2, activation='relu')),
model.add(tf.keras.layers.Dense(256*2, activation='relu')),
model.add(tf.keras.layers.Dense(256*2, activation='relu')),
# model.add(tf.keras.layers.Dense(4, activation='relu')),
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# model.build(input_shape=(1,256))
# plot_model(model, to_file='model_plot.png', show_shapes=True, show_layer_names=True)



In [42]:
# new_df.shape
model.fit(new_df[:9000], label[:9000], epochs=30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7f13437ec4f0>

### Evaluate my Model

In [45]:
# new_df.values[4].shape
# new_df.values[4]

# with open('model_summary.txt', mode='w') as file:
#     model.summary(print_fn=lambda x: file.write(x + '\n'))
# from keras.utils.vis_utils import plot_model
# plot_model(model, to_file='model_plot.png', show_shapes=True, show_layer_names=True)
# numeric_dataset = tf.data.Dataset.from_tensor_slices((new_df, label))

# for row in numeric_dataset.take(3):
#   print(row) 
# texts[1][0:]
model.evaluate(new_df[9000:], label[9000:15000])
predict = model.predict(new_df[9000:15000])

#TODO evaluate 64_hand_labled_pairs
# plt.show()

# print(label)
# zero = 0
# one = 0
# for row in label.values[:9000]:
#     if row == 0:
#         zero = zero +1
#     elif row == 1:
#         one = one +1
# print(f'training_data: zeros = {zero}; ones = {one}')
# zero = 0
# one = 0
# for row in label.values[9000:10000]:
#     if row == 0:
#         zero = zero +1
#     elif row == 1:
#         one = one +1
# print(f'test_data: zeros = {zero}; ones = {one}')




### Reproduce the binary accuracy

In [41]:
# tf.keras.metrics.BinaryAccuracy()
m = tf.keras.metrics.BinaryAccuracy()
sum = 0
zeros = 0
ones = 0
for p,l in zip(predict, label[9000:15000]):
    m.update_state([[l]],[[p]])
    print(f"p: {p}; l: {l}; accuracy: {m.result().numpy()}")
    if m.result().numpy():
        sum = sum + 1
        if l:
            ones = ones + 1
        else:
            zeros = zeros + 1
    m.reset_state()
print(f"total accuracy: {sum/(15000-9000)}")
print(f"number positive accuracy through zeros (round off): {zeros}")
print(f"number positive accuracy through ones (round up): {ones}")

p: [0.00017923]; l: 0; accuracy: 1.0
p: [0.5362032]; l: 0; accuracy: 0.0
p: [3.6756433e-06]; l: 1; accuracy: 0.0
p: [0.00399564]; l: 1; accuracy: 0.0
p: [0.74822414]; l: 0; accuracy: 0.0
p: [0.40415087]; l: 0; accuracy: 1.0
p: [0.82156754]; l: 0; accuracy: 0.0
p: [0.22954358]; l: 0; accuracy: 1.0
p: [0.6750477]; l: 1; accuracy: 1.0
p: [0.94885343]; l: 0; accuracy: 0.0
p: [0.83541536]; l: 1; accuracy: 1.0
p: [0.20874076]; l: 1; accuracy: 0.0
p: [1.2291295e-08]; l: 0; accuracy: 1.0
p: [0.46788636]; l: 0; accuracy: 1.0
p: [5.0503513e-14]; l: 1; accuracy: 0.0
p: [6.0075658e-05]; l: 1; accuracy: 0.0
p: [0.9835388]; l: 0; accuracy: 0.0
p: [0.24652924]; l: 0; accuracy: 1.0
p: [0.6132551]; l: 1; accuracy: 1.0
p: [0.00535353]; l: 0; accuracy: 1.0
p: [0.00054233]; l: 0; accuracy: 1.0
p: [0.3858947]; l: 0; accuracy: 1.0
p: [0.95910954]; l: 1; accuracy: 1.0
p: [0.3630363]; l: 0; accuracy: 1.0
p: [0.45862484]; l: 0; accuracy: 1.0
p: [0.06735966]; l: 0; accuracy: 1.0
p: [6.2872907e-10]; l: 0; accura

### Plot the predictions

In [38]:
# print(predict)
%matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.figure(figsize = (10,5))
plt.plot(predict)
# plt.savefig('predictions_of_test_data.jpg')
# step_size = np.arange(0.05,0.95,0.05)
# step_size[1]
# # plt.bar(step_size, np.round(predict,2))
# for i in range(0,10):
#     np.round(predict,2)

Using matplotlib backend: TkAgg


<Figure size 720x360 with 0 Axes>

### Save/Load our trained model

In [49]:
import os.path
path_to_model = '../data/model/trained_simple_model'
# if os.path.isfile(path_to_model) is False:
#     model.save(path_to_model)
from tensorflow.keras.models import load_model
model = load_model(path_to_model)

### With Model.fit

A DataFrame, interpreted as a single tensor, can be used directly as an argument to the `Model.fit` method.

Below is an example of training a model on the numeric features of the dataset.

The first step is to normalize the input ranges. Use a `tf.keras.layers.Normalization` layer for that.

To set the layer's mean and standard-deviation before running it be sure to call the `Normalization.adapt` method:

In [None]:
normalizer = tf.keras.layers.Normalization(axis=-1)
normalizer.adapt(numeric_features)

Call the layer on the first three rows of the DataFrame to visualize an example of the output from this layer:

In [None]:
normalizer(numeric_features.iloc[:3])

Use the normalization layer as the first layer of a simple model:

In [None]:
def get_basic_model():
  model = tf.keras.Sequential([
    normalizer,
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1)
  ])

  model.compile(optimizer='adam',
                loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
                metrics=['accuracy'])
  return model

When you pass the DataFrame as the `x` argument to `Model.fit`, Keras treats the DataFrame as it would a NumPy array:

In [None]:
model = get_basic_model()
model.fit(numeric_features, target, epochs=15, batch_size=BATCH_SIZE)

### With tf.data

If you want to apply `tf.data` transformations to a DataFrame of a uniform `dtype`, the `Dataset.from_tensor_slices` method will create a dataset that iterates over the rows of the DataFrame. Each row is initially a vector of values. To train a model, you need `(inputs, labels)` pairs, so pass `(features, labels)` and `Dataset.from_tensor_slices` will return the needed pairs of slices:

In [None]:
numeric_dataset = tf.data.Dataset.from_tensor_slices((numeric_features, target))

for row in numeric_dataset.take(3):
  print(row)

In [None]:
numeric_batches = numeric_dataset.shuffle(1000).batch(BATCH_SIZE)

model = get_basic_model()
model.fit(numeric_batches, epochs=15)

## A DataFrame as a dictionary

When you start dealing with heterogeneous data, it is no longer possible to treat the DataFrame as if it were a single array. TensorFlow tensors require that all elements have the same `dtype`.

So, in this case, you need to start treating it as a dictionary of columns, where each column has a uniform `dtype`. A DataFrame is a lot like a dictionary of arrays, so typically all you need to do is cast the DataFrame to a Python dict. Many important TensorFlow APIs support (nested-)dictionaries of arrays as inputs.

`tf.data` input pipelines handle this quite well. All `tf.data` operations handle dictionaries and tuples automatically. So, to make a dataset of dictionary-examples from a DataFrame, just cast it to a dict before slicing it with `Dataset.from_tensor_slices`:

In [None]:
numeric_dict_ds = tf.data.Dataset.from_tensor_slices((dict(numeric_features), target))

Here are the first three examples from that dataset:

In [None]:
for row in numeric_dict_ds.take(3):
  print(row)

### Dictionaries with Keras

Typically, Keras models and layers expect a single input tensor, but these classes can accept and return nested structures of dictionaries, tuples and tensors. These structures are known as "nests" (refer to the `tf.nest` module for details).

There are two equivalent ways you can write a Keras model that accepts a dictionary as input.

#### 1. The Model-subclass style

You write a subclass of `tf.keras.Model` (or `tf.keras.Layer`). You directly handle the inputs, and create the outputs:

In [None]:
  def stack_dict(inputs, fun=tf.stack):
    values = []
    for key in sorted(inputs.keys()):
      values.append(tf.cast(inputs[key], tf.float32))

    return fun(values, axis=-1)

In [None]:
#@title
class MyModel(tf.keras.Model):
  def __init__(self):
    # Create all the internal layers in init.
    super().__init__(self)

    self.normalizer = tf.keras.layers.Normalization(axis=-1)

    self.seq = tf.keras.Sequential([
      self.normalizer,
      tf.keras.layers.Dense(10, activation='relu'),
      tf.keras.layers.Dense(10, activation='relu'),
      tf.keras.layers.Dense(1)
    ])

  def adapt(self, inputs):
    # Stack the inputs and `adapt` the normalization layer.
    inputs = stack_dict(inputs)
    self.normalizer.adapt(inputs)

  def call(self, inputs):
    # Stack the inputs
    inputs = stack_dict(inputs)
    # Run them through all the layers.
    result = self.seq(inputs)

    return result

model = MyModel()

model.adapt(dict(numeric_features))

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'],
              run_eagerly=True)

This model can accept either a dictionary of columns or a dataset of dictionary-elements for training:

In [None]:
model.fit(dict(numeric_features), target, epochs=5, batch_size=BATCH_SIZE)

In [None]:
numeric_dict_batches = numeric_dict_ds.shuffle(SHUFFLE_BUFFER).batch(BATCH_SIZE)
model.fit(numeric_dict_batches, epochs=5)

Here are the predictions for the first three examples:

In [None]:
model.predict(dict(numeric_features.iloc[:3]))

#### 2. The Keras functional style

In [None]:
inputs = {}
for name, column in numeric_features.items():
  inputs[name] = tf.keras.Input(
      shape=(1,), name=name, dtype=tf.float32)

inputs

In [None]:
x = stack_dict(inputs, fun=tf.concat)

normalizer = tf.keras.layers.Normalization(axis=-1)
normalizer.adapt(stack_dict(dict(numeric_features)))

x = normalizer(x)
x = tf.keras.layers.Dense(10, activation='relu')(x)
x = tf.keras.layers.Dense(10, activation='relu')(x)
x = tf.keras.layers.Dense(1)(x)

model = tf.keras.Model(inputs, x)

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'],
              run_eagerly=True)

In [None]:
tf.keras.utils.plot_model(model, rankdir="LR", show_shapes=True)

You can train the functional model the same way as the model subclass:

In [None]:
model.fit(dict(numeric_features), target, epochs=5, batch_size=BATCH_SIZE)

In [None]:
numeric_dict_batches = numeric_dict_ds.shuffle(SHUFFLE_BUFFER).batch(BATCH_SIZE)
model.fit(numeric_dict_batches, epochs=5)

## Full example

If you're passing a heterogeneous DataFrame to Keras, each column may need unique preprocessing. You could do this preprocessing directly in the DataFrame, but for a model to work correctly, inputs always need to be preprocessed the same way. So, the best approach is to build the preprocessing into the model. [Keras preprocessing layers](https://www.tensorflow.org/guide/keras/preprocessing_layers) cover many common tasks.

### Build the preprocessing head

In this dataset some of the "integer" features in the raw data are actually Categorical indices. These indices are not really ordered numeric values (refer to the <a href="https://archive.ics.uci.edu/ml/datasets/heart+Disease" class="external">the dataset description</a> for details). Because these are unordered they are inappropriate to feed directly to the model; the model would interpret them as being ordered. To use these inputs you'll need to encode them, either as one-hot vectors or embedding vectors. The same applies to string-categorical features.

Note: If you have many features that need identical preprocessing it's more efficient to concatenate them together before applying the preprocessing.

Binary features on the other hand do not generally need to be encoded or normalized.

Start by by creating a list of the features that fall into each group:

In [None]:
binary_feature_names = ['sex', 'fbs', 'exang']

In [None]:
categorical_feature_names = ['cp', 'restecg', 'slope', 'thal', 'ca']

The next step is to build a preprocessing model that will apply appropriate preprocessing to each input and concatenate the results.

This section uses the [Keras Functional API](https://www.tensorflow.org/guide/keras/functional) to implement  the preprocessing. You start by creating one `tf.keras.Input` for each column of the dataframe:

In [None]:
inputs = {}
for name, column in df.items():
  if type(column[0]) == str:
    dtype = tf.string
  elif (name in categorical_feature_names or
        name in binary_feature_names):
    dtype = tf.int64
  else:
    dtype = tf.float32

  inputs[name] = tf.keras.Input(shape=(), name=name, dtype=dtype)

In [None]:
inputs

For each input you'll apply some transformations using Keras layers and TensorFlow ops. Each feature starts as a batch of scalars (`shape=(batch,)`). The output for each  should be a batch of `tf.float32` vectors (`shape=(batch, n)`). The last step will concatenate all those vectors together.


#### Binary inputs

Since the binary inputs don't need any preprocessing, just add the vector axis, cast them to `float32` and add them to the list of preprocessed inputs:

In [None]:
preprocessed = []

for name in binary_feature_names:
  inp = inputs[name]
  inp = inp[:, tf.newaxis]
  float_value = tf.cast(inp, tf.float32)
  preprocessed.append(float_value)

preprocessed

#### Numeric inputs

Like in the earlier section you'll want to run these numeric inputs through a `tf.keras.layers.Normalization` layer before using them. The difference is that this time they're input as a dict. The code below collects the numeric features from the DataFrame, stacks them together and passes those to the `Normalization.adapt` method.

In [None]:
normalizer = tf.keras.layers.Normalization(axis=-1)
normalizer.adapt(stack_dict(dict(numeric_features)))

The code below stacks the numeric features and runs them through the normalization layer.

In [None]:
numeric_inputs = {}
for name in numeric_feature_names:
  numeric_inputs[name]=inputs[name]

numeric_inputs = stack_dict(numeric_inputs)
numeric_normalized = normalizer(numeric_inputs)

preprocessed.append(numeric_normalized)

preprocessed

#### Categorical features

To use categorical features you'll first need to encode them into either binary vectors or embeddings. Since these features only contain a small number of categories, convert the inputs directly to one-hot vectors using the `output_mode='one_hot'` option, supported by both the `tf.keras.layers.StringLookup` and `tf.keras.layers.IntegerLookup` layers.

Here is an example of how these layers work:

In [None]:
vocab = ['a','b','c']
lookup = tf.keras.layers.StringLookup(vocabulary=vocab, output_mode='one_hot')
lookup(['c','a','a','b','zzz'])

In [None]:
vocab = [1,4,7,99]
lookup = tf.keras.layers.IntegerLookup(vocabulary=vocab, output_mode='one_hot')

lookup([-1,4,1])

To determine the vocabulary for each input, create a layer to convert that vocabulary to a one-hot vector:

In [None]:
for name in categorical_feature_names:
  vocab = sorted(set(df[name]))
  print(f'name: {name}')
  print(f'vocab: {vocab}\n')

  if type(vocab[0]) is str:
    lookup = tf.keras.layers.StringLookup(vocabulary=vocab, output_mode='one_hot')
  else:
    lookup = tf.keras.layers.IntegerLookup(vocabulary=vocab, output_mode='one_hot')

  x = inputs[name][:, tf.newaxis]
  x = lookup(x)
  preprocessed.append(x)

#### Assemble the preprocessing head

At this point `preprocessed` is just a Python list of all the preprocessing results, each result has a shape of `(batch_size, depth)`:

In [None]:
preprocessed

Concatenate all the preprocessed features along the `depth` axis, so each dictionary-example is converted into a single vector. The vector contains categorical features, numeric features, and categorical one-hot features:

In [None]:
preprocesssed_result = tf.concat(preprocessed, axis=-1)
preprocesssed_result

Now create a model out of that calculation so it can be reused:

In [None]:
preprocessor = tf.keras.Model(inputs, preprocesssed_result)

In [None]:
tf.keras.utils.plot_model(preprocessor, rankdir="LR", show_shapes=True)

To test the preprocessor, use the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html" class="external">DataFrame.iloc</a> accessor to slice the first example from the DataFrame. Then convert it to a dictionary and pass the dictionary to the preprocessor. The result is a single vector containing the binary features, normalized numeric features and the one-hot categorical features, in that order:

In [None]:
preprocessor(dict(df.iloc[:1]))

### Create and train a model

Now build the main body of the model. Use the same configuration as in the previous example: A couple of `Dense` rectified-linear layers and a `Dense(1)` output layer for the classification.

In [None]:
body = tf.keras.Sequential([
  tf.keras.layers.Dense(10, activation='relu'),
  tf.keras.layers.Dense(10, activation='relu'),
  tf.keras.layers.Dense(1)
])

Now put the two pieces together using the Keras functional API.

In [None]:
inputs

In [None]:
x = preprocessor(inputs)
x

In [None]:
result = body(x)
result

In [None]:
model = tf.keras.Model(inputs, result)

model.compile(optimizer='adam',
                loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
                metrics=['accuracy'])

This model expects a dictionary of inputs. The simplest way to pass it the data is to convert the DataFrame to a dict and pass that dict as the `x` argument to `Model.fit`:

In [None]:
history = model.fit(dict(df), target, epochs=5, batch_size=BATCH_SIZE)

Using `tf.data` works as well:

In [None]:
ds = tf.data.Dataset.from_tensor_slices((
    dict(df),
    target
))

ds = ds.batch(BATCH_SIZE)

In [None]:
import pprint

for x, y in ds.take(1):
  pprint.pprint(x)
  print()
  print(y)

In [None]:
history = model.fit(ds, epochs=5)