<br>

<div align=center><font color=maroon size=6><b>Load a pandas DataFrame</b></font></div>

<br>

<font size=4><b>References:</b></font>
1. TF2 official tutorials: <a href="https://www.tensorflow.org/tutorials" style="text-decoration:none;">TensorFlow Tutorials</a> 
    * `TensorFlow > Learn > TensorFlow Core > `Tutorials > <a href="https://www.tensorflow.org/tutorials/load_data/pandas_dataframe" style="text-decoration:none;">Load a pandas DataFrame</a>
        * Run in <a href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/load_data/pandas_dataframe.ipynb" style="text-decoration:none;">Google Colab</a>

<br>
<br>
<br>

This tutorial provides examples of how to load <a href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html" class="external">pandas DataFrames</a> into TensorFlow.

You will use a small <a href="https://archive.ics.uci.edu/ml/datasets/heart+Disease" class="external">heart disease dataset</a> provided by the UCI Machine Learning Repository. There are several hundred rows in the CSV. Each row describes a patient, and each column describes an attribute. You will use this information to predict whether a patient has heart disease, which is a binary classification task.

<br>

## Read data using pandas

In [1]:
import tensorflow as tf
import pandas as pd

In [2]:
print(tf.__version__)

2.5.0


In [3]:
SHUFFLE_BUFFER = 500
BATCH_SIZE = 2

<br>

Download the CSV file containing the heart disease dataset:

In [4]:
# help(tf.keras.utils.get_file)

In [5]:
csv_file = tf.keras.utils.get_file('heart.csv', 
                                   'https://storage.googleapis.com/download.tensorflow.org/data/heart.csv',
                                   cache_dir="D:/KeepStudy/0_Coding",
                                   cache_subdir="0_dataset")

<br>

Read the CSV file using pandas:

In [6]:
df = pd.read_csv(csv_file)

This is what the data looks like:

In [7]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,1,145,233,1,2,150,0,2.3,3,0,fixed,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,normal,1
2,67,1,4,120,229,0,2,129,1,2.6,2,2,reversible,0
3,37,1,3,130,250,0,0,187,0,3.5,3,0,normal,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0,normal,0


In [8]:
df.dtypes

age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal         object
target        int64
dtype: object

<br>

You will build models to predict the label contained in the `target` column.

In [9]:
target = df.pop('target')

<br>
<br>
<br>

## A DataFrame as an array

<font size=3 color=maroon>If your data has a uniform datatype, or `dtype`, it's possible to use a pandas DataFrame anywhere you could use a NumPy array. This works because the `pandas.DataFrame` class supports the `__array__` protocol, and TensorFlow's `tf.convert_to_tensor` function accepts objects that support the protocol.</font>

Take the numeric features from the dataset (skip the categorical features for now):

In [10]:
numeric_feature_names = ['age', 'thalach', 'trestbps',  'chol', 'oldpeak']
numeric_features = df[numeric_feature_names]
numeric_features.head()

Unnamed: 0,age,thalach,trestbps,chol,oldpeak
0,63,150,145,233,2.3
1,67,108,160,286,1.5
2,67,129,120,229,2.6
3,37,187,130,250,3.5
4,41,172,130,204,1.4


<br>

<font size=3 color=maroon>The DataFrame can be converted to a NumPy array using the `DataFrame.values` property or `numpy.array(df)`. To convert it to a tensor, use `tf.convert_to_tensor`:</font>

In [11]:
tf.convert_to_tensor(numeric_features)

<tf.Tensor: shape=(303, 5), dtype=float64, numpy=
array([[ 63. , 150. , 145. , 233. ,   2.3],
       [ 67. , 108. , 160. , 286. ,   1.5],
       [ 67. , 129. , 120. , 229. ,   2.6],
       ...,
       [ 65. , 127. , 135. , 254. ,   2.8],
       [ 48. , 150. , 130. , 256. ,   0. ],
       [ 63. , 154. , 150. , 407. ,   4. ]])>

In [12]:
type(numeric_features)

pandas.core.frame.DataFrame

In [13]:
numeric_features.shape

(303, 5)

<br>

In general, if an object can be converted to a tensor with `tf.convert_to_tensor` it can be passed anywhere you can pass a `tf.Tensor`.

<br>
<br>

### With Model.fit

<font size=3 color=maroon>A DataFrame, interpreted as a single tensor, can be used directly as an argument to the `Model.fit` method.</font>

Below is an example of training a model on the numeric features of the dataset.

<br>

<font size=3 color=maroon>The first step is to normalize the input ranges. Use a `tf.keras.layers.Normalization` layer for that.

To set the layer's mean and standard-deviation before running it be sure to call the `Normalization.adapt` method:</font>

In [14]:
normalizer = tf.keras.layers.experimental.preprocessing.Normalization(axis=-1)
normalizer.adapt(numeric_features)   # 这个相当于是在训练 normalizer

In [15]:
type(numeric_features)

pandas.core.frame.DataFrame

<br>

Call the layer on the first three rows of the DataFrame to visualize an example of the output from this layer:

In [16]:
normalizer(numeric_features.iloc[:3])

<tf.Tensor: shape=(3, 5), dtype=float32, numpy=
array([[ 0.93383914,  0.03480718,  0.74578077, -0.26008663,  1.0680453 ],
       [ 1.3782105 , -1.7806165 ,  1.5923285 ,  0.7573877 ,  0.38022864],
       [ 1.3782105 , -0.87290466, -0.6651321 , -0.33687714,  1.3259765 ]],
      dtype=float32)>

<br>

Use the normalization layer as the first layer of a simple model:

In [17]:
def get_basic_model():
    model = tf.keras.Sequential([normalizer,
                                 tf.keras.layers.Dense(10, activation='relu'),
                                 tf.keras.layers.Dense(10, activation='relu'),
                                 tf.keras.layers.Dense(1)
                                ])

  
    model.compile(optimizer='adam',
                  loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
                  metrics=['accuracy'])
    
    return model

<br>

<font size=3 color=maroon>When you pass the DataFrame as the `x` argument to `Model.fit`, Keras treats the DataFrame as it would a NumPy array:</font>

In [18]:
type(numeric_features)

pandas.core.frame.DataFrame

In [19]:
model = get_basic_model()

model.fit(numeric_features, target, epochs=15, batch_size=BATCH_SIZE)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x1e422f85ca0>

<br>
<br>

### With tf.data

If you want to apply `tf.data` transformations to a DataFrame of a uniform `dtype`, <font size=3 color=maroon>the `Dataset.from_tensor_slices` method will create a dataset that **iterates over the rows of the DataFrame. Each row is initially a vector of values**.</font> 

To train a model, you need `(inputs, labels)` pairs, so pass `(features, labels)` and `Dataset.from_tensor_slices` will return the needed pairs of slices:

In [20]:
numeric_dataset = tf.data.Dataset.from_tensor_slices((numeric_features, target))

for row in numeric_dataset.take(3):
    print(row)

(<tf.Tensor: shape=(5,), dtype=float64, numpy=array([ 63. , 150. , 145. , 233. ,   2.3])>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)
(<tf.Tensor: shape=(5,), dtype=float64, numpy=array([ 67. , 108. , 160. , 286. ,   1.5])>, <tf.Tensor: shape=(), dtype=int64, numpy=1>)
(<tf.Tensor: shape=(5,), dtype=float64, numpy=array([ 67. , 129. , 120. , 229. ,   2.6])>, <tf.Tensor: shape=(), dtype=int64, numpy=0>)


In [21]:
# numeric_dataset.numpy()
#
# 报错：AttributeError: 'TensorSliceDataset' object has no attribute 'numpy'



# type(numeric_dataset.take(3))
#
# tensorflow.python.data.ops.dataset_ops.TakeDataset



# tf.convert_to_tensor(numeric_dataset.take(3))
#
# 报错：ValueError: Attempt to convert a value (<TakeDataset shapes: ((5,), ()), types: (tf.float64, tf.int64)>) 
# with an unsupported type (<class 'tensorflow.python.data.ops.dataset_ops.TakeDataset'>) to a Tensor.

In [22]:
numeric_batches = numeric_dataset.shuffle(1000).batch(BATCH_SIZE)

model = get_basic_model()
model.fit(numeric_batches, epochs=15)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x1e43d9ec5b0>

<br>
<br>
<br>

## A DataFrame as a dictionary

<font size=3 color=maroon>When you start dealing with `heterogeneous (由很多种类组成的；各种各样的) data`, it is no longer possible to treat the DataFrame as if it were a single array. TensorFlow tensors require that all elements have the same `dtype`.

So, in this case, you need to start treating it as **a dictionary of columns, where each column has a uniform `dtype`**. <br><br>
A DataFrame is a lot like a dictionary of arrays, so typically all you need to do is cast the DataFrame to a Python dict. Many important TensorFlow APIs support (nested-)dictionaries of arrays as inputs.

`tf.data` input pipelines handle this quite well. <font size=3 color=maroon>All `tf.data` operations handle dictionaries and tuples automatically.</font> So, to make a dataset of dictionary-examples from a DataFrame, just cast it to a dict before slicing it with `Dataset.from_tensor_slices`:

In [23]:
dict(numeric_features)

{'age': 0      63
 1      67
 2      67
 3      37
 4      41
        ..
 298    52
 299    43
 300    65
 301    48
 302    63
 Name: age, Length: 303, dtype: int64,
 'thalach': 0      150
 1      108
 2      129
 3      187
 4      172
       ... 
 298    190
 299    136
 300    127
 301    150
 302    154
 Name: thalach, Length: 303, dtype: int64,
 'trestbps': 0      145
 1      160
 2      120
 3      130
 4      130
       ... 
 298    118
 299    132
 300    135
 301    130
 302    150
 Name: trestbps, Length: 303, dtype: int64,
 'chol': 0      233
 1      286
 2      229
 3      250
 4      204
       ... 
 298    186
 299    341
 300    254
 301    256
 302    407
 Name: chol, Length: 303, dtype: int64,
 'oldpeak': 0      2.3
 1      1.5
 2      2.6
 3      3.5
 4      1.4
       ... 
 298    0.0
 299    3.0
 300    2.8
 301    0.0
 302    4.0
 Name: oldpeak, Length: 303, dtype: float64}

<br>

In [24]:
numeric_dict_ds = tf.data.Dataset.from_tensor_slices((dict(numeric_features), target))

Here are the first three examples from that dataset:

In [25]:
for row in numeric_dict_ds.take(3):
    print(row)
    print()

({'age': <tf.Tensor: shape=(), dtype=int64, numpy=63>, 'thalach': <tf.Tensor: shape=(), dtype=int64, numpy=150>, 'trestbps': <tf.Tensor: shape=(), dtype=int64, numpy=145>, 'chol': <tf.Tensor: shape=(), dtype=int64, numpy=233>, 'oldpeak': <tf.Tensor: shape=(), dtype=float64, numpy=2.3>}, <tf.Tensor: shape=(), dtype=int64, numpy=0>)

({'age': <tf.Tensor: shape=(), dtype=int64, numpy=67>, 'thalach': <tf.Tensor: shape=(), dtype=int64, numpy=108>, 'trestbps': <tf.Tensor: shape=(), dtype=int64, numpy=160>, 'chol': <tf.Tensor: shape=(), dtype=int64, numpy=286>, 'oldpeak': <tf.Tensor: shape=(), dtype=float64, numpy=1.5>}, <tf.Tensor: shape=(), dtype=int64, numpy=1>)

({'age': <tf.Tensor: shape=(), dtype=int64, numpy=67>, 'thalach': <tf.Tensor: shape=(), dtype=int64, numpy=129>, 'trestbps': <tf.Tensor: shape=(), dtype=int64, numpy=120>, 'chol': <tf.Tensor: shape=(), dtype=int64, numpy=229>, 'oldpeak': <tf.Tensor: shape=(), dtype=float64, numpy=2.6>}, <tf.Tensor: shape=(), dtype=int64, numpy=0>)

<br>
<br>

### Dictionaries with Keras

<font size=3 color=maroon>Typically, Keras models and layers expect a single input tensor, but these classes can accept and return nested structures of dictionaries, tuples and tensors. These structures are known as "nests" (refer to the `tf.nest` module for details).</font>

There are two equivalent ways you can write a Keras model that accepts a dictionary as input.

#### The Model-subclass style

You write a subclass of `tf.keras.Model` (or `tf.keras.Layer`). You directly handle the inputs, and create the outputs:

In [26]:
def stack_dict(inputs, fun=tf.stack):
    values = []
    for key in sorted(inputs.keys()):
        values.append(tf.cast(inputs[key], tf.float32))
    
    return fun(values, axis=-1)

In [27]:
#@title
class MyModel(tf.keras.Model):
    def __init__(self):
        # Create all the internal layers in init.
        super().__init__(self)
        
        self.normalizer = tf.keras.layers.experimental.preprocessing.Normalization(axis=-1)
        
        self.seq = tf.keras.Sequential([self.normalizer,
                                        tf.keras.layers.Dense(10, activation='relu'),
                                        tf.keras.layers.Dense(10, activation='relu'),
                                        tf.keras.layers.Dense(1)
                                       ])
        
    def adapt(self, inputs):
        # Stack the inputs and `adapt` the normalization layer.
        inputs = stack_dict(inputs)
        self.normalizer.adapt(inputs)
    
    def call(self, inputs):
        # Stack the inputs
        inputs = stack_dict(inputs)
        # Run them through all the layers.
        result = self.seq(inputs)

        return result

In [28]:
model = MyModel()

model.adapt(dict(numeric_features))

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'],
              run_eagerly=True)

<br>

This model can accept either a dictionary of columns or a dataset of dictionary-elements for training:

In [29]:
model.fit(dict(numeric_features), target, epochs=5, batch_size=BATCH_SIZE)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1e4252d17f0>

In [30]:
numeric_dict_batches = numeric_dict_ds.shuffle(SHUFFLE_BUFFER).batch(BATCH_SIZE)
model.fit(numeric_dict_batches, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1e6c1700a30>

<br>

Here are the predictions for the first three examples:

In [31]:
model.predict(dict(numeric_features.iloc[:3]))

array([[[-0.05580579]],

       [[ 0.7148901 ]],

       [[ 0.1741641 ]]], dtype=float32)

<br>

#### The Keras functional style

In [32]:
inputs = {}
for name, col in numeric_features.items():
    inputs[name] = tf.keras.Input(shape=(1,),
                                  name=name,
                                  dtype=tf.float32)

inputs

{'age': <KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'age')>,
 'thalach': <KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'thalach')>,
 'trestbps': <KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'trestbps')>,
 'chol': <KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'chol')>,
 'oldpeak': <KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'oldpeak')>}

In [33]:
x = stack_dict(inputs, fun=tf.concat)
x

<KerasTensor: shape=(None, 5) dtype=float32 (created by layer 'tf.concat')>

In [34]:
x2 = stack_dict(inputs)
x2

<KerasTensor: shape=(None, 1, 5) dtype=float32 (created by layer 'tf.stack')>

In [35]:
normalizer = tf.keras.layers.experimental.preprocessing.Normalization(axis=1)
normalizer.adapt(stack_dict(dict(numeric_features)))

x = normalizer(x)
x = tf.keras.layers.Dense(10, activation='relu')(x)
x = tf.keras.layers.Dense(10, activation='relu')(x)
x = tf.keras.layers.Dense(1)(x)

model = tf.keras.Model(inputs, x)

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'],
              run_eagerly=True)

In [36]:
tf.keras.utils.plot_model(model, rankdir="LR", show_shapes=True)

('You must install pydot (`pip install pydot`) and install graphviz (see instructions at https://graphviz.gitlab.io/download/) ', 'for plot_model/model_to_dot to work.')


<br>

下载 Colab 的运行结果图：


<img src="./images/dataframe_plot_model.png" width=800px>

<br>

You can train the functional model the same way as the model subclass:

In [37]:
model.fit(dict(numeric_features), target, epochs=5, batch_size=BATCH_SIZE)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1e6c91a57f0>

In [38]:
numeric_dict_batches = numeric_dict_ds.shuffle(SHUFFLE_BUFFER).batch(BATCH_SIZE)
model.fit(numeric_dict_batches, epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<tensorflow.python.keras.callbacks.History at 0x1e6c1700a90>

<br>
<br>
<br>

## Full example

If you're passing a heterogeneous DataFrame to Keras, each column may need unique preprocessing. You could do this preprocessing directly in the DataFrame, but for a model to work correctly, inputs always need to be preprocessed the same way. So, the best approach is to build the preprocessing into the model. [Keras preprocessing layers](https://www.tensorflow.org/guide/keras/preprocessing_layers) cover many common tasks.

### Build the preprocessing head

In this dataset some of the "integer" features in the raw data are actually Categorical indices. These indices are not really ordered numeric values (refer to the <a href="https://archive.ics.uci.edu/ml/datasets/heart+Disease" class="external">the dataset description</a> for details). 

<font size=3 color=maroon>Because these are unordered they are inappropriate to feed directly to the model; the model would interpret them as being ordered. To use these inputs you'll need to encode them, either as one-hot vectors or embedding vectors. The same applies to string-categorical features.</font>
<br>
<br>
<br>
<font size=4 color=maroon>**Note**:</font> 

* If you have many features that need identical preprocessing it's more efficient to concatenate them together before applying the preprocessing.

* Binary features on the other hand do not generally need to be encoded or normalized.

<br>

Start by by creating a list of the features that fall into each group:

In [39]:
binary_feature_names = ['sex', 'fbs', 'exang']

In [40]:
categorical_feature_names = ['cp', 'restecg', 'slope', 'thal', 'ca']

<br>

The next step is to build a preprocessing model that will apply appropriate preprocessing to each input and concatenate the results.

This section uses the [Keras Functional API](https://www.tensorflow.org/guide/keras/functional) to implement  the preprocessing. You start by creating one `tf.keras.Input` for each column of the dataframe:

In [41]:
inputs = {}
for name, column in df.items():
    if type(column[0]) == str:
        dtype = tf.string
    elif (name in categorical_feature_names or name in binary_feature_names):
        dtype = tf.int64
    else:
        dtype = tf.float32

    inputs[name] = tf.keras.Input(shape=(), name=name, dtype=dtype)

In [42]:
inputs

{'age': <KerasTensor: shape=(None,) dtype=float32 (created by layer 'age')>,
 'sex': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'sex')>,
 'cp': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'cp')>,
 'trestbps': <KerasTensor: shape=(None,) dtype=float32 (created by layer 'trestbps')>,
 'chol': <KerasTensor: shape=(None,) dtype=float32 (created by layer 'chol')>,
 'fbs': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'fbs')>,
 'restecg': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'restecg')>,
 'thalach': <KerasTensor: shape=(None,) dtype=float32 (created by layer 'thalach')>,
 'exang': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'exang')>,
 'oldpeak': <KerasTensor: shape=(None,) dtype=float32 (created by layer 'oldpeak')>,
 'slope': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'slope')>,
 'ca': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'ca')>,
 'thal': <KerasTensor: shape=(None,) dtype=string 

<br>

<font color=maroon size=3>For each input you'll apply some transformations using Keras layers and TensorFlow ops. Each feature starts as a batch of scalars (`shape=(batch,)`). The output for each  should be a batch of `tf.float32` vectors (`shape=(batch, n)`). The last step will concatenate all those vectors together.</font>

<br>

#### Binary inputs

<font color=maroon size=3>Since the binary inputs don't need any preprocessing, just add the vector axis, cast them to `float32` and add them to the list of preprocessed inputs:</font>

In [43]:
preprocessed = []

for name in binary_feature_names:
    inp = inputs[name]
    inp = inp[:, tf.newaxis]
    float_value = tf.cast(inp, tf.float32)
    preprocessed.append(float_value)

preprocessed

[<KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'tf.cast_10')>,
 <KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'tf.cast_11')>,
 <KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'tf.cast_12')>]

<br>

#### Numeric inputs

Like in the earlier section you'll want to run these numeric inputs through a `tf.keras.layers.Normalization` layer before using them. The difference is that this time they're input as a dict. The code below collects the numeric features from the DataFrame, stacks them together and passes those to the `Normalization.adapt` method.

In [44]:
normalizer = tf.keras.layers.experimental.preprocessing.Normalization(axis=-1)
normalizer.adapt(stack_dict(dict(numeric_features)))

<br>

The code below stacks the numeric features and runs them through the normalization layer.

In [45]:
numeric_inputs = {}
for name in numeric_feature_names:
    numeric_inputs[name]=inputs[name]

numeric_inputs = stack_dict(numeric_inputs)
numeric_normalized = normalizer(numeric_inputs)

preprocessed.append(numeric_normalized)

preprocessed

[<KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'tf.cast_10')>,
 <KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'tf.cast_11')>,
 <KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'tf.cast_12')>,
 <KerasTensor: shape=(None, 5) dtype=float32 (created by layer 'normalization_3')>]

<br>

#### Categorical features

To use categorical features you'll first need to encode them into either binary vectors or embeddings. Since these features only contain a small number of categories, convert the inputs directly to one-hot vectors using the `output_mode='one_hot'` option, supported by both the `tf.keras.layers.StringLookup` and `tf.keras.layers.IntegerLookup` layers.

Here is an example of how these layers work:

In [46]:
# vocab = ['a','b','c']
# lookup = tf.keras.layers.StringLookup(vocabulary=vocab, 
#                                       output_mode='one_hot')
#
# lookup(['c','a','a','b','zzz'])
# 报错：AttributeError: module 'tensorflow.keras.layers' has no attribute 'StringLookup'

<br>

In [47]:
# help(tf.keras.layers.experimental.preprocessing.StringLookup)

In [48]:
# vocab = ['a','b','c']
# lookup = tf.keras.layers.experimental.preprocessing.StringLookup(vocabulary=vocab, 
#                                                                  output_mode='one_hot')
#
# lookup(['c','a','a','b','zzz'])
# 报错：ValueError: The output_mode argument of layer StringLookup received an invalid value one_hot. 
# Allowed values are: or one of the following values: ('int', 'binary', 'count', 'tf-idf').
# 但是上述 output_mode 的 value 并没有 one-hot 类型的

<br>

In [49]:
# 这个 cell 的参考链接：
# https://www.tensorflow.org/guide/migrate/migrating_feature_columns#one-hot_encoding_string_data_with_a_vocabulary

import tensorflow.compat.v1 as tf1


def call_feature_columns(feature_columns, inputs):
    # This is a convenient way to call a `feature_column` outside of an estimator to display its output.
    feature_layer = tf1.keras.layers.DenseFeatures(feature_columns)
    
    return feature_layer(inputs)


vocab_col = tf1.feature_column \
               .categorical_column_with_vocabulary_list('my_try',
                                                        vocabulary_list=['a','b','c'],
                                                        num_oov_buckets=2)

indicator_col = tf1.feature_column.indicator_column(vocab_col)
call_feature_columns(indicator_col, {'my_try': ['c','a','a','b','zzz']})

<tf.Tensor: shape=(5, 5), dtype=float32, numpy=
array([[0., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 0., 1., 0.]], dtype=float32)>

In [50]:
# help(tf1.feature_column.categorical_column_with_vocabulary_list)

<br>

In [51]:
# vocab = [1,4,7,99]
# lookup = tf.keras.layers.IntegerLookup(vocabulary=vocab, output_mode='one_hot')
# lookup([-1,4,1])

# 与上面同样的问题

In [52]:
# help(tf1.feature_column.categorical_column_with_identity)

In [53]:
# 这个 cell 的参考链接：
# https://www.tensorflow.org/guide/migrate/migrating_feature_columns#one-hot_encoding_integer_ids


categorical_col = tf1.feature_column \
                     .categorical_column_with_identity('my_try', num_buckets=10, default_value=9)

indicator_col = tf1.feature_column.indicator_column(categorical_col)
call_feature_columns(indicator_col, {'my_try': [1,4,7,99]})

#vocab = [1,4,7,99]
#lookup = tf.keras.layers.experimental.preprocessing.IntegerLookup(vocabulary=vocab, output_mode='one_hot')
#
#lookup([-1,4,1])

<tf.Tensor: shape=(4, 10), dtype=float32, numpy=
array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]], dtype=float32)>

<br>

```python
for name in categorical_feature_names:
    vocab = sorted(set(df[name]))
    print(f'name: {name}')
    print(f'vocab: {vocab}\n')

    if type(vocab[0]) is str:
        lookup = tf.keras.layers.StringLookup(vocabulary=vocab, output_mode='one_hot')
    else:
        lookup = tf.keras.layers.IntegerLookup(vocabulary=vocab, output_mode='one_hot')

    x = inputs[name][:, tf.newaxis]
    x = lookup(x)
    preprocessed.append(x)
```

In [54]:
# 平平自己编写
p = []
for name in categorical_feature_names:
    vocab = sorted(set(df[name]))
    print(f'name: {name}')
    print(f'vocab: {vocab}')

    if type(vocab[0]) is str:
        print("进入 string")
        # lookup = tf.keras.layers.StringLookup(vocabulary=vocab, output_mode='one_hot')
        
        key = 'string'
        string_col = tf1.feature_column \
                       .categorical_column_with_vocabulary_list('string',
                                                                vocabulary_list=vocab,
                                                                num_oov_buckets=1)

        indicator_col = tf1.feature_column.indicator_column(string_col)
        # call_feature_columns(indicator_col, {'string': vocab})
    else:
        print("进入 integer")
        # lookup = tf.keras.layers.IntegerLookup(vocabulary=vocab, output_mode='one_hot')
        
        key = 'integer'
        integer_col = tf1.feature_column \
                             .categorical_column_with_identity('integer',
                                                               num_buckets=len(vocab)+1)

        indicator_col = tf1.feature_column.indicator_column(integer_col)
        # call_feature_columns(indicator_col, {'integer': vocab})


    x = inputs[name][:, tf.newaxis]
    # x = lookup(x)
    # x = call_feature_columns(indicator_col, {key: vocab})
    x = tf1.keras.layers.DenseFeatures(indicator_col)
    p.append(x)
    print()


name: cp
vocab: [0, 1, 2, 3, 4]
进入 integer

name: restecg
vocab: [0, 1, 2]
进入 integer

name: slope
vocab: [1, 2, 3]
进入 integer

name: thal
vocab: ['1', '2', 'fixed', 'normal', 'reversible']
进入 string

name: ca
vocab: [0, 1, 2, 3]
进入 integer



In [55]:
p

[<tensorflow.python.keras.feature_column.dense_features.DenseFeatures at 0x1e6e09bd0d0>,
 <tensorflow.python.keras.feature_column.dense_features.DenseFeatures at 0x1e6e0925ca0>,
 <tensorflow.python.keras.feature_column.dense_features.DenseFeatures at 0x1e43d9c92b0>,
 <tensorflow.python.keras.feature_column.dense_features.DenseFeatures at 0x1e6e08dc6d0>,
 <tensorflow.python.keras.feature_column.dense_features.DenseFeatures at 0x1e6e0936f10>]

<font color=red>从上面的输出看，变量 preprocessed 的最后几个元素都不是 Tensor 类型，不能通过后续的处理，所以通过 tf1.feature_column 的方法来进行 one-hot encoding 对于本 notebook 索要解决的问题来说，并不适用。

只能另求他法。如下：</font>

<br>
<br>

In [56]:
vocab = ['a','b','c']
lookup = tf.keras.layers.experimental.preprocessing.StringLookup(vocabulary=vocab, 
                                                                 output_mode='int')

lookup(['c','a','a','b','zzz'])

<tf.Tensor: shape=(5,), dtype=int64, numpy=array([4, 2, 2, 3, 1], dtype=int64)>

In [57]:
vocab = ['a','b','c']
lookup = tf.keras.layers.experimental.preprocessing.StringLookup(vocabulary=vocab, 
                                                                 output_mode='int')

# arr = lookup(['c','a','a','b','zzz']).numpy()    # 这句也可以
arr = lookup(np.array(['c','a','a','b','zzz'])).numpy()
d1 = arr.shape[0]
d2 = max(len(set(vocab)), len(set(arr)))
z = np.zeros((d1, d2))
z[range(d1), arr-1] = 1
tf.convert_to_tensor(z, dtype=tf.float32)

<tf.Tensor: shape=(5, 4), dtype=float32, numpy=
array([[0., 0., 0., 1.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [1., 0., 0., 0.]], dtype=float32)>

In [58]:
# help(tf.keras.layers.experimental.preprocessing.StringLookup)
# help(tf.keras.layers.experimental.preprocessing.IntegerLookup)

In [59]:
# vocab = [1,4,7,99]
# lookup = tf.keras.layers.experimental.preprocessing.IntegerLookup(vocabulary=vocab,
#                                                                   output_mode='int')
# 
# lookup([-1,4,1])
#
# lookup([-1,4,1]) 这一句报错：AttributeError: 'list' object has no attribute 'dtype'

In [60]:
vocab = [1,4,7,99]
# vocab = np.array(vocab)
lookup = tf.keras.layers.experimental.preprocessing.IntegerLookup(vocabulary=vocab,
                                                                  output_mode='int')

# lookup(np.array([-1,4,1]))

arr = lookup(np.array([-1,4,1])).numpy()

d1 = arr.shape[0]
d2 = max(len(set(vocab)), len(set(vocab)|set([-1,4,1])))
z = np.zeros((d1, d2))
z[range(d1), arr-1] = 1
tf.convert_to_tensor(z, dtype=tf.float32)

<tf.Tensor: shape=(3, 5), dtype=float32, numpy=
array([[1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.]], dtype=float32)>

<br>
<br>
<br>

In [96]:
c = tf.constant([[1,2],[3,4]])

c_ki = tf.keras.Input(tensor=c, name='c_ki')
c_ki

<KerasTensor: shape=(2, 2) dtype=int32 (created by layer 'c_ki')>

In [97]:
c_ki(c)

TypeError: 'KerasTensor' object is not callable

In [93]:
categorical_feature_names

['cp', 'restecg', 'slope', 'thal', 'ca']

In [94]:
inputs2 = {}
for name, column in df.items():
    if type(column[0]) == str:
        dtype = tf.string
    elif (name in categorical_feature_names or name in binary_feature_names):
        dtype = tf.int64
    else:
        dtype = tf.float32

    inputs2[name] = tf.keras.Input(shape=(), name=name, dtype=dtype)

inputs2

{'age': <KerasTensor: shape=(None,) dtype=float32 (created by layer 'age')>,
 'sex': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'sex')>,
 'cp': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'cp')>,
 'trestbps': <KerasTensor: shape=(None,) dtype=float32 (created by layer 'trestbps')>,
 'chol': <KerasTensor: shape=(None,) dtype=float32 (created by layer 'chol')>,
 'fbs': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'fbs')>,
 'restecg': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'restecg')>,
 'thalach': <KerasTensor: shape=(None,) dtype=float32 (created by layer 'thalach')>,
 'exang': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'exang')>,
 'oldpeak': <KerasTensor: shape=(None,) dtype=float32 (created by layer 'oldpeak')>,
 'slope': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'slope')>,
 'ca': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'ca')>,
 'thal': <KerasTensor: shape=(None,) dtype=string 

In [103]:
p = []
for name in categorical_feature_names:
    vocab = sorted(set(df[name]))
    print(f'name: {name}')
    print(f'vocab: {vocab}')

    
    if type(vocab[0]) is str:
        print("进入 str")
        lookup = tf.keras.layers.experimental.preprocessing\
                                .StringLookup(vocabulary=vocab, output_mode='int')
    else:
        print("进入 int")
        # vocab.insert(1, -1)
        lookup = tf.keras.layers.experimental.preprocessing.IntegerLookup(vocabulary=vocab,
                                                                          mask_token=None,
                                                                          # oov_token=-1,
                                                                          output_mode='int')
    
    print(inputs[name])
    x = inputs[name][:, tf.newaxis]
    x = lookup(x)
    print(x)
    print(x)
    # preprocessed.append(x)
    p.append(x)
    print()


#arr = lookup(np.array([-1,4,1])).numpy()
#
#d1 = arr.shape[0]
#d2 = max(len(set(vocab)), len(set(vocab)|set([-1,4,1])))
#z = np.zeros((d1, d2))
#z[range(d1), arr-1] = 1
#tf.convert_to_tensor(z, dtype=tf.float32)

p

name: cp
vocab: [0, 1, 2, 3, 4]
进入 int
KerasTensor(type_spec=TensorSpec(shape=(None,), dtype=tf.int64, name='cp'), name='cp', description="created by layer 'cp'")
<dtype: 'int64'>
KerasTensor(type_spec=TensorSpec(shape=(None, 1), dtype=tf.int64, name=None), name='integer_lookup_43/None_lookup_table_find/LookupTableFindV2:0', description="created by layer 'integer_lookup_43'")

name: restecg
vocab: [0, 1, 2]
进入 int
KerasTensor(type_spec=TensorSpec(shape=(None,), dtype=tf.int64, name='restecg'), name='restecg', description="created by layer 'restecg'")
<dtype: 'int64'>
KerasTensor(type_spec=TensorSpec(shape=(None, 1), dtype=tf.int64, name=None), name='integer_lookup_44/None_lookup_table_find/LookupTableFindV2:0', description="created by layer 'integer_lookup_44'")

name: slope
vocab: [1, 2, 3]
进入 int
KerasTensor(type_spec=TensorSpec(shape=(None,), dtype=tf.int64, name='slope'), name='slope', description="created by layer 'slope'")
<dtype: 'int64'>
KerasTensor(type_spec=TensorSpec(shape=(

[<KerasTensor: shape=(None, 1) dtype=int64 (created by layer 'integer_lookup_43')>,
 <KerasTensor: shape=(None, 1) dtype=int64 (created by layer 'integer_lookup_44')>,
 <KerasTensor: shape=(None, 1) dtype=int64 (created by layer 'integer_lookup_45')>,
 <KerasTensor: shape=(None, 1) dtype=int64 (created by layer 'string_lookup_12')>,
 <KerasTensor: shape=(None, 1) dtype=int64 (created by layer 'integer_lookup_46')>]

<br>

#### Assemble the preprocessing head

At this point `preprocessed` is just a Python list of all the preprocessing results, each result has a shape of `(batch_size, depth)`:

In [89]:
# preprocessed.pop()

<KerasTensor: shape=(None, 1) dtype=int64 (created by layer 'integer_lookup_17')>

In [90]:
preprocessed

[<KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'tf.cast_10')>,
 <KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'tf.cast_11')>,
 <KerasTensor: shape=(None, 1) dtype=float32 (created by layer 'tf.cast_12')>,
 <KerasTensor: shape=(None, 5) dtype=float32 (created by layer 'normalization_3')>]

<br>

Concatenate all the preprocessed features along the `depth` axis, so each dictionary-example is converted into a single vector. The vector contains categorical features, numeric features, and categorical one-hot features:

In [None]:
preprocesssed_result = tf.concat(preprocessed, axis=-1)
preprocesssed_result

<br>

Now create a model out of that calculation so it can be reused:

In [106]:
# help(tf.keras.Model)

In [None]:
preprocessor = tf.keras.Model(inputs, preprocesssed_result)

In [None]:
tf.keras.utils.plot_model(preprocessor, rankdir="LR", show_shapes=True)

以下是 Colab 运行结果图：


<img src="./images/dataframe_plot_model_preprocessor.png" width=1000px>

<br>

To test the preprocessor, use the <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html" class="external">DataFrame.iloc</a> accessor to slice the first example from the DataFrame. Then convert it to a dictionary and pass the dictionary to the preprocessor. The result is a single vector containing the binary features, normalized numeric features and the one-hot categorical features, in that order:

In [None]:
preprocessor(dict(df.iloc[:1]))

<br>

Now build the main body of the model. Use the same configuration as in the previous example: A couple of `Dense` rectified-linear layers and a `Dense(1)` output layer for the classification.

In [None]:
body = tf.keras.Sequential([tf.keras.layers.Dense(10, activation='relu'),
                            tf.keras.layers.Dense(10, activation='relu'),
                            tf.keras.layers.Dense(1)
                           ])

<br>

Now put the two pieces together using the Keras functional API.

In [108]:
inputs

{'age': <KerasTensor: shape=(None,) dtype=float32 (created by layer 'age')>,
 'sex': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'sex')>,
 'cp': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'cp')>,
 'trestbps': <KerasTensor: shape=(None,) dtype=float32 (created by layer 'trestbps')>,
 'chol': <KerasTensor: shape=(None,) dtype=float32 (created by layer 'chol')>,
 'fbs': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'fbs')>,
 'restecg': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'restecg')>,
 'thalach': <KerasTensor: shape=(None,) dtype=float32 (created by layer 'thalach')>,
 'exang': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'exang')>,
 'oldpeak': <KerasTensor: shape=(None,) dtype=float32 (created by layer 'oldpeak')>,
 'slope': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'slope')>,
 'ca': <KerasTensor: shape=(None,) dtype=int64 (created by layer 'ca')>,
 'thal': <KerasTensor: shape=(None,) dtype=string 

In [None]:
x = preprocessor(inputs)
x

In [None]:
result = body(x)
result

In [None]:
model = tf.keras.Model(inputs, result)

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

<br>

This model expects a dictionary of inputs. The simplest way to pass it the data is to convert the DataFrame to a dict and pass that dict as the `x` argument to `Model.fit`:

In [None]:
history = model.fit(dict(df), target, epochs=5, batch_size=BATCH_SIZE)

<br>

Using `tf.data` works as well:

In [None]:
ds = tf.data.Dataset.from_tensor_slices((dict(df), target))

ds = ds.batch(BATCH_SIZE)

In [None]:
import pprint

for x, y in ds.take(1):
    pprint.pprint(x)
    print()
    print(y)

In [None]:
history = model.fit(ds, epochs=5)

<br>
<br>
<br>

```python
# MIT License
#
# Copyright (c) 2017 François Chollet
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.
```

<br>
<br>
<br>