# This is class version of Interactive feature columns mapping builder.

Why do I want a class version instead of a Jupyter Notebook?

Because finally I will create a end-to-end time series solution. Maybe an estimator class is suitable to my objectives.

This tools focus mapping from columns in the dataframe of Pandas to  feature columns of Tensorflow, which is thereafter used to train a model.

TensorFlow provides many types of feature columns. 
You may visit https://www.tensorflow.org/tutorials/structured_data/feature_columns to know the detail.

Please note this tools support very limited feature column types. The generated code could be limited also, but you may modify it. BTW, a lambda statement can be used to deal with data preprocessing.

I am working on Deep Time (https://github.com/MRYingLEE/DeepTime-Deep-Learning-Framework-for-Time-Series-Forecasting). This tools is part of my research work.

Tensorflow 2.x is used.


# Import TensorFlow and other libraries

Maybe later sklearn Preprocessing function (https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing) will be supported.

So far, only train_test_split of sklearn is used.

In [0]:
!pip install sklearn

In [0]:
import numpy as np
import pandas as pd

import tensorflow as tf

from tensorflow import feature_column
from tensorflow.feature_column import *
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split

from io import StringIO
# # Import ipywidgets

# ipywidgets （https://github.com/jupyter-widgets/ipywidgets） makes the Jupyter Notebook interactive.

from ipywidgets import GridspecLayout
import ipywidgets as widgets
from ipywidgets import Button, Layout, jslink, IntText, IntSlider


# Useful helper functions for feature columns

The reason I create some helper function is that I want to make the generated code short and easy to read.

In [0]:
# A function to generate a one-hot column by the vocabulary list.
def categorical_strings(column,vocabulary_list):
  sparse_column = feature_column.categorical_column_with_vocabulary_list(
      column, vocabulary_list)
  one_hot_column = feature_column.indicator_column(sparse_column)
  return one_hot_column

# A function to generate an embedding column by the vocabulary list.
def categorical_strings_embedding(column,vocabulary_list, embedding_dim=8):
  sparse_column = feature_column.categorical_column_with_vocabulary_list(
      column, vocabulary_list)
  embedding_column = feature_column.embedding_column(sparse_column, dimension=embedding_dim)
  return embedding_column

# A function to generate a hashed column by the vocabulary list.
def categorical_hash(column,vocabulary_list, bucket_size=1000):
  hashed = feature_column.categorical_column_with_hash_bucket(
      column, hash_bucket_size=bucket_size)
  hashed=feature_column.indicator_column(hashed)
  return hashed

# A function to generate a one-hot column by the vocabulary list for an integer column.
def categorical_identitys(column,vocabulary_list):
  min_int=np.min(vocabulary_list)
  max_int=np.max(vocabulary_list)
  count_v=len(vocabulary_list)

  if ((min_int<0) or (max_int>20)):
    sparse_column = feature_column.categorical_column_with_hash_bucket(
      column, count_v, dtype=tf.dtypes.int32)
  else:
    sparse_column = feature_column.categorical_column_with_identity(
      column, max_int)
    
  one_hot_column = feature_column.indicator_column(sparse_column)
  return one_hot_column


# Class of an Estimator

In [0]:
class TsEstimator:
## The feature column types
    # Here is a full list of built-in features of tensorflow 2.
    # But actually not all are supported in this tools.
  feature_kinds={
      "bucketized_column(...)":"Represents discretized dense input bucketed by boundaries.",
      "categorical_column_with_hash_bucket(...)":"Represents sparse feature where ids are set by hashing.",
      "categorical_column_with_identity(...)":"A CategoricalColumn that returns identity values.",
      "categorical_column_with_vocabulary_file(...)":"A CategoricalColumn with a vocabulary file.",
      "categorical_column_with_vocabulary_list(...)":"A CategoricalColumn with in-memory vocabulary.",
      "crossed_column(...)":"Returns a column for performing crosses of categorical features.",
      "embedding_column(...)":"DenseColumn that converts from sparse, categorical input.",
      "indicator_column(...)":"Represents multi-hot representation of given categorical column.",
      "make_parse_example_spec(...)":"Creates parsing spec dictionary from input feature_columns.",
      "numeric_column(...)":"Represents real valued or numerical features.",
      "sequence_categorical_column_with_hash_bucket(...)":"A sequence of categorical terms where ids are set by hashing.",
      "sequence_categorical_column_with_identity(...)":"Returns a feature column that represents sequences of integers.",
      "sequence_categorical_column_with_vocabulary_file(...)":"A sequence of categorical terms where ids use a vocabulary file.",
      "sequence_categorical_column_with_vocabulary_list(...)":"A sequence of categorical terms where ids use an in-memory list.",
      "sequence_numeric_column(...)":"Returns a feature column that represents sequences of numeric data.",
      "shared_embeddings(...)":"List of dense columns that convert from sparse, categorical input.",
      "weighted_categorical_column(...)":"Applies weight values to a CategoricalColumn.",
      "?":"Unknown"
    }
    # ## The default feature kind for dtype of Pandas

    # For every dtype of Pandas, a default feature kind is assigned.

  dtype_default_feature={
      "object":"?",
      "int64":"numeric_column(...)",
      "float64":"numeric_column(...)",
      "bool":"numeric_column(...)",
      "datetime64":"?",
      "timedelta[ns]":"?",
      "category":"categorical_strings(...)"
    }

      # ## The available feature_column for dtype of Pandas

    # This is a dictionary of the matching of dtype and feature kinds.

    # Some adavanced feature kinds are disabled here.


  dtype_features_cross = StringIO("""Kind,object,int64,float64,bool,datetime64,timedelta[ns],category,cat_int64,cat_string
    bucketized_column(...),FALSE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
    categorical_column_with_hash_bucket(...),FALSE,TRUE,FALSE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE
    categorical_column_with_identity(...),FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE
    categorical_column_with_vocabulary_file(...),FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
    categorical_column_with_vocabulary_list(...),FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE
    crossed_column(...),FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
    embedding_column(...),TRUE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,FALSE,FALSE
    indicator_column(...),FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
    make_parse_example_spec(...),FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
    numeric_column(...),FALSE,TRUE,TRUE,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE
    sequence_categorical_column_with_hash_bucket(...),TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE
    sequence_categorical_column_with_identity(...),TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE
    sequence_categorical_column_with_vocabulary_file(...),TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE
    sequence_categorical_column_with_vocabulary_list(...),TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE
    sequence_numeric_column(...),TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE
    shared_embeddings(...),FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE
    weighted_categorical_column(...),FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,FALSE
    categorical_identitys,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE
    categorical_strings,TRUE,FALSE,FALSE,FALSE,FALSE,FALSE,TRUE,FALSE,TRUE
        """)
  df_dtype_features_cross = pd.read_csv(dtype_features_cross, sep=",")

  def __init__(self, df_all, df_train, df_val=None, df_test=None, categories_limit=20):
    self._df_all=df_all

    assert(df_all is not None)

    if (df_train is None):
      self._df_train=df_all
    else:
      self._df_train=df_train
    self._df_val=df_val
    self._df_df_test=df_test

    self.input_features=[]
    self.label_features=[]

    self.global_normalizers={} # Not used so far 
    self.categorical_columns=[]
    self.categories_limit=categories_limit
    self.grid=None
    self.category_lists= self.__df_desc()
      # If a column has less than this number (20 as default) of unique value, I will treate it as a category column.

  @classmethod
  def get_available_features(cls,col_dtype):
    return set(cls.df_dtype_features_cross[["Kind",col_dtype]][cls.df_dtype_features_cross[col_dtype]]["Kind"].unique())

    # ## To generate normalizer lambda and denormalizer one

    # So far, only 2 kinds of normalizer and denormalizer are supported:

  # min-max  : (value-min)/(max-min)
  # To generate min-max normalizer and denomalizer lambda statements
  @staticmethod
  def min_max_normalizer(min_v,max_v, v_str="by_train",is_int64=False):
    if is_int64:
      ext_v_str="tf.cast("+v_str+",tf.float32)"
    else:
      ext_v_str=v_str
    
    return "lambda "+v_str+": ("+ext_v_str+ " -"+str(min_v)+")/("+str(max_v)+"-"+str(min_v)+")","lambda "+v_str+": "+ext_v_str+ " *("+str(max_v)+"-"+str(min_v)+")+"+str(min_v)

  # mean-std  : (value-mean)/std
  # To generate mean-std normalizer and denomalizer lambda statements
  @staticmethod
  def std_normalizer(v_mean,v_std, v_str="by_train",is_int64=False):
    if is_int64:
      ext_v_str="tf.cast("+v_str+",tf.float32)"
    else:
      ext_v_str=v_str

    return "lambda "+v_str+": ("+ext_v_str+ " -"+str(v_mean)+")/"+str(v_std),"lambda "+v_str+": "+ext_v_str+ " *"+str(v_std)+"+"+str(v_mean)

  # To generate min-max/mean-std normalizer and denomalizer lambda statements given an statistics data
  @staticmethod
  def create_local_normalizers(col_name,df_statistics, v_str="by_train",is_int64=False):
    v_min=df_statistics.loc[col_name]["min"]
    v_max=df_statistics.loc[col_name]["max"]
    v_mean=df_statistics.loc[col_name]["mean"]
    v_std=df_statistics.loc[col_name]["std"]

    n1,d1=TsEstimator.min_max_normalizer(v_min,v_max,v_str,is_int64=is_int64)
    n2,d2=TsEstimator.std_normalizer(v_mean,v_std,v_str,is_int64=is_int64)

    locals={n1:d1,n2:d2}
    return locals

  # To generated a suitable string for an integer list
  @staticmethod
  def int_list_as_string(a):
    s = [str(i) for i in a]
    return  "["+",".join(s)+"]"

  # To generated a suitable string for a string list
  @staticmethod
  def string_list_as_string(s):
    return  "['"+"','".join(s)+"']"

    # ## To generate available feature kinds and suitable normalizer lambda statements for every column.

  # Please note the whole dataframe and the train part are both required.

  # The whole dataframe is used to decide the vocalbulary list for each column.

  # Both the whole dataframe and the train part are used to generate lambda statements for NUMERIC columns. So normalizing can be based on the whole data or only the train part. It's up to the data scientist.

  def __df_desc(self):
    df_all=self._df_all
    df_train=self._df_train

    df_statistics_train=df_train.describe().T # I use train part to normalize!
    df_statistics_all=df_all.describe().T # I use train part to normalize!
    
    category_lists={}
    
    for c in df_train.columns:
      dtype_name=df_train[c].dtype.name

      availables=self.get_available_features(dtype_name)

      if availables is None:
        availables={}

      feature="numeric_column('"+c+"')"

      local_normalizers={}

      if ((dtype_name=="int64") or (dtype_name=="object")):
        is_int64=(dtype_name=="int64")

        values_unique=df_all[c].unique()
        f=len(values_unique)   # I use all rows to decide the cetegory list   
        if f<self.categories_limit: #Category
          if is_int64:
            feature=categorical_identitys.__name__+"('"+c+"',"+self.int_list_as_string(values_unique)+")"
          else:
            feature=categorical_strings.__name__+"('"+c+"',"+self.string_list_as_string(values_unique)+")"
          self.categorical_columns.append(c)
        else:
          if is_int64:
            feature="numeric_column('"+c+"')"
            local_normalizers=self.create_local_normalizers(c,df_statistics_train,v_str="by_train", is_int64=True)
            self.global_normalizers.update(local_normalizers)
            local_normalizers1=self.create_local_normalizers(c,df_statistics_all,v_str="by_all", is_int64=True)
            self.global_normalizers.update(local_normalizers1)
            local_normalizers.update(local_normalizers1)
          else:
            feature="embedding_column('"+"('"+c+"')"
      else:
        if (dtype_name=="float64"):
            feature="numeric_column('"+c+"')"
            local_normalizers=self.create_local_normalizers(c,df_statistics_train,v_str="by_train", is_int64=False)
            self.global_normalizers.update(local_normalizers)
            local_normalizers1=self.create_local_normalizers(c,df_statistics_all,v_str="by_all", is_int64=False)
            self.global_normalizers.update(local_normalizers1)
            local_normalizers.update(local_normalizers1)
        elif  (dtype_name=="bool"):
            feature="numeric_column('"+c+"')"
        elif (dtype_name=="category"):
          feature="categorical_column_with_vocabulary_list('"+"('"+c+"')"
          self.categorical_columns.append(c)
        else:
          feature=dtype_defaults[dtype_name] 
      
      availables.add(feature)

      availables={s.replace("(...)","('"+c+"')") for s in availables}
      category_lists[c]={"default":feature,"available":availables,"normalizers": local_normalizers}

    return category_lists


  def get_feature_grid(self,default_inputs=[], default_labels=[]):
    if self.grid is not None:
      return self.grid

    # category_lists=df_desc(df_all,df_train)
    df_all=self._df_all
    df_train=self._df_train

    cols=len(df_train.columns)
    grid = GridspecLayout(cols+1, 12)
    # To add a header at row 0
    grid[0,0]= widgets.Label(value="Column")
    grid[0,1]= widgets.Label(value="dtype")
    grid[0,2]= widgets.Label(value="Input?")
    grid[0,3]= widgets.Label(value="Label?")
    grid[0,4:7]= widgets.Label(value="Feature Kind")
    grid[0,8:]= widgets.Label(value="Numeric Normalizer")

    for i in range(cols):
      feature_option=self.category_lists[df_train.columns[i]]
      grid[i+1,0]= widgets.Label(value=df_train.columns[i])
      grid[i+1,1]= widgets.Label(value=df_train.dtypes[i].name)
      grid[i+1,2]=widgets.Checkbox(value=(df_train.columns[i] in default_inputs),description='',indent=False,layout=Layout(height='auto', width='auto'))
      grid[i+1,3]=widgets.Checkbox(value=(df_train.columns[i] in default_labels),indent=False,description='',layout=Layout(height='auto', width='auto'))
      
      grid[i+1,4:7]= widgets.Dropdown(
        options=list(feature_option['available']),
        value=feature_option['default'],
        description="",
        layout=Layout(height='auto', width='auto')
        )
      
      if len(feature_option['normalizers'])>0:
        grid[i+1,8:]=widgets.Dropdown(
          options=list(feature_option['normalizers'].keys()),
          value=list(feature_option['normalizers'].keys())[0],
          layout=Layout(height='auto', width='auto'),
          description=""
          )
    
    self.grid=grid

    return grid

    # To generate code based on interactive grid
  def __generate_code(self):
    code_generator=[]
    feature_inputs=[]
    feature_labels=[]

    grid=self.grid
    for i in range(1,grid.n_rows):
      f_col=grid[i,4].value
      # print(f_col)
      if (grid[i,4].value.startswith("numeric_column(") and (grid[i,1].value !="bool")):
        f_col=f_col[:-1]

        if (grid[i,2].value==True):
          code_generator.append("input_features.append("+f_col+",normalizer_fn="+grid[i,8].value+"))")
          feature_inputs.append(grid[i,0].value)
        if (grid[i,3].value==True):
          code_generator.append("label_features.append("+f_col+",normalizer_fn="+grid[i,8].value+"))")
          feature_labels.append(grid[i,0].value)
      else:
        if (grid[i,2].value==True):
          code_generator.append("input_features.append("+f_col+")")
          feature_inputs.append(grid[i,0].value)
        if (grid[i,3].value==True):
          code_generator.append("label_features.append("+f_col+")")
          feature_labels.append(grid[i,0].value)
    return code_generator, feature_labels    

  def __run_generated_code(self, code_generator):
    code=';'.join(code_generator)
    # print(code)

    try:
      self.input_features.clear()
      self.label_features.clear()
      exec(code,None, {'input_features':self.input_features,'label_features':self.label_features})
      print("The feature_columns have been generated!")
    except:
      print("Please check the generated code", sys.exc_info()[0])
    # print(code_generator)

  def update_by_grid(self):
    code,_ =self.__generate_code()
    # print("code:",code)
    self.__run_generated_code(code)


## <font color=red> Your Dataframe here</font>
Typically, this is the <font color=red>ONLY</font> place for you to type.


In [0]:
csvURL = '' # the csv data file or web path

In [0]:

Testing=True # INTERNAL for me to debug. You don't need to care about this.

In [0]:
dataframe=None
default_inputs=[]  # The default features list for inputs
default_labels=[] # The default features list for labels

if (csvURL!=''):
  dataframe = pd.read_csv(csvURL)
  dataframe.head()

## A demo dataframe if you don't create one

I will use a small [dataset](https://archive.ics.uci.edu/ml/datasets/heart+Disease) provided by the Cleveland Clinic Foundation for Heart Disease. There are several hundred rows in the CSV. Each row describes   a patient, and each column describes an attribute.<br>
Notice there are both numeric (including bool) and categorical columns.

>Column| Description| Feature Type | Data Type
>------------|--------------------|----------------------|-----------------
>Age | Age in years | Numerical | integer
>Sex | (1 = male; 0 = female) | Categorical | integer
>CP | Chest pain type (0, 1, 2, 3, 4) | Categorical | integer
>Trestbpd | Resting blood pressure (in mm Hg on admission to the hospital) | Numerical | integer
>Chol | Serum cholestoral in mg/dl | Numerical | integer
>FBS | (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) | Categorical | integer
>RestECG | Resting electrocardiographic results (0, 1, 2) | Categorical | integer
>Thalach | Maximum heart rate achieved | Numerical | integer
>Exang | Exercise induced angina (1 = yes; 0 = no) | Categorical | integer
>Oldpeak | ST depression induced by exercise relative to rest | Numerical | float
>Slope | The slope of the peak exercise ST segment | Numerical | integer
>CA | Number of major vessels (0-3) colored by flourosopy | Numerical | integer
>Thal | 3 = normal; 6 = fixed defect; 7 = reversable defect | Categorical | string
>Target | Diagnosis of heart disease (1 = true; 0 = false) | Classification | integer
>is_male | Whether a person is male (true or false) | Numerical | bool

In [0]:

if (dataframe is None):
  csvURL = 'https://storage.googleapis.com/applied-dl/heart.csv'
  labels='target'
  dataframe = pd.read_csv(csvURL)
  dataframe['is_male']=(dataframe['sex']==0) # As a demo of a column of bool
  default_inputs=['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'slope', 'ca','is_male']  # The default features list for inputs
  default_labels=['target'] # The default features list for labels

dataframe.head()

## Split data into Train and Test 

In [0]:
dataframe_train, dataframe_test = train_test_split(dataframe, test_size=0.2)
dataframe_train, dataframe_val = train_test_split(dataframe_train, test_size=0.2)
print(len(dataframe_train), 'train examples')
print(len(dataframe_val), 'validation examples')

## Create an estimator

In [0]:
estimator=TsEstimator(dataframe, dataframe_train, dataframe_val, dataframe_test)

## Inspect the data by categorical columns

In [0]:
# Use seaborn for pairplot
!pip install -q seaborn
# import matplotlib.pyplot as plt
import seaborn as sns
# plt.figure(figsize=(20,5))
sns.pairplot(dataframe[estimator.categorical_columns], diag_kind="kde")

## To create an interactive grid

You may try the builder INTERACTIVELY.

In [0]:
grid=estimator.get_feature_grid(default_inputs, default_labels)
grid

<font color=red>**RERUN** the following cells once you change the above settings.</font>

In [0]:
estimator.update_by_grid()

# assert(len(estimator.code_generator)>0)

# code_generator
estimator.input_features

## Create an input pipeline using tf.data

Next, I will wrap the dataframes with [tf.data](https://www.tensorflow.org/guide/datasets).

In [0]:
# A utility method to create a tf.data dataset from a Pandas Dataframe
def df_to_dataset(dataframe,input_cols, label_cols, shuffle=True, batch_size=32):
  labels = dataframe[label_cols]
  dataframe= dataframe[input_cols]

  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  return ds

In [0]:
batch_size = 5 # A small batch sized is used for demonstration purposes
train_ds = df_to_dataset(dataframe_train,feature_inputs,feature_labels, batch_size=batch_size)
val_ds = df_to_dataset(dataframe_val,feature_inputs, feature_labels,shuffle=False, batch_size=batch_size)
test_ds = df_to_dataset(dataframe_test,feature_inputs, feature_labels,shuffle=False, batch_size=batch_size)

## Understand the input pipeline

Now that I have created the input pipeline, let's call it to see the format of the data it returns. I have used a small batch size to keep the output readable.

In [0]:
for feature_batch, label_batch in train_ds.take(1):
  print('Every feature:', list(feature_batch.keys()))
  print('A batch of ages:', feature_batch['age'])
  print('A batch of targets:', label_batch )

The dataset returns a dictionary of column names (from the dataframe) that map to column values from rows in the dataframe.

## Test the features mapping

In [0]:
next(iter(train_ds))

In [0]:
# I will use this batch to demonstrate several types of feature columns
example_batch = next(iter(train_ds))[0]

In [0]:
example_batch

## Every feature mapping can be tested

In [0]:
if Testing:
  for f in estimator.input_features:
    print(f)
    feature_layer = layers.DenseFeatures(f,dtype='float64' )
    print(feature_layer(example_batch).numpy())

## Create a feature layer
Now that I have defined the feature columns, I will use a [DenseFeatures](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/DenseFeatures) layer to input them to our Keras model.

In [0]:
feature_layer = tf.keras.layers.DenseFeatures(estimator.input_features)

## Create, compile, and train the model

In [0]:
model = tf.keras.Sequential([
  feature_layer,
  layers.Dense(128, activation='relu'),
  layers.Dense(128, activation='relu'),
  layers.Dense(1)
])

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

model.fit(train_ds,
          validation_data=val_ds,
          epochs=5)