# Classify Structured Data


This is a draft of a tutorial that shows how to classify structured data (e.g. tabular data that you might find in a CSV). We will use [Keras](https://www.tensorflow.org/guide/keras) to define our model, and [feature columns](https://www.tensorflow.org/guide/feature_columns) to describe how each column from the CSV should be represented. In the process we will:
* Load a CSV file using Pandas
* Explore the format of the dataset
* Build an input pipeline with tf.data
* Demonstrate how to use sevearl different types of feature columns
* Build and train a model with Keras
* Evaluate our accuracy

## Overview

Using [census data](https://archive.ics.uci.edu/ml/datasets/Census+Income) which contains data a person's age, education, marital status, and occupation (the *features*), we will try to predict whether or not the person earns more than 50,000 dollars a year (the *label*). We will train a neural network that, given an individual's information, outputs a number between 0 and 1. This can be interpreted as the probability that the individual has an annual income of over 50,000 dollars.

Key Point: As a developer, think about how this data is used and the potential benefits and harm a model's predictions can cause. A model built on a dataset like this one could reinforce societal biases and disparities. Is each feature relevant to the problem you want to solve or will it introduce bias? For more information, read about [ML fairness](https://developers.google.com/machine-learning/fairness-overview/).

In [0]:
!pip install tf-nightly-2.0-preview

## Import TensorFlow and other libraries

In [0]:
from __future__ import absolute_import, division, print_function

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import tensorflow as tf
import tensorflow.feature_column as fc

from sklearn.model_selection import train_test_split

keras = tf.keras

## Download the Census dataset

We will use a version of this dataset that has been lightly preprocessed (for consistent formatting), to minimize the code in this notebook.

In [0]:
URL = 'https://storage.googleapis.com/applied-dl/uci_census_cleaned.csv'
data = keras.utils.get_file('uci_census_cleaned.csv', URL)

## Use Pandas to load and preprocess the data

[Pandas](https://pandas.pydata.org/) is a Python library with many helpful utilities for loading and working with structured data. We will use Pandas in this tutorial to load and preprocess the dataset before classifying it with TensorFlow.

In [0]:
dataframe = pd.read_csv('~/.keras/datasets/uci_census_cleaned.csv')
dataframe.head()

The last column in the above output (income_bracket) is the label we will predict. Notice it is represented as a string. We will use Pandas to convert it to a number (0.0 or 1.0).

In [0]:
dataframe['income_bracket'] = dataframe['income_bracket'].map(lambda x: x == '>50K')
dataframe['income_bracket'] = dataframe['income_bracket'].astype(float)
dataframe.head()

## Split the dataset into train, validation, and test

The dataset we downloaded was a single CSV file. We will split this into train, validation, and test sets.

In [0]:
train, test = train_test_split(dataframe, test_size=0.1)
train, val = train_test_split(train, test_size=0.1)
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')

## Create an input pipeline using tf.data

Next, we will wrap the dataframes with [tf.data](https://www.tensorflow.org/guide/datasets). These enable us to use feature columns as a bridge to map from the columns in the Pandas dataframe, to features used to train our Keras model.

In [0]:
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
  dataframe = dataframe.copy()
  labels = dataframe.pop('income_bracket')
  ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.repeat().batch(batch_size)
  return ds

In [0]:
# We will use a small batch size at first in order to demo
# how this code works
batch_size = 5
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, batch_size=batch_size)
test_ds = df_to_dataset(test, shuffle=False, batch_size=batch_size)

Now that we have created our input pipeline, lets explore what it returns.

In [0]:
for feature_batch, label_batch in train_ds.take(1):
  print('All features:', list(feature_batch.keys()))
  print('A batch of ages:', feature_batch['age'])
  print('A batch of labels:', label_batch )

We can see that the dataset returns a dictionary of column names (from the dataframe) that map to column values. In a moment, we will use feature columns to represent these in different ways. First, let's retrieve a batch of data and keep it in memory. We will use this batch demostrate each type of feature column.

In [0]:
example_batch = list(train_ds.take(1))[0][0]

## Demonstrate each type of feature column
TensorFlow has several different types of feature columns you can use. Next, we will create one of each, and demonstrate how it is used to represent a batch of data.

In [0]:
def demo(feature_column):
  feature_layer = keras.layers.DenseFeatures(feature_column)
  print(feature_layer(example_batch).numpy())

### Numeric columns
A [numeric column](https://www.tensorflow.org/api_docs/python/tf/feature_column/numeric_column) is the simplest type of column. It's used to represent real valued features. When using this column, your model will receive the column value unchanged.

In [0]:
age = tf.feature_column.numeric_column("age")
demo(age)

### Bucketized columns
Often, you don't want to feed a number directly into the model, but instead split its value into different categories based on numerical ranges. To do so, create a [bucketized column](https://www.tensorflow.org/api_docs/python/tf/feature_column/bucketized_column). For example, consider raw data that represents a person's age. Instead of representing age as a numeric column, we could split the age into several buckets.

In [0]:
age_buckets = tf.feature_column.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
demo(age_buckets)

### Categorical columns
In the census dataset, education is represented as a string (e.g. bachelors). We cannot feed strings directly to a model. Instead, we must first map them to numeric or categorical values. Categorical vocabulary columns provide a way to represent strings as a one-hot vector. The vocabulary can be loaded from a list using [categorical_column_with_vocabulary_list](https://www.tensorflow.org/api_docs/python/tf/feature_column/categorical_column_with_vocabulary_list), or from a file using [categorical_column_with_vocabulary_file](https://www.tensorflow.org/api_docs/python/tf/feature_column/categorical_column_with_vocabulary_file). 

In [0]:
education = tf.feature_column.categorical_column_with_vocabulary_list(
      'education', [
          'Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
          'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',
          '5th-6th', '10th', '1st-4th', 'Preschool', '12th'])

education_one_hot = tf.feature_column.indicator_column(education)
demo(education_one_hot)

### Embedding columns
Now, suppose instead of having just a few possible strings, we have a million (or more). For a number of reasons, as the number of categories grow large, it becomes infeasible to train a neural network using one-hot encodings. We can use an embedding column to overcome this limitation. Instead of representing the data as a one-hot vector of many dimensions, an [embedding column](https://www.tensorflow.org/api_docs/python/tf/feature_column/embedding_column) represents that data as a lower-dimensional, dense vector in which each cell can contain any number, not just 0 or 1. The size of the embedding (8, in the example below) is a parameter that must be tuned.

In [0]:
education_embedding = tf.feature_column.embedding_column(education, dimension=8)
demo(education_embedding)

### Hashed feature columns

Another way to represent a categorical column with a large number of values is to use a [categorical_column_with_hash_bucket](https://www.tensorflow.org/api_docs/python/tf/feature_column/categorical_column_with_hash_bucket). This feature columns enables you to specify the number of categories in advanced (instead of providing a vocabulary file, or list). This feature column calculates a hash value of the input, then selects one of the `hash_bucket_size` buckets to encode a string.

An important downside of this technique is there may be collisions in which different strings are mapped to the same bucket. In practice, this can work well for some datasets regardless.

In [0]:
occupation = tf.feature_column.categorical_column_with_hash_bucket(
      'occupation', hash_bucket_size=1000)
demo(tf.feature_column.indicator_column(occupation))

## Crossed feature columns
Combining features into a single feature, better known as [feature crosses](https://developers.google.com/machine-learning/glossary/#feature_cross), enables a model to learn separate weights for each combination of features. Here, we will create a new feature that is the cross of age and education. As a feature cross results in many new features, they are represented with a hash for efficiency.

In [0]:
crossed_feature = tf.feature_column.crossed_column([age_buckets, education], hash_bucket_size=1000)
demo(tf.feature_column.indicator_column(crossed_feature))

## Train a model
We have seen how to use many types of feature coilumns. Now we will use them to train a model. We have chosen the features used below somewhat arbitrarily (they have not been tuned to build an accurate model). If your aim is to build an accurate model, try a dataset of your own, and think carefully about which features are the most meaningful to include.

In [0]:
age = tf.feature_column.numeric_column('age')
education_num = tf.feature_column.numeric_column('education_num')
capital_gain = tf.feature_column.numeric_column('capital_gain')
capital_loss = tf.feature_column.numeric_column('capital_loss')
hours_per_week = tf.feature_column.numeric_column('hours_per_week')

In [0]:
education = tf.feature_column.categorical_column_with_vocabulary_list(
    'education', [
        'Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
        'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',
        '5th-6th', '10th', '1st-4th', 'Preschool', '12th'])

marital_status = tf.feature_column.categorical_column_with_vocabulary_list(
    'marital_status', [
        'Married-civ-spouse', 'Divorced', 'Married-spouse-absent',
        'Never-married', 'Separated', 'Married-AF-spouse', 'Widowed'])

relationship = tf.feature_column.categorical_column_with_vocabulary_list(
    'relationship', [
        'Husband', 'Not-in-family', 'Wife', 'Own-child', 'Unmarried',
        'Other-relative'])

workclass = tf.feature_column.categorical_column_with_vocabulary_list(
    'workclass', [
        'Self-emp-not-inc', 'Private', 'State-gov', 'Federal-gov',
        'Local-gov', '?', 'Self-emp-inc', 'Without-pay', 'Never-worked'])

In [0]:
occupation = tf.feature_column.categorical_column_with_hash_bucket(
    'occupation', hash_bucket_size=1000)

In [0]:
age_buckets = tf.feature_column.bucketized_column(
    age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])

education_occuputation = tf.feature_column.crossed_column(['education', 'occupation'], 
                              hash_bucket_size=1000)

age_education_occuptation = tf.feature_column.crossed_column([age_buckets, 'education', 'occupation'],
                              hash_bucket_size=1000)

### Create a feature layer
Now that we have defined our feature columns, we will use a [DenseFeatures](https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/keras/layers/DenseFeatures) layer to input them to our Keras model.

In [0]:
all_columns = [
    age,
    education_num,
    capital_gain,
    capital_loss,
    hours_per_week,
    tf.feature_column.indicator_column(workclass),
    tf.feature_column.indicator_column(education),
    tf.feature_column.indicator_column(marital_status),
    tf.feature_column.indicator_column(relationship),
    tf.feature_column.embedding_column(education_occuputation, dimension=8),
    tf.feature_column.embedding_column(age_education_occuptation, dimension=8),
    tf.feature_column.embedding_column(occupation, dimension=8),
]

feature_layer = keras.layers.DenseFeatures(all_columns)

Earlier, we used a small batch size to demonstrate how the feature columns worked. We will now create a new input pipeline.

In [0]:
batch_size = 256
train_ds = df_to_dataset(train, batch_size=batch_size)
val_ds = df_to_dataset(val, batch_size=batch_size)
test_ds = df_to_dataset(test, batch_size=batch_size)

### Create, compile, and train the model

In [0]:
model = tf.keras.Sequential([
  feature_layer,
  tf.keras.layers.Dense(128, activation=tf.nn.relu),
  tf.keras.layers.Dense(128, activation=tf.nn.relu),
  tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)
])

In [0]:
model.compile(optimizer='adam',
              loss=tf.keras.losses.binary_crossentropy,
              metrics=['accuracy'])

In [0]:
model.fit(train_ds, 
          steps_per_epoch=len(train)//batch_size,
          validation_data=val_ds, 
          validation_steps=len(val)//batch_size,
          epochs=2)

In [0]:
loss, accuracy = model.evaluate(test_ds, steps=len(test) // batch_size)
print("Accuracy", accuracy)

### Next steps

The best way to learn more about classifying structured data is to try it yourself. We suggest finding another dataset to work with, and training a model to classifying it, using code similar to the above. To improve accuracy, think carefully about which features to include in your model, and how they should be represented.