![UCI](http://mlr.cs.umass.edu/ml/assets/logo.gif)

# Loading in the Data
In this example, we are going to use crossed columns and embedding columns inside of a tensorflow object created with the contrib "learn" library.

However, we will start the process by loading up a dataset with a mix of categorical data and numeric data. This dataset is quite old and has been used many times in machine learning examples: the census data from 1990's. We will use it to predict if a person will earn over or under 50k per year.

- https://archive.ics.uci.edu/ml/datasets/Census-Income+(KDD)

In [1]:
import pandas as pd

headers = ['age','workclass','fnlwgt','education','edu_num','marital_status',
           'occupation','relationship','race','sex','cap_gain','cap_loss','work_hrs_weekly','country','income']
df_train_orig = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data',names=headers)
df_test_orig = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.test',names=headers)
df_test_orig = df_test_orig.ix[1:]
print(df_train_orig.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age                32561 non-null int64
workclass          32561 non-null object
fnlwgt             32561 non-null int64
education          32561 non-null object
edu_num            32561 non-null int64
marital_status     32561 non-null object
occupation         32561 non-null object
relationship       32561 non-null object
race               32561 non-null object
sex                32561 non-null object
cap_gain           32561 non-null int64
cap_loss           32561 non-null int64
work_hrs_weekly    32561 non-null int64
country            32561 non-null object
income             32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
None


In [2]:
from copy import deepcopy
df_train = deepcopy(df_train_orig)
df_test = deepcopy(df_test_orig)

The data is organized as follows: 

|Variable | description|
|----|--------|
|age: | continuous|
|workclass:      |Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, ...|
|fnlwgt:         |continuous.|
|education:      |Bachelors, Some-college, 11th, HS-grad, Prof-school, ...|
|education-num:  |continuous.|
|marital-status: |Married-civ-spouse, Divorced, Never-married, Separated, Widowed, ... |
|occupation:     |Tech-support, Craft-repair, Other-service, ...|
|relationship:   | Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.|
|race:           |White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.|
|sex:            |Female, Male.|
|capital-gain:   |continuous.|
|capital-loss:   |continuous.|
|hours-per-week: |continuous.|
|native-country: |United-States, Cambodia, England, ... |

In [3]:

import numpy as np

# let's just get rid of rows with any missing data
# and then reset the indices of the dataframe so it corresponds to row number
df_train.replace(to_replace=' ?',value=np.nan, inplace=True)
df_train.dropna(inplace=True)
df_train.reset_index()

df_test.replace(to_replace=' ?',value=np.nan, inplace=True)
df_test.dropna(inplace=True)
df_test.reset_index()

df_test.head()

Unnamed: 0,age,workclass,fnlwgt,education,edu_num,marital_status,occupation,relationship,race,sex,cap_gain,cap_loss,work_hrs_weekly,country,income
1,25,Private,226802.0,11th,7.0,Never-married,Machine-op-inspct,Own-child,Black,Male,0.0,0.0,40.0,United-States,<=50K.
2,38,Private,89814.0,HS-grad,9.0,Married-civ-spouse,Farming-fishing,Husband,White,Male,0.0,0.0,50.0,United-States,<=50K.
3,28,Local-gov,336951.0,Assoc-acdm,12.0,Married-civ-spouse,Protective-serv,Husband,White,Male,0.0,0.0,40.0,United-States,>50K.
4,44,Private,160323.0,Some-college,10.0,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688.0,0.0,40.0,United-States,>50K.
6,34,Private,198693.0,10th,6.0,Never-married,Other-service,Not-in-family,White,Male,0.0,0.0,30.0,United-States,<=50K.


## Processing
For preprocessing, we are going to fix a few issues in the dataset. 

- This first includes the use of "50K." instead of "50K" in the test set. 
- Next, we will encode the categorical features as integers (later on we will encode one hot)
- Finally, we will make certain all the continuous data is scaled properly

In [4]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

if df_test.income.dtype=='object':
    df_test.income.replace(to_replace=[' <=50K.',' >50K.'],value=['<=50K','>50K'],inplace=True)
    print(df_test.income.value_counts())

encoders = dict() 
categorical_headers = ['workclass','education','marital_status',
                       'occupation','relationship','race','sex','country']

for col in categorical_headers+['income']:
    df_train[col] = df_train[col].str.strip()
    df_test[col] = df_test[col].str.strip()
    
    if col=="income":
        tmp = LabelEncoder()
        df_train[col] = tmp.fit_transform(df_train[col])
        df_test[col] = tmp.transform(df_test[col])
    else:
        encoders[col] = LabelEncoder()
        df_train[col+'_int'] = encoders[col].fit_transform(df_train[col])
        df_test[col+'_int'] = encoders[col].transform(df_test[col])


numeric_headers = ["age", "edu_num", "cap_gain", "cap_loss","work_hrs_weekly"]

for col in numeric_headers:
    df_train[col] = df_train[col].astype(np.float)
    df_test[col] = df_test[col].astype(np.float)
    
    ss = StandardScaler()
    df_train[col] = ss.fit_transform(df_train[col].values.reshape(-1, 1))
    df_test[col] = ss.transform(df_test[col].values.reshape(-1, 1))

<=50K    11360
>50K      3700
Name: income, dtype: int64


In [9]:
df_test['marital_status']

1             Never-married
2        Married-civ-spouse
3        Married-civ-spouse
4        Married-civ-spouse
6             Never-married
8        Married-civ-spouse
9             Never-married
10       Married-civ-spouse
11       Married-civ-spouse
12       Married-civ-spouse
13            Never-married
15       Married-civ-spouse
16       Married-civ-spouse
17            Never-married
18       Married-civ-spouse
19                  Widowed
21       Married-civ-spouse
22            Never-married
24            Never-married
25       Married-civ-spouse
26       Married-civ-spouse
27            Never-married
28                Separated
29       Married-civ-spouse
30            Never-married
31       Married-civ-spouse
32                  Widowed
33            Never-married
34       Married-civ-spouse
35                 Divorced
                ...        
16249              Divorced
16250         Never-married
16251    Married-civ-spouse
16253    Married-civ-spouse
16254         Never-

___

![tf](https://wiki.tum.de/download/attachments/25009442/tensor-flow_opengraph_h.png?version=1&modificationDate=1485888308193&api=v2)

# Starting Tensorflow
Now that we have processed the data, let's grab numpy matrices of the features we would like to predict with. In particular, we will convert everything to 32 bits (as a lot of Tensorflow is written with this arch in mind). 

In [5]:
# let's start as simply as possible, without any feature preprocessing
categorical_headers_ints = [x+'_int' for x in categorical_headers]

# we will forego one-hot encoding right now and instead just scale all inputs
feature_columns = categorical_headers_ints+numeric_headers
X_train =  ss.fit_transform(df_train[feature_columns].values).astype(np.float32)
X_test =  ss.transform(df_test[feature_columns].values).astype(np.float32)

y_train = df_train['income'].values.astype(np.int)
y_test = df_test['income'].values.astype(np.int)

print(feature_columns)

['workclass_int', 'education_int', 'marital_status_int', 'occupation_int', 'relationship_int', 'race_int', 'sex_int', 'country_int', 'age', 'edu_num', 'cap_gain', 'cap_loss', 'work_hrs_weekly']


Now let's import the tensorflow contrib libraries. These libraries are the workhorses for tensorflow's simplified interface. Please note that this was previously called "SKFlow" and might one day be replaced by the simplified syntax of Keras. Keras will received native TF support and be included in tf.contrib but as of the creation of this notebook, that is not the case. As such, we will be using syntax for the contrib library as of Rev 1.0 of tensorflow. 

- Please also note that keras is getting added to the core tensorflow contribution library such that this command will work:
 - `from tensorflow.contrib import keras`
- As such, if you want to use the Keras API, it will soon be supported! Even so, its unclear what syntax changes and additions will be made to the existing Keras tools, so we will forge ahead using the "learn" API

When using the learn API, we will still be using many of the native tensorflow functions, so we 
- `import tensorflow as tf` to get access to the entire API when needed
- `from tensorflow.contrib import learn` to get access to many of the wrappers and simplifications for using tensorflow. This library really helps with creating models and using some out-of-the-box networks (like a shallow/deep MLP with fully connected layers)
- `from tensorflow.contrib import layers` will be used to get access to some common neural network layer types and activations 
- `from tensorflow.contrib.learn.python import SKCompat` this will be used to give us a very familir interface as used in sklearn.

In [6]:
import tensorflow as tf
from tensorflow.contrib import learn
from tensorflow.contrib import layers
from tensorflow.contrib.learn.python import SKCompat
from tensorflow.contrib.learn.python.learn.estimators import model_fn as model_fn_lib
tf.logging.set_verbosity(tf.logging.WARN) # control the verbosity of tensor flow

ModuleNotFoundError: No module named 'tensorflow'

![tflearn](http://www.kdnuggets.com/wp-content/uploads/skflow.jpg)

## An example similar to Sklearn
We will start with creating a model that is similar to what we have seen in scikit-learn. 

In [7]:
%%time
# we need to tell tensorflow how many inputs to expect and what the data types will be
# for this early example, everything is just numeric, real valued
features_tf = [layers.real_valued_column('', dimension=X_train.shape[1])]
clf = SKCompat(# wrap with SKCompat for easy usage like sklearn
            learn.DNNClassifier(hidden_units=[50], feature_columns=features_tf)
        )

clf.fit(X_train,y_train,steps=100)

Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
CPU times: user 4.81 s, sys: 107 ms, total: 4.91 s
Wall time: 4.89 s


In [8]:
from sklearn import metrics as mt

yhat = clf.predict(X_test)
# notice that the output needs some interpretation
# as its not completely the same as sklearn
yhat = yhat['classes']
print(mt.confusion_matrix(y_test,yhat),
      mt.accuracy_score(y_test,yhat))

[[10627   733]
 [ 1991  1709]] 0.819123505976


In [9]:
%%time
# we can also custimize the classifier somewhat:
clf = SKCompat(# wrap with SKCompat for easy usage like sklearn
            learn.DNNClassifier(hidden_units=[50], 
                                feature_columns=features_tf,
                                activation_fn=tf.nn.sigmoid 
                                # tf.tanh, tf.sigmoid, tf.nn.relu, tf.nn.softmax etc.
                                )
        )

clf.fit(X_train,y_train,steps=1000)

yhat = clf.predict(X_test)['classes']
print(mt.confusion_matrix(y_test,yhat),
      mt.accuracy_score(y_test,yhat))

Instructions for updating:
Please switch to tf.summary.scalar. Note that tf.summary.scalar uses the node name instead of the tag. This means that TensorFlow will automatically de-duplicate summary names based on the scope they are created in. Also, passing a tensor or list of tags to a scalar summary op is no longer supported.
[[10451   909]
 [ 1878  1822]] 0.814940239044
CPU times: user 6.65 s, sys: 443 ms, total: 7.1 s
Wall time: 6.41 s


___
## Adding more customization
This type of architecture is fine to use, but suffers from many limitations: if its not implemented by the `learn` API, then we can't use the architecture or customize it easily. To get around this, we, instead, will use the `Estimator` class. To use this class, we need to define a "model function" that creates the neural network architecture. The output of this model needs to be very specific: a three tuple of 
- (1) predictions in a format that the end user will need to use
- (2) loss function implemented as tf graph 
- (3)  the optimization operation to perform on the graph. 

For this particular optimization, let's go beyond SGD and use Adagrad. There are many excellent explanations of different optimizers, for instance:
- http://sebastianruder.com/optimizing-gradient-descent/

There is a lot of code here that might be new, but essentially we are starting to get down to the core functionality of tensorflow (even though still using the learn API).

In [10]:

# let's start by just using the tflearn library out of the box on the data we have
def my_model(features, targets, mode):
    # the prototype for this function is as follows
    # input:  (features, targets) 
    # output: (predictions, loss, train_op)
    
    # =====SETUP ARCHITECTURE=====
    # we can use functions from learn to add layers and complexity to the model
    # pass features through one hidden layer with relu activation
    features = layers.relu(features, num_outputs=50) 
    # now pass the features through a fully connected layer
    features = layers.fully_connected(features, num_outputs=1) 
    # and pass them through a sigmoid activation
    output_layer = tf.sigmoid(features) 
    # reshape the output to be one dimensional
    predictions = tf.reshape(output_layer, [-1])
    
    # depending on the mode, we may not want to evaluate these
    loss_mse = None
    train_op = None
    
    # Calculate Loss (for both TRAIN and EVAL modes)
    if mode != learn.ModeKeys.INFER:
        # =====LOSS=======
        # we want to use MSE as our loss function, but could also choose 
        # cross entropy, or other objective functions here
        loss_mse = tf.losses.mean_squared_error(targets, predictions) 
    
    # Configure the Training Op (for TRAIN mode)
    if mode == learn.ModeKeys.TRAIN:
        # =====OPTIMIZER PARAMS========
        # now let's setup how we want thing to optimize 
        train_op = layers.optimize_loss(
            loss=loss_mse, 
            global_step=tf.contrib.framework.get_global_step(),
            optimizer='Adagrad', # adaptive gradient, so that the learning rate is not SO important
            learning_rate=0.1)
    
    # what format to have the output in when calling clf.predict?
    predictions_out = predictions>0.5
    return model_fn_lib.ModelFnOps(
      mode=mode, predictions=predictions_out, loss=loss_mse, train_op=train_op)


Now that we have defined the model, its as simple as using it in a very familiar syntax:

In [11]:
%%time

from sklearn.metrics import confusion_matrix, accuracy_score

# now let train the estimator
# we can wrap the tf estimator in the SKCompat class
# so that we can use similar syntax to to SKLearn 
clf = SKCompat(learn.Estimator(model_fn=my_model))
clf.fit(X_train, y_train, steps=5000, batch_size=32)

yhat = clf.predict(X_test)
print(confusion_matrix(y_test,yhat), 
      accuracy_score(y_test,yhat))

[[11359     1]
 [ 3700     0]] 0.754249667995
CPU times: user 8.86 s, sys: 1.94 s, total: 10.8 s
Wall time: 7.91 s


___

## Handling different feature types

You may have noticed that we were not handling the categorical features correctly. Really we need to one hot encode the categorical features
unfortunately, this requires using part of the API that breaks some conventions of SKLearn. In this example, we will use the integer categorical labels to get one hot encoded examples.

The advantage of this method is that we can write our custom preprocessing steps and use them straight in the pipeline of the classifier. However, the syntax for this is verbose and fairly ugly. So, let's get started!

First, we need a process input function that performs all the tensoflow preparation of the data and returns a dictionary of the name and values of each feature column. It should also return the target column as a tf.constant

In [12]:
# Let's start with the TF example (manipulated to work with new syntax)
# https://www.tensorflow.org/tutorials/wide_and_deep
def process_input(df, label_header, categ_headers, numeric_headers):
    # input: what ever you need it to be
    # output: (dict of feature columns as tensors), (labels as tensors)
    
    # ========Process Inputs=========
    # Creates a dictionary mapping from each continuous feature column name (k) to
    # the values of that column stored in a constant Tensor.
    continuous_cols = {k: tf.expand_dims( # make it a column vector
                            tf.cast( # cast to a float32
                                tf.constant(df[k].values), 
                                tf.float32), 
                            1)
                       for k in numeric_headers}
    
    # Creates a dictionary mapping from each categorical feature column name (k)
    # to the values of that column stored as constant Tensors (numeric)
    # then use tensor flow to one hot encode them using the given number of classes 
    categorical_cols = {k: tf.one_hot(indices=tf.constant(df[k].values),
                                      depth=len(encoders[k[:-4]].classes_)) 
                        for k in categ_headers}
    
    # Merges the two dictionaries into one.
    feature_cols = dict(continuous_cols)
    feature_cols.update(categorical_cols)
    
    # Convert the label column into a constant Tensor.
    label = None
    if label_header is not None:
        label = tf.constant(df[label_header].values)
        
    return feature_cols, label


Now, when we want to use the classifier, we tell the model how it needs to parse the inputs. For each numeric feature, we simply need to convert the tensors into column vectors (see below). For the categrocial vectors, TF has a convenient "one_hot" function that we can use. The remainder of the model selection stays the same. 

In [13]:
# update the model to take input features as a dictionary
def my_model(dict_features, targets, mode):
    # the prototype for this function is as follows
    # input:  (features, targets) 
    # output: (predictions, loss, train_op)
    
    #=======DECODE FEATURES================
    # now let's combine the tensors from the input dictionary
    # into a list of the feature columns
    features = []
    for col in numeric_headers:
        features.append(dict_features[col])
    
    # also add in the one hot encoded features
    for col in categorical_headers_ints:
        features.append(dict_features[col])
    
    # now we can just combine all the features together
    features = tf.concat(values=features,axis=1)
    
    # =====SETUP ARCHITECTURE=====
    # we can use functions from learn to add layers and complexity to the model
    # pass features through one hidden layer with relu activation
    features = layers.relu(features, num_outputs=50) 
    # now pass the features through a fully connected layer
    features = layers.fully_connected(features, num_outputs=1) 
    # and pass them through a sigmoid activation
    output_layer = tf.sigmoid(features) 
    # reshape the output to be one dimensional
    predictions = tf.reshape(output_layer, [-1])
    
    # depending on the mode, we may not want to evaluate these
    loss_mse = None
    train_op = None
    
    # Calculate Loss (for both TRAIN and EVAL modes)
    if mode != learn.ModeKeys.INFER:
        # =====LOSS=======
        # we want to use MSE as our loss function
        loss_mse = tf.losses.mean_squared_error(targets, predictions) 
    
    if mode == learn.ModeKeys.TRAIN:
        # =====OPTIMIZER PARAMS========
        # now let's setup how we want thing to optimize 
        train_op = layers.optimize_loss(
            loss=loss_mse, 
            global_step=tf.contrib.framework.get_global_step(),
            optimizer='Adagrad', # adaptive gradient, so that the learning rate is not SO important 
            learning_rate=0.1)
    
    # what format to have the output in when calling clf.predict?
    predictions_out = predictions>0.5
    
    return model_fn_lib.ModelFnOps(
      mode=mode, predictions={'incomes':predictions_out}, loss=loss_mse, train_op=train_op)

Now to call the estimator with our custom pre-processing step we need to wrap the preprocessing inside another function. This is easy to do with a simple lambda, but you could also use another wrapper function that takes no inputs. 

In [14]:
%%time
clf = learn.Estimator(model_fn=my_model)

# when we provide the process function, they expect us to control the mini-batch
clf.fit(input_fn=lambda:process_input(df_train,'income',categorical_headers_ints, numeric_headers), 
        steps=500)

CPU times: user 2min 8s, sys: 10.2 s, total: 2min 18s
Wall time: 48 s


In [15]:
yhat = clf.predict(input_fn=lambda:process_input(df_test,None,categorical_headers_ints, numeric_headers))
# the output is now an iterable value, so we need to step over it
yhat = [x['incomes'] for x in yhat]
print(confusion_matrix(y_test,yhat),
      accuracy_score(y_test,yhat))

[[10154  1206]
 [ 1247  2453]] 0.837118193891


So the confusion matrix is doing pretty well! But we still are just using an MLP with one hidden layer. We really want to take advantage of the crossed columns and embeddings that are possible with tensorflow. That takes yet another syntax change in how we create out tensorflow object. 

## [Back to Slides]

![asdfasfd](https://www.tensorflow.org/images/wide_n_deep.svg)
____
# Adding Crossed Columns
For this example, we are going to combine our `learn` syntax with the syntax we have just gone over. 

In [16]:
# Now lets create a wide model 
# https://www.tensorflow.org/tutorials/wide_and_deep
def process_input(df, label_header, categ_headers, numeric_headers):
    # input: what ever you need it to be
    # output: (dict of feature columns as tensors), (labels as tensors)
    
    # ========Process Inputs=========
    # not much changes here, except we leave the numerics as tc.constants
    continuous_cols = {k: tf.constant(df[k].values) for k in numeric_headers}
      
    # and we shift these tensors to be sparse one-hot encoded values
    # Creates a dictionary mapping from each categorical feature column name (k)
    # to the values of that column stored in a tf.SparseTensor.
    categorical_cols = {k: tf.SparseTensor(
                              indices=[[i, 0] for i in range(df[k].size)],
                              values=df[k].values,
                              dense_shape=[df[k].size, 1])
                        for k in categ_headers}
    
    # Merges the two dictionaries into one.
    feature_cols = dict(categorical_cols)
    feature_cols.update(continuous_cols)
    
    # Convert the label column into a constant Tensor.
    label = None
    if label_header is not None:
        label = tf.constant(df[label_header].values)
        
    return feature_cols, label


Let's start simple with just a linear classifier that takes the categorical features as input and makes one set of crossed columns. 

In [17]:
# update the model to take input features as a dictionary
def setup_wide_columns():
    # let's create the column structure that the learn API can expect
    
    wide_columns = []
    # add in each of the categorical columns
    for col in categorical_headers:
        wide_columns.append(layers.sparse_column_with_keys(col, keys=encoders[col].classes_))
        
    # also add in some specific crossed columns
    cross_columns = [('Age', 'Gender', 'Folk', 'Country', 'Western movies'), ('Age', 'Gender', 'Classical', 'Swing, Jazz', 'Opera'), ('Age', 'Gender','Musicals', 'Pop'), ('Age', 'Gender', 'Rock', 'Metal or Hardrock', 'Rock n roll', 'Hiphop, Rap', 'Reggae, Ska'), ('Age', 'Gender', 'Alternative'), ('Age', 'Gender', 'Latin'), ('Age', 'Gender', 'Techno, Trance'), ('Horror', 'Thriller'), ('Comedy', 'Romantic'), ('Sci-fi', 'Fantasy/Fairy tails', 'Animated'), ('War', 'Action'), ('History', 'Geography', 'Foreign Languages'), ('Psychology', 'Biology', 'Medicine'), ('Mathematics', 'Physics', 'Science and technology'), ('Economy Management', 'Politics', 'Law', 'Voting'), ('Biology', 'Physics', 'Chemistry'), ('Reading', 'Writing'), ('Art exhibitions', 'Theatre', 'Musical instruments'), ('Countryside, outdoors', 'Dancing', 'Active sport', 'Adrenaline sports'), ('Passive sport', 'Gardening'), ('Dancing', 'Musical instruments'), ('Smoking', 'Alcohol'), ('Age','Height', 'Weight'), ('Village-town', 'House-block of flats')]
    for tup in cross_columns:
        wide_columns.append(
            layers.crossed_column(
                [layers.sparse_column_with_keys(tup[0], keys=encoders[tup[0]].classes_),
                 layers.sparse_column_with_keys(tup[1], keys=encoders[tup[1]].classes_)],
            hash_bucket_size=int(1e4))
        )
                        
    return wide_columns

In [18]:
%%time

# ignore all the deprecations that the learn API needs to deal with... ugh
tf.logging.set_verbosity(tf.logging.ERROR)

# setup 
wide_columns = setup_wide_columns()
input_wrapper = lambda:process_input(df_train,'income',categorical_headers, numeric_headers)
output_wrapper = lambda:process_input(df_test,None,categorical_headers, numeric_headers)

clf = learn.LinearClassifier(feature_columns=wide_columns)

# when we provide the process function, they expect us to control the mini-batch
clf.fit(input_fn=input_wrapper, steps=300)

yhat = clf.predict(input_fn=output_wrapper)
# the output is now an iterable value, so we need to step over it
yhat = [x for x in yhat]
print(confusion_matrix(y_test,yhat),accuracy_score(y_test,yhat))

[[10474   886]
 [ 1753  1947]] 0.824767596282
CPU times: user 1min 17s, sys: 4.72 s, total: 1min 22s
Wall time: 50 s


Wow! That is just using crossed columns and a one layer Linear Classifer! So memorization works fairly well here. 

Now let's try a deeper architecture with dense embeddings for the categorical features, as described in lecture.

___

## Using Dense embeddings in a deeper network

In [19]:
# update the model to take input features as a dictionary
def setup_deep_columns():
    # now make up the deep columns
    
    deep_columns = []
    # add in each of the categorical columns to both wide and deep features
    for col in categorical_headers:
        
        tmp = layers.sparse_column_with_keys(col, keys=encoders[col].classes_)
        
        deep_columns.append(
            layers.embedding_column(tmp, dimension=8)
        )
        
        
    # and add in the regular dense features 
    for col in numeric_headers:
        deep_columns.append(
            layers.real_valued_column(col)
        )
                    
    return deep_columns

In [20]:
%%time 

# setup deep columns
deep_columns = setup_deep_columns()
clf = learn.DNNClassifier(feature_columns=deep_columns, hidden_units=[100, 50])

clf.fit(input_fn=input_wrapper, steps=300)

yhat = clf.predict(input_fn=output_wrapper)
# the output is now an iterable value, so we need to step over it
yhat = [x for x in yhat]
print(confusion_matrix(y_test,yhat),
      accuracy_score(y_test,yhat))

[[10513   847]
 [ 1440  2260]] 0.848140770252
CPU times: user 3min 23s, sys: 35.6 s, total: 3min 58s
Wall time: 1min 42s


So the DNN with feature embeddings is also fairly capable (best thus far). For the final example, lets combine the classifiers together like we talked about in lecture.
___

## Combining Crossed Linear Classifier and Deep Embeddings
Now its just a matter of setting the wide and deep columns for tensorflow. After which, we can use the combined classifier which is already implemented! `learn.DNNLinearCombinedClassifier`

In [21]:
# update the model to take input features as a dictionary
def setup_wide_deep_columns():
    # the prototype for this function is as follows
    # input:  (features, targets) 
    # output: (predictions, loss, train_op)
    
    wide_columns = []
    deep_columns = []
    # add in each of the categorical columns to both wide and deep features
    for col in categorical_headers:
        wide_columns.append(
            layers.sparse_column_with_keys(col, keys=encoders[col].classes_)
        )
        
        dim = round(np.log2(len(encoders[col].classes_)))
        deep_columns.append(
            layers.embedding_column(wide_columns[-1], dimension=dim)
        )
        
    # also add in some specific crossed columns
    cross_columns = [('Age', 'Gender', 'Folk', 'Country', 'Western movies'), ('Age', 'Gender', 'Classical', 'Swing, Jazz', 'Opera'), ('Age', 'Gender','Musicals', 'Pop'), ('Age', 'Gender', 'Rock', 'Metal or Hardrock', 'Rock n roll', 'Hiphop, Rap', 'Reggae, Ska'), ('Age', 'Gender', 'Alternative'), ('Age', 'Gender', 'Latin'), ('Age', 'Gender', 'Techno, Trance'), ('Horror', 'Thriller'), ('Comedy', 'Romantic'), ('Sci-fi', 'Fantasy/Fairy tails', 'Animated'), ('War', 'Action'), ('History', 'Geography', 'Foreign Languages'), ('Psychology', 'Biology', 'Medicine'), ('Mathematics', 'Physics', 'Science and technology'), ('Economy Management', 'Politics', 'Law', 'Voting'), ('Biology', 'Physics', 'Chemistry'), ('Reading', 'Writing'), ('Art exhibitions', 'Theatre', 'Musical instruments'), ('Countryside, outdoors', 'Dancing', 'Active sport', 'Adrenaline sports'), ('Passive sport', 'Gardening'), ('Dancing', 'Musical instruments'), ('Smoking', 'Alcohol'), ('Age','Height', 'Weight'), ('Village-town', 'House-block of flats')]
    for tup in cross_columns:
        feature_columns = []
        for element in tup:
            feature_columns.append(layers.sparse_column_with_keys(element, keys=encoders[element].classes_))
        wide_columns.append(
            layers.crossed_column(feature_columns, hash_bucket_size=int(1e4)))
        
        
    # and add in the regular dense features 
    for col in numeric_headers:
        deep_columns.append(
            layers.real_valued_column(col)
        )
                    
    return wide_columns, deep_columns

In [22]:
%%time

wide_columns, deep_columns = setup_wide_deep_columns()
clf = learn.DNNLinearCombinedClassifier(
                        linear_feature_columns=wide_columns,
                        dnn_feature_columns=deep_columns,
                        dnn_hidden_units=[100, 50])


clf.fit(input_fn=input_wrapper, steps=2500)

yhat = clf.predict(input_fn=output_wrapper)
# the output is now an iterable value, so we need to step over it
yhat = [x for x in yhat]
print(confusion_matrix(y_test,yhat),accuracy_score(y_test,yhat))

[[10512   848]
 [ 1391  2309]] 0.851328021248
CPU times: user 12min 54s, sys: 2min 15s, total: 15min 9s
Wall time: 5min 21s


So we have excellent performance by manipulating the wide and deep architectures in the census data! Excellent!!

Wide and deep models can have really interesting and useful properties so they are great to keep in mind when selecting an architecture. Some of the hyperparameters that are specific to this are:
- which features to cross together, typically you only want to cross columns you think are important to be connected--they somehow might create new knowledge by combining.
- the size of the dense feature embeddings. This can be difficult to set, but one common setting is $log_2(N)$ where $N$ is the total number of uniques values.

Also, here are some other references that use many of the same steps as I do:
- using custom models via `learn.Estimator`: https://www.tensorflow.org/extend/estimators
- many of the same things we went over: https://www.tensorflow.org/tutorials/wide_and_deep 
 - and the Github: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/layers/python/ops/sparse_feature_cross_op.py 
- optimizers: http://sebastianruder.com/optimizing-gradient-descent/



___

# Bonus: Working with monitors
Much of this example is taken from: 

- https://www.tensorflow.org/get_started/monitors

On my machine, this logging makes the training extremely slow. I haven't taken the time to really track down why that is. However, the outputs are extremely useful. For instance, you can look at the graph structure and loss over time:

![tb](data/tensorboard.png)

![tb](data/ts_loss.png)


In [23]:
# tf.logging.set_verbosity(tf.logging.ERROR)

# input_wrapper = lambda:process_input(df_train,'income',categorical_headers, numeric_headers)
# output_wrapper = lambda:process_input(df_test,'income',categorical_headers, numeric_headers)

# # for tensorboard this means you can use the following:
# # !python -m tensorflow.tensorboard --logdir=large_data/tmp/

# wide_columns, deep_columns = setup_wide_deep_columns()
# clf = learn.DNNLinearCombinedClassifier(
#                         linear_feature_columns=wide_columns,
#                         dnn_feature_columns=deep_columns,
#                         dnn_hidden_units=[100, 50], 
#                         model_dir="large_data/tmp/")

# validation_metrics = {
#     "accuracy":
#         learn.MetricSpec(
#             metric_fn=tf.contrib.metrics.streaming_accuracy,
#             prediction_key=learn.PredictionKey.
#             CLASSES),
#     "precision":
#         learn.MetricSpec(
#             metric_fn=tf.contrib.metrics.streaming_precision,
#             prediction_key=learn.PredictionKey.
#             CLASSES),
#     "recall":
#         learn.MetricSpec(
#             metric_fn=tf.contrib.metrics.streaming_recall,
#             prediction_key=learn.PredictionKey.
#             CLASSES)
# }

# validation_monitor = learn.monitors.ValidationMonitor(
#     input_fn= lambda:process_input(df_test,'income',categorical_headers, numeric_headers),
#     every_n_steps=50,
#     metrics=validation_metrics)

# clf.fit(input_fn=input_wrapper, 
#         steps=500, 
#         monitors=[validation_monitor])

____
____

![keras](https://blog.keras.io/img/keras-tensorflow-logo.jpg)


# [Optional, Bonus] Using Keras In place of TF Contrib
Because we have no idea how things will change in tensorflow and the contribution library once keras is added to the core tensorflow library.

This example does not use the wide and deep embeddings becuase keras does not currently have an interface for that type of creation (although you likely can find a way to incorporate the crossed columns and embeddings via the  tensorflow backend). However, if the values can be placed within memory, then the manipulations can occur in sklearn only.

In [24]:
# but we were dealing with the data incorrectly because we didn't one hot encode the 
#   categorical features
from sklearn.preprocessing import OneHotEncoder

# now let's encode the integerr outputs as one hot encoded labels
ohe = OneHotEncoder()
X_train_ohe = ohe.fit_transform(df_train[categorical_headers_ints].values)
X_test_ohe = ohe.transform(df_test[categorical_headers_ints].values)

# the ohe instance will help us to organize our encoded matrix
print(ohe.feature_indices_)
print(X_train_ohe.shape)
print(type(X_train_ohe))

[ 0  7 23 30 44 50 55 57 98]
(30162, 98)
<class 'scipy.sparse.csr.csr_matrix'>


In [25]:
from keras.models import Sequential
from keras.layers import Dense, Activation
from keras.layers import Embedding, Flatten, Merge

Using TensorFlow backend.


In [26]:
%%time

# combine the features into a single 
X_train = np.hstack((X_train_ohe.todense(),df_train[numeric_headers].values))
X_test = np.hstack((X_test_ohe.todense(),df_test[numeric_headers].values))

model = Sequential() # this defines a generic empty model
# let's first define the inputs
model.add(Dense(input_dim=X_train.shape[1], units=10, activation='relu')) 
model.add(Dense(1,activation='sigmoid'))

model.compile(optimizer='sgd',
              loss='mean_squared_error',
              metrics=['accuracy'])

model.fit(X_train,y_train, epochs=10, batch_size=50, verbose=0)

CPU times: user 8.61 s, sys: 3.05 s, total: 11.7 s
Wall time: 7.86 s


In [27]:
from sklearn import metrics as mt
yhat = np.round(model.predict(X_test))
print(mt.confusion_matrix(y_test,yhat),mt.accuracy_score(y_test,yhat))

[[10510   850]
 [ 1587  2113]] 0.83818061089


# Keras with Embeddings
Now lets add some embeddings into Keras. Right now the "sparse" vector integration in Keras is lacking support, so it can't take advantage of the tensorflow backend sparse representation (yet, it will). As such, we will need to perform our own one hot encoding and then save the matrix as "dense." This wastes memory, but is okay for smaller problems like this one (we only have about 30,000 training examples).  

In [31]:
# we need to create separate sequential models for each embedding
embed_branches = []
X_ohe_train = []
X_ohe_test = []
for col in categorical_headers_ints:
    # encode as sparse
    ohe = OneHotEncoder()
    X_ohe_train.append( ohe.fit_transform(df_train[col].values.reshape((-1,1))).todense() )
    X_ohe_test.append(  ohe.transform(df_test[col].values.reshape((-1,1))).todense() )
    
    # get the number of categories
    N = int(X_ohe_train[-1].shape[1]) # same as the max(df_train[col])
    
    # create embedding branch from the number of categories
    embed_branches.append(Sequential())
    embed_branches[-1].add( Embedding(N*2, int(np.sqrt(N)), input_length=N) )
    embed_branches[-1].add( Flatten() )

# also get a dense branch of the numeric features
numeric_branch = Sequential()
numeric_branch.add( Dense(input_dim=df_train[numeric_headers].values.shape[1], units=20, activation='relu') ) 
numeric_branch.add( Dense(units=10,activation='relu') )

# merge the branches together
final_branch = Sequential()
final_branch.add( Merge(embed_branches+[numeric_branch], mode='concat') )
final_branch.add( Dense(units=1,activation='sigmoid') )

final_branch.compile(optimizer='adagrad',
              loss='mean_squared_error',
              metrics=['accuracy'])

final_branch.fit(X_ohe_train + [df_train[numeric_headers].values],
        y_train, epochs=10, batch_size=32, verbose=1)



Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x128254e48>

In [32]:
yhat = np.round(final_branch.predict(X_ohe_test+[df_test[numeric_headers].values]))
print(mt.confusion_matrix(y_test,yhat),mt.accuracy_score(y_test,yhat))

[[10635   725]
 [ 1506  2194]] 0.851859229748



# Keras crossed columns
Unfortunately this is where Keras does not have good support. The crossed embeddings are impossible to create automatically. Therefore, the non-sparse representation will affect memory footprint and run time to the extent that its not worth doing. Keras will eventually get support for using sparse tensors and probably will implement a crossed column based on Tensorflow, but that is not yet the case. 

If you really want to implement it in Keras, then look into sklearn's PolynomialFeatures preprocess step to create the dense crossings. 

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html

You will want to incorporate the `interaction_only` variables to get the crossed columns.