<a href="https://colab.research.google.com/github/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_04_1_feature_encode.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# T81-558: Applications of Deep Neural Networks
**Module 4: Training for Tabular Data**
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), McKelvey School of Engineering, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

# Module 4 Material

* **Part 4.1: Encoding a Feature Vector for Keras Deep Learning** [[Video]](https://www.youtube.com/watch?v=Vxz-gfs9nMQ&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_04_1_feature_encode.ipynb)
* Part 4.2: Keras Multiclass Classification for Deep Neural Networks with ROC and AUC [[Video]](https://www.youtube.com/watch?v=-f3bg9dLMks&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_04_2_multi_class.ipynb)
* Part 4.3: Keras Regression for Deep Neural Networks with RMSE [[Video]](https://www.youtube.com/watch?v=wNhBUC6X5-E&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_04_3_regression.ipynb)
* Part 4.4: Backpropagation, Nesterov Momentum, and ADAM Neural Network Training [[Video]](https://www.youtube.com/watch?v=VbDg8aBgpck&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_04_4_backprop.ipynb)
* Part 4.5: Neural Network RMSE and Log Loss Error Calculation from Scratch [[Video]](https://www.youtube.com/watch?v=wmQX1t2PHJc&list=PLjy4p-07OYzulelvJ5KVaT2pDlxivl_BN) [[Notebook]](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class_04_5_rmse_logloss.ipynb)

# Google CoLab Instructions

The following code ensures that Google CoLab is running the correct version of TensorFlow.

In [1]:
try:
    %tensorflow_version 2.x
    COLAB = True
    print("Note: using Google CoLab")
except:
    print("Note: not using Google CoLab")
    COLAB = False

Note: not using Google CoLab


# Part 4.1: Encoding a Feature Vector for Keras Deep Learning

Neural networks can accept many types of data.  We will begin with tabular data, where there are well defined rows and columns.  This is the sort of data you would typically see in Microsoft Excel.  An example of tabular data is shown below.

Neural networks require numeric input.  This numeric form is called a feature vector.  Each row of training data typically becomes one vector.  The individual input neurons each receive one feature (or column) from this vector.  In this section, we will see how to encode the following tabular data into a feature vector.

In [2]:
import pandas as pd

pd.set_option('display.max_columns', 7) 
pd.set_option('display.max_rows', 5)

df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/jh-simple-dataset.csv",
    na_values=['NA','?'])

pd.set_option('display.max_columns', 9)
pd.set_option('display.max_rows', 5)

display(df)

Unnamed: 0,id,job,area,income,...,pop_dense,retail_dense,crime,product
0,1,vv,c,50876.0,...,0.885827,0.492126,0.071100,b
1,2,kd,c,60369.0,...,0.874016,0.342520,0.400809,c
...,...,...,...,...,...,...,...,...,...
1998,1999,qp,c,67949.0,...,0.909449,0.598425,0.117803,c
1999,2000,pe,c,61467.0,...,0.925197,0.539370,0.451973,c


The following observations can be made from the above data:
* The target column is the column that you seek to predict.  There are several candidates here.  However, we will initially use product.  This field specifies what product someone bought.
* There is an ID column.  This column should not be fed into the neural network as it contains no information useful for prediction.
* Many of these fields are numeric and might not require any further processing.
* The income column does have some missing values.
* There are categorical values: job, area, and product.

To begin with, we will convert the job code into dummy variables.

In [3]:
pd.set_option('display.max_columns', 7) 
pd.set_option('display.max_rows', 5)

dummies = pd.get_dummies(df['job'],prefix="job")
print(dummies.shape)

pd.set_option('display.max_columns', 9)
pd.set_option('display.max_rows', 10)

display(dummies)

(2000, 33)


Unnamed: 0,job_11,job_al,job_am,job_ax,...,job_rn,job_sa,job_vv,job_zz
0,0,0,0,0,...,0,0,1,0
1,0,0,0,0,...,0,0,0,0
2,0,0,0,0,...,0,0,0,0
3,1,0,0,0,...,0,0,0,0
4,0,0,0,0,...,0,0,0,0
...,...,...,...,...,...,...,...,...,...
1995,0,0,0,0,...,0,0,1,0
1996,0,0,0,0,...,0,0,0,0
1997,0,0,0,0,...,0,0,0,0
1998,0,0,0,0,...,0,0,0,0


Because there are 33 different job codes, there are 33 dummy variables.  We also specified a prefix, because the job codes (such as "ax") are not that meaningful by themselves.  Something such as "job_ax" also tells us the origin of this field.

Next, we must merge these dummies back into the main data frame.  We also drop the original "job" field, as it is now represented by the dummies. 

In [4]:
pd.set_option('display.max_columns', 7) 
pd.set_option('display.max_rows', 5)

df = pd.concat([df,dummies],axis=1)
df.drop('job', axis=1, inplace=True)

pd.set_option('display.max_columns', 9)
pd.set_option('display.max_rows', 10)

display(df)

Unnamed: 0,id,area,income,aspect,...,job_rn,job_sa,job_vv,job_zz
0,1,c,50876.0,13.100000,...,0,0,1,0
1,2,c,60369.0,18.625000,...,0,0,0,0
2,3,c,55126.0,34.766667,...,0,0,0,0
3,4,c,51690.0,15.808333,...,0,0,0,0
4,5,d,28347.0,40.941667,...,0,0,0,0
...,...,...,...,...,...,...,...,...,...
1995,1996,c,51017.0,38.233333,...,0,0,1,0
1996,1997,d,26576.0,33.358333,...,0,0,0,0
1997,1998,d,28595.0,39.425000,...,0,0,0,0
1998,1999,c,67949.0,5.733333,...,0,0,0,0


We also introduce dummy variables for the area column.

In [5]:
pd.set_option('display.max_columns', 7) 
pd.set_option('display.max_rows', 5)

df = pd.concat([df,pd.get_dummies(df['area'],prefix="area")],axis=1)
df.drop('area', axis=1, inplace=True)

pd.set_option('display.max_columns', 9)
pd.set_option('display.max_rows', 10)
display(df)

Unnamed: 0,id,income,aspect,subscriptions,...,area_a,area_b,area_c,area_d
0,1,50876.0,13.100000,1,...,0,0,1,0
1,2,60369.0,18.625000,2,...,0,0,1,0
2,3,55126.0,34.766667,1,...,0,0,1,0
3,4,51690.0,15.808333,1,...,0,0,1,0
4,5,28347.0,40.941667,3,...,0,0,0,1
...,...,...,...,...,...,...,...,...,...
1995,1996,51017.0,38.233333,1,...,0,0,1,0
1996,1997,26576.0,33.358333,2,...,0,0,0,1
1997,1998,28595.0,39.425000,3,...,0,0,0,1
1998,1999,67949.0,5.733333,0,...,0,0,1,0


The last remaining transformation is to fill in missing income values. 

In [6]:
med = df['income'].median()
df['income'] = df['income'].fillna(med)

There are more advanced ways of filling in missing values, but they require more analysis.  The idea would be to see if another field might give a hint as to what the income were.  For example, it might be beneficial to calculate a median income for each of the areas or job categories.  This is something to keep in mind for the class Kaggle competition.

At this point, the Pandas dataframe is ready to be converted to Numpy for neural network training. We need to know a list of the columns that will make up *x* (the predictors or inputs) and *y* (the target). 

The complete list of columns is:

In [7]:
print(list(df.columns))

['id', 'income', 'aspect', 'subscriptions', 'dist_healthy', 'save_rate', 'dist_unhealthy', 'age', 'pop_dense', 'retail_dense', 'crime', 'product', 'job_11', 'job_al', 'job_am', 'job_ax', 'job_bf', 'job_by', 'job_cv', 'job_de', 'job_dz', 'job_e2', 'job_f8', 'job_gj', 'job_gv', 'job_kd', 'job_ke', 'job_kl', 'job_kp', 'job_ks', 'job_kw', 'job_mm', 'job_nb', 'job_nn', 'job_ob', 'job_pe', 'job_po', 'job_pq', 'job_pz', 'job_qp', 'job_qw', 'job_rn', 'job_sa', 'job_vv', 'job_zz', 'area_a', 'area_b', 'area_c', 'area_d']


This includes both the target and predictors.  We need a list with the target removed.  We also remove **id** because it is not useful for prediction.

In [8]:
x_columns = df.columns.drop('product').drop('id')
print(list(x_columns))

['income', 'aspect', 'subscriptions', 'dist_healthy', 'save_rate', 'dist_unhealthy', 'age', 'pop_dense', 'retail_dense', 'crime', 'job_11', 'job_al', 'job_am', 'job_ax', 'job_bf', 'job_by', 'job_cv', 'job_de', 'job_dz', 'job_e2', 'job_f8', 'job_gj', 'job_gv', 'job_kd', 'job_ke', 'job_kl', 'job_kp', 'job_ks', 'job_kw', 'job_mm', 'job_nb', 'job_nn', 'job_ob', 'job_pe', 'job_po', 'job_pq', 'job_pz', 'job_qp', 'job_qw', 'job_rn', 'job_sa', 'job_vv', 'job_zz', 'area_a', 'area_b', 'area_c', 'area_d']


### Generate X and Y for a Classification Neural Network

We can now generate *x* and *y*.  Note, this is how we generate y for a classification problem.  Regression would not use dummies and would simply encode the numeric value of the target.

In [9]:
# Convert to numpy - Classification
x_columns = df.columns.drop('product').drop('id')
x = df[x_columns].values
dummies = pd.get_dummies(df['product']) # Classification
products = dummies.columns
y = dummies.values

We can display the *x* and *y* matrices.

In [10]:
print(x)
print(y)

[[5.08760000e+04 1.31000000e+01 1.00000000e+00 ... 0.00000000e+00
  1.00000000e+00 0.00000000e+00]
 [6.03690000e+04 1.86250000e+01 2.00000000e+00 ... 0.00000000e+00
  1.00000000e+00 0.00000000e+00]
 [5.51260000e+04 3.47666667e+01 1.00000000e+00 ... 0.00000000e+00
  1.00000000e+00 0.00000000e+00]
 ...
 [2.85950000e+04 3.94250000e+01 3.00000000e+00 ... 0.00000000e+00
  0.00000000e+00 1.00000000e+00]
 [6.79490000e+04 5.73333333e+00 0.00000000e+00 ... 0.00000000e+00
  1.00000000e+00 0.00000000e+00]
 [6.14670000e+04 1.68916667e+01 0.00000000e+00 ... 0.00000000e+00
  1.00000000e+00 0.00000000e+00]]
[[0 1 0 ... 0 0 0]
 [0 0 1 ... 0 0 0]
 [0 1 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 1 0]
 [0 0 1 ... 0 0 0]
 [0 0 1 ... 0 0 0]]


The x and y values are now ready for a neural network.  Make sure that you construct the neural network for a classification problem.  Specifically,

* Classification neural networks have an output neuron count equal to the number of classes.
* Classification neural networks should use **categorical_crossentropy** and a **softmax** activation function on the output layer.

### Generate X and Y for a Regression Neural Network

For a regression neural network, the *x* values are generated the same.  However, *y* does not use dummies.  Make sure to replace **income** with your actual target.

In [11]:
y = df['income'].values

# Module 4 Assignment

You can find the first assignment here: [assignment 4](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/assignments/assignment_yourname_class1.ipynb)