# Encoding feature vectors
A neural network can function on pretty much any kind of data, provided it can be encoded into a 'feature vector', i.e. numerical input. Exactly how these are structured, can vary. 

For tabular data, each row becomes a feature vector, with a given column mapping to a specific input neuron.

We will use a simple example data set.

In [25]:
import pandas as pd

df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/jh-simple-dataset.csv",
    na_values=['NA','?']
)

In [26]:
df

Unnamed: 0,id,job,area,income,aspect,subscriptions,dist_healthy,save_rate,dist_unhealthy,age,pop_dense,retail_dense,crime,product
0,1,vv,c,50876.0,13.100000,1,9.017895,35,11.738935,49,0.885827,0.492126,0.071100,b
1,2,kd,c,60369.0,18.625000,2,7.766643,59,6.805396,51,0.874016,0.342520,0.400809,c
2,3,pe,c,55126.0,34.766667,1,3.632069,6,13.671772,44,0.944882,0.724409,0.207723,b
3,4,11,c,51690.0,15.808333,1,5.372942,16,4.333286,50,0.889764,0.444882,0.361216,b
4,5,kl,d,28347.0,40.941667,3,3.822477,20,5.967121,38,0.744094,0.661417,0.068033,a
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,1996,vv,c,51017.0,38.233333,1,5.454545,34,14.013489,41,0.881890,0.744094,0.104838,b
1996,1997,kl,d,26576.0,33.358333,2,3.632069,20,8.380497,38,0.944882,0.877953,0.063851,a
1997,1998,kl,d,28595.0,39.425000,3,7.168218,99,4.626950,36,0.759843,0.744094,0.098703,f
1998,1999,qp,c,67949.0,5.733333,0,8.936292,26,3.281439,46,0.909449,0.598425,0.117803,c


We can already begin to scrutinize this data, and consider which column we wish to *predict* with the NN. We also not the `id` column, which we want to drop for the neural network, since it is not relevant information.

We (may) have missing values, which we will have to deal with

We also note that there is some non-numerical data -- `job`, `area` and `product` -- which we will have to encode.

## Encoding categorical data
To begin with, let us encode jobs as a dummy matrix.

In [27]:
dummies = pd.get_dummies(
    df['job'], 
    prefix='job'
)

print(dummies.shape)

pd.set_option('display.max_columns', 14)
dummies

(2000, 33)


Unnamed: 0,job_11,job_al,job_am,job_ax,job_bf,job_by,job_cv,...,job_pz,job_qp,job_qw,job_rn,job_sa,job_vv,job_zz
0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0
1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0
1996,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0
1997,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0
1998,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0


We will now merge this data back into the original data frame, and drop the `job` column.

In [28]:
df = pd.concat([df, dummies], axis=1)
df.drop(
    'job', 
    axis=1,
    errors='ignore', # ignore errors
    inplace=True # inplace edits original frame instead of returning new
)

df

Unnamed: 0,id,area,income,aspect,subscriptions,dist_healthy,save_rate,...,job_pz,job_qp,job_qw,job_rn,job_sa,job_vv,job_zz
0,1,c,50876.0,13.100000,1,9.017895,35,...,0,0,0,0,0,1,0
1,2,c,60369.0,18.625000,2,7.766643,59,...,0,0,0,0,0,0,0
2,3,c,55126.0,34.766667,1,3.632069,6,...,0,0,0,0,0,0,0
3,4,c,51690.0,15.808333,1,5.372942,16,...,0,0,0,0,0,0,0
4,5,d,28347.0,40.941667,3,3.822477,20,...,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,1996,c,51017.0,38.233333,1,5.454545,34,...,0,0,0,0,0,1,0
1996,1997,d,26576.0,33.358333,2,3.632069,20,...,0,0,0,0,0,0,0
1997,1998,d,28595.0,39.425000,3,7.168218,99,...,0,0,0,0,0,0,0
1998,1999,c,67949.0,5.733333,0,8.936292,26,...,0,1,0,0,0,0,0


We will do the same for the area column:

In [29]:
df = pd.concat(
    [
        df, 
        pd.get_dummies(
            df['area'],
            prefix='area'
        )
    ],
    axis=1
)

df.drop('area', axis=1, errors='ignore', inplace=True)
df

Unnamed: 0,id,income,aspect,subscriptions,dist_healthy,save_rate,dist_unhealthy,...,job_sa,job_vv,job_zz,area_a,area_b,area_c,area_d
0,1,50876.0,13.100000,1,9.017895,35,11.738935,...,0,1,0,0,0,1,0
1,2,60369.0,18.625000,2,7.766643,59,6.805396,...,0,0,0,0,0,1,0
2,3,55126.0,34.766667,1,3.632069,6,13.671772,...,0,0,0,0,0,1,0
3,4,51690.0,15.808333,1,5.372942,16,4.333286,...,0,0,0,0,0,1,0
4,5,28347.0,40.941667,3,3.822477,20,5.967121,...,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,1996,51017.0,38.233333,1,5.454545,34,14.013489,...,0,1,0,0,0,1,0
1996,1997,26576.0,33.358333,2,3.632069,20,8.380497,...,0,0,0,0,0,0,1
1997,1998,28595.0,39.425000,3,7.168218,99,4.626950,...,0,0,0,0,0,0,1
1998,1999,67949.0,5.733333,0,8.936292,26,3.281439,...,0,0,0,0,0,1,0


## Dealing with missing data
There is no 'correct' method for handling missing data, and often the solution should be evaluated on an individual basis, depending on data type, amount, and nature.

First, we'll assertain how much missing data we have:

In [30]:
df.isna().sum()

id                 0
income            59
aspect             0
subscriptions      0
dist_healthy       0
save_rate          0
dist_unhealthy     0
age                0
pop_dense          0
retail_dense       0
crime              0
product            0
job_11             0
job_al             0
job_am             0
job_ax             0
job_bf             0
job_by             0
job_cv             0
job_de             0
job_dz             0
job_e2             0
job_f8             0
job_gj             0
job_gv             0
job_kd             0
job_ke             0
job_kl             0
job_kp             0
job_ks             0
job_kw             0
job_mm             0
job_nb             0
job_nn             0
job_ob             0
job_pe             0
job_po             0
job_pq             0
job_pz             0
job_qp             0
job_qw             0
job_rn             0
job_sa             0
job_vv             0
job_zz             0
area_a             0
area_b             0
area_c       

We're only missing data in the `income` column, so we'll consider options:
- drop the data that is missing
- use an average across the data set for missing values
- see if we have any indicators for what the missing value could be

This is arguably non-trivial -- we could calculate a median income, given a specific area, and use that as the missing value.

In this case, out of laziness, we will just to assign a median value from the column, as we have 12 other data columns to work with.

In [31]:
med = df['income'].median()
df['income'] = df['income'].fillna(med)

## Final data manipulation
We can get an overview of our columns to see what we have available:

In [36]:
print(list(df.columns))

['id', 'income', 'aspect', 'subscriptions', 'dist_healthy', 'save_rate', 'dist_unhealthy', 'age', 'pop_dense', 'retail_dense', 'crime', 'product', 'job_11', 'job_al', 'job_am', 'job_ax', 'job_bf', 'job_by', 'job_cv', 'job_de', 'job_dz', 'job_e2', 'job_f8', 'job_gj', 'job_gv', 'job_kd', 'job_ke', 'job_kl', 'job_kp', 'job_ks', 'job_kw', 'job_mm', 'job_nb', 'job_nn', 'job_ob', 'job_pe', 'job_po', 'job_pq', 'job_pz', 'job_qp', 'job_qw', 'job_rn', 'job_sa', 'job_vv', 'job_zz', 'area_a', 'area_b', 'area_c', 'area_d']


We still have our target predictor, namely `product`, and the `id` column, which we don't want to train a NN on -- we will create a list with just the information we want to give the NN:

In [37]:
x_cols = df.columns.drop('product').drop('id')
print(list(x_cols))

['income', 'aspect', 'subscriptions', 'dist_healthy', 'save_rate', 'dist_unhealthy', 'age', 'pop_dense', 'retail_dense', 'crime', 'job_11', 'job_al', 'job_am', 'job_ax', 'job_bf', 'job_by', 'job_cv', 'job_de', 'job_dz', 'job_e2', 'job_f8', 'job_gj', 'job_gv', 'job_kd', 'job_ke', 'job_kl', 'job_kp', 'job_ks', 'job_kw', 'job_mm', 'job_nb', 'job_nn', 'job_ob', 'job_pe', 'job_po', 'job_pq', 'job_pz', 'job_qp', 'job_qw', 'job_rn', 'job_sa', 'job_vv', 'job_zz', 'area_a', 'area_b', 'area_c', 'area_d']


## Generate `x`, `y` variables for NN
We may consider classification or regression for this data. We will use `x_cols` to predict our target.

The `x` variable, in either case, is the same:

In [38]:
x = df[x_cols].values

### Classification
If we are considering a classification problem, we will encode the `product` (the target) column as dummies:

In [39]:
dummies = pd.get_dummies(df['product'])
products = dummies.columns # keep index of names

y = dummies.values

### Regression
For regression, we could simply use

In [40]:
y = df['product'].values

*Note:* since product is categorical, this would not actually work in a NN -- we would require numerical data for this target.