# Part 2.2: Categorical and Continuous Values

Neural networks require their input to be a fixed number of columns.  This input format is very similar to spreadsheet data.  This input must be entirely numeric.  

It is essential to represent the data in a way that the neural network can train from it.  In class 6, we will see even more ways to preprocess data.  For now, we will look at several of the most basic ways to transform data for a neural network.

Before we look at specific ways to preprocess data, it is important to consider four basic types of data, as defined by [[Cite:stevens1946theory]](http://psychology.okstate.edu/faculty/jgrice/psyc3214/Stevens_FourScales_1946.pdf).  Statisticians commonly refer to as the [levels of measure](https://en.wikipedia.org/wiki/Level_of_measurement):

* Character Data (strings)
    * **Nominal** - Individual discrete items, no order. For example, color, zip code, shape.
    * **Ordinal** - Individual distinct items have an implied order.  For example grade level, job title, Starbucks(tm) coffee size (tall, vente, grande) 
* Numeric Data
    * **Interval** - Numeric values, no defined start.  For example, temperature.  You would never say, "yesterday was twice as hot as today."
    * **Ratio** - Numeric values, clearly defined start.  For example, speed.  You would say that "The first car is going twice as fast as the second."

### Encoding Continuous Values

One common transformation is to normalize the inputs.  It is sometimes valuable to normalization numeric inputs to be put in a standard form so that the program can easily compare these two values.  Consider if a friend told you that he received a 10 dollar discount.  Is this a good deal?  Maybe.  But the cost is not normalized.  If your friend purchased a car, then the discount is not that good.  If your friend bought dinner, this is an excellent discount!

Percentages are a prevalent form of normalization.  If your friend tells you they got 10% off, we know that this is a better discount than 5%.  It does not matter how much the purchase price was.  One widespread machine learning normalization is the Z-Score:

$z = \frac{x - \mu}{\sigma} $

To calculate the Z-Score you need to also calculate the mean($\mu$) and the standard deviation ($\sigma$).  The mean is calculated as follows:

$\mu = \bar{x} = \frac{x_1+x_2+\cdots +x_n}{n}$

The standard deviation is calculated as follows:

$\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2}, {\rm \ \ where\ \ } \mu = \frac{1}{N} \sum_{i=1}^N x_i$

The following Python code replaces the mpg with a z-score.  Cars with average MPG will be near zero, above zero is above average, and below zero is below average.  Z-Scores above/below -3/3 are very rare, these are outliers.

In [None]:
import os
import pandas as pd
from scipy.stats import zscore

df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/auto-mpg.csv",
    na_values=['NA','?'])

pd.set_option('display.max_columns', 7)
pd.set_option('display.max_rows', 5)

df['mpg'] = zscore(df['mpg'])
display(df)

Unnamed: 0,mpg,cylinders,displacement,...,year,origin,name
0,-0.706439,8,307.0,...,70,1,chevrolet chevelle malibu
1,-1.090751,8,350.0,...,70,1,buick skylark 320
...,...,...,...,...,...,...,...
396,0.574601,4,120.0,...,82,1,ford ranger
397,0.958913,4,119.0,...,82,1,chevy s-10


### Encoding Categorical Values as Dummies
The traditional means of encoding categorical values is to make them dummy variables.  This technique is also called one-hot-encoding.  Consider the following data set.

In [2]:
import pandas as pd

df = pd.read_csv(
    "https://data.heatonresearch.com/data/t81-558/jh-simple-dataset.csv",
    na_values=['NA','?'])

areas = list(df['area'].unique())
print(f'Number of areas: {len(areas)}')
print(f'Areas: {areas}')
df.head()

Number of areas: 4
Areas: ['c', 'd', 'a', 'b']


Unnamed: 0,id,job,area,income,aspect,subscriptions,dist_healthy,save_rate,dist_unhealthy,age,pop_dense,retail_dense,crime,product
0,1,vv,c,50876.0,13.1,1,9.017895,35,11.738935,49,0.885827,0.492126,0.0711,b
1,2,kd,c,60369.0,18.625,2,7.766643,59,6.805396,51,0.874016,0.34252,0.400809,c
2,3,pe,c,55126.0,34.766667,1,3.632069,6,13.671772,44,0.944882,0.724409,0.207723,b
3,4,11,c,51690.0,15.808333,1,5.372942,16,4.333286,50,0.889764,0.444882,0.361216,b
4,5,kl,d,28347.0,40.941667,3,3.822477,20,5.967121,38,0.744094,0.661417,0.068033,a


In [3]:
dummies = pd.get_dummies(df['area'],prefix='area')
df = pd.concat([df,dummies],axis=1)
display(df[['id','job','area','income','area_a',
                  'area_b','area_c','area_d']])

Unnamed: 0,id,job,area,income,area_a,area_b,area_c,area_d
0,1,vv,c,50876.0,0,0,1,0
1,2,kd,c,60369.0,0,0,1,0
2,3,pe,c,55126.0,0,0,1,0
3,4,11,c,51690.0,0,0,1,0
4,5,kl,d,28347.0,0,0,0,1
...,...,...,...,...,...,...,...,...
1995,1996,vv,c,51017.0,0,0,1,0
1996,1997,kl,d,26576.0,0,0,0,1
1997,1998,kl,d,28595.0,0,0,0,1
1998,1999,qp,c,67949.0,0,0,1,0


In [None]:
df = pd.concat([df,dummies],axis=1)

### Encoding Categorical Values as Ordinal

Typically categoricals will be encoded as dummy variables.  However, there might be other techniques to convert categoricals to numeric. Any time there is an order to the categoricals, a number should be used.  Consider if you had a categorical that described the current education level of an individual.   

* Kindergarten (0)
* First Grade (1)
* Second Grade (2)
* Third Grade (3)
* Fourth Grade (4)
* Fifth Grade (5)
* Sixth Grade (6)
* Seventh Grade (7)
* Eighth Grade (8)
* High School Freshman (9)
* High School Sophomore (10)
* High School Junior (11)
* High School Senior (12)
* College Freshman (13)
* College Sophomore (14)
* College Junior (15)
* College Senior (16)
* Graduate Student (17)
* PhD Candidate (18)
* Doctorate (19)
* Post Doctorate (20)

The above list has 21 levels.  This would take 21 dummy variables. However, simply encoding this to dummies would lose the order information.  Perhaps the easiest approach would be to assign simply number them and assign the category a single number that is equal to the value in parenthesis above.  However, we might be able to do even better.  Graduate student is likely more than a year, so you might increase more than just one value.  