# Categorical Features
There are two types of categorical features:<br>
Ordinal Variable: Discrete values that can be ordered. <mark>Example: small\<medium \<large </mark><br>
Nominal Variable: Discrete values that have no ordering. <mark>Example: Brown, Blue, Green</mark><br>

**Definition- Cardinality: the number of distint elements in a set.  For our purposes the number of unique values in a column**<br>
    This notebook uses concepts outlined Chapter 4 of _Python Machine Learning_ by Sebastian Raschka
    

In [78]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
import pandas as pd
import numpy as np

# Display all cell outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

from IPython import get_ipython
ipython = get_ipython()

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

# Set max rows and columns displayed in jupyter
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)

# autoreload extension
if 'autoreload' not in ipython.extension_manager.loaded:
    %load_ext autoreload

%autoreload 2

## Generate a t-shirt order<br>
have a name, a t-shirt size, a t-shirt color and a weight(in pounds)<br>
Uses the <a href="https://pypi.org/project/names/https://pypi.org/project/names/">names </a> module to generate random names

In [79]:
import utils as ut
df = ut.generate_tshirt_order()
df

Unnamed: 0,weight,t_shirt_size,t_shirt_color,name
0,92.555867,small,blue,Frank Notter
1,110.861400,small,blue,Daisy Burns
2,86.717474,small,green,Vanessa Diaz
3,84.761938,small,blue,Justin Herman
4,83.808216,small,orange,Deborah Gregory
...,...,...,...,...
295,161.188878,large,black,Gabriel Bazile
296,183.012031,large,orange,Dick Klein
297,174.493988,large,green,Michael Moyer
298,206.972934,large,red,Israel Williams


## Get a list of all categorical variables
Usually strings (objects in dtype) and bools (show up as bool in dtypes)

In [73]:
df.dtypes

weight           float64
t_shirt_size      object
t_shirt_color     object
name              object
dtype: object

In [74]:
#and how many unique entries for each type
df.nunique()

weight           300
t_shirt_size       3
t_shirt_color      5
name             300
dtype: int64

## We have 3 objects, of which t_shirt_size and t_shirt_color are categorical

### Ordinal Categorical values
Ordering matters for t_shirt_size given that <br>
<mark> small \< medium \< large </mark><br>
So make t_shirt_size ordinal, and map these strings to numbers that respect the above inequality.  This will also help any ML using this data to interpret it correctly. <br>
Replace the values in the t_shirt_size column with the following <br>
small:0, medium:1, large:2 <br><br>
Advantages
<ul>
    <li>Establishes a numerical order
    <li>Does not add new columns to DataFrame
    <li>works with tree based models  (Random Forest, Boosted Trees).  Although they will probably work without this as well.
   </ul>



In [75]:
#lets get a set of all possible values
vals = set(df.t_shirt_size)
print(sorted(vals))

#if there is an order you have to specify it by hand, easy when there are 3 values, harder when there are 30
#I copied the resulting set from above and gave each of the members a value based on my domain expertise
vals={'large':2, 'medium':1, 'small':0}

['large', 'medium', 'small']


In [76]:
df.t_shirt_size = df.t_shirt_size.map(vals)
df

Unnamed: 0,weight,t_shirt_size,t_shirt_color,name
0,80.887497,0,blue,Annette Thompson
1,93.967084,0,black,Tammy Hogan
2,85.506761,0,black,Dora Blount
3,110.621576,0,orange,Lola Payne
4,87.717764,0,black,Marcus Keegan
...,...,...,...,...
295,162.318522,2,orange,Stephen Ellwanger
296,157.197792,2,red,William Thon
297,171.990514,2,red,Jonathan Reitz
298,203.830502,2,blue,Norman Dyer


Notice that we do **not** increase the total number of columns when we do this.

In [45]:
# If you want to reverse the above mapping create a reverse mapping and map to df
# reverse_mapping = {v:k for k,v in size_mapping.items()}
# df.t_shirt_size = df.t_shirt_size.map(reverse_mapping)
# df

### Nominal Categorical values
Ordering does not matter for t_shirt_color, but if we do the same thing that we did for ordinal features, that is map each unique value to a number, then we will be establishing an order. Like this:<br>
'green':0,'blue':1,'orange':2,'red':3,'black':4<br>
This may appear to a ML algorithm that 'green'<'blue'<'orange'<'red'<'black'.  Which is nonsense.<br>

One way to solve this is through something called <mark>one-hot encoding</mark>.  A technique where a new column is created for each possible value that the nominal variable can be. This operation **will** increase the number of features (columns) in your dataset by the cardinality of the column -1.<br>
To implement, use pandas builtin get_dummies<br><br>
Advantages
<ul>
    <li>Guarantees a ML model will not deduce an ordering
   </ul>
Disdvantages
<ul>
    <li>Expands the feature space (adds n-1 columns if the nominal variable has n unique values).  So high cardinality columns can dramatically expand feature space. 
    <li>Does not work as well with tree based models (Random Forest, Boosted Trees)
   </ul>


In [77]:
df=pd.get_dummies(df,columns=['t_shirt_color'])
df

Unnamed: 0,weight,t_shirt_size,name,t_shirt_color_black,t_shirt_color_blue,t_shirt_color_green,t_shirt_color_orange,t_shirt_color_red
0,80.887497,0,Annette Thompson,0,1,0,0,0
1,93.967084,0,Tammy Hogan,1,0,0,0,0
2,85.506761,0,Dora Blount,1,0,0,0,0
3,110.621576,0,Lola Payne,0,0,0,1,0
4,87.717764,0,Marcus Keegan,1,0,0,0,0
...,...,...,...,...,...,...,...,...
295,162.318522,2,Stephen Ellwanger,0,0,0,1,0
296,157.197792,2,William Thon,0,0,0,0,1
297,171.990514,2,Jonathan Reitz,0,0,0,0,1
298,203.830502,2,Norman Dyer,0,1,0,0,0


Notice that the t_shirt_color column has been replaced with 5 columns; t_shirt_color_black...<br>
Note also that only 1 of these 5 columns will ever be 1 the rest will be 0.  Note also that there is no longer any order to infer to any of the colors .