# Categorical Features
There are two types of categorical features:<br>
Ordinal Variable: Discrete values that can be ordered. <mark>Example: small\<medium \<large </mark><br>
Nominal Variable: Discrete values that have no ordering. <mark>Example: Brown, Blue, Green</mark><br><br>
    This notebook uses concepts outlined Chapter 4 of _Python Machine Learning_ by Sebastian Raschka
    

In [30]:
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
import pandas as pd
import numpy as np

# Display all cell outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

from IPython import get_ipython
ipython = get_ipython()

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

# Set max rows and columns displayed in jupyter
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 100)

# autoreload extension
if 'autoreload' not in ipython.extension_manager.loaded:
    %load_ext autoreload

%autoreload 2

## Generate a t-shirt order<br>
have a name, a t-shirt size, a t-shirt color and a weight(in pounds)<br>
Uses the <a href="https://pypi.org/project/names/https://pypi.org/project/names/">names </a> module to generate random names

In [41]:
import utils as ut
df = ut.generate_tshirt_order()
df

Unnamed: 0,weight,t_shirt_size,t_shirt_color,name
0,87.478379,small,black,Timothy Bunch
1,101.982078,small,black,Miguel Williams
2,114.504086,small,orange,Tommy Jennings
3,95.567857,small,red,Willie Ledet
4,109.106926,small,orange,David Smith
...,...,...,...,...
295,149.039786,large,green,Irene Glover
296,189.241702,large,orange,Theresa Tomlin
297,173.061783,large,red,Rebekah Millar
298,178.617007,large,red,Melinda Bonner


## Get a list of all categorical variables
Usually strings (objects in dtype) can also be bools (show up as bool in dtypes)

In [42]:
df.dtypes

weight           float64
t_shirt_size      object
t_shirt_color     object
name              object
dtype: object

In [43]:
#and how many unique entries for each type
df.nunique()

weight           300
t_shirt_size       3
t_shirt_color      5
name             300
dtype: int64

## We have 3 objects, of which t_shirt_size and t_shirt_color are categorical

### Ordinal Categorical values
Ordering matters for t_shirt_size given that <br>
<mark> small \< medium \< large </mark><br>
So make t_shirt_size ordinal, and map these strings to numbers that respect the above inequality.  This will also help any ML using this data to interpret it correctly. <br>
Replace the values in the t_shirt_size column with the following <br>
small:1, medium:2, large:3 <br>



In [44]:
size_mapping = {'small':1, 'medium':2, 'large':3}
df.t_shirt_size = df.t_shirt_size.map(size_mapping)
df

Unnamed: 0,weight,t_shirt_size,t_shirt_color,name
0,87.478379,1,black,Timothy Bunch
1,101.982078,1,black,Miguel Williams
2,114.504086,1,orange,Tommy Jennings
3,95.567857,1,red,Willie Ledet
4,109.106926,1,orange,David Smith
...,...,...,...,...
295,149.039786,3,green,Irene Glover
296,189.241702,3,orange,Theresa Tomlin
297,173.061783,3,red,Rebekah Millar
298,178.617007,3,red,Melinda Bonner


Notice that we do **not** increase the total number of columns when we do this.

In [45]:
# If you want to reverse the above mapping create a reverse mapping and map to df
# reverse_mapping = {v:k for k,v in size_mapping.items()}
# df.t_shirt_size = df.t_shirt_size.map(reverse_mapping)
# df

### Nominal Categorical values
Ordering does not matter for t_shirt_color, but if we do the same thing that we did for ordinal features, that is map each unique value to a number, then we will be establishing an order. Like this:<br>
'green':0,'blue':1,'orange':2,'red':3,'black':4<br>
This will appear to a ML algorithm that 'green'<'blue'<'orange'<'red'<'black'.  Which is nonsense.<br>

One way to solve this is through something called one-hot encoding.  That's where you create a new column for each possible value that the nominal variable can be.<br>
This operation **will** increase the number of features (columns) in your dataset by the number of nominal values -1.<br>
Use pandas builtin get_dummies

In [46]:
df=pd.get_dummies(df,columns=['t_shirt_color'])
df

Unnamed: 0,weight,t_shirt_size,name,t_shirt_color_black,t_shirt_color_blue,t_shirt_color_green,t_shirt_color_orange,t_shirt_color_red
0,87.478379,1,Timothy Bunch,1,0,0,0,0
1,101.982078,1,Miguel Williams,1,0,0,0,0
2,114.504086,1,Tommy Jennings,0,0,0,1,0
3,95.567857,1,Willie Ledet,0,0,0,0,1
4,109.106926,1,David Smith,0,0,0,1,0
...,...,...,...,...,...,...,...,...
295,149.039786,3,Irene Glover,0,0,1,0,0
296,189.241702,3,Theresa Tomlin,0,0,0,1,0
297,173.061783,3,Rebekah Millar,0,0,0,0,1
298,178.617007,3,Melinda Bonner,0,0,0,0,1


Notice that the t_shirt_color column has been replaced with 5 columns; t_shirt_color_black...<br>
Note also that only 1 of these 5 columns will ever be 1 the rest will be 0.  Note also that there is no longer any order to infer to any of the colors .