# Categorical Features
There are two types of categorical features:<br>
Ordinal Variable: Discrete values that can be ordered. <mark>Example: small\<medium \<large </mark><br>
Nominal Variable: Discrete values that have no ordering. <mark>Example: Brown, Blue, Green</mark><br>

**Definition- Cardinality: the number of distint elements in a set.  For our purposes the number of unique values in a column**<br>
    This notebook uses concepts outlined Chapter 4 of _Python Machine Learning_ by Sebastian Raschka
    

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
import numpy as np
import utils as ut

## Generate a t-shirt order<br>
have a name, a t-shirt size, a t-shirt color and a weight(in pounds)<br>
Uses the <a href="https://pypi.org/project/names/https://pypi.org/project/names/">names </a> module to generate random names

In [2]:
dir(ut)

['PROCESSED_DATA',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'generate_tshirt_order',
 'names',
 'np',
 'pd']

In [4]:
import utils as ut
df = ut.generate_tshirt_order()
df

Unnamed: 0,weight,t_shirt_size,t_shirt_color,name
0,89.997833,small,green,Thomas Kempker
1,114.179140,small,black,James Faulkner
2,113.689670,small,orange,Ryan Murphy
3,102.687225,small,orange,Constance Brassard
4,110.344519,small,black,Christina Yearout
...,...,...,...,...
295,170.395980,large,orange,Janis Askew
296,162.185715,large,orange,Phillip Boysen
297,219.890920,large,green,Monica Olewine
298,147.486465,large,red,Sarah Greer


## Get a list of all categorical variables
Usually strings (objects in dtype) and bools (show up as bool in dtypes)

In [5]:
len(df)

300

In [6]:
df.dtypes

weight           float64
t_shirt_size      object
t_shirt_color     object
name              object
dtype: object

In [7]:
#and how many unique entries for each type
df.nunique()

weight           300
t_shirt_size       3
t_shirt_color      5
name             300
dtype: int64

## We have 3 objects, of which t_shirt_size and t_shirt_color are low cardinality categorical variables

### Ordinal Categorical values
Ordering matters for t_shirt_size given that <br>
<mark> small \< medium \< large </mark><br>
So make t_shirt_size ordinal, and map these strings to numbers that respect the above inequality.  This will also help any ML algorithm using this data to interpret it correctly. <br>
Replace the values in the t_shirt_size column with the following <br>
small:0, medium:1, large:2 <br><br>
Advantages
<ul>
    <li>Establishes a numerical order
    <li>Does not add new columns to DataFrame 
   </ul>



In [8]:
#lets get a set of all possible values
vals = set(df.t_shirt_size)  #list(df.t_shirt_size.unique()) works too
print(f'original t shirt sizes {vals}')

#if there is an order you generally have to specify it by hand, easy when there are 3 values, harder when there are 30
#I copied the resulting set from above and gave each of the members a value based on my domain expertise
#What would you do if they were small medium, mediumplus and large?  You dont have to use integers.
#Maybe something like vals={'large':2.0, mediumplus:1.2, 'medium':1.0, 'small':0.0}
vals={'large':2, 'medium':1, 'small':0}
print(f'mapping used to convert original t_shirt sizes to numbers {vals}')

original t shirt sizes {'large', 'small', 'medium'}
mapping used to convert original t_shirt sizes to numbers {'large': 2, 'medium': 1, 'small': 0}


In [9]:
#map the vals dict to the t_shirt_size column, this is quite fast
df.t_shirt_size = df.t_shirt_size.map(vals)

#can do the same thing above this way
# df['t_shirt_size'] = df['t_shirt_size'].map(vals)
df

Unnamed: 0,weight,t_shirt_size,t_shirt_color,name
0,89.997833,0,green,Thomas Kempker
1,114.179140,0,black,James Faulkner
2,113.689670,0,orange,Ryan Murphy
3,102.687225,0,orange,Constance Brassard
4,110.344519,0,black,Christina Yearout
...,...,...,...,...
295,170.395980,2,orange,Janis Askew
296,162.185715,2,orange,Phillip Boysen
297,219.890920,2,green,Monica Olewine
298,147.486465,2,red,Sarah Greer


Notice that we do **not** increase the total number of columns when we do this.

In [21]:
df.dtypes

weight           float64
t_shirt_size       int64
t_shirt_color     object
name              object
dtype: object

In [22]:
vals

{'large': 2, 'medium': 1, 'small': 0}

In [23]:
# If you want to reverse the above mapping create a reverse mapping and map to df
reverse_mapping = {v:k for k,v in vals.items()}
print(f'reverse mapping used to convert numbers back to original t_shirt sizes {reverse_mapping}')

#apply reverse mapping to get back oridinal values
# df.t_shirt_size = df.t_shirt_size.map(reverse_mapping)
# df

reverse mapping used to convert numbers back to original t_shirt sizes {2: 'large', 1: 'medium', 0: 'small'}


### Nominal Categorical values
Ordering does not matter for t_shirt_color, but if we do the same thing that we did for ordinal features, that is map each unique value to a number, then we will be establishing an order. Like this:<br>
'green':0,'blue':1,'orange':2,'red':3,'black':4<br>
This may appear to a ML algorithm that 'green'<'blue'<'orange'<'red'<'black'.  Which is nonsense.<br>

One way to solve this is through something called <mark>one-hot encoding</mark>.  A technique where a new column is created for each possible value that the nominal variable can be. This operation **will** increase the number of features (columns) in your dataset by the cardinality of the column -1 (if n unique values add n dummy features and delete the original feature).<br>
To implement, use pandas builtin get_dummies<br><br>
Advantages
<ul>
    <li>Guarantees a ML model will not deduce an ordering
   </ul>
Disdvantages
<ul>
    <li>Expands the feature space (adds n-1 columns if the nominal variable has n unique values).  So high cardinality columns can dramatically expand feature space. 
    <li>Does not work as well with tree based models (Random Forest, Boosted Trees).
   </ul>


In [12]:
df

Unnamed: 0,weight,t_shirt_size,t_shirt_color,name
0,108.765395,0,blue,Karen Wainwright
1,85.564958,0,red,Scott Roache
2,99.466720,0,green,Helen Cole
3,86.386705,0,blue,Steven Logan
4,71.223125,0,orange,Jackie Martinez
...,...,...,...,...
295,216.391622,2,orange,Todd Hughes
296,254.026741,2,red,Patrick Long
297,191.267698,2,red,Alma Barr
298,180.305148,2,orange,Charlotte Turnbull


In [11]:
pd.get_dummies?

[0;31mSignature:[0m
[0mpd[0m[0;34m.[0m[0mget_dummies[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mdata[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mprefix[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mprefix_sep[0m[0;34m:[0m [0;34m'str | Iterable[str] | dict[str, str]'[0m [0;34m=[0m [0;34m'_'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdummy_na[0m[0;34m:[0m [0;34m'bool'[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcolumns[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msparse[0m[0;34m:[0m [0;34m'bool'[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdrop_first[0m[0;34m:[0m [0;34m'bool'[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdtype[0m[0;34m:[0m [0;34m'NpDtype | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0;34m'DataFrame'[0m[0;34m[0m[0;34m[0m[0m


In [12]:
df2=df.copy()
df2=pd.get_dummies(df,drop_first=True,columns=['t_shirt_color'])
df2

Unnamed: 0,weight,t_shirt_size,name,t_shirt_color_blue,t_shirt_color_green,t_shirt_color_orange,t_shirt_color_red
0,89.997833,0,Thomas Kempker,False,True,False,False
1,114.179140,0,James Faulkner,False,False,False,False
2,113.689670,0,Ryan Murphy,False,False,True,False
3,102.687225,0,Constance Brassard,False,False,True,False
4,110.344519,0,Christina Yearout,False,False,False,False
...,...,...,...,...,...,...,...
295,170.395980,2,Janis Askew,False,False,True,False
296,162.185715,2,Phillip Boysen,False,False,True,False
297,219.890920,2,Monica Olewine,False,True,False,False
298,147.486465,2,Sarah Greer,False,False,False,True


Notice that the t_shirt_color column has been replaced with 5 columns; t_shirt_color_black...<br>
Note also that only 1 of these 5 columns will ever be 1 the rest will be 0.  Note also that there is no longer any order to infer to any of the colors .

In [13]:
df2.dtypes

weight                  float64
t_shirt_size              int64
name                     object
t_shirt_color_blue         bool
t_shirt_color_green        bool
t_shirt_color_orange       bool
t_shirt_color_red          bool
dtype: object

In [32]:
df2.memory_usage(deep=True)/len(df2)

Index                    0.440000
weight                   8.000000
t_shirt_size             8.000000
name                    70.083333
t_shirt_color_blue       1.000000
t_shirt_color_green      1.000000
t_shirt_color_orange     1.000000
t_shirt_color_red        1.000000
dtype: float64