# Categorical Features
There are two types of categorical features:<br>
Ordinal Variable: Discrete values that can be ordered. <mark>Example: small\<medium \<large </mark><br>
Nominal Variable: Discrete values that have no ordering. <mark>Example: Brown, Blue, Green</mark><br>

**Definition- Cardinality: the number of distint elements in a set.  For our purposes the number of unique values in a column**<br>
    This notebook uses concepts outlined Chapter 4 of _Python Machine Learning_ by Sebastian Raschka
    

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns

import pandas as pd
import numpy as np
import utils as ut

## Generate a t-shirt order<br>
have a name, a t-shirt size, a t-shirt color and a weight(in pounds)<br>
Uses the <a href="https://pypi.org/project/names/https://pypi.org/project/names/">names </a> module to generate random names

In [2]:
dir(ut)

['PROCESSED_DATA',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'generate_tshirt_order',
 'names',
 'np',
 'pd']

In [3]:
import utils as ut
df = ut.generate_tshirt_order()
df

Unnamed: 0,weight,t_shirt_size,t_shirt_color,name
0,104.223342,small,black,Joshua Kern
1,79.555350,small,red,Annie Labine
2,99.122399,small,red,Betty Leonard
3,100.165451,small,green,Lillie Mccarley
4,96.266007,small,red,Joel Stern
...,...,...,...,...
295,160.484796,large,black,Frances Camacho
296,156.060992,large,green,David Livesay
297,125.988215,large,black,Shawn Barraclough
298,173.009852,large,green,Heather Giordano


## Get a list of all categorical variables
Usually strings (objects in dtype) and bools (show up as bool in dtypes)

In [4]:
len(df)

300

In [5]:
df.dtypes

weight           float64
t_shirt_size      object
t_shirt_color     object
name              object
dtype: object

In [6]:
#and how many unique entries for each type
df.nunique()

weight           300
t_shirt_size       3
t_shirt_color      5
name             300
dtype: int64

## We have 3 objects, of which t_shirt_size and t_shirt_color are low cardinality categorical variables

### Ordinal Categorical values
Ordering matters for t_shirt_size given that <br>
<mark> small \< medium \< large </mark><br>
So make t_shirt_size ordinal, and map these strings to numbers that respect the above inequality.  This will also help any ML algorithm using this data to interpret it correctly. <br>
Replace the values in the t_shirt_size column with the following <br>
small:0, medium:1, large:2 <br><br>
Advantages
<ul>
    <li>Establishes a numerical order
    <li>Does not add new columns to DataFrame 
   </ul>



In [7]:
#lets get a set of all possible values
vals = set(df.t_shirt_size)  #list(df.t_shirt_size.unique()) works too
print(f'original t shirt sizes {vals}')

#if there is an order you generally have to specify it by hand, easy when there are 3 values, harder when there are 30
#I copied the resulting set from above and gave each of the members a value based on my domain expertise
#What would you do if they were small medium, mediumplus and large?  You dont have to use integers.
#Maybe something like vals={'large':2.0, mediumplus:1.2, 'medium':1.0, 'small':0.0}
vals={'large':2, 'medium':1, 'small':0}
print(f'mapping used to convert original t_shirt sizes to numbers {vals}')

original t shirt sizes {'small', 'medium', 'large'}
mapping used to convert original t_shirt sizes to numbers {'large': 2, 'medium': 1, 'small': 0}


In [8]:
#map the vals dict to the t_shirt_size column, this is quite fast
df.t_shirt_size = df.t_shirt_size.map(vals)

#can do the same thing above this way
# df['t_shirt_size'] = df['t_shirt_size'].map(vals)
df

Unnamed: 0,weight,t_shirt_size,t_shirt_color,name
0,104.223342,0,black,Joshua Kern
1,79.555350,0,red,Annie Labine
2,99.122399,0,red,Betty Leonard
3,100.165451,0,green,Lillie Mccarley
4,96.266007,0,red,Joel Stern
...,...,...,...,...
295,160.484796,2,black,Frances Camacho
296,156.060992,2,green,David Livesay
297,125.988215,2,black,Shawn Barraclough
298,173.009852,2,green,Heather Giordano


Notice that we do **not** increase the total number of columns when we do this.

In [10]:
df.dtypes

weight           float64
t_shirt_size       int64
t_shirt_color     object
name              object
dtype: object

In [11]:
vals

{'large': 2, 'medium': 1, 'small': 0}

In [12]:
# If you want to reverse the above mapping create a reverse mapping and map to df
reverse_mapping = {v:k for k,v in vals.items()}
print(f'reverse mapping used to convert numbers back to original t_shirt sizes {reverse_mapping}')

#apply reverse mapping to get back oridinal values
# df.t_shirt_size = df.t_shirt_size.map(reverse_mapping)
# df

reverse mapping used to convert numbers back to original t_shirt sizes {2: 'large', 1: 'medium', 0: 'small'}


### Nominal Categorical values
Ordering does not matter for t_shirt_color, but if we do the same thing that we did for ordinal features, that is map each unique value to a number, then we will be establishing an order. Like this:<br>
'green':0,'blue':1,'orange':2,'red':3,'black':4<br>
This may appear to a ML algorithm that 'green'<'blue'<'orange'<'red'<'black'.  Which is nonsense.<br>

One way to solve this is through something called <mark>one-hot encoding</mark>.  A technique where a new column is created for each possible value that the nominal variable can be. This operation **will** increase the number of features (columns) in your dataset by the cardinality of the column -1 (if n unique values add n dummy features and delete the original feature).<br>
To implement, use pandas builtin get_dummies<br><br>
Advantages
<ul>
    <li>Guarantees a ML model will not deduce an ordering
   </ul>
Disdvantages
<ul>
    <li>Expands the feature space (adds n-1 columns if the nominal variable has n unique values).  So high cardinality columns can dramatically expand feature space. 
    <li>Does not work as well with tree based models (Random Forest, Boosted Trees).
   </ul>


In [34]:
df

Unnamed: 0,weight,t_shirt_size,t_shirt_color,name
0,81.179952,0,red,Mary Mcniel
1,103.493753,0,blue,Rodney Vick
2,94.204191,0,red,Elmer Hickman
3,106.881369,0,red,Mary Snyder
4,95.933206,0,orange,Catherine Ishak
...,...,...,...,...
295,192.876132,2,black,Lester Malin
296,104.387189,2,green,Janice Scott
297,168.322953,2,blue,Lawanda Parker
298,103.020549,2,green,Ethel Grice


In [13]:
df2=df.copy()
df2=pd.get_dummies(df,columns=['t_shirt_color'])
df2

Unnamed: 0,weight,t_shirt_size,name,t_shirt_color_black,t_shirt_color_blue,t_shirt_color_green,t_shirt_color_orange,t_shirt_color_red
0,104.223342,0,Joshua Kern,True,False,False,False,False
1,79.555350,0,Annie Labine,False,False,False,False,True
2,99.122399,0,Betty Leonard,False,False,False,False,True
3,100.165451,0,Lillie Mccarley,False,False,True,False,False
4,96.266007,0,Joel Stern,False,False,False,False,True
...,...,...,...,...,...,...,...,...
295,160.484796,2,Frances Camacho,True,False,False,False,False
296,156.060992,2,David Livesay,False,False,True,False,False
297,125.988215,2,Shawn Barraclough,True,False,False,False,False
298,173.009852,2,Heather Giordano,False,False,True,False,False


Notice that the t_shirt_color column has been replaced with 5 columns; t_shirt_color_black...<br>
Note also that only 1 of these 5 columns will ever be 1 the rest will be 0.  Note also that there is no longer any order to infer to any of the colors .