<a href="https://colab.research.google.com/github/GabeMaldonado/AIforMedicine/blob/master/AIforMed_C2_W4_Lab1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Categorical Variables and One-hot Encoding



In [0]:
import pandas as pd

In [3]:
# Create an example dataframe

df = pd.DataFrame({'ascites' : [0,1, 0, 1 ],
                   'edema' : [0.5, 0, 1, 0.5],
                   'stage' : [3, 4, 3, 4],
                   'cholesterol' : [200.5,180.2,190.5,210.3]}
                  )
df

Unnamed: 0,ascites,edema,stage,cholesterol
0,0,0.5,3,200.5
1,1,0.0,4,180.2
2,0,1.0,3,190.5
3,1,0.5,4,210.3


In this small example dataframe we can see three categorical variables: 
*   'ascites' -- its values are either 0 and 1
*   'edema' -- it's values are either 0, 0.5 or 1 
*   'stage' -- it's values are either 3 or 4

'Cholesterol' is not a categorical variable, it is a continous variable and it can be any number greater than 0.

### Which categorical variables to one-hot encode?
We can see that:
*    Ascites' values are already either 1 or 0 so there is no need to one-hot encode it. 1 means a disease is present and 0 means no disease. 
*   Edema, (swelling in any part of the body), has 3 categories represented by the values of 0, 0.5 and 1. We can one-hot encode this variable so that there is one feature column for each of the possible values:
   *   0 -- no edema
   *   0.5 -- patient has edema but has not received diuretic treatment 
   *   1 -- patient has edema despite receiving diuretic treatment -- indicates that the condition can be more severe
*   Stage has values of 3 and 4 so this variable can be one-hot encoded and the values can be converted to 0 and 1 for this excersice. Normally the values range from 0 to 4 and they represent:
   *   Stage 0 -- patient has no cancer          
   *   Stage 1 -- patient has cancer that is limited to a small area of the body, "early stage cancer"
   *   Stage 2 -- patient's cancer has spread to nearby tissues
   *   Stage 3 -- patient's cancer has spread more aggressively 
   *   Stage 4 -- patient's cancer has spread to distants part of the body, "metastatic cancer".

To one-hot encode a feature variable we use pandas' `get_dummies()` function and pass the dataframe and the feature/column name as parameters.


In [5]:
# One-hot ecode the 'stage' feature

df_stage = pd.get_dummies(data=df, columns=['stage'])
df_stage

Unnamed: 0,ascites,edema,cholesterol,stage_3,stage_4
0,0,0.5,200.5,1,0
1,1,0.0,180.2,0,1
2,0,1.0,190.5,1,0
3,1,0.5,210.3,0,1


In [6]:
df_stage[['stage_3', 'stage_4']]

Unnamed: 0,stage_3,stage_4
0,1,0
1,0,1
2,1,0
3,0,1


### Multi-colinearity of one-hot encoded features

Looking at the results of the one-hot encode, it becomes clear (because we only have to possible values: 0 & 1)that if the value of `stage_3` is $1$ the value of `stage_4` is $0$ and if the value of `stage_4` is $1$ the value of `stage_3` is $0$. This tells us that one of the columns is redundant and can therefore be safely dropped from the dataframe. We drop this redundant column to prevent multi-colinearity (where one feature can predict another feature).
There are two ways to drop the column:
1.   By passing the argument `drop_first=True`   
2.   Or by using `.drop(columns='column_name')`



In [7]:
# Drop column using method 1

df_stage_drop_mthd1 = pd.get_dummies(data= df,
                                     columns=['stage'],
                                     drop_first=True)
df_stage_drop_mthd1

Unnamed: 0,ascites,edema,cholesterol,stage_4
0,0,0.5,200.5,0
1,1,0.0,180.2,1
2,0,1.0,190.5,0
3,1,0.5,210.3,1


In [8]:
# Drop column using method 2

df_stage_drop_mthd2 = df_stage.drop(columns='stage_3')
df_stage_drop_mthd2

Unnamed: 0,ascites,edema,cholesterol,stage_4
0,0,0.5,200.5,0
1,1,0.0,180.2,1
2,0,1.0,190.5,0
3,1,0.5,210.3,1
