<p align="center">
<img src='../../img/VerticaMLPython.png' width="180px">
</p>

# Vertica ML Python Exercise 5

During this exercice, we will:
<ul>
    <li> Encode data using different methods
    <li> Learn when to use the suitable encoding method
</ul>
## Initialization

Let's create a cursor using the vertica_cursor function

In [1]:
from vertica_ml_python.utilities import vertica_cursor
cur = vertica_cursor("VerticaDSN")

During this study, we will work with the well known Titanic dataset used in many exercises. 

In [2]:
from vertica_ml_python import vDataframe
titanic = vDataframe('titanic', cursor = cur)
print(titanic)

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
,age,body,survived,ticket,home.dest,cabin,sex,pclass,embarked,parch,fare,name,boat,sibsp
0.0,2.000,,0,113781,"Montreal, PQ / Chesterville, ON",C22 C26,female,1,S,2,151.55000,"Allison, Miss. Helen Loraine",,1
1.0,30.000,135,0,113781,"Montreal, PQ / Chesterville, ON",C22 C26,male,1,S,2,151.55000,"Allison, Mr. Hudson Joshua Creighton",,1
2.0,25.000,,0,113781,"Montreal, PQ / Chesterville, ON",C22 C26,female,1,S,2,151.55000,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",,1
3.0,39.000,,0,112050,"Belfast, NI",A36,male,1,S,0,0.00000,"Andrews, Mr. Thomas Jr",,0
4.0,71.000,22,0,PC 17609,"Montevideo, Uruguay",,male,1,C,0,49.50420,"Artagaveytia, Mr. Ramon",,0
,...,...,...,...,...,...,...,...,...,...,...,...,...,...


<object>  Name: titanic, Number of rows: 1234, Number of columns: 14


We already saw how to clean this dataset (deal with the missing values). We used the following commands.

In [3]:
titanic["fare"].dropna()
titanic["embarked"].dropna()
titanic["boat"].fillna(method = "0ifnull")
titanic["age"].fillna(method = "median", by = ["sex", "pclass"])
titanic.drop(["body", "cabin"])
titanic["home.dest"].fillna(method = "mode")

1 element was dropped
2 elements were dropped
794 elements were filled
237 elements were filled
vColumn '"body"' deleted from the vDataframe.
vColumn '"cabin"' deleted from the vDataframe.
526 elements were filled


0,1,2,3,4,5,6,7,8,9,10,11,12
,age,survived,ticket,home.dest,sex,pclass,embarked,parch,fare,name,boat,sibsp
0.0,36.0,1,17421,"New York, NY",female,1,C,0,110.88330,"Fleming, Miss. Margaret",1,0
1.0,36.0,1,PC 17611,"New York, NY",female,1,S,0,133.65000,"Frauenthal, Mrs. Henry William (Clara Heinsheimer)",1,1
2.0,36.0,1,PC 17604,"New York, NY",female,1,C,0,82.17080,"Meyer, Mrs. Edgar Joseph (Leila Saks)",1,1
3.0,36.0,1,17453,"Paris, France / New York, NY",female,1,C,0,89.10420,"Goldenberg, Mrs. Samuel L (Edwiga Grabowska)",1,1
4.0,36.0,1,19996,"London / East Orange, NJ",female,1,S,0,52.00000,"Taylor, Mrs. Elmer Zebley (Juliet Cummins Wright)",1,1
,...,...,...,...,...,...,...,...,...,...,...,...


<object>  Name: titanic, Number of rows: 1231, Number of columns: 12

It is also possible to extract the title of the Titanic passengers from the name.

In [4]:
titanic["name"].str_extract(' ([A-Za-z]+)\.')

0,1,2,3,4,5,6,7,8,9,10,11,12
,age,survived,ticket,home.dest,sex,pclass,embarked,parch,fare,name,boat,sibsp
0.0,36.0,1,17421,"New York, NY",female,1,C,0,110.88330,Miss.,1,0
1.0,36.0,1,PC 17611,"New York, NY",female,1,S,0,133.65000,Mrs.,1,1
2.0,36.0,1,PC 17604,"New York, NY",female,1,C,0,82.17080,Mrs.,1,1
3.0,36.0,1,17453,"Paris, France / New York, NY",female,1,C,0,89.10420,Mrs.,1,1
4.0,36.0,1,19996,"London / East Orange, NJ",female,1,S,0,52.00000,Mrs.,1,1
,...,...,...,...,...,...,...,...,...,...,...,...


<object>  Name: titanic, Number of rows: 1231, Number of columns: 12

## Encoding

Let's explore the data by displaying descriptive statistics of all the columns.

In [5]:
titanic.describe(method = "categorical")

0,1,2,3,4,5
,dtype,unique,count,top,top_percent
"""age""",float,96,1231,25,13.972
"""survived""",int,2,1231,0,63.607
"""ticket""",varchar(36),885,1231,CA. 2343,0.812
"""home.dest""",varchar(100),358,1231,"New York, NY",47.766
"""sex""",varchar(20),2,1231,male,66.044
"""pclass""",int,3,1231,3,53.777
"""embarked""",varchar(20),3,1231,S,70.837
"""parch""",int,8,1231,0,76.848
"""fare""","numeric(10,5)",276,1231,8.05000,4.712


<object>

Many features are categorical and need to be encoded first. Let's start with the simplest ones.

<b>Question 1: </b>The feature 'sex' has two non-numerical categories. Use a label encoding to encode it.

In [6]:
titanic["sex"].label_encode()

0,1,2,3,4,5,6,7,8,9,10,11,12
,age,survived,ticket,home.dest,sex,pclass,embarked,parch,fare,name,boat,sibsp
0.0,36.0,1,17421,"New York, NY",0,1,C,0,110.88330,Miss.,1,0
1.0,36.0,1,PC 17611,"New York, NY",0,1,S,0,133.65000,Mrs.,1,1
2.0,36.0,1,PC 17604,"New York, NY",0,1,C,0,82.17080,Mrs.,1,1
3.0,36.0,1,17453,"Paris, France / New York, NY",0,1,C,0,89.10420,Mrs.,1,1
4.0,36.0,1,19996,"London / East Orange, NJ",0,1,S,0,52.00000,Mrs.,1,1
,...,...,...,...,...,...,...,...,...,...,...,...


<object>  Name: titanic, Number of rows: 1231, Number of columns: 12

<b>Question 2: </b>The feature 'embarked' has 3 categories. Why is it more judicious to use a One Hot Encoding to encode it? Encode this feature.

In [7]:
titanic["embarked"].get_dummies()

3 new features: "embarked_C", "embarked_Q", "embarked_S"


0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
,age,survived,ticket,home.dest,sex,pclass,embarked,parch,fare,name,boat,sibsp,embarked_C,embarked_Q,embarked_S
0.0,36.0,1,17421,"New York, NY",0,1,C,0,110.88330,Miss.,1,0,1,0,0
1.0,36.0,1,PC 17611,"New York, NY",0,1,S,0,133.65000,Mrs.,1,1,0,0,1
2.0,36.0,1,PC 17604,"New York, NY",0,1,C,0,82.17080,Mrs.,1,1,1,0,0
3.0,36.0,1,17453,"Paris, France / New York, NY",0,1,C,0,89.10420,Mrs.,1,1,1,0,0
4.0,36.0,1,19996,"London / East Orange, NJ",0,1,S,0,52.00000,Mrs.,1,1,0,0,1
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


<object>  Name: titanic, Number of rows: 1231, Number of columns: 15

<p style="color:red">The cardinality of this feature is small. One Hot Encoding is ideal for this type of feature.</p>

<b>Question 3: </b>The feature 'name' which now represents the title of the passengers. Describe it and find the most occurrent categories.

In [8]:
titanic["name"].describe()

0,1
,value
name,"""name"""
dtype,varchar(164)
unique,16
Mr.,733
Miss.,227
Mrs.,184
Master.,56
Others,11
Dr.,8


<object>

<p style="color:red">The main categories are 'Mr', 'Miss', 'Mrs' and 'Master'. It is important to merge all the other categories together as ML loves to identify patterns. If a category is not very occurent and where the response is the same, the models will tend to always predict with the same output.</p>

<b>Question 4: </b>Machine Learning doesn't like too many categories. Encode the data by combining all the rare categories together. 

In [9]:
titanic["name"].decode({" Mr.": "Mr", " Miss.": "Miss", " Mrs.": "Mrs", " Master.": "Master"}, "Rare")

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
,age,survived,ticket,home.dest,sex,pclass,embarked,parch,fare,name,boat,sibsp,embarked_C,embarked_Q,embarked_S
0.0,36.0,1,17421,"New York, NY",0,1,C,0,110.88330,Miss,1,0,1,0,0
1.0,36.0,1,PC 17611,"New York, NY",0,1,S,0,133.65000,Mrs,1,1,0,0,1
2.0,36.0,1,PC 17604,"New York, NY",0,1,C,0,82.17080,Mrs,1,1,1,0,0
3.0,36.0,1,17453,"Paris, France / New York, NY",0,1,C,0,89.10420,Mrs,1,1,1,0,0
4.0,36.0,1,19996,"London / East Orange, NJ",0,1,S,0,52.00000,Mrs,1,1,0,0,1
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


<object>  Name: titanic, Number of rows: 1231, Number of columns: 15

<b>Question 5: </b>As we need numerical values, mean encoding can be a way to encode the result. We want to predict the passengers survival. The response is then the feature 'survived'. Use a mean encoding to encode the feature obtained in the previous question.

In [10]:
titanic["name"].mean_encode("survived")

The mean encoding was successfully done.


0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
,age,survived,ticket,home.dest,sex,pclass,embarked,parch,fare,name,boat,sibsp,embarked_C,embarked_Q,embarked_S
0.0,14.5,0,CA. 2343,"New York, NY",1,3,S,2,69.55000,0.482142857142857,0,8,0,0,1
1.0,13.0,0,C.A. 2673,"East Providence, RI",1,3,S,2,20.25000,0.482142857142857,0,0,0,0,1
2.0,13.0,0,347077,"Sweden Worcester, MA",1,3,S,2,31.38750,0.482142857142857,0,4,0,0,1
3.0,12.0,1,2651,"New York, NY",1,3,C,0,11.24170,0.482142857142857,1,1,1,0,0
4.0,11.5,0,A/5. 851,"New York, NY",1,3,S,1,14.50000,0.482142857142857,0,1,0,0,1
,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


<object>  Name: titanic, Number of rows: 1231, Number of columns: 15