<p align="center">
<img src='../../img/VerticaMLPython.png' width="180px">
</p>

# Vertica ML Python Utilities

During this exercice, we will see how:
<ul>
    <li> to save a Virtual Dataframe
    <li> to store a Virtual Dataframe
    <li> to load a Virtual Dataframe
    <li> to generate the SQL
</ul>
## Initialization

Let's create a cursor to the DB.

In [1]:
from vertica_ml_python.utilities import vertica_cursor
cur = vertica_cursor("VerticaDSN")

Let's use the amazon dataset introduced in the Exercise 0.

In [2]:
from vertica_ml_python.learn.datasets import load_amazon
amazon = load_amazon(cur)

## Utilities

Let's apply some transformations to the dataset.

In [3]:
amazon.cumsum("cum_sum", "number", by = ["state"], order_by = ["date"])["state"].label_encode()

The new vColumn "cum_sum" was added to the vDataframe.


0,1,2,3,4,5,6
,year,number,date,month,state,cum_sum
0.0,1998,509,1998-01-01,September,0,509
1.0,1998,130,1998-01-01,August,0,639
2.0,1998,44,1998-01-01,October,0,683
3.0,1998,37,1998-01-01,July,0,720
4.0,1998,7,1998-01-01,December,0,727
,...,...,...,...,...,...


<object>  Name: amazon, Number of rows: 6454, Number of columns: 6

You can look at the SQL code generation using the 'sql_on_off' method. Let's describe the Virtual Dataframe.

In [4]:
amazon.sql_on_off()
amazon.describe()

0,1,2,3,4,5,6,7,8,9
,count,mean,std,min,25%,50%,75%,max,unique
cum_sum,6454,85027.7595289746,145749.250695705,12.0,6498.66666666667,28841.25,85335.75,818902.0,5652
number,6454,553.687325689495,1592.64987447327,0.0,9.0,55.0,282.5,25963.0,1475
state,6454,11.4057948559033,6.27449760565551,0.0,6.0,12.0,16.0,22.0,23
year,6454,2007.46172916021,5.74665355968707,1998.0,2002.0,2007.0,2012.0,2017.0,20


<object>

You can look at all the objects modifications with the 'info' method:

In [5]:
amazon.info()

The vDataframe was modified many times: 
 * {Thu Dec  5 20:06:25 2019} [Eval]: A new vColumn '"cum_sum"' was added to the vDataframe.
 * {Thu Dec  5 20:06:25 2019} [Label Encoding]: Label Encoding was applied to the vColumn '"state"' using the following mapping:
	Acre => 0	Alagoas => 1	Amapa => 2	Amazonas => 3	Bahia => 4	Ceara => 5	Distrito Federal => 6	Espirito Santo => 7	Goias => 8	Maranhao => 9	Mato Grosso => 10	Minas Gerais => 11	Para => 12	Paraiba => 13	Pernambuco => 14	Piau => 15	Rio => 16	Rondonia => 17	Roraima => 18	Santa Catarina => 19	Sao Paulo => 20	Sergipe => 21	Tocantins => 22


0,1,2,3,4,5,6
,year,number,date,month,state,cum_sum
0.0,1998,509,1998-01-01,September,0,509
1.0,1998,130,1998-01-01,August,0,639
2.0,1998,44,1998-01-01,October,0,683
3.0,1998,37,1998-01-01,July,0,720
4.0,1998,7,1998-01-01,December,0,727
,...,...,...,...,...,...


<object>  Name: amazon, Number of rows: 6454, Number of columns: 6

You can also save the Dataframe to load it later. We can for example filter the data.

In [6]:
amazon.save()
amazon.filter("state = 0")

6215 elements were filtered


0,1,2,3,4,5,6
,year,number,date,month,state,cum_sum
0.0,1998,509,1998-01-01,September,0,509
1.0,1998,130,1998-01-01,August,0,639
2.0,1998,44,1998-01-01,October,0,683
3.0,1998,37,1998-01-01,July,0,720
4.0,1998,7,1998-01-01,December,0,727
,...,...,...,...,...,...


<object>  Name: amazon, Number of rows: 239, Number of columns: 6

And finally we decided to come back to the previous saving.

In [7]:
amazon = amazon.load()
print(amazon)

0,1,2,3,4,5,6
,year,number,date,month,state,cum_sum
0.0,1998,509,1998-01-01,September,0,509
1.0,1998,130,1998-01-01,August,0,639
2.0,1998,44,1998-01-01,October,0,683
3.0,1998,37,1998-01-01,July,0,720
4.0,1998,7,1998-01-01,December,0,727
,...,...,...,...,...,...


<object>  Name: amazon, Number of rows: 6454, Number of columns: 6


You can also save a vdf file to share it with your team or to start a session later.

In [8]:
amazon.to_vdf("my_vdf")

0,1,2,3,4,5,6
,year,number,date,month,state,cum_sum
0.0,1998,509,1998-01-01,September,0,509
1.0,1998,130,1998-01-01,August,0,639
2.0,1998,44,1998-01-01,October,0,683
3.0,1998,37,1998-01-01,July,0,720
4.0,1998,7,1998-01-01,December,0,727
,...,...,...,...,...,...


<object>  Name: amazon, Number of rows: 6454, Number of columns: 6

In [9]:
from vertica_ml_python import read_vdf
amazon = read_vdf("my_vdf.vdf", cur)
print(amazon)

0,1,2,3,4,5,6
,year,number,date,month,state,cum_sum
0.0,1998,509,1998-01-01,September,0,509
1.0,1998,130,1998-01-01,August,0,639
2.0,1998,44,1998-01-01,October,0,683
3.0,1998,37,1998-01-01,July,0,720
4.0,1998,7,1998-01-01,December,0,727
,...,...,...,...,...,...


<object>  Name: amazon, Number of rows: 6454, Number of columns: 6
