# Getting Started

Welcome to your interactive development environment! 
Here you can create documents that contain live code and visualization for data cleaning and transformation, statistical modeling, data visualization, machine learning, and much more.

---

# Upload your files

You can upload your files from your local system by by dragging and dropping files onto the file browser, or by clicking the “Upload Files” button at the top of the file browser.

---

# Access data from your Datalake.

Let's create the connection between this environment and your Datalake.

```python
!pip install pyhive
from pyhive import presto
import pandas as pd

conn = presto.connect(host = datalake-1, port = 18080, username = 'root')
cursor = conn.cursor()
cursor.execute("<Your query goes here>")
rows = cursor.fetchall()
cursor.close()
```

---

# Using Pandas DataFrame

Once connected to your Datalake, you can use the pandas library for high-performance, easy-to-use data structures and data analysis tools.

You can transform the output of your query into a two-dimensional tabular data.

```python
df1 = pd.DataFrame(rows)
df1.columns = [<list of your columns>]
df1
```

You can export your DataFrame into a table in your datalake, which can be used to create dashboards with Superset.

```python
from sqlalchemy import create_engine
engine = create_engine('presto://datalake-1:18080/hive', echo=False)
df.to_sql('deals', con=engine, schema='raw', if_exists='replace', index=False, method='multi')
```

## Example:

To demonstrate some operations that can be done with pandas DataFrame, an example dataset will be generated.

In [None]:
!pip install pandas
import pandas as pd
import numpy as np

In [2]:
# A 5x4 dataframe is generated with random values
np.random.seed(101)
df = pd.DataFrame(np.random.randint(0,100,size=(5, 4)))
df.columns = ['Loja 1', 'Loja 2', 'Loja 3', 'Loja 4']
df

Unnamed: 0,Loja 1,Loja 2,Loja 3,Loja 4
0,95,11,81,70
1,63,87,75,9
2,77,40,4,63
3,40,60,92,64
4,5,12,93,40


In [3]:
# Adding new column to specific index:
df.insert(0, "Item", ['Produto 1', 'Produto 2', 'Produto 3', 'Produto 4', 'Produto 5'])
df

Unnamed: 0,Item,Loja 1,Loja 2,Loja 3,Loja 4
0,Produto 1,95,11,81,70
1,Produto 2,63,87,75,9
2,Produto 3,77,40,4,63
3,Produto 4,40,60,92,64
4,Produto 5,5,12,93,40


In [4]:
# Adding new column as last column position
df['Total'] = [0,0,0,0,0]
df

Unnamed: 0,Item,Loja 1,Loja 2,Loja 3,Loja 4,Total
0,Produto 1,95,11,81,70,0
1,Produto 2,63,87,75,9,0
2,Produto 3,77,40,4,63,0
3,Produto 4,40,60,92,64,0
4,Produto 5,5,12,93,40,0


In [5]:
# Modifing cell values (Line 4)
sum_produto = 0
for i in range(0,5):
    df.at[i, 'Total'] = df['Loja 1'][i] + df['Loja 2'][i] + df['Loja 3'][i] + df['Loja 4'][i]
    sum_produto = 0
df

Unnamed: 0,Item,Loja 1,Loja 2,Loja 3,Loja 4,Total
0,Produto 1,95,11,81,70,257
1,Produto 2,63,87,75,9,234
2,Produto 3,77,40,4,63,184
3,Produto 4,40,60,92,64,256
4,Produto 5,5,12,93,40,150


In [6]:
# Return spefic row by integer-location index
df.iloc[0, :]

Item      Produto 1
Loja 1           95
Loja 2           11
Loja 3           81
Loja 4           70
Total           257
Name: 0, dtype: object

In [7]:
# Return spefic row by labels
df.loc[[0], ['Item', 'Loja 1', 'Loja 2', 'Loja 3', 'Loja 4', 'Total']]

Unnamed: 0,Item,Loja 1,Loja 2,Loja 3,Loja 4,Total
0,Produto 1,95,11,81,70,257
