# Using FugueSQL on Coiled Dask Clusters

In this notebook we will discuss [fugue-sql](https://docs.dask.org/en/latest/dataframe-sql.html#does-dask-implement-sql), an abstraction layer that allows users to run SQL queries on top of Pandas, Spark, and Dask dataframes. fugue-sql is part of the broader [fugue project](https://github.com/fugue-project/fugue), which aims to be an abstaction layer for distributed compute workflows. Fugue has both a Python and SQL interface. Users can choose the engine to run on just by specifying.

<img src="https://raw.githubusercontent.com/fugue-project/fugue/master/images/logo.svg" align="left" width="250"/>

In [1]:
import coiled

cluster = coiled.Cluster(
    n_workers=10,
    software="kvnkho/fugue-sql",
)
cluster

# cluster = coiled.Cluster(n_workers=10)
# cluster

Output()

Checking environment images
Valid environment image found


VBox(children=(HTML(value='<h2>coiled.Cluster</h2>'), HBox(children=(HTML(value='\n<div>\n  <style scoped>\n  …

In [2]:
from dask.distributed import Client

client = Client(cluster)
client


+-------------+-----------+-----------+-----------+
| Package     | client    | scheduler | workers   |
+-------------+-----------+-----------+-----------+
| dask        | 2021.04.0 | 2021.03.1 | 2021.03.1 |
| distributed | 2021.04.0 | 2021.03.1 | 2021.03.1 |
+-------------+-----------+-----------+-----------+


0,1
Client  Scheduler: tls://ec2-3-16-180-175.us-east-2.compute.amazonaws.com:8786  Dashboard: http://ec2-3-16-180-175.us-east-2.compute.amazonaws.com:8787,Cluster  Workers: 10  Cores: 40  Memory: 160.00 GiB


## Setup

fugue-sql can be imported in notebooks by using the `fugue_notebook.setup` function. This provides syntax highlighting for fugue-sql cells and allows us to use the %%fsql magic.

At the moment, the notebook extension is only available for traditional iPython notebooks. This means syntax highlighting will fail in JupyterLab environments.

In [3]:
from fugue_notebook import setup
try:
    setup()
except:
    print("Syntax highlighting not yet available for JupyterLab")

## Initial Look



In [49]:
import dask.dataframe as dd

df = dd.read_csv(
    "s3://nyc-tlc/trip data/yellow_tripdata_2019-07.csv",
    dtype={'RatecodeID': 'float64',
           'VendorID': 'float64',
           'passenger_count': 'float64',
           'payment_type': 'float64'},
    storage_options={"anon": True},
    blocksize="16 MiB",
).persist()

In [66]:
%%fsql dask
tempdf = SELECT passenger_count, AVG(tip_amount) AS average_tip
           FROM df
       GROUP BY passenger_count

  SELECT *
    FROM tempdf
ORDER BY passenger_count DESC
   LIMIT 15
   PRINT

Unnamed: 0,passenger_count,average_tip
0,0.0,2.083597
1,1.0,2.200508
2,2.0,2.218763
3,3.0,2.125355
4,4.0,1.992989
5,5.0,2.235327
6,6.0,2.228765
7,8.0,6.118182
8,9.0,10.244375
9,7.0,6.391034


Variable assignment. Groupby Orderby

## Basics

### Load and Save

In [70]:
import pandas as pd
example = pd.DataFrame({'a':[1,2,3],'b':[1,2,3]})

In [80]:
%%fsql
SELECT * FROM example
-- SAVE OVERWRITE "/work/test.parquet" (header=true)

-- loaded_example = LOAD "/work/test.parquet" (header=true)
-- PRINT 5 ROWS from loaded_example

### Jinja Templating

In [75]:
n = 1

In [77]:
%%fsql
SELECT *
  FROM example
 WHERE a = {{n}}
 PRINT

Unnamed: 0,a,b
0,1,1


In [81]:
### Persist

distributed.client - ERROR - Failed to reconnect to scheduler after 10.00 seconds, closing client
_GatheringFuture exception was never retrieved
future: <_GatheringFuture finished exception=CancelledError()>
concurrent.futures._base.CancelledError
