# Welcome to BlazingSQL Notebooks!

BlazingSQL Notebooks is a fully managed, high-performance JupyterLab environment. 

**No setup required.** You just login and start writing code, immediately.

Every Notebooks environment has:   
- An attached CUDA GPU
- Pre-Installed GPU Data Science Packages ([BlazingSQL](https://github.com/BlazingDB/blazingsql), [RAPIDS](https://github.com/rapidsai), [Dask](https://github.com/dask), and many more)

Start running GPU-accelerated code below!

## The GPU DataFrame
The RAPIDS ecosystem is built on the concept of a shared GPU DataFrame, built on [Apache Arrow](http://arrow.apache.org/), between all of the different libraries and packages. This was achieved with the `cudf.DataFrame`.

There are two libraries specific to data manipulation:
- **BlazingSQL**:  SQL commands on a `cudf.DataFrame`
- **cuDF**: pandas-like commands on a `cudf.DataFrame`

### BlazingSQL (BSQL) 
[GitHub](https://github.com/BlazingDB/blazingsql) | [Intro Notebook](intro_notebooks/the_dataframe.ipynb)

BlazingSQL is a distributed SQL engine built on top of cuDF. Easily run SQL on files and DataFrames.

We start with a BlazingContext, which acts like a session of the SQL engine.

In [None]:
import dask
from dask.distributed import Client
dask_scheduler_ip_port = 'localhost:8786'
client = Client(dask_scheduler_ip_port)
client

In [None]:
from blazingsql import BlazingContext
network_interface = 'ens5'
bc = BlazingContext(dask_client=client, network_interface=network_interface)

With `.create_table('table_name', 'file_path')` you can create tables from many formats. Here we infer the schema from a CSV file.

In [None]:
bc.create_table('taxi', 'data/sample_taxi.csv', header=0)

Now, we can run a SQL query directly on that CSV file with `.sql()`.

In [None]:
bc.sql('SELECT * FROM taxi')

Learn more about [creating](https://docs.blazingdb.com/docs/creating-tables) and [querying](https://docs.blazingdb.com/docs/single-gpu) BlazingSQL tables, or the [BlazingContext API](https://docs.blazingdb.com/docs/methods-arguments).

BlazingSQL returns each query's results as a cuDF DataFrame, making for easy handoff to GPU or non-GPU solutions.

In [None]:
type(bc.sql('select * from taxi limit 10'))

### cuDF
[GitHub](https://github.com/rapidsai/cudf) | [Intro Notebook](intro_notebooks/the_dataframe.ipynb)

cuDF is a GPU DataFrame Library similar to pandas.

In [None]:
import cudf
s = cudf.Series([3, 5, 0.01, None, 4])
s

You can make a `cudf.DataFrame` from a SQL statement, each column being a `cudf.Series`.

In [None]:
df = bc.sql('select * from taxi where trip_distance < 10')

Utilize DataFrame methods like `.head()`, `.tail()`, or `.describe()`.

In [None]:
df.tail(2)

In [None]:
df.describe()

You can also filter cuDF DataFrames just like pandas DataFrames.

In [None]:
df.loc[(df['passenger_count'] != 1) & (df['trip_distance'] < 10)]

To ensure interoperability, you can also easily convert from cuDF to pandas with `.to_pandas()`. This grants you access to all pandas methods, in this example, `.sample()`.

In [None]:
df.to_pandas().sample(3)

Learn more about [BlazingSQL + cuDF](intro_notebooks/the_dataframe.ipynb).

## Data Visualization
cuDF DataFrames easily plug into current and GPU-accelerated visualization.


### Matplotlib

[GitHub](https://github.com/matplotlib/matplotlib) | [Intro Notebook](intro_notebooks/data_visualization.ipynb#Matplotlib)

Calling the `.to_pandas()` method, we can convert a `cudf.DataFrame` into a `pandas.DataFrame` and hand off to Matplotlib or other CPU visualization packages.

In [None]:
bc.sql('SELECT passenger_count, tip_amount FROM taxi').to_pandas().plot(kind='scatter', x='passenger_count', y='tip_amount')

### Datashader

[GitHub](https://github.com/holoviz/datashader) | [Intro Notebook](intro_notebooks/data_visualization.ipynb#Datashader)

Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

Datashader is one of the first visualization tools to support GPU DataFrames, so we can directly pass in `cudf.DataFrame` query results.

In [None]:
from datashader import Canvas, transfer_functions
from colorcet import fire

We execute and pass a query as a GPU DataFrame to datashader to render taxi dropoff locations.

In [None]:
nyc = Canvas().points(bc.sql('SELECT dropoff_x, dropoff_y FROM taxi'), 'dropoff_x', 'dropoff_y')

transfer_functions.set_background(transfer_functions.shade(nyc, cmap=fire), "black")

## Machine Learning
### cuML 
[GitHub](https://github.com/rapidsai/cuml) | [Intro Notebook](intro_notebooks/machine_learning.ipynb)

cuML is a GPU-accelerated machine learning library similar to scikit-learn but made to run on GPU.

Let's predict fare amount of the `taxi` table we've been querying with a linear regression model.

In [None]:
%%time
from cuml import LinearRegression
from cuml.preprocessing.model_selection import train_test_split

Pull feature (X) and target (y) values

In [None]:
X = bc.sql('SELECT trip_distance, tolls_amount, pickup_x, pickup_y, dropoff_x, dropoff_y FROM taxi')
y = bc.sql('SELECT fare_amount FROM taxi')['fare_amount']

Split data into train and test sets (80:20)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

Run a Linear Regression Model.

In [None]:
%%time
# call Linear Regression model
lr = LinearRegression()

# train the model
lr.fit(X_train, y_train)

# make predictions for test X values
y_pred = lr.predict(X_test)

Test the model's predicted values with sklearn's r2_score.

In [None]:
from sklearn.metrics import r2_score
r2_score(y_true=y_test.to_pandas(), y_pred=y_pred.to_pandas())

## That is the Quick Tour!
There are in fact many more packages that are integrating the GPU DataFrame, and therefore providing interoperability with the rest of the stack.

Some of those not mentioned here are:
- **cuGraph**: a graph analytics library similar to NetworkX 
- **cuSignal**: a signal analytics library similar to SciPy Signal 
- **CLX**: a collection of cyber security use cases with the RAPIDS stack 

[Continue to The DataFrame introductory Notebook](intro_notebooks/the_dataframe.ipynb)