# Welcome to BlazingSQL Notebooks!

BlazingSQL Notebooks is a fully managed, high-performance Jupyter notebook environment. 

**No setup required.** You just login and start writing code, immediately.

Every Notebooks environment has:   
- An attached CUDA GPU
- Pre-Installed GPU Data Science Packages ([BlazingSQL](https://github.com/BlazingDB/blazingsql), [RAPIDS](https://github.com/rapidsai), [Dask](https://github.com/dask), and many more)

Start running GPU-accelerated code below!

## The DataFrame
The RAPIDS ecosystem is built on the concept of a shared GPU DataFrame between all of the different libraries and packages.

There are two libraries specific to data manipulation:
- **BlazingSQL**:  SQL commands on a GPU DataFrame
- **cuDF**: Pandas-like commands on a GPU DataFrame.

### BlazingSQL (BSQL) 
[GitHub](https://github.com/BlazingDB/blazingsql) | [Intro Notebook](intro_notebooks/blazingcontext.ipynb)

Let's see how easy it is to SQL query a CSV file on a GPU.

In [None]:
from blazingsql import BlazingContext

# Create a BlazingContext to launch a BSQL session
bc = BlazingContext()

In [None]:
import os
# Create a BlazingSQL table from any file w/ .create_table(table_name, file_path)
bc.create_table('taxi', f'{os.getcwd()}/data/sample_taxi.csv', header=0)

In [None]:
# Query table with the `.sql()` function
bc.sql('SELECT * FROM taxi')

Learn more about [creating](intro_notebooks/create_tables.ipynb) and [querying](intro_notebooks/query_tables.ipynb) BlazingSQL tables, or the [BlazingContext API](intro_notebooks/blazingcontext.ipynb).

BlazingSQL returns each query's results as a cuDF DataFrame, making for easy handoff to GPU or non-GPU solutions.

### cuDF
[GitHub](https://github.com/rapidsai/cudf) | [Intro Notebook](intro_notebooks/bsql_cudf.ipynb)

cuDF is a GPU DataFrame Library similar to Pandas.

In [None]:
type(bc.sql('select * from taxi'))

In [None]:
some cudf function (like grabbing a column, or what have you...)

In [None]:
# tell me about this taxi data
bc.sql('select * from taxi').describe()

Learn more about [BlazingSQL + cuDF](intro_notebooks/bsql_cudf.ipynb).

## Data Visualization

Leverage your favorite Python visualization packages by converting a GPU DataFrame to a Pandas DataFrame with `.to_pandas()` or visualization packages that are GPU accelerated.

### Non-GPU Visualization Package

In [None]:
matplotlib/seaborn on a .to_pandas() gdf
bc.sql('select count(*) from taxi group by day(date_time)').to_pandas()

### GPU Visualization Package

In [None]:
import datashader as ds
from colorcet import fire

# execute query & lay out a canvas w/ dropoff locations 
nyc = ds.Canvas().points(bc.sql('SELECT dropoff_x, dropoff_y FROM taxi'), 'dropoff_x', 'dropoff_y')

# shade in the picture w/ fire & display
ds.transfer_functions.set_background(ds.transfer_functions.shade(nyc, cmap=fire), "black")

## Machine Learning
### cuML 

cuML is a GPU-accelerated machine learning library similar to scikit-learn but made to run on the GPU DataFrame.

Let's predict fare amount of the `taxi` table we've been querying with a linear regression model.

In [None]:
%%time
from cuml import LinearRegression
from cuml.preprocessing.model_selection import train_test_split

# pull feature (X) and target (y) values
X = bc.sql('SELECT trip_distance, tolls_amount FROM taxi')
y = bc.sql('SELECT fare_amount FROM taxi')['fare_amount']

# split data into train and test sets (80:20)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

In [None]:
%%time
# call Linear Regression model
lr = LinearRegression()

# train the model
lr.fit(X_train, y_train)

# make predictions for test X values
y_pred = lr.predict(X_test)

In [None]:
from sklearn.metrics import r2_score

# convert test & predicted values .to_pandas() & find the model's r2_score
r2_score(y_true=y_test.to_pandas(), y_pred=y_pred.to_pandas())

## Graph Analytics

### cuGraph - RAPIDS Graph Analytics Library

Run graph analytics on GPU DataFrames with cuGraph, which aims to provide a NetworkX-like API on GPU DataFrames.

Pending resolution of [rapidsai/cugraph#744](https://github.com/rapidsai/cugraph/issues/744).

In [None]:
# import cugraph

# # assuming that data has been loaded into a cuDF (using read_csv) Dataframe
# bc.create_table('karate', f'{os.getcwd()}/data/karate.csv', names=["src", "dst"], delimiter='\t', dtype=["float", "float"])

# # create a Graph using the source (src) and destination (dst) vertex pairs the GDF  
# G = cugraph.Graph()
# G.add_edge_list(gdf, source='src', destination='dst')  # ERROR

# # Call cugraph.pagerank to get the pagerank scores
# gdf_page = cugraph.pagerank(G)

# for i in range(len(gdf_page)):
#     print("vertex " + str(gdf_page['vertex'][i]) + " PageRank is " + str(gdf_page['pagerank'][i]))  

### cuSignal - GPU-Accelerated Signal Processing

cuSignal is a direct port of Scipy Signal built to leverage GPU compute resources through cuPy and Numba.

<details><summary>...</summary>

The RAPIDS cuSignal project leverages CuPy, Numba, and the RAPIDS ecosystem for GPU accelerated signal processing. 
    
In some cases, cuSignal is a direct port of Scipy Signal to leverage GPU compute resources via CuPy but also contains Numba CUDA kernels for additional speedups for selected functions. 
    
cuSignal achieves its best gains on large signals and compute intensive functions but stresses online processing with zero-copy memory (pinned, mapped) between CPU and GPU.

[GitHub](https://github.com/rapidsai/cusignal) | [Intro Notebook](intro_notebooks/cusignal.ipynb)

</details>

In [None]:
import cusignal
import cupy as cp

start = 0
stop = 10
num_samps = int(1e8)
resample_up = 2
resample_down = 3

gx = cp.linspace(start, stop, num_samps, endpoint=False) 
gy = cp.cos(-gx**2/6.0)

gf = cusignal.resample_poly(gy, resample_up, resample_down, window=('kaiser', 0.5))

In [None]:
gf

#### Storage Plugins - Scale Your Data

<details><summary>...</summary>
    
We think you should let data rest wherever it likes. Don't worry about synching, directly query files wherever they reside.

With the BlazingSQL Filesystem API, you can register and connect to multiple storage solutions. 

- [AWS](https://docs.blazingdb.com/docs/s3) 
- [Google Storage](https://docs.blazingdb.com/docs/google-cloud-storage)
- [HDFS](https://docs.blazingdb.com/docs/hdfs)

Once a filesystem is registered you can reference the user-defined file path when creating a new table off of a file.
    
[Docs](https://docs.blazingdb.com/docs/connecting-data-sources) | [Intro notebook](intro_notebooks/storage_plugins.ipynb)
    
</details>

In [None]:
# register AWS S3 storage bucket 
bc.s3('bsql_data', bucket_name='blazingsql-colab')

# tag S3 {s3://} file path to specific data directory within 'bsql_data'
tpch_sf10 = 's3://bsql_data/tpch_sf10/'

# create 'orders' table from list of 10 orders files
bc.create_table('orders', [f'{tpch_sf10}orders/0_0_{i}.parquet' for i in range(10)])

#### BlazingSQL Logs

<details><summary>...</summary>
    
BlazingSQL has an internal log that records events from every node from all queries run. The events include runtime query step execution information, performance timings, errors and warnings. 

The logs table is called `bsql_logs`. You can query the logs as if it were any other table, except you use the `.log()` function, instead of the `.sql()` function.
    
[Docs](https://docs.blazingdb.com/docs/blazingsql-logs) | [Intro Notebook](intro_notebooks/bsql_logs.ipynb)
    
</details>

In [None]:
# how long did each successfully run query take?
bc.log("SELECT log_time, query_id, duration FROM bsql_logs WHERE info = 'Query Execution Done' ORDER BY log_time DESC")

#### Cyber Log Accelerators 

RAPIDS Cyber Log Accelerators (CLX)

<details><summary>...</summary>
    
CLX ("clicks") provides a collection of RAPIDS examples for security analysts, data scientists, and engineers to quickly get started applying RAPIDS and GPU acceleration to real-world cybersecurity use cases.

The goal of CLX is to:

- Allow cyber data scientists and SecOps teams to generate workflows, using cyber-specific GPU-accelerated primitives and methods, that let them interact with code using security language,
- Make available pre-built use cases that demonstrate CLX and RAPIDS functionality that are ready to use in a Security Operations Center (SOC),
- Accelerate log parsing in a flexible, non-regex method. and
- Provide SIEM integration with GPU compute environments via RAPIDS and effectively extend the SIEM environment.
    
[GitHub](https://github.com/rapidsai/clx) | [Intro Notebook](intro_notebooks/clx.ipynb)
    
</details>

In [None]:
import cudf
import s3fs
from os import path

# download data
if not path.exists("./splunk_faker_raw4"):
    fs = s3fs.S3FileSystem(anon=True)
    fs.get("rapidsai-data/cyber/clx/splunk_faker_raw4", "./splunk_faker_raw4")

# read in alert data
gdf = cudf.read_csv('./splunk_faker_raw4')
gdf.columns = ['raw']

# parse the alert data using CLX built-in parsers
from clx.parsers.splunk_notable_parser import SplunkNotableParser

snp = SplunkNotableParser()
parsed_gdf = cudf.DataFrame()
parsed_gdf = snp.parse(gdf, 'raw')

# define function to round time to the day
def round2day(epoch_time):
    return int(epoch_time/86400)*86400

# aggregate alerts by day
parsed_gdf['time'] = parsed_gdf['time'].astype(int)
parsed_gdf['day'] = parsed_gdf.time.applymap(round2day)
day_rule_gdf = parsed_gdf[['search_name','day','time']].groupby(['search_name', 'day']).count().reset_index()
day_rule_gdf.columns = ['rule', 'day', 'count']

# import the rolling z-score function from CLX statistics
from clx.analytics.stats import rzscore

# pivot the alert data so each rule is a column
def pivot_table(gdf, index_col, piv_col, v_col):
    index_list = gdf[index_col].unique()
    piv_gdf = cudf.DataFrame()
    piv_gdf[index_col] = index_list
    for group in gdf[piv_col].unique():
        
        temp_df = gdf[gdf[piv_col] == group]
        temp_df = temp_df[[index_col, v_col]]
        temp_df.columns = [index_col, group]
        piv_gdf = piv_gdf.merge(temp_df, on=[index_col], how='left')
        
    piv_gdf = piv_gdf.set_index(index_col)
    return piv_gdf.sort_index()

alerts_per_day_piv = pivot_table(day_rule_gdf, 'day', 'rule', 'count').fillna(0)

# create a new cuDF with the rolling z-score values calculated
r_zscores = cudf.DataFrame()
for rule in alerts_per_day_piv.columns:
    x = alerts_per_day_piv[rule]
    r_zscores[rule] = rzscore(x, 7) #7 day window

In [None]:
r_zscores

In [None]:
temp_df

In [None]:
day_rule_gdf

BlazingSQL is built on top of the RAPIDS AI ecosystem. RAPIDS is based on the Apache Arrow columnar memory format, and cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.

BlazingSQL is a SQL interface for cuDF, with various features to support large scale data science workflows and enterprise datasets.

Query Data Stored Externally - a single line of code can register remote storage solutions, such as Amazon S3.
Simple SQL - incredibly easy to use, run a SQL query and the results are GPU DataFrames (GDFs).
Interoperable - GDFs are immediately accessible to any RAPIDS library for data science workloads.
    
BlazingContext is the Python API of BlazingSQL. 
    
Initializing BlazingContext connects allows you to create tables, run queries and utilize the power of GPU accelerated SQL.


### Datashader

Datashader is a data visualization library  Quickly and accurately render even the largest data.

<details><summary>...</summary>
    
Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data. Datashader breaks the creation of images of data into 3 main steps:

1. Projection

  - Each record is projected into zero or more bins of a nominal plotting grid shape, based on a specified glyph.

2. Aggregation

  - Reductions are computed for each bin, compressing the potentially large dataset into a much smaller aggregate array.

3. Transformation

  - These aggregates are then further processed, eventually creating an image.

Using this very general pipeline, many interesting data visualizations can be created in a performant and scalable way. Datashader contains tools for easily creating these pipelines in a composable manner, using only a few lines of code. Datashader can be used on its own, but it is also designed to work as a pre-processing stage in a plotting library, allowing that library to work with much larger datasets than it would otherwise.
    
Datashader is part of the [HoloViz](https://github.com/holoviz) ecosystem for making browser-based data visualization in Python easier to use, easier to learn, and more powerful. See [holoviz.org](https://holoviz.org/) for related packages that you can use with Datashader and [status.holoviz.org](http://status.holoviz.org/) for the current status of each HoloViz project.

Datashader is supported and maintained by [Anaconda](https://anaconda.com/).
    
[GitHub](https://github.com/holoviz/datashader/) | [Intro Notebook](intro_notebooks/cuml.ipynb)

</details>

In [None]:

<details><summary>...</summary>

cuML is a suite of libraries that implement machine learning algorithms and mathematical primitives functions that share compatible APIs with other RAPIDS projects.

cuML enables data scientists, researchers, and software engineers to run traditional tabular ML tasks on GPUs without going into the details of CUDA programming. In most cases, cuML's Python API matches the API from scikit-learn.

For large datasets, these GPU-based implementations can complete 10-50x faster than their CPU equivalents. For details on performance, see the cuML Benchmarks Notebook.
    
[GitHub](https://github.com/rapidsai/cuml) | [Intro Notebook](intro_notebooks/cuml.ipynb)

</details>

In [None]:
<details><summary>...</summary>

The RAPIDS cuGraph library is a collection of graph analytics that process data found in GPU Dataframes - see cuDF.

cuGraph aims to provide a NetworkX-like API that will be familiar to data scientists, so they can now build GPU-accelerated workflows more easily.
    
[GitHub](https://github.com/rapidsai/cugraph) | [Intro Notebook](intro_notebooks/cugraph.ipynb)

</details>