# Machine Learning

## cuML
[GitHub](https://github.com/rapidsai/cuml)

cuML is a suite of libraries that implement machine learning algorithms and mathematical primitives functions that share compatible APIs with other RAPIDS projects.

cuML enables data scientists, researchers, and software engineers to run traditional tabular ML tasks on GPUs without going into the details of CUDA programming. In most cases, cuML's Python API matches the API from scikit-learn.

For large datasets, these GPU-based implementations can complete **10-50x** faster than their CPU equivalents.

We'll show off a few examples to demonstrate the ease of using cuML. 

Parts of this were borrowed and lightly adapted from the cuML GitHub repository.

#### DBScan
Here is an an example of computing DBSCAN clusters entirely on GPUs.

In [None]:
import cudf
from cuml.cluster import DBSCAN

# create and populate a GPU DataFrame
gdf_float = cudf.DataFrame()
gdf_float['0'] = [1.0, 2.0, 5.0]
gdf_float['1'] = [4.0, 2.0, 1.0]
gdf_float['2'] = [4.0, 2.0, 1.0]

# setup and fit clusters
dbscan_float = DBSCAN(eps=1.0, min_samples=1)
dbscan_float.fit(gdf_float)

dbscan_float.labels_

#### Linear Regression on Data in AWS S3
To demonstrate how the RAPIDS stack integrates to easily create data pipelines we're going to run a simple workflow.

We are going to predict the fare of a [NYC Yellow Taxi](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) cab ride. We are going to do this by running a `LinearRegression()` on a `cudf.DataFrame`. 

This DataFrame will be generated from a SQL query on an *Apache Parquet* dataset that resides in *AWS S3*.

For more information on BlazingSQL and cuDF see [The DataFrame Notebook](the_dataframe.ipynb).

In [None]:
from blazingsql import BlazingContext

bc = BlazingContext()

In [None]:
bc.s3('blazingsql-colab', bucket_name='blazingsql-colab')

In [None]:
bc.create_table('taxi', 's3://blazingsql-colab/yellow_taxi/1_0_0.parquet')

Extract the desired features with `.sql()`, and then split up the data test using cuML's `train_test_split()` function.

In [None]:
%%time
from cuml import LinearRegression
from cuml.preprocessing.model_selection import train_test_split

X = bc.sql('SELECT trip_distance, tolls_amount FROM taxi')
y = bc.sql('SELECT fare_amount FROM taxi')['fare_amount']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

Then we run the `.fit()` and `.predict()` functions to perform the linear regression on the Taxi data.

In [None]:
%%time
# call Linear Regression model
lr = LinearRegression()

# train the model
lr.fit(X_train, y_train)

# make predictions for test X values
y_pred = lr.predict(X_test)

We can convert test & predicted values `.to_pandas()` & find the model's `r2_score()`.

In [None]:
from sklearn.metrics import r2_score

r2_score(y_true=y_test.to_pandas(), y_pred=y_pred.to_pandas())

## Distributed Machine Learning

cuML also features multi-GPU and multi-node-multi-GPU operation, using Dask, for a growing list of algorithms. 

The following Python snippet reads input from a Parquet file and performs a NearestNeighbors query across a cluster of Dask workers, using multiple GPUs on a single node:

In [None]:
import dask_cudf

# read Parquet file in parallel across workers
df = dask_cudf.read_parquet("../data/blobs.parquet")

In [None]:
from dask_cuda import LocalCUDACluster
from dask.distributed import Client

# identify client for cuml
cluster = LocalCUDACluster()
client = Client(cluster)

In [None]:
from cuml.dask.neighbors import NearestNeighbors

# fit a NearestNeighbors model and query it
nn = NearestNeighbors(client=client)

nn.fit(df)

distances, indices = nn.kneighbors(df, n_neighbors=5)

In [None]:
distances.compute()

## That is the Machine Learning Tour
Those are but two simple examples of the algorithms supported by cuML.

There are many more supported algorithms as you can see on the [cuML Github](https://github.com/rapidsai/cuml#supported-algorithms).

That's all we have for now! Get coding your own workloads!