## Machine Learning

### cuML

cuML is a suite of libraries that implement machine learning algorithms and mathematical primitives functions that share compatible APIs with other RAPIDS projects.

cuML enables data scientists, researchers, and software engineers to run traditional tabular ML tasks on GPUs without going into the details of CUDA programming. In most cases, cuML's Python API matches the API from scikit-learn.

For large datasets, these GPU-based implementations can complete 10-50x faster than their CPU equivalents. For details on performance, see the cuML Benchmarks Notebook.
    
[GitHub](https://github.com/rapidsai/cuml) | [Welcome Notebook](../welcome.ipynb#Machine-Learning)

In [1]:
from blazingsql import BlazingContext

# connect to BlazingSQL w/ BlazingContext API
bc = BlazingContext()

BlazingContext ready


In [2]:
import os

# identify path to data directory
data_dir = f'{os.getcwd().split("/intro_notebooks")[0]}/data'

# create a BlazingSQL table from any file w/ .create_table(table_name, file_path)
bc.create_table('taxi', f'{data_dir}/sample_taxi.csv', header=0)

<pyblazing.apiv2.context.BlazingTable at 0x7fe885b023d0>

#### Linear Regression

In [3]:
%%time
from cuml import LinearRegression
from cuml.preprocessing.model_selection import train_test_split

# pull feature (X) and target (y) values
X = bc.sql('SELECT trip_distance, tolls_amount FROM taxi')
y = bc.sql('SELECT fare_amount FROM taxi')['fare_amount']

# split data into train and test sets (80:20)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

CPU times: user 1.82 s, sys: 343 ms, total: 2.16 s
Wall time: 1.63 s


In [4]:
%%time
# call Linear Regression model
lr = LinearRegression()

# train the model
lr.fit(X_train, y_train)

# make predictions for test X values
y_pred = lr.predict(X_test)

CPU times: user 421 ms, sys: 97 ms, total: 518 ms
Wall time: 541 ms


In [5]:
from sklearn.metrics import r2_score

# convert test & predicted values .to_pandas() & find the model's r2_score
r2_score(y_true=y_test.to_pandas(), y_pred=y_pred.to_pandas())

0.2145465529140157

#### DBSCAN

In [None]:
import cudf
from cuml.cluster import DBSCAN

# Create and populate a GPU DataFrame
gdf_float = cudf.DataFrame()
gdf_float['0'] = [1.0, 2.0, 5.0]
gdf_float['1'] = [4.0, 2.0, 1.0]
gdf_float['2'] = [4.0, 2.0, 1.0]

# Setup and fit clusters
dbscan_float = DBSCAN(eps=1.0, min_samples=1)
dbscan_float.fit(gdf_float)


#### Multi-GPU 

cuML also features multi-GPU and multi-node-multi-GPU operation, using Dask, for a growing list of algorithms. 

The following Python snippet reads input from a CSV file and performs a NearestNeighbors query across a cluster of Dask workers, using multiple GPUs on a single node:

In [None]:
# Create a Dask CUDA cluster w/ one worker per device
from dask_cuda import LocalCUDACluster
cluster = LocalCUDACluster()

# Read CSV file in parallel across workers
import dask_cudf
df = dask_cudf.read_csv("/path/to/csv")

# Fit a NearestNeighbors model and query it
from cuml.dask.neighbors import NearestNeighbors
nn = NearestNeighbors(n_neighbors = 10)
nn.fit(df)
neighbors = nn.kneighbors(df)