# Machine Learning

## cuML
[GitHub](https://github.com/rapidsai/cuml)

cuML is a suite of libraries that implement machine learning algorithms and mathematical primitives functions that share compatible APIs with other RAPIDS projects.

cuML enables data scientists, researchers, and software engineers to run traditional tabular ML tasks on GPUs without going into the details of CUDA programming. In most cases, cuML's Python API matches the API from scikit-learn.

For large datasets, these GPU-based implementations can complete **10-50x** faster than their CPU equivalents.

We'll show off a few examples to demonstrate the ease of using cuML. 

Parts of this were borrowed and lightly adapted from the cuML Githu repository.

#### DBScan
Here is an an example of computing DBSCAN clusters entirely on GPUs.

In [1]:
import cudf
from cuml.cluster import DBSCAN

# Create and populate a GPU DataFrame
gdf_float = cudf.DataFrame()
gdf_float['0'] = [1.0, 2.0, 5.0]
gdf_float['1'] = [4.0, 2.0, 1.0]
gdf_float['2'] = [4.0, 2.0, 1.0]

# Setup and fit clusters
dbscan_float = DBSCAN(eps=1.0, min_samples=1)
dbscan_float.fit(gdf_float)
print(dbscan_float.labels_)

0    0
1    1
2    2
dtype: int32


#### Linear Regression on Data in AWS S3
To demonstrate how the RAPIDS stack integrates to easily create data pipelines we're going to run a simple workflow.

We are going to predict the fare of a [NYC Yellow Taxi](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page) cab ride. We are going to do this by running a `LinearRegression()` on a `cudf.DataFrame`. 

This DataFrame will be generated from a SQL query on an *Apache Parquet* dataset that resides in *AWS S3*.

For more information on BlazingSQL and cuDF see [The DataFrame Notebook](the_dataframe.ipynb).

In [2]:
from blazingsql import BlazingContext
bc = BlazingContext()
bc.s3('blazingsql-colab', bucket_name='blazingsql-colab')

BlazingContext ready


(True,
 '',
 OrderedDict([('type', 's3'),
              ('bucket_name', 'blazingsql-colab'),
              ('access_key_id', ''),
              ('secret_key', ''),
              ('session_token', ''),
              ('encryption_type', <S3EncryptionType.NONE: 1>),
              ('kms_key_amazon_resource_name', '')]))

In [3]:
bc.create_table('taxi','s3://blazingsql-colab/yellow_taxi/1_0_0.parquet')

ParseSchemaError: [ParseSchema Error] Path 's3://blazingsql-colab/yellow_taxi/1_0_0.parquet' does not exist. File or directory paths are expected to be in one of the following formats: For local file paths: '/folder0/folder1/fileName.extension'    For local file paths with wildcard: '/folder0/folder1/*fileName*.*'    For local directory paths: '/folder0/folder1/'    For s3 file paths: 's3://registeredFileSystemName/folder0/folder1/fileName.extension'    For s3 file paths with wildcard: '/folder0/folder1/*fileName*.*'    For s3 directory paths: 's3://registeredFileSystemName/folder0/folder1/'    For gs file paths: 'gs://registeredFileSystemName/folder0/folder1/fileName.extension'    For gs file paths with wildcard: '/folder0/folder1/*fileName*.*'    For gs directory paths: 'gs://registeredFileSystemName/folder0/folder1/'    For HDFS file paths: 'hdfs://registeredFileSystemName/folder0/folder1/fileName.extension'    For HDFS file paths with wildcard: '/folder0/folder1/*fileName*.*'    For HDFS directory paths: 'hdfs://registeredFileSystemName/folder0/folder1/'

Exception ignored in: 'cio.parseSchemaPython'
cio.ParseSchemaError: [ParseSchema Error] Path 's3://blazingsql-colab/yellow_taxi/1_0_0.parquet' does not exist. File or directory paths are expected to be in one of the following formats: For local file paths: '/folder0/folder1/fileName.extension'    For local file paths with wildcard: '/folder0/folder1/*fileName*.*'    For local directory paths: '/folder0/folder1/'    For s3 file paths: 's3://registeredFileSystemName/folder0/folder1/fileName.extension'    For s3 file paths with wildcard: '/folder0/folder1/*fileName*.*'    For s3 directory paths: 's3://registeredFileSystemName/folder0/folder1/'    For gs file paths: 'gs://registeredFileSystemName/folder0/folder1/fileName.extension'    For gs file paths with wildcard: '/folder0/folder1/*fileName*.*'    For gs directory paths: 'gs://registeredFileSystemName/folder0/folder1/'    For HDFS file paths: 'hdfs://registeredFileSystemName/folder0/folder1/fileName.extension'    For HDFS file paths wi

<pyblazing.apiv2.context.BlazingTable at 0x7f12c0372590>

Extract the desired features with `.sql()`, and then split up the data test using cuML's `train_test_split()` function.

In [4]:
%%time
from cuml import LinearRegression
from cuml.preprocessing.model_selection import train_test_split

X = bc.sql('SELECT trip_distance, tolls_amount FROM taxi')
y = bc.sql('SELECT fare_amount FROM taxi')['fare_amount']

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8)

com.blazingdb.calcite.application.SqlValidationException: com.blazingdb.calcite.application.SqlValidationException: SqlValidationException

SELECT trip_distance, tolls_amount FROM taxi
       ^^^^^^^^^^^^^

From line 1, column 8 to line 1, column 20: Column 'trip_distance' not found in any table

Then we run the `.fit()` and `.predict()` functions to perform the linear regression on the Taxi data.

In [5]:
%%time
# call Linear Regression model
lr = LinearRegression()

# train the model
lr.fit(X_train, y_train)

# make predictions for test X values
y_pred = lr.predict(X_test)

NameError: name 'X_train' is not defined

We can convert test & predicted values `.to_pandas()` & find the model's `r2_score()`.

In [6]:
from sklearn.metrics import r2_score

r2_score(y_true=y_test.to_pandas(), y_pred=y_pred.to_pandas())

NameError: name 'y_test' is not defined

## That is the Machine Learning Tour
Those are but two simple examples of the algorithms supported by cuML.

There are many more supported algorithms as you can see on the [cuML Github](https://github.com/rapidsai/cuml#supported-algorithms).

That's all we have for now! Get coding your own workloads!