# RAPIDS Hands-on Example Notebook

This notebook provides a hands-on demonstration of an ETL and ML workflow using the RAPIDS ecosystem. We will be using the California Housing Prices dataset from Scikit-learn to create a model which will predict median housing prices. We will also perform some light analysis, such as computing feature significant, and visualizing clusters.

We first load the California Housing Prices dataset into Pandas dataframes on the CPU

In [None]:
from sklearn.datasets import fetch_california_housing

dataset = fetch_california_housing(as_frame=True)

X_cpu = dataset['data']
y_cpu = dataset['target']

Printing our features and labels

In [None]:
X_cpu

In [None]:
y_cpu

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(20,15))
for i, col in enumerate(X_cpu.columns):
    sns.kdeplot(X_cpu[col], ax=axes[i//3,i%3])

## Move data from CPU to GPU

Next, we convert the Pandas dataframes into cuDF dataframes on the GPU. This is done using `cudf.Dataframe.from_pandas()` and `cudf.Series.from_pandas()`

In [None]:
import cudf

In [None]:
X = 
y = 

## Filter

Next, let's perform some filtering of the geo-coordinates. Let's filter X to include everything with Latitude between (36, 40) and Longitude between (-123, -120). Also make sure to filter `y` to keep only those fields which have been kept in `X`.

In [None]:
X =
y =

## Convert Features into Z-Scores

Let's perform some light feature engineering by shifting and rescaling each feature column. We can do this by subtracting the mean and normalizing by the standard deviation. 

In [None]:
for col in X.columns:
    X[col] = 

In [None]:
X.describe()

## Cluster Analysis

In [None]:
import cuml

Let's use `cuml.cluster.KMeans` to cluster our feature matrix into 5 clusters. The cudf with the cluster assignments should be stored in `clusters`

In [None]:

clusters = 

In [None]:
import matplotlib.pyplot as plt
plt.scatter(X['Latitude'].to_pandas(),
            X['Longitude'].to_pandas(), 
            c=clusters.to_pandas(), s=0.4)

## Feature Significance

Let's start by using `cuml.linear_model.Lasso` to get a rough idea of the significance of each of our features. Assign the trained LASSO model to a variable named `lasso`.

In [None]:
lasso = 

Now `argsort` and reverse the absolute value of the trained LASSO coefficients. This can be done with `cudf.Series`. Assign the resulting `Series` object to the `feature_importance` variable

In [None]:
feature_importance = 

In [None]:
feature_importance

In [None]:
import numpy as np
plt.scatter(np.arange(8), feature_importance.to_pandas())

In [None]:
X.columns

When we order our columns by their feature importance scores, we see that income, location, house age, and avage number of bedrooms appears to have the highest effect on the resulting predictions.

In [None]:
X.columns[feature_importance.index.values.get()]

## Create Train & Test Datasets

Let's split our data into training and testing sets using `cuml.train_test_split`

In [None]:
X_train, X_test, y_train, y_test = 

## Predict Median House Value

In [None]:
def compute_mean_avg_error(y_test, y_hat):
    return ((y_test.reset_index(drop=True) - cudf.Series(y_hat)).abs().sum()) / len(y_test)

### Basic Linear Model

Create a `cuml.linear_model.ElasticNet` model below, train it, and assign the predicted results to a variable named `y_hat`. You can play around with different hyperparameters, such as the L1/L2 mixing ratio (`l1_ratio`) and the amount of weight to give the penalty terms (`alpha1`) and see how it affects the mean average error and the R1 score.

In [None]:


y_hat = 

In [None]:
y_hat

In [None]:
y_test

In [None]:
compute_mean_avg_error(y_test, y_hat)

Use the `score()` function on the trained to use the default scorer, which is an r2 score.

r2_score = 

In [None]:
r2_score

### Random Forest Model

Let's train a non-linear model and non-parametric model and see how it performs. Train a `cuml.ensemble.RandomForestRegressor` model below and assign the resulting predictions to the `y_hat` variable. You can play around with hyperparameters like the maximum decision tree depth (`max_depth`) and the size of the forest (`n_estimators`). 

In [None]:


y_hat = 

In [None]:
compute_mean_avg_error(y_test, y_hat)

Use the `score()` function on the trained model to use the default scorer, wich is an r2 score.

In [None]:
r2_score = 

In [None]:
r2_score

### Nearest Neighbors Model

A really simple and surprisingly effective non-linear and non-parametric regression model is the `cuml.neighbors.KNeighborsRegressor`. Train one below and assign the resulting predictions to `y_hat`. You can play around with the number of neighbors to include in the regression computation (`n_neighbors`) or the distance metric that is used to compute the neighborhoods (`metric`) and see how they affect the resulting predictions.

In [None]:


y_hat =

In [None]:
compute_mean_avg_error(y_test, y_hat)

Use the `score()` function on the trained model to use the default scorer, wich is an r2 score.

In [None]:
r2_score =

In [None]:
r2_score

### SVM Model

Finally, let's train a parametric estimator. Train a support vector regressor using `cuml.svm.SVR` and use the hyperparameter `kernel='rbf'` for a non-linear decision function. As with the models above, assign the output predictions to a variable named `y_hat`

In [None]:
y_hat = 

In [None]:
compute_mean_avg_error(y_test, y_hat)

Use the `score()` function on the trained model to use the default scorer, wich is an r2 score.

In [None]:
r2_score = 

In [None]:
r2_score

Let's inspect the resulting predictions and see how our SVM model performed 

In [None]:
y_test

In [None]:
y_hat