In [1]:
!nvidia-smi

Sun Jan 26 03:21:19 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   44C    P8     9W /  70W |      0MiB / 15079MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|  No ru

In [2]:
import pynvml
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
device_name = pynvml.nvmlDeviceGetName(handle)
if (device_name != b'Tesla T4') and (device_name != b'Tesla P100-PCIE-16GB'):
  raise Exception("""
    Unfortunately this instance does not have a T4 or P100 GPU.
    
    Please make sure you've configured Colab to request a GPU instance type.
    
    Sometimes Colab allocates a Tesla K80 instead of a T4 or P100. Resetting the instance.
If you get a K80 GPU, try Runtime -> Reset all runtimes...
  """)
else:
  print(f'Yes, you got the right kind of GPU to work and it is a ({device_name}) GPU.')

Yes, you got the right kind of GPU to work and it is a (b'Tesla T4') GPU.


In [3]:
!wget -nc https://github.com/rapidsai/notebooks-extended/raw/master/utils/rapids-colab.sh
!bash rapids-colab.sh

import sys, os

sys.path.append('/usr/local/lib/python3.6/site-packages/')
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'

--2020-01-26 03:21:46--  https://github.com/rapidsai/notebooks-extended/raw/master/utils/rapids-colab.sh
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://github.com/rapidsai/notebooks-contrib/raw/master/utils/rapids-colab.sh [following]
--2020-01-26 03:21:47--  https://github.com/rapidsai/notebooks-contrib/raw/master/utils/rapids-colab.sh
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/rapidsai/notebooks-contrib/master/utils/rapids-colab.sh [following]
--2020-01-26 03:21:47--  https://raw.githubusercontent.com/rapidsai/notebooks-contrib/master/utils/rapids-colab.sh
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.co

# **Truncated Single Value Decomposition (TSVD)** 
The TSVD algorithm is a linear dimensionality reduction algorithm which works really well for datasets in which samples correlated in large groups. TSVD does not center the data before computation unlike PCA. 
The TSVD model is implemented in the cuML library and can accept the following parameters: 
1. n_components : The number of top K singular vectors to be present in the output. The n_componnts variable must be <= number of columns.
2. algorithm: selects the type of algorithm to be used: Jacobi or full. Jacobi is much faster as it iteratively corrects but is less accurate.
3. n_iter: if the algorithm = 'Jacobi' then this variable decides the number of iterations. 
4. tol: if the algorithm = 'Jacobi' then this variable is used to set the tolerance
5. random_state : select a random state if the results should be reproduceable across multiple runs.

The functions that can be used with the tsvd:
1. fit: fits the dataframe on the TSVD model
2. fit_transform: fit the TSVD model on the dataset and perform dimensionality reduction
3. transform: performs dimensionality reduction on the dataset
4. inverse_transform: returns the original dataset from the dimensionally reduced one
5. get_params: returns the value of the parameters set for the TSVD model
6. set_params: allows the user to set the parameter of the TSVD model

The model accepts only numpy arrays or cudf dataframes as the input. In order to convert your dataset to cudf format please read the cudf documentation on https://rapidsai.github.io/projects/cudf/en/latest/. For additional information on the tsvd model please refer to the documentation on https://rapidsai.github.io/projects/cuml/en/0.6.0/api.html#truncated-svd

In [0]:
import os

import numpy as np
import pandas as pd

from sklearn.decomposition import TruncatedSVD as skTSVD

import cudf
from cuml import TruncatedSVD as cumlTSVD

In [0]:
# check if the mortgage dataset is present and then extract the data from it, else throw an error statement
import gzip
# change the path of the mortgage dataset if you have saved it in a different directory
def load_data(nrows, ncols, cached = 'mortgage.npy.gz'):
    if os.path.exists(cached):
        print('use mortgage data')
        with gzip.open(cached) as f:
            X = np.load(f)
        X = X[np.random.randint(0,X.shape[0]-1,nrows),:ncols]
    else:
        # raise an exception if the dataset is not present
        raise FileNotFoundError('Please download the required dataset or check the path')
    df = pd.DataFrame({'fea%d'%i:X[:,i] for i in range(X.shape[1])})
    return df

In [0]:
# this function checks if the results obtained from two different libraries (sklearn and cuml) are the same
from sklearn.metrics import mean_squared_error
def array_equal(a,b,threshold=5e-3,with_sign=True):
    a = to_nparray(a)
    b = to_nparray(b)
    if with_sign == False:
        a,b = np.abs(a),np.abs(b)
    error = mean_squared_error(a,b)
    res = error<threshold
    return res

# the function converts a variable from ndarray or dataframe format to numpy array
def to_nparray(x):
    if isinstance(x,np.ndarray) or isinstance(x,pd.DataFrame):
        return np.array(x)
    elif isinstance(x,np.float64):
        return np.array([x])
    elif isinstance(x,cudf.DataFrame) or isinstance(x,cudf.Series):
        return x.to_pandas().values
    return x

## **Loading Data**

In [8]:
from google.colab import files
uploaded = files.upload()

Saving mortgage.npy.gz to mortgage.npy.gz


In [9]:
%%time
# nrows = number of samples
# ncols = number of features of each sample

nrows = 2**22
ncols = 40

X = load_data(nrows, ncols)
print('data', X.shape)

use mortgage data
data (4194304, 40)
CPU times: user 7.61 s, sys: 1.67 s, total: 9.28 s
Wall time: 9.33 s


In [11]:
X

Unnamed: 0,fea0,fea1,fea2,fea3,fea4,fea5,fea6,fea7,fea8,fea9,fea10,fea11,fea12,fea13,fea14,fea15,fea16,fea17,fea18,fea19,fea20,fea21,fea22,fea23,fea24,fea25,fea26,fea27,fea28,fea29,fea30,fea31,fea32,fea33,fea34,fea35,fea36,fea37,fea38,fea39
0,0.486111,0.196900,0.031532,0.751953,0.861111,0.000000,0.0,0.044486,0.059951,0.061698,0.716319,0.083032,0.231624,0.113612,0.212412,0.00678,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.652778,0.483108,0.049550,0.744141,0.969444,0.531162,0.0,0.044486,0.059951,0.061698,0.716319,0.083032,0.231624,0.113612,0.212412,0.00678,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.750000,0.196900,0.022523,0.755859,0.988889,0.000000,0.0,0.044486,0.059951,0.061698,0.716319,0.083032,0.231624,0.113612,0.212412,0.00678,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.513889,0.173428,0.292793,0.287109,0.261111,0.000000,0.0,0.044486,0.059951,0.061698,0.716319,0.083032,0.231624,0.113612,0.212412,0.00678,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.652778,0.165342,0.198198,0.679688,0.880556,0.486932,0.0,0.044486,0.059951,0.061698,0.716319,0.083032,0.231624,0.113612,0.212412,0.00678,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4194299,0.541667,0.414940,0.126126,0.710938,0.925000,0.341375,0.0,0.044486,0.059951,0.061698,0.716319,0.083032,0.231624,0.113612,0.212412,0.00678,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4194300,0.500000,0.259399,0.360360,0.609375,0.666667,0.806996,0.0,0.044486,0.059951,0.061698,0.716319,0.083032,0.231624,0.113612,0.212412,0.00678,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4194301,0.527778,0.452519,0.076577,0.263672,0.000000,0.638118,0.0,0.044486,0.059951,0.061698,0.716319,0.083032,0.231624,0.113612,0.212412,0.00678,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4194302,0.569444,0.051263,0.391892,0.126953,0.091667,0.000000,0.0,0.044486,0.059951,0.061698,0.716319,0.083032,0.231624,0.113612,0.212412,0.00678,0.0,0.5,0.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## **Model Parameter**

In [0]:
# define the value of some of the model parameters
n_components = 10
random_state = 42

## **Scikit-Learn Implementation**

In [14]:
%%time
# use the sklearn tsvd to reduce the dimentionality of the dataset
algorithm = 'arpack'
tsvd_sk = skTSVD(n_components = n_components,
                 algorithm = algorithm, 
                 random_state = random_state)
# fits the dataset on the sklearn tsvd model and returns the dimensionally reduced dataset
result_sk = tsvd_sk.fit_transform(X)

CPU times: user 7.4 s, sys: 249 ms, total: 7.64 s
Wall time: 4.29 s


## **cuML Implementation**

In [15]:
%%time
# convert pandas dataframe to cudf dataframe
X_cdf = cudf.DataFrame.from_pandas(X)

CPU times: user 635 ms, sys: 173 ms, total: 808 ms
Wall time: 1.08 s


In [16]:
%%time
# use the cuml tsvd model to reduce the dimentionality of the dataset
algorithm = 'full'
tsvd_cuml = cumlTSVD(n_components = n_components,
                     algorithm = algorithm, 
                     random_state = random_state)
# fits the dataset on the cuml tsvd model and returns the dimensionally reduced dataset
result_cuml = tsvd_cuml.fit_transform(X_cdf)

CPU times: user 1.38 s, sys: 269 ms, total: 1.65 s
Wall time: 6.17 s


In [17]:
# obtain attributes of the sklearn and cuml tsvd and check to see if they are equal
for attr in ['singular_values_',
             'components_']:
    passed = array_equal(getattr(tsvd_sk, attr), getattr(tsvd_cuml, attr),threshold=0.1)
    # larger error margin due to different algorithms: arpack vs full
    message = 'compare tsvd: cuml vs sklearn {:>25} {}'.format(attr,'equal' if passed else 'NOT equal')
    print(message)

compare tsvd: cuml vs sklearn          singular_values_ equal
compare tsvd: cuml vs sklearn               components_ equal


In [18]:
# compare the reduced matrix
passed = array_equal(result_sk,
                     result_cuml,
                     threshold=0.1)
# larger error margin due to different algorithms: arpack vs full
message = 'compare tsvd: cuml vs sklearn transformed results %s'%('equal'if passed else 'NOT equal')
print(message)

compare tsvd: cuml vs sklearn transformed results equal
