# Environment Sanity Check #

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

Check the output of `!nvidia-smi` to make sure you've been allocated a Tesla T4, P4 or P100.

In [0]:
!nvidia-smi

#Setup Rapids:
Set up script installs
1. Install most recent Miniconda release compatible with Google Colab's Python install  (3.6.7)
1. removes incompatible files
1. Install RAPIDS libraries
1. Set necessary environment variables
1. Copy RAPIDS .so files into current working directory, a workaround for conda/colab interactions

In [2]:
# Install RAPIDS
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!bash rapidsai-csp-utils/colab/rapids-colab.sh

import sys, os

dist_package_index = sys.path.index('/usr/local/lib/python3.6/dist-packages')
sys.path = sys.path[:dist_package_index] + ['/usr/local/lib/python3.6/site-packages'] + sys.path[dist_package_index:]
sys.path
exec(open('rapidsai-csp-utils/colab/update_modules.py').read(), globals())

fatal: destination path 'rapidsai-csp-utils' already exists and is not an empty directory.
PLEASE READ
********************************************************************************************************
Changes:
1. Default stable version is now 0.14.  Nightly is now 0.15.  We have fixed the long conda install.  Hooray!
2. For stable releases, we now use static yml files, in case of incompatible dependancy changes later.
3. You can now declare your RAPIDSAI version as a CLI option and skip the user prompts (ex: '0.14' or '0.15', between 0.13 to 0.15, without the quotes): 
        "!bash rapidsai-csp-utils/colab/rapids-colab.sh <version/label>"
        Examples: '!bash rapidsai-csp-utils/colab/rapids-colab.sh 0.14', or '!bash rapidsai-csp-utils/colab/rapids-colab.sh stable', or '!bash rapidsai-csp-utils/colab/rapids-colab.sh s'
                  '!bash rapidsai-csp-utils/colab/rapids-colab.sh 0.15, or '!bash rapidsai-csp-utils/colab/rapids-colab.sh nightly', or '!bash rapidsai-csp-u

# Setup Spark

In [0]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-3.0.0-preview2/spark-3.0.0-preview2-bin-hadoop3.2.tgz
!tar -xvf spark-3.0.0-preview2-bin-hadoop3.2.tgz
!pip install -q findspark

In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.0-preview2-bin-hadoop3.2"

In [0]:
!wget -O test.csv -q https://zenodo.org/record/2595588/files/all-sorted-2018-01-21-to-2019-02-04.csv?download=1

# cuDF compare to pandas DF #

Now you can run code! 

What follows are basic examples where all processing takes place on the GPU.

#[cuDF](https://github.com/rapidsai/cudf)#

Load a dataset into a GPU memory resident DataFrame and perform a basic operation to save csv on disk.

_Note_: You must import nvstrings and nvcategory before cudf, else you'll get errors.

In [0]:
import cudf
import time

# read CSV from file
start = time.time()
df = cudf.read_csv('/content/teste.csv')
df.to_csv('/content/testdde.csv')
print('seconds: {}'.format(time.time()-start))

In [0]:
import pandas as pd
import time

start = time.time()
df = pd.read_csv('test.csv')
df.to_csv('pandasdf.csv')
print('seconds: {}'.format(time.time()-start))

# Start Spark Session

In [0]:
# only for exits the folder
!rm -rf sparkdf

In [0]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
import time

spark = SparkSession.builder.master("local[*]").config('spark.executor.memory','8g').getOrCreate()

# Use spark
start = time.time()
df = spark.read.format('csv').options(header='true', inferSchema='true').load('test.csv')
df.write.option("header","true").csv("sparkdf")
print('seconds: {}'.format(time.time()-start))