## 1. Data Ingestion

The `diamonds` dataset has been widely used in data science and machine learning. We will use it to demonstrate Snowflake's native data science transformers in terms of database functionality and Spark & Pandas comportablity, using non-synthetic and statistically appropriate data that is well known to the ML community.



### Establish Secure Connection to Snowflake

*Other connection options include Username/Password, MFA, OAuth, Okta, SSO. For more information, refer to the [Python Connector](https://docs.snowflake.com/en/developer-guide/python-connector/python-connector-example) documentation.*

### Import Libraries

In [1]:
# Snowpark for Python
from snowflake.snowpark import Session
from snowflake.snowpark.version import VERSION
from snowflake.snowpark.types import StructType, StructField, FloatType, StringType, IntegerType
import snowflake.snowpark.functions as F

# data science libs
import numpy as np

# misc
import json

In [2]:
# Make a Snowpark Connection

################################################################################################################
#  You can also use the SnowSQL Client to configure your connection params:
#  https://docs.snowflake.com/en/user-guide/snowsql-install-config.html
#
#  >>> from snowflake.ml.utils import connection_params
#  >>> session = Session.builder.configs(connection_params.SnowflakeLoginOptions()
#  >>> ).create()   
#
#  NOTE: If you have named connection params then specify the connection name
#  Example:
#  
#  >>> session = Session.builder.configs(
#  >>> connection_params.SnowflakeLoginOptions(connection_name='connections.snowml')
#  >>> ).create()
#
#################################################################################################################

# Create Snowflake Session object
connection_parameters = json.load(open('connection.json'))
session = Session.builder.configs(connection_parameters).create()
session.sql_simplifier_enabled = True

snowflake_environment = session.sql('SELECT current_user(), current_version()').collect()
snowpark_version = VERSION

# Current Environment Details
print('User                        : {}'.format(snowflake_environment[0][0]))
print('Role                        : {}'.format(session.get_current_role()))
print('Database                    : {}'.format(session.get_current_database()))
print('Schema                      : {}'.format(session.get_current_schema()))
print('Warehouse                   : {}'.format(session.get_current_warehouse()))
print('Snowflake version           : {}'.format(snowflake_environment[0][1]))
print('Snowpark for Python version : {}.{}.{}'.format(snowpark_version[0],snowpark_version[1],snowpark_version[2]))

User                        : SIKHADAS
Role                        : "ACCOUNTADMIN"
Database                    : "ML_HOL_DB"
Schema                      : "ML_HOL_SCHEMA"
Warehouse                   : "ML_HOL_WH"
Snowflake version           : 7.21.1
Snowpark for Python version : 1.4.0


### Stage the `diamonds` CSV file to be read into the Snowpark DataFrame Reader

For more information on loading data, see documentation on [snowflake.snowpark.DataFrameReader](https://docs.snowflake.com/ko/developer-guide/snowpark/reference/python/api/snowflake.snowpark.DataFrameReader.html).

First, download the `diamonds` data from
https://github.com/tidyverse/ggplot2/blob/882584f915b23cda5091fb69e88f19e8200811bf/data-raw/diamonds.csv and save it in this repo's folder.

Once it's downloaded, run the rest of the cells in order to stage the file in Snowflake.



In [3]:
# Upload the diamonds CSV file to the stage we created earlier
session.file.put("diamonds.csv", "@DIAMONDS_ASSETS", auto_compress=False)

[PutResult(source='diamonds.csv', target='diamonds.csv', source_size=2814963, target_size=0, source_compression='NONE', target_compression='NONE', status='SKIPPED', message='')]

In [4]:
# Define the schema for the data in the CSV file
diamonds_schema = StructType([StructField("row", IntegerType()), 
                              StructField("carat", FloatType()), 
                              StructField("cut", StringType()),
                              StructField("color", StringType()),
                              StructField("clarity", StringType()),
                              StructField("depth", StringType()),
                              StructField("table", FloatType()),
                              StructField("price", FloatType()),
                              StructField("x", FloatType()),
                              StructField("y", FloatType()),
                              StructField("z", FloatType())
                              ])

# Create a Snowpark DataFrame that is configured to load data from the CSV file
diamonds_df = session.read.options({"field_delimiter": ",", "skip_header": 1}).schema(diamonds_schema).csv("@DIAMONDS_ASSETS/diamonds.csv")
diamonds_df.show()

# Look at descriptive stats on the DataFrame
diamonds_df.describe().show()

--------------------------------------------------------------------------------------------------------
|"ROW"  |"CARAT"  |"CUT"      |"COLOR"  |"CLARITY"  |"DEPTH"  |"TABLE"  |"PRICE"  |"X"   |"Y"   |"Z"   |
--------------------------------------------------------------------------------------------------------
|1      |0.23     |Ideal      |E        |SI2        |61.5     |55.0     |326.0    |3.95  |3.98  |2.43  |
|2      |0.21     |Premium    |E        |SI1        |59.8     |61.0     |326.0    |3.89  |3.84  |2.31  |
|3      |0.23     |Good       |E        |VS1        |56.9     |65.0     |327.0    |4.05  |4.07  |2.31  |
|4      |0.29     |Premium    |I        |VS2        |62.4     |58.0     |334.0    |4.2   |4.23  |2.63  |
|5      |0.31     |Good       |J        |SI2        |63.3     |58.0     |335.0    |4.34  |4.35  |2.75  |
|6      |0.24     |Very Good  |J        |VVS2       |62.8     |57.0     |336.0    |3.94  |3.96  |2.48  |
|7      |0.24     |Very Good  |I        |VVS1       |62

### Data cleaning

First, we standardize the category formatting for `CUT` using Snowpark DataFrame operations.

This way, when we write to a Snowflake table, there will be no inconsistencies in how the Snowpark DataFrame will read in the column names. Secondly, the feature transformations on categoricals will be easier to encode.

In [5]:
def fix_values(columnn):
    return F.upper(F.regexp_replace(F.col(columnn), '[^a-zA-Z0-9]+', '_'))

for col in ["CUT"]:
    diamonds_df = diamonds_df.with_column(col, fix_values(col))

diamonds_df.show()

--------------------------------------------------------------------------------------------------------
|"ROW"  |"CARAT"  |"COLOR"  |"CLARITY"  |"DEPTH"  |"TABLE"  |"PRICE"  |"X"   |"Y"   |"Z"   |"CUT"      |
--------------------------------------------------------------------------------------------------------
|1      |0.23     |E        |SI2        |61.5     |55.0     |326.0    |3.95  |3.98  |2.43  |IDEAL      |
|2      |0.21     |E        |SI1        |59.8     |61.0     |326.0    |3.89  |3.84  |2.31  |PREMIUM    |
|3      |0.23     |E        |VS1        |56.9     |65.0     |327.0    |4.05  |4.07  |2.31  |GOOD       |
|4      |0.29     |I        |VS2        |62.4     |58.0     |334.0    |4.2   |4.23  |2.63  |PREMIUM    |
|5      |0.31     |J        |SI2        |63.3     |58.0     |335.0    |4.34  |4.35  |2.75  |GOOD       |
|6      |0.24     |J        |VVS2       |62.8     |57.0     |336.0    |3.94  |3.96  |2.48  |VERY_GOOD  |
|7      |0.24     |I        |VVS1       |62.3     |57.0

Second, we remove the `row` column and force headers to uppercase using Snowpark DataFrame operations.


In [6]:
# Drop 'ROW'
diamonds_df = diamonds_df.drop('ROW')

# Force headers to uppercase
for colname in np.array(diamonds_df.columns):
    if str.upper(colname) == "TABLE":
        new_colname = colname + '_PCT'
    else:
        new_colname = str.upper(colname)

    diamonds_df = diamonds_df.with_column_renamed(colname, new_colname)

diamonds_df.show()

----------------------------------------------------------------------------------------------------
|"CARAT"  |"COLOR"  |"CLARITY"  |"DEPTH"  |"TABLE_PCT"  |"PRICE"  |"X"   |"Y"   |"Z"   |"CUT"      |
----------------------------------------------------------------------------------------------------
|0.23     |E        |SI2        |61.5     |55.0         |326.0    |3.95  |3.98  |2.43  |IDEAL      |
|0.21     |E        |SI1        |59.8     |61.0         |326.0    |3.89  |3.84  |2.31  |PREMIUM    |
|0.23     |E        |VS1        |56.9     |65.0         |327.0    |4.05  |4.07  |2.31  |GOOD       |
|0.29     |I        |VS2        |62.4     |58.0         |334.0    |4.2   |4.23  |2.63  |PREMIUM    |
|0.31     |J        |SI2        |63.3     |58.0         |335.0    |4.34  |4.35  |2.75  |GOOD       |
|0.24     |J        |VVS2       |62.8     |57.0         |336.0    |3.94  |3.96  |2.48  |VERY_GOOD  |
|0.24     |I        |VVS1       |62.3     |57.0         |336.0    |3.95  |3.98  |2.47  |VER

### Write cleaned data to a Snowflake table

In [7]:
diamonds_df.write.mode('overwrite').save_as_table('diamonds')

In [8]:
session.close()