## Set Up Notebook: 

Run this notebook first to set up the datasets you'll require for the `01_Custom_Preprocessing_Pipeline.ipynb` notebook. 

In [1]:
#Imports: 
import json
import pandas as pd
import numpy as np

#Snowflake Imports: 
from snowflake.snowpark import Session

In [2]:
#Authenticate to Snowflake, here using a local json file with credentials:
conn_params = json.load(open('/Users/hapatel/.config/creds.json'))
session = Session.builder.configs(conn_params).create()

#Use the appropriate database context (I have created my own Database/Schema ahead of time, this may look different compared to yours)
session.sql('USE ROLE ML_ENGINEER').collect()
session.sql('USE WAREHOUSE TEST').collect()
session.sql('USE DATABASE DEMO').collect()
session.sql('USE SCHEMA CUSTOMER_EXAMPLES').collect()

[Row(status='Statement executed successfully.')]

### Setup Dataset:

We will be making use of the [NYC Taxi Trip Dataset](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page). Specificallly, we will be using a sample from January 2016 that records information around each taxi trip. The dataset has been included in this directory for you to be able to experiment with. 

In [4]:
taxi_df = pd.read_csv('taxi_sample.csv')

In [5]:
taxi_df.head()

Unnamed: 0,VENDORID,PASSENGER_COUNT,TRIP_DISTANCE,RATECODEID,STORE_AND_FWD_FLAG,PULOCATIONID,DOLOCATIONID,PAYMENT_TYPE,FARE_AMOUNT,EXTRA,MTA_TAX,TIP_AMOUNT,TOLLS_AMOUNT,IMPROVEMENT_SURCHARGE,TOTAL_AMOUNT,CONGESTION_SURCHARGE,AIRPORT_FEE,TPEP_PICKUP_DATETIME,TPEP_DROPOFF_DATETIME,TRIP_ID
0,1,1,3.2,1,N,48,262,1,14.0,0.5,0.5,3.06,0.0,0.3,18.36,,,2016-01-01 00:12:22,2016-01-01 00:29:14,0
1,1,2,1.0,1,N,162,48,2,9.5,0.5,0.5,0.0,0.0,0.3,10.8,,,2016-01-01 00:41:31,2016-01-01 00:55:10,1
2,1,1,0.9,1,N,246,90,2,6.0,0.5,0.5,0.0,0.0,0.3,7.3,,,2016-01-01 00:53:37,2016-01-01 00:59:57,2
3,1,1,0.8,1,N,170,162,2,5.0,0.5,0.5,0.0,0.0,0.3,6.3,,,2016-01-01 00:13:28,2016-01-01 00:18:07,3
4,1,1,1.8,1,N,161,140,2,11.0,0.5,0.5,0.0,0.0,0.3,12.3,,,2016-01-01 00:33:04,2016-01-01 00:47:14,4


In [6]:
#confirm datatypes: 
taxi_df.dtypes

VENDORID                   int64
PASSENGER_COUNT            int64
TRIP_DISTANCE            float64
RATECODEID                 int64
STORE_AND_FWD_FLAG        object
PULOCATIONID               int64
DOLOCATIONID               int64
PAYMENT_TYPE               int64
FARE_AMOUNT              float64
EXTRA                    float64
MTA_TAX                  float64
TIP_AMOUNT               float64
TOLLS_AMOUNT             float64
IMPROVEMENT_SURCHARGE    float64
TOTAL_AMOUNT             float64
CONGESTION_SURCHARGE     float64
AIRPORT_FEE              float64
TPEP_PICKUP_DATETIME      object
TPEP_DROPOFF_DATETIME     object
TRIP_ID                    int64
dtype: object

As seen above, each record details information around the pickup/dropoff time, the fare amount calculated by the meter, as well as the total amount charged to the customer. There are some additionally engineered features that measure the rolling averages of the fare amounts in the past 1/10 hours. For more details on the columns and what they mean, refer to the [data dictionary](https://www.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf). 

We will load this dataset into a Snowflake Table to simulate a realistic example. 

In [7]:
#Create the table definition - REPLACE THE FULLY QUALIFIED PATH WITH YOUR OWN DATABASE/SCHEMA! 
session.sql("""
create or replace TABLE DEMO.CUSTOMER_EXAMPLES.NYC_YELLOW_TRIPS (
	VENDORID NUMBER(38,0),
	PASSENGER_COUNT NUMBER(38,0),
	TRIP_DISTANCE FLOAT,
	RATECODEID NUMBER(38,0),
	STORE_AND_FWD_FLAG VARCHAR(16777216),
	PULOCATIONID NUMBER(38,0),
	DOLOCATIONID NUMBER(38,0),
	PAYMENT_TYPE NUMBER(38,0),
	FARE_AMOUNT FLOAT,
	EXTRA FLOAT,
	MTA_TAX FLOAT,
	TIP_AMOUNT FLOAT,
	TOLLS_AMOUNT FLOAT,
	IMPROVEMENT_SURCHARGE FLOAT,
	TOTAL_AMOUNT FLOAT,
	CONGESTION_SURCHARGE NUMBER(38,0),
	AIRPORT_FEE NUMBER(38,0),
	TPEP_PICKUP_DATETIME TIMESTAMP_NTZ(9),
	TPEP_DROPOFF_DATETIME TIMESTAMP_NTZ(9),
	TRIP_ID NUMBER(38,0) NOT NULL
);

""").collect()

[Row(status='Table NYC_YELLOW_TRIPS successfully created.')]

In [8]:
session.write_pandas(taxi_df, table_name = "NYC_YELLOW_TRIPS", database = "DEMO",
                     schema = "CUSTOMER_EXAMPLES", quote_identifiers = False, 
                    overwrite = True)

<snowflake.snowpark.table.Table at 0x167f11810>

In [9]:
taxi_sdf = session.table("nyc_yellow_trips")
taxi_sdf.show()

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"VENDORID"  |"PASSENGER_COUNT"  |"TRIP_DISTANCE"  |"RATECODEID"  |"STORE_AND_FWD_FLAG"  |"PULOCATIONID"  |"DOLOCATIONID"  |"PAYMENT_TYPE"  |"FARE_AMOUNT"  |"EXTRA"  |"MTA_TAX"  |"TIP_AMOUNT"  |"TOLLS_AMOUNT"  |"IMPROVEMENT_SURCHARGE"  |"TOTAL_AMOUNT"  |"CONGESTION_SURCHARGE"  |"AIRPORT_FEE"  |"TPEP_PICKUP_DATETIME"  |"TPEP_DROPOFF_DATETIME"  |"TRIP_ID"  |
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------