Welcome to Snowflake! This entry-level guide designed for database and data warehouse administrators and architects will help you navigate the Snowflake interface and introduce you to some of our core capabilities. [Sign up for a free 30-day trial of Snowflake](https://signup.snowflake.com/) and follow along with this lab exercise. Once we cover the basics, you'll be ready to start processing your own data and diving into Snowflake's more advanced features like a pro.

You can refer to the full quickstart guide corresponding to this tutorial [here](https://quickstarts.snowflake.com/guide/getting_started_with_snowflake/index.html#0).

# Preparing to Load Data

Let's start by preparing to load the structured Citi Bike rider transaction data into Snowflake.

This section walks you through the steps to:

- Create a database and table.
- Create an external stage.
- Create a file format for the data.

The data we will be using is bike share data provided by Citi Bike NYC. The data has been exported and pre-staged for you in an Amazon AWS S3 bucket in the US-EAST region. The data consists of information about trip times, locations, user type, gender, age, etc. On AWS S3, the data represents 61.5M rows, 377 objects, and 1.9GB compressed.

Below is a snippet from one of the Citi Bike CSV data files:

```raw
"tripduration","starttime","stoptime","start station id","start station name","start station latitude",
"start station longitude","end station id","end station name","end station latitude","end station longitude",
"bikeid","name_localizedValue0","usertype","birth year","gender"
196,"2018-01-01 00:01:51","2018-01-01 00:05:07",315,"South St & Gouverneur Ln",
40.70355377,-74.00670227,259,"South St & Whitehall St",
40.70122128,-74.01234218,18534,"Annual Membership","Subscriber",1997,1
207,"2018-01-01 00:02:44"1"2018-01-01 00:06:11"13224,"W 13 St & Hudson St", 
40.73997354103409,-74.00513872504234,470,"W 20 St & 8 Ave",
40.74345335,-74.00004031,19651,"Annual Membership","Subscriber",1978,1 
613,"2018-01-01 00:03:15","2018-01-01 00:13:28",386,"Centre St & Worth St",
40.71494807,-74.00234482,2008,"Little West St & 1 Pl", 
40.70569254,-74.01677685,21678,"Annual Membership","Subscriber",1982,1 
```

It is in comma-delimited format with a single header line and double quotes enclosing all string values, including the field headings in the header line. This will come into play later in this section as we configure the Snowflake table to store this data.

### Create a Database and Table

First, let's create a database called `CITIBIKE_TUTORIAL` to use for loading the structured data.

In [None]:
USE ROLE SYSADMIN;
CREATE OR REPLACE DATABASE CITIBIKE_TUTORIAL;

In [None]:
USE SCHEMA CITIBIKE_TUTORIAL.PUBLIC;

> 
> 
>  **Data Definition Language (DDL) operations are free!**
All the DDL operations we have done so far do not require compute resources, so we can create all our objects for free.


Next we create a table called `TRIPS` to use for loading the comma-delimited data. Instead of using the UI, we run the following DDL that creates the table. 

In [None]:
create or replace table CITIBIKE_TUTORIAL.PUBLIC.trips
(tripduration integer,
starttime timestamp,
stoptime timestamp,
start_station_id integer,
start_station_name string,
start_station_latitude float,
start_station_longitude float,
end_station_id integer,
end_station_name string,
end_station_latitude float,
end_station_longitude float,
bikeid integer,
membership_type string,
usertype string,
birth_year integer,
gender integer);

Verify that your TRIPS table has been created. You should see the returned query status displaying a "Table TRIPS successfully created" message.

> 
> 
>  **Many Options to Run Commands.**
SQL commands can be executed through the UI, via the **Worksheets** tab, using our SnowSQL command line tool, with a SQL editor of your choice via ODBC/JDBC, or through our other connectors (Python, Spark, etc.).
As mentioned earlier, to save time, we are performing most of the operations in this lab via pre-written SQL executed in the worksheet as opposed to using the UI.


### Create an External Stage

We are working with structured, comma-delimited data that has already been staged in a public, external S3 bucket. Before we can use this data, we first need to create a Stage that specifies the location of our external bucket.

> 
> 
>  For this lab we are using an AWS-East bucket. To prevent data egress/transfer costs in the future, you should select a staging location from the same cloud provider and region as your Snowflake account.


In [None]:
CREATE STAGE CITIBIKE_TUTORIAL.PUBLIC.citibike_trips 
	URL = 's3://snowflake-workshop-lab/citibike-trips-csv/';

> 
> 
>  The S3 bucket for this lab is public so you can leave the credentials options in the statement empty. In a real-world scenario, the bucket used for an external stage would likely require key information.


Now let's take a look at the contents of the `citibike_trips` stage. 

In [None]:
LIST @CITIBIKE_TUTORIAL.PUBLIC.CITIBIKE_TRIPS;

In the results table, you should see the list of files in the stage.

In [None]:
USE SCHEMA CITIBIKE_TUTORIAL.public

### Create a File Format

Before we can load the data into Snowflake, we have to create a file format that matches the data structure.


In [None]:
--create file format

create or replace file format CITIBIKE_TUTORIAL.PUBLIC.CSV type='csv' 
  compression = 'auto' field_delimiter = ',' record_delimiter = '\n'
  skip_header = 0 field_optionally_enclosed_by = '\042' trim_space = false
  error_on_column_count_mismatch = false escape = 'none' escape_unenclosed_field = '\134'
  date_format = 'auto' timestamp_format = 'auto' null_if = ('') comment = 'file format for ingesting data for zero to snowflake';

Verify that the file format has been created with the correct settings by executing the following command:

In [None]:
--verify file format is created
show file formats in database CITIBIKE_TUTORIAL;

## Loading Data

In this section, we will use a virtual warehouse and the COPY command to initiate bulk loading of structured data into the Snowflake table we created in the last section.

### Resize and Use a Warehouse for Data Loading

Compute resources are needed for loading data. Snowflake's compute nodes are called virtual warehouses and they can be dynamically sized up or out according to workload, whether you are loading data, running a query, or performing a DML operation. Each workload can have its own warehouse so there is no resource contention.


> aside positive
> 
>  If this account isn't using Snowflake Enterprise Edition (or higher), you will not see the **Mode** or **Clusters** options shown in the screenshot below. The multi-cluster warehouses feature is not used in this lab, but we will discuss it as a key capability of Snowflake.


### Load the Data

Now we can run a COPY command to load the data into the `TRIPS` table we created earlier.

Execute the following statements to load the staged data into the table. This may take up to 30 seconds.



In [None]:
copy into CITIBIKE_TUTORIAL.public.trips from @CITIBIKE_TUTORIAL.PUBLIC.CITIBIKE_TRIPS file_format=CITIBIKE_TUTORIAL.public.csv PATTERN = '.*csv.*' ;

In the result table, you should see the status of each file that was loaded. 

Next, navigate to the **Query History** tab by clicking the **Home** icon and then **Activity** > **Query History**. Select the query at the top of the list, which should be the COPY INTO statement that was last executed. Select the **Query Profile** tab and note the steps taken by the query to execute, query details, most expensive nodes, and additional statistics.

In [None]:
import streamlit as st
st.image("https://quickstarts.snowflake.com/guide/getting_started_with_snowflake/img/ba7874d9fe5cb2b7.png",width=1000)

Now let's use the TRUNCATE TABLE command to clear the table of all data and metadata:

In [None]:
truncate table trips;

Verify that the table is empty by running the following command:


In [None]:
--verify table is clear
select * from CITIBIKE_TUTORIAL.public.trips limit 10;

The result should show "Query produced no results".

We can use Snowpark to get the name of the current warehouse used in this session.

In [None]:
from snowflake.snowpark.context import get_active_session
session = get_active_session()
current_warehouse_name = session.get_current_warehouse()
print(current_warehouse_name)

Then change the warehouse size to `large` using the following ALTER WAREHOUSE.

Note that the ``{{python-variable}}`` syntax allows us to use the value of a Python variable inside a SQL query.

In [None]:
--change current warehouse size from small to large (4x)
alter warehouse {{current_warehouse_name}} set warehouse_size='large';

Verify the change using the following SHOW WAREHOUSES:

In [None]:
--load data with large warehouse
show warehouses;

Execute the same COPY INTO statement as before to load the same data again:

In [None]:
copy into trips from @citibike_trips
file_format=CSV;

Once the load is done, navigate back to the **Queries** page (**Home** icon > **Activity** > **Query History**). Compare the times of the two COPY INTO commands. The load using the `Large` warehouse was significantly faster.

Note that you can also see the query runtime on the top right hand corner of each of the SQL cells under "View run details".

In [None]:
-- Changing this back to an XSMALL warehouse
alter warehouse {{current_warehouse_name}} set warehouse_size='XSMALL';

### Create a New Warehouse for Data Analytics

Going back to the lab story, let's assume the Citi Bike team wants to eliminate resource contention between their data loading/ETL workloads and the analytical end users using BI tools to query Snowflake. As mentioned earlier, Snowflake can easily do this by assigning different, appropriately-sized warehouses to various workloads. Since Citi Bike already has a warehouse for data loading, let's create a new warehouse for the end users running analytics. We will use this warehouse to perform analytics in the next section.


In [None]:
CREATE OR REPLACE WAREHOUSE ANALYTICS_WH WITH WAREHOUSE_SIZE = 'LARGE';

## Working with Queries, the Results Cache, & Cloning

In the previous exercises, we loaded data into two tables using Snowflake's COPY bulk loader command and the `COMPUTE_WH` virtual warehouse. Now we are going to take on the role of the analytics users at Citi Bike who need to query data in those tables using the worksheet and the second warehouse `ANALYTICS_WH`.

> 
> 
>  **Real World Roles and Querying**
Within a real company, analytics users would likely have a different role than SYSADMIN. To keep the lab simple, we are going to stay with the SYSADMIN role for this section.
Additionally, querying would typically be done with a business intelligence product like Tableau, Looker, PowerBI, etc. For more advanced analytics, data science tools like Datarobot, Dataiku, AWS Sagemaker or many others can query Snowflake. Any technology that leverages JDBC/ODBC, Spark, Python, or any of the other supported programmatic interfaces can run analytics on the data in Snowflake. To keep this lab simple, all queries are being executed via the Snowflake worksheet.

### Execute Some Queries

Change the warehouse to use the new warehouse you created in the last section. 

In [None]:
USE WAREHOUSE ANALYTICS_WH;
USE ROLE SYSADMIN;
USE DATABASE CITIBIKE_TUTORIAL;

Run the following query to see a sample of the `trips` data:


In [None]:
select * from trips limit 20;

Now, let's look at some basic hourly statistics on Citi Bike usage. Run the query below in the worksheet. For each hour, it shows the number of trips, average trip duration, and average trip distance.

In [None]:
select date_trunc('hour', starttime) as "date",
count(*) as "num trips",
avg(tripduration)/60 as "avg duration (mins)",
avg(haversine(start_station_latitude, start_station_longitude, end_station_latitude, end_station_longitude)) as "avg distance (km)"
from trips
group by 1 order by 1;

### Use the Result Cache

Snowflake has a result cache that holds the results of every query executed in the past 24 hours. These are available across warehouses, so query results returned to one user are available to any other user on the system who executes the same query, provided the underlying data has not changed. Not only do these repeated queries return extremely fast, but they also use no compute credits.

Let's see the result cache in action by running the exact same query again.


In [None]:
select date_trunc('hour', starttime) as "date",
count(*) as "num trips",
avg(tripduration)/60 as "avg duration (mins)",
avg(haversine(start_station_latitude, start_station_longitude, end_station_latitude, end_station_longitude)) as "avg distance (km)"
from trips
group by 1 order by 1;

In the query runtime displayed on the top right of the cell, note that the second query runs significantly faster because the results have been cached.

### Execute Another Query

Next, let's run the following query to see which months are the busiest:


In [None]:
select
monthname(starttime) as "month",
count(*) as "num trips"
from trips
group by 1 order by 2 desc;

### Clone a Table

Snowflake allows you to create clones, also known as "zero-copy clones" of tables, schemas, and databases in seconds. When a clone is created, Snowflake takes a snapshot of data present in the source object and makes it available to the cloned object. The cloned object is writable and independent of the clone source. Therefore, changes made to either the source object or the clone object are not included in the other.

A popular use case for zero-copy cloning is to clone a production environment for use by Development & Testing teams to test and experiment without adversely impacting the production environment and eliminating the need to set up and manage two separate environments.

> 
> 
>  **Zero-Copy Cloning**
A massive benefit of zero-copy cloning is that the underlying data is not copied. Only the metadata and pointers to the underlying data change. Hence, clones are “zero-copy" and storage requirements are not doubled when the data is cloned. Most data warehouses cannot do this, but for Snowflake it is easy!

Run the following command in the worksheet to create a development (dev) table clone of the `trips` table:


In [None]:
create table trips_dev clone trips;

Navigate to the TRIPS_DEV table on the Object Explorer on the left pane. Click the three dots (**...**) in the left pane and select **Refresh**. Expand the object tree under the `CITIBIKE_TUTORIAL` database and verify that you see a new table named `trips_dev`. Your Development team now can do whatever they want with this table, including updating or deleting it, without impacting the `trips` table or any other object.

In [None]:
st.image("https://quickstarts.snowflake.com/guide/getting_started_with_snowflake/img/adae9f4ec4cec092.png",width=500)

## Working with Semi-Structured Data, Views, & Joins

> 
> 
>  This section requires loading additional data and, therefore, provides a review of data loading while also introducing loading semi-structured data.

Going back to the lab's example, the Citi Bike analytics team wants to determine how weather impacts ride counts. To do this, in this section, we will:

- Load weather data in semi-structured JSON format held in a public S3 bucket.
- Create a view and query the JSON data using SQL dot notation.
- Run a query that joins the JSON data to the previously loaded `TRIPS` data.
- Analyze the weather and ride count data to determine their relationship.

The JSON data consists of weather information provided by *MeteoStat* detailing the historical conditions of New York City from 2016-07-05 to 2019-06-25. It is also staged on AWS S3 where the data consists of 75k rows, 36 objects, and 1.1MB compressed. If viewed in a text editor, the raw JSON in the GZ files looks like:


In [None]:
import streamlit as st
st.image("https://quickstarts.snowflake.com/guide/getting_started_with_snowflake/img/c025f1200b524e26.png",width=1000)

> 
> 
>  **SEMI-STRUCTURED DATA**
Snowflake can easily load and query semi-structured data such as JSON, Parquet, or Avro without transformation. This is a key Snowflake feature because an increasing amount of business-relevant data being generated today is semi-structured, and many traditional data warehouses cannot easily load and query such data. Snowflake makes it easy!

### Create a New Database and Table for the Data

First, in the worksheet, let's create a database named `WEATHER` to use for storing the semi-structured JSON data.


In [None]:
CREATE DATABASE IF NOT EXISTS weather;

Execute the following USE commands to set the worksheet context appropriately:

In [None]:
use role sysadmin;

use warehouse compute_wh;

use database weather;

use schema public;

Next, let's create a table named `JSON_WEATHER_DATA` to use for loading the JSON data. In the worksheet, execute the following CREATE TABLE command:

In [None]:
create table if not exists weather.public.json_weather_data (v variant);

Note that Snowflake has a special column data type called `VARIANT` that allows storing the entire JSON object as a single row and eventually query the object directly.

> aside negative
> 
>  **Semi-Structured Data Magic**
The VARIANT data type allows Snowflake to ingest semi-structured data without having to predefine the schema.

In the results table, verify that your table, `JSON_WEATHER_DATA`, was created.


### Create Another External Stage

Use the following command to create a stage that points to the bucket where the semi-structured JSON data is stored on AWS S3:

In [None]:
create stage if not exists weather.public.nyc_weather
url = 's3://snowflake-workshop-lab/zero-weather-nyc';

Now let's take a look at the contents of the `nyc_weather` stage. Execute the following LIST command to display the list of files:

In [None]:
list @weather.public.nyc_weather;

In the results table, you should see a list of `.gz` files from S3.

### Load and Verify the Semi-structured Data

In this section, we will use a warehouse to load the data from the S3 bucket into the `JSON_WEATHER_DATA` table we created earlier.

In the following cell, execute the COPY command below to load the data.

Note that you can specify a `FILE FORMAT` object inline in the command. In the previous section where we loaded structured data in CSV format, we had to define a file format to support t"he CSV structure. Because the JSON data here is well-formed, we are able to simply specify the JSON type and use all the default settings:


In [None]:
copy into weather.public.json_weather_data
from @weather.public.nyc_weather 
    file_format = (type = json strip_outer_array = true);

Verify that each file has a status of `LOADED`.


Now, let's take a look at the data that was loaded:

In [None]:
select * from weather.public.json_weather_data limit 10;

Click any of the rows to display the formatted JSON string.


### Create a View and Query Semi-Structured Data

Next, let's look at how Snowflake allows us to create a view and also query the JSON data directly using SQL.

> 
>  **Views & Materialized Views**
A view allows the result of a query to be accessed as if it were a table. Views can help present data to end users in a cleaner manner, limit what end users can view in a source table, and write more modular SQL.
Snowflake also supports materialized views in which the query results are stored as though the results are a table. This allows faster access, but requires storage space. Materialized views can be created and queried if you are using Snowflake Enterprise Edition (or higher).

Run the following command to create a columnar view of the semi-structured JSON weather data so it is easier for analysts to understand and query. The ``72502`` value for ``station_id`` corresponds to Newark Airport, the closest station that has weather conditions for the whole period.


In [None]:
create or replace view weather.public.json_weather_data_view as
select
    v:obsTime::timestamp as observation_time,
    v:station::string as station_id,
    v:name::string as city_name,
    v:country::string as country,
    v:latitude::float as city_lat,
    v:longitude::float as city_lon,
    v:weatherCondition::string as weather_conditions,
    v:coco::int as weather_conditions_code,
    v:temp::float as temp,
    v:prcp::float as rain,
    v:tsun::float as tsun,
    v:wdir::float as wind_dir,
    v:wspd::float as wind_speed,
    v:dwpt::float as dew_point,
    v:rhum::float as relative_humidity,
    v:pres::float as pressure
from
    weather.public.json_weather_data
where
    station_id = '72502';


SQL dot notation `v:temp` is used in this command to pull out values at lower levels within the JSON object hierarchy. This allows us to treat each field as if it were a column in a relational table.

The new view should appear as `JSON_WEATHER_DATA` under `WEATHER` > `PUBLIC` > **Views** in the object browser on the left. You may need to expand or refresh the objects browser in order to see it.

Verify the view with the following query: 


In [None]:
select * from weather.public.json_weather_data_view
where date_trunc('month',observation_time) = '2018-01-01'
limit 20;

Notice the results look just like a regular structured data source.

### Use a Join Operation to Correlate Against Data Sets

We will now join the JSON weather data to our `CITIBIKE_TUTORIAL.PUBLIC.TRIPS` data to answer our original question of how weather impacts the number of rides.

Run the query below to join `WEATHER` to `TRIPS` and count the number of trips associated with certain weather conditions:

> 
>  Because we are still in the worksheet, the `WEATHER` database is still in use. You must, therefore, fully qualify the reference to the `TRIPS` table by providing its database and schema name.

In [None]:
select weather_conditions as conditions
,count(*) as num_trips
from CITIBIKE_TUTORIAL.public.trips
left outer join weather.public.json_weather_data_view
on date_trunc('hour', observation_time) = date_trunc('hour', starttime)
where conditions is not null
group by 1 order by 2 desc;

Note that we can export the result from this SQL query by referencing the cell name directly and calling `to_pandas` to get the dataframe.

In [None]:
df = cell80.to_pandas()
df

Then, we can plot the results with Altair. 

In [None]:
import altair as alt
alt.Chart(df).mark_bar().encode(
    x=alt.X('CONDITIONS',sort ='-y'),
    y=alt.Y('NUM_TRIPS')
)

The initial goal was to determine if there was any correlation between the number of bike rides and the weather by analyzing both ridership and weather data. Per the results above we have a clear answer. As one would imagine, the number of trips is significantly higher when the weather is good!


## Using Time Travel

Snowflake's powerful Time Travel feature enables accessing historical data, as well as the objects storing the data, at any point within a period of time. The default window is 24 hours and, if you are using Snowflake Enterprise Edition,   can be increased up to 90 days. Most data warehouses cannot offer this functionality, but - you guessed it - Snowflake makes it easy!

Some useful applications include:

- Restoring data-related objects such as tables, schemas, and databases that may have been deleted.
- Duplicating and backing up data from key points in the past.
- Analyzing data usage and manipulation over specified periods of time.

### Drop and Undrop a Table

First let's see how we can restore data objects that have been accidentally or intentionally deleted.


In [None]:
drop table weather.public.json_weather_data;

You should see an error because the underlying table has been dropped.

In [None]:
select * from json_weather_data limit 10;

Now, restore the table:

In [None]:
undrop table weather.public.json_weather_data;


The json_weather_data table should be restored. Verify by running the following query:

In [None]:
select * from weather.public.json_weather_data limit 10;

### Roll Back a Table

Let's roll back the `TRIPS` table in the `CITIBIKE` database to a previous state to fix an unintentional DML error that replaces all the station names in the table with the word "oops".

Run the following command to replace all of the station names in the table with the word "oops":

In [None]:
use role sysadmin;

use warehouse compute_wh;

use database CITIBIKE_TUTORIAL;

use schema public;

In [None]:
update CITIBIKE_TUTORIAL.public.trips set start_station_name = 'oops';

Now, run a query that returns the top 20 stations by number of rides. Notice that the station names result contains only one row:

In [None]:
select
start_station_name as "station",
count(*) as "rides"
from CITIBIKE_TUTORIAL.public.trips
group by 1
order by 2 desc
limit 20;

Normally we would need to scramble and hope we have a backup lying around.

In Snowflake, we can simply run a command to find the query ID of the last UPDATE command and store it in a variable named `$QUERY_ID`.

In [None]:
set query_id = (select query_id from table(information_schema.query_history_by_session (result_limit=>25)) where query_text like 'update%' order by start_time desc limit 1);

In [None]:
SELECT $query_id;

Use Time Travel to recreate the table with the correct station names:

In [None]:
create or replace table CITIBIKE_TUTORIAL.public.trips as
(select * from CITIBIKE_TUTORIAL.public.trips before (statement => $query_id));



Run the previous query again to verify that the station names have been restored:


In [None]:
select
start_station_name as "station",
count(*) as "rides"
from CITIBIKE_TUTORIAL.public.trips
group by 1
order by 2 desc
limit 20;

For the last two sections, we recommend following along the quickstarts for the UI walkthrough instructions: 
- [Working with Roles, Account Admin, & Account Usage](https://quickstarts.snowflake.com/guide/getting_started_with_snowflake/index.html#8)
- [Sharing Data Securely & the Data Marketplace](https://quickstarts.snowflake.com/guide/getting_started_with_snowflake/index.html#9)

## Resetting Your Snowflake Environment

If you would like to reset your environment by deleting all the objects created as part of this lab, run the SQL statements in a worksheet.

First, ensure you are using the ACCOUNTADMIN role:


In [None]:
use role accountadmin;

Then, run the following SQL commands to drop all the objects we created in the lab:


In [None]:
drop database if exists CITIBIKE_TUTORIAL;
drop database if exists weather;
drop warehouse if exists analytics_wh;

## Conclusion & Next Steps

Congratulations on completing this introductory lab exercise! You've mastered the Snowflake basics and are ready to apply these fundamentals to your own data. Be sure to reference this guide if you ever need a refresher.

We encourage you to continue with your free trial by loading your own sample or production data and by using some of the more advanced capabilities of Snowflake not covered in this lab.

### Additional Resources:

- Learn more about the [Snowsight](https://docs.snowflake.com/en/user-guide/ui-snowsight.html#using-snowsight) docs.
- Read the [Definitive Guide to Maximizing Your Free Trial](https://www.snowflake.com/test-driving-snowflake-the-definitive-guide-to-maximizing-your-free-trial/) document.
- Attend a [Snowflake virtual or in-person event](https://www.snowflake.com/about/events/) to learn more about our capabilities and customers.
- Join the [Snowflake Community](https://community.snowflake.com/s/topic/0TO0Z000000wmFQWAY/getting-started-with-snowflake).
- Sign up for [Snowflake University](https://community.snowflake.com/s/article/Getting-Access-to-Snowflake-University).
- Contact our [Sales Team](https://www.snowflake.com/free-trial-contact-sales/) to learn more.

### What we've covered:

- How to create stages, databases, tables, views, and virtual warehouses.
- How to load structured and semi-structured data.
- How to perform analytical queries on data in Snowflake, including joins between tables.
- How to clone objects.
- How to undo user errors using Time Travel.
- How to create roles and users, and grant them privileges.
- How to securely and easily share data with other accounts.
- How to consume datasets in the Snowflake Data Marketplace.