# Introduction to Snowflake tutorial

## What is Snowflake?

If someone asked me to describe Snowflake in as few words as possible, I would choose these:
- Data warehouses
- Large-scale data
- Multi-cloud
- Separation
- Scalable
- Flexible
- Simple

If they asked me to elaborate, I would put together the words like this:

Snowflake is a massively popular _cloud-based_ _data warehouse_ management platform. It distinguishes itself from competitors through its ability to handle _large-scale data_ and workloads faster and more efficiently. Its enhanced performance comes from its underlying architecture which uses _separate_ storage and compute layers, allowing it to be very _flexible_ and _scalable_. As a bonus, it natively integrates with _multiple cloud_ providers. Despite all these features, it manages to stay _simple_ to learn and implement. 

If they asked me to give even more details, well, then I would write this tutorial.

## 1. Why use Snowflake?

Snowflake serves more than 8900 customers worldwide and processes 3.9 billion queries every day. That kind of usage statistics isn't a coincidence by any means. 

Below are the best benefits of Snowflake that have so much appeal:

#### 1. h4 Cloud-based architecture
Snowflake operates in the clouds, allowing companies to scale up and down resources based on demand without worrying about physical infrastructure (hardware). The platform also handles routine maintenance tasks such as software updates, hardware management, and performance tuning. This relieves the burden of maintenance overhead, allowing organizations to focus on matters: deriving value from data.

#### 2. h4 Elasticity and scalability
Snowflake separates storage and compute layers, allowing users to scale their computing resources independently of their storage needs. This elasticity enables efficient handling of diverse workloads with optimal performance and without unnecessary costs.

#### 3. h4 Concurrency and performance
Snowflake easily handles high concurrency: multiple users can access and query the data without performance loss. 

#### 4. h4 Data sharing
Snowflake's security safeguards allow sharing data across other organizations, internal departments, external partners, customers, or other stakeholders. No need for complex data transfers. 

#### 5. h4 Time travel
Snowflake uses a fancy term "Time travel" for data versioning. Whenever a change is made to the database, Snowflake takes a snapshot. This allows users to access historical data at various points in time. 

#### 6. h4 Cost efficiency
Snowflake offers a pay-as-you-go model due to its ability to scale resources dynamically. You will only pay for what you use. 


All these benefits combined make Snowflake a highly desirable data warehouse management tool. 

Now, let's take a look at the underlying architecture of Snowflake that unlocks these features. 

## 2. What is a data warehouse?

Before we dive into Snowflake architecture, let's review data warehouses to ensure we are all on the same page.

A data warehouse is a centralized repository that stores large amounts of structured and organized data from various sources for a company. Different personas (employees) in organizations use the data within to derive different insights.

For example, data analysts in collaboration with the marketing team may run an A/B test for a new marketing campaign using the sales table. HR specialists may query the employee information to track performance. 

These are some of the examples of how companies globally use data warehouses to drive growth. But without proper implementation and management, data warehouses only stay as elaborate concepts.

## 3. Snowflake architecture

Snowflake's unique architecture, designed for faster analytical queries, comes from its separation of the storage and compute layers. This distinction contributes to the benefits we've mentioned earlier.

### h3: Storage layer

In Snowflake, the storage layer is a critical component in storing data in an efficient and scalable manner. Here are some key features of the layer:

1. __Cloud-based__: Snowflake seamlessly integrates with major cloud providers such as AWS, GCP, and Microsoft Azure.
2. __Columnar format__: Snowflake stores data in a columnar format, which is optimized for analytical queries. Unlike traditional row-based formats used by tools like Postgres, the columnar format is well-suited for aggregating data. In columnar storage, queries only access the specific columns they need, making it more efficient. On the other hand, row-based formats require accessing all rows in memory for simple operations like calculating averages.
3. __Micro-partitioning__: Snowflake employs a technique called micro-partitioning that stores tables in memory using small chunks. Each chunk is typically immutable and only a few megabytes, making query optimization and execution much faster.
4. __Zero-copy cloning__: Snowflake has a unique feature that allows it to create virtual clones of data. Cloning is instantaneous and doesn't take up memory until changes are made to the new copy.
5. __Scale and elasticity__: The storage layer scales horizontally, which means it can handle increasing data volumes by adding more servers to distribute the load. Also, the scaling happens independently of compute resources, which is ideal when you want to store large volumes of data but analyze only a small fraction.

Now, let's look at the compute layer. 

### h3: Compute layer

As the name suggests, the compute layer is the engine that executes your queries. It works together with the storage layer to process the data and perform various computational tasks. Below are some more details of how the layer works:

1. Virtual warehouses: you can think of VWs as teams of computers (compute nodes) designed to handle query processing. Each member of the team handles different part of the query, which makes execution crazy fast and parallel. VWs in Snowflake in different sizes and subsequently, in different prices (the sizes are XS, S, M, L, XL).
2. Multi-cluster, multi-node architecture: The compute layer uses multiple clusters with multiple nodes for high concurrency, allowing multiple users to access and query the data.
3. Automatic query optimization: Snowflake's system analyzes all queries and finds patterns to optimize using historical data. Common optimizations include pruning unnecessary data, using metadata and choosing the most efficient execution path.
4. Results cache: The compute layer includes a cache that stores the results of frequently executed queries (FEQs 😃). When the same query is run again, the results are returned almost instantaneously.

These design principles of the compute layer all contribute to Snowflake's ability to handle different workloads and demanding workloads in the cloud. 

### h3: Cloud services layer

The final layer is cloud services. As this layer integrates into every component of Snowflake's architecture, there are many details on how it operators. On top of the features related to other layers, it has the following additional responsibilities:

1. Security and access control: The layer enforces security measures, including authentication, authorization and encryption. Administrators use Role-Based Access Control (RBAC) to define and manage user roles and permissions.
2. Data sharing: The layer implements secure data sharing protocols across different accounts and even third-party organizations. Data consumers can access the data without the need for data movement, promoting collaboration and data monetization.
3. Semi-structured data support: Another unique benefit of Snowflake is its ability to handle semi-structured data such as JSON and Parquet despite being a data warehouse management platform. It can easily query semi-structured data and integrate the results with existing tables. This flexibility isn't seen in other RDBMS tools.

Now that we have a high-level picture of Snowflake's architecture, let's write some SQL on the platform.

## 4. Setting up SnowflakeSQL

Snowflake has its own flavor of SQL called SnowflakeSQL (big surprise). The difference between it and other SQL dialects is like the difference between English accents. 

So, most of the analytical queries you perform in dialects like PostgreSQL don't change but there are some discrepancies in DDL (Data Definition Language) commands. 

Snowflake offers two interfaces to run SnowSQL:
- Snowsight: Web interface to interact with the platform
- SnowSQL: A CLI client to manage and query databases

We will see how to set up both and run some queries!

### h3 Snowsight: Web interface

![image.png](attachment:43c330bb-3176-4300-b2b0-d893908fbbd9.png)

To get started with Snowsight, go to the [Snowflake homepage](https://www.snowflake.com/en/) and click "Start for free". Enter your personal details and choose any cloud provider. It doesn't matter which because the free trial includes 400$ worth of credits to any of the options (you don't have to set the cloud credentials yourself).

Once you verify your email, you will be directed to the Worksheets page. Worksheets are interactive, live-coding environments where you can write, run, and see the results of your SQL queries. 

![image.png](attachment:2a365c69-e0a3-43df-af13-3a083956ff4d.png)

To run some queries, we need a database and a table (we won't be using the sample data in Snowsight). The below GIF shows how you can create a new table named "test_db" and create a table named "diamonds" using a local CSV file. You can grab the CSV file by running the code in [this GitHub gist](https://gist.github.com/BexTuychiev/74235b154573953fe6e6b8ce2a785c4b) in your terminal.

![](images/web_create_table.gif)

In the GIF, Snowsight tells us that there is a problem with one of the column names. Since the word "table" is a reserved keyword, I wrapped it inside double quotes.

Afterwards, you will be directed to a new worksheet where you can run any SQL query you want. As shown in the GIF, the worksheet interface is quite straightforward and highly-function. Take a few minutes to familiarize yourself with the panels and the buttons and what goes where. 

### h3 SnowSQL: CLI

Nothing beats the thrill of managing and querying a full-fledged database on your terminal. That's why SnowSQL exists!

But, to get it up and running, there are a few steps we need to follow, which is typically slower than getting started on Snowsight.

As a first step, download the SnowSQL installer from the [Snowflake Developers Download](https://developers.snowflake.com/snowsql/) page. Download the relevant file. I am on WSL2, so I will be choosing a Linux version:

![](images/copy_download_link.gif)

In the terminal, I download the file using the copied link and execute it with `bash`:

```shell
$ curl -O https://sfc-repo.snowflakecomputing.com/snowsql/bootstrap/1.2/linux_x86_64/snowsql-1.2.31-linux_x86_64.bash
$ bash snowsql-1.2.31-linux_x86_64.bash
```

For other platforms, you can follow the installation steps from [this page of Snowflake docs](https://docs.snowflake.com/en/user-guide/snowsql-install-config).

Once installed successfully, you should get the following message:

![image.png](attachment:996896c7-5a22-4567-afe6-08ca6bae97ec.png)

> Note: On Unix-like systems, to ensure `snowsql` command is available in all terminal sessions, add the `/home/username/bin` directory to $PATH. You can do it by placing the following code to `.bashrc`, `.bash_profile` or `.zshrc` files: `export PATH=/home/yourusername/bin:$PATH`.

The message is prompting us to configure the account settings to connect to Snowflake. There are two ways to do this:
1. Passing the account details interactively in the terminal.
2. Configuring the credentials in a global Snowflake configuration file.

Since it is more permanent and secure, we will proceed with option two. For platform-specific instructions, read the [Connecting through SnowSQL](https://docs.snowflake.com/en/user-guide/snowsql-start) page of the docs. Instructions below are for Unix-like systems.

First of all, you are going to your email address and find the Welcome email by Snowflake. It contains your account name inside the login-link: account-name.snowflakecomputing.com. Copy it.

![image.png](attachment:dffeb297-9af5-4b76-bca1-b7cc0722c5bf.png)

Then, open the `~/.snowsql/config` file  with a text editor like VIM or VSCode. Under the `connections` section, uncomment three fields:
- Account name
- Username
- Password

Replace the default values with the account name you copied and the username and password you provided during sign up. Save and close the file.

Then, move back to your terminal and type in `snowsql`. The client should automatically connect and provide you with an SQL editor with code highlighting, tab completion and all! Here is what it looks like:

![image.png](attachment:d94652fa-1ca4-4c2e-8b16-6bd6bc9d3f70.png)

#### h4 Connecting to an existing database in Snowflake

Right now, we aren't connected to any databases. Let's fix that by connecting to the `test_db` database we've created with Snowsight. First, check available databases with `SHOW DATABASES`:

```SQL
$ SHOW DATABASES
$ USE DATABASE TEST_DB
```

Next, specify that you will be using the `test_db` database (case-insensitive) from now on. Then, you can run any SQL query on the tables of the connected database.

```
$ SELECT COUNT(*) FROM DIAMONDS
```

![image.png](attachment:f17e174b-0a21-406c-a167-005248397da9.png)

## Conclusion