# Vendredi 05 Avril

# Introduction to Data Warehousing

The concept of Data Warehousing originated at IBM in the 80's. The goal of the initial research was to provide a framework to transfer data from operational systems to business intelligence departments, avoiding the cost and technical challenges of high redundancy.

## What will you learn in this course? 🧐🧐

This lecture will introduce the concept of data warehousing and why do we need it. Here's the outline:

* Why analysts cannot work directly on business databases?
* Data Warehouse VS Data Lake
* Data Warehouse VS traditional databases
    * Key differences
* Cloud vendors
* Amazon Redshift
    * Setup your own Redshift cluster
    * Tear down your Redshift cluster when you are done
* Using Redshift in PySpark
    * Writing to Redshift from PySpark DataFrame
    * Reading from Redshift onto a PySpark DataFrame

## Why analysts cannot work directly on business databases? 🤔🤔

Business databases must stay clean at all cost: allowing Data Analysis or Data Scientist to access it introduces a breach.

Moreover, most of the time, unstructured data (i.e., not stored in any kind of databases) is required to do performant analysis. 

A Warehousing solution allows the company to aggregate and store its data needed for analysis, without altering the databases used for operations.

## Data Warehouse VS Data Lake 🗄️🆚🌊

You often hear both when discussing Big Data, however they are very different.

Data Lakes are a big pool of raw data, with no defined purposes: we store this unstructured data in prevision of future usage.

Data Warehouse holds **processed** and **structured** data, ready to be used for advanced analytics. 

Most of the time, data that ends up in the Warehouse was previously stored in the Lake. 

- Step 1: Data is collected and stored in its raw form in a Data Lake,
- Step 2: Data is extracted from the Lake, cleaned and processed,
- Step 3: Data is loaded in the warehouse, ready to be queried.

## Data Warehouse VS traditional databases 🗄️🗄️🗄️🆚🗄️

Roughly, a Data Warehouse **is** a relational database. It's just a little more than that.

### Key differences 🔑

1. The Warehouse can hold data from many databases.
2. Any data stored in the Warehouse is stored for **analytics purposes only**.
3. Data within a warehouse has been processed to simplify the analysis, and avoid the need for SQL queries that spread on 300 lines.
4. Whereas databases are optimized for extracting rows (or observations), data warehouses are optimized to have a performance boost on columns (or fields).

In a nutshell: warehouses are optimized for performant analysis.

**A warehouse is the perfect candidate for `LOAD` destination in ETL pipelines.**

## Cloud vendors ☁️☁️

- BigQuery, owned by Google, and part of the Google Cloud Platform,
- Redshift, owned by Amazon and part of the AWS platform,
- Snowflake,
- ...

As always when choosing between different vendors, the cost structure is one the most important aspects to check. For instance, BigQuery storage is **much** cheaper than Redshift, but querying data on Redshift is **free** whereas it costs about 5 dollars/TB on BigQuery. Depending on your need, one solution might be more suitable than the other.

## Amazon Redshift 🔴🔴

Redshift is the Data Warehousing solution from Amazon Web Services. As every services of the AWS family, Redshift is **Cloud-based**: you only pay for the compute and storage, and you don't have to take care of maintenance costs, or scaling the hardware to support an increasing load.

Amazon Redshift Serverless automatically provisions data warehouse capacity and intelligently scales the underlying resources. Amazon Redshift Serverless adjusts capacity in seconds to deliver consistently high performance and simplified operations for even the most demanding and volatile workloads.

**If you have never used Amazon Redshift Serverless before, you are eligible for a $300 credit, which can be used within 90 days of sign-up toward your compute and usage use !**

### Creating a Redshift Serverless cluster

Go to your AWS Console and look for Redshift. Check you are on a good location, here `Paris`. Click on _Try Redshift Serverless free trial_!

<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/M05-D03-Redshift/new/redshift1.png"/>

In the configuration page, click on "Manage IAM roles" > "Create IAM role".

<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/M05-D03-Redshift/new/redshift2.png"/>

Select "Any S3 bucket"  and click on "Create IAM role as default":

<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/M05-D03-Redshift/new/redshift3.png"/>

At the end of the page, validate and wait for the cluster initialisation to complete (it may take up to 10 minutes).

Then, you can click on "default-workgroup", and you should read "Available" in the status:

<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/M05-D03-Redshift/new/redshift4.png"/>

Scroll down until you see the "Network and security" settings, and click on the "Edit" button:

<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/M05-D03-Redshift/new/redshift5.png"/>

In the configuration page, check "Turn on Publicly accessible":

<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/M05-D03-Redshift/new/redshift6.png"/>

The changes may take a few minutes.


Congratulations, your first redshift serverless cluster is ready !

Now, let's create a connection between our notebook (on Databricks) and the redshift datawarehouse in order to read/write some data.

To do so, you'll need some connection informations:

* The JDBC URL is in the "default-workgroup" page:

<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/M05-D03-Redshift/new/redshift8.png"/>

* The username is in the "default-namespace" page:

<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/M05-D03-Redshift/new/redshift9.png"/>

## Using Redshift in PySpark

### Writing to Redshift from PySpark DataFrame ✨➡🔴

Let's show you how to use Redshift with PySpark. First, we are creating a simple Dataframe:

In [None]:
import pandas as pd
import numpy as np

data_dict = {'a': [1,2,3], 'b': [2,3,4], 'c': [3,4,5], 'd':[np.NaN,0,1], 'e':["apple","banana","orange"]}

pandas_df = pd.DataFrame.from_dict(
    data_dict
)

df = spark.createDataFrame(pandas_df)

df.show()

Then you need to fill some informations:

> The `REDSHIFT_FULL_PATH` is the URL JDBC from the workgroup panel we mentioned above 👆. Remember? 🙂
You'll have to modify it, by replacing "redshift" by "postgresql" in the url.

> The `REDSHIFT_USER` is the Admin user name from the namespace panel

> The `REDSHIFT_PASSWORD` is the password you set when you created the cluster. 
If you forgot your password, you can edit it by clicking on the "Actions" button in the default-namespace panel.

In [None]:
REDSHIFT_USER = 'YOUR_REDSHIFT_USERNAME'
REDSHIFT_PASSWORD = 'YOUR_REDSHIFT_PASSWORD'

REDSHIFT_FULL_PATH = "URL_JDBC" # don't forget to replace "redshift" by "postgresql"
                                # for example it'll look like:
                                # "jdbc:postgresql://redshift-cluster-1.csssws1edn9m.eu-west-3.redshift.amazonaws.com:5439/dev"
REDSHIFT_TABLE = 'NAME_OF_THE_TABLE'

We can then write to our Redshift:

In [None]:
mode = "overwrite"

properties = {"user": REDSHIFT_USER, "password": REDSHIFT_PASSWORD, "driver": "org.postgresql.Driver"}

df.write.jdbc(url=REDSHIFT_FULL_PATH, table=REDSHIFT_TABLE, mode=mode, properties=properties)

The 4 `mode` to choose from are:

- `overwrite`: drop the table if it exists, then load the data in a new one,
- `append`: create the table if it does not exists, else append the data to the existing table,
- `error` (default): create the table or raise an error if it exists,
- `ignore`: same as `overwrite`, but does nothing if table already exists.

Once you've executed the cells above, you can check that the data has been written into redshift, by clicking on the "query data" button in the namespace panel. A window like the one below should appear:

<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/M05-D03-Redshift/new/redshiftquery.png"/>

On the left side, you should find a table in Serverless > dev > public > Tables.
You can also use the SQL query editor on the right to query the table.

### Reading from Redshift onto a PySpark DataFrame 🔴➡✨

We can read from our Redshift in few lines:

In [None]:
properties = {"user": REDSHIFT_USER, "password": REDSHIFT_PASSWORD, "driver": "org.postgresql.Driver"}

table = sqlContext.read.jdbc(url=redshift_path_full, table=REDSHIFT_TABLE, properties=properties)

table.show()

Although this can be useful, it is also possible to query your database using the Redshift query editor directly, which is most likely what data analysts and business analysts would be doing in a real-life context.

Congrats! 👏 You just created your first data warehouse using Redshift! Do not forget to 👉 **[tear down your Redshift cluster](#how-to-tear-down-your-redshift)** 👈 or you run the risk of being charged.

### How to tear down your Redshift?

When you have finished working with your Redshift cluster we advise your to ⚠️ **tear down your Redshift cluster so as to avoid too much costs.** ⚠️

It is easy, just follow the following steps:

Go to the default-workgroup panel, click on Actions > Delete workgroup

<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/M05-D03-Redshift/new/redshiftdelete1.png"/>

In the confirmation page, enter the word "delete", then check "Delete the associated namespace default-namespace", uncheck "Create final snapshot" and write again "delete" in the end of the page:

<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/M05-D03-Redshift/new/redshiftdelete3.png"/>

It may happen that the default-namespace can't be deleted at this point. In this case, some error message will pop-up. Then, just go into the default-namespace panel, click on "Actions" > "Delete namespace":

<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/M05-D03-Redshift/new/redshiftdelete4.png"/>

Uncheck "create final snapshot" and write "delete" to confirm:

<img src="https://full-stack-assets.s3.eu-west-3.amazonaws.com/M05-D03-Redshift/new/redshiftdelete5.png"/>

You're done ! The deletion is complete if there is no workgroup and no namespace appearing in your redshift dashboard.


## Ressources 📚📚

- [A nice article on Alooma's blog](https://www.alooma.com/blog/database-vs-data-warehouse)
- [Amazon Redshift](https://docs.databricks.com/data/data-sources/aws/amazon-redshift.html#setting-a-custom-column-type)