# From Data Lake to Data Warehouse - ETL 

Creating robust pipeline that let your data flow seamlessly has become one of the most important part of a well-made infrastructure. These pipelines are called ETL processes. In this course, we will cover the basics of an ETL process and build some using Google Big Query 🪄

## What you'll learn in this course 🧐🧐

* What is an ETL 
* What is Google Big Query
* Build your first ETL

## What is an ETL? 

ETL stands for **E**tract **T**ransform **L**oad. As barbaric as it sounds, it is simply the process of extracting data from your Data Lake to your Data Warehouse. 

You can build ETL using SQL, Python or even using No code solutions like cloud services.

## What is Google Big Query 🥊

[Big Query](https://cloud.google.com/bigquery) is Google's cloud Data Warehouse. A lot of companies are using it these days as it as easy to use, reliable and relatively cheap. 

With Big Query, you will be able to: 

* Build simple ETLs 
* Query your data using SQL 

Let's first open Big Query by going on [Google Cloud Console](https://cloud.google.com) and search for *Big Query*

![crack](https://essentials-assets.s3.eu-west-3.amazonaws.com/M02-SQL/Introduction_to_SQL_and_cloud_computing/big_query_search.png)

On the left side of the screen, you will see the name of your project (in the above picture *fresh-desk-324610*). This is where your data is going to live. 

For the moment, it is empty 🕳️ as it should be but we will learn how to add data in the section of this course. 

## Build your first ETL 📩

There are two ways you can transfer data to Big Query:

1. From a Data Lake using *Data Transfer* service 
2. From a Cloud SQL database using *Big Query* built-in functionnalities

Before diving into each option, let's prepare Big Query. First we will need to create a database:

* **Create Dataset**

![crack](https://essentials-assets.s3.eu-west-3.amazonaws.com/M02-SQL/Introduction_to_SQL_and_cloud_computing/create_dataset_bq.gif)

We are all set, let's populate our tables with data. 💪


👋 As a final note, **make sure you have some data within your Data Lake** (i.e Google Cloud Storage)


### From Data Lake using Data Transfer 💦


Now that we have Big Query ready to receive data, let's add Data. There are two ways of doing it depending on how often you need to transfer data.  


#### If you need to transfer data once 

If you only need to transfer your data from your Data Lake to Big Query once, it's relatively simple. You only need to create a table from your dataset and specify where to get the data: 

![crack](https://essentials-assets.s3.eu-west-3.amazonaws.com/M02-SQL/Introduction_to_SQL_and_cloud_computing/create_table_from_datalake.gif)

Google will automaticall detect the type of file you want to import to Big Query. 😮 

👋 Select `auto detect` to automatically detect your files schema. It will save you some time otherwise you will need to specify each column of your table manually 😰


#### If you need to transfer data periodically 

If you need to periodically import data from your data lake to Big Query, you will need to specify it using *Data Transfer*. Here is how to do it: 

![crack](https://essentials-assets.s3.eu-west-3.amazonaws.com/M02-SQL/Introduction_to_SQL_and_cloud_computing/create_a_data_transfer.gif)

⚠️ You need to already have a table created with the right schema to use *Data Transfer* ⚠️ Therefore best practice is to first to a manual transfer so that you don't have to waste time building your schema and then create a periodical data transfer. 

### Import Data directly from cloud SQL ☁️

When you are not dealing with CSV ou Excel files but SQL databases, you can directly import them to Big Query without using Google Cloud Storage. To do so you will need to find the following information: 

* **Cloud SQL Instance ID**

![crack](https://essentials-assets.s3.eu-west-3.amazonaws.com/M02-SQL/Introduction_to_SQL_and_cloud_computing/cloud_sql_instance_id.png)

* **Database name**

![crack](https://essentials-assets.s3.eu-west-3.amazonaws.com/M02-SQL/Introduction_to_SQL_and_cloud_computing/cloud_sql_database.png)
    
* **Database username** 

![crack](https://essentials-assets.s3.eu-west-3.amazonaws.com/M02-SQL/Introduction_to_SQL_and_cloud_computing/cloud_sql_users.png)

* Database password 
    * You should have stored it when you created your db 


Once you have gathered this information, go to Google Big Query in the *SQL Workspace* section and above click on *+ ADD DATA > External Data Source*

![crack](https://essentials-assets.s3.eu-west-3.amazonaws.com/M02-SQL/Introduction_to_SQL_and_cloud_computing/Cloud_sql_add_external_data_source.png)

Then simply fill out the information you need. Additionnally to what is above, you will need to provide a *Connection ID*. It is you 👊 who creates it, you don't need to find it somewhere in GCP 😉

## Resources 📚📚

* Google Big Query - [https://bit.ly/ckdaXc](https://cloud.google.com/bigquery)