# Session 9 - ETL Concepts and The Resolution of Data

## EXTRACT
- Where is data coming from?
- How quickly is it moving?
- What level of granularity is it coming through with?

## TRANSFORM
- What do I want to change about the data?
- What am I adding to this data from elsewhere?
- How do we want to aggregate or shape this data?

## LOAD
- Where is the data going?
- When should I send it/how quickly?
- Is the data I'm sending accurate and useful?

# EXTRACT - "How do I ingest this data?"
#### The first stage of the ETL process is Extracting the data... This is the point when you actually pull the underlying data out of some source. Naturally, and as we've talked about in the past, there are so many different datasources that we can pull from, and depending on what they are, there may be different steps that we have to accomplish down the line. Typically, some initial data-validation is done at this stage as well to make sure that things are going smoothly and some filtering of the data may happen at this point as well if needed. This is analagous to a retaurant manager arriving at the market in the morning, sampling and buying only the best ingredients, validating that they'll be a good fit in the menu that night and bringing them back to the restaurant.

# TRANSFORM - "How do I want to change or restructure this data?"
#### The second step of this process is where a lot of the magic happens. We've been talking about this in some fashion since you've started this class. We talk about how to clean up data, restructure it, aggregate and disaggregate it, concatenate columns, etc. etc. There will also likely be more validation happening at this stage as well. This part of the process is incredibly valuable and you can think about it as the difference between a pile of ingredients on a counter and a finished plate of food cooked by a professional chef who has been crafting their recipe. The outcome of proper transformation is worth far more than the sum of its parts.

# LOAD - "How do I want to deliver this data?"
#### At the end of the process (however simple or complex you want that to be), you have to ship the finished product off somewhere. Where? That will obviously change, but you should still be delivering at the right level of aggregation with the proper changes made to it or new signal that has been added from another datasource. You can think about this as the server (or servers) that are tasked with bringing the food out safely and with the proper timing. You can have the best chef in the world but if you're trying to lob the food across the dining room or serving it up cold every single time, your restaurant won't be winning any awards.



## We're going to talk about all stages of this process and which actions typically fall under each, but the truth is that nowadays the arrival of Enterprise Data Warehouses is sort of muddying the waters of what should fall into each category. Regardless of which part of the process you're in, you're going to want to keep quite a few main thoughts in mind as you go... 



### WHERE FROM? - The first thing to think about is where the data is actually going to be coming from. I'd say it is very likely that you will get your data from one of a few places...

- Standard relational database. This is what you've been working with a little bit already such as Postgres, Mysql, etc.

- Document-based database/NOSQL database such as MongoDB.

- Live datastream such as logs from a website... If you have javascript tags firing on a page, you can be tasked with picking up information coming in from those tag-fires, processing it in some way, and then sending them to any other source.

- Flat files such as JSON or CSV files (this may be a common practice for reporting between companies, nearly everybody has access to software that can read and show these at least at some sort of a basic level).

- Enterprise Data Warehouse (Such as Google Bigquery or Amazon Redshift), though this may be a slightly different process)

- API/middleware is always a possibility, and apis can be incredibly important to lots of different aspects of your organization, so you may be running into data that is fed like this at seveal points, this would involve holding data from that source in memory or writing it somewhere as part of the extraction process)

- Some Other Datasream (We will talk a bit about streams later)

#### It is important to note that you will have ETL processes happening in many different parts of your organization, from processing raw data as it comes into your system initially, to a nightly process that picks data up from a production database and moves it to a "colder" form of storage, to a process that pulls data from a materialized view in a database and plops it in an analytics environment for quick visualization.

### HOW FAST? - You'll want to always keep in mind how quickly the underlying data is moving in and out of that system, as well as how taxed that system is at any point in time, this is also mirrored on the "load" side when you're done with the data. You're going to want to keep in mind the limitations of each system that is participating in the process. If you have 3 databases of different kinds and in physically different locations feeding 

- This is very system-dependent. You can have a server that is set up in a way that it is always running at 99% capacity just bringing the data in if they are used strictly for that and never for processing other operations, or if the datasource is set up in a way that is more like an analytics environment for faster querying, the CPU and memory allotment may be completely different.

#### EX: You have a powerful 32 core machine with 64 GB of RAM and an arbitrarily large amount of disk space and you have something feeding this database between 2 and 5 gigs of data per minute. With all of the processing it is doing, it regularly runs at 95% of its capacity... 

- If you wanted to extract the data in "real-time" (or close to real-time) for an analytics front-end, you may run out of available bandwith in this system and cause the output to be slow or worse, but if you scaled back to only needing aggregations, or small snippets of the data to be moved as part of an ETL framework, you may be just fine without changing the specs of your machine.

- If you doubled the specs of this system with everything else held equal, it may allow you to run queries against your database for an analytics front-end (HTML/CSS/JS, Tableau, etc.) more quickly. Even though the same exact database and datasteam exist, you have a larger amount of computing power available to you in your sandbox.

- The reason that I bring up this example is to let you know that you have to keep in mind just how much you are asking of a system when you will be extracting data from it. If you treat every server like it's going to have a bunch of extra resources available to move data out of it, you're probably going to realize very quickly that hardware limitations (both physical and cloud) do exist and can throw a wrench in your plans.

- There are a ton of ways of handling for this, you may have to upgrade the hardware on the main server, find a way to spin up a second server (again, we'll cover this later) to take on some of the workload, have a clever stream-processor working to only pull data from the first system when that system has a free moment to "pass the baton", etc.

- Chances are, the process for a an at-scale ETL process would be a lot more involved than the pipelines that you built/will build out during your projects, but don't let that be discouraging, you learned exactly some of the steps that you'd have to do, just at a much smaller scale.

#### Actually serving up the data during the "load" stage will also be heavily dependant on the systems in play. For example, if you have an analytics server that is being used to power all of your company's Tableau dashboards, then you would have to be very careful about making sure not to try to push data to that system in a way that will be overly taxing, because that can lead to all of the dashboards in your company running slowly for the day, or worse. Not loading at all! There is nothing that will end with a coworker hovering at your desk or sending emails to alert you faster than a real-time dashboard breaking or being slow to the point that it becomes unusable.

### HOW OFTEN? - This is an important part of the process, getting data to move every few minutes or even real-time is a completely different animal than pulling data from someplace nightly (or 2-3 times during some other time of low-usage)...

- Depending on what you're extracting and how important it is to have in real-time, you may be able to make a quick and easy pull either once or a few times per day (this can be set up with a "cron job"), or in a situation where the data actually needs to be used in realtime, you may have a streaming engine that is really good at processing this data and pulling it (more on this later).

#### You may have to build event-driven systems where ETL processes will occur when certain conditions are met such as... 

- Every 1,000,000 rows added to a table, we will want to exctract a copy of those million rows to a CSV and save that to some directory off-site... 

- If there are over x number of events in the last hour (which, for the sake of this example we are considering to be anomalous) send a filtered and sorted list of the most active users to our security team.

- If a system down the line (for example an analytics server) fails at 12:05AM and comes back online at 12:15AM, we may have an ETL job set up for the system that feeds (possibly a staging area for data bought in from an API) it to just do a full dump of the old data in a trickle alongside the new data that is coming in in realtime so that you end up with complete data after some time and you have a recovered machine without a loss of data at the end of the day. Granted, if multiple systems fail at once then you may still be out of luck and lose that data.

- It is very important to keep in mind the same system limitations that we talked about in the last section, just because you are pulling a relatively small amount of data doesn't necessarily mean that things will go cleanly if you are trying to pull the data 1,000 times per minute.

### HOW ACCURATELY? - During extraction, it is common to do some rudimentary error-checking, filtering, and data-validation, you'd do a lot more during the actual transformation part of the process, and then finally as you load the data at the end...

- You may want to make sure that all of the data types are what you'd expect them to be and sanitize them in the case that they aren't, depending on what you need done, you may want to drop malformed data, backfill it with some logic (for example a mean or zeroes)

- Check for duplicates and otherwise anomalous data, see if there are keys missing or if there is duplication in your keys (obviously either of these can raise questions, and you may want to create some sort of alert for when it is happening)

- You may have logic for filtering the data depending on where the final landing place is, so you will want to make sure that only the relevant data is getting routed in each direction.

- There will be some validation done at other steps in the process as well, whenever you are applying functions to change your data during the transformation, you'll want to keep in mind to validate the results of those functions before you send the data any further, and you should be subsequently monitoring that everything was done properly and checking the data for completeness during the load process. You will be able to check out

#### EX: If I have a stream of phonecall data constantly coming in every second as the calls are being made, I may be feeding 2-3 other systems from that data. While every call is being logged in my source database...

- I may only be passing data related to calls from a certain state to a datacenter physically located in that state in a stream.

- I could be sending data on customers who log a very large amount of call-time to the database used by the marketing team as a single batch job done overnight so that they can analyze those behaviors and send out emails offering a sale or a higher-tier plan with more minutes.

- Finally, I could be collecting information on calls that were dropped by my customers from the raw data stream and passing every log into a container running an ML algorithm to try to sniff out problems or inadequacies with my network due to high-usage or broken hardware. 

### WHERE TO? - Just as you had to take a lot of things into consideration when you were extracting the data, you're going to have to think about the limitations and intricacies of whatever tool the data is going to land in.

- Typically, the load part of the process is going to be sending data towards the data warehouse, but with more and more of these steps existing in enterprise data warehouses with each passing day, this really blurs the lines as to what the needs are.

- One thing that will be different with a solution like BigQuery or Redshift is that you may not necessarily be worried about the specific limitations of the hardware, but with the costs that may be associated with working with a specific data warehouse.

- Something to keep in mind is that for very large processes in EDWs, you may either have to wait in a queue to have your job run (this shouldn't matter for most small applications, as queries in the 50-100gb range can still take mere seconds, but becomes increasingly important as you get into the Terabyte and Petabyde scale), or have to have reserved spots in Google or Amazon's queues (though this is expensive, and is likely an option that is being used by only the biggest of names).

- Something else that I wanted to mention with modern data warehouses is that they have increasingly made it easier to support querying against raw datasets, which can alter the way that you think about ETL. As it becomes cheaper to store everything and parallel processing and map-reduce is making queries faster to run than ever before, you may not have to clean the data on its way into the system as much (as broken data can be a great lead for how to fix a problem), or vacuum up databases less frequently.

### WHAT CHANGES TO MAKE?

- As I said before, the transform part of this pipeline is likely the most familiar to you at this point, but I just want to point out that you'll be learning how to transform data more accurately, efficiently, and with fewer errors for quite a while. There are many best-practices that you'll learn and many habits that you may have to break when making changes to source data, as well as norms that will change with the software that you're pulling data out of or pushing to.

- The changes that would be made are possible cleaning functions, joining against data from another source, localization of timestamps, etc. etc. Think about this as the feature-engineering portion of the standard data scientist workflow. This is, in my opinion, the most fluid portion of this whole process.

#### Another really important thing to think about at each point in your pipeline is the level of detail/granularity of your data. You'll find yourself aggregating and disaggregating data and changing the resolution of it often throughout different parts of a standard ETL or data science workflow. Let's discuss some possible combinations from the data below. I got a ton of questions during this ETL project about how you can join data at different resolutions and wanted to make sure to click on this with some open conversation. As long as you can confirm that you're aggregating things to the same level properly, this is an extremely valuable tool in your kit.

1. Phonecall data by the second

- Call start time (timestamp to the second)
- Call stop time (timestamp to the second)
- Caller A Phone number
- Caller B Phone number
- Caller A latitude, longitude
- Caller B latitude, longitude

2. A lookup table of all latitudes and longitudes as well as the state and zip code that they correspond to.

- Latitude
- Longitude
- City
- Zip Code
- State

3. A rollup of all yearly tax data by zip code

- Zip Code
- Year
- % of residents in tax brackets 1-5

4. Weather Data to the second

- Latitude
- Longitude
- Timestamp (to the second)
- Weather Descriptors (temperature, weather type, precipitation %)

### Some links
- What is an enterprise data warehouse? https://www.stitchdata.com/resources/enterprise-data-warehouse/
- ETL Process https://www.guru99.com/etl-extract-load-process.html
- ETL Process Overview https://www.stitchdata.com/etldatabase/etl-process/
- ETL or ELT? https://www.matillion.com/what-is-etl-the-ultimate-guide/