# Data Warehouse Intro

### Data warehouse:
a RDBMS designed to (extract, transform, load) and store transactional data (OLTP) in a centralized repository and allow business decision makers to query data for analysis (OLAP).

#### Data warehouses are used for the following:
- store __enterprise wide__ current and historical transactional data from one or multiple sources
- used to query the data for the purpose of analysis and decision making (OLAP)
- it is read intensive

#### Data warehouse characteristics:
- subject oriented: only stores relevant data
- integrated: integrates data from different sources and formats keeping the data format uniform, ex: same naming conventions 
- time variant: has up to date and historical data unlike transactional system that only keep the most current data
- non-volatile: is physically separated and hosts transformed data. the data once stored is not changed.

### Data warehouse building blocks and basic architecture:

#### general (most common) Architecture:
- data source(s)> staging area and ETL> data storage (raw transformed data/ meta data/ summery data)> data marts> end user tools
- depending on the needs a data warehouse can have a staging area and data marts, staging area only, or neither staging nor data marts

![image](https://static.javatpoint.com/tutorial/datawarehouse/images/data-warehouse-architecture4.png)

#### Building blocks:
- data sources: includes production, internal, archived, external data
- staging area and ETL:
    - staging area: used to store raw data before processing it and moving it to the warehouse storage component
        - holding the data for longer periods of time in the staging area can help with recoverability, backup, and auditing
    - extraction: copy data from different sources and loading them into tbe storage area
    - transform: clean, format, standardize, merge and delete irrelevant data 
    - load: batch or stream load the data into the warehouse
- data storage: data is stored in a de-normalized 
- data information delivery
- meta data data component: keeps the data about the data structure, indexes, records, and addresses 
- data mart: a subset of the data warehouse containing a summery of a specific subject

![image](https://static.javatpoint.com/tutorial/datawarehouse/images/data-warehouse-components.png)

#### Properties of Data Warehouse Architectures: __too much generalization i think!__
- separation: analytics and transactional processes should be kept separated as much as possible avoiding performance loss. 
- scalability: the system should be easy to upgrade to higher data volumes requirements 
- extensibility: the architecture should be flexible allowing new operations and technologies to be added without redesigning the whole system
- security: the architecture should allow monitoring access to data
- manageability: management shouldn't be complex 

#### Operational database vs data warehouse:

|Database|Data warehouse|
|--------|--------------|
|focused on current data|focused on historical data|
|data is updated regularly|data entered regularly however once entered rarely change|
|optimized for simple transactions (fast insert and update for small volume of data)|optimized for large reads of complex and large queries|
|data is normalized, to save storage space, thus joins are more complex and slower|data is partially normalized for a fast read speed|
|uses ER data modelling and application oriented database design| uses star or snowflake model and subject oriented database design|
|designed for OLTP|designed for OLAP|

### Data warehouse Architectures 

- __1 tier architecture:__
    - data sources __>__ middle ware __>__ end-user tools 
    - fails to separate OLAP and OLTP process (OLAP processes affect OLTP performance).
    - ![image](https://static.javatpoint.com/tutorial/datawarehouse/images/data-warehouse-architecture7.png)
    
    <br/>
- __2 tier architecture:__
    - data sources __>__ staging area __>__ warehouse layer __>__ end-user tools
    - separates OLAP and OLTP process 
    - ![image](https://static.javatpoint.com/tutorial/datawarehouse/images/data-warehouse-architecture8.png)

    <br/>
- __3 tier architecture:__ 
    - data sources __>__ staging area(ETL) __>__ reconciliation layer __>__ loading (ETL) __>__ data warehouse layer __>__ end user tools
    - the data reconciliation layer* not only preform data cleaning and processing just like the staging area but also compares the same data from different sources insuring data accuracy and 
    - the reconciliation layer also merges the data from multiple sources when needed
    - the reconciliation layer costs extra storage space due to redundancy.
    - ![image](https://static.javatpoint.com/tutorial/datawarehouse/images/data-warehouse-architecture9.png)
    
    <br/>
- __Reconciliation layer*__: the reconciliation layer is specifically designed to compare data from various source systems and identify any discrepancies or inconsistencies. Its primary goal is to ensure the accuracy and consistency of the data being loaded into the data warehouse. This layer often includes data quality checks, data matching, and data cleansing processes to resolve any discrepancies found.


### Operational data Store

- a subject-oriented*, integrated, volatile, current valued data store, containing only detailed corporate data. A data warehouse is a documenting database that includes associatively recent as well as historical information and may also include aggregate data.

- an operational data store is used for near realtime reporting for current detailed data (no summery data is stored) for operational level data users by streaming data(or frequently updated whenever a transaction occurs) from OLTP data sources.

- ODS extracts and refresh data from OLTP data sources and preforms data validation frequently.
- An ODS is a read-only database other than regular refreshing by the OLTP systems (Customer should not be allowed to update ODS information)
- ODS is detailed enough for operational management staff. however, not as detailed as OLTP (doesn't have to be the same granularity) 
- when building a new ODS data and performance should be validated during the ETL process  
- ![image](https://static.javatpoint.com/tutorial/datawarehouse/images/what-is-operational-data-stores.png)

subject-oriented*: It is organized around the significant information subject of an enterprise. In a university, the subjects may be students, lecturers and courses while in the company the subjects might be users, salespersons and products.



### ETL VS ELT ???
|ETL|ELT|
|---|---|
|data transfer to ETL server and back to the db (require high network bandwidth)|data remain in the database(except for cross database loads ex source object)|
|transformation is preformed in ETL server|Transformations are performed in the source or in the target server|
|requires high maintenance as data selection need to be preformed|low maintenance as data is always available|


Reference:
[Java point intro to data warehousing](https://www.javatpoint.com/data-warehouse)