## Review - History of DBMS - Discussions

https://www.youtube.com/watch?v=LWS8LEQAUVc&t=1652s

The problem of data storage was a complex issue that developers were struggling with. Without a standard, developers initially designed systems that stored unrelated files that required complex procedures to access data. This led to error-prone, slow, poor-data-integrity, difficult-to-maintain systems. Memory and machines were expensive, whereas developers were inexpensive. 

For this reason, developers aimed to make consistent systems that maximized utility with present memory constraints. This led to the creation of *intermediate layers* between applications & data, otherwise known as *database models*. These were logical structures that explain how data is represented in a digital system (https://mariadb.com/kb/en/exploring-early-database-models/).

The following is a [pdf](https://15721.courses.cs.cmu.edu/spring2018/notes/01-intro.pdf) to go along with this respective lecture.

* **Hierarchical Database Models** (Pre-Relational)
    * Advantages: 
        * One parent, many children; Similar to our OS file-system 
        * Functional standardization of data
    * Disadvantages: 
        * many-to-many relationships were complex
        * Adding new relationships results in the wholesale change of a database
        * Accessing hierarchical data requires you to know the entire "chain of command"
    * https://mariadb.com/kb/en/understanding-the-hierarchical-database-model/

* **Network Database Models** (Pre-Relational)
    * Advantages: 
        * Solves the many-to-many complexities of hierarchical database models; Allows for many parents, many children
    * Disadvantages:  
        * Still complex to implement and maintain
        * Programmers still had to understand the entire data structure to have efficient queries
    * https://mariadb.com/kb/en/understanding-the-network-database-model/

In June 1970 Edgar F. Codd published an academic paper proposing a "[relational model of data for large shared data banks](https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf)." This paper focused on developing a logical model that allowed for consistent syntax and user activity even when the internal states of the database are changed. This is led to:

* **Relational Database Models**
    * Advantages: 
        * Store database in simple data structures
        * Access data via high-level language
        * Developers do not have to worry about details of physical storage 
        * Models many kinds of relationships
    * Disadvantages:
        * 20 years later, object-oriented programming was taking off. A direct contrast to hierarchical data modeling.

In tandem with this logical model, implements of a structured query language (read only the first letter of those 3 words) also began to spring up. This query language was briefly named "SEQUEL" before a trademark dispute caused the developers to rename the language "SQL." 

As mentioned in the disadvantages of the relational database model, this period also saw major adoption in object-oriented programming languages such as C++. If we consider the differences in the behavior of conceptual objects & relations, we can start to see some issues start to appear:

* Objects rely on encapsulation & private variables, relations reveal all information
* Objects provide interfaces, relations utilize views to provide varying perspectives of data

This is called the [object-relational impedance mismatch](https://en.wikipedia.org/wiki/Object%E2%80%93relational_impedance_mismatch).

To solve some of these issues, various object-relational solutions that we still use today came to the forefront of enterprise. Note: we are now discussing implementations of DBMS, as opposed to logical data models.

* **Postgres**
    * Advantages: 
        * Automatic support of objects, classes, and inheritance in database schemas and query language
        * Supports function overloading (shadowing)
        * Each table is defined as a data type
    * Disadvantages:
        * ...

During this time, enterprises realized that they didn't only want transactional databases, but also databases for analytics. For this reason, [data cubes](https://en.wikipedia.org/wiki/Data_cube) were made, which were databases that contained pre-computed aggregates optimized for analytical queries.

As more organizations went online and began collecting larger quantities of data, they realized that they would prefer two types of databases: one for transactional processes (OLTP), and another for analytics (OLAP). Forks of `Postgres` came to the forefront to implement [column-oriented DBMS](https://en.wikipedia.org/wiki/Column-oriented_DBMS) which allowed for efficient analytical queries. As a result of this progression, data cubes were no longer in use as [columnar stores were much faster](https://dataschool.com/data-modeling-101/row-vs-column-oriented-databases/).

* **DATAllegro**
    * Advantages:
        * Columnar storage allows for faster analytical queries than row-storage
    * Disadvantages:
        * The name sounds like an allergy medicine

Datallegro was eventually bought by Microsoft and served as the (inspiration?) foundation for [Microsofts Parallel Datawarehouse](https://stackoverflow.com/questions/45426780/differences-between-azure-data-warehouse-and-microsoft-parallel-datawarehouse-p).

Column stores later became an integral part of various SQL solutions.

At the same time, Google developed a programming paradigm called [MapReduce](https://research.google/pubs/pub62/) which allowed for [efficient analytical functions]((https://www.ibm.com/topics/mapreduce)) on large datasets. Companies would then generate open-source implementations of this paradigm.

* **Hadoop**
    * Advantages:
        * High-speed data processing
    * Disadvantages:
        * Once again, developers were writing code to interact with a database. Taking [a major step backward](https://homes.cs.washington.edu/~billhowe/mapreduce_a_major_step_backwards.html) in terms of managing complexity.

Early implementations of Hadoop required developers to develop a working knowledge of Java. These days, map-reduce functions can be written in a variety of languages.

Furthermore, as companies went online, they also created ["NoSQL" database solutions](https://www.couchbase.com/resources/why-nosql/) that were "highly scalable", "highly available", and perhaps didn't abide by the strict norms of relational databases. That is, corrupted data could potentially exist on a database without the entire system failing. For example, if you are Amazon, you are probably keeping track of millions if not billions of items in a shopping cart. Some might be turned into orders, some just might sit there, whereas others will be deleted. 

* **Apache Cassandra**
    * Advantages
        * Allows for new-attributes "on-demand"
        * Highly scalable, if one server goes down, the entire system isn't compromised
        * Schema-less (or schema last)
    * Disadvantages
        * Custom APIs instead of SQL. Stonebraker weeps.

All NoSQL solutions now support SQL syntax. Notice a pattern?

To introduce the same scalability as NoSQL systems, [NewSQL](https://en.wikipedia.org/wiki/NewSQL) systems were developed to support OLTP workloads without giving up ACIDity. These were also known as "distributed DBMS."

* **CockroachDB**
    * Advantages
        * Highly scalable and resilient like NoSQL systems
        * Supports OLTP-heavy operations for data storage
    * Disadvantages
        * Unfortunate name

Many of the traditional RDBMS models of today have solved issues of scalability.

These days, whenever companies want to integrate database solutions, they no longer have to provision physical machines, hardware, and employees. Instead, we have the *cloud*. This is when database-as-a-service solutions (DBaaS) start to emerge. 

* **Azure SQL Database**
    * Advantages
        * All the perks of a comprehensive database solution, for cheap!
    * Disadvantages
        * ...

As these cloud solutions proliferated, [shared disk architectures](http://www.benstopford.com/2009/11/24/understanding-the-shared-nothing-architecture/) likewise began to be used. Whereas "shared nothing" architecture ensured that each node has autonomy over a certain subset of data, "shared disk" guaranteed that each node can access another respective node's data, without adding another layer of complexity to the end-user. This is also known as a [data lake](https://en.wikipedia.org/wiki/Data_lake).

* **Azure Data Lakes**
    * Advantages
        * Flat architecture
        * Great for storing BLOBS (binary large objects) & files
        * Prevent ["data silos"](https://www.dataeaze.io/from-data-silo-to-data-lake-the-path-to-business-insights/).
    * Disadvantages
        * Poorly managed data lakes become ["data swamps"](https://en.wikipedia.org/wiki/Data_lake#Criticism)

Within the past couple of years, systems that organize data in [graph logical models](https://en.wikipedia.org/wiki/Graph_theory) also were developed. This was similar in principle to the [CODASYL network data model](https://en.wikipedia.org/wiki/CODASYL), where data is represented via [nodes & edges](https://en.wikipedia.org/wiki/Graph_database).

* **Neo4j**
    * Advantages
        * Great for querying relationships & networks
        * Optimized for data retrieval 
        * Model inconsistent data models
    * Disadvantages
        * RDBMSs are better for storage or quick look-up

[Research](https://www.cidrdb.org/cidr2023/papers/p66-wolde.pdf) has shown that recent updates in RDBMS can outperform graph databases

And to conclude our exploration of databases, we also have [time series databases](https://en.wikipedia.org/wiki/Time_series_database) which are systems optimized for querying & storying time-bound data. These databases utilize compression algorithms to manage data efficiently. 

* **InfluxDB**
    * Advantages
        * Optimized for fast retrieval of time-bound data
    * Disadvantages
        * Complex to set up

**Conclusion**

Almost all database solutions eventually were integrated into the collective of SQL. What was once considered a clear demarcation of NoSQL, NewSQL, and GraphQL is now all mixed under the umbrella of a consistent structured query language (SQL).

To see how many databases there are truly out there, check out the big book of [database management systems](https://dbdb.io/browse).

### Discussion Questions

1) What would you say is a common pattern in the motivation of developing new database management systems?

2) Which database management solution mentioned in the video could have been helpful in the implementation of your capstone project?

3) How does Azure Cloud Computing fit into this history of DBMS?

4) Define the following terms:

**Relational**:

**Non-Relational**:

**Graph**:

**Transactional**:

**Analytical**:

**Database Models**:

## Review - DP-900 Exam

The DP-900: Microsoft Azure Data Fundementals Exam is a certificate that shows fundamental "data literacy" when it comes to working with:

* Core Data Concepts
    * Data Representation: Structured, semi-structured, unstructured data
    * Data Storage: data file formats, types of databases
    * Data Workloads: Transactional & analytical workloads
    * Data Roles: DB Admins, data engineers, data analysts
* Relational Data 
    * Relational Concepts: Basic SQL, features of relational data, normalization, common databases objects
    * Azure Data Services: Azure SQL DB, Azure SQL, SQL Server, Azure Virtual Machine, Azure services for open source db
* Non-Relational Data
    * Azure Data Storage: Blob, file, table
    * Azure Cosmos DB: Use-cases, DB API
* Analytics workload
    * Large Scale Analytics: Data ingestion & processing, analytical data stores, data warehousing
    * Real-Time Analytics: Batch vs streaming data, technologies for real-time analytics
    * Visualization: Power BI

All in the context of the Azure Cloud Computing Environment. These topics could be considered an individual's first foray into data concepts, which they will then use to either secure a job that utilizes the Azure Cloud Computing environment or go on to achieve another Azure specialized certification.

This exam is 60 minutes long and has anywhere from 40-60 questions worth 1000 points in total. A passing grade is 700/1000.

### Review Questions

1) What is the passing score for this exam?

2) What are the skills measured by this exam?

3) What are the 3 data-intensive roles explored in this exam?

4) What level of SQL queries will this exam ask you to implement?

5) What is the Azure data solution mentioned in the context of storing non-relational data?

6) What is the data visualization tool of choice in the Azure Cloud Computing environment?

7) What are the 3 ways to represent data as mentioned in the section "Core Data Concepts."

## Review - Job Roles

Within the DP-900 exam, we explore 3 discrete roles:

* Database administrators
    * *Supervises* & *maintains* the database. Responsible for availability, user permissions (security), & operational aspects of on-premise & cloud solutions. 
    * manage databases, assigning permissions to users, storing backup copies of data and restore data in the event of a failure.
* Data engineers
    * *Engineers* and *monitors* data workflows to ingest, clean, and transform data for analytical workloads. Responsible for ensuring data privacy. 
    * manage infrastructure and processes for data integration across the organization, applying data cleaning routines, identifying data governance rules, and implementing pipelines to transfer and transform data between systems.
* Data analysts
    * *Analyzes* and *explores* data. Builds analytical models and visualizes data trends. 
    * explore and analyze data to create visualizations and charts that enable organizations to make informed decisions.

Note, while this exam discusses specific responsibilities for specific roles, technologists (especially on smaller teams) are often asked to wear multiple hats.

### Review Questions

1) The database server goes down at 3 AM and the night crew isn't able to get it back up. Who's getting woken up?

2) A new developer is added to the team and needs permission to access the database. Who is responsible for granting these permissions?

3) Which role is most likely to use Azure Data Factory to define a data pipeline for an ETL process?

## Next Actions

Complete discussions & exercises from the previous week if not done.

View content in `exercises.ipynb`.