<h1 style="display:none;">Test</h1>
# Introduction to Databases: Start Working with Databases


In [15]:
%load_ext sql
%sql mysql+pymysql://root:sh01dan5@localhost/lahman2016

%sql select * from master where playerid='willite01'

The sql extension is already loaded. To reload it, use:
  %reload_ext sql
1 rows affected.


playerID,birthYear,birthMonth,birthDay,birthCountry,birthState,birthCity,deathYear,deathMonth,deathDay,deathCountry,deathState,deathCity,nameFirst,nameLast,nameGiven,weight,height,bats,throws,debut,finalGame,retroID,bbrefID
willite01,1918,8,30,USA,CA,San Diego,2002,7,5,USA,FL,Inverness,Ted,Williams,Theodore Samuel,205,75,L,R,1939-04-20,1960-09-28,willt103,willite01


## Motivation and Approach

### Motivation

- This is an _introduction_ to databases. I want to cover
    - Several types of data models, and the database systems that implement the models.
    - Best practices for modeling and implementing data-centric solutions.
    - Different types of application scenarios that rely on databases.
    - Interesting topics and challenges in the design and implementation of database.


- There are several (too many) ways to classify applications and scenarios that use databases. Some examples
    - One classification approach
        - _Batch processing systems,_ where you submit a a bunch of files, a "job" containing multiple subprograms, and later receive output in the form of a file.

        - <span style="color: red;"> _Real-time (online) systems,_ where you submit requests to do a small amount of work that has to be done before some very early deadline, or interactively why the user waits.</span>
        - _Data warehouse systems,_ where reporting programs and ad hoc queries access data that is integrated from multiple data sources
    - Annother classification approach: A data processing system may involve some combination of
        - Conversion converting data to another format.
        - Validation – Ensuring that supplied data is "clean, correct and useful."
        - Sorting – "arranging items in some sequence and/or in different sets."
        - Summarization – reducing detail data to its main points.
        - Aggregation – combining multiple pieces of data.
        - Analysis – the "collection, organization, analysis, interpretation and presentation of data.".
        - Reporting – list detail or summary data or computed information.
    - A third classification approach
        - <span style="color: red;">Operational Database</span>
        - <span style="color: red;">External Database</span>
        - End User Database
        - Distributed Database
        - Data Warehouse Database
        - Analytical Database
        - Hypermedia Database


        
### Approach

- We discuss application scenarios and implement simple applications.
- In each scenario, we will study
    - One or two types of database models and engines.
    - Some of the design and implementation technology for database systems.
- We will start with a simple web application. <span style="color: red;">Using the terms above, this application is online, operational and external.</span>




## First Application Scenario

### Set Up

1. Someone has given you two comma separated value (CSV) files
    1. Master -- Information about everyone who has ever played Major League Baseball.
    1. Batting -- Information about every player's batting performance for every year.
<br><br>
1. Answer some interesting and potentially unanticipated questions based on the data. For example
    1. Find a player by last name and first name, and display information.
    1. Show me all the teams a player played for.
    1. Which player who ever played for the Boston Red Sox has the highest career batting average.
    1. etc.
<br><br>
1. Since you have written all this interesting code to _query_ the data, why not make it available on the web?
<br><br>
1. Oh. And people are still playing MLB.
    1. We need to update the data.
    1. Let's allow authorized people to update the files.
    
### Application Design Methodology

There are two basic approach to building many types of application
1. "Data Model Out"
1. "User Experience In"


Since this is a database class, we will typically start with and focus on "Data Model Out."
<br><br>
<img src="../images/appdesignmodels.jpeg">

<br><br>
System architecture
- There are many, many, many application topologies and toplogies for supporting infrastructure (server, storage, networking).
- We will start with a simple 3-Tier Architecture.
- We will start with the data in files, and the use a database system to understand the benefits.
- The diagram below shows two "servers."
    - There are HW servers, e.g. computers with disks attached to the network.
    - Long running software programs executing in OS processes and listening on network connections.
    - In our scenarios, there will typically be one HW server (your laptop) and multiple SW servers. In the real world, there are dozens, hundreds, thousands of HW servers and complex mapping on SW servers to HW.
    
<br><br>   
<img src="../images/apptop1.jpeg">


### Data

#### Master

1. 19,106 rows in the CSV file.
1. File size is 3.2MB
1. The columns are
    - playerID:       A unique code asssigned to each player.  The playerID links the data in this file with records in the other files.
    - birthYear:      Year player was born
    - birthMonth:     Month player was born
    - birthDay:       Day player was born
    - birthCountry:   Country where player was born
    - birthState:     State where player was born
    - birthCity:      City where player was born
    - deathYear:      Year player died
    - deathMonth:     Month player died
    - deathDay:       Day player died
    - deathCountry:   Country where player died
    - deathState:     State where player died
    - deathCity:      City where player died
    - nameFirst:      Player's first name
    - nameLast:       Player's last name
    - nameGiven:      Player's given name (typically first and middle)
    - weight:         Player's weight in pounds
    - height:         Player's height in inches
    - bats:           Player's batting hand (left, right, or both)         
    - throws:         Player's throwing hand (left or right)
    - debut:          Date that player made first major league appearance
    - finalGame:      Date that player made first major league appearance (blank if still active)
    - retroID:        ID used by retrosheet
    - bbrefID:        ID used by Baseball Reference website
<br><br>


<img src="../images/bbmaster.jpeg" width="90%">

#### Batting

1. 102,817 rows in the CSV file
1. File size is 8.2MB
1. Columns are
    - playerID:       Player ID code
    - yearID:        Year
    - stint:          player's stint (order of appearances within a season)
    - teamID:         Team
    - lgID:           League
    - G:              Games
    - AB:             At Bats
    - R:              Runs
    - H:              Hits
    - 2B:             Doubles
    - 3B:             Triples
    - HR:             Homeruns
    - RBI:            Runs Batted In
    - SB:             Stolen Bases
    - CS:             Caught Stealing
    - BB:             Base on Balls
    - SO:             Strikeouts
    - IBB:            Intentional walks
    - HBP:            Hit by pitch
    - SH:             Sacrifice hits
    - SF:             Sacrifice flies
    - GIDP:           Grounded into double plays
    

<img src="../images/bbmaster.jpeg" width="90%">
<br><br>


## Data Modeling and Data(base) Design -- I

### Overview

"_Data modeling_ in software engineering is the process of creating a data model for an information system by applying certain formal techniques."(https://en.wikipedia.org/wiki/Data_modeling)

Reference: Ramakrishnan and Gehrke, section 2.1, 2.2, 2.3

There are six steps in data modeling and database design
1. Requirements Analysis
2. Conceptual Design
3. Logical Design
4. Schema Refinement
5. Physical Design
5. Application, Security and Infrastructure Design

We will start with (2), (3) and (4).

### Entity-Relationship Modeling

"An entity–relationship model (ER model) describes inter-related things of interest in a specific domain of knowledge. An ER model is composed of entity types (which classify the things of interest) and specifies relationships that can exist between instances of those entity types." (https://en.wikipedia.org/wiki/Entity%E2%80%93relationship_model)

"The ER model defines the conceptual view of a database. It works around real-world entities and the associations among them. At view level, the ER model is considered a good option for designing databases." (https://www.tutorialspoint.com/dbms/er_model_basic_concepts.htm)

<img src="../images/simpledb.jpg" width="50%">

- Entity – Data about a “Thing,” e.g.
    - Person
    - Web click
    - Product
- Attributes (Fields, Properties) – The data describing, defining an entity. Often named and typed, e.g.
    - (Height, Integer)
    - (Last name, String)
- Set/Collection/Table
    - A group of things.
    - Usually the same ”kind of entity”
- Relationships/Associations – Links between entities, which convey sematic information, e.g.
    - Don IsA Professor
    - Don Teaches {COMS4111, COMSE6998)
    
### Conceptual, Logical, Physical Model
<br><br>
<img src="../images/conceptuallogicalphysical.jpeg" width="90%">

From the Master and Batting files (first row), we have part of a _logical data model_:
- Entity names
- Attributes

We will flesh out other aspects as we proceed.


## Scenario I -- Basic Player Information

### Logical Model

<img src="../images/masterlogical.jpeg" width="90%">



### Tell Me about Players based on Last Name

#### Implementation

In [5]:
import csv as csv

def query_by_last_name(lname):
    r = []
    with open('../Data/People.csv', 'r') as csvfile:
        player_reader = csv.DictReader(csvfile, delimiter=',', quotechar='"')
        for row in player_reader:
            if (row['nameLast'] == lname):
                r.append(row)
    return r

lname = input("Please enter a last name: ")
print("You are looking for players with last name = ", lname)

answer = query_by_last_name(lname);
print("The following players have last name = ", lname, ":\n", answer)



Please enter a last name: Williams
You are looking for players with last name =  Williams
The following players have last name =  Williams :
 [OrderedDict([('playerID', 'williac01'), ('birthYear', '1917'), ('birthMonth', '3'), ('birthDay', '18'), ('birthCountry', 'USA'), ('birthState', 'NJ'), ('birthCity', 'Montclair'), ('deathYear', '1999'), ('deathMonth', '9'), ('deathDay', '16'), ('deathCountry', 'USA'), ('deathState', 'FL'), ('deathCity', 'Fort Myers'), ('nameFirst', 'Ace'), ('nameLast', 'Williams'), ('nameGiven', 'Robert Fulton'), ('weight', '174'), ('height', '74'), ('bats', 'R'), ('throws', 'L'), ('debut', '1940-07-15'), ('finalGame', '1946-04-22'), ('retroID', 'willa103'), ('bbrefID', 'williac01')]), OrderedDict([('playerID', 'willial02'), ('birthYear', '1914'), ('birthMonth', '5'), ('birthDay', '11'), ('birthCountry', 'USA'), ('birthState', 'AL'), ('birthCity', 'Valhermosa Springs'), ('deathYear', '1969'), ('deathMonth', '7'), ('deathDay', '19'), ('deathCountry', 'USA'), ('dea

#### Comments

- Pretty simple and it works.
- There are limitations
    - Specific to People and look up by last name.
        - A complex system will have dozens of files, each with many columns.
        - The set of files and the "schema" of individual files will evolve.
    - Maybe the user wants
        - More sophisticated queries, e.g. "Left handed batting players with last name williams."
        - A subset of the column values.
- A web application models these requirements with a URL of the form "../resourcetype?f1=aaa&f7=bbb&f4=ccc&fields=f1,f2,f7"
    - resourcetype is the "file"
    - There are query parameters specifying equality matches on fields.
    - There is a special query parameter that enumerates the fields the user requests.
    

### More General Query Solution

#### Implementation


In [4]:
import csv as csv
import pandas as pd
import json

# This function takes two dictionary inputs
# 1. Row is a row from the CSV file.
# 2. Template is contains a set of (field name, value) pairs.
# The function returns true if for every (field name, value) pair
# the corresponding (field name, value) pair matches on value.
#
def matches_template(row, template):
    match = True;
    for field, value in template.items():
        rowvalue=row[field]
        if (rowvalue != value):
            return False;
    return match

# The function has the following parameters
# 1. The "name" of a collection of data
# 2. A template that is a dictionary of (name, value) pairs.
# The function returns a list containing all rows from the file
# that match the template.
def query_collection(c,t):
    r = []
    f = '../Data/' + c + '.csv'
    with open(f, 'r') as csvfile:
        player_reader = csv.DictReader(csvfile, delimiter=',', quotechar='"')
        for row in player_reader:
            if (matches_template(row,t)):
                r.append(row)
    return r
        
# { "collection": "Master", "nameLast" : "Williams", "throws" : "R", "bats" : "L"}
# { "collection" : "Batting", "playerID" : "addybo01"}

t = input("Please enter your query template: ")
t = json.loads(t)
print("Template = ",t)
print("\n\nThe players matching template t = ",t," are:\n")
collection = t["collection"]
del t["collection"]
result = query_collection(collection,t)
print("Result = ", result)
print("\n\n")


Please enter your query template: { "collection": "People", "nameLast" : "Williams", "throws" : "R", "bats" : "L"}
Template =  {'collection': 'People', 'nameLast': 'Williams', 'throws': 'R', 'bats': 'L'}


The players matching template t =  {'collection': 'People', 'nameLast': 'Williams', 'throws': 'R', 'bats': 'L'}  are:

Result =  [OrderedDict([('playerID', 'williar01'), ('birthYear', '1877'), ('birthMonth', '8'), ('birthDay', '24'), ('birthCountry', 'USA'), ('birthState', 'MA'), ('birthCity', 'Somerville'), ('deathYear', '1941'), ('deathMonth', '5'), ('deathDay', '16'), ('deathCountry', 'USA'), ('deathState', 'VA'), ('deathCity', 'Arlington'), ('nameFirst', 'Art'), ('nameLast', 'Williams'), ('nameGiven', 'Arthur Frank'), ('weight', ''), ('height', ''), ('bats', 'L'), ('throws', 'R'), ('debut', '1902-05-07'), ('finalGame', '1902-09-01'), ('retroID', 'willa104'), ('bbrefID', 'williar01')]), OrderedDict([('playerID', 'willibi01'), ('birthYear', '1938'), ('birthMonth', '6'), ('birthDay'

### Comments

- Reasonably cool and a good start.


- Pretty clear how to add the additional function to choose the subset of the fields.


- The query language is primitive and limiting
    - The template only checks "==".
    - AND is the only operator to combine terms.


- Even if I were going to continue down this path and write code, adding query expressiveness
    - I am doing it the hard way.
    - There are frameworks that implement some of what we need, e.g. 
        - [Pandas](https://pandas.pydata.org/pandas-docs/stable/index.html) for Python: "pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “__relational__” or “labeled” data both easy and intuitive." 
        - .NET Language Integrated Query [LINQ](https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/concepts/linq/): "You can write LINQ queries in C# for SQL Server databases, XML documents, ADO.NET Datasets, and any collection of objects that supports IEnumerable or the generic IEnumerable<T> interface." 
    - Even I use the frameworks, I will start to experience additional problems in performance, reliability, etc.
        - I have to load the entire file(s)
            - The files are less than 10MB.
            - But there are application scenarios that will have GB files.
        - Some queries need to combine data from multiple files, which means I have to "join" data, e.g.
            - Getting a player's personal and batting information requires combining rows from Person and Batting.
            - There are other files, e.g Appearances, Pitching, Teams, ...
        - Some scenarios will be common
            - Find by last name and/or first name.
            - I should speed these queries up by building helper indexes, search trees, etc.
        - _Insert_, _Update_ and _Delete_ open a new can of worms
            - Validating input, e.g. "bats" must be "L," "R",  "B".
            - PlayerID must be unique.
            - A row in Batting can have playerID="ferdo01" only if there is a corresponding row in Master.
    
__This is going to get out of hand!__

## The Relational Model -- I

### Overview

References: Ramakrishnan and Gehrke, section 3.1

There are two perspectives on the relational model:
- Formal language and algebra.
- A standard implementation language, Structure Query Language [(SQL)](https://en.wikipedia.org/wiki/SQL)

__Note__: I am going to switch to a smaller table, for the time being.

_AllStarFull_:
- playerID:       Player ID code
- YearID:         Year
- gameNum:        Game number (zero if only one All-Star game played that season)
- gameID:         Retrosheet ID for the game idea
- teamID:         Team
- lgID:           League
- GP:             1 if Played in the game
- startingPos:    If player was game starter, the position played

### Relational Model

- _Relational Schema_ defines and provides metadata describing what is/can be in "the file" (called a _relation_), e.g.<br>


```
AllStarFull(
    playerID: string,
    yearID: integer,
    gameNum: integer,
    gameID: string,
    teamID: string,
    lgID: string,
    GP: integer,
    startingPOS: integer
    )
```

- A _Relational Instance_ is a "table" of data that conforms to a relation schema, e.g.<br><br>

<img src="../images/allstarrelation.jpeg">

### SQL

- Structured Query Language (SQL) is a well-defined programming language that realizes a superset of the relation model.


- There are two sub-languages
    - Data Definition Language (DDL) corresponds to relational schema
    - Data Manipulation Language corresponds to relational algebra (covered later)
    
    
- Defining a table in SQL: The default you get if you just import the CSV file. This is a subset of a _physical model_
```
CREATE TABLE `AllstarFull` (
  `playerID` varchar(255) DEFAULT NULL,
  `yearID` int(11) DEFAULT NULL,
  `gameNum` int(11) DEFAULT NULL,
  `gameID` varchar(255) DEFAULT NULL,
  `teamID` varchar(255) DEFAULT NULL,
  `lgID` varchar(255) DEFAULT NULL,
  `GP` int(11) DEFAULT NULL,
  `startingPos` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
```

- SQL has a superset of the relational model's schema definition capability. An improved table definition could be

```
CREATE TABLE `AllStarFullBetter` (
  `playerID` varchar(16) NOT NULL,
  `yearID` int(6) NOT NULL,
  `gameNum` enum('0','1','2') NOT NULL,
  `gameID` char(12) NOT NULL,
  `teamID` char(3) NOT NULL,
  `lgID` enum('NL','AL') NOT NULL,
  `GP` enum('0','1') NOT NULL,
  `startingPos` enum('1','2','3','4','5','6','7','8','9') DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
```

- This definition adds two benefits
    1. More precise definition of attribute sizes
        - Allows the database engine to more precisely allocate disk/storage space and place records on disk.
        - This is not important for a relation with a few thousand rows requiring a couple of MB.
        - But, some scenarios have millions or billions of rows requiring GBs of storage.
    1. Integrity Constraints: The database engine will not allow creates or updates that produce obviously invalid data, e.g.
        - There are 9 field positions in baseball and each has code.
        - There are two leagues: American League (AL) and National League (NL)
        - Some attributes can be NULL but other cannot.

### Execution


In [6]:
import pymysql.cursors
import pandas as pd

# The database server is running somewhere in the network.
# I must specify the IP address (HW server) and port number
# (connection that SW server is listening on)
# Also, I do not want to allow anyone to access the database
# and different people have different permissions. So, the
# client must log on.
config = {
  'user': 'dbuser',
  'password': 'dbuser',
  'host': '10.0.1.4',
  'database': 'lahman2016',
  'raise_on_warnings': True,
  'charset' : 'utf8'
}

# Connect
cnx = pymysql.connect(host='localhost',
                             user='dbuser',
                             password='dbuser',
                             db='lahman2016',
                             charset='utf8mb4',
                             cursorclass=pymysql.cursors.DictCursor)


# I manually created the tables (not shown).
# Let's see what database thinks a table is like.
def describe_table(t):
    cursor=cnx.cursor()
    q = "show columns from  " + t + ";"
    #print ("Query = ", q)
    cursor.execute(q);
    r = cursor.fetchall()
    df_mysql = pd.read_sql(q,cnx)
    return df_mysql

# Describe the AllStarFullBetter table.
print("\n\n")
print("AllStarFullBetter table is \n", describe_table("AllStarFullBetter"))




AllStarFullBetter table is 
          Field                                       Type Null  Key Default  \
0     playerID                                varchar(16)   NO  PRI    None   
1       yearID                                     int(6)   NO  PRI    None   
2      gameNum                          enum('0','1','2')   NO  PRI    None   
3       gameID                                   char(12)   NO         None   
4       teamID                                    char(3)   NO         None   
5         lgID                            enum('NL','AL')   NO         None   
6           GP                              enum('0','1')   NO         None   
7  startingPos  enum('1','2','3','4','5','6','7','8','9')  YES         None   

  Extra  
0        
1        
2        
3        
4        
5        
6        
7        


### What Happened (Reminder)?

- [Three-Tier Architecture](https://en.wikipedia.org/wiki/Multitier_architecture#Three-tier_architecture) is a common system architecture for combining
    - Application logic
    - Data
    - User interface logic
<br><br>
<img src="../images/tier3all.png" width="100%">
<br><br>
- In this simple example
    - The user interface is a web browser (Chrome)
    - The application server is the the Jupyter Notebook application running locally
        - Serving web content on locahost:8889.
        - Running the Python code and sending the SQL statements to MySQL and receiving responses.
    - The database server is mysqldb executable running locally and listening for commands on localhost:3306.
<br><br>
<img src="../images/local3tier.jpeg" width="100%">

### Data Manipulation -- Let's Test the Table Definition

#### Code

In [9]:
import pymysql.cursors
import pandas as pd

# The database server is running somewhere in the network.
# I must specify the IP address (HW server) and port number
# (connection that SW server is listening on)
# Also, I do not want to allow anyone to access the database
# and different people have different permissions. So, the
# client must log on.
config = {
  'user': 'dbuser',
  'password': 'dbuser',
  'host': '10.0.1.4',
  'database': 'lahman2016',
  'raise_on_warnings': True,
  'charset' : 'utf8'
}

# Connect
cnx = pymysql.connect(host='localhost',
                             user='dbuser',
                             password='dbuser',
                             db='lahman2016',
                             charset='utf8mb4',
                             cursorclass=pymysql.cursors.DictCursor)

# I manually created the tables (not shown).
# Let's see what database thinks a table is like.
def describe_table(t):
    cursor=cnx.cursor()
    q = "show columns from  " + t + ";"
    #print ("Query = ", q)
    cursor.execute(q);
    r = cursor.fetchall()
    df_mysql = pd.read_sql(q,cnx)
    return df_mysql

def create_all_star(table, playerID, yearID, gameNum, gameID, teamID, lgID, GP, startingPos):
    q = "INSERT INTO " + table + " VALUES(" \
        + "'" + playerID + "'," + str(yearID) + "," + str(gameNum) + ",'" + gameID + "','" \
        + teamID + "','" + lgID + "'," + str(GP) + ",'" + str(startingPos) + "');"
    print("INSERT statement = ", q)
    cursor=cnx.cursor()
    cursor.execute(q)
    cnx.commit()
    result=cursor.rowcount
    print("INSERT create ", result, " rows")
    
    

"""    
print("\n\n")
print("AllStarFull table is \n", describe_table("AllStarFull"))
print("\n\n")
print("AllStarFullBetter table is \n", describe_table("AllStarFullBetter"))
"""
print("\nInserting into AllStarFull")
create_all_star("AllStarFull","ferdo05", -17, 1000000, "Canary", "BOS", "EPL", -2, "Quarter Back")
#print("Seemed OK")
#print("\nInserting into AllStarFull")
#create_all_star("AllStarFullBetter","ferdo05", -17, 1000000, "Canary", "BOS", "EPL", -2, "Quarter Back")



Inserting into AllStarFull
INSERT statement =  INSERT INTO AllStarFull VALUES('ferdo05',-17,1000000,'Canary','BOS','EPL',-2,'Quarter Back');
INSERT create  1  rows


- Second test: A table with constraints.

In [14]:
print("\nInserting into AllStarFull")
try:
    create_all_star("AllStarFullBetter","ferdo05", -17, 1000000, "Canary", "BOS", "EPL", -2, "Quarter Back")
except Exception as e:
    print("e = ", e);


Inserting into AllStarFull
INSERT statement =  INSERT INTO AllStarFullBetter VALUES('ferdo05',-17,1000000,'Canary','BOS','EPL',-2,'Quarter Back');
e =  (1265, "Data truncated for column 'gameNum' at row 1")


In [12]:


%sql INSERT INTO AllStarFullBetter VALUES('ferdo05',-17,1000000,'Canary','BOS','EPL',-2,'Quarter Back');

1 rows affected.


DataError: (pymysql.err.DataError) (1265, "Data truncated for column 'gameNum' at row 1") [SQL: "INSERT INTO AllStarFullBetter VALUES('ferdo05',-17,1000000,'Canary','BOS','EPL',-2,'Quarter Back');"]

#### Comments

- I could insert loony data into AllStarFull, e.g.
    - ferdo05 played in the All Star Game in the year 17 BCE.
    - This is a baseball database,, but ferdo05 played in the All Star Game in the English Premier League.
    - "Quarter Back" is not a valid position in baseball or Association Football.
    
    
- The database engine enforced the constraints and prevented the loony data for AllStarFullBetter


## The Relational Model -- II

### Why Start with Relational?

<img src="../images/ranking2.jpeg"><br>
(https://db-engines.com/en/ranking_categories)
<br><br>




### Relational and Common Database Concepts

- Almost all database engines and models have the concepts of
    - Objects that are some form of array of (name, value) pairs.
    - Sets of similar or related objects.
    - Four basic (CRUD) operations on a set
        - CREATE a new object and add to a set.
        - RETRIEVE an object in a set based on a criteria.
        - UPDATE an object in a set, e.g. change the data in the object.
        - DELETE an object from a set, specifying the object(s) by some criteria.
        
        
- In the file systems/CSV model
    - A set is a file, e.g. students.csv.
    - Each object is a row in the file.
    - The header row gives the names of each column.
    - The CRUD processing involves writing a program that reads the file, changes the two-dimensional array and writing the file.
        - CREATE: Append a row and save the file.
        - RETRIEVE: Scan the table and apply me kind of IF statement.
        - UPDATE: Change a row in the two dimensional array.
        - DELETE: Remove a row from the array.
        

- In the "pure" relational model
    - A set is a _relation_.
    - An object is a _row_ or _tuple_.
    - There is no support for CREATE, UPDATE or DELETE.
    - There is an _algebra_ and language from producing a new relation from existing relations that implements a support set of RETRIEVE.
    
- In SQL,
    - A set is a _table_.
    - An object is a _row_ or _tuple_.
    - INSERT is the create operation.
    - UPDATE is the delete operation.
    - DELETE is the delete operation.
    - SELECT is the statement that realizes the relational _algebra_.
    

### Algebraic Query Language

Reference: Ramkrishan and Gherke, section 2.4.

#### Algebra

"... abstract algebra (occasionally called modern algebra) is the study of algebraic structures." (https://en.wikipedia.org/wiki/Abstract_algebra) 

"In mathematics, and more specifically in abstract algebra, an algebraic structure is a set (called carrier set or underlying set) with one or more operations defined on it that satisfies a list of axioms." (https://en.wikipedia.org/wiki/Algebraic_structure)

__Group is an Example__ (http://mathworld.wolfram.com/Group.html)

"A group G is a finite or infinite set of elements together with a binary operation (called the group operation) that together satisfy the four fundamental properties of closure, associativity, the identity property, and the inverse property. The operation with respect to which a group is defined is often called the "group operation," and a set is said to be a group "under" this operation. Elements A, B, C, ... with binary operation between A and B denoted AB form a group if

1. Closure: If A and B are two elements in G, then the product AB is also in G.

2. Associativity: The defined multiplication is associative, i.e., for all A,B,C in G, (AB)C=A(BC).

3. Identity: There is an identity element I (a.k.a. 1, E, or e) such that IA=AI=A for every element A in G.

4. Inverse: There must be an inverse (a.k.a. reciprocal) of each element. Therefore, for each element A of G, the set contains an element B=A<sup>-1</sup> such that AA<sup>-1</sup>=A<sup>-1</sup>A=I.

#### Why a Special Query Language? Why a Formal Language?

- Computing has a formal mathemetical model.

- Programming languages derive from the model and have their own formal definition.

- Almost every time someone publishes a new language, my reaction is, "What? Why another language? Can't we just pick one, use it and get it right?"


- Relational algebra is less powerful and expressive than Java, C, ... and other programming languages.


- The simplicity and constrained capabilities is actually the core of the value, enabling
    - Vastly simplified programming and supporting tools that yields increased productivity.
    - Development of algorithms that process the data definitions and query statements to automatically produce optimal execution plans, which are better than what a programmer can directly code.
    
    
- We will see these benefits in coming lectures.


_Simple Tool Example_ that enables "citizen programmers" to maniuplate data.

<img src="../images/ss_1.jpeg">

<br><br>
_Simple Tool Example_ for enabling business professionals to analyze and report on data.
<img src="../images/quicksite.jpeg" width="85%">


#### Relational Algebra

- There are two notations or representations of the algebra:
    - The original, formal theory.
    - SQL
    
__Original, Formal Notation__

- The "set" in the relational algebra is the set of _relations_.


- The operations are:
    - Common set operations:
        - Union: $\cup$
        - Intersection: $\cap$
        - Difference: $-$
    - Projection: $\pi$
    - Selection: $\sigma$
    - Cartesian Product: $\times$
    - Join: $\bowtie$
    - Rename/Alias
    

- The formal notation does not support create, update or delete. You could emulate the operations by
    - Defining new relations containing the created, updated or delete tuples.
    - Using $\cup$, $\cap$, $-$ on the original relation and created/deleted/updated tuple relation.
    
    
__SQL Notation__

- The "set" in the relational algebra is the catalog of _tables_.

- The operations are:
    - Common set operations:
        - Union: UNION
        - Intersection: INTERSECT
        - Difference: EXCEPT
    - Selection, Projection, Cartesian Product, Rename/Alias and Join are clauses within a SELECT statement.
    
    ```SELECT <project clause> FROM <table> [JOIN <table> [ON <join condition]] WHERE <select condition>```
    <br><br>
    
- SQL supports additional operations, e.g
    - GROUP BY
    - ORDER BY
    


## Selection

- MySQL SELECT statement

<img src="../images/sqlselect.jpeg">

<br><br>
- Other relational database engines have similar statements.


- Simple, starting table reminder

```
CREATE TABLE `AllStarFullBetter` (
  `playerID` varchar(16) NOT NULL,
  `yearID` int(6) NOT NULL,
  `gameNum` enum('0','1','2') NOT NULL,
  `gameID` char(12) NOT NULL,
  `teamID` char(3) NOT NULL,
  `lgID` enum('NL','AL') NOT NULL,
  `GP` enum('0','1') NOT NULL,
  `startingPos` enum('1','2','3','4','5','6','7','8','9') DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

```

- __NOTE:__ All of the single back quotes are:
    - MySQL specific
    - Optional in most cases, and only required if a table or column name has a space, e.g. ``` `First Name` ```
    
    
__Example 1:__ SELECT all columns and rows in the table ```AllStarFull``` in the database (schema) ```lahman2016```

```
SELECT * FROM lahman2016.AllstarFull
```

- ```*``` means all columns.
- ```lahman2016``` is the database (schema) on the database server.
- ```AllStarFull``` is the table.

Result screen in MySQLWorkbench looks something like:
<img src="../images/sqlselect1.jpeg">
<br><br>


<br><br>
__Example 2:__ Projection

- I only want the playerID, yearID and position.

```
SELECT
	playerID, yearID, startingPos
FROM
	lahman2016.AllstarFull;
```

- Note:
    - Capitalization is optional
    - Line breaks and indentation is optional.

Result screen in MySQLWorkbench looks something like:
<img src="../images/select2.jpeg">
<br><br>

__Example 3:__ Selection (and Projection)

- I only want rows where the years 1960 and the startingPos was '1', i.e. pitcher.

```
SELECT
	playerID, yearID, startingPos
FROM
	lahman2016.AllstarFull
WHERE
	yearID=1960 AND startingPos='1';
```

The structure of the where clause (for now) is

```
WHERE column1=value AND ... AND columnn=valuen
```
Result screen in MySQLWorkbench looks something like:
<img src="../images/select3.jpeg">
<br><br>



## What is Next? -- Homework 1

