## Introduction to SQL

Author: Greg Wray  
2024-FEB-26

### Set-up   
We will be using **DuckDB** as our relational database mangement system (RDBMS) and **JupySQL** as an interface between DuckDB and JupyterLab. Make sure you've correctly installed the necessary software before you use this notebook (see the separate notebook with instructions and tests). 

Note: you will get a warning about Pyarrow if you don't have it installed. This won't affect anything you do with this notebook.  

In [4]:
# load the libraries we will need
import numpy as np
import pandas as pd
import duckdb

# use a magic to load the JupySQL extension
%load_ext sql

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


### Build a database

First, we will create an empty database. To do this, we use a magic to indicate that the rest of the line is SQL code. In this particular case, it is actually a meta-statement specific to JupySQL that instructs DuckDB to create a new in-memory database.

In [5]:
%sql duckdb://

Now we're ready to define the tables that will make up the relational database and read in the data.

Make sure you have all 7 `s_*.csv` files in your working directory. 

We will use the cell magic `%%sql` to tell Jupyter that the entire code block should be interpreted as SQL. Note that cell magics must be the first line of code in a block and consist of `%%` immediately followed by the name of the magic. 

Important: the database will persist only as long as the notebook is open or until the kernel is reset. If you re-open the notebook or re-start the kernel, you'll need to run this code block again. We'll go over how to save a database to disk in the second SQL session.

Note that comments in SQL are indicated by `--` (double dash sign); as with R and Python, comments can apply to an entire line or to the remainder of a line.

In [6]:
%%sql

-- create 7 tables and read in data for each from .csv files
-- we will examine table creation  next week, but for now comments provide a brief explanation
    
DROP TABLE IF EXISTS orders;             -- allows us to re-install a table if we mistakenly mess it up
CREATE TABLE orders(                     -- creates and names the table
    order_ioc VARCHAR PRIMARY KEY,       -- specifies 1st column: name, data type, and primary key
    seq SMALLINT NOT NULL,               -- specifies 2nd column: name, data type, and required value
    familiar_order VARCHAR,              -- specifies 3rd column: name, data type
    taxonomy VARCHAR                     -- specifies 4th column: name, data type
    );                                   -- closes column definitions 
COPY orders FROM 's_orders.csv';         -- reads in data from file (column order must match!)

DROP TABLE IF EXISTS families;
CREATE TABLE families(
    family_ioc VARCHAR PRIMARY KEY,
    seq SMALLINT NOT NULL,
    order_ioc VARCHAR NOT NULL,
    familiar_family VARCHAR,
    niche VARCHAR,
    taxonomy VARCHAR,
    num_gen SMALLINT NOT NULL,
    num_spp SMALLINT NOT NULL,
    num_spp_x SMALLINT NOT NULL,
    num_threat SMALLINT NOT NULL
    );
COPY families FROM 's_families.csv';

DROP TABLE IF EXISTS genera;
CREATE TABLE genera(
    genus_ioc VARCHAR PRIMARY KEY,
    seq SMALLINT NOT NULL,
    family_ioc VARCHAR NOT NULL,
    familiar_genus VARCHAR,
    taxonomy VARCHAR,
    num_spp SMALLINT NOT NULL
    );
COPY genera FROM 's_genera.csv';

DROP TABLE IF EXISTS species;
CREATE TABLE species(
    seq SMALLINT PRIMARY KEY,
    genus_ioc VARCHAR NOT NULL,
    species_ioc VARCHAR NOT NULL,
    num_spp SMALLINT NOT NULL,
    familiar_ioc VARCHAR,
    conservation VARCHAR,
    endemic VARCHAR
    );
COPY species FROM 's_species.csv';

DROP TABLE IF EXISTS observations;
CREATE TABLE observations(
    seq SMALLINT PRIMARY KEY,
    genus_ioc VARCHAR NOT NULL,
    species_ioc VARCHAR NOT NULL,
    subspecies_ioc VARCHAR NOT NULL,
    date_obs DATE,
    time_obs VARCHAR,
    location_name VARCHAR NOT NULL,
    trip_name VARCHAR NOT NULL,
    notes VARCHAR
    );
COPY observations FROM 's_observations.csv';

DROP TABLE IF EXISTS trips;
CREATE TABLE trips(
    trip_name VARCHAR PRIMARY KEY,
    start_date DATE NOT NULL,
    end_date DATE NOT NULL
    );
COPY trips FROM 's_trips.csv';

DROP TABLE IF EXISTS locations;
CREATE TABLE locations(
    location_name VARCHAR PRIMARY KEY,
    province VARCHAR,
    country_name VARCHAR,
    bioregion_name VARCHAR,
    climate VARCHAR,
    protection VARCHAR,
    earliest DATE NOT NULL,
    latest DATE NOT NULL
    );
COPY locations FROM 's_locations.csv';

Count


Next, we will remove the default limit of 10 on the maximum number of rows returned when we make a query. To do this, we will use a different magic, this time to indicate we want to change a configuration setting. The statement itself consists of the variable name and the new value we want to assign. Somewhat counter-intuitively, the value 0 corresponds to "no limit"; any positive integer will set the corresponding limit.  

In [7]:
%config SqlMagic.displaylimit = 0

### Test the database

It's good practice to check that your newly constructed database is installed correctly. One way to do that is to make sure all the rows are present in individual tables. The expected number of records for three of the tables are as follows:   
* orders: 44
* families: 252
* observations: 13510

YOUR TURN: substitute the table names below as needed to check that other tables have been correctly imported.

In [8]:
%%sql 
-- return the number of rows
SELECT COUNT(*) FROM orders;

count_star()
44


We can also check whether all of the columns are present. The expected number of columns in three of the tables are as follows:
* genera: 6
* locations: 8   
* observations: 9

YOUR TURN: substitute the table names below as needed to check that other tables have been correctly imported. Note that the output has a separate row for each column in the table.

In [None]:
%%sql 
-- return information about the columns in a table
SELECT * FROM pragma_table_info ('genera');

### Simple queries

`SELECT` is the workhorse of data retrieval in SQL. It can be used in simple constructions to filter and process rows in a single table or in complex statements that involve grouping, aliases, functions, nested queries, and joins between multiple tables. We will start by working with data in a single table to introduce some of the basic filtering and output capabilities.  

SQL is case-insensitive. However, the convention is to use upper-case for keywords and functions, and use lower case for indentifiers (table and column names). This will make your SQL code more readable.

A simple `SELECT` statement is shown below. The `*` character means *return all columns*, while the default behavior of `SELECT` is to return all rows. Thus, this query returns *all* of the data in the `genera` table. 

In [None]:
%%sql 
-- return the entire contents of the genera table
SELECT * FROM orders;

For queries that return many rows, it is convenient to specify a limit on the number of rows using a `LIMIT` clause. 

YOUR TURN: The `orders` table contains only 44 rows, but the `genera` table is much longer. Try substituting `genera` in place of orders to get a feel for why including a `LIMIT` is helpful. 

In [None]:
%%sql 
-- return the first 5 rows of the genera table
SELECT  * FROM genera LIMIT 10;

Note above that rows are returned in a seemingly random order. We'll cover how to order output later.

Often, we want to filter rows based on a condition. Conditions apply values in a particular column or set of columns. Include a `WHERE` clause to specify the condition. When construting a condition, logical, arithmetic, and set operators work as expected; SQL also provides some additional operators, including `LIKE` and `BETWEEN`. Brackets can be used to indicate compound conditions. 

Important: SQL requires that clauses appear in a particular order! `WHERE` clauses must follow `FROM` and table name(s). `LIMIT` clauses are last. 

In [None]:
%%sql 
-- return the records for every species observed within a given genus
SELECT * FROM species WHERE genus_ioc = 'Bubo';

In [None]:
%%sql
-- return the rows where a family contains greater than five recently extinct species
SELECT * FROM families WHERE num_spp_x > 5;

In [None]:
%%sql
-- return rows containing families with names that alphabetically follow a specified word
SELECT * FROM genera WHERE genus_ioc > 'Vireo';

In [None]:
%%sql 
-- return rows containing families that are part of a specified set
SELECT * FROM genera WHERE family_ioc IN ('Todidae', 'Momotidae', 'Meropidae');

In [None]:
%%sql 
-- return rows containing families that have more than 2 extinct species and contain fewer than 10 genera
SELECT * FROM families WHERE num_spp_x > 2 AND num_gen < 10;

In [None]:
%%sql
-- return rows with genus names that start with 'J', 'K' or 'L' (note 'indexing' is like Python)
SELECT * FROM families WHERE family_ioc BETWEEN 'J' AND 'L';

YOUR TURN: to understand how `BETWEEN` works, try changing the first or second string.

To search for a substring within a specific column, use the `LIKE` keyword and the expansion
`%` to indicate "anything". The expansion character can precede or follow the search substring (or both). Filtering with `LIKE` is case-sensitive with DuckDB, but be aware that this is not the case with some other implementations of SQL.

In [None]:
%%sql 
-- return rows where family name starts with 'Str'
SELECT * FROM genera WHERE genus_ioc LIKE 'Str%';

In [None]:
%%sql 
-- return rows where family name contains with 'ng'
SELECT * FROM genera WHERE genus_ioc LIKE '%ng%';

An important feature of relational databases is that the rows (records) are not stored in a predictable or even stable order. This allows for faster filter and sort operations, but it does mean that you need to be explicit when you want results returned in a particular order.  

To sort output, we can include an `ORDER BY` clause: the keyword followed by the name of the column(s) to sort on. By default, rows are returned in ascending order, but this behavior can be changed by including the keyword `DESC`.

In [None]:
%%sql 
-- return rows in alphabetical order by family name (note: ascending order is default)
SELECT * FROM genera ORDER BY genus_ioc LIMIT 5;

In [None]:
%%sql
-- return the five families with the most extinct species (note: sort in descending order)
SELECT * FROM families ORDER BY num_spp_x DESC LIMIT 5;

In many cases, we only need to see some of the information in the rows returned by a query. Replace the `*` with the name(s) of the columns to retrieve a subset. (Note that the `*` is also modeled on the expansions used in Unix and stands for "all".)

In [None]:
%%sql 
-- return family name and niche only and sort by IOC ordination (note: uses the seq column to sort)
SELECT family_ioc, familiar_family, niche FROM families ORDER BY seq LIMIT 5;

### Summarizing and grouping data

SQL offers powerful summarizing and grouping capabilities that have inspired packages in other languages, including dplyr and Pandas. We will explore these using a single table for simplicity, but bear in mind that they can also be applied to the output of more complex queries, such as those covered in later sections of this notebook. 

Note that SQL automatically assigns a column name for computed output (i.e., anything other than column values).

In [None]:
%%sql
-- return the largest number of genera in a single family
SELECT MAX(num_gen) FROM families;

In [None]:
%%sql
-- returns the mean number of genera in each family
SELECT MEAN (num_gen) FROM families;

To aggregate rows by value, use a `GROUP BY` clause and specify column or columns to use. This clause follows the `FROM` or `WHERE` clause (if present) and preceds the `ORDER BY` and `LIMIT` clauses (if present). In the example below, we group by location name. We can then get tallies of the number of observations at each location using the `COUNT` function.  

As queries become more complex, it is common to use multiple lines and indenting to improve readability. 

In [None]:
%%sql
-- return the number of records at each location
SELECT
  location_name,
  COUNT(*)
FROM
  observations
GROUP BY
  location_name
ORDER BY
  COUNT() DESC           -- just the top 10 locations
LIMIT 
  10;

Another useful function is `DISTINCT`. Use it to return the number of unique entries in a column. `DISTINCT` can be combined with `COUNT` and `GROUP BY`for powerful summarizing queries.  

In [None]:
%%sql
-- returns the number of locations where a given species has been observed
SELECT
  genus_ioc,
  species_ioc,
  COUNT(DISTINCT (location_name))
FROM
  observations
WHERE
  genus_ioc = 'Falco'
  AND species_ioc = 'tinnunculus'
GROUP BY
  genus_ioc,
  species_ioc;

### Aliases   
Looking at the results of the last few queries above, it's clear that the automatically assigned column labels aren't ideal. Using more intuitive labels makes output easier to interpret. To do this, we can assign an alias using the keyword `AS` followed by a name. Aliases without any spaces do not need quotes. 

A second important use for aliases is to simplify table names. We will see examples later, when we get to queries that involve multiple tables.
                                                                                                                                    
Note that aliases are not saved in memory; they apply only to the current query.

In [None]:
%%sql 
-- return the number of species observed within each genus ranked by number observed
--     note that any alias introduced in the SELECT clause must also go in the ORDER BY clause 
--     also note the location of the DESC keyword (try placing it after genus and see what happens)
SELECT
  genus_ioc AS genus,           -- define the aliases here and next line
  COUNT(*) AS species_seen
FROM
  species
GROUP BY
  genus_ioc
ORDER BY
  species_seen DESC,            -- use the aliases here and next line
  genus
LIMIT
  10;

Another helpful way to label output is to use the function `CONCAT` to create strings that can be assigned to an alias. Strings can contain a mix of column values and whatever you provide. Enclose your text in quotes and separate items with commas.

With the current database, genus and species names are stored separately, so `CONCAT` is particularly useful for combining them into a single column when returning the results of a query.

In [None]:
%%sql
-- returns the first observation of each species in a given genus and labels the output
--     note the use of CONCAT to join the genus and species output and insert a space for readability
--     note the quotes in the second alias, which are needed since the alias contains spaces 
SELECT
  CONCAT(genus_ioc, ' ', species_ioc) AS species,
  MIN(date_obs) AS 'first seen'
FROM
  observations
WHERE
  genus_ioc = 'Falco'
GROUP BY
  genus_ioc,
  species_ioc;

By combining aliases and grouping functions, it is possible to create reports that summarize several different kinds of information. Building on the previous query, we can ask for additional details about sightings of a particular species.

In [None]:
%%sql
-- returns the first observation of each species within a given genus and labels the output
--     note the use of CONCAT to join the genus and species output and insert a space for readability
--     note the quotes in the second alias, which are needed since the alias contains spaces 
SELECT
  CONCAT (genus_ioc, ' ', species_ioc) AS species,
  COUNT(seq) AS '# times seen',
  MIN(date_obs) AS 'first seen',
  MAX(date_obs) AS 'last seen',
  COUNT(DISTINCT (location_name)) AS '# places',
  COUNT(DISTINCT (subspecies_ioc)) AS '# subspecies'
FROM
  observations
WHERE
  genus_ioc = 'Falco'
GROUP BY
  genus_ioc,
  species_ioc;

### Nested queries

Nested queries are useful when we want to first narrow a search, then query the results in a different way. With nested queries, the "inner" query is executed first, followed by a second "outer" query that only considers the rows returned by the first query. 

The example below returns the name of every species that has been observed within a given family. This information is stored in multiple tables: the `genera` table has a column that assigns every genus to a family. We will first query that table to retrieve a list of all genera in the family of interest. Next we will query the `species` table to retrieve the genus and species names of every entry in the list returned by the first query. 

In [1]:
%%sql
-- return the names of all species observed in a given family
--    first, filter the genera table for entries in that family and return the genus names
--    then, use the results of that query as a condition to query the observations table 
SELECT -- this SELECT returns rows that match the set returned by the nested query
  CONCAT (genus_ioc, ' ', species_ioc) AS species,
  familiar_ioc AS 'common name'
FROM
  species
WHERE
  genus_ioc IN (
    SELECT genus_ioc                   -- beginning of the nested query (executes first)
    FROM genera
    WHERE family_ioc = 'Musophagidae'
    ORDER BY seq
  )                                    -- end of the nested query
GROUP BY
  genus_ioc,
  species_ioc,
  familiar_ioc
ORDER BY
  genus_ioc,
  species_ioc
LIMIT
  10;

UsageError: Cell magic `%%sql` not found.


Nested queries can be useful for tabulating counts from multiple tables into a single report. The query below introduces two new features. First, it demonstrates the use of `SELECT` simply as a wrapper for several nested queries. Note that there is no table name in the outer query. Second, it demonstrates how to access a built-in variable using `SELECT`. Retrieving the value of`CURRENT_DATE` provides a date stamp on output. If we want to save this output, we now have a record of when the query was made.

In [None]:
%%sql
-- return the number of species observed and the total number recognized
SELECT
  (SELECT CURRENT_DATE) AS 'as of',
  (SELECT COUNT(DISTINCT (genus_ioc, species_ioc)) FROM species) AS 'species observed',
  (SELECT SUM(num_spp) FROM families) AS 'out of a total of';

### Joins

The real power of SQL is its ability to query data from multiple tables at once using (relatively) simple statements. This can be done by including a `JOIN` clause that specifies *how* to merge results and specifies one or more *relations* that link one or more columns between two or more tables.

The most commmon kind of join is a *left join*, where *every* row from the first table is returned, attached to *any* row from the second table where a value matches a row in the relation column(s) of the first table. If there is no match in the second table, `NULL` values will be inserted. Left joins allow for a one-to-many relationship between rows in the first table to those in the second table: a row from the first table might appear mutliple times if more than one row in the second table contains matching values in the related column. This is perfect for joining taxonomic tables, which have an inherently nested structure with a variable number of contents at each level and where every higher taxon contains at least one lower taxon (e.g., a family always contains at least one genus but it might contain more than one). Left joins are also useful for our `observations` table, where multiple records may be related to a single species or trip or location.     

It is often the case that the same column name appears in multiple tables within a relational database. To make explicit which column of the same name you wish to reference, simply append the table name and a `.` before the column name. Columns specified in this way are said to be *fully qualified*. When writing queries that involve multiple tables, you only need to fully qualify if a column name exists in more than one table. However, fully qualifying is always allowed even when not needed, and doing so can make the logic of a query clearer. 

When dealing with multiple tables, using aliases for table names can help to simplify a query. A common convention is to use a single letters corresponding the table's full name. 

Most versions of SQL accept more than one syntax for indicating a join. We'll use the newer and recommended syntax, but be aware that other ways to specify joins exist and are commonly used. Also be aware that it is possible (but not recommended) to create *implicit joins* by leaving out the `JOIN` keyword and simply specifying one or more relations in a `WHERE` clause. This produces a *cross join*, which can work in simple queries but is often not what you actually want. You will almost certainly encounter both of these alternative approaches if you refer to code from ChatGPT or StackOverflow. Best practice is to always use explicit joins and the syntax introduced below.

Let's start by constructing a query that returns the number of species that have been observed within each country. The `observations` table only stores the name of a location, not the country is it located in; that information is stored in the `locations` table. We'll create a left join of these tables to combine the information and then filter for the number of distinct country names. 

In [2]:
%%sql
-- return the number of species observed within each country visited
-- this requires:
--    (1) grouping and counting distinct locations in the observations table
--    (2) a left join of the locations table onto the observations table based on location name  
SELECT
  country_name AS country,
  COUNT(DISTINCT (genus_ioc, species_ioc)) AS num_species
FROM
  observations AS o                         -- we need every row from this table
  LEFT JOIN locations AS l                  -- we only need rows that match in this table
  ON o.location_name = l.location_name      -- this is the column we will use to match rows
GROUP BY
  country
ORDER BY
  num_species DESC                          -- order output by number of species observed, high to low
LIMIT
  10;

UsageError: Cell magic `%%sql` not found.


The join appears to be successful, but there is a problem: the fourth row lists the country "None". What is going on? This is an indication that there are rows in the `observations` where the value in the `location_name` column can't find a match in the corresponding column of the `locations` table. But which ones? Fortunately, SQL has a mechanism for figuring this out automatically, so we don't have to scroll through the two tables trying to figure out how to find typos or missing values. We'll learn how to do this next week.

The next query reports how many species have been observed from each genus relative to the total. It joins the `genus` and `species` tables and returns three numbers, each derived in a different way: (1) a tally of the number of species observed, using `GROUP BY` and the `COUNT` function; (2) the total number of species in that genus, read directly from the `genus.num_spp` column; and (3) a computed value (yes, you can do math with SQL!). 

In [36]:
%%sql
-- returns the proportion of species observed within each genus
SELECT
  species.genus_ioc,
  COUNT(species_ioc) AS num_obs,
  genera.num_spp AS num_total,
  ROUND(((COUNT(species_ioc) + 0.0) / genera.num_spp), 3) * 100 AS percent
FROM species
  LEFT JOIN genera 
  ON species.genus_ioc = genera.genus_ioc
GROUP BY
  genera.seq,
  species.genus_ioc,
  genera.num_spp
ORDER BY
  genera.seq
LIMIT
  20;

genus_ioc,num_obs,num_total,percent
Struthio,1,2,50.0
Tinamus,1,5,20.0
Crypturellus,2,21,9.5
Anhima,1,1,100.0
Anseranas,1,1,100.0
Dendrocygna,5,8,62.5
Branta,4,6,66.7
Anser,5,11,45.5
Cygnus,3,6,50.0
Merganetta,1,1,100.0


### Views

What if we want to left join the genera and species tables to the output of the orders and famlies left join? We could do this with two levels of nested joins, but that would be difficult to read and even more difficult to debug. Instead, we can use temporary intermediate tables called *views*.

Views persist after they are created and can be subsequently queried by name; however they only persist for the *current* session and must be reconstituted every session. In both regards, creating a view is analogous to an assignment statement in R or Python that creates a data object.

Something to be aware of: if you update the contents of an actual table, any view based on it is automatically updated as well.

In [37]:
%%sql
-- creates a view that preserves the output of the query that follows
-- think of this as assigning the output to a variable name
CREATE
OR REPLACE VIEW tax1 AS
SELECT
  orders.order_ioc AS orders,
  families.family_ioc AS families
FROM orders 
  LEFT JOIN families                          
  ON orders.order_ioc = families.order_ioc    
ORDER BY
  families.seq;

Count


Now we can treat `tax1` as if it were a table.

In [38]:
%%sql
SELECT * FROM tax1 LIMIT 20;

orders,families
Struthioniformes,Struthionidae
Rheiformes,Rheidae
Apterygiformes,Apterygidae
Casuariiformes,Casuariidae
Tinamiformes,Tinamidae
Anseriformes,Anhimidae
Anseriformes,Anseranatidae
Anseriformes,Anatidae
Galliformes,Megapodiidae
Galliformes,Cracidae


This means we can use `tax1` in a left join with the `genus` table. 

In [39]:
%%sql
-- create a new view that contains the following query
--     return the a list of every genus and the family and order to which it belongs
CREATE
OR REPLACE VIEW tax2 AS
SELECT
  tax1.orders,
  tax1.families,
  genera.genus_ioc AS genera
FROM tax1
  LEFT JOIN genera 
  ON tax1.families = genera.family_ioc
ORDER BY
  genera.seq;

Count


In [40]:
%%sql
SELECT * FROM tax2 LIMIT 20;

orders,families,genera
Struthioniformes,Struthionidae,Struthio
Tinamiformes,Tinamidae,Tinamus
Tinamiformes,Tinamidae,Crypturellus
Anseriformes,Anhimidae,Anhima
Anseriformes,Anseranatidae,Anseranas
Anseriformes,Anatidae,Dendrocygna
Anseriformes,Anatidae,Branta
Anseriformes,Anatidae,Anser
Anseriformes,Anatidae,Cygnus
Anseriformes,Anatidae,Merganetta


Now, we can join the `species` table to `tax2` to create a giant table that includes every species with a full enumeration of the higher-level taxa it belongs to. 

In [41]:
%%sql
-- create a new view that contains the following query
--     return the a list of every genus and the family and order to which it belongs
CREATE
OR REPLACE VIEW tax3 AS
SELECT
  tax2.orders,
  tax2.families,
  tax2.genera,
  species.species_ioc AS species
FROM tax2
  LEFT JOIN species 
  ON tax2.genera = species.genus_ioc
ORDER BY
  species.seq;

Count


In [42]:
%%sql
SELECT * FROM tax3 LIMIT 30;

orders,families,genera,species
Struthioniformes,Struthionidae,Struthio,camelus
Tinamiformes,Tinamidae,Tinamus,major
Tinamiformes,Tinamidae,Crypturellus,cinnamomeus
Tinamiformes,Tinamidae,Crypturellus,boucardi
Anseriformes,Anseranatidae,Anseranas,semipalmata
Anseriformes,Anatidae,Dendrocygna,viduata
Anseriformes,Anatidae,Dendrocygna,autumnalis
Anseriformes,Anatidae,Dendrocygna,arborea
Anseriformes,Anatidae,Dendrocygna,bicolor
Anseriformes,Anatidae,Dendrocygna,javanica


Note how much **redundancy** is present in `tax3`. This is why relational databases are so much more efficient in terms of memory and comptutation. In the present case, well over half of the cells in the table contain repeated information! 

Just imagine your frustration if the IOC split the family Anatidae into two families. This would require updating **a lot** of individual cells in the giant table. But updating would only require a few changes in a relational structure. This not only makes updates simpler, but it is a huge advantage for maintaining the integrity of data.