# Table Design That Works For You

Obsession with order & detail can be a good thing. When you're running out the door, it's reassuring to see your keys hanging on the hook where you *always* leave them. The same is true for database design. When you need to excavate a nugget of information from dozens of tables & millions of rows, you appreciate a dose of that same detail obsession. With data organised & finely tuned, smartly named set of tables, the analysis experience becomes much more managable.

---

# Following Naming Conventions

Programming languages tend to have their own style patterns & even various factions of SQL coders prefer certain conventions when naming tables, columns, & other objects (called *identifiers*). Some like *camel case* as in `berrySmoothie`, where words are strung together & the first letter of each word is capitalised except for the first word. *Pascal case*, as in `BerrySmoothie`, follows a similar pattern but capitalises the first letter too. With *snake case*, as in `berry_smoothie`, all the words are lowercase & separated by underscores.

Mixing styles or following none generally leads to a mess. For example, imagine connecting to a database & finding the following collection of tables:

```
Customers
customers
custBackup
customer_analysis
customer_test2
customer_testMarch2012
customer_analysis
```
You would have questions. For one, which table actually holds the current data on customers? A disorganised naming scheme -- & a general lack of tidiness -- makes it hard for others to dive into your data & makes it challenging for you to pick up where you left off. 

## Quoting Identifiers Enables Mixed Case

Regardless of any capitalisation you supply, PostgreSQL treats identifiers as lowercase unless you place double quotes around the identifier. Consider the two `CREATE TABLE` statements for PostgreSQL:

```
CREATE TABLE customers (
    customer_id text,
    ...
);

CREATE TABLE Customers (
    customer_id text,
    ...
);
```

When you execute these statemenets in order, the first command creates a table called `customers`. The second statement, rather than creating a separate table called `Customers`, will throw an error: `relation "customers" already exists`. Because you didn't quote the identifier, PostgreSQL treats `customers` & `Customers` as the same identifier, disregarding the case. To preserve the uppercase letter & create a separate table named `Customers`, we must surround the identifer with quotes, like so:

```
CREATE TABLE "Customers" (
    customer_id serial,
    ...
);
```

However, because this requires quotes to query `Customers` rather than `customers`, we have to quote its name in the `SELECT` statement as well:

```
SELECT * FROM "Customers";
```

That can be a chore to remember & make users vulnerable to a mix-up. Makes sure your tables have names that are clear & distinct from other tables in the database.

## Pitfalls with Quoting Identifiers

Quoting identifiers also allows us to use characters not otherwise allowed, including spaces. That may appeal to some folks, but there are negatives. You may want to throw quotes around `"trees planted"` as a column name in a reforestation database, but then all users will have to provide quotes on every reference to that column. Omit the quotes in a query, & the database will respond with an error, identifying `trees` & `planted` as separate columns & responding that `trees` does not exist. The more readable & reliable option is to use snake case, as in `trees_planted`.

Quotes also let us use SQL *reserved keywords*, which are words that have special meaning in SQL. We've already encounter several, such as `TABLE`, `WHERE`, or `SELECT`. Most database developers frown on using reserved keywords as identifiers. At a minimum, it's confusing & at worst, neglecting or forgetting to quote that keyword later may result in an error because the database will interpret the word as a command instead of an identifier.

## Guidelines for Naming Identifiers

Given the extra burden of quoting & its potential problems, it's best to keep your identifer names simple, unquoted, & consistent. Here are some recommendations:

1. **Use snake case.** Snake case is reliable, as shown in the earlier `trees_planted` example.
2. **Make names easy to understand & avoid cryptic abbreviations.** If you're building a database related to travel, `arrival_time` is a clearer column name than `arv_tm`.
3. **For table names, use plurals.** Tables hold rows, & each row represents one instances of an entity. So, use plural names for tables, such as `teachers`, `vehicles`, & `departments`.
4. **Mind the length.** If you're writing code that may get reused in another database system, lean toward shorter identifier names.
5. **When making copies of tables, use names that will help you manage them later.** One method is to append a `_YYYY_MM_DD` date to the table name when you create a copy, such as `vehicle_parts_2024_11_13`. An additional benefit is that the table names will sort in date order.

---

# Controlling Column Names with Constraints

You can maintain further control over the data a column will accept by using certain constraints. A column's data type broadly defines the kind of data it will accept: integers versus characters, for example. Additional constraints let us further specify acceptable values based on rules & logical tests. With constraints, we can avoid the "garbage in, garbage out" phenomenon, which happens when poor-quality data results in inaccurate or incomplete analysis. Well-designed constraints help maintain the quality of the data & ensure the integrity of the relationships among tables.

Previously, we learned about *primary* & *foreign keys*, which are two of the most commonly used contraints. SQL also has the following constraint types:

* **CHECK:** allows only those rows where a supplied Boolean expression evaluates to `true`
* **UNIQUE:** ensures the values in a column or group of columns are unique in each row in the table
* **NOT NULL:** prevents `NULL` values in a column

We can add constraints in two ways: as a *column constraint* or as a *table constraint*. A column contraint applies only to that column. We declare it with the column name & data type in the `CREATE TABLE` statement, & it gets checked whenever a change is made to the column. With a table constraint, we can supply criteria that apply to one or more columns. We declare it in the `CREATE TABLE` statement immediately after defining all the table columns, & it gets checked whenever a change is made to a row in the table.

## Primary Keys: Natural vs Surrogate

A *primary key* is a column or collection of columns whose values uniquely identify each row in a table. A primary key is a constraint, & it imposes two rules on the column or columns that make up the key:

1. Values must be unique for each row.
2. No column can have missing values.

In a table of products stored in a warehouse, the primary key could be a column of unique product codes. In the simple primary key examples in previous lessons, our tables had a primary key made from a single ID column with an integer inserted by us, the user. Often, the data will suggest the best path & help us decide whether to use a *natural key* or a *surrogate key* as the primary key.

### Using Existing Columns for Natural Keys

A natural key uses one or more of the table's existing columns that meet the criteria for a primary key: unique for every row & never empty. Values in the columns can change as long as the new value doesn't cause a violation of the constraint.

A natural key might be a driver's license identification number issued by a local DMV. Within government jurisdiction, such as a state in the United States, we'd reasonably expect that all drivers would receive a unique ID on their licenses, which we could store as `driver_id`. However, if we were compiling a national driver's license database, we might not be able to make that assumption: several states could independently issue the same ID code. In that case, the `driver_id` column may not have unique values & cannot be used as the natural key. As a solution, we could create a *composite primary key* by combining `driver_id` with a column holding the state name, which would give us a unique combination for each row. For example, both rows in this table have a unique combination of the `driver_id` & `st` columns:

|driver_id|st|first_name|last_name|
|:---|:---|:---|:---|
|10302019|NY|Patrick|Corbin|
|10302019|FL|Howard|Kendrick|

We'll visit both approaches in this lesson, & as we work with data, we'll keep an eye out for values suitable for natural keys. A part number, a serial number, or a book's ISBN are all good examples.

### Introducing Columns for Surrogate Keys

A *surrogate* key is a single column that you fill with artificial values; we might use it when a table doesn't have data that supports creating a natural primary key. The surrogate key might be sequential number autogenerated by the database. We've already done this with the serial data type & the `IDENTITY` syntax. A table using an autogenerate integer for a surrogate key might look like this: 

|id|first_name|last_name|
|:---|:---|:---|
|1|Patrick|Corbin|
|2|Howard|Kendrick|
|3|David|Martinez|

Some developers like to use a *universally unique identifier* (UUID), which is a code comprised of 32 hexadecimal digits in groups separated by hyphens. Often UUIDs are used to identify computer hardware or software & look like the following:

```
2911d8a8-6dea-4a46-af23-d64175a08237
```

PostgreSQL offers a UUID data type as well as two modules that generate UUIDs: `uuid-ossp` & `pgcrypto`. The [PostgreSQL documentation](https://www.postgresql.org/docs/current/datatype-uuid.html) is a good starting point for diving deeper.

### Evaluating the Pros & Cons of Key Types

There are well-reasoned arguments for using either type of primary key, but both have drawbacks. Points to consider about natural keys include the following:

1. The data already exists in the table, so you don't need to add a column to create a key.
2. Because the natural key data has meaning, it can reduce the need to join tables when querying.
3. If you data changes in a way that violates the requirements for a key -- sudden appearance of duplicate values, for instance -- you'll be forced to change the setup of the table.

Here are points to consider about surrogate keys:

1. Because a surrogate key doesn't have any meaning in itself & its values are independent of the data in the table, you're not limited by the key structure if your data changes later.
2. Key values are guaranteed to be unique.
3. Adding a column for a surrogate key requires more space.

In a perfect world, a table should have one or more columns that can serve as a natural key, such as a unique product code in a table of products. But real-world limitations arise all the time. In a table of employees, it might be difficult to find any single column, or even multiple columns, that would be unique on a row-by-row basis to serve as a primary key. In such cases where you can't reconsider the table structure, you may need to use a surrogate key.

### Creating a Single-Column Primary Key

Let's work through several primary key examples. Previously, we created primary keys on the `district_2020` & `district_2035` tables to try `JOIN` types. In fact, these were surrogate keys: in both tables, we created columns called `id` to use as the key & used the keywords `CONSTRAINT key_name PRIMARY KEY` to declare them as primary keys.

There are two ways to declare constraints: as a column constraint or as a table constraint. In the below code, we try both methods, declaring a primary key on a table similar to the driver's license example mentioned earlier. Because we expect the driver's license IDs to always to be unique, we'll use that column as a natural key.

```
CREATE TABLE natural_key_example (
    license_id text CONSTRAINT license_key PRIMARY KEY,
    first_name text,
    last_name text
);

DROP TABLE natural_key_example;

CREATE TABLE natural_key_example (
    license_id text,
    first_name text,
    last_name text,
    CONSTRAINT license_key PRIMARY KEY (license_id)
);
```

We first create a table called `natural_key_example` & use the column constraint syntax `CONSTRAINT` to declare `license_id` as the primary key, followed by a name for the constraint & the keywords `PRIMARY KEY`. This syntax makes it easy to understand at a glance which column is designated as the primary key. Note that you can omit the `CONSTRAINT` keyword & name for the key & simply use `PRIMARY KEY`:

```
license_id text PRIMARY KEY
```

In that case, PostgreSQL will name the primary key on its own, using the convention of the table name followed by `_pkey`.

Next, we delete the table from the database with `DROP TABLE` to prepare for the table constraint example.

To add a table constraint, we declare the `CONSTRAINT` after listing all the columns, with the column we want to use as the key in parentheses. (Again, we can omit the `CONSTRAINT` keyword & key name.) In this example, we end up with the same `license_id` column for the primary key. We must use the table constraint syntax when we want to create a primary key using more than one column; in that case, we would list the columns in the parentheses, separated by commas.

Let's look at how the qualities of a primary key -- unique for every row & no `NULL` values -- protect us from harming our data's integrity. The below code has two `INSERT` statements.

```
INSERT INTO natural_key_example (
    license_id, first_name, last_name
)
VALUES ('T229901', 'Gem', 'Godfrey');

INSERT INTO natural_key_example (
    license_id, first_name, last_name
)
VALUES ('T229901', 'John', 'Mitchell');
```

When you execute the first `INSERT` statement on its own, the server loads a row into the `natural_key_example` table without any issue. When you attempt to execute the second, the server replies with an error:

<img src = "Primary Key Violation.png" width = "600" style = "margin:auto"/>

Before adding the row, the server checked whether a `license_id` of `T229901` was already present in the table. Because it was & because a primary key by definition must be unique for each row, the server rejected the operation. The rules of the fictional DMV state that no two drivers can have the same license ID, so checking for & rejecting duplicate data is one way for the database to enforce that rule.

### Creating a Composite Primary Key

If a single column doesn't meet the requirements for a primary key, we can create a *composite primary key*. 

We'll make a table that tracks student school attendance. The combination of `student_id` & `school_day` columns gives us a unique value for each row, which records whether a student was in school on that day in a column called `present`. To create a composite primary key, we must declare it using the table constraint syntax.

```
CREATE TABLE natural_key_composite_example (
    student_id text,
    school_day date,
    present boolean,
    CONSTRAINT student_key PRIMARY KEY (student_id, school_day)
);
```

Here, we pass two (or more) columns as arguments rather than one. We'll simulate a key violation by attempting to insert a row where the combination of values in the two key columns -- `student_id` & `school_day` -- is not unique to the table. Run the `INSERT` statements below one at a time.

```
INSERT INTO natural_key_composite_example (
    student_id, school_day, present
)
VALUES (775, '2022-01-22', 'Y');

INSERT INTO natural_key_composite_example (
    student_id, school_day, present
)
VALUES (775, '2022-01-23', 'Y');

INSERT INTO natural_key_composite_example (
    student_id, school_day, present
)
VALUES (775, '2022-01-23', 'N');
```

The first two `INSERT` statements execute fine because there's no duplication of values in the combination of key columns. But the third statement causes an error because the `student_id` & `school_day` values it contains match a combination that already exists in the table:

<img src = "Composite Primary Key Violation.png" width = "600" style = "margin:auto"/>

You can create composite keys with more than two columns. The limit to the number of columns you can use depends on your database.

### Creating an Auto-Incrementing Surrogate Key

As we've learned previously, there are two ways to have a PostgreSQL database add an automatically increasing unique value to a column. The first is to set the column to one of the PostgreSQL-specific serial data types: `smallserial`, `serial`, & `bigserial`. The second is to use the `IDENTITY` syntax; because it is part of the ANSI SQL standard -- we'll employ this for our examples.

Use `IDENTITY` with one of the integer types `smallint`, `integer`, & `bigint`. For a primary key, it may be tempting to try to save disk space by using `integer`, which handles numbers as large as 2,147,483,647. But many a database developer has received a late-night call from a user frantic to know why an application is broken, only to discover that the database is trying to generate a number one greater than the data type's maximum. So, if it's remotely possible that your table will grow past 2.147 billion rows, it's wise to use `bigint`, which accepts numbers as high as 9.2 quintillion. You can set it & forget it.

```
CREATE TABLE surrogate_key_example (
    order_number bigint GENERATED ALWAYS AS IDENTITY,
    product_name text,
    order_time timestamp with time zone,
    CONSTRAINT order_number_key PRIMARY KEY (order_number)
);

INSERT INTO surrogate_key_example (
    product_name, order_time
)
VALUES ('Beachball Polish', '2020-03-15 09:21-07'),
       ('Wrinkle De-Atomiser', '2017-05-22 14:00-07'),
       ('Flux Capacitor', '1985-10-26 01:18:00-07');

SELECT * FROM surrogate_key_example;
```

The code above shows how to declare an auto-incrementing `bigint` column called `order_number` using the `IDENTITY` syntax & then set the column as the primary key. When you insert data into the table, you omit `order_number` from the list of columns & values. The database will create a new value for that column as each row is inserted, & that value will be one greater than the largest already created for the column.

<img src = "bigint Column as Surrogate Key Using IDENTITY.png" width = "600" style = "margin:auto"/>

We see these sorts of auto-incrementing order numbers reflected in the receipts for the purchases we make every day. 

A few details worth noting: if you delete a row, the database won't fill the gap in the `order_number` sequence, nor will it change any of the existing values in that column. It will generally add one to the largest existing value in the sequence (though there are exceptions related to operations, including restoring a database from a backup). Also, we used the syntax `GENERATED ALWAYS AS IDENTITY`. This prevents a user from inserting a value in `order_number` without manually overriding this setting. Generally, you want to prevent such meddling to avoid problems. Let's say a user were to manually insert a value of `4` into the `order_number` column of your existing `surrogate_key_example` table. That manual insert will not increment the `IDENTITY` sequence for the `order_number` column; that occurs only when the database generates a new value. Thus, on the next row insert, the database also would try to insert a `4`, as that's the next number in the sequence. The result will be an error, because a duplicate value violates the primary key constraint.

You can, however, allow manual insertions by restarting the `IDENTITY` sequence. You might allow this in case you need to insert a row that was mistakenly deleted. The below code shows how to add a row to the table that has an `order_number` of `4`, which is the next value in the sequence.

```
INSERT INTO surrogate_key_example
OVERRIDING SYSTEM VALUE
VALUES (4, 'Chicken Coop', '2021-09-03 10:33-06');

ALTER TABLE surrogate_key_example ALTER COLUMN order_number
RESTART WITH 5;

INSERT INTO surrogate_key_example (
    product_name, order_time
)
VALUES ('Aloe Plant', '2020-03-15 10:09-07');
```

You start with an `INSERT` statement that includes the keywords `OVERRIDING SYSTEM VALUE`. Next, we include the `VALUES` clause & specify the integer `4` for the first column, `order_number`, in the `VALUES` list, which overrides the `IDENTITY` restriction. We're using `4`, but we could choose any number that's not already present in the column.

After the insert, you need to reset the `IDENTITY` sequence so that it begins at a number larger than `4` you just inserted. To do this, use an `ALTER TABLE ... ALTER COLUMN` statement that includes the keywords `RESTART WITH 5`. An `ALTER TABLE` modifies tables & columns in various ways. Here, we use it to change the beginning number of the `IDENTITY` sequence; so, when the next row gets added to the table, the value for `order_number` will be `5`. Finally, insert a new row & omit a value for the `order_number`.

If you select all rows again for the `surrogate_key_example` table, we'll see that the `order_number` column populated as intended:

<img src = "Restarting an IDENTITY Sequence.png" width = "600" style = "margin:auto"/>

## Foreign Keys

We use *foreign keys* to establish relationships between tables. A foreign key is one or more columns whose values match those in another table's primary key or other unique key of the table it references. If not, the value is rejected. With this constraint, SQL enforces *referential integrity* -- ensuring that data in related tables doesn't end up unrelated, or orphaned.

We won't end up with rows in one table that have no relation to rows in the other table we can join them to.

The below code shows two tables from a hypothetical database tracking motor vehicle activity.

```
CREATE TABLE licenses (
    license_id text,
    first_name text,
    last_name text,
    CONSTRAINT licenses_key PRIMARY KEY (license_id)
);

CREATE TABLE registrations (
    registration_id text,
    registration_date timestamp with time zone,
    license_id text REFERENCES licenses (license_id),
    CONSTRAINT registration_key PRIMARY KEY (registriation_id,
        license_id)
);

INSERT INTO licenses (
    license_id, first_name, last_name
)
VALUES ('T229901', 'Steve', 'Rothery');

INSERT INTO registrations (
    registration_id, registration_date, license_id
)
VALUES ('A203391', '2022-03-17', 'T229901');

INSERT INTO registrations (
    registration_id, registration_date, license_id
)
VALUES ('A75772', '2022-03-17', 'T000001');
```

The first table, `licenses`, uses a driver's unique `license_id` as a natural primary key. The second table, `registrations`, is for tracking vehicle registrations. A single license ID might be connected to multiple vehicle registrations, because each licensed driver can register multiple vehicles -- this is called a *one-to-many relationship*.

Here's how that relationship is expressed via SQL: in the `registrations` table, we designate the column `license_id` as a foreign key by adding the `REFERENCES` keyword, followed by the table name & column for it to reference.

Now, when we insert a row into `registrations`, the database will test whether the value inserted into `license_id` already exists in the `license_id` primary key column of the `licenses` table. If it doesn't, the database returns an error, which is important. If any rows in `registrations` didn't correspond to a row in `licenses`, we'd have no way to write a query to find the person who registered the vehicle.

To see this constraint in action, create the two tables & execute the `INSERT` statements one at a time. The first adds a row to `licenses` that includes the value `T22901` for the `license_id`. The second adds a row to `registrations` where the foreign key contains the same value. So far, so good, because the value exists in both tables. But, we encounter an error with the third insert, which tries to add a row to `registrations` with a value for `license_id` that's not in `licenses`:

<img src = "Foreign Key Example.png" width = "600" style = "margin:auto"/>

The resulting error is actually helpful: the database is enforcing referential integrity by preventing a registration for a nonexisting license holder. But it also indicates a few practical implications. First, it affects the order in which we insert data. We cannot add data to a table that contains a foreign key before the other table referenced by the key has the related records, or we'll get an error. In this example, we'd have to create a driver's license record before inserting a related registration record (if you think about it, that's what your local DMV probably does).

Second, the reverse applies when we delete data. To maintain referential integrity, the foreign key constraint prevents us from deleting a row from `licenses` before removing any related rows in `registrations`, because doing so would leave an orphaned record. We would have to delete the related row in `registrations` first & then delete the row in `licenses`. However, ANSI SQL standard provides a way to handle this order of operations automatically using the `ON DELETE CASCADE` keywords.

## How to Automatically Delete Related Records with CASCADE

To delete a row in `licenses` & have that action automatically delete any related rows in registrations, we can specify that behaviour by adding `ON DELETE CASCADE` when defining the foreign key constraint.

Here's how we would modify the `CREATE TABLE` statement for `registrations`, adding the keywords at the end of the definition of the `license_id` column:

```
CREATE TABLE registriations (
    registration_id text,
    registration_date date,
    license_id text REFERENCES licenses (license_id)
        ON DELETE CASCADE,
    CONSTRAINT registration_key PRIMARY KEY (
        registration_id, license_id
    )
)
```

Deleting a row in `licenses` should also delete all related rows in `registrations`. This allows us to delete a driver's license without first having to manually remove any registrations linked to it. It also maintains data integrity by ensuring deleting a license doesn't leave orphaned rows in `registrations`.

## The CHECK Constraint

A check constraint evaluates whether data added to a column meets the expected criteria, which we specify with a logical test. If the criteria aren't met, the database returns an error. The `CHECK` constraint is extremely valuable because it can prevent columns from getting loaded with nonsensical data. For example, a baseball player's total number of hits shouldn't be negative, so you should limit the data to values of zero or greater. Or, in schools, `Z` isn't a valid letter grade for a course, so we might insert constraints that only accepts the values A-F.

As with primary keys, we can implement a `CHECK` constraint at the column or table level. For a column constraint, declare it in the `CREATE TABLE` statement after the column name & data type: `CHECK (logical expression)`. As a table constraint, use the syntax `CONSTRAINT constraint_name CHECK (logical expression)` after all columns are defined.

The below code shows a `CHECK` constraint applied on two columns in a table we might use to track the user role & salary of employees within an organisation. It uses the table constraint syntax for the primary key & the `CHECK` constraint.

```
CREATE TABLE check_constraint_example (
    user_id bigint GENERATED ALWAYS AS IDENTITY,
    user_role text,
    salary numeric(10, 2),
    CONSTRAINT user_id key PRIMARY KEY (user_id),
    CONSTRAINT check_role_in_list CHECK (user_role IN('Admin', 'Staff')),
    CONSTRAINT check_salary_not_below_zero CHECK (salary >= 0)
)
```

We create the table & set the `user_id` column as an auto-incrementing surrogate primary key. The first `CHECK` tests whether values entered into the `user_role` column match one of two predefined strings, `Admin` or `Staff`, by using the SQL `IN` operator. The second `CHECK` tests whether values entered in the `salary` column are greater than or equal to 0, because a negative amount wouldn't make sense. Both tests are an example of a *Boolean expression*, a statement that evaluates as either true or false. If a value tested by the constraint evaluates as `true`, the check passes.

When values are inserted or updated, the database checks them against the constraint. If the values in either column violate the constraint -- or, for that matter, if the primary key constraint is violated -- the database will reject the change.

If we use the table constraint syntax, we also can combine more than one test in a single `CHECK` statement. Say, we have a table related to student achievement. We could add the following:

```
CONSTRAINT grad_check CHECK (credits >= 120 AND tuition = 'Paid')
```

Notice that we combine two logical test by enclosing them in parentheses & connecting them with `AND`. Here, both Boolean expressions must evaluate to `true` for the entire check to pass. You can also test values across columns, as in the following example where we want to make sure an item's sales price is a discount on the original, assuming we have columns for both values:

```
CONSTRAINT sale_check CHECK (sale_price < retail_price)
```

Inside the parentheses, the logical expression checks that the sale price is less than the retail price.

## The UNIQUE Constraint

We can also ensure that a column has a unique value in each row by using the `UNIQUE` constraint. If ensuring unique values sounds similar to the purpose of a primary key, it is. But `UNIQUE` has one important difference. In a primary key, no values can be `NULL`, but a `UNIQUE` constraint permits multiple `NULL` values in a column. This is useful in cases where we won't always have values but want to ensure that the ones we do have are unique.

To show the usefulness of `UNIQUE`, look at the code below, which is a table for tracking contact info.

```
CREATE TABLE unique_constraint_example (
    contact_id bigint GENERATED ALWAYS AS IDENTITY,
    first_name text,
    last_name text,
    email text,
    CONSTRAINT contract_id key PRIMARY KEY (contact_id),
    CONSTRAINT email_unique UNIQUE (email)  
);

INSERT INTO unique_constraint_example (
    first_name, last_name, email
)
VALUES ('Samantha', 'Lee', 'slee@example.org');

INSERT INTO unique_constraint_example (
    first_name, last_name, email
)
VALUES ('Betty', 'Diaz', 'bdiaz@example.org');

INSERT INTO unique_constraint_example (
    first_name, last_name, email
)
VALUES ('Sasha', 'Lee', 'slee@example.org');
```

In this table, `contact_id` serves as a surrogate primary key, uniquely identifying each row. But we also have an `email` column, the main point of contact with each person. We'd expect this column to contain only unique email addresses, but those addresses might change over time. So, we use `UNIQUE` to ensure that any time we add or update a contact's email, we're not providing one that already exists. If we try to insert an email that already exists, the database will return an error:

<img src = "UNIQUE Constraint Example.png" width = "600" style = "margin:auto"/>

Again, the error shows the database is working for us.

## The NOT NULL Constraint

Previously, we learned about `NULL`, a special SQL value that represents missing data or unknown values. We know that `NULL` is not allowed for primary key values because they need to uniquely identify each row in a table. But there may be other times when you'll want to disallow empty values in a column. For example, in a table listing each student in a school, requiring that columns containing first & last names be filled for each row makes sense. To require a value in a column, SQL provides the `NOT NULL` constraint, which simply prevents a column from accepting empty values.

The below code demonstrates the `NOT NULL` syntax.

```
CREATE TABLE not_null_example (
    student_id bigint GENERATED ALWAYS AS IDENTITY,
    first_name text NOT NULL,
    last_name text NOT NULL,
    CONSTRAINT student_id_key PRIMARY KEY (student_id)
);
```

Here, we declare `NOT NULL` for the `first_name` & `last_name` columns because it's likely we'd require those pieces of information in a table tracking student information. If we attempt an `INSERT` on the table & don't include values for those columns, the database will notify use of the violation.

## How to Remove Constraints or Add Them Later

You can remove a constraint or later add one to an existing table using `ALTER TABLE`, the command we used earlier to reset the `IDENTITY` sequence.

To remove a primary key, foreign key, or `UNIQUE` constraint, we write an `ALTER TABLE` statement in this format:

```
ALTER TABLE table_name DROP CONSTRAINT constraint_name;
```

To drop a `NOT NULL` constraint, the statement operates on the column, so we must us the additional `ALTER COLUMN` keywords, like so:

```
ALTER TABLE table_name ALTER COLUMN column_name DROP NOT NULL;
```

Let's use these statements to modify the `not_null_example` table we just made.

```
ALTER TABLE not_null_example
    DROP CONSTRAINT student_id_key;

ALTER TABLE not_null_example
    ADD CONSTRAINT student_id_key
    PRIMARY KEY (student_id);

ALTER TABLE not_null_example ALTER COLUMN first_name
    DROP NOT NULL;

ALTER TABLE not_null_example ALTER COLUMN first_name
    SET NOT NULL;
```

Execute the statements one at a time. Each time, you can view the changes to the table definition in pgAdmin by clicking the table name once & then clicking the **SQL** tab above the query window.

With the first `ALTER TABLE` statement, we use `DROP CONSTRAINT` to remove the primary key named `student_id_key`. We then add the primary key back using `ADD CONSTRAINT`. We'd use that same syntax to add a constraint to any existing table.

In the third statement, `ALTER COLUMN` & `DROP NOT NULL` remove the `NOT NULL` constraint from the `first_name` column. Finally, `SET NOT NULL` adds the constraint.

---

# Speeding Up Queries with Indexes

In the same way that a book's index helps us find information more quickly, we can speed up queries by adding an *index* -- a separate data structure the database manages -- to one or more columns in a table. The database uses the index as a shortcut rather than scanning each row to find data. Here, we'll offer general guidance on using indexes & a PostgreSQL-specific example that demonstrates their benefits.

## B-Tree: PostgreSQL's Default Index

In PostgreSQL, the default index type is the *B-tree index*. It's created automatically on the columns designated for the primary key or a `UNIQUE` constraint, & it's also the type created by default with the `CREATE INDEX` statement. B-tree, short for *balanced tree*, is so named because when you search for a value, the structure looks from the top of the tree down through branches until it locates the value. A B-tree index is useful for data that can be ordered & searched using equality & range operators, such as `<`, `<=`, `=`, `>=`, `>`, & `BETWEEN`. It also works with `LIKE` if there's no wildcard in the pattern at the beginning of the search string. An example is `WHERE chips LIKE 'Dorito%'`.

PostgreSQL also supports additional index types, such as *Generalised Inverted Index* (GIN) & the *Generalised Search Tree* (GiST). For now, let's see a B-tree index speed up a simple search query. For this exercise, we'll use a large dataset comprising more than 900,000 New York City street addresses, compiled by the OpenAddresses project. The file with the data `city_of_new_york.csv`, is available along with all the resources downloaded for this course.

Use the code below to create a `new_york_addresses` table & import the address data. The import will take longer than the tiny datasets we've loaded so far, because the CSV file is about 50MB.

```
CREATE TABLE new_york_addresses (
    longitude numeric(9, 6),
    latitude numeric(9, 6),
    street_number text,
    street text,
    unit text,
    postcode text,
    id integer CONSTRAINT new_york_key PRIMARY KEY
);

COPY new_york_addresses
FROM '/YourDirectory/city_of_new_york.csv'
WITH (FORMAT CSV, HEADER);
```

When the data loads, run a quick `SELECT` query to visually check that you have 940,374 rows & 7 columns. 

<img src = "Importing New York City Address Data.png" width = "600" style = "margin:auto"/>

A common use fo this data might be to search for matches in the `street` column, so we'll use that example for exploring index performance.

### Benchmarking Query Performance with EXPLAIN

We'll measure the performance before & after adding an index by using the PostgreSQL-specific `EXPLAIN` command, which lists the *query plan* for a specific database query. The query plan might include how the database plans to scan the table, whether or not it will use indexes, & so on. When we add the `ANALYZE` keyword, `EXPLAIN` will carry out the query & show the actual execution time.

### Recording Some Control Execution Times

We'll use the three querys below to analyse query performance before & after adding an index. We're using typical `SELECT` queries with a `WHERE` clause with `EXPLAIN ANALYSE` included at the beginning. These keywords tell the database to execute the query & display statistics about the query process & how long it took to execute, rather than show the results.

```
EXPLAIN ANALYZE SELECT * FROM new_york_addresses
WHERE street = 'BROADWAY';

EXPLAIN ANALYZE SELECT * FROM new_york_addresses
WHERE street = '52 STREET';

EXPLAIN ANALYSE SELECT * FROM new_york_addresses
WHERE street = 'ZWICKY AVENUE';
```

On my system, the first query returns these stats in the pgAdmin output pane:

<img src = "Benchmark Queries for Index Performance.png" width = "600" style = "margin:auto"/>

Not all the output is relevant here, so I won't decode it all, but two lines are pertinent. The first indicates that to find any rows where `street = 'BROADWAY'`, the database will conduct a sequential scan of the table. That's a synonym for a full table scan: the database will examine each row & remove any where `street` doesn't match `BROADWAY`. The execution time is how long the query took to run. Your time will depend on factors including your computer hardware.

For the test, run each query several times & record the fastest execution time for each. You'll notice that execution times for the same query will vary slightly on each run. That can be the result of several factors, from other processes running on the server to the effect of data being held in memory after a prior run of the query.

### Adding the Index

Now, let's see how adding an index changes the query's search method & execution time. The code below shows a SQL statement for creating the index with PostgreSQL:

```
CREATE INDEX street_idx
ON new_york_addresses (street);
```

Notice that it's similar to the commands for creating constraints. We give the `CREATE INDEX` keywords followed by a name we choose for the index, in this case `street_idx`. Then `ON` is added, followed by the target table & column.

Execute the `CREATE INDEX` statement, & PostgreSQL will scan the values in the `street` column & build the index from them. We need to create the index only once. When the task finishes, rerun each of the three queries from before & record the execution times reported by `EXPLAIN ANALYZE`.

<img src = "Creating B-Tree Index on new_york_addresses Table.png" width = "600" style = "margin:auto"/>

Do you notice a change? First, we see that the database is now using an index scan on `street_idx` instead of visiting each row in a sequential scan. Also, the query speed is markedly faster.

If you ever need to remove an index from a table -- perhaps if you're testing the performance of several index types -- use the `DROP INDEX` command followed by the name of the index to remove.

## Considerations When Using Indexes

We've seen that indexes have signficant performance benefits, so does that mean we should add an index to every column in a table? Not so fast! Indexes are valuable, but they're not always needed. In addition, they do enlarge the database & impose a maintenance cost on writing data. Here are a few tips for judging when to use indexes:

1. Consult the documentation for the database system you're using the learn about the kinds of indexes available & which to use on particular data types. PostgreSQL, for example, has 5 more index types in addition to B-tree. One, called GiST, is particularly suited to geometry data types. Full-text search also benefits from indexing.
2. An index on a foreign key will help avoid an expensive sequential scan during a cascading delete.
3. Add indexes to columns that will frequently end up in a query `WHERE` clause. As we've seen, search performance is significantly improved via indexes.
4. Use `EXPLAIN ANALYZE` to test the performance under a variety of configurations. Optimization is a process! If an index isn't being used by the database -- & it's not backing up a primary key or other constraints -- we can drop it to reduce the size of our database & speed up inserts, updates, & deletes.

---

# Wrapping Up

We're ready to ensure that the databases we build or inherit are best suited for collection & exploration of data. It's crucial to define constraints that match the data & the expectation of users by not allowing values that don't make sense, making sure values are filled in, & setting up proper relationships between tables. We also learned how to make our queries run faster & how to consistently organise our database objects.