# Inspecting & Modifying Data

Dirty data can have multiple origins. Converting data from one file type to another or giving a column the wrong data type can cause information to be lost. People can also be careless when inputting or editting data, leaving behind typos & spelling inconsistencies. Whatever the cause may be, dirty data is the bane of the data analyst.

Being able to examine data to assess its quality & how to modify data & tables to make analysis easier. The ability to make changes to data & tables gives us options for updating or adding new information to our database as it becomes available, elevating our database from a static collection to a living record.

---

# Importing Data on Meat, Poultry, & Egg Producers

For this lesson, we'll use a directory of US meat, poultry, & egg producers. The data we'll use comes from [https://data.gov/](https://data.gov/), a website run by the US federal government that catalogs thousands of datasets from various federal agencies. You'll find the file along with the other resources downloaded for this course.

To import the file into PostgreSQL, we'll create a table called `meat_poultry_egg_establishments` & use `COPY` to add the CSV file to the table.

```
CREATE TABLE meat_poultry_egg_establishments (
    establishment_number text
        CONSTRAINT est_number_key PRIMARY KEY,
    company text,
    street text,
    city text,
    st text,
    zip text,
    phone text,
    grant_date date,
    activities text,
    dbas text
);

COPY meat_poultry_egg_establishments
FROM '/YourDirectory/MPI_Directory_by_Establishment_Name.csv'
WITH (FORMAT CSV, HEADER);

CREATE INDEX compnay_idx
ON meat_poultry_egg_establishments (company);
```

The table has 10 columns. We add a natural primary key constraint to the `establishment_number` column, which will hold unique values that identify each establishment. Most of the remaining columns relate to the company's name & location. We set most columns to `text`. We import the CSV & then create an index on the `company` column to speed up searches for particular companies.

For practice, let's use the `count()` aggregate function to check how many rows are in the `meat_poultry_egg_establishments` table:

```
SELECT count(*)
FROM meat_poultry_egg_establishments;
```

<img src = "Row Count Check on meat_poultry_egg_establishments Table.png" width = "600" style = "margin:auto"/>

The result should show 6,287 rows.

---

# Interviewing the Dataset

Let's interview the dataset to discover its details. The `meat_poultry_egg_establishments` table's rows describe food producers. At first glance, we might assume that each company in each row operates at a distinct address. But it's never safe to assume in data analysis, so let's run some checks.

```
SELECT company, street, city, st,
       count(*) AS address_count
FROM meat_poultry_egg_establishments
GROUP BY company, street, city, st
HAVING count(*) > 1
ORDER BY company, street, city, st;
```

Here, we group companies by unique combinations of the `company`, `street`, `city`, & `st` columns. Then we use `count(*)`, which returns the number of rows for each combination of those columns & gives it the alias `address_count`. Using the `HAVING` clause, we filter the results to show only cases where more than one row has the same combination of values. This should return all duplicate addresses for a company.

The query returns 23 rows, which means there are close to two dozen cases where the same company is listed multiple times at the same address:

<img src = "Finding Multiple Companies at the Same Address.png" width = "600" style = "margin:auto"/>

This is not necessarily a problem. There may be valid reasons for a company to appear multiple times at the same address. For example, two types of processing plants could exist with the same name. On the other hand, we may have found data entry errors. Either way, it's wise to eliminate concerns about the validity of a dataset before relying on it. This result should prompt use to investigate individual cases before we draw conclusions. However, this dataset has other issues that we need to look at before we get meaningful data from it.

## Check for Missing Values

We'll also check whether we have values from all states & whether any rows are missing a state code. We'll use the aggregate function `count()` along with `GROUP BY` to determine this.

```
SELECT st,
       count(*) AS st_count
FROM meat_poultry_egg_establishments
GROUP BY st
ORDER BY st;
```

The query is a simple count that tallies the number of times each state postal code (`st`) appears in the table. Your result should include 57 rows, grouped by the state postal code in the column `st`. Why more than 50 US states? Because the data includes Puerto Rico & other unincorporated US territories, such as Guam & American Samoa. Alaska (`AK`) is at the top of the results with a count of `17` establishments:

<img src = "Grouping & Counting States.png" width = "600" style = "margin:auto"/>

However, the row at the botom of the list has a `NULL` value in the `st` column & a `3` in `st_count`. That means three rows have a `NULL` in `st`. To see the details of those facilities, let's query those rows.

We'll add a `WHERE` clause with the `st` column & the `IS NULL` keywords to find which rows are missing a state code.

```
SELECT establishment_number,
       company,
       city,
       st,
       zip
FROM meat_poultry_egg_establishments
WHERE st IS NULL;
```

This query returns three rows that don't have a value in the `st` column:

<img src = "Using IS NULL to Find Missing Values in the st Column.png" width = "600" style = "margin:auto"/>

This is a problem, because any counts that include the `st` column will be incorrect, such as the number of establishments per state. When we spot an error such as this, it's worth making a quick visual check of the original file. We can confirm that there is no state listed in those rows in the file, so the error is organic to the data, not one introduced during import.

We'll have to add missing values to the `st` column to clean up this table.

## Checking for Inconsistent Data Values

Inconsistent data is another factor that can hamper our analysis. We can check for inconsistently entered data within a column by using `GROUP BY` with `count()`. When you scan the unduplicated values in the results, you might be able to spot variations in the spelling of names or other attributes.

For example, many of the 6,200 companies in our table are multiple locations owned by just a few multinational food corporations, such as Cargill or Tyson Foods. To find out how many locations each company owns, we count the values in the `company` column. Let's see what happens when we do.

```
SELECT company,
       count(*) AS company_count
FROM meat_poultry_egg_establishments
GROUP BY company
ORDER BY company ASC;
```

At least four different spellings are shown for seven establishments that are likely owned by the same company. It would help to standardise the names so all the items counted or summed are grouped properly.

<img src = "Using GROUP BY & count() to Find Inconsistent Company Names.png" width = "600" style = "margin:auto"/>

## Checking for Malformed Values Using length()

It's a good idea to check for unexpected values in a column that should be consistently formatted. For example, each entry in the `zip` column in the `meat_poultry_egg_establishments` table should be formatted in the style of US ZIP codes with five digits. However, that's not what is in our dataset.

Zip codes are 5 digit values, however some zip codes lose their leading zeros because of its `integer` data type. We can see this error when we run the below code. The example introduced `length()`, is a *string function* that counts the number of characters in a string. We combine `length()` with `count()` & `GROUP BY` to determine how many rows have five characters in the `zip` field & how many to have a value other than five. To make it easy to scan the results, we use `length()` in the `ORDER BY` clause.

```
SELECT length(zip),
       count(*) AS length_count
FROM meat_poultry_egg_establishments
GROUP BY length(zip)
ORDER BY length(zip) ASC;
```

The results confirm the formatting error. As you can see, `496` of the ZIP codes are four characters long, & `86` are three characters long, which likely means these numbers originally had two leading zeros erroneously eliminated.

<img src = "Using length() & count() to Test the zip Column.png" width = "600" style = "margin:auto"/>

Using the `WHERE` clause, we can see which states these shortened zip code correspond to.

```
SELECT st,
       count(*) AS st_count
FROM meat_poultry_egg_establishments
WHERE length(zip) < 5
GROUP BY st
ORDER BY st ASC;
```

We use the `length()` function inside the `WHERE` clause to return a count of rows where the zip code is less than five characters for each state code. The result is what we would expect. The states are largely in the Northeast region of the United States where zip codes often start with zero.

<img src = "Filtering with length() to Find Short zip Values.png" width = "600" style = "margin:auto"/>

Obviously, we don't want this error to persist, so we'll add it to our list of items to correct. So far, we need to correct the following issues in our dataset:

1. Missing values for three rows in the `st` column
2. Inconsistent spelling of at least one company's name
3. Inaccurate ZIP codes

We'll look at how to use SQL to fix these issues by modifying the data.

---

# Modifying Tables, Columns, & Data

Almost nothing in a database, from tables to columns & the data types & values they contain, is set in concrete after it's created. As your needs change, you can use SQL to add columns to a table, change data types on existing column, & edit values. Given the issues we discovered in the `meat_poultry_egg_establishments` table, being able to modify our database will come in handy.

We'll use two SQL commands. The first, `ALTER TABLE`, is part of ANSI SQL standard & provides options to `ADD COLUMN`, `ALTER COLUMN` & `DROP COLUMN`, among others.

The second command, `UPDATE`, also included in the SQL standard, allows you to change values in a table's columns. You can supply criteria using `WHERE` to choose which rows to update.

We'll explore the basic syntax & options for both commands & use them to fix the issues in our dataset.

## Modifying Tables with ALTER TABLE

We can use the `ALTER TABLE` statement to modify the structure of tables. The following examples show standard ANSI SQL syntax for common operations, starting with the code for adding a column to a table:

```
ALTER TABLE table ADD COLUMN column data_type;
```

We can remove a column with the following syntax:

```
ALTER TABLE table DROP COLUMN column;
```

To change the data type of a column, we would use this code:

```
ALTER TABLE table ALTER COLUMN column
    SET DATA TYPE data_type;
```

We add a `NOT NULL` constraint to a column like so:

```
ALTER TABLE table ALTER COLUMN column SET NOT NULL;
```

Note that in PostgreSQL & some other systems, adding a constraint to the table causes all rows to be checked to see whether they comply with the constraint. If the table has millions of rows, this could take a while.

Removing the `NOT NULL` constraint looks like this:

```
ALTER TABLE table ALTER COLUMN column DROP NOT NULL;
```

When you execute `ALTER TABLE` with the placeholders filled in, you should see a message that reads `ALTER TABLE` in the pgAdmin output screen. If an operation violates a constraint or if you attempt to change a column's data type & the existing values in the column won't conform to the new data type, PostgreSQL returns an error. But PostgreSQL won't give you any warning about deleting data when you drop a column, so use extra caution before dropping a column.

## Modifying Values with UPDATE

The `UPDATE` statement, part of the ANSI SQL standard, modifies the data in a column that meets a condition. It can be applied to all rows or a subset of rows. It's basic syntax for updating the data in every row in a column follows this form:

```
UPDATE table
SET column = value;
```

We first pass `UPDATE` the name of the table. Then to `SET`, we pass the column we want to update. The new `value` to place in the column can be a string, number, the name of another column, or even a query or expression that generates a value. The new value must be compatible with the column data type.

We can update values in multiple columns by adding additional columns & source values & separating each with a comma:

```
UPDATE table
SET column_a = value,
    column_b = value;
```

To restrict the update to particular rows, we add a `WHERE` clause with some criteria that must be met before the update can happen, such as rows where values equal to something or match a string:

```
UPDATE table
SET column = value
WHERE criteria
```

We can also update one table with values from another table. Standard ANSI SQL requires that we write a *subquery*, a query inside a query, to specify which values & rows to update.

```
UPDATE table
SET column = (SELECT column
              FROM table_b
              WHERE table.column = table_b.column)
WHERE EXISTS (SELECT column
              FROM table_b
              WHERE table.column = table_b.column);
```

The value portion of `SET`, inside the parentheses, is a subquery. A `SELECT` statement inside parentheses generates the value for the update by joining columns in both tables on matching row values. Similarly, the `WHERE EXISTS` clause uses a `SELECT` statement to ensure that we only update rows where both tables have matching values. If we didn't use `WHERE EXISTS` we might inadvertently see some values to `NULL` without planning to.

Some database managers offer additional syntax from updating across tables. PostgreSQL supports the ANSI standard but also a similar syntax using a `FROM` clause:

```
UPDATE table
SET column = table_b.column
FROM table_b
WHERE table.column = table_b.column;
```

When you execute an `UPDATE` statement, you'll get a message stating `UDPATE` along with the number of rows affected

## Viewing Modified Data with RETURNING

If you add an optional `RETURNING` clause to `UPDATE`, you can view the values that were modified without having to run a second, separate query. The syntax of the clause uses the `RETURNING` keyword followed by a list of columns or a wildcard in the same manner that we name columns following `SELECT`. Here's an example:

```
UPDATE table
SET column_a = value
RETURNING column_a, column_b, column_c;
```

Instead of just noting the number of rows modified, `RETURNING` directs the database to show the columns you specify for the rows modified. This is a PostgreSQL-specific implementation that we can also use with `INSERT` & `DELETE FROM`.

## Creating Backup Tables

Before modifying a table, it's a good idea to make a copy for reference & backup in case we accidently destroy some data. We can use a variation of the familiar `CREATE TABLE` statement to make a new table from the table we want to duplicate.

```
CREATE TABLE meat_poultry_egg_establishments_backup
AS (SELECT *
    FROM meat_poultry_egg_establishments;)
```

The result should be a pristine copy of your table with the new specified name. You can confirm this by counting the number of records in both tables at once:

```
SELECT (SELECT count(*)
        FROM meat_poultry_egg_establishments) AS original,
       (SELECT count(*)
        FROM meat_poultry_egg_establishments_backup) AS backup;
```

The results should return the same count from both tables.

<img src = "Backing Up a Table.png" width = "600" style = "margin:auto"/>Â 

If the counts match, you can be sure your backup table is an exact copy of the structure & contents of the original table. As an added measure & for easy reference, we'll use `ALTER TABLE` to make copies of column data within the table we're updating.

## Restoring Missing Column Values

Earlier, we revealed that three rows in the `meat_poultry_egg_establishments` table don't have a value in the `st` column. 

<img src = "Using IS NULL to Find Missing Values in the st Column.png" width = "600" style = "margin:auto"/>

To get a complete count of establishments in each state, we need to fill those missing values using an `UPDATE` statement.

### Creating a Column Copy

Even though we've backed up this table, let's take extra caution & make a copy of the `st` column within the table so we still have the original data if we make some dire error somewhere. Let's create the copy & fill it with the existing `st` column values.

```
ALTER TABLE meat_poultry_egg_establishments
ADD COLUMN st_copy text;

UPDATE meat_poultry_egg_establishments
SET st_copy = st;
```

The `ALTER TABLE` statement adds a column called `st_copy` using the same `text` data type as the original `st` column. Next, the `SET` clause in `UPDATE` fills our new `st_copy` column with the values in column `st`. Because we don't specify any criteria using `WHERE`, values in every row are updated, & PostgreSQL returns the message `UPDATE 6287`. Again, it's worth noting that on a very large table, this operation could take some time & also substantially increases the table's size. Making a column in addition to a table backup isn't entirely necessary, but if you're the patient cautious type, it can be worthwhile.

We can confirm the values were copied properly with a simple `SELECT` query on both columns.

```
SELECT st, st_copy
FROM meat_poultry_egg_establishments
WHERE st IS DISTINCT FROM st_copy
ORDER BY st;
```

To check for differences between values in the columns, we use `IS DISTINCT FROM` in the `WHERE` clause. We've used `DISTINCT` before to find unique values in a column; in this context, `IS DISTINCT` tests whether values in `st` & `st_copy` are different. This keeps us from having to scan every row ourselves. Running this query will return zero rows, meaning the values match throughout the table.

<img src = "Checking Values in the st & st_copy Columns.png" width = "600" style = "margin:auto"/> 

Now, with our original data safely stored, we can update the three rows with missing state codes. This is now our in-table backup, so if something goes drastically wrong while we're updating the original column, we can easily copy the original data back in.

### Updating Rows Where Values Are Missing

To update those rows' missing values, we first find the values we need with a quick online search: Atlas Inspection is located in Minnesota; Hall-Namie Packing is in Alabama; & Jones Dairy is in Wisconsin. We add those states to the appropriate rows below.

```
UPDATE meat_poultry_egg_establishments
SET st = 'MN'
WHERE establishment_number = 'V18677A'
RETURNING establishment_number, company, city, st, zip;

UPDATE meat_poultry_egg_establishments
SET st = 'AL'
WHERE establishment_number = 'M45319+P45319'
RETURNING establishment_number, company, city, st, zip;

UPDATE meat_poultry_egg_establishments
SET st = 'AL'
WHERE establishment_number = 'M263A+P263A+V263A'
RETURNING establishment_number, company, city, st, zip;
```

Because we want each `UPDATE` statement to affect a single row, we include a `WHERE` clause for each that identifies the company's unique `establishment_number`, which is the table's primary key. When we run the queries, PostgreSQL directs the database to show several columns from that row that was updated along with a temporary message stating the number of rows affected:

<img src = "Updating the st Column for Three Establishments.png" width = "600" style = "margin:auto"/>

If we rerun our code from before to find rows where `st` is `NULL`, the query should return nothing:

```
SELECT establishment_number,
       company,
       city,
       st,
       zip
FROM meat_poultry_egg_establishments
WHERE st IS NULL;
```

<img src = "Rerunning IS NULL to Find Missing Values in the st Column.png" width = "600" style = "margin:auto"/>

### Restoring Original Values

What happens if we botch an update by providing the wrong values or updating the wrong rows? We'll just copy the data back from either the full table backup or the column backup.

```
UPDATE meat_poultry_egg_establishments
SET st = st_copy;

UPDATE meat_poultry_egg_establishments AS original
SET st = backup.st
FROM meat_poultry_egg_establishments_backup AS backup
WHERE original.establishment_number =
    backup.establishment_number;
```

To restore the values from the backup column in `meat_poultry_egg_establishments`, run an `UPDATE` query that sets `st` to the values in `st_copy`. Both columns should again have the identical original values. Alternatively, we can create an `UPDATE` that sets `st` to values in the `st` column from the `meat_poultry_egg_establishments_backup` table we made before. This will obviate the fixes we made to add missing state values as well.

## Updating Values for Consistency

Earlier, we discovered several cases where a single company's name was entered inconsistently. These inconsistencies will hinder us if we want to aggregate data by company name, so we'll fix them.

Here are the spelling variations of Armour-Eckrich Meats:

<img src = "Using GROUP BY & count() to Find Inconsistent Company Names.png" width = "600" style = "margin:auto"/>

We can standardise the spelling using an `UPDATE` statement. To protect our data, we'll create a new column for the standardised spellings, copy names in `company` into the new column, & work in the new column.

```
ALTER TABLE meat_poultry_egg_establishments
ADD COLUMN company_standard text;

UPDATE meat_poultry_egg_establishments
SET company_standard = company;
```

Now, let's say we want any name in `company` that starts with the string `Armour` to appear in `company_standard` as `Armour_Eckrich Meats`. (This assumes we've checked all Armour entries & want to standardise them.)  The below code can update all rows matching the string `Armour` using `WHERE`.

```
UPDATE meat_poultry_egg_establishments
SET company_standard = 'Armour-Eckrich Meats'
WHERE company LIKE 'Armour%'
RETURNING company, company_standard;
```

The important piece of this query is the `WHERE` clause that uses the `LIKE` keyword for case-sensitive pattern matching. Including the wildcard syntax `%` at the end of the string `Armour` updates all rows that start with those characters regardless of what comes after them. The clause lets us target all the varied spellings used for the company's name. The `RETURNING` clause causes the statement to provide the results of the updated `company_standard` column next to the original `company` column:

<img src = "Creating & Filling the company_standard Column.png" width = "600" style = "margin:auto"/>

The values for Armour-Eckrich in `company_standard` are now standardised with consistently spelling. To standardise other company names in the table, we would create an `UPDATE` statement for each case. We would also keep the original `company` column for reference.

## Repairing ZIP Codes Using Concatenation

Our final fix repairs values in the `zip` column that lost leading zeros. Zip codes in Puerto Rico & the US Virgin Islands begin with two zeros, so we need to restore two leading zeros to the values in `zip`. For the other states, located mostly in New England, we'll restored a single leading zero.

We'll use `UPDATE` in conjunction with the double-pipe *string concatenation operator* (`||`). Concatenation combines two string values into one (it will also combine a string & a number into a string). For example, inserting `||` between the strings `abc` & `xyz` results in `abcxyz`. The double-pipe operator is a SQL standard for concatenation supported by PostgreSQL. You can use it in many contexts, such as `UPDATE` queries & `SELECT`, to provide custom output from existing as well as new data.

First, let's make a backup copy of the `zip` column.

```
ALTER TABLE meat_poultry_egg_establishments
ADD COLUMN zip_copy text;

UPDATE meat_poultry_egg_establishments
SET zip_copy = zip;
```

Next, we'll perform the first update.

```
UPDATE meat_poultry_egg_establishments
SET zip = '00' || zip
WHERE st IN ('PR', 'VI') AND length(zip) = 3;
```

We use `SET` to set the value in the `zip` column to the result of the concatenation of `00` & the existing value. We limit the `UPDATE` to only those rows where the `st` column has the state codes `PR` & `VI` using the `IN` comparison operator & add a test for rows where the length of `zip` is `3`. This entire statement will then only update the `zip` values for Puerto Rico & the Virgin Islands. 

Let's repair the remaining ZIP codes using a similar query.

```
UPDATE meat_poultry_egg_establishments
SET zip = '0' || zip
WHERE st IN ('CT', 'MA', 'ME', 'NH', 'NJ', 'RI', 'VT')
    AND length(zip) = 4;
```

Now let's check our progress. Earlier, when we aggregated rows in the `zip` column by length, we found `86` rows with three characters & `496` with four. Using the same query now returns a more desirable results: all the rows have a five-digit ZIP code.

```
SELECT length(zip),
       count(*) AS length_count
FROM meat_poultry_egg_establishments
GROUP BY length(zip)
ORDER BY length(zip) ASC;
```

<img src = "Modifying Codes in the zip Column Missing One Leading Zero.png" width = "600" style = "margin:auto"/>

## Updating Values Across Tables

Earlier in this lesson: "Modifying Values with UPDATE", we saw the standard ANSI SQL & PostgreSQL-specific syntax for updating values in one table based on values in another. This syntax is particularly valuable in a relational database where primary keys & foreign keys establish table relationships. In those cases, we may need information in one table to update values in another table.

Let's say we're setting an inspection deadline for each of the companies in our table. We want to do this by US regions, such as Northeast, Pacific, & so on, but those regional designations don't exist in our table. However, they do exist in the file *state_regions.csv*, that contains matching `st` state codes. Once we load that file into a table, we can use that data in an `UPDATE` statement. Let's begin with the New England region to see how this works.

The below code contains the SQL statements to create a `state_regions` table & fill the table with data:

```
CREATE TABLE state_regions (
    st text CONSTRAINT st_key PRIMARY KEY,
    region text NOT NULL
);

COPY state_regions
FROM '/YourDirectory/state_regions.csv'
WITH (FORMAT CSV, HEADER);

SELECT * FROM state_regions;
```

We'll create two columns in a `state_regions` table: one containing the two-character state code `st` & the other containing the `region` name. We set the primary key constraint to the `st` column, which holds a unique `st_key` value to identify each state. In the data we're importing, each state is present & assigned to a census region, & territories outside the United States are labeled as outlying areas. We'll update the table one region at a time.

<img src = "Creating & Filling a state_regions Table.png" width = "600" style = "margin:auto"/>

Next, let's return the `meat_poultry_egg_establishments` table, add a column for inspection dates, & then fill in that column with the New England States.

```
ALTER TABLE meat_poultry_egg_establishments
ADD COLUMN inspection_deadline
    timestamp with time zone;

UPDATE meat_poultry_egg_establishments
    AS establishments
SET inspection_deadline = '2022-12-01 00:00 EST'
WHERE EXISTS (
    SELECT state_regions.region
    FROM state_regions
    WHERE establishments.st = state_regions.st
      AND state_regions.region = 'New England'
);
```

The `ALTER TABLE` statement creates the `inspection_deadline` column in the `meat_poultry_egg_establishments` table. In the `UPDATE` statement, we give the table an alias of `establishments` to make the code easier to read. Next, `SET` assigns a timestamp value of `2022-12-01 00:00 EST` to the new `inspection_deadline` column. Finally, `WHERE EXISTS` includes a subquery that connects the `meat_poultry_egg_establishments` table to the `state_regions` table we created & specifies which rows to update. The subquery (in parentheses, beginning with `SELECT`) looks for rows in the `state_regions` table where the `region` column matches the string `New England`. At the same time, it joins the `meat_poultry_egg_establishments` table with the `state_regions` table using the `st` column from both sides. In effect, the query is telling the databases to find all the `st` codes that correspond to the New England region & use those codes to filter the update.

We can see the effect of the change with the below code:

```
SELECT st, inspection_deadline
FROM meat_poultry_egg_establishments
GROUP BY st, inspection_deadline
ORDER BY st;
```

The results should show the updated inspection deadlines for all New England companies. The top of the output shows Connecticut has received a deadline timestamp, for example, but states outside New England remain `NULL` because we haven't updated them yet:

<img src = "Viewing Updated inspection_date Values.png" width = "600" style = "margin:auto"/>

To fill in deadlines for additional regions, substitute a different region for `New England` in the code from before.

```
ALTER TABLE meat_poultry_egg_establishments
ADD COLUMN inspection_deadline
    timestamp with time zone;

UPDATE meat_poultry_egg_establishments
    AS establishments
SET inspection_deadline = '2022-12-01 00:00 EST'
WHERE EXISTS (
    SELECT state_regions.region
    FROM state_regions
    WHERE establishments.st = state_regions.st
      AND state_regions.region = 'DifferentRegion'
);
```

---

# Deleting Unneeded Data

The most irrevocable way to modify data is to remove it entirely. SQL includes options to remove rows & columns along with options to delete an entire table or database. We want to perform these operations with caution, removing only data or tables we don't need. Without a backup, our data is gone for good.

## Deleting Rows from a Table

To remove rows from a table, we can use either `DELETE FROM` or `TRUNCATE` which are part of the ANSI SQL standard. Each offers options that are useful depending on our goals.

Using `DELETE FROM`, we can remove all rows from a table, or we can add a `WHERE` clause to delete only the portion that matches an expression we supply. To delete all rows from a table, use the following syntax:

```
DELETE FROM table_name;
```

To remove only selected rows, add a `WHERE` clause along with the matching value or pattern to specify which ones you want to delete:

```
DELETE FROM table_name WHERE expression;
```

For example, to exclude US territories from our processors table, we can remove the companies in those locations using the code below.

```
DELETE FROM meat_poultry_egg_establishments
WHERE st IN ('AS', 'GU', 'MP', 'PR', 'VI');
```

PostgreSQL should return the message `DELETE 105`. This means the 105 rows where the `st` column held any of the codes designating a territory you supplied via the `IN` keyword have been removed from the table.

<img src = "Deleting Rows Matching an Expression.png" width = "600" style = "margin:auto"/>

With large tables, using `DELETE FROM` to remove all rows can be inefficient because it scans the entire table as part of the process. In that case, you can use `TRUNCATE`, which skips the scan. To empty the table using `TRUNCATE`, use the following syntax:

```
TRUNCATE table_name;
```

A handy feature of `TRUNCATE` is the ability to reset an `IDENTITY` sequence, such as one you may have created to serve as a surrogate primary key, as part of the operation. To do that, add the `RESTART IDENTITY` keywords to the statement:

```
TRUNCATE table_name RESTART IDENTITY;
```

We'll skip truncating any tables for now.

## Deleting a Column from a Table

Earlier, we created a backup `zip` column called `zip_copy`. Now that we've finished working on fixing the issues in `zip`, we no longer need `zip_copy`. We can remove the backup column, including all the data within the column, from the table using the `DROP COLUMN` keywords in the `ALTER TABLE` statement.

The syntax for removing a column is similar to other `ALTER TABLE` statements.

```
ALTER TABLE table_name
DROP COLUMN column_name;
```

The below code removes the `zip_copy` column:

```
ALTER TABLE meat_poultry_egg_establishments
DROP COLUMN zip_copy;
```

PostgreSQL returns the message `ALTER TABLE`, & the `zip_copy` column should be deleted. The database doesn't actually rewrite the table to remove the column; it just marks the column as deleted in its internal catalog & no longer shows it or adds data to it when new rows are added.

<img src = "Removing a Column From a Table Using DROP.png" width = "600" style = "margin:auto"/>

## Deleting a Table from a Database

The `DROP TABLE` statement is a standard ANSI SQL feature that deletes a table from the database. This statement might come in handy if, for example, we have a collection of backups, or *working tables* that have outlived their usefulness. It's also useful when we need to change the structure of a table significantly; in that case, rather than using too many `ALTER TABLE` statements, we can just remove the table & create a fresh one by running a new `CREATE TABLE` statement & re-importing the data.

The syntax for the `DROP TABLE` command is simple:

```
DROP TABLE table_name;
```

For example, we can delete the backup version of the `meat_poultry_egg_establishments` table.

```
DROP TABLE meat_poultry_egg_establishments_backup;
```

PostgreSQL should respond with the message `DROP TABLE` to indicate the table has been removed.

<img src = "Removing a Table From a Database Using DROP.png" width = "600" style = "margin:auto"/>

---

# Using Transaction so Save or Revert Changes

So far, our alterations have been final. That is, after we run a `DELETE` or `UPDATE` query (or any other query that alters our data or database structure), the only way to undo the change is to restore from a backup. However, there is a way to check our changes before finalising them & cancel the change if it's not what we intended. We do this by enclosing the SQL statement within a *transaction*, which includes keywords that allow us to commit our changes if they are successful or roll them back if not. We define a transaction using the following keywords at the beginning & end of the query:

**START TRANSACTION** sinals the start of the transaction block. In PostgreSQL, you can also use the non-ANSI SQL `BEGIN` keyword.

**COMMIT** signals the end of the block & saves all changes.

**ROLLBACK** signals the end of the block & reverts all changes.

You can include multiple statements between `BEGIN` & `COMMIT` to define a sequence of operations that perform one unit of work in a database. An example is when you buy concert tickets, which might involve two steps: charging your credit card & reserving your seats so someone else can't buy them. A database programmer would want either both steps in the transaction to happen (say, when your card charge goes through) or neither to happen (if you cancel at checkout). Defining both steps as one transaction -- also called a *transaction block* -- keeps them as a unit; if one step is canceled or throws an error, the other get cancelled too.

We can use a transaction block to review changes a query makes & then decide whether to keep or discard them. In our table, let's say we're cleaning dirty data related to the company AGRO Merchants Oakland LLC. The table has three rows listing the company, but one row has an extra comma in the name:

<img src = "Demonstrating a Transaction Block 1.png" width = "600" style = "margin:auto"/>

We want that name to be consistent, so we'll remove the comma from the third row using an `UPDATE` query. But this time, we'll check the result of our update before we make it final (we'll purposely make a mistake we want to discard).

```
START TRANSACTION;

UPDATE meat_poultry_egg_establishsments
SET company = 'AGRO Merchantss Oakland LLC'
WHERE company = 'AGRO Merchants Oakland, LLC';

SELECT company
FROM meat_poultry_egg_establishments
WHERE company LIKE 'AGRO%'
ORDER BY company;

ROLLBACK;
```

Beginning with `START TRANSACTION`, we'll run each statement separately. The database responds with the message `START TRANSACTION`, letting us know that any succeeding changes we make to the data will not be made permanent unless we issue a `COMMIT` command. Next, we run the `UPDATE` statement, which changes the company name in the row where it has an extra comma. I intentionally added an extra `s` in the name used in the `SET` clause to introduce a mistake.

When we view the names of companies starting with the letters `AGRO` using the `SELECT` statement, we see that, oops, one company name is misspelled now.

<img src = "Demonstrating a Transaction Block 2.png" width = "600" style = "margin:auto"/>

Instead of rerunning the `UPDATE` statement to fix the typo, we can simply discard the change by running the `ROLLBACK` command. When we rerun the `SELECT` statement to view the company names, we're back to where we started:

<img src = "Demonstrating a Transaction Block 1.png" width = "600" style = "margin:auto"/>

From here, you can correct your `UPDATE` statement by removing the extra `s` & rerun it, beginning with the `START TRANSACTION` statement again. If you're happy with the changes, run `COMMIT` to make them permanent.

Transaction blocks are often used for more complex situations rather than checking simple changes.

---

# Improving Performance When Updating Large Tables

With PostgreSQL, adding a column to a table & filling it with values can quickly inflate the table's size because the database creates a new version of the existing row each time a value is updated, but it doesn't delete the old row version. That essentially doubles the table's size. For small datasets, the increase is negligible, but for tables with hundreds of thousands or millions of rows, the time required to update rows & the resulting extra disk usage can be substantial.

Instead of adding a column & filling it with values, we can save disk space by copying the entire table & adding a populated column during the operation. Then, we rename the tables so the copy replaces the original, & the original becomes a backup. Thus we have a fresh table without the added old rows.

The below code shows how to copy `meat_poultry_egg_establishments` into a new table while adding a populated column. To do this, if you didn't already drop the `meat_poultry_egg_establishments_backup` table, go ahead & drop it. Then run a `CREATE TABLE` statement.

```
CREATE TABLE meat_poultry_egg_establishments_backup
AS (SELECT *,
           '2023-02-14 00:00 EST'::timestamp with time zone
               AS reviewed_date
    FROM meat_poultry_egg_establishments);
```

The query is a modified version of the backup script. In addition to selecting all the columns using the asterisk wildcard, we also add a column called `reviewed_date` by providing a value cast as a `timestamp` data type & the `AS` keyword. That syntax adds & fills `reviewed_date`, which we might use to track the last time we checked the status of each plant.

Then, we use the below code to swap table names.

```
ALTER TABLE meat_poultry_egg_establishments
RENAME TO meat_poultry_egg_establishments_temp;

ALTER TABLE meat_poultry_egg_establishments_backup
RENAME TO meat_poultry_egg_establishments;

ALTER TABLE meat_poultry_egg_establishments_temp
RENAME TO meat_poultry_egg_establishments_backup;
```

Here, we use `ALTER TABLE` with a `RENAME TO` clause to change a table name. The first statement changes the original table name to one that ends with `_temp`. The second statement renames the copy we made with the original name of the table. Finally, we rename the table that ends with `_temp` to the ending `_backup`. The original table is now called `meat_poultry_eggs_establishments_backup`, & the copy with the added column is called `meat_poultry_egg_establishments`. This process avoids updating rows & thus inflating the table.

<img src = "Swapping Table Names Using ALTER TABLE.png" width = "600" style = "margin:auto"/>

---

# Wrapping Up

Gleaning useful information from data sometimes requires modifying the data to remove inconsistencies, fix errors, & make it more suitable for supporting an accurate analysis. In a perfect world, datasets would arrive with everything clean & complete. But such a perfect world doesn't exist, so the ability to alter, update, & delete data is indispensable.

Be sure to back up your tables before you start making changes. Make copies of your columns too, for an extra level of protection.