# Importing & Exporting Data

So far, we've been adding a handful of rows to tables using the SQL `INSERT` statement. This is useful for making quick test tables or adding a few rows to an existing table. But, it's more likely that you'll need to load hundreds, thousands, possible even millions of rows & no one wants to write `INSERT` statements in those situations.

If your data exists in a *delimited* text file, with one table row per line of text & each column separated by a comma or other character, PostgreSQL can import the data in bulk via its `COPY` command. This command is a PostgreSQL-specific implementation with options for including or excluding columns & handling various delimited text types.

In the opposite direction, `COPY` will also *export* data from PostgreSQL tables or from the result of a query to a delimited text file. This technique is handy when you want to share data with colleagues or move it into another format, such as an Excel file.

For importing, we'll start by introducing the *Annual US Census Population Estimates by County* dataset. Three steps form the outline of most of the imports you'll do:

* Obtain the source data in the form of a delimited text file.
* Create a table to store the data.
* Write a `COPY` statement to perform the import.

After the import is done, we'll check the data & look at additional options for importing & exporting. We'll focus on delimited text file, since it is the most common file format that's portable across proprietary & open source systems. If you want to transfer data from another database program's proprietary format directly to PostgreSQL -- for example, from Microsoft Access or MySQL -- you'll need to use a third-party tool. Check out the PostgreSQL wiki at [https://wiki.postgresql.org/wiki/Main_Page](https://wiki.postgresql.org/wiki/Main_Page) & search for "Converting from other databases to PostgreSQL" for a list of tools & options.

---

# Working with Delimited Text Files

A delimited text file contains rows of data, each of which represents one row in a table. In each row, each data column is separated, or delimited, by a particular character. There are all kinds of characters used as delimiters, from ampersands to pipes, but the comma is most commonly used; hence the name fo the file type you'll see often is *comma-separated values* (CSV). The terms *CSV* & *comma-delimited* are interchangeable.

Here's a typical data row you might see in a comma-delimited file:

```
John,Doe,123 Main St.,Hyde Park,NY,845-555-1212
```

Notice that a comma separates each piece of data -- first name, last name, street, town, state, & phone -- without any spaces. The commas tell the software to treat each item as a separate column, upon either import or export. Simple enough.

## Handling Header Rows

A feature you'll often find inside a delimited text file is a *header row*. As the name implies, it's a single row at the top, or *head*, of the file that lists the name of each data column. Often, a header is added when data is exported from a database or a spreadsheet. Here is an example with the delimited row I've been using. Each item in a header row corresponds to its respective column:

```
FIRSTNAME,LASTNAME,STREET,CITY,STATE,PHONE
John,Doe,123 Main St.,Hyde Park,NY,845-555-1212
```

Header rows serve a few purposes. For one, the values in the header row identify the data in each column, which is particularly useful when you're deciphering a file's contents. Second, some database managers (although not PostgreSQL) use the header row to map columns in the delimited text file to the correct columns in the import table. PostgreSQL doesn't use the header row, so we don't want to import that row to a table. We use the `HEADER` option in the `COPY` command to exclude it.

## Quoting Columns That Contain Delimiters

Using commas as a column delimiter leads to potential dilemma: whwat if the value in a column includes a comma? For example, some people combine an apartment number with a street address, as in 123 Main St., Apartment 200. Unless the system for delimiting accounts for that extra comma during import, the line will appear to have an extra column & cause the import to fail.

To handle such cases, delimited files use an arbitrary character called a *text qualifier* to enclose a column that includes the delimiter character. This acts as a signal to ignore that delimiter & treat everything between the text qualifiers as a single column. Most of the time, in comma-delimited files, the text qualifier used is the double quote. Here's the example data again, but with the street name column surrounded by double quotes:

```
FIRSTNAME,LASTNAME,STREET,CITY,STATE,PHONE
John,Doe,"123 Main St., Apartment 200",Hyde Park,NY,845-555-1212
```

On import, the database will recognise that double quotes signify one column regardless of whether it finds a delimiter within the quotes. When importing CSV files, PostgreSQL by default ignores delimiters inside double-quoted columns, but you can specify a different text qualifier if your import requires it. 

Finally, in CSV mode, if PostgreSQL finds two consecutive text qualifiers inside a double-quoted column, it will remove one. For example, let's say PostgreSQL finds this:

```
"123 Main St."" Apartment 200"
```

If so, it will treat that text as a single column upon import, leaving just one of the qualifiers:

```
123 Main St." Apartment 200
```

A situation like this could indicate an error in the formatting of your CSV file, which is why it's always a good idea to review your data before importing.

---

# Using COPY to Import Data

To import data from an external file into our database, we first create a table in our database that matches the columns & data types in our source file. Once that's done, the `COPY` statement for the import is just the three lines of code.

```
COPY table_name
FROM 'C:/YourDirectory/your_file.csv'
WITH (FORMAT CSV, HEDAER);
```

We start the block of code with the `COPY` keyword, followed by the name of the target table, which must already exist in your database. Think of this syntax as meaning, "Copy data to my table called `table_name`".

The `FROM` keyword identifies the full path to the source file, & we enclose the path in single quotes. For example, to import a file located on my desktop, the `FROM` line would read as follows:

```
FROM '/Users/jiehengyu/Desktop/my_file.csv'
```

The `WITH` keyword lets you specify options, surrounded by parentheses, that you use to tailor your input or output file. Here, we specify that the external file should be comma-delimited & that we should exclude the file's header row in the import. It's worth examining all the options in the official [PostgreSQL documentation](https://www.postgresql.org/docs/current/sql-copy.html), but here is a list of the options you'll commonly use.

## Input & Output File Format

Use the `FORMAT format_name` option to specify the type of file you're reading or writing. Format names are `CSV`, `TEXT` or `BINARY`. Very often, you'll work with standard CSV files. In the `TEXT` format, a *tab* character is the delimiter by default (although you can specify another character). You'll rarely use the `BINARY` format, unless you're deep into building technical systems.

## Presence of a Header Row

On import, use `HEADER` to specify that the source file has a header row that you want to exclude. The database will start importing with the second line of the file so that the column names in the header don't become part of the data in the table. Be sure to check your source CSV to make sure this is what you want; not every CSV comes with a header row. On export, using `HEADER` tells the database to include the column names as a header row in the output file, which helps a user understand the file's contents.

## Delimiter

The `DELIMITER 'character'` option lets you specify which character your import or export file uses as a delimiter. The delimiter must be a single character & cannot be a carriage return. If you use `FORMAT CSV`, the assumed delimiter is a comma. I include `DELIMITER` here to show that you have the option to specify a different delimiter if that's how your data arrived. For example, if you received pipe-delimited data, you would treat the option this way: `DELIMITER '|'`.

## Quote Character

Earlier, we learned that in a CSV file, commas inside a single column value will mess up your import unless the column value is surrounded by a character that serves as a text qualifier, telling the database to handle the value within as one column. By default, PostgreSQL uses the double quote, but if the CSV you're importing uses a different character for the text qualifier, you can specify it with the `QUOTE 'quote_character'` option.

Now that you better understand delimited files, you're ready to import one.

---

# Importing Census Data Describing Counties

The dataset we'll work with in this import exercise is considerably larger than the `teachers` table we created in our previous lessons. It contains census population estimates for every county in the United States & is 3,142 rows deep & 16 columns wide. (Census counties include some geographies with other names: parishes in Louisiana, boroughs & census areas in Alaska, & cities, particularly in Virginia).

To understand the data, it helps to know a little about the US Census Bureau, a federal agency that tracks the nation's demographics. Its best-known program is a full count of the population it undertakes every 10 years, most recently in 2020. That data, which enumerates the age, gender, race, & ethnicity of each person in the country, is used to determine how many members from each state make up the 435-member US House of Representatives. In recent decades, faster-growing states such as Texas & Florida have gained seats, while slower-growing states such as New York & Ohio have lost representatives in the house.

The data we'll work with are the census' annual population estimates. These use the most recent 10-year census count as a base, & they factor in births, deaths, & domestic & international migration to produce population estimates each year for the nation, states, counties, & other geographies. In lieu of an annual physical count, it's the best way to get an updated measure on how many people live where in the United States. For this exercise, we have compiled select columns from the 2019 US Census county-level population estimates (plus a few descriptive columns from census geographic data) into a file named *us_counties_pop_est_2019.csv*. You should have this file on your computer if you downloaded the course data in our first lesson.

Open the file with a text editor. You should see a header row that begins with these columns:

```
state_fips,county_fips,region,state_name,county_name,area_land,area_water,internal_point_lat,internal_point_lon,pop_est_2018,pop_est_2019,...
```

Let's explore the columns by examining the code for creating the import table.

## Creating the us_counties_pop_est_2019 Table

The SQL code below shows the `CREATE TABLE` script. In pgAdmin, click the `analysis` database that you created, then select **Tools -> Query Tool** from the menu bar. 
Run the script below in the Query Tool window.

```
CREATE TABLE us_counties_pop_est_2019 (
    state_fips text,
    county_fips text,
    region smallint,
    state_name text,
    county_name text,
    area_land bigint,
    area_water bigint,
    internal_point_lat numeric(10, 7),
    internal_point_lon numeric(10, 7),
    pop_est_2018 integer,
    pop_est_2019 integer,
    births_2019 integer,
    deaths_2019 integer,
    international_migr_2019 integer,
    domestic_migr_2019 integer,
    residual_2019 integer,
    CONSTRAINT counties_2019_key 
        PRIMARY KEY (state_fips, county_fips)
);
```

Return to the main pgAdmin window, & in the object browser, right-click & refresh the `analysis` database. Choose **Schemas -> public -> Tables** to see the new table. Although it's empty, you can see the structure by running a basic `SELECT` query in pgAdmin's Query Tool:

```
SELECT * FROM us_counties_pop_est_2019;
```

<img src = "CREATE TABLE Statement for Census County Population Estimates.png" width = "600" style = "margin:auto"/>

When you run the `SELECT` query, you'll see the column in the table you created in the pgAdmin Data Output pane. No data rows exist yet. We need to import them.

## Understanding Census Columns & Data Types

Before we import the CSV file into the table, let's walk through several of the columns & data types we chose. In this set of census data, each row displays the populations estimates & components of annual change (births, deaths, migration) for one county. The first two columns are the county's `state_fips` & `county_fips`, which are the standard federal codes for these entities. We use `text` for both because the codes can contain leading zeros that would be lost if we stored the values as integers. For example, Alaska's `state_fips` is `02`. If we use an integer type, that leading `0` would be stripped on import, leaving `2`, which is the wrong code for the state. Also, we won't be doing any math with this value, so we don't need integers. It's always important to distinguish codes from numbers; these state & county values are actually labels as opposed to numbers used for math.

Numbers from 1 to 4 in `region` represent the general location of a county in the United States: Northeast, Midwest, South, & West. No number is higher than 4, so we define the columns with type `smallint`. The `state_name` & `county_name` columns contain the complete name of both the state & county, stored as `text`.

The number of square meters for land & water in the county are recorded in `area_land` & `area_water`, respectively. The two, combined, comprised a county's total area. In certain places -- such as Alaska, where there's lots of land to go with all that snow -- some values easily surpass the `integer` type's maximum of 2,147,483,647. For that reason, we're using `bigint`, which will handle the 377,038,836,685 square meters of land in the Yukon-Koyukuk census area with room to spare.

The latitude & longitude of a point near the center of the county, called an *interal point*, are specified in `internal_point_lat` & `internal_point_lon`, respectively. The Census Bureau -- along with many mapping systems -- expresses latitude & longitude coordinates using a *decimal degrees* system. *Latitude* represents positions north & south on the globe, with the equator at 0 degrees, the North Pole at 90 degrees, & the South Pole at -90 degrees. *Longitude* represents locations east & west, with the *Prime Meridian* that passes through Greenwich in London at 0 degrees longitude. From there, longitude increase both east & west (positive number to the east & negative to the west) until they meet at 180 degrees on the opposite side of the globe. The location there, known as the *antimeridian* is used as the basis for the *International Date Line*. 

When reporting interior points, the Census Bureau uses up to seven decimal places. With a value up to 180 to the left of the decimal, we need to account for a maximum of 10 digits total. So we're using `numeric` with a precision of `10` & a scale of `7`.

Next, we reach a series of columns that contain the county's population estimates & components of change. The table lists their definitions:

|Column name|Description|
|:---|:---|
|`pop_est_2018`|Estimated population on July 1, 2018|
|`pop_est_2019`|Estimated population on July 1, 2019|
|`births_2019`|Number of births from July 1, 2018 to June 30, 2019|
|`deaths_2019`|Number of deaths from July 1, 2018 to June 30, 2019|
|`international_migr_2019`|Net international migration from July 1, 2018 to June 30, 2019|
|`domestic_migr_2019`|Net domestic migration from July 1, 2018 to June 30, 2019|
|`residual_2019`|Number used to adjust estimates for consistency|

Finally, the `CREATE TABLE ` statement ends with a `CONSTRAINT` clause specifying that the columns `state_fips` & `county_fips` will serve as the table's primary key. This measn that the combination of those columns is unique for every row in the table. Let's run the import.

## Performing the Census Import with COPY

Now we're ready to bring the census data into the table. Run the below SQL code, remembering to change the path to the file to match the location of the data in your computer.

```
COPY us_counties_pop_est_2019
FROM '/YourDirectory/us_counties_pop_est_2019.csv'
WITH (FORMAT CSV, HEADER);
```

