# Basic Math & Stats with SQL

If you data contains any of the number data types discussed before: integers, decimals, or floating points -- sooner or later in your analysis, you will need to include some calculations. SQL can handle calculations from basic math through to advanced statistics.

---

# Understanding Math Operators & Functions

The table belwo shows nine math operators you'll use most often in your calculations. The first four (addition, subtraction, multiplication, & division) are part of the ANSI SQL standard & are implemented in all database systems. The others are PostgreSQL-specific operators, although most other database maangers likely have functions or operators to perform those operations too. For example, the modulo operator (`%`) works in Microsoft SQL Server & MySQL, as well as with PostgreSQL. If you're using another database system, check its documentation.

|Operator|Description|
|:---|:---|
|+|Addition|
|-|Subtraction|
|*|Multiplication|
|/|Division (returns quotient only)|
|%|Module (returns remainder only)|
|^|Exponentiation|
|&#124;/|Square root|
|&#124;&#124;/|Cube root|
|!|Factorial|

## Adding, Subtracting, & Multiplying

Let's start with simple integer addition, subtraction, & multiplication. Here are three examples, each with the `SELECT` keyword followed by the math formula.

```
SELECT 2 + 2;
SELECT 9 - 1;
SELECT 3 * 4;
```

None of these statements is rocket science, so you shouldn't be surprised that `SELECT 2 + 2;` in the Query Tool shows a result of `4`. Similarly, the examples for subtraction & multiplication yield what you'd expect: `8` & `12`. The output displays in a column, as with any query result. But because we're not querying a table or specifying a column, the results appear beneath a `?column?` name, signifying an unknown column:

<img src = "Addition.png" width = "600" style = "margin:auto"/>
<img src = "Subtraction.png" width = "600" style = "margin:auto"/>
<img src = "Multiplication.png" width = "600" style = "margin:auto"/>

That's okay. We're not affecting any data in a table, just displaying a result. If you want to display a column name, you can provide an alias, as in `SELECT 3 * 4 AS result;`.

## Performing Division & Modulo

Division with SQL gets alittle trickier because of the difference between math with integers & math with decimals. Add in *modulo*, an operator that returns just the *remainder* of the division operation, & the results can be confusing.

```
SELECT 11 / 6;
SELECT 11 % 6;
SELECT 11.0 / 6;
SELECT CAST (11 AS numeric(3, 1)) / 6;
```

The `/` operator divides the integer `11` by another integer, `6`. If you do that math in your head, you know the answer is `1` with a remainder of `5`. However, running this query yields `1`, which is how SQL handles division of one integer by another -- by reporting only the integer `quotient` without any remainder.

<img src = "Integer Division in SQL 1.png" width = "600" style = "margin:auto"/>

If you want the retrieve the *remainder* as an integer, you must perform the same calculation using the modulo operator `%`. That statement returns just the remainder, in this case `5`. 

<img src = "Integer Division in SQL 2.png" width = "600" style = "margin:auto"/>

No single operation today will provide you with both the quotient & remainder as integers. 

Modulo is useful for more than just fetching the remainder: you can use it as a test condition. For example, to check whether a number is even, you can test it using the `% 2` operation. If the result is `0` with no remainder, the number is even.

There are two ways to divide two numbers & have the result return as a `numeric` type. First, if one or both of the numbers is a `numeric`, the result will by default be expressed as `numeric`. That happens when we divide `11.0` by `6`. The result is `1.83333`. The number of decimal digits displayed may vary according to your PostgreSQL & system settings.

<img src = "Decimal Division in SQL 1.png" width = "600" style = "margin:auto"/>

Second, if you're working with data stored only as integer & need to force decimal division, you can use `CAST` to convert one of the integers to a `numeric` type. Executing this also returns `1.83333`.

<img src = "Decimal Division in SQL 2.png" width = "600" style = "margin:auto"/>

## Using Exponents, Roots, & Factorials

Beyond the basics, PostgreSQL also provides operators & functions to square, cube or otherwise raise a base number to an exponent as well as find roots or the factorial of a number.

```
SELECT 3 ^ 4;
SELECT |/ 10;
SELECT sqrt(10);
SELECT ||/ 10;
SELECT factorial(4);
```

The exponentiation operator (`^`) allows you to raise a given base number to an exponent, where `3 ^ 4` returns `81`.

<img src = "Exponents in SQL.png" width = "600" style = "margin:auto"/>

You can find the square root of a number in two ways: using the `|/` operator or the `sqrt(n)` function. 

<img src = "Square Root in SQL 1.png" width = "600" style = "margin:auto"/>
<img src = "Square Root in SQL 2.png" width = "600" style = "margin:auto"/>

For a cube root, use the `||/` operator. Both are *prefix operators*, because they come before a single value.

<img src = "Cube Root in SQL.png" width = "600" style = "margin:auto"/>

To find the *factorial* of a number, you can use the `factorial(n)` function. 

<img src = "Factorial in SQL.png" width = "600" style = "margin:auto"/>

Again, these operators are specific to PostgreSQL; they're not part of the SQL standard. 

## Minding the Order of Operations

You may recall what th order of operations is on a mathematical expression. SQL follows the established math standard. For PostgreSQL operators discussed so far, the order is as follows:

* Exponents & roots
* Multiplication, division, modulo
* Addition & subtraction

Given these rules, you'll need to encase an operation in parentheses if you want to calculate it in a different order. For example the following two expressions yield different results:

```
SELECT 7 + 8 * 9;
SELECT (7 + 8) * 9;
```

The first expression returns `79` because the multiplication operation takes precedent & is processed before the addition. The second returns `135` because the parentheses force the addition operation to occur first.

<img src = "PEMDAS 1.png" width = "600" style = "margin:auto"/>
<img src = "PEMDAS 2.png" width = "600" style = "margin:auto"/>

Keep operator precedence in mind to avoid having to correct your analysis later.

---

# Doing Math Across Census Table Columns

Let's use our newly learned SQL math operators on real data by digging into the 2019 US Census population estimates table, `us_counties_pop_est_2019`. Let's retrieve a subset of the dataset that we will use for our oncoming calculations.

```
SELECT county_name AS county,
       state_name AS state,
       pop_est_2019 AS pop,
       births_2019 AS births,
       deaths_2019 AS deaths,
       international_migr_2019 AS int_migr,
       domestic_migr_2019 AS dom_migr,
       residual_2019 AS residual
FROM us_counties_pop_est_2019;
```

## Adding & Subtracting Columns

Let's try a simple calculation using two of the columns. Subtract the number of deaths from the number of births in each county, a measure the census refers to as natural increase. Let's see what this shows.

```
SELECT county_name AS county
       state_name AS state,
       births_2019 AS births,
       deaths_2019 AS deaths,
       births_2019 - deaths_2019 AS natural_increase
FROM us_counties_pop_est_2019
ORDER BY state_name, county_name;
```

Providing `births_2019 - deaths_2019` as one of the columns in the `SELECT` statement handles the calculation. We use the `AS` keyword to provide a readable alias for the column. If you don't provide an alias, PostgreSQL uses the label `?column?`, which is far less helpful.

Run the query to see the results.

<img src = "Subtracting Two Columns in us_counties_pop_est_2019.png" width = "600" style = "margin:auto"/>

A quick check confirms that the `natural_increase_` column equals the difference between the two columns we subtracted. Notice as you scroll through the output that some counties have more births than deaths, while others have the opposite. Typically, counties with a younger mix of residents see births outpace deaths; those with an older set of people -- think rural areas & retirement hotspots -- tend to see a greater number of deaths than births.

Let's build on this to test our data & validate that we imported columns correctly. The population estimate for 2019 should equal the sum of the 2018 estimate, births, deaths, migration, & residual factor.

```
SELECT county_name AS county,
       state_name AS state,
       pop_est_2019 AS pop,
       pop_est_2018 + births_2019 - deaths_2019 +
           international_migr_2019 + domestic_migr_2019 +
           residual_2019 AS components_total,
       pop_est_2019 - (pop_est_2018 + births_2019 -
           deaths_2019 + international_migr_2019 +
           domestic_migr_2019 + residual_2019) AS difference
FROM us_counties_pop_est_2019
ORDER BY difference DESC;
```

This query includes the 2019 population estimate, followed by a calculation adding the components to the 2018 population estimate as `component total`. The 2018 estimate plus the components should equal the 2019 estimate. Rather than manually check, we add a column that subtracts the components total from the 2019 estimate. That column, named `difference`, should contain a zero in each row if all the data is in the right place. To avoid having to scan all 3,142 rows, we add an `ORDER BY` caluse on the named column. Any rows showing a difference should appear at the top or bottom of the query result.

<img src = "Checking Census Totals.png" width = "600" style = "margin:auto"/>

With the `difference` column showing zeros, we can be confident that our import was clean.

## Finding Percentages of the Whole

One way to spot differences in the items in a dataset is to calculate the percentage of the whole that a particular data point represents. Then you can clean meaningful insights -- & sometimes surprises -- by comparing that percentage across all the items in your dataset.

We'll try this on the census population estimates using the two columns that represent the size of each county's geographical features. The columns `area_land` & `area_water` show a county's land & water measurement in square meters. Using these two columns, we can calculate for each county the percentage of its area that is made up of water.

```
SELECT county_name AS county,
       state_name AS state,
       area_water::numeric / (area_land + area_water)
          * 100 AS pct_water
FROM us_counties_pop_est_2019
ORDER BY pct_water DESC;
```

The key piece of this query divides `area_water` by the sum of `area_land` & `area_water`, which together represent the total area of the county.

If we use the data as their original integer types, we won't get the fractional result we need: every row will display a result of 0, the quotient. Instead, we force decimal division by casting one of the integers to the numeric type. Here, for brevity, we use the PostgreSQL-specific double-colon notation after the first reference of `area_water`, but we could have used the ANSI SQL standard `CAST` function as well. Finally, we multiply the result by 100 to present the result as a fraction of 100 - the way most people understand percentages.

By sorting from highest to lowest percentage, the top of the output is as follows:

<img src = "Calculating Percent of a County's Area that is Water.png" width = "600" style = "margin:auto"/>

If you check the Wikipedia entry for Keweenaw County, you'll discover the reason why its total area is more than 90 percent water: its land area includes an island in Lake Superior, & the lake's waters are included in the total reported by the census.

## Tracking Percent Change

Another key indicator in data analysis is percent change. Percent change calculations are often employed when analysing change over time, & they're particularly useful for comparing change among similar items.

let's try this with a small collection of test data related to spending in departments of a hypothetical local governement. We'll calculate which departments had the greatest percentage increase or decrease.

```
CREATE TABLE percent_change (
    department text,
    spend_2019 numeric(10, 2),
    spend_2022 numeric(10, 2)
)

INSERT INTO percent_change
VALUES ('Assessor', 178556, 179500),
       ('Building', 250000, 289000),
       ('Clerk', 451980, 650000),
       ('Library', 87777, 90001),
       ('Parks', 250000, 223000),
       ('Water', 199000, 195000);

SELECT department,
       spend_2019,
       spend_2022,
       round((spend_2022 - spend_2019) /
           spend_2019 * 100, 1) AS pct_change
FROM percent_change;
```

We create a small table called `percent_change` & insert six rows with data on department spending for the years 2019-2022. The percent change formula subtracts `spend_2019` from `spend_2022` & then divides by `spend_2019`. We multiple by 100 to express the result as a portion of 100.

To simplify the output, we add a `round()` function to remove all but one decimal place. The function takes two arguments: the column or expression to be rounded & the number of decimal places to display. Since both numbers are type `numeric`, the result will also be `numeric`.

<img src = "Calculating Percent Change.png" width = "600" style = "margin:auto"/>

Now, it's just a matter of finding out why the Clerk deparment's spending has outpaces others in the town.

---

# Using Aggregate Function for Averages & Sums

SQL lets you calculate a result from multiple values within the same column using *aggregate functions*. You can see a full list of PostgreSQL aggregates, which calculate a single result from multiple inputs at [https://www.postgresql.org/docs/current/functions-aggregate.html](https://www.postgresql.org/docs/current/functions-aggregate.html). Two of the most-used aggregate functions in data analysis are `avg()` & `sum()`.

Returning to the `us_counties_pop_est_2019` census table, it's reasonable to want to calculate the total population of all counties plus the average population of all counties. Using `avg()` & `sum()` on column `pop_est_2019` (the population estimate for 2019) makes it easy, as shown below. Again, we use the `round()` function to remove numbers after the decimal point in the average calculation.

```
SELECT sum(pop_est_2019) AS county_sum
       round(avg(pop_est_2019), 0) AS county_average
FROM us_counties_pop_est_2019;
```

This calculation produces the following result:

<img src = "Using sum() & avg() Aggregate Functions.png" width = "600" style = "margin:auto"/>

The estimated population for all counties in the United States in 2019 added up to approximately 328.2 million, & the average of the county population estimates was 104,468.

---

# Finding the Median

The *median* value in a set of numbers is as important an indicator, if not more so, than the average, especially when the data is not normally distributed or skewed.

## Finding the Median with Percentiles

PostgreSQL (as with most relational databases) does not have a built-in `median()` function. It's also not included in the ANSI SQL standard. Instead, we have to use a SQL *percentile* function to find the median & use *quantiles* or *cut points* to divide a group of numbers into equal sizes. Percentile functions are part of standard ANSI SQL.

The median is equivalent to the 50th percentile -- half the values are below & half above. There are two version of the percentile function, `percentile_cont(n)` & `percentile_disc(n)`. Both functions are part of ANSI SQL standard & are present in PostgreSQL, Microsoft SQL Server, & other databases.

The `percentile_cont(n)` function calculates percentiles as *continuous* values. That is, the result does not have to be one of the values in the dataset, but can be a decimal value in between two numbers. This follows the methodolody for calculating medians on an even number of values, where the median is the average of the two middle numbers. The `percentile_disc(n)` function returns only *discrete* values, meaning the result will be rounded to one of the numbers in the set.

We can see how this works by creating a test table with six numbers to find the percentiles.

```
CREATE TABLE percentile_test (numbers integer);

INSERT INTO percentile_test
VALUES (1), (2), (3), (4), (5), (6);

SELECT percentile_cont(0.5)
           WITHIN GROUP (ORDER BY numbers),
       percentile_disc(0.5)
           WITHIN GROUP (ORDER BY numbers)
FROM percentile_test;
```

In both the continuous & discrete percentile functions, we enter `0.5` to represent the 50th percentile, equivalent to the median. Running the code returns this:

<img src = "Testing SQL Percentile Functions.png" width = "600" style = "margin:auto"/>

The `percentile_cont()` function returned what we'd expect the median to be, `3.5`. But because `percentile_disc()` calculates discrete values, it reports `3`, the last value in the first 50 percent of numbers. Because the accepted method of calculated medians is to average the two middle values in an even-numbered set, use `percentile_cont(0.5)` to find the median.

## Finding Median & Percentiles with Census Data

Our census data can show how a median tells a different story than an average. The below SQL adds `percentile_cont()` alongside the `sum()` & `avg()` aggregates we've used so far to find the sum, average, & median population of all counties.

```
SELECT sum(pop_est_2019) AS county_sum,
       round(avg(pop_est_2019), 0) AS county_average,
       percentile_cont(0.5)
           WITHIN GROUP (ORDER BY pop_est_2019)
           AS county_median
FROM us_counties_pop_est_2019;
```

Your result should be:

<img src = "Using sum(), avg(), & percentile_cont() Aggregate Function.png" width = "600" style = "margin:auto"/>

The median & average are far apart, which shows that averages chan mislead. As of 2019 estimates, half the counties in America had fewer than 25,726 people, whereas half had more. If you gave a presentation on US demographics & told the audience that the "average county in America has 104,468 people", they'd walk away with a skewed picture of reality. More than 40 counties were estimated to have a million or more people in 2019, & Los Angeles County had more than 10 million. That pushed the average higher.

## Finding Other Quantiles with Percentile Functions

You can also slice data into smaller equal groups for analysis. Most common are *quartiles* (four equal groups), *quintiles* (five groups), & *deciles* (10 groups). To find any individual value, you can just plug it into a percentile function. To find the value marking the first quartile or the lowest 25 percent of data, you'd use a value of `0.25`.

However, entering values one at a time is labourious if you want to generate multiple cut coints. Instead, you can pass values into `percentile_cont()` using an *array*, a list of items.

The below code shows how to calculate all four quartiles at once.

```
SELECT percentile_cont(ARRAY[0.25, 0.5, 0.75])
           WITHIN GROUP (ORDER BY pop_est_2019)
           AS quartiles
FROM us_counties_pop_est_2019;
```

In this example, we create our cut points by enclosing values in an *array constructor* called `ARRAY[]`. An array constructor is an expression that builds an array from the elements included between the square brackets. Inside the brackets, we provide comma-separated values representing the three points at which to cut to create four quartiles. Run the query to see the output.

<img src = "Passing Array of Values to percentile_cont().png" width = "600" style = "margin:auto"/>

Because we passed an array, PostgreSQL returns an array, denoted in the results by curly brackets. Each quartile is separated by commas. The first quartile is 10,902.5, which means that 25 percent of counties have a population that is equal to or lower than this value. The second quartile is the same as the median: 25,726. The third quartile is 68,072.75, meaning the largest 25 percent of counties have at least this large of a population.

Arrays are defined in the ANSI SQL standard & our use here is just one of several ways you work with arrays in PostgreSQL. You can, for example, define a table column as an array of a particular data type. That's useful if you want store multiple values in a single database column, such as a collection of tags for a blog post, instead of storing them in a separate table. See the [PostgreSQL documentation](https://www.postgresql.org/docs/current/arrays.html) for examples of declaring, searching, & modifying arrays.

Arrays also come with a host of functions that allow you to perform tasks such as adding or removing values, or counting the elements. A handy function for working with the result of our previous query is `unnest()`, which makes the array easier to read by turning it into rows.

```
SELECT unnest(
    percentile_cont(ARRAY[0.25, 0.5, 0.75])
        WITHIN GROUP (ORDER BY pop_est_2019)
) AS quartiles
FROM us_counties_pop_est_2019;
```
Now the output should be in rows:

<img src = "Using unnest() to Turn Arrays into Rows.png" width = "600" style = "margin:auto"/>

If we're computing deciles, pulling them from the resulting array & displaying them in rows would be very helpful.

---

# Finding the Mode

We can find the *mode*, the value that appears most often, using the PostgreSQL `mode()` function. The function is not part of standard SQL & has a syntax similar to the percentile functions. The below code shows a `mode()` calculation on `births_2019`, the column showing the number of babies born.

```
SELECT mode() WITHIN GROUP (ORDER BY births_2019)
FROM us_counties_pop_est_2019;
```

The result is `86`, a number of births shared by 16 counties.

<img src = "Finding the Most Frequent Value with mode().png" width = "600" style = "margin:auto"/>

---

# Wrapping Up

Working with numbers is a key step in acquiring meaning from your data & with the math skills covered in this lesson, we're ready to handle the foundations of numerical analysis with SQL. We also learned how a median can be fairer assessment of a group of values than an average. That alone can help you avoid inaccurate conclusions.