# Statistical Functions in SQL

Statistics, as a subject, is worthy of multiple books, so we'll only skim the surface in SQL. We'll learn how to apply high-level statistical concepts to derive meaning from our data using data from the US Census Bureau. We'll also learn to use SQL to create rankings, calculate rates & smooth out time-series data using rolling averages & sums.

---

# Creating a Census Stats Table

For this lesson, we'll be using county data from the 2014-2018 American Community Survey (ACS) 5-Year Estimates.

We'll create a table `acd_2014_2018_stats` & import the CSV file *acs_2014_2018_stats.csv*. The data is available with the course's downloadable resources.

```
CREATE TABLE acs_2014_2018_stats (
    geoid text CONSTRAINT geoid_key PRIMARY KEY,
    county text NOT NULL,
    st text NOT NULL,
    pct_travel_60_min numeric(5, 2),
    pct_bachelors_higher numeric(5, 2),
    pct_masters_higher numeric(5, 2),
    median_hh_income integer,
    CHECK (pct_masters_higher <= pct_bachelors_higher)
);

COPY acs_2014_2018_stats
FROM '/YourDirectory/acs_2014_2018_stats.csv'
WITH (FORMAT CSV, HEADER);

SELECT * FROM acs_2014_2018_stats;
```

<img src = "Creating a 2014-18 ACS 5-Year Estimates Table.png" width = "600" style = "margin:auto"/>

The `acs_2014_2018_stats` table has seven columns. The first three include the unique `geoid` that serves as the primary key, the name of the `county` & the state name `st`. Both `county` & `st` carry the `NOT NULL` constraint because each row should contain a value. The next four columns display percentages for each county plus an economic indicator:

* **pct_travel_60_min**: the percentage of workers ages 16 & older who commute more than 60 minutes to work.
* **pct_bachelors_higher**: the percentage of people ages 25 & older whose level of education is a bachelor's degree or higher. (In the United States, a bachelor's degree is usually awarded upon completing a four-year college education.)
* **pct_masters_higher**: the percentage of people ages 25 & older whose level of education is a master's degree or higher. (In the United States, a master's degree is the first advanced degree earned after completing a bachelor's degree.)
* **median_hh_income**: the county's median household income in 2018 inflation-adjusted dollars.

We include a `CHECK` constraint to ensure that the figures for the bachelor's degree are equal to or higher than those for the master's degree, because in the United States, a bachelor's degree is earned before or concurrently with a master's degree. A county showing the opposite could indicate data imported incorrectly or a column mislabeled. Our data should check out; upon import, there should be no errors showing a violation of the `CHECK` constraint.

We use the `SELECT` statement to view all 3,142 rows imported, each corresponding to a county surveyed in this census release.

<img src = "Creating a 2014-18 ACS 5-Year Estimates Table.png" width = "600" style = "margin:auto"/>

## Measuring Correlation with corr(Y, X)

*Correlation* describes the statistical relationship between two variables, measuring the extent to which a change in one is associated with a change in the other. We'll use the SQL `corr(Y, X)` function to measure what relationship exists, if any, between the percentage of people in a county who've attained a bachelor's degree & the median household income in that county. We'll also determine whether, according to our data, a better-educated poplation typically equates to higher income &, if it does, the strength of that relationship.

The below table provides general guidelines for interpreting positive & negative *r* values, although different statisticians may offer different interpretations.

|Correlation coefficient (+/-)|What it could mean|
|:---|:--|
|0|No relationship|
|0.01 to 0.29|Weak relationship|
|0.3 to 0.59|Moderate relationship|
|0.6 to 0.99|Strong to nearly perfect relationship|
|1|Perfect relationship|

In standard ANSI SQL & PostgreSQL, we calculate the Pearson correlations coefficient using `corr(Y, X)`. It's one of the several *binary aggregate functions* in SQL & is so named because these functions accept two inputs. The input `Y` is the *dependent variable* whose variation depends on the value of another variable, & `X` is the *independent variable* whose value doesn't depend on another variable.

We'll use `corr(Y, X)` to discover the relationship between education level & income, with income as our dependent variable & education as our independent variable. We'll use `corr(Y, X)` with `median_hh_income` & `pct_bachelors_higher` as inputs.

```
SELECT corr(median_hh_income, pct_bachelors_higher)
           AS bachelors_income_r
FROM acs_2014_2018_stats;
```

Your result should be an *r* value of about 0.7, given as a floating-point `double precision` data type.

<img src = "Using corr(Y, X) to Measure Relationships.png" width = "600" style = "margin:auto"/>

The positive *r* value indicates that as a county's educational attainment increases, so too does the median household income. The relationship isn't perfect, but the *r* values shows the relationship is fairly strong.  We can visualise the pattern by plotting the variables on a scatter plot using Excel. Each point represents one US county; the data point's position on the x-axis shos the percentage of the population ages 25 & older that has a bachelor's degree or higher. The data point's position on the y-axis represents the county's median household income.

<img src = "Scatterplot Showing Relationship Between Education & Income.png" width = "500" style = "margin:auto"/>

Notice that although most of the data points are grouped together in the bottom-left corner of the visualisation, they do generally slope upward from left to right. 

## Checking Additional Correlations

Let's calculate the correlation coefficients for the remaining variable pairs.

```
SELECT round(corr(median_hh_income,
                  pct_bachelors_higher)::numeric, 2)
           AS bachelors_income_r,
       round(corr(pct_travel_60_min,
                  median_hh_income)::numeric, 2)
           AS income_travel_r,
       round(corr(pct_travel_60_min,
                  pct_bachelors_higher)::numeric, 2)
           AS bachelors_travel_r
FROM acs_2014_2018_stats;
```

This time, we round off the decimal values to make the output more readably by wrapping the `corr(Y, X)` function in SQL's `round()` function.

<img src = "Using corr(Y, X) on Additional Variables.png" width = "600" style = "margin:auto"/>

The `bachelors_income_r` value is `0.70`, which is the same as our first run but rounded up to two decimal places. Compared to `bachelors_income_r`, the other two correlations are weak.

The `income_travel_r` value shows that the correlation between income & the percentage of those who commute more than an hour to work is practically zero. This indicates that a county's median household income bears little connection to how long it takes people to get to work.

The `bachelors_travel_r` value shows that the correlation of bachelor's degrees & lengthy commutes is also low also `-0.14`. The negative value indicates an inverse relationship: as education increases, the percentage of the population that travels more than an hour to work decreases. Although this is interesting, a correlation coefficient that is this close to zero indicates a weak relationship.

When testing for correlation, we need to note some caveats. The first is that even a strong correlation does not imply causality. We can't say that a change in one variable changes a change in the other, only that the changes move together. The second is that correlations should be subject to testing to determine whether they're statistically significant. But, we won't cover that in course.

Nevertheless, the SQL `corr(Y, X)` function is a handy tool for checking correlations between variables.

## Predicting Values with Regression Analysis

Researchers also want to be able to predict values using available data. We can do this with *linear regression*. Simply put, the regression method finds the best linear equation, or straight line, that describes the relationship between an independent variable (such as education) & a dependent varaible (such as income). We can then look at points along this line to predict values where we don't have observations. Standard ANSI SQL & PostgreSQL include functions that perform linear regression.

<img src = "Scatterplot with OLS Regression.png" width = "500" style = "margin:auto"/>

The straight line running through the middle of all the data points is called the *least squares regression line*, which approximates the "best fit" for a straight line that best describes the relationship between the varaibles. The equation for the regression line is like the *slope-intercept* formula, written as $Y = bX + a$. Here are the forumla's components:

**Y** is the predicted values, which is also the value on the y-axis, or dependent variable.

**b** is the slope of the line. It measures how many units the y-axis will increase or decrease for each unit of the x-axis value.

**X** represents a value on the x-axis, or independent variable.

**a** is the y-intercept, the value at which the line crosses the y-axis when the X value is zero.

Let's apply this formula using SQL. Suppose we want to know if the expected median household income in a county where 30 percent or more of the population had a bachelor's degree. In our scatterplot, the percentage with bachelor's degrees falls along the x-axis, represented by $X$ in the calculation. Let's plug that value into the regression line formula in place of $X$.

$$Y = b(30) + a$$

To calculate $Y$, which represented the predicted median household income, we need the line's slope, $b$, & the y-intercept, $a$. To get these values, we'll use the SQL functions `regr_slope(Y, X)` & `regr_intercept(Y, X)`.

```
SELECT round(regr_slope(median_hh_income,
                        pct_bachelors_higher)::numeric,
             2) AS slope,
       round(regr_intercept(median_hh_income,
                            pct_bachelors_higher)::numeric,
             2) AS y_intercept
FROM acs_2014_2018_stats;
```

Using the `median_hh_income` & `pct_bachelors_higher` variables as inputs for both functions, we'll set the resulting value of the `regr_slope(Y, X)` function as `slope` & the output for the `regr_intercept(Y, X)` function as `y_intercept`.

The result should show the following:

<img src = "Regression Slope & Intercept Functions.png" width = "600" style = "margin:auto"/>

Let's plug these values into the equation to get our predicted $Y$:

$$Y = 1016.55(30) + 29651.42$$
$$Y = 60147.92$$

Based on our calculation, in a county in which 30 percent of people age 25 & older have a bachelor's degree or higher, we can expect a median household income to be about $60,148. Of course, our data includes counties whose median income falls above & below that predicted value, but we expect this to be the case because our data points in the scatterplot don't line up perfectly along the regression line. Recall that the correlation coefficient we calculated was 0.7, indicating a strong but not perfect relationship between education & income. Other factors likely contribute to variations in income, such as the types of jobs available in each county.

## Finding the Effect of an Independent Variable with r-Squared

Beyond determining the direction & strength of the relationship between two variables, we can also calculate the extent that the variation in the *x* (independent) variable explains the variation in the *y* (dependent) variable. To do this we square the *r* value to find the *coefficient of determination*, better knowned as *r-squared*. An r-squared indicates the percentage of the variation that is exaplined by the independent variable, & is a value between zero & one. For example, if *r*-squared equals 0.1, we would say that the independent variable explains 10 percent of the variation in the dependent variable, or not much at all.

To find *r*-squared, we use the `regr_r2(Y, X)` function in SQL. Let's apply it to our education & income variables using the code below.

```
SELECT round(
           regr_r2(median_hh_income,
                   pct_bachelors_higher)::numeric, 3)
       AS r_squared
FROM acs_2014_2018_stats;
```

This time, we round off the output to the nearest thousandth place & alias the result to `r_squared`. The query should return the following result:

<img src = "Calculating the Coefficient of Determination.png" width = "600" style = "margin:auto"/>

The *r*-squared value of `0.490` indicates that about 49 percent of the variation in median household income among counties can be explained by the percentage of people with a bachelor's degree or higher in that county. Any number of factors could explain the other 51 percent, & statisticians will typically test numerous combinations of variables to determine what they are.

Before we use these numbers in a headline or presentation, its worth understanding the following points:

Correlation doesn't prove causality. Also, statisticians apply additional tests to data before accepting the results of a regression analysis, including whether the variables follow the standard bell curve distribution. They usually also perform significant testing on the results to make sure values are not simply the result of randomness.

## Finding Variance & Standard Deviation

*Variance* & *standard deviation* describe the degree to which a set of values varies from teh average of those values. Variance, often used in finance, is the average of each number's squared distance from the average. The more dispersion in a set of values, the greater the variance. A stock market trader can use variance to measure the volatility of a particular stock -- how much its daily closing values tend to vary from the average. That could indicate how risky an investment the stock might be.

Standard deviation is the square root of the variance & is most useful for assessing data whose values form a normal distrivution usually visualised as a symmetrical bell curve. The standard deviation helps us understand how close most of our values are to the average.

When calculating variance & standard deviation, note that they report different units. Standard deviation is expressed in the same units as the values, while variance is not -- it reports a number that is larger than the units, on a scale of its own.

These are the functions for calculating variance:

**var_pop(numeric)** calculates the population variance of the input values. *Population* refers to a dataset that contains all possible values, as opposed to a sample that just contains a portion of all possible values.

**var_samp(numeric)** calculates the sample variance of the input values. Use this with data that is sampled from a population, as in a random sample survey.

For calculating standard deviation, we use these:

**stddev_pop(numeric)** calculates the population standard deviation.

**stddev_samp(numeric)** calculates the sample standard deviation.

With functions covering correlation, regression, & other descriptive statistics, we now have a basic toolkit for obtaining a preliminary survey of our data before doing more rigourous analysis.

---

# Creating Rankings with SQL

With SQL, we can create numbered rankings in our query results, which are useful for tasks such as tracking changes over several years. 

## Ranking with rank() & dense_rank()

Standard ANSI SQL includes several ranking functions, but we'll just focus on two: `rank()` & `dense_rank()`. Both are *window functions*, which are defined as functions that eprform calculations across a set of rows relative to the current row. Unlike aggregate functions, which combine rows to calculates values, with window functions the query first generates a set of rows, & then the window function runs across the result set to calculate the value it will return.

The difference between `rank()` & `dense_rank()` is the way they handle the next rank value after a tie: `rank()` includes a gap in the rank order, but `dense_rank()` does not. This concept is easier to understand in action. Consider a Wall Street analyst who covers the highly competitive widget manufacturing market. The analyst wants to rank companies by their annual output. The SQL statements below create & fill a table with data, then rank the companies by widget output.

```
CREATE TABLE widget_companies (
    id integer PRIMARY KEY GENERATED ALWAYS AS IDENTITY
    company text NOT NULL,
    widget_output integer NOT NULL
);

INSERT INTO widget_companies (company, widget_output)
VALUES ('Dom Widgets', 125000),
       ('Ariadne Widget Masters', 143000),
       ('Saito Widget Co.', 201000),
       ('Mal Inc.', 133000),
       ('Dream Widget Inc.', 196000),
       ('Miles Amalgamated', 620000),
       ('Arthur Industries', 244000),
       ('Fischer Worldwide', 201000);

SELECT company,
       widget_output,
       rank() OVER (ORDER BY widget_output DESC),
       dense_rank() OVER (ORDER BY widget_output DESC)
FROM widget_companies
ORDER BY widget_output DESC;
```

Notice the syntax in the `SELECT` statement that includes `rank()` & `dense_rank()`. After the function names, we use the `OVER` clause & in parentheses place an expression that specifies the "window" of rows the function should operate on. The *window* is the set of rows relative to the current row, & in this case, we want both functions to work on all rows of the `widget_output` column, sorted in descending order. 

<img src = "Using the rank() & dense_rank() Window Functions.png" width = "600" style = "margin:auto"/>

The columns produced by `rank()` & `dense_rank()` show each company's ranking based on the `widget_output` value from highest to lowest, with Miles Amalgamated at number one. To see how `rank()` & `dense_rank()` differ, check the fifth-row listing, Dream Widget Inc.

With `rank()`, Dream Widget Inc. is the fifth-highest-ranking company. Because `rank()` allows a gap in the order when a tie occurs, Dream placing fifth tells us there are four companies with more output. In contrast, `dense_rank()` doesn't allow a gap in the rank order so it places Dream Widget Inc. in fourth place. This reflect the fact that Dream has the fourth-highest widget output regardless of how many companies produced more.

Both ways of handling ties have merit, but in practice, `rank()` is used most often. It more accurately reflects the total number of companies ranked, shown by the fact that Dream Widget Inc. has four companies ahead of it in total output, not three.

## Ranking Within Subgroups with PARTITION BY

The ranking we just did is a simple overall ranking based on widget output. But sometimes, we'll want to produce wanks within groups of rows in a table. For example, we might want to rank governemtn employees by salary within each department or rank movies by box-office earnings within each genre.

To use window functions in this way, we'll add `PARTITION BY` to the `OVER` clause. A `PARTITION BY` clause divides table rows according to values in a column we specify.

Here's an example using made-up data about grovery stores. 

```
CREATE TABLE store_sales (
    store text NOT NULL,
    category text NOT NULL,
    unit_sales bigint NOT NULL,
    CONSTRAINT store_category_key PRIMARY KEY (store, category)
);

INSERT INTO store_sales (
    store, category, unit_sales
)
VALUES ('Broders', 'Cereal', 1104),
       ('Wallace', 'Ice Cream', 1863),
       ('Broders', 'Ice Cream', 2517),
       ('Cramers', 'Ice Cream', 2112),
       ('Broders', 'Beer', 641),
       ('Cramers', 'Cereal', 1003),
       ('Cramers', 'Beer', 640),
       ('Wallace', 'Cereal', 980),
       ('Wallace', 'Beer', 988);

SELECT category, store, unit_sales,
       rank() OVER (PARTITION BY category
           ORDER BY unit_sales DESC)
FROM store_sales
ORDER BY category, rank() OVER (PARTITION BY
    category ORDER BY unit_sales DESC);
```

In the table, each row includes a store's produce category & sales for that category. The final `SELECT` statement creats a result set showing how each store's sales ranks within each category. The new element is the addition of `PARTITION BY` in the `OVER` clause. In effect, the clause tells the progra to create rankings one category at a time, using the store's unit sales in descending order. 

To display the results by category & rank, we add an `ORDER BY` clause that includes the `category` column & the same `rank()` function syntax.

<img src = "Applying rank() Within Groups Using PARTITION BY.png" width = "600" style = "margin:auto"/>

Rows for each category are ordered by category unit sales with the `rank` column displaying the ranking.

Using this table, we can see at a glance how each store ranks in a food category. For instance, Broders tops sales for cereal & ice cream, but Wallace wins in the beer category.

---

# Calculating Rates for Meaningful Comparisons

Ranks based on raw counts aren't always meaningful; in fact, they can be misleading. Consider birth statistic: the US National Center for Health Statistics (NCHS) reported that in 2019, there were 377,599 babies born in the state of Texas & 46,826 born in the state of Utah. So, women in Texas are more likely to have babies, right? Not so fast. In 2019, Texas' estimated population was 9 times as much as Utah's. Given that context, comparing the plain number of births in two states isn't very meaningful.

A more accurate way to compare these numbers is to convert them to rates. Analysts often calculate a rate per 1,000 people, or some multiple of thatumber, to allow an apples-to-apples comparison. For example, the fertility rate -- the number of births per 1,000 women ages 15 to 44 -- was 62.5 for Texas in 2019 & 66.7 for Utah, according to the NCHS. So, despite the smaller number of births, on a per-1,000 rate, women in Utah actually had more children.

The math behind this is simple. Let's say our town had 115 births & a population of 2,200 women ages 15 to 44. We can find the per-1,000 rate as follows:

$$(115/2,200) * 1000 = 52.3$$

In our town, there were 52.3 births per 1,000 women ages 15 to 44, which we can now compare to other places regardless of their size.

## Finding Rates of Tourism-Related Businesses

Let's try calculating rates using SQL & census data. We'll join two tables: the census population estimates plus data compiled about tourism-related businesses from the census' County Business Patterns program.

The below code creates & fills the business patterns table. 

```
CREATE TABLE cbp_naics_72_establishments (
    state_fips text,
    county_fips text,
    county text NOT NULL,
    st text NOT NULL,
    naics_2017 text NOT NULL,
    naics_2017_label text NOT NULL,
    year smallint NOT NULL,
    establishments integer NOT NULL,
    CONSTRAINT cbp_fips_key PRIMARY KEY (
        state_fips, county_fips)
);

COPY cbp_naics_72_establishments
FROM '/YourDirectory/cbp_naics_72_establishments.csv'
WITH (FORMAT CSV, HEADER);

SELECT *
FROM cbp_naics_72_establishments
ORDER BY state_fips, county_fips
LIMIT;
```

Once the data is imported, run the final `SELECT` statement to view the first few rows of the table. Each row contains descriptive information about a county along with the number of business establishments that fall under code 72 of the North American Industry Classification System (NAICS). Code 72 covers "Accomodation & Food Services" establishments, mainly hotels, inns, bars, & restaurants. The number of those businesses in a county is a good proxy for the amount of tourist & recreation activity in the area.

<img src = "Creating & Filling a Table For Census County Business Pattern Data.png" width = "600" style = "margin:auto"/>

Let's find out which counties have the highest concentration of such businesses per 1,000 population.

```
SELECT cbp.county,
       cbp.st,
       cbp.establishments,
       pop.pop_est_2018,
       round((cbp.establishments::numeric /
           pop.pop_est_2018) * 1000, 1)
           AS estabs_per_1000
FROM cbp_naics_72_establishments AS cbp
JOIN us_counties_pop_est_2019 as pop
    ON cbp.state_fips = pop.state_fips
    AND cbp.county_fips = pop.county_fips
WHERE pop.pop_est_2018 >= 50000
ORDER BY cbp.establishments::numeric /
    pop.pop_est_2018 DESC;
```

We limited our results to counties with 50,000 or more people. That's an arbitrary value that lets us see how rates compare within a group of more-populous, better-known counties. Here's the results, sorted with highest rates at the top:

<img src = "Business Rates per Thousand Population in Counties with 50,000 or More People.png" width = "600" style = "margin:auto"/>

The counties that have the highest rates make sense. Cape May County, New Jersey, is home to numerous beach resort towns on the Atlantic Ocean & Deleware Bay. Worcester County, Maryland, contains Ocean City & other beach attractions. Monroe County, Florida, is known for its vacation hotspot, the Florida keys.

---

# Smoothing Uneven Data

A *rolling average* is an average calculated for each time period in a dataset, using a moving window of rows as input each time. Think of a hardware store: it might sell 20 hammers on Monday, 15 hammers on Tuesday, & just a few the rest of the week. The next week, hammer sales might spike on Friday. To find the big-picture story in such uneven data, we can smooth numbers by calculating the rolling average, sometimes called a *moving average*.

Here are two weeks of hammer sales at that hypothetical hardware store:

|Date|Hammer Sales|Seven-Day Average|
|:---:|:---:|:---:|
|2022-05-01|0||
|2022-05-02|20||
|2022-05-03|15||
|2022-05-04|3||
|2022-05-05|6||
|2022-05-06|1||
|2022-05-07|1|6.6|
|2022-05-08|2|6.9|
|2022-05-09|18|6.6|
|2022-05-10|13|6.3|
|2022-05-11|2|6.1|
|2022-05-12|4|5.9|
|2022-05-13|12|7.4|
|2022-05-14|2|7.6|

Let's say that for every day we want to know the average sales over the last seven days (we can choose any period, but a week is an intuitive unit). Once we have seven days of data, we calculate the average of sales over the seven-day period that includes the current day. The average of hammer sales from May 1 to May 7, 2020, is `6.6` per day.

The next day, we again average sales over the most recent seven days, from May 2 to May 8, 2022. The result is `6.9` per day. As we continue each day, despite the ups & downs in the daily sales, the seven-day average remains fairly steady. over a long period of time, we'll be able to better discern a trend.

Let's use the window function syntax again to perform a running average calculation on US exports data. We'll create a table & use `COPY` to insert data from *us_exports.csv*. This file contains data showing the monthly dollar value of US exports citrus fruit & soybeans, two commodities whose sales are tied to the growing season. The data comes from the US Census Bureau's international trade division at [https://usatrade.census.gov/](https://usatrade.census.gov/)

```
CREATE TABLE us_exports (
    year smallint,
    month smallint,
    citrus_export_value bigint,
    soybeans_export_value bigint
);

COPY us_exports
FROM '/YourDirectory/us_exports.csv'
WITH (FORMAT CSV, HEADER);

SELECT year, month, citrus_export_value
FROM us_exports
ORDER BY year, month;

SELECT year, month, citrus_export_value,
       round(avg(citrus_export_value)
           OVER (ORDER BY year, month
           ROWS BETWEEN 11 PRECEDING AND CURRENT ROW),
           0) AS twelve_month_avg
FROM us_exports
ORDER BY year, month;
```

The first `SELECT` statement lets us view the monthly citrus export data, which covers every month from 2002 through summer 2020. 

<img src = "Creating a Rolling Average for Export Data.png" width = "600" style = "margin:auto"/>

Notice the pattern: the value of citrus fruit exports is higher in winter months, when the growing season is paused in the northern hemisphere & countriess need imports to meed demand. We'll use the second `SELECT` statement to compute a 12-month rolling average so we can see, for each month, the annual trend in exports.

In the `SELECT` values list, we place an `avg()` function to calculate the average of the values in the `citrus_export_value` column. We follow the function with an `OVER` clause that has two elements in parentheses: an `ORDER BY` clause that sorts the data for the period we plan to average, using the keywords `ROWS BETWEEN 11 PRECEDING AND CURRENT ROW`. This tells PostgreSQL to limit the window to the current row & the 11 rows before it -- 12 total.

We wrap the entire statement, from the `avg()` function through the `OVER` clause, in a `round()` function to limit the output to whole numbers.

<img src = "Creating a Rolling Average for Export Data 2.png" width = "600" style = "margin:auto"/>

Notice the 12-month average is far most consistent. If we want to see the trend, it's helpful to graph the results using Excel or a stats program. The below figure shows the monthly totals from 2015 through August 2020 in bars, with the 12-month average as a line.

<img src = "Monthly Citrus Exports with 12-Month Rolling Average.png" width = "500" style = "margin:auto"/>

Based on the rolling average, citrus fruit exports were generally steady until 2019 & then trended down before recovering slightly in 2020. It's difficult to discern that movement from the monthly data, but the rolling average makes it apparent.

The window function syntax offers multiple options for analysis. For example, instead of calculating a rolling average, we could substitute the `sum()` function to find the rolling total over a time period. If we calculated a seven-day rolling sum, we'd know the weekly total ending on any day in our dataset.

SQL offers additional window functions. Check out the official [PostgreSQL documentation](https://www.postgresql.org/docs/current/tutorial-window.html) for an overview of window functions & [https://www.postgresql.org/docs/current/functions-window.html](https://www.postgresql.org/docs/current/functions-window.html) for a list of window functions.

---

# Wrapping Up

Our SQL analysis toolkit now includes ways to find relationships among variables using statistical functions, create rankings from ordered data, smooth spiky data to find trends, & properly compare raw numbers by turning them into rates.