# Advanced Aggregation Functions

Many DBMS provide more advanced aggregation functions. 
In this lab, we will look at a few of the advanced aggregation functions available in PostgreSQL.


## Connect to postgreSQL

We will again connect to the postgres database to view the data.

Connect again by using the command:

In [1]:
%load_ext sql
%sql postgres://dsa_ro_user:readonly@pgsql.dsa.lan/dsa_ro

'Connected: dsa_ro_user@dsa_ro'

For these functions we are going to look at a table of data with housing sales data.

```SQL
dsa_ro=# \d houses
                             Table "public.houses"
    Column     |  Type   |                      Modifiers                      
---------------+---------+-----------------------------------------------------
 id            | integer | not null default nextval('houses_id_seq'::regclass)
 date          | text    | 
 price         | real    | 
 bedrooms      | integer | 
 bathrooms     | real    | 
 sqft_living   | integer | 
 sqft_lot      | integer | 
 floors        | real    | 
 waterfront    | integer | 
 view          | integer | 
 condition     | integer | 
 grade         | integer | 
 sqft_above    | integer | 
 sqft_basement | integer | 
 yr_built      | integer | 
 yr_renovated  | integer | 
 zipcode       | integer | 
 lat           | real    | 
 long          | real    | 
 sqft_living15 | integer | 
 sqft_lot15    | integer | 
Indexes:
    "houses_pkey" PRIMARY KEY, btree (id)


dsa_ro=# select count(*) from houses;
 count 
-------
 21613
(1 row)

```


## Advanced Statistical Aggregates

<style>
table.CALSTABLE th {font-weight:900; font-family: verdana,sans-serif; background: blue;}
</style>

<table class="CALSTABLE" border="1">
<colgroup><col>
<col>
<col>
<col>

</colgroup><thead>
<tr>
<th>Function</th>

<th>Argument Type</th>

<th>Return Type</th>

<th>Description</th>
</tr>
</thead>

<tbody>
<tr>
<td><code class="FUNCTION">corr(<tt class="REPLACEABLE c3">Y</tt>, <tt class="REPLACEABLE c3">X</tt>)</code></td>

<td><tt class="TYPE">double precision</tt></td>

<td><tt class="TYPE">double precision</tt></td>

<td>correlation coefficient</td>
</tr>

<tr>
<td><code class="FUNCTION">covar_pop(<tt class="REPLACEABLE c3">Y</tt>, <tt class="REPLACEABLE c3">X</tt>)</code></td>

<td><tt class="TYPE">double precision</tt></td>

<td><tt class="TYPE">double precision</tt></td>

<td>population covariance</td>
</tr>

<tr>
<td><code class="FUNCTION">covar_samp(<tt class="REPLACEABLE c3">Y</tt>, <tt class="REPLACEABLE c3">X</tt>)</code></td>

<td><tt class="TYPE">double precision</tt></td>

<td><tt class="TYPE">double precision</tt></td>

<td>sample covariance</td>
</tr>

<tr>
<td><code class="FUNCTION">regr_avgx(<tt class="REPLACEABLE c3">Y</tt>, <tt class="REPLACEABLE c3">X</tt>)</code></td>

<td><tt class="TYPE">double precision</tt></td>

<td><tt class="TYPE">double precision</tt></td>

<td>average of the independent variable (<tt class="LITERAL">sum(<tt class="REPLACEABLE c3">X</tt>)/<tt class="REPLACEABLE c3">N</tt></tt>)</td>
</tr>

<tr>
<td><code class="FUNCTION">regr_avgy(<tt class="REPLACEABLE c3">Y</tt>, <tt class="REPLACEABLE c3">X</tt>)</code></td>

<td><tt class="TYPE">double precision</tt></td>

<td><tt class="TYPE">double precision</tt></td>

<td>average of the dependent variable (<tt class="LITERAL">sum(<tt class="REPLACEABLE c3">Y</tt>)/<tt class="REPLACEABLE c3">N</tt></tt>)</td>
</tr>

<tr>
<td><code class="FUNCTION">regr_count(<tt class="REPLACEABLE c3">Y</tt>, <tt class="REPLACEABLE c3">X</tt>)</code></td>

<td><tt class="TYPE">double precision</tt></td>

<td><tt class="TYPE">bigint</tt></td>

<td>number of input rows in which both expressions are
nonnull</td>
</tr>

<tr>
<td><code class="FUNCTION">regr_intercept(<tt class="REPLACEABLE c3">Y</tt>, <tt class="REPLACEABLE c3">X</tt>)</code></td>

<td><tt class="TYPE">double precision</tt></td>

<td><tt class="TYPE">double precision</tt></td>

<td>y-intercept of the least-squares-fit linear equation
determined by the (<tt class="REPLACEABLE c3">X</tt>,
<tt class="REPLACEABLE c3">Y</tt>) pairs</td>
</tr>

<tr>
<td><code class="FUNCTION">regr_r2(<tt class="REPLACEABLE c3">Y</tt>, <tt class="REPLACEABLE c3">X</tt>)</code></td>

<td><tt class="TYPE">double precision</tt></td>

<td><tt class="TYPE">double precision</tt></td>

<td>square of the correlation coefficient</td>
</tr>

<tr>
<td><code class="FUNCTION">regr_slope(<tt class="REPLACEABLE c3">Y</tt>, <tt class="REPLACEABLE c3">X</tt>)</code></td>

<td><tt class="TYPE">double precision</tt></td>

<td><tt class="TYPE">double precision</tt></td>

<td>slope of the least-squares-fit linear equation
determined by the (<tt class="REPLACEABLE c3">X</tt>,
<tt class="REPLACEABLE c3">Y</tt>) pairs</td>
</tr>

<tr>
<td><code class="FUNCTION">regr_sxx(<tt class="REPLACEABLE c3">Y</tt>, <tt class="REPLACEABLE c3">X</tt>)</code></td>

<td><tt class="TYPE">double precision</tt></td>

<td><tt class="TYPE">double precision</tt></td>

<td><tt class="LITERAL">sum(<tt class="REPLACEABLE c3">X</tt>^2) - sum(<tt class="REPLACEABLE c3">X</tt>)^2/<tt class="REPLACEABLE c3">N</tt></tt> (<span class="QUOTE">"sum of
squares"</span> of the independent variable)</td>
</tr>

<tr>
<td><code class="FUNCTION">regr_sxy(<tt class="REPLACEABLE c3">Y</tt>, <tt class="REPLACEABLE c3">X</tt>)</code></td>

<td><tt class="TYPE">double precision</tt></td>

<td><tt class="TYPE">double precision</tt></td>

<td><tt class="LITERAL">sum(<tt class="REPLACEABLE c3">X</tt>*<tt class="REPLACEABLE c3">Y</tt>) - sum(<tt class="REPLACEABLE c3">X</tt>) * sum(<tt class="REPLACEABLE c3">Y</tt>)/<tt class="REPLACEABLE c3">N</tt></tt> (<span class="QUOTE">"sum of
products"</span> of independent times dependent
variable)</td>
</tr>

<tr>
<td><code class="FUNCTION">regr_syy(<tt class="REPLACEABLE c3">Y</tt>, <tt class="REPLACEABLE c3">X</tt>)</code></td>

<td><tt class="TYPE">double precision</tt></td>

<td><tt class="TYPE">double precision</tt></td>

<td><tt class="LITERAL">sum(<tt class="REPLACEABLE c3">Y</tt>^2) - sum(<tt class="REPLACEABLE c3">Y</tt>)^2/<tt class="REPLACEABLE c3">N</tt></tt> (<span class="QUOTE">"sum of
squares"</span> of the dependent variable)</td>
</tr>

<tr>
<td><code class="FUNCTION">stddev(<tt class="REPLACEABLE c3">expression</tt>)</code></td>

<td><tt class="TYPE">smallint</tt>, <tt class="TYPE">int</tt>, <tt class="TYPE">bigint</tt>, <tt class="TYPE">real</tt>, <tt class="TYPE">double precision</tt>,
or <tt class="TYPE">numeric</tt></td>

<td><tt class="TYPE">double precision</tt> for
floating-point arguments, otherwise <tt class="TYPE">numeric</tt></td>

<td>historical alias for <code class="FUNCTION">stddev_samp</code></td>
</tr>

<tr>
<td><code class="FUNCTION">stddev_pop(<tt class="REPLACEABLE c3">expression</tt>)</code></td>

<td><tt class="TYPE">smallint</tt>, <tt class="TYPE">int</tt>, <tt class="TYPE">bigint</tt>, <tt class="TYPE">real</tt>, <tt class="TYPE">double precision</tt>,
or <tt class="TYPE">numeric</tt></td>

<td><tt class="TYPE">double precision</tt> for
floating-point arguments, otherwise <tt class="TYPE">numeric</tt></td>

<td>population standard deviation of the input
values</td>
</tr>

<tr>
<td><code class="FUNCTION">stddev_samp(<tt class="REPLACEABLE c3">expression</tt>)</code></td>

<td><tt class="TYPE">smallint</tt>, <tt class="TYPE">int</tt>, <tt class="TYPE">bigint</tt>, <tt class="TYPE">real</tt>, <tt class="TYPE">double precision</tt>,
or <tt class="TYPE">numeric</tt></td>

<td><tt class="TYPE">double precision</tt> for
floating-point arguments, otherwise <tt class="TYPE">numeric</tt></td>

<td>sample standard deviation of the input values</td>
</tr>

<tr>
<td><code class="FUNCTION">variance</code>(<tt class="REPLACEABLE c3">expression</tt>)</td>

<td><tt class="TYPE">smallint</tt>, <tt class="TYPE">int</tt>, <tt class="TYPE">bigint</tt>, <tt class="TYPE">real</tt>, <tt class="TYPE">double precision</tt>,
or <tt class="TYPE">numeric</tt></td>

<td><tt class="TYPE">double precision</tt> for
floating-point arguments, otherwise <tt class="TYPE">numeric</tt></td>

<td>historical alias for <code class="FUNCTION">var_samp</code></td>
</tr>

<tr>
<td><code class="FUNCTION">var_pop</code>(<tt class="REPLACEABLE c3">expression</tt>)</td>

<td><tt class="TYPE">smallint</tt>, <tt class="TYPE">int</tt>, <tt class="TYPE">bigint</tt>, <tt class="TYPE">real</tt>, <tt class="TYPE">double precision</tt>,
or <tt class="TYPE">numeric</tt></td>

<td><tt class="TYPE">double precision</tt> for
floating-point arguments, otherwise <tt class="TYPE">numeric</tt></td>

<td>population variance of the input values (square of
the population standard deviation)</td>
</tr>

<tr>
<td><code class="FUNCTION">var_samp</code>(<tt class="REPLACEABLE c3">expression</tt>)</td>

<td><tt class="TYPE">smallint</tt>, <tt class="TYPE">int</tt>, <tt class="TYPE">bigint</tt>, <tt class="TYPE">real</tt>, <tt class="TYPE">double precision</tt>,
or <tt class="TYPE">numeric</tt></td>

<td><tt class="TYPE">double precision</tt> for
floating-point arguments, otherwise <tt class="TYPE">numeric</tt></td>

<td>sample variance of the input values (square of the
sample standard deviation)</td>
</tr>
</tbody>
</table>


In [2]:
%%sql
SELECT avg(price), variance(price), stddev(price)
FROM houses;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
1 rows affected.


avg,variance,stddev
540088.141766529,134782378397.246,367127.196482699


More commonly, we will use factors (categorical variables) to great statistical groupings.

In [3]:
%%sql
SELECT grade, count(*), avg(price)::bigint, variance(price)::bigint, stddev(price)::bigint
FROM houses
GROUP BY grade
HAVING count(*) > 30
ORDER BY grade

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
8 rows affected.


grade,count,avg,variance,stddev
5,242,248524,13947676133,118100
6,2038,301920,15121689876,122970
7,8981,402590,24297614822,155877
8,6068,542853,47294666885,217473
9,2615,773513,99931906084,316120
10,1134,1071771,233815854928,483545
11,399,1496842,497165018439,705099
12,90,2191222,1056411130266,1027819


### Some bivariate statistics


In [4]:
%%sql
SELECT grade, count(*), covar_pop(price,bedrooms), corr(price,bedrooms),corr(price,bathrooms)
FROM houses
GROUP BY grade
ORDER BY grade

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
12 rows affected.


grade,count,covar_pop,corr,corr_1
1,1,0.0,,
3,3,0.0,,0.42976535357982
4,29,-6011.5338882283,-0.0960058221867791,0.0116341728457079
5,242,4688.03189672837,0.0399857233120915,0.0851140756469383
6,2038,7718.69182008955,0.0754546183386198,0.212692970850097
7,8981,17301.2449266977,0.121667816782839,0.11303389241814
8,6068,32504.5607135827,0.17681730008073,0.0986407456886358
9,2615,47042.8471164666,0.192674784791592,0.232168814554075
10,1134,79354.3798745836,0.208779082478747,0.335073236711695
11,399,64107.4924403741,0.114156395512748,0.305024394652586


### Bivariate Regression Analysis
You will note that the functions above support bivariate regression, but not multiple regression.

In [5]:
%%sql
SELECT grade, count(*), regr_intercept(price,bedrooms), regr_slope(price,bedrooms), regr_r2(price,bedrooms)
FROM houses
GROUP BY grade
HAVING count(*) > 30
ORDER BY grade

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
8 rows affected.


grade,count,regr_intercept,regr_slope,regr_r2
5,242,237209.471288564,4737.21271309269,0.0015988580687911
6,2038,271975.228681038,11148.4663770632,0.0056933994286267
7,8981,334915.431595848,20786.882411768,0.0148030576407025
8,6068,384577.925545356,45482.654375392,0.0312643576078389
9,2615,476067.507606642,78830.4903829566,0.0371235726944861
10,1134,570373.099738572,128319.860730413,0.0435887052806674
11,399,1077432.83168708,100809.728407744,0.013031682636463
12,90,718896.913151365,348708.573200993,0.115822999355101


#### NOTE: Built in statistical analysis is often limited to what is shown above.  More advanced statistical analysis occurs in one of two ways, database extensions or pulling data from the database into a statistical analysis software environment such as R. This will be covered in much greater detail in your Stat Math course next semester! 

## <span style="background:yellow">Your Turn</span>

Write a query to find the average, variance, and standard deviation on the number of bedrooms



In [8]:
%%sql
SELECT  avg(bedrooms)
        ,variance(bedrooms)
        ,stddev(bedrooms)
FROM    houses;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
1 rows affected.


avg,variance,stddev
3.37084162309721,0.8650150097573506,0.9300618311474516


Write a query to find the covariance and correlation between the year of renovation and number of bathrooms

In [9]:
%%sql
SELECT  covar_pop(yr_renovated, bathrooms)
        ,corr(yr_renovated, bathrooms)
FROM    houses;

 * postgres://dsa_ro_user:***@pgsql.dsa.lan/dsa_ro
1 rows affected.


covar_pop,corr
15.6958103846152,0.0507389776480596


# Save your notenbook, then `File > Close and Halt`