# 1. Introduction

In the "Summary Statistics" lesson, we computed summary statistics across columns using SQL. Often, however, we will want even more information, which we can get by computing summary statistics per group to answer questions like the following:

* What is the total number of graduates by major category?
* What major category pays the most?
* How does the sample size vary by major category?

In this lesson, we'll answer questions like these by learning how to calculate summary statistics for groups.

We'll continue working with the recent_grads table of jobs.db. Recall that each row represents a single college major, and it contains information about post-graduation employment of students who studied the major. Here are some descriptions for just a few of the 21 total columns:

* Rank — the major's rank according to median earnings.
* Major — the name of the major.
* Major_category — the broader category to which the major belongs.
* Total — the total number of people who graduated with the major.
* Sample_size — the size of the sample.
* Men — the number of graduates who are men.
* Women — the number of graduates who are women.
* ShareWomen — women as a proportion of the total number of graduates (a number ranging from 0 to 1).
* Employed — the number of employed graduates.

Here are the first few rows and columns in the dataset:

    Rank	Major	Major_category	Total	Sample_size	Men	Women	ShareWomen	Employed
    1	PETROLEUM ENGINEERING	Engineering	2339	36	2057	282	0.120564	1976
    2	MINING AND MINERAL ENGINEERING	Engineering	756	7	679	77	0.101852	640
    3	METALLURGICAL ENGINEERING	Engineering	856	3	725	131	0.153037	648
    4	NAVAL ARCHITECTURE AND MARINE ENGINEERING	Engineering	1258	16	1123	135	0.107313	758
    5	CHEMICAL ENGINEERING	Engineering	32260	289	21239	11021	0.341631	25694

Let's begin by learning how to code if/then logic in SQL.

# 2. If/Then in SQL

In [2]:
%load_ext sql
%sql sqlite://

%sql sqlite:////home/mohammeds/datasets/jobs.db

The sql extension is already loaded. To reload it, use:
  %reload_ext sql


In [15]:
%%sql

SELECT CASE
        WHEN Sample_size<200 THEN 'Small'
        WHEN Sample_size<1000 THEN 'Medium'
        WHEN Sample_size>=1000 THEN 'Large'
        END AS Sample_category
    FROM recent_grads
LIMIT 20;

   sqlite://
 * sqlite:////home/mohammeds/datasets/jobs.db
Done.


Sample_category
Small
Small
Small
Small
Medium
Small
Small
Small
Large
Medium


# 3. Dissecting CASE

Look again at the syntax we saw in the previous screen.

    CASE
    WHEN <condition_1> THEN <value_1>
    WHEN <condition_2> THEN <value_2>
    ELSE <value_3>
    END AS <new_column_name>

Note the following:

* It starts with CASE to indicate that we'll be creating values by cases.
* It ends with END to indicate where the CASE clause terminates.
* The reserved word WHEN signals each explicit case.
* The value for each case follows the reserved word THEN
* There is a fallback value indicated by the reserved word ELSE.

Here are some important observations:

* Anything you can use in WHERE for filtering, you can also use in place of the conditions above.
* There can be one or more WHEN lines. We demonstrated this with three, but it works with any number.
* The ELSE line is optional — without it, rows that don't match any WHEN will get a missing value (NULL).

In [21]:
%%sql

SELECT Major, Sample_size,
        CASE
        WHEN Sample_size<200 THEN "Small"
        WHEN Sample_size<1000 THEN "Medium"
        ELSE "Large"
        END AS Sample_category
    FROM recent_grads
LIMIT 20;

   sqlite://
 * sqlite:////home/mohammeds/datasets/jobs.db
Done.


Major,Sample_size,Sample_category
PETROLEUM ENGINEERING,36,Small
MINING AND MINERAL ENGINEERING,7,Small
METALLURGICAL ENGINEERING,3,Small
NAVAL ARCHITECTURE AND MARINE ENGINEERING,16,Small
CHEMICAL ENGINEERING,289,Medium
NUCLEAR ENGINEERING,17,Small
ACTUARIAL SCIENCE,51,Small
ASTRONOMY AND ASTROPHYSICS,10,Small
MECHANICAL ENGINEERING,1029,Large
ELECTRICAL ENGINEERING,631,Medium


# 4. Calculating Group-Level Summary Statistics

In [25]:
%%sql

SELECT Major_category, SUM(Total) AS Total_graduates
    FROM recent_grads
GROUP BY Major_category;

   sqlite://
 * sqlite:////home/mohammeds/datasets/jobs.db
Done.


Major_category,Total_graduates
Agriculture & Natural Resources,79981
Arts,357130
Biology & Life Science,453862
Business,1302376
Communications & Journalism,392601
Computers & Mathematics,299008
Education,559129
Engineering,537583
Health,463230
Humanities & Liberal Arts,713468


# 5. GROUP BY Visual Breakdown

Here's how the query works. The GROUP BY statement divides the Major_category column into groups (one group for each unique major category), then calculates the sum for each group. The following diagram shows how GROUP BY divides the data (the diagram uses a small sample from the recent_grads table):

![](https://dq-content.s3.amazonaws.com/254/5.1-group.gif)

For each group, the GROUP BY statement queries each column and runs all of the aggregation functions we include in the query after the SELECT statement:

![](https://dq-content.s3.amazonaws.com/254/5.2-sum.gif)

The SQL engine will use the last value for a selected column in the group. The SQL engine will also compute the value for a selected aggregation function across the group.

In [27]:
%%sql

SELECT Major_category, AVG(ShareWomen) AS Average_women
    FROM recent_grads
GROUP BY Major_category;

   sqlite://
 * sqlite:////home/mohammeds/datasets/jobs.db
Done.


Major_category,Average_women
Agriculture & Natural Resources,0.6179384232
Arts,0.56185119575
Biology & Life Science,0.584518475857143
Business,0.4050631853076923
Communications & Journalism,0.64383484025
Computers & Mathematics,0.5127519954545455
Education,0.6749855163125
Engineering,0.2571578951034483
Health,0.6168565694166667
Humanities & Liberal Arts,0.6761934042


# 6. Multiple Summary Statistics by Group

In addition to computing multiple summary statistics for the whole table, as we did in the "Summary Statistics" lesson, we can also compute multiple summary statistics by groups.

Working from the same example, here's a query that for each major category finds the following:

* The total number of employed graduates
* The average number of employed graduates
* The maximum number of employed graduates in a major
* The minimum number of employed graduates in a major

    SELECT Major_category,
           SUM(Employed), AVG(Employed), MAX(Employed), MIN(Employed)
      FROM recent_grads
     GROUP BY Major_category;