<a href="https://colab.research.google.com/github/Rossel/DataQuest_Courses/blob/master/059__Summary_Statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# COURSE 1/5: SQL FUNDAMENTALS

# MISSION 2: Summary Statistics

In this mission, we:

- Explore how to calculate summary statistics in SQL.
- Learn about different types of functions.
- Learn about data types in SQL.

## 1. Introduction

In the last mission, we wrote queries that filtered rows and columns in a database table. Each of the queries we ran returned multiple rows of values. What if instead we wanted to just calculate the sum, average, minimum, or maximum of these results?

In this mission, we'll learn how to calculate [summary statistics](https://en.wikipedia.org/wiki/Summary_statistics) on subsets of a database table. We'll continue working with data on job outcomes, compiled by FiveThirtyEight.

In [None]:
# Import functions from Google modules into Colaboratory
!pip install -U -q PyDrive
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [None]:
# Insert file id from Google Drive shareable link:
# https://drive.google.com/file/d/1dGtg_Gx7VZgH4ZVydd2sFly3tNLsvhds/view?usp=sharing
id = '1dGtg_Gx7VZgH4ZVydd2sFly3tNLsvhds'

In [None]:
# Download the dataset
downloaded = drive.CreateFile({'id':id}) 
downloaded.GetContentFile('jobs.db')

In [None]:
# Import SQLite3 and pandas library
import sqlite3
import pandas as pd

In [None]:
%%capture
%load_ext sql
%sql sqlite:///jobs.db

In [None]:
%%sql

  SELECT * 
  FROM recent_grads
  LIMIT 5

 * sqlite:///jobs.db
Done.


index,Rank,Major_code,Major,Major_category,Total,Sample_size,Men,Women,ShareWomen,Employed,Full_time,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
0,1,2419,PETROLEUM ENGINEERING,Engineering,2339,36,2057,282,0.120564344,1976,1849,270,1207,37,0.018380527,110000,95000,125000,1534,364,193
1,2,2416,MINING AND MINERAL ENGINEERING,Engineering,756,7,679,77,0.1018518519999999,640,556,170,388,85,0.117241379,75000,55000,90000,350,257,50
2,3,2415,METALLURGICAL ENGINEERING,Engineering,856,3,725,131,0.153037383,648,558,133,340,16,0.024096386,73000,50000,105000,456,176,0
3,4,2417,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,1258,16,1123,135,0.107313196,758,1069,150,692,40,0.050125313,70000,43000,80000,529,102,0
4,5,2405,CHEMICAL ENGINEERING,Engineering,32260,289,21239,11021,0.341630502,25694,23170,5180,16697,1672,0.061097712,65000,50000,75000,18314,4440,972


Let's start with some motivating questions we want to answer:

- How many majors had mostly female students? How many had mostly male students? What proportion of majors had mostly female students?
- Which category of majors had the lowest unemployment rates? Which category of majors had the highest female representation?
- Which majors had the largest spread (difference) between the 25th and 75th percentile starting salaries?

By the end of this mission, we will have covered the techniques to answer them. Let's move on to the next screen to start learning!

## 2. A Simple Question

Before we tackle questions like the ones we saw on the previous screen, let's start with this: What is the lowest proportion of women on the `recent_grads` table?

One of way of thinking about this question is that we want to "determine the minimum value of `ShareWomen`" — recall that `ShareWomen` gives us the proportion of women graduates. To better help us grasp the content taught on this screen, we'll focus on the majors with the three lowest proportion of women, seen below. Scroll to the `ShareWomen` column.

Looking at the sample above, we see that the three lowest `ShareWomen` values are roughly `0.09`, `0.07` and `0`, so the minimum is `0`. Here's how we can use SQL to answer this question:


In [None]:
%%sql

SELECT MIN(ShareWomen) 
  FROM recent_grads;

 * sqlite:///jobs.db
Done.


MIN(ShareWomen)
0.0


One takeaway from the result is that there is at least one major that had no women graduating from it.

It doesn't actually tell us what those majors are — we only know them because we singled out the three majors with less women than men. In a typical table there are too many rows for us to tell just by looking.

We'll learn how to determine what rows the lowest value correspond to later in the course.

Note that instead of just returning a single value, SQLite returned a table with a column (`MIN(ShareWomen)`) and the lowest value of `ShareWomen` as a row in that column (`0`).

A key idea in SQL is that every *result* is a table. One advantage of this simplification is that it is a common, visual representation that makes SQL approachable for a much wider audience. The disadvantage is that datasets and calculations that aren't well suited for this representation must be converted to be used in a SQL environment.

Let's practice!



Instructions:

- Write a query that returns the lowest unemployment rate.
  - For answer checking purposes, make sure the aggregate function is all uppercase, and that the column name is spelled exactly like this: `Unemployment_rate`.

## 3. Aggregate Functions

On the last screen, we used `MIN(ShareWomen)` to compute the minimum value in the `ShareWomen` column. We introduced the syntax `MIN(column_name)`. This is an example of a broader syntax, that of [functions](https://en.wikipedia.org/wiki/Function_(mathematics)#In_computer_science). A **function** is an entity that takes in one or multiple inputs and produces something, often called the **output** of the function.

In `MIN(ShareWomen)`, we:

- Used:
  - The input `ShareWomen`
  - The function `MIN`
- Got the output `0.0`.

Here's an animation depicting this operation:
![alt text](https://dq-content.s3.amazonaws.com/253/m253-3.gif)


This function is one of a particular type of functions [called aggregate](https://en.wikipedia.org/wiki/Aggregate_function) functions. Aggregate functions are functions that are applied over columns of values and return a single value. The `MIN` and `MAX` functions, for example, calculate and return the minimum and maximum values in a column.

Some other commonly used aggregate functions are:

- `AVG`, which returns the mean of its input.
- `COUNT`, which counts the number of values in its input.
- `SUM`, which sums the values in its input.

Note the use of capital letters for the functions' names. This isn't necessary for the query to run, but it's a widely used convention that functions, just like other reserved words, should be typed in capital letters.

For the answer checker to correctly validate your answer, you should use capital letters for aggregate functions. Also, coding style suggests we use capital letters, so make sure you do from now on!



Instructions:

- Write a query that computes the sum of the `Total` column.
  - For answer checking purposes, make sure the aggregate function is all uppercase, and that the column name is spelled exactly like this: `Total`.

## 4. Order of Execution

In the beginning of mission two of the questions we set forth were:

- How many majors had mostly female students?
- How many had mostly male students?

On this screen, we'll answer them.

In the last mission, we learned how to return all majors that contained majority female students:



```
SELECT Major
  FROM recent_grads
 WHERE ShareWomen > 0.5;
```



To answer the first question above, we'd like to count them. On the last screen, we learned that there is an aggregate function that counts the number of values in the input: `COUNT`.

A query comes to mind as a way to answer the question:


```
SELECT COUNT(Major)
  FROM recent_grads
 WHERE ShareWomen > 0.5;
```


But does it really work? You may be wondering if `COUNT(Major)` counts the rows before or after the `WHERE` clause is executed.

The answer is that it runs after! So this query will find the answer. Here's the result:


In [None]:
%%sql

SELECT COUNT(Major)
  FROM recent_grads
 WHERE ShareWomen > 0.5;

 * sqlite:///jobs.db
Done.


COUNT(Major)
97


Let's quickly summarize what a query looks like using all the clauses we learned so far:



```
SELECT *
  FROM some_table
 WHERE some_condition
 ORDER BY some_column
 LIMIT some_limit;
```


Here is the order in which the clauses run:

1. `FROM`
2. `WHERE`
3. `SELECT`
4. `ORDER BY`
5. `LIMIT`

Since aggregate functions are part of `SELECT`, the calculation happens after `WHERE` acts.

Instructions:

- Write a query that returns the number of majors with mostly male students.
  - For answer checking purposes, make sure the aggregate function is all uppercase, and that the column name is spelled exactly like this: Major.

## 5. Missing Values

Sometimes, for various reasons, tables contain no values in certain cells (a **cell** is a location in a table given by specifying a row and a column).

When this happens, we say any of the following sentences (or variations of them):

- The value is missing
- It's a missing value
- The value is `NULL`
  - `NULL` is a special entity in SQL that exists to capture the concept of missing value.

Something that is important to keep in mind when using aggregate functions is that most of them ignore missing values.

For example, if we were to select `COUNT(Primes)` in the table below, the result would be `3` due to the missing value in the third row.

|Primes|
---
|2|
|3|
| |
|7|

Here's an animation depicting this operation:

![alt text](https://dq-content.s3.amazonaws.com/253/m253-5.gif)

A consequence of this is that we must know we do not have null values in a column before we can use it to count the numbers of rows.

To circumvent this, we can resort to the `*` and pass it into `COUNT` as if it were a column name. So, in the table above, we'd use `COUNT(*)` instead of `COUNT(Primes)`.

In this screen's exercise we will ask you to use `COUNT` to find a column with at least a missing value. You can do this by running the query below, replacing `<column_name>` with the name of the columns in `recent_grads` until you find a suspicious result. Here's the query:


```
SELECT COUNT(<column_name>)
  FROM recent_grads;
```

For your convenience, here are the names of all the columns: `index`, `Rank`, `Major_code`, `Major`, `Major_category`, `Total`, `Sample_size`, `Men`, `Women`, `ShareWomen`, `Employed`, `Full_time_year_round`, `Unemployed`, `Unemployment_rate`, `Median`, `P25th`, `P75th`, `College_jobs`, `Non_college_jobs`, `Low_wage_jobs`.



**Instructions:**

1. Use the query above to find a column that has at least one missing value.
- Here's a hint: it's one of the columns whose values lie between 0 and 1.
- Make sure you don't submit an answer while experimenting.
2. Write a query that counts the number of rows in `recent_grads`, followed by the number of rows in the column you determined in 1.
- For answer checking purposes, make sure the aggregate functions are all uppercase, and that the column name is spelled exactly as it shows in the output of the query above.

## 6. Combining Multiple Aggregation Functions

Instead of writing an individual query for each specific question we want to answer, we can actually write queries that answer multiple questions at once. Let's take the following questions:

- What's the lowest median salary?
- What's the highest median salary?
- What's the total number of students?

Recall that we can select multiple columns by including their names with commas, like so:



```
SELECT Major, Major_category
 FROM recent_grads;
```


We can apply the same principle to combine multiple aggregation functions into a single query:



In [None]:
%%sql

SELECT MIN(Median), MAX(Median), SUM(Total)
  FROM recent_grads;

 * sqlite:///jobs.db
Done.


MIN(Median),MAX(Median),SUM(Total)
22000,110000,6776015


**Instructions:**

1. Write a query that computes the average of the `Total` column, the minimum of the `Men` column, and the maximum of the `Women` column, in that specific order.
- For answer checking purposes, make sure the aggregate functions are all uppercase, and that all column names are spelled exactly as above.

## 7. Customizing the Results

All of the queries we wrote so far have had somewhat unpleasant column names in the results, like `AVG(SUM)` and `MIN(Men)`.

Many companies use SQL environments and tools that can run your query, turn the results into a plot of your choosing, and then create a PDF report containing multiple plots (and some additional explanation from the user).

Given that others may interpret and understand the results of your SQL queries, it's helpful to be able to specify custom names for the columns in our results.

We can do just that using `AS`:


```
SELECT SUM(Total) AS num_students
  FROM recent_grads;
```

This is known as an [alias](https://www.tutorialspoint.com/sqlite/sqlite_alias_syntax.htm) and the alias is restricted to just our results table (the table in the database won't be renamed)

If we use certain characters, like spaces, we need to insert the alias inbetween quotes. More generally, we can specify an arbitrary phrase as a string using quotation marks:


```
SELECT SUM(Total) AS 'Total Students'
 FROM recent_grads;
```

we can drop `AS` entirely and just add the name next to the original column:


```
SElECT SUM(Total) 'Total Students'
  FROM recent_grads;
```
We'll keep using it, though — it's a matter of [style](https://www.sqlstyle.guide/#aliasing-or-correlations). Lastly, we can reference renamed columns when writing longer queries to make our code more compact:



```
SELECT Major AS m, Major_category AS mc, Unemployment_rate AS ur
  FROM recent_grads
 WHERE (mc = 'Engineering') AND (ur > 0.04 and ur < 0.08)
 ORDER BY ur DESC
```
Later you'll also learn how to use aliases for both database tables and results tables!



**Instructions:**

1. Write a query that returns, in the following order:
- The number of rows as `Number of Majors`
- The maximum value of `Unemployment_rate` as `Highest Unemployment Rate`

## 8. Counting Unique Values

Let's explore `Major_category`. It's a column with only a few unique values. What if we want to obtain a list with repetitions of the values in this column? Or how many distinct values there are in this column?

We can return all of the unique values in a column using the `DISTINCT` statement.

In [None]:
%%sql

SELECT DISTINCT Major_category
  FROM recent_grads;

 * sqlite:///jobs.db
Done.


Major_category
Engineering
Business
Physical Sciences
Law & Public Policy
Computers & Mathematics
Agriculture & Natural Resources
Industrial Arts & Consumer Services
Arts
Health
Social Science


As with the other SQL clauses we learned, we can use the `DISTINCT` statement with multiple columns to return unique pairings of those columns:

In [None]:
%%sql

SELECT DISTINCT Major, Major_category
  FROM recent_grads
 LIMIT 5;

 * sqlite:///jobs.db
Done.


Major,Major_category
PETROLEUM ENGINEERING,Engineering
MINING AND MINERAL ENGINEERING,Engineering
METALLURGICAL ENGINEERING,Engineering
NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering
CHEMICAL ENGINEERING,Engineering


Lastly, we can count the number of unique values in a column by nesting the `COUNT()` function with the `DISTINCT` clause:

In [None]:
%%sql

SELECT COUNT(DISTINCT Major_category) unique_major_categories
  FROM recent_grads;

 * sqlite:///jobs.db
Done.


unique_major_categories
16


**Instructions:**

1. Write a query that returns the number of unique values in the `Major`, `Major_category`, and `Major_code` columns. Use the following aliases in the following order:
- For the unique value count of the `Major` column, use the alias `unique_majors`.
- For the unique value count of the `Major_category` column, use the alias `unique_major_categories`.
- For the unique value count of the `Major_code` column, use the alias `unique_major_codes`.

In [None]:
%%sql

## 9. Data Types

Let's run a query and look at the result:

In [None]:
%%sql

SELECT Major, Total, Men, Women, Unemployment_rate
  FROM recent_grads
 ORDER BY Unemployment_rate DESC
 LIMIT 3;

 * sqlite:///jobs.db
Done.


Major,Total,Men,Women,Unemployment_rate
NUCLEAR ENGINEERING,2573,2200,373,0.177226407
PUBLIC ADMINISTRATION,5629,2947,2682,0.1594905999999999
COMPUTER NETWORKING AND TELECOMMUNICATIONS,7613,5291,2322,0.151849807


Let's focus on the different kinds of values we got:

- In the `Major` column we see text.
- In the `Total`, `Men` and `Women` columns we see integers.
- In the `Unemployment_rate` column we see decimal numbers.

Each of the above is self-descriptively called a [data type](https://en.wikipedia.org/wiki/Data_type). Each column has exactly one type of values, it can't be mixed.

You can read more about the SQLite data types [here](https://www.sqlite.org/datatype3.html). We'll have a chance to explore them from the point of view of the database when we learn how to create tables.

For now we'll focus on some of the things we can and may want to do with different data types.



## 10. String Functions and Operations

We previously learned about aggregate functions. Aggregate functions take a column as input and return one value for the column. We'll now learn about functions that, when we pass them a column as input, return (a transformation of the input in) another column. The values of text columns are typically called strings, hence the present screen's title.

We'll start with the `LENGTH` function. Given a text column, the `LENGTH` function returns the number of characters in the input strings

![alt text](https://dq-content.s3.amazonaws.com/253/m253-10.gif)

Let's build on the query we saw on the previous screen to see it action.

In [None]:
%%sql

SELECT Major,
       Total, Men, Women, Unemployment_rate,
       LENGTH(Major) AS Length_of_name
  FROM recent_grads
 ORDER BY Unemployment_rate DESC
 LIMIT 3;

 * sqlite:///jobs.db
Done.


Major,Total,Men,Women,Unemployment_rate,Length_of_name
NUCLEAR ENGINEERING,2573,2200,373,0.177226407,19
PUBLIC ADMINISTRATION,5629,2947,2682,0.1594905999999999,21
COMPUTER NETWORKING AND TELECOMMUNICATIONS,7613,5291,2322,0.151849807,42


Another thing we can do with strings is concatenate them by using the `||` operator. Here's a usage example:

In [None]:
%%sql

SELECT 'Data' || 'quest' as 'e-learning';

 * sqlite:///jobs.db
Done.


e-learning
Dataquest


It's important to point out that, despite the fact we signaled the use of strings by using single quotes (`'`), double quotes (`"`) are also accepted. However, single quotes are more universally accepted by other databases than double quotes.

In the same way that we can compare columns with both constant numbers and other columns in `WHERE` clauses, we can also mix columns and constant strings when concatenating. For example:



In [None]:
%%sql

SELECT 'Cat: ' || Major_category
  FROM recent_grads
 LIMIT 2;

 * sqlite:///jobs.db
Done.


'Cat: ' || Major_category
Cat: Engineering
Cat: Engineering


In the following exercise, you'll use the `LOWER` function to replace the `Major` column with one where all values are written in lowercase. It is used just like the `LENGTH` function. Here's what the output of the exercise should look like:


|Major	| Total |	Men |	Women |	Unemployment_rate |	Length_of_name |
| --- | --- | --- | --- | --- | --- |
|Major: nuclear engineering	| 2573	| 2200 |	373 |	0.177226 |	19
Major: public administration	|5629	|2947|	2682	|0.159491|	21
Major: computer networking and telecommunications	|7613	|5291	|2322	|0.151850	|42



It's a tough one, so don't feel discouraged if it takes you some tries.

**Instructions:**

To save yourself some typing, feel free to copy and paste the query at the top of the screen and work from it.

Write a query that:

- Selects in order
 - The values in the `Major` column in lowercase, preceded by the string `Major`:.  Use the alias `Major`.
  - `Total`.
  - `Men`.
  - `Women`.
  - `Unemployment_rate`.
  - `LENGTH(Major)` as `Length_of_name`.
- Ordered in descending order by the unemployment rate.

## 11. Performing Arithmetic in SQL

Let's revisit one of the questions from the beginning of the mission:

- Which majors had the largest spread (difference) between the 25th and 75th percentile starting salaries?

In the same way that we can use string functions and operators, we can also perform arithmetic on the columns in a table. SQL supports the standard arithmetic operators: `*`, `+`, `-`, and `/`, and we can use them like any other operator:

In [None]:
%%sql
SELECT P75th - P25th quartile_spread
  FROM recent_grads
 LIMIT 10;

 * sqlite:///jobs.db
Done.


quartile_spread
30000
35000
55000
37000
25000
52000
19000
77500
22000
27000


You can also add, subtract, multiple, or divide columns by individual values:

In [None]:
%%sql
SELECT ShareWomen * 100 percent_female FROM recent_grads LIMIT 10

 * sqlite:///jobs.db
Done.


percent_female
12.0564344
10.1851852
15.3037383
10.7313196
34.1630502
14.4966965
53.5714286
44.1355573
13.9792801
43.7846874


One thing to note is that multiplying or dividing columns with a floating point value (or a column with floating point values) will result in floating point values:

- Two floats - Returns a float.
  - `SELECT 100.0 / 100.0` returns `1.0`.
- A float and an integer - Returns a float
  - `SELECT 100 / 1.0` returns `100.0`.
- Two integers - Returns an integer
  - `SELECT 100 / 10` returns `10`


**Instructions:**

Write a query that computes the difference between the 25th and 75th percentile of salaries for all majors.
- Return the `Major` column first, using the default column name.
- Return the `Major_category` column second, using the default column name.
- Return the compute difference between the 25th and 75th percentile third, using the alias `quartile_spread`.
- Order the results from lowest to highest and only return the first 20 results.

## 12. Next Steps

In this mission, we:

- Explored how to calculate summary statistics in SQL.
- Learned about different types of functions.
- Learned about data types in SQL.

In the next mission, we'll learn how to calculate statistics within specific subgroups using the `GROUP BY` statement.