# 1. Introduction

In this mission, we'll learn how to calculate [summary statistics](https://en.wikipedia.org/wiki/Summary_statistics) on subsets of a database table

Let's start with some motivating questions we want to answer:

* How many majors had mostly female students? How many had mostly male students? What proportion of majors had mostly female students?
* Which category of majors had the lowest unemployment rates? Which category of majors had the highest female representation?
* Which majors had the largest spread (difference) between the 25th and 75th percentile starting salaries?

By the end of this mission, we will have covered the techniques to answer them. Let's move on to the next screen to start learning!

# 2. A Simple Question

**What is the lowest proportion of women on the recent_grads table?**

In [1]:
%load_ext sql

In [2]:
%sql sqlite:///jobs.db

In [3]:
%%sql

SELECT MIN(ShareWomen)
From recent_grads

 * sqlite:///jobs.db
Done.


MIN(ShareWomen)
0.0


**Write a query that returns the lowest unemployment rate.**

In [4]:
%%sql

SELECT MIN(Unemployment_rate)
FROM recent_grads

 * sqlite:///jobs.db
Done.


MIN(Unemployment_rate)
0.0


# 3. Aggregate Functions

MIN function is one of a particular type of functions called [aggregate functions](https://en.wikipedia.org/wiki/Aggregate_function). `Aggregate functions are functions that are applied over columns of values and return a single value.`

**Write a query that computes the sum of the Total column.**

In [5]:
%%sql

SELECT SUM(Total)
FROM recent_grads

 * sqlite:///jobs.db
Done.


SUM(Total)
6776015


# 4. Order of Execution

Here is the order in which the clauses run:

* FROM
* WHERE
* SELECT
* ORDER BY
* LIMIT

**Write a query that returns the number of majors with mostly male students.**

In [6]:
%%sql

SELECT COUNT(Major)
FROM recent_grads
WHERE ShareWomen<0.5;

 * sqlite:///jobs.db
Done.


COUNT(Major)
76


# 5. Missing Values

Sometimes, for various reasons, tables contain no values in certain cells (a cell is a location in a table given by specifying a row and a column).

When this happens, we say any of the following sentences (or variations of them):

The value is missing
It's a missing value
The value is NULL
`NULL is a special entity in SQL that exists to capture the concept of missing value`.

**`Something that is important to keep in mind when using aggregate functions is that most of them ignore missing values`.**

A consequence of this is that we must know we do not have null values in a column before we can use it to count the numbers of rows.


**Write a query that counts the number of rows in recent_grads, followed by the number of rows in the column Unemployment_rate**

In [7]:
%%sql

SELECT COUNT(*),COUNT(Unemployment_rate)
FROM recent_grads

 * sqlite:///jobs.db
Done.


COUNT(*),COUNT(Unemployment_rate)
173,172


# 6. Combining Multiple Aggregation Functions

Instead of writing an individual query for each specific question we want to answer, we can actually write queries that answer multiple questions at once. Let's take the following questions:

* What's the lowest median salary?
* What's the highest median salary?
* What's the total number of students?

In [8]:
%%sql 

SELECT MIN(Median),MAX(Median),SUM(Total)
FROM recent_grads;

 * sqlite:///jobs.db
Done.


MIN(Median),MAX(Median),SUM(Total)
22000,110000,6776015


**Write a query that computes the average of the Total column, the minimum of the Men column, and the maximum of the Women column, in that specific order.**

In [9]:
%%sql 

SELECT AVG(Total),MIN(Men),MAX(Women)
FROM recent_grads

 * sqlite:///jobs.db
Done.


AVG(Total),MIN(Men),MAX(Women)
39167.71676300578,119,307087


# 7. Customizing the Results

All of the queries we wrote so far have had somewhat unpleasant column names in the results, like AVG(SUM) and MIN(Men).

Many companies use SQL environments and tools that can run your query, turn the results into a plot of your choosing, and then create a PDF report containing multiple plots (and some additional explanation from the user).

`Given that others may interpret and understand the results of your SQL queries, it's helpful to be able to specify custom names for the columns in our results`.

We can do just that **using AS:**

`we can drop AS entirely and just add the name next to the original column`

## TODO:
* Write a query that returns, in the following order:
  * The number of rows as Number of Majors
  * The maximum value of Unemployment_rate as Highest Unemployment Rate

In [10]:
%%sql

SELECT COUNT(*) AS 'Number of Majors',MAX(Unemployment_rate) 'Highest Unemployment Rate'
FROM recent_grads

 * sqlite:///jobs.db
Done.


Number of Majors,Highest Unemployment Rate
173,0.177226407


# 8. Counting Unique Values

**`We can return all of the unique values in a column using the DISTINCT statement.`**

As with the other SQL clauses we learned, we can use the DISTINCT statement with multiple columns to return unique pairings of those columns

## TODO:
* Write a query that returns the number of unique values in the Major, Major_category, and Major_code columns. Use the following aliases in the following order:
  * For the unique value count of the Major column, use the alias unique_majors.
  * For the unique value count of the Major_category column, use the alias unique_major_categories.
  * For the unique value count of the Major_code column, use the alias unique_major_codes.

In [11]:
%%sql

SELECT COUNT(DISTINCT(Major)) unique_majors,COUNT(DISTINCT(Major_category)) unique_major_categories,
             COUNT(DISTINCT(Major_code)) unique_major_codes
FROM recent_grads

 * sqlite:///jobs.db
Done.


unique_majors,unique_major_categories,unique_major_codes
173,16,173


# 9. Data Types

In [12]:
%%sql

SELECT Major, Total, Men, Women, Unemployment_rate
  FROM recent_grads
 ORDER BY Unemployment_rate DESC
 LIMIT 3;

 * sqlite:///jobs.db
Done.


Major,Total,Men,Women,Unemployment_rate
NUCLEAR ENGINEERING,2573,2200,373,0.177226407
PUBLIC ADMINISTRATION,5629,2947,2682,0.1594905999999999
COMPUTER NETWORKING AND TELECOMMUNICATIONS,7613,5291,2322,0.151849807


* In the Major column we see text.
* In the Total, Men and Women columns we see integers.
* In the Unemployment_rate column we see decimal numbers.

* Each of the above is self-descriptively called a [data type](https://en.wikipedia.org/wiki/Data_type). Each column has exactly one type of values, it can't be mixed.

* You can read more about the SQLite data types [here](https://www.sqlite.org/datatype3.html). We'll have a chance to explore them from the point of view of the database when we learn how to create tables.

# 10. String Functions and Operations

We previously learned about aggregate functions. Aggregate functions take a column as input and return one value for the column. We'll now learn about `functions that, when we pass them a column as input, return (a transformation of the input in) another column.` The values of text columns are typically called strings, hence the present screen's title.

**`Another thing we can do with strings is concatenate them by using the || operator`.**

In [13]:
%%sql

SELECT 'Data' || 'quest' as 'e-learning';

 * sqlite:///jobs.db
Done.


e-learning
Dataquest


It's important to point out that, despite the fact we signaled the use of strings by using single quotes ('), double quotes (") are also accepted. However, single quotes are more universally accepted by other databases than double quotes.

## TODO:
* Write a query that:

* Selects in order
  * The values in the Major column in lowercase, preceded by the string Major:. Use the alias Major.
  * Total.
  * Men.
  * Women.
  * Unemployment_rate.
  * LENGTH(Major) as Length_of_name.
* Ordered in descending order by the unemployment rate

In [14]:
%%sql

SELECT 'Major: ' || LOWER(Major) Major,Total,Men,Women,Unemployment_rate,LENGTH(Major) Lenght_of_name
FROM recent_grads
ORDER BY Unemployment_rate DESC

 * sqlite:///jobs.db
Done.


Major,Total,Men,Women,Unemployment_rate,Lenght_of_name
Major: nuclear engineering,2573,2200,373,0.177226407,19
Major: public administration,5629,2947,2682,0.1594905999999999,21
Major: computer networking and telecommunications,7613,5291,2322,0.151849807,42
Major: clinical psychology,2838,568,2270,0.149048198,19
Major: public policy,5978,2695,905,0.128426299,13
Major: communication technologies,18035,2563,16346,0.119511469,26
Major: mining and mineral engineering,756,679,77,0.117241379,30
Major: computer programming and data processing,4168,3046,1122,0.11398259,40
Major: geography,18480,173809,156118,0.113458628,9
Major: architecture,46420,9658,4582,0.113331949,12


# 11. Performing Arithmetic in SQL

Which majors had the largest spread (difference) between the 25th and 75th percentile starting salaries?

`In the same way that we can use string functions and operators, we can also perform arithmetic on the columns in a table. SQL supports the standard arithmetic operators: *, +, -, and /, and we can use them like any other operator`:

You can also add, subtract, multiple, or divide columns by individual values:

## TODO:
* Write a query that computes the difference between the 25th and 75th percentile of salaries for all majors.
  * Return the Major column first, using the default column name.
  * Return the Major_category column second, using the default column name.
  * Return the compute difference between the 25th and 75th percentile third, using the alias quartile_spread.
  * Order the results from lowest to highest and only return the first 20 results.

In [15]:
%%sql

SELECT Major,Major_category,(P75th-P25th) quartile_spread
FROM recent_grads
ORDER BY quartile_spread
LIMIT 20

 * sqlite:///jobs.db
Done.


Major,Major_category,quartile_spread
MILITARY TECHNOLOGIES,Industrial Arts & Consumer Services,0
SCHOOL STUDENT COUNSELING,Education,2000
LIBRARY SCIENCE,Education,2000
COURT REPORTING,Law & Public Policy,4000
PHARMACOLOGY,Biology & Life Science,5000
EDUCATIONAL ADMINISTRATION AND SUPERVISION,Education,6000
COUNSELING PSYCHOLOGY,Psychology & Social Work,6800
SPECIAL NEEDS EDUCATION,Education,10000
MATHEMATICS TEACHER EDUCATION,Education,10000
SOCIAL WORK,Psychology & Social Work,10000


In this mission, we:

* Explored how to calculate summary statistics in SQL.
* Learned about different types of functions.
* Learned about data types in SQL.