# Intro

We've loaded the dataset on job outcome statistics into a database. A database usually consists of multiple, related tables of data. Each table contains rows and columns, just like a CSV file. We'll be working with the database file jobs.db, which contains a single table named recent_grads. In later courses, we'll learn how to work with a database containing multiple tables.

![sql_table.svg](attachment:sql_table.svg)


## Dummy Query
```SQL
SELECT [column1, column2,...] FROM [table1]
WHERE [condition1] AND ([condition2] OR [condition3]) LIMIT [Limit] ORDER BY [column1] DESC
```

![sql_components.svg](attachment:sql_components.svg)

## More SQL Intro

https://www.w3schools.com/SQL/default.asp
https://sqlzoo.net/


# Summary Statistics

## Aggregate functions

Aggregate functions are applied over columns of values and return a single value. MIN() and MAX(), for example, calculate and return the minimum and maximum values in a column.

### COUNT()

Instead of returning all of the rows, we want SQLite to count the number of rows and return just that value. While we don't need to change the subset of data we're working with, we do need to change how it's presented to us. To return just the count, we need to use the SQL function COUNT():

```SQL
SELECT COUNT(Major) FROM recent_grads WHERE ShareWomen > 0.5
```

### MIN() / MAX()

```SQL
SELECT Major, MIN(ShareWomen) FROM recent_grads
```

![image.png](attachment:image.png)


### AVG(), SUM(), TOTAL()

Applying the SUM() function will add all of the values in a column while AVG() will compute the average. Lastly, the TOTAL() function also returns the sum as a floating point value (even if the column contains integers). The TOTAL() function should be used when working with a column containing floating point values.

## Combining aggregate functions

```SQL
SELECT AVG(Total), MIN(Men), MAX(Women) 
FROM recent_grads

Output
[1 rows x 3 columns]
AVG(Total)   MIN(Men)   MAX(Women)
39167.71676300578   119    307087

```

## Customising output

### AS operator


This is known as an alias and the alias is restricted to just our results table (the table in the database won't be renamed). We can specify an arbitrary phrase as a string using quotation marks:

```SQL
SELECT COUNT(*) as num_students FROM recent_grads
```

Even better, we can **drop AS entirely** and just add the name next to the original column:

```SQL
SELECT COUNT(*) "Number of Students", MAX(Unemployment_rate) "Highest Unemployment Rate" FROM recent_grads 
```

Lastly, we can **reference renamed columns** when writing longer queries to make our code more compact:

```SQL
SELECT Major m, Major_category mc, Unemployment_rate ur
FROM recent_grads
WHERE (mc = 'Engineering') AND (ur > 0.04 and ur < 0.08)
ORDER BY ur DESC
```

## Counting Destinct Values

```SQL
SELECT DISTINCT Major_category FROM recent_grads
```

As with the other SQL clauses we've learned, we can use the DISTINCT statement with multiple columns to return unique pairings of those columns:

```SQL
SELECT DISTINCT Major, Major_category FROM recent_grads limit 5
```

We can count the number of unique values in a column by nesting the COUNT() function with the DISTINCT() function (note the nesting of parentheses as well):

```SQL
SELECT COUNT(DISTINCT(Major_category)) unique_major_categories FROM recent_grads
```


## Performing Arithmetics in SQL

SQL supports the standard arithmetic operators: \*, +, -, and /, and we can use them like any other operator, e.g. by showing the spreads between the 75 percentile and the 25 percentile

```SQL
SELECT Major, Major_category, P75th - P25th quartile_spread 
FROM recent_grads 
ORDER BY quartile_spread 
ASC LIMIT 20 
```

# Group Summary Statistics

## Group level summary statistics

The GROUP BY SQL statement allows us to compute summary statistics by "group," or unique value. When we use this statement, SQL creates a group for each unique value in a column or set of columns (the same values we get when we use the DISTINCT statement), and then does the calculations for them. To illustrate, we can find the total number of people employed in each major category with the following query:

```SQL
SELECT SUM(Employed) 
FROM recent_grads 
GROUP BY Major_category;
```

Here's how the query works. The GROUP BY statement splits the Major_category column into groups (with one group for each unique major category), then calculates the sum for each group. The following diagram shows how GROUP BY splits the data. (The diagram uses a small sample from the recent_grads table.):


## Virtual columns

Sometimes we want to select a subset of rows after performing a GROUP BY query. On the last screen, for instance, we may have wanted to select only those rows where share_employed is greater than .8. We can't use the WHERE clause to do this because share_employed isn't a column in recent_grads; it's actually a virtual column generated by the GROUP BY statement.

When we want to filter on a column generated by a GROUP BY query, we can use the HAVING statement.

```SQL
SELECT Major_category, AVG(Employed) / AVG(Total) share_employed 
FROM recent_grads 
GROUP BY Major_category 
HAVING share_employed > .8;
```

## Rounding results with ROUND()

```SQL
SELECT Major_category, ROUND(ShareWomen, 2) AS rounded_share_women 
FROM recent_grads;
```
The query will round the ShareWomen column to two decimal places.

## Casting

We can use the `PRAGMA TABLE_INFO()` statement by itself to return the type, along with some other information, for each column:

![image.png](attachment:image.png)

Dividing two integers will return an (rounded) integer. To avoid that, we can cast integers to Floats:

```SQL
SELECT CAST(Women as Float) / CAST(Total as Float) women_ratio FROM recent_grads limit 5
```


# Subqueries

The SQL operations we've learned so far enable us to answer questions with only one source of uncertainty. Many times, we want to answer questions that have 2 or more levels of unknowns. For example:

Which rows are above the average for the ShareWomen column?

We need to learn how to break up a question we want to answer into a series of queries that can be combined.

```SQL
SELECT Major, ShareWomen FROM recent_grads
WHERE ShareWomen > (subquery that returns the average value for ShareWomen)
```

- The query that replaces the placeholder subquery needs to be a full query (contain SELECT and FROM clauses, etc), that works even if it's run separately. 
- In addition, the inner query should only return a table with a single row and column because of where it fits in the outer query (... WHERE > ?). If you instead try to return a table with multiple columns, for example, the following error will be returned:
- Lastly, a subquery must always be contained within parentheses (), or the following error will be returned:

**EXAMPLE**

```SQL
SELECT Major, Unemployment_rate
FROM recent_grads
WHERE Unemployment_rate < (SELECT AVG(Unemployment_rate) FROM recent_grads)
ORDER BY Unemployment_rate ASC
```


## Subquery in SELECT

To dynamically calculate the number of total rows in recent_grads and be able to use it in another SQL statement, we can use a subquery in the SELECT clause. The following query displays the proportion of Majors where the Share of Women is above the average. 

```SQL
SELECT CAST(COUNT(*) AS Float) / CAST((SELECT COUNT(*) FROM recent_grads) AS Float) proportion_abv_avg 
FROM recent_grads
WHERE ShareWomen > (SELECT AVG(ShareWomen) FROM recent_grads)
```

## Returning multiple results in subqueries

So far, the subqueries we've used have computed an aggregate value of some kind and returned that value to the outer query to use for filtering. This is because we only worked with the < and > operators, which, by definition, expect a single value to compare against in a filter. As we learned earlier in this course from the documentation, SQLite contains all of the following operators:

### SQLite operators:
![sqlite_operators.png](attachment:sqlite_operators.png)

```SQL
SELECT Major, Major_category FROM recent_grads
WHERE Major_category IN 
   (SELECT Major_category 
    FROM recent_grads
    GROUP BY Major_category
    ORDER BY SUM(Total) DESC
    LIMIT 5)
```

## Complex subqueries

In the last few screens, we nested subqueries in the WHERE and the SELECT clauses that were evaluated before the outer query was. We can actually nest subqueries within subqueries many times, but this makes our SQL code more complex and harder to debug. In the next course, we'll explore other techniques of composing SQL statements that make nested logic easier.

```SQL
SELECT Major, Major_category, (cast(Sample_size as float) / cast(Total as float)) ratio 
FROM recent_grads
WHERE ratio > (SELECT AVG(cast(Sample_size as float)/cast(Total as float)) avg_ratio FROM recent_grads)
```

