# Introduction to SQL
---

## Introduction to Databases

When the data changes frequently, requires shared access, doesn't fit in memory, and security is critical, a database is a great solution. A database is a data representation that lives on disk that can be queried, accessed, and updated without using much memory. We primarily interact with a database using a [database management system](https://en.wikipedia.org/wiki/Database) or **DBMS** for short.

In the pandas workflow, we spend most of our time thinking about what functions and methods to use, where to store intermediate results in variables, and juggling all of these. To work with data stored in a database, we instead use a language called **SQL** (or structured query language). In SQL, we express each unique request (whether it be fetching a subset of or editing values in the data) as a single query and then ask the DBMS to run the query and display any results.

For example, to fetch a specific subset of the data from a database, we would:

* write the SQL query: `SELECT * FROM salaries`
* ask the DBMS to run the query and display the results to us

Because the data lives on disk, we can work with datasets that consume multiple terabytes of disk space. Many data science teams in industry have servers and setups in cloud environments like Microsoft Azure or Amazon Web Services that let team members work with this scale of data. Robust and popular DBMS tools like [Postgres](https://www.postgresql.org/) and [MySQL](https://www.mysql.com/) include powerful features for managing user credentials, security, and high data throughput (quickly changing data). We'll learn the fundamentals of SQL using a small, portable DBMS called [SQLite](https://www.sqlite.org/). SQLite is the most popular database in the world and is lightweight enough that the SQLite DBMS is included as a [module in Python](https://docs.python.org/3.6/library/sqlite3.html). Later we'll dive into production systems like Postgres.

We'll explore data from the American Community Survey on job outcome statistics based on college majors. While the original CSV version can be found on [FiveThirtyEight's Github](https://github.com/fivethirtyeight/data/tree/master/college-majors), we'll be using a slightly modified version of the data that's stored as a database. We'll be working with the bit of data that contains the 2010-2012 data for recent college grads only. We'll learn how to write SQL queries to explore and start to understand the dataset.

## Previewing SQLite Module

We'll be working with the sqlite3 Python module, which was developed to work with [SQLite version 3](https://www.sqlite.org/version3.html).

First we'll need to import the module:

In [8]:
import sqlite3

Once we've imported the module, we connect to the database we want to query using the [`connect()` function](https://docs.python.org/3/library/sqlite3.html#sqlite3.connect). This function requires a single parameter, which is the database we want to connect to. Because the database we're working with exists as a file on disk, we need to pass in the file name.

The `connect()` function returns a [Connection instance](https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection), which maintains the connection to the database we want to work with. When we're connected to a database, SQLite locks the database file and prevents any other processes from connecting to the database simultaneously. The SQLite team made this design decision to keep the database lightweight, and avoid the complexity that arises when multiple processes interact with the same database.

Let's connect the `jobs.db` database.

In [9]:
conn = sqlite3.connect("jobs.db")

To display a table from the database we'll need pandas [read_sql_query](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql_query.html) function. Basic syntax for it looks like this:

```python
import pandas as pd
pd.read_sql_query("SQL QUERY", 
                  "URL" or Connection instance, 
                  index_col = "index")
```

Now let's get back to the SQL syntax.

In [10]:
import pandas as pd

## Previewing A Table Using SELECT

A database usually consists of multiple, related tables of data. Each table contains rows and columns, just like a CSV file. We'll be working with the database file `jobs.db`, which contains a single table named `recent_grads`.

To display the first 5 rows from the recent_grads table, we need to:

* write SQL code that expresses this request
* ask the SQLite DBMS software to run the code and display the results

Like other programming languages, code in SQL has to adhere to a defined structure and vocabulary. To specify that we want to return the first 5 rows from `recent_grads`, we need to run the following SQL query:

```SQL
SELECT * FROM recent_grads LIMIT 5
```

Here's what's returned when the query is run:

In [11]:
pd.read_sql_query("SELECT * FROM recent_grads LIMIT 5", conn, index_col = "index")

Unnamed: 0_level_0,Rank,Major_code,Major,Major_category,Total,Sample_size,Men,Women,ShareWomen,Employed,...,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,1,2419,PETROLEUM ENGINEERING,Engineering,2339,36,2057,282,0.120564,1976,...,270,1207,37,0.018381,110000,95000,125000,1534,364,193
1,2,2416,MINING AND MINERAL ENGINEERING,Engineering,756,7,679,77,0.101852,640,...,170,388,85,0.117241,75000,55000,90000,350,257,50
2,3,2415,METALLURGICAL ENGINEERING,Engineering,856,3,725,131,0.153037,648,...,133,340,16,0.024096,73000,50000,105000,456,176,0
3,4,2417,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,1258,16,1123,135,0.107313,758,...,150,692,40,0.050125,70000,43000,80000,529,102,0
4,5,2405,CHEMICAL ENGINEERING,Engineering,32260,289,21239,11021,0.341631,25694,...,5180,16697,1672,0.061098,65000,50000,75000,18314,4440,972


In this query, we specified:

* the columns we wanted using `SELECT *`
* the table we wanted to query using `FROM recent_grads`
* the number of rows we wanted using `LIMIT 5`

Here's the logic behind it:

```

   SELECT *         FROM recent_grads       LIMIT 5

Which columns?      From which table?    How many rows?
     All              recent_grads             5
     
     
```

## Filtering Rows Using WHERE

Let's answer these questions using the data we have:
* Which majors had mostly female students? Which ones had mostly male students?
* Which majors had the largest spread (difference) between the 25th and 75th percentile starting salaries?
* Which engineering majors had the highest full time employment rates?

Let's start by focusing on the first question. The SQL workflow revolves around translating the question we want to answer to the subset of data we want from the database. To determine which majors had mostly female students, we want the following subset:

* only the Major column
* only the rows where ShareWomen is greater than 0.5 (corresponding to 50%)

To return only the `Major` column, we need to add the specific column name in the `SELECT` statement part of the query (instead of using the `*` operator to return all columns):

```SQL
SELECT Major FROM recent_grads
```

This will return all of the values in the `Major` column. We can specify multiple columns this way as well and the results table will preserve the order of the columns:

```SQL
SELECT Major, Major_category FROM recent_grads
```

To return only the values where `ShareWomen` is greater than or equal to `0.5`, we need to add a `WHERE` clause:

```SQL
SELECT Major FROM recent_grads
WHERE ShareWomen >= 0.5
```

Finally, we can limit the number of rows returned using `LIMIT`:

```SQL
SELECT Major FROM recent_grads
WHERE ShareWomen >= 0.5
LIMIT 5
```

Running this query will return the following results table:

In [12]:
pd.read_sql_query("SELECT Major FROM recent_grads WHERE ShareWomen >= 0.5 LIMIT 5",
                 conn)

Unnamed: 0,Major
0,ACTUARIAL SCIENCE
1,COMPUTER SCIENCE
2,ENVIRONMENTAL ENGINEERING
3,NURSING
4,INDUSTRIAL PRODUCTION TECHNOLOGIES


While in the `SELECT` part of the query, we express the specific **column** we want, in the `WHERE` part we express the specific **rows** we want. The beauty of SQL is that these can be independent.

Now, let's answer the first question:

```SQL
SELECT Major, ShareWomen FROM recent_grads WHERE ShareWomen < 0.5
```

In [13]:
pd.read_sql_query("SELECT Major, ShareWomen FROM recent_grads WHERE ShareWomen < 0.5", conn)

Unnamed: 0,Major,ShareWomen
0,PETROLEUM ENGINEERING,0.120564
1,MINING AND MINERAL ENGINEERING,0.101852
2,METALLURGICAL ENGINEERING,0.153037
3,NAVAL ARCHITECTURE AND MARINE ENGINEERING,0.107313
4,CHEMICAL ENGINEERING,0.341631
5,NUCLEAR ENGINEERING,0.144967
6,ASTRONOMY AND ASTROPHYSICS,0.441356
7,MECHANICAL ENGINEERING,0.139793
8,ELECTRICAL ENGINEERING,0.437847
9,COMPUTER ENGINEERING,0.199413


## Expressing Multiple Filter Criteria Using AND, OR and ()

To filter rows by specific criteria, we need to use the `WHERE` statement. A simple `WHERE` statement requires three things:

* The column we want the database to filter on: `ShareWomen`
* A comparison operator that specifies how we want to compare a value in a column: `>`
* The value we want the database to compare each value to: `0.5`

Here are the comparison operators we can use:

* Less than: `<`
* Less than or equal to: `<=`
* Greater than: `>`
* Greater than or equal to: `>=`
* Equal to: `=`
* Not equal to: `!=`

The comparison value after the operator must be either text or a number, depending on the field. Because ShareWomen is a numeric column, we don't need to enclose the number `0.5` in quotes. Finally, **most database systems require that the `SELECT` and `FROM` statements come first, before `WHERE` or any other statements**.

We can use the `AND` operator to combine multiple filter criteria. For example, to determine which engineering majors had majority female, we'd need to specify 2 filtering criteria:

```SQL
SELECT Major FROM recent_grads
WHERE Major_category = 'Engineering' AND ShareWomen > 0.5
```

In [14]:
pd.read_sql_query("SELECT Major FROM recent_grads WHERE Major_category = 'Engineering' AND ShareWomen > 0.5",
                 conn)

Unnamed: 0,Major
0,ENVIRONMENTAL ENGINEERING
1,INDUSTRIAL PRODUCTION TECHNOLOGIES


It looks like only 2 majors met this criteria. If we wanted to "zoom" back out to look at all of the columns for both of these majors to see if they shared some other common attributes, we can modify the `SELECT` statement and use the symbol `*` to represent all columns:

In [15]:
pd.read_sql_query("SELECT * FROM recent_grads WHERE Major_category = 'Engineering' AND ShareWomen > 0.5",
                 conn,
                 index_col = "index")

Unnamed: 0_level_0,Rank,Major_code,Major,Major_category,Total,Sample_size,Men,Women,ShareWomen,Employed,...,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
30,31,2410,ENVIRONMENTAL ENGINEERING,Engineering,4047,26,2639,3339,0.558548,2983,...,930,1951,308,0.093589,50000,42000,56000,2028,830,260
38,39,2503,INDUSTRIAL PRODUCTION TECHNOLOGIES,Engineering,4631,73,528,1588,0.750473,4428,...,597,3242,129,0.028308,46000,35000,65000,1394,2454,480


The ability to quickly iterate on queries as you think of new questions is the appeal of SQL. The SQL workflow lets data professionals focus on asking and answering questions, instead of lower level programming concepts. There's a clear separation of concerns between the engine that stores, organizes, and retrieves the data and the language that let's people interface with the data easily without having to worry about the underlying mechanics.

As the scale of data has increased, engineers have maintained the interface of SQL while swapping out the database engine underneath. This allows people who need to ask and answer questions easily transfer their SQL experience, even as database technologies change. For example, the [Presto project](https://en.wikipedia.org/wiki/Presto_%28SQL_query_engine%29) lets you query using SQL but use data from database systems like MySQL, from a distributed file system like HDFS, and more.

In [16]:
# Example with AND operator
pd.read_sql_query("SELECT Major, Major_category, Median, ShareWomen FROM recent_grads WHERE ShareWomen > 0.5 AND Median > 50000",
                 conn)

Unnamed: 0,Major,Major_category,Median,ShareWomen
0,ACTUARIAL SCIENCE,Business,62000,0.535714
1,COMPUTER SCIENCE,Computers & Mathematics,53000,0.578766


In [17]:
# Example with OR operator
pd.read_sql_query("SELECT Major, Median, Unemployed FROM recent_grads WHERE Median >= 10000 OR Unemployed <= 1000 LIMIT 20",
                  conn)

Unnamed: 0,Major,Median,Unemployed
0,PETROLEUM ENGINEERING,110000,37
1,MINING AND MINERAL ENGINEERING,75000,85
2,METALLURGICAL ENGINEERING,73000,16
3,NAVAL ARCHITECTURE AND MARINE ENGINEERING,70000,40
4,CHEMICAL ENGINEERING,65000,1672
5,NUCLEAR ENGINEERING,65000,400
6,ACTUARIAL SCIENCE,62000,308
7,ASTRONOMY AND ASTROPHYSICS,62000,33
8,MECHANICAL ENGINEERING,60000,4650
9,ELECTRICAL ENGINEERING,60000,3895


In [18]:
# Example with AND, OR and PARENTHESES
pd.read_sql_query("SELECT Major, Major_category, ShareWomen, Unemployment_rate FROM recent_grads WHERE (Major_category = 'Engineering') AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051)",
                 conn)

Unnamed: 0,Major,Major_category,ShareWomen,Unemployment_rate
0,PETROLEUM ENGINEERING,Engineering,0.120564,0.018381
1,METALLURGICAL ENGINEERING,Engineering,0.153037,0.024096
2,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,0.107313,0.050125
3,MATERIALS SCIENCE,Engineering,0.31082,0.023043
4,ENGINEERING MECHANICS PHYSICS AND SCIENCE,Engineering,0.183985,0.006334
5,INDUSTRIAL AND MANUFACTURING ENGINEERING,Engineering,0.343473,0.042876
6,MATERIALS ENGINEERING AND MATERIALS SCIENCE,Engineering,0.292607,0.027789
7,ENVIRONMENTAL ENGINEERING,Engineering,0.558548,0.093589
8,INDUSTRIAL PRODUCTION TECHNOLOGIES,Engineering,0.750473,0.028308
9,ENGINEERING AND INDUSTRIAL MANAGEMENT,Engineering,0.174123,0.033652


## Ordering Results Using ORDER BY

The results of every query we've written so far have been ordered by the `Rank` column. Recall a query from early in the mission, where we wrote a query that returned all of the columns and didn't filter rows on any specific criteria (`SELECT * FROM recent_grads LIMIT 5`):

In [19]:
pd.read_sql_query("SELECT * \
FROM recent_grads \
LIMIT 5",
                  conn)

Unnamed: 0,index,Rank,Major_code,Major,Major_category,Total,Sample_size,Men,Women,ShareWomen,...,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
0,0,1,2419,PETROLEUM ENGINEERING,Engineering,2339,36,2057,282,0.120564,...,270,1207,37,0.018381,110000,95000,125000,1534,364,193
1,1,2,2416,MINING AND MINERAL ENGINEERING,Engineering,756,7,679,77,0.101852,...,170,388,85,0.117241,75000,55000,90000,350,257,50
2,2,3,2415,METALLURGICAL ENGINEERING,Engineering,856,3,725,131,0.153037,...,133,340,16,0.024096,73000,50000,105000,456,176,0
3,3,4,2417,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,1258,16,1123,135,0.107313,...,150,692,40,0.050125,70000,43000,80000,529,102,0
4,4,5,2405,CHEMICAL ENGINEERING,Engineering,32260,289,21239,11021,0.341631,...,5180,16697,1672,0.061098,65000,50000,75000,18314,4440,972


As the questions we want to answer get more complex, we want more control over how the results are ordered. We can specify the order using the [`ORDER BY`](https://sqlite.org/lang_select.html#orderby) clause. For example, we may want to understand which majors that met the criteria in the `WHERE` statement had the lowest unemployment rate. This will return the results in ascending order (increasing) by the `Unemployment_rate` column:

In [20]:
pd.read_sql_query("SELECT Rank, Major, Major_category, ShareWomen, Unemployment_rate \
FROM recent_grads \
WHERE (Major_category = 'Engineering') AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051) \
ORDER BY Unemployment_rate",
                 conn)

Unnamed: 0,Rank,Major,Major_category,ShareWomen,Unemployment_rate
0,15,ENGINEERING MECHANICS PHYSICS AND SCIENCE,Engineering,0.183985,0.006334
1,1,PETROLEUM ENGINEERING,Engineering,0.120564,0.018381
2,14,MATERIALS SCIENCE,Engineering,0.31082,0.023043
3,3,METALLURGICAL ENGINEERING,Engineering,0.153037,0.024096
4,24,MATERIALS ENGINEERING AND MATERIALS SCIENCE,Engineering,0.292607,0.027789
5,39,INDUSTRIAL PRODUCTION TECHNOLOGIES,Engineering,0.750473,0.028308
6,51,ENGINEERING AND INDUSTRIAL MANAGEMENT,Engineering,0.174123,0.033652
7,17,INDUSTRIAL AND MANUFACTURING ENGINEERING,Engineering,0.343473,0.042876
8,4,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,0.107313,0.050125
9,31,ENVIRONMENTAL ENGINEERING,Engineering,0.558548,0.093589


If we instead want the results ordered by the same column but in descending order, we can add the `DESC` keyword:

In [21]:
pd.read_sql_query("SELECT Rank, Major, Major_category, ShareWomen, Unemployment_rate \
FROM recent_grads \
WHERE (Major_category = 'Engineering') AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051) \
ORDER BY Unemployment_rate DESC",
                 conn)

Unnamed: 0,Rank,Major,Major_category,ShareWomen,Unemployment_rate
0,31,ENVIRONMENTAL ENGINEERING,Engineering,0.558548,0.093589
1,4,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,0.107313,0.050125
2,17,INDUSTRIAL AND MANUFACTURING ENGINEERING,Engineering,0.343473,0.042876
3,51,ENGINEERING AND INDUSTRIAL MANAGEMENT,Engineering,0.174123,0.033652
4,39,INDUSTRIAL PRODUCTION TECHNOLOGIES,Engineering,0.750473,0.028308
5,24,MATERIALS ENGINEERING AND MATERIALS SCIENCE,Engineering,0.292607,0.027789
6,3,METALLURGICAL ENGINEERING,Engineering,0.153037,0.024096
7,14,MATERIALS SCIENCE,Engineering,0.31082,0.023043
8,1,PETROLEUM ENGINEERING,Engineering,0.120564,0.018381
9,15,ENGINEERING MECHANICS PHYSICS AND SCIENCE,Engineering,0.183985,0.006334


In [22]:
pd.read_sql_query("SELECT Major, ShareWomen, Unemployment_rate \
FROM recent_grads \
WHERE (ShareWomen > 0.3) AND (Unemployment_rate < 0.1) \
ORDER BY ShareWomen DESC",
                  conn)

Unnamed: 0,Major,ShareWomen,Unemployment_rate
0,EARLY CHILDHOOD EDUCATION,0.967998,0.040105
1,MATHEMATICS AND COMPUTER SCIENCE,0.927807,0.000000
2,ELEMENTARY EDUCATION,0.923745,0.046586
3,ANIMAL SCIENCES,0.910933,0.050862
4,PHYSIOLOGY,0.906677,0.069163
5,MISCELLANEOUS PSYCHOLOGY,0.905590,0.051908
6,HUMAN SERVICES AND COMMUNITY ORGANIZATION,0.904075,0.037819
7,NURSING,0.896019,0.044863
8,GEOSCIENCES,0.881294,0.024374
9,MASS MEDIA,0.877228,0.089837


Let's find out which engineering majors had the highest full time employment rates?

In [23]:
pd.read_sql_query("SELECT Major_category, Major, Unemployment_rate \
FROM recent_grads \
WHERE (Major_category = 'Engineering') OR (Major_category = 'Physical Sciences') \
ORDER BY Unemployment_rate",
                 conn)

Unnamed: 0,Major_category,Major,Unemployment_rate
0,Engineering,ENGINEERING MECHANICS PHYSICS AND SCIENCE,0.006334
1,Engineering,PETROLEUM ENGINEERING,0.018381
2,Physical Sciences,ASTRONOMY AND ASTROPHYSICS,0.021167
3,Physical Sciences,ATMOSPHERIC SCIENCES AND METEOROLOGY,0.022229
4,Engineering,MATERIALS SCIENCE,0.023043
5,Engineering,METALLURGICAL ENGINEERING,0.024096
6,Physical Sciences,GEOSCIENCES,0.024374
7,Engineering,MATERIALS ENGINEERING AND MATERIALS SCIENCE,0.027789
8,Engineering,INDUSTRIAL PRODUCTION TECHNOLOGIES,0.028308
9,Engineering,ENGINEERING AND INDUSTRIAL MANAGEMENT,0.033652


## Introduction to Summary Statistics with COUNT

Now we'll learn how to calculate summary statistics on subsets of a database table. We'll continue working with data on job outcomes, compiled by FiveThirtyEight.

Let's start with some motivating questions we want to answer:

* How many majors had mostly female students? How many had mostly male students? What proportion of majors had mostly female students?
* Which category of majors had the lowest unemployment rates? Which category of majors had the highest female representation?
* Which majors had the largest spread (difference) between the 25th and 75th percentile starting salaries?

Let's focus on the first set of questions around gender representation:

Instead of returning all of the rows, we want SQLite to count the number of rows and return just that value. While we don't need to change the subset of data we're working with, we do need to change how it's presented to us. To return just the count, we need to use the SQL function [`COUNT()`](https://sqlite.org/lang_aggfunc.html#count):

In [24]:
pd.read_sql_query("SELECT COUNT(Major) \
FROM recent_grads \
WHERE ShareWomen > 0.5",
                 conn)

Unnamed: 0,COUNT(Major)
0,97


Instead of just returning a single value, SQLite returned a table with a column (`COUNT(Major)`) and the count as a row in that column (`97`).

A key idea in SQL is that **everything is a table**. One advantage of this simplification is that it's a common, visual representation that makes SQL approachable for a much wider audience. The disadvantage is that datasets and calculations that aren't well suited for this representation must be converted to be used in a SQL database environment.

## Aggregate Functions

Functions like `COUNT()` are known as [aggregate functions](https://sqlite.org/lang_aggfunc.html). Aggregate functions are applied over columns of values and return a single value. [MIN()](https://sqlite.org/lang_corefunc.html#minoreunc) and [MAX()](https://sqlite.org/lang_corefunc.html#maxoreunc), for example, calculate and return the minimum and maximum values in a column.

We can use these functions to compute the lowest value in the `ShareWomen` column and to know the major.

In [25]:
pd.read_sql_query("SELECT Major, MIN(ShareWomen) \
FROM recent_grads",
                 conn)

Unnamed: 0,Major,MIN(ShareWomen)
0,MISCELLANEOUS ENGINEERING TECHNOLOGIES,0.0


If you think about it, `MIN(ShareWomen)` acts a row filter in some way. While the query `SELECT Major FROM recent_grads` returns all of the values in the `Major` column, the query `SELECT Major, MIN(ShareWomen) FROM recent_grads` only returned the `Major` column value corresponding for the row with the minimum value in the `ShareWomen` column.

One thing to note is that while `COUNT()` can be used on any column (because it's just counting the number of values), the other aggregate functions (`MIN()`, `MAX()`, etc) can only be used on numeric columns (since these arithmetic calculations only work with numbers).

Let's write a query that returns the Engineering major with the lowest median salary:

In [26]:
pd.read_sql_query("SELECT Major, Major_category, MIN(Median) \
FROM recent_grads \
WHERE Major_category = 'Engineering'",
                 conn)

Unnamed: 0,Major,Major_category,MIN(Median)
0,ARCHITECTURE,Engineering,40000


## Calculating Sums and Averages in SQL

The final two aggregation functions we'll look at are `SUM()` and `AVG()`. Applying the `SUM()` function will add all of the values in a column while `AVG()` will compute the average. Lastly, the `TOTAL()` function also returns the sum as a floating point value (even if the column contains integers). The `TOTAL()` function should be used when working with a column containing floating point values. You can read more [here](https://sqlite.org/lang_aggfunc.html).

This time around, we're going to skip showing sample code since these functions are used the same way as `COUNT()`, `MIN()`, and `MAX()`. This is good practice working with new functions, as SQL contains many functions that you'll end up using down the road that you haven't been taught explicitly.

Let's write a query that computes the sum of the `Total` column:

In [27]:
pd.read_sql_query("SELECT SUM(Total) \
FROM recent_grads",
                 conn)

Unnamed: 0,SUM(Total)
0,6776015


## Combining Multiple Aggregation Functions

Instead of writing an individual query for each specific question we want to answer, we can actually write queries that answer multiple questions at once. Let's take the following questions:

* What's the lowest median salary?
* What's the highest median salary?
* What's the total number of students?

We can select multiple columns by including their names with commas and we can apply the same principle to combine multiple aggregation functions into a single query. Let's write a query that computes the average of the `Total` column, the minimum of the `Men` column, and the maximum of the `Women` column, in that specific order:

In [28]:
pd.read_sql_query("SELECT AVG(Total), MIN(Men), MAX(Women) \
FROM recent_grads",
                  conn)

Unnamed: 0,AVG(Total),MIN(Men),MAX(Women)
0,39167.716763,119,307087


## Customizing The Results

All of the queries we've written so far have had somewhat unpleasant column names in the results, like AVG(SUM) and MIN(Men). Many companies use SQL environments and tools that can run your query, turn the results into a plot of your choosing, and then create a PDF report containing multiple plots (and some additional explanation from the user). Given that others may interpret and understand the results of your SQL queries, it's helpful to be able to specify custom names for the columns in our results.

We can do just that using the `AS` operator. This is known as an [alias](https://www.tutorialspoint.com/sqlite/sqlite_alias_syntax.htm) and the alias is restricted to just our results table (the table in the database won't be renamed). We can specify an arbitrary phrase as a string using quotation marks or, even better, we can drop AS entirely and just add the name next to the original column:

```SQL
SELECT COUNT(*) as num_students FROM recent_grads
SELECT COUNT(*) as "Total Students" FROM recent_grads
SElECT COUNT(*) "Total Students" FROM recent_grads
```

Lastly, we can reference renamed columns when writing longer queries to make our code more compact:

```SQL
SELECT Major m, Major_category mc, Unemployment_rate ur
FROM recent_grads
WHERE (mc = 'Engineering') AND (ur > 0.04 and ur < 0.08)
ORDER BY ur DESC
```

Let's write a query that returns, in the following order:

* the number of rows as `Number of Students`
* the maximum value of `Unemployment_rate` as `Highest Unemployment Rate`

In [29]:
pd.read_sql_query('SELECT \
COUNT(*) AS "Number of Students", \
MAX(Unemployment_rate) AS "Highest Unemployment Rate" \
FROM recent_grads',
                 conn)

Unnamed: 0,Number of Students,Highest Unemployment Rate
0,173,0.177226


## Counting Unique Values

We've been working with the Major_category column a decent amount in our queries and it's a column with only few unique values. What if we want to return just the unique values in this column? Or the number of unique values in this column?

We can return all of the unique values in a column using the `DISTINCT` statement:

In [30]:
pd.read_sql_query("SELECT DISTINCT Major_category FROM recent_grads", conn)

Unnamed: 0,Major_category
0,Engineering
1,Business
2,Physical Sciences
3,Law & Public Policy
4,Computers & Mathematics
5,Agriculture & Natural Resources
6,Industrial Arts & Consumer Services
7,Arts
8,Health
9,Social Science


As with the other SQL clauses we've learned, we can use the `DISTINCT` statement with multiple columns to return unique pairings of those columns:

In [31]:
pd.read_sql_query("SELECT DISTINCT Major, Major_category FROM recent_grads LIMIT 5", conn)

Unnamed: 0,Major,Major_category
0,PETROLEUM ENGINEERING,Engineering
1,MINING AND MINERAL ENGINEERING,Engineering
2,METALLURGICAL ENGINEERING,Engineering
3,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering
4,CHEMICAL ENGINEERING,Engineering


In this case, the `Major_category` column is much more unique (only 16 unique values for `Major_category` compared to 173 for `Major`), so the corresponding value is repeated for every unique value in `Major`.

Lastly, we can count the number of unique values in a column by nesting the `COUNT()` function with the `DISTINCT()` function (note the nesting of parentheses as well):

In [32]:
pd.read_sql_query("SELECT COUNT(DISTINCT(Major_category)) unique_major_categories \
FROM recent_grads",
                  conn)

Unnamed: 0,unique_major_categories
0,16


Now, let's write a query that returns the number of unique values in the `Major`, `Major_category`, and `Major_code` columns:

In [33]:
pd.read_sql_query("SELECT COUNT(DISTINCT(Major)) unique_majors, \
COUNT(DISTINCT(Major_category)) unique_major_categories, \
COUNT(DISTINCT(Major_code)) unique_major_codes \
FROM recent_grads",
                 conn)

Unnamed: 0,unique_majors,unique_major_categories,unique_major_codes
0,173,16,173


## Performing Arithmetic in SQL

Which majors had the largest spread (difference) between the 25th and 75th percentile starting salaries?

To answer this question, we need to be able to perform arithmetic on the columns in a table to compute the difference. SQL supports the standard arithmetic operators: `*`, `+`, `-`, and `/`, and we can use them like any other operator:

In [34]:
pd.read_sql_query("SELECT P75th - P25th quartile_spread FROM recent_grads LIMIT 10",
                 conn)

Unnamed: 0,quartile_spread
0,30000
1,35000
2,55000
3,37000
4,25000
5,52000
6,19000
7,77500
8,22000
9,27000


One thing to note is that multiplying or dividing columns with a floating point value (or a column with floating point values) will result in floating point values:

* Two floats - Returns a float.
    * `SELECT 100.0 / 100.0` returns 1.0.
* A float and an integer - Returns a float
    * `SELECT 100 / 1.0` returns 100.0.
* Two integers - Returns an integer
    * `SELECT 100 / 10` returns 10
    
Now let's write a query that computes the difference between the 25th and 75th percentile of salaries for all majors:

In [35]:
pd.read_sql_query("SELECT Major, Major_category, P75th - P25th quartile_spread \
FROM recent_grads \
ORDER BY quartile_spread \
LIMIT 20",
                 conn)

Unnamed: 0,Major,Major_category,quartile_spread
0,MILITARY TECHNOLOGIES,Industrial Arts & Consumer Services,0
1,SCHOOL STUDENT COUNSELING,Education,2000
2,LIBRARY SCIENCE,Education,2000
3,COURT REPORTING,Law & Public Policy,4000
4,PHARMACOLOGY,Biology & Life Science,5000
5,EDUCATIONAL ADMINISTRATION AND SUPERVISION,Education,6000
6,COUNSELING PSYCHOLOGY,Psychology & Social Work,6800
7,SPECIAL NEEDS EDUCATION,Education,10000
8,MATHEMATICS TEACHER EDUCATION,Education,10000
9,SOCIAL WORK,Psychology & Social Work,10000


## Group Summary Statistics

In many cases, we want to drill down even more and compute summary statistics per group. We'll explore how to calculate more granular summary statistics using groups.

We'll drill down and compute summary statistics by group to answer questions like:

* What's the share of women in each major category?
* Which major categories have the greatest numbers of employed graduates?
* What percentage of people in each major category end up in low-wage jobs?

The `GROUP BY` SQL statement allows us to compute summary statistics by "group," or unique value. When we use this statement, SQL creates a group for each unique value in a column or set of columns (the same values we get when we use the `DISTINCT` statement), and then does the calculations for them. To illustrate, we can find the total number of people employed in each major category with the following query:

In [38]:
pd.read_sql_query("SELECT Major_category, SUM(Employed) \
FROM recent_grads \
GROUP BY Major_category",
                 conn)

Unnamed: 0,Major_category,SUM(Employed)
0,Agriculture & Natural Resources,66943
1,Arts,288114
2,Biology & Life Science,302797
3,Business,1088742
4,Communications & Journalism,330660
5,Computers & Mathematics,237894
6,Education,479839
7,Engineering,420372
8,Health,372147
9,Humanities & Liberal Arts,544118


This gives us the total number of employed graduates for each major category. The `GROUP BY` statement splits the Major_category column into groups (with one group for each unique major category), then calculates the sum for each group.

If a column is selected, the SQL engine will use the last value for that column in the group. If an aggregation function is selected, the SQL engine will compute the value for that aggregation function across the group:

In [39]:
pd.read_sql_query("SELECT Employed, Major_category, SUM(Employed) \
FROM recent_grads \
GROUP BY Major_category",
                 conn)

Unnamed: 0,Employed,Major_category,SUM(Employed)
0,3149,Agriculture & Natural Resources,66943
1,2914,Arts,288114
2,1144,Biology & Life Science,302797
3,2912,Business,1088742
4,179633,Communications & Journalism,330660
5,102087,Computers & Mathematics,237894
6,730,Education,479839
7,1976,Engineering,420372
8,180903,Health,372147
9,2787,Humanities & Liberal Arts,544118


Now, let's have some practice. We'll find the percentage of graduates who are employed for each major category:

In [40]:
pd.read_sql_query("SELECT Major_category, AVG(Employed) / AVG(Total) share_employed \
FROM recent_grads \
GROUP BY Major_category",
                 conn)

Unnamed: 0,Major_category,share_employed
0,Agriculture & Natural Resources,0.836986
1,Arts,0.806748
2,Biology & Life Science,0.667157
3,Business,0.835966
4,Communications & Journalism,0.842229
5,Computers & Mathematics,0.795611
6,Education,0.85819
7,Engineering,0.781967
8,Health,0.803374
9,Humanities & Liberal Arts,0.762638


## Querying Virtual Columns With the HAVING Statement

Sometimes we want to select a subset of rows after performing a `GROUP BY` query. Previously, for instance, we may have wanted to select only those rows where share_employed is greater than `.8`. We can't use the `WHERE` clause to do this because `share_employed` isn't a column in `recent_grads`; it's actually a virtual column generated by the `GROUP BY` statement.

When we want to filter on a column generated by a `GROUP BY` query, we can use the `HAVING` statement. Here's an example:

```SQL
SELECT Major_category, AVG(Employed) / AVG(Total) AS share_employed 
FROM recent_grads 
GROUP BY Major_category 
HAVING share_employed > .8;
```

Note that we used the same column name in the `HAVING` statement that we originally specified with the `AS` statement. SQL allows us to use custom column names in subsequent statements, including `HAVING` and `WHERE`. The statement above will result in the following output:

In [41]:
pd.read_sql_query("SELECT Major_category, AVG(Employed) / AVG(Total) AS share_employed \
FROM recent_grads \
GROUP BY Major_category \
HAVING share_employed > .8;",
                 conn)

Unnamed: 0,Major_category,share_employed
0,Agriculture & Natural Resources,0.836986
1,Arts,0.806748
2,Business,0.835966
3,Communications & Journalism,0.842229
4,Education,0.85819
5,Health,0.803374
6,Industrial Arts & Consumer Services,0.82267
7,Law & Public Policy,0.808399


Note that the results only include categories where share_employed is greater than `.8`. That's because the `HAVING` statement filters out the other rows. Let's find all of the major categories where the share of graduates with low-wage jobs is greater than `.1`:

In [42]:
pd.read_sql_query("SELECT Major_category, AVG(Low_wage_jobs) / AVG(Total) AS share_low_wage \
FROM recent_grads \
GROUP BY Major_category \
HAVING share_low_wage > 0.1",
                 conn)

Unnamed: 0,Major_category,share_low_wage
0,Arts,0.168331
1,Communications & Journalism,0.126324
2,Humanities & Liberal Arts,0.132087
3,Industrial Arts & Consumer Services,0.115713
4,Law & Public Policy,0.115685
5,Psychology & Social Work,0.116934
6,Social Science,0.102233


## Rounding Results With the ROUND() Function

The percentages in our results were very long and hard to read (e.g., 0.16833085991095678). We can use the SQL `ROUND` function in our query to round them.

```SQL
SELECT Major_category, ROUND(ShareWomen, 2) AS rounded_share_women 
FROM recent_grads;
```

The query will round the `ShareWomen` column to two decimal places. Here's a truncated view of the results:

In [44]:
pd.read_sql_query("SELECT Major_category, ROUND(ShareWomen, 2) AS rounded_share_women \
FROM recent_grads \
LIMIT 10",
                 conn)

Unnamed: 0,Major_category,rounded_share_women
0,Engineering,0.12
1,Engineering,0.1
2,Engineering,0.15
3,Engineering,0.11
4,Engineering,0.34
5,Engineering,0.14
6,Business,0.54
7,Physical Sciences,0.44
8,Engineering,0.14
9,Engineering,0.44


Now let's use all of this to find share of employed graduates:

In [45]:
pd.read_sql_query("SELECT Major_category, ROUND(AVG(Employed) / AVG(Total), 3) AS share_employed \
FROM recent_grads \
GROUP BY Major_category \
HAVING share_employed > .8;",
                 conn)

Unnamed: 0,Major_category,share_employed
0,Agriculture & Natural Resources,0.837
1,Arts,0.807
2,Business,0.836
3,Communications & Journalism,0.842
4,Education,0.858
5,Health,0.803
6,Industrial Arts & Consumer Services,0.823
7,Law & Public Policy,0.808


## Casting

We used SQL arithmetic to divide float columns. This resulted in float values that we could round using the `ROUND()` function. We can use the [`PRAGMA TABLE_INFO()`](https://sqlite.org/pragma.html#pragma_table_info) statement by itself to return the type, along with some other information, for each column:

In [46]:
pd.read_sql_query("PRAGMA TABLE_INFO(recent_grads)",
                 conn)

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,index,INTEGER,0,,0
1,1,Rank,INTEGER,0,,0
2,2,Major_code,INTEGER,0,,0
3,3,Major,TEXT,0,,0
4,4,Major_category,TEXT,0,,0
5,5,Total,INTEGER,0,,0
6,6,Sample_size,INTEGER,0,,0
7,7,Men,INTEGER,0,,0
8,8,Women,INTEGER,0,,0
9,9,ShareWomen,REAL,0,,0


If we try to divide 2 integer columns (`Women` and `Total`), SQL will round down and return integer values. We need to instead use the `CAST()` function to the Float type:

In [47]:
pd.read_sql_query("SELECT Major_category, \
CAST(SUM(Women) AS Float) / CAST(SUM(Total) AS Float) AS SW \
FROM recent_grads \
GROUP BY Major_category \
ORDER BY SW",
                 conn)

Unnamed: 0,Major_category,SW
0,Law & Public Policy,0.030585
1,Business,0.084743
2,Industrial Arts & Consumer Services,0.160249
3,Computers & Mathematics,0.209356
4,Engineering,0.219596
5,Communications & Journalism,0.250325
6,Arts,0.393327
7,Humanities & Liberal Arts,0.490051
8,Health,0.673588
9,Interdisciplinary,0.800911


## Writing More Complex Queries

The SQL operations we've learned so far enable us to answer questions with only one source of uncertainty. Many times, we want to answer questions that have 2 or more levels of unknowns. For example:

* Which rows are above the average for the ShareWomen column?

Using the SQL techniques we've learned so far, there's no way to write a query that answers this question. As of right now, we only know aggregate functions such as `AVG()` is valid in the `SELECT` clause; however, they can be used in other clauses such as the `GROUP BY` and `HAVING` clauses. For example, The following query:

```SQL
SELECT * FROM recent_grads
WHERE ShareWomen > AVG(ShareWomen)
```

will return an error:

```python
(sqlite3.OperationalError) misuse of aggregate function AVG() [SQL: 'SELECT * FROM recent_grads WHERE ShareWomen > AVG(ShareWomen)']
```

We need to instead learn how to break up a question we want to answer into a series of queries that can be combined.

To determine which majors are above the average for the `ShareWomen` column, we need to:

* first determine the average value for the `ShareWomen` column
* then select and filter the rows that are greater than the average value

If we had to do this using Python and pandas, we would compute and store the average value in `ShareWomen` as a variable and then use the variable in a table filter. While variables dominate how we express logic in object-oriented programming languages like Python and Java, SQL doesn't have support for variables. The designers of SQL, a [declarative programming language](https://en.wikipedia.org/wiki/Declarative_programming), want it's users to focus on expressing computations over explicitly defining, setting, and juggling variables.

What would the query look like if we already knew the average value for the `ShareWomen` column?

```SQL
SELECT Major, ShareWomen FROM recent_grads
WHERE ShareWomen > 0.5225502029537575
```



Now, how do we make the computed average value, 0.5225502029537575, dynamic?

Let's introduce the SQL way to solve this problem -- **subqueries**. A subquery is a query nested within another query. Here's a template for a SQL statement where the subquery resides in the `WHERE` clause:

```SQL
SELECT Major, ShareWomen FROM recent_grads
WHERE ShareWomen > (subquery that returns the average value for ShareWomen)
```

The subquery is run first and returns the average value for the `ShareWomen` column (which happens to be 0.5225502029537575). Based on the result of the subquery, SQL will replace the subquery with this value dynamically. Note that SQL will ignore the column name (`AVG(ShareWomen)`) and is smart enough to just use the actual row value.

The query that replaces the placeholder `subquery` needs to be a full query (contain `SELECT` and `FROM` clauses, etc), that works even if it's run separately. In addition, the inner query should only return a table with a single row and column because of where it fits in the outer query (`... WHERE > ?`).

A subquery must always be contained within parentheses `()`.

In [48]:
pd.read_sql_query("SELECT Major, Unemployment_rate \
FROM recent_grads \
WHERE Unemployment_rate < (SELECT AVG(Unemployment_rate) FROM recent_grads) \
ORDER BY Unemployment_rate",
                 conn)

Unnamed: 0,Major,Unemployment_rate
0,MATHEMATICS AND COMPUTER SCIENCE,0.000000
1,BOTANY,0.000000
2,SOIL SCIENCE,0.000000
3,EDUCATIONAL ADMINISTRATION AND SUPERVISION,0.000000
4,ENGINEERING MECHANICS PHYSICS AND SCIENCE,0.006334
5,COURT REPORTING,0.011690
6,MATHEMATICS TEACHER EDUCATION,0.016203
7,PETROLEUM ENGINEERING,0.018381
8,GENERAL AGRICULTURE,0.019642
9,ASTRONOMY AND ASTROPHYSICS,0.021167


## Subquery in SELECT

What if we wanted to understand the proportion of majors are above the average for a given column? We'd need to divide the number of rows that met the filter criteria with the total number of rows in the table.

Using the `COUNT()` aggregate function, we can return the number of rows the results set contains:

In [49]:
pd.read_sql_query("SELECT COUNT(*) FROM recent_grads \
WHERE ShareWomen > (SELECT AVG(ShareWomen) FROM recent_grads)",
                 conn)

Unnamed: 0,COUNT(*)
0,91


To return the proportion, we need to divide this value with the total number of rows in `recent_grads`. The challenge, however, is that the we don't know the total number of rows (or want to be reliant on an out of date calculation anyway that we could potentially hard code).

To dynamically calculate the number of total rows in `recent_grads` and be able to use it in another SQL statement, we can use a subquery in the `SELECT` clause:

In [50]:
pd.read_sql_query("SELECT COUNT(*), (SELECT COUNT(*) FROM recent_grads) FROM recent_grads \
WHERE ShareWomen > (SELECT AVG(ShareWomen) FROM recent_grads)",
                 conn)

Unnamed: 0,COUNT(*),(SELECT COUNT(*) FROM recent_grads)
0,91,173


Now, let's find the proportion

In [51]:
pd.read_sql_query("SELECT CAST(COUNT(*) AS Float) / \
CAST((SELECT COUNT(*) FROM recent_grads) AS Float) \
AS proportion_abv_avg FROM recent_grads \
WHERE ShareWomen > (SELECT AVG(ShareWomen) FROM recent_grads)",
                 conn)

Unnamed: 0,proportion_abv_avg
0,0.526012


## Returning Multiple Results In Subqueries

So far, the subqueries we've used have computed an aggregate value of some kind and returned that value to the outer query to use for filtering. This is because we only worked with the `<` and `>` operators, which, by definition, expect a single value to compare against in a filter.

SQLite understands the following binary operators, in order from highest to lowest precedence: 

``` SQL
||
* / % 
+ - 
<< >> & | 
< <= > >= 
= == != <> IS IS NOT IN LIKE GLOB MATCH REGEXP 
AND 
OR
```

And supported unary prefix operators are these:

```SQL
- + ~ NOT
```

Using the `IN` operator, we can specify a list of values that we want to match against in the `WHERE` clause. All rows that match exactly will be returned. The following query returns the rows where `Major_category` equals either `Business` or `Engineering`:

In [52]:
pd.read_sql_query("SELECT Major, Major_category FROM recent_grads \
WHERE Major_category IN ('Business', 'Engineering') \
LIMIT 7",
                 conn)

Unnamed: 0,Major,Major_category
0,PETROLEUM ENGINEERING,Engineering
1,MINING AND MINERAL ENGINEERING,Engineering
2,METALLURGICAL ENGINEERING,Engineering
3,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering
4,CHEMICAL ENGINEERING,Engineering
5,NUCLEAR ENGINEERING,Engineering
6,ACTUARIAL SCIENCE,Business


Opportunities like this, where we've hard coded values, are usually good candidates for converting to a subquery. Instead of returning the rows where `Major_category` equals one of 2 specific values, we can write a subquery that returns the `Major_category` with the 5 highest group level sums for the `Total` column:

In [53]:
pd.read_sql_query("SELECT Major_category FROM recent_grads \
GROUP BY Major_category \
ORDER BY SUM(Total) DESC \
LIMIT 5",
                 conn)

Unnamed: 0,Major_category
0,Business
1,Humanities & Liberal Arts
2,Education
3,Engineering
4,Social Science


Now let's write the whole query:

In [54]:
pd.read_sql_query("SELECT Major, Major_category \
FROM recent_grads \
WHERE Major_category IN (SELECT Major_category FROM recent_grads \
GROUP BY Major_category \
ORDER BY SUM(Total) DESC \
LIMIT 5)",
                 conn)

Unnamed: 0,Major,Major_category
0,PETROLEUM ENGINEERING,Engineering
1,MINING AND MINERAL ENGINEERING,Engineering
2,METALLURGICAL ENGINEERING,Engineering
3,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering
4,CHEMICAL ENGINEERING,Engineering
5,NUCLEAR ENGINEERING,Engineering
6,ACTUARIAL SCIENCE,Business
7,MECHANICAL ENGINEERING,Engineering
8,ELECTRICAL ENGINEERING,Engineering
9,COMPUTER ENGINEERING,Engineering


## Building Complex Subqueries

We can actually nest subqueries within subqueries many times, but this makes our SQL code more complex and harder to debug. Later we'll explore other techniques of composing SQL statements that make nested logic easier.

When you have a SQL statement you want to write that will end up using many subqueries, it can be overwhelming at first to know how to start. In general, you want to start with the inner queries first and work your way outwards. Let's say we're interested in understanding the ratio of the `Sample_size` column to the `Total` column.

Specifically, let's say we're interested in:

* computing this ratio for every major
* understanding which majors are above the average for this ratio
* understanding how many majors are above the average for this ratio

We'll start by writing a query that computes the ratio for every major and then the average of all of these ratios:

In [55]:
pd.read_sql_query("SELECT AVG(CAST(Sample_size AS Float) / CAST(Total AS Float)) \
AS avg_ratio \
FROM recent_grads",
                  conn)

Unnamed: 0,avg_ratio
0,0.009086


Now that we have a subquery that calculates the average ratio (of `Sample_size` to `Total`), we can return the rows that exceed this average.

In [56]:
pd.read_sql_query("SELECT Major, Major_category, \
CAST(Sample_size AS Float) / CAST(Total AS Float) AS ratio \
FROM recent_grads \
WHERE ratio > (SELECT AVG(CAST(Sample_size AS Float) / CAST(Total AS Float)) \
AS avg_ratio \
FROM recent_grads)",
                 conn)

Unnamed: 0,Major,Major_category,ratio
0,PETROLEUM ENGINEERING,Engineering,0.015391
1,MINING AND MINERAL ENGINEERING,Engineering,0.009259
2,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,0.012719
3,ACTUARIAL SCIENCE,Business,0.013503
4,MECHANICAL ENGINEERING,Engineering,0.011280
5,COMPUTER ENGINEERING,Engineering,0.009605
6,AEROSPACE ENGINEERING,Engineering,0.009762
7,INDUSTRIAL AND MANUFACTURING ENGINEERING,Engineering,0.009648
8,ARCHITECTURAL ENGINEERING,Engineering,0.009204
9,COURT REPORTING,Law & Public Policy,0.012195


## Back to Querying SQLite from Python

Before we can execute a query, we need to express our SQL query as a string. While we use the Connection class to represent the database we're working with, we use the [Cursor class](https://docs.python.org/3/library/sqlite3.html#cursor-objects) to:

* Run a query against the database
* Parse the results from the database
* Convert the results to native Python objects
* Store the results within the Cursor instance as a local variable

After running a query and converting the results to a list of **tuples**, the Cursor instance stores the list as a local variable. Before diving into the syntax of querying the database, let's learn more about tuples.

We need to use the Connection instance method `cursor()` to return a Cursor instance corresponding to the database we want to query.

```python
cursor = conn.cursor()
```

In the following code block, we:

* Write a basic `SELECT` query that will return all of the values from the `recent_grads` table, and store this query as a string named `query`
* Use the Cursor method `execute()` to run the query against our database
* Return the full results set and store it as `results`
* Print the first three tuples in the list `results`

```python
# SQL Query as a string
query = "SELECT * FROM recent_grads;"
# Execute the query, convert the results to tuples, and store as a local variable
cursor.execute(query)
# Fetch the full results set as a list of tuples
results = cursor.fetchall()
# Display the first three results
print(results[0:3])
```

Let's have some practice:

In [57]:
import sqlite3
conn = sqlite3.connect("jobs.db")
cursor = conn.cursor()

query = "SELECT Major FROM recent_grads;"
cursor.execute(query)
majors = cursor.fetchall()
print(majors[0:2])

[('PETROLEUM ENGINEERING',), ('MINING AND MINERAL ENGINEERING',)]


So far, we've been running queries by creating a Cursor instance, and then calling the `execute` method on the instance. The SQLite library actually allows us to skip creating a Cursor altogether by using the [`execute` method](https://docs.python.org/3/library/sqlite3.html#sqlite3.Connection.execute) within the Connection object itself. SQLite will create a Cursor instance for us under the hood and our query run against the database, but this shortcut allows us to skip a step. Here's what the code looks like:

In [60]:
conn = sqlite3.connect("jobs.db")
query = "SELECT Major FROM recent_grads;"
print(conn.execute(query).fetchall()[0:2])

[('PETROLEUM ENGINEERING',), ('MINING AND MINERAL ENGINEERING',)]


Notice that we didn't explicitly create a separate Cursor instance ourselves in this code example.

Now let's learn how to fetch a specific number of results after we run a query.

To make it easier to work with large results sets, the Cursor class allows us to control the number of results we want to retrieve at any given time. To return a single result (as a tuple), we use the Cursor method [`fetchone()`](https://docs.python.org/3/library/sqlite3.html#sqlite3.Cursor.fetchone). To return `n` results, we use the Cursor method [`fetchmany()`](https://docs.python.org/3/library/sqlite3.html#sqlite3.Cursor.fetchmany).

Each Cursor instance contains an internal counter that updates every time we retrieve results. When we call the `fetchone()` method, the Cursor instance will return a single result, and then increment its internal counter by 1. This means that if we call `fetchone()` again, the Cursor instance will actually return the second tuple in the results set (and increment by 1 again).

The `fetchmany()` method takes in an integer (`n`) and returns the corresponding results, starting from the current position. It then increments the Cursor instance's counter by `n`. In the following code, we return the first two results using the `fetchone()` method, then the next five results using the `fetchmany()` method.

```python
first_result = cursor.fetchone()
second_result = cursor.fetchone()
next_five_results = cursor.fetchmany(5)
```

Now let's write and run a query that returns the `Major` and `Major_category` columns from `recent_grads`. Then, fetch the first five results and store them as `five_results`.

In [62]:
import sqlite3
conn = sqlite3.connect("jobs.db")
cursor = conn.cursor()

query = "SELECT Major, Major_category FROM recent_grads;"
cursor.execute(query)
five_results = cursor.fetchmany(5)
print(five_results)

[('PETROLEUM ENGINEERING', 'Engineering'), ('MINING AND MINERAL ENGINEERING', 'Engineering'), ('METALLURGICAL ENGINEERING', 'Engineering'), ('NAVAL ARCHITECTURE AND MARINE ENGINEERING', 'Engineering'), ('CHEMICAL ENGINEERING', 'Engineering')]


Because SQLite restricts access to the database file when we're connected to a database, we need to close the connection when we're done working with it. Closing the connection allows other processes to access the database, which is important when you're in a production environment and working with other team members.

To close a connection to a database, use the Connection instance method `close()`. When we're working with multiple databases and multiple Connection instances, we want to make sure we call the `close()` method on the correct instance.

```python
conn.close()
```

Here's an example of a typical workflow code:

```python
import sqlite3
conn = sqlite3.connect("jobs2.db")
cursor = conn.cursor()

query = "SELECT Major FROM recent_grads ORDER BY Major DESC"
cursor.execute(query)
reverse_alphabetical = cursor.fetchall()

conn.close()
```

In [63]:
conn.close()