# SQL - Intro P2

In [None]:
import pandas as pd
import sqlite3

conn = sqlite3.connect('../data/mtcars.sqlite')
df = pd.read_sql_query("SELECT * FROM results", conn)
df

## GROUP BY

GROUP BY can be used to summarize values in a table (sum, average, count, etc.). In order to use GROUP BY correctly, the SELECT statement should contain the columns for which you want to display aggregated data, a column that needs to be transformed and indicate what transformation needs to be applied. The rest of your query would follow as standard, i.e. you indicate FROM which table you need to retrieve data.

If there are any filters you need to apply, you specify them in the WHERE clause. Finally, you add GROUP BY which should contain one or more variables separated by commas for which you are grouping the data.

<code> SELECT var1, var2, sum(var3) as sum_var3 FROM table GROUP BY var1, var2 </code>

Keep in mind, that the below syntax will return an error or incorrect results

<code> SELECT var1, var2, sum(var3) as sum_var3 FROM table GROUP BY var1 </code> **var2 needs to be in the GROUP BY**

In [None]:
pd.read_sql_query("""SELECT cylinders, SUM(weight) AS agg_weight 
                  FROM results""", conn)

In [None]:
pd.read_sql_query("""SELECT cylinders, SUM(weight) AS agg_weight 
                  FROM results
                  GROUP BY cylinders""", conn)

What if I want to select cylinders  where agg_weight is at least 10,000?

In [None]:
pd.read_sql_query("""SELECT cylinders, SUM(weight) AS agg_weight 
                  FROM results
                  WHERE weight >= 10000
                  GROUP BY cylinders""", conn)

In [None]:
# This will break!

pd.read_sql_query("""SELECT cylinders, SUM(weight) AS agg_weight 
                  FROM results
                  WHERE agg_weight >= 10000
                  GROUP BY cylinders""", conn)

## HAVING

HAVING allows you to filter aggregated results which the WHERE keyword doesn't support. HAVING should be included **AFTER** GROUP BY.



In [None]:
pd.read_sql_query("""SELECT cylinders, SUM(weight) AS agg_weight 
                  FROM results
                  GROUP BY cylinders
                  HAVING SUM(weight) >= 5000""", conn)

## ORDER BY
If you want to sort the order of the SELECT statement, you can use ORDER BY. This doesn't change the order of records in a table, and only affects the output of the statement. You can sort in descending order by adding desc key word after the name of the variable you want to sort. The default order is ascending. You can also sort by multiple variables; you just need to separate them by commas.

In [None]:
pd.read_sql_query("""SELECT *
                  FROM results
                  ORDER BY cylinders""", conn)

In [None]:
pd.read_sql_query("""SELECT *
                  FROM results
                  ORDER BY cylinders DESC""", conn)

In [None]:
pd.read_sql_query("""SELECT *
                  FROM results
                  ORDER BY cylinders DESC, year DESC""", conn)