## Data Journalism

Data Engineering/Analytics is not just reserved for purely “business” applications

Along with the proliferation of the internet & massive tech companies, journalism has now begun to benefit from “data-oriented” individuals.

Namely data journalism is the practice of documenting & presenting real-world-events using data science & visualizations 

## Subqueries

A super useful feature of postgres, and at this point, most flavors of SQL are correlated subqueries aka subqueries

Subqueries allow us to utilize information pulled from a subquery in a outer query

```sql
    SELECT … --This can be an UPDATE or DELETE clause, for now we will only use SELECT
    FROM outer_table p  -- Loop through each row of outer_table “p”
    WHERE EXPR (    -- This is a boolean operator
        SELECT … FROM inner_table i WHERE  …    -- Only select the rows from the outer table that meet the criteria of the inner query
) ;
```

The expressions we can use with our subqueries include:

* EXISTS : Select row if the subquery produces a table with at least one row
* IN  : Select row if attribute exists in subquery that returns column or static list 
* ANY or SOME : Select row if attribute equal, not equal, etc to any row from subquery
* ALL : Select row if attribute equal, not equal, etc to all rows from subquery

https://www.postgresql.org/docs/current/functions-subquery.html#FUNCTIONS-SUBQUERY-NOTIN 

## EXISTS

This EXISTS expression selects all rows from the subquery table if it produces at least one row.

In [None]:
SELECT *
FROM flight f
WHERE EXISTS
	( 
		SELECT * 
	 	FROM airline_city ac 
		WHERE f.destination = ac.code and f.airlineid = ac.airlineid
	)
;

alternatively, we also have `NOT EXISTS`

In [None]:
SELECT *
FROM flight f
WHERE NOT EXISTS
	( 
		SELECT * 
	 	FROM airline_city ac 
		WHERE f.destination = ac.code and f.airlineid = ac.airlineid
	)
;

## IN 

This IN expression selects all rows from the table that exist within the subquery.

The query below searches through the `airline_city` table, and only selects the rows where the airline is listed in the `flights` table

In [None]:
SELECT *
FROM airline_city ac
WHERE ac.airlineid IN ( 
    SELECT  f.airlineid FROM flight f
) ;

This IN expression can also be used to select rows from a static list.

In [None]:
SELECT *
FROM airline_city ac
WHERE ac.airlineid IN ( `AA`, `WN` ) ;

## ANY or SOME

The ANY or SOME expressions are used in conjunction with a mathematical operator to check if an attribute from the outer-table is equal to, not equal to, greater than, less than , etc than at least one row in the inner-query.

This query accomplishes the same exact thing as the IN expression from before:

In [None]:
SELECT *
FROM airline_city ac
WHERE ac.airlineid = ANY ( 
    SELECT airlineid in FROM flight f 
) ;

## ALL

The ALL expressions are used in conjunction with a mathematical operator to check if an attribute from the outer-table is equal to, not equal to, greater than, less than , etc than at all rows in the inner-query

In [None]:
SELECT *
FROM airline_city ac
WHERE ac.airlineid = ALL ( 
    SELECT airlineid in  FROM flight f 
) ;


You can figure out your port number via:
```sql
SELECT *
FROM pg_settings
WHERE name = 'port';
```

In [2]:
import psycopg2
import pandas as pd

params_dic = {
    "host"      : "localhost",
    "dbname"    : "flights",
    "user"      : "postgres",
    "password"  : "password",
    "port" : "5434"     
}

conn = psycopg2.connect(**params_dic)

cursor = conn.cursor()
cursor.execute("SELECT * FROM real_flight ")
rows = cursor.fetchall()
print(rows)

cursor.close()

df = pd.DataFrame(rows, columns=["airlineid", "ap_name", "code"])
df.head()

[('AA', 'Louisville International airport', 'SDF'), ('AA', 'John F. Kennedy International airport', 'JFK'), ('AA', 'LaGuardia Airport', 'LGA'), ('AA', 'George Bush Intercontinental airport', 'IAH'), ('AA', 'Tampa International airport', 'TPA'), ('AA', 'Austin-Bergstrom International Airport', 'AUS'), ('AA', 'Hartsfield-Jackson Atlanta International Airport', 'ATL'), ('UAL', 'Louisville International airport', 'SDF'), ('UAL', 'John F. Kennedy International airport', 'JFK'), ('UAL', 'LaGuardia Airport', 'LGA'), ('UAL', 'George Bush Intercontinental airport', 'IAH'), ('UAL', 'San Francisco International airport', 'SFO'), ('UAL', 'Tampa International airport', 'TPA'), ('UAL', 'Austin-Bergstrom International Airport', 'AUS'), ('UAL', 'Hartsfield-Jackson Atlanta International Airport', 'ATL'), ('UAL', 'Daniel K. Inouye International Airport', 'HNL'), ('DL', 'Louisville International airport', 'SDF'), ('DL', 'Dallas/Fort Worth International Airport', 'DFW'), ('DL', 'John F. Kennedy Internatio

Unnamed: 0,airlineid,ap_name,code
0,AA,Louisville International airport,SDF
1,AA,John F. Kennedy International airport,JFK
2,AA,LaGuardia Airport,LGA
3,AA,George Bush Intercontinental airport,IAH
4,AA,Tampa International airport,TPA


comprehensive explanation of code below:

In [None]:
import psycopg2
import pandas as pd

# set up log in information
params_dic = {
    "host"      : "localhost",
    "dbname"    : "flights",
    "user"      : "postgres",
    "password"  : "password",
    "port" : "..."      # this is the port set up by your default wizard
}
# we can use "SELECT * FROM pg_settings WHERE name = 'port';" in pgAdmin to discover our port number
# learn more about ports here: https://www.cloudflare.com/learning/network-layer/what-is-a-computer-port/

# connect to your database using the dictionary above
# the ** operator unpacks all your settings into their appropriate params
# without the ** operator, we would have to manually set these params via
# psycopg2.connect(host="localhost", dbname="flights", user="postgres", password="password", port="...")
conn = psycopg2.connect(**params_dic)

# create a cursor
# think of a cursor like an object that stores and executes queries
# we prepare it by calling conn.cursor() in Python
# for our purpose, a cursor is an object that allows us to execute queries 
# learn more about cursors here: https://www.geeksforgeeks.org/what-is-cursor-in-sql/
cursor = conn.cursor()

# execute a query using the cursor
# docs of cursor here: https://www.psycopg.org/docs/cursor.html
cursor.execute("SELECT * FROM airline_city")

# pull all rows from the query you just executed
rows = cursor.fetchall()
print(rows)

# close your cursor, must be done once you are done interacting w/your cursor
cursor.close()

# save your list of tuples (representing rows) into a dataframe!
df = pd.DataFrame(rows, columns=["airlineid", "ap_name", "code"])

# print out the first 5 rows of your dataframe!
df.head()

## Extra Problems & Resources

* (Easy) https://leetcode.com/problems/combine-two-tables/ 
* (Medium) https://leetcode.com/problems/rank-scores/ 
* (Medium) https://leetcode.com/problems/exchange-seats/ 
* SQL Lab Optional Challenge 
* SQL Optional Coding Challenge Q6 - Q9 
* Kaggle SQL BigQuery : https://www.kaggle.com/learn/intro-to-sql (Google Standard SQL)
    * Tutorials & Exercises 1, 3, 5, 6