# Joins

![schema.svg](attachment:schema.svg)

The line in the schema diagram clearly shows the link between the id column in the facts table and the facts_id column in the cities table. You may need to refer back to this schema diagram throughout the mission.


## Inner joins

The most common way to join data using SQL is using an inner join. The syntax for an inner join is:

```SQL
SELECT [column_names] FROM [table_name_one]
INNER JOIN [table_name_two] ON [join_constraint];
```

The inner join clause is made up of two parts:

* `INNER JOIN`, which tells the SQL engine the name of the table you wish to join in your query, and that you wish to use an inner join.
* `ON`, which tells the SQL engine what columns to use to join the two tables.

![inner_join.svg](attachment:inner_join.svg)

**Our inner join will include:**

* Rows from the cities table that have a cities.facts_id that matches a facts.id from facts.


**Our inner join will not include:**
* Rows from the cities table that have a cities.facts_id that doesn't match any facts.id from facts.
* Rows from the facts table that have a facts.id that doesn't match any cities.facts_id from cities.

![venn_inner.svg](attachment:venn_inner.svg)


**Example**
```SQL
SELECT * FROM facts
INNER JOIN cities ON cities.facts_id = facts.id
LIMIT 5;
```



In [13]:
import sqlite3
import pandas as pd

conn = sqlite3.connect("factbook.db")
pd.read_sql_query("SELECT * FROM facts INNER JOIN cities ON cities.facts_id = facts.id LIMIT 10", conn)


Unnamed: 0,id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate,id.1,name.1,population.1,capital,facts_id
0,216,aa,Aruba,180,180,0,112162,1.33,12.56,8.18,8.92,1,Oranjestad,37000,1,216
1,6,ac,Antigua and Barbuda,442,442,0,92436,1.24,15.85,5.69,2.21,2,Saint John'S,27000,1,6
2,184,ae,United Arab Emirates,83600,83600,0,5779760,2.58,15.43,1.97,12.36,3,Abu Dhabi,942000,1,184
3,184,ae,United Arab Emirates,83600,83600,0,5779760,2.58,15.43,1.97,12.36,4,Dubai,1978000,0,184
4,184,ae,United Arab Emirates,83600,83600,0,5779760,2.58,15.43,1.97,12.36,5,Sharjah,983000,0,184
5,1,af,Afghanistan,652230,652230,0,32564342,2.32,38.57,13.89,1.51,6,Kabul,3097000,1,1
6,3,ag,Algeria,2381741,2381741,0,39542166,1.84,23.67,4.31,0.92,7,Algiers,2916000,1,3
7,3,ag,Algeria,2381741,2381741,0,39542166,1.84,23.67,4.31,0.92,8,Oran,783000,0,3
8,11,aj,Azerbaijan,86600,82629,3971,9780780,0.96,16.64,7.07,0.0,9,Baku,2123000,1,11
9,2,al,Albania,28748,27398,1350,3029278,0.3,12.92,6.58,3.3,10,Tirana,419000,1,2


## Selecting columns

When workin with aliases, we can use the dot selector to only select certain columns:

**Example:**
Select all columns from the cities table and only the country name as "country_name"

```SQL
SELECT c.*, f.name country_name FROM facts f
INNER JOIN cities c ON c.facts_id = f.id
LIMIT 5
```


In [14]:
pd.read_sql_query("SELECT c.*, f.name country_name FROM facts f INNER JOIN cities c ON c.facts_id = f.id LIMIT 5", conn)


Unnamed: 0,id,name,population,capital,facts_id,country_name
0,1,Oranjestad,37000,1,216,Aruba
1,2,Saint John'S,27000,1,6,Antigua and Barbuda
2,3,Abu Dhabi,942000,1,184,United Arab Emirates
3,4,Dubai,1978000,0,184,United Arab Emirates
4,5,Sharjah,983000,0,184,United Arab Emirates


In [15]:
pd.read_sql_query("SELECT f.name country, c.name capital_city FROM cities c INNER JOIN facts f ON c.facts_id = f.id WHERE c.capital = 1 LIMIT 10", conn)



Unnamed: 0,country,capital_city
0,Aruba,Oranjestad
1,Antigua and Barbuda,Saint John'S
2,United Arab Emirates,Abu Dhabi
3,Afghanistan,Kabul
4,Algeria,Algiers
5,Azerbaijan,Baku
6,Albania,Tirana
7,Armenia,Yerevan
8,Andorra,Andorra La Vella
9,Angola,Luanda


## Left joins

As we mentioned earlier, an inner join will not include any rows where there is not a mutual match from both tables.  This means there could be information we are not seeing in our query where rows don't match. It's important whenever you use inner joins to be mindful that you might be excluding important data, especially if you are joining based on columns that aren't linked in the database schema.

Let's look at how we can create a query to explore the missing data using a new type of join— the left join.

A left join includes all the rows that an inner join will select, plus any joins from the first (or left) table that don't have a match in the second table. We can see this represented as a Venn diagram.

![venn_left.svg](attachment:venn_left.svg)



In [19]:
# Let's use a left join to explore the countries that don't exist in the cities table.
pd.read_sql_query("SELECT f.name country, f.population FROM facts f LEFT JOIN cities c ON c.facts_id = f.id WHERE c.facts_id IS NULL", conn)



Unnamed: 0,country,population
0,Kosovo,1870981.0
1,Monaco,30535.0
2,Nauru,9540.0
3,San Marino,33020.0
4,Singapore,5674472.0
5,Holy See (Vatican City),842.0
6,Taiwan,23415130.0
7,European Union,513949400.0
8,Ashmore and Cartier Islands,
9,Christmas Island,1530.0


## Right join

A right join, as the name indicates, is exactly the opposite of a left join. Where the left join includes all rows in the table before the JOIN clause, the right join includes all rows in the new table in the JOIN clause. We can see a right join in the Venn diagram below:




![venn_right.svg](attachment:venn_right.svg)

The following two queries, one using a left join and one using a right join, produce identical results.


```SQL
SELECT f.name country, c.name city
FROM cities c
RIGHT JOIN facts f ON f.id = c.facts_id
LIMIT 5;

SELECT f.name country, c.name city
FROM cities c
RIGHT JOIN facts f ON f.id = c.facts_id
LIMIT 5;
```



## FULL OUTER Joins

Full outer joins are reasonably uncommon, and similar results can be achieved using a union clause (which we will teach in the next mission). The standard SQL syntax for an full outer join is:

```SQL
SELECT f.name country, c.name city
FROM cities c
FULL OUTER JOIN facts f ON f.id = c.facts_id
LIMIT 5;
```

![venn_full.svg](attachment:venn_full.svg)


## Combining JOINS and subqueries

![explain_subquery.svg](attachment:explain_subquery.svg)

The important thing to remember is that the result of any subqueries are always calculated first, so we read from the inside out.

* The subquery, in the red box, is calculated first. This simple query selects all columns from cities, filtering rows that are marked as capital cities by having a value for capital of 1.
* The INNER JOIN joins the subquery result, aliased as c, to the facts table based on the ON clause.
* Two columns are selected from the results of the join:
 * f.name, aliased as country.
 * c.name, aliased as capital_city.
* The results are limited to the first 10 rows.


**Example:** Using a join and a subquery, write a query that returns capital cities with populations of over 10 million ordered from largest to smallest. Include the following columns:




In [24]:
query = "SELECT c.name capital_city, f.name country, c.population population FROM facts f INNER JOIN (SELECT * from cities  WHERE population > 10000000 AND capital = 1) c ON c.facts_id = f.id ORDER BY population DESC"
    
pd.read_sql_query(query, conn)



Unnamed: 0,capital_city,country,population
0,Tokyo,Japan,37217000
1,New Delhi,India,22654000
2,Mexico City,Mexico,20446000
3,Beijing,China,15594000
4,Dhaka,Bangladesh,15391000
5,Buenos Aires,Argentina,13528000
6,Manila,Philippines,11862000
7,Moscow,Russia,11621000
8,Cairo,Egypt,11169000


## Complex SQL queries
When you're writing complex queries with joins and subqueries, it helps to follow this process:

* Think about what data you need in your final output
* Work out which tables you'll need to join, and whether you will need to join to a subquery.
* If you need to join to a subquery, write the subquery first.
* Then start writing your SELECT clause, followed by the join and any other clauses you will need.
* Don't be afraid to write your query in steps, running it as you go— for instance you can run your subquery as a 'stand alone' query first to make sure it looks like you want before writing the outer query.

```SQL
SELECT DISTINCT f.name country, 
       (SELECT SUM(population) 
        FROM cities
        WHERE facts_id = f.id
       ) urban_pop, f.population total_pop, (CAST((SELECT SUM(population) 
        FROM cities
        WHERE facts_id = f.id
       ) as FLOAT) / CAST(f.population as FLOAT)) urban_pct FROM facts f 
INNER JOIN cities c
ON c.facts_id = f.id
WHERE urban_pct > 0.5
ORDER BY urban_pct ASC
```

Better solution:

```SQL    

SELECT
    f.name country, 
    c.urban_pop,
    f.population total_pop,
    (c.urban_pop / CAST(f.population as FLOAT)) urban_pct
FROM facts f
INNER JOIN (
    SELECT 
        facts_id,
        SUM(population) urban_pop
    FROM cities
    GROUP BY 1
    ) c ON c.facts_id = f.id
WHERE urban_pct > 0.5
ORDER BY urban_pct ASC


```