In [2]:
import pandas as pd
import numpy as np

# Tabular operations in `pandas`
We've already seen a simple _table join_ operation when we used the SA1 codes as an index to select only SA1 data for the Wellington urban area. In this notebook we look at the various ways that tables can be combined in `pandas` a bit more closely.

## Joining tables with `merge()`
As an example here are data pertaining to airports sourced via https://ourairports.com/data/.

In [39]:
airports = pd.read_csv(
    "https://davidmegginson.github.io/ourairports-data/airports.csv")
airports = airports[(airports.iso_country == "NZ") &
                    (airports.iata_code.notna())]
airports = airports[["name", "latitude_deg", "longitude_deg", "iata_code"]]
airports = airports.rename(columns = {"latitude_deg": "lat", "longitude_deg": "lon"})
airports

Unnamed: 0,name,lat,lon,iata_code
50020,Fox Glacier Aerodrome,-43.46111,170.01702,FGL
50084,Auckland International Airport,-37.01199,174.786331,AKL
50089,Taupo Airport,-38.7397,176.084,TUO
50090,Ardmore Airport,-37.029701,174.973007,AMZ
50091,Ashburton Airport,-43.903301,171.796997,ASG
50098,Christchurch International Airport,-43.489029,172.532065,CHC
50099,Chatham Islands / Tuuta Airport,-43.81189,-176.46514,CHT
50101,Coromandel Airport,-36.791698,175.509003,CMV
50102,Dargaville Aerodrome,-35.939701,173.893997,DGR
50105,Dunedin International Airport,-45.928101,170.197998,DUD


Rather out of date (but free!) information about scheduled flights is available via https://openflights.org/data.php.

In [40]:
schedule = pd.read_csv("https://raw.githubusercontent.com/jpatokal/openflights/master/data/routes.dat")
schedule = schedule.iloc[:, [2, 4]]
schedule.columns = ["iata_code_1", "iata_code_2"]
schedule

Unnamed: 0,iata_code_1,iata_code_2
0,ASF,KZN
1,ASF,MRV
2,CEK,KZN
3,CEK,OVB
4,DME,KZN
...,...,...
67657,WYA,ADL
67658,DME,FRU
67659,FRU,DME
67660,FRU,OSS


Say we want to associate with each scheduled flight within New Zealand the latitude-longitude of its respective airports. Before doing that we might want to reduce the `schedule` dataset down to only flights between two airports in New Zealand.

In [41]:
schedule_nz = schedule[(schedule.iata_code_1.isin(airports.iata_code)) &
                       (schedule.iata_code_2.isin(airports.iata_code))]
schedule_nz

Unnamed: 0,iata_code_1,iata_code_2
34911,AKL,CHC
34912,AKL,DUD
34917,AKL,WLG
34918,AKL,ZQN
34938,CHC,AKL
...,...,...
44072,WRE,WLG
44073,WSZ,WLG
44075,ZQN,AKL
44076,ZQN,CHC


In [37]:
schedule \
    .merge(airports.add_suffix("_1")) \
    .merge(airports.add_suffix("_2"))

Unnamed: 0,iata_code_1,iata_code_2,name_1,lat_1,lon_1,name_2,lat_2,lon_2
0,AKL,CHC,Auckland International Airport,-37.011990,174.786331,Christchurch International Airport,-43.489029,172.532065
1,AKL,DUD,Auckland International Airport,-37.011990,174.786331,Dunedin International Airport,-45.928101,170.197998
2,AKL,WLG,Auckland International Airport,-37.011990,174.786331,Wellington International Airport,-41.327202,174.804993
3,AKL,ZQN,Auckland International Airport,-37.011990,174.786331,Queenstown International Airport,-45.021099,168.738998
4,CHC,AKL,Christchurch International Airport,-43.489029,172.532065,Auckland International Airport,-37.011990,174.786331
...,...,...,...,...,...,...,...,...
109,WRE,WLG,Whangarei Airport,-35.769253,174.363713,Wellington International Airport,-41.327202,174.804993
110,WSZ,WLG,Westport Airport,-41.737111,171.579033,Wellington International Airport,-41.327202,174.804993
111,ZQN,AKL,Queenstown International Airport,-45.021099,168.738998,Auckland International Airport,-37.011990,174.786331
112,ZQN,CHC,Queenstown International Airport,-45.021099,168.738998,Christchurch International Airport,-43.489029,172.532065


## Concatenating tables with `concat()`
Sometimes you have data from more than one source that are organised identically&mdash;in particular they have the same or shared column names. Less often you may have data from two different sources that record different attributes for the same set of objects, arranged in the same order. In either of these cases you can basically 'sticky-tape' the tables or series together using `concat`. We already saw this in action [back here](02-navigating-pandas.ipynb#dataframe). As usual, you can combine data row-wise or column-wise. When you want to make up a `DataFrame` from a bunch of `Series` you do `pd.concat(<list of Series>, axis = "columns")` but sometimes you just want to extend `Series` by appending them to others.

In [4]:
s1 = pd.Series(range(5), index = list("abcde"))
s2 = pd.Series(range(6, 11), index = list("fghij"))
print(pd.concat([s1, s2]))

a     0
b     1
c     2
d     3
e     4
f     6
g     7
h     8
i     9
j    10
dtype: int64


If we concatenate these two series by columns, there will be many missing values:

In [5]:
print(pd.concat([s1, s2], axis = "columns"))

     0     1
a  0.0   NaN
b  1.0   NaN
c  2.0   NaN
d  3.0   NaN
e  4.0   NaN
f  NaN   6.0
g  NaN   7.0
h  NaN   8.0
i  NaN   9.0
j  NaN  10.0
