# DATA SCIENCE SESSIONS VOL. 3
### A Foundational Python Data Science Course
## TaskList 07: Relational Structure + Pivot

[&larr; Back to course webpage](https://datakolektiv.com/)

Feedback should be send to [goran.milovanovic@datakolektiv.com](mailto:goran.milovanovic@datakolektiv.com). 

These notebooks accompany the DATA SCIENCE SESSIONS VOL. 3 :: A Foundational Python Data Science Course.

### Lecturers

[Goran S. Milovanović, PhD, DataKolektiv, Chief Scientist & Owner](https://www.linkedin.com/in/gmilovanovic/)

[Aleksandar Cvetković, PhD, DataKolektiv, Consultant](https://www.linkedin.com/in/alegzndr/)

[Ilija Lazarević, MA, DataKolektiv, Consultant](https://www.linkedin.com/in/ilijalazarevic/)

***

Hey I found very helpuf to watch this video [https://www.youtube.com/watch?v=iYWKfUOtGaw] to help with the merge and join functions. Recommended to watch it!

### Intro

In this TaskList you'll load several tables from `nycflight13` dataset and searching for the required information by joining them. Furthermore, in last few exercises you'll be constructing pivot tables and switching between wide and long format. Good luck!

**00.** Import all the necessary Python libraries for this work, and then load the following datasets from `_data` directory in `session07` folder:

- `flights.csv`
- `airlines.csv`
- `airports.csv`
- `planes.csv`

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

flights = pd.read_csv(('_data/flights.csv'),index_col=0)
airlines = pd.read_csv(('_data/airlines.csv'),index_col=0)
airports = pd.read_csv(("_data/airports.csv"),index_col=0)
planes = pd.read_csv(("_data/planes.csv"),index_col=0)

In [3]:
planes.head()


Unnamed: 0,tailnum,year,type,manufacturer,model,engines,seats,speed,engine
1,N10156,2004.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan
2,N102UW,1998.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
3,N103US,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
4,N104UW,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
5,N10575,2002.0,Fixed wing multi engine,EMBRAER,EMB-145LR,2,55,,Turbo-fan


In [4]:
flights.head()

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
1,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01 05:00:00
2,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01 05:00:00
3,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01 05:00:00
4,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01 05:00:00
5,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01 06:00:00


**01.** By joining `flights` and `airlines` datasets, count how many flights have been carried out by every listed airline company? What's the distribution of flights in percentages?

In [5]:
flight_count = pd.merge(flights,airlines,how='left',on='carrier')

number = flight_count[['carrier','name']]
number.value_counts()
display(number)

Unnamed: 0,carrier,name
0,UA,United Air Lines Inc.
1,UA,United Air Lines Inc.
2,AA,American Airlines Inc.
3,B6,JetBlue Airways
4,DL,Delta Air Lines Inc.
...,...,...
336771,9E,Endeavor Air Inc.
336772,9E,Endeavor Air Inc.
336773,MQ,Envoy Air
336774,MQ,Envoy Air


### `how='left'`

- You keep all rows from the flights DataFrame (the "left" DataFrame), even if there’s no matching row in the airlines DataFrame.

#### Do You Always Need to Specify on?

#### Not always. In the case where both dataframes have exactly one column, we do not need to specify `on=''` argument

### When Should You Specify on?

#### If multiple columns have the same name in both DataFrames, you must explicitly specify which column(s) to use as the key. 

#### Otherwise, pd.merge() will raise an error.

In [6]:
number['name'].value_counts(normalize=True).apply(lambda x: x*100,2)

name
United Air Lines Inc.          17.419590
JetBlue Airways                16.222949
ExpressJet Airlines Inc.       16.085766
Delta Air Lines Inc.           14.285460
American Airlines Inc.          9.718329
Envoy Air                       7.838148
US Airways Inc.                 6.097822
Endeavor Air Inc.               5.481388
Southwest Airlines Co.          3.644856
Virgin America                  1.532770
AirTran Airways Corporation     0.968002
Alaska Airlines Inc.            0.212010
Frontier Airlines Inc.          0.203399
Mesa Airlines Inc.              0.178457
Hawaiian Airlines Inc.          0.101551
SkyWest Airlines Inc.           0.009502
Name: proportion, dtype: float64

***
Just to refres***h our memory:
The normalize=True parameter in .value_counts() means that instead of returning the counts (absolute frequencies) of unique values, it will return their relative frequencies (proportions).
These proportions represent how frequently each unique value occurs, expressed as a fraction of the total number of elements.

__normalize = True__

If JetBlue airways occurs 3 times, it will be calculated as N/3 and frequency will be shown (for example).

- Function is implemented in pandas and we do not have to calculate precentage

- Also , this excludes all NaN values so you do not need to dropna(), just apply it 

[link] https://pandas.pydata.org/pandas-docs/version/0.25.1/reference/api/pandas.Series.value_counts.html

***

**02.** By joining `flights` and `airports` datasets, compute average delay of departure for flights across different origin airports. Include the standard deviation of the delay too.

Here I noticed after several efforts to merge these 2 columns, that I do not have any overlapping column. In that case i can not use argument on='', but must use `right_on` and `left_on`. Basically this do the following:

- When column names don’t match between the DataFrames. We must explicitly specify the columns from each Data Frame.

- `right_on` = uses the column from the right data frame 

- `left_on` = uses the column from the left data frame

So in my case: 

In [7]:
flights_and_airports = pd.merge(flights,airports,how='left', left_on='origin', right_on='faa')
flights_and_airports

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,...,minute,time_hour,faa,name,lat,lon,alt,tz,dst,tzone
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,...,15,2013-01-01 05:00:00,EWR,Newark Liberty Intl,40.692500,-74.168667,18,-5,A,America/New_York
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,...,29,2013-01-01 05:00:00,LGA,La Guardia,40.777245,-73.872608,22,-5,A,America/New_York
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,...,40,2013-01-01 05:00:00,JFK,John F Kennedy Intl,40.639751,-73.778925,13,-5,A,America/New_York
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,...,45,2013-01-01 05:00:00,JFK,John F Kennedy Intl,40.639751,-73.778925,13,-5,A,America/New_York
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,...,0,2013-01-01 06:00:00,LGA,La Guardia,40.777245,-73.872608,22,-5,A,America/New_York
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
336771,2013,9,30,,1455,,,1634,,9E,...,55,2013-09-30 14:00:00,JFK,John F Kennedy Intl,40.639751,-73.778925,13,-5,A,America/New_York
336772,2013,9,30,,2200,,,2312,,9E,...,0,2013-09-30 22:00:00,LGA,La Guardia,40.777245,-73.872608,22,-5,A,America/New_York
336773,2013,9,30,,1210,,,1330,,MQ,...,10,2013-09-30 12:00:00,LGA,La Guardia,40.777245,-73.872608,22,-5,A,America/New_York
336774,2013,9,30,,1159,,,1344,,MQ,...,59,2013-09-30 11:00:00,LGA,La Guardia,40.777245,-73.872608,22,-5,A,America/New_York


In [8]:
delay_departure = flights_and_airports.groupby('name')['dep_delay'].agg(['mean','std'])
delay_departure

Unnamed: 0_level_0,mean,std
name,Unnamed: 1_level_1,Unnamed: 2_level_1
John F Kennedy Intl,12.112159,39.035071
La Guardia,10.346876,39.993021
Newark Liberty Intl,15.107954,41.323704


**03.** By joining `flights` and `planes`, count how many flights were carried out using planes produced strictly before year 2000. 

*Note:* You'll notice that both the dataset have column `year`. Even though they have the same name, they contain completely different data - one is the year when a flight was carried out, the other is the year when a plane was produce. In order to reslove this ambiguity, you may use `suffixes` argument in the `.merge()` Data Frame method.

In [9]:
flights_planes = pd.merge(flights,planes,how='left',on='tailnum',suffixes=('_flight','_planes123'))
flights_planes.head()

Unnamed: 0,year_flight,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,...,minute,time_hour,year_planes123,type,manufacturer,model,engines,seats,speed,engine
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,...,15,2013-01-01 05:00:00,1999.0,Fixed wing multi engine,BOEING,737-824,2.0,149.0,,Turbo-fan
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,...,29,2013-01-01 05:00:00,1998.0,Fixed wing multi engine,BOEING,737-824,2.0,149.0,,Turbo-fan
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,...,40,2013-01-01 05:00:00,1990.0,Fixed wing multi engine,BOEING,757-223,2.0,178.0,,Turbo-fan
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,...,45,2013-01-01 05:00:00,2012.0,Fixed wing multi engine,AIRBUS,A320-232,2.0,200.0,,Turbo-fan
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,...,0,2013-01-01 06:00:00,1991.0,Fixed wing multi engine,BOEING,757-232,2.0,178.0,,Turbo-fan


In [10]:
planes.head()
# here we have a year but this year represents year of plane production, but year in flights references the year when the flight was carreid out.

Unnamed: 0,tailnum,year,type,manufacturer,model,engines,seats,speed,engine
1,N10156,2004.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan
2,N102UW,1998.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
3,N103US,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
4,N104UW,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
5,N10575,2002.0,Fixed wing multi engine,EMBRAER,EMB-145LR,2,55,,Turbo-fan


In [11]:
flights_and_planes = pd.merge(flights,planes, how='left',on='tailnum',suffixes=('_flight','_plane'))
flights_and_planes[flights_and_planes['year_plane']>2000]

Unnamed: 0,year_flight,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,...,minute,time_hour,year_plane,type,manufacturer,model,engines,seats,speed,engine
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,...,45,2013-01-01 05:00:00,2012.0,Fixed wing multi engine,AIRBUS,A320-232,2.0,200.0,,Turbo-fan
5,2013,1,1,554.0,558,-4.0,740.0,728,12.0,UA,...,58,2013-01-01 05:00:00,2012.0,Fixed wing multi engine,BOEING,737-924ER,2.0,191.0,,Turbo-fan
8,2013,1,1,557.0,600,-3.0,838.0,846,-8.0,B6,...,0,2013-01-01 06:00:00,2004.0,Fixed wing multi engine,AIRBUS,A320-232,2.0,200.0,,Turbo-fan
10,2013,1,1,558.0,600,-2.0,849.0,851,-2.0,B6,...,0,2013-01-01 06:00:00,2011.0,Fixed wing multi engine,AIRBUS,A320-232,2.0,200.0,,Turbo-fan
11,2013,1,1,558.0,600,-2.0,853.0,856,-3.0,B6,...,0,2013-01-01 06:00:00,2007.0,Fixed wing multi engine,AIRBUS,A320-232,2.0,200.0,,Turbo-fan
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
336765,2013,9,30,2240.0,2245,-5.0,2334.0,2351,-17.0,B6,...,45,2013-09-30 22:00:00,2013.0,Fixed wing multi engine,EMBRAER,ERJ 190-100 IGW,2.0,20.0,,Turbo-fan
336766,2013,9,30,2240.0,2250,-10.0,2347.0,7,-20.0,B6,...,50,2013-09-30 22:00:00,2007.0,Fixed wing multi engine,EMBRAER,ERJ 190-100 IGW,2.0,20.0,,Turbo-fan
336767,2013,9,30,2241.0,2246,-5.0,2345.0,1,-16.0,B6,...,46,2013-09-30 22:00:00,2011.0,Fixed wing multi engine,EMBRAER,ERJ 190-100 IGW,2.0,20.0,,Turbo-fan
336768,2013,9,30,2307.0,2255,12.0,2359.0,2358,1.0,B6,...,55,2013-09-30 22:00:00,2003.0,Fixed wing multi engine,AIRBUS,A320-232,2.0,200.0,,Turbo-fan


**04.** Using the merged dataset obtained in the Task 03, list all the flights carried out by the oldest airplane models. 

In [12]:
oldest_plane_year = flights_and_planes['year_plane'].min()
oldest_plane_year

1956.0

In [13]:
flights_and_planes[flights_and_planes['year_plane'] == 1956]

Unnamed: 0,year_flight,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,...,minute,time_hour,year_plane,type,manufacturer,model,engines,seats,speed,engine
25294,2013,1,30,741.0,745,-4.0,1059.0,1125,-26.0,AA,...,45,2013-01-30 07:00:00,1956.0,Fixed wing multi engine,DOUGLAS,DC-7BF,4.0,102.0,232.0,Reciprocating
33097,2013,10,7,1525.0,1530,-5.0,1915.0,1845,30.0,AA,...,30,2013-10-07 15:00:00,1956.0,Fixed wing multi engine,DOUGLAS,DC-7BF,4.0,102.0,232.0,Reciprocating
34270,2013,10,8,1737.0,1735,2.0,2052.0,2055,-3.0,AA,...,35,2013-10-08 17:00:00,1956.0,Fixed wing multi engine,DOUGLAS,DC-7BF,4.0,102.0,232.0,Reciprocating
61564,2013,11,7,817.0,745,32.0,1140.0,1100,40.0,AA,...,45,2013-11-07 07:00:00,1956.0,Fixed wing multi engine,DOUGLAS,DC-7BF,4.0,102.0,232.0,Reciprocating
66532,2013,11,12,1528.0,1530,-2.0,1837.0,1845,-8.0,AA,...,30,2013-11-12 15:00:00,1956.0,Fixed wing multi engine,DOUGLAS,DC-7BF,4.0,102.0,232.0,Reciprocating
98216,2013,12,17,1043.0,1030,13.0,1416.0,1355,21.0,AA,...,30,2013-12-17 10:00:00,1956.0,Fixed wing multi engine,DOUGLAS,DC-7BF,4.0,102.0,232.0,Reciprocating
99023,2013,12,18,808.0,800,8.0,1146.0,1135,11.0,AA,...,0,2013-12-18 08:00:00,1956.0,Fixed wing multi engine,DOUGLAS,DC-7BF,4.0,102.0,232.0,Reciprocating
111827,2013,2,1,1526.0,1530,-4.0,1915.0,1910,5.0,AA,...,30,2013-02-01 15:00:00,1956.0,Fixed wing multi engine,DOUGLAS,DC-7BF,4.0,102.0,232.0,Reciprocating
113111,2013,2,3,1036.0,1030,6.0,1411.0,1355,16.0,AA,...,30,2013-02-03 10:00:00,1956.0,Fixed wing multi engine,DOUGLAS,DC-7BF,4.0,102.0,232.0,Reciprocating
116573,2013,2,7,742.0,745,-3.0,1114.0,1125,-11.0,AA,...,45,2013-02-07 07:00:00,1956.0,Fixed wing multi engine,DOUGLAS,DC-7BF,4.0,102.0,232.0,Reciprocating


**05.** By semi-joining `flights` and `planes` datasets, compute the average flight distance of all the flights carried out by planes with Turbo-jet engines.

In [14]:
turbo_jet_engine = planes[planes['engine']=='Turbo-fan']
turbo_jet_engine

Unnamed: 0,tailnum,year,type,manufacturer,model,engines,seats,speed,engine
1,N10156,2004.0,Fixed wing multi engine,EMBRAER,EMB-145XR,2,55,,Turbo-fan
2,N102UW,1998.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
3,N103US,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
4,N104UW,1999.0,Fixed wing multi engine,AIRBUS INDUSTRIE,A320-214,2,182,,Turbo-fan
5,N10575,2002.0,Fixed wing multi engine,EMBRAER,EMB-145LR,2,55,,Turbo-fan
...,...,...,...,...,...,...,...,...,...
3316,N996AT,2002.0,Fixed wing multi engine,BOEING,717-200,2,100,,Turbo-fan
3317,N996DL,1991.0,Fixed wing multi engine,MCDONNELL DOUGLAS AIRCRAFT CO,MD-88,2,142,,Turbo-fan
3318,N997AT,2002.0,Fixed wing multi engine,BOEING,717-200,2,100,,Turbo-fan
3319,N997DL,1992.0,Fixed wing multi engine,MCDONNELL DOUGLAS AIRCRAFT CO,MD-88,2,142,,Turbo-fan


Basically, filtered data frame with engines with 'Turbo-fan' labeled rows.

- Instead of merging this data frame with flights data frame, condition is checked and says that 'engines' labeled 'Turbo-fan' from planes that is present in flights data frame. 
- -ie select only rows in flights where this condition is met 

In [15]:
flights_planes_semi = flights[flights['tailnum'].isin(turbo_jet_engine['tailnum'])]
flights_planes_semi['distance'].mean()

994.9218687088808

**06.** Calculate median departure delay by grouping `flights` dataset by both `carrier` and `origin` of the flight. Then, using `pd.pivot()` function on that result, create a pivot table showing median departure delays for each `carrier`-`origin` combination.

In [25]:
median_dep_delay= flights.groupby(['origin','carrier'])['dep_delay'].mean().reset_index()
median_dep_delay

Unnamed: 0,origin,carrier,dep_delay
0,EWR,9E,5.951667
1,EWR,AA,10.035419
2,EWR,AS,5.804775
3,EWR,B6,13.100262
4,EWR,DL,12.084592
5,EWR,EV,20.164931
6,EWR,MQ,17.467268
7,EWR,OO,20.833333
8,EWR,UA,12.522869
9,EWR,US,3.735104


In [27]:
pd.pivot(median_dep_delay,index='carrier',columns='origin',values='dep_delay')

origin,EWR,JFK,LGA
carrier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
9E,5.951667,19.001517,8.894182
AA,10.035419,10.302155,6.705769
AS,5.804775,,
B6,13.100262,12.757453,14.805738
DL,12.084592,8.333188,9.572997
EV,20.164931,18.520362,19.1255
F9,,,20.215543
FL,,,18.726075
HA,,4.900585,
MQ,17.467268,13.199971,8.528569


**07.** The previous result could also be acomplished by using `.pivot_table()` Data Frame method, without needing to previously to group/aggregate data: `pivot_table()` does that for you.

Now, using `.pivot_table()` method on `flights` dataset display longest `air_time`s for each `origin`-`dest`ination combination.

In [50]:
flights.pivot_table(index='dest',columns='origin', values='air_time',aggfunc='max')\
.reset_index()\
.rename_axis('Improvise',axis=1)


Improvise,dest,EWR,JFK,LGA
0,ABQ,,318.0,
1,ACK,,141.0,
2,ALB,50.0,,
3,ANC,434.0,,
4,ATL,176.0,172.0,175.0
...,...,...,...,...
99,TPA,204.0,200.0,191.0
100,TUL,250.0,,
101,TVC,108.0,,110.0
102,TYS,138.0,,131.0


**08.** Using `.melt()` method change the result table from Task 07 from wide to long format.

In [54]:
pd.melt(flights,id_vars='origin',var_name='dest', value_name='max_air_time')

Unnamed: 0,origin,dest,max_air_time
0,EWR,year,2013
1,LGA,year,2013
2,JFK,year,2013
3,JFK,year,2013
4,LGA,year,2013
...,...,...,...
6061963,JFK,time_hour,2013-09-30 14:00:00
6061964,LGA,time_hour,2013-09-30 22:00:00
6061965,LGA,time_hour,2013-09-30 12:00:00
6061966,LGA,time_hour,2013-09-30 11:00:00


***

DataKolektiv, 2022/23.

[hello@datakolektiv.com](mailto:goran.milovanovic@datakolektiv.com)

![](../img/DK_Logo_100.png)

<font size=1>License: [GPLv3](https://www.gnu.org/licenses/gpl-3.0.txt) This Notebook is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This Notebook is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this Notebook. If not, see http://www.gnu.org/licenses/.</font>