<H1>Data cleaning, aggregation and analysis</H1>
<br>
<br>
This file is used to read the previously consolidated dataset of flight and vaccination information, clean the data, and display it in various aggregation graphs used to answer the proejcts main questions.
<br>
Workbook contains: <br>

## Sections

- <a href="#cleaning">Data cleaning and aggregation</a>
- <a href="#graph1">Graph 1: Global flight levels over time</a>
- <a href="#graph2">Graph 2: Comparing flight levels vs. vaccination rates per country for 06/2021</a>
- <a href="#graph3">Graph 3a,b: Comparing flight levels vs. vaccination rates per country over time, and calculating correlation</a>
- <a href="#graph4">Graph 4: Comparing flight levels vs. vaccination rates across countries over time, and calculating correlation</a>

In [1]:
import numpy as np
import pandas as pd
import pycountry
from datetime import datetime
import os

import requests
from bs4 import BeautifulSoup
import csv
import re

import plotly.express as px

import plotly
# connected=True means it will download the latest version of plotly javascript library.
plotly.offline.init_notebook_mode(connected = True)

from matplotlib import pyplot as plt
plt.style.use('ggplot')
import seaborn as sns

%matplotlib inline


<p><a name="cleaning"></a></p>
<H2>Data cleaning and aggregation</H2>
<br>
<br>
The data consolidated in the 'data_consolidation' Jupyter Workbook is loeaded. We are removing all flights without an assigned country. This almost exclusively filters flights without an oriin airport. A review of these flights shows a very high correlation of number of flights over time, compared to all other flights, indicationg the removal should not skew the data too much in any direction.

In addition, e are filtering out a few very small countries, for which we do nit have any flight data for each of the 30 month in the observation timeframe.

FInally, we are consolidating the flight data by year, month and country, as these are the key metrics considered during the following analyses.
<br>

In [2]:
#Filter out unmapped data and group by year, month and country

flight_data_completed=pd.read_csv('final_flight_table.csv')

flight_data_filtered=flight_data_completed[~flight_data_completed['iso_country'].isnull()]

flight_data_filtered['date_num']=flight_data_filtered['date_num'].map(lambda x: pd.to_datetime(x))

flight_data_filtered['people_fully_vaccinated']=flight_data_filtered['people_fully_vaccinated_per_hundred']*flight_data_filtered['population']/100
flight_data_filtered['month']=flight_data_filtered['date_num'].map(lambda x: x.month)
flight_data_filtered['year']=flight_data_filtered['date_num'].map(lambda x: x.year)

flight_data_monthly=flight_data_filtered.groupby(['year','month','iso_country']).agg({'people_fully_vaccinated_per_hundred': 'mean','callsign':'count','population':'mean','people_fully_vaccinated':'mean'})
flight_data_monthly=flight_data_monthly.reset_index()

flight_data_monthly['people_fully_vaccinated_per_hundred']=flight_data_monthly['people_fully_vaccinated_per_hundred'].fillna(0)
flight_data_monthly['people_fully_vaccinated']=flight_data_monthly['people_fully_vaccinated'].fillna(0)

country_count=[]
for country in flight_data_monthly.groupby('iso_country').agg('mean').index:
    if flight_data_monthly[flight_data_monthly['iso_country']==country].shape[0]==30:
        country_count.append(country)

flight_data_monthly=flight_data_monthly[flight_data_monthly['iso_country'].isin(country_count)]



Columns (13,14,15) have mixed types.Specify dtype option on import or set low_memory=False.



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer]

<p><a name="graph1"></a></p>
<H2>Graph 1: Global flight levels over time</H2>
<br>
<br>
Graph shows the global number of commercial passenger flights per month from 01/2019 until 06/2021 (Full available period). Based on monthly flight data as developed and filtered above.
<br>

In [3]:
#Aggregate monthly dataset into a view of all flights per month per year
global_monthly_flights=flight_data_monthly.groupby(['year','month'])['callsign'].agg('sum')
global_monthly_flights=global_monthly_flights.reset_index()

#New column year-month as basis for graph x-axis
global_monthly_flights['year-month']=global_monthly_flights.agg(lambda x: f"{x['year']} - {x['month']}", axis=1)

#plot using plotly express
x_axis = global_monthly_flights['year-month']
corr = global_monthly_flights['callsign']

fig = px.bar(x=x_axis, y=corr,
             labels=dict(x="<B>Year and month</B>", y="<B># of global commercial passenger flights(Top100 airlines)</B>"))

fig.update_layout({'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
})

fig.update_xaxes(showline=True, linewidth=1, linecolor='grey')
fig.update_yaxes(showline=True, linewidth=1, linecolor='grey')

fig.show()


<p><a name="graph2"></a></p>
<H2>Graph 2: Comparing flight levels vs. vaccination rates per country for 06/2021</H2>
<br>
<br>
Graph shows per country the %-recovery of travel compared to 06/2019 levels for 06/2021, plotted against the relative amount of fully vaccinated people per country.
<br>

In [4]:

#create tables, filtered for June 2021 and June 2019
june2021=flight_data_monthly[(flight_data_monthly['year']==2021) & (flight_data_monthly['month']==6)]
june2019=flight_data_monthly[(flight_data_monthly['year']==2019) & (flight_data_monthly['month']==6)]

#merge tables, to show June data in single row comparing 2019 and 2021
june_merged=june2021.merge(june2019, on='iso_country', how='inner')

#calculate relative number of flights 2021 compared to 2019 baseline
june_merged['flight_return_level']=june_merged['callsign_x']/june_merged['callsign_y']

#Filter to show only countries with significant number of flights, and excluce outliers
june_merged_filtered=june_merged[(june_merged['callsign_x']>=800) & (june_merged['callsign_y']>=800)][june_merged['flight_return_level']<1.5]


#plot data in plotly express bubble chart, with population as bubble size
fig = px.scatter(june_merged_filtered, x="flight_return_level", y="people_fully_vaccinated_per_hundred_x",
                 size="population_x",
                 hover_name="iso_country", log_x=False, size_max=60, labels={'flight_return_level':"<B>% of flights vs. pre-Covid</B>", 'people_fully_vaccinated_per_hundred_x':"<B>People fully vaccinated per 100</B>"})

fig.update_layout({'plot_bgcolor': 'rgba(240, 240, 240, 240)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
})

fig.update_xaxes(showline=True, linewidth=1, linecolor='grey')
fig.update_yaxes(showline=True, linewidth=1, linecolor='grey')

fig.show()



Boolean Series key will be reindexed to match DataFrame index.



<p><a name="graph3"></a></p>
<H2>Graph 3a,b: Comparing flight levels vs. vaccination rates per country over time, and calculating correlation</H2>
<br>
<br>
Graph shows for the 4 countries with highest passenger counts the %-recovery of travel coover time for the first 6 months of 2021, plotted against the relative amount of fully vaccinated people in the country.
<br>

In [5]:

#create tables, filtered for June 2021 and June 2019
all2021=flight_data_monthly[flight_data_monthly['year']==2021]
all2019=flight_data_monthly[flight_data_monthly['year']==2019]

#merge tables, to show June data in single row comparing 2019 and 2021
year_merged=all2021.merge(all2019, on=['iso_country', 'month'], how='inner')

#calculate relative number of flights 2021 compared to 2019 baseline
year_merged['flight_return_level']=year_merged['callsign_x']/year_merged['callsign_y']
year_merged_filtered=year_merged #no filters applied




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [49]:

#plot using plotly express 4 graphs showing flight recovery vs. vaccination rates over time per country 
#for the 4 countries with highest passenger count

def newchart(country):
    x_axis = year_merged_filtered[year_merged_filtered['iso_country']==country]['month']
    monthly_flights = year_merged_filtered[year_merged_filtered['iso_country']==country]['flight_return_level']
    monthly_vacc = year_merged_filtered[year_merged_filtered['iso_country']==country]['people_fully_vaccinated_per_hundred_x']
    
    fig = px.line(x=x_axis, y=monthly_vacc/100, color=px.Constant("Completed vaccination percent"),
                 labels=dict(x="<b>Month 2021</b>", y="<B>% of flights vs. pre-Covid</B>"), title=country)
    fig.add_bar(x=x_axis, y=monthly_flights, name="Flights vs pre-COVID in %")
    
    fig.update_yaxes(range=[0, 1.1])
    
    fig.update_layout({'plot_bgcolor': 'rgba(0, 0, 0, 0)',
    'paper_bgcolor': 'rgba(0, 0, 0, 0)',
    })

    fig.update_xaxes(showline=True, linewidth=1, linecolor='grey')
    fig.update_yaxes(showline=True, linewidth=1, linecolor='grey')
     
    fig.show()

#Filter for top4 countries by number of flights in 2021
countries=year_merged_filtered.groupby('iso_country')['callsign_x'].agg('sum').sort_values(ascending=False).iloc[:4]

#create plot for each country
for country in list(countries.index):
    newchart(country)



In [56]:

#calculate correlations between vaccination rates and fligh recovery for each country in dataset
corr_dict={}
for country in year_merged_filtered['iso_country']:
    corr_dict[country]=year_merged_filtered[year_merged_filtered['iso_country']==country].corr().loc['flight_return_level','people_fully_vaccinated_per_hundred_x']

#map correlation data into dataframe
corr_data=pd.DataFrame({'Country':corr_dict.keys(),'Correlation':corr_dict.values()}).sort_values(by='Correlation', ascending=False)


#show correlation data using plotly express for all countries, sorted from highest to lowest correlation

x_axis = corr_data['Country']
corr = corr_data['Correlation']

fig = px.bar(x=x_axis, y=corr,
             labels=dict(x="<b>Country</b>", y="<b>Correlation of flight recovery to vaccination levels</b>"))

fig.update_layout({'plot_bgcolor': 'rgba(0, 0, 0, 0)',
'paper_bgcolor': 'rgba(0, 0, 0, 0)',
})

fig.update_xaxes(showline=True, linewidth=1, linecolor='grey', dtick=1)
fig.update_yaxes(showline=True, linewidth=1, linecolor='grey')

fig.show()

    

<p><a name="graph4"></a></p>
<H2>Graph 4: Comparing flight levels vs. vaccination rates across countries over time, and calculating correlation</H2>
<br>
<br>
Graph shows across all countries with highest passenger counts the %-recovery of travel over time for the first 6 months of 2021, plotted against the relative amount of fully vaccinated people in the world.
<br>

In [8]:

#filter world population, for all countries included in dataset
total_considered_population=flight_data_monthly[(flight_data_monthly['year']==2021) & (flight_data_monthly['month']==6)]['population'].sum()

#create tables, filtered for June 2021 and June 2019
total2021=flight_data_monthly[flight_data_monthly['year']==2021].groupby('month').agg('sum')
total2019=flight_data_monthly[flight_data_monthly['year']==2019].groupby('month').agg('sum')

#merge tables, to show June data in single row comparing 2019 and 2021
total_merged=total2021.merge(total2019, on='month', how='inner')

#calculate relative number of flights 2021 compared to 2019 baseline, and total vaccinated levels
total_merged['flight_return_level']=total_merged['callsign_x']/total_merged['callsign_y']
total_merged['vaccination_percent']=total_merged['people_fully_vaccinated_x'].map(lambda x: x/total_considered_population)


#plot as bar chart, showing vaccination levels as a line, using plotly express
x_axis = total_merged.index
monthly_flights = total_merged['flight_return_level']
monthly_vacc = total_merged['vaccination_percent']
fig = px.line(x=x_axis, y=monthly_vacc, color=px.Constant("Completed vaccination percent"),
             labels=dict(x="Month 2021", y="0 to 1"))
fig.add_bar(x=x_axis, y=monthly_flights, name="Flights vs pre-COVID in %")
fig.show()
