# Airline Dataset
This is a personl project to practice data visualisation and dashboarding using real-life scenario data. The scenario is I am a data scientist tasked with visualizing data in graphs and creating a dashboard with the input year and output of 5 graphs for delay cauesed by weather, security, national air system, carrier, and late aircraft. This in a real-life commercial situation, this project would be complementary a bigger data analysis project. The final product, the dashboard, could be used by other members of the data science project team as they mine the data for more inferences. Goals for such a wholesome project might be for a carrier or airline to understand flight delays and how to hande them, or for an insurance company to predict risk caused by delay in flights.
The dataset and more details can be found on the IBM data exchange asset here: [Airline Reporting Carrier On-Time Performance Dataset](https://developer.ibm.com/exchanges/data/all/airline/).

## Table of Contents
* Import libraries
* Load data
* Data preprocessing
* Visualization
    - Scatter plot
    - Line plot
    - Bar chart
    - Bubble chart
    - Histogram
    - Pie chart
    - SUnburst chart
* Dashboard
* Conclusion
* Author credits

## Import libraries

In [21]:
import pandas as pd
import seaborn as sns
import numpy as np
import urllib
import plotly.graph_objects as go
import plotly.express as px

%matplotlib inline 
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

## Import Dataset

In [23]:
import requests
import tarfile
from os import path

fname = 'airline_2m.tar.gz'
url = 'https://dax-cdn.cdn.appdomain.cloud/dax-airline/1.0.1/' + fname
r = requests.get(url)
open(fname , 'wb').write(r.content)

151681776

In [24]:
tar = tarfile.open(fname)
tar.extractall()
tar.close()

data_path = "airline_2m.csv"
path.exists(data_path)

True

In [25]:
df = pd.read_csv(data_path, encoding = "ISO-8859-1",
                 dtype={'Div1Airport': str, 'Div1TailNum': str, 'Div2Airport': str, 'Div2TailNum': str})
df.head()

Unnamed: 0,Year,Quarter,Month,DayofMonth,DayOfWeek,FlightDate,Reporting_Airline,DOT_ID_Reporting_Airline,IATA_CODE_Reporting_Airline,Tail_Number,...,Div4WheelsOff,Div4TailNum,Div5Airport,Div5AirportID,Div5AirportSeqID,Div5WheelsOn,Div5TotalGTime,Div5LongestGTime,Div5WheelsOff,Div5TailNum
0,1998,1,1,2,5,1998-01-02,NW,19386,NW,N297US,...,,,,,,,,,,
1,2009,2,5,28,4,2009-05-28,FL,20437,FL,N946AT,...,,,,,,,,,,
2,2013,2,6,29,6,2013-06-29,MQ,20398,MQ,N665MQ,...,,,,,,,,,,
3,2010,3,8,31,2,2010-08-31,DL,19790,DL,N6705Y,...,,,,,,,,,,
4,2006,1,1,15,7,2006-01-15,US,20355,US,N504AU,...,,,,,,,,,,


In [2]:
URL = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DV0101EN-SkillsNetwork/Data%20Files/airline_data.csv'
#resp = fetch(URL)
#text = io.BytesIO((await resp.arrayBuffer()).to_py())

airline_data =  pd.read_csv(URL)

print('Data downloaded and read into a dataframe!')

Data downloaded and read into a dataframe!


In [3]:
airline_data.head()

Unnamed: 0.1,Unnamed: 0,Year,Quarter,Month,DayofMonth,DayOfWeek,FlightDate,Reporting_Airline,DOT_ID_Reporting_Airline,IATA_CODE_Reporting_Airline,...,Div4WheelsOff,Div4TailNum,Div5Airport,Div5AirportID,Div5AirportSeqID,Div5WheelsOn,Div5TotalGTime,Div5LongestGTime,Div5WheelsOff,Div5TailNum
0,1295781,1998,2,4,2,4,1998-04-02,AS,19930,AS,...,,,,,,,,,,
1,1125375,2013,2,5,13,1,2013-05-13,EV,20366,EV,...,,,,,,,,,,
2,118824,1993,3,9,25,6,1993-09-25,UA,19977,UA,...,,,,,,,,,,
3,634825,1994,4,11,12,6,1994-11-12,HP,19991,HP,...,,,,,,,,,,
4,1888125,2017,3,8,17,4,2017-08-17,UA,19977,UA,...,,,,,,,,,,


In [26]:
airline_data.shape

(27000, 110)

The dataset is very large for the scope of this project, and so we randomly pick 500 entries.

In [27]:
# Randomly sample 500 data points. Setting the random state to be 42 so that we get same result.
data = airline_data.sample(n=500, random_state=42)

In [28]:
data.shape

(500, 110)

In [29]:
data.tail()

Unnamed: 0.1,Unnamed: 0,Year,Quarter,Month,DayofMonth,DayOfWeek,FlightDate,Reporting_Airline,DOT_ID_Reporting_Airline,IATA_CODE_Reporting_Airline,...,Div4WheelsOff,Div4TailNum,Div5Airport,Div5AirportID,Div5AirportSeqID,Div5WheelsOn,Div5TotalGTime,Div5LongestGTime,Div5WheelsOff,Div5TailNum
18946,61420,2005,3,7,6,3,2005-07-06,WN,19393,WN,...,,,,,,,,,,
16291,458237,2019,2,6,1,6,2019-06-01,UA,19977,UA,...,,,,,,,,,,
21818,557936,1999,1,3,4,4,1999-03-04,HP,19991,HP,...,,,,,,,,,,
24116,1268298,2017,2,4,14,5,2017-04-14,DL,19790,DL,...,,,,,,,,,,
16705,1496740,2019,1,1,26,6,2019-01-26,AA,19805,AA,...,,,,,,,,,,


## Visualization

### 1. Scatter Plot
First, let us use a scatter plot to represent depature time with respect to airport distance.

In [30]:
fig = go.Figure()
fig.add_trace(go.Scatter(x=data['Distance'], y=data['DepTime'], mode='markers', marker=dict(color = 'red')))
fig.update_layout(xaxis_title = 'Distance', yaxis_title = 'Departure Time', title = 'Distance v Depature Time')
fig.show()

We can infer that there are more flights around the clock for short distances compared to long distances.

### 2. Line Plot
We shall use a line plot to visualize the average monthly arrival delay time and see how it changes over the year.

In [9]:
line_data = data.groupby('Month')['ArrDelay'].mean().reset_index()
line_data

Unnamed: 0,Month,ArrDelay
0,1,2.232558
1,2,2.6875
2,3,10.868421
3,4,6.229167
4,5,-0.27907
5,6,17.310345
6,7,5.088889
7,8,3.121951
8,9,9.081081
9,10,1.2


In [10]:
fig = go.Figure()
fig.add_trace(go.Scatter(x = line_data['Month'], y=line_data['ArrDelay'], mode = 'lines', marker = dict(color = 'green')))
fig.update_layout(title = 'Month vs Average Flight Delay Time', xaxis_title = 'Month', yaxis_title = 'Average Flight Delay Time')

From the line graph we can infer that there is less flight delay towards the end of the year and early in the year whereas there is most delay mid-year.

### 3. Bar Chart
Let us use a bar chart to extract the number of flights from a specific airline that goes to a destination.

In [11]:
bar_data = data.groupby('DestState')['Flights'].sum().reset_index()
bar_data.head()

Unnamed: 0,DestState,Flights
0,AK,4.0
1,AL,3.0
2,AZ,8.0
3,CA,68.0
4,CO,20.0


In [12]:
fig = px.bar(x = bar_data['DestState'], y = bar_data['Flights'], title = 'Total number of flights to the destination state split by reporting airline')
fig.show()

#### Inferences
The airlines with the highest number of flights are CA and TX

### 4. Histogram
Let us represent arrival delay using a histogram.

In [13]:
data['ArrDelay'] = data['ArrDelay'].fillna(0)

In [14]:
fig = px.histogram(x = data['ArrDelay'], title = 'Total number of flights to the destination state split by reporting airline')
fig.show()

#### Inference
Delay has a normal distribution with most of the flights with nearly a zero minutes delay.

### 5. Bubble Chart
Let us use a bubble chart to represent the number of flights as per the reporting airline

In [15]:
bub_data = data.groupby('Reporting_Airline')['Flights'].sum().reset_index()
bub_data.head()

Unnamed: 0,Reporting_Airline,Flights
0,9E,5.0
1,AA,57.0
2,AS,14.0
3,B6,10.0
4,CO,12.0


In [16]:
fig = px.scatter(bub_data, x = 'Reporting_Airline', y = 'Flights', size = 'Flights', hover_name = 'Reporting_Airline', title = 'Reporting Airlines vs Number of Flights')
fig.show()


#### Inference
The airline WN has the highest number of flights

### 6. Pie Chart
Let us use a pie chart to represent the proportion of flights by Distance Group

In [17]:
fig = px.pie(values = data['Flights'], names = data['DistanceGroup'], title = 'Flight Proportion by Distance Group')
fig.show()

### 7. Sunburst Charts
Let us represent the hierarchical view in othe order of month and destination state holding value of number of flights

In [18]:
fig = px.sunburst(data, path=['Month', 'DestStateName'], values='Flights',title='Flight Distribution Hierarchy')
fig.show()
#This is an interactive graph; hoover or click on a part of  the chart for more details.

## Dashboarding
#### Dashboard Components
* Monthly average carrier delay by reporting airline for the given year.
* Monthly average weather delay by reporting airline for the given year.
* Monthly average national air system delay by reporting airline for the given year.
* Monthly average security delay by reporting airline for the given year.
* Monthly average late aircraft delay by reporting airline for the given year.
NOTE: Year range should be between 2010 and 2020

Expected Output
Below is the expected result from the lab. Our dashboard application consists of three components:

1. Title of the application
2. Component to enter input year
3. Charts conveying the different types of flight delay. Chart section is divided into three segments.
    * Carrier and Weather delay in the first segment
    * National air system and Security delay in the second segment
    * Late aircraft delay in the third segmen

In [33]:
from dash import dcc
# Import required libraries
import pandas as pd
import dash
from dash import html
from dash import dcc
from dash.dependencies import Input, Output
import plotly.express as px
import requests
import tarfile
from os import path

# Read the airline data into pandas dataframe
fname = 'airline_2m.tar.gz'
url = 'https://dax-cdn.cdn.appdomain.cloud/dax-airline/1.0.1/' + fname
r = requests.get(url)
open(fname , 'wb').write(r.content)
tar = tarfile.open(fname)
tar.extractall()
tar.close()
df = pd.read_csv("airline_2m.csv", encoding = "ISO-8859-1",
                 dtype={'Div1Airport': str, 'Div1TailNum': str, 'Div2Airport': str, 'Div2TailNum': str})


# Create a dash application
app = dash.Dash(__name__)

# Build dash app layout
app.layout = html.Div(children=[ html.H1('Flight Delay Time Statistics', 
                                style={'textAlign': 'center', 'color': '#503D40',
                                'font-size': 30}),
                                html.Div(["Input Year: ", dcc.Input(id='input-year', value='2010', 
                                type='number', style={'height':'35px', 'font-size': 30}),], 
                                style={'font-size': 30}),
                                html.Br(),
                                html.Br(), 
                                # Segment 1
                                html.Div([
                                        html.Div(dcc.Graph(id='carrier-plot')),
                                        html.Div(dcc.Graph(id='weather-plot'))
                                ], style={'display': 'flex'}),
                                # Segment 2
                                html.Div([
                                        html.Div(dcc.Graph(id='nas-plot')),
                                        html.Div(dcc.Graph(id='security-plot'))
                                ], style={'display': 'flex'}),
                                # Segment 3
                                html.Div(dcc.Graph(id='late-plot'), style={'width':'65%'})
                                ])

""" Compute_info function description

This function takes in airline data and selected year as an input and performs computation for creating charts and plots.

Arguments:
    airline_data: Input airline data.
    entered_year: Input year for which computation needs to be performed.
    
Returns:
    Computed average dataframes for carrier delay, weather delay, NAS delay, security delay, and late aircraft delay.

"""
def compute_info(airline_data, entered_year):
    # Select data
    df =  airline_data[airline_data['Year']==int(entered_year)]
    # Compute delay averages
    avg_car = df.groupby(['Month','Reporting_Airline'])['CarrierDelay'].mean().reset_index()
    avg_weather = df.groupby(['Month','Reporting_Airline'])['WeatherDelay'].mean().reset_index()
    avg_NAS = df.groupby(['Month','Reporting_Airline'])['NASDelay'].mean().reset_index()
    avg_sec = df.groupby(['Month','Reporting_Airline'])['SecurityDelay'].mean().reset_index()
    avg_late = df.groupby(['Month','Reporting_Airline'])['LateAircraftDelay'].mean().reset_index()
    return avg_car, avg_weather, avg_NAS, avg_sec, avg_late

"""Callback Function

Function that returns fugures using the provided input year.

Arguments:

    entered_year: Input year provided by the user.
    
Returns:

    List of figures computed using the provided helper function `compute_info`.
"""
# Callback decorator
@app.callback( [
               Output(component_id='carrier-plot', component_property='figure'),
               Output(component_id='weather-plot', component_property='figure'),
               Output(component_id='nas-plot', component_property='figure'),
               Output(component_id='security-plot', component_property='figure'),
               Output(component_id='late-plot', component_property='figure')
               ],
               Input(component_id='input-year', component_property='value'))
# Computation to callback function and return graph
def get_graph(entered_year):
    
    # Compute required information for creating graph from the data
    avg_car, avg_weather, avg_NAS, avg_sec, avg_late = compute_info(airline_data, entered_year)
            
    # Line plot for carrier delay
    carrier_fig = px.line(avg_car, x='Month', y='CarrierDelay', color='Reporting_Airline', title='Average carrrier delay time (minutes) by airline')
    # Line plot for weather delay
    weather_fig = px.line(avg_weather, x='Month', y='WeatherDelay', color='Reporting_Airline', title='Average weather delay time (minutes) by airline')
    # Line plot for nas delay
    nas_fig = px.line(avg_NAS, x='Month', y='NASDelay', color='Reporting_Airline', title='Average NAS delay time (minutes) by airline')
    # Line plot for security delay
    sec_fig = px.line(avg_sec, x='Month', y='SecurityDelay', color='Reporting_Airline', title='Average security delay time (minutes) by airline')
    # Line plot for late aircraft delay
    late_fig = px.line(avg_late, x='Month', y='LateAircraftDelay', color='Reporting_Airline', title='Average late aircraft delay time (minutes) by airline')
            
    return[carrier_fig, weather_fig, nas_fig, sec_fig, late_fig]

# Run the app
if __name__ == '__main__':
    app.run_server()

## Conclusion
This project created a few graphs from the airline data, and a dashboard. These are useful in understanding the data, and mining for more insights. The results of this projects would be limited to other stakeholders within the analytics department, and would be for their use in making a more detailed analysis.

# Authored By:
## Mwenda Kinoti