<a href="https://colab.research.google.com/github/google/applied-machine-learning-intensive/blob/master/content/02_data/06_project_data_processing/colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Copyright 2019 Google LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Dataset Exploration


For this project you will be given a dataset and an associated problem. Over the course of the day, you will explore the dataset and train the best model you can in order to solve the problem. At the end of the day, you will give a short presentation about your model and solution.

### Deliverables

1. A **copy of this Colab notebook** containing your code and responses to the ethical considerations below.
1. At the end of the day, we will ask you and your group to stand in front of the class and give a brief **presentation about what you have done**. 

## Team

Please enter your team members' names in the placeholders in this text area:

*   *Alejandra Berroso*
*   *A'Darius Lee*
*   *Sam Lefforge*



# Exercises

## Exercise 1: Coding

[Kaggle](http://www.kaggle.com) hosts a [dataset containing US airline on-time statistics and delay data](https://www.kaggle.com/giovamata/airlinedelaycauses) from the [US Department of Transportation's Bureau of Transportation Statistics (BTS)](https://www.bts.gov/). In this project, we will **use flight statistics data to gain insights into US airports' and airlines' flights in 2008.**

You are free to use any toolkit we've covered in class to solve the problem (e.g. Pandas, Matplotlib, Seaborn).

Demonstrations of competency:
1. Get the data into a Python object.
1. Inspect the data for each column's data type and summary statistics.
1. Explore the data programmatically and visually.
1. Produce an answer and visualization, where applicable, for at least three questions from the list below, and discuss any relevant insights. Feel free to generate and answer some of your own questions. 

  * Which U.S. airport is the busiest airport? You can decide how you'd like to measure "business" (e.g., annually, monthly, daily).
  * Of the 2008 flights that are *actually delayed*, think about:
    * Which 10 U.S. airlines have the most delays?
    * Which 10 U.S. airlines have the longest average delay time?
    * Which 10 U.S. airports have the most delays?
    * Which 10 U.S. airports have the longest average delay time?
  * More analysis:
    * Are there patterns on how flight delays are distributed across different hours of the day?
    * How about across months or seasons? Can you think of any reasons for these seasonal delays?
    * If you look at average delay time or number of delays by airport, does the data show linearity? Does any subset of the data show linearity?
    * Add reason for delay to your delay analysis above.
    * Examine flight frequencies, delays, time of day or year, etc. for a specific airport, airline or origin-arrival airport pair.

### Student Solution

Get the data into a python object.

In [None]:
import pandas as pd
import zipfile

# Get the data into a Python object.
!chmod 600 kaggle.json && (ls ~/.kaggle 2>/dev/null || mkdir ~/.kaggle) && mv kaggle.json ~/.kaggle/ && echo 'Done'
!kaggle datasets download giovamata/airlinedelaycauses

zip = zipfile.ZipFile('airlinedelaycauses.zip')
zip.extractall()
flights_df = pd.read_csv('DelayedFlights.csv')

Inspect the data for each column's data type and summary statistics.

In [None]:
# Inspect the data for each column's data type and summary statistics.
print(flights_df.dtypes)
print(flights_df.describe())
flights_df.tail(20)

Explore the data programmatically and visually.

In [None]:
# Explore the data programmatically and visually.

print(flights_df.tail(20))
# Visually, it is clear that ArrTime, ActualElapsedTime, CRSElapsedTime, 
# AirTime, ArrDelay, TaxiIn, CarrierDelay, WeatherDelay, NASDelay, 
# SecurityDelay, and LateAircraftDelay have missing data

print(flights_df.columns[flights_df.isnull().any()]) # Confirmation

# The columns signifying some type of delay, most values are 0, so I'm going
# to fill in missing values with 0
flights_df['ArrDelay'] = flights_df['ArrDelay'].fillna(0)
flights_df['CarrierDelay'] = flights_df['CarrierDelay'].fillna(0)
flights_df['WeatherDelay'] = flights_df['WeatherDelay'].fillna(0)
flights_df['NASDelay'] = flights_df['NASDelay'].fillna(0)
flights_df['SecurityDelay'] = flights_df['SecurityDelay'].fillna(0)
flights_df['LateAircraftDelay'] = flights_df['LateAircraftDelay'].fillna(0)

# Now missing datas are pretty sparse, so I'm just going to drop the
# remianing rows that have missing values
flights_df.dropna(inplace= True)
flights_df = flights_df.reset_index()

Produce an answer and visualization, where applicable, for at least three questions from the list below, and discuss any relevant insights. Feel free to generate and answer some of your own questions.

  * Which U.S. airport is the busiest airport? You can decide how you'd like to measure "business" (e.g., annually, monthly, daily).

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
tests = flights_df.groupby(flights_df['UniqueCarrier']).size().reset_index(name = 'delays')
tests
#tests_pd = tests.sort_values(by="delays", ascending=True)

#plt.bar(tests['UniqueCarrier'], tests_pd['delays'],)
print('YV is the busiest airport.')

  * Of the 2008 flights that are *actually delayed*, think about:
    * Which 10 U.S. airlines have the most delays?
    * Which 10 U.S. airlines have the longest average delay time?
    * Which 10 U.S. airports have the most delays?
    * Which 10 U.S. airports have the longest average
    

In [None]:
# Which 10 U.S. airlines have the most delays
import pandas as pd 
delay_df = flights_df[flights_df['ArrDelay'] > 0] #removing any numbers that are 0 and negative
delay_df = flights_df[flights_df['DepDelay'] > 0] #removing any numbers that are 0 and negative
delay_df = pd.DataFrame(delay_df[['ArrDelay', 'DepDelay']])
delay_df['UniqueCarrier'] = flights_df['UniqueCarrier']
delay_df['SumofDelays'] = delay_df['ArrDelay'] + delay_df['DepDelay']
df = pd.DataFrame(delay_df)
df = delay_df['SumofDelays'].sort_values(ascending=False)
_delay = delay_df.loc[df.index, 'UniqueCarrier'].unique()[:10].tolist()

print("Here are the top 10 airlines by delays: ")
_delay

In [None]:
#Which 10 U.S. airlines have the longest average delay time?
import pandas as pd
import matplotlib.pyplot as plt

delay_df = pd.DataFrame()
delay_df = flights_df[flights_df['ArrDelay'] > 0] #removing any numbers that are 0 and negative
delay_df = flights_df[flights_df['DepDelay'] > 0] #removing any numbers that are 0 and negative
delay_df['UniqueCarrier'] = flights_df['UniqueCarrier']
delay_df['SumofDelays'] = delay_df['ArrDelay'] + delay_df['DepDelay']
delay = delay_df.groupby('UniqueCarrier')[['SumofDelays']].mean().sort_values(by ='SumofDelays', ascending=False).head(10)
delay.plot(kind = 'bar')

print("The 10 U.S. airlines that have the longest average delay time are shown below:")
plt.show()

In [None]:
# Which 10 U.S. airports have the most delays
import pandas as pd 

delay_df = flights_df[flights_df['DepDelay'] > 0] #removing any numbers that are 0 and negative
delay_df = pd.DataFrame(delay_df['DepDelay'])
delay_df['Origin'] = flights_df['Origin']
delay_df['SumofDelays'] = delay_df['DepDelay']
df = pd.DataFrame(delay_df)
df = delay_df['SumofDelays'].sort_values(ascending=False)
_delay = delay_df.loc[df.index, 'Origin'].unique()[:10].tolist()

print("Here are the top 10 airports by delays: ")
_delay


In [None]:
#Which 10 U.S. airports have the longest average delay time?
import pandas as pd
import matplotlib.pyplot as plt

delay_df = flights_df[flights_df['DepDelay'] > 0] #removing any numbers that are 0 and negative
delay_df['Origin'] = flights_df['Origin']
delay = delay_df.groupby('Origin')[['DepDelay']].mean().sort_values(by ='DepDelay', ascending=False).head(10)
delay.plot(kind = 'bar')

print("The 10 U.S. airports that have the longest average delay time are shown below:")
plt.show()

  * Are there patterns on how flight delays are distributed across different months of the year?
  
*Yes, as you can see, the delays are pretty consistent throughout the year, but right around Christmas time, the frequency of long delays increases substantially.*

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Question 3
months = flights_df['Month'].values
days = flights_df['DayofMonth'].values

# this is a dictionary that stores how many days occur before the first of
# any given month in a leap year (2008)
daysBeforeMonth = {
    1 : 0,
    2 : 31,
    3 : 60,
    4 : 91,
    5 : 121,
    6 : 152,
    7 : 182,
    8 : 213,
    9 : 244,
    10 : 274,
    11 : 305,
    12 : 335,
    13 : 366
}

# This is a metric to accurately show what day it is relative to the current yr
dayOfYear = []
for index in range(len(months)):
  dayOfYear.append(daysBeforeMonth[months[index]] + days[index])
delays = flights_df['DepDelay'].values

plt.scatter(dayOfYear, delays, s = [1] * len(months)) # Plot the Data

for i in range(1, 14): # draw lines to represent the months of the year
  plt.plot([daysBeforeMonth[i],daysBeforeMonth[i]],[delays.min(),delays.max()],
           color='black', linestyle='dashed')
  
plt.title("How Long Flights are Delayed on Different Days of the Year")
plt.xlabel("Day of the Year")
plt.ylabel("How Long the Flight was Delayed")
plt.show()

## Exercise 2: Ethical Implications

Even the most basic of data manipulations has the potential to affect segments of the population in different ways. It is important to consider how your code might positively and negatively affect different types of users.

In this section of the project, you will reflect on the ethical implications of your analysis.

### Student Solution

**Positive Impact**

Your analysis is trying to solve a problem. Think about who will benefit if the problem is solved, and write a brief narrative about how the model will help.

*The main entities that would benefit from this kind of analysis are airlines and airports. This kind of data is great for discovering how and why delays occur, and in turn how to prevent them in the future. For example, in the holidays long delays are more frequent, so it might be wise to increase the magnitude of flights during that time period if possible.*

**Negative Impact**

Solutions usually don't have a universal benefit. Think about who might be negatively impacted by your analysis. This person or persons might not be directly considered in the analysis, but they might be impacted indirectly.

*A group that might be harmed from this data analysis is flight staff, like pilots, and flight attendants. In my example above, I talked about how this data might lead to the conclusion that there should be more flights avaliable during Christmas/New Years season. A natural consequence of this would be making more flight staff have to work over the holidays, which isn't super fair to them.*

**Bias**

Data analysis can be biased for many reasons. The bias can come from the data itself (e.g. sampling, data collection methods, available sources), and from the interpretation of the analysis outcome.

Think of at least two ways that bias might have been introduced to your analysis and explain them below.

*One source of bias we ran in to in our analysis came in our month by month analysis. The months in a year have differing amounts of days, so longer months should have more delays. This can cause longer months to look like they have a disproportionately bad delay freqency, when in reality they just have a greater total number of delays.*

*Another source of bias in our analysis comes from an overly narrow view of the data. When looking at the top 10 airports and airlines in terms of delays, the airports and airlines not in the top 10 essentially just get thrown away. We may be missing some general trend in the rest of the data with our current method of analysis.*

**Changing the Dataset to Mitigate Bias**

The most common way that an analysis is biased is when the dataset itself is biased. Look back at the input data that you used for your analysis. Think about how you might change something about the data to reduce bias in your model.

What changes could you make to make your dataset less biased? Consider the data that you have, how and where that data was collected, and what other sources of data might be used to reduce bias.

Write a summary of the changes that could be made to your input data.

*A bias in the analysis could be group attribution bias. The data only includes big airlines. This can cause small airlines to appear like they never get delays. For example, small airports in this dataset might appear to have next to no delays, but that could be because the planes that fly there aren't included in this dataset. A simple change to fix this would be to include a wide variety of airline sizes in the data collection process.*

**Changing the Analysis Questions to Mitigate Bias**

Are there any ways to reduce bias by changing the analysis itself? This could include modifying the choice of questions you ask, the approach you take to answer the questions, etc.

Write a brief summary of any changes that you could make to help reduce bias in your analysis.

*Since the analysis has an overly narrow view of the dataset, we could modify the questions to ask about general trends or patterns in the data instead of just the top ten airports.*

**Mitigating Bias Downstream**

While analysis can point to suggestions, it is people who make decisions based on them. What processes and/or rules should be in place for people and systems interpreting and acting on the results of your analysis to reduce the bias? Describe these below.

*One important detail about this data is that it is from 2008, so it might not be indicitive of current trends. Another important detail about this data is that it only includes delayed flights, which is super important to the interpretation of this data. Additionally, this data only includes data from large air carriers, so that needs to be accounted for.*