<a href="https://colab.research.google.com/github/google/applied-machine-learning-intensive/blob/master/content/02_data/06_project_data_processing/colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### Copyright 2019 Google LLC.

In [None]:
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Dataset Exploration


For this project you will be given a dataset and an associated problem. Over the course of the day, you will explore the dataset and train the best model you can in order to solve the problem. At the end of the day, you will give a short presentation about your model and solution.

### Deliverables

1. A **copy of this Colab notebook** containing your code and responses to the ethical considerations below.
1. At the end of the day, we will ask you and your group to stand in front of the class and give a brief **presentation about what you have done**. 

## Team

Please enter your team members' names in the placeholders in this text area:

*   *Jermaine Lennon*
*   *Nasir Barnes*
*   *Samuel Adeleye*
*   *Antoine Teague*





# Exercises

## Exercise 1: Coding

[Kaggle](http://www.kaggle.com) hosts a [dataset containing US airline on-time statistics and delay data](https://www.kaggle.com/giovamata/airlinedelaycauses) from the [US Department of Transportation's Bureau of Transportation Statistics (BTS)](https://www.bts.gov/). In this project, we will **use flight statistics data to gain insights into US airports' and airlines' flights in 2008.**

You are free to use any toolkit we've covered in class to solve the problem (e.g. Pandas, Matplotlib, Seaborn).

Demonstrations of competency:
1. Get the data into a Python object.
1. Inspect the data for each column's data type and summary statistics.
1. Explore the data programmatically and visually.
1. Produce an answer and visualization, where applicable, for at least three questions from the list below, and discuss any relevant insights. Feel free to generate and answer some of your own questions. 

  * Which U.S. airport is the busiest airport? You can decide how you'd like to measure "business" (e.g., annually, monthly, daily).
  * Of the 2008 flights that are *actually delayed*, think about:
    * Which 10 U.S. airlines have the most delays?
    * Which 10 U.S. airlines have the longest average delay time?
    * Which 10 U.S. airports have the most delays?
    * Which 10 U.S. airports have the longest average delay time?
  * More analysis:
    * Are there patterns on how flight delays are distributed across different hours of the day?
    * How about across months or seasons? Can you think of any reasons for these seasonal delays?
    * If you look at average delay time or number of delays by airport, does the data show linearity? Does any subset of the data show linearity?
    * Add reason for delay to your delay analysis above.
    * Examine flight frequencies, delays, time of day or year, etc. for a specific airport, airline or origin-arrival airport pair.

### Student Solution

In [None]:
# Use as many text and code blocks as you need to create your solution.
# Make sure to take notes and add lots of code comments, so your instructor
# understands what you are doing!

print("Good luck!")

#### Cleaning Data

In [None]:
#Importing Data Files
import pandas as pd

df = pd.read_csv (r'DelayedFlights.csv')

df

In [None]:
#Describing Data
df.describe(include = 'all')

In [None]:
df.columns

In [None]:
df.columns = ['Unnamed' , 'Year', 'Month', 'Date of Month', 'Day of Week', 
              'Dep Time', 'CRS Dep Time', 'Arr Time', 'CRS Arr Time', 
              'Unique Carrier', 'Flight Num' , 'Trail Num', 
              'Actual Elapsed Time', 'CRS Elapsed Time', 'Airtime Arr Delay', 
              'Dep Delay', 'Origin', 'Dest', 'Distance',
              'Taxi in', 'Taxi out', 'Cancelled', 'Cancelled Code', 
              'Diverted', 'Carrier Delay', 'Weather Delay', 'NAS Delay', 
              'Security Delay', 'Late Aircraft Delay', '--'] 
df

In [None]:
#Inspecting Year column
print(df['Year'].isna().any())
print(df['Year'].unique().shape)
for location in sorted(df['Year'].unique()):
  print(location)

#This tells us that we're looking at the year of 2008

In [None]:
#Inspecting Month column
print(df['Month'].isna().any())
print(df['Month'].unique().shape)
for location in sorted(df['Month'].unique()):
  print(location)


In [None]:
print(df['Date of Month'].isna().any())
print(df['Date of Month'].unique().shape)
for location in sorted(df['Date of Month'].unique()):
  print(location)
#The Days range from 1-31


In [None]:
print(df['Day of Week'].isna().any())
print(df['Day of Week'].unique().shape)
for location in sorted(df['Day of Week'].unique()):
  print(location)

In [None]:
print(df['Dep Time'].isna().any())
df[df['Dep Time'].isna()].count()
df[df['Dep Time'].isna()]
df = df.drop(99553)


In [None]:
print(df['Dep Time'].isna().any())


In [None]:
print(df['CRS Dep Time'].isna().any())
print(df['CRS Dep Time'].unique().shape)
for location in sorted(df['CRS Dep Time'].unique()):
  print(location)

In [None]:
#We remove Arr Time because there was not enough data
print(df['Arr Time'].isna().any())
print(df[df['Arr Time'].isna()].count())
df[df['Arr Time'].isna()]
df.drop("Arr Time", axis=1, inplace=True)

In [None]:
print(df['CRS Arr Time'].isna().any())
print(df['CRS Arr Time'].unique().shape)
for location in sorted(df['CRS Arr Time'].unique()):
  print(location)

In [None]:
print(df['Unique Carrier'].isna().any())
print(df['Unique Carrier'].unique().shape)


In [None]:
print(df['Flight Num'].isna().any())
print(df['Flight Num'].unique().shape)
#Too much unique shapes to look through but this column tells us the number of
# the flights


In [None]:
print(df['Trail Num'].isna().any())
print(df['Trail Num'].unique().shape)

In [None]:
print(df['Actual Elapsed Time'].isna().any())
print(df[df['Actual Elapsed Time'].isna()].count())
df['Actual Elapsed Time'].isna()
df.drop("Actual Elapsed Time", axis=1, inplace=True)

In [None]:
print(df['CRS Elapsed Time'].isna().any())
print(df['CRS Elapsed Time'].unique().shape)
for location in sorted(df['CRS Elapsed Time'].unique()):
  print(location)

#FOR FUTURE REFRENCE: theres an 'NAN' in the list

In [None]:
df.columns

In [None]:
print(df['Airtime Arr Delay'].isna().any())
print(df[df['Airtime Arr Delay'].isna()].count())
df['Airtime Arr Delay'].isna()
df.drop("Airtime Arr Delay", axis=1, inplace=True)


In [None]:
#Removing Dep Delay Column
print(df['Dep Delay'].isna().any())
df[df['Dep Delay'].isna()].count()
print(df['Dep Delay'].isna())
df.drop("Dep Delay", axis=1, inplace=True)

In [None]:
df.columns

In [None]:
print(df['Origin'].isna().any())
print(df['Origin'].unique().shape)
for location in sorted(df['Origin'].unique()):
  print(location)

In [None]:
df['Dest'].isna().any()
df[df['Dest'].isna()].count()
df['Dest'].isna()
df.drop("Dest", axis=1, inplace=True)

In [None]:
df.columns

In [None]:
df['Distance'].isna().any()
print(df['Distance'].unique().shape)


In [None]:
df['Taxi in'].isna().any()
print(df['Taxi in'].unique().shape)


In [None]:
df['Taxi out'].isna().any()
df[df['Taxi out'].isna()].count()
df['Taxi out'].isna()
df.drop("Taxi out", axis=1, inplace=True)

In [None]:
df.columns

In [None]:
df['Cancelled'].isna().any()
df[df['Cancelled'].isna()].count()
df['Cancelled'].isna()
df.drop('Cancelled', axis=1, inplace=True)

In [None]:
df.columns

In [None]:
df['Cancelled Code'].isna().any()
print(df['Cancelled Code'].unique().shape)
for location in sorted (df['Cancelled Code'].unique()):
  print(location)

In [None]:
df['Diverted'].isna().any()
print(df['Diverted'].unique().shape)


In [None]:
df['Carrier Delay'].isna().any()
print(df['Carrier Delay'].unique().shape)
for location in sorted (df['Carrier Delay'].unique()):
  print(location)
df.dropna(subset=['Carrier Delay'], inplace=True)
df.drop('Carrier Delay', axis=1, inplace=True)

In [None]:
df['Weather Delay'].isna().any()
df[df['Weather Delay'].isna()].count()
df['Weather Delay'].isna() 
df.dropna(subset=['Weather Delay'], inplace=True)
df

In [None]:
df.columns

In [None]:
df['NAS Delay'].isna().any()
print(df['NAS Delay'].unique().shape)
for location in sorted (df['NAS Delay'].unique()):
  print(location)

In [None]:
df['Security Delay'].isna().any()
print(df['Security Delay'].unique().shape)
for location in sorted (df['Security Delay'].unique()):
  print(location)

In [None]:
df['Late Aircraft Delay'].isna().any()
print(df['Late Aircraft Delay'].unique().shape)
for location in sorted (df['Late Aircraft Delay'].unique()):
  print(location)

In [None]:
df['--'].isna().any()
print(df['--'].unique().shape)
for location in sorted (df['--'].unique()):
  print(location)

In [None]:
df = df.drop(columns=['Unnamed', '--'])
df

In [None]:
#In this code we combine the delay columns
df['new_column'] = df['Weather Delay'] + df['NAS Delay']
df['new_column1'] = df['new_column'] + df['Late Aircraft Delay']
df['Delays'] = df['new_column1'] + df['Security Delay']
df.drop('Weather Delay', axis=1, inplace=True)
df.drop('NAS Delay', axis=1, inplace=True)
df.drop('Security Delay', axis=1, inplace=True)
df.drop('Late Aircraft Delay', axis=1, inplace=True)
df.drop('new_column', axis=1, inplace=True)
df.drop('new_column1', axis=1, inplace=True)
#df.drop('new_column2', axis=1, inplace=True)


In [None]:
df.dropna(inplace=True)
df

In [None]:
#Examining Delays Column
df['Delays'].isna().any()
print(df['Delays'].unique().shape)
for location in sorted (df['Delays'].unique()):
  print(location)

####  Question 1

**Which 10 U.S. airlines have the most delays?**


In [None]:
#Creating a Data Frame with Distance and Delays 
aggregation_functions = {'Delays': 'sum'}
df_new = df.groupby(df['Distance']).aggregate(aggregation_functions)
df_new = df_new.reset_index()
df_new = df_new.sort_values(by='Delays')
df_new

In [None]:
#Inspect New Data
for Delays in sorted(df_new['Delays'].unique()):
  print(Delays)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns



Delays = df_new['Delays'].tail(10)

Distance = df_new['Distance'].tail(10)
plt.pie(Delays, labels= Distance)
plt.show() 

#We used the pie chart to show the ten airports with the most delays. 

> *According to the data in the pie chart, ORD, ATL, SFO, EWR, DFW, LAX, LGA, DEN, LAS, and IAH airlines had the most delays in 2008.*

#### Question 2

**Which U.S. airport is the busiest airport?** 

In [None]:
df['Distance'].value_counts()


In [None]:
df_distance = df.groupby(['Distance']).size().sort_values(ascending=False)
df_distance.head(1)

> *According to the dataset ORD airport is the U.S busiest airport.*

#### Question 3

**If you look at average delay time or number of delays by airport, does the data show linearity?**

In [None]:
df_new

In [None]:
import matplotlib.pyplot as plt
import numpy as np


x = df_new['Distance'].head(10)

y = df_new['Delays'].head(10)


plt.scatter(x, y, color='blue');
plt.show()

In [None]:
x = df_new['Distance'].tail(10)

y = df_new['Delays'].tail(10)


plt.scatter(x, y, color='blue');
plt.show()

> *According to the data set, the number of delays by airports are not linearity.* 

## Exercise 2: Ethical Implications

Even the most basic of data manipulations has the potential to affect segments of the population in different ways. It is important to consider how your code might positively and negatively affect different types of users.

In this section of the project, you will reflect on the ethical implications of your analysis.

### Student Solution

**Positive Impact**

Your analysis is trying to solve a problem. Think about who will benefit if the problem is solved, and write a brief narrative about how the model will help.

> *\[The Airlines and Airports \] will benefit because the data serves as a report card that tells companies how well or poorly they've been doing in each area. For example ORD has the largest amount of delays. So to improve ORD can see where delays are coming from and try to fix them. Customers who are interested in flying also benefit because they can see the probability of their flight getting delayed.*

**Negative Impact**

Solutions usually don't have a universal benefit. Think about who might be negatively impacted by your analysis. This person or persons might not be directly considered in the analysis, but they might be impacted indirectly.

> *\[Airlines with high amount of delays\] will be negatively impacted because in the short run they may take a hit in business. While it is true that airlines and airpots may have the means to work on their delays. Customers may still decide to choose those airports and airlines that have a lesser chance of being deylayed.*

**Bias**

Data analysis can be biased for many reasons. The bias can come from the data itself (e.g. sampling, data collection methods, available sources), and from the interpretation of the analysis outcome.

Think of at least two ways that bias might have been introduced to your analysis and explain them below.

> *One source of bias in the analysis could be the year that the data was collected. Since the Data was colleceted over ten years ago a lot of the analysis may not hold true anymore. Something that could lead to more delays today that was relevent in 2008 is the Covid-19. Since the start of the pandemic there have been a few instances of major delays due lack of staff. Just recently theres be an influx of delays. Another source of bias in the analysis could be the time of day delays are more likely to happen. According to "If You Want to Avoid Delays, This Is the Best Time of the Day to Fly", an article written by Brooke Nelson, delays are more likely to happend around 6-7 p.m. The reason that could be a bias is those type of delays usually mean the plane would depart a few minutes after scheduled time. Those types of delays shouldn't be compared to the Delays such as Weather Delays. Also Weather Delays are dependent on the weather which airport and airline companies have no control of. Thats another source of delay that could heavily shift the dataset.*

>> *source: https://www.rd.com/article/avoid-delays-best-time-day-to-fly/:*

**Changing the Dataset to Mitigate Bias**

The most common way that an analysis is biased is when the dataset itself is biased. Look back at the input data that you used for your analysis. Think about how you might change something about the data to reduce bias in your model.

What changes could you make to make your dataset less biased? Consider the data that you have, how and where that data was collected, and what other sources of data might be used to reduce bias.

Write a summary of the changes that could be made to your input data.

> *Since the data has potential bias in the year the data was collected, we can adjust the dataset by collected data from airports and airlines from last 5-7 years ago. Doing this reduces the bias of the data being less relevent. If the airport that had least amount delays saw an increased in delays within the past 5-7 years then the airport and customers would have misleading information. When it comes to the afternoon delays, while collecting data it would be up to the data scientist whether or not they would take those delays as imporant as other delays in the dataset. Lastly, for weather delays I wouldn't even collect that because that's nothing the airports or airlines companies can control.* 

**Changing the Analysis Questions to Mitigate Bias**

Are there any ways to reduce bias by changing the analysis itself? This could include modifying the choice of questions you ask, the approach you take to answer the questions, etc.

Write a brief summary of any changes that you could make to help reduce bias in your analysis.

> *Since the analysis has potential bias in the year the data was collected, we can adjust the year the data was collected to a later time. This allows companies to view a more relevant analysis of the flights and delays that year. We can also pay more attention to the weather that is out of the control of the airports because although that may cause a delay it not really a fixable action airlines, airports, or even customers can affect or predict.*

**Mitigating Bias Downstream**

While analysis can point to suggestions, it is people who make decisions based on them. What processes and/or rules should be in place for people and systems interpreting and acting on the results of your analysis to reduce the bias? Describe these below.

> *Since the analysis has potential bias that includes: year that data was collected, time of the day delays take place, and lastly, the delays caused by weather; we can implement processes that could reduce the biases. For example, we would suggest collecting the data within reasonable time period maybe every 1-3 years. This way we have a relevent dataset to look at when asking for changes. We suggest not counting delays that took place in the middle of the day and caused less than 20 minutes shift in depart time. As those delays are not as important as others. Lasly, we would suggest not counting delays caused by weather at all because airlines and airports can't affect them.*