# 0. Import our libraries

In [1]:
import pandas as pd
import numpy 
import matplotlib.pyplot as plt
import seaborn as sns

# 1. Collecting data
- What subject is your data about? What is the source of your data?  
- Do authors of this data allow you to use like this? You can check the data license  
- How did authors collect data?


## About dataset
### Subject: Commercial flight operations in the United States (2024)

Although air travel is essential for bringing people and economies together, passenger flights are frequently disrupted by delays and cancellations. With over 1 million flights nationwide, this dataset offers a thorough understanding of US airline performance in 2024.

The dataset, which includes comprehensive statistics on departure times, cancellations, flight distances, weather delays, and late aircraft delays, presents a special chance to investigate:

### Sources
The On-Time Performance Data (2024) from the US Bureau of Transportation Statistics (BTS) is the source of this dataset. The source offers comprehensive details on airline on-time performance, including departure times, weather-related disruptions, delays, and cancellations.

## Data Lisence

**No Copyright** 

The person who associated a work with this deed has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law. 

You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission. See Other Information via the link below. 

[CC0: Public Domain](https://creativecommons.org/publicdomain/zero/1.0/)



## Collection Methodology

The information was gathered from the 2024 On-Time Performance reports published by the US Bureau of Transportation Statistics (BTS). A single CSV file containing important features, such as delays, cancellations, distances, and weather impacts, was created by merging, cleaning, and curating monthly reports.

# 2. Exploring data 
- How many rows and how many columns?  
- What is the meaning of each row?  
- Are there **<span style="color:red">duplicated rows</span>**?  
- What is the meaning of each column?  
- What is the current data type of each column? Are there columns having **<span style="color:red">inappropriate data types</span>**?  
- With each numerical column, how are values distributed?  
  - What is the percentage of **<span style="color:red">missing values</span>**?  
  - Min? max? Are they **<span style="color:red">abnormal</span>**?  
- With each categorical column, how are values distributed?  
  - What is the percentage of **<span style="color:red">missing values</span>**?  
  - How many different values? Show a few  
  - Are they **<span style="color:red">abnormal</span>**?

In [2]:
df = pd.read_csv("data/flight_data_2024.csv")
df.head()

Unnamed: 0,year,month,day_of_month,day_of_week,fl_date,origin,origin_city_name,origin_state_nm,dep_time,taxi_out,wheels_off,wheels_on,taxi_in,cancelled,air_time,distance,weather_delay,late_aircraft_delay
0,2024,1,1,1,1/1/2024,JFK,"New York, NY",New York,1247.0,31.0,1318.0,1442.0,7.0,0,84.0,509,0,0
1,2024,1,1,1,1/1/2024,MSP,"Minneapolis, MN",Minnesota,1001.0,20.0,1021.0,1249.0,6.0,0,88.0,622,0,0
2,2024,1,1,1,1/1/2024,JFK,"New York, NY",New York,1411.0,21.0,1432.0,1533.0,8.0,0,61.0,288,0,0
3,2024,1,1,1,1/1/2024,RIC,"Richmond, VA",Virginia,1643.0,13.0,1656.0,1747.0,12.0,0,51.0,288,0,0
4,2024,1,1,1,1/1/2024,DTW,"Detroit, MI",Michigan,1010.0,21.0,1031.0,1016.0,4.0,0,45.0,237,0,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 18 columns):
 #   Column               Non-Null Count    Dtype  
---  ------               --------------    -----  
 0   year                 1048575 non-null  int64  
 1   month                1048575 non-null  int64  
 2   day_of_month         1048575 non-null  int64  
 3   day_of_week          1048575 non-null  int64  
 4   fl_date              1048575 non-null  object 
 5   origin               1048575 non-null  object 
 6   origin_city_name     1048575 non-null  object 
 7   origin_state_nm      1048575 non-null  object 
 8   dep_time             1026022 non-null  float64
 9   taxi_out             1025450 non-null  float64
 10  wheels_off           1025450 non-null  float64
 11  wheels_on            1024898 non-null  float64
 12  taxi_in              1024898 non-null  float64
 13  cancelled            1048575 non-null  int64  
 14  air_time             1022824 non-null  float64
 15

# 3. Asking meaningful questions

- Your group needs to give ≥ [the-number-of-group-members](#) questions which can be answered with this data.  
- Each question should be [meaningful](#) (what are benefits of finding the answer?) and [not too easy](#) to answer (e.g., it’s too easy if we just need one line of code to get the answer).  
- Your group should focus more on the [quality of questions](#) than the quantity.  
- In notebook file, with each question, your group needs to present:  
  - What is the question?  
  - What are benefits of finding the answer?


# 4. Preprocessing and analyzing data to answer each question

- With each question:  
  - Does it need to have preprocessing step, and if yes, how does your group preprocess?  
    - **Text:** sketch steps [clearly](#) so that readers can understand how your group preprocesses even without reading code  
    - **Code:** implement [sketched](#) steps. Your group should also try to write code clearly (choose good variable names, comment where should be commented, don’t let a line too long)  
  - How does your group analyze data to answer the question?  
    - **Text:** similar to above  
    - **Code:** similar to above


# 5. Reflection
- **Each member:** What difficulties have you encountered?  
- **Each member:** What have you learned?  
- **Your group:** If you had more time, what would you do?


# 6. References
- To finish this project, what materials have you consulted