# ✈️ Flight Delays Analysis – Exploratory Data Analysis (EDA)

This notebook is part of the **`flight-delays-analysis`** project. The goal is to analyze real-world flight delay data from the United States to uncover meaningful patterns and develop a basic predictive model.

---

## 🎯 Notebook Objectives

- Load the raw dataset
- Understand the structure and quality of the data
- Detect missing values, duplicates, and irrelevant features
- Identify initial patterns related to delays and cancellations

---

## 📁 Dataset Overview

The dataset contains flight records with variables such as:

- Airline, origin, destination
- Scheduled departure and arrival times
- Delay causes (weather, security, etc.)
- Cancellation information

Source: [Kaggle – Airline Delay Causes](https://www.kaggle.com/datasets/giovamata/airlinedelaycauses)

---

## 🧠 Notes

- This EDA focuses solely on **delay behavior**, not external economic or meteorological data.
- Key findings will be used later as a foundation for modeling and reporting.

---

📌 Author: **Josekawa** – 2025  
🔗 GitHub Repository: [github.com/Josekawa/flight-delays-analysis](https://github.com/Josekawa/flight-delays-analysis)



## 🧭 Project Workflow

To structure the analysis and keep things focused, I’m following a clear workflow that covers everything from data loading to exporting a cleaned dataset for modeling. Here's the step-by-step plan:

1. **Load the dataset**  
   Import the CSV file and confirm it loads correctly using Pandas.

2. **Initial inspection**  
   Get a general sense of the structure and contents of the dataset using `.info()`, `.describe()`, and other quick checks.

3. **Data cleaning**  
   Drop irrelevant or redundant columns, handle missing values, and create a proper `Date` column for time-based analysis.

4. **Feature documentation**  
   Record what each remaining column means, which ones are useful, and why I’ve kept them.

5. **Exploratory visuals**  
   Use charts to explore patterns in delays, cancellations, carriers, days of the week, and other relevant dimensions.

6. **Correlation analysis**  
   Compare different delay causes and examine their relationship with arrival delay using a correlation matrix.

7. **Save the cleaned dataset**  
   Export a cleaned version of the data to the `/data/processed/` folder for use in the modeling phase.



In [22]:
# 📦 Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Setup for visualizations
plt.style.use('seaborn-whitegrid')
sns.set_palette('pastel')
%matplotlib inline

# 📄 Load the dataset
file_path = '../data/raw/DelayedFlights.csv'
df = pd.read_csv(file_path)

# 🔍 Quick preview
print(f"Shape of dataset: {df.shape}")
df.head()



Shape of dataset: (1936758, 30)


Unnamed: 0.1,Unnamed: 0,Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,...,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
0,0,2008,1,3,4,2003.0,1955,2211.0,2225,WN,...,4.0,8.0,0,N,0,,,,,
1,1,2008,1,3,4,754.0,735,1002.0,1000,WN,...,5.0,10.0,0,N,0,,,,,
2,2,2008,1,3,4,628.0,620,804.0,750,WN,...,3.0,17.0,0,N,0,,,,,
3,4,2008,1,3,4,1829.0,1755,1959.0,1925,WN,...,3.0,10.0,0,N,0,2.0,0.0,0.0,0.0,32.0
4,5,2008,1,3,4,1940.0,1915,2121.0,2110,WN,...,4.0,10.0,0,N,0,,,,,


In [None]:
# Información general del dataset
df.info()

# Estadísticas de columnas numéricas
df.describe()

# Ver nombres de columnas
df.columns

# Ver tamaño del dataset
df.shape

# Comprobar nulos
df.isnull().sum()

# Ver duplicados
df.duplicated().sum()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1936758 entries, 0 to 1936757
Data columns (total 30 columns):
 #   Column             Dtype  
---  ------             -----  
 0   Unnamed: 0         int64  
 1   Year               int64  
 2   Month              int64  
 3   DayofMonth         int64  
 4   DayOfWeek          int64  
 5   DepTime            float64
 6   CRSDepTime         int64  
 7   ArrTime            float64
 8   CRSArrTime         int64  
 9   UniqueCarrier      object 
 10  FlightNum          int64  
 11  TailNum            object 
 12  ActualElapsedTime  float64
 13  CRSElapsedTime     float64
 14  AirTime            float64
 15  ArrDelay           float64
 16  DepDelay           float64
 17  Origin             object 
 18  Dest               object 
 19  Distance           int64  
 20  TaxiIn             float64
 21  TaxiOut            float64
 22  Cancelled          int64  
 23  CancellationCode   object 
 24  Diverted           int64  
 25  CarrierDelay      

0

The `ArrDelay` column ranges from -89 to over 1400 minutes. Most delays seem mild, but there are extreme outliers. This might affect visualizations or model performance later if not handled properly.


## ✍️ First impressions from the initial inspection

- The dataset is massive — almost 2 million rows and 30 columns. That gives me a lot to work with, but also means I need to stay focused so I don't get overwhelmed.
- Time columns like `DepTime` and `ArrTime` are stored as floats, which suggests there are missing or irregular values that I’ll need to clean or convert.
- Several delay-related columns are likely to be NaN when no delay occurred, which makes sense logically — but I’ll need to validate that before deciding how to handle them.
- `Cancelled` and `Diverted` are binary and clearly important. I’ll want to isolate non-cancelled flights later to avoid bias in the delay analysis.
- A few fields like `FlightNum`, `TailNum`, and `CRSElapsedTime` might not be useful for modeling directly but could give interesting insights during EDA.
- `Unnamed: 0` looks like an autogenerated index column — not useful, and I’ll drop it in the next step.

So far, the dataset seems rich, structured, and manageable with the right filters and cleanup.


## 🧹 Data cleaning

To simplify the dataset and make it easier to work with, I dropped several columns that were either redundant, irrelevant for analysis, or not usable in a predictive context. This includes things like flight numbers, raw time values, and columns with constant values (like `Year`, which is always 2008).

Instead of keeping separate columns for `Year`, `Month`, and `DayofMonth`, I created a single `Date` column using `pd.to_datetime()`. This gives me much more flexibility for filtering, grouping, and time-based visualizations — especially later on when working with tools like Power BI or Tableau.

The goal here is to reduce noise and keep only what I need for meaningful analysis and modeling.



In [24]:
# Drop columns that are irrelevant, redundant, or not needed for the current analysis
df['Date'] = pd.to_datetime(dict(year=df['Year'], month=df['Month'], day=df['DayofMonth']))
cols = ['Date'] + [col for col in df.columns if col != 'Date']
df = df[cols]


df.drop(columns=[
    "Unnamed: 0",       # Index column from CSV
    "Year",             # Constant (always 2008)
    "DayofMonth",       # Likely replaced by full date
    "DepTime",          # Keeping only scheduled departure time (CRSDepTime)
    "DepDelay",         # Dropped intentionally — only if not used for delay analysis
    "ArrTime",          # Actual arrival time — may be redundant
    "CRSArrTime",       # Scheduled arrival time — may be redundant
    "ActualElapsedTime",
    "CRSElapsedTime",
    "Diverted",         # Only if you filtered diverted flights earlier
    "Cancelled",        # Only if you filtered cancelled flights earlier
    "Distance",         # Optional, depends on whether distance is relevant
    "FlightNum",        # Flight ID — not useful analytically
    "TailNum",         # Aircraft ID — not useful analytically
], inplace=True)



## 📄 Column Definitions (for reference)

Before diving deeper, I want to document what each column represents — both for myself and anyone else reading this later. I've already dropped a number of columns that were redundant, constant, or not relevant for the type of analysis I want to do (like raw timestamps, flight IDs, and distance). 

Here’s a quick overview of the key variables I’ve kept in the dataset:

- `Date`: Combined from Year, Month, and DayofMonth
- `DayOfWeek`: 1 = Monday, 7 = Sunday
- `CRSDepTime`: Scheduled departure time
- `UniqueCarrier`: Airline code
- `Origin`: Origin airport (IATA)
- `Dest`: Destination airport (IATA)
- `AirTime`: Time in flight
- `ArrDelay`: Arrival delay in minutes  
- `TaxiIn`: Time spent taxiing after landing (min)
- `TaxiOut`: Time spent taxiing before takeoff (min)
- `CancellationCode`: Reason (A = carrier, B = weather, C = NAS, D = security)
- `CarrierDelay`: Airline responsibility (e.g. maintenance, crew)
- `WeatherDelay`: Due to hazardous weather
- `NASDelay`: NAS-related (air traffic, runways, etc.)
- `SecurityDelay`: Security-related issues
- `LateAircraftDelay`: Caused by late arrival of aircraft from a previous flight

This list will help me stay focused on the features that matter most for EDA, modeling, and visual storytelling.


## 🔍 Handling Missing Values

Most missing values in this dataset appear in the delay cause columns (e.g. `CarrierDelay`, `WeatherDelay`, etc.), which are only filled in when a delay of that type actually occurred. In those cases, `NaN` simply means “no delay” — so I replaced them with 0.

Other columns with inconsistent or incomplete data (like raw time stamps, aircraft IDs, and internal indexes) were dropped earlier during the cleaning step. This keeps the dataset clean and focused on the variables I actually plan to use.


In [25]:
delay_cols = [
    'CarrierDelay', 'WeatherDelay', 'NASDelay',
    'SecurityDelay', 'LateAircraftDelay'
]
df[delay_cols] = df[delay_cols].fillna(0)

df.isnull().sum().sort_values(ascending=False)

AirTime              8387
ArrDelay             8387
TaxiIn               7110
TaxiOut               455
Date                    0
SecurityDelay           0
NASDelay                0
WeatherDelay            0
CarrierDelay            0
CancellationCode        0
Dest                    0
Month                   0
Origin                  0
UniqueCarrier           0
CRSDepTime              0
DayOfWeek               0
LateAircraftDelay       0
dtype: int64