# AirFly Insights — Week 2: Preprocessing & Feature Engineering

**Dataset:** NYC Flights 2013 (`flights.csv`)

**Goals for Week 2:**
- Handle nulls (cancelled flights)
- Parse and format datetime columns
- Create derived features for analysis
- Save the preprocessed dataset for fast reuse
- Produce a feature dictionary

## 1. Imports & Setup

In [1]:
import pandas as pd
import numpy as np
import os
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.2f}'.format)

print('Setup complete.')

Setup complete.


## 2. Load Raw Data

In [2]:
df_raw = pd.read_csv('../data/raw/flights.csv')
print(f'Raw data loaded: {df_raw.shape[0]:,} rows × {df_raw.shape[1]} columns')
print(f'Columns: {list(df_raw.columns)}')

Raw data loaded: 336,776 rows × 21 columns
Columns: ['id', 'year', 'month', 'day', 'dep_time', 'sched_dep_time', 'dep_delay', 'arr_time', 'sched_arr_time', 'arr_delay', 'carrier', 'flight', 'tailnum', 'origin', 'dest', 'air_time', 'distance', 'hour', 'minute', 'time_hour', 'name']


## 3. Handle Nulls — Cancelled Flights

In [3]:
df = df_raw.copy()

# All nulls come from cancelled flights (dep_time is null = cancelled)
df['is_cancelled'] = df['dep_time'].isnull().astype(int)

cancelled_count = df['is_cancelled'].sum()
total = len(df)
print(f'Total flights: {total:,}')
print(f'Cancelled:     {cancelled_count:,} ({cancelled_count/total*100:.2f}%)')
print(f'Completed:     {total - cancelled_count:,} ({(total - cancelled_count)/total*100:.2f}%)')

Total flights: 336,776
Cancelled:     8,255 (2.45%)
Completed:     328,521 (97.55%)


In [4]:
# Fill nulls in delay columns for cancelled flights with a sentinel (NaN is fine, 
# but we mark them clearly)
# We keep cancelled rows in the dataset for cancellation rate analysis
print('Null counts before:')
print(df[['dep_time','dep_delay','arr_time','arr_delay','air_time']].isnull().sum())

Null counts before:
dep_time     8255
dep_delay    8255
arr_time     8713
arr_delay    9430
air_time     9430
dtype: int64


## 4. Parse & Format Datetime Columns

In [5]:
# Parse time_hour into a proper datetime
df['datetime'] = pd.to_datetime(df['time_hour'], format='%Y-%m-%d %H:%M:%S', errors='coerce')

# Extract date
df['date'] = df['datetime'].dt.date

# Day of week (0=Monday, 6=Sunday)
df['day_of_week'] = df['datetime'].dt.dayofweek
df['day_name'] = df['datetime'].dt.day_name()

# Week number
df['week_of_year'] = df['datetime'].dt.isocalendar().week.astype(int)

print('Datetime columns created successfully.')
print(df[['datetime', 'date', 'day_of_week', 'day_name', 'week_of_year']].head())

Datetime columns created successfully.
             datetime        date  day_of_week day_name  week_of_year
0 2013-01-01 05:00:00  2013-01-01            1  Tuesday             1
1 2013-01-01 05:00:00  2013-01-01            1  Tuesday             1
2 2013-01-01 05:00:00  2013-01-01            1  Tuesday             1
3 2013-01-01 05:00:00  2013-01-01            1  Tuesday             1
4 2013-01-01 06:00:00  2013-01-01            1  Tuesday             1


## 5. Feature Engineering

### 5a. Route

In [6]:
# Route = origin-destination pair
df['route'] = df['origin'] + '-' + df['dest']

print(f'Unique routes: {df["route"].nunique()}')
print('Top 5 routes:')
print(df['route'].value_counts().head())

Unique routes: 224
Top 5 routes:
route
JFK-LAX    11262
LGA-ATL    10263
LGA-ORD     8857
JFK-SFO     8204
LGA-CLT     6168
Name: count, dtype: int64


### 5b. Departure Hour Bin

In [7]:
# Bucket scheduled departure hour into time-of-day periods
def hour_bin(h):
    if pd.isna(h): return 'Unknown'
    if 5 <= h < 12: return 'Morning'
    elif 12 <= h < 17: return 'Afternoon'
    elif 17 <= h < 21: return 'Evening'
    else: return 'Night'

df['dep_hour_bin'] = df['hour'].apply(hour_bin)

print('Departure time period distribution:')
print(df['dep_hour_bin'].value_counts())

Departure time period distribution:
dep_hour_bin
Morning      131020
Afternoon    106733
Evening       84389
Night         14634
Name: count, dtype: int64


### 5c. Delay Flags

In [8]:
# is_delayed: departure delay > 15 minutes (standard airline threshold)
df['is_delayed'] = (df['dep_delay'] > 15).astype('Int64')  # Int64 supports NaN

# is_early: departed more than 5 minutes early
df['is_early'] = (df['dep_delay'] < -5).astype('Int64')

# total_delay: combined departure + arrival delay
df['total_delay'] = df['dep_delay'] + df['arr_delay']

delayed_count = df['is_delayed'].sum()
completed = (df['is_cancelled'] == 0).sum()
print(f'Flights delayed >15 min: {delayed_count:,} ({delayed_count/completed*100:.2f}%) of completed flights)')
print(f'Flights departed early:  {df["is_early"].sum():,}')
print(f'Avg total delay (non-cancelled): {df["total_delay"].mean():.1f} min')

Flights delayed >15 min: 70,774 (21.54%) of completed flights)
Flights departed early:  69,588
Avg total delay (non-cancelled): 19.5 min


### 5d. Speed

In [9]:
# Speed in miles per hour (only for completed flights with valid air_time)
df['speed_mph'] = np.where(
    df['air_time'].notna() & (df['air_time'] > 0),
    (df['distance'] / (df['air_time'] / 60)).round(1),
    np.nan
)

print(f'Speed stats (mph):')
print(df['speed_mph'].describe().round(1))

Speed stats (mph):
count   327346.00
mean       394.30
std         60.60
min         76.80
25%        358.10
50%        404.20
75%        438.80
max        703.40
Name: speed_mph, dtype: float64


### 5e. Delay Severity Category

In [10]:
# Classify departure delay into severity buckets
def delay_severity(d):
    if pd.isna(d): return 'Cancelled'
    if d <= 0: return 'On Time / Early'
    elif d <= 15: return 'Minor (1-15 min)'
    elif d <= 60: return 'Moderate (16-60 min)'
    elif d <= 180: return 'Severe (1-3 hrs)'
    else: return 'Extreme (>3 hrs)'

df['delay_severity'] = df['dep_delay'].apply(delay_severity)

print('Delay severity distribution:')
print(df['delay_severity'].value_counts())

Delay severity distribution:
delay_severity
On Time / Early         200089
Minor (1-15 min)         57658
Moderate (16-60 min)     44193
Severe (1-3 hrs)         22688
Cancelled                 8255
Extreme (>3 hrs)          3893
Name: count, dtype: int64


## 6. Final Cleaned Dataset Overview

In [11]:
# Summary of new columns added
new_cols = ['is_cancelled', 'datetime', 'date', 'day_of_week', 'day_name',
            'week_of_year', 'route', 'dep_hour_bin', 'is_delayed', 'is_early',
            'total_delay', 'speed_mph', 'delay_severity']

print(f'Original columns: {df_raw.shape[1]}')
print(f'New columns added: {len(new_cols)}')
print(f'Total columns: {df.shape[1]}')
print(f'\nFinal dataset shape: {df.shape}')
print(f'\nNew columns preview:')
df[new_cols].head()

Original columns: 21
New columns added: 13
Total columns: 34

Final dataset shape: (336776, 34)

New columns preview:


Unnamed: 0,is_cancelled,datetime,date,day_of_week,day_name,week_of_year,route,dep_hour_bin,is_delayed,is_early,total_delay,speed_mph,delay_severity
0,0,2013-01-01 05:00:00,2013-01-01,1,Tuesday,1,EWR-IAH,Morning,0,0,13.0,370.0,Minor (1-15 min)
1,0,2013-01-01 05:00:00,2013-01-01,1,Tuesday,1,LGA-IAH,Morning,0,0,24.0,374.3,Minor (1-15 min)
2,0,2013-01-01 05:00:00,2013-01-01,1,Tuesday,1,JFK-MIA,Morning,0,0,35.0,408.4,Minor (1-15 min)
3,0,2013-01-01 05:00:00,2013-01-01,1,Tuesday,1,JFK-BQN,Morning,0,0,-19.0,516.7,On Time / Early
4,0,2013-01-01 06:00:00,2013-01-01,1,Tuesday,1,LGA-ATL,Morning,0,1,-31.0,394.1,On Time / Early


In [12]:
# Check null summary on engineered dataset
null_final = df[new_cols].isnull().sum()
print('Null counts in engineered features:')
print(null_final[null_final > 0])

Null counts in engineered features:
total_delay    9430
speed_mph      9430
dtype: int64


## 7. Save Preprocessed Data

In [13]:
# Create processed directory if it doesn't exist
os.makedirs('../data/processed', exist_ok=True)

# Save full preprocessed dataset
output_path = '../data/processed/flights_processed.csv'
df.to_csv(output_path, index=False)
print(f'Preprocessed data saved to: {output_path}')
print(f'Shape: {df.shape}')

# Verify saved file
df_check = pd.read_csv(output_path)
print(f'Verification load shape: {df_check.shape}')
print('Save successful!')

Preprocessed data saved to: ../data/processed/flights_processed.csv
Shape: (336776, 34)
Verification load shape: (336776, 34)
Save successful!


## 8. Feature Dictionary

### Original Features

| Feature | Type | Description |
|---------|------|-------------|
| `id` | int | Row index |
| `year`, `month`, `day` | int | Flight date components |
| `dep_time` | float | Actual departure time (HHMM); null = cancelled |
| `sched_dep_time` | int | Scheduled departure time (HHMM) |
| `dep_delay` | float | Departure delay in minutes (negative = early) |
| `arr_time` | float | Actual arrival time (HHMM) |
| `sched_arr_time` | int | Scheduled arrival time (HHMM) |
| `arr_delay` | float | Arrival delay in minutes |
| `carrier` | str | IATA 2-letter carrier code |
| `flight` | int | Flight number |
| `tailnum` | str | Aircraft registration number |
| `origin` | str | Origin airport code (EWR, LGA, JFK) |
| `dest` | str | Destination airport code (105 unique) |
| `air_time` | float | Flight duration in minutes |
| `distance` | int | Distance in miles |
| `hour`, `minute` | int | Scheduled departure time components |
| `time_hour` | str | Rounded departure timestamp string |
| `name` | str | Full airline name |

### Engineered Features

| Feature | Type | Description |
|---------|------|-------------|
| `is_cancelled` | int (0/1) | 1 if flight was cancelled (`dep_time` is null) |
| `datetime` | datetime | Parsed `time_hour` as datetime object |
| `date` | date | Flight date (YYYY-MM-DD) |
| `day_of_week` | int (0–6) | Day of week (0=Monday, 6=Sunday) |
| `day_name` | str | Day name (e.g., 'Monday') |
| `week_of_year` | int | ISO week number (1–53) |
| `route` | str | `origin-dest` pair (e.g., 'EWR-IAH') |
| `dep_hour_bin` | str | Time of day bucket: Morning/Afternoon/Evening/Night |
| `is_delayed` | Int64 | 1 if `dep_delay > 15`, null if cancelled |
| `is_early` | Int64 | 1 if `dep_delay < -5`, null if cancelled |
| `total_delay` | float | `dep_delay + arr_delay` (can be negative) |
| `speed_mph` | float | `distance / (air_time / 60)` in mph |
| `delay_severity` | str | Categorical: On Time/Minor/Moderate/Severe/Extreme/Cancelled |

## 9. Week 2 Summary

| Task | Status |
|------|--------|
| Handle nulls (cancelled flights) | Flagged with `is_cancelled` |
| Parse `time_hour` to datetime | `datetime`, `date` columns created |
| Extract temporal features | `day_of_week`, `day_name`, `week_of_year` |
| Create `route` feature | origin-destination pairs |
| Departure time bucketing | `dep_hour_bin` (Morning/Afternoon/Evening/Night) |
| Delay flags | `is_delayed`, `is_early`, `total_delay`, `delay_severity` |
| Speed feature | `speed_mph` |
| Save preprocessed data | `data/processed/flights_processed.csv` |
| Feature dictionary | Documented above |