# Predicting Flight Delays With Machine Learning in Python


This example shows use of classification to models to predict flight delays using the HPC-like platform Bodo. 1988 flights data is extracted from Bodo's public S3 bucket, cleaned and processed. Then some analysis are done to extract insight. All are **parallelized across multiple cores using Bodo**. This can be a straightforward way to make Python code run faster without a lot of changes to the code. 
Original example can be found [here](https://github.com/frenchlam/dask_CDSW/blob/master/03_Dask_ML-LargeDS.ipynb).

You can run the large-scale example on [Bodo platform](https://platform.bodo.ai/account/login).

The current results are based on running on 2 `c5.2xlarge` instances (8 cores, 32GiB memory)

The dataset was downloaded from [here](https://github.com/frenchlam/dask_CDSW/tree/master/data) and saved in parquet format in S3 Bucket (`s3://bodo-example-data/flights/flights_1988.pq/`)

The example using CSV dataset can be found in [Bodo-Examples Git repository](https://github.com/Bodo-inc/Bodo-examples/blob/master/ml/credit-card-fraud.ipynb)

The Bodo framework knows when to parallelize code based on the `%%px` at the start of cells and `@bodo.jit` function decorators. Removing those and restarting the kernel will run the code without Bodo.

## Importing the Packages

These are the main packages we are going to work with:
 - Bodo to parallelize Python code automatically
 - Pandas to work with data
 - Scikit-learn to build and evaluate regression models

In [1]:
%%px
import bodo
import pandas as pd
import time
import numpy as np
print(f"Hello World from rank {bodo.get_rank()}. Total ranks={bodo.get_size()}")

Starting 8 engines with <class 'ipyparallel.cluster.launcher.MPIEngineSetLauncher'>


  0%|          | 0/8 [00:00<?, ?engine/s]

[stdout:7] Hello World from rank 7. Total ranks=8


[stdout:3] Hello World from rank 3. Total ranks=8


[stdout:4] Hello World from rank 4. Total ranks=8


[stdout:2] Hello World from rank 2. Total ranks=8


[stdout:0] Hello World from rank 0. Total ranks=8


[stdout:1] Hello World from rank 1. Total ranks=8


[stdout:6] Hello World from rank 6. Total ranks=8


[stdout:5] Hello World from rank 5. Total ranks=8


## Part 1. Pre-processing in Pandas

### 1. Read flights dataset

In [2]:
%%px
@bodo.jit(distributed=["flight_df"], cache=True)
def read_flights(input_file):
    flight_df = pd.read_parquet(input_file, columns=['Month', 'DayofMonth', 'DayOfWeek', 'CRSDepTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'Origin', 'Dest','Cancelled'])        
    return flight_df

input_file = "s3://bodo-example-data/flights/flights_1988.pq/"
flight_df = read_flights(input_file)
if bodo.get_rank() == 0:
    display(flight_df.head())


[output:0]

Unnamed: 0,Month,DayofMonth,DayOfWeek,CRSDepTime,CRSArrTime,UniqueCarrier,FlightNum,Origin,Dest,Cancelled
0,1,9,6,1331,1435,PI,942,SYR,BWI,0
1,1,10,7,1331,1435,PI,942,SYR,BWI,0
2,1,11,1,1331,1435,PI,942,SYR,BWI,0
3,1,12,2,1331,1435,PI,942,SYR,BWI,0
4,1,13,3,1331,1435,PI,942,SYR,BWI,0


%px:   0%|          | 0/8 [00:00<?, ?tasks/s]

### 2. Feature Engineering
1. Create routes from origin and destination

In [3]:
%%px
@bodo.jit(distributed=["flight_df"], cache=True)
def create_routes(flight_df):
    flight_df['route'] = flight_df['Origin'] + "_" + flight_df['Dest']
    # show top 20 routes - As defined by nb of flights
    top_routes = flight_df['route'].value_counts(ascending=False)
    print(top_routes.head(10))
    #focus on 50 biggest routes - As defined by nb of flights 
    route_lst=top_routes.head(50)
    flight_df = flight_df[flight_df['route'].isin(route_lst.index)]
    return flight_df

flight_df = create_routes(flight_df)

[stdout:6] Series([], Name: route, dtype: int64)


[stdout:1] Series([], Name: route, dtype: int64)


[stdout:3] Series([], Name: route, dtype: int64)


[stdout:5] Series([], Name: route, dtype: int64)


[stdout:7] Series([], Name: route, dtype: int64)


[stdout:4] Series([], Name: route, dtype: int64)


[stdout:2] Series([], Name: route, dtype: int64)


[stdout:0] LAX_SFO    20750
SFO_LAX    20658
LAX_PHX    13461
PHX_LAX    13273
LAX_LAS    12175
LGA_BOS    12027
LAS_LAX    11801
SJC_LAX    11535
LAX_SJC    11292
BOS_LGA    11141
Name: route, dtype: int64


2. Look at their cancellations

In [4]:
%%px
@bodo.jit(distributed=["flight_df"], cache=True)
def check_cancelations(flight_df):
    res = flight_df[['route', 'Cancelled', 'Month']].groupby(by='route')\
         .agg({'Month':'size', 'Cancelled':'sum'})\
        .rename(columns={'Month':'count','Cancelled':'nb_cancelled'}) \
        .reset_index()\
        .sort_values(['count'], ascending=False)
    return res.head()

top10_canceled_flights = check_cancelations(flight_df)

if bodo.get_rank() == 0:
    display(top10_canceled_flights)


[output:0]

Unnamed: 0,route,count,nb_cancelled
0,LAX_SFO,20750,228
32,SFO_LAX,20658,206
43,LAX_PHX,13461,78
29,PHX_LAX,13273,71
35,LAX_LAS,12175,58


In [5]:
%%px
@bodo.jit(distributed=["flight_df"])
def print_info(flight_df):
    print(flight_df.shape)
print_info(flight_df)

[stdout:0] (487253, 11)


3. Quick sanity check - count number of null values()

In [6]:
%%px
@bodo.jit(distributed=["flight_df"])
def check_count(flight_df):
    
    print(flight_df.isnull().sum())
    
check_count(flight_df)

%px:   0%|          | 0/8 [00:00<?, ?tasks/s]

[stdout:0] Month            0
DayofMonth       0
DayOfWeek        0
CRSDepTime       0
CRSArrTime       0
UniqueCarrier    0
FlightNum        0
Origin           0
Dest             0
Cancelled        0
route            0
dtype: int64


### 3. Feature and label encoding encoding

#### 1. Encode Labels using Cancelled column

In [7]:
%%px
@bodo.jit(distributed=["flight_df"], cache=True)
def encode_labels(flight_df):
    flight_df.Cancelled = pd.Categorical(flight_df.Cancelled)
    flight_df['Label'] = flight_df.Cancelled.cat.codes
    flight_df.drop(['Cancelled'], axis=1, inplace=True)
    return flight_df

flight_df = encode_labels(flight_df)
if bodo.get_rank() == 0:
    display(flight_df.head())

%px:   0%|          | 0/8 [00:00<?, ?tasks/s]

[output:0]

Unnamed: 0,Month,DayofMonth,DayOfWeek,CRSDepTime,CRSArrTime,UniqueCarrier,FlightNum,Origin,Dest,route,Label
786,1,1,5,955,1035,PS,1400,LAX,SAN,LAX_SAN,0
787,1,2,6,955,1035,PS,1400,LAX,SAN,LAX_SAN,0
788,1,4,1,955,1035,PS,1400,LAX,SAN,LAX_SAN,0
789,1,5,2,955,1035,PS,1400,LAX,SAN,LAX_SAN,0
790,1,6,3,955,1035,PS,1400,LAX,SAN,LAX_SAN,1


#### 2. Feature Encoding

This is needed because sklearn only supports numerical values

a. Get airport unique values

b. Encode origin, destination, and route features

In [8]:
%%px

@bodo.jit(distributed=["flight_df"], cache=True)
def get_airport_list(flight_df):
    airport_list = np.sort((pd.concat((flight_df['Origin'], flight_df['Dest']))).unique())
    return airport_list

airport_list = get_airport_list(flight_df)
if bodo.get_rank() == 0:
    display(airport_list)

%px:   0%|          | 0/8 [00:00<?, ?tasks/s]

[output:0]

array(['ATL', 'BOS', 'DAL', 'DCA', 'DEN', 'DFW', 'DTW', 'EWR', 'HOU',
       'IAH', 'LAS', 'LAX', 'LGA', 'MCO', 'MIA', 'MSP', 'ORD', 'PDX',
       'PHX', 'SAN', 'SAT', 'SEA', 'SFO', 'SJC', 'STL'], dtype=object)

In [9]:
%%px
from sklearn.preprocessing import LabelEncoder
@bodo.jit(distributed=["flight_df", "airport_list"], cache=True)
def encode_features(flight_df, airport_list):
    t1 = time.time()    
    # encode airlines 
    le_carrier = LabelEncoder()
    flight_df['Carrier_encoded'] = pd.Series(le_carrier.fit_transform(flight_df['UniqueCarrier'].values))
    # Encode airports : Using same encoder for both origin and dest ( consistent encoding of airports )
    le_airport = LabelEncoder()
    le_airport.fit(airport_list)
    flight_df['Origin_encoded'] = pd.Series(le_airport.transform(flight_df['Origin']))
    flight_df['Dest_encoded'] = pd.Series(le_airport.transform(flight_df['Dest']))
    # Encode routes 
    le_route = LabelEncoder()
    flight_df['route_encoded'] = pd.Series(le_route.fit_transform(flight_df['route'].values))
    print("Encoding time: ", (time.time()-t1), " sec")
    return flight_df

flight_df = encode_features(flight_df, airport_list)
if bodo.get_rank() == 0:
    display(flight_df.head())

[stdout:0] Encoding time:  0.5405858317162711  sec


[output:0]

Unnamed: 0,Month,DayofMonth,DayOfWeek,CRSDepTime,CRSArrTime,UniqueCarrier,FlightNum,Origin,Dest,route,Label,Carrier_encoded,Origin_encoded,Dest_encoded,route_encoded
786,1,1,5,955,1035,PS,1400,LAX,SAN,LAX_SAN,0,9,11,19,18
787,1,2,6,955,1035,PS,1400,LAX,SAN,LAX_SAN,0,9,11,19,18
788,1,4,1,955,1035,PS,1400,LAX,SAN,LAX_SAN,0,9,11,19,18
789,1,5,2,955,1035,PS,1400,LAX,SAN,LAX_SAN,0,9,11,19,18
790,1,6,3,955,1035,PS,1400,LAX,SAN,LAX_SAN,1,9,11,19,18


In [10]:
%%px
@bodo.jit(distributed=["flight_df"], cache=True)
def sample(flight_df):
    print(flight_df[['UniqueCarrier','Carrier_encoded','Origin','Origin_encoded',
           'Dest', 'Dest_encoded', 'route', 'route_encoded' ]].sample(10))
    
sample(flight_df)

[stdout:0]         UniqueCarrier  Carrier_encoded Origin  Origin_encoded Dest  \
749691             AA                0    ORD              16  STL   
1183420            AA                0    IAH               9  DFW   
1768911            UA               11    DEN               4  ORD   
2237141            UA               11    LAX              11  SFO   
2316550            NW                6    ORD              16  MSP   
3167804            HP                5    PHX              18  LAS   
3070695            UA               11    ORD              16  LGA   
4374513            UA               11    SFO              22  LAX   
4637595            EA                4    LGA              12  DCA   
4642272            HP                5    SAN              19  PHX   

         Dest_encoded    route  route_encoded  
749691             24  ORD_STL             33  
1183420             5  IAH_DFW             13  
1768911            16  DEN_ORD              8  
2237141            22  LAX

In [14]:
%%px
from sklearn.model_selection import train_test_split
@bodo.jit(distributed=["flight_df", "X_train", "X_test", "y_train", "y_test"], cache=True)
def split_data(flight_df):
    t1 = time.time()
    X_train, X_test, y_train, y_test = train_test_split(flight_df.drop(['UniqueCarrier','Origin','Dest','route'],axis=1),
                                                    flight_df['Label'], 
                                                    test_size=0.3, train_size=0.7,
                                                    random_state=100)
    print("Data splitting time: ", (time.time()-t1), " sec")    

    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = split_data(flight_df)

[stdout:0] Data splitting time:  0.05209195460520277  sec


## Part 2: Model Training - Using Scikit-learn

### 1. RandomForestClassifier

In [16]:
%%px
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score # evaluation metric
@bodo.jit(distributed=['X_train', 'y_train', 'X_test', 'y_test'], cache=True)
def rf_model(X_train, X_test, y_train, y_test):
    start = time.time()
    rf = RandomForestClassifier()
    rf.fit(X_train.to_numpy(), y_train.values)
    y_pred = rf.predict(X_test.values)
    print("RandomForestClassifier fit and predict time: ", time.time()-start)    
    print('Accuracy score {}'.format(accuracy_score(y_test, y_pred)))

rf_model(X_train, X_test, y_train, y_test)

%px:   0%|          | 0/8 [00:00<?, ?tasks/s]

[stdout:0] RandomForestClassifier fit and predict time:  6.080242309267305
Accuracy score 1.0


### 2. Logistic Regression

In [18]:
%%px

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score # evaluation metric
@bodo.jit(distributed=['X_train', 'y_train', 'X_test', 'y_test'], cache=True)
def lr_model(X_train, X_test, y_train, y_test):
    start = time.time()
    lr = LogisticRegression()
    lr.fit(X_train.to_numpy(), y_train.values)
    y_pred = lr.predict(X_test.to_numpy())
    print("Logistic Regression fit and predict time: ", time.time()-start)    
    print('Accuracy score {}'.format(accuracy_score(y_test, y_pred)))

lr_model(X_train, X_test, y_train, y_test)

%px:   0%|          | 0/8 [00:00<?, ?tasks/s]

[stdout:0] Logistic Regression fit and predict time:  0.1333144736495342
Accuracy score 0.9815770030647986


  lr_model(X_train, X_test, y_train, y_test)
