# Predicting Flight Delays

This example shows use of classification models to predict flight delays. 
Original example can be found [here](https://github.com/frenchlam/dask_CDSW/blob/master/03_Dask_ML-LargeDS.ipynb) (dataset is [here](https://github.com/frenchlam/dask_CDSW/blob/master/data/1988.csv.bz2)).

### Start an IPyParallel cluster 
Run the following code in a cell to start an IPyParallel cluster. 8 cores are used in this example.

In [1]:
import os
if os.environ.get("BODO_PLATFORM_WORKSPACE_UUID",'NA') == 'NA':
    import ipyparallel as ipp
    import psutil; n = min(psutil.cpu_count(logical=False), 8)
    rc = ipp.Cluster(engines='mpi', n=n).start_and_connect_sync(activate=True)

Starting 8 engines with <class 'ipyparallel.cluster.launcher.MPIEngineSetLauncher'>
100%|██████████| 8/8 [00:07<00:00,  1.13engine/s]


### Verifying your setup
Run the following code to verify that your IPyParallel cluster is set up correctly:

In [2]:
%%px
import bodo
print(f"Hello World from rank {bodo.get_rank()}. Total ranks={bodo.get_size()}")

[stdout:3] Hello World from rank 3. Total ranks=8


[stdout:6] Hello World from rank 6. Total ranks=8


[stdout:5] Hello World from rank 5. Total ranks=8


[stdout:0] Hello World from rank 0. Total ranks=8


[stdout:7] Hello World from rank 7. Total ranks=8


[stdout:4] Hello World from rank 4. Total ranks=8


[stdout:1] Hello World from rank 1. Total ranks=8


[stdout:2] Hello World from rank 2. Total ranks=8


## Importing the Packages

These are the main packages we are going to work with:
 - Bodo to parallelize Python code automatically
 - Pandas to work with data
 - Scikit-learn to build and evaluate regression models

In [3]:
%%px
import pandas as pd
import time
import numpy as np

## Part 1. Pre-processing in Pandas

### 1. Read flights dataset

In [4]:
%%px
@bodo.jit(cache=True)
def read_flights(input_file):
    flight_df = pd.read_csv(input_file, sep=',', header=0,
        usecols=['Month', 'DayofMonth', 'DayOfWeek', 'CRSDepTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'Origin', 'Dest','Cancelled'])    
    print(flight_df.head())    
    return flight_df

input_file = "s3://bodo-example-data/flights/1988.csv.bz2"
flight_df = read_flights(input_file)


%px:   0%|          | 0/8 [01:18<?, ?tasks/s]

[stdout:0]    Month  DayofMonth  DayOfWeek  CRSDepTime  CRSArrTime UniqueCarrier  \
0      1           9          6        1331        1435            PI   
1      1          10          7        1331        1435            PI   
2      1          11          1        1331        1435            PI   
3      1          12          2        1331        1435            PI   
4      1          13          3        1331        1435            PI   

   FlightNum Origin Dest  Cancelled  
0        942    SYR  BWI          0  
1        942    SYR  BWI          0  
2        942    SYR  BWI          0  
3        942    SYR  BWI          0  
4        942    SYR  BWI          0  


%px: 100%|██████████| 8/8 [01:19<00:00,  9.90s/tasks]


### 2. Feature Engineering
1. Create routes from origin and destination

In [5]:
%%px
@bodo.jit(cache=True)
def create_routes(flight_df):
    flight_df['route'] = flight_df['Origin'] + "_" + flight_df['Dest']
    # show top 20 routes - As defined by nb of flights
    top_routes = flight_df['route'].value_counts(ascending=False)
    print(top_routes.head(10))
    #focus on 50 biggest routes - As defined by nb of flights 
    route_lst=top_routes.head(50)
    flight_df = flight_df[flight_df['route'].isin(route_lst.index)]
    return flight_df

flight_df = create_routes(flight_df)

%px:   0%|          | 0/8 [00:15<?, ?tasks/s]

[stdout:0] LAX_SFO    20750
SFO_LAX    20658
LAX_PHX    13461
PHX_LAX    13273
LAX_LAS    12175
LGA_BOS    12027
LAS_LAX    11801
SJC_LAX    11535
LAX_SJC    11292
BOS_LGA    11141
Name: route, dtype: int64


%px: 100%|██████████| 8/8 [00:15<00:00,  1.90s/tasks]


2. Look at their cancellations

In [6]:
%%px
@bodo.jit(cache=True)
def check_cancelations(flight_df):
    res = flight_df[['route', 'Cancelled', 'Month']].groupby(by='route')\
         .agg({'Month':'size', 'Cancelled':'sum'})\
        .rename(columns={'Month':'count','Cancelled':'nb_cancelled'}) \
        .reset_index()\
        .sort_values(['count'], ascending=False)
    print(res.head(10))

check_cancelations(flight_df)

%px:   0%|          | 0/8 [00:03<?, ?tasks/s]

[stdout:0]       route  count  nb_cancelled
0   LAX_SFO  20750           228
32  SFO_LAX  20658           206
43  LAX_PHX  13461            78
29  PHX_LAX  13273            71
35  LAX_LAS  12175            58
41  LGA_BOS  12027           287


[stdout:1]       route  count  nb_cancelled
19  LAS_LAX  11801            47
10  SJC_LAX  11535            71
42  LAX_SJC  11292            71
24  BOS_LGA  11141           243


%px: 100%|██████████| 8/8 [00:03<00:00,  2.38tasks/s]


In [7]:
%%px
@bodo.jit
def print_info(flight_df):
    print(flight_df.shape)
print_info(flight_df)

[stdout:0] (487253, 11)


3. Quick sanity check - count number of null values()

In [8]:
%%px
@bodo.jit
def check_count(flight_df):
    
    print(flight_df.isnull().sum())
    
check_count(flight_df)

%px:   0%|          | 0/8 [00:10<?, ?tasks/s]

[stdout:0] Month            0
DayofMonth       0
DayOfWeek        0
CRSDepTime       0
CRSArrTime       0
UniqueCarrier    0
FlightNum        0
Origin           0
Dest             0
Cancelled        0
route            0
dtype: int64


%px: 100%|██████████| 8/8 [00:10<00:00,  1.36s/tasks]


### 3. Feature and label encoding encoding

#### 1. Encode Labels using Cancelled column

In [9]:
%%px
@bodo.jit(cache=True)
def encode_labels(flight_df):
    flight_df.Cancelled = pd.Categorical(flight_df.Cancelled)
    flight_df['Label'] = flight_df.Cancelled.cat.codes
    flight_df.drop(['Cancelled'], axis=1, inplace=True)
    return flight_df

flight_df = encode_labels(flight_df)

%px: 100%|██████████| 8/8 [00:08<00:00,  1.05s/tasks]


#### 2. Feature Encoding

This is needed because sklearn only supports numerical values

a. Get airport unique values

b. Encode origin, destination, and route features

In [10]:
%%px
import numpy as np

@bodo.jit(cache=True)
def get_airport_list(flight_df):
    airport_list = np.sort((pd.concat((flight_df['Origin'], flight_df['Dest']))).unique())
    return airport_list

airport_list = get_airport_list(flight_df)

%px: 100%|██████████| 8/8 [00:05<00:00,  1.56tasks/s]


In [11]:
%%px
from sklearn.preprocessing import LabelEncoder
@bodo.jit(cache=True)
def encode_features(flight_df, airport_list):
    t1 = time.time()    
    # encode airlines 
    le_carrier = LabelEncoder()
    flight_df['Carrier_encoded'] = pd.Series(le_carrier.fit_transform(flight_df['UniqueCarrier'].values))
    # Encode airports : Using same encoder for both origin and dest ( consistent encoding of airports )
    le_airport = LabelEncoder()
    le_airport.fit(airport_list)
    flight_df['Origin_encoded'] = pd.Series(le_airport.transform(flight_df['Origin']))
    flight_df['Dest_encoded'] = pd.Series(le_airport.transform(flight_df['Dest']))
    # Encode routes 
    le_route = LabelEncoder()
    flight_df['route_encoded'] = pd.Series(le_route.fit_transform(flight_df['route'].values))
    print("Encoding time: ", (time.time()-t1), " sec")
    return flight_df

flight_df = encode_features(flight_df, airport_list)

%px:   0%|          | 0/8 [00:14<?, ?tasks/s]

[stdout:0] Encoding time:  0.12577081399922463  sec


%px: 100%|██████████| 8/8 [00:14<00:00,  1.82s/tasks]


In [12]:
%%px
@bodo.jit(cache=True)
def sample(flight_df):
    print(flight_df[['UniqueCarrier','Carrier_encoded','Origin','Origin_encoded',
           'Dest', 'Dest_encoded', 'route', 'route_encoded' ]].sample(10))
    
sample(flight_df)

%px:   0%|          | 0/8 [00:08<?, ?tasks/s]

[stdout:0]         UniqueCarrier  Carrier_encoded Origin  Origin_encoded Dest  \
504561             UA               11    SFO              22  LAX   
89184              WN               13    HOU               8  DAL   
896267             UA               11    ORD              16  DTW   
703945             DL                3    SAN              19  LAX   
2510421            AA                0    DEN               4  DFW   
2396774            CO                2    DFW               5  IAH   
2211575            UA               11    ORD              16  DEN   
2198174            UA               11    LGA              12  ORD   
3182400            HP                5    LAX              11  PHX   
4801132            UA               11    SFO              22  SEA   

         Dest_encoded    route  route_encoded  
504561             11  SFO_LAX             45  
89184               2  HOU_DAL             12  
896267              6  ORD_DTW             29  
703945             11  SAN

%px: 100%|██████████| 8/8 [00:08<00:00,  1.08s/tasks]


In [13]:
%%px
from sklearn.model_selection import train_test_split
@bodo.jit(cache=True)
def split_data(flight_df):
    t1 = time.time()
    X_train, X_test, y_train, y_test = train_test_split(flight_df.drop(['UniqueCarrier','Origin','Dest','route'],axis=1),
                                                    flight_df['Label'], 
                                                    test_size=0.3, train_size=0.7,
                                                    random_state=100)
    print("Data splitting time: ", (time.time()-t1), " sec")    

    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = split_data(flight_df)

%px:   0%|          | 0/8 [00:20<?, ?tasks/s]

[stdout:0] Data splitting time:  0.18990181499975733  sec


%px: 100%|██████████| 8/8 [00:20<00:00,  2.55s/tasks]


## Part 2: Model Training - Using Scikit-learn

### 1. RandomForestClassifier

In [14]:
%%px
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score # evaluation metric
@bodo.jit(cache=True)
def rf_model(X_train, X_test, y_train, y_test):
    start = time.time()
    rf = RandomForestClassifier()
    rf.fit(X_train.to_numpy(), y_train.values)
    y_pred = rf.predict(X_test)
    print("RandomForestClassifier fit and predict time: ", time.time()-start)    
    print('Accuracy score {}'.format(accuracy_score(y_test, y_pred)))

rf_model(X_train, X_test, y_train, y_test)

%px:   0%|          | 0/8 [00:15<?, ?tasks/s]

%px: 100%|██████████| 8/8 [00:15<00:00,  1.91s/tasks]


[stdout:0] RandomForestClassifier fit and predict time:  6.684258414000396
Accuracy score 1.0


### 2. Logistic Regression

In [15]:
%%px

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score  # evaluation metric
@bodo.jit(cache=True)
def lr_model(X_train, X_test, y_train, y_test):
    start = time.time()
    lr = LogisticRegression()
    lr.fit(X_train, y_train.values)
    y_pred = lr.predict(X_test)
    print("Logistic Regression fit and predict time: ", time.time()-start)    
    print('Accuracy score {}'.format(accuracy_score(y_test, y_pred)))

lr_model(X_train, X_test, y_train, y_test)

%px:   0%|          | 0/8 [00:03<?, ?tasks/s]

  lr_model(X_train, X_test, y_train, y_test)


%px:   0%|          | 0/8 [00:03<?, ?tasks/s]

[stdout:0] Logistic Regression fit and predict time:  0.451817635000225
Accuracy score 0.9815770030647986


%px: 100%|██████████| 8/8 [00:03<00:00,  2.03tasks/s]


In [16]:
# To stop the cluster run the following command. 
rc.cluster.stop_cluster_sync()

Stopping controller
Controller stopped: {'exit_code': 0, 'pid': 49007, 'identifier': 'ipcontroller-1652912636-2mh3-48843'}
Stopping engine(s): 1652912637
engine set stopped 1652912637: {'exit_code': 0, 'pid': 49019, 'identifier': 'ipengine-1652912636-2mh3-1652912637-48843'}
