# Predicting Flight Delays

This example shows use of regression models to predict flight delays. 
Original example can be found [here](https://github.com/frenchlam/dask_CDSW/blob/master/03_Dask_ML-LargeDS.ipynb).

### Notes on running this example:

By defaults runs use Bodo. Hence, data is distributed in chunks across processes.

The current results are based on running on one **c5.18xlarge** instance (36 cores, 144GiB memory)

The dataset can be downloaded from [here](https://github.com/frenchlam/dask_CDSW/blob/master/data/1988.csv.bz2) or S3 bucket `s3://bodo-examples-data/flights/1988.csv.bz2`

To run the code:
1. Make sure you add your AWS account credentials to access the data (if using data from S3 bucket). 
2. If you want to run the example using pandas only (without Bodo):
    1. Comment lines magic expression (`%%px`) and bodo decorator (`@bodo.jit`) from all the code cells.
    2. Then, re-run cells from the beginning.

## Importing the Packages

These are the main packages we are going to work with:
 - Bodo to parallelize Python code automatically
 - Pandas to work with data
 - Scikit-learn to build and evaluate regression models

In [1]:
%%px
import pandas as pd
import bodo
import time
import numpy as np

In [2]:
%%px
import os

os.environ["AWS_ACCESS_KEY_ID"] = "your_aws_access_key_id"
os.environ["AWS_SECRET_ACCESS_KEY"] = "your_aws_secret_access_key"
os.environ["AWS_DEFAULT_REGION"] = "us-east-2"


## Part 1. Pre-processing in Pandas

### 1. Read flights dataset

In [3]:
%%px
@bodo.jit(distributed=["flight_df"], cache=True)
def read_flights(input_file):
    flight_df = pd.read_csv(input_file, sep=',', header=0, usecols=['Month', 'DayofMonth', 'DayOfWeek', 'CRSDepTime', 'CRSArrTime', 'UniqueCarrier', 'FlightNum', 'Origin', 'Dest','Cancelled'])    
    print(flight_df.head())    
    return flight_df

input_file = "s3://bodo-examples-data/flights/1988.csv.bz2"
flight_df = read_flights(input_file)


[stdout:0] 
   Month  DayofMonth  DayOfWeek  CRSDepTime  CRSArrTime UniqueCarrier  \
0      1           9          6        1331        1435            PI   
1      1          10          7        1331        1435            PI   
2      1          11          1        1331        1435            PI   
3      1          12          2        1331        1435            PI   
4      1          13          3        1331        1435            PI   

   FlightNum Origin Dest  Cancelled  
0        942    SYR  BWI          0  
1        942    SYR  BWI          0  
2        942    SYR  BWI          0  
3        942    SYR  BWI          0  
4        942    SYR  BWI          0  


### 2. Feature Engineering
1. Create routes from origin and destination

In [4]:
%%px
@bodo.jit(distributed=["flight_df"], cache=True)
def create_routes(flight_df):
    flight_df['route'] = flight_df['Origin'] + "_" + flight_df['Dest']
    # show top 20 routes - As defined by nb of flights
    top_routes = flight_df['route'].value_counts(ascending=False)
    print(top_routes.head(10))
    #focus on 50 biggest routes - As defined by nb of flights 
    route_lst=top_routes.head(50)
    flight_df = flight_df[flight_df['route'].isin(route_lst.index)]
    return flight_df

flight_df = create_routes(flight_df)

[stdout:0] 
LAX_SFO    20750
SFO_LAX    20658
LAX_PHX    13461
PHX_LAX    13273
LAX_LAS    12175
LGA_BOS    12027
LAS_LAX    11801
SJC_LAX    11535
LAX_SJC    11292
BOS_LGA    11141
Name: route, dtype: int64


2. Look at their cancellations

In [5]:
%%px
@bodo.jit(distributed=["flight_df"], cache=True)
def check_cancelations(flight_df):
    res = flight_df[['route', 'Cancelled', 'Month']].groupby(by='route')\
         .agg({'Month':'size', 'Cancelled':'sum'})\
        .rename(columns={'Month':'count','Cancelled':'nb_cancelled'}) \
        .reset_index()\
        .sort_values(['count'], ascending=False)
    print(res.head(10))

check_cancelations(flight_df)

[stdout:0] 
      route  count  nb_cancelled
36  LAX_SFO  20750           228
6   SFO_LAX  20658           206
23  LAX_PHX  13461            78
42  PHX_LAX  13273            71
1   LAX_LAS  12175            58
14  LGA_BOS  12027           287
37  LAS_LAX  11801            47
0   SJC_LAX  11535            71
5   LAX_SJC  11292            71
49  BOS_LGA  11141           243


In [6]:
%%px
@bodo.jit(distributed=["flight_df"])
def print_info(flight_df):
    print(flight_df.shape)
print_info(flight_df)

[stdout:0] (487253, 11)


3. Quick sanity check - count number of null values()

In [7]:
%%px
@bodo.jit(distributed=["flight_df"])
def check_count(flight_df):
    
    print(flight_df.isnull().sum())
    
check_count(flight_df)

[stdout:0] 
Month            0
DayofMonth       0
DayOfWeek        0
CRSDepTime       0
CRSArrTime       0
UniqueCarrier    0
FlightNum        0
Origin           0
Dest             0
Cancelled        0
route            0
dtype: int64


### 3. Feature and label encoding encoding

#### 1. Encode Labels using Cancelled column

In [8]:
%%px
@bodo.jit(distributed=["flight_df"], cache=True)
def encode_labels(flight_df):
    flight_df.Cancelled = pd.Categorical(flight_df.Cancelled)
    flight_df['Label'] = flight_df.Cancelled.cat.codes
    flight_df.drop(['Cancelled'], axis=1, inplace=True)
    return flight_df

flight_df = encode_labels(flight_df)

#### 2. Feature Encoding

This is needed because sklearn only supports numerical values

a. Get airport unique values

b. Encode origin, destination, and route features

In [9]:
%%px
import numpy as np

@bodo.jit(distributed=["flight_df"], cache=True)
def get_airport_list(flight_df):
    airport_list = np.sort((pd.concat((flight_df['Origin'], flight_df['Dest']))).unique())
    return airport_list

airport_list = get_airport_list(flight_df)

In [10]:
%%px
from sklearn.preprocessing import LabelEncoder
@bodo.jit(distributed=["flight_df", "airport_list"], cache=True)
def encode_features(flight_df, airport_list):
    t1 = time.time()    
    # encode airlines 
    le_carrier = LabelEncoder()
    flight_df['Carrier_encoded'] = pd.Series(le_carrier.fit_transform(flight_df['UniqueCarrier'].values))
    # Encode airports : Using same encoder for both origin and dest ( consistent encoding of airports )
    le_airport = LabelEncoder()
    le_airport.fit(airport_list)
    flight_df['Origin_encoded'] = pd.Series(le_airport.transform(flight_df['Origin']))
    flight_df['Dest_encoded'] = pd.Series(le_airport.transform(flight_df['Dest']))
    # Encode routes 
    le_route = LabelEncoder()
    flight_df['route_encoded'] = pd.Series(le_route.fit_transform(flight_df['route'].values))
    print("Encoding time: ", (time.time()-t1), " sec")
    return flight_df

flight_df = encode_features(flight_df, airport_list)

[stdout:0] Encoding time:  0.14142894744873047  sec


In [11]:
%%px
@bodo.jit(distributed=["flight_df"], cache=True)
def sample(flight_df):
    print(flight_df[['UniqueCarrier','Carrier_encoded','Origin','Origin_encoded',
           'Dest', 'Dest_encoded', 'route', 'route_encoded' ]].sample(10))
    
sample(flight_df)

[stdout:0] 
        UniqueCarrier  Carrier_encoded Origin  Origin_encoded Dest  \
749697             AA                0    ORD              16  STL   
1183428            AA                0    IAH               9  DFW   
1768923            UA               11    DEN               4  ORD   
2237156            UA               11    LAX              11  SFO   
2319146            NW                6    ORD              16  MSP   
3167937            HP                5    SAN              19  PHX   
3070716            UA               11    LGA              12  ORD   
4374544            UA               11    LAX              11  SFO   
4637742            HP                5    PHX              18  LAS   
4642304            HP                5    LAS              10  LAX   

         Dest_encoded    route  route_encoded  
749697             24  ORD_STL             33  
1183428             5  IAH_DFW             13  
1768923            16  DEN_ORD              8  
2237156            22  LA

In [12]:
%%px
from sklearn.model_selection import train_test_split
@bodo.jit(distributed=["flight_df", "X_train", "X_test", "y_train", "y_test"], cache=True)
def split_data(flight_df):
    t1 = time.time()
    X_train, X_test, y_train, y_test = train_test_split(flight_df.drop(['UniqueCarrier','Origin','Dest','route'],axis=1),
                                                    flight_df['Label'], 
                                                    test_size=0.3, train_size=0.7,
                                                    random_state=100)
    print("Data splitting time: ", (time.time()-t1), " sec")    

    return X_train, X_test, y_train, y_test

X_train, X_test, y_train, y_test = split_data(flight_df)

[stdout:0] Data splitting time:  0.09323906898498535  sec


## Part 2: Model Training - Using Scikit-learn

### 1. RandomForestClassifier

In [13]:
%%px
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score # evaluation metric
@bodo.jit(distributed=['X_train', 'y_train', 'X_test', 'y_test'], cache=True)
def rf_model(X_train, X_test, y_train, y_test):
    start = time.time()
    rf = RandomForestClassifier()
    rf.fit(X_train.to_numpy(), y_train.values)
    y_pred = rf.predict(X_test)
    print("RandomForestClassifier fit and predict time: ", time.time()-start)    
    print('Accuracy score {}'.format(accuracy_score(y_test, y_pred)))

rf_model(X_train, X_test, y_train, y_test)

[stdout:0] 
RandomForestClassifier fit and predict time:  1.2791459560394287
Accuracy score 1.0


### 2. Logistic Regression

In [14]:
%%px

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score # evaluation metric
@bodo.jit(distributed=['X_train', 'y_train', 'X_test', 'y_test'], cache=True)
def lr_model(X_train, X_test, y_train, y_test):
    start = time.time()
    lr = LogisticRegression()
    lr.fit(X_train, y_train.values)
    y_pred = lr.predict(X_test)
    print("Logistic Regression fit and predict time: ", time.time()-start)    
    print('Accuracy score {}'.format(accuracy_score(y_test, y_pred)))

lr_model(X_train, X_test, y_train, y_test)

[stdout:0] 
Logistic Regression fit and predict time:  0.129410982131958
Accuracy score 0.9815770030647986
