# EDSA Sendy Logistics Challenge

## Contents

 1. Required Imports & Libraries
 2. Import the Data
 3. Data Cleaning & Formatting
 4. Exploratory Data Analysis
 5. Feature Engineering and selection
 6. Building & Evaluating Models
 7. Training the Model and making a prediction
 8. Generating a Submission file for ZINDI
 9. References

### 1. Required Imports & Libraries

In [17]:
## Required Libraries & imports for the model

import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, KFold, GridSearchCV 
from sklearn.model_selection import cross_val_score, learning_curve
from sklearn.metrics import mean_squared_error

from sklearn.svm import SVR
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression

import xgboost as xgb
import lightgbm as lgb

sns.set(style='white', context='notebook', palette='deep')

### 2. Import the Data

In [19]:
##Importing Training Data CSV file
train_url = 'https://raw.githubusercontent.com/AksharJ47/regression-predict-api-template/master/utils/data/Train_Zindi.csv'
train_df = pd.read_csv(train_url)

##Importing Test Data
test_url = 'https://raw.githubusercontent.com/AksharJ47/regression-predict-api-template/master/utils/data/Test_Zindi.csv'
test_df = pd.read_csv(test_url)

##Importing Riders Data CSV file
riders_url = 'https://raw.githubusercontent.com/AksharJ47/regression-predict-api-template/master/utils/data/Riders_Zindi.csv'
riders_df = pd.read_csv(riders_url)

print(train_df.shape, test_df.shape, riders_df.shape)
train_df.head()

(21201, 29) (7068, 25) (960, 5)


Unnamed: 0,Order No,User Id,Vehicle Type,Platform Type,Personal or Business,Placement - Day of Month,Placement - Weekday (Mo = 1),Placement - Time,Confirmation - Day of Month,Confirmation - Weekday (Mo = 1),...,Arrival at Destination - Time,Distance (KM),Temperature,Precipitation in millimeters,Pickup Lat,Pickup Long,Destination Lat,Destination Long,Rider Id,Time from Pickup to Arrival
0,Order_No_4211,User_Id_633,Bike,3,Business,9,5,9:35:46 AM,9,5,...,10:39:55 AM,4,20.4,,-1.317755,36.83037,-1.300406,36.829741,Rider_Id_432,745
1,Order_No_25375,User_Id_2285,Bike,3,Personal,12,5,11:16:16 AM,12,5,...,12:17:22 PM,16,26.4,,-1.351453,36.899315,-1.295004,36.814358,Rider_Id_856,1993
2,Order_No_1899,User_Id_265,Bike,3,Business,30,2,12:39:25 PM,30,2,...,1:00:38 PM,3,,,-1.308284,36.843419,-1.300921,36.828195,Rider_Id_155,455
3,Order_No_9336,User_Id_1402,Bike,3,Business,15,5,9:25:34 AM,15,5,...,10:05:27 AM,9,19.2,,-1.281301,36.832396,-1.257147,36.795063,Rider_Id_855,1341
4,Order_No_27883,User_Id_1737,Bike,1,Personal,13,1,9:55:18 AM,13,1,...,10:25:37 AM,9,15.4,,-1.266597,36.792118,-1.295041,36.809817,Rider_Id_770,1214


In [20]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21201 entries, 0 to 21200
Data columns (total 29 columns):
Order No                                     21201 non-null object
User Id                                      21201 non-null object
Vehicle Type                                 21201 non-null object
Platform Type                                21201 non-null int64
Personal or Business                         21201 non-null object
Placement - Day of Month                     21201 non-null int64
Placement - Weekday (Mo = 1)                 21201 non-null int64
Placement - Time                             21201 non-null object
Confirmation - Day of Month                  21201 non-null int64
Confirmation - Weekday (Mo = 1)              21201 non-null int64
Confirmation - Time                          21201 non-null object
Arrival at Pickup - Day of Month             21201 non-null int64
Arrival at Pickup - Weekday (Mo = 1)         21201 non-null int64
Arrival at Pickup - Time   

### 3. Data Cleaning & Formatting

In [21]:
#Drop data not available in test, Pickup Time + label = Arrival times

train_df = train_df.drop(['Arrival at Destination - Day of Month', 'Arrival at Destination - Weekday (Mo = 1)',
                          'Arrival at Destination - Time'], axis=1)

In [22]:
# Combine train & test to create a full df
train_end = train_df.shape[0]
test_df['Time from Pickup to Arrival'] = [np.nan]*test_df.shape[0]
full_df = pd.concat([train_df, test_df], axis=0, ignore_index=True)
full_riders_df = pd.merge(full_df, riders_df, how='left',
                          left_on='Rider Id',
                          right_on='Rider Id',
                          left_index=True)
train_df.shape, test_df.shape, full_df.shape, full_riders_df.shape

((21201, 26), (7068, 26), (28269, 26), (28269, 30))

In [23]:
# Check features and their data types
# Found missing values in Temperature and Precipitation
full_riders_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28269 entries, 27 to 785
Data columns (total 30 columns):
Order No                                28269 non-null object
User Id                                 28269 non-null object
Vehicle Type                            28269 non-null object
Platform Type                           28269 non-null int64
Personal or Business                    28269 non-null object
Placement - Day of Month                28269 non-null int64
Placement - Weekday (Mo = 1)            28269 non-null int64
Placement - Time                        28269 non-null object
Confirmation - Day of Month             28269 non-null int64
Confirmation - Weekday (Mo = 1)         28269 non-null int64
Confirmation - Time                     28269 non-null object
Arrival at Pickup - Day of Month        28269 non-null int64
Arrival at Pickup - Weekday (Mo = 1)    28269 non-null int64
Arrival at Pickup - Time                28269 non-null object
Pickup - Day of Month          

In [24]:
# Shorten Column Names, remove whitespace to make them easier to work with
feature_names = {"Order No": "Order_No",
                 "User Id": "User_Id",
                 "Vehicle Type": "Vehicle_Type",
                 "Personal or Business": "Personal_Business",
                 "Placement - Day of Month": "Pla_Mon",
                 "Placement - Weekday (Mo = 1)": "Pla_Weekday",
                 "Placement - Time": "Pla_Time",
                 "Confirmation - Day of Month": "Con_Day_Mon",
                 "Confirmation - Weekday (Mo = 1)": "Con_Weekday",
                 "Confirmation - Time": "Con_Time",
                 "Arrival at Pickup - Day of Month": "Arr_Pic_Mon",
                 "Arrival at Pickup - Weekday (Mo = 1)": "Arr_Pic_Weekday",
                 "Arrival at Pickup - Time": "Arr_Pic_Time",
                 "Platform Type": "Platform_Type",
                 "Pickup - Day of Month": "Pickup_Mon",
                 "Pickup - Weekday (Mo = 1)": "Pickup_Weekday",
                 "Pickup - Time": "Pickup_Time",
                 "Distance (KM)": "Distance(km)",
                 "Precipitation in millimeters": "Precipitation(mm)",
                 "Pickup Lat": "Pickup_Lat",
                 "Pickup Long": "Pickup_Lon",
                 "Destination Lat": "Destination_Lat",
                 "Destination Long": "Destination_Lon",
                 "Rider Id": "Rider_Id",
                 "Time from Pickup to Arrival": "Time_Pic_Arr"}
renamed_df = full_riders_df.rename(columns=feature_names)
renamed_df.columns


Index(['Order_No', 'User_Id', 'Vehicle_Type', 'Platform_Type',
       'Personal_Business', 'Pla_Mon', 'Pla_Weekday', 'Pla_Time',
       'Con_Day_Mon', 'Con_Weekday', 'Con_Time', 'Arr_Pic_Mon',
       'Arr_Pic_Weekday', 'Arr_Pic_Time', 'Pickup_Mon', 'Pickup_Weekday',
       'Pickup_Time', 'Distance(km)', 'Temperature', 'Precipitation(mm)',
       'Pickup_Lat', 'Pickup_Lon', 'Destination_Lat', 'Destination_Lon',
       'Rider_Id', 'Time_Pic_Arr', 'No_Of_Orders', 'Age', 'Average_Rating',
       'No_of_Ratings'],
      dtype='object')

In [25]:
renamed_df.shape

(28269, 30)

In [26]:
# Convert time to Seconds after midnight
def time_conv(input_df):
    '''Converts time format %H:%M:%S to seconds past midnight(00:00) of
       the same day rounded to the nearest second.
       ------------------------------
       12:00:00 PM --> 43200
       01:30:00 AM --> 5400
       02:35:30 PM --> 9330
     '''    
    input_df_1 = input_df.copy()

    def timetosecs(x):
        if len(x) == 10:
            if x[-2:] == 'AM':
                x = (float(x[0])*3600) + (float(x[2:4])*60) + float(x[5:7])
            else:
                x = (float(x[0])*43200) + (float(x[2:4])*60) + float(x[5:7])
        else:
            if x[-2:] == 'AM':
                x = (float(x[0:2])*3600) + (float(x[3:5])*60) + float(x[6:8])
            else:
                x = (float(x[0:2])*43200) + (float(x[3:5])*60) + float(x[6:8])
        return x
    input_df_1['Pla_Time'] = input_df_1['Pla_Time'].apply(timetosecs)
    input_df_1['Con_Time'] = input_df_1['Con_Time'].apply(timetosecs)
    input_df_1['Arr_Pic_Time'] = input_df_1['Arr_Pic_Time'].apply(timetosecs)
    input_df_1['Pickup_Time'] = input_df_1['Pickup_Time'].apply(timetosecs)
    return input_df_1

time_conv_df = time_conv(renamed_df)


In [27]:
# Add Columns for time differences
def time_diffs(input_df):
    df = input_df.copy()
    df['Conf_Pla_dif'] = df['Con_Time'] - df['Pla_Time']
    df['Arr_Con_dif'] = df['Arr_Pic_Time'] - df['Con_Time']
    df['Pic_Arr_dif'] = df['Pickup_Time'] - df['Arr_Pic_Time']

    return df


In [28]:
time_conv_df = time_diffs(time_conv_df)

### 4. Exploratory Data Analysis (EDA)

### 5.  Feature Engineering & Selection

In [29]:
# Add Rider Experience based on Age Column - Low - Medium - High

time_conv_df['Rider_Exp'] = pd.qcut(time_conv_df['Age'],
                                    q=[0, .25, .75, 1],
                                    labels=['low', 'medium', 'high'])


In [30]:
# Filling Missing Values for Temperature and Precipitation - used the Mean

time_conv_df['Temperature'] = time_conv_df['Temperature'].fillna(
    time_conv_df['Temperature'].mean())
time_conv_df['Precipitation(mm)'].fillna(
    time_conv_df['Precipitation(mm)'].mean(), inplace=True)


In [31]:
# Create Temperature band Column - 3 categories - low, mid, high

time_conv_df['Temp_Band'] = pd.qcut(time_conv_df['Temperature'],
                                    q=[0, .25, .75, 1],
                                    labels=['low', 'medium', 'high'])

In [32]:
#Manhattan distance
def manhattan_distance(lat1, lng1, lat2, lng2):
    a = np.abs(lat2 -lat1)
    b = np.abs(lng1 - lng2)
    return a + b


In [33]:
##Add Manhattan to DF
def added_manhattan(input_df):
    df = input_df.copy()
    df['distance_manhattan'] = manhattan_distance(df['Pickup_Lat'].values,
                                                  df['Pickup_Lon'].values,
                                                  df['Destination_Lat'].values,
                                                  df['Destination_Lon'].values)
    return df

time_conv_df = added_manhattan(time_conv_df)

## Add Reference for this

In [34]:
#Haversine distance
def haversine_array(lat1, lng1, lat2, lng2):
    lat1, lng1, lat2, lng2 = map(np.radians, (lat1, lng1, lat2, lng2))
    AVG_EARTH_RADIUS = 6371  # in km
    lat = lat2 - lat1
    lng = lng2 - lng1
    d = np.sin(lat * 0.5) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(lng * 0.5) ** 2
    h = 2 * AVG_EARTH_RADIUS * np.arcsin(np.sqrt(d))
    return h

## Add Reference for this

In [35]:
##Add Haversine Distance to DF

def add_haversine(input_df):
    input_df_1 = input_df.copy()
    input_df_1['distance_haversine'] = haversine_array(input_df_1['Pickup_Lat'].values,
                                                       input_df_1['Pickup_Lon'].values,
                                                       input_df_1['Destination_Lat'].values,
                                                       input_df_1['Destination_Lon'].values)
    return input_df_1

time_conv_df = add_haversine(time_conv_df)

In [36]:
time_conv_df['distance_haversine']

27      1.930333
739    11.339849
851     1.880079
806     4.943458
159     3.724829
         ...    
611     3.631752
119    16.407578
134     5.398648
205    15.688753
785     7.294294
Name: distance_haversine, Length: 28269, dtype: float64

In [37]:
# This is to check if there is any difference between the columns with Days of Month or Weekday of Month

month_cols = [col for col in time_conv_df.columns if col.endswith('Mon')]
weekday_cols = [col for col in time_conv_df.columns if col.endswith('Weekday')]

count = 0
instances_of_different_days = [];
for i, row in time_conv_df.iterrows():
    if len(set(row[month_cols].values)) > 1:
        print(count+1, end='\r')
        count = count + 1
        instances_of_different_days.append(list(row[month_cols].values))
instances_of_different_days

2

[[17, 18, 18, 18], [11, 13, 13, 13]]

In [39]:
# Drop columns based on:
   #Days of Month or Weekday of Month are the same except for 2 rows. The delivery service is same day
   #All Vehicle types are Bikes, Vehicle Type is not necessary.

time_conv_df['Day_of_Month'] = time_conv_df[month_cols[0]]
time_conv_df['Day_of_Week'] = time_conv_df[weekday_cols[0]]

time_conv_df.drop(month_cols+weekday_cols, axis=1, inplace=True)
time_conv_df.drop('Vehicle_Type', axis=1, inplace=True)

time_conv_df.head(3)


Unnamed: 0,Order_No,User_Id,Platform_Type,Personal_Business,Pla_Time,Con_Time,Arr_Pic_Time,Pickup_Time,Distance(km),Temperature,...,No_of_Ratings,Conf_Pla_dif,Arr_Con_dif,Pic_Arr_dif,Rider_Exp,Temp_Band,distance_manhattan,distance_haversine,Day_of_Month,Day_of_Week
27,Order_No_4211,User_Id_633,3,Business,34546.0,34810.0,36287.0,37650.0,4,20.4,...,549,264.0,1477.0,1363.0,high,low,0.017978,1.930333,9,5
739,Order_No_25375,User_Id_2285,3,Personal,40576.0,41001.0,42022.0,42249.0,16,26.4,...,69,425.0,1021.0,227.0,low,high,0.141406,11.339849,12,5
851,Order_No_1899,User_Id_265,3,Business,520765.0,520964.0,521374.0,521583.0,3,23.255689,...,114,199.0,410.0,209.0,low,medium,0.022588,1.880079,30,2


In [40]:
#Convert Personal_Business Temp_Band using LabelEncoding

le = LabelEncoder()
le.fit(time_conv_df['Personal_Business'])
time_conv_df['Personal_Business'] = le.transform(time_conv_df['Personal_Business'])
time_conv_df['Personal_Business'][:2]


27     0
739    1
Name: Personal_Business, dtype: int32

In [41]:
# Rider_Exp convert Label Encoding

le.fit(time_conv_df['Rider_Exp'])
time_conv_df['Rider_Exp'] = le.transform(time_conv_df['Rider_Exp'])
time_conv_df['Rider_Exp'][:2]

27     0
739    1
Name: Rider_Exp, dtype: int32

In [42]:
## Convert Temp_Band using LabelEncoding

le.fit(time_conv_df['Temp_Band'])
time_conv_df['Temp_Band'] = le.transform(time_conv_df['Temp_Band'])
time_conv_df['Temp_Band'][:2]

27     1
739    0
Name: Temp_Band, dtype: int32

In [43]:
## This function splits Columns into Data types - this makes it easier to select & plot numeric features
## against the Target Variable

numeric_cols = []
object_cols = []
time_cols = []
for k, v in time_conv_df.dtypes.items():
    if (v != object):
        if (k != "Time_Pic_Arr"):
            numeric_cols.append(k)
    elif k.endswith("Time"):
        time_cols.append(k)
    else:
        object_cols.append(k)
time_conv_df[numeric_cols].head(3) 

Unnamed: 0,Platform_Type,Personal_Business,Pla_Time,Con_Time,Arr_Pic_Time,Pickup_Time,Distance(km),Temperature,Precipitation(mm),Pickup_Lat,...,No_of_Ratings,Conf_Pla_dif,Arr_Con_dif,Pic_Arr_dif,Rider_Exp,Temp_Band,distance_manhattan,distance_haversine,Day_of_Month,Day_of_Week
27,3,0,34546.0,34810.0,36287.0,37650.0,4,20.4,7.573502,-1.317755,...,549,264.0,1477.0,1363.0,0,1,0.017978,1.930333,9,5
739,3,1,40576.0,41001.0,42022.0,42249.0,16,26.4,7.573502,-1.351453,...,69,425.0,1021.0,227.0,1,0,0.141406,11.339849,12,5
851,3,0,520765.0,520964.0,521374.0,521583.0,3,23.255689,7.573502,-1.308284,...,114,199.0,410.0,209.0,1,2,0.022588,1.880079,30,2


In [44]:
## Feature Selection & Dropping of the Target Variable

features = numeric_cols 

data_df = time_conv_df[features]

y = time_conv_df[:train_end]['Time_Pic_Arr']
train = data_df[:train_end]
test = data_df[train_end:]

test.head()

Unnamed: 0,Platform_Type,Personal_Business,Pla_Time,Con_Time,Arr_Pic_Time,Pickup_Time,Distance(km),Temperature,Precipitation(mm),Pickup_Lat,...,No_of_Ratings,Conf_Pla_dif,Arr_Con_dif,Pic_Arr_dif,Rider_Exp,Temp_Band,distance_manhattan,distance_haversine,Day_of_Month,Day_of_Week
183,3,0,175450.0,175469.0,175984.0,216407.0,8,23.255689,7.573502,-1.333275,...,171,19.0,515.0,40423.0,0,2,0.076451,6.220125,27,3
826,3,0,521855.0,521957.0,44427.0,44737.0,5,23.255689,7.573502,-1.272639,...,45,102.0,-477530.0,310.0,1,2,0.033551,3.280436,17,5
650,3,0,40094.0,41105.0,41600.0,43074.0,5,22.8,7.573502,-1.290894,...,67,1011.0,495.0,1474.0,1,2,0.042714,3.535344,27,4
561,3,0,46295.0,46407.0,86561.0,87412.0,5,24.5,7.573502,-1.290503,...,44,112.0,40154.0,851.0,2,2,0.031867,2.550774,17,1
203,3,0,41428.0,41685.0,42439.0,42964.0,6,24.4,7.573502,-1.281081,...,1010,257.0,754.0,525.0,2,2,0.036875,2.960588,11,2


In [48]:
y

27      745.0
739    1993.0
851     455.0
806    1341.0
159    1214.0
        ...  
712       9.0
851     770.0
642    2953.0
41     1380.0
801    2128.0
Name: Time_Pic_Arr, Length: 21201, dtype: float64

### 6. Building & Evaluating Models

In [45]:
## Splitting the Data into Train & Test sets, ratio of 80:20 & inspecting the shape

X_train, X_test, y_train, y_test = train_test_split(train, y, test_size=0.2, shuffle=True,random_state = 42)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(16960, 26) (4241, 26) (16960,) (4241,)


### K-fold Cross Validation

##### The general procedure is as follows:

Shuffle the dataset randomly.
Split the dataset into k groups
For each unique group:
Take the group as a hold out or test data set
Take the remaining groups as a training data set
Fit a model on the training set and evaluate it on the test set
Retain the evaluation score and discard the model
Summarize the skill of the model using the sample of model evaluation scores
Importantly, each observation in the data sample is assigned to an individual group and stays in that group for the duration of the procedure. This means that each sample is given the opportunity to be used in the hold out set 1 time and used to train the model k-1 times.

In [None]:
##  k-fold cross-validation procedure for estimating the skill of machine learning models
from sklearn.metrics import mean_squared_error,make_scorer
rs = 42
kfold = KFold(n_splits=10, random_state=rs, shuffle=True)

regressors = []
regressors.append(SVR(random_state = rs))
regressors.append(GradientBoostingRegressor(random_state=rs))
regressors.append(ExtraTreesRegressor(n_estimators=rs))
regressors.append(RandomForestRegressor(random_state=rs))
regressors.append(xgb.XGBRegressor(random_state=rs, objective="reg:squarederror"))
regressors.append(lgb.LGBMRegressor(random_state=rs))

cv_results = []
rmse_scorer = make_scorer(mean_squared_error)
for regressor in regressors:     #scores to be minimised are negated (neg)
    cv_results.append(np.sqrt(cross_val_score(estimator = regressor,X = X_train, y = y_train, cv = kfold, scoring = rmse_scorer)))

cv_means = []
cv_stds = []
for cv_result in cv_results:
    cv_means.append(cv_result.mean())
    cv_stds.append(cv_result.std())
    
cv_res = pd.DataFrame({ 
    "Algorithm": ["SVR", "GBR", "EXR", "RFR", "XGBR", "LGBM"],
    "CrossValMeans": cv_means, "CrossValErrors": cv_stds
                       })
cv_res = cv_res.sort_values("CrossValMeans", ascending=True)
print(cv_res)

##### Based on the above LGBM

In [None]:
params = {
    'n_estimators':[75], # [75, 95],
    'num_leaves': [15], #[12,15, 17],
    'reg_alpha': [0.02], #[0.02, 0.05],
    'min_data_in_leaf': [300],  #[250, 280, 300]
    'learning_rate': [0.1], #[0.05, 0.1, 0.25],
    'objective': ['regression'] #['regression', None]
    }

lsearch = GridSearchCV(estimator = lgb.LGBMRegressor(random_state=rs), cv=kfold,scoring=rmse_scorer, param_grid=params)
lgbm = lsearch.fit(X_train, y_train)

l_params = lgbm.best_params_
l_score = np.sqrt(abs(lgbm.best_score_))
print(lgbm.best_params_, np.sqrt(abs(lgbm.best_score_)))


In [None]:
RFC = RandomForestRegressor(random_state=rs)
rf_param = {"max_depth":[None], "max_features":[3], "min_samples_split":[10],
           "min_samples_leaf": [3], "n_estimators":[300]}
rsearch = GridSearchCV(RFC, cv=kfold, scoring=rmse_scorer,param_grid=rf_param)
rfm = rsearch.fit(X_train, y_train)

r_score = np.sqrt(abs(rfm.best_score_))
r_params = rfm.best_params_
print(r_score, r_params)

#### Plotting the Learning curves

In [None]:
def plot_learning_curve(estimator, title, X, y, ylim=None, n_jobs=-1, cv=None, train_sizes=np.linspace(.1, 1.0, 5)):
    """Generating a plot of test and training learning curve"""
    plt.figure()
    plt.title(title)
    
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes, scoring=rmse_scorer, shuffle=True)
    
    #scores - 5 runs, each with 10 fold
    train_scores_mean = np.mean(train_scores, axis=1) #5 means (each size)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    
    plt.grid()
    
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color='r' )
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color='g')
    
    plt.plot(train_sizes, train_scores_mean, 'o-', color='r', label='Training score')
    plt.plot(train_sizes, test_scores_mean, 'o-', color='g', label='Cross-validation score')
    
    plt.legend(loc='best')
    return plt


In [None]:
#Learning Curves

g = plot_learning_curve(lgbm.best_estimator_, "lgbm learning curves", X_train, y_train, cv=kfold)
g = plot_learning_curve(rfm.best_estimator_, "random forest_learning_curve", X_train, y_train, cv=kfold)

#lgbm: mse error comment here
#rf: mse error comment here

##### From the above learning curves we can see that the LGBM curve seems to continue to decrease, at a decreasing rate, as the number of samples increases. The RFC however, seems to have levelled off just above 8000 samples.

#### We can now plot the Feature Importance of the two models, which will enable us to fine tune them further

In [15]:
vals = lgbm.best_estimator_.feature_importances_
l_importance = np.array([ val/sum(vals) for val in vals ])
r_importance  = rfm.best_estimator_.feature_importances_
feats = np.array(features)

fig,axes = plt.subplots(1,2, figsize=(12, 8))
plt.subplots_adjust(top=0.6, bottom=0.2, hspace=.6, wspace=0.8)

indices = np.argsort(l_importance)[::-1]
g = sns.barplot(y=feats[indices], x=l_importance[indices], orient='h', ax=axes[0])
g.set_xlabel("Relative importances", fontsize=12)
g.set_ylabel("Features", fontsize=12)
g.tick_params(labelsize=9)
g.set_title(" LGBM feature importance")

index = np.argsort(r_importance)[::-1]
g = sns.barplot(y=feats[index], x=r_importance[index], orient='h', ax=axes[1])
g.set_xlabel("Relative importances", fontsize=12)
g.set_ylabel("Features", fontsize=12)
g.tick_params(labelsize=9)
g.set_title(" Random Forest feature importance")
plt.show()

NameError: name 'lgbm' is not defined

### 7. Training the Model and making a prediction

In [None]:
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

lparams = {
           'learning_rate': 0.1, 'min_data_in_leaf': 300, 
           'n_estimators': 75, 'num_leaves': 20, 'random_state':rs,
           'objective': 'regression', 'reg_alpha': 0.02,
          'feature_fraction': 0.9, 'bagging_fraction':0.9}


lgbm = lgb.train(lparams, lgb_train, valid_sets=lgb_eval, num_boost_round=20, early_stopping_rounds=20)

lpred = lgbm.predict(X_test, num_iteration=lgbm.best_iteration)

print("The RMSE of prediction is ", mean_squared_error(y_test, lpred)**0.5)


# Generating insights from data

In [None]:
plt.hist2d(x = train['Day_of_Month'],y = train['Pla_Time']/3600,bins=(10,20),\
          range=((0,31),(0,24)))
plt.xlabel('Day_of_placement')
plt.ylabel('Hour_of_placement')
plt.xticks(np.arange(0, 31, step=2))
plt.yticks(np.arange(0, 24, step=2))
plt.colorbar()
plt.show

In [None]:
platform_types = train['Platform_Type'].value_counts()
plt.bar(platform_types.index,platform_types)
plt.xlim(0, 5)
plt.xlabel('Platform_type')
plt.ylabel('Number of orders')
plt.xticks(np.arange(0, 5, step=1))
plt.show()

In [None]:
sns.distplot(train['Day_of_Month'], kde = False,bins = 4)
plt.ylabel('No. of orders')

In [None]:
train['Rider_Exp']
train_df['Time from Pickup to Arrival']/train_df['Distance (KM)']
plt.scatter(train['Rider_Exp'],train_df['Time from Pickup to Arrival']/train_df['Distance (KM)'])
plt.xticks(np.arange(0, 3, step=1))

Displays effect of experience on driver delivery time, this shows that the time take by a 

In [None]:
train_df.info()

### 8. Generating a Submission File for ZINDI

In [None]:
lgbm_y = lgbm.predict(test, num_iteration=lgbm.best_iteration)
lgbm_output = pd.DataFrame({"Order No":test_df['Order No'], 
                           "Time from Pickup to Arrival": lgbm_y })
lgbm_output.to_csv(r"C:\Users\TERENCE.VENGATASS\Desktop\art\submission_2.csv", index=False)

### 9. References

1. A Gentle Introduction to k-fold Cross-Validation
by Jason Brownlee on May 23, 2018 in Statistics : https://machinelearningmastery.com/k-fold-cross-validation/


2. How to Implement Resampling Methods From Scratch In Python
by Jason Brownlee on October 17, 2016 in Code Algorithms From Scratch : https://machinelearningmastery.com/implement-resampling-methods-scratch-python/

3. What is the Difference Between Test and Validation Datasets?
by Jason Brownlee on July 14, 2017 in Machine Learning Process : https://machinelearningmastery.com/difference-test-validation-datasets/

4. ZINDI Discussion Board - Orginal Competition:
https://zindi.africa/competitions/sendy-logistics-challenge/discussions