# Recruit Restaurant Visitor Forecasting
---

<h3 style='color: red'> Problem Statement: </h3>

<p>
    Running a thriving local restaurant isn't always as charming as first impressions appear. There are often all sorts of unexpected troubles popping up that could hurt business.

<strong>One common predicament is that restaurants need to know how many customers to expect each day to effectively purchase ingredients and schedule staff members.</strong> This forecast isn't easy to make because many unpredictable factors affect restaurant attendance, like weather and local competition. It's even harder for newer restaurants with little historical data.

Recruit Holdings has unique access to key datasets that could make automated future customer prediction possible. Specifically, Recruit Holdings owns Hot Pepper Gourmet (a restaurant review service), AirREGI (a restaurant point of sales service), and Restaurant Board (reservation log management software).

<strong>In this competition, you're challenged to use reservation and visitation data to predict the total number of visitors to a restaurant for future dates. This information will help restaurants be much more efficient and allow them to focus on creating an enjoyable dining experience for their customers.</strong>
</p>

<br>

<h3 style='color:red'> Loss function: RMSLE </h3>

<p>
RMSLE has this unique feature of penalizing underprediction compared to overprediction, which is important in this problem since we don't want the restaurants to be underprepared especially in case of small restaurants. Being over prepared has less effects on the business as the extra resources can be stored.
</p>

few other advantages:

    * RMSLE is unaffected by the outlier values while RMSE does
    * RMSLE focuses on the relational ratio difference between prediction and actual values while RMSE focuses on the difference between their magnitudes.

[read more](https://medium.com/analytics-vidhya/root-mean-square-log-error-rmse-vs-rmlse-935c6cc1802a)
    



## Data Description(Kaggle)
---



1. **air_visit_data.csv**: This data contains the historical visits done to AIR registered restaurants

    * **air_store_id** : Unique ID for AIR registered restaurants
    * **visit_date**   : The date of the day
    * **visitors**     : No. of customers visited the restaurant
    
    
2. **air_reserve.csv**: This data contains the reservation done using AIR reservation system
    
    * **air_store_id**       : Unique ID for AIR registered restaurants
    * **visit_datetime**     : The visiting date done through reservation
    * **reserve_datetime**   : The date on which reservation was made.
    * **reserve_visitors**   : No. of visitors for the reservation
    
    
3. **air_store_data**: This data contains the location and restaurant type information for AIR

    * **air_store_id**       : Unique ID for AIR registered restaurants
    * **air_genre_name**     : Type of restaurant
    * **air_area_name**      : area name of the restaurant
    * **latitude**           : lat of the restaurant
    * **longitude**          : long of the restaurant


4. **date_info.csv**: This data contains the visting day calendar information

    * **calendar_date**      : The date
    * **day_of_week**        : what day
    * **holiday_flg**        : if the day was holiday? 1:yes; 0:No


5. **hpg_reserve.csv**: This data contains the reservations done through HPG reservation system

    * **hpg_store_id**       : Unique ID for the restaurants in HPG database
    * **visit_datetime**     : the time of the reservation
    * **reserve_datetime**   : the time the reservation was made
    * **reserve_visitors**   : the number of visitors for that reservation


6. **hpg_store_info.csv**: This data contains the location and restaurant type information for HPG

    * **hpg_store_id**       : Unique ID for HPG registered restaurants
    * **hpg_genre_name**     : Type of restaurant
    * **hpg_area_name**      : area name of the restaurant
    * **latitude**           : lat of the restaurant
    * **longitude**          : long of the restaurant


7. **store_id_relation.csv**: This data contains the mapping for HPG restaurants ID to AIR restaurant ID

    * **hpg_store_id**       : HPG unique ID
    * **air_store_id**       : AIR unique ID


8. **sample_submission.csv**: This is test data for which the predictions has to be done.
    
    * **id**                 : The id is formed by concatenating the air_store_id and visit_date with an underscore
    * **visitors**           : The number of visitors forecasted for the store and date combination
    

## Training and Implementation
---

In [109]:
# imports
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os
import pickle
from sklearn.feature_selection import RFECV
import xgboost as xgb
from sklearn.metrics import make_scorer
import xgboost as xgb
import unidecode

data_path = os.path.join(os.path.curdir, "Kaggle_Data")
processed_data_path = os.path.join(os.path.curdir, "Processed_Data")
results_path = os.path.join(os.path.curdir, 'results')

class RRVF():
    
    def __init__(self):
        
        self.data = self.prepare_process_data()
        self.features = self.feature_selection_RFE(self.data)
        
        # load the xgboost model
        self.xgb_model = pickle.load(open(results_path + "/xgb_regression_0.47827.pickle", 'rb'))
        
    def get_data(self, train_only=False, test_only=False, both=False, feature_selected=False):
        """
        This function will fetch the required data from the whole processed data
        """
        if not feature_selected:
            if train_only:
                print("fetching full train data...done")
                return self.data.loc[self.data.is_train == True, :] 
            elif test_only:
                print("fetching full test data...done")
                return self.data.loc[self.data.is_train == False, :]
            else:
                print("fetching full train+test data...done")
                return self.data
        else:
            
            if train_only:
                print("fetching feature_selected train data...done")
                temp = self.data.loc[self.data.is_train == True, self.features] 
                temp = pd.get_dummies(temp, columns=['day_of_week', 
                                             'area_prefecture', 
                                             'area_sub_prefecture', 
                                             'air_genre_name'])
                return temp
            
            elif test_only:
                print("fetching feature_selected test data...done")
                temp = self.data.loc[self.data.is_train == False, self.features]
                temp = pd.get_dummies(temp, columns=['day_of_week', 
                                             'area_prefecture', 
                                             'area_sub_prefecture', 
                                             'air_genre_name'])
                return temp
            else:
                print("fetching feature_selected train+test data...done")
                temp = self.data.loc[:, self.features]
                temp = pd.get_dummies(temp, columns=['day_of_week', 
                                             'area_prefecture', 
                                             'area_sub_prefecture', 
                                             'air_genre_name'])
                return temp
            
            
    def prepare_process_data(self):
        """
        This function will read multiple data sources, preprocess, featurize and return the data

        """

        # read the required data:
        # air
        air_visit_data = pd.read_csv(data_path + "/air_visit_data.csv")
        air_visit_data.visit_date = pd.to_datetime(air_visit_data.visit_date)
        air_reservation_data = pd.read_csv(data_path + "/air_reserve.csv")
        air_store_info = pd.read_csv(data_path + "/air_store_info.csv")

        # date info
        date_info = pd.read_csv(data_path + "/date_info.csv")
        date_info.calendar_date = pd.to_datetime(date_info.calendar_date)

        # hpg
        hpg_reservation_data = pd.read_csv(data_path + "/hpg_reserve.csv")
        store_id_relation = pd.read_csv(data_path + "/store_id_relation.csv")

        # submission
        submission = pd.read_csv(data_path + "/sample_submission.csv")

        # weather information
        air_store_weather_info = pd.read_csv(data_path + "/processed_air_store_weather_info.csv")
        
        print("Reading data from multiple sources...Done \n")


        # prepare the test data
        submission['air_store_id'] = submission.id.apply(lambda x: "_".join(x.split("_")[:2]))
        submission['visit_date'] = submission.id.apply(lambda x: x.split("_")[2])
        submission.drop("id", axis=1, inplace=True)
        submission = submission[['air_store_id', 'visit_date', 'visitors']]
        submission.visit_date = pd.to_datetime(submission.visit_date)
        
        print("Test data prepared...Done")

        # add train and test tags
        air_visit_data['is_train'] = True
        submission['is_train'] = False

        # combine the train and test data into one
        data = pd.concat([air_visit_data, submission])

        # join the train and test data together
        data = pd.merge(data, air_store_info, how='left', on='air_store_id')
        
        
        # add the date information from the date_info data
        date_info['non_working'] = np.where(date_info.day_of_week.isin(['Saturday', 'Sunday']) | (date_info.holiday_flg == 1), 1, 0)
        date_info['prev_day_holiday'] = date_info['non_working'].shift().fillna(0)
        date_info['next_day_holiday'] = date_info['non_working'].shift(-1).fillna(0)
        date_info.rename(columns={'calendar_date': 'visit_date'}, inplace=True)
        data = pd.merge(data, date_info, how='left', on='visit_date')
        
        print("Date information collection...Done")
    
        
        ## -------------weather-information-store------------------------
        # add the weather information using air store weather information data
        air_store_weather_info.rename(columns={'calendar_date': 'visit_date'}, inplace=True)
        air_store_weather_info.visit_date = pd.to_datetime(air_store_weather_info.visit_date)
        data = pd.merge(data, air_store_weather_info, how='left', on=['air_store_id', 'visit_date'])
        ## -------------weather-information-store------------------------
        
        
        print("Weather information collection...Done")
        
        # handle outliers
        data = self.handle_outliers(data)
        
        print("Outliers Detection and handling...Done")
        
        
        ## -------------restaurant-count-feature------------------------
        # area wise restaurant count
        area_wise_store_count = air_store_info[['air_store_id', 'air_area_name']].groupby('air_area_name').count()
        area_wise_store_count.rename(columns={'air_store_id': 'area_store_count'}, inplace=True)
        data = pd.merge(data, area_wise_store_count, how='left', on='air_area_name')
        
        # area wise genre count
        area_wise_genre_count = air_store_info[['air_store_id', 'air_genre_name', 'air_area_name']].groupby(['air_area_name', 'air_genre_name']).count()
        area_wise_genre_count.rename(columns={'air_store_id': 'area_genre_count'}, inplace=True)
        data = pd.merge(data, area_wise_genre_count, how='left', on=['air_area_name', 'air_genre_name'])
        ## -------------restaurant-count-feature------------------------
        
        print("Area wise resturant, genre's count ... Done")
        
        ## ------------- reservation feature ----------------------------
        # get the reservations done on HPG
        hpg_reserve = pd.merge(store_id_relation, hpg_reservation_data, how='left', on='hpg_store_id')
        hpg_reserve.drop("hpg_store_id", axis=1, inplace=True)

        # concat both AIR and HPG reservations
        reservation_data = pd.concat([air_reservation_data, hpg_reserve])

        # add the hour gap diff between reseravtion time as a feature
        reservation_data.visit_datetime = pd.to_datetime(reservation_data.visit_datetime)
        reservation_data.reserve_datetime = pd.to_datetime(reservation_data.reserve_datetime)
        reservation_data['reservation_gap'] = reservation_data.visit_datetime - reservation_data.reserve_datetime
        reservation_data['reservation_gap'] = reservation_data.reservation_gap / np.timedelta64(1,'h')


        # encode the visitors based on the reservation gap 
        # 6th place solutuon features
        # if reservation gap is under 12 hr
        reservation_data['reserve_visitor_lt_12hr'] = np.where(reservation_data.reservation_gap < 12, reservation_data.reserve_visitors, 0)

        # if reservation gap is between 12-36 hr
        reservation_data['reserve_visitor_bt_12_36'] = np.where((reservation_data.reservation_gap >= 12) & (reservation_data.reservation_gap < 36), 
                                                                reservation_data.reserve_visitors, 0)

        # if reservation gap is between 37-59 hr
        reservation_data['reserve_visitor_bt_36_59'] = np.where((reservation_data.reservation_gap >= 36) & (reservation_data.reservation_gap < 59), 
                                                                reservation_data.reserve_visitors, 0)

        # if reservation gap is between 59-85 hr
        reservation_data['reserve_visitor_bt_59_85'] = np.where((reservation_data.reservation_gap >= 59) & (reservation_data.reservation_gap < 85), 
                                                                reservation_data.reserve_visitors, 0)

        # if reservation gap is greater 85 hr
        reservation_data['reserve_visitor_gt_85'] = np.where(reservation_data.reservation_gap >= 85, reservation_data.reserve_visitors, 0)
        reservation_data['visit_date'] = reservation_data.visit_datetime.dt.date

        # group by per store on visit date = reservation visitors
        reservation_features = reservation_data.groupby(['air_store_id', 'visit_date'], as_index=False)[['reserve_visitors', 'reserve_visitor_lt_12hr', 
                                                                                                         'reserve_visitor_bt_12_36', 'reserve_visitor_bt_36_59',
                                                                                                        'reserve_visitor_bt_59_85', 'reserve_visitor_gt_85']].sum()
        # log transform the reservation featues
        reservation_columns = reservation_features.columns[2:]
        for column in reservation_columns:
            reservation_features[column] = reservation_features[column].apply(lambda a: np.log1p(a))
            
        reservation_features.visit_date = pd.to_datetime(reservation_features.visit_date)
        data = pd.merge(data, reservation_features, how='left', on=['air_store_id', 'visit_date']).fillna(0)
        ## ------------- reservation feature --------------------------
        
        print("Reservation guest number ....Done")
        
        ## --------------date features --------------------------------
        data['visit_day'] = data.visit_date.dt.day
        data['visit_month'] = data.visit_date.dt.month
        data['visit_year'] = data.visit_date.dt.year
        data['visit_week'] = data.visit_date.dt.weekofyear
        ## --------------date features --------------------------------
        
        print("Data Features...Done")
        
        ##---------------monthly-visitor-statistics----------------------
        # calculating monthly visitors mean for each restaurants
        month_wise_mean= data.groupby(['air_store_id','visit_month'],as_index=False)['visitors_capped'].mean()
        month_wise_mean = month_wise_mean.pivot(index='air_store_id', columns='visit_month', values='visitors_capped').reset_index()
        month_wise_mean.columns = ['air_store_id','m1','m2','m3','m4','m5','m6','m7','m8','m9','m10','m11','m12']

        month_wise_mean_2= data.groupby(['air_store_id','visit_month'],as_index=False)['visitors_capped_log1p'].mean()
        month_wise_mean_2 = month_wise_mean_2.pivot(index='air_store_id', columns='visit_month', values='visitors_capped_log1p').reset_index()
        month_wise_mean_2.columns = ['air_store_id','m1_log1p','m2_log1p','m3_log1p','m4_log1p','m5_log1p','m6_log1p','m7_log1p','m8_log1p','m9_log1p',
                                   'm10_log1p','m11_log1p','m12_log1p']
        # merging with data
        data = data.merge(month_wise_mean,on='air_store_id',how='left')
        data = data.merge(month_wise_mean_2,on='air_store_id',how='left')
        ##---------------monthly-visitor-statistics-------------------------
        
        print("Monthly visitors statistics... Done")
    
        ## --------------visitors-statistics----------------------------------
        store_non_working = self.visitor_statistics(data, group_by=['air_store_id', 'non_working'], on='visitors_capped')
        store_dow = self.visitor_statistics(data, group_by=['air_store_id', 'day_of_week'], on='visitors_capped')
        store = self.visitor_statistics(data, group_by=['air_store_id'], on='visitors_capped')
        
        store_non_working_1 = self.visitor_statistics(data, group_by=['air_store_id', 'non_working'], on='visitors_capped_log1p')
        store_dow_1 = self.visitor_statistics(data, group_by=['air_store_id', 'day_of_week'], on='visitors_capped_log1p')
        store_1 = self.visitor_statistics(data, group_by=['air_store_id'], on='visitors_capped_log1p')
        
        data = pd.merge(data, store_non_working, how='left', on=['air_store_id', 'non_working'])
        data = pd.merge(data, store_dow, how='left', on=['air_store_id', 'day_of_week'])
        data = pd.merge(data, store, how='left', on=['air_store_id'])
    
        data = pd.merge(data, store_non_working_1, how='left', on=['air_store_id', 'non_working'])
        data = pd.merge(data, store_dow_1, how='left', on=['air_store_id', 'day_of_week'])
        data = pd.merge(data, store_1, how='left', on=['air_store_id'])
        ## --------------visitors-statistics----------------------------------
        
        print("Store wise, working, day of week visitors statistics...Done")
        
        ##------------------prefecures-feature--------------------------------
        data['area_prefecture'] = data.air_area_name.apply(lambda x: x.split()[0])
        data['area_sub_prefecture'] = data.air_area_name.apply(lambda x: x.split()[1])
        ##------------------prefecures-feature--------------------------------
        
        print("Area prefectures...Done")
        
        ##------------------weekly_restaurants_count-----------------------------
        weekly_count = self.weekly_restaurants_count(data)
        data = pd.merge(data, weekly_count, how='left', on=['visit_year', 'visit_week'])
        ##------------------weekly_restaurants_count-----------------------------
        
        print("Weekly resturant open..Done")
        
        ##------------------nb-reservation-visit-date----------------------------
        reservation_count_feature = reservation_data.groupby(['air_store_id', 'visit_date'], as_index=False).visit_datetime.count()
        reservation_count_feature.visit_date = pd.to_datetime(reservation_count_feature.visit_date)
        reservation_count_feature.rename(columns={'visit_datetime' : 'reservation_count'}, inplace=True)
        data = pd.merge(data, reservation_count_feature, how='left', on=['air_store_id', 'visit_date']).fillna(0)
        data.reservation_count = data.reservation_count.apply(lambda x: np.log1p(x))
        ##------------------nb-reservation-visit-date----------------------------
        
        print("Reservations...Done")
        
        ##-------------------random-features--------------------------------------
        data['lat_plus_long'] = data.latitude + data.longitude
        data['lat_max_diff'] = data.latitude.max() - data.latitude 
        data['long_max_diff'] = data.longitude.max() - data.longitude 
        ##-------------------random-features--------------------------------------
        
        print("Random features...Done\n")
        
        # fix accented characters
        data['area_prefecture'] = data['area_prefecture'].apply(lambda x: unidecode.unidecode(x))
        data['area_sub_prefecture'] = data['area_sub_prefecture'].apply(lambda x: unidecode.unidecode(x))
        
        # set air_id+ date as index
        data.sort_values(['air_store_id', 'visit_date'], inplace=True)
        data['id'] = data.air_store_id + "_" + data.visit_date.astype('str')
        data.set_index('id', inplace=True)
        data.drop(['air_store_id', 'visit_date'], axis=1, inplace=True)
        
        return data


    def weekly_restaurants_count(self, df: pd.DataFrame):
        """
        This function will return the count of restaurants that are open on the given week
        """
        total_data = df.copy()
        total_data.loc[total_data.visit_week == 53, 'visit_week'] = 0
        week_list = list(total_data['visit_week'].unique())
        year_list = list(total_data['visit_year'].unique())

        year_week_count = []
        for year in year_list:
            for week_num in week_list:

                count = len(list(total_data.loc[(total_data.visit_year ==year) & 
                                                (total_data.visit_week == week_num),'air_store_id'].unique()))
                 # upto may 2017
                if (year == 2017) and (week_num>22):
                    break

                year_week_count.append([year, week_num, count])

        columns = ['visit_year', 'visit_week', 'open_resturant_count']
        #restaurant_weekly_open = pd.DataFrame(year_week_count, columns=columns)
        return pd.DataFrame(year_week_count, columns=columns)
        
        
    def visitor_statistics(self, df, group_by, on):
        """
        This function will add visior statistics, by day, store, and non working day
        """
        
        temp = df.groupby(group_by, as_index=False)
        
        # mean
        stats = temp[on].mean()
        stats.rename(columns={on: f'mean_{on}_{"_".join(group_by)}'}, inplace=True)
        
        # median
        stats = pd.merge(stats, temp[on].median(), how='left', on=group_by)
        stats.rename(columns={on: f'median_{on}_{"_".join(group_by)}'}, inplace=True)
        
        # minimum
        stats = pd.merge(stats, temp[on].min(), how='left', on=group_by)
        stats.rename(columns={on: f'min_{on}_{"_".join(group_by)}'}, inplace=True)
        
        # maximum
        stats = pd.merge(stats, temp[on].max(), how='left', on=group_by)
        stats.rename(columns={on: f'max_{on}_{"_".join(group_by)}'}, inplace=True)
        
        # count
        stats = pd.merge(stats, temp[on].count(), how='left', on=group_by)
        stats.rename(columns={on: f'count_{on}_{"_".join(group_by)}'}, inplace=True)
        
        return stats
        
        
    def calulate_IQR_outlier_range(self, df):
        
        """
        This function will calculate the higher max IQR value for given values
        """

        q1, q3 = np.quantile(df.values, [0.25, 0.75])
        higher_IQR = q3 + (1.5 * (q3 - q1))
        
        # return the minimum
        return min(higher_IQR, df.max())

    
    def handle_outliers(self, df: pd.DataFrame):
        
        """
        This function will handle the outliers, by capping them to their max outlier range
        """
        
        # group the data by stores
        stores = df[['air_store_id', 'visitors']].groupby('air_store_id')
        
        # calculate the higher IQR range for each store(No negative visitors)
        max_store_vistors_capped = stores.apply(lambda x: self.calulate_IQR_outlier_range(x.visitors))
        max_store_vistors_capped.name = 'max_visitor'
        
        # add the max_capped_data for each store
        df = pd.merge(df, max_store_vistors_capped, how='left', on='air_store_id')
        
        # minimum of max data and the original 
        df['visitors_capped'] = np.where(df.visitors < df.max_visitor, 
                                                df.visitors, df.max_visitor)
        
        # add the log transformation of the visitors too
        df['visitors_log1p'] = df.visitors.apply(lambda x: np.log1p(x))
        df['visitors_capped_log1p'] = df.visitors_capped.apply(lambda x: np.log1p(x))
        
        # drop the max visitor
        df.drop('max_visitor', axis=1, inplace=True)
        
        return df
    
    
    def rmsle(self, y_true, y_pred):
        """
        This function will calculate the RMSLE score given the predicition and true values
        """

        # get exp of log predictions
        y_pred = np.expm1(y_pred)
        y_true = np.expm1(y_true)

        # calculate the rmsle
        # formula 
        return np.sqrt(np.mean(np.square(np.log1p(y_true) - np.log1p(y_pred))))
    
    
    def feature_selection_RFE(self, df):
        """
        This function will selected the best features using RFE method, and return the list of column names
        """
        
        # feature with rank 1
        selected_features = list(pd.read_csv(processed_data_path + '/feature_eliminated_train.csv').columns)
        
        print("Feature selection process running...Done")
        print("No of selected features: ", len(selected_features))
        
        return selected_features
        
    # train
    def function1(self, train_data, train_target):
        
        """
        This function will take training data, training targets and calculate the RMSLE Score
        """
        
        # convert to xgb matrix 
        x_train_matrix = xgb.DMatrix(data=train_data, label=train_target)
        
        # predict the train matrix outputs
        xgb_train_predictions = self.xgb_model.predict(x_train_matrix)
        
        train_rmsle_score = self.rmsle(train_target.values, xgb_train_predictions)
        
        return train_rmsle_score
        
    # prediction
    def function2(self, test_data):
        """
        this function will take a array/single of queries and return the prediction
        """
        
        if type(test_data) == pd.Series:
            id_values = test_data.values
    
        elif type(test_data) == str:
            id_values = [test_data]
            
        elif type(test_data) == np.ndarray:
            id_values = test_data
        else:
            return "Please send the queries in Series/numpy/Single string format"
        
        
        # fetch the featurized vector for the test data
        featurized_test_data = self.data.loc[:, self.features]
        featurized_test_data = pd.get_dummies(featurized_test_data, columns=['day_of_week', 
                                                                             'area_prefecture', 
                                                                             'area_sub_prefecture', 
                                                                             'air_genre_name'])
        
        # get the processed data for query points
        featurized_test_data = featurized_test_data.loc[featurized_test_data.index.isin(id_values), :]
        
        # convert it to Dmatrix for prediction
        x_test_matrix = xgb.DMatrix(data=featurized_test_data.drop('visitors_capped_log1p', axis=1))
        
        return np.expm1(self.xgb_model.predict(x_test_matrix))
        

In [110]:
# intantiate the class, this will prepare, process, feature select the data
rv = RRVF()

Reading data from multiple sources...Done 

Test data prepared...Done
Date information collection...Done
Weather information collection...Done
Outliers Detection and handling...Done
Area wise resturant, genre's count ... Done
Reservation guest number ....Done
Data Features...Done
Monthly visitors statistics... Done
Store wise, working, day of week visitors statistics...Done
Area prefectures...Done
Weekly resturant open..Done
Reservations...Done
Random features...Done

Feature selection process running...Done
No of selected features:  66


In [111]:
import time
# get the processed train data
train =rv.get_data(train_only=True, feature_selected=True)

start_time = time.time()
# use function1 to get the RMSLE score for the train data
score = rv.function1(train.drop('visitors_capped_log1p', axis=1), train['visitors_capped_log1p'])
end_time = time.time()

print("\nThe RMSLE score on train data and its target: ", score)
print(f"\nTime taken to predict train_data({train.shape[0]}): {(end_time - start_time):.2f} secs", )

fetching feature_selected train data...done

The RMSLE score on train data and its target:  0.4131801507919498

Time taken to predict train_data(252108): 68.60 secs


In [112]:
# get the submision data to get predictions
sub = pd.read_csv(data_path + "/sample_submission.csv")

# send subset of queries in single str datapoint
test = sub.iloc[1, 0]

start_time = time.time()
predicition = rv.function2(test)
end_time = time.time()

print(f"Time taken to predict test_data: {(end_time - start_time):.2f} secs" )
print("prediction: ")
result = pd.DataFrame(dict(query=[test], visitors=predicition))
result

Time taken to predict test_data: 1.07 secs
prediction: 


Unnamed: 0,query,visitors
0,air_00a91d42b08b08d9_2017-04-24,21.351002


In [113]:
# get the submision data to get predictions
sub = pd.read_csv(data_path + "/sample_submission.csv")

# send subset of queries in Series format
test = sub.iloc[:5, 0]

start_time = time.time()
predicition = rv.function2(test)
end_time = time.time()

print(f"Time taken to predict test_data: {(end_time - start_time):.2f} secs" )
print("prediction: ")
result = pd.DataFrame(dict(query=test.values, visitors=predicition))
result

Time taken to predict test_data: 0.93 secs
prediction: 


Unnamed: 0,query,visitors
0,air_00a91d42b08b08d9_2017-04-23,1.817007
1,air_00a91d42b08b08d9_2017-04-24,21.351002
2,air_00a91d42b08b08d9_2017-04-25,25.375101
3,air_00a91d42b08b08d9_2017-04-26,29.888779
4,air_00a91d42b08b08d9_2017-04-27,30.829861


In [114]:
# get the submision data to get predictions
sub = pd.read_csv(data_path + "/sample_submission.csv")

# send subset of queries in single array datapoint
test = sub.iloc[1, 0]

start_time = time.time()
predicition = rv.function2(np.array([test]))
end_time = time.time()

print(f"Time taken to predict test_data: {(end_time - start_time):.2f} secs" )
print("prediction: ")
result = pd.DataFrame(dict(query=[test], visitors=predicition))
result

Time taken to predict test_data: 0.96 secs
prediction: 


Unnamed: 0,query,visitors
0,air_00a91d42b08b08d9_2017-04-24,21.351002


In [115]:
# get the submision data to get predictions
sub = pd.read_csv(data_path + "/sample_submission.csv")

# send subset of queries in single array datapoint
test = sub.iloc[:5, 0]

start_time = time.time()
predicition = rv.function2(test.values)
end_time = time.time()

print(f"Time taken to predict test_data: {(end_time - start_time):.2f} secs" )
print("prediction: ")
result = pd.DataFrame(dict(query=test.values, visitors=predicition))
result

Time taken to predict test_data: 0.98 secs
prediction: 


Unnamed: 0,query,visitors
0,air_00a91d42b08b08d9_2017-04-23,1.817007
1,air_00a91d42b08b08d9_2017-04-24,21.351002
2,air_00a91d42b08b08d9_2017-04-25,25.375101
3,air_00a91d42b08b08d9_2017-04-26,29.888779
4,air_00a91d42b08b08d9_2017-04-27,30.829861
