# 01_simple_model_feeature_engineering

* split air_visit_data in a train and test.
* build a few features
    * previous visits
    * holiday/weekday features
* train different model types
* evaluate expected error (RMSLE)
* submit

## Imports

In [1]:
import pandas as pd
#https://www.kaggle.com/irinaabdullaeva/welcome-recruit-restaurant-visitor-forecasting

## Load Data

In [2]:
data = {
    'air_reserve': pd.read_csv('data/air_reserve.csv'),
    'air_store_info': pd.read_csv('data/air_store_info.csv'),
    'air_visit_data': pd.read_csv('data/air_visit_data.csv'),
    'date_info': pd.read_csv('data/date_info.csv'),
    'hpg_reserve': pd.read_csv('data/hpg_reserve.csv'),
    'hpg_store_info': pd.read_csv('data/hpg_store_info.csv'),
    'sample_submission': pd.read_csv('data/sample_submission.csv'),
    'store_id_relation': pd.read_csv('data/store_id_relation.csv'),
}

## Split Training and test dataset

In [21]:
sumbission_df = data['sample_submission'].copy()
sumbission_df['store_id'] = sumbission_df.apply(lambda x: '_'.join(x['id'].split('_')[:-1]),axis=1)
sumbission_df['visit_date'] = sumbission_df.apply(lambda x: x['id'].split('_')[-1],axis=1)
print('The submission dateset contains data between {min_date} and {max_date}'.format(min_date=min(sumbission_df['visit_date']), max_date=max(sumbission_df['visit_date'])))
print('The submission dateset contains {unique_air_stores} unique air stores'.format(unique_air_stores=len(sumbission_df['store_id'].unique())))

The submission dateset contains data between 2017-04-23 and 2017-05-31
The submission dateset contains 821 unique air stores


In [4]:
air_visit_df = data['air_visit_data'].copy()
print('The air_visit_data dateset contains data between {min_date} and {max_date}'.format(min_date=min(air_visit_df['visit_date']), max_date=max(air_visit_df['visit_date'])))
print('The air_visit_data dateset contains {unique_air_stores} unique air stores'.format(unique_air_stores=len(air_visit_df['air_store_id'].unique())))

The air_visit_data dateset contains data between 2016-01-01 and 2017-04-22
The air_visit_data dateset contains 829 unique air stores


In [5]:
air_visit_df = air_visit_df.rename(columns={'air_store_id':'store_id'})
train_test_split_date = '2017-01-01'
train_df = air_visit_df[air_visit_df['visit_date'] < train_test_split_date]
test_df = air_visit_df[air_visit_df['visit_date'] >= train_test_split_date]

## Feature Engineering

In [6]:
def calc_instance_features(df):
    df['visit_datetime'] = pd.to_datetime(df['visit_date'])
    df['year'] = df['visit_datetime'].dt.year
    df['month'] = df['visit_datetime'].dt.month
    df['day'] = df['visit_datetime'].dt.day
    df['weekday'] = df['visit_datetime'].dt.weekday
    return df
    
train_df = calc_instance_features(train_df)
test_df = calc_instance_features(test_df)

store_mean_2016 = train_df[['store_id','visitors']].groupby('store_id',as_index=False).mean().rename(columns={'visitors':'store_visitors_mean'})
store_weekday_mean_2016 = train_df[['store_id','visitors','weekday']].groupby(['store_id','weekday'], as_index=False).mean().rename(columns={'visitors':'store_visitors_weekday_mean'})

train_df = pd.merge(train_df, store_mean_2016, on=['store_id'],how='left').fillna(0)
train_df = pd.merge(train_df, store_weekday_mean_2016, on=['store_id','weekday'],how='left').fillna(0)

test_df = pd.merge(test_df, store_mean_2016, on=['store_id'],how='left').fillna(0)
test_df = pd.merge(test_df, store_weekday_mean_2016, on=['store_id','weekday'],how='left').fillna(0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_ind

### Train logistic regression

In [7]:
target_col = 'visitors'
features_cols = ['month','day','weekday','store_visitors_mean','store_visitors_weekday_mean']

In [14]:
from sklearn.linear_model import LogisticRegression
train_sample_df = train_df.sample(n = 10000)
clf = LogisticRegression(random_state=0).fit(X=train_sample_df[features_cols], y=train_sample_df[target_col])



In [19]:
from sklearn import metrics
test_df['predicted'] = clf.predict(test_df[features_cols])
rsmle = metrics.mean_squared_log_error(y_true=test_df['visitors'], y_pred=test_df['predicted'])

0.4466748875752145

### Predict on submission_df

In [23]:
sumbission_df = calc_instance_features(sumbission_df) 

In [25]:
sumbission_df = pd.merge(sumbission_df, store_mean_2016, on=['store_id'],how='left').fillna(0)
sumbission_df = pd.merge(sumbission_df, store_weekday_mean_2016, on=['store_id','weekday'],how='left').fillna(0)

In [28]:
sumbission_df['visitors'] = clf.predict(sumbission_df[features_cols])

In [30]:
sumbission_df[['id','visitors']].to_csv('01_simple_sumission.csv',index=False)

In [32]:
!kaggle competitions submit -c recruit-restaurant-visitor-forecasting -f 01_simple_sumission.csv -m 01_simple_submission

100%|███████████████████████████████████████| 1.06M/1.06M [00:05<00:00, 221kB/s]
Successfully submitted to Recruit Restaurant Visitor Forecasting