_Lambda School Data Science, Unit 2_
 
# Regression 2 Sprint Challenge: Predict drugstore sales 🏥

For your Sprint Challenge, you'll use real-world sales data from a German drugstore chain, from Jan 2, 2013 — July 31, 2015.

You are given three dataframes:

- `train`: historical sales data for 100 stores
- `test`: historical sales data for 100 different stores
- `store`: supplemental information about the stores


The train and test set do _not_ have different date ranges. But they _do_ have different store ids. Your task is _not_ to forecast future sales from past sales. **Your task is to predict sales at unknown stores, from sales at known stores.**

In [49]:
import pandas as pd
train = pd.read_csv('https://drive.google.com/uc?export=download&id=1E9rgiGf1f_WL2S4-V6gD7ZhB8r8Yb_lE')
test = pd.read_csv('https://drive.google.com/uc?export=download&id=1vkaVptn4TTYC9-YPZvbvmfDNHVR8aUml')
store = pd.read_csv('https://drive.google.com/uc?export=download&id=1rZD-V1mWydeytptQfr-NL7dBqre6lZMo')
assert train.shape == (78400, 7)
assert test.shape == (78400, 7)
assert store.shape == (200, 10)

In [50]:
import numpy as np
import category_encoders as ce
import eli5
from eli5.sklearn import PermutationImportance
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_squared_log_error
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_score
from sklearn.impute import SimpleImputer
import xgboost as xgb
from xgboost import XGBRegressor
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)



The dataframes have a variety of columns:

- **Store** - a unique Id for each store
- **DayOfWeek** - integer, 1-6
- **Date** - the date, from Jan 2, 2013 — July 31, 2015.
- **Sales** - the units of inventory sold on a given date (this is the target you are predicting)
- **Customers** - the number of customers on a given date
- **Promo** - indicates whether a store is running a promo on that day
- **SchoolHoliday** - indicates the closure of public schools
- **StoreType** - differentiates between 4 different store models: a, b, c, d
- **Assortment** - describes an assortment level: a = basic, b = extra, c = extended
- **CompetitionDistance** - distance in meters to the nearest competitor store
- **CompetitionOpenSince[Month/Year]** - gives the approximate year and month of the time the nearest competitor was opened
- **Promo2** - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
- **Promo2Since[Year/Week]** - describes the year and calendar week when the store started participating in Promo2
- **PromoInterval** - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

This Sprint Challenge has three parts. To demonstrate mastery on each part, do all the required instructions. To earn a score of "3" for the part, also do the stretch goals.

## 1. Wrangle relational data, Log-transform the target
- Merge the `store` dataframe with the `train` and `test` dataframes. 
- Arrange the X matrix and y vector for the train and test sets.
- Log-transform the target for the train and test set.
- Plot the target's distribution for the train set, before and after the transformation.

#### Stretch goals
- Engineer 3+ more features.

In [51]:
#traintest = pd.concat([train,test])

train = train.merge(store)
test = test.merge(store)

#traintest = traintest.merge(store)


In [60]:
features = ['Store', 'DayOfWeek', 'Customers', 'Promo',
        'SchoolHoliday', 'CompetitionDistance',
        'CompetitionOpenSinceMonth', 'CompetitionOpenSinceYear', 'Promo2',
        'Promo2SinceWeek', 'Promo2SinceYear', 'PromoInterval']

target = 'Sales'

ytrain = np.log1p(train[target])
ytest = np.log1p(test[target])

xtrain = train[features]

xtest =test[features]

#xtraintest = traintest[features]
#ytraintest = traintest[target]

## 2. Fit and validate your model
- **Use Gradient Boosting** or any type of regression model.
- **Beat the baseline:** The estimated baseline Root Mean Squared Logarithmic Error is 0.90, if we guessed the mean sales for every prediction. Remember that RMSE with the log-transformed target is equivalent to RMSLE with the original target. Try to get your error below 0.20.
- **To validate your model, choose any one of these options:**
  - Split the train dataframe into train and validation sets. Put all rows for a given store into the same set. Use xgboost `early_stopping_rounds` with the validation set. 
  - Or, use scikit-learn `cross_val_score`. Put all rows for a given store into the same fold.
  - Or, use scikit-learn `RandomizedSearchCV` for hyperparameter optimization. Put all rows for a given store into the same fold.
- **Get the Validation Error** (multiple times if you try multiple iterations) **and Test Error** (one time, at the end).
  
#### Stretch goal
- Optimize 3+ hyperparameters by searching 10+ "candidates" (possible combinations of hyperparameters). 

In [53]:
#set(train['Store']) & set(test['Store'])

#trainval_store = train['Store'].unique()
#xtrain, xval, = train_test_split(trainval_store)

#xtrain = train[train.Store.isin(xtrain)]
#xval = train[train.Store.isin(xval)]

In [80]:
%%time

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'),
   #RandomForestRegressor(n_estimators=200, n_jobs=-1)
    XGBRegressor(n_estimators=100, n_jobs=-1)

)

folds=5
groups = train['Store']

scores = cross_val_score(pipeline, xtrain, ytrain, cv=folds, 
                         scoring='neg_mean_squared_error', groups=groups);
print(f'RMSLE for {folds} folds:', np.sqrt(-scores))
print('Mean folds :',np.mean(np.sqrt(-scores)))

pipeline.fit(xtrain, ytrain)
ypred = pipeline.predict(xtest)

print("mean Squared Error: ",mean_squared_error(ytest, ypred))



RMSLE for 5 folds: [0.15736142 0.15038773 0.16592383 0.18577861 0.17726637]
Mean folds : 0.1673435947775183
mean Squared Error:  0.022480376897081204
CPU times: user 52.5 s, sys: 1.01 s, total: 53.5 s
Wall time: 3.45 s


## 3. Plot model interpretation visualizations
- Choose any one of these options:
  - Permutation Importances plot
  - Partial Dependency Plot, 1 feature isolation
  - Partial Dependency Plot, 2 feature interaction
  
#### Stretch goals
- Plot 2+ visualizations.
- Use permutation importances for feature selection. 

In [81]:
permuter = PermutationImportance(pipeline, scoring='neg_mean_squared_error', 
                                 cv='prefit', n_iter=2)

permuter.fit(xtrain, ytrain)
feature_names = xtrain.columns.tolist()
eli5.show_weights(permuter, top=None, feature_names=feature_names)

NameError: name 'pipeline' is not defined