<a href="https://colab.research.google.com/github/ElisabethShah/DS-Unit-2-Applied-Modeling/blob/master/DS_Sprint_Challenge_8_Regression_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science, Unit 2_
 
# Regression 2 Sprint Challenge: Predict drugstore sales 🏥

For your Sprint Challenge, you'll use real-world sales data from a German drugstore chain, from Jan 2, 2013 — July 31, 2015.

You are given three dataframes:

- `train`: historical sales data for 100 stores
- `test`: historical sales data for 100 different stores
- `store`: supplemental information about the stores


The train and test set do _not_ have different date ranges. But they _do_ have different store ids. Your task is _not_ to forecast future sales from past sales. **Your task is to predict sales at unknown stores, from sales at known stores.**

The dataframes have a variety of columns:

- **Store** - a unique Id for each store
- **DayOfWeek** - integer, 1-6
- **Date** - the date, from Jan 2, 2013 — July 31, 2015.
- **Sales** - the units of inventory sold on a given date (this is the target you are predicting)
- **Customers** - the number of customers on a given date
- **Promo** - indicates whether a store is running a promo on that day
- **SchoolHoliday** - indicates the closure of public schools
- **StoreType** - differentiates between 4 different store models: a, b, c, d
- **Assortment** - describes an assortment level: a = basic, b = extra, c = extended
- **CompetitionDistance** - distance in meters to the nearest competitor store
- **CompetitionOpenSince[Month/Year]** - gives the approximate year and month of the time the nearest competitor was opened
- **Promo2** - Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
- **Promo2Since[Year/Week]** - describes the year and calendar week when the store started participating in Promo2
- **PromoInterval** - describes the consecutive intervals Promo2 is started, naming the months the promotion is started anew. E.g. "Feb,May,Aug,Nov" means each round starts in February, May, August, November of any given year for that store

This Sprint Challenge has three parts. To demonstrate mastery on each part, do all the required instructions. To earn a score of "3" for the part, also do the stretch goals.

## Setup

### Import libraries

In [0]:
import pandas as pd

from sklearn.model_selection import train_test_split

### Define utility functions

In [0]:
def rmse(y_true, y_pred):
  """
  Calculate root mean squared error.
  """
  return np.sqrt(mean_squared_error(y_true, y_pred))

def rmsle(y_true, y_pred):
  """
  Calculate root mean squared log error.
  """
  return np.sqrt(mean_squared_log_error(y_true, y_pred))

### Load data

In [0]:
trainval = pd.read_csv('https://drive.google.com/uc?export=download'
                       '&id=1E9rgiGf1f_WL2S4-V6gD7ZhB8r8Yb_lE')
test = pd.read_csv('https://drive.google.com/uc?export=download'
                   '&id=1vkaVptn4TTYC9-YPZvbvmfDNHVR8aUml')
store = pd.read_csv('https://drive.google.com/uc?export=download'
                    '&id=1rZD-V1mWydeytptQfr-NL7dBqre6lZMo')

# Verify data dimensions.
assert trainval.shape == (78400, 7)
assert test.shape == (78400, 7)
assert store.shape == (200, 10)

## Split into training and validation sets

In [4]:
trainval['Store'].nunique()

100

In [0]:
train_stores, val_stores = train_test_split(trainval['Store'].unique(), random_state=0)

In [6]:
train_stores.shape, val_stores.shape

((75,), (25,))

In [7]:
train = trainval[trainval['Store'].isin(train_stores)]
val = trainval[trainval['Store'].isin(val_stores)]

train.shape, val.shape

((58800, 7), (19600, 7))

## Data Exploration

#### Sales table

In [8]:
# Preview data.
train.head()

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Promo,SchoolHoliday
0,4,5,2015-07-31,13995,1498,1,1
1,8,5,2015-07-31,8492,833,1,1
4,34,5,2015-07-31,11144,1162,1,1
5,44,5,2015-07-31,6670,665,1,1
6,48,5,2015-07-31,3874,390,1,1


In [9]:
# Check data types.
train.dtypes

Store             int64
DayOfWeek         int64
Date             object
Sales             int64
Customers         int64
Promo             int64
SchoolHoliday     int64
dtype: object

In [10]:
# Examine summary statistics.
train.describe(include='all')

Unnamed: 0,Store,DayOfWeek,Date,Sales,Customers,Promo,SchoolHoliday
count,58800.0,58800.0,58800,58800.0,58800.0,58800.0,58800.0
unique,,,784,,,,
top,,,2014-12-10,,,,
freq,,,75,,,,
mean,557.226667,3.506378,,7013.920357,825.035561,0.450255,0.195595
std,320.263225,1.710564,,2849.244395,314.173058,0.497524,0.396662
min,4.0,1.0,,1712.0,208.0,0.0,0.0
25%,270.0,2.0,,4997.0,600.0,0.0,0.0
50%,551.0,3.0,,6374.0,759.0,0.0,0.0
75%,839.0,5.0,,8360.0,989.0,1.0,0.0


#### Store table

In [11]:
# Preview data.
store.head()

Unnamed: 0,Store,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
0,4,c,c,620.0,9.0,2009.0,0,,,
1,8,a,a,7520.0,10.0,2014.0,0,,,
2,10,a,a,3160.0,9.0,2009.0,0,,,
3,11,a,c,960.0,11.0,2011.0,1,1.0,2012.0,"Jan,Apr,Jul,Oct"
4,12,a,c,1070.0,,,1,13.0,2010.0,"Jan,Apr,Jul,Oct"


In [12]:
# Check data types.
store.dtypes

Store                          int64
StoreType                     object
Assortment                    object
CompetitionDistance          float64
CompetitionOpenSinceMonth    float64
CompetitionOpenSinceYear     float64
Promo2                         int64
Promo2SinceWeek              float64
Promo2SinceYear              float64
PromoInterval                 object
dtype: object

In [13]:
# Examine summary statistics.
store.describe(include='all')

Unnamed: 0,Store,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
count,200.0,200,200,199.0,163.0,163.0,200.0,36.0,36.0,36
unique,,3,2,,,,,,,3
top,,a,c,,,,,,,"Mar,Jun,Sept,Dec"
freq,,127,112,,,,,,,14
mean,538.75,,,4971.758794,7.319018,2010.748466,0.18,23.916667,2011.666667,
std,318.711977,,,7828.182796,3.165605,2.477911,0.385152,13.64525,0.956183,
min,4.0,,,30.0,1.0,2005.0,0.0,1.0,2010.0,
25%,259.25,,,890.0,4.0,2009.0,0.0,14.0,2011.0,
50%,534.5,,,2180.0,9.0,2011.0,0.0,22.0,2012.0,
75%,819.5,,,4655.0,9.0,2013.0,0.0,35.0,2012.0,


In [14]:
# List possible promo2 schedules.
store['PromoInterval'].unique()

array([nan, 'Jan,Apr,Jul,Oct', 'Feb,May,Aug,Nov', 'Mar,Jun,Sept,Dec'],
      dtype=object)

## 1. Wrangle relational data, Log-transform the target
- Merge the `store` dataframe with the `train` and `test` dataframes. 
- Arrange the X matrix and y vector for the train and test sets.
- Log-transform the target for the train and test set.
- Plot the target's distribution for the train set, before and after the transformation.

#### Stretch goals
- Engineer 3+ more features.

### Merge the `store` dataframe with the `train` and `test` dataframes

### Arrange the X matrix and y vector for the train and test sets

### Log-transform the target for the train and test set

### Plot the target's distribution for the train set, before and after the transformation

#### Original target distribution

#### Log-transformed target distribution

### Stretch: Engineer 3+ more features

## 2. Fit and validate your model
- **Use Gradient Boosting** or any type of regression model.
- **Beat the baseline:** The estimated baseline Root Mean Squared Logarithmic Error is 0.90, if we guessed the mean sales for every prediction. Remember that RMSE with the log-transformed target is equivalent to RMSLE with the original target. Try to get your error below 0.20.
- **To validate your model, choose any one of these options:**
  - Split the train dataframe into train and validation sets. Put all dates for a given store into the same set. Use xgboost `early_stopping_rounds` with the validation set. 
  - Or, use scikit-learn `cross_val_score`. Put all dates for a given store into the same fold.
  - Or, use scikit-learn `RandomizedSearchCV` for hyperparameter optimization. Put all dates for a given store into the same fold.
- **Get the Validation Error** (multiple times if you try multiple iterations) **and Test Error** (one time, at the end).
  
#### Stretch goal
- Optimize 3+ hyperparameters by searching 10+ "candidates" (possible combinations of hyperparameters). 

## 3. Plot model interpretation visualizations
- Choose any one of these options:
  - Permutation Importances plot
  - Partial Dependency Plot, 1 feature isolation
  - Partial Dependency Plot, 2 feature interaction
  
#### Stretch goals
- Plot 2+ visualizations.
- Use permutation importances for feature selection. 