# Groceries - Big Data¶

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
%matplotlib inline

from fastai.imports import *
from fastai.structured import *
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from IPython.display import display
from sklearn import metrics

Corporacion Favorita Grocery Sales Forecasting

Forecast, dependent variable, how many units of what kind of product were sold in each store, for each type of producrt, on each day were sold, for a two week period. For each date have meta data, like oil prices. 

Key things to understand are the 
 - dependent variable
 - independent variable
 - and time frame

This is like a star schema data warehousing style data set. Have a central data set of transactions, items sold, by id. Fron there can join all kinds of meta, data from different tables. Like info on the stores.
Sometimes see snowflake schema. there may be information joined to the transactions talbe that tells you about the transactions. 

In [3]:
# start as before

PATH = "./data/groceries/"
!ls {PATH}

holidays_events.csv.7z	sample_submission.csv.7z  test.csv.7z
items.csv.7z		stores.csv.7z		  train.csv
oil.csv.7z		test.csv		  train.csv.7z


# Read Data

In [4]:
# over 100 million rows
# create dictionary of each column
# this way PD does not have to read the entire csv to figure out the 
#
# why ints
# use smallest number of bits to represent the column
types ={'id': 'int64',
       'item_nbr': 'int32',
       'store_nbr': 'int8',
       'unit_sales':'float32',
       'onpromotion':'object'}
# on promotion is a boolean has missing values and we will have to deal with it 
# before turning it into a boolean. So, next we fill in the missing values with false.
# Objects generally read in as strings. Below will replace the strings.

In [5]:
# It takes 1min 48s to read in the data

%time df_all = pd.read_csv(f'{PATH}train.csv', parse_dates = ['date'], dtype=types, infer_datetime_format = True)

CPU times: user 1min 39s, sys: 7.02 s, total: 1min 46s
Wall time: 1min 57s


In [None]:
df_all.head()

In [8]:
# exploratory data analysis indicates that missing corresponds to False
# replace missing with false
df_all.onpromotion.fillna(False, inplace = True)
# map trasform "True and False" strings to booleans
df_all.onpromotion = df_all.onpromotion.map({'False' : False, 'True' : True})
# convert to boolean types
df_all.onpromotion = df_all.onpromotion.astype(bool)

%time df_all.to_feather('tmp/raw_groceries')

# this file, 125 million rows, takes something like 2.5 GB of memory
# save to feather format in under 5 s

ArrowIOError: Failed to open local file: tmp/raw_groceries , error: Success

In [None]:
# 125.5 million rows
# Because Pandas is pretty fast can summarize all 125 million rows, 20 cols, in
# about 23 s
%time df_all.describe(include='all')

When people say Python is slow, they probably don't know how to use it properly. Python is slow if you don't use it correctly. 125 million CSV records in < 2 min. Actually, its going to C-Code. Python itself isn't very fast. Almost evrything we want to do in Data Science is really run in Cython, Pandas, heavily optimized. If we wrote our own CSV reader in Python it would take 1000's of time slower.

**Need to tell it two things**
- dates
- datatypes

**Why int64,32, 8?**
Use the smallest No. of bits.  Purpose here is to avoid running out of RAM. However when working with large datasets, slow piece is reading and writing to RAM. As a rule of thumb smaller data sets run faster especially in SIMD (single instruction multiple data vectorized code) can pack more numbes into a single number to run at once.

**Start with sample**
Tip when start, don't usually read in the whole thing. Use unix shuf to read a random sample. Search forum for shuf to get randome sample at the command promp. Good way to get started and do some exploring, understand data types ... generally do work on sample to understand before moving on.

**models on large data sets**
We'll talk about 

## Understand dates
First thing look at dates

Dates will be very important

Kaggle make sure dates don't overlap between train and test
Train last date aug 15, 2017
Test set starts one day later, Aug 16, 2017
Have 4 years of data and trying to predict the next 4 weeks. 

How to sample? Get from the bottom, most recent data for prediction. Hedge that it will be close to most recent. There will be some useful information from 4 years ago, so don't want to through that information away. But, perhaps start with easy models staring with recent dates.

In [None]:
# Look at bottom of the data set
#   store number, item no. , onpromotion (on sale)
df_all.tail()

## Training Metric Log RMSE (also review lesson 1 for reasoning)
Take the log ... predict something accoerding to ratios ... so log .. competition details will ... there are negative sales that should be considered returns and make them 0.
Take the log plus 1, as specified in the Kaggle competition, because log zero 

In [None]:
df_all= pd.read_feather('tmp/raw_groceries')

In [None]:
df_all.unit_sales = np.log1p(np.clip(df_all.unit_sales,0,None))

In [None]:
# add datepart as usual
%time add_datepart(df_all,'date')
# takes 1min 53 s ... run through on a sample first so takes 10s 
# ... Jeremy's rule of thumb sample so everything takes less than 10 s for exploration
# to make sure evrything looks reasonable

Pretty much the code looks almost identical as the bulldozers competition

In [None]:
def split_vals(a,n): return a[:n].copy(), a[n:].copy()

In [None]:
# split out the validation set into 12,000 rows 
# training set will contain everything else
# most recent rows will be our validation set. 
# Generally speaking top half of Kaggle is pretty good. 
# ... Off the block in top 25% ... with no thinking in top 25%


n_valid = len(df_test)
n_trn = len(df)-n_valid

train, valid = split_vals(df_all,n_trn)
train.shape, valid.shape

Note: Variable names are a bit messed up
In Lesson1 Jeremy defines df_raw = pd.read ... in this notebook df_all=pd.read
Seems to mix up variables ... will need to sort all this out to get it to work

In [None]:
# all data types are already numeric so dont need to do this
#  turns string columns to pandas category class
#train_cats(raw_train) 
#   apply same codes from training set to validation set
#apply_cats(faw_valid, raw_train)



# Models

In [None]:
def rmse(x,y): return math.sqrt(((x-y)**2).mean())

def print_score(m):
    res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),
                m.score(X_train, y_train), m.score(X_valid, y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

In [None]:
# this is key ... have 120 mil records
# don't want a tree that big or long ... who knows how long it will take
# start with 10K or 100K to see how long it takes, make sure it works
# found that setting it to 1 million, m.fit it runs in less than 1 minute

set_rf_samples(1_000_000)

In [None]:
# another key change the input data to an array of floats, why?
#  internally, RF code they do this, doing it once myself save 1:37 sec
# if you run the code and it takes a long time then use prun to understand what is 
# taking so long. In this example m.fit took 2.5 minutes, so investigated
# and then pulled this line out
%time x = np.array(trn, dtype=np.float32)
%prun m.fit(x, y)

# profiler will tell you how much time each line of code takes
#  the x = np.array line was taking most of the time
# software engineers appriciate this
# data scientist often under appreciate it
# try running prun and see if you can interpret and use profile outputs


# oob score ... noticed in profile cannot use oob score 
# will try and use other 124 million rows to calculate oob
# will take forever
# so will need a proper validation set
# DONT USE OOB on LARGE DATASET

In [None]:
m = RandomForestRegressor(n_estimators=20, min_samples_leaf=100, max_features=0.5,
                          n_jobs=-8, oob_score=True)

In [None]:
# How long it takes to build the random forest?
#   It is by No. of Estimators x Sampel Size
#   not by the size of the data set

# n_jobs
#    the number of cores it will use, before we set it to -1
#    computer had 60 cores, but spinning up jobs was taking too long
#    -1 use all cores, like on your PC
#
# 
m = RandomForestRegressor(n_estimators=20, min_samples_leaf=100, max_features=0.5,
                          n_jobs=-8, oob_score=True)
%time m.fit(x, y)

In [None]:
print_socre(m)
# gets a 0.76588 Log RMSE error
# try fiddling out to min
#   samples from 100 to 10 ... takes a LRMSE from 0.76 to 0.71
#   down to 3 ... gets down to 0.70
# this is a reasonable RF

However 0.71 isn't so great on leaderboard

Look at the columns that are predicting with
- date, store, item no, on promoation, etc ... 

So, most of the insight, how much do you expect sell, will be wrapped up on where is that store, what category of item, and the day of the weeil

A RF can only create binary splits, which store represents Gasoline, which store in center of city, etc. Its ability to understand whats going on is somewhat limited.

Will need to use 4 years of data. There is a Kaggle competition. Take the last two weeks average sales by store, item, onpromotion, and take mean. You come out at about 30th.

So, your job then is how do you start with that model and make it a little bit better. Kaggle many peopls started from this Kernal and started improving on it. 

Create a scatter plot Mean model on one axis Vs. new model ... should form a line, if not probably screwed something up.

Pull in data from other soruces, like weather data ... this kind of thing is done very often ... weather is meta-data about a date. Most competitions have a rule that you can use external data, but have to post and it should be publicly available. Outside of Kaggle should always be looking for other data.

Equador's largest grocery chain. Look for Equadaor holidays ... This information is provided in this case. 

Create lots of new columsn ...
- avg no sales in holidays, 
- avg % change form Jan to Feb so on an so forth


Also, look at similar competitions, 
- like Roseman (Germany)

Person that won, created lots of columns based on whats useful for making predictions

Third place team did almost no feature engineering 


**Tune Validation Set**
- if you don't have a good validation set its hard almost impossible to create a good model
- next month sales, are they good at predicting next month
- need a validation set that is reliable for telling you that model will be good when put it into production
- You should usually onl use Test set at tend, but can use it to calibrate the validation set
- submit 4 models to Kaggle x axis is score from Kaggle on Y axis is validation set score to see if the validation set is any good. If the validation set is good then x,y should be on straight line as close to y = x as possible. If not, then your validation set is not good. Validation set will predict leader board score set well. 


New column examples
- Date range of test set ... 14 days ... test set is 16 days ... test set begins after pay day and ends on pay day. one of the bits of meta-data they told us ... draw time-series and make sure you have some number of spikes in your test set ... 

### Grocery store notebook not in Github, but after competition 
### finsihed it will be on github