<a href="https://colab.research.google.com/github/Giffy/fast.ai/blob/master/Machine%20Learning/lesson3_grocery.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Important: This notebook will only work with fastai-0.7.x. Do not try to run any fastai-1.x code from this path in the repository because it will load fastai-0.7.x**

# Intro to Random Forests - Favorita Grocery Sales Forecasting 

Notebook based in kaggle competition [Can you accurately predict sales for a large grocery chain?](https://www.kaggle.com/c/favorita-grocery-sales-forecasting)

Corporación Favorita has challenged the Kaggle community to build a model that more accurately forecasts product sales.

They’re excited to see how machine learning could better ensure they please customers by having just enough of the right products at the right time.

In this competition, you will be predicting the unit sales for thousands of items sold at different Favorita stores located in Ecuador. The training data includes dates, store and item information, whether that item was being promoted, as well as the unit sales. Additional files include supplementary information that may be useful in building your models.

##File Descriptions and Data Field Information

**train.csv**

* Training data, which includes the target *unit_sales by date, store_nbr*, and *item_nbr* and a unique *id* to label rows.
* The target unit_sales can be integer (e.g., a bag of chips) or float (e.g., 1.5 kg of cheese).
* Negative values of unit_sales represent returns of that particular item.
* The onpromotion column tells whether that item_nbr was on promotion for a specified date and store_nbr.
* Approximately 16% of the onpromotion values in this file are NaN.
* **NOTE:** The training data does not include rows for items that had zero unit_sales for a store/date combination. There is no information as to whether or not the item was in stock for the store on the date, and teams will need to decide the best way to handle that situation. Also, there are a small number of items seen in the training data that aren't seen in the test data.


**test.csv**

* Test data, with the date, store_nbr, item_nbr combinations that are to be predicted, along with the onpromotion information.
* **NOTE:** The test data has a small number of items that are not contained in the training data. Part of the exercise will be to predict a new item sales based on similar products..
* The public / private leaderboard split is based on time. All items in the public split are also included in the private split.


# Google Colab setup
Installs fast.ai 0.7.0 and the required libraries to run the notebook.

Also downloads the required datasets.

Train.csv dataset has been reduced in order to store the data sample in Github. 

Original data file can be downloaded in [kaggle](https://www.kaggle.com/c/favorita-grocery-sales-forecasting/data) (size of compressed file:  474Mb) 

In [1]:
print (" Installing FastAI libraries ... (takes 2 min)")
!pip install fastai==0.7.0 > /dev/null
print ("\n Clonning FastAI repository locally ...")
!git clone https://github.com/fastai/fastai.git fastai_ml
!ln -s fastai_ml/courses/ml1/fastai/ fastai

 Installing FastAI libraries ... (takes 2 min)
[31mtorchvision 0.2.1 has requirement pillow>=4.1.1, but you'll have pillow 4.0.0 which is incompatible.[0m
[31mmizani 0.5.3 has requirement pandas>=0.23.4, but you'll have pandas 0.22.0 which is incompatible.[0m
[31mplotnine 0.5.1 has requirement pandas>=0.23.4, but you'll have pandas 0.22.0 which is incompatible.[0m

 Clonning FastAI repository locally ...
Cloning into 'fastai_ml'...
remote: Enumerating objects: 99, done.[K
remote: Counting objects: 100% (99/99), done.[K
remote: Compressing objects: 100% (60/60), done.[K
remote: Total 23512 (delta 43), reused 67 (delta 38), pack-reused 23413[K
Receiving objects: 100% (23512/23512), 374.89 MiB | 31.05 MiB/s, done.
Resolving deltas: 100% (16540/16540), done.
Checking out files: 100% (761/761), done.


In [2]:
print ("\n Installing required libraries...")
!pip install --upgrade setuptools > /dev/null
!pip install feather > /dev/null
!pip install scikit-misc==0.1.0 > /dev/null
!pip install pdpbox==0.2.0 > /dev/null
!pip install treeinterpreter==0.2.2 > /dev/null
print ("\n Downloading datasets...")
!wget https://raw.githubusercontent.com/Giffy/Personal_dataset_repository/master/grocery-sales-train.7z
!7z x -y grocery-sales-train.7z > /dev/null
!wget https://raw.githubusercontent.com/Giffy/Personal_dataset_repository/master/grocery-sales-test.tar.gz
!tar xvf grocery-sales-test.tar.gz > /dev/null
print ("\n Importing libraries")
import pandas as pd
import os
import numpy as np


 Installing required libraries...
[31mCommand "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-g48mk3m1/feather/[0m

 Downloading datasets...
--2019-02-15 16:13:05--  https://raw.githubusercontent.com/Giffy/Personal_dataset_repository/master/grocery-sales-train.7z
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23818196 (23M) [application/octet-stream]
Saving to: ‘grocery-sales-train.7z’


2019-02-15 16:13:06 (186 MB/s) - ‘grocery-sales-train.7z’ saved [23818196/23818196]

--2019-02-15 16:13:10--  https://raw.githubusercontent.com/Giffy/Personal_dataset_repository/master/grocery-sales-test.tar.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.github

# 1 Imports 

In [0]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [0]:
from fastai.imports import *
from fastai.structured import *

from sklearn.ensemble import RandomForestRegressor
from IPython.display import display

from sklearn import metrics

In [5]:
PATH = 'grocery-sales/'
!ls {PATH}

test.csv  train_basic.csv


# 2 Read data

In [0]:
types = {'id': 'int64',
        'item_nbr': 'int32',
        'store_nbr': 'int8',
        'unit_sales': 'float32',
        'onpromotion': 'object'}

In [7]:
%%time
df_all = pd.read_csv(f'{PATH}train_basic.csv', parse_dates=['date'], dtype=types, 
                     infer_datetime_format=True)#, skiprows=range(1,100000000))

CPU times: user 2.55 s, sys: 159 ms, total: 2.71 s
Wall time: 2.72 s


In [8]:
df_all.onpromotion.fillna(False, inplace=True)
df_all.onpromotion = df_all.onpromotion.map({'False': False, 'True': True})
df_all.onpromotion = df_all.onpromotion.astype(bool)

os.makedirs('tmp', exist_ok=True)
%time df_all.to_feather('tmp/raw_groceries')

CPU times: user 64 ms, sys: 75.1 ms, total: 139 ms
Wall time: 402 ms


In [20]:
df_all.drop('Unnamed: 0', axis=1, inplace=True)
%time df_all.describe(include='all')

CPU times: user 751 ms, sys: 739 µs, total: 752 ms
Wall time: 751 ms


Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion
count,1997040.0,1997040,1997040.0,1997040.0,1997040.0,1997040
unique,,1684,,,,2
top,,2017-05-07 00:00:00,,,,False
freq,,1908,,,,1528448
first,,2013-01-01 00:00:00,,,,
last,,2017-08-15 00:00:00,,,,
mean,62717230.0,,27.45692,972862.1,8.551911,
std,36197720.0,,16.33284,520294.0,20.57796,
min,206.0,,1.0,96995.0,-1768.0,
25%,31386010.0,,12.0,522721.0,2.0,


In [21]:
df_test = pd.read_csv(f'{PATH}test.csv',parse_dates=['date'], dtype=types, infer_datetime_format=True)

df_test.onpromotion.fillna(False, inplace=True)
df_test.onpromotion = df_all.onpromotion.map({'False': False, 'True': True})
df_test.onpromotion = df_all.onpromotion.astype(bool)
df_test.describe(include='all')

Unnamed: 0,id,date,store_nbr,item_nbr,onpromotion
count,3370464.0,3370464,3370464.0,3370464.0,1997040
unique,,16,,,2
top,,2017-08-27 00:00:00,,,False
freq,,210654,,,1528448
first,,2017-08-16 00:00:00,,,
last,,2017-08-31 00:00:00,,,
mean,127182300.0,,27.5,1244798.0,
std,972969.3,,15.58579,589836.2,
min,125497000.0,,1.0,96995.0,
25%,126339700.0,,14.0,805321.0,


In [22]:
df_all.tail()

Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion
1997035,73734366,2016-03-18,54,1464238,22.0,False
1997036,101679247,2016-12-31,49,1489899,29.0,True
1997037,124420242,2017-08-05,38,849080,2.0,False
1997038,35980557,2014-11-25,31,1463798,3.0,False
1997039,85369289,2016-07-18,28,362035,14.0,False


In [0]:
df_all.unit_sales = np.log1p(np.clip(df_all.unit_sales, 0, None))

In [24]:
%time add_datepart(df_all, 'date')

CPU times: user 1.77 s, sys: 597 ms, total: 2.37 s
Wall time: 2.37 s


In [0]:
def split_vals(a, n): return a[:n].copy(), a[n:].copy()

In [26]:
n_valid = len(df_test)
n_trn = len(df_all) - n_valid
train, valid = split_vals(df_all, n_trn)
train.shape, valid.shape

((623616, 18), (1373424, 18))

In [0]:
# train_cats(raw_train)
# apply_cats(raw_valid, raw_train)

In [28]:
%%time
trn, y, nas = proc_df(train, 'unit_sales')
val, y_val, nas = proc_df(valid, 'unit_sales')

CPU times: user 1.05 s, sys: 558 ms, total: 1.61 s
Wall time: 1.61 s


# 3 Models

In [0]:
def rmse(x, y): return math.sqrt(((x-y)**2).mean())

def print_score(m):
    res = [rmse(m.predict(x), y), rmse(m.predict(val), y_val),
          m.score(x, y), m.score(val, y_val)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

In [0]:
set_rf_samples(1000000)

In [31]:
%time x = np.array(trn, dtype=np.float32)

CPU times: user 558 ms, sys: 35.8 ms, total: 594 ms
Wall time: 593 ms


In [36]:
# set n_jobs=-1 to use all CPU cores available
m = RandomForestRegressor(n_estimators=20, min_samples_leaf=100, n_jobs=-1) 
%time m.fit(x, y)

CPU times: user 2min 50s, sys: 71.5 ms, total: 2min 50s
Wall time: 1min 26s


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=100, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=-1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [33]:
print_score(m)

[0.776009538167972, 0.7925213761629737, 0.22717611402275195, 0.19108538656522367]


In [38]:
m = RandomForestRegressor(n_estimators=20, min_samples_leaf=10, n_jobs=-1)
%time m.fit(x, y)

CPU times: user 3min 38s, sys: 92.1 ms, total: 3min 38s
Wall time: 1min 49s


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=10, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=-1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [39]:
print_score(m)

[0.6081776615524892, 0.7598760754, 0.525312990622674, 0.25635398795611275]


In [40]:
m = RandomForestRegressor(n_estimators=20, min_samples_leaf=3, n_jobs=-1)
%time m.fit(x, y)

CPU times: user 4min 3s, sys: 112 ms, total: 4min 3s
Wall time: 2min 2s


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=3, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=-1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [41]:
%time print_score(m)

[0.39489960188817574, 0.7700224882615785, 0.7998664911736166, 0.236362004759179]
CPU times: user 1min 12s, sys: 391 ms, total: 1min 12s
Wall time: 38.4 s
