# ASHRAE with fast.ai, Part 3: Inference

This kernel leverages the convenient fast.ai API to prepare the test set for inference in just a few lines of code.

In order to combine the large size of the ASHRAE dataset and the overhead of fast.ai's objects with the limited memory of Kaggle sessions, this kernel is part of a series which further includes:

- https://www.kaggle.com/michelezoccali/ashrae-with-fast-ai-part-1 (preprocessing)
- https://www.kaggle.com/michelezoccali/ashrae-with-fast-ai-part-2 (training)

# Imports

In [None]:
import os
import gc
import sys
import pickle

import numpy as np
import pandas as pd

from tqdm.notebook import tqdm
from fastai.tabular.all import *

# plotting
import seaborn as sns

In [None]:
data_path = '../input/ashrae-with-fast-ai-part-1/'
model_path = '../input/ashrae-with-fast-ai-part-2/'

for path in [data_path, model_path]:
    for dirname, _, filenames in os.walk(path):
        for filename in filenames:
            print(os.path.join(dirname, filename))

# Prepare test dataset

In [None]:
#%%time
X_test = pd.read_hdf(data_path + 'preprocessing_no_lag.h5', 'test')
X_test.info()

In [None]:
row_ids = X_test.row_id # for submission file
X_test = X_test.drop(columns='row_id')

gc.collect()

In [None]:
procs_nn = [Categorify, Normalize]
cont = ['building_id','square_feet','year_built','floor_count','air_temperature','cloud_coverage',
       'dew_temperature','precip_depth_1_hr']
cat = ['meter','site_id','primary_use','hour','weekday']

Let's create a TabularPandas instance with the same transforms of the training set.

In [None]:
test = TabularPandas(X_test, procs_nn, cat, cont, inplace=True, reduce_memory=True)

del X_test, procs_nn, cat, cont
gc.collect()

We can now load our trained neural network back in...

In [None]:
with open(f'{model_path}/tabular_nn.pickle', mode='rb') as f:
    learn = pickle.load(f)

...and predict with it.

In [None]:
n_iterations = 30
batch_size = len(test) // n_iterations

preds = []
for i in tqdm(range(n_iterations)):
    start = i * batch_size
    test_batch = test.iloc[start:start + batch_size]
    test_dl = TabDataLoader(test_batch, bs=batch_size, shuffle=False, drop_last=False)
    
    del test_batch; gc.collect()
    
    batch_preds, _ = learn.get_preds(dl=test_dl)
    batch_preds = to_np(batch_preds.squeeze())
    preds.extend(np.expm1(batch_preds))
    
    del test_dl, batch_preds; gc.collect()

At last, we can save our predictions (clipped at 0 on the left as negative meter readings do not make much sense) and inspect their distribution.

In [None]:
submission = pd.DataFrame({'row_id':row_ids, 'meter_reading':np.clip(preds, 0, a_max=None)})
submission.to_csv('submission.csv', index=False)

del preds

In [None]:
logs = np.log1p(submission.meter_reading)
print(logs.shape)
np.log1p(submission.meter_reading).hist(bins=100);

In [None]:
#sns.displot(logs);

In [None]:
submission.meter_reading.min(), submission.meter_reading.max(), submission.meter_reading.mean()

And now we are done!

If anyone has any tips regarding a better management of memory resources (with or without fast.ai classes), so as to fit all of this in a single kernel for instance, they'd be greatly appreciated! 😉