## Microbusiness Density Forecasting with KerasTuner
In this notebook, I will create a single DNN model to predict Microbusiness Density in the fucture. '
I will do feature engineering by providing lag 1 target feature and calculate monthly change rate of different time series features. 
I will also add county embedding and state embedding in the DNN. 
I will search best parameters using KerasTuner. I use last 8 months of training data as validation set. 
CV score varies a lot for different parameters. 
Using KerasTuner can do more experiments to find an optimal result in short period of time and reduce human labor. 
Within 30 attempts, I get about 1.5 CV score and 13 to 1.5 LB score. Better DNN architecture, feature engineering, hyperpameter searching spaces and and post processing of the result can improve CV and LB score. Happy new year to everyone and happy kaggling.

#### 1.0 Configuration

In [3]:
import numpy as np
import pandas as pd
import tensorflow as tf
import keras_tuner as kt
import math
from tensorflow.keras.callbacks import LearningRateScheduler

In [4]:
class CFG:
    epochs = 200
    max_trials = 30
    tuning_epochs = 10
    batch_size = 512

#### 2. Utility Functions

In [5]:
def diff_month(dt):
    return (dt.year - 2019) * 12 + dt.month - 8
def smape(y_true, y_pred):
    return 200.0  * tf.reduce_mean(tf.abs(y_true - y_pred) / (tf.abs(y_true) + tf.abs(y_pred)))

def get_cosine_decay_learning_rate_scheduler(epochs, cycles=1, lr_start=0.001, lr_end=1e-6):
    def cosine_decay(epoch):
        if epoch <= 10:
            return lr_start
        epochs_per_cycle = epochs // cycles
        epoch_in_cycle = epoch % epochs_per_cycle
        if epochs_per_cycle > 1:
            w = (1 + math.cos(epoch_in_cycle / (epochs_per_cycle-1) * math.pi)) / 2
        else:
            w = 1
        return w * lr_start + (1 - w) * lr_end
    return LearningRateScheduler(cosine_decay, verbose=0)

def not_valid_number(num):
    return pd.isna(num) or num == 0 or num == np.inf

def preprocess_data(features):
    target = features.pop("microbusiness_density")
    for column in numeric_lookup_layers.keys():
        lookup = numeric_lookup_layers[column]
        features[column] = lookup[features["cfips"]]
    return features, target

def make_dataset(df, shuffle=True):
    ds = tf.data.Dataset.from_tensor_slices({
    "x": df["x"], 
    "year": df["year"], 
    "month": df["month"], 
    "cfips": df["cfips"], 
    "county_id": df["county_id"],
    "state_id": df["state_id"],
    "microbusiness_density": df["microbusiness_density"],
    "microbusiness_density_shift_1": df["microbusiness_density_shift_1"]
    }).map(preprocess_data)
    if shuffle:
        ds = ds.shuffle(CFG.batch_size * 4)
    ds = ds.batch(CFG.batch_size).cache().prefetch(tf.data.AUTOTUNE)
    return ds

#### 3.0 Load Data

In [6]:
train = pd.read_csv(r"C:\kaggle\go_daddy\train.csv")
train.head()

Unnamed: 0,row_id,cfips,county,state,first_day_of_month,microbusiness_density,active
0,1001_2019-08-01,1001,Autauga County,Alabama,2019-08-01,3.007682,1249
1,1001_2019-09-01,1001,Autauga County,Alabama,2019-09-01,2.88487,1198
2,1001_2019-10-01,1001,Autauga County,Alabama,2019-10-01,3.055843,1269
3,1001_2019-11-01,1001,Autauga County,Alabama,2019-11-01,2.993233,1243
4,1001_2019-12-01,1001,Autauga County,Alabama,2019-12-01,2.993233,1243


In [8]:
test = pd.read_csv(r"C:\kaggle\go_daddy\test.csv")
test.head()

Unnamed: 0,row_id,cfips,first_day_of_month
0,1001_2022-11-01,1001,2022-11-01
1,1003_2022-11-01,1003,2022-11-01
2,1005_2022-11-01,1005,2022-11-01
3,1007_2022-11-01,1007,2022-11-01
4,1009_2022-11-01,1009,2022-11-01
