# 🌇 GoDaddy - Microbusiness Density Forecasting
### Forecast Next Month’s Microbusiness Density


<img src='https://miro.medium.com/max/1400/1*gsUixexI9DsFfKsS-ZZqng.webp' width = 750>

### Overview...
The main goal is to develop a simple notebook utilizing GDBTs to construct a Machine Learning Model...

---

### Dataset Description
Your challenge in this competition is to forecast microbusiness activity across the United States, as measured by the density of microbusinesses in US counties. Microbusinesses are often too small or too new to show up in traditional economic data sources, but microbusiness activity may be correlated with other economic indicators of general interest.

As historic economic data are widely available, this is a forecasting competition. The forecasting phase public leaderboard and final private leaderboard will be determined using data gathered after the submission period closes. You will make static forecasts that can only incorporate information available before the end of the submission period.

**Files**

A great deal of data is publicly available about counties and we have not attempted to gather it all here. You are strongly encouraged to use external data sources for features.

train.csv

* row_id - An ID code for the row.
* cfips - A unique identifier for each county using the Federal Information Processing System. The first two digits correspond to the state FIPS code, while the following 3 represent the county.
* county_name - The written name of the county.
* state_name - The name of the state.
* first_day_of_month - The date of the first day of the month.
* microbusiness_density - Microbusinesses per 100 people over the age of 18 in the given county. This is the target variable. The population figures used to calculate the density are on a two-year lag due to the pace of update provided by the U.S. Census Bureau, which provides the underlying population data annually. 2021 density figures are calculated using 2019 population figures, etc.
* active - The raw count of microbusinesses in the county. Not provided for the test set.

**sample_submission.csv** 

A valid sample submission. This file will remain unchanged throughout the competition.

* row_id - An ID code for the row.
* microbusiness_density - The target variable.

**test.csv** 

Metadata for the submission rows. This file will remain unchanged throughout the competition.

* row_id - An ID code for the row.
* cfips - A unique identifier for each county using the Federal Information Processing System. The first two digits correspond to the state FIPS code, while the following 3 represent the county.
* first_day_of_month - The date of the first day of the month.
* revealed_test.csv During the submission period, only the most recent month of data will be used for the public leaderboard. Any test set data older than that will be published in revealed_test.csv, closely following the usual data release cycle for the microbusiness report. We expect to publish one copy of revealed_test.csv in mid * February. This file's schema will match train.csv.

**census_starter.csv**

Examples of useful columns from the Census Bureau's American Community Survey (ACS) at data.census.gov. The percentage fields were derived from the raw counts provided by the ACS. All fields have a two year lag to match what information was avaiable at the time a given microbusiness data update was published.

* pct_bb_[year] - The percentage of households in the county with access to broadband of any type. Derived from ACS table B28002: PRESENCE AND TYPES OF INTERNET SUBSCRIPTIONS IN HOUSEHOLD.
* cfips - The CFIPS code.
* pct_college_[year] - The percent of the population in the county over age 25 with a 4-year college degree. Derived from ACS table S1501: EDUCATIONAL ATTAINMENT.
* pct_foreign_born_[year] - The percent of the population in the county born outside of the United States. Derived from ACS table DP02: SELECTED SOCIAL CHARACTERISTICS IN THE UNITED STATES.
* pct_it_workers_[year] - The percent of the workforce in the county employed in information related industries. Derived from ACS table S2405: INDUSTRY BY OCCUPATION FOR THE CIVILIAN EMPLOYED POPULATION 16 YEARS AND OVER.
* median_hh_inc_[year] - The median household income in the county. Derived from ACS table S1901: INCOME IN THE PAST 12 MONTHS (IN 2021 INFLATION-ADJUSTED DOLLARS.
---

# 1.0 Loading All Nesesary Libraries

In [None]:
%%time
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
%%time
# Loading more libraries for the model...
from pathlib import Path # OS path library...
from sklearn.preprocessing import LabelEncoder # Label encoding...

In [None]:
%%time
# Load model libraries...
from xgboost import XGBRegressor # GBDT Library, XGBosst Regressor
from catboost import CatBoostRegressor # GBDT Library, CatBoost Regressor

# Load Sklearn libraries...
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor # Regressors

from sklearn.metrics import mean_squared_error # Load metrics
from sklearn.model_selection import StratifiedKFold, TimeSeriesSplit, GroupKFold, KFold # Load CV strategies
from sklearn.preprocessing import LabelEncoder # Load encoder packages

from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder # Load Normalization libraries
from sklearn.pipeline import Pipeline # Load sklearn pipelines, in case are needed in the CV loop
from sklearn.compose import ColumnTransformer # Load 

import holidays # Holiday libraries
import matplotlib.pyplot as plt # Visualization libraries

---

# 2.0 Notebook Configuration
Placeholder, explanations of this sections...

In [None]:
%%time
# I like to disable my Notebook Warnings.
import warnings
warnings.filterwarnings('ignore')

In [None]:
%%time
# Notebook Configuration...

# Amount of data we want to load into the Model...
DATA_ROWS = None
# Dataframe, the amount of rows and cols to visualize...
NROWS = 50
NCOLS = 15
# Main data location path...
BASE_PATH = '...'

In [None]:
%%time
# Configure notebook display settings to only use 2 decimal places, tables look nicer.
pd.options.display.float_format = '{:,.2f}'.format
pd.set_option('display.max_columns', NCOLS) 
pd.set_option('display.max_rows', NROWS)

---

# 3.0 Loading the Datasets into a Pandas DataFrame
Placeholder, explanations of this sections...

In [None]:
%%time
# Load the CSV information into a Pandas DataFrame...
input_path = Path('/kaggle/input/godaddy-microbusiness-density-forecasting/')

train_df = pd.read_csv(input_path / 'train.csv')
census_df = pd.read_csv(input_path / 'census_starter.csv')
test_df = pd.read_csv(input_path / 'test.csv')

submission = pd.read_csv(input_path / 'sample_submission.csv')

In [None]:
%%time
train_df.head()

In [None]:
%%time
print('Min:',train_df['first_day_of_month'].min())
print('Max:',train_df['first_day_of_month'].max())

In [None]:
# Unpivot the Census information...
census_df.info()

In [None]:
%%time
variables = [var for var in census_df.columns if var not in ['cfips']]
census_unpivot = pd.melt(census_df, id_vars = 'cfips', value_vars = variables)

In [None]:
%%time
census_unpivot

In [None]:
%%time
census_df.head()

In [None]:
%%time
census_df.shape

In [None]:
%%time
test_df.head()

In [None]:
%%time
test_df.shape

In [None]:
print('Min:',test_df['first_day_of_month'].min())
print('Max:',test_df['first_day_of_month'].max())

In [None]:
%%time
submission.head()

---

# 4.0 Preparing the Information for Analysis
Placeholder, explanations of this sections...

In [None]:
%%time
aux_info = train_df[['cfips', 'county','state']]
aux_info = aux_info.drop_duplicates()
aux_info.head()

In [None]:
%%time
def merge_df(first_df, second_df, join_field = ['cfips']) -> pd.DataFrame:
    '''
    Merge two pandas datasets...
    '''
    merge_df = first_df.merge(second_df, how = 'left', left_on = join_field, right_on = join_field)
    merge_df.reset_index(inplace = True, drop = True)
    return merge_df

trn_data = merge_df(train_df, census_df)
tst_data = merge_df(test_df, census_df)
tst_data = merge_df(tst_data, aux_info)

In [None]:
%%time
print(trn_data.shape, tst_data.shape)

In [None]:
%%time
trn_data

In [None]:
%%time
tst_data

---

# 5.0 Feature Engineering
Placeholder, explanations of this sections...

In [None]:
%%time
# Create some simple features base on the Date field...

def create_time_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Create features base on the date variable, the idea is to extract as much 
    information from the date componets.
    Args
        df: Input data to create the features.
    Returns
        df: A DataFrame with the new time base features.
    """
    
    df['date'] = pd.to_datetime(df['first_day_of_month']) # Convert the date to datetime.
    
    # Start the creating future process.
    df['year'] = df['date'].dt.year
    df['quarter'] = df['date'].dt.quarter
    df['month'] = df['date'].dt.month
    df['day'] = df['date'].dt.day
    df['dayofweek'] = df['date'].dt.dayofweek
    df['dayofmonth'] = df['date'].dt.days_in_month
    df['dayofyear'] = df['date'].dt.dayofyear
    df['weekofyear'] = df['date'].dt.weekofyear
    df['is_weekend'] = np.where((df['dayofweek'] == 5) | (df['dayofweek'] == 6), 1, 0)
    
    return df

# Apply the function 'create_time_features' to the dataset...
trn_data = create_time_features(trn_data)
tst_data = create_time_features(tst_data)

---

# 6.0 Data Pre-Processing
Placeholder, explanations of this sections...

In [None]:
%%time
# ...
def encode_labels(df, text_features = ['country','store', 'product']):
    '''
    Describe the function...
    '''
    
    for categ_col in df[text_features].columns:
        encoder = LabelEncoder()
        df[categ_col + '_enc'] = encoder.fit_transform(df[categ_col])
    return df

trn_data = encode_labels(trn_data, text_features = ['county','state'])
tst_data = encode_labels(tst_data, text_features = ['county','state'])

In [None]:
%%time
trn_data['microbusiness_density'].describe()

---

# 7.0 Features & Data Selection
Placeholder, explanations of this sections...

In [None]:
%%time
# Extract features and avoid certain columns from the dataframe for training purposes...
target = 'microbusiness_density'
avoid = ['row_id', 'first_day_of_month','cfips', 'microbusiness_density', 'active', 'county', 'state', 'date']
features = [feat for feat in trn_data.columns if feat not in avoid]

# Print a list of all the features created...
print(features)

In [None]:
%%time
# Creates the Train and Validation sets to train the model...
# Define a cutoff date to split the datasets
cutoff_date = '2021-10-01'

# Split the data into train and validation datasets using timestamp best suited for timeseries...
X_train = trn_data[trn_data['first_day_of_month'] < cutoff_date][features]
y_train = trn_data[trn_data['first_day_of_month'] < cutoff_date][target]

X_val = trn_data[trn_data['first_day_of_month'] >= cutoff_date][features]
y_val = trn_data[trn_data['first_day_of_month'] >= cutoff_date][target]

In [None]:
%%time
#...
def SMAPE(y_true, y_pred):
    '''
    
    '''
    denominator = (y_true + np.abs(y_pred)) / 200.0
    diff = np.abs(y_true - y_pred) / denominator
    diff[denominator == 0] = 0.0
    return np.mean(diff)

---

# 8.0 Development of an GBDT Model (XGBoost)

In [None]:
%%time
# Defines a really simple XGBoost Regressor...

xgboost_params = {'eta'              : 0.01,
                  'n_estimators'     : 8192,
                  'max_depth'        : 16,
                  'max_leaves'       : 256,
                  'colsample_bylevel': 0.95,
                  'colsample_bytree' : 0.95,
                  'subsample'        : 0.95, # XGBoost would randomly sample 'subsample_value' of the training data prior to growing trees
                  'min_child_weight' : 256,
                  'min_split_loss'   : 0.002,
                  'alpha'            : 0.08,
                  'lambda'           : 64,
                  'objective'        : 'reg:squarederror',
                  'eval_metric'      : 'mae', # Originally using RMSE, trying new functions...
                  'tree_method'      : 'hist',
                  'seed'             :  42
                  }

In [None]:
%%time
# Create an instance of the XGBRegressor and set the model parameters...
regressor = XGBRegressor(**xgboost_params)

In [None]:
%%time
# Train the XGBRegressor using the train and validation datasets, 
# Utilizes early_stopping_rounds to control overfitting...
regressor.fit(X_train,
              y_train,
              eval_set=[(X_val, y_val)],
              early_stopping_rounds = 250,
              verbose = 250)

# 9.0 Machine Learning Explainability

In [None]:
%%time
#...
feats = {} # a dict to hold feature_name: feature_importance
for feature, importance in zip(features, regressor.feature_importances_):
    feats[feature] = importance # add the name/value pair 

importances = pd.DataFrame.from_dict(feats, orient='index').rename(columns={0: 'Gini-importance'})
importances.sort_values(by='Gini-importance', ascending=False).plot(kind='bar', rot=90, figsize=(10,5))

---

# 10.0 Generating Model Predictions
Placeholder, explanations of this sections...

In [None]:
%%time
val_pred = regressor.predict(X_val[features])

score = np.sqrt(mean_squared_error(y_val, val_pred))
print(f'RMSE: {score} / SMAPE: {SMAPE(y_val, val_pred)}')

In [None]:
%%time
#...
# Use the created model to predict the sales for 2019...
predictions = regressor.predict(tst_data[features])
submission['microbusiness_density'] = predictions

# Creates a submission file for Kaggle...
submission.to_csv('submission.csv',index = False)

In [None]:
%%time
submission.head()

---

# Development of a Linear Regression Model

---