# Initial Modeling: Linear Models

_By [Michael Rosenberg](mailto:rosenberg.michael.m@gmail.com)._

_**Description**: Contains my initial modeling related to Linear Models for the [Corporacion Favorita Grocery Sales Forecasting Competition](https://www.kaggle.com/c/favorita-grocery-sales-forecasting)._

_**Language Used**: [Python](https://www.python.org)._

_Last Updated: 10/22/2017 9:43 AM EST._

In [13]:
#imports
import pandas as pd
import numpy as np
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator as h2oglm
from sklearn import preprocessing as pp
import scipy as sp

#initialize h2o
whoami = 24601
h2o.init(port = whoami)

#helpers
sigLev = 3
pd.options.display.precision = sigLev

Checking whether there is an H2O instance running at http://localhost:24601. connected.


0,1
H2O cluster uptime:,28 mins 28 secs
H2O cluster version:,3.10.4.8
H2O cluster version age:,5 months !!!
H2O cluster name:,H2O_from_python_michaelrosenberg_cjn4ec
H2O cluster total nodes:,1
H2O cluster free memory:,3.540 Gb
H2O cluster total cores:,4
H2O cluster allowed cores:,4
H2O cluster status:,"locked, healthy"
H2O connection url:,http://localhost:24601


In [3]:
#load in data
trainFrame = pd.read_csv("../data/preprocessed/train_splitObs.csv")
validationFrame = pd.read_csv("../data/preprocessed/validation_splitObs.csv")
testFrame = pd.read_csv("../data/preprocessed/test.csv")

# Data preprocessing

We need to do some slight preprocessing of our training and validation sets before building our initial models.

In [5]:
#0 out observations with negative retursn
trainFrame.loc[trainFrame["unit_sales"] < 0,"unit_sales"] = 0
validationFrame.loc[validationFrame["unit_sales"] < 0,"unit_sales"] = 0

In [6]:
#then get log sales
trainFrame["logUnitSales"] = np.log(trainFrame["unit_sales"] + 1)
validationFrame["logUnitSales"] = np.log(validationFrame["unit_sales"] + 1)

# Predict with fixed effects

Let's try the simplest model: One with store fixed effects and item ID fixed effects.

In [11]:
#get store encodings
storeEncoder = pp.OneHotEncoder()
storeEncodings = storeEncoder.fit_transform(
                                trainFrame["store_nbr"].values.reshape(-1,1))
#get names
featureNameList = ["store_nbr_" + str(i) 
                   for i in storeEncoder.active_features_]
#initialize feature matrix
featureMat = storeEncodings

In [14]:
#then get item encodings
itemEncoder = pp.OneHotEncoder()
itemEncodings = itemEncoder.fit_transform(
                            trainFrame["item_nbr"].values.reshape(-1,1))
itemNameList = ["item_nbr_" + str(i) for i in itemEncoder.active_features_]
#add information to features
featureNameList.extend(itemNameList)
featureMat = sp.sparse.hstack((featureMat,itemEncodings))

In [None]:
#then append our log-sales to the regression
logSalesMat = sp.sparse.csc_