# Initial Modeling: Linear Models

_By [Michael Rosenberg](mailto:rosenberg.michael.m@gmail.com)._

_**Description**: Contains my initial modeling related to Linear Models for the [Corporacion Favorita Grocery Sales Forecasting Competition](https://www.kaggle.com/c/favorita-grocery-sales-forecasting)._

_**Language Used**: [Python](https://www.python.org)._

_Last Updated: 10/22/2017 9:43 AM EST._

In [11]:
#imports
import pandas as pd
import numpy as np
#import h2o
#from h2o.estimators.glm import H2OGeneralizedLinearEstimator as h2oglm
from sklearn import preprocessing as pp
import scipy as sp
from sklearn import linear_model as lm

#initialize h2o
#whoami = 24601
#h2o.init(port = whoami)

#helpers
sigLev = 3
pd.options.display.precision = sigLev

In [5]:
#load in data
trainFrame = pd.read_csv("../data/preprocessed/train_splitObs.csv")
validationFrame = pd.read_csv("../data/preprocessed/validation_splitObs.csv")
testFrame = pd.read_csv("../data/preprocessed/test.csv")

# Data preprocessing

We need to do some slight preprocessing of our training and validation sets before building our initial models.

In [6]:
#0 out observations with negative retursn
trainFrame.loc[trainFrame["unit_sales"] < 0,"unit_sales"] = 0
validationFrame.loc[validationFrame["unit_sales"] < 0,"unit_sales"] = 0

In [7]:
#then get log sales
trainFrame["logUnitSales"] = np.log(trainFrame["unit_sales"] + 1)
validationFrame["logUnitSales"] = np.log(validationFrame["unit_sales"] + 1)

# Predict with fixed effects

Let's try the simplest model: One with store fixed effects and item ID fixed effects.

In [12]:
#get store encodings
storeEncoder = pp.OneHotEncoder()
storeEncodings = storeEncoder.fit_transform(
                                trainFrame["store_nbr"].values.reshape(-1,1))
#get names
featureNameList = ["store_nbr_" + str(i) 
                   for i in storeEncoder.active_features_]
#initialize feature matrix
featureMat = storeEncodings

In [13]:
#then get item encodings
itemEncoder = pp.OneHotEncoder()
itemEncodings = itemEncoder.fit_transform(
                            trainFrame["item_nbr"].values.reshape(-1,1))
itemNameList = ["item_nbr_" + str(i) for i in itemEncoder.active_features_]
#add information to features
featureNameList.extend(itemNameList)
featureMat = sp.sparse.hstack((featureMat,itemEncodings))

In [17]:
#then append our log-sales to the regression
logSalesMat = trainFrame["logUnitSales"].values
#add information to feature

In [18]:
initLinearReg = lm.SGDRegressor(alpha = 0,n_iter = 5)
initLinearReg.fit(featureMat,logSalesMat)

SGDRegressor(alpha=0, average=False, epsilon=0.1, eta0=0.01,
       fit_intercept=True, l1_ratio=0.15, learning_rate='invscaling',
       loss='squared_loss', n_iter=5, penalty='l2', power_t=0.25,
       random_state=None, shuffle=True, verbose=0, warm_start=False)

In [24]:
#get features for test frame
testStoreEncodings = storeEncoder.transform(testFrame["store_nbr"].values.reshape(-1,1))
featureMat = testStoreEncodings
#change some information in the test frame
alteredTestFrame = testFrame.copy()
alteredTestFrame.loc[~(alteredTestFrame["item_nbr"].isin(
                        list(trainFrame["item_nbr"].unique()))),"item_nbr"] = trainFrame["item_nbr"].unique()[0]
testItemEncodings = itemEncoder.transform(alteredTestFrame["item_nbr"].values.reshape(-1,1))
featureMat = sp.sparse.hstack((featureMat,testItemEncodings))

In [27]:
#then predict
testFrame["logPredictions"] = initLinearReg.predict(featureMat)
testFrame["unit_sales"] = np.exp(testFrame["logPredictions"]) - 1
exportFrame = testFrame[["id","unit_sales"]]

In [28]:
exportFrame.to_csv("../data/processed/initPredictions.csv",index = False)

We see this puts us in the top $78\%$. This isn't awful, but we could do better!

# Replace item fixed effects