Compare elastic net prediction with $demand$ as the independent variable with one with $\log(demand +1)$.

In [1]:
import numpy as np
import pandas as pd
from math import ceil
import pickle
import random
with open('datapath.txt','r') as f:
    datapath=f.readlines()[0].rstrip()

In [2]:
with open('Cliente_ID_Stats.pkl','r') as f:
    cuenta_Cliente=pickle.load(f)

In [3]:
cuenta_Cliente.sort_values(ascending=False,inplace=True)

In [4]:
myclient=cuenta_Cliente.index[0]

In [7]:
traindtypes=pd.read_csv(datapath+'train.csv',nrows=10,parse_dates=False).dtypes.to_dict()
traindtypes['Semana']=np.uint8

In [8]:
traincols=pd.read_csv(datapath+'train.csv',nrows=10,parse_dates=False).columns

I already looked at the data, and these were the only columns with multiple values within this client id that are present in the test file.

In [9]:
useful_features='Agencia_ID Producto_ID Ruta_SAK Semana'.split()
useful_columns='Agencia_ID Producto_ID Ruta_SAK Semana Demanda_uni_equil'.split()

In [10]:
train=pd.DataFrame(columns=useful_columns+['Cliente_ID'])

In [11]:
chunksz=400000

In [12]:
trainit=pd.read_csv(datapath+'train.csv',parse_dates=False,dtype=traindtypes,chunksize=chunksz,iterator=True,usecols=useful_columns+['Cliente_ID'])

In [13]:
%%time
i=0
for chunk in trainit:
    if i%50==0:
        print(i)
    train=train.append(chunk[chunk['Cliente_ID']==myclient],ignore_index=True)
    i+=1
    

0
50
100
150
CPU times: user 40 s, sys: 4.07 s, total: 44.1 s
Wall time: 47 s


In [14]:
y=train['Demanda_uni_equil']

In [15]:
#convert to category variables
for i in useful_features:
    cats=train[i].unique().tolist()
    train.loc[:,i]=train[i].astype('category',categories=cats)
    

In [16]:
#get rid of Cliente_ID
train=train[useful_columns]

This is the process that takes a lot of time if you have lots of different categories: Dummifying the variables.

I'm not sure if R will do this faster. I happen to know off the top of my head that R's Random Forest model can't deal with more than 32 factors.

In [17]:
%%time
X=pd.get_dummies(train[useful_features],sparse=True,prefix_sep=':')

CPU times: user 1.73 s, sys: 271 µs, total: 1.73 s
Wall time: 1.72 s


In [18]:
def standardize(s):
    if s.std()>0.0:
        return ((s-s.mean())/s.std())
    else:
        return s

In [19]:
X=X.apply(standardize,axis='index')

In [20]:
from sklearn.linear_model import ElasticNetCV

In [21]:
enet=ElasticNetCV(n_alphas=20)

In [22]:
%time enet.fit(X,y)

CPU times: user 13.9 s, sys: 12.6 s, total: 26.5 s
Wall time: 11.5 s


ElasticNetCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
       l1_ratio=0.5, max_iter=1000, n_alphas=20, n_jobs=1, normalize=False,
       positive=False, precompute='auto', random_state=None,
       selection='cyclic', tol=0.0001, verbose=0)

In [23]:
enet.alpha_

0.13207133562491277

Mostly L1 regularization.

In [26]:
from sklearn.metrics import mean_squared_error

In [38]:
def positize(x):
    if x<0.0:
        return 0.0
    else:
        return x

In [42]:
vpositize=np.vectorize(positize)

In [45]:
def rmsle(estimator, X, y):
    y_pred=estimator.predict(X)
    y_positive=vpositize(y_pred)
    return np.sqrt(mean_squared_error(np.log((y_positive+1.0)),np.log(y+1.0)))

In [46]:
#This is the same score as the kaggle score, as we already took the log
rmsle(enet,X,y)

1.0806249793808369

Ok, that could be better. But we don't have the client id as a predictor, so it's not a great comparison.

In [47]:
from sklearn.cross_validation import cross_val_score

In [48]:
%time cvscore_enet=cross_val_score(enet,X,y,scoring=rmsle,cv=5)

CPU times: user 55.3 s, sys: 52.3 s, total: 1min 47s
Wall time: 46 s


In [49]:
cvscore_enet

array([ 1.03786978,  1.09000293,  1.1039588 ,  1.09835236,  1.16863783])

In [50]:
cvscore_logenet=np.array([ 0.95151178,  0.99752119,  0.96322927,  0.9714726 ,  1.12439721])

In [51]:
from scipy.stats import ttest_ind

In [53]:
cvscore_enet.mean()-cvscore_logenet.mean()

0.098137929568478999

In [52]:
ttest_ind(cvscore_enet,cvscore_logenet)

Ttest_indResult(statistic=2.5918392973511737, pvalue=0.032021418963489305)

For a better idea of how well this is doing, I'm going to compare to a prediction that is just the mean of the demand for each.

In [56]:
def rmslep(y_pred,y_true):
    return np.sqrt(mean_squared_error(np.log(y_pred+1.0),np.log(y_true+1.0)))

In [57]:
from sklearn.cross_validation import KFold

In [69]:
kf=KFold(y.shape[0],n_folds=5)

In [66]:
demand_mean=train['Demanda_uni_equil'].mean()

In [70]:
[rmslep(np.array([demand_mean]*test.shape[0]),y[test]) for train, test in kf]

[1.3303300627402943,
 1.3821851006058314,
 1.40279877097426,
 1.371450300457385,
 1.3582077407053041]