## Zeros in expenditure data



### Introduction



Consider the following data from Uganda, collected at the household
level.  The data itself is *recall* data; the respondent is asked to
recall the value, the quantity, and the price of consumption out of
expenditures over the past week, for a rather long list of possible
non-durable expenditure items.  I&rsquo;ve organized the data as an array,
with each row corresponding to a household, and each column
corresponding to a different consumption item.



In [None]:
import pandas as pd

x = pd.read_pickle('uganda_expenditures.pickle')

One thing to note about these data is the large number of &ldquo;zeros&rdquo;.
  This may reflect the fact that few households consume all different
  kinds of consumption goods every week, or could reflect &ldquo;missing&rdquo;
  data on non-zero expenditures (e.g., if the respondent forgot).



In [None]:
# Count of non-missing observations by year (t) and market (mkt) (transposed)
# FIXME: there is only one value for mkt?
x.groupby(['t','mkt']).count().T

Missing data can cause serious problems in a demand analysis,
   depending on how and why data might be missing.  If observations
   are &ldquo;missing at random&rdquo; (MAR) then it may be an easy issue to
   address, but if the probability of being missing is related to the
   disturbance term in the demand equation this becomes a sort of
   selection problem that will complicate estimation and inference.



### Household characteristics



One class of variables that may help to explain zeros are
   &ldquo;household characteristics&rdquo;; this includes household size and
   composition (both because this affects demand and perhaps because
   there are more potential shoppers); whether a household is urban or
   rural, and perhaps other characteristics.

Here are some characteristics for the households in Uganda:



In [None]:
z = pd.read_pickle('uganda_hh_characteristics.pickle')
z

### Data mining



Unfortunately, demand theory doesn&rsquo;t offer much guidance to let us
   know how household characteristics should be related to the
   probability of a goods&rsquo; consumption being positive in a given week;
   this is a case where a certain amount of &ldquo;data mining&rdquo; may be a
   reasonable approach.

We&rsquo;ll use tools we&rsquo;ve discussed in class, relying on an
implementation given by the `scikit.learn` project.  In the first
instance, let&rsquo;s consider simply estimating a logit, where the
dependent variable is simply a dummy indicating that the
expenditure of a given good $i$ for a household $j$ at time $t$ is
positive, and where the right-hand-side variables are all the
household characteristics in `z`, combined with a collection of
time dummies (which we can think of as picking up the influence of
prices, among other things):



In [None]:
from sklearn.linear_model import LogisticRegression

time_effects = pd.get_dummies(z.reset_index()[['t']].set_index(z.index),columns=['t'])
# make dummies out of year values; reset_index()[['t']] makes a new df with the t index into a column, 
# while setting the index to be the same as the old df

X = pd.concat([z,time_effects],axis=1).dropna(how='any') # Drop missing data
# note that axis=1 on concat means we glued the time effects on similar to a merge or join; horiz. not vertically
x = x.dropna(how='all',axis=1)

# Here's a good place to limit the number of dependent variables
# if we want to save time.  We select just the first few (5) columns (and all rows):
x = x.iloc[:,:5]

Ests = {}
for item in x: # Iterate over dummies indicating positive expenditure
    y = (x>0)[item]  # Dummy for non-missing item expenditures (turn into series of True, False on the condition x>0)
    Ests[item] = LogisticRegression(fit_intercept=False,penalty='none').fit(X,y) 
    # save logit results per item to the dictionary

#### Coefficients



This gives us a vector of coefficients for each good, which we can
re-arrange into a pandas DataFrame.  Recall that in the logit model
$e^{X\beta}$ is interpreted as the *odds*.  Thus, for a variable in
$X$ which is itself a logarithm, like log HSize, the associated
coefficient can be interpreted as an elasticity.  Accordingly, if the
coefficient on log HSize in the regression involving Matoke is 0.6,
then we can say that for every one percent increase in household size
(other things equal) there&rsquo;s roughly a 0.6% increase in the odds of
observing positive Matoke consumption.  

Coefficients associated with variables in levels have the
interpetation of *semi-elasticities*; thus, the odds of a rural
household consuming Matoke are approximately 53% less than that for
the average household in the sample.  What is the interpretation of
the coefficients associated with discrete counts of different
household members?



In [None]:
Coefs = pd.DataFrame({i:Ests[i].coef_.squeeze() for i in Ests.keys()},index=X.columns)
# make  dataframe where each column name is the key from Ests; values are the coefficients extracted from the results object
# make the index the column names of X (which were our regressors)
Coefs

#### Cross-Validation & Lasso



Interpreting the coefficients above allows us to think about how
differences in household characteristics affect the odds of consuming
a particular good, but our original concern was that the data might
not be *missing at random*, which could complicate subsequent
estimation of a demand system.  

Here we use Lasso & cross-validation to tune the Lasso penalty
parameter to check which (if any) of our regressors is useful for
out-of-sample prediction.  

We again use a canned routine from sklearn, `LogisticRegressionCV`.
This bundles both the Lasso penalty criterion and cross-validation
together for us, and searches over a list of penalty parameters to
minimize the EMSE, computed via $K$-fold cross-validation.



In [None]:
from sklearn.linear_model import LogisticRegressionCV
import numpy as np

Lambdas = np.logspace(-5,5,11) # 11 evenly spaced numbers on the log scale from -5, 5

CVEsts = {}
for item in x: # Iterate over dummies indicating positive expenditure
    print(item)
    y = (x>0)[item]  # Dummy for non-missing item expenditures

    # Use 5-fold cross-validation in computing CV statistics; using
    # penalty 'l1' implies a lasso estimator.
    CVEsts[item] = LogisticRegressionCV(fit_intercept=False,
                                        Cs = 1/Lambdas,        # Penalty 1/lambdas to search over
                                        cv=5,                 # K folds
                                        penalty='l1',         # Lasso penalty
                                        solver='liblinear',
                                        scoring='neg_mean_squared_error', # (minus) our CV statistic
                                        n_jobs=-1             # Number of cores to use (-1=all)
                                       ).fit(X,y)

CVCoefs = pd.DataFrame({i:CVEsts[i].coef_.squeeze() for i in CVEsts.keys()},index=X.columns)
CVCoefs

We can see how the estimated coefficients vary with different choices
of the penalty parameter $\lambda$ ($=1/C$).  Consider just the
coefficients associated with estimation of the Matoke logit: If we try
$P$ different values of the penalty parameter using $K$-fold
cross-validation this will be $KP$ different estimates for every
parameter.  We can average over the $K$ different folds to get a
clearer picture of how coefficients vary with &lambda;



In [None]:
pd.DataFrame(CVEsts['Matoke'].coefs_paths_[True].mean(axis=0),index=Lambdas.tolist(),columns=X.columns).T
# select the CVEsts where coefs_paths_ == TRUE, average over the rows (axis=0)
# name each row for a lambda, 
# name each column for a value in X
# transpose

and see also how the EMSE varies with $\lambda$



In [None]:
EMSEs={k:-e.scores_[True].mean(axis=0).ravel() for k,e in CVEsts.items()} 
# loop over CVEsts, make a new dictionary with the same keys, values are (- avg. value_k where scores_ = TRUE
# (ravel casts this to an array)

EMSEs = pd.DataFrame(EMSEs,index=np.log(Lambdas).tolist()).T
# make this dictionary into a dataframe with index log(lamdas), transpose
EMSEs

Plotting these versus $\log\lambda$:



In [None]:
EMSEs.T.plot()

Finding the minima of these curves gives estimates of the optimal
&lambda;:



In [None]:
lambda_star = pd.Series({k:1/e.C_[0] for k,e in CVEsts.items()})
lambda_star

Large values of &lambda; encourage parsimony in the selection of
regressors, so it&rsquo;s not surprising to find that consumption items with
large values of $\lambda^*$  also have few regressors (this is the
magic of Lasso):



In [None]:
Lasso_outcomes = pd.DataFrame({'#Regressors':(np.abs(CVCoefs)>1e-5).sum(),
                               'λ*':lambda_star})
Lasso_outcomes