# Discrete choice theory: logit model of user decision making


## Introduction

In this lab, we'll be applying discrete choice theory to a real dataset. In particular, we will be using logistic regression to estimate the representative utility functions of a set of users. After we estimate these utility functions from data, we can interpret the coefficients to understand how much influence each feature has on the user decision making, as we will see. For more details, we refer the reader to the textbook *Discrete Choice Theory with Simulations* by Kenneth Train, which is available for free at: https://eml.berkeley.edu/books/choice2.html. In particular, this lab uses content from Chapters 2 and 3 of this book.

To run this lab, you will need to have Python 3, Jupyter notebook, NumPy, and a few other packages. The easiest way to get this installed and setup on your computer is to install Anaconda, which will automatically install everything you need. This can be found at https://www.anaconda.com/download/.

For the purposes of this lab, we will be using a dataset collected by the National Renewable Energy Laboratory (NREL). This dataset consists of a set of survey questions sent out to users to understand the context for their decision whether or not to adopt solar energy. The dataset is called *Understanding the Evolution of Customer Motivations and Adoption Barriers in Residential Solar Markets: Survey Data*, and can be found at https://data.nrel.gov/submissions/68.

## Dataset

This dataset consists of surveys sent to energy consumers about their decision to adopt solar generation. Consumers are segmented into 3 types: ADOPTER, who have adopted photovoltaics (PVs), CONSIDERER, who have seriously considered it but have not adopted it, and GPS, the general population who do not have PV. Depending on the type, the consumers were sent different surveys.

There are different files for each of the 3 types of consumer, so we will refer to the consumer type as `---`. The survey responses can be found in `--- Survey Data.zip`, where the data is provided in a csv file. How the survey was delivered can be found in `--- Survey Delivery.pdf`, and it describes how the survey was deployed and consumers were targeted. In `--- Instrument.pdf`, one can view the survey sent to users, as well as the logical flow of the survey, e.g. "if respondent says no, go to question 5". In `--- Codebook.pdf`, one can find detailed information on how to interpret the results of the csv file.


In [None]:
# Load packages.

import numpy as np
from sklearn import linear_model

import csv

### Decide which features to use.

In the code below, you will load the survey responses as a csv file. Here, you can determine which survey answers ('features') to incorporate into your utility function. Here, design decisions must be made: which features are most salient to estimation of a user's decision? The more features that are included, the better the training performance will be, but we run the risk of overfitting.

(Note that, as per common convention, each row is one data point, and each column is one feature.)

Note that this data will require some care to use a discrete choice formulation. In this situation, the user decision determines which survey they are given, so we do not collect the same data for each user. To learn these models, we will have to use the features that are common between all 3 surveys.

Also, care must be taken when loading this data, as not all survey questions are answered by all participants. In addition to common cases of randomly missing data, we can see from the 'Instrument' pdfs that the questions a participant sees are determined by their previous answers.

**Important:** This code is designed to load data from `./data/`, so when saving the dataset on your computer, please put it in the correct folder location.

If you wish to find the features common to all 3 surveys, just run this code first, and it will be calculated in the variable `shared_fieldnames`.

In [None]:
# Initialize and load data.

X = []
y = []

fieldnames = {}

# TODO: Modify which fieldnames are used here!
# This argument is used when pulling survey results from all 3 types.
fieldnames_used = ['EDUC_BINNED','INCOME_BINNED']

with open('data/ADOPTER.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    fieldnames['ADOPTER'] = reader.fieldnames
    for row in reader:
        
#         This is where you decide which features to load into your dataset for ADOPTER.
#         If the features will be type-dependent, you will have to modify the code here.
        row_to_add = []
        for f in fieldnames_used:
            row_to_add.append(row[f])
        X.append(row_to_add)
        y.append(row['GPS_NAC_ADOPTER'])

with open('data/CONSIDERER.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    fieldnames['CONSIDERER'] = reader.fieldnames
    for row in reader:
#         This is where you decide which features to load into your dataset for CONSIDERER.
#         If the features will be type-dependent, you will have to modify the code here.
        row_to_add = []
        for f in fieldnames_used:
            row_to_add.append(row[f])
        X.append(row_to_add)
        y.append(row['GPS_NAC_ADOPTER'])

with open('data/GPS.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    fieldnames['GPS'] = reader.fieldnames
    for row in reader:
#         This is where you decide which features to load into your dataset for GPS.
#         If the features will be type-dependent, you will have to modify the code here.
        row_to_add = []
        for f in fieldnames_used:
            row_to_add.append(row[f])
        X.append(row_to_add)
        y.append(row['GPS_NAC_ADOPTER'])

shared_fieldnames = set(fieldnames['ADOPTER']).intersection(fieldnames['CONSIDERER']).intersection(fieldnames['GPS'])
print(shared_fieldnames)

### Preprocess the data and set the model.

For this dataset, we have $J = 2$ alternatives: either the users adopted, or they are considerers or general population. Note that we consider considerers and the general population the same: they both opted out of solar.

As discussed in the lecture, we use the **random utility model**. In this model, user $n$ chooses the option $j$ that maximizes $U_{nj}$. Under some assumptions, we can decompose the utility $U_{nj} = V_{nj} + \epsilon_{nj}$. As discussed, $V_{nj}$ is the representative utility, and in this lab we will assume $V_{nj} = \beta^\top x_{nj}$, where $x_{nj}$ are the features chosen above, and $\beta$ is an unknown parameter we wish to estimate. Also, $\epsilon_{nj}$ follows a Gumbel distribution with location $\mu = 0$ and scale $\beta = 1$.

To put these things into a discrete choice framework, we'll need to reformat the data a bit. Namely, if we have multinomial data points, e.g. `EDUC_BINNED`, we can see that we can't treat "Less than high school" as half of "Some college", so we will convert these numbers into binary variables. Most of these options have a "Missing Values" value, so we will have coefficients for all the non-omitted answers.

In [None]:
# Preprocess the data.

X_cleaned = []
y_cleaned = []

# Turn the multinomial across 4 or 5 values as a binary vector of length 4 or 5.
# TODO: This will need to be changed if your feature set changes!
num_data_omitted = 0
for xx, yy in zip(X,y):
    row_to_add1 = [0,0,0,0]
    row_to_add2 = [0,0,0,0,0]
    try:
        index = int(xx[0]) - 1
        if index < 4:
            row_to_add1[index] = 1
    except ValueError:
#         print('Warning: missing value in data point.')
#         print('Omitting data...')
        num_data_omitted += 1
        continue
    try:
        index = int(xx[1]) - 1
        if index < 5:
            row_to_add2[index] = 1
    except ValueError:
#         print('Warning: missing value in data point.')
#         print('Omitting data...')
        num_data_omitted += 1
        continue
    row_to_add = row_to_add1 + row_to_add2
    
    X_cleaned.append(row_to_add)
    
#     Treat ADOPT as 1, other as 0.
    if int(yy) == 3:
        y_cleaned.append(1)
    else:
        y_cleaned.append(0)

print('{} incomplete data points omitted.'.format(num_data_omitted))
# print(X_cleaned)
# print(y_cleaned)

### Send the data into a solver.

Now, we can do logistic regression to estimate the parameters of the logit model. To do this, we will use the `linear_model.LogisticRegression` class to do so. A lot of input arguments can be used to modify this object; more details can be found here: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html.

In [None]:
# Do logistic regression!

logreg = linear_model.LogisticRegression()

logreg.fit(X_cleaned,y_cleaned)

In [None]:
# Output results of logistic regression!

S = logreg.score(X_cleaned,y_cleaned)

print('accuracy on training data: ')
print(S)

print('parameters of logistic regression.')
print('intercept: ')
print(logreg.intercept_)
print('parameters: ')
print(logreg.coef_)

### Interpret the results; modify the model

Great! Now you've fit a logistic model and calculated a representative utility function $V_{nj} = \beta^\top x_{nj}$. Can we interpret these results?

What does each coefficient mean? How can we interpret them?

Classically, if one were to try and publish on these results, we would want to incorporate p-values for these parameters as well. Can you calculate the p-values or confidence intervals for these coefficients?

Can we modify the features used? Which features do you intuitively think should affect solar adoption?

Finally, this was meant to be a quick lab to train you in estimation of logit models from real data of human decision making. What are the shortcomings of this learning? What problems are there with the estimation done? What could be done better? If you could modify the survey to get better results, how would you?