# Marketing Data Science Modeling
## Predicting Commuter Transportation Choices

To predict consumer choice, we use explanatory variables from the marketing mix, such as product characteristics, advertising and promotion, or the type of distribution channel. We note consumer characteristics, observable behaviors, survey responses, and demographic data. We build the discrete choice models of economics and generalized linear models of statistics.

To demonstrate choice methods, we begin with the Sydney Transportation Study. Commuters in Sydney can choose to go into the city by car or train. The response is binary, so we can use logistic regression, a generalized linear model with a logit link. The logit is the natural logarithm of the odds ratio.

In the Sydney Transportation Study, 150 out of 333 commuters (45 percent) use the train. Suppose public administrators set a goal to increase public transportation usage by 10 percent. How much lower would train ticket prices have to be to achieve this goal, keeping all other variables constant? We can use the fitted logistic regression model to answer this question.

### Libraries

In [1]:
# import packages into the workspace for this program
from __future__ import division, print_function
import numpy as np
import pandas as pd
import statsmodels.api as sm

### Read Data

In [2]:
sydney = pd.read_csv('data/sydney.csv')
sydney.head()

Unnamed: 0,cartime,carcost,traintime,traincost,choice
0,70,50,64,39,TRAIN
1,50,230,60,32,TRAIN
2,50,70,58,40,CAR
3,60,108,93,62,CAR
4,70,60,68,26,TRAIN


In [3]:
sydney.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 333 entries, 0 to 332
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   cartime    333 non-null    int64 
 1   carcost    333 non-null    int64 
 2   traintime  333 non-null    int64 
 3   traincost  333 non-null    int64 
 4   choice     333 non-null    object
dtypes: int64(4), object(1)
memory usage: 13.1+ KB


## Data Cleaning

In [4]:
# dictionary object to convert string to binary integer 
response_to_binary = {'TRAIN':1, 'CAR':0}

y = sydney['choice'].map(response_to_binary)
cartime = sydney['cartime']
carcost = sydney['carcost']
traintime = sydney['traintime']
traincost = sydney['traincost']

## Logistic Regression Model

In [5]:
# define design matrix for the linear predictor
Intercept = np.array([1] * len(y))
x = np.array([Intercept, cartime, carcost, traintime, traincost]).T

# generalized linear model for logistic regression
logistic_regression = sm.GLM(y, x, family=sm.families.Binomial())
sydney_fit = logistic_regression.fit()
print(sydney_fit.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                 choice   No. Observations:                  333
Model:                            GLM   Df Residuals:                      328
Model Family:                Binomial   Df Model:                            4
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -136.32
Date:                Mon, 11 Jul 2022   Deviance:                       272.63
Time:                        21:26:01   Pearson chi2:                     326.
No. Iterations:                     6                                         
Covariance Type:            nonrobust                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.4440      0.585     -2.468      0.0

In [6]:
# Probability The probability that a non-smoker will have a heart disease in the next 10 years is 0.13.

print('Coeficient Car time:', 0.0565)
print('Coeficient Car cost:', 0.0298)
print('Coeficient Train time:', 0.0149)
print('Coeficient Train cost:', -0.1113)
print()

# Odds ratio that associates tickes price to choice
ortc = 1 - np.exp(-0.1113)
print('How much the Train Cost is associete with choice:', round(ortc*100,2),'%')

# Probability that someone make a change
pc = np.exp(-1.440)/(1 + np.exp(-1.440))
print('Probability that someone make a change:', round(pc*100,2),'%')

Coeficient Car time: 0.0565
Coeficient Car cost: 0.0298
Coeficient Train time: 0.0149
Coeficient Train cost: -0.1113

How much the Train Cost is associete with choice: 10.53 %
Probability that someone make a change: 19.15 %


## Model Accuracy

In [7]:
sydney['train_prob'] = sydney_fit.predict(linear = False)

# function to convert probability to choice prediction
def prob_to_response(response_prob, cutoff):
    if(response_prob > cutoff):
        return('TRAIN')
    else:
        return('CAR')
            
# add binary predictions to DataFrame sydney using cutoff value for the case
sydney['choice_pred'] = \
    sydney['train_prob'].apply(lambda d: prob_to_response(d, cutoff = 0.50))
    
# evaluate performance of logistic regression model 
# obtain confusion matrix and proportion of observations correctly predicted    
cmat = pd.crosstab(sydney['choice_pred'], sydney['choice']) 
a = float(cmat.iloc[0,0])
b = float(cmat.iloc[0,1])
c = float(cmat.iloc[1,0]) 
d = float(cmat.iloc[1,1])

n = a + b + c + d
predictive_accuracy = (a + d)/n  

print(cmat)
print('\nPercentage Correctly Predicted',\
     round(predictive_accuracy, 3), "\n")

choice       CAR  TRAIN
choice_pred            
CAR          155     30
TRAIN         28    120

Percentage Correctly Predicted 0.826 



The resulting four-fold table or confusion matrix would show that we have correctly predicted transportation choice 82.6 percent of the time.