# Additional information: Logistic Regression with interactions

This extra notebook shows how you can do hypothesis testing with Logistic Regression checking for interactions (e.g., moderations) using statsmodels. 

**IMPORTANT:** The discussions about moderation are a lot more complex than what we're showing below. These steps are only valid for the Digital Analytics course, and as a preliminary set of steps so you can use statsmodels/logistic regression. For your thesis (or other work for the master), please follow the general instructions from the methods courses.

Here we will be using a simulated dataset that you can [download directly from dropbox](https://www.dropbox.com/s/ayv08411gt51uk9/web_campaign_simulated.xlsx?dl=0). 


In [1]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# 1. Data cleaning
(Note: same steps as in the tutorial)

In [2]:
webdata = pd.read_excel('web_campaign_simulated.xlsx')

In [3]:
def check_referral(referral, site):
    if referral == site:
        return 1
    return 0

In [4]:
webdata['google'] = webdata['referral'].apply(check_referral, args=('google',))
webdata['facebook'] = webdata['referral'].apply(check_referral, args=('facebook',))
webdata['news_a'] = webdata['referral'].apply(check_referral, args=('newsletter A',))
webdata['news_b'] = webdata['referral'].apply(check_referral, args=('newsletter B',))
webdata['nyt'] = webdata['referral'].apply(check_referral, args=('nyt',))
webdata['tumblr'] = webdata['referral'].apply(check_referral, args=('tumblr',))
webdata['twitter'] = webdata['referral'].apply(check_referral, args=('twitter',))


In [5]:
webdata.columns

Index(['id', 'age', 'female', 'used_search', 'referral', 'time_spent',
       'campaign_1', 'campaign_2', 'click', 'sell', 'spent', 'google',
       'facebook', 'news_a', 'news_b', 'nyt', 'tumblr', 'twitter'],
      dtype='object')

In [24]:
webdata.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,9010.0,2141.001887,670.144769,1.0,2253.25,2483.0,2483.0,2483.0
age,9010.0,28.344728,11.573714,18.0,23.0,23.0,23.0,67.0
female,9010.0,0.140622,0.34765,0.0,0.0,0.0,0.0,1.0
used_search,9010.0,0.862486,0.344408,0.0,1.0,1.0,1.0,1.0
time_spent,9010.0,188.782464,108.709242,1.0,96.0,187.0,282.0,380.0
campaign_1,9010.0,0.863929,0.342883,0.0,1.0,1.0,1.0,1.0
campaign_2,9010.0,0.863263,0.343589,0.0,1.0,1.0,1.0,1.0
click,9010.0,0.386903,0.487068,0.0,0.0,0.0,1.0,1.0
sell,9010.0,0.700222,0.458186,0.0,0.0,1.0,1.0,1.0
spent,6309.0,80.705183,41.088255,10.0,45.0,81.0,117.0,150.0


## 2. Hypothesis testing with statsmodels - using interactions


Let's say that we want to check how coming from Google may interact with the effect of the variable Sell, so we want to check the interaction effects, and also control for time_spent. If we just wanted to create a model with both variables (with separate effects) in statsmodels, we'd do the following steps:

In [76]:
import statsmodels.api as sm


In [77]:
import statsmodels.formula.api as smf

In [78]:
features = ['google', 'sell', 'time_spent']

In [79]:
logit_model = sm.Logit(webdata['click'], sm.add_constant(webdata[features]))




In [80]:
result = logit_model.fit()

Optimization terminated successfully.
         Current function value: 0.634819
         Iterations 5


In [81]:
print(result.summary())


                           Logit Regression Results                           
Dep. Variable:                  click   No. Observations:                 9010
Model:                          Logit   Df Residuals:                     9006
Method:                           MLE   Df Model:                            3
Date:                Tue, 24 Oct 2017   Pseudo R-squ.:                 0.04874
Time:                        17:26:09   Log-Likelihood:                -5719.7
converged:                       True   LL-Null:                       -6012.8
                                        LLR p-value:                1.050e-126
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -1.5510      0.062    -24.828      0.000      -1.673      -1.429
google        -0.0013      0.055     -0.024      0.981      -0.109       0.106
sell           1.1863      0.054     22.030      0.0

Here we have the effects of each independent variable separately. But we can also check for interactions. 

To do so, we use statsmodels in a slightly different manner. We'll first see how to get the same results (above) if you use a formula. The formula is defined by including the dependent variable (clicks) comes first, followed by a ~ and then the predictors. 

Note that you need to import some additional modules.

In [82]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [85]:
model1_maineffects = smf.glm(formula='click ~ google + sell + time_spent', data=webdata, family=sm.families.Binomial()).fit()

In [86]:
print(model1_maineffects.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                  click   No. Observations:                 9010
Model:                            GLM   Df Residuals:                     9006
Model Family:                Binomial   Df Model:                            3
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -5719.7
Date:                Tue, 24 Oct 2017   Deviance:                       11439.
Time:                        17:26:24   Pearson chi2:                 9.01e+03
No. Iterations:                     4                                         
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -1.5510      0.062    -24.828      0.000      -1.673      -1.429
google        -0.0013      0.055     -0.024      0.9

The results are the same as running with the way we covered in class. However, now we can also modify the formula to check for interactions. So if I want to see the effects of interactions between sell and google, instead of having a plus sign between google and sell, I add an asterisk to ask for an interaction. The other variable (time_spent) is still separated by a plus sign.

In [87]:
model2_interactions = smf.glm(formula='click ~ google * sell + time_spent', data=webdata, family=sm.families.Binomial()).fit()

In [88]:
print(model2_interactions.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                  click   No. Observations:                 9010
Model:                            GLM   Df Residuals:                     9005
Model Family:                Binomial   Df Model:                            4
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -5716.5
Date:                Tue, 24 Oct 2017   Deviance:                       11433.
Time:                        17:26:51   Pearson chi2:                 9.01e+03
No. Iterations:                     4                                         
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      -1.6063      0.067    -24.080      0.000      -1.737      -1.476
google          0.2532      0.113      2.249     

Above I can see the effect of time_spent, and also of sell and google (and their interaction). 

In the table, then:
* sell is the effect of having a sale (sell = 1) without coming from google (google = 0)
* google is the effect of coming from google (google = 1) and not having a sale (sell = 0)
* google:sell is the effect of coming from google (google = 1) and having a sale (sell = 1)

What we actually see in this model is that not coming from google but having a sale (sell) has a positive effect on clicks, whereas coming from google and having a sale (google:sell) has a negative effect on clicks. Interestingly, Google was not significant in the first model (just with main effects) but was when the interaction was considered.

**Important:** In the tests you'd normally do in SPSS to check for moderation (for example, for your thesis), you'd probably have to go a lot deeper than this, and may have different assumptions (e.g., you may not test for interactions if the main effects are not significant, or may want to use PROCESS). That said, *specifically for the digital analytics course, and for logistic regressions done in Pandas/Statsmodels*, doing the above is sufficient to check for the interactions and see if you can speak of any moderation effects.

