In [2]:
import pandas as pd 
import numpy as np
import scipy
import sklearn 
from sklearn import linear_model

In [4]:
df=pd.read_csv('final_data_3.csv')

In [3]:
def compute_correlation(column_name):
    '''Computes the correlation with attendance count and column name.'''
    return df['attendance_count'].corr(df[column_name])

Calculating pairwise correlation between columns of dataframe.

We have omitted the 'declined' column because all entries are 0, correlation calculatons are impossible


In [4]:
corr_total=compute_correlation('total_invited_count')

corr_maybe= compute_correlation('maybe_count')

corr_no_reply= compute_correlation('no_reply_count')

corr_interested_count = compute_correlation('interested_count')

In [18]:
corr_df=(corr_total, corr_maybe, corr_no_reply, corr_interested_count) 
column_title = ['Total Invited', 'Maybe', 'No Replies', 'Interested']

dictionary_correlation = dict(zip(column_title, corr_df)) # creates dictionary for dataframe

In [20]:
corr_results = pd.DataFrame(dictionary_correlation, index=[0])
corr_results.round(3) # round to 3 dp

Unnamed: 0,Interested,Maybe,No Replies,Total Invited
0,0.768,0.768,0.676,0.866


In [23]:
# we ommit maybe_count because it turns out it is exactly the same as interested_count
X = df[['interested_count','no_reply_count','total_invited_count']]
y = df[['attendance_count']]

In [24]:
import statsmodels.api as sm #using statsmodels as opposed to sklearn due to regression output

model1 = sm.OLS(y, X)
results1 = model1.fit()
print(results1.summary())

                            OLS Regression Results                            
Dep. Variable:       attendance_count   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                  1.000
Method:                 Least Squares   F-statistic:                 3.849e+30
Date:                Mon, 12 Feb 2018   Prob (F-statistic):               0.00
Time:                        14:20:10   Log-Likelihood:                 18036.
No. Observations:                 694   AIC:                        -3.607e+04
Df Residuals:                     691   BIC:                        -3.605e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
interested_count       -2.0000   1

What these results tell us (Model 2): 

In the previous model, we 'controlled' for the number of total invitations to the events, in this model we do not and we see something interesting. Wheras previously we estimated inverse relationships between the 'going' variable and the 'no-reply'/'interested' variables, in this model they are positive.

What does this mean? 

Broadly, this means that there is a definite ammount of predictive power that comes from the addition of the total invites variable.

Why the change in signs?

This occurs when variables are highly correlated with each other, in essence in a model where we do not control for the total number of invites, some of this predictive power is being 'absorbed' by the interested and no-reply variables. This makes intuitive sense, the more people you invite, the more people are going to click interested, and the more people are going to not reply -- if we don't control for these total invites then, the true relationship between the ammount of people who click for example 'interested' and those who click 'going' is going to be misspecified because it may just be that more people being invited is driving the 'going' numbers as opposed to there being some sort of symbiotic relationship between those who click 'interested' and those who click 'going'.

In [25]:
model2 = sm.OLS(y, df[['interested_count','no_reply_count']])
results2 = model2.fit()
print(results2.summary())

                            OLS Regression Results                            
Dep. Variable:       attendance_count   R-squared:                       0.763
Model:                            OLS   Adj. R-squared:                  0.762
Method:                 Least Squares   F-statistic:                     1113.
Date:                Mon, 12 Feb 2018   Prob (F-statistic):          6.33e-217
Time:                        14:20:17   Log-Likelihood:                -4014.8
No. Observations:                 694   AIC:                             8034.
Df Residuals:                     692   BIC:                             8043.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
interested_count     0.3813      0.019  

Can we explain our results?

Let's assume that the only people who could potentially come to these events are the ones invited on Facebook. So for each event, we have some part of the student body that is 'eligible' to come to these events. Within that body there are 2 subsets of people, those who are 'favourable' to the event, and those who aren't. Those who aren't are the no-replyers, they do not really ever consider going whereas the 'favourable' group is comprised of people who click attending and interested -- these are the people that who are actually thinking about going to these events. You can only choose to click either going or interested in an event on Facebook, and so it is sort of like a zero sum game -- the more interested people in this subgroup, the less can be 'going' and vice versa.

What can we draw from this?

This isn't so much a hypothesis as it is a (potential) explanation of why we have that counter-intuitive result. Our model is suggesting that the way to increase those going to events is by increasing the total number of people invited -- increasing your exposure broadly, but this is nothing new. So is there anything else we can find?