### End-to-end Experiment Design and Analysis

In this notebook, we will look at an example of analyzing and preparing a set of users to run an experiment on, and subsequently analyzing the results.

First we will import pandas and the stats module from scipy.

In [1]:
import pandas as pd
from scipy import stats

Now we're going to import a data file. Assuming this data file relates to a mobile game, our main objectives would be to increase the amount of time users spend playing, to increase the amount of money they spend in the game, and to increase the average time they spend watching ads. In this data file we have several variables:
* User ID - unique user identifier
* Age - the age of a user
* Device Type - the type of device they have (e.g., android, iphone)
* Location - where they live (e.g., Canada, US)
* Average Spend - how much money they spend in the product
* Average Play Time - how much time they spend playing a game
* Average Time Watching Ads - how much time they spend watching ads

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/delinai/schulich_ds1/main/Datasets/experiment_demo_dataset.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,User ID,Age,Device Type,Location,Average Spend,Average Play Time,Average Time Watching Ads
0,0,1,62,Device3,Location3,54.608937,0.234102,2.262157
1,1,2,65,Device1,Location1,36.428457,0.802944,1.44016
2,2,3,18,Device1,Location2,84.843851,4.349362,0.374079
3,3,4,21,Device3,Location1,30.963386,5.966833,2.430972
4,4,5,21,Device3,Location2,76.392755,9.401406,1.041962


We should remove the Unnamed: 0 column, and set the User Id as the index. Let's also check df.info() to make sure the data is clean.

In [4]:
df.drop('Unnamed: 0', axis=1, inplace = True)
df.set_index('User ID', inplace=True)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 1 to 1000
Data columns (total 6 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   Age                        1000 non-null   int64  
 1   Device Type                1000 non-null   object 
 2   Location                   1000 non-null   object 
 3   Average Spend              1000 non-null   float64
 4   Average Play Time          1000 non-null   float64
 5   Average Time Watching Ads  1000 non-null   float64
dtypes: float64(3), int64(1), object(2)
memory usage: 54.7+ KB


Now let's do some analysis to understand differences between users. Before we design the experiment, we need to make sure the users are similar in behaviour. This means that users shouldn't have significant differences between the amounts of money they spend, time they spend playing games, or time they spend watching ads. We should compare users by Location, Device Type and Age. We can use statistical tests to support our analysis.

In [6]:
# By location
df.groupby('Location')[['Average Spend','Average Play Time','Average Time Watching Ads']].mean()

Unnamed: 0_level_0,Average Spend,Average Play Time,Average Time Watching Ads
Location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Location1,51.144352,5.09333,1.503598
Location2,51.752892,5.062008,1.453953
Location3,50.563795,4.952144,1.498298


This looks mostly the same. Let's use an ANOVA test to validate each metric. If they are statistically the same, they should have a HIGH p-value.

In [7]:
# checking spend
loc_1 = df[df['Location'] == 'Location1']['Average Spend']
loc_2 = df[df['Location'] == 'Location2']['Average Spend']
loc_3 = df[df['Location'] == 'Location3']['Average Spend']

p_val = stats.f_oneway(loc_1, loc_2, loc_3)
print(p_val)


F_onewayResult(statistic=0.1464947745048961, pvalue=0.8637488343476756)


In [8]:
# checking Play Time
loc_1 = df[df['Location'] == 'Location1']['Average Play Time']
loc_2 = df[df['Location'] == 'Location2']['Average Play Time']
loc_3 = df[df['Location'] == 'Location3']['Average Play Time']

p_val = stats.f_oneway(loc_1, loc_2, loc_3)
print(p_val)

F_onewayResult(statistic=0.219950898335729, pvalue=0.8025971367411683)


In [9]:
# checking Time Watching Ads
loc_1 = df[df['Location'] == 'Location1']['Average Time Watching Ads']
loc_2 = df[df['Location'] == 'Location2']['Average Time Watching Ads']
loc_3 = df[df['Location'] == 'Location3']['Average Time Watching Ads']

p_val = stats.f_oneway(loc_1, loc_2, loc_3)
print(p_val)

F_onewayResult(statistic=0.3246211420994766, pvalue=0.7228775087524879)


For all 3 locations, it seems that the users behave the same. There is no statistically significant difference between them - this means we can comfortably assign all users to any variant of our experiment.

Let's check the same for device type.

In [10]:
# checking spend
loc_1 = df[df['Device Type'] == 'Device1']['Average Spend']
loc_2 = df[df['Device Type'] == 'Device2']['Average Spend']
loc_3 = df[df['Device Type'] == 'Device3']['Average Spend']

p_val = stats.f_oneway(loc_1, loc_2, loc_3)
print(p_val)

F_onewayResult(statistic=0.03056354414431916, pvalue=0.9698997074210355)


In [11]:
# checking play time
loc_1 = df[df['Device Type'] == 'Device1']['Average Play Time']
loc_2 = df[df['Device Type'] == 'Device2']['Average Play Time']
loc_3 = df[df['Device Type'] == 'Device3']['Average Play Time']

p_val = stats.f_oneway(loc_1, loc_2, loc_3)
print(p_val)

F_onewayResult(statistic=2.25245551853566, pvalue=0.10567552404042554)


In [12]:
# checking spend
loc_1 = df[df['Device Type'] == 'Device1']['Average Time Watching Ads']
loc_2 = df[df['Device Type'] == 'Device2']['Average Time Watching Ads']
loc_3 = df[df['Device Type'] == 'Device3']['Average Time Watching Ads']

p_val = stats.f_oneway(loc_1, loc_2, loc_3)
print(p_val)

F_onewayResult(statistic=0.008244524732634912, pvalue=0.9917894357700617)


Again we see that behaviour is mostly similar across device types, with the exception of Play Time. We see a much lower p-value, however this is still above 0.05. To be safe though, perhaps we won't do an experiment which seeks to influence Play Time. Why? Because if there are differences in behaviour between people who have different devices, if we run an experiment we won't be able to attribute the change in behaviour to the experiment. Our results won't be reliable. So, with this user group we should focus on feature changes which could impact Averaage Time Watching Ads or Average Spend.

## Experiment Design

Now that we confirmed our users have similar characteristics for the most part, and we identified the risk of experimenting with Play Time as we do see SOME difference by device type, we can go ahead and design an experiment. We can pick between Increasing Average Spend or Increasing Avg Time Spent Watching Ads.

Let's take Average Spend. Suppose we introduced a new notification type which offered users a % discount on their next in-app purchase. Perhaps we want to offer some users 10% off and other users 20% off to see which discount works best. We should now divide our data set into 3 sections: Variant 1 - 10% off; Variant 2 - 20% off, and Control - no discount. After the experiment, we will review the results.

We can use pandas' sample() function to select subsets of users, or simply select the first 300 for Variant 1, the second 300 for Variant 2 and the last 400 for Control.

In [13]:
# Variant 1 - 10% off - 30% of users
variant1 = df.iloc[0:300]

In [14]:
# Variant 2 - 20% off - 30% of users
variant2 = df.iloc[300:600]

In [15]:
# Control group - no discount
control = df.iloc[600:]

In [16]:
# Create a new column to add the variant identifier
variant1['Variant'] = 'Variant 1'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  variant1['Variant'] = 'Variant 1'


In [17]:
variant2['Variant'] = 'Variant 2'
control['Variant'] = 'Control'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  variant2['Variant'] = 'Variant 2'
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  control['Variant'] = 'Control'


In [18]:
# Concatenate all 3 data sets
final_dataset = pd.concat([variant1, variant2, control], axis=0)

In [19]:
final_dataset

Unnamed: 0_level_0,Age,Device Type,Location,Average Spend,Average Play Time,Average Time Watching Ads,Variant
User ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,62,Device3,Location3,54.608937,0.234102,2.262157,Variant 1
2,65,Device1,Location1,36.428457,0.802944,1.440160,Variant 1
3,18,Device1,Location2,84.843851,4.349362,0.374079,Variant 1
4,21,Device3,Location1,30.963386,5.966833,2.430972,Variant 1
5,21,Device3,Location2,76.392755,9.401406,1.041962,Variant 1
...,...,...,...,...,...,...,...
996,54,Device1,Location3,77.823025,7.265630,1.881264,Control
997,19,Device1,Location3,27.422427,9.863676,1.798216,Control
998,47,Device2,Location1,28.934192,2.449896,2.218613,Control
999,23,Device1,Location3,39.857764,8.261137,2.130031,Control


Once we have confirmed that the users have similar behaviour, and we have identified users for each variant, we can give this new user list to the Engineering Team or the appropriate team which can deploy the experiment.

## Analyzing Experiment Results

Now let's suppose that we ran the experiment over 2 weeks, and after two weeks we decided to analyze the results. We now have a new data set with 3 new columns:
* Post_Test_Avg_Spend = this looks at the average spend per user after the discount test
* Post_Test_Avg_Play-Time = this looks at the average time spent playing games by users after the discount test
* Post_Test_Avg_Time_Watching_Ads = this looks at the average time spent watching ads after the discount test

Let's import the data.

In [20]:
results = pd.read_csv('https://raw.githubusercontent.com/delinai/schulich_ds1/main/Datasets/experiment_demo_dataset_results.csv')

In [21]:
results.head()

Unnamed: 0,User ID,Age,Device Type,Location,Average Spend,Average Play Time,Average Time Watching Ads,Variant,Post_Test_Avg_Spend,Post_Test_Avg_Play_Time,Post_Test_Avg_Time_Watching_Ads
0,1,62,Device3,Location3,54.608937,0.234102,2.262157,Variant 1,88,9,0
1,2,65,Device1,Location1,36.428457,0.802944,1.44016,Variant 1,78,3,1
2,3,18,Device1,Location2,84.843851,4.349362,0.374079,Variant 1,102,5,3
3,4,21,Device3,Location1,30.963386,5.966833,2.430972,Variant 1,112,6,4
4,5,21,Device3,Location2,76.392755,9.401406,1.041962,Variant 1,81,4,3


Now we should evaluate the results. We can use an ANOVA test to compare all 3 groups (Variant 1, Variant 2, and Control), and also t-tests to compare individual variants to the Control, and to each other. Let's review the results.

In [22]:
# checking spend
var1 = results[results['Variant']=='Variant 1']['Post_Test_Avg_Spend']
var2 = results[results['Variant']=='Variant 2']['Post_Test_Avg_Spend']
control = results[results['Variant']=='Control']['Post_Test_Avg_Spend']

p_val = stats.f_oneway(var1, var2, control)
print(p_val)

F_onewayResult(statistic=721.5888481862228, pvalue=1.659365119376383e-194)


In [23]:
# checking play time
var1 = results[results['Variant']=='Variant 1']['Post_Test_Avg_Play_Time']
var2 = results[results['Variant']=='Variant 2']['Post_Test_Avg_Play_Time']
control = results[results['Variant']=='Control']['Post_Test_Avg_Play_Time']

p_val = stats.f_oneway(var1, var2, control)
print(p_val)

F_onewayResult(statistic=2.8910397415994877, pvalue=0.055984033047406054)


In [24]:
# checking ad watching time
var1 = results[results['Variant']=='Variant 1']['Post_Test_Avg_Time_Watching_Ads']
var2 = results[results['Variant']=='Variant 2']['Post_Test_Avg_Time_Watching_Ads']
control = results[results['Variant']=='Control']['Post_Test_Avg_Time_Watching_Ads']

p_val = stats.f_oneway(var1, var2, control)
print(p_val)

F_onewayResult(statistic=175.98453563485413, pvalue=3.495772779084361e-66)


Based on the ANOVA, we see that both the Spend and Ad Watching time have statistically significant results! This means that the variants and the control group have completely different behaviours in their Spending and Ad Waching time, based on the discount offered. However, the Play Time has not changed much from before the discount. This makes sense, since we were not targeting or incentivizing users to change the amount of time they play a game.

Let's do an independent t-test next for each variant compared to the control, and then the variants compared to each other.

In [25]:
# Compare Spend

var1 = results[results['Variant']=='Variant 1']['Post_Test_Avg_Spend']
var2 = results[results['Variant']=='Variant 2']['Post_Test_Avg_Spend']
control = results[results['Variant']=='Control']['Post_Test_Avg_Spend']

p_val = stats.ttest_ind(var1, control)
p_val_2 = stats.ttest_ind(var2, control)
p_val_3 = stats.ttest_ind(var1, var2)
print(p_val)
print(p_val_2)
print(p_val_3)

Ttest_indResult(statistic=20.852764656657413, pvalue=1.939954903041814e-75)
Ttest_indResult(statistic=34.71741287396566, pvalue=3.4380838932331946e-154)
Ttest_indResult(statistic=-17.552071233010796, pvalue=6.134968318898046e-56)


When comparing Spend, we see there are significant differences between both variants and the control. This means we should have one variant that performs better than the others - we'll get back to this shortly.

In [26]:
# checking ad watching time
var1 = results[results['Variant']=='Variant 1']['Post_Test_Avg_Time_Watching_Ads']
var2 = results[results['Variant']=='Variant 2']['Post_Test_Avg_Time_Watching_Ads']
control = results[results['Variant']=='Control']['Post_Test_Avg_Time_Watching_Ads']

p_val = stats.ttest_ind(var1, control)
p_val_2 = stats.ttest_ind(var2, control)
p_val_3 = stats.ttest_ind(var1, var2)
print(p_val)
print(p_val_2)
print(p_val_3)

Ttest_indResult(statistic=-13.386879454840882, pvalue=1.5310600393512453e-36)
Ttest_indResult(statistic=-14.568723666360519, pvalue=3.571429475888476e-42)
Ttest_indResult(statistic=0.8071301318967642, pvalue=0.419912416689849)


Based on the ad-watching time, we can see that both Variant 1 and Variant 2 are statistically significantly different than the Control group! However, they are not different from each other. This means that likely no matter which variant we select (10% discount or 20% discount) to implement permanently, both should yield good results post-implementation.

In [27]:
# checking play time
var1 = results[results['Variant']=='Variant 1']['Post_Test_Avg_Play_Time']
var2 = results[results['Variant']=='Variant 2']['Post_Test_Avg_Play_Time']
control = results[results['Variant']=='Control']['Post_Test_Avg_Play_Time']

p_val = stats.ttest_ind(var1, control)
p_val_2 = stats.ttest_ind(var2, control)
p_val_3 = stats.ttest_ind(var1, var2)
print(p_val)
print(p_val_2)
print(p_val_3)

Ttest_indResult(statistic=1.9707612533248144, pvalue=0.04914591976859942)
Ttest_indResult(statistic=2.1031563575673995, pvalue=0.03580947626027853)
Ttest_indResult(statistic=-0.12640047334730473, pvalue=0.8994574139012586)


We can see that both Variant 1 and Variant 2 resulted in statistically significantly different play time, even though that wasn't the intention! However, there is no difference between the two variants - we will check if they perform better and then make a decision.

Now that we know the discount worked well in improving spend, let's see which variant worked better.

In [28]:
results.groupby('Variant')[['Post_Test_Avg_Spend']].mean()

Unnamed: 0_level_0,Post_Test_Avg_Spend
Variant,Unnamed: 1_level_1
Control,50.4675
Variant 1,88.66
Variant 2,109.683333


It looks like Variant 2 (20% discount) resulted in much higher average spend than control, and than Variant 1. And since we know this result is not a coincidence, and is in fact statistically significant, we should implement Variant 2 to drive higher spend in our users.

In [29]:
results.groupby('Variant')[['Post_Test_Avg_Play_Time']].mean()

Unnamed: 0_level_0,Post_Test_Avg_Play_Time
Variant,Unnamed: 1_level_1
Control,4.555
Variant 1,4.986667
Variant 2,5.016667


It looks like the average Play Time is slightly better than Control, and since we know the results are statistically significant, but both variants are better than control, it doesn't really matter which variant we choose. So far, since Variant 2 is better at Avg Spend, we may as well pick this for our top choice for now. But first, let's check the Ad Watching time.

In [30]:
results.groupby('Variant')[['Post_Test_Avg_Time_Watching_Ads']].mean()

Unnamed: 0_level_0,Post_Test_Avg_Time_Watching_Ads
Variant,Unnamed: 1_level_1
Control,4.545
Variant 1,2.106667
Variant 2,2.03


Uh oh... it looks like both variants performed WORSE for ad watching, and actually Control performed the best. We know the results here are statistically significant - unfortunately in the wrong direction.

#### Business Strategy
Now that we have our results, we can consider the pros and cons and make a strategic decision. We know that Variant 1 and 2 both drive in-app purchases, and increase users' Average Spend. This is good for our business. On the other hand, both variants result in lower Ad Watching time, which is also good for our business. There is a clear tradeoff between deciding on these variants.

From a business perspective, we need to select the variant based on our goals - in our case, we want to increase Revenue so it makes sense to select Variant 2. Ideally, we would compare the margin we get from in-app spend by users vs. the margin we get from Ad Watching (hint: usually direct spend is better for the business than ad revenue). 

In this example, we should decide to deploy Variant 2 to all users, acknowledging our ad revenue will take a hit. We could also recommend further testing before making a final decision, perhaps a discount + incentive to watch ads. 