## Experimental Design

+ A process
+ An objective and controlled way to draw specific conclusions in reference to a hypothesis

## Forming robust statements

Lacking precision: *X probably had an effect on Y. There is likely some small risk of error*

+ Precise and Quantified language
  + More precise: *P-value analysis indicates X had an effect on Y with a 10% risk of Type I error (Type I error: incorrectly reject null hypothesis)*
+ Goal: Experimental design and statistical analysis

## Non-random assignment of subjects

***Scenario***: An agricultural firm is conducting an experiment to measure how feeding sheep different types of grass affects their weight. They have asked for your help to properly set up the experiment. One of their managers has said you can perform the subject assignment by taking the top 250 rows from the DataFrame and that should be fine.

We can clearly demonstrate why this might not be a good idea. Assign the subjects to two groups using non-random assignment (the first 250 rows) and observe the differences in descriptive statistics.

In [1]:
import numpy as np
import pandas as pd

weights = pd.read_fwf('./data/weights.txt')
weights.set_index('idx', inplace=True)
weights.reset_index(drop=True, inplace=True)
weights.head()

Unnamed: 0,id,weight
0,262,39.07
1,74,44.04
2,471,46.58
3,382,48.01
4,442,48.46


In [2]:
# Non-random assignment
group1_non_rand = weights.iloc[:250,]
group2_non_rand = weights.iloc[250:,]

# Compare descriptive statistics of groups
# Compare descriptive statistics of groups
compare_df_non_rand = pd.concat([group1_non_rand['weight'].describe(), group2_non_rand['weight'].describe()], axis=1)
compare_df_non_rand.columns = ['group1', 'group2']

# Print to assess
print(compare_df_non_rand)

           group1      group2
count  250.000000  250.000000
mean    58.821560   71.287480
std      4.503791    5.019958
min     39.070000   65.100000
25%     56.200000   67.490000
50%     59.390000   70.115000
75%     62.527500   73.770000
max     65.100000   95.820000


In [3]:
# Randomly assign half
group1_random = weights.sample(n=250, random_state=42, replace=False)

# Create second assignment
group2_random = weights.drop(group1_random.index)

# Compare assignments
compare_df_random = pd.concat([group1_random['weight'].describe(), group2_random['weight'].describe()], axis=1)
compare_df_random.columns = ['group1', 'group2']
print(compare_df_random)

           group1      group2
count  250.000000  250.000000
mean    64.499960   65.609080
std      8.073025    7.596346
min     39.070000   44.040000
25%     58.660000   60.312500
50%     64.620000   65.540000
75%     69.957500   70.442500
max     86.760000   95.820000


## Experimental data setup

Dataset for the e-commerce example (`econ`) in video 2 is not provided.

### Blocking on productivity

Randomly create 2 equal-sized blocks of 50 each

### Stratifying an experiment

***Scenario:*** You are working with a government organization that wants to undertake an experiment around how some particular government policies impact the net wealth of individuals in a number of areas.

They have approached you to help set up the experimental design. They have warned you that there is likely to be a small group of users who already have high net wealth and are concerned that this group might overshadow any experimental outcome observed. We know just what to do!

In [4]:
productivity_subjects = pd.read_csv('./data/productivity_subjects.csv', index_col='index')
productivity_subjects.head()

Unnamed: 0_level_0,subject_id
index,Unnamed: 1_level_1
0,1
1,2
2,3
3,4
4,5


In [5]:
# Randomly assign half
block_1 = productivity_subjects.sample(n=50, random_state=42, replace=False)

# Set the block column
block_1['block'] = 1

# Create second assignment and label
block_2 = productivity_subjects.drop(block_1.index)
block_2['block'] = 2

# Concatenate and print
productivity_combined = pd.concat([block_1, block_2], axis=0)
print(productivity_combined['block'].value_counts())

block
1    50
2    50
Name: count, dtype: int64


In [6]:
wealth_data = pd.read_fwf('./data/wealth_data.txt', index_col='index')
wealth_data.head()

Unnamed: 0_level_0,net_wealth,service_involvement,high_wealth
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,112820.451,19,0
1,106850.675,19,0
2,118590.331,19,0
3,184323.773,12,0
4,166439.516,10,0


In [7]:
# Create the first block
strata_1 = wealth_data[wealth_data['high_wealth'] == 1].copy()
strata_1['Block'] = 1

# Create two groups assigning to Treatment or Control
strata_1_g1 = strata_1.sample(n=100, replace=False)  # could use frac=0.5 instead of n=100
strata_1_g1['T_C'] = 'T' 
strata_1_g2 = strata_1.drop(strata_1_g1.index)
strata_1_g2['T_C'] = 'C'

# Create the second block and assign groups
strata_2 = wealth_data[wealth_data['high_wealth'] == 0].copy()
strata_2['Block'] = 2

strata_2_g1 = strata_2.sample(n=900, replace=False)  # could use frac=0.5 instead of n=900
strata_2_g1['T_C'] = 'T'
strata_2_g2 = strata_2.drop(strata_2_g1.index)
strata_2_g2['T_C'] = 'C'

# Concatenate the grouping work
wealth_data_stratified = pd.concat([strata_1_g1, strata_1_g2, strata_2_g1, strata_2_g2])
print(wealth_data_stratified.groupby(['Block','T_C', 'high_wealth']).size())

Block  T_C  high_wealth
1      C    1              100
       T    1              100
2      C    0              900
       T    0              900
dtype: int64


## Visual normality in an agricultural experiment

***Scenario:*** We have been contracted by an agricultural firm conducting an experiment on 50 chickens, divided into four groups, each fed a different diet. Weight measurements were taken every second day for 20 days.

We'll analyze chicken_data to assess normality, which will determine the suitability of parametric statistical tests, beginning with a visual examination of the data distribution. 

In [8]:
chicken_data = pd.read_csv('./data/chick_weight.csv')
chicken_data.head()

Unnamed: 0,weight,Time,Chick,Diet
0,42,0,1,1
1,51,2,1,1
2,59,4,1,1
3,64,6,1,1
4,76,8,1,1
