## Experimental Design

+ A process
+ An objective and controlled way to draw specific conclusions in reference to a hypothesis

## Forming robust statements

Lacking precision: *X probably had an effect on Y. There is likely some small risk of error*

+ Precise and Quantified language
  + More precise: *P-value analysis indicates X had an effect on Y with a 10% risk of Type I error (Type I error: incorrectly reject null hypothesis)*
+ Goal: Experimental design and statistical analysis

## Non-random assignment of subjects

***Scenario***: An agricultural firm is conducting an experiment to measure how feeding sheep different types of grass affects their weight. They have asked for your help to properly set up the experiment. One of their managers has said you can perform the subject assignment by taking the top 250 rows from the DataFrame and that should be fine.

We can clearly demonstrate why this might not be a good idea. Assign the subjects to two groups using non-random assignment (the first 250 rows) and observe the differences in descriptive statistics.

In [6]:
import numpy as np
import pandas as pd

weights = pd.read_fwf('./data/weights.txt')
weights.set_index('idx', inplace=True)
weights.reset_index(drop=True, inplace=True)
weights.head()

Unnamed: 0,id,weight
0,262,39.07
1,74,44.04
2,471,46.58
3,382,48.01
4,442,48.46


In [9]:
# Non-random assignment
group1_non_rand = weights.iloc[:250,]
group2_non_rand = weights.iloc[250:,]

# Compare descriptive statistics of groups
# Compare descriptive statistics of groups
compare_df_non_rand = pd.concat([group1_non_rand['weight'].describe(), group2_non_rand['weight'].describe()], axis=1)
compare_df_non_rand.columns = ['group1', 'group2']

# Print to assess
print(compare_df_non_rand)

           group1      group2
count  250.000000  250.000000
mean    58.821560   71.287480
std      4.503791    5.019958
min     39.070000   65.100000
25%     56.200000   67.490000
50%     59.390000   70.115000
75%     62.527500   73.770000
max     65.100000   95.820000


In [10]:
# Randomly assign half
group1_random = weights.sample(n=250, random_state=42, replace=False)

# Create second assignment
group2_random = weights.drop(group1_random.index)

# Compare assignments
compare_df_random = pd.concat([group1_random['weight'].describe(), group2_random['weight'].describe()], axis=1)
compare_df_random.columns = ['group1', 'group2']
print(compare_df_random)

           group1      group2
count  250.000000  250.000000
mean    64.499960   65.609080
std      8.073025    7.596346
min     39.070000   44.040000
25%     58.660000   60.312500
50%     64.620000   65.540000
75%     69.957500   70.442500
max     86.760000   95.820000
