# Experiment Design and Analysis

The objective of this repo is to:
Apply theory of experiment design and knowledge of analysis techniques to real experiment data.

## Local files:

Below, please amend the inputs according to your pathways/needs.

We'll be using the following dataset for notebook exploration:
https://www.sciencedirect.com/science/article/abs/pii/S0022053107000178


In [None]:
DATA_PATH = ""
README_PATH = ""


In [None]:
import pandas as pd
import statsmodels.api as sm
import statsmodels.stats.api as sms
from scipy import stats


data = pd.read_csv(DATA_PATH)



In [None]:
#if applicable:
#uncomment the below line to view readme file for this dataset (includes explanation of variable names)
#!cat README_PATH


#uncomment the below line to view snippet of csv file
# data.head()

In [None]:
def stats_calculator(provided_data, variables):
    """
    Function to calculate basic statistics for given variables in a DataFrame.
    Args:
        provided_data (pd.DataFrame): DataFrame containing the data.
        columns (list): List of column names to calculate statistics for.
        variables (list): List of variable names corresponding to the columns.
        stats_df_cols (list): List of statistic names to be used as columns in the output DataFrame.
    
    """
    stats_df = pd.DataFrame(columns=['variable','mean','std. dev.','max','min'])
    stats_df['variable'] = variables
    for var in variables:
        col = provided_data[var].dropna()
        stats_df.loc[stats_df['variable']==var, 'mean']    = round(col.mean(), 2)
        stats_df.loc[stats_df['variable']==var, 'std. dev.'] = round(col.std(),  2)
        stats_df.loc[stats_df['variable']==var, 'max']     = round(col.max(),  2)
        stats_df.loc[stats_df['variable']==var, 'min']     = round(col.min(),  2)

    stats_df.set_index('variable', inplace=True)
    return stats_df

We can use the function to check that the statistics for the control group are similar to those of the treatment group. In order to do that, we will need one dataframe for the control and one for the treatment group. Those are created in the next cell. 

In [None]:
control = data[data['treatment'] == 'k1_8_lot_exp']
treatment = data[data['treatment'] == 'k1_8_exp_lot']

display(stats_calculator(control))
display(stats_calculator(treatment))

We can also use a more objective measure to identify if our treatment groups were properly randomized.



In [None]:
def objective_randomization(provided_data, variables):
    """
    
    Function to perform t-tests for given variables between two treatment groups in a DataFrame.
    Args:
        provided_data (pd.DataFrame): DataFrame containing the data.
        variables (list): List of variable names to perform t-tests on.
    
    Returns:
        pd.DataFrame: DataFrame containing t-statistics and p-values for each variable.

    
    """

    ttest_df = pd.DataFrame(columns=['variable','t-statistic','p-value'])
    ttest_df['variable'] = variables
    
    g1 = provided_data[provided_data['treatment']=='k1_8_lot_exp']
    g2 = provided_data[provided_data['treatment']=='k1_8_exp_lot']

    for var in ttest_df['variable']:
        arr1 = g1[var].dropna()
        arr2 = g2[var].dropna()
        t_stat, p_val = stats.ttest_ind(arr1, arr2, equal_var=False)
        ttest_df.loc[ttest_df['variable']==var, 't-statistic'] = round(t_stat, 2)
        ttest_df.loc[ttest_df['variable']==var, 'p-value']     = round(p_val, 2)

    ttest_df.set_index('variable', inplace=True)
    return ttest_df


Let's analyze the differences between the two treatment groups (k1_8_exp_lot and k1_8_lot_exp) for the female, age, and hispanic demographic variables by completing the following objective_randomization function. (4 points)

In [None]:
objective_randomization(data)
