In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Experimental Design Preliminaries
Building knowledge in experimental design allows you to test hypotheses with best-practice analytical tools and quantify the risk of your work. You’ll begin your journey by setting the foundations of what experimental design is and different experimental design setups such as blocking and stratification. You’ll then learn and apply visual and analytical tests for normality in experimental data.

# Non-random assignment of subjects
An agricultural firm is conducting an experiment to measure how feeding sheep different types of grass affects their weight. They have asked for your help to properly set up the experiment. One of their managers has said you can perform the subject assignment by taking the top 250 rows from the DataFrame and that should be fine.

Your task is to use your analytical skills to demonstrate why this might not be a good idea. Assign the subjects to two groups using non-random assignment (the first 250 rows) and observe the differences in descriptive statistics.

You have received the DataFrame, weights which has a column containing the weight of the sheep and a unique id column.

numpy and pandas have been imported as np and pd, respectively.

In [2]:
# Non-random assignment
# Use DataFrame slicing to put the first 250 rows of weights into group1_non_rand and the remaining into group2_non_rand.
group1_non_rand = weights.iloc[0:250,:]
group2_non_rand = weights.iloc[250:,:]

# Compare descriptive statistics of groups
# Generate descriptive statistics of the two groups and concatenate them into a single DataFrame.
compare_df_non_rand = pd.concat([group1_non_rand['weight'].describe(), group2_non_rand['weight'].describe()], axis=1)
compare_df_non_rand.columns = ['group1', 'group2']

# Print to assess
print(compare_df_non_rand)

NameError: name 'weights' is not defined

Those two datasets have a much greater difference in means. It may be that the dataset was sorted before you received it. Presenting these results to the firm will help them understand best-practice group assignment. Hopefully you can now work with them to set up the experiment properly.

# Random assignment of subjects
Having built trust from your last work with the agricultural firm, you have been given the task of properly setting up the experiment.

Use your knowledge of best practice experimental design set up to assign the sheep to two even groups of 250 each.

In [None]:
# Randomly select 250 subjects from the weights DataFrame into a new DataFrame group1 without replacement.
group1_random = weights.sample(frac=.5, random_state=42, replace= False) # frac=.5 or n=250

# Create second assignment
group2_random = weights.drop(group1_random.index)

# Compare assignments
compare_df_random = pd.concat([group1_random['weight'].describe(), group2_random['weight'].describe()], axis=1)
compare_df_random.columns = ['group1', 'group2']
print(compare_df_random)

While there are some differences in these datasets, you can clearly see the mean of the two sets are very close. This best-practice setup will ensure the experiment is on the right path from the beginning. Let's continue building foundational experimental design skills by learning about experimental design setup.

# Your recent learnings
11 hours ago, you worked on Experimental Design Preliminaries, chapter 1 of the course Experimental Design in Python. Here is what you covered in your last lesson:
You learned about the basics of setting up experiments and the importance of experimental design. Experimental design is crucial for making precise and objective conclusions about hypotheses. Here are the key points you covered:
•	Terminology:

o	Subjects: The entities being experimented on (e.g., people, animals).

o	Treatment Group: The group receiving the intervention.

o	Control Group: The group not receiving the intervention, often given a placebo.

•	Assignment Methods:
o	Non-Random Assignment: Splitting subjects into groups without randomization can lead to significant differences between groups, making it harder to attribute changes to the treatment.
o	Random Assignment: Using randomization to assign subjects to groups ensures that any observed changes are likely due to the treatment rather than inherent differences.
•	Practical Application:
o	You used pandas to perform both non-random and random assignments of subjects in a DataFrame and compared their descriptive statistics to observe differences.
Here's a code snippet demonstrating random assignment:

## Randomly assign half

group1_random = weights.sample(n=250, random_state=42, replace=False)

## Create second assignment

group2_random = weights.drop(group1_random.index)

## Compare assignments

compare_df_random = pd.concat([group1_random['weight'].describe(), group2_random['weight'].describe()], axis=1)
compare_df_random.columns = ['group1', 'group2']
print(compare_df_random)

This exercise highlighted the importance of random assignment in experimental design to ensure reliable and valid results.


# Blocking experimental data
You are working with a manufacturing firm that wants to conduct some experiments on worker productivity. Their dataset only contains 100 rows, so it's important that experimental groups are balanced.

This sounds like a great opportunity to use your knowledge of blocking to assist them. They have provided a productivity_subjects DataFrame. Split the provided dataset into two even groups of 50 entries each.

In [None]:
# Randomly select 50 subjects from the productivity_subjects DataFrame into a new DataFrame block_1 without replacement.
block_1 = productivity_subjects.sample(50, random_state=42, replace=False)

# Set a new column, block to 1 for the block_1 DataFrame.
block_1['block'] = 1

# Assign the remaining subjects to a DataFrame called block_2 and set the block column to 2 for this DataFrame.
block_2 = productivity_subjects.drop(block_1.index)
block_2['block'] = 2

# Concatenate the blocks together into a single DataFrame, and print the count of each value in the block column to confirm the blocking worked.
productivity_combined = pd.concat([block_1, block_2], axis=0)
print(productivity_combined['block'].value_counts())

This is important especially when the size of the data is small, as is the case with this data. Let's consider an example where they may be a confounding variable and use stratification to assist.

# Visual normality in an agricultural experiment
You have been contracted by an agricultural firm conducting an experiment on 50 chickens, divided into four groups, each fed a different diet. Weight measurements were taken every second day for 20 days.

You'll analyze chicken_data to assess normality, which will determine the suitability of parametric statistical tests, beginning with a visual examination of the data distribution. The necessary packages for analysis have been imported for you:

In [3]:
import seaborn as sns
import pandas as pd
from statsmodels.graphics.gofplots import qqplot
from scipy.stats.distributions import norm

In [None]:
# Plot the distribution of the chickens' weight using the kernel density estimation (KDE) to visualize normality.

sns.displot(data=chicken_data, x='weight', kind="kde", dist= norm) # (dist= norm) is optional
plt.show()

In [None]:
# Create a qq plot with a standard line of the chickens' weight to assess normality visually.
qqplot(data=chicken_data["weight"], line='s')
plt.show()

In [None]:
# Subset chicken_data for a 'Time' of 2, and plot the KDE of 'weight' from subset_data to check if data is normal across time.
subset_data = chicken_data[chicken_data['Time'] == 2]

sns.displot(data=subset_data, x='weight', kind="kde")
plt.show()

Your first distribution plot looked a bit normal, but the qq plot was not aligned, and tapered off at the top and bottom. This indicates the data may have tails that affect normality. It also looked a bit more normal at the second time point. Let's confirm some of our thoughts using analytical methods.

# Analytical normality in an agricultural experiment
Carrying on from your previous work, your visual inspections of the data indicate it may not be a normal dataset overall, but that the initial time point may be.

Build on your previous work by using analytical methods to determine the normality of the dataset.

# Note: when the p-value is less than alpha in a Shapiro-Wilk test, you would not assume normality.

In [6]:
import pandas as pd
from scipy.stats import shapiro
from scipy.stats import anderson

In [None]:
# Run a Shapiro-Wilk test of normality on the 'weight' column and print the test statistic and p-value.
test_statistic, p_value = shapiro(chicken_data["weight"])

print(f"p: {(p_value, 4)} test stat: {round(test_statistic, 4)}")

<script.py> output:

    p: 0.0 test stat: 0.9154

In [None]:
# Run an Anderson-Darling test for normality and print out the test statistic, significance levels, and critical values from the returned object.
result = anderson(chicken_data['weight'], dist="norm")

print(f"Test statistic: {round(result.statistic, 4)}")
print(f"Significance Levels: {result.significance_level}")
print(f"Critical Values: {result.critical_values}")

# output
<script.py> output:

    Test statistic: 12.
    
    Significance Levels: [15.  10.   5.   2.5  1. ]
    
    Critical Values: [0.572 0.651 0.781 0.911 1.084]

The critical value which matches the significance level of 5 is 0.781. When compared to the Anderson-Darling test statistic (12.5451), the critical value is much smaller and so we reject the null hypothesis and can conclude the data is unlikely to have been drawn from a normal distribution.

#  Chapter 2: Experimental Design Techniques
Delve into sophisticated experimental design techniques, focusing on factorial designs, randomized block designs, and covariate adjustments. These methodologies are instrumental in enhancing the accuracy, efficiency, and interpretability of experimental results. Through a combination of theoretical insights and practical applications, you'll acquire the skills needed to design, implement, and analyze complex experiments in various fields of research.

# Understanding marketing campaign effectiveness
Imagine you're a digital marketer analyzing data from a recent campaign to understand what messaging style and time of day yield the highest conversions. This analysis is crucial for guiding future marketing strategies, ensuring that your messages reach potential customers when they're most likely to engage. In this exercise, you're working with a dataset giving the outcomes of different messaging styles ('Casual' versus 'Formal') and times of day ('Morning' versus 'Evening') on conversion rates, a common scenario in marketing data analysis.

The data has been loaded for you as a DataFrame named marketing_data, and pandas is loaded as pd.

In [None]:
# Create a pivot table with 'Messaging_Style' as the index and 'Time_of_Day' as the columns, computing the mean of Conversions.
marketing_pivot = marketing_data.pivot_table(
  values='Conversions', 
  index='Messaging_Style', 
  columns='Time_of_Day', 
  aggfunc='mean')

# View the pivoted results
print(marketing_pivot)


<script.py> output:

    Time_of_Day      Evening  Morning

    Messaging_Style     

    Casual           402.329  401.134

    Formal           432.913  411.096

In [None]:
# Visualize interactions between Messaging_Style and Time_of_Day with respect to conversions by creating an annotated cool-warm heatmap of 
# marketing_pivot.
sns.heatmap(marketing_pivot, 
            annot=True, 
            cmap='coolwarm',
            fmt='g')

plt.show()

Factorial designs and randomized block designs
Select the three correct statements regarding factorial designs and randomized block designs.

Factorial designs require each experimental unit to be exposed to all possible combinations of treatment levels

Randomized block designs enhance experimental precision by controlling for variability within groups of similar subjects

Factorial designs are particularly useful in complex scenarios with multiple factors influencing the outcome

# Your recent learnings
When you left 1 day ago, you worked on Experimental Design Techniques, chapter 2 of the course Experimental Design in Python. Here is what you covered in your last lesson:

You learned about factorial designs, which allow you to examine multiple variables simultaneously and their interactions. This method tests every possible combination of factor levels, providing insights into complex dynamics that simpler setups might miss. Here's a recap of the key points:

Factorial Designs: These designs test all combinations of factor levels to measure both direct effects and interactions. For example, in a plant growth experiment, you can test the effects of light conditions and fertilizer types simultaneously.
Pivot Tables: You created a pivot table using pandas to aggregate data. For instance, you used pivot_table to calculate the mean growth for each combination of light condition and fertilizer type.

plant_pivot = plant_data.pivot_table(
   
    values='Growth_cm', 
    
    index='Light_Condition', 
    
    columns='Fertilizer_Type', 
    
    aggfunc='mean'
)

print(plant_pivot)

Heatmaps: You visualized interactions using Seaborn's heatmap function, which helps identify how different factors interact by showing the intensity of their effects.

Comparing Designs: Factorial designs explore multiple treatments and their interactions, while randomized block designs group similar subjects to minimize confounding impacts and enhance precision.

These techniques are crucial for designing, implementing, and analyzing complex experiments effectively.

The goal of the next lesson is to understand how to use randomized block design to control for variance and improve the precision of experimental results.

# Implementing a randomized block design
The manufacturing firm you worked with earlier is still interested in conducting some experiments on worker productivity. Previously, the two blocks were set randomly. While this can work, it can be better to group subjects based on similar characteristics.

The same employees are again loaded but this time in a DataFrame called productivity including 1200 other colleagues. It also includes a worker 'productivity_score' column based on units produced per hour. This column was binned into three groups to generate blocks based on similar productivity values. The firm would like to apply a new incentive program with three options ('Bonus', 'Profit Sharing' and 'Work from Home') throughout the firm with treatment applied randomly.

In [None]:
# Randomly assign workers to blocks(Shuffle the blocks to create a new DataFrame called prod_df.)
prod_df = productivity.groupby('block').apply(
  lambda x: x.sample(frac=1)
)

# Reset the index(Reset the index so that block is not both an index and a column.)
prod_df = prod_df.reset_index(drop=True)

# Randomly assign the three treatment values in the 'Treatment' column.
prod_df['Treatment'] = np.random.choice(
  ['Bonus', 'Profit Sharing', 'Work from Home'],
  size=len(prod_df)
)

You've efficiently shuffled workers within their blocks, streamlined the DataFrame by resetting the index, and randomly assigned treatments, setting a solid foundation for analyzing incentive effects on productivity. Your skills in preparing data for experimental analysis are on point!

# Visualizing productivity within blocks by incentive
Continuing with the worker productivity example, you'll explore if the productivity scores are distributed throughout the data as one would expect with random assignment of treatment. Note that this is a precautionary step, and the treatment and follow-up results on the impact of the three treatments is not done yet!

seaborn and matplotlib.pyplot as sns and plt respectively are loaded.

In [None]:
# Visualize the productivity scores within blocks by treatment using a boxplot with 'block' for x, 'productivity_score' for y, and 'Treatment' for hue.
sns.boxplot(x='block', 
            y='productivity_score', 
            hue='Treatment', 
            data=prod_df)

plt.show()

You've successfully created a visualization that illustrates how the 'productivity_score' varies within different blocks, with the additional layer of treatment differentiation. Notice that the 'productivity_score' values vary greatly across blocks—that's how you set up the blocks to start! You, therefore, won't test for variability across blocks, but will you see significant variability within blocks? Time to find out!

# ANOVA within blocks of employees
Building on your previous analyses with the manufacturing firm, where worker productivity was examined across different blocks and an incentive program was introduced, you're now delving deeper into the data. The firm, equipped with a more comprehensive dataset in the productivity DataFrame, including 1200 additional employees and their productivity_score, has structured the workforce into three blocks based on productivity levels. Each employee has been randomly assigned one of three incentive options: 'Bonus', 'Profit Sharing', or 'Work from Home'.

Before assessing the full impact of these incentive treatments on productivity, it's crucial to verify that the initial treatment assignment was indeed random and equitable across the different productivity blocks. This step ensures that any observed differences in productivity post-treatment can be confidently attributed to the incentive programs themselves, rather than pre-existing disparities in the blocks.

In [1]:
from scipy.stats import f_oneway

In [None]:
# Group prod_df by the appropriate column that represents different blocks in your data.
within_block_anova = prod_df.groupby('block').apply(
# Use a lambda function to apply the ANOVA test within each block, specifying the lambda function's argument.
# For each treatment group within the blocks, filter prod_df based on the 'Treatment' column values and select the 'productivity_score' column.
  lambda x: f_oneway(
    # Filter Treatment values based on outcome
    x[x['Treatment'] == 'Bonus']['productivity_score'], 
    x[x['Treatment'] == 'Profit Sharing']['productivity_score'],
    x[x['Treatment'] == 'Work from Home']['productivity_score'])
)
print(within_block_anova)


<script.py> output:

    block
    1    (1.009243027139357, 0.36539191608501026)
    
    2    (0.09964675193646039, 0.905177667445047)
    
    3    (0.2983940138555918, 0.7421589537946647)
    
    dtype: object

    You've adeptly conducted an ANOVA analysis across the different blocks, comparing the productivity scores for the three treatment groups. Notice that each of the three p-values are large, so you can feel confident in how this randomized block design is set up as an experiment.

# Importance of covariates
Why is it important to include covariates in statistical analyses?

To account for potential variability and reduce confounding in the analysis(Covariates help refine your analysis by accounting for additional variability and reducing confounding effects.)

# Covariate adjustment with chick growth
Imagine studying in agricultural science the growth patterns of chicks under various dietary regimens. The data from this study sheds light on the intricate relationship between their respective diets and the consequent impact on their weight. This data includes weight measurements of chicks at different ages, allowing for an exploration of covariate adjustment. age serves as a covariate, potentially influencing the outcome variable: the weight of the chicks.

DataFrames exp_chick_data, the experimental data, and cov_chick_data, the covariate data, have been loaded, along with the following libraries:

In [2]:
import pandas as pd
import numpy as np
from statsmodels.formula.api import ols
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Join the experimental and covariate data based on common column(s), and print this merged data.
merged_chick_data = pd.merge(exp_chick_data, 
                            cov_chick_data, on='Chick')

# Print the merged data
print(merged_chick_data)

In [None]:
# Perform ANCOVA with Diet and Time as predictors(Produce an ANCOVA predicting 'weight' based on 'Diet' and 'Time')
model = ols('weight ~ Diet + Time', data=merged_chick_data).fit()

# Print a summary of the ANCOVA model
print(model.summary())


<script.py> output:
                                OLS Regression Results                            
    ==============================================================================
    Dep. Variable:                 weight   R-squared:                       0.040
    Model:                            OLS   Adj. R-squared:                  0.039
    Method:                 Least Squares   F-statistic:                     140.9
    Date:                Sun, 16 Nov 2025   Prob (F-statistic):           1.12e-60
    Time:                        05:09:32   Log-Likelihood:                -38608.
    No. Observations:                6818   AIC:                         7.722e+04
    Df Residuals:                    6815   BIC:                         7.724e+04
    Df Model:                           2                                         
    Covariance Type:            nonrobust                                         
    ==============================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
    ------------------------------------------------------------------------------
    Intercept     94.0678      2.275     41.342      0.000      89.607      98.528
    Diet          12.2022      0.729     16.747      0.000      10.774      13.631
    Time           0.1238      0.125      0.990      0.322      -0.121       0.369
    ==============================================================================
    Omnibus:                      694.766   Durbin-Watson:                   0.055
    Prob(Omnibus):                  0.000   Jarque-Bera (JB):              922.241
    Skew:                           0.886   Prob(JB):                    5.47e-201
    Kurtosis:                       3.326   Cond. No.                         35.8
    ==============================================================================
    
    Notes:
    [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In [None]:
# Design an lmplot to see hue='Diet' effects on y='weight' adjusted for x='Time'.(Visualize Diet effects with Time adjustment)
sns.lmplot(x='Time', y='weight', 
         hue='Diet', 
         data=merged_chick_data)
plt.show()

# Your recent learnings
When you left 2 days ago, you worked on Experimental Design Techniques, chapter 2 of the course Experimental Design in Python. Here is what you covered in your last lesson:

You learned about covariate adjustment in experimental design and its importance in minimizing confounding effects. Covariates are variables related to the outcome variable but not of primary interest. By including them in the analysis, you can isolate the effect of the independent variable on the outcome. Here are the key points you covered:

Covariates: These are additional variables that can influence the outcome variable. They are included in the analysis to control for their effects and reduce confounding.

ANCOVA (Analysis of Covariance): This technique evaluates treatment effects while controlling for covariates. It helps in isolating the true effect of the independent variable on the dependent variable.

Combining DataFrames: You used pandas' merge function to combine experimental data with covariate data, ensuring each subject's data is aligned.

Modeling with ANCOVA: You employed the ols model from statsmodels to adjust for covariates. For example:

model = ols('Growth_cm ~ Fertilizer_Type + Watering_Days_Per_Week', data=exp_data).fit()
print(model.summary())

Interpreting Results: You learned to interpret the summary output, focusing on p-values to determine the significance of covariates and treatment effects.
Armed with this understanding, you're now ready to apply covariate adjustment in your own analyses.

The goal of the next lesson is to understand how to interpret the results of different statistical tests to make informed decisions based on data analysis.

# Choosing the right test: petrochemicals
In a chemistry research lab, scientists are examining the efficiency of three well-known catalysts—Palladium (Pd), Platinum (Pt), and Nickel (Ni)—in facilitating a particular reaction. Each catalyst is used in a set of identical reactions under controlled conditions, and the time taken for each reaction to reach completion is meticulously recorded. Your goal is to compare the mean reaction times across the three catalyst groups to identify which catalyst, if any, has a significantly different reaction time.

The data is available in the chemical_reactions DataFrame. pandas as pd, numpy as np, and the following functions have been loaded as well:

What type of hypothesis test should be performed in this scenario?

One-way ANOVA


In [2]:
from scipy.stats import ttest_ind
from scipy.stats import f_oneway
from scipy.stats import chi2_contingency

In [None]:
catalyst_types = ['Palladium', 'Platinum', 'Nickel']

# Use a list comprehension to filter into groups iterating over the catalyst_types and each of their 'Reaction_Time's.
groups = [chemical_reactions[chemical_reactions['Catalyst'] == catalyst]['Reaction_Time'] for catalyst in catalyst_types]

In [None]:
Perform the one-way ANOVA across the three groups
f_stat, p_val = f_oneway(*groups)
print(p_val)

<script.py> output:

    4.710677600047866e-151

Assume a significance level of 0.01. What is the appropriate conclusion to glean from the P-value in comparison with this alpha value?

The P-value is substantially smaller than the alpha value, indicating a significant difference in reaction times across the catalysts.
The extremely small P-value strongly suggests significant differences among the catalysts.

# Choosing the right test: human resources
In human resources, it's essential to understand the relationships between different variables that might influence employee satisfaction or turnover. Consider a scenario where an HR department is interested in understanding the association between the department in which employees work and their participation in a new workplace wellness program. The HR team has compiled this data over the past two years and has asked you if there's any significant association between an employee's department and their enrolling in the wellness program.

The data is available in the hr_wellness DataFrame. pandas as pd, numpy as np, and the following functions have been loaded:

What type of hypothesis test should be performed in this scenario?

Chi-square test of association

In [4]:
from scipy.stats import ttest_ind
from scipy.stats import f_oneway
from scipy.stats import chi2_contingency

In [None]:
# Create a contingency table comparing 'Department' and 'Wellness_Program_Status'.
contingency_table = pd.crosstab(
  hr_wellness['Department'], 
  hr_wellness['Wellness_Program_Status']
)

In [None]:
# Perform a chi-square test of association on the contingency table and print the p-value.
chi2_stat, p_val, dof, expected = chi2_contingency(contingency_table)
print(p_val)

<script.py> output:

    0.17573344450112738

Assume a significance level of 0.05. Given the P-value, what is the appropriate conclusion?

answer:

There's no significant association between department and enrollment in the wellness program, as the P-value is larger than 0.05.
The P-value being greater than 0.05 suggests no significant association between the variables.

# Choosing the right test: finance
In the realm of finance, investment strategists are continually evaluating different approaches to maximize returns. Consider a scenario where a financial firm wishes to assess the effectiveness of two investment strategies: "Quantitative Analysis" and "Fundamental Analysis". The firm has applied each strategy to a separate set of investment portfolios for a year and now asks you to compare the annual returns to determine if there is any difference in strategy returns by comparing the mean returns of the two groups.

The data is available in the investment_returns DataFrame. pandas as pd, numpy as np, and the following functions have been loaded as well:


What type of hypothesis test should be performed in this scenario?

answers:

Independent samples t-test

In [1]:
from scipy.stats import ttest_ind
from scipy.stats import f_oneway
from scipy.stats import chi2_contingency

In [None]:
# Filter 'Strategy_Type' on 'Quantitative' to retrieve their 'Annual_Return' and do the same for 'Fundamental' strategies.
quantitative_returns = investment_returns[investment_returns['Strategy_Type'] =='Quantitative']['Annual_Return']
fundamental_returns = investment_returns[investment_returns['Strategy_Type'] == 'Fundamental']['Annual_Return']

print(quantitative_returns, fundamental_returns)

In [None]:
# Perform the independent samples t-test between the two groups(Complete for the two groups an independent samples t-test and print the p-value.)
t_stat, p_val = ttest_ind(quantitative_returns, fundamental_returns)
print(t_stat,p_val)

Assume a significance level of 0.1. What is the appropriate conclusion to glean from the P-value in comparison with this 
 value?

answers

The P-value is much smaller than alpha suggesting a significant difference in returns between the two strategies.
<script.py> output:

    7.784788496693728 2.0567003424807146e-14

Given the very small p-value of around 0.000000000000002, we have evidence of a difference in returns for any reasonable choice of alpha.

# POST HOC ANALYSIS FOLLOWING ANOVA

# Anxiety treatments ANOVA
Psychologists conducted a study to compare the effectiveness of three types of therapy on reducing anxiety levels: Cognitive Behavioral Therapy (CBT), Dialectical Behavior Therapy (DBT), and Acceptance and Commitment Therapy (ACT). Participants were randomly assigned to one of the three therapy groups, and their anxiety levels were measured before and after the therapy sessions. The psychologists have asked you to determine if there are any significant differences in the effectiveness of these therapies.

The therapy_outcomes DataFrame containing this experiment data has been loaded along with pandas as pd and from scipy.stats import f_oneway

In [None]:
# Pivot to view the mean anxiety reduction for each therapy (Create a pivot table to calculate the mean 'Anxiety_Reduction' value across groups of 'Therapy_Type' in this data)
pivot_table = therapy_outcomes.pivot_table(
    values='Anxiety_Reduction', 
    index='Therapy_Type', 
    aggfunc="mean")
print(pivot_table)

In [None]:
# Filter groups of therapy types and their 'Anxiety_Reduction' values by first creating a list of the three therapy types: 'CBT', 'DBT', and 'ACT'.
# (Create groups to prepare the data for ANOVA)
therapy_types = ['CBT', 'DBT', 'ACT']
groups = [therapy_outcomes[therapy_outcomes['Therapy_Type'] == therapy]['Anxiety_Reduction'] for therapy in therapy_types]

# Conduct a one-way ANOVA
f_stat, p_val = f_oneway(*groups)
print(p_val)


<script.py> output:

                  Anxiety_Reduction

    Therapy_Type 

    ACT                      14.929

    CBT                      14.962

    DBT                      15.729

    0.019580062979016804

By analyzing the data with ANOVA, you've taken an important step in comparing the effectiveness of different therapies. Assuming an alpha of 0.05, the P-value indicates significant differences in therapy effectiveness.

# Applying Tukey's HSD
Following the ANOVA analysis which suggested significant differences in the effectiveness of the three types of therapy, the psychologists are keen to delve deeper. They wish for you to explain exactly which therapy types differ from each other in terms of reducing anxiety levels. This is where Tukey's Honest Significant Difference (HSD) test comes into play. It's a post-hoc test used to make pairwise comparisons between group means after an ANOVA has shown a significant difference. Tukey's HSD test helps in identifying specific pairs of groups that have significant differences in their means.

The therapy_outcomes DataFrame containing this experiment data has again been loaded along with pandas as pd and from statsmodels.stats.multicomp import pairwise_tukeyhsd.

In [None]:
# At a significance level of 0.05, perform Tukey's HSD test to compare the mean anxiety reduction across the three therapy groups.
tukey_results = pairwise_tukeyhsd(   therapy_outcomes['Anxiety_Reduction'], therapy_outcomes['Therapy_Type'], 
    alpha=0.05
)
print(tukey_results)

<script.py> output:
    Multiple Comparison of Means - Tukey HSD, FWER=0.05
    ===================================================
    group1 group2 meandiff p-adj   lower  upper  reject
    ---------------------------------------------------
       ACT    CBT    0.033 0.9941 -0.7136 0.7795  False
       ACT    DBT   0.8001 0.0358  0.0418 1.5583   True
       CBT    DBT   0.7671 0.0433  0.0181 1.5161   True
    ---------------------------------------------------


The Tukey HSD test provided clear insights into which therapy types significantly differ in reducing anxiety. These findings can guide psychologists in refining treatment approaches. Did you catch that ACT and CBT don't differ significantly from this experiment?

# Applying Bonferoni correction
After identifying significant differences between therapy groups with Tukey's HSD, we want to confirm our findings with the Bonferroni correction. The Bonferroni correction is a conservative statistical adjustment used to counteract the problem of multiple comparisons. It reduces the chance of obtaining false-positive results by adjusting the significance level. In the context of your study on the effectiveness of CBT, DBT, and ACT, applying the Bonferroni correction will help ensure that the significant differences you observe between therapy groups are not due to chance.

The therapy_outcomes DataFrame has again been loaded along with pandas as pd, from scipy.stats import ttest_ind, and from statsmodels.sandbox.stats.multicomp import multipletests.

In [None]:
# Conduct independent t-tests between all pairs of therapy groups in therapy_pairs and append the p-values (p_val) to the p_values list.
p_values = []

therapy_pairs = [('CBT', 'DBT'), ('CBT', 'ACT'), ('DBT', 'ACT')]

# Apply the Bonferroni correction to adjust the p-values from the multiple tests and print them.
for pair in therapy_pairs:
    group1 = therapy_outcomes[therapy_outcomes['Therapy_Type'] == "pair"]['Anxiety_Reduction']
    group2 = therapy_outcomes[therapy_outcomes['Therapy_Type'] == "pair"]['Anxiety_Reduction']
    t_stat, p_val = ttest_ind(group1, group2)
    p_values.append(p_val)

# Apply Bonferroni correction
print(multipletests(p_values, alpha=0.05, method='bonferroni')[1])

You've adeptly applied the Bonferroni correction to adjust the P-values for multiple comparisons. This step is critical to control for Type I error, ensuring the reliability of your findings. Here, you again see that ACT and CBT don't differ significantly from this experiment due to the corrected P-value of 1.

# Your recent learnings
When you left 20 hours ago, you worked on Analyzing Experimental Data: Statistical Tests and Power, chapter 3 of the course Experimental Design in Python. Here is what you covered in your last lesson:

You learned about post-hoc analysis following ANOVA, which helps identify specific differences between groups after ANOVA indicates significant differences. Here's a recap of the key points:

Post-hoc Analysis: Essential for understanding pairwise differences after ANOVA.

Tukey's HSD: Robust for multiple comparisons, useful for broader comparisons.
Bonferroni Correction: Adjusts p-values to control for Type I errors, ideal for focused tests.
Practical Application:

ANOVA: Used to assess significant differences in Click_Through_Rates among different Ad campaigns.
Tukey's HSD Test:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

tukey_results = pairwise_tukeyhsd(
    therapy_outcomes['Anxiety_Reduction'], 
    therapy_outcomes['Therapy_Type'], 
    alpha=0.05
)

print(tukey_results)
Bonferroni Correction:
from scipy.stats import ttest_ind
from statsmodels.sandbox.stats.multicomp import multipletests

p_values = []
therapy_pairs = [('CBT', 'DBT'), ('CBT', 'ACT'), ('DBT', 'ACT')]

for pair in therapy_pairs:
    group1 = therapy_outcomes[therapy_outcomes['Therapy_Type'] == pair[0]]['Anxiety_Reduction']
    group2 = therapy_outcomes[therapy_outcomes['Therapy_Type'] == pair[1]]['Anxiety_Reduction']
    t_stat, p_val = ttest_ind(group1, group2)
    p_values.append(p_val)

print(multipletests(p_values, alpha=0.05, method='bonferroni')[1])
These methods ensure you can accurately identify and confirm significant differences between groups in your data.

The goal of the next lesson is to learn how to conduct power analysis to determine the sample size needed for detecting a significant effect in your experiments.

# Analyzing toy durability
In product development within the toy industry, it's crucial to understand the durability of toys, particularly when comparing educational toys to recreational ones. Durability can significantly impact customer satisfaction and repeat business. Researchers in a toy manufacturing company have asked you to conduct the analysis of a study comparing the durability of educational toys versus recreational toys. The toy_durability DataFrame contains the results of these tests, with durability scores assigned based on rigorous testing protocols.

The data is available in the toy_durability DataFrame. pandas as pd and from scipy.stats import ttest_ind have been loaded.

In [None]:
# Calculate the mean 'Durability_Score' for both 'Educational' and 'Recreational' toys using a pivot table.
mean_durability = toy_durability.pivot_table(
  values='Durability_Score', index='Toy_Type', aggfunc="mean")
print(mean_durability)

# Perform an independent samples t-test to compare the durability of 'Educational' and 'Recreational' toys by first separating durability scores by Toy_Type.
educational_durability = toy_durability[toy_durability['Toy_Type'] == 'Educational']['Durability_Score']
recreational_durability = toy_durability[toy_durability['Toy_Type'] == 'Recreational']['Durability_Score']
t_stat, p_val = ttest_ind(recreational_durability, educational_durability)

print(p_val)

The P-value suggests that there's a statistically significant difference in durability between 'Educational' and 'Recreational' toys, assuming an alpha of 0.05. This insight could be crucial for product development and marketing strategies.

# Visualizing durability differences
Following the analysis of toy durability, the research team is interested in you visualizing the distribution of durability scores for both Educational and Recreational toys. Such visualizations can offer intuitive insights into the data, potentially highlighting the range and variability of scores within each category. This step is essential for presenting findings to non-technical stakeholders and guiding further product development decisions.

The data is available in the toy_durability DataFrame, and seaborn and matplotlib.pyplot as sns and plt respectively are loaded.

In [None]:
# Visualize the distribution of 'Durability_Score' for Educational and Recreational toys using a Kernel Density Estimate (KDE) plot,
# highlighting differences by using the 'Toy_Type' column to color the distributions differently.
sns.displot(data=toy_durability, x="Durability_Score", 
         hue="Toy_Type", kind="kde")
plt.title('Durability Score Distribution by Toy Type')
plt.xlabel('Durability Score')
plt.ylabel('Density')
plt.show()

The KDE plot visually illustrates the differences in durability between Educational and Recreational toys. You can see that the center of both distributions is near 80 for the durability score, but Recreational seems more variable than Educational.

Alpha levels set the probability threshold for rejecting the null hypothesis, reflecting the risk of committing a Type I error.

# Effect size purpose
What is the primary purpose of estimating effect size (such as Cohen's d) in the context of power analysis?

To quantify how big the expected difference or relationship is, so you can determine how many participants (sample size) you need to reliably detect that effect.
Why effect size is essential in power analysis

Power analysis requires three things:

Effect size (e.g., Cohen’s d)

Sample size (N)

Alpha level (e.g., 0.05)

Power (usually 0.80)

If you don’t know the effect size, you cannot determine:

how large your sample should be

how likely you are to detect a real effect

how strong the difference between groups is

Effect size tells you how big the difference is expected to be, which then directly determines how much data you need.

# Estimating required sample size for energy study
In the energy sector, researchers are often tasked with evaluating the effectiveness of new technologies or initiatives to enhance energy efficiency or reduce consumption. A study is being designed to compare the impact of two energy-saving measures: "Smart Thermostats" and "LED Lighting". To ensure the study has sufficient power to detect a meaningful difference in energy savings between these two measures, you'll conduct a power analysis.

In [None]:
import pandas as pd, 
import numpy as np
from statsmodels.stats.power import TTestIndPower

In [None]:
# Instantiate a TTestIndPower object
power_analysis = TTestIndPower()

# Conduct the power analysis to estimate the required sample size for each group (Smart Thermostats and LED Lighting) to achieve a power of 0.9, assuming a moderate effect size (Cohen's d = 0.5) and an alpha of 0.05 with an equal sized groups.
required_n = power_analysis.solve_power(
    effect_size=0.5, 
    alpha=0.05, 
    power=0.9, 
    ratio=1)

print(required_n)

Excellent! By conducting a power analysis, you've determined that approximately 85 participants are required in each group to achieve a power of 0.9, assuming an Cohen's d effect size of 0.5. This information is crucial for planning a sufficiently powered study to compare the energy-saving effectiveness of Smart Thermostats versus LED Lighting.

# Your recent learnings
When you left 22 hours ago, you worked on Analyzing Experimental Data: Statistical Tests and Power, chapter 3 of the course Experimental Design in Python. Here is what you covered in your last lesson:

You learned about power analysis, focusing on understanding effect size and its influence on sample size. Effect size quantifies the magnitude of the difference between groups, beyond just statistical significance. Cohen's d is a common measure, calculated as the difference in means divided by a pooled standard deviation.

Key points covered:

Effect Size: Quantifies the magnitude of the difference between groups. For example, Cohen's d is calculated as:
def cohen_d(group1, group2):

    diff_means = np.mean(group1) - np.mean(group2)
    
    pooled_std = np.sqrt((np.var(group1) + np.var(group2)) / 2)
    
    return diff_means / pooled_std
    
Power Analysis: Determines the probability that a test will correctly reject a false null hypothesis (avoiding Type II errors). Power is 1 minus beta (Type II error rate).

Sample Size Calculation: Helps determine the necessary sample size to achieve a desired power level. For example, using TTestIndPower to calculate required sample size:

from statsmodels.stats.power import TTestIndPower

power_analysis = TTestIndPower()

required_n = power_analysis.solve_power(effect_size=0.5, alpha=0.05, power=0.9, ratio=1)

print(required_n)

Balancing Power and Sample Size: Larger sample sizes increase the power of an experiment, enhancing the likelihood of detecting true effects.
You also applied these concepts to a video game study, calculating the necessary sample size to achieve 99% power with an assumed effect size.

The goal of the next lesson is to learn how to integrate and analyze data from multiple sources to draw meaningful conclusions.

# Visualizing loan approval yield
In the realm of financial services, understanding the factors that influence loan approval rates is crucial for both lenders and borrowers. A financial institution has conducted a study and collected data on loan applications, detailing the amount requested, the applicant's credit score, employment status, and the ultimate yield of the approval process. This rich dataset offers a window into the nuanced dynamics at play in loan decision-making. You have been asked to dive into the loan_approval_yield dataset to understand how loan amounts and credit scores influence approval yields.

The loan_approval_yield DataFrame, seaborn as sns, and matplotlib.pyplot as plt have been loaded for you.

In [None]:
# Use Seaborn create a side-by-side bar graph, setting the x-axis to 'LoanAmount', the y-axis to 'ApprovalYield', and differentiating the bars with hues for 'CreditScore'.
sns.catplot(x="LoanAmount", 
            y="ApprovalYield", 
            hue="CreditScore", 
            kind="bar", 
            data=loan_approval_yield)
plt.title("Loan Approval Yield by Amount and Credit Score")
plt.show()

What does the analysis of approval yields across different credit scores and loan amounts reveal?

answers

Poor credit scores tend to have similar approval yields across loan amounts, while Good credit scores have more variability.
The data shows that Poor credit scores tend to have similar approval yields across various loan amounts, while Good credit scores exhibit more variability, reflecting different lending criteria based on the loan size.

# Exploring customer satisfaction

Merging datasets is a crucial skill in data analysis, especially when dealing with related data from different sources. You're working on a project for a financial institution to understand the relationship between loan approval rates and customer satisfaction. Two separate studies have been conducted: one focusing on loan approval yield based on various factors, and another on customer satisfaction under different conditions. Your task is to analyze how approval yield correlates with customer satisfaction, considering another variable such as interest rates.

The loan_approval_yield and customer_satisfaction DataFrames, pandas as pd, numpy as np, seaborn as sns, and matplotlib.pyplot as plt have been loaded for you.

# The presented scenario underscores the necessity of turning complex data into engaging, relatable content. Simplification and effective visualization could significantly enhance comprehension and keep the audience engaged.

In [None]:
# Merge loan_approval_yield with customer_satisfaction datasets
merged_data = pd.merge(loan_approval_yield, 
                      customer_satisfaction, 
                      on='ApplicationID')

# Use Seaborn to Create a scatter plot to compare 'SatisfactionQuality' versus 'ApprovalYield', coloring the points by 'InterestRate'.
sns.relplot(x="ApprovalYield", 
            y="SatisfactionQuality", 
            hue="InterestRate", 
            kind="scatter", 
            data=merged_data)
plt.title("Satisfaction Quality by Approval Yield and Interest Rate")
plt.show()

What does the scatterplot of Customer Satisfaction versus Approval Yield, including Interest Rate as a variable, indicate about their relationship in the experimental data?

Answer

There isn't a strong relationship between Customer Satisfaction and Approval Yield in this experimental data. The resulting scatterplot looks similar to white noise scattered all about even when including Interest Rate.

# Effectively communicating experimental data
You're participating in a research seminar where the latest findings from a neuroscience study are being discussed. The presenter uses a dense slide filled with raw electroencephalogram (EEG) data outputs, complex visualizations, and a small font size, making it difficult for the audience to follow.

Given the scenario, what is a key benefit of effectively communicating experimental data?


It transforms complex data into engaging, relatable content, enhancing audience understanding. The presented scenario underscores the necessity of turning complex data into engaging, relatable content. Simplification and effective visualization could significantly enhance comprehension and keep the audience engaged.



# Check for heteroscedasticity in shelf life
When examining food preservation methods, it's crucial to understand how the variance of one variable, such as shelf life, might change across the range of another variable like nutrient retention. Identifying such patterns, known as heteroscedasticity, can provide insights into the consistency of preservation effects. The food_preservation dataset encapsulates the outcomes of various preservation methods on different food types, specifically highlighting the balance between nutrient retention and resultant shelf life.

The food_preservation DataFrame, pandas as pd, numpy as np, seaborn as sns, and matplotlib.pyplot as plt have been loaded for you.


In [None]:
# Check for heteroscedasticity with a residual plot(Use an appropriate plot to check for heteroscedasticity between 'NutrientRetention' and 'ShelfLife'.)
sns.residplot(x='NutrientRetention', y='ShelfLife', 
         data=food_preservation, lowess=True)
plt.title('Residual Plot of Shelf Life and Nutrient Retention')
plt.xlabel('Nutrient Retention (%)')
plt.ylabel('Residuals')
plt.show()

The residual plot allows you to visually assess the heteroscedasticity between nutrient retention and shelf life, showing if the spread of residuals changes across nutrient retention levels. You can see some deviation away from the 0 line, so there may be some concerns about heteroscedasticity.

# Exploring and transforming shelf life data
Understanding the distribution of different variables in our data is a key aspect of any data work including experimental analysis. The food_preservation dataset captures various food preservation methods and their impact on nutrient retention and shelf life. A crucial aspect of this data involves the shelf life of preserved foods, which can vary significantly across different preservation methods and food types.

The food_preservation DataFrame, from scipy.stats import boxcox, pandas as pd, numpy as np, seaborn as sns, and matplotlib.pyplot as plt have been loaded for you.

In [None]:
# Visualize the original ShelfLife distribution
sns.displot(food_preservation['ShelfLife'])
plt.title('Original Shelf Life Distribution')
plt.show()

# Create a Box-Cox transformation by Apply a Box-Cox transformation to the 'ShelfLife' column.
ShelfLifeTransformed, _ = boxcox(food_preservation['ShelfLife'])

# Visualize the transformed ShelfLife distribution
plt.clf()
sns.displot(ShelfLifeTransformed)
plt.title('Transformed Shelf Life Distribution')
plt.show()

Visualizing the original and transformed distributions provides valuable insights into the data's structure. The Box-Cox transformation helps stabilize variance, making the data more suitable for further statistical analysis by helping to make the ShelfLife follow a more normal shape.

# Applying non parametric test in experimental analysis

# Visualizing and testing preservation methods
As a food scientist, you're tasked with evaluating the effectiveness of different preservation methods on nutrient retention and how these methods impact shelf life. You have been provided with a dataset, food_preservation, that includes various types of food preserved by methods such as freezing and canning. Each entry in the dataset captures the nutrient retention and calculated shelf life for these foods, providing a unique opportunity to analyze the impacts of preservation techniques on food quality.

The following imports have been loaded for you in addition to food_preservation:

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import mannwhitneyu

In [None]:
# Filter the DataFrame to include only Freezing and Canning rows.
condensed_food_data = food_preservation[food_preservation['PreservationMethod'].isin(['Freezing', 'Canning'])]

# Create a violin plot for nutrient retention by preservation method(Create a violin plot to visualize the distribution of nutrient retention for different preservation methods.)
sns.violinplot(data=condensed_food_data, 
     x="PreservationMethod", 
     y="NutrientRetention")
plt.show()

# Separate nutrient retention for Freezing and Canning methods
freezing = food_preservation[food_preservation['PreservationMethod'] == 'Freezing']['NutrientRetention']
canning = food_preservation[food_preservation['PreservationMethod'] == 'Canning']['NutrientRetention']

# Perform Mann Whitney U test
u_stat, p_val = mannwhitneyu(freezing,canning)

# Print the p-value
print("Mann Whitney U test p-value:", p_val)

The violin plot shows that the distribution and median values are similar across Freezing and Canning. The large p-value leads us to suspect that a statistical difference does not exist in the medians of nutrient retention for freezing versus canning preservation methods.

# Further analyzing food preservation techniques
In your role as a food scientist, you're exploring into the comparative effects of various food preservation methods on nutrient retention, utilizing a food_preservation dataset that includes measurements from freezing, canning, and drying methods. This dataset has been crafted to incorporate variations in shelf life that depend on the nutrient retention values, reflecting real-world scenarios where preservation efficacy varies significantly. Your analysis will involve visually exploring these differences using advanced plotting techniques and nonparametric tests.

The following imports have been loaded for you in addition to food_preservation:

In [3]:
from scipy.stats import kruskal

In [None]:
# Create a boxen plot to explore the distribution of nutrient retention across the three different preservation methods.
sns.boxenplot(data=food_preservation, 
     x="PreservationMethod", 
     y="NutrientRetention")
plt.show()

# Separate nutrient retention for each preservation method
freezing = food_preservation[food_preservation['PreservationMethod'] == 'Freezing']['NutrientRetention']
canning = food_preservation[food_preservation['PreservationMethod'] == 'Canning']['NutrientRetention']
drying = food_preservation[food_preservation['PreservationMethod'] == 'Drying']['NutrientRetention']

# Perform Kruskal-Wallis test to compare nutrient retention across all preservation methods.
k_stat, k_pval = kruskal(freezing,canning,drying)
print("Kruskal-Wallis test p-value:", k_pval)

By effectively visualizing and statistically analyzing the nutrient retention across different preservation methods, you've gained insights into how these methods impact food quality. The boxen plot provided a deeper understanding of the data's distribution, and the Kruskal-Wallis test helped you assess the statistical differences between groups. The large p-value leads us to fail to conclude that a difference in the median values across the three groups of preservation methods exists for nutrient retention.