In [None]:
Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

Analysis of Variance (ANOVA) is a statistical method used to compare means across multiple groups. 
However, ANOVA relies on several key assumptions for its validity. 
Violating these assumptions can lead to inaccurate or misleading results. 
The main assumptions for ANOVA are:
Normality: 
    The dependent variable should follow a normal distribution within each group. This assumption is more critical with smaller sample sizes but tends to be robust with larger samples due to the Central Limit Theorem.
    Violation Example: If the data in one or more groups significantly deviates from a normal distribution, it can affect the ANOVA results. Outliers or skewed distributions may impact the test's accuracy.
Homogeneity of Variance (Homoscedasticity): 
    The variance of the dependent variable should be approximately equal across all groups. Homogeneity of variance ensures that the groups being compared have similar levels of variability.
    Violation Example: If one group has much larger variability than another, it can lead to unequal variances. This can result in certain groups having more influence on the overall analysis, potentially leading to incorrect conclusions.
Independence of Observations: 
    Observations within each group should be independent of each other. This means that the value of one observation should not be influenced by or related to the value of another observation.
    Violation Example: If data points within a group are somehow related or dependent, it can lead to pseudoreplication. For example, repeated measurements on the same subjects or correlated data violate the assumption of independence.
Homogeneity of Regression Slopes (Interaction in the case of ANCOVA): 
    In the context of ANOVA with covariates (ANCOVA), the relationship between the covariate and the dependent variable should be consistent across all groups.
    Violation Example: If the interaction between the covariate and the grouping variable is significant, it suggests that the slopes of the regression lines are not equal across groups. This violates the assumption, and the ANCOVA results may be affected.
    
It's essential to note that ANOVA is relatively robust to violations of the normality assumption, especially with larger sample sizes. However, violations of homogeneity of variance can have more serious consequences. If assumptions are seriously violated, alternative methods like Welch's ANOVA or non-parametric tests may be more appropriate.
Researchers should always check for these assumptions before interpreting the results of ANOVA and consider alternative methods if the assumptions are not met. Graphical methods (e.g., residual plots) and statistical tests (e.g., Levene's test for homogeneity of variance) can be used to assess these assumptions.

In [None]:
Q2. What are the three types of ANOVA, and in what situations would each be used?

Analysis of Variance (ANOVA) can be categorized into three main types based on the number of independent variables and their levels. These are:

1.One-Way ANOVA:
  Use Case: One-Way ANOVA is used when there is one independent variable (factor) with more than two levels (groups).
  Example: Suppose you want to compare the mean scores of three different teaching methods (A, B, and C) to determine if they have a statistically significant impact on student performance. The independent variable is the teaching method, and there are three levels (A, B, C).
2.Two-Way ANOVA:
  Use Case: Two-Way ANOVA is used when there are two independent variables, and each variable may have multiple levels.
  Example: Consider a study examining the effects of both drug dosage (low, medium, high) and gender (male, female) on a medical outcome. In this case, there are two independent variables (dosage and gender), and each has multiple levels.
3.Repeated Measures ANOVA:
  Use Case: Repeated Measures ANOVA is used when the same subjects are used for each treatment (repeated measurements), and it is often applied to assess changes over time or under different conditions.
  Example: Suppose you measure the blood pressure of the same group of individuals before and after they undergo three different exercise programs (A, B, C). Repeated Measures ANOVA would be appropriate to determine if there are significant differences in blood pressure across the three exercise programs.

In [None]:
Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in Analysis of Variance (ANOVA) refers to the process of breaking down the total variability in the data into different components, each attributable to specific sources. 
Understanding this concept is crucial because it helps researchers and statisticians assess the relative contributions of different factors to the overall variation observed in the dependent variable. 
The partitioning of variance is a fundamental aspect of ANOVA and is typically represented by the total sum of squares (SST), the between-group sum of squares (SSB or SSBetween), and the within-group sum of squares (SSW or SSWithin).

Understanding the partitioning of variance is important for the following reasons:
Identification of Sources of Variation: 
    It helps researchers identify and quantify the sources of variation in the data, distinguishing between variation due to group differences and random variability within groups.
Assessment of Treatment Effects: 
    Researchers can assess whether the differences between groups are statistically significant, providing insight into the effectiveness of experimental treatments or the impact of different factors.
Interpretation of Results: 
    It enables a more nuanced interpretation of ANOVA results, allowing researchers to understand the relative importance of different factors influencing the dependent variable.
Basis for F-Statistic: 
    The F-statistic, which is central to ANOVA, is calculated as the ratio of the variance between groups to the variance within groups. Understanding the partitioning of variance helps in interpreting the F-statistic and assessing its significance.

In [None]:
Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import numpy as np

In [2]:
group1 = np.array([23, 25, 27, 30, 32])
group2 = np.array([18, 20, 22, 25, 28])
group3 = np.array([15, 18, 20, 22, 24])

In [3]:
group1

array([23, 25, 27, 30, 32])

In [4]:
group2

array([18, 20, 22, 25, 28])

In [5]:
group3

array([15, 18, 20, 22, 24])

In [6]:
# Combine data from all groups
all_data = np.concatenate([group1, group2, group3])


In [7]:
all_data

array([23, 25, 27, 30, 32, 18, 20, 22, 25, 28, 15, 18, 20, 22, 24])

In [8]:
# Overall mean
overall_mean = np.mean(all_data)

In [9]:
overall_mean

23.266666666666666

In [10]:
# Calculate Total Sum of Squares (SST)
SST = np.sum((all_data - overall_mean)**2)

In [11]:
SST

312.9333333333333

In [12]:
# Calculate Explained Sum of Squares (SSE)
group_means = [np.mean(group) for group in [group1, group2, group3]]
n_per_group = [len(group) for group in [group1, group2, group3]]
SSE = np.sum(n * (group_mean - overall_mean)**2 for n, group_mean in zip(n_per_group, group_means))

  SSE = np.sum(n * (group_mean - overall_mean)**2 for n, group_mean in zip(n_per_group, group_means))


In [13]:
group_means

[27.4, 22.6, 19.8]

In [14]:
n_per_group

[5, 5, 5]

In [15]:
SSE

147.73333333333323

In [16]:
# Calculate Residual Sum of Squares (SSR)
SSR = SST - SSE

In [17]:
SSR

165.20000000000005

In [18]:
# Print results
print("Total Sum of Squares (SST):", SST)
print("Explained Sum of Squares (SSE):", SSE)
print("Residual Sum of Squares (SSR):", SSR)

Total Sum of Squares (SST): 312.9333333333333
Explained Sum of Squares (SSE): 147.73333333333323
Residual Sum of Squares (SSR): 165.20000000000005


In [None]:
Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [28]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from scipy.stats import f_oneway

In [29]:
data = {'A': [10, 15, 20, 25, 30],
        'B': [5, 10, 15, 20, 25],
        'Y': [45, 60, 75, 90, 105]}

In [30]:
df = pd.DataFrame(data)

In [31]:
df

Unnamed: 0,A,B,Y
0,10,5,45
1,15,10,60
2,20,15,75
3,25,20,90
4,30,25,105


In [32]:
# Fit the ANOVA model
model = ols('Y ~ A * B', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

In [33]:
model

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f0bee180e50>

In [34]:
anova_table

Unnamed: 0,sum_sq,df,F,PR(>F)
A,870.0937,1.0,3.792272e+27,2.636942e-28
B,113.1779,1.0,4.932818e+26,2.027239e-27
A:B,1.818321e-26,1.0,0.07925088,0.8047691
Residual,4.588772e-25,2.0,,


In [35]:
# Extract main effects and interaction effects
main_effect_A = anova_table['sum_sq']['A'] / anova_table['df']['A']
main_effect_B = anova_table['sum_sq']['B'] / anova_table['df']['B']
interaction_effect = anova_table['sum_sq']['A:B'] / anova_table['df']['A:B']

In [36]:
# Print the results
print(f"Main Effect A: {main_effect_A}")
print(f"Main Effect B: {main_effect_B}")
print(f"Interaction Effect: {interaction_effect}")

Main Effect A: 870.0936663693159
Main Effect B: 113.17788839402476
Interaction Effect: 1.818321241851911e-26


In [None]:
Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?


In a one-way ANOVA (Analysis of Variance), the F-statistic is used to test whether there are significant differences among the means of three or more groups. The p-value associated with the F-statistic helps you determine whether these differences are statistically significant or not.
In your case, the obtained F-statistic is 5.23, and the associated p-value is 0.02. To interpret these results:
Null Hypothesis (H0): 
    The null hypothesis in ANOVA is that there is no significant difference among the group means. Mathematically, it is stated as 
      H0 : μ1=μ2=μ3=......=μk,where μ1,μ2,μ3,......μk are the population means of the k groups.
Alternative Hypothesis (H1): 
    The alternative hypothesis is that at least one group mean is significantly different from the others.

Given the p-value of 0.02, which is less than the typical significance level of 0.05, you would reject the null hypothesis. This suggests that there is enough evidence to conclude that there are significant differences among the group means.
In practical terms, you can interpret this as follows:
Significant Differences: 
    The groups in your study are not all the same. At least one group differs significantly from the others in terms of the variable you measured.
Further Analysis: 
    Since the overall ANOVA test is significant, you may want to perform post-hoc tests (e.g., Tukey's HSD, Bonferroni) to identify which specific groups are different from each other.
Effect Size: 
    While the ANOVA tells you if there are differences, it doesn't tell you the size of those differences. You may want to calculate and report effect size measures (e.g., eta-squared) to quantify the practical significance of the observed differences.

In summary, with an F-statistic of 5.23 and a p-value of 0.02, you have evidence to reject the null hypothesis and conclude that there are significant differences among the group means.

In [None]:
Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?


Handling missing data in repeated measures ANOVA is crucial to ensure the validity and reliability of your analysis. There are several methods to address missing data, and the choice of method can have implications for the results and interpretations. 
Here are some common approaches:
Complete Case Analysis (Listwise Deletion): 
    This method involves excluding any participant with missing data on any variable included in the analysis. While this is straightforward, it can lead to a reduction in sample size and potential bias if the missing data are not completely random. This approach may result in biased estimates and reduced statistical power.
Pairwise Deletion: 
    This method includes all available data for each analysis, so participants with missing data on some variables are still included in the analysis. However, this can lead to biased estimates of the variances and covariances between variables, affecting the accuracy of the results.
Mean Imputation: 
    Missing values are replaced with the mean of the observed values for that variable. While this is simple and preserves the sample size, it can distort the distribution of the data and lead to an underestimation of standard errors. Additionally, it assumes that the missing values are missing completely at random, which may not be the case.
Last Observation Carried Forward (LOCF): 
    Missing values are replaced with the last observed value. This method assumes that the most recent measurement is a good estimate of the missing value, but it may not be suitable if the pattern of missing data is related to the variable's trajectory over time.
Multiple Imputation: 
    This advanced technique involves creating multiple sets of imputed values for missing data, allowing for the incorporation of uncertainty related to missing data. This method provides more accurate standard errors and parameter estimates compared to single imputation methods. However, it requires more sophisticated statistical software and may be computationally intensive.

Consequences of Using Different Methods:
Bias: 
    Certain methods, such as mean imputation or LOCF, can introduce bias into the analysis by artificially altering the distribution of the data.
Reduced Power: 
    Methods that result in a reduction of the effective sample size, such as complete case analysis, may lead to reduced statistical power, making it more difficult to detect true effects.
Invalid Inferences: 
    Using inappropriate methods for handling missing data can lead to invalid inferences and incorrect conclusions about the relationships between variables.
Generalizability: 
    The method chosen may impact the generalizability of the findings, especially if the pattern of missing data is related to certain participant characteristics.

It's essential to carefully consider the nature of the missing data and choose a method that aligns with the assumptions of the analysis while minimizing potential biases. Multiple imputation is generally considered a robust approach, but it requires careful implementation and consideration of the underlying missing data mechanism.

In [None]:
Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

Post-hoc tests are used in analysis of variance (ANOVA) to explore specific group differences when the overall ANOVA test indicates that there are significant differences among groups. These tests help identify which groups differ from each other. 
Here are some common post-hoc tests:
Tukey's Honestly Significant Difference (HSD):
  When to use: Tukey's HSD is appropriate when you have equal group sizes and want to compare all possible pairs of group means. It controls the overall Type I error rate.
  Example: You conduct an ANOVA to compare the average scores of three teaching methods (A, B, and C). The ANOVA indicates a significant difference, and Tukey's HSD can be used to identify which pairs of teaching methods differ significantly.
Bonferroni Correction:
  When to use: Bonferroni correction is suitable when you are conducting multiple pairwise comparisons, and it adjusts the significance level to control the familywise error rate.
  Example: You perform an ANOVA to analyze the effect of three different diets on weight loss. Post-hoc tests using Bonferroni correction can be applied to determine if there are significant differences between specific pairs of diets.
Scheffé's Test:
  When to use: Scheffé's test is a conservative approach suitable for unequal sample sizes and is used when you want to make all possible pairwise comparisons while controlling the familywise error rate.
  Example: You are comparing the performance of four different training programs, and the ANOVA indicates a significant difference. Scheffé's test can help identify which specific pairs of training programs have significantly different effects.
Dunnett's Test:
  When to use: Dunnett's test is employed when you have one control group and want to compare all other groups to the control group.
  Example: You conduct an ANOVA to assess the effectiveness of three different medications compared to a placebo. Dunnett's test can help identify which medications, if any, result in significantly different outcomes compared to the placebo.
Games-Howell Test:
  When to use: Games-Howell is a robust alternative to Tukey's HSD when sample sizes are unequal, and variances are not assumed to be equal.
  Example: You perform an ANOVA to analyze the impact of different work schedules on productivity. If the ANOVA indicates significant differences, Games-Howell can be used to identify specific pairs of work schedules with significantly different effects.
Example Scenario:
   Imagine you are conducting a study to compare the mean scores of students' performance across four different teaching methods: A, B, C, and D. After conducting a one-way ANOVA, you find that the p-value is less than your chosen significance level (e.g., 0.05), indicating that there are significant differences among the group means.
   In this situation, you would proceed to perform post-hoc tests to determine which specific pairs of teaching methods differ significantly. Depending on the assumptions and characteristics of your data (e.g., equal or unequal group sizes, homogeneity of variances), you might choose a post-hoc test such as Tukey's HSD, Bonferroni correction, Scheffé's test, or another appropriate test to make pairwise comparisons and draw conclusions about the specific teaching methods that lead to significantly different student performance.

In [None]:
Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [37]:
import numpy as np
from scipy.stats import f_oneway

In [38]:
# Generate example weight loss data for three diets (A, B, C)
np.random.seed(42)  # For reproducibility
diet_A = np.random.normal(loc=5, scale=2, size=50)
diet_B = np.random.normal(loc=4, scale=2, size=50)
diet_C = np.random.normal(loc=6, scale=2, size=50)

In [39]:
diet_A

array([5.99342831, 4.7234714 , 6.29537708, 8.04605971, 4.53169325,
       4.53172609, 8.15842563, 6.53486946, 4.06105123, 6.08512009,
       4.07316461, 4.06854049, 5.48392454, 1.17343951, 1.55016433,
       3.87542494, 2.97433776, 5.62849467, 3.18395185, 2.1753926 ,
       7.93129754, 4.5484474 , 5.13505641, 2.15050363, 3.91123455,
       5.22184518, 2.69801285, 5.75139604, 3.79872262, 4.4166125 ,
       3.79658678, 8.70455637, 4.97300555, 2.88457814, 6.64508982,
       2.5583127 , 5.41772719, 1.08065975, 2.3436279 , 5.39372247,
       6.47693316, 5.34273656, 4.76870344, 4.39779261, 2.04295602,
       3.56031158, 4.07872246, 7.11424445, 5.68723658, 1.47391969])

In [40]:
diet_B

array([ 4.64816794,  3.22983544,  2.646156  ,  5.22335258,  6.06199904,
        5.86256024,  2.32156495,  3.38157525,  4.66252686,  5.95109025,
        3.04165152,  3.62868205,  1.78733005,  1.60758675,  5.62505164,
        6.71248006,  3.85597976,  6.0070658 ,  4.72327205,  2.70976049,
        4.72279121,  7.07607313,  3.92834792,  7.12928731, -1.23949021,
        5.64380501,  4.17409414,  3.4019853 ,  4.18352155,  0.02486217,
        3.56065622,  4.71422514,  6.95578809,  2.96345956,  2.38301279,
        2.99648591,  5.83080424,  4.65750222,  2.94047959,  5.02653487,
        4.1941551 ,  5.93728998,  2.59589381,  3.34467571,  3.21578369,
        1.0729701 ,  4.59224055,  4.52211054,  4.01022691,  3.53082573])

In [41]:
diet_C

array([ 3.16925852,  5.15870935,  5.31457097,  4.39544546,  5.67742858,
        6.80810171,  9.7723718 ,  6.34915563,  6.51510078,  5.85110817,
        2.16245757,  5.94697225,  6.12046042, 10.92648422,  5.61527807,
        6.60309468,  5.93057646,  3.66264392,  8.28564563,  7.50386607,
        7.58206389,  4.18122509,  8.80558862,  3.19629787,  7.17371419,
       10.38091125,  4.01892735,  4.86740454,  6.19930273,  4.99304869,
        2.89867314,  6.13712595,  3.87539257,  6.94718486,  4.16115153,
        9.09986881,  4.43349342,  5.35587697,  7.62703443,  3.53827137,
        6.45491987,  8.61428551,  2.78503353,  6.36926772,  6.51976559,
        7.56364574,  3.52609858,  3.35908677,  7.04388313,  6.59396935])

In [42]:
# Combine the data
all_data = [diet_A, diet_B, diet_C]

In [43]:
all_data

[array([5.99342831, 4.7234714 , 6.29537708, 8.04605971, 4.53169325,
        4.53172609, 8.15842563, 6.53486946, 4.06105123, 6.08512009,
        4.07316461, 4.06854049, 5.48392454, 1.17343951, 1.55016433,
        3.87542494, 2.97433776, 5.62849467, 3.18395185, 2.1753926 ,
        7.93129754, 4.5484474 , 5.13505641, 2.15050363, 3.91123455,
        5.22184518, 2.69801285, 5.75139604, 3.79872262, 4.4166125 ,
        3.79658678, 8.70455637, 4.97300555, 2.88457814, 6.64508982,
        2.5583127 , 5.41772719, 1.08065975, 2.3436279 , 5.39372247,
        6.47693316, 5.34273656, 4.76870344, 4.39779261, 2.04295602,
        3.56031158, 4.07872246, 7.11424445, 5.68723658, 1.47391969]),
 array([ 4.64816794,  3.22983544,  2.646156  ,  5.22335258,  6.06199904,
         5.86256024,  2.32156495,  3.38157525,  4.66252686,  5.95109025,
         3.04165152,  3.62868205,  1.78733005,  1.60758675,  5.62505164,
         6.71248006,  3.85597976,  6.0070658 ,  4.72327205,  2.70976049,
         4.72279121,  7.07

In [44]:
# Perform one-way ANOVA
f_statistic, p_value = f_oneway(*all_data)

In [45]:
# Display results
print(f"F-statistic: {f_statistic}")
print(f"P-value: {p_value}")


F-statistic: 13.364811099020901
P-value: 4.646161222392871e-06


In [46]:
# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("The one-way ANOVA is statistically significant.")
    print("There are significant differences in the mean weight loss between at least two diets.")
else:
    print("The one-way ANOVA is not statistically significant.")
    print("There is not enough evidence to conclude significant differences in mean weight loss between the diets.")

The one-way ANOVA is statistically significant.
There are significant differences in the mean weight loss between at least two diets.


In [None]:
Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [69]:
import numpy as np
import pandas as pd
from scipy.stats import f_oneway
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

In [70]:
# Set a random seed for reproducibility
np.random.seed(42)


In [71]:
# Generate example data
n_employees_per_group = 30
programs = np.repeat(['Program A', 'Program B', 'Program C'], n_employees_per_group)
experience_levels = np.tile(['Novice', 'Experienced'], n_employees_per_group+15)

In [72]:
programs

array(['Program A', 'Program A', 'Program A', 'Program A', 'Program A',
       'Program A', 'Program A', 'Program A', 'Program A', 'Program A',
       'Program A', 'Program A', 'Program A', 'Program A', 'Program A',
       'Program A', 'Program A', 'Program A', 'Program A', 'Program A',
       'Program A', 'Program A', 'Program A', 'Program A', 'Program A',
       'Program A', 'Program A', 'Program A', 'Program A', 'Program A',
       'Program B', 'Program B', 'Program B', 'Program B', 'Program B',
       'Program B', 'Program B', 'Program B', 'Program B', 'Program B',
       'Program B', 'Program B', 'Program B', 'Program B', 'Program B',
       'Program B', 'Program B', 'Program B', 'Program B', 'Program B',
       'Program B', 'Program B', 'Program B', 'Program B', 'Program B',
       'Program B', 'Program B', 'Program B', 'Program B', 'Program B',
       'Program C', 'Program C', 'Program C', 'Program C', 'Program C',
       'Program C', 'Program C', 'Program C', 'Program C', 'Prog

In [73]:
experience_levels

array(['Novice', 'Experienced', 'Novice', 'Experienced', 'Novice',
       'Experienced', 'Novice', 'Experienced', 'Novice', 'Experienced',
       'Novice', 'Experienced', 'Novice', 'Experienced', 'Novice',
       'Experienced', 'Novice', 'Experienced', 'Novice', 'Experienced',
       'Novice', 'Experienced', 'Novice', 'Experienced', 'Novice',
       'Experienced', 'Novice', 'Experienced', 'Novice', 'Experienced',
       'Novice', 'Experienced', 'Novice', 'Experienced', 'Novice',
       'Experienced', 'Novice', 'Experienced', 'Novice', 'Experienced',
       'Novice', 'Experienced', 'Novice', 'Experienced', 'Novice',
       'Experienced', 'Novice', 'Experienced', 'Novice', 'Experienced',
       'Novice', 'Experienced', 'Novice', 'Experienced', 'Novice',
       'Experienced', 'Novice', 'Experienced', 'Novice', 'Experienced',
       'Novice', 'Experienced', 'Novice', 'Experienced', 'Novice',
       'Experienced', 'Novice', 'Experienced', 'Novice', 'Experienced',
       'Novice', 'Experienc

In [74]:
# Simulate time data
time_data = np.random.normal(loc=20, scale=5, size=n_employees_per_group * 3)


In [76]:
time_data

array([22.48357077, 19.30867849, 23.23844269, 27.61514928, 18.82923313,
       18.82931522, 27.89606408, 23.83717365, 17.65262807, 22.71280022,
       17.68291154, 17.67135123, 21.20981136, 10.43359878, 11.37541084,
       17.18856235, 14.9358444 , 21.57123666, 15.45987962, 12.93848149,
       27.32824384, 18.8711185 , 20.33764102, 12.87625907, 17.27808638,
       20.55461295, 14.24503211, 21.87849009, 16.99680655, 18.54153125,
       16.99146694, 29.26139092, 19.93251388, 14.71144536, 24.11272456,
       13.89578175, 21.04431798, 10.20164938, 13.35906976, 20.98430618,
       23.6923329 , 20.85684141, 19.42175859, 18.49448152, 12.60739005,
       16.40077896, 17.69680615, 25.28561113, 21.71809145, 11.18479922,
       21.62041985, 18.0745886 , 16.61539   , 23.05838144, 25.15499761,
       24.6564006 , 15.80391238, 18.45393812, 21.65631716, 24.87772564,
       17.60412881, 19.07170512, 14.46832513, 14.01896688, 24.06262911,
       26.78120014, 19.63994939, 25.01766449, 21.80818013, 16.77

In [80]:
# Create a DataFrame
df = pd.DataFrame({'Program': programs, 'Experience': experience_levels, 'Time': time_data})

In [81]:
df

Unnamed: 0,Program,Experience,Time
0,Program A,Novice,22.483571
1,Program A,Experienced,19.308678
2,Program A,Novice,23.238443
3,Program A,Experienced,27.615149
4,Program A,Novice,18.829233
...,...,...,...
85,Program C,Experienced,17.491215
86,Program C,Novice,24.577011
87,Program C,Experienced,21.643756
88,Program C,Novice,17.351199


In [82]:
# Perform two-way ANOVA
formula = 'Time ~ C(Program) + C(Experience) + C(Program):C(Experience)'
model = ols(formula, df).fit()
anova_table = anova_lm(model, typ=2)

In [83]:
model

<statsmodels.regression.linear_model.RegressionResultsWrapper at 0x7f0bf0b7ec80>

In [84]:
anova_table

Unnamed: 0,sum_sq,df,F,PR(>F)
C(Program),15.717327,2.0,0.344485,0.709581
C(Experience),2.994142,1.0,0.131248,0.718051
C(Program):C(Experience),9.952457,2.0,0.218133,0.804472
Residual,1916.27349,84.0,,


In [85]:
# Interpret the results
alpha = 0.05
p_value_program = anova_table.loc['C(Program)', 'PR(>F)']
p_value_experience = anova_table.loc['C(Experience)', 'PR(>F)']
p_value_interaction = anova_table.loc['C(Program):C(Experience)', 'PR(>F)']

In [86]:
p_value_program

0.7095813882358291

In [87]:
p_value_experience

0.7180509539617208

In [88]:
p_value_interaction

0.8044721993691825

In [89]:
if p_value_program < alpha:
    print("There is a significant main effect of software programs on task completion time.")
else:
    print("There is no significant main effect of software programs on task completion time.")

if p_value_experience < alpha:
    print("There is a significant main effect of employee experience on task completion time.")
else:
    print("There is no significant main effect of employee experience on task completion time.")

if p_value_interaction < alpha:
    print("There is a significant interaction effect between software programs and employee experience.")
else:
    print("There is no significant interaction effect between software programs and employee experience.")

There is no significant main effect of software programs on task completion time.
There is no significant main effect of employee experience on task completion time.
There is no significant interaction effect between software programs and employee experience.


In [None]:
Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [90]:
import numpy as np
from scipy.stats import ttest_ind
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd

In [91]:
# Set a random seed for reproducibility
np.random.seed(42)

In [92]:
# Generate example test scores (replace with your actual data)
control_group = np.random.normal(loc=70, scale=10, size=100)
experimental_group = np.random.normal(loc=75, scale=10, size=100)

In [93]:
control_group

array([74.96714153, 68.61735699, 76.47688538, 85.23029856, 67.65846625,
       67.65863043, 85.79212816, 77.67434729, 65.30525614, 75.42560044,
       65.36582307, 65.34270246, 72.41962272, 50.86719755, 52.75082167,
       64.37712471, 59.8716888 , 73.14247333, 60.91975924, 55.87696299,
       84.65648769, 67.742237  , 70.67528205, 55.75251814, 64.55617275,
       71.1092259 , 58.49006423, 73.75698018, 63.9936131 , 67.0830625 ,
       63.98293388, 88.52278185, 69.86502775, 59.42289071, 78.22544912,
       57.7915635 , 72.08863595, 50.40329876, 56.71813951, 71.96861236,
       77.3846658 , 71.71368281, 68.84351718, 66.98896304, 55.2147801 ,
       62.80155792, 65.39361229, 80.57122226, 73.4361829 , 52.36959845,
       73.24083969, 66.1491772 , 63.23078   , 76.11676289, 80.30999522,
       79.31280119, 61.60782477, 66.90787624, 73.31263431, 79.75545127,
       65.20825762, 68.14341023, 58.93665026, 58.03793376, 78.12525822,
       83.56240029, 69.27989878, 80.03532898, 73.61636025, 63.54

In [94]:
experimental_group

array([ 60.84629258,  70.79354677,  71.57285483,  66.97722731,
        73.38714288,  79.04050857,  93.86185901,  76.74577813,
        77.57550391,  74.25554084,  55.81228785,  74.73486125,
        75.6023021 ,  99.63242112,  73.07639035,  78.01547342,
        74.6528823 ,  63.31321962,  86.42822815,  82.51933033,
        82.91031947,  65.90612545,  89.02794311,  60.98148937,
        80.86857094,  96.90455626,  65.09463675,  69.3370227 ,
        75.99651365,  69.96524346,  59.49336569,  75.68562975,
        64.37696286,  79.73592431,  65.80575766,  90.49934405,
        67.16746708,  71.77938484,  83.13517217,  62.69135684,
        77.27459935,  88.07142754,  58.92516765,  76.84633859,
        77.59882794,  82.81822872,  62.63049289,  61.79543387,
        80.21941566,  77.96984673,  77.5049285 ,  78.46448209,
        68.19975278,  77.32253697,  77.93072473,  67.85648582,
        93.65774511,  79.73832921,  63.08696503,  81.56553609,
        65.2531833 ,  82.87084604,  86.58595579,  66.79

In [95]:
# Perform two-sample t-test
t_statistic, p_value = ttest_ind(control_group, experimental_group)

In [96]:
# Display the results of the t-test
print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

T-statistic: -4.754695943505281
P-value: 3.819135262679478e-06


In [97]:
# Interpret the results of the t-test
alpha = 0.05
if p_value < alpha:
    print("The two-sample t-test is statistically significant.")
    print("There is a significant difference in test scores between the control and experimental groups.")
else:
    print("The two-sample t-test is not statistically significant.")
    print("There is not enough evidence to conclude a significant difference in test scores between the groups.")

The two-sample t-test is statistically significant.
There is a significant difference in test scores between the control and experimental groups.


In [98]:
# Perform post-hoc test (Tukey's HSD) if the t-test is significant
if p_value < alpha:
    # Combine data for post-hoc test
    all_data = np.concatenate([control_group, experimental_group])
    
    # Create group labels
    group_labels = ['Control'] * 100 + ['Experimental'] * 100
    
    # Perform Tukey's HSD post-hoc test
    tukey_results = pairwise_tukeyhsd(all_data, group_labels)

    # Display the results of the post-hoc test
    print(tukey_results.summary())

  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj lower  upper  reject
--------------------------------------------------------
Control Experimental   6.2615   0.0 3.6645 8.8585   True
--------------------------------------------------------


In [None]:
Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [99]:
import numpy as np
from scipy.stats import f_oneway
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd

In [100]:
# Set a random seed for reproducibility
np.random.seed(42)

In [101]:
# Generate example daily sales data 
sales_store_A = np.random.normal(loc=500, scale=50, size=30)
sales_store_B = np.random.normal(loc=550, scale=50, size=30)
sales_store_C = np.random.normal(loc=600, scale=50, size=30)

In [102]:
sales_store_A

array([524.83570765, 493.08678494, 532.38442691, 576.15149282,
       488.29233126, 488.29315215, 578.96064078, 538.37173646,
       476.5262807 , 527.12800218, 476.82911536, 476.71351232,
       512.09811358, 404.33598777, 413.75410837, 471.88562354,
       449.35844398, 515.71236663, 454.59879622, 429.38481493,
       573.28243845, 488.71118498, 503.37641023, 428.76259069,
       472.78086377, 505.54612949, 442.45032113, 518.78490092,
       469.9680655 , 485.41531251])

In [103]:
sales_store_B

array([519.91466939, 642.61390923, 549.32513876, 497.11445355,
       591.12724561, 488.9578175 , 560.44317975, 452.01649381,
       483.59069756, 559.84306179, 586.923329  , 558.56841406,
       544.21758588, 534.94481522, 476.07390048, 514.00778958,
       526.96806145, 602.85611131, 567.18091448, 461.84799223,
       566.20419847, 530.74588598, 516.15389998, 580.58381444,
       601.54997612, 596.56400596, 508.03912384, 534.53938121,
       566.56317157, 598.77725636])

In [104]:
sales_store_C

array([576.04128811, 590.71705117, 544.6832513 , 540.1896688 ,
       640.62629112, 667.81200143, 596.39949392, 650.17664489,
       618.08180125, 567.74401227, 618.06978028, 676.90182832,
       598.20869804, 678.23218279, 469.0127448 , 641.09512522,
       604.35235341, 585.04963248, 604.58803883, 500.62155427,
       589.01640561, 617.85562858, 673.89470224, 574.08648909,
       559.57531986, 574.91214782, 645.77010589, 616.43755548,
       573.51198981, 625.66337166])

In [105]:
# Combine the data
all_sales_data = np.concatenate([sales_store_A, sales_store_B, sales_store_C])

In [106]:
all_sales_data

array([524.83570765, 493.08678494, 532.38442691, 576.15149282,
       488.29233126, 488.29315215, 578.96064078, 538.37173646,
       476.5262807 , 527.12800218, 476.82911536, 476.71351232,
       512.09811358, 404.33598777, 413.75410837, 471.88562354,
       449.35844398, 515.71236663, 454.59879622, 429.38481493,
       573.28243845, 488.71118498, 503.37641023, 428.76259069,
       472.78086377, 505.54612949, 442.45032113, 518.78490092,
       469.9680655 , 485.41531251, 519.91466939, 642.61390923,
       549.32513876, 497.11445355, 591.12724561, 488.9578175 ,
       560.44317975, 452.01649381, 483.59069756, 559.84306179,
       586.923329  , 558.56841406, 544.21758588, 534.94481522,
       476.07390048, 514.00778958, 526.96806145, 602.85611131,
       567.18091448, 461.84799223, 566.20419847, 530.74588598,
       516.15389998, 580.58381444, 601.54997612, 596.56400596,
       508.03912384, 534.53938121, 566.56317157, 598.77725636,
       576.04128811, 590.71705117, 544.6832513 , 540.18

In [107]:

# Create group labels
group_labels = ['Store A'] * 30 + ['Store B'] * 30 + ['Store C'] * 30

In [108]:
group_labels

['Store A',
 'Store A',
 'Store A',
 'Store A',
 'Store A',
 'Store A',
 'Store A',
 'Store A',
 'Store A',
 'Store A',
 'Store A',
 'Store A',
 'Store A',
 'Store A',
 'Store A',
 'Store A',
 'Store A',
 'Store A',
 'Store A',
 'Store A',
 'Store A',
 'Store A',
 'Store A',
 'Store A',
 'Store A',
 'Store A',
 'Store A',
 'Store A',
 'Store A',
 'Store A',
 'Store B',
 'Store B',
 'Store B',
 'Store B',
 'Store B',
 'Store B',
 'Store B',
 'Store B',
 'Store B',
 'Store B',
 'Store B',
 'Store B',
 'Store B',
 'Store B',
 'Store B',
 'Store B',
 'Store B',
 'Store B',
 'Store B',
 'Store B',
 'Store B',
 'Store B',
 'Store B',
 'Store B',
 'Store B',
 'Store B',
 'Store B',
 'Store B',
 'Store B',
 'Store B',
 'Store C',
 'Store C',
 'Store C',
 'Store C',
 'Store C',
 'Store C',
 'Store C',
 'Store C',
 'Store C',
 'Store C',
 'Store C',
 'Store C',
 'Store C',
 'Store C',
 'Store C',
 'Store C',
 'Store C',
 'Store C',
 'Store C',
 'Store C',
 'Store C',
 'Store C',
 'Store C',
 'St

In [109]:
# Perform one-way ANOVA
f_statistic, p_value = f_oneway(sales_store_A, sales_store_B, sales_store_C)

In [110]:
# Display the results of the ANOVA
print(f"F-statistic: {f_statistic}")
print(f"P-value: {p_value}")

F-statistic: 40.97563597701801
P-value: 2.893768135071658e-13


In [111]:
# Interpret the results of the ANOVA
alpha = 0.05
if p_value < alpha:
    print("The one-way ANOVA is statistically significant.")
    print("There are significant differences in daily sales between at least two stores.")
else:
    print("The one-way ANOVA is not statistically significant.")
    print("There is not enough evidence to conclude significant differences in daily sales between the stores.")

The one-way ANOVA is statistically significant.
There are significant differences in daily sales between at least two stores.


In [112]:
# Perform post-hoc test (Tukey's HSD) if the ANOVA is significant
if p_value < alpha:
    # Perform Tukey's HSD post-hoc test
    tukey_results = pairwise_tukeyhsd(all_sales_data, group_labels)

    # Display the results of the post-hoc test
    print(tukey_results.summary())

  Multiple Comparison of Means - Tukey HSD, FWER=0.05  
 group1  group2 meandiff p-adj   lower   upper   reject
-------------------------------------------------------
Store A Store B  53.3492 0.0001 24.3572  82.3413   True
Store A Store C 110.0516    0.0 81.0595 139.0437   True
Store B Store C  56.7024    0.0 27.7103  85.6944   True
-------------------------------------------------------
