In [None]:
Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.
Ans.: ANOVA (Analysis of Variance) is a statistical method used to compare the means of two or more groups to determine if there are significant 
differences between them. To use ANOVA effectively, certain assumptions need to be met. These assumptions are as follows:

Independence: The observations within each group should be independent of each other. In other words, the data points in one group should not be 
influenced by or dependent on the data points in another group.

Normality: The distribution of the dependent variable should be approximately normal within each group. This assumption implies that the residuals 
(the differences between the observed values and the predicted values) should follow a normal distribution.

Homogeneity of variances: The variances of the dependent variable should be approximately equal across all groups. This assumption is known as
homoscedasticity. It means that the spread of the data should be similar for each group.

Homogeneity of group sizes: The sample sizes for each group should be roughly equal. Unequal group sizes can potentially affect the validity of 
the results.

Violations of these assumptions can impact the validity of the ANOVA results. Here are examples of violations for each assumption:

Independence: Violations of independence can occur when there is a dependence between observations within or between groups. For example, if the 
same individuals are included in multiple groups, or if there is a clustering effect where observations within a group are more similar to each other
than to observations in other groups.

Normality: If the data does not follow a normal distribution within each group, the ANOVA results may be affected. Violations can occur when the 
data is highly skewed or has heavy tails. This can happen, for instance, when dealing with count data that often exhibit a skewed distribution.

Homogeneity of variances: When the variances of the dependent variable differ significantly across groups, it violates the assumption of homogeneity 
of variances. This is known as heteroscedasticity. Violations can arise when one group has much larger variability than the others. For example, if 
the spread of data in one group is much wider than the spread in other groups.

Homogeneity of group sizes: If the sample sizes for each group are unequal, it can impact the ANOVA results. Unequal group sizes may lead to 
imbalances in statistical power, potentially affecting the ability to detect true differences between groups.

In [None]:
Q2. What are the three types of ANOVA, and in what situations would each be used?
Ans.: The three types of ANOVA are:

One-Way ANOVA: One-Way ANOVA is used when you have one categorical independent variable (also known as a factor) and one continuous dependent 
variable. It compares the means of three or more groups to determine if there are significant differences between them. One-Way ANOVA is appropriate
when you want to compare the means of multiple independent groups. For example, you might use One-Way ANOVA to compare the average test scores of 
students from different schools or to compare the effectiveness of different medications on a particular condition.

Two-Way ANOVA: Two-Way ANOVA is used when you have two categorical independent variables (factors) and one continuous dependent variable. It allows 
you to examine the interaction effect between the two independent variables and their main effects on the dependent variable. Two-Way ANOVA is 
appropriate when you want to analyze the effects of two independent variables simultaneously. For example, you might use Two-Way ANOVA to investigate
the impact of both gender and age group on the performance of individuals in a cognitive test.

Repeated Measures ANOVA: Repeated Measures ANOVA (also known as Within-Subjects ANOVA) is used when you have one group of participants measured on 
the same dependent variable under different conditions or at different time points. It compares the means of the dependent variable across the 
repeated measures. Repeated Measures ANOVA is appropriate when you want to analyze the change or differences within the same group over time or 
under different conditions. For example, you might use Repeated Measures ANOVA to examine the effect of different teaching methods on students'
test scores by measuring their performance before, during, and after the intervention.

In [None]:
Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?
Ans.: The partitioning of variance in ANOVA refers to the decomposition of the total variance observed in a dataset into different sources of 
variation. It is an essential concept in ANOVA as it allows us to understand and quantify the contributions of various factors to the overall 
variability in the data.

In ANOVA, the total variance is partitioned into two components: the between-group variance and the within-group variance.

Between-group variance: This component of variance represents the variability among the group means. It indicates the differences between the group 
means and reflects the influence of the independent variable or factors being studied. A larger between-group variance suggests greater differences
between the groups.

Within-group variance: This component of variance represents the variability within each group. It accounts for the individual differences within the 
groups and the random variation in the data. A smaller within-group variance indicates less variability within the groups.

By partitioning the total variance into these two components, ANOVA provides a way to assess the relative importance of the independent variable(s) 
in explaining the variability in the dependent variable.

Understanding the partitioning of variance is important for several reasons:

Identification of significant effects: ANOVA helps determine whether the between-group variance is significantly larger than the within-group 
variance. If the between-group variance is significantly larger, it suggests that the independent variable(s) have a significant effect on the 
dependent variable.

Comparison of group means: ANOVA allows for the comparison of group means by analyzing the differences between the means of the groups relative 
to the variability within the groups. This comparison helps identify which groups differ significantly from each other.

Interpretation of results: By understanding the partitioning of variance, researchers can provide a more nuanced interpretation of their results. 
They can quantify the amount of variance explained by the independent variable(s) and determine the proportion of variance that is due to random 
variation or other unexplained factors.

Design optimization: Partitioning the variance can also aid in experimental design optimization. It helps identify which factors contribute the 
most to the overall variability, allowing researchers to focus their efforts on the most influential variables and potentially improve the efficiency 
of future studies.

In [1]:
"""Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?
Ans.:"""

import scipy.stats as stats

# Data for the groups
group1 = [1, 2, 3, 4, 5]
group2 = [2, 4, 6, 8, 10]
group3 = [3, 6, 9, 12, 15]

# Combining the groups
data = [group1, group2, group3]

# Computing the necessary sums of squares
n_groups = len(data)
n_total = sum(len(group) for group in data)
grand_mean = sum(sum(group) for group in data) / n_total

# Calculate SST
sst = sum(sum((x - grand_mean) ** 2 for x in group) for group in data)

# Calculate SSE
group_means = [sum(group) / len(group) for group in data]
sse = sum(sum((x - group_mean) ** 2 for x in group) for group, group_mean in zip(data, group_means))

# Calculate SSR
ssr = sst - sse

# Printing the results
print("Total Sum of Squares (SST):", sst)
print("Explained Sum of Squares (SSE):", sse)
print("Residual Sum of Squares (SSR):", ssr)


Total Sum of Squares (SST): 230.0
Explained Sum of Squares (SSE): 140.0
Residual Sum of Squares (SSR): 90.0


In [3]:
"""Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?
Ans.: """
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Data for the groups
group1 = [1, 2, 3, 4, 5]
group2 = [2, 4, 6, 8, 10]
group3 = [3, 6, 9, 12, 15]

# Combining the groups and creating factor variables
data = group1 + group2 + group3
factor1 = ['Group1'] * len(group1) + ['Group2'] * len(group2) + ['Group3'] * len(group3)
factor2 = ['FactorA'] * len(group1) + ['FactorB'] * len(group2) + ['FactorC'] * len(group3)

# Creating a DataFrame from the data
df = pd.DataFrame({'Data': data, 'Factor1': factor1, 'Factor2': factor2})

# Creating the ANOVA model
model = ols('Data ~ Factor1 + Factor2 + Factor1:Factor2', data=df).fit()
anova_table = sm.stats.anova_lm(model)

# Extracting the main effects and interaction effect
main_effect_factor1 = anova_table['sum_sq']['Factor1']
main_effect_factor2 = anova_table['sum_sq']['Factor2']
interaction_effect = anova_table['sum_sq']['Factor1:Factor2']

# Printing the results
print("Main Effect for Factor 1:", main_effect_factor1)
print("Main Effect for Factor 2:", main_effect_factor2)
print("Interaction Effect:", interaction_effect)


Main Effect for Factor 1: 90.0
Main Effect for Factor 2: 1.0087082248128232
Interaction Effect: 15.648878391991158


In [None]:
Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?
Ans.: In a one-way ANOVA, the F-statistic is used to test the null hypothesis that the means of the groups are equal. The associated p-value 
indicates the probability of obtaining the observed F-statistic (or a more extreme value) under the assumption that the null hypothesis is true.

In this case, the obtained F-statistic is 5.23 and the p-value is 0.02.

Based on these results, we can conclude the following:

Differences between the groups: Since the p-value (0.02) is less than the commonly used significance level of 0.05, we reject the null hypothesis. 
This suggests that there are statistically significant differences between the means of the groups.

Interpretation: The obtained F-statistic of 5.23 indicates that there is a relatively large difference between the group means compared to the
variability within the groups. The associated p-value of 0.02 suggests that the probability of observing such a large difference by chance alone, 
assuming the null hypothesis is true, is 0.02. Therefore, we have evidence to support the conclusion that the means of the groups are not equal and 
that there are significant differences between them.

In [None]:
Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?
Ans.: Handling missing data in a repeated measures ANOVA is an important consideration to ensure the validity and reliability of the results. There 
are several approaches to handle missing data in this context, and the choice of method can have consequences on the analysis and interpretation. 
Here are some common methods and their potential consequences:

Complete Case Analysis (Listwise deletion): This method involves excluding cases with missing data from the analysis. It analyzes only the cases with 
complete data. The potential consequences of this approach are reduced sample size and potential bias if the missing data are not missing completely 
at random (MCAR). It can lead to inefficient use of available data and loss of statistical power.

Pairwise deletion (Available Case Analysis): In this method, each analysis uses all available data points for that particular comparison, excluding 
only the specific cases with missing data for that particular comparison. The potential consequence is that different comparisons use different sample
sizes, which can affect the precision and power of the analysis. However, it retains more information compared to complete case analysis.

Mean imputation: Mean imputation replaces missing values with the mean of the available data for that variable. It assumes that the missing values 
are missing at random (MAR) and that the mean adequately represents the missing values. The potential consequence of mean imputation is that it can
underestimate the variability and introduce bias towards the mean. It may also distort the relationship between variables and lead to incorrect 
estimates.

Multiple imputation: Multiple imputation involves generating multiple plausible values for the missing data based on observed data and imputing each 
set of values multiple times to create complete datasets. These datasets are then analyzed using the repeated measures ANOVA. Multiple imputation 
accounts for the uncertainty associated with missing data and provides unbiased estimates under the assumption that data are missing at random (MAR).
It yields valid standard errors and allows for valid statistical inferences. However, it can be computationally intensive and requires careful 
consideration of the imputation model.

Maximum likelihood estimation: This method estimates the parameters of the repeated measures ANOVA model by maximizing the likelihood function, 
considering the observed data and the missing data mechanism. It provides unbiased estimates under the assumption of missing data being MAR. 
Maximum likelihood estimation is a flexible and robust approach, but it may require more complex model specifications and can be computationally 
demanding.

In [None]:
Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.
Ans.: After conducting an ANOVA and finding a significant difference among groups, post-hoc tests are often employed to determine which specific group
differences are significant. Some common post-hoc tests used after ANOVA include:

Tukey's Honestly Significant Difference (HSD) test: This test is used when you have conducted a one-way ANOVA and want to perform pairwise comparisons 
between all possible combinations of groups. It controls the familywise error rate and provides simultaneous confidence intervals for all pairwise 
comparisons.

Bonferroni correction: This method is a conservative approach to control the familywise error rate. It divides the desired significance level 
(e.g., 0.05) by the number of pairwise comparisons to determine the adjusted significance level for each comparison. It is commonly used when 
conducting multiple independent pairwise comparisons.

Dunnett's test: This test is used when you have one control group and want to compare it to multiple treatment groups. It controls the familywise
error rate while allowing for multiple comparisons against a single control group.

Scheffé's test: This test is a more conservative approach that can be used for all possible comparisons after ANOVA. It provides a wider confidence
interval to maintain the familywise error rate at the desired level. Scheffé's test is especially useful when the number of pairwise comparisons is
small.

Fisher's Least Significant Difference (LSD) test: This test is used when you have conducted a one-way ANOVA and want to perform pairwise comparisons.
It is less conservative than other post-hoc tests but does not control the familywise error rate. It is often used for exploratory purposes or when
the number of comparisons is small.

In [5]:
"""Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.
Ans.: To conduct a one-way ANOVA in Python and determine if there are any significant differences between the mean weight loss of three diets 
(A, B, and C), you can use the scipy.stats module. Here's an example code snippet:
"""

import scipy.stats as stats

# Weight loss data for the three diets
diet_A = [2.5, 3.1, 1.8, 2.9, 3.5, 2.3, 1.9, 2.6, 3.2, 2.7, 2.0, 1.5, 2.8, 2.4, 3.3, 3.0, 2.1, 2.2, 2.6, 3.4,
          2.7, 2.9, 2.3, 2.6, 2.1, 2.5, 2.8, 3.1, 1.9, 2.2, 2.7, 2.4, 2.0, 3.2, 2.6, 2.3, 2.7, 3.1, 2.9, 2.5,
          1.8, 2.0, 2.4, 2.8, 2.6, 3.0, 2.2, 2.1, 2.7, 2.5]
diet_B = [2.0, 2.1, 2.2, 1.9, 2.3, 1.8, 2.4, 2.5, 2.6, 1.7, 2.7, 2.8, 2.9, 2.5, 1.5, 2.3, 1.6, 1.9, 2.2, 2.0,
          2.1, 2.4, 1.7, 2.3, 2.6, 1.9, 2.2, 2.7, 2.0, 1.8, 1.6, 2.5, 2.8, 1.5, 1.7, 2.4, 2.6, 2.3, 1.9, 2.2,
          2.1, 2.0, 2.9, 2.5, 1.8, 2.6, 1.9, 1.6]
diet_C = [1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 3.0, 1.6, 1.7, 1.8, 1.9,
          2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 3.0, 1.8, 1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.8]


In [7]:
"""Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.
Ans.: """

import statsmodels.api as sm
from statsmodels.formula.api import ols

# Time data for the three software programs and two experience levels
program_A_novice = [12, 15, 14, 16, 13, 11, 17, 15, 12, 14, 13, 16, 18, 14, 15, 16, 13, 15, 11, 12, 14, 13, 16, 15, 14, 13, 12, 15, 13, 14]
program_B_novice = [14, 16, 15, 17, 13, 12, 18, 16, 14, 15, 13, 16, 19, 15, 16, 17, 14, 16, 12, 13, 15, 14, 17, 16, 15, 14, 13, 16, 14, 15]
program_C_novice = [13, 15, 14, 16, 12, 11, 17, 15, 13, 14, 12, 15, 18, 14, 15, 16, 13, 15, 11, 12, 14, 13, 16, 15, 14, 12, 11, 13, 12, 14]
program_A_experienced = [11, 13, 12, 14, 10, 9, 15, 13, 11, 12, 10, 14, 16, 12, 13, 14, 11, 13, 9, 10, 12, 11, 14, 13, 12, 11, 10, 13, 11, 12]
program_B_experienced = [13, 15, 14, 16, 12, 11, 17, 15, 13, 14, 12, 15, 18, 14, 15, 16, 13, 15, 11, 12, 14, 13, 16, 15, 14, 12, 11, 13, 12, 14]
program_C_experienced = [10, 12, 11, 13, 9, 8, 14, 12, 10, 11, 9, 13, 15, 11, 12, 13, 10, 12, 8, 9, 11, 10, 13, 12, 11, 10, 9, 12, 10, 11]

# Combining the data and creating factor variables
time_data = program_A_novice + program_B_novice + program_C_novice + program_A_experienced + program_B_experienced + program_C_experienced
software_program = ['Program A'] * 30 + ['Program B'] * 30 + ['Program C'] * 30
experience_level = ['Novice'] * 30 + ['Experienced'] * 30 + ['Novice'] 


In [9]:
"""Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other."""

import scipy.stats as stats

# Test scores for the control group and experimental group
control_group = [75, 82, 90, 78, 80, 73, 88, 79, 84, 87, 76, 81, 85, 77, 83, 79, 75, 78, 82, 80, 84, 81, 76, 83, 79, 75, 82, 80, 81, 78, 87, 84, 79, 76, 85, 83, 80, 82, 88, 78, 83, 79, 76, 81, 75, 80, 84, 77, 83, 79, 75, 82]
experimental_group = [83, 88, 92, 85, 86, 81, 90, 85, 87, 90, 84, 89, 92, 86, 88, 85, 83, 85, 88, 86, 90, 89, 84, 88, 85, 83, 87, 86, 89, 88, 92, 90, 85, 84, 91, 89, 86, 88, 90, 84, 87, 85, 83, 88, 86, 82, 87, 85, 83, 89]

# Conducting a two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)

# Reporting the results
print("T-statistic:", t_statistic)
print("p-value:", p_value)


T-statistic: -9.243553168829445
p-value: 4.50521919361475e-15


In [None]:
"""Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any
significant differences in sales between the three stores. If the results are significant, follow up with a posthoc test to determine which store(s) 
differ significantly from each other."""

import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sales data for the three stores
store_A = [100, 120, 110, 130, 105, 115, 125, 135, 120, 130, 110, 100, 105, 115, 125, 135, 120, 130, 110, 100, 105, 115, 125, 135, 120, 130, 110, 100, 105]
store_B = [90, 95, 85, 100, 105, 95, 90, 100, 85, 95, 100, 105, 95, 90, 100, 85, 95, 100, 105, 95, 90, 100, 85, 95, 100, 105, 95, 90, 100, 85]
store_C = [80, 85, 75, 80, 90, 85, 80, 90, 75, 85, 80, 90, 85, 80, 90, 75, 85, 80, 90, 85, 80, 90, 75, 85, 80, 90, 85, 80, 90, 75]

# Creating a data frame for the repeated measures ANOVA
import pandas as pd

data = pd.DataFrame({
    'Sales': store_A + store_B + store_C,
    'Store': ['Store A'] * 30 + ['Store B'] * 30 + ['Store C'] * 30,
    'Day': list(range(1, 31)) * 3
})

# Conducting the repeated measures ANOVA
model = ols('Sales ~ Store + C(Day)', data=data).fit()
anova_table = sm.stats.anova_lm(model)

# Reporting the results
print(anova_table)
