# Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

ANOVA (Analysis of Variance) is a statistical test used to compare means of three or more groups simultaneously. It is an extension of the t-test, which is used for comparing means of two groups. ANOVA is based on several assumptions, and violating these assumptions can impact the validity of the results.

Assumptions for using ANOVA:

1. Independence: The observations in each group are assumed to be independent of each other. This means that the data points within each group are not influenced by or related to the data points in other groups.

2. Normality: The data in each group are assumed to follow a normal distribution. ANOVA is relatively robust to moderate departures from normality, especially with larger sample sizes. However, severe departures from normality can affect the validity of the results.

3. Homogeneity of variance (Homoscedasticity): The variances of the data in each group are assumed to be equal. This means that the spread of data points around the group means should be similar for all groups.

4. Random sampling: The data should be obtained from random sampling from the respective populations.

Examples of violations and their impact on ANOVA results:

1. Non-independence: If the observations in one group are somehow related or dependent on the observations in another group, the independence assumption is violated. This can lead to increased Type 1 errors (false positives) and biased results. For example, in a study comparing the effects of different teaching methods on student performance, if the same students are assigned to multiple teaching methods, the independence assumption would be violated.

2. Non-normality: When the data in one or more groups deviate significantly from a normal distribution, the validity of ANOVA results can be compromised. Skewed or heavy-tailed distributions may lead to inaccurate p-values and incorrect conclusions. This can happen when sample sizes are small or if outliers are present in the data.

3. Heteroscedasticity: Violation of the assumption of equal variances across groups can lead to imprecise estimates of the group means and inflated Type 1 error rates. In other words, if the variances differ significantly between groups, the ANOVA results may be less reliable. This is especially critical when sample sizes are unequal.

4. Non-random sampling: If the data is collected in a non-random or biased manner, the generalizability of the ANOVA results may be limited. Biased sampling can lead to inaccurate population inferences.

When these assumptions are not met, researchers may need to consider alternative statistical tests or apply data transformations to address the issues. Non-parametric tests, such as the Kruskal-Wallis test, can be used when the normality or equal variance assumptions are violated. Additionally, bootstrapping methods and robust statistical techniques can be helpful in situations with non-normal data or heteroscedasticity. Careful consideration of the data and its characteristics is essential to ensure the validity and reliability of ANOVA results.

# Q2. What are the three types of ANOVA, and in what situations would each be used?

ANOVA (Analysis of Variance) can be categorized into three main types based on the number of factors or independent variables involved:

1. One-Way ANOVA:
One-Way ANOVA is used when there is only one factor (independent variable) with three or more levels or groups. It is used to compare means among multiple groups and determine if there are significant differences between the group means. One-Way ANOVA is appropriate when you want to assess the effect of a single categorical variable on a continuous dependent variable.

Example situations for One-Way ANOVA:
- A study comparing the mean test scores of students from different schools (e.g., public, private, charter).
- Evaluating the effectiveness of three different teaching methods on student performance (e.g., traditional lecture, active learning, online modules).

2. Two-Way ANOVA:
Two-Way ANOVA is used when there are two factors (independent variables) with two or more levels each. It allows you to simultaneously examine the main effects of each factor and their interaction effect on the dependent variable. Two-Way ANOVA is appropriate when you want to study the combined effects of two categorical variables on a continuous dependent variable.

Example situations for Two-Way ANOVA:
- A drug trial comparing the effects of different drug dosages (Factor 1) and gender (Factor 2) on blood pressure.
- Studying the influence of two factors, such as temperature and humidity, on plant growth.

3. N-Way ANOVA (N-Way refers to three or more factors):
N-Way ANOVA extends the concept of Two-Way ANOVA to include three or more factors. It allows you to study the interactions and main effects of multiple independent variables on a continuous dependent variable simultaneously.

Example situations for N-Way ANOVA:
- Investigating the effects of temperature, pH level, and different fertilizer types on crop yield.
- Analyzing the impact of education level, income level, and region on consumer spending behavior.

It's essential to choose the appropriate type of ANOVA based on the research design and the number of factors under investigation. The choice of ANOVA depends on the specific research question and the experimental or observational setup. Each type of ANOVA provides valuable insights into the relationships between variables and helps researchers draw meaningful conclusions from their data.

# Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Partitioning of variance in ANOVA refers to the process of decomposing the total variability in the data into different sources of variation. ANOVA achieves this decomposition by dividing the total sum of squares (SS) into several components, each representing a specific source of variability. Understanding this concept is crucial as it allows researchers to gain insights into the contributions of different factors to the overall variability in the data, facilitating the interpretation of results and helping draw meaningful conclusions from the analysis.

The partitioning of variance in ANOVA is typically done using the following components:

1. Total Sum of Squares (SST): SST represents the total variability in the data and is the sum of the squared differences between each data point and the overall mean. It measures how much the data points vary from the overall mean.

2. Between-Group Sum of Squares (SSB): SSB represents the variability between the group means. It is the sum of the squared differences between each group mean and the overall mean. SSB measures how much the group means differ from each other.

3. Within-Group Sum of Squares (SSW): SSW represents the variability within each group. It is the sum of the squared differences between each data point and its respective group mean. SSW measures how much the data points within each group deviate from their group mean.

The relationship among these components can be expressed as:

\[ SST = SSB + SSW \]

The F-ratio, which is calculated as the ratio of the variance between groups to the variance within groups, is used to test the hypothesis of whether the group means are equal or not. If there are significant differences between the group means, the F-ratio will be large, leading to rejection of the null hypothesis (i.e., the means are equal).

The importance of understanding the partitioning of variance in ANOVA includes:

1. Identifying sources of variation: By partitioning the variance, ANOVA allows researchers to identify which factors contribute significantly to the variability in the data. This helps in understanding the impact of different factors on the dependent variable.

2. Assessing significance: ANOVA allows researchers to test the statistical significance of the differences between group means. By comparing the between-group variability to the within-group variability, researchers can determine whether the observed differences are likely due to chance or are statistically significant.

3. Informing further analysis: Understanding the partitioning of variance can guide researchers in conducting post-hoc tests or follow-up analyses to explore specific group differences.

4. Comparing multiple groups simultaneously: ANOVA is particularly useful when comparing means among three or more groups, as it avoids the problem of conducting multiple pairwise comparisons, which can lead to an increased risk of Type 1 errors (false positives).

In conclusion, partitioning of variance in ANOVA provides valuable insights into the data, helps in making informed decisions, and allows researchers to draw meaningful conclusions about the relationships between variables under study.

# Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [6]:
import numpy as np
import scipy.stats as stat

group1 = [10,12,14,16,18]
group2 = [11,13,15,17,19]
group3 = [10,13,14,17,20]

all_data = np.concatenate((group1,group2,group3))
overall_mean = np.mean(all_data)

# calculate sst SST = Σ(yᵢ - ȳ)²
SST = np.sum((all_data - overall_mean) ** 2)

g1_mean = np.mean(group1)
g2_mean = np.mean(group2)
g3_mean = np.mean(group3)

# calculate SSE SSE = Σ(nᵢ * (ȳᵢ - ȳ)²)
SSE = np.sum((g1_mean - overall_mean) ** 2) * len(group1) + \
      np.sum((g2_mean - overall_mean) **2) * len(group2) + \
      np.sum((g2_mean - overall_mean) **2) * len(group3)

# calculate SSR

SSR = SST - SSE

print("SST: " , SST)
print("SSE: " , SSE)
print("SSR: ", SSR)

SST:  141.6
SSE:  3.4000000000000012
SSR:  138.2


In [7]:
import numpy as np
import scipy.stats as stats

group1 = [10,12,14,16,18]
group2 = [11,13,15,17,19]
group3 = [10,13,14,17,20]

one_way_ANOVA = stats.f_oneway(group1,group2,group3)
print("One-Way-Anova: " ,one_way_ANOVA)

One-Way-Anova:  F_onewayResult(statistic=0.12103746397694526, pvalue=0.887068742991468)


# Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [3]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

data = np.array([[10, 15, 20],
                 [12, 18, 24],
                 [14, 21, 28],
                 [16, 24, 32]])

flattened_data = data.flatten()
factor_a = np.repeat([1, 2], 6)  
factor_b = np.tile([1, 2, 3], 4) 
df = pd.DataFrame({'data': flattened_data, 'factor_a': factor_a, 'factor_b': factor_b})

formula = 'data ~ factor_a + factor_b + factor_a:factor_b'
model = ols(formula, data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)


                   sum_sq   df          F    PR(>F)
factor_a            108.0  1.0  29.793103  0.000603
factor_b            338.0  1.0  93.241379  0.000011
factor_a:factor_b     8.0  1.0   2.206897  0.175699
Residual             29.0  8.0        NaN       NaN


In [4]:
main_effect_a = model.params['factor_a']
main_effect_b = model.params['factor_b']
interaction_effect = model.params['factor_a:factor_b']

print("Main Effect of A:", main_effect_a)
print("Main Effect of B:", main_effect_b)
print("Interaction Effect:", interaction_effect)


Main Effect of A: 2.000000000000007
Main Effect of B: 3.5000000000000018
Interaction Effect: 1.999999999999993


# Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

In a one-way ANOVA, the F-statistic is used to test whether there are statistically significant differences in means among three or more groups. The p-value associated with the F-statistic indicates the probability of observing such differences by random chance alone. A small p-value suggests that the differences between the groups are unlikely to be solely due to random variation.

In your case, you obtained an F-statistic of 5.23 and a p-value of 0.02. Here's how you can interpret these results:

1. **F-Statistic (5.23):** The F-statistic represents the ratio of the variation between group means to the variation within groups. A larger F-statistic indicates a larger difference between group means compared to the variability within each group. In your case, the calculated F-statistic suggests that there are differences between the group means.

2. **P-Value (0.02):** The p-value indicates the probability of observing the observed differences (or more extreme differences) between group means if the null hypothesis is true. The null hypothesis states that there are no significant differences between the groups. A p-value of 0.02 means that if the null hypothesis is true, there is a 2% chance of observing the obtained differences between groups due to random chance.

Interpretation:

Since the p-value (0.02) is less than a typical significance level (such as 0.05), you would reject the null hypothesis. This suggests that there are statistically significant differences between at least one pair of groups. In other words, the data provide enough evidence to conclude that the means of at least two groups are not equal.

However, the ANOVA itself does not tell you which specific groups are different from each other. To identify which groups are different, you may need to perform post hoc tests (e.g., Tukey's Honestly Significant Difference test) or examine confidence intervals for group means.

Keep in mind that a significant result in ANOVA only indicates the presence of differences; it doesn't provide information about the nature or direction of those differences. Further analyses are often needed to understand the specific group differences and their implications.

# Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Handling missing data in repeated measures ANOVA can be challenging because each participant contributes data at multiple time points, and missing data can occur at different time points for different participants. There are several methods to handle missing data in this context, and the choice of method can have consequences on the validity and interpretation of the results.

1. Listwise deletion (Complete Case Analysis):
In listwise deletion, any participant with missing data at any time point is entirely excluded from the analysis. This method reduces the sample size and can introduce bias if the missing data are not missing completely at random (MCAR). It may lead to less precise estimates and reduced statistical power.

2. Pairwise deletion:
With pairwise deletion, participants with missing data for specific time points are excluded only for those time points. This method retains more data than listwise deletion, but it can still introduce bias if the missing data are not MCAR. The downside is that it may lead to different sample sizes for different time points, potentially affecting the balance of the repeated measures design.

3. Mean imputation:
Mean imputation involves replacing missing data with the mean of the available data for that time point. While this method allows retention of all participants and time points, it can introduce artificial relationships and underestimate the true variability in the data. It may also distort the correlation structure among time points.

4. Last observation carried forward (LOCF):
LOCF imputes missing values with the last observed value for that participant. It can lead to biased estimates, especially if the missing data are non-random or if there is a trend in the data over time. LOCF may not be appropriate if the missing data are due to dropouts or intervention effects.

5. Multiple imputation:
Multiple imputation is a more sophisticated method that generates multiple plausible imputed datasets, incorporating uncertainty about the missing values. It is based on the assumption that the data are missing at random (MAR) rather than MCAR. Multiple imputation can produce more accurate parameter estimates and standard errors compared to single imputation methods. However, it requires additional computational effort.

The potential consequences of using different methods to handle missing data include biased parameter estimates, underestimated standard errors, and inflated Type I error rates. Additionally, the choice of method may affect the power of the statistical test and the interpretation of the results.

To handle missing data appropriately, it is essential to understand the mechanism of missingness and the assumptions underlying the chosen imputation method. Multiple imputation is generally considered one of the best approaches when the data are missing at random, but it requires careful consideration and validation of the imputation model. Alternatively, sensitivity analyses can be performed to assess the robustness of the results to different methods of handling missing data.

# Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

After performing an ANOVA and finding a statistically significant difference among the group means, post-hoc tests are conducted to determine which specific groups differ significantly from each other. Post-hoc tests help avoid Type I errors that may occur when conducting multiple pairwise comparisons.

Here are some common post-hoc tests used after ANOVA, along with situations where each one is appropriate:

1. Tukey's Honestly Significant Difference (HSD) test:
Tukey's HSD test is a conservative post-hoc test that controls the family-wise error rate (FWER). It is suitable when you have equal sample sizes and want to compare all possible pairs of group means.

Example situation: Suppose you conduct a study comparing the effectiveness of three different diets (low-carb, Mediterranean, and low-fat) on weight loss. After performing ANOVA and finding a significant difference in weight loss among the groups, you can use Tukey's HSD test to identify which diets have significantly different effects on weight loss.

2. Bonferroni correction:
The Bonferroni correction is a more stringent method that controls the overall Type I error rate. It divides the desired significance level (usually 0.05) by the number of comparisons to adjust the p-values. It is appropriate when you perform many pairwise comparisons, but it can be overly conservative and may result in lower power.

Example situation: In a medical trial, you want to compare the effectiveness of a new drug to a placebo in treating various symptoms. After conducting ANOVA and finding a significant difference among the groups, you decide to use the Bonferroni correction to compare the drug to the placebo for each symptom individually.

3. Dunnett's test:
Dunnett's test is useful when you have one control group and want to compare it with multiple treatment groups. It is less conservative than Bonferroni correction and is appropriate when you are mainly interested in comparing each treatment group to the control group.

Example situation: In a study evaluating the effects of different exercise regimes, you have one control group (no exercise) and several treatment groups (e.g., aerobic exercise, strength training, yoga). After ANOVA reveals a significant difference in a health-related measure, you can use Dunnett's test to compare each exercise group to the control group.

4. Scheffe's method:
Scheffe's method is a less conservative post-hoc test that is suitable when sample sizes are unequal and group variances are not homogeneous. It allows for a broader range of comparisons and controls the overall Type I error rate.

Example situation: In a social science study comparing the performance of three groups of students (small class, medium class, and large class) on an exam, you find a significant difference among the groups. Since the sample sizes are not equal and the variances may differ, you opt for Scheffe's method to compare the means of all three groups.

It's important to choose an appropriate post-hoc test based on the specific research question, the design of the study, and the characteristics of the data. Each post-hoc test has its strengths and weaknesses, and selecting the right one will depend on the context of your analysis.

# Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [8]:
import scipy.stats as stats
import numpy as np

diat_a = np.array([10,12,14,16,12])
diat_b = np.array([11,13,15,17,11])
diat_c = np.array([0,2,4,6,8])

all_data = np.concatenate((diat_a,diat_b,diat_c))

statistics , p_value = stats.f_oneway(diat_a,diat_b,diat_c)
alpha = 0.05

print("F-statistics: ",statistics)
print("P-value: ",p_value)

print("Mean weight loss for diat A:",np.mean(diat_a))
print("Mean weight loss for diat B:",np.mean(diat_b))
print("Mean weight loss for diat C:",np.mean(diat_c))

F-statistics:  18.881818181818193
P-value:  0.00019661417657625268
Mean weight loss for diat A: 12.8
Mean weight loss for diat B: 13.4
Mean weight loss for diat C: 4.0


# Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects orinteraction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [2]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm


data = {'time': [12, 15, 17, 14, 16, 18, 13, 19, 20, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1],
       'software': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C',
                   'C', 'C', 'C', 'C', 'C', 'C', 'C'],
       'experience': ['novice', 'novice', 'novice', 'novice', 'novice', 'novice',
                      'novice', 'novice', 'novice', 'novice','experienced', 'experienced', 'experienced', 'experienced','experienced',
                      'experienced', 'experienced', 'experienced', 'experienced','experienced']}

df = pd.DataFrame(data)

model = ols('time ~ C(software) + C(experience) + C(software) : C(experience)', data=df).fit()

anova_table = anova_lm(model , typ = 2)
print(anova_table)

                           sum_sq    df         F    PR(>F)
C(software)                  81.0   2.0  4.300437  0.053610
C(experience)                 NaN   1.0       NaN       NaN
C(software):C(experience)     9.8   2.0  0.520300  0.480517
Residual                    160.1  17.0       NaN       NaN


  F /= J


# Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [10]:
import numpy as np
import scipy.stats as stats

control = np.array([75, 80, 85, 90, 95, 100, 105, 110, 115, 120])
experimental = np.array([76, 81, 86, 91, 96, 101, 106, 111, 116, 121])

# two-sample t-test

t_stat , p_value = stats.ttest_ind(control,experimental)

print("T-Statistics: ",t_stat)
print("P-Value: ",p_value)

alpha = 0.05

if p_value < alpha:
    print("Reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")
    
    
# If the results are significant, you can follow up with a post-hoc test to determine which group(s) differ significantly from each other.
#One common post-hoc test is Tukey’s HSD test. 
#You can use the following code to perform it in Python:


import pandas as pd
from statsmodels.stats.multicomp import pairwise_tukeyhsd


data = {'score': np.concatenate([control,experimental]),
        'group': np.repeat(['control','experimental'],len(control))
       }
print("Post-hoc-test\n")
df = pd.DataFrame(data)

tukey = pairwise_tukeyhsd(endog=df['score'] , groups=df['group'] , alpha=0.05)

print(tukey)

T-Statistics:  -0.14770978917519928
P-Value:  0.8842138175489671
Fail to reject the null hypothesis
Post-hoc-test

    Multiple Comparison of Means - Tukey HSD, FWER=0.05     
 group1    group2    meandiff p-adj   lower    upper  reject
------------------------------------------------------------
control experimental      1.0 0.8842 -13.2233 15.2233  False
------------------------------------------------------------


# Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [6]:
import pandas as pd
import numpy as np
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd

np.random.seed(42)

data = {
    'store':['A','B','C'] * 30,
    'day' : np.tile(range(1,31),3), # np.tile to repeat the array
    'sales': np.random.randint(50,200,size=90)
}

df = pd.DataFrame(data)

# Repeated measures ANOVA

formula = 'sales ~ store + C(day)'
results = ols(formula,data=df).fit()

print("Repeated Measures ANOVA:")
print(f"F-statistic: {results.fvalue}")
print(f"P-value: {results.f_pvalue}")

alpha = 0.05

if results.f_pvalue < alpha:
    
    posthoc = pairwise_tukeyhsd(df['sales'] , df['store'] , alpha = 0.05)
    
    print("\nPost-Hoc Tukey's HSD Test")
    print(posthoc)
else:
    print("\nNo significant differences found")

Repeated Measures ANOVA:
F-statistic: 1.523852525725498
P-value: 0.08462571015455761

No significant differences found
