Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

A1:Assumptions of ANOVA:

* Independence: The observations in each group must be independent of each other.
* Normality: The data within each group should follow a normal distribution.
* Homogeneity of Variance: The variance of the data should be equal across all groups.

Examples of Violations:

* Independence: If data points in one group are related or dependent on data points in another group, the assumption of independence is violated. For example, if you measure the same individuals multiple times in different groups, their data points may not be independent.
* Normality: If the data within each group deviates significantly from a normal distribution, the normality assumption is violated. This could happen when sample sizes are small, or if there are extreme outliers present in the data.
* Homogeneity of Variance: If the variance of the data is not equal across all groups, the homogeneity of variance assumption is violated. This could result in unequal influence of different groups on the overall results.


Q2. What are the three types of ANOVA, and in what situations would each be used?

A2:

1. One-Way ANOVA: Used when comparing means of three or more independent groups on a single continuous dependent variable.
2. Two-Way ANOVA: Used when comparing means across two independent categorical variables on a single continuous dependent variable. It can also examine the interaction effect between the two independent variables.
3. Repeated Measures ANOVA: Used when comparing means of the same group under different conditions or at different time points. It is used for dependent data where each participant is measured under multiple conditions.

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

A3: The partitioning of variance in ANOVA involves breaking down the total variance observed in the data into different components related to the factors being analyzed. In a one-way ANOVA, the total variance (SST) is partitioned into two components:

1. Explained Variance (SSE): The variance explained by the differences between group means.
2. Residual Variance (SSR): The variance remaining after accounting for the differences between group means, often referred to as the error variance.

Understanding the partitioning of variance is crucial because it allows us to quantify the proportion of variance in the data that is explained by the factors under consideration. This information is essential in interpreting the significance of the ANOVA results and understanding the contribution of the factors to the overall variability in the data.

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [5]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

data = pd.DataFrame({'Group': ['A', 'A', 'B', 'B', 'C', 'C'],
                     'Values': [5, 6, 8, 7, 10, 12]})

model = ols('Values ~ Group', data=data).fit()

SST = np.sum((data['Values'] - np.mean(data['Values']))**2)

SSE = np.sum((model.fittedvalues - np.mean(data['Values']))**2)

SSR = SST - SSE

print("Total Sum of Squares (SST):", SST)
print("Explained Sum of Squares (SSE):", SSE)
print("Residual Sum of Squares (SSR):", SSR)


Total Sum of Squares (SST): 34.0
Explained Sum of Squares (SSE): 31.000000000000014
Residual Sum of Squares (SSR): 2.999999999999986


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [6]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

data = pd.DataFrame({'Group1': ['A', 'A', 'B', 'B', 'C', 'C'],
                     'Group2': ['X', 'Y', 'X', 'Y', 'X', 'Y'],
                     'Values': [5, 6, 8, 7, 10, 12]})

model = ols('Values ~ Group1 * Group2', data=data).fit()

main_effects = model.params[['Group1[T.B]', 'Group1[T.C]', 'Group2[T.Y]']]
interaction_effect = model.params['Group1[T.C]:Group2[T.Y]']

print("Main Effects:")
print(main_effects)
print("Interaction Effect:")
print(interaction_effect)


Main Effects:
Group1[T.B]    3.0
Group1[T.C]    5.0
Group2[T.Y]    1.0
dtype: float64
Interaction Effect:
0.9999999999999982


Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

A6: If you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02, it means that the variability between group means is significantly larger than the variability within groups. The F-statistic of 5.23 indicates how much larger the variability between group means is compared to the variability within groups.

With a p-value of 0.02, you can conclude that there is a statistically significant difference between the group means at a significance level of 0.05. In other words, you can reject the null hypothesis that all group means are equal. The result suggests that the population means of the groups are not all the same, and there is evidence to support that the population means are significantly different.

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

A7: In a repeated measures ANOVA, missing data can be handled in various ways:

* Complete Case Analysis: You can remove cases with missing data from the analysis, which may lead to a reduction in sample size and potential loss of information.

* Mean Imputation: You can replace missing values with the mean of the available data for the corresponding group or condition. However, this may introduce bias and underestimate the true variability.

* Multiple Imputation: Use statistical techniques to impute missing data multiple times based on the available information, which provides more robust estimates.

Potential Consequences of Handling Missing Data Differently:

Different methods of handling missing data can lead to varying results and interpretations. Complete case analysis reduces the sample size and may introduce bias if the missing data are not missing completely at random. Imputation methods assume that missing data are missing at random, and the imputed values may not accurately reflect the true underlying values. Multiple imputation can provide more reliable estimates but requires careful consideration of the imputation model and assumptions. Using different methods can lead to different conclusions about the significance of the effects and may impact the validity of the results. It is essential to handle missing data appropriately and consider the potential impact of the chosen method on the study conclusions.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

A8: Post-hoc tests are used after ANOVA to compare group means pairwise when there are three or more groups. Some common post-hoc tests include:

1. Tukey's Honestly Significant Difference (HSD) test: It controls the family-wise error rate and is used when the assumption of homogeneity of variances is met.

2. Bonferroni correction: It adjusts the significance level for multiple comparisons and can be used when the number of pairwise comparisons is small.

3. Scheffe's test: It is conservative and can be used when the assumption of homogeneity of variances is violated.

4. Dunnett's test: It is used for comparing all treatment groups with a control group in a one-way ANOVA.

Example of a situation where a post-hoc test might be necessary:

Suppose you conducted a one-way ANOVA with four groups and found a significant overall difference. To determine which specific groups differ from each other, you can conduct post-hoc tests. For instance, Tukey's HSD or Bonferroni correction can help identify which group means are significantly different from each other. Post-hoc tests provide valuable information about pairwise comparisons and help pinpoint the sources of significant differences identified in the ANOVA.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [9]:
import numpy as np
from scipy.stats import f_oneway

weight_loss_A = [5, 7, 6, 8, 9, 4, 6, 5, 8, 7, 6, 5, 9, 7, 5, 6, 8, 6, 7, 5,
                 6, 8, 7, 9, 5, 6, 7, 8, 9, 6, 7, 5, 6, 8, 9, 7, 6, 5, 7, 8,
                 6, 5, 9, 7, 6, 5, 8, 7, 6, 5]

weight_loss_B = [3, 2, 1, 4, 2, 3, 4, 3, 2, 3, 1, 4, 2, 3, 2, 1, 4, 3, 2, 4,
                 3, 2, 4, 3, 1, 3, 2, 4, 2, 3, 1, 4, 3, 2, 3, 4, 1, 2, 3, 4,
                 3, 2, 4, 1, 3, 2, 4, 2, 3, 1]

weight_loss_C = [1, 3, 2, 4, 1, 2, 3, 1, 4, 3, 2, 4, 1, 2, 3, 1, 4, 2, 3, 1,
                 4, 2, 3, 1, 4, 2, 3, 1, 4, 2, 3, 1, 4, 2, 3, 1, 4, 2, 3, 1,
                 4, 3, 2, 4, 1, 3, 2, 4, 1, 3]

f_statistic, p_value = f_oneway(weight_loss_A, weight_loss_B, weight_loss_C)

print("F-Statistic:", f_statistic)
print("P-value:", p_value)


F-Statistic: 198.67029972752044
P-value: 1.6267681192634228e-42


Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [10]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

data = pd.DataFrame({
    'Software': ['A', 'A', 'B', 'B', 'C', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
    'Experience': ['Novice', 'Experienced'] * 6,
    'Time': [25, 22, 30, 28, 35, 33, 23, 32, 38, 24, 27, 34]
})

model = ols('Time ~ C(Software) + C(Experience) + C(Software):C(Experience)', data=data).fit()

anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)


                               sum_sq   df          F    PR(>F)
C(Software)                264.500000  2.0  36.906977  0.000425
C(Experience)                2.083333  1.0   0.581395  0.474664
C(Software):C(Experience)   10.166667  2.0   1.418605  0.312974
Residual                    21.500000  6.0        NaN       NaN


Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [11]:
import numpy as np
from scipy.stats import ttest_ind

control_group = [85, 82, 78, 88, 90, 84, 86, 80, 87, 89, 81, 83, 85, 88, 82, 80, 86, 84, 89, 85]
experimental_group = [92, 88, 95, 89, 90, 94, 91, 93, 87, 96, 93, 92, 94, 90, 91, 95, 92, 89, 93, 91]

t_statistic, p_value = ttest_ind(control_group, experimental_group)

print("T-Statistic:", t_statistic)
print("P-value:", p_value)


T-Statistic: -7.666184363521894
P-value: 3.144088678985931e-09


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [25]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

data = pd.DataFrame({
    'Day': [i for i in range(1, 31)] * 3,
    'Store': ['A'] * 30 + ['B'] * 30 + ['C'] * 30,
    'Sales': [500, 520, 480, 510, 490, 510, 480, 500, 510, 490,
              520, 490, 480, 510, 500, 520, 510, 500, 480, 510,
              500, 490, 510, 480, 500, 520, 510, 500, 480, 510,
              520, 490, 480, 500, 510, 520, 490, 510, 500, 480,
              510, 490, 520, 500, 510, 480, 500, 510, 490, 520,
              510, 500, 520, 500, 510, 520, 510, 490, 520, 500,
              500, 480, 510, 520, 490, 480, 510, 510, 480, 500,
              490, 520, 500, 510, 500, 520, 510, 490, 510, 480, 
              500, 510, 490, 520, 510, 520, 510, 490, 480, 510,]
})

print(data.isna().sum())

print(np.isinf(data['Sales']).sum())

data.dropna(inplace=True)
data = data[~np.isinf(data['Sales'])]

f_statistic, p_value = f_oneway(data[data['Store'] == 'A']['Sales'],
                                data[data['Store'] == 'B']['Sales'],
                                data[data['Store'] == 'C']['Sales'])

print("F-Statistic:", f_statistic)
print("P-value:", p_value)




Day      0
Store    0
Sales    0
dtype: int64
0
F-Statistic: 0.4706384796070894
P-value: 0.6261841937083128
