In [None]:
Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.


In [None]:
Assumptions of ANOVA and Examples of Violations:

Assumptions of ANOVA are:

1.Normality: The dependent variable should be normally distributed in each group or sample.
2.Homogeneity of variance: The variance of the dependent variable should be equal across all groups.
3.Independence: The observations should be independent of each other.

Examples of Violations are:

Violation of Normality: If the dependent variable is not normally distributed, 
then the ANOVA results may not be valid. For example, if a study is conducted to compare the weight of people across different age groups, 
and the weight of people in some age groups is not normally distributed, then ANOVA results may not be valid.

Violation of Homogeneity of Variance: If the variance of the dependent variable is not equal across all groups,
    then ANOVA results may not be valid. For example, if a study is conducted to compare the salary of people across different industries,
    and the variance of salary is much higher in some industries, then ANOVA results may not be valid.

Violation of Independence: If the observations are not independent of each other, 
    then ANOVA results may not be valid. For example, if a study is conducted to compare the exam scores of students in different classes, 
    and some students are related to each other or influenced by the same factors, then ANOVA results may not be valid.

In [None]:
Q2. What are the three types of ANOVA, and in what situations would each be used?


In [None]:
Types of ANOVA and their Applications:

There are three types of ANOVA:

One-Way ANOVA: One-way ANOVA is used to compare the means of two or more groups for a single independent variable. 
    It is used when we have one independent variable and one dependent variable. For example, comparing the mean score of students in different classes.

Two-Way ANOVA: Two-way ANOVA is used to compare the means of two or more groups for two independent variables.
    It is used when we have two independent variables and one dependent variable. For example, comparing the mean score of students in different classes and different schools.

Repeated Measures ANOVA: Repeated measures ANOVA is used to compare the means of two or more groups for a single independent variable, but the dependent variable is measured multiple times. 
    It is used when we have one independent variable and one dependent variable measured multiple times. For example, comparing the mean weight of people before and after a diet.

In [None]:
Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?


In [None]:
Partitioning of Variance in ANOVA:

Partitioning of variance in ANOVA is a method of dividing the total variance of the dependent variable into different components to estimate the contribution of each independent variable. 
It is important to understand this concept because it helps us to identify the sources of variation and to determine the significance of the independent variables.

The total variance of the dependent variable is divided into three components: the total sum of squares (SST), the explained sum of squares (SSE), and the residual sum of squares (SSR).

In [None]:
Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [None]:
Calculation of Total Sum of Squares, Explained Sum of Squares, and Residual Sum of Squares in One-Way ANOVA using Python:

The total sum of squares (SST) is the sum of squared deviations from the mean of all observations.
The explained sum of squares (SSE) is the sum of squared deviations of group means from the overall mean.
The residual sum of squares (SSR) is the sum of squared deviations of individual observations from their group means.

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats

# create a data frame
df = pd.DataFrame


In [None]:
Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?


In [2]:
import numpy as np
from scipy.stats import f

# create a 2x2 array of data
data = np.array([[5, 7], [8, 6]])

# calculate the means of each group
means = np.mean(data, axis=0)

# calculate the total sum of squares
total_ss = np.sum((data - np.mean(data))**2)

# calculate the sum of squares for the first factor (rows)
row_ss = np.sum((np.mean(data, axis=1) - np.mean(data))**2)

# calculate the sum of squares for the second factor (columns)
col_ss = np.sum((np.mean(data, axis=0) - np.mean(data))**2)

# calculate the sum of squares for the interaction effect
interaction_ss = total_ss - row_ss - col_ss

# calculate the degrees of freedom
df_total = (data.shape[0] * data.shape[1]) - 1
df_row = data.shape[0] - 1
df_col = data.shape[1] - 1
df_interaction = df_row * df_col

# calculate the mean squares
ms_row = row_ss / df_row
ms_col = col_ss / df_col
ms_interaction = interaction_ss / df_interaction

# calculate the F-statistics
f_row = ms_row / (total_ss / df_total)
f_col = ms_col / (total_ss / df_total)
f_interaction = ms_interaction / (total_ss / df_total)

# calculate the p-values
p_row = 1 - f.cdf(f_row, df_row, df_total)
p_col = 1 - f.cdf(f_col, df_col, df_total)
p_interaction = 1 - f.cdf(f_interaction, df_interaction, df_total)

# print the results
print('Main effect of rows (factor A): F =', f_row, ', p =', p_row)
print('Main effect of columns (factor B): F =', f_col, ', p =', p_col)
print('Interaction effect (factor A*B): F =', f_interaction, ', p =', p_interaction)


Main effect of rows (factor A): F = 0.3 , p = 0.622002454214274
Main effect of columns (factor B): F = 0.0 , p = 1.0
Interaction effect (factor A*B): F = 2.6999999999999997 , p = 0.19889218813730503


In [None]:
Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?


In [None]:
Interpretation of One-Way ANOVA Results:

In one-way ANOVA, the F-statistic is used to test whether the means of the groups are significantly different. 
The p-value indicates the probability of obtaining an F-statistic as 
extreme as the one observed if the null hypothesis (i.e., all group means are equal) is true.

In this case, since the p-value is 0.02, which is less than the significance level of 0.05,
we reject the null hypothesis and conclude that there is at least one group mean that is significantly different from the others. However, we cannot determine which specific groups are different from each other using ANOVA alone.

In [None]:
Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?


In [None]:
Handling Missing Data in Repeated Measures ANOVA:

To handle missing data in repeated measures ANOVA, we can use different methods such as:

Listwise deletion: We can exclude any participant who has missing data on any variable used in the analysis.
    This method can lead to loss of statistical power and bias if the missing data are not missing at random.

Pairwise deletion: We can use only the data available for each pair of variables used in the analysis.
    This method can lead to biased estimates of the variances and covariances.

Imputation: We can estimate the missing values using different methods such as mean imputation, regression imputation, or multiple imputation. 
    
    However, the validity of the results depends on the assumptions made about the missing data mechanism.

The consequences of using different methods to handle missing data are that the results may differ depending on the method used, and the validity of the results may be compromised if the assumptions made about the missing data mechanism are not valid.

In [None]:
Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

In [None]:
Common Post-Hoc Tests in ANOVA and Examples:

Post-hoc tests are used after ANOVA to determine which specific groups are significantly different from each other.
Some common post-hoc tests are:

Tukey's HSD test: This test compares all possible pairs of group means and adjusts the significance level to control for the family-wise error rate.

Bonferroni test: This test adjusts the significance level for each comparison to control for the overall type I error rate.

Scheffe's test: This test controls the family-wise error rate by using a conservative estimate of the error variance.

An example situation where a post-hoc test might be necessary is when conducting a study to compare
the effectiveness of three different types of treatments for a medical condition. ANOVA might indicate that there is a significant difference between the treatments, but we need to use a post-hoc test to determine which specific treatments are significantly different from each other.

In [None]:
Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.


In [None]:
The output shows the F-statistic and p-value. In this example, 
if the p-value is less than the significance level (e.g., 0.05), 
we can reject the null hypothesis that the mean weight loss of the three diets are equal.



In [3]:
import numpy as np
from scipy.stats import f_oneway

# define the data for each group
diet_a = np.array([5.6, 4.2, 6.7, 3.8, 7.1, 5.9, 6.2, 4.8, 5.2, 4.1,
                   6.8, 5.3, 7.3, 6.0, 5.5, 4.9, 5.7, 6.1, 4.7, 5.8,
                   6.3, 5.0, 4.6, 7.2, 6.5])
diet_b = np.array([2.1, 1.6, 2.8, 2.5, 3.0, 2.0, 1.8, 3.1, 1.9, 2.3,
                   2.4, 2.6, 1.5, 2.2, 2.7, 2.9, 1.7, 3.2, 2.8, 2.0,
                   2.1, 1.8, 3.3, 2.6, 2.2])
diet_c = np.array([0.9, 1.2, 0.8, 1.5, 1.1, 1.6, 1.3, 1.0, 0.7, 1.4,
                   0.6, 1.7, 1.2, 0.8, 1.3, 1.0, 1.1, 1.6, 0.9, 1.4,
                   1.7, 1.2, 1.0, 1.5, 1.3])

# conduct the one-way ANOVA
f_statistic, p_value = f_oneway(diet_a, diet_b, diet_c)

# print the results
print('F-statistic =', f_statistic)
print('p-value =', p_value)


F-statistic = 297.19931806389445
p-value = 1.6201017360786905e-35


In [None]:
Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [13]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# create a sample dataset
data = pd.DataFrame({
    'program': ['A', 'B', 'C'] * 10,
    'experience': ['novice'] * 15 + ['experienced'] * 15,
    'time': [10.2, 9.5, 10.7, 11.5, 10.1, 9.9, 12.1, 11.5, 12.3, 11.9,
             8.6, 9.1, 8.8, 8.7, 9.2, 7.9, 8.8, 9.3, 9.6, 8.3,
             12.3, 11.9, 12.8, 11.6, 13.2, 12.5, 12.1, 11.8, 13.1, 12.9]
})

# conduct two-way ANOVA
model = ols('time ~ C(program) + C(experience) + C(program):C(experience)', data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# print the ANOVA table
print(anova_table)


                             sum_sq    df         F    PR(>F)
C(program)                 1.850000   2.0  0.327530  0.723872
C(experience)              6.533333   1.0  2.313367  0.141333
C(program):C(experience)   3.408667   2.0  0.603482  0.554994
Residual                  67.780000  24.0       NaN       NaN


In [None]:
Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [8]:
import numpy as np
from scipy.stats import ttest_ind

# create the data
control_scores = np.array([65, 72, 68, 80, 77, 73, 75, 81, 69, 71, 78, 74, 76, 70, 82, 79, 66, 67, 83, 84,
                           72, 68, 76, 79, 81, 73, 75, 70, 77, 78, 74, 71, 80, 82, 67, 83, 69, 65, 84, 72,
                           70, 81, 68, 77, 73, 76, 75, 74, 79, 78, 66, 82, 83, 71, 80, 72, 84, 69, 67, 65,
                           70, 76, 75, 77, 68, 79, 74, 73, 81, 72, 78, 66, 82, 84, 71, 67, 83, 69, 80, 65,
                           77, 76, 70, 75, 68, 72, 81, 73, 74, 78, 79, 66, 82, 84, 83, 67, 71, 69, 80, 65])

experimental_scores = np.array([73, 85, 81, 90, 84, 79, 82, 88, 78, 80, 83, 86, 89, 87, 91, 76, 77, 75, 92, 74,
                                82, 81, 78, 87, 85, 77, 88, 74, 86, 84, 83, 91, 89, 75, 92, 76, 73, 90, 85, 81,
                                88, 77, 84, 83, 78, 89, 80, 76, 86, 73, 90, 91, 75, 79, 74, 92, 87, 82, 88, 84,
                                85, 77, 83, 78, 89, 76, 80, 81, 73, 90, 91, 75, 74, 92, 87, 82, 88, 85, 83, 79,
                                84, 78, 89, 80, 76, 86, 73, 90, 91, 75, 74, 92, 87, 82, 88, 85, 83, 79, 84, 78])

# perform the two-sample t-test
t_stat, p_value = ttest_ind(control_scores, experimental_scores)

# print the results
print(f"T-statistic: {t_stat:.3f}")
print(f"P-value: {p_value:.3f}")


T-statistic: -9.925
P-value: 0.000


In [None]:
Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [12]:
import numpy as np
from scipy.stats import f_oneway
from scipy.stats import ttest_rel

# create the data
store_a = [10, 12, 8, 14, 11, 9, 13, 15, 12, 11, 13, 14, 12, 10, 11, 9, 12, 14, 15, 11, 10, 13, 12, 14, 11, 12, 10, 9, 13, 15]
store_b = [8, 10, 7, 11, 9, 6, 10, 12, 11, 9, 12, 11, 10, 8, 9, 7, 11, 12, 13, 9, 8, 12, 10, 13, 8, 10, 7, 6, 11, 13]
store_c = [12, 15, 10, 17, 13, 11, 16, 18, 14, 12, 16, 14, 13, 11, 12, 10, 14, 16, 18, 12, 11, 15, 13, 17, 12, 14, 10, 9, 16, 18]

# perform repeated measures ANOVA
f_value, p_value = f_oneway(store_a, store_b, store_c)

if p_value < 0.05:
    print("The results are significant (p < 0.05)")
else:
    print("The results are not significant (p >= 0.05)")

# perform post-hoc test
t_value_ab, p_value_ab = ttest_rel(store_a, store_b)
t_value_ac, p_value_ac = ttest_rel(store_a, store_c)
t_value_bc, p_value_bc = ttest_rel(store_b, store_c)

if p_value_ab < 0.05:
    print("There is a significant difference between Store A and Store B (p < 0.05)")
if p_value_ac < 0.05:
    print("There is a significant difference between Store A and Store C (p < 0.05)")
if p_value_bc < 0.05:
    print("There is a significant difference between Store B and Store C (p < 0.05)")


The results are significant (p < 0.05)
There is a significant difference between Store A and Store B (p < 0.05)
There is a significant difference between Store A and Store C (p < 0.05)
There is a significant difference between Store B and Store C (p < 0.05)
