#### Answer_1

ANOVA (Analysis of Variance) is a statistical method used to compare means between two or more groups. In order to use ANOVA, there are certain assumptions that need to be met, and violations of these assumptions could impact the validity of the results. The assumptions are:

* Normality assumption: The data within each group should follow a normal distribution. Violations of this assumption could occur when the data is skewed or has extreme outliers.

* Homogeneity of variances assumption: The variance of the data within each group should be roughly equal. Violations of this assumption could occur when the variance is much larger in one group than in others, which can affect the F-statistic and the p-value.

* Independence assumption: The observations within each group should be independent of each other. This means that the data points should not be related or dependent on each other. Violations of this assumption could occur when there is a relationship between the observations within a group.

Examples of violations that could impact the validity of the results:

* Non-normality: If the data within each group is not normally distributed, then the results of ANOVA may not be reliable. For example, if the data is skewed or has extreme outliers, then the results may not accurately reflect the true differences between the groups.

* Heteroscedasticity: If the variance of the data within each group is not equal, then the results of ANOVA may not be reliable. For example, if the variance is much larger in one group than in others, then the F-statistic may be biased towards that group, resulting in incorrect conclusions.

* Dependence: If the observations within a group are not independent, then the results of ANOVA may not be reliable. For example, if the data is collected from a repeated measures design, where each participant is measured more than once, then the observations within each participant may be dependent on each other, violating the independence assumption. In this case, a different statistical analysis, such as a repeated measures ANOVA, would be more appropriate.

#### Answer_2

The three types of ANOVA are:

* One-way ANOVA: This type of ANOVA is used when there is one categorical independent variable and one continuous dependent variable. The goal is to determine whether the means of the dependent variable are equal across the different levels of the independent variable. For example, a one-way ANOVA could be used to determine whether there are differences in the mean exam scores between three different classes (Class A, B, and C) of students.

* Two-way ANOVA: This type of ANOVA is used when there are two categorical independent variables and one continuous dependent variable. The goal is to determine whether the means of the dependent variable are equal across the different levels of the two independent variables and whether there is an interaction effect between the two independent variables. For example, a two-way ANOVA could be used to determine whether there are differences in the mean exam scores between three different classes (Class A, B, and C) of students who were taught by two different teachers (Teacher X and Y).

* Three-way ANOVA: This type of ANOVA is used when there are three categorical independent variables and one continuous dependent variable. The goal is to determine whether the means of the dependent variable are equal across the different levels of the three independent variables and whether there are any interaction effects between the three independent variables. For example, a three-way ANOVA could be used to determine whether there are differences in the mean exam scores between three different classes (Class A, B, and C) of students who were taught by two different teachers (Teacher X and Y) in two different schools (School 1 and 2).

In summary, one-way ANOVA is used when there is one categorical independent variable, two-way ANOVA is used when there are two categorical independent variables, and three-way ANOVA is used when there are three categorical independent variables. The goal of each type of ANOVA is to determine whether there are significant differences in the means of the dependent variable across the different levels of the independent variables, and to identify any interaction effects between the independent variables.

#### Answer_3

Partitioning of variance is a fundamental concept in ANOVA (Analysis of Variance) that refers to the division of the total variance in a dataset into different components that are associated with different sources of variation. The partitioning of variance in ANOVA is important because it allows us to determine how much of the total variance in the data can be attributed to the independent variable(s) being studied, and how much of the variance is due to random error or other factors that are not of interest. This information is crucial in determining whether there are statistically significant differences between groups or conditions.

In ANOVA, the total variance in the data is divided into two components: the between-group variance and the within-group variance. The between-group variance represents the differences between the means of the groups being compared and is attributable to the independent variable(s). The within-group variance represents the variability within each group and is attributable to random error or other factors that are not of interest.

By comparing the between-group variance to the within-group variance, ANOVA allows us to determine whether the differences between the means of the groups are statistically significant or whether they could be due to chance. If the between-group variance is large relative to the within-group variance, then we can conclude that the differences between the means of the groups are statistically significant and are likely due to the independent variable(s) being studied.

In summary, understanding the partitioning of variance in ANOVA is important because it allows us to determine the sources of variation in the data and to determine whether the differences between the means of the groups are statistically significant.

#### Answer_4

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# load data into a pandas DataFrame
data = pd.read_csv('data.csv')

# fit one-way ANOVA model
model = ols('response ~ group', data=data).fit()

# calculate total sum of squares (SST)
sst = sm.stats.anova_lm(model, typ=1)['sum_sq'][0]

# calculate explained sum of squares (SSE)
sse = sm.stats.anova_lm(model, typ=1)['sum_sq'][1]

# calculate residual sum of squares (SSR)
ssr = sst - sse

print("Total sum of squares (SST):", sst)
print("Explained sum of squares (SSE):", sse)
print("Residual sum of squares (SSR):", ssr)

#### Answer_5

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import AnovaRM

# load data into a pandas DataFrame
data = pd.read_csv('data.csv')

# fit two-way ANOVA model
model = ols('response ~ factorA + factorB + factorA:factorB', data=data).fit()

# perform ANOVA and extract main and interaction effects
anovarm = AnovaRM(data, 'response', 'id', within=['factorA', 'factorB'])
res = anovarm.fit()

main_effects = res.anova_table['F Value'][0:2]
interaction_effect = res.anova_table['F Value'][2]

print("Main effect of factor A:", main_effects[0])
print("Main effect of factor B:", main_effects[1])
print("Interaction effect:", interaction_effect)

#### Answer_6

If we conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02, we can conclude that there is statistically significant evidence that at least one of the groups is different from the others in terms of the mean value of the response variable.

The F-statistic is a ratio of the variance between the groups to the variance within the groups, so a higher F-value indicates that the differences between the groups are relatively larger compared to the differences within the groups. In this case, the F-value of 5.23 suggests that the differences between the groups are moderately larger than the differences within the groups.

The p-value of 0.02 indicates that the probability of obtaining an F-statistic as extreme or more extreme than the one we observed, assuming that there is no difference between the groups, is only 0.02. This is below the commonly used threshold of 0.05, so we can reject the null hypothesis of no difference between the groups and conclude that there is statistically significant evidence that at least one of the groups is different from the others.

However, it's important to note that the one-way ANOVA only tells us that there is a difference between the groups, but not which specific groups are different from each other. To determine this, we would need to conduct post-hoc tests, such as Tukey's HSD or Bonferroni correction, to compare pairs of groups and identify the significant differences.

#### Answer_7

In a repeated measures ANOVA, missing data can occur if one or more measurements are not available for some of the participants. There are different methods to handle missing data, and the choice of method can have an impact on the validity and precision of the results.

Here are some common methods for handling missing data in a repeated measures ANOVA:

* Complete case analysis: This method involves excluding all participants who have missing data on any of the variables used in the ANOVA. The advantage of this method is that it is straightforward to implement and can reduce bias if the data are missing completely at random. However, this method can reduce the sample size and potentially introduce bias if the missing data are related to the outcome or the covariates.

* Pairwise deletion: This method involves including all participants who have at least one complete measurement on the variables used in the ANOVA. The advantage of this method is that it can retain more data than complete case analysis, but it can also introduce bias if the missing data are related to the outcome or the covariates.

* Imputation: This method involves estimating the missing values based on the observed values and the characteristics of the data. There are different types of imputation methods, such as mean imputation, regression imputation, and multiple imputation. The advantage of imputation is that it can retain more data and potentially reduce bias if the imputation model is correctly specified. However, imputation can also introduce bias if the imputation model is misspecified or if the assumptions of the imputation method are violated.

It is important to note that the choice of method for handling missing data should depend on the mechanism of missingness (i.e., whether the missing data are missing completely at random, missing at random, or missing not at random) and the amount and pattern of missing data. In general, it is recommended to use multiple imputation or other advanced imputation methods if the amount of missing data is substantial or if the missingness is non-ignorable.

In addition, different methods for handling missing data can lead to different estimates of the standard errors and the p-values, which can affect the conclusions of the ANOVA. Therefore, it is important to report the method used for handling missing data and to conduct sensitivity analyses to assess the robustness of the results to different methods.

#### Answer_8

After conducting an ANOVA, post-hoc tests can be used to compare pairs of groups and identify which groups differ significantly from each other. Here are some common post-hoc tests used after ANOVA and when to use each one:

* Tukey's Honestly Significant Difference (HSD) test: This test is used when there are more than two groups, and it is the most conservative test among the commonly used post-hoc tests. It controls the overall Type I error rate, which is the probability of rejecting the null hypothesis when it is true, at a specified level, and it is appropriate for situations where all pairwise comparisons need to be made. For example, a researcher may conduct a one-way ANOVA to compare the mean scores of three different teaching methods on a math test. If the ANOVA shows a significant difference among the three groups, the researcher may use Tukey's HSD test to compare the mean scores of each pair of teaching methods.

* Bonferroni correction: This test is used when there are multiple pairwise comparisons, and it adjusts the p-values to control the overall Type I error rate at a specified level. It is more conservative than Tukey's HSD test and is appropriate when the number of pairwise comparisons is relatively large. For example, if a researcher conducts a two-way ANOVA to compare the effects of gender and age on a cognitive task, and the ANOVA shows a significant interaction effect, the researcher may use Bonferroni correction to compare the mean scores of each combination of gender and age.

* Dunnett's test: This test is used when there is a control group and multiple treatment groups, and it compares each treatment group with the control group while controlling the overall Type I error rate. It is appropriate when the research question is whether any of the treatment groups differ significantly from the control group. For example, a researcher may conduct a one-way ANOVA to compare the mean scores of four different medications on a symptom scale, with a placebo group as the control. If the ANOVA shows a significant difference among the five groups, the researcher may use Dunnett's test to compare the mean scores of each medication group with the placebo group.

It's important to note that post-hoc tests can increase the probability of making a Type I error, which is the risk of rejecting a null hypothesis when it is true. Therefore, it is recommended to use them cautiously and report the results with the adjusted p-values and effect sizes. A post-hoc test might be necessary when a researcher conducts an ANOVA and finds a significant difference among the groups or conditions but wants to determine which specific groups or conditions differ significantly from each other.

#### Answer_9

In [31]:
import pandas as pd
import numpy as np
import scipy.stats as stat


data = pd.DataFrame({
    "weight loss by diet A in kg" : [5.2, 3.8, 4.5, 2.3, 6.1, 3.9, 4.8, 5.3, 3.2, 2.5, 7.2, 8.1, 6.9,7.5, 8.8, 7.1],
    "weight loss by diet B in kg" : [6.4, 7.9, 9.1, 8.5, 4.2, 2.9, 3.6, 5.1, 3.3, 4.1, 2.8, 5.5, 3.4, 4.7,5.2, 3.8],
    "weight loss by diet C in kg" : [3.2, 2.5, 7.2, 8.1, 6.9,7.5, 8.8, 7.1, 3.3, 4.1, 2.8, 5.5, 3.4, 4.7,5.2, 3.8]
})

F_stat,p_value = stat.f_oneway(data["weight loss by diet A in kg"],
                               data["weight loss by diet B in kg"], 
                               data["weight loss by diet C in kg"])
print("F-Statistic :{:.2f}".format(F_stat))
print("p_value : {:.3f}".format(p_value))


F-Statistic :0.17
p_value : 0.843


#### Answer_10

In [16]:
import pandas as pd

data = pd.DataFrame({
    'program': ['A']*20 + ['B']*20 + ['C']*20,
    'experience': ['experienced']*10 + ['novice']*10 + ['experienced']*10 + ['novice']*10 + ['experienced']*10 + ['novice']*10,
    'time': [24,26,25,27,28,26,27,29,25,28, # program A experienced
             24,37,28,32,29,31,30,28,31,33, # program A novice
             31,29,30,28,32,30,28,29,33,31, # program B experienced
             35,36,33,38,37,34,36,39,35,33, # program B novice
             17,19,20,18,21,19,18,20,21,19, # program C experienced
             22,23,24,20,21,22,23,22,24,20] # program C novice
})

In [18]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [20]:
model = ols('time ~ program + experience + program:experience' , data).fit()

In [21]:
anova_table = sm.stats.anova_lm(model, typ=2)

In [22]:
anova_table

Unnamed: 0,sum_sq,df,F,PR(>F)
program,1524.7,2.0,181.832597,1.0286499999999999e-24
experience,248.066667,1.0,59.167845,3.10342e-10
program:experience,17.433333,2.0,2.079064,0.134943
Residual,226.4,54.0,,


From the ANOVA table, we can see that there are significant main effects of program (F(2,54) = 181.83, p = 1.028650e-24) and experience (F(1,54) = 59.16, p = 3.103420e-10) and significant interaction effect (F(2,54) = 2.08, p =1.349430e-01).

The main effect of program indicates that there are differences in the average time it takes to complete the task using the different software programs. However, we cannot say which programs are significantly different from each other without conducting post-hoc tests.

The main effect of experience indicates that there are differences in the average time it takes to complete the task between novice and experienced employees, regardless of the software program used.

The significant interaction effect indicates that the effect of software program on task completion time is  dependent on the employee's experience level, and vice versa.

#### Answer_11

In [23]:
import numpy as np
from scipy import stats

# Generate some sample data 
control_scores = np.random.normal(loc=70, scale=10, size=100)
experimental_scores = np.random.normal(loc=75, scale=10, size=100)

# Conduct the t-test
t_statistic, p_value = stats.ttest_ind(control_scores, experimental_scores)

# Print the results
print("t-statistic:", t_statistic)
print("p-value:", p_value)


t-statistic: -4.12895470878696
p-value: 5.369597482929837e-05


In [24]:
import statsmodels.stats.multicomp as mc

# Create a list of group labels
groups = ['control']*100 + ['experimental']*100

# Create a list of all test scores
all_scores = np.concatenate((control_scores, experimental_scores))

# Perform the Tukey's HSD test
tukey_results = mc.MultiComparison(all_scores, groups).tukeyhsd()

# Print the results
print(tukey_results)


   Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj  lower  upper  reject
---------------------------------------------------------
control experimental   5.3468 0.0001 2.7931 7.9004   True
---------------------------------------------------------


#### Answer_12

In [25]:
import numpy as np
import pandas as pd

np.random.seed(1)

store_a = np.random.normal(100, 20, 30)
store_b = np.random.normal(110, 25, 30)
store_c = np.random.normal(120, 30, 30)

sales_data = pd.DataFrame({
    'Store A': store_a,
    'Store B': store_b,
    'Store C': store_c
})


In [26]:
sales_data 

Unnamed: 0,Store A,Store B,Store C
0,132.486907,92.708481,97.368062
1,87.764872,100.081162,157.586045
2,89.436565,92.820682,135.387895
3,78.540628,88.869859,111.057215
4,117.308153,93.218847,134.655544
5,53.969226,109.683385,117.732849
6,134.896235,82.067241,153.948882
7,84.775862,115.860392,165.594504
8,106.380782,151.495054,185.567262
9,95.012592,128.551104,78.10511


In [38]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

sales_melt = pd.melt(sales_data.reset_index(), id_vars=['index'], value_vars=['Store A', 'Store B', 'Store C'])
sales_melt.columns = ['day', 'store', 'sales']
rm_anova = ols('sales ~ store + day + store:day', data=sales_melt).fit()
sm.stats.anova_lm(rm_anova, typ=2)


Unnamed: 0,sum_sq,df,F,PR(>F)
store,10225.685477,2.0,9.67572,0.000165
day,513.90223,1.0,0.972526,0.326883
store:day,1128.773737,2.0,1.068065,0.348293
Residual,44387.271403,84.0,,


In [39]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Perform Tukey's HSD test for post-hoc comparisons
tukey = pairwise_tukeyhsd(endog=sales_melt['sales'], groups=sales_melt['store'], alpha=0.05)

print(tukey.summary())

 Multiple Comparison of Means - Tukey HSD, FWER=0.05  
 group1  group2 meandiff p-adj   lower   upper  reject
------------------------------------------------------
Store A Store B   13.113 0.0754 -1.0485 27.2744  False
Store A Store C  26.1095 0.0001 11.9481  40.271   True
Store B Store C  12.9966 0.0788 -1.1649  27.158  False
------------------------------------------------------
