# Assignment

Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

Ans1: To use the ANOVA test we made the following assumptions:

- Each group sample is drawn from a normally distributed population

- All populations have a common variance

- All samples are drawn independently of each other

- Within each sample, the observations are sampled randomly and independently of each other

- Factor effects are additive

- The populations are symmetrical and uni-modal.
- The sample sizes for the groups are equal and greater than 10

Q2. What are the three types of ANOVA, and in what situations would each be used?

Ans2: The use of ANOVA depends on the research design. 
Commonly, ANOVAs are used in three ways: one-way ANOVA, two-way ANOVA, and N-way ANOVA.

***One-Way ANOVA***

A one-way ANOVA has just one independent variable. For example, difference in IQ can be assessed by Country,
and County can have 2, 20, or more different categories to compare.

***Two-Way ANOVA***

A two-way ANOVA (are also called factorial ANOVA) refers to an ANOVA using two independent variables. Expanding the example above, a 2-way ANOVA can examine differences in IQ scores (the dependent variable) by Country (independent variable 1) and Gender (independent variable 2). Two-way ANOVA can be used to examine the interaction between the two independent variables. Interactions indicate that differences are not uniform across all categories of the independent variables. For example, females may have higher IQ scores overall compared to males, but this difference could be greater (or less) in European countries compared to North American countries.

***N-Way ANOVA***

A researcher can also use more than two independent variables, and this is an n-way ANOVA (with n being the number of independent variables you have). For example, potential differences in IQ scores can be examined by Country, Gender, Age group, Ethnicity, etc, simultaneously.

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Ans3: ANOVA (Analysis of Variance) is a statistical method used to compare the means of two or more groups. The variance of scores is partitioned into components attributable to different sources of variation . The total variance is divided into two parts: variance between groups and variance within groups . The variance between groups is the variance that exists between the means of different groups, while the variance within groups is the variance that exists within each group.

Partitioning of variance is important in ANOVA because it helps to determine whether the differences between the means of different groups are statistically significant or not. It also helps to identify which sources of variation are significant and which are not. By partitioning the variance, we can determine the proportion of variance that is due to the treatment effect and the proportion that is due to random error. This information is useful in interpreting the results of an ANOVA test and in making decisions based on those results 

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

Ans4: To calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python, you can follow these steps:

1. First, you need to fit a regression model using the `statsmodels` library in Python.
2. Next, you can use the following formulas to calculate the SST, SSR, and SSE values of the model:

    - SST = SSR + SSE
    - SSR = Σ (ŷi – y)2
    - SSE = Σ (ŷi – yi)2

Here, ŷi is the predicted value of the response variable for the ith observation, yi is the actual value of the response variable for the ith observation, and y is the mean of the response variable.

3. Finally, you can use the `numpy` library in Python to calculate the SST, SSR, and SSE values of the model.







Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

Ans5: To calculate the main effects and interaction effects in a two-way ANOVA using Python, you can use the statsmodels library. Here are the steps:

First, you need to fit a regression model using the ols function from the statsmodels library.
Next, you can use the anova_lm function from the statsmodels.stats module to calculate the main effects and interaction effects.
Here’s an example code snippet that demonstrates how to calculate the main effects and interaction effects of a two-way ANOVA model in Python:

In [6]:
import numpy as np
import pandas as pd

#create data
df = pd.DataFrame({'water': np.repeat(['daily', 'weekly'], 15),
                   'sun': np.tile(np.repeat(['low', 'med', 'high'], 5), 2),
                   'height': [6, 6, 6, 5, 6, 5, 5, 6, 4, 5,
                              6, 6, 7, 8, 7, 3, 4, 4, 4, 5,
                              4, 4, 4, 4, 4, 5, 6, 6, 7, 8]})

#view first ten rows of data 
df[:10]



Unnamed: 0,water,sun,height
0,daily,low,6
1,daily,low,6
2,daily,low,6
3,daily,low,5
4,daily,low,6
5,daily,med,5
6,daily,med,5
7,daily,med,6
8,daily,med,4
9,daily,med,5


In [7]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

#perform two-way ANOVA
model = ols('height ~ C(water) + C(sun) + C(water):C(sun)', data=df).fit()
sm.stats.anova_lm(model, typ=2)

Unnamed: 0,sum_sq,df,F,PR(>F)
C(water),8.533333,1.0,16.0,0.000527
C(sun),24.866667,2.0,23.3125,2e-06
C(water):C(sun),2.466667,2.0,2.3125,0.120667
Residual,12.8,24.0,,


We can see the following p-values for each of the factors in the table:

- water: p-value = .000527
- sun: p-value = .0000002
- water*sun: p-value = .120667

Since the p-values for water and sun are both less than .05, this means that both factors have a statistically significant effect on plant height.

And since the p-value for the interaction effect (.120667) is not less than .05, this tells us that there is no significant interaction effect between sunlight exposure and watering frequency.

Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

Ans6: If you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02, you can conclude that there is a statistically significant difference between the means of the groups. The p-value of 0.02 indicates that there is only a 2% chance of observing such a large F-statistic if the null hypothesis (that all group means are equal) were true. Therefore, you can reject the null hypothesis and conclude that at least one group mean is different from the rest.

The F-statistic of 5.23 indicates that the variation between the sample means is greater than the variation within the samples. This suggests that the differences between the groups are not due to random chance, but rather to some underlying factor that distinguishes the groups. However, the F-statistic alone does not tell us which group means are different from each other. To determine this, you can perform post-hoc tests such as the Tukey test, Bonferroni test, or Scheffe test.

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

Ans7: In a repeated measures ANOVA, missing data can be handled in several ways. One common approach is to use listwise deletion, which involves removing any cases with missing data from the analysis. However, this method can lead to a loss of statistical power and may introduce bias if the missing data are not missing completely at random. Another approach is to use imputation methods to estimate the missing values. There are several imputation methods available, including mean imputation, regression imputation, and multiple imputation. These methods can help to reduce bias and increase statistical power, but they also have their own limitations and assumptions.

The choice of method for handling missing data in a repeated measures ANOVA depends on several factors, including the amount and pattern of missing data, the assumptions of the imputation method, and the research question being addressed. It is important to carefully consider the potential consequences of each method and to choose the method that is most appropriate for the specific research question and data set.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

Ans8: Some common post-hoc tests used after ANOVA include:

- Tukey’s HSD test: This test is used to compare all possible pairs of means and is appropriate when the sample sizes are equal.

- Bonferroni correction: This test is used to adjust the significance level for multiple comparisons and is appropriate when the sample sizes are unequal.

- Dunn’s test: This test is a non-parametric alternative to Tukey’s test and is appropriate when the data violate the assumptions of normality and equal variance.

- Scheffe’s test: This test is a conservative test that controls the family-wise error rate and is appropriate when the number of comparisons is small.

- Holm’s test: This test is a modified version of the Bonferroni correction that is more powerful and is appropriate when the number of comparisons is large.

The choice of post-hoc test depends on several factors, including the sample size, the number of groups being compared, and the assumptions of the test. For example, Tukey’s test is often used when the sample sizes are equal and the data meet the assumptions of normality and equal variance. On the other hand, Scheffe’s test is often used when the number of comparisons is small and the data do not meet the assumptions of normality and equal variance.

A situation where a post-hoc test might be necessary is when an ANOVA produces a statistically significant result, indicating that at least one group mean is different from the others. In this case, a post-hoc test can be used to determine which group means are different from each other. For example, suppose a researcher wants to compare the effectiveness of three different treatments for a medical condition. After conducting an ANOVA, the researcher finds that there is a statistically significant difference between the means of the three groups. To determine which treatments are different from each other, the researcher can use a post-hoc test such as Tukey’s test or Bonferroni correction.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

To conduct a one-way ANOVA using Python, you can use the scipy.stats module. Here’s how you can do it:import scipy.stats as stats



In [20]:
import scipy.stats as stats

# weight loss data for the three diets
diet_a = [3.2, 2.9, 3.1, 2.8, 3.5, 2.7, 2.6, 2.9, 3.0, 2.8, 
          3.1, 2.7, 3.0, 3.2, 2.9, 2.8, 2.7, 3.1, 2.8, 3.0, 
          2.9, 2.8, 3.2, 3.0, 3.1]
diet_b = [2.7, 2.1, 2.7, 2.6, 2.2, 2.3, 2.6, 2.1, 2.3, 2.5, 
          2.2, 2.4, 2.5, 2.3, 2.2, 2.4, 2.5, 2.1, 2.4, 2.2, 
          2.3, 2.4, 2.5, 2.1, 2.2]
diet_c = [1.3, 2.0, 1.8, 1.7, 1.9, 1.2, 2.1, 2.0, 2.2, 2.1, 
          1.8, 2.0, 1.9, 2.1, 1.8, 1.9, 1.7, 2.0, 1.8, 2.1, 
          2.2, 1.7, 1.9, 1.8, 2.0]

# perform one-way ANOVA
f_stat, p_value = stats.f_oneway(diet_a, diet_b, diet_c)

# print results
print("F-statistic: ", f_stat)
print("p-value: ", p_value)

F-statistic:  159.14922813035997
p-value:  3.7470014787571026e-27


F-statistic is 159.15 and the p-value is 3.74e-27.
Since the p-value is very small, we can reject the null hypothesis that the means of the three diets are equal, and conclude that there are significant differences between the mean weight loss of the three diets.

The F-statistic is 1.01 and the p-value is 0.3, Since the p-value is more than the significance level of 0.05, we can not reject the null hypothesis that the mean weight loss of the three diets is not equal. This suggests that there are differences between the mean weight loss of the three diets.

Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [21]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# create a dataframe with the time data and employee experience level
data = {'Time':[11, 22, 28, 25, 20, 23, 16, 27, 15, 12, 13, 22, 26, 29, 19, 29, 21, 14, 17,
                18, 25, 23, 17, 19, 28, 27, 20, 26, 11, 12],
        'Program': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 
                    'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
        'Experience': ['Experienced', 'Experienced', 'Experienced', 'Experienced', 'Experienced', 
                       'Experienced', 'Experienced', 'Novice', 'Novice', 'Novice', 'Novice', 'Novice', 
                       'Novice', 'Novice', 'Novice', 'Experienced', 'Experienced', 'Experienced', 
                       'Experienced', 'Experienced', 'Experienced', 'Experienced', 'Novice', 
                       'Novice', 'Novice', 'Novice', 'Novice', 'Novice', 'Novice', 'Novice']}

df = pd.DataFrame(data)

# perform two-way ANOVA
model = ols('Time ~ Program + Experience + Program:Experience', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# print results
print(anova_table)

                          sum_sq    df             F    PR(>F)
Program             6.666667e-02   2.0  8.978177e-04  0.976325
Experience         -3.197306e-15   1.0 -8.611795e-17  1.000000
Program:Experience  4.163095e+00   2.0  5.606551e-02  0.945591
Residual            9.653036e+02  26.0           NaN       NaN




Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [23]:
import numpy as np
from scipy.stats import ttest_ind, f_oneway, tukey_hsd

# generate example data
control_scores = np.random.normal(loc=70, scale=10, size=50)
experimental_scores = np.random.normal(loc=75, scale=10, size=50)

# conduct two-sample t-test
t_statistic, p_value = ttest_ind(control_scores, experimental_scores)

# print results
print("t-statistic: ", t_statistic)
print("p-value: ", p_value)

# conduct post-hoc Tukey test if significant differences found
if p_value < 0.05:
    allscores = np.concatenate((control_scores, experimental_scores))
    grouplabels = ['control'] * len(control_scores) + ['experimental'] * len(experimental_scores)
    tukey_results = tukey_hsd(allscores, grouplabels)
    print(tukey_results)

t-statistic:  -2.2103820834581014
p-value:  0.029405151645282155


TypeError: ufunc 'isinf' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [27]:
import numpy as np
from scipy.stats import f_oneway, tukey_hsd

# generate example data
store_a_sales = np.random.normal(loc=900, scale=100, size=30)
store_b_sales = np.random.normal(loc=1000, scale=100, size=30)
store_c_sales = np.random.normal(loc=1100, scale=100, size=30)

# conduct one-way ANOVA
f_statistic, p_value = f_oneway(store_a_sales, store_b_sales, store_c_sales)

# print results
print("F-statistic: ", f_statistic)
print("p-value: ", p_value)

# conduct post-hoc Tukey test if significant differences found
if p_value < 0.05:
    all_sales = np.concatenate((store_a_sales, store_b_sales, store_c_sales))
    group_labels = ['Store A'] * len(store_a_sales) + ['Store B'] * len(store_b_sales) + ['Store C'] * len(store_c_sales)
    tukey_results = tukey_hsd(all_sales, group_labels)
    print(tukey_results) 

F-statistic:  23.163159834474857
p-value:  8.614820417858605e-09


TypeError: ufunc 'isinf' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''