Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

ans:

There are 4 primary assumptions in ANOVA:

1. Interval data for the dependent variable

2. normality : ANOVA assumes that the residuals from the ANOVA model follow a normal distribution.

3. homoscedasticity : Homoscedasticity, or homogeneity of variances, is an assumption of equal or similar variances in different groups being compared. This is an important assumption of parametric statistical tests because they are sensitive to any dissimilarities. Uneven variances in samples result in biased and skewed test results

4. no multicollinearity : Since the factorial ANOVA includes two or more independent variables it is important that the factorial ANOVA model contains little or no Multicollinearity. Multicollinearity occurs when the independent variables are intercorrelated and not independent from each other.



The homogeneity of variance assumption is failed, the ANOVA would yield a lower p-value, thus, leading us to falsely reject a true null hypothesis.


If the independent assumption is violated, then also there is higher chance of getting an inaccurate P-value.



If the population from which data to be analyzed by a normality test were sampled violates one or more of the normality test assumptions, the results of the analysis may be incorrect or misleading.

*Potential assumption violations include:*

1. Implicit factors: lack of independence within a sample

2. Lack of independence: lack of independence between samples

3. Outliers: apparent nonnormality by a few data points

4. Nonnormality: nonnormality of entire samples

5. Unequal population variances

6. Patterns in plots of data: detecting violation assumptions graphically

7. Special problems with small sample sizes

8. Special problems with unbalanced sample sizes

9. Multiple comparisons: effects of assumption violations on multiple comparison tests

Q2. What are the three types of ANOVA, and in what situations would each be used?

ans:

**One-Way ANOVA**:

A one-way ANOVA has just one independent variable. For example, difference in IQ can be assessed by Country, and County can have 2, 20, or more different categories to compare.

**Two-Way ANOVA**:

A two-way ANOVA (are also called factorial ANOVA) refers to an ANOVA using two independent variables.A 2-way ANOVA can examine differences in IQ scores (the dependent variable) by Country (independent variable 1) and Gender (independent variable 2).

Two-way ANOVA can be used to examine the interaction between the two independent variables. Interactions indicate that differences are not uniform across all categories of the independent variables. For example, females may have higher IQ scores overall compared to males, but this difference could be greater (or less) in European countries compared to North American countries.

**N-Way ANOVA**:

A researcher can also use more than two independent variables, and this is an n-way ANOVA (with n being the number of independent variables you have). For example, potential differences in IQ scores can be examined by Country, Gender, Age group, Ethnicity, etc, simultaneously.

**MANOVA (Multivariate Analysis of Variance)**:

This type of ANOVA is used when there are two or more dependent variables (i.e., outcome variables) and one or more independent variables. MANOVA is used to determine if there are any significant differences between the means of the dependent variables across the different levels of the independent variable. For example, MANOVA can be used to compare the average scores on multiple personality traits (e.g., extroversion, agreeableness, neuroticism) between different age groups (young, middle-aged, and old).

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

ans:

*Partitioning of variance in ANOVA*

Partitioning of variance in ANOVA (Analysis of Variance) refers to the process of decomposing the total variation in the data into different sources of variation. In other words, ANOVA breaks down the total variation in the dependent variable into the variation explained by the independent variable(s) and the variation that is not explained by the independent variable(s). This is important because it helps us understand how much of the variation in the dependent variable can be attributed to the independent variable(s).

The partitioning of variance in ANOVA is typically represented by the following equation:

Total variation = Variation explained by independent variable(s) + Variation not explained by independent variable(s)

The variation explained by the independent variable(s) is referred to as the "between-group" variation, while the variation not explained by the independent variable(s) is referred to as the "within-group" variation.

Understanding the partitioning of variance is important in ANOVA because it helps us determine if there is a significant difference between the means of the different groups being compared. If the variation explained by the independent variable(s) (i.e., the between-group variation) is much larger than the variation not explained by the independent variable(s) (i.e., the within-group variation), then it suggests that there is a significant difference between the means of the groups being compared.

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

ans:
    
1. Sum of Squares Total (SST) – The sum of squared differences between individual data points (yi) and the mean of the response variable (y).

SST = Σ(yi – y)^2

2. Sum of Squares Regression (SSR) – The sum of squared differences between predicted data points (ŷi) and the mean of the response variable(y).

SSR = Σ(ŷi – y)^2

3. Sum of Squares Error (SSE) – The sum of squared differences between predicted data points (ŷi) and observed data points (yi).

SSE = Σ(ŷi – yi)2

In [1]:
import scipy.stats as stats

# create three sample groups
group1 = [10, 12, 14, 16, 18]
group2 = [8, 11, 14, 17, 20]
group3 = [9, 12, 15, 18, 21]

# concatenate the groups
data = group1 + group2 + group3

# calculate the mean of the data
mean = sum(data) / len(data)

# calculate the total sum of squares (SST)
SST = sum([(x - mean)**2 for x in data])

# calculate the sum of squares between (SSB)
SSB = len(group1) * (sum([(x - mean)**2 for x in group1]) / len(group1))
SSB += len(group2) * (sum([(x - mean)**2 for x in group2]) / len(group2))
SSB += len(group3) * (sum([(x - mean)**2 for x in group3]) / len(group3))

# calculate the explained sum of squares (SSE)
SSE = SSB

# calculate the residual sum of squares (SSR)
SSR = SST - SSE

print("SST =", SST)
print("SSE =", SSE)
print("SSR =", SSR)

SST = 223.33333333333337
SSE = 223.33333333333337
SSR = 0.0


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

ans:

A two-way ANOVA is used to determine whether or not there is a statistically significant difference between the means of three or more independent groups that have been split on two factors.

The purpose of a two-way ANOVA is to determine how two factors impact a response variable, and to determine whether or not there is an interaction between the two factors on the response variable.

example:
    
Step 1: Enter the data.

First, we’ll create a pandas DataFrame that contains the following three variables:

In [13]:
import numpy as np
import pandas as pd

#create data
df = pd.DataFrame({'water': np.repeat(['daily', 'weekly'], 15),
                   'sun': np.tile(np.repeat(['low', 'med', 'high'], 5), 2),
                   'height': [6, 6, 6, 5, 6, 5, 5, 6, 4, 5,
                              6, 6, 7, 8, 7, 3, 4, 4, 4, 5,
                              4, 4, 4, 4, 4, 5, 6, 6, 7, 8]})

In [14]:
df[:10]

Unnamed: 0,water,sun,height
0,daily,low,6
1,daily,low,6
2,daily,low,6
3,daily,low,5
4,daily,low,6
5,daily,med,5
6,daily,med,5
7,daily,med,6
8,daily,med,4
9,daily,med,5


Step 2: Perform the two-way ANOVA.

Next, we’ll perform the two-way ANOVA using the anova_lm() function from the statsmodels library:

In [15]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

#perform two-way ANOVA
model = ols('height ~ C(water) + C(sun) + C(water):C(sun)', data=df).fit()
sm.stats.anova_lm(model, typ=2)

Unnamed: 0,sum_sq,df,F,PR(>F)
C(water),8.533333,1.0,16.0,0.000527
C(sun),24.866667,2.0,23.3125,2e-06
C(water):C(sun),2.466667,2.0,2.3125,0.120667
Residual,12.8,24.0,,


Step 3: Interpret the results.

We can see the following p-values for each of the factors in the table:

water: p-value = .000527

sun: p-value = .0000002

water*sun: p-value = .120667

Since the p-values for water and sun are both less than .05, this means that both factors have a statistically significant effect on plant height.

And since the p-value for the interaction effect (.120667) is not less than .05, this tells us that there is no significant interaction effect between sunlight exposure and watering frequency.

Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

ans:
    
F statistics = 5.23

p-value = 0.02

Lets consider alpha or significance value = 0.05

Since the p-value < alpha, we can reject the null hypothesis. This implies that we have sufficient proof to say that there exists a differences between the groups.

F-statistics can be used to detemine the difference between the group but for that we have to determine critical values which will require datas like degree of freedom each group.

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

ans:

In a repeated measures ANOVA, missing data can be handled in several ways:

1. Listwise deletion: This method involves excluding any cases with missing data from the analysis. This can be done using the dropna() function in pandas. While this approach is simple, it may result in a loss of statistical power if a large amount of data is missing.

2. Mean imputation: This method involves replacing missing values with the mean of the non-missing values. This can be done using the fillna() function in pandas. While this approach is simple and easy to implement, it may underestimate the variability of the data and result in biased estimates.

3. Last observation carried forward (LOCF): This method involves imputing missing values with the last observed value. This can be done using the fillna(method='ffill') function in pandas. While this approach is useful for data with a temporal order, it may not be appropriate for all situations and may result in biased estimates.

4. Multiple imputation: This method involves imputing missing values multiple times using a statistical model, and then combining the results to obtain estimates and standard errors. This can be done using the fancyimpute library in Python. While this approach is more sophisticated and can produce more accurate estimates than mean imputation or LOCF, it is computationally intensive and requires careful consideration of the underlying assumptions.

The potential consequences of using different methods to handle missing data in a repeated measures ANOVA include bias in the estimated means, standard errors, and effect sizes, as well as a loss of statistical power. It's important to carefully consider the underlying assumptions and potential limitations of each method and choose the approach that is most appropriate for the specific dataset and research question. Additionally, it may be beneficial to conduct sensitivity analyses to assess the robustness of the results to different methods of handling missing data.


Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

ans:

Post-hoc tests are used after ANOVA to make pairwise comparisons between groups when the overall ANOVA result is statistically significant. The purpose of post-hoc tests is to determine which specific groups differ from each other and to control for the familywise error rate, which is the probability of making at least one Type I error (false positive) across all the pairwise comparisons. Here are some common post-hoc tests used after ANOVA, along with an example of a situation where each one might be necessary:

Tukey's HSD (honestly significant difference) test: This test is a conservative post-hoc test that is commonly used when the sample sizes are equal across groups. It controls for the familywise error rate by adjusting the significance level for each pairwise comparison. For example, if we have four groups (A, B, C, D), and the overall ANOVA result is significant, we might use Tukey's HSD test to determine which specific groups differ from each other. If the test shows that group A is significantly different from group B and group C, but not group D, we can conclude that group A is significantly different from groups B and C, but not group D.

Bonferroni correction: This test is a simple and commonly used method to adjust the significance level for each pairwise comparison. It divides the significance level (usually 0.05) by the number of comparisons being made. For example, if we have four groups (A, B, C, D), and the overall ANOVA result is significant, we might use the Bonferroni correction to determine which specific groups differ from each other. If the test shows that group A is significantly different from group B, group C, and group D, we can conclude that group A is significantly different from all the other groups.

Dunnett's test: This test is used when we have one control group and several treatment groups. It compares each treatment group to the control group, while controlling for the overall familywise error rate. For example, if we have one control group and three treatment groups (A, B, C), and the overall ANOVA result is significant, we might use Dunnett's test to determine which specific treatment groups differ from the control group. If the test shows that group A is significantly different from the control group, but groups B and C are not significantly different from the control group, we can conclude that group A is significantly different from the control group, but groups B and C are not.

Scheffe's test: This test is a conservative post-hoc test that is used when the sample sizes are unequal across groups. It controls for the familywise error rate by adjusting the significance level for each pairwise comparison. For example, if we have four groups (A, B, C, D), and the overall ANOVA result is significant, we might use Scheffe's test to determine which specific groups differ from each other. If the test shows that group A is significantly different from group B and group C, but not group D, we can conclude that group A is significantly different from groups B and C, but not group D.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

ans:

Null hypothesis : mean weight loss of three diets are equal.

Alternate hypothesis : Atleast one of the mean weight loss of three diets is not equal. 

In [16]:
from scipy.stats import f_oneway
import scipy.stats as stat

In [17]:
import numpy as np

In [18]:
Diet_A = np.random.rand(50)

In [19]:
Diet_B = np.random.rand(50)
Diet_C = np.random.rand(50)

In [20]:
f_statistics, p_value = f_oneway(Diet_A,Diet_B,Diet_C)

In [21]:
f_statistics,p_value

(1.0884129750677392, 0.3394485582781596)

In [22]:
#consider significance value =0.05
significance_value  = 0.05
total_population  = 50 
category = 3
df_between = category -1
df_within = total_population-category
df_total =total_population-1

In [23]:
critical_value = stat.f.ppf(q=1-significance_value, dfn=df_between,dfd=df_within)
critical_value

3.195056280737215

The F statistics and p-value turn out to be equal to 1.277470 and 0.2818175 respectively.

In [24]:
if f_statistics > critical_value or p_value < significance_value:
    print("Reject the Null hypothesis")
elif f_statistics < critical_value or p_value > significance_value:
    print("Fail to reject the Null hypothesis")

Fail to reject the Null hypothesis


Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [5]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

dataframe = pd.DataFrame({'Program' : np.repeat(['A','B','C'],10),
                          'Experience' : np.repeat(['Novice','Experience'],15),
                          'Time' : [10,14,12,11,14,15,16,17,18,19,14,13,12,11,10,9,7,5,8,9,5,8,9,7,8,7,7,9,9,9]})

In [6]:
model = ols('Time ~ C(Program) + C(Experience) + C(Program):C(Experience)', data = dataframe).fit()

In [8]:
anova_table = sm.stats.anova_lm(model,typ=2)
print(anova_table)

                              sum_sq    df         F    PR(>F)
C(Program)                 14.049736   2.0  1.558418  0.223022
C(Experience)                    NaN   1.0       NaN       NaN
C(Program):C(Experience)   38.880000   2.0  4.312628  0.047852
Residual                  117.200000  26.0       NaN       NaN


  F /= J


since the p-value for the interaction effect (.047852) is less than .05, this tells us that there is significant interaction effect between program and experience level of employees.

Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

Null Hypothesis  : There are no significant differences in test scores between the two groups. 

Alternate Hypothesis :There are any significant differences in test scores between the two groups. 

In [13]:
import numpy as np
import scipy.stats as stats
alpha = 0.05

In [5]:
group_control = np.random.randint(40,100,100)

In [7]:
np.var(group_control)

278.1696

In [6]:
experiment_group = np.random.randint(20,100,100)

In [8]:
np.var(experiment_group)

578.5475000000001

In [12]:
statistics, p_value = stats.ttest_ind(a = group_control , b = experiment_group, equal_var= False)

In [15]:
if p_value < alpha:
    print("We reject the null hypothesis")
else:
    print("Fail to reject the null hypothesis")
    

We reject the null hypothesis


In [16]:
from statsmodels.stats.multicomp import MultiComparison

In [17]:
comp = MultiComparison(group_control,experiment_group)

In [22]:
tukey_result = comp.tukeyhsd()

In [23]:
print(tukey_result.summary())

  Multiple Comparison of Means - Tukey HSD, FWER=0.05  
group1 group2 meandiff p-adj    lower    upper   reject
-------------------------------------------------------
    22     23    -12.0    1.0 -100.4319  76.4319  False
    22     24     20.0    1.0  -52.2044  92.2044  False
    22     25     18.0    1.0  -44.5308  80.5308  False
    22     28     14.0    1.0  -51.9133  79.9133  False
    22     29  22.3333    1.0  -43.5799  88.2466  False
    22     30     24.0    1.0  -64.4319 112.4319  False
    22     31      9.6    1.0  -50.8105  70.0105  False
    22     33    -16.0    1.0  -88.2044  56.2044  False
    22     34      6.0    1.0  -82.4319  94.4319  False
    22     35     27.0    1.0  -61.4319 115.4319  False
    22     41     21.5    1.0  -50.7044  93.7044  False
    22     42     25.0    1.0  -63.4319 113.4319  False
    22     44     38.5 0.9736  -33.7044 110.7044  False
    22     45     11.0    1.0  -77.4319  99.4319  False
    22     48     38.0 0.9993  -50.4319 126.4319

Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [36]:
import pandas as pd
import numpy as np
import statsmodels as sm
from statsmodels.formula.api import ols

sales_data = {'Store': ['A']*30 + ['B']*30 + ['C']*30, 
              'Day': list(range(1, 31))*3, 
              'Sales': [100, 120, 130, 110, 130, 140, 90, 110, 120, 80,
                        100, 130, 140, 120, 140, 150, 110, 130, 140, 100,
                        90, 110, 120, 100, 120, 130, 80, 100, 110, 70,
                        80, 100, 110, 90, 110, 120, 70, 90, 100, 60,
                        70, 90, 100, 80, 100, 110, 60, 80, 90, 50,
                        60, 80, 90, 70, 90, 100, 50, 70, 80, 40,
                        50, 70, 80, 60, 80, 90, 40, 60, 70, 30,
                        40, 60, 70, 50, 70, 80, 30, 50, 60, 20,
                        30, 50, 60, 40, 60, 70, 20, 40, 50, 0]}

In [37]:
sales_df = pd.DataFrame(sales_data)

In [38]:
sales_df

Unnamed: 0,Store,Day,Sales
0,A,1,100
1,A,2,120
2,A,3,130
3,A,4,110
4,A,5,130
...,...,...,...
85,C,26,70
86,C,27,20
87,C,28,40
88,C,29,50


In [39]:
# Conduct one-way repeated measures ANOVA
rm_anova = sm.stats.anova.AnovaRM(data=sales_df, depvar='Sales', subject='Day', 
                                  within=['Store'], aggregate_func='mean')
result = rm_anova.fit()

# Print ANOVA table
print(result.summary())


               Anova
      F Value  Num DF  Den DF Pr > F
------------------------------------
Store 777.6103 2.0000 58.0000 0.0000



In [40]:
# Conduct post-hoc tests using Tukey's HSD test
from statsmodels.stats.multicomp import MultiComparison

comp = MultiComparison(sales_df['Sales'], sales_df['Store'])
tukey_result = comp.tukeyhsd()

# Print post-hoc results
print(tukey_result.summary())

 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj  lower    upper   reject
-----------------------------------------------------
     A      B    -31.0   0.0 -43.5477 -18.4523   True
     A      C -61.3333   0.0  -73.881 -48.7857   True
     B      C -30.3333   0.0  -42.881 -17.7857   True
-----------------------------------------------------
