# Assignment

# Q1

ANOVA (Analysis of Variance) is a statistical method used to compare the means of three or more groups. The assumptions required for using ANOVA are as follows:

Independence: The observations in each group must be independent of each other.

Normality: The distribution of the response variable in each group must be approximately normal.

Homogeneity of variances: The variances of the response variable in each group must be approximately equal.

Violations of these assumptions could impact the validity of the ANOVA results. For example:

Violation of independence: If the observations in one group are not independent of the observations in another group, this could lead to the problem of pseudoreplication. For example, if data are collected from multiple sites, but some sites are more similar to each other than others, the assumption of independence may be violated.

Violation of normality: If the distribution of the response variable is not approximately normal in each group, this could lead to inaccurate results. For example, if the data are heavily skewed or have outliers, the assumption of normality may be violated.

Violation of homogeneity of variances: If the variances of the response variable are not approximately equal in each group, this could lead to inaccurate results. For example, if the data in one group have much larger variances than the data in another group, this could lead to a false detection of a significant difference.

In general, violations of these assumptions could lead to either Type I errors (false positive) or Type II errors (false negative). Therefore, it is important to check the assumptions before performing ANOVA and consider alternative methods if the assumptions are not met.

# Q2

The three types of ANOVA are:

One-way ANOVA: This type of ANOVA is used to analyze the differences between the means of three or more independent groups (or levels) on a single dependent variable. It is called "one-way" because there is only one independent variable being studied.

Two-way ANOVA: This type of ANOVA is used to analyze the differences between the means of two or more independent groups (or levels) on a single dependent variable, but with the added feature of considering the effects of two or more independent variables (or factors) simultaneously. It is called "two-way" because there are two independent variables being studied.

Repeated-measures ANOVA: This type of ANOVA is used to analyze the differences between the means of three or more related groups (or levels) on a single dependent variable, where each participant is measured on the dependent variable under all conditions (or levels) of the independent variable.

# Q3

The partitioning of variance in ANOVA refers to the process of dividing the total variability in a dataset into different sources of variation, which can then be used to evaluate the significance of different factors or treatments. This partitioning is important because it allows researchers to determine the extent to which different factors or treatments are contributing to the observed variability in the data.

In ANOVA, the total variability in the data is partitioned into two components: the between-group variability and the within-group variability. The between-group variability represents the variation in the data that can be explained by the differences between the group means, while the within-group variability represents the variation that cannot be explained by these differences.

The partitioning of variance is important because it allows researchers to calculate the F-statistic, which is used to test the null hypothesis that there is no difference between the group means. By comparing the F-statistic to a critical value based on the degrees of freedom and the chosen significance level, researchers can determine whether the observed differences between the group means are statistically significant.

# Q4

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# read in data
data = pd.read_csv('data.csv')

# set up one-way ANOVA model
model = ols('y ~ group', data=data).fit()

# calculate SST, SSE, and SSR
SST = sm.stats.anova_lm(model, typ=1)['sum_sq'][0]
SSE = sm.stats.anova_lm(model, typ=1)['sum_sq'][1]
SSR = sm.stats.anova_lm(model, typ=1)['sum_sq'][2]


# Q5

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# read in data
data = pd.read_csv('data.csv')

# create model formula
formula = 'y ~ C(factor_a) + C(factor_b) + C(factor_a):C(factor_b)'

# fit ANOVA model
model = ols(formula, data).fit()

# print main effects
print(model.params[1:3])

# print interaction effect
print(model.params[3])


# Q6

An F-statistic of 5.23 with a p-value of 0.02 suggests that there is a statistically significant difference between the groups in the population. Specifically, it indicates that the variability between the group means is larger than the variability within the groups. Therefore, we can reject the null hypothesis that all group means are equal.

In terms of interpretation, this means that there is evidence to suggest that there are real differences between the means of the groups being compared. However, it does not provide information on which specific group means are different from each other. Further analysis, such as post-hoc tests or confidence intervals, would be needed to identify which group means are significantly different

# Q7

In a repeated measures ANOVA, missing data can occur due to various reasons such as participant dropout, equipment failure, or other unforeseen circumstances. It is important to handle missing data appropriately as it can affect the validity and reliability of the results.

There are different methods to handle missing data in a repeated measures ANOVA, including:

Pairwise deletion: This method involves excluding any case that has missing data on any of the variables involved in the analysis. This approach is simple and straightforward, but it can lead to biased results and reduced power if the missing data is not completely at random.

Listwise deletion: This method involves excluding any case that has missing data on any of the variables involved in the analysis. It is more conservative than pairwise deletion, but it can also lead to biased results and reduced power if the missing data is not completely at random.

Imputation: This method involves estimating missing data using a statistical model or algorithm. Imputation can help to reduce bias and increase power compared to deletion methods, but the validity of the results depends on the accuracy of the imputation model and assumptions made about the missing data.

Mixed-effects models: This method involves modeling the data using a hierarchical model that accounts for both within-subject and between-subject variability. This approach can handle missing data by estimating the missing values using the observed data and can provide unbiased estimates of the treatment effects even when data is missing at random.

# Q8

After conducting an ANOVA, post-hoc tests can be used to determine which specific group means differ significantly from each other. Some common post-hoc tests include Tukey's HSD, Bonferroni correction, Scheffe's method, and Dunnett's test.

Tukey's HSD (honestly significant difference) test is commonly used when the sample sizes are equal and the variances are approximately equal across all groups. The test controls the familywise error rate, which is the probability of making at least one Type I error across all pairwise comparisons.

Bonferroni correction is a more conservative approach that adjusts the alpha level (significance level) for each pairwise comparison. It is often used when there are a large number of pairwise comparisons to be made.

Scheffe's method is a more conservative approach that controls the familywise error rate while also allowing for unequal sample sizes and variances across groups.

Dunnett's test is used when comparing several groups to a control group. It controls the overall alpha level, rather than the familywise error rate.

A situation where a post-hoc test might be necessary is when an ANOVA shows that there is a significant difference between groups, but it is unclear which specific groups differ significantly from each other. For example, in a study comparing the effects of three different treatments on blood pressure, an ANOVA might show a significant difference between groups, but a post-hoc test would be needed to determine which specific treatments have different effects on blood pressure.

# Q9

In [2]:
import scipy.stats as stats

# Define the data
diet_a = [5, 6, 7, 4, 5, 6, 7, 8, 9, 6, 7, 5, 6, 7, 8, 9, 10, 6, 7, 8, 9, 7, 6, 5, 4, 6, 7, 8, 9, 7, 6, 5, 4, 6, 7, 8, 9, 7, 6, 5, 4, 6, 7, 8, 9, 7, 6, 5, 4]
diet_b = [4, 5, 6, 7, 8, 5, 4, 5, 6, 7, 4, 5, 6, 7, 8, 5, 4, 5, 6, 7, 4, 5, 6, 7, 8, 5, 4, 5, 6, 7, 4, 5, 6, 7, 8, 5, 4, 5, 6, 7, 4, 5, 6, 7, 8, 5, 4, 5, 6, 7]
diet_c = [3, 4, 5, 6, 7, 8, 4, 5, 6, 7, 3, 4, 5, 6, 7, 8, 4, 5, 6, 7, 3, 4, 5, 6, 7, 8, 4, 5, 6, 7, 3, 4, 5, 6, 7, 8, 4, 5, 6, 7, 3, 4, 5, 6, 7, 8, 4, 5, 6, 7]

# Conduct the one-way ANOVA
f_stat, p_val = stats.f_oneway(diet_a, diet_b, diet_c)

# Print the results
print("F-statistic:", f_stat)
print("p-value:", p_val)


F-statistic: 7.883247731412348
p-value: 0.0005609089293995041


# Q10

In [5]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create data
n = 30
programs = ['A', 'B', 'C']
levels = ['novice', 'experienced']
data = pd.DataFrame({'Program': np.random.choice(programs, n),
                     'Experience': np.random.choice(levels, n),
                     'Time': np.random.normal(loc=10, scale=2, size=n)})

# Fit two-way ANOVA model
model = ols('Time ~ C(Program) + C(Experience) + C(Program):C(Experience)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print ANOVA table
print(anova_table)


                             sum_sq    df         F    PR(>F)
C(Program)                 3.638022   2.0  0.547440  0.585482
C(Experience)              7.185994   1.0  2.162658  0.154390
C(Program):C(Experience)  21.516958   2.0  3.237814  0.056899
Residual                  79.746240  24.0       NaN       NaN


# Q11

In [3]:
import pandas as pd
from scipy import stats

# create a DataFrame with test scores and group assignments
data = pd.DataFrame({
    'score': [80, 85, 90, 75, 95, 85, 70, 80, 85, 90, 80, 75, 85, 90, 75, 95, 85, 70, 80, 85],
    'group': ['control'] * 10 + ['experimental'] * 10
})

# separate the data into two groups
control_scores = data[data['group'] == 'control']['score']
experimental_scores = data[data['group'] == 'experimental']['score']

# perform a two-sample t-test
t, p = stats.ttest_ind(control_scores, experimental_scores)

# print the results
print(f"t = {t:.3f}, p = {p:.3f}")


t = 0.447, p = 0.660


# Q12

In [7]:
import pandas as pd
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# create a DataFrame with sales data for three stores
data = pd.DataFrame({
    'store': ['A'] * 30 + ['B'] * 30 + ['C'] * 30,
    'day': list(range(1, 31)) * 3,
    'sales': [100, 120, 90, 110, 95, 105, 115, 105, 120, 95,
              110, 100, 105, 115, 120, 105, 95, 110, 100, 120,
              105, 95, 110, 100, 115, 105, 120, 95, 100, 110] * 3
})

# convert the DataFrame to wide format
wide_data = data.pivot_table(values='sales', index='day', columns='store')

# perform a repeated measures ANOVA
aovrm = AnovaRM(data, 'sales', 'day', within=['store'])
res = aovrm.fit()

# print the results
print(res.summary())

# perform a post-hoc test
tukey = pairwise_tukeyhsd(data['sales'], data['store'])
print(tukey.summary())


               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
store  0.0000 2.0000 58.0000 1.0000

Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj  lower  upper  reject
--------------------------------------------------
     A      B      0.0   1.0 -5.5383 5.5383  False
     A      C      0.0   1.0 -5.5383 5.5383  False
     B      C      0.0   1.0 -5.5383 5.5383  False
--------------------------------------------------
