In [None]:
# Q 1 Answer :
"""
ANOVA (Analysis of Variance) is a statistical technique used to test for differences in means among two or more groups. 
The technique relies on certain assumptions that must be met in order for the results to be valid.

The assumptions required to use ANOVA are:

1)Normality: The data must be normally distributed within each group. 
Normality means that the data follows a normal bell-shaped distribution.

2)Homogeneity of variances: The variance of the data should be approximately equal across all groups.
Homogeneity of variances means that the variability in the data is roughly the same across all groups.

3)Independence: The observations within each group must be independent of one another. 
Independence means that the values in one group are not related to the values in another group.

Examples of violations that could impact the validity of the results are:

1)Non-normality: If the data are not normally distributed, the ANOVA results may be biased or inaccurate. For example, 
if the data are skewed, the mean may not be an accurate representation of the central tendency of the data.

2)Heteroscedasticity: If the variances of the groups are not equal, the ANOVA results may be biased or inaccurate. 
For example, if the variances of the groups are different, the mean may not be an accurate representation of the central tendency of the data.

3)Dependence: If the observations within each group are not independent, the ANOVA results may be biased or inaccurate.
For example, if the observations within a group are correlated, the variance of the group may be underestimated, 
which could lead to a false conclusion that the group means are significantly different.

It is important to check these assumptions before using ANOVA and to address any violations that may impact the validity of the results. 
There are statistical tests and methods available to correct for violations of the assumptions, 
or alternative non-parametric tests can be used that do not require these assumptions.

"""

In [None]:
# Q 2 Answer :
"""
There are three types of ANOVA:

1)One-way ANOVA: This type of ANOVA is used when there is one independent variable, also called a factor, and one dependent variable. 
The independent variable can have two or more levels, which represent different groups or conditions. 
One-way ANOVA is used to test whether the means of the dependent variable differ significantly across the groups or 
conditions of the independent variable. For example, a one-way ANOVA can be used to compare the mean weight of apples from three different orchards.

2)Two-way ANOVA: This type of ANOVA is used when there are two independent variables, or factors, and one dependent variable. 
The factors can be either categorical or continuous variables.
Two-way ANOVA is used to test whether the means of the dependent variable differ significantly across the levels of both factors,
and whether there is an interaction effect between the factors. For example, 
a two-way ANOVA can be used to compare the mean scores of students in different majors across two different teaching styles.

3)Three-way ANOVA: This type of ANOVA is used when there are three independent variables, or factors, and one dependent variable. 
The factors can be either categorical or continuous variables.
Three-way ANOVA is used to test whether the means of the dependent variable differ significantly across the levels of all three factors,
and whether there are any interaction effects between the factors. For example, 
a three-way ANOVA can be used to investigate the effects of gender, age, and ethnicity on job satisfaction.

The choice of ANOVA type depends on the research question, study design, and the number of factors and levels involved. 
In general, one-way ANOVA is used when there is only one factor of interest,
two-way ANOVA is used when there are two factors, and three-way ANOVA is used when there are three factors. However, 
sometimes a higher-order ANOVA may be used if more factors need to be considered.
"""

In [None]:
# Q 3 Answer :
"""
Partitioning of variance is the process of breaking down the total variance in a data set into different components that are attributable 
to specific sources of variation. 
In ANOVA, the variance is partitioned into two types of variance: between-group variance and within-group variance.

Between-group variance is the variation in the means of the groups or conditions of the independent variable. 
This variance represents the extent to which the group means differ from one another. 
Within-group variance, on the other hand, is the variation within each group or condition. 
This variance represents the extent to which the individual observations within a group differ from the group mean.

The importance of understanding partitioning of variance in ANOVA lies in the fact that it helps to identify the sources of variation in the data. 
By understanding how the variance is partitioned, we can determine whether there are significant differences between the groups or 
conditions of the independent variable. 
This information can be used to draw conclusions about the underlying population from which the sample was drawn.

Partitioning of variance also allows us to calculate different effect size measures, 
such as eta-squared and partial eta-squared. 
These effect size measures indicate the proportion of variance in the dependent variable that can be attributed to the independent variable(s) 
after controlling for other sources of variation.

In summary, partitioning of variance is important in ANOVA because it helps to identify the sources of variation in the data, 
determine whether there are significant differences between the groups or conditions, and calculate effect size measures.
"""

In [None]:
# Q 4 Answer :

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load the data into a pandas DataFrame
data = pd.read_csv('data.csv')

# Fit the one-way ANOVA model
model = ols('dependent_variable ~ independent_variable', data=data).fit()

# Calculate the total sum of squares (SST)
SST = sm.stats.anova_lm(model, typ=1)['sum_sq'][0]

# Calculate the explained sum of squares (SSE)
SSE = sm.stats.anova_lm(model, typ=1)['sum_sq'][1]

# Calculate the residual sum of squares (SSR)
SSR = sm.stats.anova_lm(model, typ=1)['sum_sq'][2]

print('SST:', SST)
print('SSE:', SSE)
print('SSR:', SSR)


In [None]:
# Q 5 Answer :

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load the data into a pandas DataFrame
data = pd.read_csv('data.csv')

# Fit the two-way ANOVA model
model = ols('dependent_variable ~ independent_variable_1 + independent_variable_2 + independent_variable_1 * independent_variable_2', data=data).fit()

# Calculate the main effect of independent_variable_1
main_effect_1 = sm.stats.anova_lm(model, typ=2)['sum_sq'][0]

# Calculate the main effect of independent_variable_2
main_effect_2 = sm.stats.anova_lm(model, typ=2)['sum_sq'][1]

# Calculate the interaction effect between independent_variable_1 and independent_variable_2
interaction_effect = sm.stats.anova_lm(model, typ=2)['sum_sq'][2]

print('Main effect of independent_variable_1:', main_effect_1)
print('Main effect of independent_variable_2:', main_effect_2)
print('Interaction effect:', interaction_effect)



In [None]:
# Q 6 Answer :
"""
If you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02, 
you can conclude that there is evidence of statistically significant differences between the groups.

The F-statistic in ANOVA compares the variance between groups to the variance within groups, 
and a high F-value suggests that the differences between group means are larger than what would be expected by chance. 
The p-value measures the probability of obtaining the observed F-statistic or a more extreme value 
if the null hypothesis (i.e., no differences between the groups) is true. In this case, 
the low p-value of 0.02 indicates that the probability of obtaining an F-statistic as extreme as 5.23 or
higher by chance alone is very low (i.e., less than 2%). 
Therefore, we can reject the null hypothesis and conclude that there are statistically significant differences between the groups.

To interpret these results, we can compare the means of the groups and conduct post-hoc tests to determine which groups differ significantly
from each other. We can also calculate effect size measures such as eta-squared or Cohen's d to estimate the magnitude of 
the differences between the groups. Additionally, 
we should consider the assumptions of ANOVA (e.g., normality, 
equal variances) and check for any violations that may impact the validity of the results.

"""

In [None]:
# Q 7 Answer :
"""
Handling missing data in repeated measures ANOVA can be challenging because the repeated measures design assumes that 
each participant has complete data for all levels of the independent variable. However, 
it is common for some participants to have missing data due to various reasons such as dropouts, technical problems, or participant noncompliance.

There are several methods to handle missing data in repeated measures ANOVA, including:

1)Pairwise deletion: This method involves analyzing only the available data for each pair of variables. 
This method is easy to implement, but it can lead to a loss of power and biased results if the data are not missing at random.

2)Listwise deletion: This method involves analyzing only the cases with complete data for all variables. 
This method can ensure valid statistical inference if the data are missing completely at random. 
However, this method can also lead to a loss of power and biased results if the amount of missing data is substantial.

3)Imputation: This method involves estimating the missing values using various techniques such as mean imputation,
regression imputation, or multiple imputation. Imputation methods can help to reduce bias and increase power,
but the validity of the results depends on the assumptions of the imputation model.

The potential consequences of using different methods to handle missing data in repeated measures ANOVA can be substantial.
For example, pairwise and listwise deletion can lead to biased estimates and reduced power,
especially when the amount of missing data is large or the missing data are not missing completely at random. 
Imputation methods can improve the validity of the results, 
but the choice of the imputation model can affect the accuracy of the estimates and the significance of the results. 
Therefore, it is important to carefully consider the assumptions of the missing 
data model and to use sensitivity analyses to assess the robustness of the results to different methods of handling missing data
"""

In [None]:
# Q 8 Answer :
"""

Post-hoc tests are used to compare group means after a significant F-test in ANOVA. 
They are necessary because ANOVA only tells us if there is a statistically significant difference between groups, 
but not which specific groups differ from each other. Some common post-hoc tests include:

1. Tukey's HSD (Honestly Significant Difference) test: This test compares all possible pairs of group means and 
controls the family-wise error rate (i.e., the probability of making at least one type I error across all comparisons) at a pre-specified alpha level.
It is commonly used when there are equal group sizes and variances.

2. Bonferroni test: This test adjusts the alpha level for each comparison by dividing it by the number of comparisons. 
It is commonly used when there are unequal group sizes or variances.

3. Dunnett's test: This test compares each group mean to a control group mean and controls the family-wise error rate
at a pre-specified alpha level. It is commonly used in situations where there is a control group and multiple treatment groups.

4. Scheffe's test: This test is a conservative test that controls the family-wise error rate at
a pre-specified alpha level for all possible contrasts among the groups. It is commonly used when there are unequal group sizes and variances.

For example, suppose a researcher conducted a study to compare the effectiveness of three different treatments 
(A, B, and C) for reducing anxiety levels. The researcher conducted a one-way ANOVA and found a significant difference between the groups 
(F(2, 87) = 4.92, p = 0.01). The researcher wants to determine which specific groups differ significantly from each other. 
In this case, the researcher could conduct a post-hoc test such as Tukey's HSD or
Bonferroni test to compare the group means and determine which pairs of groups differ significantly from each other.
This would help the researcher to make more specific conclusions about the relative 
effectiveness of the different treatments for reducing anxiety levels.






"""

In [None]:
# Q 9 Answer :

"""
Assuming the data is stored in a Pandas dataframe called df with the following columns:

diet - categorical variable indicating the diet (A, B, or C)
weight_loss - continuous variable indicating the weight loss (in pounds) for each participant
Here's how to conduct a one-way ANOVA using Python:
"""
import pandas as pd
import scipy.stats as stats

# Load data into a pandas dataframe
df = pd.read_csv('data.csv')

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(df[df['diet'] == 'A']['weight_loss'],
                                      df[df['diet'] == 'B']['weight_loss'],
                                      df[df['diet'] == 'C']['weight_loss'])

# Print results
print("F-statistic: ", f_statistic)
print("p-value: ", p_value)


In [None]:
# Q 10 Answer :
"""
Assuming the data is stored in a Pandas dataframe called df with the following columns:

program - categorical variable indicating the software program used (A, B, or C)
experience - categorical variable indicating the employee's experience level (novice or experienced)
time - continuous variable indicating the time it took to complete the task (in minutes) for each employee
Here's how to conduct a two-way ANOVA using Python:
"""
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load data into a pandas dataframe
df = pd.read_csv('data.csv')

# Create the ANOVA model
model = ols('time ~ C(program) + C(experience) + C(program):C(experience)', data=df).fit()

# Perform the two-way ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

# Print results
print(anova_table)


In [None]:
# Q 11 Answer :
"""
Assuming the data is stored in a Pandas dataframe called df with the following columns:

group - categorical variable indicating the group (control or experimental)
score - continuous variable indicating the test score for each student
Here's how to conduct a two-sample t-test and post-hoc test using Python:

"""
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Load data into a pandas dataframe
df = pd.read_csv('data.csv')

# Separate the two groups into separate arrays
control = df[df['group'] == 'control']['score']
experimental = df[df['group'] == 'experimental']['score']

# Conduct a two-sample t-test
t, p = stats.ttest_ind(control, experimental)

# Print the results
print('Two-sample t-test results:')
print(f't = {t:.2f}, p = {p:.3f}')

# Conduct a post-hoc test (Tukey's HSD)
tukey_results = pairwise_tukeyhsd(df['score'], df['group'])

# Print the post-hoc test results
print('\nPost-hoc test results:')
print(tukey_results)


In [None]:
# Q 12 Answer 
import pandas as pd
import pingouin as pg
import statsmodels.api as sm
from statsmodels.formula.api import ols

sales_data = pd.DataFrame({
    'store': ['A']*30 + ['B']*30 + ['C']*30,
    'day': list(range(1, 31))*3,
    'sales': [10, 15, 14, 12, 11, 13, 17, 16, 18, 14, 15, 12, 13, 11, 12, 13, 15, 16, 11, 14, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]*3
})

rm = pg.rm_anova(dv='sales', within='day', subject='store', data=sales_data)
print(rm)

posthoc = pg.pairwise_tukey(data=sales_data, dv='sales', within='day', subject='store')
print(posthoc)