## Assignments Questions

__Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.__

__Ans)__ The ANOVA (Analysis of Variance) test is a parametric statistical test that makes several assumptions about the data being analyzed. Violations of these assumptions can affect the validity of the ANOVA results. The main assumptions required for the ANOVA are as follows:

* Independence: The data for each group should be independent of each other.

* Normality: The data for each group should be normally distributed.

* Homogeneity of variance: The variances of the groups being compared should be approximately equal.

If any of these assumptions are violated, the ANOVA results may not be reliable.

__Q2. What are the three types of ANOVA, and in what situations would each be used?__

__Ans)__ There are three types of ANOVA: one-way ANOVA, two-way ANOVA, and repeated measures ANOVA. The situations in which each type would be used are as follows:

* One-way ANOVA: This type of ANOVA is used when there is only one independent variable, and the dependent variable is continuous. One-way ANOVA is used to test whether there is a significant difference between the means of three or more groups. For example, a one-way ANOVA could be used to test whether there is a significant difference in the average scores of three or more groups on a test.

* Two-way ANOVA: This type of ANOVA is used when there are two independent variables, and the dependent variable is continuous. Two-way ANOVA is used to test whether there is a significant interaction between the two independent variables and the dependent variable. For example, a two-way ANOVA could be used to test whether there is a significant interaction between the type of treatment and the severity of the disease on the recovery time of patients.

* Repeated Measures ANOVA: This type of ANOVA is used when the same subjects are measured multiple times, and the dependent variable is continuous. Repeated measures ANOVA is used to test whether there is a significant difference between the means of three or more groups over time or different conditions. For example, a repeated measures ANOVA could be used to test whether there is a significant difference in the average scores of the same group of students before and after a training program.

__Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?__

__Ans)__ Partitioning of variance is the process of decomposing the total variance in a dataset into different components, each of which represents a source of variation in the data. In ANOVA, the total variance in the dependent variable is partitioned into two types of variance: between-group variance and within-group variance. Understanding this concept is important because it helps to identify the sources of variation in the data and to determine whether the differences between the groups are significant or due to chance.

Understanding the partitioning of variance is important because it allows researchers to test hypotheses about the differences between the means of the groups being compared. By comparing the ratio of the between-group variance to the within-group variance (F-statistic) to a critical value based on the chosen significance level, researchers can determine whether the observed differences between the means of the groups are statistically significant.

__Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?__

In [3]:
import seaborn as sns
import statsmodels.api as sm
from statsmodels.formula.api import ols

# load Titanic dataset from Seaborn
titanic = sns.load_dataset('titanic')

# define ANOVA model with Pclass as independent variable and Fare as dependent variable
model = ols('fare ~ pclass', data=titanic).fit()

# calculate SST
sst = sm.stats.anova_lm(model, typ=1)['sum_sq'][0]

# calculate SSE
sse = sm.stats.anova_lm(model, typ=1)['sum_sq'][1]

# calculate SSR
ssr = sst - sse

print('SST:', sst)
print('SSE:', sse)
print('SSR:', ssr)

SST: 663624.9769289957
SSE: 1534173.8157823747
SSR: -870548.838853379


__Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?__

In [2]:
import seaborn as sns
import statsmodels.api as sm
from statsmodels.formula.api import ols

# load Titanic dataset from Seaborn
titanic = sns.load_dataset('titanic')

# define ANOVA model with Pclass and Sex as independent variables and Fare as dependent variable
model = ols('fare ~ pclass + sex + pclass*sex', data=titanic).fit()

# calculate main effects
main_effects = sm.stats.anova_lm(model, typ=2)['sum_sq'][:2]

# calculate interaction effect
interaction_effect = sm.stats.anova_lm(model, typ=2)['sum_sq'][2]

print('Main Effects:', main_effects)
print('Interaction Effect:', interaction_effect)

Main Effects: sex        26992.212049
pclass    617550.791669
Name: sum_sq, dtype: float64
Interaction Effect: 45934.370324601216


__Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?__

__Ans)__ An F-statistic of 5.23 and a p-value of 0.02 from a one-way ANOVA suggest that there is a statistically significant difference between the means of the groups. However, further analysis is needed to determine the specific nature of the differences, and it is important to consider the practical implications of the results in addition to their statistical significance.

__Q7. In a epeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?__

__Ans)__ In a repeated measures ANOVA, missing data can be handled by exclusion or imputation methods such as mean imputation, regression imputation, or multiple imputation. However, the potential consequences of using different methods to handle missing data should be carefully considered, as different methods rely on different assumptions and may introduce bias or inaccuracies in the results. The choice of method should be based on the nature and extent of the missing data and the assumptions underlying the imputation method, and sensitivity analyses should be conducted to evaluate the robustness of the results to different assumptions.

__Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary__

__Ans)__ Post-hoc tests are used after ANOVA to determine which specific groups differ significantly from each other. Some common post-hoc tests include Tukey's Honestly Significant Difference (HSD), Bonferroni correction, Dunnett's test, and Scheffé's test.

__An example of a situation where a post-hoc test might be necessary is in a study comparing the effectiveness of four different treatments for a medical condition. After conducting an ANOVA, it may be found that there is a significant difference between the means of the four treatment groups. However, the ANOVA does not tell us which specific groups are different from each other. In this case, a post-hoc test such as Tukey's HSD or Bonferroni correction could be used to compare the means of each pair of treatment groups and determine which groups are significantly different from each other.__

It is important to note that post-hoc tests should only be conducted if the ANOVA result is significant. If the ANOVA result is not significant, post-hoc tests are not appropriate because there is no evidence of a difference between the groups. Additionally, it is important to choose an appropriate post-hoc test based on the nature of the data and the specific research question being addressed.

__Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.__

In [4]:
import numpy as np
from scipy.stats import f_oneway

# define weight loss data for each diet
diet_A = np.array([5.6, 6.1, 4.8, 7.2, 5.4, 6.8, 4.9, 7.1, 5.8, 6.4,
                   4.7, 5.9, 6.3, 5.5, 4.8, 7.4, 6.2, 5.1, 4.9, 6.7,
                   5.3, 6.0, 4.6, 5.2, 7.0])
diet_B = np.array([4.2, 3.8, 2.9, 5.1, 4.9, 3.5, 3.9, 5.6, 3.1, 4.5,
                   5.0, 4.3, 3.7, 4.0, 4.8, 3.4, 4.1, 3.3, 3.6, 4.7,
                   4.4, 3.6, 4.1, 4.2, 4.9])
diet_C = np.array([2.8, 2.1, 1.9, 2.6, 3.0, 2.5, 2.7, 1.8, 3.3, 3.5,
                   2.2, 2.9, 3.1, 2.6, 3.3, 3.0, 2.8, 2.1, 2.7, 2.2,
                   2.5, 2.9, 2.6, 2.3, 2.1])

# conduct one-way ANOVA
F, p = f_oneway(diet_A, diet_B, diet_C)

# report results
print('F-statistic:', F)
print('p-value:', p)
if p < 0.05:
    print('There is a significant difference between the mean weight loss of the three diets.')
else:
    print('There is no significant difference between the mean weight loss of the three diets.')

F-statistic: 135.79364818938228
p-value: 3.6866706447631974e-25
There is a significant difference between the mean weight loss of the three diets.


__Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.__

In [5]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# create a pandas dataframe with completion time, software program, and employee experience level
data = pd.DataFrame({'completion_time': [35, 40, 32, 33, 41, 38, 36, 39, 29, 31, 30, 33, 34, 32, 38, 36, 40, 42, 37, 31, 27, 29, 30, 28, 34, 37, 39, 35, 38, 40],
                     'software_program': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
                     'employee_experience': ['novice', 'novice', 'novice', 'novice', 'novice', 'novice', 'novice', 'novice', 'novice', 'experienced', 'experienced', 'experienced', 'experienced', 'experienced', 'experienced', 'experienced', 'experienced', 'experienced', 'novice', 'novice', 'novice', 'novice', 'novice', 'novice', 'novice', 'novice', 'novice', 'experienced', 'experienced', 'experienced']})

# conduct two-way ANOVA
model = ols('completion_time ~ C(software_program) + C(employee_experience) + C(software_program):C(employee_experience)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# report results
print(anova_table)

                                                sum_sq    df         F  \
C(software_program)                          16.800000   2.0  0.488234   
C(employee_experience)                       18.050000   1.0  1.049122   
C(software_program):C(employee_experience)   75.033333   2.0  2.180585   
Residual                                    412.916667  24.0       NaN   

                                              PR(>F)  
C(software_program)                         0.619673  
C(employee_experience)                      0.315923  
C(software_program):C(employee_experience)  0.134848  
Residual                                         NaN  


__The output of the code snippet will include an ANOVA table with the F-statistics and p-values for the main effects of software program and employee experience level, as well as their interaction effect. Based on the p-values, we can determine whether there are any significant differences between the completion times for the different software programs or employee experience levels, as well as whether there is an interaction effect between them.__

__Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.__

In [6]:
import numpy as np
from scipy.stats import ttest_ind, f_oneway
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# define test scores for control and experimental groups
control = np.array([72, 80, 68, 75, 78, 85, 70, 79, 73, 81,
                    74, 77, 82, 76, 79, 71, 83, 75, 76, 80,
                    69, 78, 72, 77, 81, 70, 73, 79, 74, 82,
                    76, 80, 74, 77, 69, 72, 73, 78, 75, 81,
                    68, 79, 82, 77, 75, 80, 74, 71, 83, 76])
experimental = np.array([85, 88, 78, 90, 83, 91, 87, 89, 86, 84,
                         81, 92, 79, 89, 84, 88, 87, 81, 84, 82,
                         90, 87, 86, 84, 83, 89, 88, 91, 86, 82,
                         87, 89, 91, 85, 84, 82, 87, 86, 85, 89,
                         90, 83, 82, 86, 84, 91, 85, 88, 89, 90])

# conduct two-sample t-test
t, p = ttest_ind(control, experimental)

# report results
print('t-statistic:', t)
print('p-value:', p)
if p < 0.05:
    print('There is a significant difference in test scores between the control and experimental groups.')
    # conduct post-hoc test (Tukey's HSD)
    data = np.concatenate([control, experimental])
    group_labels = np.concatenate([np.repeat('control', len(control)), np.repeat('experimental', len(experimental))])
    tukey_results = pairwise_tukeyhsd(data, group_labels, 0.05)
    print(tukey_results)
else:
    print('There is no significant difference in test scores between the control and experimental groups.')

t-statistic: -12.83475228455051
p-value: 1.0424949919654563e-22
There is a significant difference in test scores between the control and experimental groups.
 Multiple Comparison of Means - Tukey HSD, FWER=0.05  
 group1    group2    meandiff p-adj lower upper reject
------------------------------------------------------
control experimental     9.96  -0.0  8.42  11.5   True
------------------------------------------------------


__Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.__

In [7]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# create a pandas dataframe with daily sales data for each store
data = pd.DataFrame({'day': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
                             11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
                             21, 22, 23, 24, 25, 26, 27, 28, 29, 30] * 3,
                     'store': ['A'] * 30 + ['B'] * 30 + ['C'] * 30,
                     'sales': [156, 132, 159, 142, 136, 148, 145, 157, 164, 140,
                               131, 135, 141, 147, 143, 148, 152, 162, 155, 158,
                               148, 146, 152, 149, 163, 160, 150, 155, 149, 157,
                               135, 141, 138, 142, 129, 137, 146, 133, 131, 144,
                               163, 158, 165, 169, 166, 171, 175, 170, 172, 168,
                               184, 180, 178, 182, 187, 190, 189, 194, 193, 191,
                               154, 143, 141, 147, 139, 150, 157, 146, 139, 149,
                               162, 170, 163, 169, 160, 172, 174, 170, 168, 167,
                               144, 142, 146, 138, 149, 150, 146, 154, 155, 157]})

# conduct repeated measures ANOVA
model = AnovaRM(data, 'sales', 'store', within=['day']).fit()
print(model.summary())

# conduct post-hoc test (Tukey's HSD)
tukey_results = pairwise_tukeyhsd(data['sales'], data['store'], 0.05)
print(tukey_results)

              Anova
    F Value  Num DF  Den DF Pr > F
----------------------------------
day  1.5417 29.0000 58.0000 0.0805

 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower    upper  reject
-----------------------------------------------------
     A      B     14.7 0.0007   5.5842 23.8158   True
     A      C      4.7 0.4392  -4.4158 13.8158  False
     B      C    -10.0  0.028 -19.1158 -0.8842   True
-----------------------------------------------------


-------------------------------------------------------------------------------------------- __End__----------------------------------------------------------------------------------------------------------------