## Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

Ans:
Assumptions required to use Anova:
1. Independence: The observations in each group are independent of each other.
2. Normality: The data in each group should follow a normal distribution.
3. Homogeneity of variances: The variance of the data in each group should be roughly equal.
4. Outliers: There should be no outliers.

Examples of violations that could impact the validity of ANOVA results are:
1. Outliers: If one or more data points are significantly different from the others in a group, it can impact the normality assumption.
2. Non-normal distribution: If the data in one or more groups are not normally distributed, it can affect the validity of the ANOVA results.
3. Unequal variances: If the variances of the data in one or more groups are significantly different from the others, it can affect the homogeneity of variances assumption.
4. Correlated observations: If the observations in one or more groups are not independent of each other, it can violate the independence assumption.




## Q2. What are the three types of ANOVA, and in what situations would each be used?

Ans:
    
The three types of ANOVA are:
1. One-way ANOVA: One-way ANOVA is used when there is one factor with two or more levels.A one-way ANOVA could be used to compare the average income levels of people in three different cities.
2. Two-way ANOVA: Two-way ANOVA is used when there are two factors, each with two or more levels.A two-way ANOVA could be used to compare the effects of two different diets and two different exercise programs on weight loss.
3. Three-way ANOVA: Three-way ANOVA is used when there are three factors, each with two or more levels.A three-way ANOVA could be used to compare the effects of different types of fertilizers, different irrigation methods, and different planting times on crop yield.

## Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Ans:
    
The partitioning of variance in ANOVA refers to the division of the total variance of the data into different components, each of which can be attributed to a specific source or factor. 

This partitioning is important because it helps to understand the relative contributions of different factors to the overall variability in of the data, and to determine if the differences between groups are statistically significant or due to chance.
In ANOVA, the total variance of the data is divided into two components: the variance between groups (also known as the "treatment" variance) and the variance within groups (also known as the "error" variance). The variance between groups represents the variability of the means of each group, while the variance within groups represents the variability of the individual data points within each group.

By comparing the variance between groups to the variance within groups, ANOVA can determine if there is a statistically significant difference between the means of two or more groups. If the variance between groups is significantly larger than the variance within groups, then it suggests that the differences between groups are not due to chance and that there is a significant effect of the factor being studied.

Understanding the partitioning of variance is important because it allows researchers to determine which factors are contributing to the differences between groups and to quantify the magnitude of these effects. This information can then be used to make more informed decisions and to draw more accurate conclusions about the underlying relationships in the data.

## Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual  sum of squares (SSR) in a one-way ANOVA using Python?

In [3]:
## Ans:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
# create a sample data frame
df = pd.DataFrame({'group': ['A', 'A', 'B', 'B', 'C', 'C'], 'value': [1, 2, 3, 4, 5, 6]})
# fit the one-way ANOVA model
model = ols('value ~ group', data=df).fit()
# calculate SST, SSE, and SSR
SST = sum((df['value'] - df['value'].mean())**2)
SSE = sum(model.resid**2)
SSR = SST - SSE
print('SST:', SST)
print('SSE:', SSE)
print('SSR:', SSR)



SST: 17.5
SSE: 1.5
SSR: 16.0


In this example, we first create a sample data frame df with three groups (A, B, and C) and their corresponding values. We then use the ols function to fit a one-way ANOVA model to the data, with value as the dependent variable and group as the independent variable. We then calculate SST as the sum of squares of the differences between each value and the mean value of all groups, SSE as the sum of squares of the residuals (the differences between the observed values and the predicted values from the model), and SSR as the difference between SST and SSE.

## Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [1]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Create the example database
data = {
    'id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
    'age': [25, 32, 41, 29, 38, 47, 31, 28, 35, 39, 26, 33, 44, 27, 36, 42, 30, 37, 34, 40],
    'gender': ['M', 'F', 'F', 'M', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F', 'M', 'F'],
    'treatment': ['drugA', 'drugB', 'drugA', 'drugB', 'drugA', 'drugB', 'drugA', 'drugB', 'drugA', 'drugB', 'drugA', 'drugB', 'drugA', 'drugB', 'drugA', 'drugB', 'drugA', 'drugB', 'drugA', 'drugB'],
    'outcome': [14, 19, 21, 15, 17, 20, 16, 18, 19, 22, 13, 17, 18, 16, 20, 23, 15, 19, 16, 21]
}
df = pd.DataFrame(data)
# Fit the ANOVA model with interaction
model = ols('outcome ~ age + gender + treatment + age:treatment', data=df).fit()
# Print the ANOVA table
table = sm.stats.anova_lm(model, typ=2)
print(table)


                  sum_sq    df          F    PR(>F)
gender          9.141723   1.0   3.870079  0.067931
treatment       0.097600   1.0   0.041318  0.841657
age            48.889316   1.0  20.696922  0.000384
age:treatment   0.128376   1.0   0.054347  0.818816
Residual       35.432309  15.0        NaN       NaN


In [10]:
#another example
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Create a pandas dataframe with two independent variables and a dependent variable
independent_variable_1 = np.array(['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'])
independent_variable_2 = np.array(['X', 'X', 'X', 'Y', 'Y', 'Y', 'X', 'X', 'X', 'Y', 'Y', 'Y'])
dependent_variable = np.array([10, 12, 8, 14, 16, 12, 18, 20, 16, 22, 24, 20])
df = pd.DataFrame({'independent_variable_1': independent_variable_1, 'independent_variable_2': independent_variable_2, 'dependent_variable': dependent_variable})
# Define the ANOVA model with interaction
model_interaction = ols('dependent_variable ~ independent_variable_1 * independent_variable_2', data=df).fit()
# Calculate the main effects and interaction effects
main_effect_1 = sm.stats.anova_lm(model_interaction, typ=1)['sum_sq'][1]
main_effect_2 = sm.stats.anova_lm(model_interaction, typ=1)['sum_sq'][2]
interaction_effect = sm.stats.anova_lm(model_interaction, typ=1)['sum_sq'][3]
print('Main effect of independent variable 1:', main_effect_1)
print('Main effect of independent variable 2:', main_effect_2)
print('Interaction effect:', interaction_effect)

Main effect of independent variable 1: 47.99999999999992
Main effect of independent variable 2: 3.155443620884047e-29
Interaction effect: 192.0


## Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

Ans:
    
If the F-statistic for a one-way ANOVA is significant, it suggests that there are significant differences among at least one pair of the groups being compared.
The p-value of 0.02 indicates that there is a 2% chance of obtaining the observed F-statistic by chance, assuming that there is no significant difference among the groups.

Based on these results, we can conclude that there are significant differences among the groups being compared. However, we cannot determine which specific groups are different from each other without further analysis (such as post-hoc tests).

Furthermore, the magnitude of the F-statistic can provide information about the size of the differences among the groups. In this case, the F-statistic is 5.23, which suggests that the differences among the groups are relatively large compared to the variability within the groups. However, this interpretation should be made with caution as the effect size measure would be a better metric to understand the magnitude of the effect.

## Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Ans:
    
In a repeated measures ANOVA, missing data can occur when some participants do not complete all measurements or when some data points are lost due to technical errors or other reasons. The handling of missing data depends on the extent of missingness, the underlying pattern of missingness, and the chosen method for handling missing data. Here are some common approaches to handling missing data in a repeated measures ANOVA:

1. Complete case analysis: This involves excluding all participants with any missing data from the analysis. This method is simple but can lead to biased estimates if the missing data are not missing completely at random (MCAR).

2. Pairwise deletion: This involves including all participants who have at least one complete measurement and excluding only the missing data for each variable pairwise. This method can provide unbiased estimates if the missing data are MCAR, but it can lead to biased estimates and reduced power if the missing data are not MCAR.

3. Imputation methods: These methods involve estimating the missing data based on the observed data and incorporating the imputed values into the analysis. There are several imputation methods available, including mean imputation, regression imputation, multiple imputation, and others. These methods can provide unbiased estimates and increase power if the assumptions of the imputation model are met, but they can also introduce additional uncertainty if the imputation model is misspecified.

The potential consequences of using different methods to handle missing data in a repeated measures ANOVA are that the estimates of the means, variances, and standard errors can vary depending on the method used. This can affect the statistical significance of the results, the power of the test, and the precision of the estimates. Therefore, it is important to carefully consider the extent and pattern of missing data and choose an appropriate method for handling missing data that minimizes bias and maximizes the validity of the analysis.


## Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Ans:
    
Post-hoc tests are used to compare the means of different groups in an ANOVA when the overall F-test is significant. These tests allow us to identify which specific groups are significantly different from each other.

Some common post-hoc tests include:

1. Tukey's HSD (honestly significant difference): This test is appropriate when there are equal sample sizes in each group and assumes that the variances are equal across all groups. Tukey's HSD compares all possible pairs of group means and controls the family-wise error rate.

2. Bonferroni correction: This test is appropriate when the sample sizes and variances are unequal across groups. The Bonferroni correction is a conservative method that controls the type-I error rate by dividing the overall alpha level by the number of pairwise comparisons.

3. Scheffe's test: This test is appropriate when the sample sizes and variances are unequal across groups and is more conservative than Tukey's HSD. Scheffe's test controls the family-wise error rate and is best used when the number of pairwise comparisons is small.

4. Dunn's test: This test is appropriate when there are non-normal data or unequal variances across groups. Dunn's test is a non-parametric test that compares all possible pairs of group medians.

A post-hoc test might be necessary when the overall F-test in the ANOVA is significant, and we want to determine which specific groups differ significantly from each other. 
For example, suppose we want to compare the effectiveness of four different teaching methods on student performance in a particular subject. After conducting a one-way ANOVA, we find a significant difference among the four groups. A post-hoc test such as Tukey's HSD or Bonferroni correction can help us identify which specific teaching methods lead to significantly different student performance.

## Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [12]:
import numpy as np
from scipy.stats import f_oneway
# Generate simulated data assuming normal distribution with same variance
np.random.seed(1)
diet_A = np.random.normal(5, 1, 50)
diet_B = np.random.normal(4, 1, 50)
diet_C = np.random.normal(3, 1, 50)
# Perform one-way ANOVA
f_statistic, p_value = f_oneway(diet_A, diet_B, diet_C)
# Set significance level
alpha = 0.05
# Null hypothesis: The mean weight loss is the same for all three diets.
# Alternative hypothesis: The mean weight loss is different for at least one diet.
null_hypothesis = "The mean weight loss is the same for all three diets."
alternate_hypothesis = "The mean weight loss is different for at least one diet."
print("F-statistic:", f_statistic)
print("p-value:", p_value)
if p_value < alpha:
    print("We reject the null hypothesis.")
    print(f"Conclusion : {alternate_hypothesis}")
else:
    print("We fail to reject the null hypothesis.")
    print(f"Conclusion : {null_hypothesis}")

F-statistic: 57.06379442059458
p-value: 4.5619061215783055e-19
We reject the null hypothesis.
Conclusion : The mean weight loss is different for at least one diet.


## Perfroming Turksy test for mean difference

In [15]:
import numpy as np
from statsmodels.stats.multicomp import pairwise_tukeyhsd
#create DataFrame to hold data
df = pd.DataFrame({'weight_loss': list(diet_A) + list(diet_B) + list(diet_C),
                   'group': np.repeat(['A', 'B', 'C'], repeats=50)})
# perform Tukey's test
tukey = pairwise_tukeyhsd(endog=df['weight_loss'],
                          groups=df['group'],
                          alpha=0.05)
# Print results
print(tukey)

Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj  lower   upper  reject
---------------------------------------------------
     A      B  -0.8278   0.0 -1.2477 -0.4079   True
     A      C  -1.8898   0.0 -2.3097 -1.4699   True
     B      C   -1.062   0.0 -1.4819 -0.6421   True
---------------------------------------------------


Above interpretation means all three means are different reject value is True for all of 3
1. Mean Difference between diet_A and diet_B is -0.8278
2. Mean Difference between diet_A and diet_C is -1.8898
3. Mean Difference between diet_A and diet_C is -1.062

Maximum mean difference is in between diet_A and diet_C

All mean differences with diet_A are negative
diet_A has shown highest weight loss compared to diet_B and diet_C

## Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.
Significance level=0.05.

In [16]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Setting random seed for reproducibility
np.random.seed(123)
# Generating 2 random time samples for novice and expert
time_novice = np.random.normal(loc=15, scale=2, size=30)
time_expert = np.random.normal(loc=10, scale=2, size=30)
# Generate simulated data
data = pd.DataFrame({
    'Software': ['A']*20 + ['B']*20 + ['C']*20,
    'Experience': ['Novice']*30 + ['Experienced']*30,
    'Time': list(time_novice)+list(time_expert)
})
# Print the simulated data head 
print('Simulated Data example :')
print(data.head())
print('\n======================================================================================\n')
# Fit the two-way ANOVA model
model = ols('Time ~ C(Software) + C(Experience) + C(Software):C(Experience)', data=data).fit()
table = sm.stats.anova_lm(model, typ=1)
# Set significance level
alpha = 0.05
# Main effects and interaction effect
print(table)
print('\n')
if table['PR(>F)'][0] < alpha:
    print("Conclusion: There is a significant main effect of software.")
else:
    print("Conclusion: There is no significant main effect of software.")
if table['PR(>F)'][1] < alpha:
    print("Conclusion: There is a significant main effect of experience.")
else:
    print("Conclusion: There is no significant main effect of experience.")
if table['PR(>F)'][2] < alpha:
    print("Conclusion: There is a significant interaction effect between software and experience.")
else:
    print("Conclusion: There is no significant interaction effect between software and experience.")

Simulated Data example :
  Software Experience       Time
0        A     Novice  12.828739
1        A     Novice  16.994691
2        A     Novice  15.565957
3        A     Novice  11.987411
4        A     Novice  13.842799


                             df      sum_sq     mean_sq          F  \
C(Software)                 2.0  204.881181  102.440590  18.135666   
C(Experience)               1.0  165.079097  165.079097  29.224933   
C(Software):C(Experience)   2.0   17.481552    8.740776   1.547431   
Residual                   56.0  316.319953    5.648571        NaN   

                                 PR(>F)  
C(Software)                8.460472e-07  
C(Experience)              1.375177e-06  
C(Software):C(Experience)  2.217544e-01  
Residual                            NaN  


Conclusion: There is a significant main effect of software.
Conclusion: There is a significant main effect of experience.
Conclusion: There is no significant interaction effect between software and experience.


## Here are the interpretations of the three conclusions:
1. "There is a significant main effect of software": This means that the software programs used by the employees have a significant impact on the outcome variable (e.g., completion time), independent of the experience level of the employees. This suggests that the choice of software program is an important factor that should be considered carefully when completing this task.
2. "There is a significant main effect of experience": This means that the experience level of the employees has a significant impact on the outcome variable, independent of the software program used. Specifically, this suggests that experienced employees may complete the task faster than novices, or vice versa. This finding can be helpful for the company to identify the best employees for a given task and to provide appropriate training for new employees.
3. "There is NO significant interaction effect between software and experience": This means that the effect of software on the outcome variable does not depend on the experience level of the employees, and vice versa. This suggests that the software programs perform similarly for both novices and experienced employees. This finding can be helpful for the company to decide which software program to use, as they do not need to consider the experience level of the employees when making the choice.

## Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [17]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind
# Setting numpy random seed
np.random.seed(45)
# Generating normal test scores with same variance for both control groups
test_score_control = np.random.normal(loc=70, scale=3, size=50)
test_score_experimental = np.random.normal(loc=85, scale=3, size=50)
# Creating the dataframe
df = pd.DataFrame({'test_score':list(test_score_control)+list(test_score_experimental),
                   'group':['control']*50 + ['experimental']*50})
# printing the sample dataframe
print('Simulated data for test_scores:')
print(df.head())
print('\n===============================\n')
null_hypothesis = "There is NO difference in test scores between the control and experimental groups."
alt_hypothesis = "There is SIGNIFICANT difference in test scores between the control and experimental groups."
# Conduct the two-sample t-test
control_scores = df[df['group'] == 'control']['test_score']
experimental_scores = df[df['group'] == 'experimental']['test_score']
t_stat, p_val = ttest_ind(control_scores, experimental_scores, equal_var=True)
print(f"t-statistic: {t_stat:.4f}, p-value: {p_val}")
print('\n')
# Significance value 
alpha = 0.05
if p_val<alpha:
    print('Reject the Null Hypothesis')
    print(f'Conclusion : {alt_hypothesis}')
else:
    print('Failed to reject the Null Hypothesis')
    print(f'Conclusion : {null_hypothesis}')

Simulated data for test_scores:
   test_score    group
0   70.079124  control
1   70.780965  control
2   68.814563  control
3   69.387097  control
4   66.185102  control


t-statistic: -28.5074, p-value: 3.096206271894725e-49


Reject the Null Hypothesis
Conclusion : There is SIGNIFICANT difference in test scores between the control and experimental groups.


## Turkey HSD test:

In [18]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd
# Conduct post-hoc Tukey's test
tukey_results = pairwise_tukeyhsd(df['test_score'], df['group'], 0.05)
print(tukey_results)

   Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1    group2    meandiff p-adj  lower   upper  reject
----------------------------------------------------------
control experimental  15.8829   0.0 14.7773 16.9886   True
----------------------------------------------------------


Tukey's Results Interpretation:

1. Reject = True suggests that there is significant difference in both control and Experimental groups also p-adj is almost 0.
2. Experimental group has increased the performance of test scores of students by mean of 15.88 approximately
3. Mean score improved by Experimental method is (14.78,16.99) with 95% confidence level

## Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [19]:
import numpy as np
import pandas as pd
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd
# set random seed for reproducibility
np.random.seed(456)
# generate sales data for Store A, B, and C
sales_a = np.random.normal(loc=1000, scale=100, size=(30,))
sales_b = np.random.normal(loc=1050, scale=150, size=(30,))
sales_c = np.random.normal(loc=800, scale=80, size=(30,))
# create a DataFrame to store the sales data
sales_df = pd.DataFrame({'Store A': sales_a, 'Store B': sales_b, 'Store C': sales_c})
# reshape the DataFrame for repeated measures ANOVA
sales_melted = pd.melt(sales_df.reset_index(), id_vars=['index'], value_vars=['Store A', 'Store B', 'Store C'])
sales_melted.columns = ['Day', 'Store', 'Sales']
# Printing top 5 rows of generated data
print('Generated data top 5 rows : ')
print(sales_melted.head())
print('\n================================================\n')
# perform repeated measures ANOVA
rm_anova = AnovaRM(sales_melted, 'Sales', 'Day', within=['Store'])
rm_results = rm_anova.fit()
print(rm_results)
# check if null hypothesis should be rejected based on p-value
if rm_results.anova_table['Pr > F'][0] < 0.05:
    # perform post-hoc Tukey test
    print('Reject the Null Hypothesis : Atleast one of the group has different mean.\n')
    print('Tukey HSD posthoc test:')
    tukey_results = pairwise_tukeyhsd(sales_melted['Sales'], sales_melted['Store'])
    print(tukey_results)
else:
    print('NO significant difference between groups.')

Generated data top 5 rows : 
   Day    Store        Sales
0    0  Store A   933.187150
1    1  Store A   950.179048
2    2  Store A  1061.857582
3    3  Store A  1056.869225
4    4  Store A  1135.050948


               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
Store 51.5040 2.0000 58.0000 0.0000

Reject the Null Hypothesis : Atleast one of the group has different mean.

Tukey HSD posthoc test:
    Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1  group2  meandiff p-adj    lower     upper   reject
-----------------------------------------------------------
Store A Store B   21.2439 0.6945   -40.881   83.3688  False
Store A Store C -207.8078    0.0 -269.9328 -145.6829   True
Store B Store C -229.0517    0.0 -291.1766 -166.9268   True
-----------------------------------------------------------


Interpretation of above:
1. In Repeated Measure ANOVA test we got p_value (Pr>F) as 0.0000 which is less than 0.05 .Reject the Null Hypothesis .Which means atleast one of the mean of groups is different.
2. In Tukey's Post Hoc Test we get following interpretation :
3. No significant difference between sales of Store A and Store B. Store B earns 21.24 dollars more than store A(becuse reject=False for this)
4. Significant difference between sales of Store A and Store C . Store C has approx 207.8 dollars lesser compared to store A (reject=True)
5. Siginficant difference between sales of Store B and Store C . Store C has approx 229.0 dollars lesser compared to store B (reject=True)