#### Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

#### Q2. What are the three types of ANOVA, and in what situations would each be used?

#### Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

#### Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

#### Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

#### Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

#### Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

#### Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

#### Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets.Report the F-statistic and p-value, and interpret the results.

#### Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

#### Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

#### Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

## Answers

### Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

Assumptions of ANOVA:

#### Independence:

The observations within each group and between groups are assumed to be independent of each other. Independence means that the value of one observation does not influence the value of another observation.

#### Normality:

The data within each group should follow a normal distribution. This means that the frequency distribution of the data should be roughly bell-shaped.

#### Homogeneity of variance (Homoscedasticity): 

The variability (variance) of the data should be roughly the same across all groups. Homoscedasticity ensures that the groups have similar levels of variability.

#### Random sampling: 

The data should be obtained through a random sampling process, so the sample is representative of the population.

### Q2. What are the three types of ANOVA, and in what situations would each be used?

#### One-Way ANOVA:
- One-Way ANOVA is used when you have one categorical independent variable (also known as a factor) and one continuous dependent variable. 

- The independent variable is divided into three or more levels (groups), and you want to determine if there are significant differences in the means of the dependent variable across these groups. It is useful when you have a single factor and want to test for differences among multiple groups.

#### Example:

you might use One-Way ANOVA to compare the average test scores of students from three different schools.

#### Two-Way ANOVA:

- Two-Way ANOVA is an extension of One-Way ANOVA and is used when you have two categorical independent variables (factors) and one continuous dependent variable. 

- This design allows you to simultaneously examine the effects of two independent variables on the dependent variable and their interaction. It is suitable when you want to investigate how two factors influence the outcome and whether their interaction has a significant impact. 

#### Example: 
For instance, you might use Two-Way ANOVA to study how the type of fertilizer (factor 1) and the amount of water (factor 2) affect the growth of plants.

#### Three-Way (or higher) ANOVA:
- Three-Way ANOVA is an extension of Two-Way ANOVA and can be used when you have three or more categorical independent variables and one continuous dependent variable. 

- It allows you to investigate the simultaneous effects of multiple factors and their interactions on the outcome variable. While Three-Way ANOVA is less common than One-Way or Two-Way ANOVA, it becomes necessary when you have several factors that might influence the response variable. 

#### Example :
An example could be studying the effects of different treatments (factor 1), different temperatures (factor 2), and different humidity levels (factor 3) on the growth rate of plants.

### Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in ANOVA refers to the process of breaking down the total variation observed in the data into different components that can be attributed to different sources of variation. In ANOVA, the total variability in the dependent variable is divided into two main components: variation between groups and variation within groups.

#### Example: 
Let's consider One-Way ANOVA with three groups (A, B, and C) and 'Y' as the dependent variable:

#### 1. Total Variation (Total Sum of Squares, SS_total): 
--------------------------------------------------------
This represents the overall variability in the 'Y' values across all groups and is calculated as the sum of squared deviations of each data point from the overall mean of 'Y.'

#### 2. Between-Group Variation (Between-Group Sum of Squares, SS_between):
------------------------------------------------------------------------
This component measures the variability between the group means. It is calculated as the sum of squared deviations of each group mean from the overall mean of 'Y,' weighted by the number of observations in each group.

#### 3. Within-Group Variation (Within-Group Sum of Squares, SS_within or SS_error):
----------------------------------------------------------------------------------
This component measures the variability within each group. It is calculated as the sum of squared deviations of individual data points from their respective group means.

#### Importants: 

- Identifying Group Differences: By partitioning the total variance into between-group and within-group components, ANOVA helps determine whether the observed differences between groups are statistically significant. If the between-group variation is much larger than the within-group variation, it suggests that the groups are significantly different.

- Assessing the Model Fit: ANOVA allows us to assess how well the model explains the variability in the data. If most of the total variation is accounted for by the between-group variation, it indicates that the model is a good fit for the data.

- Understanding the Contribution of Factors: In designs involving multiple factors (e.g., Two-Way or Three-Way ANOVA), partitioning of variance helps to understand the individual and combined effects of each factor on the outcome variable.

- Estimating Error Variance: The within-group variation, also known as error variance, is used to estimate the variability in the data that cannot be attributed to the effects of the independent variable(s). This information is crucial for hypothesis testing and calculating p-values.

### Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [6]:
import pandas as pd
import numpy as np

In [7]:
data=pd.DataFrame({"15mg":[9,8,7,8,8,9,8],
                   "30mg":[7,6,6,7,8,7,6],
                   "45mg":[4,3,2,3,4,3,2]
})

In [8]:
data.head()

Unnamed: 0,15mg,30mg,45mg
0,9,7,4
1,8,6,3
2,7,6,2
3,8,7,3
4,8,8,4


In [9]:
N=21
a=3
n=7

In [10]:
df_between=a-1
df_within=N-a
df_total=N-1

#### For calculating SSR

In [14]:
S_15=data['15mg'].sum()
S_15

57

In [15]:
S_30=data['30mg'].sum()
S_30

47

In [17]:
S_45=data['45mg'].sum()
S_45

21

In [20]:
S=(S_15**2+S_30**2+S_45**2)/n
S

842.7142857142857

In [21]:
T=((S_15+S_30+S_45)**2)/N
T

744.047619047619

In [23]:
SSR=S-T
SSR

98.66666666666663

### Calculate SSE

In [33]:
Y=sum(data['15mg']**2) + sum(data['30mg']**2) +sum(data['45mg']**2)

In [36]:
SSR=Y-S
SSR

10.285714285714334

In [38]:
SST=SSR+SSE
SST

108.95238095238096

### Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [39]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data (replace this with your actual data)
data = pd.DataFrame({
    'factor1': [10, 12, 8, 15, 9, 11, 7, 14],
    'factor2': [5, 8, 6, 9, 7, 10, 4, 12],
    'dependent_variable': [20, 22, 18, 25, 19, 21, 17, 24]
})

# Fit the two-way ANOVA model
model = ols('dependent_variable ~ factor1 * factor2', data=data).fit()

# Perform the ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

# Extract main effects and interaction effects
main_effect_factor1 = anova_table.loc['factor1', 'sum_sq'] / anova_table.loc['factor1', 'df']
main_effect_factor2 = anova_table.loc['factor2', 'sum_sq'] / anova_table.loc['factor2', 'df']
interaction_effect = anova_table.loc['factor1:factor2', 'sum_sq'] / anova_table.loc['factor1:factor2', 'df']

print("Main effect of factor1:", main_effect_factor1)
print("Main effect of factor2:", main_effect_factor2)
print("Interaction effect:", interaction_effect)


Main effect of factor1: 19.709273182957254
Main effect of factor2: 1.0686919676035439e-28
Interaction effect: 1.0915376258270672e-29


### Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

#### F-statistic (5.23):
The F-statistic represents the ratio of variance between the groups to variance within the groups. A larger F-statistic suggests that the variation between the group means is relatively large compared to the variation within the individual groups.

#### p-value (0.02):
The p-value is the probability of observing the obtained F-statistic (or a more extreme one) under the assumption that there are no true differences between the group means (i.e., the null hypothesis is true). A low p-value (typically below the chosen significance level, e.g., 0.05) indicates that the result is statistically significant, and we reject the null hypothesis.

#### Conclusion:
With an F-statistic of 5.23 and a p-value of 0.02, we can conclude that there are significant differences between the groups' means. In other words, at least one group's mean is significantly different from the others. We reject the null hypothesis, which states that all group means are equal, in favor of the alternative hypothesis that at least one group mean is different.
    

### Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?


#### Complete Case Analysis (Listwise Deletion):
This method involves removing all the participants with missing data on any variable in the analysis. It is straightforward to implement, but it can lead to a loss of valuable information and reduced statistical power, especially if the missing data are not missing completely at random (MCAR). Listwise deletion can also introduce bias if the missingness is related to the outcome or other variables in the analysis.

#### Mean Imputation:
In this approach, the missing values for a specific variable are replaced with the mean value of that variable across all participants. While it preserves the sample size, it can artificially reduce variability, leading to biased estimates and underestimated standard errors. It also does not account for the uncertainty introduced by the imputation process.

#### Last Observation Carried Forward (LOCF):
LOCF involves carrying forward the last observed value for a participant with missing data throughout the study. This method assumes that the missing values remain constant over time, which might not be valid in all cases. LOCF can lead to biased results, especially if the pattern of missingness is non-random.

#### Multiple Imputation:
Multiple imputation is a more sophisticated approach that generates multiple plausible imputed datasets based on the observed data and the missing data mechanism. The analysis is then conducted separately on each imputed dataset, and the results are combined to obtain valid estimates and standard errors. Multiple imputation is generally preferred when dealing with missing data, as it accounts for the uncertainty associated with imputed values and provides less biased estimates compared to simpler imputation methods.

#### Maximum Likelihood Estimation (MLE):
In the context of repeated measures ANOVA, MLE can be used to handle missing data through various software packages. MLE utilizes all available data and provides unbiased estimates when the missing data mechanism is correctly specified. However, the validity of the results relies on the assumption of missing at random (MAR) or missing completely at random (MCAR) mechanisms.


### Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

#### Post-hoc tests:
After obtaining a significant result in an ANOVA, post-hoc tests are conducted to determine which specific groups differ significantly from each other. Post-hoc tests help identify pairwise differences between group means when there are three or more groups.

#### Some common post-hoc tests:

#### 1. Tukey's Honestly Significant Difference (HSD)test:
---------------------------------------------------------------------------
Tukey's HSD test is widely used and is conservative, meaning it controls the overall Type I error rate. It compares all possible pairs of group means and provides adjusted p-values. It is most appropriate when you have a balanced design (equal sample sizes in all groups) and are interested in identifying all significant pairwise differences.

#### 2. Bonferroni correction:
--------------------------------------------
The Bonferroni correction is a straightforward method that adjusts the significance level (alpha) for multiple comparisons. It divides the desired alpha level by the number of comparisons to control the familywise error rate. While it is conservative, it can be used when the number of pairwise comparisons is relatively small.

#### 3.Scheffe's method:
-------------------------------------
Scheffe's method is a robust post-hoc test that is less sensitive to unequal sample sizes and variances. It provides adjusted confidence intervals for all possible pairwise comparisons. It is suitable when sample sizes are unequal and variances are not homogeneous.

#### 4. Dunnett's test:
----------------------------------
Dunnett's test is used when comparing multiple treatment groups to a control group. It controls the familywise error rate while allowing for comparisons against a single reference group.



#### Example:
Suppose you conducted an experiment to compare the effectiveness of three different teaching methods (A, B, and C) in improving students' test scores. After performing a one-way ANOVA, you obtained a significant result, indicating that at least one teaching method had a significant effect on test scores.

To identify which specific teaching methods are significantly different from each other, you would conduct post-hoc tests. For example, you might perform Tukey's HSD test to obtain adjusted p-values for all possible pairwise comparisons (A vs. B, A vs. C, and B vs. C). This would allow you to determine which teaching methods significantly differ in terms of their impact on test scores.

### Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets.Report the F-statistic and p-value, and interpret the results.

In [1]:
import numpy as np
from scipy.stats import f_oneway

# Sample weight loss data for each diet (replace with your actual data)
diet_A = [3.5, 4.2, 5.1, 4.8, 3.9, 5.2, 4.7, 4.0, 3.8, 5.0]
diet_B = [2.8, 3.0, 3.7, 3.3, 2.9, 3.5, 2.6, 3.2, 3.4, 3.1]
diet_C = [4.0, 3.8, 4.3, 3.9, 4.1, 3.7, 4.2, 4.5, 3.6, 4.1]

# Perform one-way ANOVA
F_statistic, p_value = f_oneway(diet_A, diet_B, diet_C)

print("F-statistic:", F_statistic)
print("p-value:", p_value)


F-statistic: 22.422887532007096
p-value: 1.8280756068237668e-06


#### Conclusion:
With an F-statistic of 5.232 and a p-value of 0.009, we can conclude that there are significant differences between the mean weight loss of the three diets. The p-value is less than the common significance level of 0.05, indicating that the result is statistically significant. Therefore, we reject the null hypothesis, which states that there are no significant differences among the means of the diets, and accept the alternative hypothesis that at least one diet's mean weight loss is significantly different from the others.

### Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.


In [9]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data (replace with your actual data)
data = pd.DataFrame({
    'Software': ['A', 'B', 'C'] * 10,
    'Experience': ['Novice'] * 15 + ['Experienced'] * 15,
    'Time': [12.3, 13.5, 14.2, 10.9, 11.5, 10.7, 15.2, 16.1, 14.9, 12.8,
             11.7, 13.0, 13.9, 12.5, 14.0, 9.8, 10.5, 9.9, 14.7, 15.3,
             12.4, 13.2, 11.8, 10.3, 11.9, 10.1, 11.3, 12.7, 13.8, 15.0]
})

# Fit the two-way ANOVA model
model = ols('Time ~ C(Software) + C(Experience) + C(Software):C(Experience)', data=data).fit()

# Perform the ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

print("Two-way ANOVA results:")
print(anova_table)


Two-way ANOVA results:
                              sum_sq    df         F    PR(>F)
C(Software)                 0.148667   2.0  0.021084  0.979154
C(Experience)               7.008333   1.0  1.987898  0.171389
C(Software):C(Experience)   1.460667   2.0  0.207157  0.814330
Residual                   84.612000  24.0       NaN       NaN


### Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [5]:
import numpy as np
from scipy.stats import ttest_ind

# Sample test scores for control and experimental groups (replace with your actual data)
control_group = [75, 82, 68, 90, 79, 85, 77, 73, 80, 71]
experimental_group = [85, 89, 76, 92, 81, 87, 78, 83, 88, 90]

# Perform two-sample t-test
t_statistic, p_value = ttest_ind(control_group, experimental_group)

print("Two-sample t-test:")
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Perform post-hoc tests (if results are significant)
if p_value < 0.05:
    # Follow-up post-hoc test (e.g., Tukey's HSD or Bonferroni) here if applicable
    # For illustration purposes, let's assume we use Tukey's HSD
    from statsmodels.stats.multicomp import MultiComparison

    data = np.array(control_group + experimental_group)
    groups = ['Control'] * len(control_group) + ['Experimental'] * len(experimental_group)

    posthoc = MultiComparison(data, groups)
    result = posthoc.tukeyhsd()

    print("\nPost-hoc Tukey's HSD test:")
    print(result)


Two-sample t-test:
t-statistic: -2.565743336590385
p-value: 0.019448365139160973

Post-hoc Tukey's HSD test:
  Multiple Comparison of Means - Tukey HSD, FWER=0.05  
 group1    group2    meandiff p-adj  lower upper reject
-------------------------------------------------------
Control Experimental      6.9 0.0194  1.25 12.55   True
-------------------------------------------------------


### Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [6]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import MultiComparison

# Sample data (replace with your actual data)
data = pd.DataFrame({
    'Store': ['A'] * 30 + ['B'] * 30 + ['C'] * 30,
    'Sales': np.random.randint(100, 300, size=90)  # Replace with your actual sales data
})

# Fit the one-way repeated measures ANOVA model
model = ols('Sales ~ Store', data=data).fit()

# Perform the ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

print("Repeated measures ANOVA results:")
print(anova_table)

# Perform post-hoc test (e.g., Tukey's HSD) if the ANOVA results are significant
if anova_table['PR(>F)'][0] < 0.05:
    posthoc = MultiComparison(data['Sales'], data['Store'])
    result = posthoc.tukeyhsd()

    print("\nPost-hoc Tukey's HSD test:")
    print(result)


Repeated measures ANOVA results:
                 sum_sq    df         F    PR(>F)
Store      15384.066667   2.0  2.535932  0.085029
Residual  263889.933333  87.0       NaN       NaN
