# Question.1

## Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

### ANOVA (Analysis of Variance) is a statistical test used to compare the means of three or more groups. To ensure the validity of the results, several assumptions need to be met. These assumptions include:
1. Independence: The observations within each group or treatment level should be independent of each other. This means that the values or measurements within one group should not be influenced by or dependent on the values of another group.
2. Normality: The distribution of the dependent variable (the variable being measured) should be approximately normally distributed within each group. This assumption is particularly important when the sample sizes are small. Violations of this assumption can lead to inaccurate p-values and confidence intervals.
3. Homogeneity of variances (homoscedasticity): The variances of the dependent variable should be equal across all groups. This assumption implies that the spread or variability of the data should be roughly the same for each group. Violations of this assumption can affect the accuracy of the F-test and lead to incorrect conclusions.
4. Random sampling: The observations should be randomly selected from the population of interest. Random sampling helps to ensure that the sample is representative of the population and allows for generalization of the results.
Examples of violations that could impact the validity of ANOVA results:
1. Non-independence: If observations within groups are not independent, such as in a repeated measures design where the same individuals are measured multiple times, the assumption of independence is violated. In such cases, specialized analyses like repeated measures ANOVA or mixed-effects models should be used instead.
2. Non-normality: If the distribution of the dependent variable within groups is significantly non-normal, ANOVA results may not be valid. Non-normality can be assessed using graphical methods (e.g., histograms, Q-Q plots) or statistical tests (e.g., Shapiro-Wilk test). Transformations or non-parametric alternatives may be considered in cases of severe non-normality.
3. Heterogeneity of variances: If the assumption of equal variances across groups is violated, it can affect the F-test results and lead to incorrect conclusions. This can be assessed using statistical tests such as Levene's test or by inspecting graphical representations of the data (e.g., boxplots). In such cases, alternative analyses like Welch's ANOVA or non-parametric tests may be more appropriate.
4. Non-random sampling: If the sample is not randomly selected, the generalizability of the results may be compromised. Biased sampling can introduce systematic errors and limit the extent to which the findings can be applied to the target population.

# Question.2

## What are the three types of ANOVA, and in what situations would each be used?

### The three types of ANOVA (Analysis of Variance) are:
1. One-Way ANOVA: One-Way ANOVA is used when comparing the means of three or more independent groups on a single dependent variable. It is the most basic form of ANOVA and is appropriate when there is only one factor or independent variable of interest. For example, a One-Way ANOVA can be used to determine if there are significant differences in test scores among students from three different schools.
2. Two-Way ANOVA: Two-Way ANOVA is used when comparing the means of two or more independent groups on a single dependent variable, considering the effects of two independent variables. It allows for the examination of main effects (individual effects of each independent variable) and interaction effects (combined effects of the independent variables). Two-Way ANOVA is suitable when there are two factors or independent variables of interest, and we want to explore their individual and combined effects on the dependent variable. For example, a Two-Way ANOVA can be used to analyze the effects of both gender and age group on the scores of a cognitive test.
3. Repeated Measures ANOVA: Repeated Measures ANOVA (also known as within-subjects ANOVA or ANOVA for correlated samples) is used when comparing the means of three or more related or matched groups on a single dependent variable. In this type of ANOVA, the same subjects or participants are measured under different conditions or at different time points. Repeated Measures ANOVA is appropriate when the dependent variable is measured repeatedly on the same individuals, such as in pre-test and post-test designs or when participants undergo multiple experimental conditions. For example, a Repeated Measures ANOVA can be used to analyze the effects of three different training programs on the performance of individuals by measuring their performance before training, after training, and at a follow-up time point.

# Question.3

## What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

### Partitioning of variance in ANOVA refers to the breakdown of the total variance in the data into different components, each representing a different source of variation. This breakdown allows for a comprehensive understanding of how much of the total variation can be attributed to different factors or sources in the analysis.
The partitioning of variance in ANOVA consists of three main components:
1. Between-Group Variance: This component of variance represents the differences in the means of the groups being compared. It measures the variability between the groups and indicates whether there are significant differences among the group means.
2. Within-Group Variance: Also known as the error or residual variance, this component represents the variation within each group. It captures the variability that cannot be explained by the factors under investigation. It includes random variability, measurement error, and any other sources of variation that are not accounted for by the independent variables.
3. Total Variance: The total variance represents the overall variability in the data, regardless of the groups or conditions being compared. It is the sum of the between-group variance and the within-group variance.
Understanding the partitioning of variance is important for several reasons:
1. Identification of Significant Effects: By examining the relative magnitudes of the between-group and within-group variances, ANOVA helps determine if there are significant differences among the groups being compared. If the between-group variance is significantly larger than the within-group variance, it suggests that the independent variable(s) have a significant effect on the dependent variable.
2. Assessment of Model Fit: The partitioning of variance helps evaluate how well the ANOVA model fits the data. If the model accounts for a large proportion of the total variance (i.e., high between-group variance and low within-group variance), it indicates a good fit and suggests that the model explains a substantial amount of the variability in the data.
3. Interpretation of Results: Partitioning of variance allows for the interpretation of the relative contributions of different factors to the overall variability. It helps researchers understand the importance of each factor in explaining the differences among the groups or conditions being studied.
4. Planning Follow-up Analyses: Understanding the partitioning of variance can guide researchers in planning post-hoc or follow-up analyses, such as pairwise comparisons or further exploration of interaction effects. By identifying the sources of variation, researchers can focus on specific group comparisons or delve deeper into the factors influencing the observed differences.

# Question.4

##  How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

To calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python, you can utilize the functionality provided by the `scipy` library. Here's an example of how you can calculate these sums of squares:
```python
import numpy as np
from scipy.stats import f_oneway

# Define the data for each group
group1 = [1, 2, 3, 4, 5]
group2 = [2, 4, 6, 8, 10]
group3 = [3, 6, 9, 12, 15]
data = np.concatenate([group1, group2, group3])
groups = np.repeat(['Group 1', 'Group 2', 'Group 3'], [len(group1), len(group2), len(group3)])
f_statistic, p_value = f_oneway(group1, group2, group3)
mean_data = np.mean(data)
sst = np.sum((data - mean_data) ** 2)
mean_group = np.array([np.mean(data[groups == group]) for group in np.unique(groups)])
sse = np.sum((mean_group - mean_data) ** 2 * np.bincount(groups))
ssr = sst - sse
print("Total Sum of Squares (SST):", sst)
print("Explained Sum of Squares (SSE):", sse)
print("Residual Sum of Squares (SSR):", ssr)
print("F-statistic:", f_statistic)
print("p-value:", p_value)
```
In this example, we have three groups (`group1`, `group2`, and `group3`) with their corresponding data. We concatenate the data into a single array and create an array of group labels (`groups`). Then, we perform the one-way ANOVA using `f_oneway` function from `scipy.stats`. Finally, we calculate the SST, SSE, and SSR using the provided formulas and print the results.

# Question.5

## In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [2]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
data = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 6],
    'B': [2, 4, 6, 8, 10, 12],
    'Y': [5, 10, 15, 20, 25, 30]
})
model = ols('Y ~ A + B + A:B', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
main_effect_A = anova_table['sum_sq']['A']
main_effect_B = anova_table['sum_sq']['B']
interaction_effect = anova_table['sum_sq']['A:B']
print("Main Effect A:", main_effect_A)
print("Main Effect B:", main_effect_B)
print("Interaction Effect:", interaction_effect)

Main Effect A: 437.5000000000019
Main Effect B: 437.50000000000153
Interaction Effect: 7.362701782062772e-30


# Question.6

##  Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

### In the given scenario, where a one-way ANOVA was conducted and an F-statistic of 5.23 and a p-value of 0.02 were obtained, we can make the following conclusions and interpretations:
1. Conclusions:
   - There is evidence of a statistically significant difference between the means of the groups being compared.
   - The null hypothesis, which assumes that there are no differences among the group means, can be rejected.
   - The alternative hypothesis, which suggests that at least one group mean is different from the others, is supported.

2. Interpretations:
   - The F-statistic of 5.23 indicates that there is variation between the groups' means that is greater than would be expected by chance.
   - The p-value of 0.02 indicates that the probability of obtaining an F-statistic as extreme as 5.23, assuming the null hypothesis is true, is 0.02 (or 2%). Since the p-value is less than the significance level (commonly set at 0.05), we reject the null hypothesis in favor of the alternative hypothesis.
   - This suggests that the differences observed in the sample means are unlikely to have occurred due to random chance alone. Instead, they are likely due to genuine differences among the groups being compared.
   - It is important to note that the ANOVA itself does not tell us which specific group means are different from each other. Post-hoc tests or pairwise comparisons are typically conducted to determine which group(s) differ significantly from the others.

# Question.7

##  In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

### Handling missing data in a repeated measures ANOVA is an important consideration to ensure valid and reliable results. Here are some common approaches to handle missing data in this context:
1. Complete Case Analysis (Listwise Deletion): This method involves excluding any participant with missing data on any of the variables involved in the repeated measures ANOVA. Only participants with complete data across all variables are included in the analysis. While this approach is straightforward, it can lead to reduced sample size and potentially biased results if the missingness is related to the outcome or the variables of interest.
2. Pairwise Deletion: This method allows for the inclusion of participants with missing data on some variables, but each analysis is conducted only on the available data for each specific comparison. This approach maximizes the use of available data but can lead to different sample sizes for different comparisons, potentially affecting statistical power and making it challenging to interpret and compare results across analyses.
3. Mean Imputation: In this method, missing values are replaced with the mean value of the available data for that variable. This approach preserves the sample size and may maintain the overall mean of the variable, but it can underestimate the variability in the data, bias the estimates, and artificially reduce the standard errors. Additionally, imputing with the mean assumes that the missing values are missing completely at random (MCAR), which may not be the case in practice.
4. Multiple Imputation: Multiple imputation involves creating multiple plausible imputations for the missing values based on observed data and using these imputed datasets to conduct the repeated measures ANOVA. This approach accounts for the uncertainty associated with missing data and allows for the estimation of appropriate standard errors. Multiple imputation assumes that the missing values are missing at random (MAR) or missing not at random (MNAR) but can yield more robust and valid results compared to other methods. However, it requires appropriate imputation models and may be computationally intensive.
The consequences of using different methods to handle missing data can vary:
- Complete case analysis and pairwise deletion may result in biased estimates and reduced statistical power, especially if the missing data are not missing completely at random (MCAR).
- Mean imputation can lead to underestimation of variability, biased estimates, and artificial reduction in standard errors, which can affect the accuracy of hypothesis tests and confidence intervals.
- Multiple imputation, when implemented appropriately, can produce more valid and reliable results by accounting for the uncertainty associated with missing data. However, it requires careful consideration of the missing data mechanism and appropriate imputation models.

# Questtion.8

## What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

After conducting an ANOVA and finding a significant overall effect, post-hoc tests are often used to examine pairwise group differences. Some common post-hoc tests include:
1. Tukey's Honestly Significant Difference (HSD): Tukey's HSD test is widely used and compares all possible pairwise group differences. It controls the family-wise error rate, making it suitable when conducting multiple comparisons. It is conservative and maintains an overall Type I error rate.
2. Bonferroni Correction: The Bonferroni correction is a simple adjustment to the significance level for each individual comparison to maintain the overall family-wise error rate. It divides the desired significance level (e.g., 0.05) by the number of comparisons. Bonferroni correction is conservative and is appropriate when the number of comparisons is relatively small.
3. Scheffé's Test: Scheffé's test is more conservative than Tukey's HSD but is appropriate when conducting a large number of pairwise comparisons. It controls the family-wise error rate and is more powerful in detecting significant differences between groups.
4. Dunnett's Test: Dunnett's test compares each group mean to a control or reference group mean. It is used when there is a specific control group, and the interest lies in comparing other groups to this reference group.
5. Games-Howell Test: The Games-Howell test is a non-parametric alternative used when the assumption of equal variances is violated. It performs pairwise comparisons between groups with unequal variances and adjusts for multiple comparisons.
Example situation:
Suppose a researcher conducts an experiment to compare the effectiveness of three different treatment methods for reducing anxiety levels. The ANOVA results indicate a significant overall effect, suggesting that at least one treatment method differs from the others. In this case, a post-hoc test would be necessary to determine which specific treatment methods differ significantly.
The researcher decides to use Tukey's HSD test as the post-hoc test to compare all possible pairwise differences between the treatment groups. This test will allow them to identify which treatment methods are significantly different from each other in terms of their impact on anxiety levels.
By conducting Tukey's HSD test, the researcher can obtain specific information about the significant pairwise differences, enabling a more comprehensive understanding of the treatment effects and informing subsequent analysis or decision-making processes.

# Question.9

## A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [4]:
import scipy.stats as stats
diet_A = [3, 4, 5, 4, 3, 6, 7, 2, 4, 5, 3, 4, 5, 4, 3, 6, 7, 2, 4, 5, 3, 4, 5, 4, 3, 6, 7, 2, 4, 5]
diet_B = [2, 3, 4, 2, 3, 4, 5, 2, 3, 4, 2, 3, 4, 5, 2, 3, 4, 2, 3, 4, 5, 2, 3, 4, 2, 3, 4, 5, 2, 3]
diet_C = [1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 1, 2, 3, 4, 1, 2]
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)
print("F-Statistic:", f_statistic)
print("p-value:", p_value)


F-Statistic: 22.659441885667835
p-value: 1.1982238136397129e-08


# Question.10

##  A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [5]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a DataFrame with the data
data = {
    'Software': ['A', 'A', 'B', 'B', 'C', 'C', 'A', 'A', 'B', 'B', 'C', 'C'],
    'Experience': ['Novice', 'Experienced'] * 6,
    'Time': [10, 12, 14, 16, 20, 22, 11, 13, 15, 17, 19, 21]
}

df = pd.DataFrame(data)

# Perform two-way ANOVA
model = ols('Time ~ Software + Experience + Software:Experience', data=df).fit()
anova_table = sm.stats.anova_lm(model)

# Print the results
print(anova_table)


                      df        sum_sq       mean_sq             F    PR(>F)
Software             2.0  1.626667e+02  8.133333e+01  1.626667e+02  0.000006
Experience           1.0  1.200000e+01  1.200000e+01  2.400000e+01  0.002714
Software:Experience  2.0  3.234330e-29  1.617165e-29  3.234330e-29  1.000000
Residual             6.0  3.000000e+00  5.000000e-01           NaN       NaN


# Question.11

## An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [6]:
import scipy.stats as stats

# Test scores for control group and experimental group
control_group = [78, 82, 75, 88, 92, 79, 81, 85, 90, 86, 80, 83, 87, 84, 77, 91, 89, 76, 82, 85,
                 80, 88, 81, 84, 79, 83, 86, 90, 77, 75, 89, 82, 86, 80, 78, 83, 87, 91, 88, 85,
                 80, 82, 79, 84, 90, 83, 77, 86, 81, 88, 89]

experimental_group = [85, 89, 92, 78, 83, 90, 88, 81, 86, 84, 82, 79, 88, 83, 90, 87, 84, 79, 82,
                      85, 81, 77, 89, 86, 83, 90, 88, 82, 79, 85, 87, 80, 83, 88, 81, 90, 84, 82,
                      79, 86, 83, 89, 88, 77, 81, 85, 83, 90, 84, 82, 86]

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)

# Print the results
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Perform post-hoc test (e.g., Tukey's HSD)
if p_value < 0.05:
    # Additional code for post-hoc test
    # Perform post-hoc test using appropriate method
    # Print post-hoc test results
    pass


t-statistic: -0.9778766189127441
p-value: 0.33049511614004234


# Question.12

## A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Create a DataFrame with the data
data = {
    'Day': list(range(1, 31)) * 3,
    'Store': ['A'] * 30 + ['B'] * 30 + ['C'] * 30,
    'Sales': [100, 110, 120, 105, 115, 125, ...]  # Sales data for each day and store
}

df = pd.DataFrame(data)

# Perform repeated measures ANOVA
model = ols('Sales ~ Store + C(Day)', data=df).fit()
anova_table = sm.stats.anova_lm(model)

# Print the ANOVA results
print(anova_table)

# Perform post-hoc test (e.g., Tukey's HSD)
if anova_table['PR(>F)'][0] < 0.05:
    posthoc = pairwise_tukeyhsd(df['Sales'], df['Store'])
    print(posthoc.summary())
