Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact  the validity of the results.

Answer--> ANOVA (Analysis of Variance) is a statistical technique used to compare means across two or more groups or conditions. It is based on several assumptions, and violating these assumptions can affect the validity of the results. The assumptions for ANOVA include:

    1 Normality: The data within each group or condition are assumed to be normally distributed. 

    2 Homogeneity of variances (homoscedasticity): The variability of scores within each group or condition is assumed to be approximately equal across all groups. 

    3 Absance of outliers: IF there is any outlier present in the data it should be remove first to perform ANOVA

    4 Independence: The observations within each group or condition are assumed to be independent of each other as well as random . 

Violations of these assumptions can impact the validity of ANOVA results. Here are examples of violations and their consequences:

    1 Violation of independence: If the observations within groups are not independent, it can lead to biased results. For example, if participants in one group influence each other's responses, such as in a group discussion setting, the assumption of independence is violated.

    3 Violation of normality: If the data within groups deviate significantly from a normal distribution, it can affect the validity of ANOVA results. Non-normal distributions may lead to inaccurate p-values and confidence intervals. For example, if the scores in a group are highly skewed or have extreme outliers, normality assumption is violated.

    3 Violation of homogeneity of variance: When the variances of the groups being compared are unequal, it can lead to incorrect conclusions. If one group has much larger variability than the others, it can dominate the overall variance, potentially masking significant differences between the group means. Violation of this assumption can inflate or deflate the F-statistic and affect the p-value.

Q2. What are the three types of ANOVA, and in what situations would each be used?

Answer--> The three types of ANOVA are:

1 One-way ANOVA: One-way ANOVA is used when there is one independent variable (factor) with two or more levels, and these values are independent.
    
    One-way ANOVA is appropriate in situations where you want to compare the means of multiple groups or conditions. For example, you may use one-way ANOVA to compare the effectiveness of three different treatments on blood pressure levels.
    
2 Repeated measures ANOVA: Repeated measures ANOVA is used when there is one independent variable (factor) with two or more levels, and these values are dependent.

    For example, you may use repeated measures ANOVA to assess the effects of different time points (e.g., before, during, and after treatment) on participants' anxiety levels.
    
3 Factorial ANOVA: Factorial ANOVA is used when there are two or more independent variables (factors) and each of them have atlest tow levels and also the levels can be either dependet or independent .

    Factorial ANOVA is suitable when you want to investigate the combined effects of multiple independent variables on the dependent variable. For example, you may use a 2x2 factorial ANOVA to study the effects of both gender (male vs. female) and treatment type (drug A vs. drug B) on participants' pain perception.

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Answer--> 
The partitioning of variance in ANOVA refers to the division of the total variance in the data into different components based on the sources of variation. It breaks down the overall variability in the dependent variable into distinct parts that can be attributed to different factors or sources.

The partitioning of variance in ANOVA is crucial for understanding the contributions of different factors, assessing their significance, examining interactions, estimating effect sizes, and guiding further analyses. It provides a comprehensive understanding of the sources of variability in the data and facilitates valid interpretations and conclusions.

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import numpy as np
from scipy import stats

# Generate example data
group1 = np.array([4, 6, 8, 9, 5])
group2 = np.array([2, 3, 1, 5, 4])
group3 = np.array([7, 6, 9, 8, 10])

# Combine data into a single array
data = np.concatenate([group1, group2, group3])

# Calculate the mean of the entire dataset
mean_total = np.mean(data)

# Calculate the total sum of squares (SST)
sst = np.sum((data - mean_total) ** 2)

# Calculate the group means
mean_group1 = np.mean(group1)
mean_group2 = np.mean(group2)
mean_group3 = np.mean(group3)

# Calculate the explained sum of squares (SSE)
sse = np.sum((group1 - mean_group1) ** 2) + np.sum((group2 - mean_group2) ** 2) + np.sum((group3 - mean_group3) ** 2)

# Calculate the residual sum of squares (SSR)
ssr = sst - sse

# Print the results
print("SST:", sst)
print("SSE:", sse)
print("SSR:", ssr)

SST: 102.39999999999999
SSE: 37.2
SSR: 65.19999999999999


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [2]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a DataFrame with the data for the two-way ANOVA
data = pd.DataFrame({
    'Factor1': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
    'Factor2': ['X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y'],
    'Response': [10, 12, 14, 16, 20, 18, 22, 24]
})

# Perform two-way ANOVA
model = ols('Response ~ Factor1 + Factor2 + Factor1:Factor2', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

# Extract main effects and interaction effects
main_effect_factor1 = anova_table['sum_sq']['Factor1']
main_effect_factor2 = anova_table['sum_sq']['Factor2']
interaction_effect = anova_table['sum_sq']['Factor1:Factor2']

print("Main Effect of Factor 1:", main_effect_factor1)
print("Main Effect of Factor 2:", main_effect_factor2)
print("Interaction Effect:", interaction_effect)

                 sum_sq   df          F    PR(>F)
Factor1           128.0  1.0  14.222222  0.019584
Factor2             2.0  1.0   0.222222  0.661914
Factor1:Factor2     2.0  1.0   0.222222  0.661914
Residual           36.0  4.0        NaN       NaN
Main Effect of Factor 1: 128.00000000000017
Main Effect of Factor 2: 2.000000000000013
Interaction Effect: 2.0000000000000044


Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

Answer--> In a one-way ANOVA, the F-statistic is used to test whether there are significant differences between the means of the groups being compared. The p-value associated with the F-statistic helps determine the statistical significance of these differences. 

In the given scenario, with an F-statistic of 5.23 and a p-value of 0.02, we can make the following conclusions:

1. Significant differences: The obtained p-value of 0.02 is less than the conventional significance level of 0.05 (assuming an alpha level of 0.05). This indicates that there is strong evidence to reject the null hypothesis that the means of the groups are equal. In other words, there are significant differences between at least some of the groups being compared.

2. Interpretation: Based on these results, we can conclude that there are statistically significant differences in the means of the groups being compared. However, the one-way ANOVA does not provide specific information about which particular group means differ from each other. To determine the specific group differences, post hoc tests or pairwise comparisons can be conducted.

3. Further analysis: Given the significant overall difference, further analyses can be performed to investigate pairwise comparisons between group means. Post hoc tests, such as Tukey's HSD (honestly significant difference) or Bonferroni corrections, can be used to identify which specific group comparisons are significantly different from each other.

It is important to note that the interpretation of the results should also consider the context and the specific research question being addressed. Additionally, other factors like effect size, sample size, and practical significance should be considered when interpreting the results of the one-way ANOVA.

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Answer--> Here are some common methods to handle missing data in repeated measures ANOVA and their potential consequences:

    Complete Case Analysis (Listwise deletion)
    Pairwise Deletion
    Mean Substitution
    Multiple Imputation
    Maximum Likelihood Estimation
    
The consequences of using different methods to handle missing data can vary:

    Complete case analysis and pairwise deletion can result in reduced statistical power, biased estimates if the missingness is related to the variables, and potentially limit generalizability.

    Mean substitution can introduce bias if the missingness is related to the variable itself or other factors influencing the variable.

    Multiple imputation and maximum likelihood estimation can provide more accurate estimates by accounting for uncertainty due to missing data. However, they rely on assumptions about the missing data mechanism, and inappropriate assumptions can lead to biased results.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Answer-->After conducting an ANOVA and finding a significant overall effect, post-hoc tests are used to compare specific group differences and determine which groups significantly differ from each other. Here are some common post-hoc tests used after ANOVA:

    Tukey's Honestly Significant Difference (HSD)
    Bonferroni correction
    Sidak correction
    Dunnett's test
    Scheffé's test

An example situation where a post-hoc test might be necessary is when conducting a one-way ANOVA to compare the effectiveness of different treatments on pain relief. After obtaining a significant overall effect, you may want to conduct a post-hoc test to determine which specific treatment groups significantly differ from each other. This would help identify the specific treatments that are more effective than others and provide valuable insights for decision-making in clinical or research settings.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [3]:
import numpy as np
from scipy import stats

# Generate example data
np.random.seed(42)  # Set a seed for reproducibility
diet_A = np.random.normal(loc=5, scale=1, size=50)
diet_B = np.random.normal(loc=6, scale=1, size=50)
diet_C = np.random.normal(loc=4, scale=1, size=50)

# Combine data into a single array
data = np.concatenate([diet_A, diet_B, diet_C])

# Create a group indicator array
groups = np.array(['A'] * 50 + ['B'] * 50 + ['C'] * 50)

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Print the results
print("F-statistic:", f_statistic)
print("p-value:", p_value)

# Interpret the results
if p_value < 0.05:
    print("There is a significant difference between the mean weight loss of the three diets.")
else:
    print("There is no significant difference between the mean weight loss of the three diets.")

F-statistic: 60.35724557746856
p-value: 7.310396587520461e-20
There is a significant difference between the mean weight loss of the three diets.


Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [4]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a sample dataset with task completion time, software program, and experience level
data = pd.DataFrame({
    'TaskTime': [12, 14, 10, 15, 13, 11, 16, 18, 20, 14, 16, 12, 8, 10, 13, 11, 9, 10, 12, 14, 16, 14, 17, 19, 15, 12, 14, 13, 16, 18],
    'Software': ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'A', 'B', 'C'],
    'Experience': ['Novice', 'Novice', 'Novice', 'Experienced', 'Experienced', 'Experienced', 'Novice', 'Novice', 'Novice', 'Experienced', 'Experienced', 'Experienced', 'Novice', 'Novice', 'Novice', 'Experienced', 'Experienced', 'Experienced', 'Novice', 'Novice', 'Novice', 'Experienced', 'Experienced', 'Experienced', 'Novice', 'Novice', 'Novice', 'Experienced', 'Experienced', 'Experienced']
})

# Fit the ANOVA model
model = ols('TaskTime ~ C(Software) + C(Experience) + C(Software):C(Experience)', data=data).fit()
anova_table = sm.stats.anova_lm(model)

# Print the ANOVA table
print(anova_table)

                             df      sum_sq    mean_sq         F    PR(>F)
C(Software)                 2.0   96.266667  48.133333  7.515242  0.002922
C(Experience)               1.0    4.226087   4.226087  0.659835  0.424603
C(Software):C(Experience)   2.0    9.659627   4.829814  0.754097  0.481258
Residual                   24.0  153.714286   6.404762       NaN       NaN


Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [5]:
import numpy as np
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Generate random test scores for the control and experimental groups
np.random.seed(1)  # Set a random seed for reproducibility
control_scores = np.random.normal(loc=70, scale=10, size=100)
experimental_scores = np.random.normal(loc=75, scale=12, size=100)

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_scores, experimental_scores)

# Check if the results are significant
if p_value < 0.05:
    print("The t-test result is significant. There is a significant difference in test scores between the two groups.")
else:
    print("The t-test result is not significant. There is no significant difference in test scores between the two groups.")

# Perform post-hoc test (Tukey's HSD)
all_scores = np.concatenate([control_scores, experimental_scores])
group_labels = ['Control'] * len(control_scores) + ['Experimental'] * len(experimental_scores)

posthoc = pairwise_tukeyhsd(all_scores, group_labels)
print(posthoc)

The t-test result is significant. There is a significant difference in test scores between the two groups.
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj lower  upper  reject
--------------------------------------------------------
Control Experimental   6.2277   0.0 3.4009 9.0545   True
--------------------------------------------------------


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any
significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [6]:
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Prepare the data
data = pd.DataFrame({
    'Day': range(1, 31),
    'Store_A': [100, 95, 105, 110, 90, 115, 95, 105, 100, 105, 95, 105, 100, 95, 105, 110, 90, 115, 95, 105, 100, 105, 95, 105, 100, 95, 105, 110, 90, 115],
    'Store_B': [120, 85, 105, 110, 95, 100, 105, 95, 105, 100, 95, 105, 120, 85, 105, 110, 95, 100, 105, 95, 105, 100, 95, 105, 120, 85, 105, 110, 95, 100],
    'Store_C': [110, 95, 85, 115, 100, 105, 95, 105, 100, 95, 105, 110, 95, 85, 115, 100, 105, 95, 105, 100, 95, 105, 110, 95, 85, 115, 100, 105, 95, 105]
})

# Reshape the data from wide to long format
data_long = pd.melt(data, id_vars='Day', var_name='Store', value_name='Sales')

# Perform repeated measures ANOVA
anova_result = stats.f_oneway(data_long[data_long['Store'] == 'Store_A']['Sales'],
                              data_long[data_long['Store'] == 'Store_B']['Sales'],
                              data_long[data_long['Store'] == 'Store_C']['Sales'])

# Extract the p-value from the ANOVA result
p_value = anova_result.pvalue

# Check if the results are significant
if p_value < 0.05:
    print("The ANOVA result is significant. There are significant differences in sales between the stores.")
else:
    print("The ANOVA result is not significant. There are no significant differences in sales between the stores.")

# Perform Tukey's HSD post-hoc test
posthoc = pairwise_tukeyhsd(data_long['Sales'], data_long['Store'], alpha=0.05)

# Print the post-hoc test summary
print(posthoc)

The ANOVA result is not significant. There are no significant differences in sales between the stores.
 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
 group1  group2 meandiff p-adj   lower  upper  reject
-----------------------------------------------------
Store_A Store_B      0.0    1.0 -5.1419 5.1419  False
Store_A Store_C  -0.8333 0.9211 -5.9752 4.3086  False
Store_B Store_C  -0.8333 0.9211 -5.9752 4.3086  False
-----------------------------------------------------
