# Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results

Assumptions for using ANOVA include:

Independence: Observations are independent within and between groups.
Normality: The residuals (differences between observed and predicted values) are normally distributed within each group.
Homogeneity of Variance: The variances of the residuals are approximately equal across all groups.
Violations of these assumptions can impact the validity of ANOVA results:

Independence violation: If observations are not independent, it can lead to biased results. For example, repeated measurements on the same subjects without accounting for the dependence.
Normality violation: If residuals aren't normally distributed, it can lead to incorrect p-values and confidence intervals. This can occur if the data is heavily skewed or contains outliers.
Homogeneity of Variance violation: Unequal variances can affect the F-test's reliability and can lead to incorrect conclusions. This might happen when groups have different levels of variability.

# Q2. What are the three types of ANOVA, and in what situations would each be used?

There are three types of ANOVA:

One-Way ANOVA: Used when comparing means across multiple independent groups.
Two-Way ANOVA: Used when there are two independent variables (factors) affecting the dependent variable, and we want to study the main effects and interaction effects.
Repeated Measures ANOVA: Used when the same subjects are used for all treatment conditions (within-subject design).

# Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The variance in data can be divided into three components: Total Variance (SST), Explained Variance (SSE), and Residual Variance (SSR). Understanding this partitioning helps determine the proportion of variability explained by the model and the unexplained variability.

# Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [None]:

def one_way_anova(data):
        n = len(data)
        k = len(data[0])
        mean = np.mean(data)

      # Calculate the total sum of squares
        SST = np.sum((data - mean)**2)

        # Calculate the explained sum of squares
        SSE = 0
        for group in data:
            x_bar_group = np.mean(group)
            SSE += np.sum((group - x_bar_group)**2)

     # Calculate the residual sum of squares
        SSR = SST - SSE

    return SST, SSE, SSR

data = [[5, 10, 15], [20, 30, 40], [10, 20, 30]]

SST, SSE, SSR = one_way_anova(data)

print("SST:", SST)
print("SSE:", SSE)
print("SSR:", SSR)

# Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python

For main effects and interaction effects, you would typically perform a more advanced analysis such as using a library like statsmodels or scipy.

# Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these
results?

An F-statistic of 5.23 and a p-value of 0.02 indicate that there are significant differences between at least some of the groups' means. However, it doesn't provide specific information about which groups are different.

# Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Handling missing data can involve techniques like mean imputation, interpolation, or using specialized methods like mixed-effects models. Different methods can lead to different biases, impacting the validity of results.

# Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Post-hoc tests, like Tukey's HSD, Bonferroni, or Scheffe, are used to identify specific group differences after significant ANOVA results. They help avoid Type I errors in multiple comparisons.

# Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets.Report the F-statistic and p-value, and interpret the results.

In [13]:
import numpy as np
import scipy.stats as stats

# Create the data
diet_a = np.array([10, 12, 15, 18, 20])
diet_b = np.array([5, 7, 9, 11, 13])
diet_c = np.array([0, 2, 4, 6, 8])

# Conduct the ANOVA
F, p = stats.f_oneway(diet_a, diet_b, diet_c)

# Print the results
print("F-statistic:", F)
print("p-value:", p)



F-statistic: 12.297297297297296
p-value: 0.0012433283447848308


# Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs experienced). Report the F-statistics and p-values, and interpret the results.

In [None]:
import numpy as np
import scipy.stats as stats

# Create the data
program = np.array(["A", "B", "C", "A", "B", "C", "A", "B", "C", "A", "B", "C"])
experience = np.array(["novice", "novice", "novice", "experienced", "experienced", "experienced", "novice", "experienced", "experienced", "novice", "experienced", "experienced"])
time = np.array([10, 12, 15, 18, 20, 22, 5, 7, 9, 11, 13, 15])

# Conduct the ANOVA
F, p, _, _ = stats.f_twoway(time, program, experience)

# Print the results
print("F-statistic for program:", F[0])
print("p-value for program:", p[0])
print("F-statistic for experience:", F[1])
print("p-value for experience:", p[1])
print("F-statistic for interaction:", F[2])
print("p-value for interaction:", p[2])


# Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [None]:
import numpy as np
import scipy.stats as stats

# Create the data
control_group_scores = np.array([75, 80, 85, 90, 95])
experimental_group_scores = np.array([65, 70, 75, 80, 85])

# Conduct the t-test
t, p = stats.ttest_ind(control_group_scores, experimental_group_scores)

# Print the results
print("t-statistic:", t)
print("p-value:", p)

# If the results are significant, conduct a post-hoc test
if p < 0.05:
  # Perform Tukey's test
  post_hoc_results = stats.posthoc_ttest(control_group_scores, experimental_group_scores, method="tukey")

  # Print the results of the post-hoc test
  print("Post-hoc results:")
  print(post_hoc_results)


# Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.

In [None]:
import numpy as np
import scipy.stats as stats

# Create the data
store = np.array(["A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C", "C"])
sales = np.array([100, 120, 150, 180, 200, 220, 50, 70, 90, 110, 130, 150, 60, 80, 100, 120, 140, 160])

# Conduct the repeated measures ANOVA
F, p = stats.f_oneway(sales, store)

# Print the results
print("F-statistic:", F)
print("p-value:", p)

# If the results are significant, conduct a post-hoc test
if p < 0.05:
  # Perform Tukey's test
  post_hoc_results = stats.posthoc_ttest(sales, store, method="tukey")

  # Print the results of the post-hoc test
  print("Post-hoc results:")
  print(post_hoc_results)
