## Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.
## Ans1- ANOVA (Analysis of Variance) is a statistical method used to test for differences in means between two or more groups. However, ANOVA relies on several assumptions to ensure that the results are valid. Violations of these assumptions can lead to inaccurate or misleading conclusions. The key assumptions of ANOVA are:

## 1.Independence: The observations in each group must be independent of each other. This means that the value of one observation does not depend on the value of any other observation. Violations of this assumption can occur when data is collected from a hierarchical or clustered structure, such as students in the same classroom, repeated measures over time, or participants nested within groups.

## 2.Normality: The distribution of the response variable within each group should follow a normal distribution. This assumption is important because ANOVA relies on normality to estimate the variance between groups and within groups accurately. Non-normality in the data could result from outliers, skewed distributions, or bimodal distributions.

## 3.Homogeneity of variance: The variance of the response variable should be equal in all groups. Homogeneity of variance assumption is important because it affects the accuracy of the ANOVA test. When variances are unequal, ANOVA may mistakenly conclude that there are significant differences between groups. A violation of this assumption could arise when the sample sizes of groups are unequal or when there are extreme values in one or more groups.

## 4.Random Sampling: The observations should be collected using a random sampling technique so that the results are generalizable to the population.

# Examples of violations that could impact the validity of the results include:

## 1.Violation of independence: This can occur when the data are collected from a hierarchical structure, such as students in the same classroom, repeated measures over time, or participants nested within groups. The non-independence of the data can result in the clustering of the data, which can lead to the overestimation of the statistical significance.

## 2.Violation of normality: Non-normality in the data can result from outliers, skewed distributions, or bimodal distributions. This can lead to inaccurate estimates of the mean, standard deviation, and variance.

## 3.Violation of homogeneity of variance: This can occur when the sample sizes of groups are unequal, or there are extreme values in one or more groups. Unequal variances can result in the underestimation or overestimation of the statistical significance of the results.

## 4.Violation of random sampling: Non-random sampling techniques can lead to biased estimates of the population parameters. This can result in the overestimation or underestimation of the statistical significance of the results.

## Q2. What are the three types of ANOVA, and in what situations would each be used?
## Ans2- ANOVA (Analysis of Variance) is a statistical method used to compare the means of two or more groups. There are three types of ANOVA: One-way ANOVA, Two-way ANOVA, and N-way ANOVA.

## 1.One-way ANOVA:
## One-way ANOVA is used when there is only one independent variable or factor in the study. It is used to compare the means of three or more groups to determine whether there is a significant difference between them. For example, it can be used to compare the effectiveness of three different types of pain relievers. One-way ANOVA tests the null hypothesis that all means are equal, and the alternative hypothesis that at least one mean is different from the others.

## 2.Two-way ANOVA:
## Two-way ANOVA is used when there are two independent variables or factors in the study. It is used to determine the effect of each independent variable on the dependent variable and to test whether there is an interaction effect between the two independent variables. For example, it can be used to determine the effect of a new drug on blood pressure in both men and women. Two-way ANOVA tests the null hypothesis that there is no difference in means between the groups defined by either factor and no interaction effect between the two factors.

## 3.N-way ANOVA:
## N-way ANOVA is used when there are more than two independent variables or factors in the study. It is used to analyze the effects of multiple independent variables on a single dependent variable. For example, it can be used to determine the effect of diet, exercise, and medication on blood sugar levels in diabetic patients. N-way ANOVA tests the null hypothesis that there is no difference in means between the groups defined by all the factors and no interaction effect between any of the factors.

## Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?
## Ans3- Partitioning of variance in ANOVA refers to the process of dividing the total variation in the data into different sources of variation. The total variation in the data is the sum of the variation between the groups and the variation within the groups.

## Understanding the partitioning of variance is important because it allows researchers to determine the sources of variation in their data, and to estimate the effect size of the independent variable(s) on the dependent variable. By partitioning the variance, researchers can determine whether the differences between the groups are due to chance or to the independent variable(s) being studied.

## In ANOVA, the total variance is decomposed into two components: between-group variance and within-group variance. The between-group variance measures the variation between the means of the groups, while the within-group variance measures the variation within each group. The ratio of between-group variance to within-group variance provides a measure of the effect size of the independent variable(s) on the dependent variable.

## This partitioning of variance also enables researchers to calculate different statistical values such as F-ratio, p-value, and effect size, which are used to assess the significance of the differences between the means of the groups. These values help researchers to make statistical inferences and conclusions about the data and the relationship between the independent variable(s) and the dependent variable.

## Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load data
df = pd.read_csv('data.csv')

# Fit the ANOVA model
model = ols('dependent_variable ~ independent_variable', data=df).fit()

# Calculate the total sum of squares (SST)
sst = sm.stats.anova_lm(model, typ=1)['sum_sq'][0]

# Calculate the explained sum of squares (SSE)
sse = sm.stats.anova_lm(model, typ=1)['sum_sq'][1]

# Calculate the residual sum of squares (SSR)
ssr = sm.stats.anova_lm(model, typ=1)['sum_sq'][2]

print('SST:', sst)
print('SSE:', sse)
print('SSR:', ssr)


FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

## Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [2]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load data
df = pd.read_csv('data.csv')

# Fit the ANOVA model
model = ols('dependent_variable ~ independent_variable1 + independent_variable2 + independent_variable1*independent_variable2', data=df).fit()

# Calculate the main effects
main_effect1 = model.params['independent_variable1']
main_effect2 = model.params['independent_variable2']

# Calculate the interaction effect
interaction_effect = model.params['independent_variable1:independent_variable2']

print('Main effect 1:', main_effect1)
print('Main effect 2:', main_effect2)
print('Interaction effect:', interaction_effect)


FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

## Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?
## Ans6- If we conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02, we can conclude that there is evidence of significant differences between the groups.

## The F-statistic is a ratio of the variance between groups to the variance within groups. If the F-statistic is large and the p-value is small, it suggests that the variance between groups is significantly larger than the variance within groups, indicating that there are significant differences between the groups.

## The p-value of 0.02 indicates that there is only a 2% chance of observing such a large F-statistic if the groups are actually the same. This is below the typical significance level of 0.05, suggesting that we can reject the null hypothesis that the means of all groups are equal.

## Therefore, we can conclude that at least one of the groups differs significantly from the others in terms of the variable being measured. However, we cannot determine which specific group or groups are different from the others based on the ANOVA alone. Post-hoc tests or pairwise comparisons would be necessary to make these comparisons.

## Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?
## Ans7- In a repeated measures ANOVA, missing data can be handled in several ways:

## 1.Pairwise deletion: This approach involves analyzing only the data for cases that have complete data for all variables of interest. This approach can lead to a loss of statistical power and biased results if the data are not missing completely at random (MCAR).

## 2.Mean imputation: This approach involves replacing the missing values with the mean value for that variable across all cases. This approach can introduce bias if the missing values are not MCAR.

## 3.Regression imputation: This approach involves predicting the missing values based on the observed values and other covariates using a regression model. This approach can improve the accuracy of the results compared to mean imputation, but it may still introduce bias if the missing values are not MCAR.

## 4.Multiple imputation: This approach involves creating multiple plausible values for each missing value based on the observed data and statistical models, and then using these values to estimate the ANOVA parameters. This approach can be more accurate than the previous approaches, especially when the proportion of missing data is low.

## The potential consequences of using different methods to handle missing data include biased results, reduced power, and inaccurate estimates of standard errors. The choice of method depends on the amount and pattern of missing data, as well as the assumptions of the statistical model being used. It is important to carefully consider the potential biases and limitations of each method and to choose the approach that is most appropriate for the data at hand.

## Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.
## Ans8- Post-hoc tests are used to determine which specific groups differ significantly from each other after a significant result is obtained from an ANOVA. Some common post-hoc tests include:

## 1.Tukey's honestly significant difference (HSD) test: This test controls the familywise error rate and compares all possible pairs of group means. This test is commonly used when the number of groups is equal and the variances are homogeneous.

## 2.Bonferroni correction: This test involves adjusting the significance level to account for multiple comparisons. This test is commonly used when the number of comparisons is small.

## 3.Scheffé's test: This test is a conservative approach that controls the familywise error rate and compares all possible pairs of group means. This test is commonly used when the number of groups is unequal or the variances are heterogeneous.

## 4.Dunnett's test: This test compares each group mean to a control group mean. This test is commonly used when there is a control group and the primary interest is in comparing the other groups to the control group.

## 5.Games-Howell test: This test does not assume equal variances or sample sizes and compares all possible pairs of group means. This test is commonly used when the assumptions of equal variances or sample sizes are violated.

## A situation where a post-hoc test might be necessary is when conducting a one-way ANOVA and obtaining a significant result, indicating that there are differences between the groups. To determine which specific groups differ significantly from each other, a post-hoc test would be conducted. For example, suppose we conducted a study to compare the effectiveness of four different medications for reducing blood pressure. After conducting a one-way ANOVA, we obtained a significant result. To determine which specific medications are different from each other, we would conduct a post-hoc test, such as Tukey's HSD test, to compare all possible pairs of group means. This would help us identify the specific medication(s) that are significantly different from each other in terms of their effectiveness for reducing blood pressure.

## Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.


In [3]:
import pandas as pd
import scipy.stats as stats

# Load data
data = pd.read_csv('diet_data.csv')
# One-way ANOVA
f_stat, p_value = stats.f_oneway(data[data['diet'] == 'A']['weight_loss'],
                                 data[data['diet'] == 'B']['weight_loss'],
                                 data[data['diet'] == 'C']['weight_loss'])
# Print results
print('F-statistic:', f_stat)
print('p-value:', p_value)

# Interpretation
if p_value < 0.05:
    print('There is a significant difference between the mean weight loss of the three diets.')
else:
    print('There is no significant difference between the mean weight loss of the three diets.')




FileNotFoundError: [Errno 2] No such file or directory: 'diet_data.csv'

## Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [4]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load data
data = pd.read_csv('software_data.csv')
# Two-way ANOVA
model = ols('time ~ C(program) + C(experience) + C(program):C(experience)', data=data).fit()
table = sm.stats.anova_lm(model, typ=2)
# Print ANOVA table
print(table)

# Interpretation
if table.loc['C(program)', 'PR(>F)'] < 0.05:
    print('There is a significant main effect of software program.')
else:
    print('There is no significant main effect of software program.')
    
if table.loc['C(experience)', 'PR(>F)'] < 0.05:
    print('There is a significant main effect of experience level.')
else:
    print('There is no significant main effect of experience level.')
    
if table.loc['C(program):C(experience)', 'PR(>F)'] < 0.05:
    print('There is a significant interaction effect between program and experience.')
else:
    print('There is no significant interaction effect between program and experience.')




FileNotFoundError: [Errno 2] No such file or directory: 'software_data.csv'

## Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [5]:
import numpy as np
from scipy.stats import ttest_ind

# generate some sample data
control_scores = np.random.normal(70, 10, size=100)
experimental_scores = np.random.normal(75, 10, size=100)

# conduct the t-test
t_stat, p_value = ttest_ind(control_scores, experimental_scores)

# print the results
print("t-statistic:", t_stat)
print("p-value:", p_value)


t-statistic: -4.275142811766609
p-value: 2.968165305220173e-05


## Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results aresignificant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.

In [6]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# create a dataframe with the sales data
data = pd.DataFrame({
    'store': ['A']*30 + ['B']*30 + ['C']*30,
    'day': list(range(1, 31))*3,
    'sales': [10, 12, 15, 13, 11, 14, 9, 12, 13, 10, 11, 13, 8, 10, 12, 14, 16, 12, 10, 11, 14, 16, 13, 12, 14, 17, 15, 11, 12, 14,
              11, 15, 16, 14, 13, 15, 12, 9, 11, 12, 14, 13, 10, 11, 12, 13, 14, 15, 17, 15, 12, 11, 14, 13, 12, 11, 10, 9, 11, 13, 12,
              18, 15, 13, 16, 14, 17, 12, 13, 14, 15, 16, 13, 14, 15, 12, 13, 11, 12, 10, 9, 10, 11, 12, 15, 13, 11, 10, 12, 14, 15, 16,
              10, 12, 11, 14, 13, 16, 15, 14, 13, 12, 11, 10, 9, 11, 13, 12, 14, 15]
})

# conduct the repeated measures ANOVA
model = ols('sales ~ C(store)', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# print the results
print(anova_table)


ValueError: All arrays must be of the same length