# Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

Ans-Analysis of Variance (ANOVA) is a statistical method used to compare the means of three or more groups to determine if there are statistically significant differences between them. However, ANOVA relies on certain assumptions to be valid. Violations of these assumptions can impact the validity of the results. The key assumptions for ANOVA are:

Independence: The observations within each group or treatment are assumed to be independent of each other. This means that the values in one group should not be related to the values in another group. For example, in a medical study comparing the effectiveness of different treatments on patients, it's essential that the outcomes for one patient do not affect the outcomes for another.

Normality: The data within each group or treatment should follow a normal distribution. This assumption is critical for the accuracy of the p-values and confidence intervals. Violations of this assumption can lead to incorrect conclusions. You can check normality using graphical methods like histograms, Q-Q plots, or statistical tests like the Shapiro-Wilk test.

Homogeneity of Variance (Homoscedasticity): The variances within each group or treatment should be approximately equal. In other words, the spread of data points around the group means should be similar for all groups. Heteroscedasticity, where variances are not equal, can lead to inaccurate p-values and confidence intervals. You can check for homoscedasticity using tests like Levene's test or by visually inspecting scatterplots.

Examples of Violations and Their Impacts on ANOVA Results:

Non-Normal Data: If the data within groups do not follow a normal distribution, the p-values and confidence intervals generated by ANOVA may not be reliable. This can lead to incorrect conclusions about group differences. For example, if you have skewed data, ANOVA may indicate significant differences that do not exist or fail to detect real differences.

# Q2. What are the three types of ANOVA, and in what situations would each be used?

Ans=Analysis of Variance (ANOVA) is a statistical technique used to compare the means of three or more groups or treatments. There are three primary types of ANOVA, each designed for specific situations:

One-Way ANOVA:

Situation: One-Way ANOVA is used when you have one categorical independent variable with three or more levels (groups or treatments) and you want to determine if there are statistically significant differences in the means of a continuous dependent variable among these groups.
Example: You have four different types of fertilizer and you want to test if they have different effects on plant growth. The independent variable is the type of fertilizer (four levels), and the dependent variable is the plant's growth height.
Two-Way ANOVA:

Situation: Two-Way ANOVA is used when you have two categorical independent variables (factors), each with multiple levels, and you want to examine their individual and interactive effects on a continuous dependent variable. This allows you to assess the impact of each factor independently as well as the combined effect of the two factors.
Example: You are studying the effects of both temperature (hot and cold) and humidity (low and high) on the growth of plants. You have two independent variables (temperature and humidity) with two levels each, and the dependent variable is plant growth.
Three-Way ANOVA:

Situation: Three-Way ANOVA is an extension of Two-Way ANOVA, but it involves three independent variables (factors) and examines their individual and interactive effects on a continuous dependent variable. This is used when you have more complex experimental designs with three categorical independent variables.
Example: In a psychological study, you are investigating the effects of three factors: gender (male and female), age group (young, middle-aged, and elderly), and education level (high school, college, and postgraduate) on memory performance. The dependent variable is memory score.

# Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Ans=The partitioning of variance helps you understand the relative importance of the factors you are investigating and allows for a more nuanced interpretation of your research findings. It is a fundamental concept in ANOVA that facilitates the analysis of data with multiple groups and is essential for making valid statistical inferences.


The partitioning of variance in Analysis of Variance (ANOVA) refers to the division of the total variance in a dataset into different components or sources of variation. Understanding this concept is crucial for several reasons:

Identification of Sources of Variation: ANOVA allows you to determine how much of the total variation in the data can be attributed to different sources or factors. By partitioning the variance, you can identify which factors (independent variables) contribute significantly to the observed differences in the dependent variable.

Hypothesis Testing: ANOVA uses the partitioning of variance to perform hypothesis tests. It helps you test whether the means of the groups or treatments are significantly different from each other. The variance components are used to calculate F-statistics and p-values to make informed statistical decisions.

Interpretation of Results: Understanding how the variance is partitioned allows you to interpret the results of an ANOVA. You can assess which factors are statistically significant and have a substantial impact on the dependent variable. This information is essential for drawing meaningful conclusions from your analysis

# Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import numpy as np
import scipy.stats as stats

In [6]:
group1 = np.array([20,25,30,35,40])
group2 = np.array([15,18,25,30,35])
group3 = np.array([10,15,20,25,30])

data = np.concatenate([group1,group2,group3])

grand_mean = np.mean(data)

# caluclate the total sum of squares (SST)
squared_deviations = np.sum((data - grand_mean) ** 2)
SST = squared_deviations

#calculate the group means
mean_group1 = np.mean(group1)
mean_group2 = np.mean(group2)
mean_group3 = np.mean(group3)

# calulate the explained sum of squares (SSE)
SSE = len(group1) * (mean_group1 - grand_mean) ** 2 + \
       len(group2) * (mean_group2 - grand_mean) ** 2 + \
       len(group3) * (mean_group3 - grand_mean) ** 2

# calculate the Residual Sum of Squares (SSR)
SSR = SST - SSE

# DEGREES OF FREEDOM
df_total = len(data) - 1
df_groups = 3 -1  # number of groups -1
df_residual = df_total - df_groups

# Calculate the Mean Squares
MS_groups = SSE / df_groups
MS_residual = SSR / df_residual

# calculate the F-statistic
F_statistic = MS_groups / MS_residual
p_value = 1 - stats.f.cdf(F_statistic, df_groups, df_residual)

print(f"Total Sum of Squares (SST): {SST}")
print(f"Explained Sum of squares (SSE): {SSE}")
print(f"Residual Sum of Squares (SSR) : {SSR}")
print(f"F-statistic: {F_statistic}")
print(f"P-value: {p_value}")


Total Sum of Squares (SST): 1023.7333333333331
Explained Sum of squares (SSE): 250.53333333333333
Residual Sum of Squares (SSR) : 773.1999999999998
F-statistic: 1.944128297982411
P-value: 0.18562223853669657


# Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [16]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# craete a sample dataset (replace this with your data)

data = {
    'Factor_A': ['A1','A2','A1','A2','A1','A2','A1','A2'],
    'Factor_B': ['B1','B2','B1','B2','B1','B2','B1','B2'],
    'DV': [10,12,15,14,20,22,18,16]
    
}

df = pd.DataFrame(data)

# fIT THE TWO-WAY ANOVA model
model = ols('DV ~ C(Factor_A)* C(Factor_B)' , data=df).fit()

# print the ANOVA table
print(sm.stats.anova_lm(model,typ=2))


                               sum_sq   df             F    PR(>F)
C(Factor_A)             -4.610102e-17  1.0 -2.453269e-18  1.000000
C(Factor_B)              2.936198e-01  1.0  1.562500e-02  0.904606
C(Factor_A):C(Factor_B)  1.250000e-01  1.0  6.651885e-03  0.937650
Residual                 1.127500e+02  6.0           NaN       NaN


# Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

Ans=In a one-way ANOVA, the F-statistic and its associated p-value are used to assess whether there are statistically significant differences among the means of three or more groups. Let's interpret the results you've obtained:

The F-Statistic (F = 5.23): This statistic quantifies the variation between the group means relative to the variation within the groups. A higher F-statistic suggests that there is more variation between the group means compared to within the groups. In your case, F = 5.23 indicates that there is some degree of variation between the groups.

The P-Value (p = 0.02): The p-value represents the probability of observing such a result (or more extreme) if there were no true differences between the groups. In other words, it tests the null hypothesis that all group means are equal. A low p-value (in this case, p = 0.02) indicates that the observed differences between groups are statistically significant at the specified significance level (e.g., α = 0.05).

Interpretation:

Given the results you provided (F = 5.23 and p = 0.02), you can make the following conclusions:

Statistical Significance: The p-value (p = 0.02) is less than the chosen significance level (e.g., α = 0.05). This indicates that there is strong evidence to reject the null hypothesis that all group means are equal.

Differences Between Groups: Since you've rejected the null hypothesis, you can conclude that there are statistically significant differences between at least one pair of groups. In other words, not all group means are equal.

Post-hoc Testing: After finding a significant result in the one-way ANOVA, it's common practice to conduct post-hoc tests (e.g., Tukey's HSD, Bonferroni) to determine which specific groups are different from each other. These tests help identify which groups are driving the observed differences.

Effect Size: It's also useful to calculate and report an effect size measure (e.g., eta-squared or partial eta-squared) to quantify the practical significance of the differences. This can help in understanding the magnitude of the group differences.

# Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Ans=Handling missing data in a repeated measures ANOVA is an important aspect of data analysis, as missing data can lead to biased or inaccurate results. There are several methods to handle missing data, and the choice of method can impact the validity of your analysis. Here's how you can handle missing data and the potential consequences of different methods:

Listwise Deletion (Complete Case Analysis): This approach involves removing cases (participants or data points) with missing values on any variable used in the analysis. It simplifies the analysis but may lead to reduced sample size, loss of statistical power, and potential bias if the data is not missing completely at random (MCAR). This method is not recommended when missingness is related to the outcome or predictors.

Pairwise Deletion (Available Case Analysis): In this method, you use all available data for each analysis, and different cases can have different sample sizes. While it makes efficient use of the available data, it can lead to bias and inaccurate results if the missing data are not missing completely at random (MCAR). It can also result in a loss of statistical power.

Potential Consequences of Using Different Methods:

Loss of Statistical Power: Listwise deletion and pairwise deletion can lead to a reduction in statistical power because they use only the complete cases for analysis. Imputation methods and MLE help mitigate this issue by retaining sample size.

Bias: Listwise deletion and pairwise deletion can introduce bias if the missing data are not missing at random (NMAR). Imputation methods, including MLE, can also introduce bias if the imputation model is misspecified.

Reduced Precision: Imputation methods can provide more precise estimates, while deletion methods may lead to less precise results due to the reduced sample size.

Generalizability: The choice of method can impact the generalizability of the results. Deletion methods may result in a sample that is not representative of the population, while imputation methods aim to preserve generalizability.

# Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Ans=Post-hoc tests are used in the context of analysis of variance (ANOVA) to further examine and compare group means after finding a significant overall difference among groups. ANOVA helps determine if there are statistically significant differences among multiple groups, but it doesn't specify which groups differ from each other. Post-hoc tests are used to identify these specific group differences. Common post-hoc tests include:

Tukey's Honestly Significant Difference (Tukey's HSD):

Use when you have performed a one-way ANOVA and found a significant difference among multiple groups.
It controls the family-wise error rate, making it suitable for situations where you want to maintain a low risk of Type I errors. This means it's less likely to result in false-positive findings.
Example: You conducted an ANOVA to compare the performance of three different teaching methods in improving student test scores. Tukey's HSD can help you determine which specific teaching methods are significantly different from each other.
Bonferroni Correction:

Use when you want to control the family-wise error rate, just like Tukey's HSD, but you suspect significant differences among multiple groups.
It is a more conservative test, which can be helpful when you want to reduce the risk of Type I errors even further.

Example: In a medical study, you want to compare the effectiveness of four different medications in treating a specific condition. The Bonferroni correction can help you identify which medications have significantly different effects.
Scheffé's Method:

Use when you need a more robust post-hoc test, especially for unequal sample sizes and unequal variances.
It is less likely to produce Type I errors than other post-hoc tests, making it suitable when you want to minimize the risk of false positives.
Example: You are comparing the productivity of employees in five different departments, and the sample sizes and variances among departments are not equal. Scheffé's method can help you determine which departments have significantly different levels of productivity.

# Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
# to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [None]:
Ans=

In [22]:
import numpy as np
import scipy.stats as stats

diet_A =[2.1, 1.8, 2.0, 2.4, 2.6, 1.9, 2.2, 2.0, 2.3, 2.1, 1.8, 2.2, 2.5, 2.3, 2.1, 2.0, 2.4, 2.6, 2.2, 2.1, 2.4, 2.7, 2.1, 2.0, 2.3]
diet_B = [1.9, 1.7, 1.8, 2.0, 1.7, 1.9, 2.1, 1.8, 2.0, 1.7, 1.6, 1.9, 2.1, 2.0, 1.8, 1.7, 1.9, 2.2, 1.9, 1.8, 1.6, 2.0, 2.1, 1.8, 1.7]
diet_C = [1.5, 1.4, 1.3, 1.6, 1.5, 1.6, 1.7, 1.4, 1.6, 1.5, 1.4, 1.6, 1.7, 1.5, 1.4, 1.6, 1.8, 1.7, 1.5, 1.6, 1.4, 1.7, 1.6, 1.4, 1.5]

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

print("F-statistic:", f_statistic)
print("p-value:", p_value)

F-statistic: 80.52467532467527
p-value: 4.323600643785061e-19


# Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They
# randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
# interaction effects between the software programs and employee experience level (novice vs.experienced). Report the F-statistics and p-values, and interpret the results.

In [23]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a DataFrame with the data
data = pd.DataFrame({
    'Software': ['A', 'B', 'C'] * 30,  # Software programs
    'Experience': ['Novice', 'Experienced'] * 45,  # Employee experience levels
    'Time': np.random.normal(10, 2, 90)  # Random time data
})

# Fit the ANOVA model
model = ols('Time ~ C(Software) * C(Experience)', data=data).fit()

# Perform the ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)


print(anova_table)

                               sum_sq    df         F    PR(>F)
C(Software)                  7.473529   2.0  0.975030  0.381409
C(Experience)                5.927139   1.0  1.546561  0.217103
C(Software):C(Experience)    4.711471   2.0  0.614679  0.543229
Residual                   321.926871  84.0       NaN       NaN


# Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
# experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores
# between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [24]:
import numpy as np
import scipy.stats as stats

# Sample data for the control and experimental groups
control_group = [80, 85, 90, 75, 78, 82, 88, 79, 84, 77, 81, 76, 83, 87, 72, 89, 74, 79, 81, 86, 75, 78, 82, 90, 74, 89, 84, 76, 83, 87]
experimental_group = [85, 89, 91, 78, 80, 86, 92, 81, 88, 79, 82, 77, 84, 90, 75, 91, 76, 80, 83, 87, 79, 82, 88, 92, 77, 91, 86, 78, 84, 90]

# Fit the ANOVA model
model = ols('Time ~ C(Software) * C(Experience)', data=data).fit()

# Perform the ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)


print(anova_table)







                               sum_sq    df         F    PR(>F)
C(Software)                  7.473529   2.0  0.975030  0.381409
C(Experience)                5.927139   1.0  1.546561  0.217103
C(Software):C(Experience)    4.711471   2.0  0.614679  0.543229
Residual                   321.926871  84.0       NaN       NaN


# Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
# on those days. Conduct a repeated measures ANOVA using Python to determine if there are any
# significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.

In [26]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data with daily sales for three stores (Store A, Store B, Store C)
data = pd.DataFrame({
    'Day': range(1, 31),  # 30 days
    'Store_A': [100, 110, 95, 105, 102, 98, 101, 107, 103, 99, 108, 106, 97, 104, 110, 99, 103, 101, 96, 102, 98, 100, 105, 103, 108, 98, 106, 105, 101, 100],
    'Store_B': [105, 100, 98, 102, 103, 101, 104, 107, 100, 99, 105, 103, 96, 102, 109, 98, 102, 103, 97, 101, 99, 104, 108, 105, 110, 100, 108, 102, 101, 106],
    'Store_C': [110, 108, 105, 104, 102, 100, 109, 111, 103, 99, 112, 107, 98, 103, 110, 100, 105, 101, 98, 103, 97, 108, 110, 105, 111, 99, 107, 105, 102, 100]
})



In [None]:
# Reshape the data for repeated measures ANOVA
data_melted = pd.melt(data, id_vars=['Day'], value_vars=['Store_A', 'Store_B', 'Store_C'], var_name='Store', value_name='Sales')

# Fit the repeated measures ANOVA model
model = ols('Sales ~ C(Store) + C(Day) + C(Store):C(Day)', data=data_melted).fit()

# Perform the repeated measures ANOVA
repeated_measures_anova = sm.stats.anova_lm(model, typ=2)

# Output the results
print(repeated_measures_anova)