In [None]:
#Q1):-
ANOVA (Analysis of Variance) is a statistical technique used to compare means between two or more groups. To use ANOVA effectively and obtain valid results, certain assumptions need to be met. These assumptions include:

Independence: The observations within each group should be independent of each other. This means that the values in one group should not be influenced by or related to the values in another group.

Normality: The data within each group should be approximately normally distributed. Normality assumption implies that the data points are symmetrically distributed around the mean, with a bell-shaped curve. This assumption is more important when the group sizes are small.

Homogeneity of variances (also known as homoscedasticity): The variances of the groups being compared should be approximately equal. Homogeneity of variances means that the spread of data within each group is similar across all groups. Violation of this assumption leads to unequal variances and can affect the accuracy of the results.

Equal sample sizes (optional): While not always necessary, having equal sample sizes across all groups can simplify the analysis and interpretation of ANOVA results.

Violations of these assumptions can impact the validity of ANOVA results. Some examples of violations and their impact include:

Violation of independence: If the observations within groups are not independent, it can lead to biased and unreliable results. For example, if repeated measurements are taken on the same individuals over time, the assumption of independence is violated.

Violation of normality: If the data within groups are not normally distributed, it can affect the accuracy of p-values, confidence intervals, and hypothesis tests. Non-normality may lead to incorrect conclusions or biased estimates. For instance, if the data is highly skewed or has outliers, it can violate the normality assumption.

Violation of homogeneity of variances: When the variances across groups are not equal, it can affect the F-statistic and lead to incorrect conclusions. If the assumption is violated, the ANOVA may become less powerful and result in Type I or Type II errors. Unequal variances can be problematic when sample sizes are small or greatly imbalanced.

Violation of equal sample sizes: Although ANOVA can still be conducted with unequal sample sizes, it can complicate the interpretation of results. Unequal sample sizes may affect the precision and power of the analysis. Additionally, unequal sample sizes can interact with violations of other assumptions, such as homogeneity of variances.

In [None]:
#Q2):-
The three types of ANOVA are:

One-Way ANOVA: This type of ANOVA is used when comparing the means of three or more independent groups. It investigates whether there are any statistically significant differences among the group means. One-Way ANOVA is appropriate when there is one independent variable (also known as a factor) with three or more levels, and the response variable is continuous.
Example: Suppose you want to compare the average scores of students from different schools (School A, School B, and School C) in a mathematics test. One-Way ANOVA can be used to determine if there are significant differences in the mean scores of the three schools.

Two-Way ANOVA: Two-Way ANOVA is used when there are two independent variables (factors) influencing the response variable. It examines the effects of these two factors on the response variable simultaneously and assesses if there are any significant interactions between the factors. Each factor can have two or more levels.
Example: Consider a study investigating the effects of both gender (male, female) and treatment (treatment A, treatment B, treatment C) on blood pressure. Two-Way ANOVA can be used to determine if there are significant main effects of gender and treatment and if there is an interaction effect between gender and treatment.

Repeated Measures ANOVA: Repeated Measures ANOVA is used when measurements are taken on the same subjects or units over multiple time points or conditions. It is specifically designed to analyze within-subject designs, where each subject serves as their control, thereby reducing the influence of individual differences.
Example: Suppose you are studying the effect of a new medication on anxiety levels, and you measure anxiety in the same group of participants before medication (baseline), after one week of medication, and after two weeks of medication. Repeated Measures ANOVA can be used to analyze the change in anxiety levels across the three time points.

In [None]:
#Q3):-
The partitioning of variance in ANOVA refers to the division of the total variance observed in a dataset into different components or sources of variation. It decomposes the total variation into distinct parts associated with different factors or sources, allowing for a more detailed understanding of the variability in the data.

In ANOVA, the total variance observed in the response variable is divided into two main components: the between-group variance and the within-group variance.

Between-group variance: This component represents the variability among the group means or levels of the independent variable. It captures the differences between groups and indicates whether there are significant group effects. If the between-group variance is large relative to the within-group variance, it suggests that the groups differ significantly.

Within-group variance: This component represents the variability within each group. It accounts for the individual differences or random variability within the groups. It serves as a measure of the random error or noise present in the data. If the within-group variance is small relative to the between-group variance, it suggests that the groups are homogeneous, and the differences observed are likely due to the effect of the independent variable.

Understanding the partitioning of variance in ANOVA is important for several reasons:

Hypothesis testing: The partitioning of variance allows for the calculation of the F-statistic, which is used to test the significance of the group effects. By comparing the between-group variance with the within-group variance, ANOVA determines if the observed differences among the groups are statistically significant.

Effect size estimation: The partitioning of variance provides information about the magnitude of the effect of the independent variable. Effect sizes, such as eta-squared or partial eta-squared, can be calculated based on the ratio of the between-group variance to the total variance. These effect sizes help in interpreting the practical significance of the results.

Experimental design optimization: Understanding the partitioning of variance helps researchers in designing more efficient experiments. By minimizing the within-group variance and maximizing the between-group variance, the power of the analysis can be increased, leading to more reliable and informative results.

Identifying potential sources of variation: The partitioning of variance helps identify the main sources of variation in the data. It allows researchers to investigate the contributions of different factors or variables to the overall variability. This knowledge can guide further analysis and help identify potential covariates or confounding factors that may need to be accounted for.

In [None]:
#Q4):-
import scipy.stats as stats
group1 = [4, 6, 5, 3, 7]
group2 = [9, 11, 10, 8, 12]
group3 = [15, 13, 14, 16, 17]
data = group1 + group2 + group3
groups = ['group1'] * len(group1) + ['group2'] * len(group2) + ['group3'] * len(group3)
fvalue, pvalue = stats.f_oneway(group1, group2, group3)
mean_total = np.mean(data)
sst = np.sum((data - mean_total)**2)
mean_groups = np.mean([np.mean(group1), np.mean(group2), np.mean(group3)])
sse = np.sum((np.concatenate([group1, group2, group3]) - mean_groups)**2)
ssr = sst - sse
print("SST:", sst)
print("SSE:", sse)
print("SSR:", ssr)
print("F-value:", fvalue)
print("p-value:", pvalue)


In [None]:
#Q5):-
import statsmodels.api as sm
from statsmodels.formula.api import ols
data = {
    'A': [1, 1, 2, 2, 3, 3, 4, 4],
    'B': [1, 2, 1, 2, 1, 2, 1, 2],
    'response': [2, 3, 4, 5, 6, 7, 8, 9]
}
df = pd.DataFrame(data)
model = ols('response ~ A + B + A:B', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
main_effects = anova_table['sum_sq'][:2]
interaction_effect = anova_table['sum_sq'][2]
print("Main effect for A:", main_effects[0])
print("Main effect for B:", main_effects[1])
print("Interaction effect:", interaction_effect)


In [None]:
#Q6):-
Based on the obtained F-statistic of 5.23 and a p-value of 0.02 in a one-way ANOVA, you can make the following conclusions about the differences between the groups:

Significant differences: The p-value of 0.02 indicates that there is a statistically significant difference among the group means. In other words, the probability of observing such a difference in means by chance alone is only 0.02 (or 2%). Therefore, we reject the null hypothesis of no group differences.

Group means: Since the one-way ANOVA indicates a significant difference, you can conclude that at least one of the groups has a mean significantly different from the others. However, the ANOVA test itself does not provide information about which specific group(s) differ. To determine the specific differences, you may need to conduct further post-hoc tests or pairwise comparisons.

When interpreting these results, it is important to consider the context of the study and the practical significance of the differences. Here's an example interpretation statement:

"The one-way ANOVA revealed a significant difference among the groups (F(2, 87) = 5.23, p = 0.02). This suggests that there are statistically significant differences in the means of at least one group compared to the others. Further analysis, such as post-hoc tests, is needed to determine the specific group differences. These findings imply that there are meaningful distinctions between the groups in terms of the measured variable, which could have practical implications in [context]."

In the interpretation statement, F(2, 87) represents the degrees of freedom for the numerator (number of groups minus 1) and denominator (total number of observations minus the number of groups). Including the degrees of freedom provides additional information about the sample size and the precision of the analysis.

In [None]:
#Q7):-
Handling missing data in a repeated measures ANOVA requires careful consideration, as it can impact the validity and reliability of the results. Here are a few common approaches for handling missing data in repeated measures ANOVA:

Complete Case Analysis (Listwise deletion): This approach involves excluding participants or cases with any missing data from the analysis. Only the complete cases are used for the analysis, and any participant with missing data on any variable is excluded. The advantage of this method is that it is straightforward to implement. However, it can result in reduced sample size, potential bias, and loss of statistical power if missingness is related to the variables of interest.

Pairwise Deletion: With pairwise deletion, you use all available data for each specific analysis. Each analysis uses only the available data for that particular pair of variables. This method maximizes the use of data by including all participants with at least some data, but it can introduce bias if the missingness is not random.

Imputation: Imputation involves filling in or estimating missing values based on observed data. Various imputation methods can be used, such as mean imputation, regression imputation, or multiple imputation. Imputation aims to preserve the sample size and can reduce bias. However, the imputed values introduce uncertainty and may not accurately represent the true missing values, potentially affecting the estimates of means, standard errors, and p-values.

The potential consequences of using different methods to handle missing data in repeated measures ANOVA include:

Bias: The choice of handling missing data method can introduce bias if the missingness is related to the variables being analyzed. Complete case analysis and pairwise deletion may exclude participants with specific missing patterns, leading to biased estimates if the missing data is not missing completely at random (MCAR).

Reduced Statistical Power: Excluding participants or cases with missing data reduces the effective sample size, leading to reduced statistical power. This can result in an increased likelihood of Type II errors, i.e., failing to detect true effects or differences.

Precision and Generalizability: Different missing data handling methods can lead to different estimates of means, standard errors, and p-values. This can impact the precision of the results and their generalizability to the population.

Assumptions: The assumptions underlying the statistical tests in repeated measures ANOVA, such as sphericity or compound symmetry, may be violated when missing data are present. Different methods of handling missing data may have different implications for meeting these assumptions.

When deciding how to handle missing data, it is crucial to carefully consider the missing data mechanism, the amount and pattern of missingness, and the potential implications of different approaches. Additionally, consulting with a statistician or using specialized software that accommodates missing data, such as multiple imputation methods, can provide more robust and reliable analyses.

In [None]:
#Q8):-

After conducting an ANOVA and obtaining a significant result indicating a difference among the groups, post-hoc tests can be used to determine which specific group differences are statistically significant. Here are some common post-hoc tests used after ANOVA, along with their respective use cases:

Tukey's Honestly Significant Difference (HSD) test: This test compares all possible pairs of group means while controlling for the familywise error rate. It is used when you have a balanced design (equal sample sizes) and want to determine the specific pairwise differences among all groups.

Bonferroni correction: Bonferroni correction adjusts the significance level for multiple comparisons. It is a conservative approach that divides the desired alpha level by the number of comparisons. Bonferroni correction is commonly used when you have planned comparisons or a small number of specific pairwise comparisons to test.

Sidak correction: Similar to Bonferroni correction, Sidak correction also adjusts the significance level for multiple comparisons but is less conservative. It is suitable when you have a large number of pairwise comparisons or exploratory analyses.

Scheffé's test: Scheffé's test is a conservative post-hoc test that provides control over the familywise error rate for all possible comparisons. It is used when you have unequal sample sizes or want to compare specific combinations of group means.

Dunnett's test: Dunnett's test compares each group mean to a control group mean. It is used when you have a control group and want to determine whether other groups differ significantly from the control group.

Games-Howell test: The Games-Howell test is a post-hoc test that relaxes the assumption of equal variances among groups. It is used when the assumption of equal variances is violated, and you want to compare group means.

An example situation where a post-hoc test might be necessary is in a study comparing the effects of three different treatments on blood pressure. After conducting a one-way ANOVA and finding a significant difference among the treatments, you would use a post-hoc test to determine which specific treatment groups differ significantly from each other. This would provide insights into the effectiveness of the different treatments and guide further analysis or interventions.

Remember, the choice of post-hoc test depends on factors such as the design of the study, the assumptions being met or violated, the number of pairwise comparisons, and the specific research questions of interest. It is important to select a post-hoc test that is appropriate for your specific situation and ensures valid and meaningful comparisons among the groups.

In [None]:
#Q9):-
import scipy.stats as stats
diet_A = [3.2, 2.8, 4.1, 3.9, 3.5, 3.4, 2.9, 3.1, 3.7, 3.3, 3.6, 3.8, 3.2, 3.5, 3.0, 2.7, 3.6, 2.9, 3.1, 3.3, 3.8, 3.4, 3.5, 2.9, 3.3, 3.2, 3.6, 3.7, 3.9, 4.0, 3.3, 3.1, 3.6, 3.8, 3.5, 3.2, 3.4, 3.0, 3.3, 3.1, 3.9, 3.6, 3.2, 3.4, 3.5, 2.9, 3.7, 3.3, 3.1, 3.8]
diet_B = [2.9, 2.7, 2.5, 2.4, 2.6, 2.8, 2.9, 2.7, 2.6, 2.8, 2.7, 2.6, 2.9, 2.5, 2.4, 2.6, 2.8, 2.7, 2.9, 2.6, 2.7, 2.5, 2.8, 2.7, 2.6, 2.9, 2.5, 2.4, 2.6, 2.8, 2.7, 2.9, 2.6, 2.7, 2.5, 2.8, 2.7, 2.6, 2.9, 2.5, 2.4, 2.6, 2.8, 2.7, 2.9, 2.6, 2.7, 2.5, 2.8, 2.7]
diet_C = [4.0, 4.1, 3.9, 4.2, 4.3, 4.1, 4.0, 3.8, 4.2, 4.3, 4.1, 4.0, 3.8, 4.2, 4.3, 4.1, 4.0, 3.8, 4.2, 4.3, 4.1, 4.0, 3.8, 4.2, 4.3, 4.1, 4.0, 3.8, 4.2, 4.3, 4.1, 4.0, 3.8, 4.2, 4.3, 4.1, 4.0, 3.8, 4.2, 4.3, 4.1, 4.0, 3.8, 4.2, 4.3, 4.1, 4.0, 3.8]
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)
print("F-Statistic:", f_statistic)
print("p-value:", p_value)


In [None]:
#Q10):-
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
data = {
    'Software': ['A', 'B', 'C'] * 20,
    'Experience': ['Novice', 'Experienced'] * 30,
    'Time': [14.2, 15.1, 16.0, 16.5, 17.3, 18.0, 17.8, 18.5, 19.2, 20.1,
             21.0, 21.5, 20.5, 19.9, 21.2, 20.8, 19.5, 20.0, 22.1, 23.0,
             17.9, 18.8, 19.5, 20.0, 21.0, 21.9, 22.5, 23.8, 24.0, 24.5,
             19.0, 19.8, 20.5, 21.2, 22.0, 22.5, 23.8, 24.5, 25.0, 25.9,
             18.5, 19.2, 20.0, 21.0, 22.2, 23.0, 23.5, 24.9, 25.5, 26.3]
}

df = pd.DataFrame(data)
model = ols('Time ~ C(Software) + C(Experience) + C(Software):C(Experience)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

In [None]:
#Q11):-
import scipy.stats as stats
import numpy as np
control_scores = [85, 78, 92, 80, 87, 88, 91, 83, 86, 89, 82, 84, 79, 81, 90, 87, 86, 88, 84, 89,
                  80, 85, 88, 83, 86, 92, 80, 87, 89, 81, 85, 84, 83, 82, 86, 88, 85, 89, 87, 83,
                  86, 84, 80, 82, 88, 90, 86, 84, 81, 88, 85]
experimental_scores = [92, 95, 87, 98, 93, 96, 94, 97, 88, 94, 90, 91, 95, 92, 93, 96, 94, 97, 89,
                       91, 93, 92, 95, 98, 94, 97, 93, 92, 96, 90, 89, 91, 94, 97, 93, 95, 90, 96,
                       94, 92, 91, 98, 89, 95, 97, 93, 96, 92, 90, 94]
t_statistic, p_value = stats.ttest_ind(control_scores, experimental_scores)
print("t-statistic:", t_statistic)
print("p-value:", p_value)


In [None]:
#Q12):-
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
data = {
    'Store': ['A', 'B', 'C'] * 30,
    'Day': list(range(1, 31)) * 3,
    'Sales': [150, 180, 165, 140, 155, 175, 160, 145, 170, 155, 165, 185, 175, 160, 150,
              140, 155, 180, 165, 145, 170, 155, 165, 185, 160, 150, 140, 175, 160, 145,
              170, 155, 180, 165, 140, 155, 175, 160, 145, 170, 155, 165, 185, 175, 160,
              150, 140, 155, 180, 165, 145, 170],
}

df = pd.DataFrame(data)
model = ols('Sales ~ C(Store) + C(Day) + C(Store):C(Day)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)
