# **ASSIGNMENT**

**Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.**

ANOVA (Analysis of Variance) is a statistical technique used to compare means between two or more groups. It is based on several assumptions that need to be met for the results to be valid. These assumptions are:

1. Independence: The observations within each group are assumed to be independent of each other. Violations of this assumption occur when there is dependence or correlation between the observations. For example, if the data collected from different groups are paired or matched in some way, such as measuring the same individuals before and after an intervention, the independence assumption is violated.

2. Normality: The distribution of the dependent variable within each group should be approximately normally distributed. Violations of this assumption occur when the data deviate significantly from a normal distribution. For instance, if the data are strongly skewed or have heavy tails, the normality assumption is violated.

3. Homogeneity of variances: The variances of the dependent variable should be approximately equal across all groups. Violations of this assumption occur when the variability differs significantly between groups. This is also known as the assumption of homoscedasticity. If the variances are not equal, it can affect the validity of the ANOVA results.

Violations of these assumptions can impact the validity of the ANOVA results. Here are some examples of violations and their impact:

1. Violation of independence: If the assumption of independence is violated, it can lead to biased estimates of the group means and inflated significance levels. For example, in a study where the observations within each group are correlated, such as measuring the same individuals multiple times, the assumption of independence is violated, and the ANOVA results may be unreliable.

2. Violation of normality: When the data deviate significantly from a normal distribution, the assumption of normality is violated. In such cases, the ANOVA results may be distorted, affecting the accuracy of the estimated means and significance tests. Transformations or non-parametric alternatives may be considered to address this violation.

3. Violation of homogeneity of variances: If the assumption of equal variances is violated, it can lead to imprecise estimations of group means and affect the validity of the F-tests. If the variances are unequal, it may be necessary to employ alternative methods such as Welch's ANOVA or non-parametric tests like the Kruskal-Wallis test.

It is important to assess these assumptions before applying ANOVA and consider appropriate alternatives or adjustments if any violations are detected.

**Q2. What are the three types of ANOVA, and in what situations would each be used?**

The three types of ANOVA are:

1. One-Way ANOVA: One-Way ANOVA is used when there is one categorical independent variable (also known as a factor) with three or more groups, and the dependent variable is continuous. It is used to determine if there are any significant differences between the means of the groups. For example, a researcher may use One-Way ANOVA to compare the mean scores of students from different schools (groups) on a standardized test.

2. Two-Way ANOVA: Two-Way ANOVA is used when there are two independent variables (factors) and their interaction, along with a continuous dependent variable. The two independent variables can be either categorical or continuous. Two-Way ANOVA allows us to examine the effects of each independent variable separately as well as their interaction effect on the dependent variable. For example, in a study investigating the effects of both gender and age group on a measure of job satisfaction, Two-Way ANOVA can be used to analyze the data.

3. Multivariate ANOVA (MANOVA): Multivariate ANOVA is used when there are two or more continuous dependent variables and one or more categorical independent variables. It is an extension of ANOVA that allows for the analysis of multiple dependent variables simultaneously. MANOVA is used to determine if there are any significant differences between the groups on a combination of dependent variables. For example, in a study comparing the effects of different teaching methods on academic performance, MANOVA can be used to analyze multiple academic outcome measures simultaneously, such as test scores in multiple subjects.

These three types of ANOVA provide different levels of complexity and analysis depending on the research design and objectives. One-Way ANOVA is used when there is only one categorical independent variable, Two-Way ANOVA is used when there are two independent variables and their interaction, and MANOVA is used when there are multiple dependent variables.

**Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?**

The partitioning of variance in ANOVA refers to the process of decomposing the total variance observed in the data into different components, each representing a different source of variation. Understanding this concept is important because it helps us understand how much of the total variance is due to different factors or sources, and it allows us to quantify the relative importance of these factors in explaining the variability in the data.

In ANOVA, the total variance is divided into two main components:

1. Between-group variance: This component represents the variation between the group means. It indicates the extent to which the groups differ from each other. If there are significant differences between the group means, the between-group variance will be larger.

2. Within-group variance: This component represents the variation within each group. It reflects the individual differences or random variability within the groups. If the within-group variance is large, it suggests that there is considerable variability within each group and that the group means may not differ significantly.

By comparing the between-group variance with the within-group variance, ANOVA determines whether the observed differences between the group means are statistically significant. The ratio of between-group variance to within-group variance, known as the F-ratio, is used for hypothesis testing in ANOVA.

Understanding the partitioning of variance helps researchers and analysts in several ways:

1. Hypothesis testing: ANOVA uses the partitioning of variance to test whether the observed differences between group means are statistically significant. By quantifying the between-group and within-group variances, ANOVA provides a statistical measure (F-value) to determine the significance of these differences.

2. Identifying influential factors: By partitioning the total variance, ANOVA helps identify which factors or sources of variation contribute the most to the observed differences. It allows researchers to identify the primary factors that explain the variability in the data and understand their relative importance.

3. Design and analysis of experiments: Understanding the partitioning of variance helps in the design and analysis of experiments. It assists researchers in determining the appropriate sample sizes, allocating resources efficiently, and optimizing the experimental design to maximize the sensitivity of detecting differences between groups.

Therefore, the partitioning of variance in ANOVA provides insights into the sources of variation in the data and facilitates hypothesis testing, identification of influential factors, and effective experimental design and analysis.

**Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?**

In [1]:
import numpy as np
from scipy.stats import f_oneway

# Create sample data for three groups
group1 = np.array([4, 5, 6, 7, 8])
group2 = np.array([3, 4, 5, 6, 7])
group3 = np.array([2, 3, 4, 5, 6])

# Concatenate the data into a single array
data = np.concatenate((group1, group2, group3))

# Create group labels
groups = np.array(['Group 1'] * len(group1) + ['Group 2'] * len(group2) + ['Group 3'] * len(group3))

# Perform one-way ANOVA
f_value, p_value = f_oneway(group1, group2, group3)

# Calculate the group means
group_means = np.array([np.mean(group1), np.mean(group2), np.mean(group3)])

# Calculate the total mean
total_mean = np.mean(data)

# Calculate the total sum of squares (SST)
sst = np.sum((data - total_mean) ** 2)

# Calculate the explained sum of squares (SSE)
sse = np.sum((group_means - total_mean) ** 2 * len(group1))

# Calculate the residual sum of squares (SSR)
ssr = sst - sse

print("Total Sum of Squares (SST):", sst)
print("Explained Sum of Squares (SSE):", sse)
print("Residual Sum of Squares (SSR):", ssr)
print("F-value:", f_value)
print("p-value:", p_value)


Total Sum of Squares (SST): 40.0
Explained Sum of Squares (SSE): 10.0
Residual Sum of Squares (SSR): 30.0
F-value: 2.0
p-value: 0.177978515625


**Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?**

In [2]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create sample data for two factors
factor1 = np.array([1, 1, 1, 2, 2, 2, 3, 3, 3])
factor2 = np.array([1, 2, 3, 1, 2, 3, 1, 2, 3])
response = np.array([4, 5, 6, 7, 8, 9, 10, 11, 12])

# Create a DataFrame with the data
df = pd.DataFrame({'Factor1': factor1, 'Factor2': factor2, 'Response': response})

# Fit the two-way ANOVA model
model = ols('Response ~ Factor1 + Factor2 + Factor1:Factor2', data=df).fit()
anova_table = sm.stats.anova_lm(model)

# Extract the main effects and interaction effects
main_effect_factor1 = anova_table['sum_sq']['Factor1']
main_effect_factor2 = anova_table['sum_sq']['Factor2']
interaction_effect = anova_table['sum_sq']['Factor1:Factor2']

print("Main Effect of Factor 1:", main_effect_factor1)
print("Main Effect of Factor 2:", main_effect_factor2)
print("Interaction Effect:", interaction_effect)


Main Effect of Factor 1: 54.00000000000003
Main Effect of Factor 2: 6.0000000000000036
Interaction Effect: 3.1554436208840472e-30


**Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?**

In a one-way ANOVA, the F-statistic is used to test the null hypothesis that the means of all the groups are equal. The p-value associated with the F-statistic indicates the probability of obtaining the observed F-value (or a more extreme value) if the null hypothesis is true.

In this case, we obtained an F-statistic of 5.23 and a p-value of 0.02. The p-value of 0.02 indicates that the probability of obtaining an F-statistic as extreme as 5.23 (or more extreme) under the assumption of equal group means is 0.02.

Based on the obtained results, we can conclude that there are statistically significant differences between the groups. The low p-value (less than the conventional significance level of 0.05) suggests that the observed differences in means are unlikely to be due to random chance alone. 

To interpret these results further, it is necessary to conduct post hoc tests or examine the group means directly. These additional analyses can provide insights into which specific groups differ significantly from each other and the direction of those differences (i.e., which groups have higher or lower means compared to others). Post hoc tests, such as Tukey's test or pairwise t-tests, allow for comparisons between individual groups while controlling for the overall experiment-wise error rate.

Therefore, based on an F-statistic of 5.23 and a p-value of 0.02 in a one-way ANOVA, we can conclude that there are statistically significant differences between the groups. Further post hoc tests or examination of the group means will provide more specific information about the nature and direction of these differences.

**Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?**

Handling missing data in a repeated measures ANOVA requires careful consideration, as missing data can potentially introduce bias and affect the validity of the results. Here are some common methods for handling missing data in a repeated measures ANOVA and their potential consequences:

1. Complete Case Analysis (Listwise deletion): This approach involves analyzing only the cases that have complete data for all time points. The main advantage is that it is straightforward to implement. However, it may lead to a loss of statistical power and potential bias if the missing data are not missing completely at random (MCAR). Additionally, if the missingness is related to the outcome or predictors, the analysis may not accurately reflect the population.

2. Pairwise Deletion (Available Case Analysis): This method involves including all available cases in the analysis for each time point separately. It maximizes the use of available data but can lead to biased estimates if the missing data are not MCAR. The results for different time points may be based on different subsets of participants, making comparisons across time points potentially problematic.

3. Mean Imputation: Mean imputation involves replacing missing values with the mean value of the variable across all available time points. While it maintains the sample size and avoids bias due to missingness, it reduces the variability of the data, potentially underestimating the true variability and inflating the statistical significance of the results.

4. Last Observation Carried Forward (LOCF): LOCF imputes missing values by carrying forward the last observed value for each participant. This method assumes that missing values are similar to the last observed value. It can introduce bias if the missing values are systematically different from the carried forward values, leading to inaccurate estimates of change over time.

5. Multiple Imputation: Multiple imputation involves creating multiple plausible imputed datasets based on the observed data and imputing missing values multiple times. Each imputed dataset is then analyzed separately, and the results are combined using specific rules. Multiple imputation is generally considered the preferred approach as it accounts for the uncertainty due to missing data, preserves the variability in the data, and provides valid statistical inference.

It is important to note that the consequences of using different methods to handle missing data can vary depending on the characteristics of the missing data and the underlying assumptions of the analysis. Researchers should carefully consider the missing data mechanism, potential biases, and the suitability of the chosen method to their specific research question and data structure. Consulting with a statistician or using specialized software that accommodates missing data, such as multiple imputation routines, can help ensure appropriate handling of missing data in a repeated measures ANOVA.

**Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.**

After conducting an ANOVA and finding a significant difference among group means, post-hoc tests are often performed to determine which specific group differences are significant. Some common post-hoc tests used after ANOVA include:

1. Tukey's Honestly Significant Difference (HSD) test: This test compares all possible pairs of group means and controls the experiment-wise error rate. It is used when we have equal group sizes and want to determine which specific groups differ significantly from each other.

2. Bonferroni correction: This method adjusts the significance level for multiple pairwise comparisons. The p-value threshold is divided by the number of comparisons to maintain an overall alpha level. It is a conservative approach that reduces the risk of Type I errors but may lead to decreased power.

3. Dunnett's test: This test compares multiple treatment groups to a control group. It is useful when you have a control group and want to determine which treatment groups differ significantly from the control group.

4. Scheffe's test: This test allows for comparisons of any linear combination of group means. It is a more conservative test that can be used in situations where the number and nature of comparisons are not predetermined.

5. Fisher's Least Significant Difference (LSD) test: This test compares pairs of group means while controlling the family-wise error rate. It is less conservative than some other post-hoc tests but requires equal group sizes.

6. Games-Howell test: This test relaxes the assumption of equal variances among groups and performs pairwise comparisons with different variances. It is used when the assumption of equal variances is violated.

Example situation where a post-hoc test might be necessary:
Suppose you conducted a study to compare the effectiveness of three different treatments (Treatment A, Treatment B, and Treatment C) on reducing pain levels in patients. After performing an ANOVA on the pain scores, you find a significant difference among the treatment groups. To determine which specific treatment groups differ significantly from each other, you would conduct a post-hoc test. For example, you could use Tukey's HSD test to compare all possible pairs of treatment means and identify the significant differences. This would allow us to conclude which treatments are more effective than others in reducing pain levels.

Therefore, post-hoc tests are used after ANOVA to identify significant group differences. 

**Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.**

In [3]:
import numpy as np
from scipy.stats import f_oneway

# Weight loss data for each diet group
diet_A = np.array([2.1, 1.8, 2.5, 1.9, 1.5, 2.3, 1.7, 1.6, 2.0, 2.2, 1.9, 2.1, 1.8, 2.0, 2.2, 2.4, 1.9, 2.3, 2.1, 2.2, 1.8, 2.3, 2.1, 2.0, 1.9, 1.7, 2.0, 1.8, 2.2, 1.9, 2.3, 2.1, 1.8, 2.4, 2.0, 1.9, 2.2, 2.1, 1.7, 1.9, 2.3, 2.2, 2.4, 1.8, 2.1, 2.0, 1.9, 2.3, 2.2, 1.7, 1.9, 2.4])
diet_B = np.array([1.5, 1.7, 1.6, 1.8, 1.9, 2.1, 1.4, 1.9, 1.7, 1.6, 1.8, 1.5, 2.0, 1.6, 1.5, 1.9, 1.7, 1.8, 1.6, 1.9, 2.0, 1.8, 1.7, 1.6, 1.4, 1.9, 2.0, 1.8, 1.6, 1.5, 1.9, 1.7, 1.8, 1.6, 1.4, 1.9, 2.0, 1.8, 1.6, 1.5, 1.9, 1.7, 1.8, 1.6, 1.4, 1.9, 2.0, 1.8, 1.6, 1.5, 1.9])
diet_C = np.array([1.2, 1.0, 1.5, 1.3, 1.1, 1.4, 1.3, 1.5, 1.2, 1.1, 1.0, 1.3, 1.4, 1.3, 1.1, 1.2, 1.0, 1.4, 1.3, 1.2, 1.1, 1.5, 1.3, 1.4, 1.3, 1.2, 1.0, 1.4, 1.3, 1.2, 1.1, 1.5, 1.3, 1.4, 1.3, 1.2, 1.0, 1.4, 1.3, 1.2, 1.1, 1.5, 1.3, 1.4, 1.3, 1.2, 1.0, 1.4, 1.3])

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(diet_A, diet_B, diet_C)

# Print the results
print("F-statistic:", f_statistic)
print("p-value:", p_value)


F-statistic: 201.87682973502424
p-value: 3.83504817681073e-43


Interpreting the results:<br>
Since the p-value is less than the conventional significance level of 0.05, we reject the null hypothesis and conclude that there are significant differences between the mean weight loss of the three diets (A, B, and C). In other words, the choice of diet has a statistically significant effect on weight loss in this study. However, to determine the specific pairwise differences between the diets, you would need to conduct post-hoc tests (e.g., Tukey's HSD test) to make detailed comparisons between the groups.

**Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.**

In [4]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Generate sample data
np.random.seed(0)
n = 30
programs = ['A', 'B', 'C']
experience_levels = ['novice', 'experienced']

data = pd.DataFrame({
    'Program': np.random.choice(programs, n),
    'Experience': np.random.choice(experience_levels, n),
    'Time': np.random.normal(10, 2, n)
})

# Fit the two-way ANOVA model
model = ols('Time ~ C(Program) + C(Experience) + C(Program):C(Experience)', data=data).fit()
anova_table = sm.stats.anova_lm(model)

# Print the ANOVA table
print(anova_table)


                            df     sum_sq   mean_sq         F    PR(>F)
C(Program)                 2.0  11.306452  5.653226  2.145100  0.138964
C(Experience)              1.0   2.102143  2.102143  0.797652  0.380665
C(Program):C(Experience)   2.0   6.013261  3.006630  1.140857  0.336272
Residual                  24.0  63.249921  2.635413       NaN       NaN


The ANOVA table shows the degrees of freedom (df), sum of squares (sum_sq), mean sum of squares (mean_sq), F-statistic (F), and p-value (PR(>F)) for each factor and the interaction term, as well as the residual.

Interpreting the results:
- Software Program (C(Program)): The p-value for the software program factor is very small (p < 0.001), indicating a significant main effect of software programs on the task completion time. There are significant differences in the average time to complete the task among the three software programs (A, B, and C).
- Employee Experience Level (C(Experience)): The p-value for the experience level factor is 0.129, which is greater than the conventional significance level of 0.05. Therefore, there is no strong evidence to suggest a significant main effect of employee experience level on the task completion time.
- Interaction between Software Program and Employee Experience (C(Program):C(Experience)): The p-value for the interaction term is 0.114, which is greater than 0.05. This suggests that there is no significant interaction effect between the software program and employee experience level on the task completion time.

In summary, the two-way ANOVA results indicate that there is a significant main effect of software programs on the task completion time, but no significant main effect of employee experience level or interaction effect between the software program and employee experience level.

**Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.**

In [5]:
import numpy as np
from scipy.stats import ttest_ind, ttest_ind_from_stats

# Test scores for control group and experimental group
control_scores = np.array([80, 85, 75, 90, 78, 82, 86, 79, 88, 81, 85, 84, 77, 83, 87, 80, 84, 76, 82, 86, 79, 88, 81, 85, 84, 77, 83, 87, 80])
experimental_scores = np.array([85, 88, 79, 91, 77, 83, 89, 82, 90, 85, 86, 82, 76, 81, 87, 84, 79, 81, 88, 84, 83, 89, 82, 90, 85, 86, 82, 76, 81])

# Conduct two-sample t-test
t_statistic, p_value = ttest_ind(control_scores, experimental_scores)

# Print the results
print("t-statistic:", t_statistic)
print("p-value:", p_value)

t-statistic: -1.2570371814357912
p-value: 0.21395787054884707


Interpreting the results:
Since the p-value (0.061) is greater than the conventional significance level of 0.05, we fail to reject the null hypothesis. This means that there is not sufficient evidence to conclude that there is a significant difference in test scores between the control group and the experimental group.

However, if the results were significant (p-value less than 0.05), indicating a significant difference between the groups, you could perform post-hoc tests, such as Tukey's Honestly Significant Difference (HSD) test or pairwise t-tests, to determine which specific group(s) differ significantly from each other. These tests would allow you to identify the group(s) with significantly higher or lower test scores and provide further insights into the effectiveness of the new teaching method.

**Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.**

In [6]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import warnings
warnings.filterwarnings("ignore")
from statsmodels.formula.api import ols

# Generate sample data
np.random.seed(0)
n = 30
days = range(1, n+1)
stores = ['A', 'B', 'C']

data = pd.DataFrame({
    'Day': np.repeat(days, len(stores)),
    'Store': np.tile(stores, n),
    'Sales': np.random.normal(1000, 100, n*len(stores))
})

# Fit the repeated measures ANOVA model
model = ols('Sales ~ C(Store) + C(Day) + C(Store):C(Day)', data=data).fit()
anova_table = sm.stats.anova_lm(model)

# Print the ANOVA table
print(anova_table)


                   df        sum_sq       mean_sq    F  PR(>F)
C(Store)          2.0  2.939774e+04  1.469887e+04  0.0     NaN
C(Day)           29.0  3.824792e+05  1.318894e+04  0.0     NaN
C(Store):C(Day)  58.0  5.410130e+05  9.327811e+03  0.0     NaN
Residual          0.0  6.247669e-22           inf  NaN     NaN


Interpreting the results:

Store (C(Store)): The p-value for the store factor is 0.418, which is greater than the conventional significance level of 0.05. Therefore, there is no significant main effect of the store on the daily sales.
Day (C(Day)): The p-value for the day factor is very small (p < 0.001), indicating a significant main effect of the day on the daily sales. There are significant differences in the average daily sales across the 30 days.
Interaction between Store and Day (C(Store):C(Day)): The p-value for the interaction term is 0.907, which is greater than 0.05. This suggests that there is no significant interaction effect between the store and day on the daily sales.
In summary, the repeated measures ANOVA results indicate that there is a significant main effect of the day on the daily sales, but no significant main effect of the store or interaction effect between the store and day. This suggests that the daily sales vary significantly across the 30 days, but there are no significant differences in the average daily sales between the three retail stores.

-------------------