## Statistics Advance Assignment - 6
***By Shahequa Modabbera***

### Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

Ans) ANOVA (Analysis of Variance) is a statistical method used to compare the means of two or more groups. ANOVA assumes the following:

1. Independence: Each observation within a group is independent of every other observation in that group.
2. Normality: The response variable is normally distributed in each group.
3. Homogeneity of variance: The variances of the response variable in each group are equal.

Violations of these assumptions can impact the validity of the ANOVA results. For example:

1. Independence violation: If observations within a group are not independent, the sample size within each group may be too small or the groups may be too similar or too different. For example, if a study is conducted on a married couple, then the observations within the couple are not independent.
2. Normality violation: If the response variable is not normally distributed within each group, the ANOVA test may not be valid. This can occur when the sample size is too small or the data has outliers. For example, if a study measures the intelligence level of people but the sample size is too small or there are outliers, the data may not be normally distributed.
3. Homogeneity of variance violation: If the variances of the response variable are not equal across the groups, then the ANOVA test may not be valid. This can occur when the sample sizes are unequal or the data have outliers. For example, if a study is conducted on the salaries of employees in different departments, but the number of employees in each department is not equal, then the variances may not be equal.

It is important to check these assumptions before conducting an ANOVA test to ensure the validity of the results.

### Q2. What are the three types of ANOVA, and in what situations would each be used?

Ans) There are three main types of ANOVA:

1. One-Way ANOVA: This is used when there is only one independent variable and one dependent variable. It is used to compare the means of three or more groups.

2. Two-Way ANOVA: This is used when there are two independent variables and one dependent variable. It is used to determine the main effects and interactions between the two independent variables.

3. MANOVA (Multivariate Analysis of Variance): This is used when there are multiple dependent variables and one or more independent variables. It is used to determine if there are differences between groups across multiple dependent variables.

The choice of ANOVA type depends on the research question, study design, and the number of independent variables and dependent variables involved. One-Way ANOVA is the most commonly used type, as it is useful for comparing the means of multiple groups. Two-Way ANOVA is useful for examining the effects of two independent variables on a dependent variable, and can also identify interactions between the independent variables. MANOVA is useful for examining differences between groups across multiple dependent variables, and can provide more comprehensive results than a series of one-way ANOVAs.

### Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Ans) The partitioning of variance in ANOVA refers to the process of breaking down the total variation in the data into different sources of variation, which are then used to estimate the statistical significance of the factors being analyzed. The total variation in the data is divided into two components: systematic variation and random variation. Systematic variation is due to the factors being studied, such as treatment groups or levels of a predictor variable, while random variation is due to chance factors that are not controlled or measured.

By partitioning the variance, ANOVA allows researchers to determine the extent to which the variation in the outcome variable is due to the factors being studied versus random variation. This is important because it helps to identify the most significant sources of variation, which can then be used to explain and predict the outcome variable. It also enables researchers to test the statistical significance of each factor, which is necessary for making valid conclusions about the relationship between the factors and the outcome variable. Overall, understanding the partitioning of variance in ANOVA is critical for accurately interpreting the results and drawing valid conclusions from the analysis.

### Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

Ans) To calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python, we can use the `ols()` function from the `statsmodels` module.

In [1]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# create a DataFrame with the data
data = {'group': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
        'value': [10, 12, 14, 8, 9, 11, 15, 16, 17]}
df = pd.DataFrame(data)

# fit the ANOVA model
model = ols('value ~ group', data=df).fit()

# calculate the SST, SSE, and SSR
SST = sum((df['value'] - df['value'].mean())**2)
SSE = sum(model.resid**2)
SSR = SST - SSE

print('SST:', SST)
print('SSE:', SSE)
print('SSR:', SSR)

SST: 82.22222222222221
SSE: 14.666666666666668
SSR: 67.55555555555554


In this example, we create a DataFrame with the data and fit an ANOVA model using the `ols()` function. Then we calculate the SST, SSE, and SSR by using the `sum()` function on the squared deviations of the values from the mean for SST, the residuals for SSE, and subtracting SSE from SST for SSR.

### Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

Ans) 

### Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

Ans) The obtained F-statistic of 5.23 and the corresponding p-value of 0.02 indicate that there is a statistically significant difference between the groups. This means that the means of at least two of the groups are different from each other.

To further interpret these results, you can also look at the effect size. One commonly used effect size measure for ANOVA is eta-squared (η²), which represents the proportion of variance in the dependent variable that can be explained by the group variable. The larger the value of η², the stronger the effect of the group variable on the dependent variable. 

You can calculate the effect size using the formula:

η² = SS_between / SS_total

where SS_between is the sum of squares between groups and SS_total is the total sum of squares. 

If the effect size is small (e.g., η² < 0.06), then the practical significance of the differences between the groups may be limited. However, if the effect size is large (e.g., η² > 0.14), then the differences between the groups are not only statistically significant but also practically important.

In summary, the F-statistic and p-value indicate that there are significant differences between the groups, and the effect size provides further information about the practical significance of these differences.

### Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Ans) In a repeated measures ANOVA, missing data can occur when a participant is unable or unwilling to provide data for one or more time points. Missing data can occur for various reasons, including equipment malfunction, subject dropout, and incomplete responses. Handling missing data is crucial for obtaining accurate results and minimizing bias in the analysis.

One common approach to handling missing data in repeated measures ANOVA is to use a method called imputation. Imputation involves replacing missing values with plausible estimates based on other available data. Some commonly used imputation methods include mean imputation, last observation carried forward, and regression imputation.

However, the choice of imputation method can have a significant impact on the results of the analysis. Different imputation methods can lead to different estimates of the mean, variance, and covariance of the variables. The choice of imputation method can also affect the standard errors and statistical significance of the results. Therefore, it is important to carefully consider the method of handling missing data and to compare the results of different methods to assess their impact on the conclusions of the analysis.

### Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Ans) Post-hoc tests are used after ANOVA to determine which groups are significantly different from each other when the overall F-test is significant. There are several post-hoc tests available, including Tukey's HSD (Honestly Significant Difference), Bonferroni correction, and Scheffe's test, among others.

Tukey's HSD is a widely used post-hoc test and is often the default choice. It compares all possible pairs of means and determines the minimum significant difference that needs to be present between two means to conclude that they are different from each other.

Bonferroni correction is a conservative post-hoc test that adjusts the significance level to account for multiple comparisons. It divides the original alpha level by the number of comparisons to control the family-wise error rate.

Scheffe's test is a more conservative post-hoc test that controls for all possible comparisons simultaneously. It is often used when the number of groups is small or the sample sizes are unequal.

An example of when a post-hoc test might be necessary is in a study comparing the effectiveness of three different medications for treating a specific condition. After conducting an ANOVA, if the overall F-test is significant, a post-hoc test can be used to determine which medications are significantly different from each other. For instance, Tukey's HSD could be used to compare all possible pairs of means and determine the minimum significant difference required to conclude that two means are different from each other.

### Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [2]:
import numpy as np
from scipy.stats import f_oneway

# Generate some fake weight loss data for the three diets
np.random.seed(123)
diet_A = np.random.normal(loc=10, scale=2, size=50)
diet_B = np.random.normal(loc=8, scale=2, size=50)
diet_C = np.random.normal(loc=6, scale=2, size=50)

# Conduct one-way ANOVA
F, p = f_oneway(diet_A, diet_B, diet_C)

# Report results
print("F-statistic:", F)
print("p-value:", p)
if p < 0.05:
    print("There is significant evidence of a difference in mean weight loss between the diets.")
else:
    print("There is not significant evidence of a difference in mean weight loss between the diets.")

F-statistic: 37.03885406173804
p-value: 9.413909285242866e-14
There is significant evidence of a difference in mean weight loss between the diets.


In this example, we generate fake weight loss data for the three diets using normal distributions with different means (10, 8, and 6) and the same standard deviation of 2. We then use the f_oneway function to conduct the one-way ANOVA and obtain the F-statistic and p-value. Finally, we report the results and interpret them.

Since the p-value is very small (less than 0.05), we reject the null hypothesis of equal mean weight loss between the diets and conclude that there is significant evidence of a difference in mean weight loss between the diets.

### Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

Ans) To conduct a two-way ANOVA using Python, we can use the `ols` function from the `statsmodels` module. Here's an example:

In [3]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Generate some fake data for the task completion time
np.random.seed(123)

# Software programs
programs = np.repeat(['A', 'B', 'C'], 30)

# Employee experience levels
experience = np.tile(['Novice', 'Experienced'], 45)

# Task completion time
time = np.random.normal(loc=10, scale=2, size=90)

# Create a DataFrame
data = pd.DataFrame({'Program': programs, 'Experience': experience, 'Time': time})

# Conduct two-way ANOVA
model = ols('Time ~ C(Program) + C(Experience) + C(Program):C(Experience)', data=data).fit()
anova_table = sm.stats.anova_lm(model)

# Report results
print(anova_table)

                            df      sum_sq   mean_sq         F    PR(>F)
C(Program)                 2.0    2.926855  1.463428  0.264784  0.768009
C(Experience)              1.0    3.094719  3.094719  0.559941  0.456374
C(Program):C(Experience)   2.0    3.334259  1.667130  0.301641  0.740401
Residual                  84.0  464.256575  5.526864       NaN       NaN


In this example, we generate fake data for the task completion time of 90 employees, with 30 employees assigned to each software program and an equal number of novice and experienced employees. We then create a DataFrame with the data and use the `ols` function to fit a two-way ANOVA model with the factors "Program" and "Experience" and their interaction term. The `anova_lm` function from `statsmodels` is used to obtain the ANOVA table, which includes the F-statistics and p-values.

The output will display the ANOVA table with the F-statistics, degrees of freedom, sum of squares, mean squares, and p-values for each factor and interaction term. The p-values can be used to assess the significance of the main effects and interaction effects.

From the ANOVA table, we can interpret the results as follows:

- Program: The p-value for the "Program" factor is 0.768007, which is more than the significance level of 0.05. Therefore, we conclude that there is no evidence that the main effect of the software programs on the task completion time is significant.
- Experience: The p-value for the "Experience" factor is 0.456374, which is greater than the significance level of 0.05. Therefore, we do not have sufficient evidence to conclude a significant main effect of employee experience level on the task completion time.
- Program:Experience Interaction: The p-value for the interaction term "Program:Experience" is 0.740401, which is greater than the significance level of 0.05. Hence, we cannot conclude that there is a significant interaction effect between the software programs and employee experience level on the task completion time.

In summary, based on the two-way ANOVA results, there is no significant main effect of the software programs and no significant interaction effect between the software programs and employee experience.

### Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

Ans) To conduct a two-sample t-test and post-hoc test using Python, we can use the `scipy.stats` module. Here's an example:

In this example, we generate fake data for test scores of 100 students, with 50 students in the control group (traditional teaching method) and 50 students in the experimental group (new teaching method). We use the `numpy.random.normal` function to generate normally distributed test scores for each group.

We then use the `ttest_ind` function from `scipy.stats` to perform a two-sample t-test. The function takes the test scores of the control group and experimental group as inputs and returns the t-statistic and p-value.

The output will display the t-statistic and p-value of the two-sample t-test.

Since the p-value (0.0213) is less than the significance level of 0.05, we can conclude that there is a significant difference in test scores between the control group and the experimental group.

To further analyze the differences between the groups, we can perform post-hoc tests. In this example, we used pairwise t-tests to compare the control group with the experimental group. The post-hoc test results indicate that the p-value for the comparison between the control and experimental groups is 0.0213, which confirms the significant difference between these two groups.

Therefore, based on the two-sample t-test and post-hoc test, we can conclude that the new teaching method significantly improves student test scores compared to the traditional teaching method.

In [4]:
import numpy as np
from scipy import stats

# Generate some fake data for test scores
np.random.seed(123)

control_scores = np.random.normal(loc=70, scale=10, size=100)
experimental_scores = np.random.normal(loc=75, scale=12, size=100)

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_scores, experimental_scores)

# Report results
print("Two-sample t-test results:")
print(f"t-statistic: {t_statistic}")
print(f"p-value: {p_value}")

# Perform post-hoc test (pairwise t-tests)
pairwise_ttests = stats.ttest_ind(control_scores, experimental_scores, equal_var=False)
posthoc_p_values = pairwise_ttests.pvalue

# Report post-hoc results
print("\nPost-hoc test results:")
print("p-values for pairwise t-tests:")
print(f"Control vs Experimental: {posthoc_p_values}")

Two-sample t-test results:
t-statistic: -2.7585883908860698
p-value: 0.006349134276947244

Post-hoc test results:
p-values for pairwise t-tests:
Control vs Experimental: 0.006349676420454233


In this example, we generate fake data for test scores of 100 students, with 50 students in the control group (traditional teaching method) and 50 students in the experimental group (new teaching method). We use the `numpy.random.normal` function to generate normally distributed test scores for each group.

We then use the `ttest_ind` function from `scipy.stats` to perform a two-sample t-test. The function takes the test scores of the control group and experimental group as inputs and returns the t-statistic and p-value.

The output will display the t-statistic and p-value of the two-sample t-test.

Since the p-value (0.0063) is less than the significance level of 0.05, we can conclude that there is a significant difference in test scores between the control group and the experimental group.

To further analyze the differences between the groups, we performed post-hoc tests. In this example, we used pairwise t-tests to compare the control group with the experimental group. The post-hoc test results indicate that the p-value for the comparison between the control and experimental groups is 0.0063, which confirms the significant difference between these two groups.

Therefore, based on the two-sample t-test and post-hoc test, we can conclude that the new teaching method significantly improves student test scores compared to the traditional teaching method.

### Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

Ans) To conduct a repeated measures ANOVA and post-hoc test using Python, we can use the `statsmodels` library. Here's an example:

In [5]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Generate some fake data for daily sales
np.random.seed(123)

store_a_sales = np.random.normal(loc=500, scale=100, size=30)
store_b_sales = np.random.normal(loc=550, scale=120, size=30)
store_c_sales = np.random.normal(loc=600, scale=110, size=30)

# Create a pandas DataFrame for the data
data = pd.DataFrame({
    'Store': np.repeat(['A', 'B', 'C'], 30),
    'Sales': np.concatenate([store_a_sales, store_b_sales, store_c_sales])
})

# Perform repeated measures ANOVA
model = ols('Sales ~ Store', data=data).fit()
anova_table = sm.stats.anova_lm(model)

# Report results
print("Repeated measures ANOVA results:")
print(anova_table)

# Perform post-hoc test (Tukey's HSD)
posthoc_results = sm.stats.multicomp.pairwise_tukeyhsd(data['Sales'], data['Store'])

# Report post-hoc results
print("\nPost-hoc test results:")
print(posthoc_results)

Repeated measures ANOVA results:
            df        sum_sq       mean_sq         F    PR(>F)
Store      2.0  1.204407e+05  60220.353984  3.639995  0.030328
Residual  87.0  1.439335e+06  16544.076803       NaN       NaN

Post-hoc test results:
 Multiple Comparison of Means - Tukey HSD, FWER=0.05  
group1 group2 meandiff p-adj   lower    upper   reject
------------------------------------------------------
     A      B  62.5068   0.15  -16.683 141.6966  False
     A      C  86.8565 0.0281   7.6667 166.0463   True
     B      C  24.3497 0.7445 -54.8401 103.5395  False
------------------------------------------------------


In this example, we generate fake data for daily sales of three retail stores: Store A, Store B, and Store C. We use the `numpy.random.normal` function to generate normally distributed sales values for each store.

We then created a pandas DataFrame to store the data, with columns for "Store" and "Sales". Each row represents the sales of a particular store on a specific day.

We use the `statsmodels` library to perform a repeated measures ANOVA. We specify the formula "Sales ~ Store" to indicate that we want to analyze the sales variable based on the store factor. The `ols` function is used to fit the ANOVA model, and the `anova_lm` function is used to generate the ANOVA table.

The output will display the ANOVA table, which includes the F-statistic, p-value, and other relevant statistics.

Since the p-value (0.03) is more than the significance level of 0.05, we can conclude that there is no evidence that the difference in average daily sales between the three stores is significant.