Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

Ans.

Analysis of Variance (ANOVA) is a statistical method that is used to compare means among three or more groups. In order to use ANOVA, there are several assumptions that must be met:

1.Normality: The data should be normally distributed within each group.

2.Homogeneity of variance: The variance of the data within each group should be approximately equal.

3.Independence: The observations in each group should be independent of each other.

If any of these assumptions are violated, the results of the ANOVA may be invalid or biased. 

Examples of violations that could impact the validity of the results include:

1.Non-normality: If the data within a group is not normally distributed, then the ANOVA results may be unreliable. For example, if the data is skewed or has outliers, it may violate the assumption of normality.

2.Heterogeneity of variance: If the variance of the data within a group is not equal, then the ANOVA results may be misleading. For example, if one group has much higher variance than the other groups, it can result in a false conclusion that the means of the groups are significantly different.

3.Dependence: If the observations within a group are not independent, then the ANOVA results may be biased. For example, if measurements are taken on the same subjects over time, there may be correlations between the measurements that violate the assumption of independence.

4.Sample size: The ANOVA results can be sensitive to sample size, especially if the sample sizes are unequal. If the sample sizes are too small, the statistical power of the analysis may be too low to detect real differences between groups. 

Overall, it is important to check the assumptions of ANOVA before using this method, and if any violations are present, alternative methods should be considered.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Q2. What are the three types of ANOVA, and in what situations would each be used?

Ans. 

The three types of ANOVA are one-way ANOVA, two-way ANOVA, and repeated measures ANOVA.

1.One-way ANOVA: One-way ANOVA is used when there is only one independent variable with three or more levels or groups. It is used to determine if there are any significant differences between the means of these groups. For example, if we want to compare the mean weight of different breeds of dogs (e.g., Labrador, German Shepherd, Poodle), we could use one-way ANOVA.

2.Two-way ANOVA: Two-way ANOVA is used when there are two independent variables, also known as factors. It is used to determine if there is an interaction effect between these factors and if there are any main effects. For example, if we want to compare the mean weight of different breeds of dogs based on their gender (male vs. female) and age (young vs. old), we could use two-way ANOVA.

3.Repeated measures ANOVA: Repeated measures ANOVA is used when there are repeated measures on the same subject or group of subjects over time or under different conditions. It is used to determine if there is a significant difference between the means of these measures. For example, if we want to compare the mean weight of a group of dogs at different ages (1 year, 2 years, 3 years), we could use repeated measures ANOVA.

In summary, one-way ANOVA is used when there is only one independent variable, two-way ANOVA is used when there are two independent variables, and repeated measures ANOVA is used when there are repeated measures on the same subject or group of subjects.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Ans. 

The partitioning of variance is a concept used in ANOVA to explain the sources of variation in the data. In ANOVA, the total variation in the data is divided into two parts: the variation between groups and the variation within groups. The partitioning of variance is important because it allows us to quantify the amount of variation that can be attributed to the factors or independent variables being tested, and the amount that cannot be explained by these factors. This is useful because it helps us determine if there is a significant difference between the means of the groups being compared.

The partitioning of variance is typically represented in ANOVA tables, which show the sum of squares (SS), degrees of freedom (df), mean squares (MS), F-ratio, and p-value. The total sum of squares (SST) represents the total variation in the data, the sum of squares between (SSB) represents the variation between groups, and the sum of squares within (SSW) represents the variation within groups. 

The F-ratio is calculated as the ratio of the mean square between to the mean square within, and it measures the amount of variation between groups relative to the amount of variation within groups. The p-value represents the probability of obtaining a F-ratio as extreme as the one observed, assuming that there is no significant difference between the means of the groups being compared.

Understanding the partitioning of variance is important because it helps us interpret the results of ANOVA and determine if there is a significant difference between the means of the groups being compared. It also helps us identify which factors or independent variables are contributing to the variation in the data, and can guide us in further analysis or experimentation.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?


Ans.

To calculate SST, SSE, and SSR for a one-way ANOVA in Python, we can use the 'statsmodels' library. Here is an example:


In [1]:

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a sample dataset
df = pd.DataFrame({'group': ['A', 'A', 'B', 'B', 'C', 'C'],
                   'value': [1, 2, 3, 4, 5, 6]})

# Fit the one-way ANOVA model
model = ols('value ~ group', data=df).fit()

# Calculate the total sum of squares (SST)
grand_mean = df['value'].mean()
df['grand_mean'] = grand_mean
df['deviation'] = (df['value'] - grand_mean)**2
SST = df['deviation'].sum()

# Calculate the explained sum of squares (SSE)
group_mean = df.groupby('group')['value'].mean()
df = df.join(group_mean, on='group', rsuffix='_group')
df['group_deviation'] = (df['value_group'] - grand_mean)**2
SSE = df['group_deviation'].sum()

# Calculate the residual sum of squares (SSR)
SSR = SST - SSE

print('SST =', SST)
print('SSE =', SSE)
print('SSR =', SSR)

SST = 17.5
SSE = 16.0
SSR = 1.5


In this example, we first create a sample dataset with three groups ('A', 'B', and 'C') and their corresponding values. We then fit a one-way ANOVA model using the 'ols' function from 'statsmodels.formula.api'. 

To calculate SST, we first calculate the grand mean of all the values and then calculate the deviation of each value from the grand mean. We sum up the squared deviations to get SST.

To calculate SSE, we first calculate the mean value for each group and then calculate the deviation of each group mean from the grand mean. We sum up the squared deviations to get SSE.

Finally, we calculate SSR by subtracting SSE from SST.

Note that in this example, we manually calculated SST, SSE, and SSR using the formulas. However, 'statsmodels' also provides these values in the ANOVA table output, which can be accessed using the 'anova_lm' function:

This will print the ANOVA table, which includes the sum of squares, degrees of freedom, mean square, F-value, and p-value for each source of variation. The 'typ=2' argument specifies the type of sum of squares calculation to use.

In [2]:
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

          sum_sq   df     F    PR(>F)
group       16.0  2.0  16.0  0.025095
Residual     1.5  3.0   NaN       NaN


-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

Ans. To calculate the main effects and interaction effects in a two-way ANOVA using Python, you can use the `statsmodels` library. Here is an example:


In [1]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a sample dataset
df = pd.DataFrame({'group1': ['A', 'A', 'B', 'B', 'C', 'C', 'D', 'D'],
                   'group2': ['X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y'],
                   'value': [1, 2, 3, 4, 5, 6, 7, 8]})

# Fit the two-way ANOVA model
model = ols('value ~ C(group1) + C(group2) + C(group1):C(group2)', data=df).fit()

# Calculate the main effects
main_effects = model.params[:2]

# Calculate the interaction effect
interaction_effect = model.params[2]

print('Main effects:', main_effects)
print('Interaction effect:', interaction_effect)

Main effects: Intercept         1.0
C(group1)[T.B]    2.0
dtype: float64
Interaction effect: 3.9999999999999996


In this example, we first create a sample dataset with two grouping variables ('group1' and 'group2') and their corresponding values. We then fit a two-way ANOVA model using the 'ols' function from 'statsmodels.formula.api'. 

To calculate the main effects, we extract the first two parameters from the model output, which represent the estimated means for each group of the two grouping variables. We store them in the 'main_effects' variable.

To calculate the interaction effect, we extract the third parameter from the model output, which represents the estimated mean difference between the two grouping variables when they are combined. We store it in the 'interaction_effect' variable.


-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

Ans. If we conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02, we can conclude that there is a significant difference between at least two of the groups.

The F-statistic measures the ratio of the variance between groups to the variance within groups. A larger F-statistic indicates that the differences between the group means are larger relative to the variation within the groups. In this case, the F-statistic of 5.23 indicates that the between-group variance is greater than the within-group variance.

The p-value measures the probability of obtaining the observed F-statistic or a more extreme value if there is no true difference between the groups. In this case, the p-value of 0.02 indicates that the probability of obtaining an F-statistic of 5.23 or larger if there is no true difference between the groups is only 2%. This is a relatively small probability, so we can reject the null hypothesis and conclude that there is a significant difference between at least two of the groups.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Ans.

In a repeated measures ANOVA, missing data can be handled using several methods, including:

1.Complete case analysis: This involves only including participants who have complete data for all time points. This is the most straightforward method but may result in a loss of power if there are a large number of missing observations.

2.Last observation carried forward (LOCF): This involves replacing missing values with the last observed value for that participant. This method assumes that the missing value is the same as the last observed value, which may not always be accurate.

3.Multiple imputation: This involves creating multiple plausible values for missing data and analyzing each imputed dataset separately. The results are then combined to obtain a single set of estimates. This method can provide unbiased estimates but may be computationally intensive and requires making assumptions about the missing data mechanism.

The potential consequences of using different methods to handle missing data include:

1.Bias: If the missing data are not missing completely at random (MCAR) and the missingness is related to the outcome variable or other predictors, using complete case analysis or LOCF may result in biased estimates.

2.Increased variability: If the missing data are MCAR and there is a large amount of missing data, using complete case analysis may result in increased variability and reduced power.

3.Incorrect standard errors: If the missing data are not MCAR and there is a large amount of missing data, using complete case analysis or LOCF may result in incorrect standard errors, leading to incorrect hypothesis testing.

4.Loss of power: If the missing data are MCAR and there is a large amount of missing data, using multiple imputation may result in a loss of power due to the need to create multiple imputed datasets.


-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Ans. Post-hoc tests are used after an ANOVA to determine which specific groups differ significantly from each other. There are several common post-hoc tests used after ANOVA, including:

1.Tukey's Honestly Significant Difference (HSD) test: This test is used to compare all possible pairs of means and adjust for multiple comparisons. It is typically used when there are more than two groups.

2.Bonferroni correction: This test is used to control the familywise error rate by adjusting the significance level for each individual comparison. It is often used when there are a large number of pairwise comparisons.

3.Scheffe's test: This test is a conservative test that controls the familywise error rate and can be used when the number of comparisons is not known in advance.

4.Dunnett's test: This test is used to compare each treatment group to a control group and adjust for multiple comparisons. It is often used when there is a single control group and multiple treatment groups.

The choice of post-hoc test depends on the specific research question and the study design. Tukey's HSD test is often used when there are more than two groups, while Dunnett's test is appropriate when there is a single control group. Bonferroni correction is generally more conservative than Tukey's HSD test and is often used when there are a large number of pairwise comparisons.

An example of a situation where a post-hoc test might be necessary is a study comparing the effects of three different types of exercise on cardiovascular health. After performing an ANOVA and finding a significant main effect, a post-hoc test such as Tukey's HSD could be used to determine which types of exercise differ significantly from each other. This information could be useful in developing targeted exercise programs for different populations.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.


Ans.

Null Hypothesis(H0):- muA = muB = muC  i.e. their is no significant differences between the mean weight loss of three diets.

Alternate Hypothesis(H1):- Their is significant difference between the mean weight loss of three diets.

In [3]:
# Solution:-

import numpy as np
import scipy.stats as stat

np.random.seed(123)

A = np.random.normal(5, 1, 50)
B = np.random.normal(6, 1, 50)
C = np.random.normal(4, 1, 50)

f_val, p_val = stat.f_oneway(A, B, C)

print(f"F-statistics: {f_val}")
print(f"P-value: {p_val}")

F-statistics: 38.1814612681822
P-value: 4.4208876104953276e-14


Since, P-value is very small we can reject the Null Hypothesis and can conclude that their is  significant differences between the mean weight loss of three diets.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

Ans.



In [10]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Generate random task completion time data for 30 employees in each program and experience level
np.random.seed(123)
data = pd.DataFrame({
    'Time': np.random.normal(10, 2, 90),
    'Program': np.repeat(['A', 'B', 'C'], 30),
    'Experience': np.tile(['Novice', 'Experienced'], 45)
})

# Fit the two-way ANOVA model
model = ols('Time ~ Program + Experience + Program:Experience', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print results
print(anova_table)

                        sum_sq    df         F    PR(>F)
Program               2.926855   2.0  0.264784  0.768009
Experience            3.094719   1.0  0.559941  0.456374
Program:Experience    3.334259   2.0  0.301641  0.740401
Residual            464.256575  84.0       NaN       NaN


The ANOVA table shows the sum of squares, degrees of freedom, F-statistic, and p-value for each effect, as well as the residual sum of squares and degrees of freedom.

The F-statistic for the main effect of program is 0.264784 with a large p-value of 0.768009, indicating that there is not significant difference in the average task completion time between the three software programs. This suggests that the software programs used doesn't have an effect on task completion time.

The F-statistic for the main effect of experience is 0.559941 with a relatively large p-value of 0.456374, indicating that there is not a significant difference in the average task completion time between novice and experienced employees. This suggests that employee experience level does not have a significant effect on task completion time.

The F-statistic for the interaction effect between program and experience is 0.301641 with a large p-value of 0.740401, indicating that there is not a significant interaction effect between program and experience on task completion time. This suggests that the effect of the software program on task completion time does not depend on the employee's experience level.

In conclusion, we found that the software program used doesn't has a significant effect on task completion time, while employee experience level does not. Additionally, we found no significant interaction effect between program and experience on task completion time.

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Q11. An educational researcher is interested in whether a new teaching method improves student testscores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

Ans.


In [11]:
import numpy as np
from scipy.stats import ttest_ind

# Generate random test scores data for 100 students in each group
np.random.seed(123)
control_scores = np.random.normal(70, 10, 100)
experimental_scores = np.random.normal(75, 10, 100)

# Perform the two-sample t-test
t_stat, p_val = ttest_ind(control_scores, experimental_scores)

# Print the results
print("t-statistic: {:.3f}".format(t_stat))
print("p-value: {:.3f}".format(p_val))

t-statistic: -3.032
p-value: 0.003


The t-test results show that the t-statistic is -3.032 with a p-value of 0.003, which is less than the significance level of 0.05. This indicates that there is a significant difference in test scores between the control and experimental groups. Specifically, the experimental group has a higher mean test score than the control group.

To determine which group(s) differ significantly from each other, we can use post-hoc tests. One commonly used post-hoc test is the Tukey's Honestly Significant Difference (HSD) test. Here's an example code that performs the Tukey's HSD test:

In [12]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Combine the control and experimental data into a single array
scores = np.concatenate([control_scores, experimental_scores])

# Create a grouping variable indicating the control and experimental groups
groups = np.concatenate([np.repeat('Control', 100), np.repeat('Experimental', 100)])

# Perform the Tukey's HSD test
tukey_results = pairwise_tukeyhsd(scores, groups, alpha=0.05)

# Print the results
print(tukey_results)

   Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj  lower  upper  reject
---------------------------------------------------------
Control Experimental   4.5336 0.0028 1.5846 7.4826   True
---------------------------------------------------------


The Tukey's HSD test results show that there is a significant difference between the control and experimental groups with a mean difference of 4.5336 and a p-value of 0.0028. This confirms that the experimental group has a significantly higher mean test score than the control group.

In conclusion, we found that the new teaching method significantly improves student test scores compared to the traditional teaching method. The Tukey's HSD test indicated that the experimental group had a significantly higher mean test

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

Ans.

In [16]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# create a sample data frame
data = pd.DataFrame({
    'store': ['A'] * 30 + ['B'] * 30 + ['C'] * 30,
    'day': list(range(1, 31)) * 3,
    'sales': [10, 8, 11, 9, 12, 13, 8, 10, 7, 11,
              9, 10, 12, 13, 14, 8, 11, 10, 9, 12,
              10, 9, 11, 12, 10, 11, 13, 12, 14, 11,
              9, 8, 10, 11, 13, 12, 8, 10, 9, 11,
              12, 11, 9, 10, 11, 12, 13, 11, 10, 9,
              14, 12, 10, 9, 11, 8, 12, 13, 11, 10,
              8, 9, 11, 12, 13, 12, 10, 11, 9, 8,
             14, 12, 10, 9, 11, 8, 12, 13, 11, 10,
              8, 9, 11, 12, 13, 12, 10, 11, 9, 8]
})

# perform repeated measures ANOVA
rm = ols('sales ~ store + day + store:day', data=data).fit()
table = sm.stats.anova_lm(rm, typ=2)

# print the ANOVA table
print(table)

               sum_sq    df         F    PR(>F)
store        0.288889   2.0  0.049258  0.951963
day          7.475009   1.0  2.549104  0.114112
store:day    7.302855   2.0  1.245198  0.293142
Residual   246.322136  84.0       NaN       NaN


Since, P-val is 0.951963 is > 0.05 we can conclude there is not any significant differences in sales between the three stores.