# Statistics Advance-6 Assignment

# Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

# Answer-1-Analysis of Variance (ANOVA) is a statistical technique used to analyze the differences among group means in a study involving multiple groups or treatments. To use ANOVA effectively, several assumptions must be met. Violations of these assumptions can impact the validity of the results. Here are the key assumptions and examples of violations:

# Assumption 1: Independence: Observations within and between groups must be independent. This means that the values within one group should not influence or be related to the values in another group. Violations occur when there is autocorrelation, or when data points within a group are not independent. For example, in a repeated measures design, where the same subjects are measured over time, violations of independence may occur if the measurements are correlated over time.

# Assumption 2: Normality: The residuals (the differences between observed values and the group means) should be normally distributed. Violations occur when the residuals do not follow a normal distribution. This can be detected using methods like the Shapiro-Wilk test or visual inspection of a Q-Q plot. For example, if you have a group with highly skewed or leptokurtic data, it may violate the normality assumption.

# Assumption 3: Homogeneity of Variance (Homoscedasticity): The variances of the groups should be roughly equal. Violations occur when some groups have significantly larger or smaller variances compared to others. This can be detected using statistical tests like Levene's test or by examining scatterplots of the residuals. For example, if one group has much higher variability in its data compared to the other groups, it violates homogeneity of variance.

# Assumption 4: Independence of Errors: The errors (residuals) should be independent of each other. This means that there should be no systematic patterns or correlations in the residuals. Violations may occur when the residuals show a pattern over time, or if there are correlations between the residuals within or between groups.

# Assumption 5: Equal Sample Sizes (for one-way ANOVA): In a one-way ANOVA, if you have multiple groups, it's assumed that the sample sizes in each group are roughly equal. Violations occur when some groups have much larger or smaller sample sizes than others. Unequal sample sizes can affect the power of the ANOVA and may require adjustments like Welch's ANOVA or transformations.

# Assumption 6: Independence of Groups: The groups being compared should be independent of each other. Violations occur when there is overlap or dependence between groups. For example, in a paired design, where the same subjects are used in both groups, the independence of groups assumption is violated.

# Q2. What are the three types of ANOVA, and in what situations would each be used?

# Answer-2-Analysis of Variance (ANOVA) is a statistical technique used to test for differences among group means. There are three main types of ANOVA, each of which is used in different situations:

# One-Way ANOVA:When to Use: One-way ANOVA is used when you have one categorical independent variable (factor) and one continuous dependent variable. It is used to determine whether there are statistically significant differences in the means of two or more independent (unrelated) groups or levels of the factor.
# Example: You want to compare the mean test scores of students who studied under different teaching methods (e.g., traditional lecture, online video lectures, and group discussions).
# Two-Way ANOVA:When to Use: Two-way ANOVA is used when you have two categorical independent variables (factors) and one continuous dependent variable. It is used to examine the influence of two independent variables, individually and in combination, on the dependent variable.
# Example: You want to study the effects of both gender (male/female) and treatment (A, B, C) on the blood pressure of patients. Repeated Measures ANOVA (or within-subjects ANOVA):When to Use: Repeated Measures ANOVA is used when you have one group of subjects and you measure the same subjects under different conditions or time points. It is used to test for differences in means when the same subjects are tested under multiple conditions.
# Example: You are studying the effects of a new drug on patients' blood pressure, and you measure their blood pressure before treatment, 1 week after treatment, and 4 weeks after treatment. Since the same individuals are measured at multiple time points, you would use a repeated measures ANOVA.
# In addition to these primary types of ANOVA, there are variations and extensions like mixed-design ANOVA, MANOVA (Multivariate Analysis of Variance), and more complex designs for specific research questions and data structures.



# Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

# Answer-3-The partitioning of variance in Analysis of Variance (ANOVA) is a fundamental concept that helps researchers understand how the total variance in a dataset is divided into different components. This concept is essential for several reasons:

# Explaining Variability: ANOVA decomposes the total variability in the data into two main components: variability due to the differences between groups (or treatments) and variability within groups. This partitioning allows researchers to understand how much of the overall variance in the data is attributable to group differences and how much is due to random variability within the groups.

# Hypothesis Testing: ANOVA is a hypothesis testing technique that compares the means of multiple groups or treatments. By partitioning the variance, it provides a systematic way to test whether the observed differences between groups are statistically significant. The partitioning of variance allows researchers to calculate an F-statistic and determine whether the group differences are greater than what would be expected by chance.

# Effect Size Estimation: Understanding the partitioning of variance helps researchers assess the practical significance of group differences. By comparing the variability between groups to the variability within groups, you can calculate effect size measures like eta-squared or partial eta-squared. These measures quantify the proportion of variance in the dependent variable that can be attributed to the independent variable(s).

# Model Evaluation: Researchers can use the partitioning of variance to assess the goodness of fit of their ANOVA model. A good model should account for a substantial portion of the total variance, indicating that it is capturing meaningful sources of variation. Researchers can use the explained variance to evaluate the model's quality.

# Post Hoc Testing: In cases where ANOVA detects significant differences between groups, post hoc tests (e.g., Tukey's HSD, Bonferroni) can be used to determine which specific group means differ from each other. A thorough understanding of the partitioning of variance can aid in interpreting the results of these post hoc tests.

# Assumption Checking: Understanding the partitioning of variance can help researchers identify potential issues with the ANOVA assumptions, such as homogeneity of variance. If there are substantial differences in the variances between groups, this can indicate a violation of the assumption and prompt further investigation.

# Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

# Answer-4-

In [1]:
import numpy as np
import scipy.stats as stats

In [2]:
group1 = np.array([30, 25, 27, 24, 29])
group2 = np.array([40, 35, 38, 36, 39])
group3 = np.array([50, 45, 48, 47, 49])

In [3]:
data = np.concatenate([group1, group2, group3])
overall_mean = np.mean(data)
squared_deviations_total = np.sum((data - overall_mean) ** 2)
SST = squared_deviations_total

In [4]:
group1_mean = np.mean(group1)
group2_mean = np.mean(group2)
group3_mean = np.mean(group3)

In [5]:
squared_deviations_group1 = np.sum((group1 - group1_mean) ** 2)
squared_deviations_group2 = np.sum((group2 - group2_mean) ** 2)
squared_deviations_group3 = np.sum((group3 - group3_mean) ** 2)
SSE = squared_deviations_group1 + squared_deviations_group2 + squared_deviations_group3
SSR = SST - SSE

In [6]:
df_total = len(data) - 1
df_groups = 3 - 1 
df_error = df_total - df_groups

In [7]:
MSR = SSE / df_groups
MSE = SSR / df_error
F_statistic = MSR / MSE
p_value = 1 - stats.f.cdf(F_statistic, df_groups, df_error)

In [8]:
print("SST:", SST)
print("SSE:", SSE)
print("SSR:", SSR)
print("F-statistic:", F_statistic)
print("p-value:", p_value)

SST: 1139.7333333333331
SSE: 58.0
SSR: 1081.7333333333331
F-statistic: 0.32170590410452365
p-value: 0.730973739273699


# Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

# Answer-5-In a two-way ANOVA, you can calculate the main effects and interaction effects using Python. Main effects represent the individual effects of each independent variable, while the interaction effect assesses whether the combined effect of the two independent variables is significant. Here's how you can calculate these effects using Python and the statsmodels library:

In [9]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [10]:
data = pd.DataFrame({
    'A': [1, 1, 2, 2, 3, 3, 4, 4],
    'B': [1, 2, 1, 2, 1, 2, 1, 2],
    'Y': [10, 12, 15, 18, 20, 25, 30, 36]
})

In [11]:
model = ols('Y ~ A * B', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

          sum_sq   df           F    PR(>F)
A          518.4  1.0  146.028169  0.000269
B           32.0  1.0    9.014085  0.039850
A:B          4.9  1.0    1.380282  0.305217
Residual    14.2  4.0         NaN       NaN


# Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

# Answer-6-In a one-way ANOVA, the F-statistic and its associated p-value are used to determine whether there are statistically significant differences among the group means. Let's interpret the results you provided:

# F-Statistic: The F-statistic is a measure of the ratio of the variability between groups to the variability within groups. It tells you whether the differences between group means are greater than what you would expect by random chance. In this case, you obtained an F-statistic of 5.23.

# P-Value: The p-value associated with the F-statistic is used to assess the statistical significance of the results. It tells you the probability of observing the obtained F-statistic (or a more extreme one) if there were no true differences between the groups. In this case, the p-value is 0.02.

# Now, let's interpret these results:With an F-statistic of 5.23, it indicates that there is some degree of difference among the group means.The p-value of 0.02 is less than the typical significance level of 0.05 (5%). This means that if there were no true differences between the groups (i.e., if the null hypothesis were true), you would only expect to see an F-statistic as extreme as 5.23 or more in about 2% of cases.

# Conclusion:Based on the results of your one-way ANOVA:There are statistically significant differences among the group means. This suggests that at least one group mean is different from the others.To determine which specific groups are different from each other, you would typically conduct post hoc tests (e.g., Tukey's HSD, Bonferroni) or perform pairwise comparisons.

# It's important to consider the practical significance of the differences in addition to the statistical significance. The significance of the differences should be evaluated in the context of your research question and the magnitude of the effect.

# Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

# Answer-7-Handling missing data in a repeated measures ANOVA is essential for obtaining valid and reliable results. Missing data can occur for various reasons, and how you handle it can impact your analysis and the conclusions you draw. Here are some methods to handle missing data in repeated measures ANOVA and the potential consequences of using these methods:

# Listwise Deletion (Complete Case Analysis):

# Method: You remove any cases or subjects with missing data. This is the simplest approach.
# Consequences: It can lead to a loss of data and statistical power, especially if a large proportion of data is missing. This method can introduce bias if the missing data is not completely random. The remaining sample may not be representative of the population.
# Pairwise Deletion (Available Case Analysis):

# Method: You perform the analysis for each pair of variables, considering all subjects with data for those variables. This means you don't exclude subjects entirely due to missing data.
# Consequences: It retains more data than listwise deletion, but it can lead to unequal sample sizes across comparisons, making it difficult to interpret overall patterns and complicating post hoc testing. It may also produce results that are difficult to combine or report.
# Mean Imputation:

# Method: You replace missing values with the mean value of the non-missing data in that variable.
# Consequences: While it retains the sample size, it can introduce bias by artificially reducing the variability in the data, potentially inflating the significance of effects. It doesn't capture the true variability of the data, and it may not be appropriate if the data is not missing completely at random.
# Last Observation Carried Forward (LOCF) or Next Observation Carried Backward (NOCB):

# Method: You carry forward the last available data point for missing data or carry backward the next available data point.
# Consequences: This method can lead to an overestimation of treatment effects, especially if data are missing non-randomly. It may not be appropriate if the assumption of data continuity is violated.
# Multiple Imputation:

# Method: You use statistical techniques to create multiple datasets with imputed values, incorporating uncertainty about the missing data.
# Consequences: Multiple imputation is a principled approach that accounts for the uncertainty introduced by missing data. However, it can be computationally intensive and may require specific software or expertise. It is generally recommended when data are missing at random or not completely at random.
# Model-Based Imputation:

# Method: You use statistical models to estimate missing values based on observed data.Consequences: This method can provide more accurate imputations when the missing data are related to other variables in the dataset. However, it relies on modeling assumptions that should be carefully considered.

# Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

# Answer-8-Post-hoc tests are used after conducting an Analysis of Variance (ANOVA) to make pairwise comparisons between group means when the ANOVA indicates that there are significant differences among groups. They help identify which specific groups differ from each other. Common post-hoc tests include:

# Tukey's Honestly Significant Difference (HSD) Test:When to Use: Tukey's HSD is widely used when you have conducted an ANOVA with the assumption of equal variances and you want to compare all pairs of group means. It controls the familywise error rate.
# Example: In a one-way ANOVA comparing the exam scores of students taught by three different teachers, the ANOVA shows that there are significant differences among the groups. Tukey's HSD can be used to identify which specific pairs of teachers have significantly different mean scores.
# Bonferroni Correction:When to Use: The Bonferroni correction is used when you have conducted an ANOVA, and you want to control the familywise error rate by making multiple comparisons. It is a more conservative approach and is appropriate when you have a large number of pairwise comparisons.
# Example: In a clinical trial comparing the effectiveness of multiple drug treatments, you may have conducted an ANOVA and want to compare the efficacy of each drug with all others. The Bonferroni correction is applied to control the overall Type I error rate.
# Sidak Correction:When to Use: Similar to the Bonferroni correction, the Sidak correction is used when conducting multiple pairwise comparisons after ANOVA. It is less conservative than Bonferroni and is appropriate when you want to control the familywise error rate.
# Example: In a marketing study, you may have conducted a two-way ANOVA to determine if there are significant differences in sales revenue due to different advertising strategies. The Sidak correction can be applied to compare the advertising strategies while controlling for multiple comparisons.
# Dunnett's Test:When to Use: Dunnett's test is used when you have a control group and you want to compare all other treatment groups to the control group. It is commonly used in situations where you are interested in the effects of different treatments relative to a control condition.
# Example: In a pharmaceutical study, you may have a control group and several experimental drug groups. Dunnett's test is appropriate to compare each drug group to the control group to see if any drugs have a significant effect.
# Fisher's LSD (Least Significant Difference) Test:When to Use: Fisher's LSD is less conservative than Tukey's HSD and can be used when the assumption of equal variances is not met. It is suitable for making pairwise comparisons when you have conducted a one-way ANOVA.
# Example: In a manufacturing study, you may have tested different machine settings to see which one produces the highest product quality. Fisher's LSD can be used to compare the settings when the variances are not equal.
# Holm-Bonferroni Method:When to Use: The Holm-Bonferroni method is used when you want to control the familywise error rate like the Bonferroni correction but is more powerful. It is appropriate when you have several pairwise comparisons to make.
# Example: In a social science survey, you may have multiple pairwise comparisons to assess differences between age groups in terms of their attitudes and behaviors. The Holm-Bonferroni method can help control the overall Type I error rate.

# Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

# Answer-9-

In [25]:
import pandas as pd
import scipy.stats as stats
from statsmodels.formula.api import ols
import statsmodels.api as sm

In [31]:
data = pd.DataFrame({
    'diet': ['A'] * 10 + ['B'] * 10 + ['C'] * 10,
    'experience': ['novice'] * 5 + ['experienced'] * 5 + ['novice'] * 5 + ['experienced'] * 5 + ['novice'] * 5 + ['experienced'] * 5,
    'weight_loss': [20, 25, 23, 30, 22, 28, 19, 26, 21, 29,
                        18, 24, 22, 29, 20, 27, 19, 25, 23, 30,
                        20, 25, 23, 30, 22, 29, 19, 26, 21, 28]
})

In [33]:
model = ols('weight_loss ~ diet', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
F_statistic = anova_table['F']['diet']
p_value = anova_table['PR(>F)']['diet']

print(f'F-statistic: {F_statistic}')
print(f'p-value: {p_value}')

F-statistic: 0.07636106528399635
p-value: 0.9266809820711994


# Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

# Answer-10-

In [22]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [23]:
data = pd.DataFrame({
    'software': ['A'] * 10 + ['B'] * 10 + ['C'] * 10,
    'experience': ['novice'] * 5 + ['experienced'] * 5 + ['novice'] * 5 + ['experienced'] * 5 + ['novice'] * 5 + ['experienced'] * 5,
    'time_to_complete': [20, 25, 23, 30, 22, 28, 19, 26, 21, 29,
                        18, 24, 22, 29, 20, 27, 19, 25, 23, 30,
                        20, 25, 23, 30, 22, 29, 19, 26, 21, 28]
})

In [24]:
model = ols('time_to_complete ~ software * experience', data=data).fit()

# Print the ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

                         sum_sq    df         F    PR(>F)
software               2.400000   2.0  0.070175  0.932421
experience             9.633333   1.0  0.563353  0.460209
software:experience    4.266667   2.0  0.124756  0.883281
Residual             410.400000  24.0       NaN       NaN


# Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

# Answer-11-

In [34]:
import numpy as np
control_group = np.array([85, 88, 90, 82, 86, 91, 87, 84, 89, 83, 87, 84, 85, 88, 81, 82, 86, 80, 84, 89, 85, 88, 87, 83, 86, 82, 84, 80, 82, 81])
experimental_group = np.array([92, 95, 94, 91, 96, 93, 90, 91, 94, 92, 95, 92, 94, 93, 90, 91, 92, 95, 93, 96, 90, 91, 93, 92, 94, 95, 92, 96, 91, 94])

In [35]:
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)

In [36]:
print(f'T-statistic: {t_statistic}')
print(f'p-value: {p_value}')

T-statistic: -12.236934691630083
p-value: 1.0419700578692412e-17


In [37]:
mean_control = np.mean(control_group)
mean_experimental = np.mean(experimental_group)

std_control = np.std(control_group, ddof=1)
std_experimental = np.std(experimental_group, ddof=1)

n_control = len(control_group)
n_experimental = len(experimental_group)
std_error_diff = np.sqrt((std_control**2 / n_control) + (std_experimental**2 / n_experimental))
margin_of_error = 1.96 * std_error_diff  
conf_int_lower = (mean_control - mean_experimental) - margin_of_error
conf_int_upper = (mean_control - mean_experimental) + margin_of_error

In [38]:
print(f"Mean test score for the control group: {mean_control}")
print(f"Mean test score for the experimental group: {mean_experimental}")
print(f"95% Confidence Interval for the difference in means: ({conf_int_lower}, {conf_int_upper})")

Mean test score for the control group: 84.96666666666667
Mean test score for the experimental group: 92.9
95% Confidence Interval for the difference in means: (-9.204021926937493, -6.662644739729181)


# Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are anysignificant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

# Answer-12-A repeated measures ANOVA is typically used when you have multiple measurements on the same subjects (in this case, the same days), but you want to compare the means of different groups (in this case, the three stores). However, the description you provided seems to refer to an independent samples scenario where you're comparing different groups (stores) on the same variable (sales) over a series of observations (days). In this case, a repeated measures ANOVA may not be appropriate.

# Answer-12-

In [52]:
import numpy as np
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

In [53]:
np.random.seed(0)
store_a_sales = np.random.normal(200, 20, 30)
store_b_sales = np.random.normal(220, 15, 30)
store_c_sales = np.random.normal(210, 25, 30)

In [54]:
data = pd.DataFrame({
    'store': ['A'] * 30 + ['B'] * 30 + ['C'] * 30,
    'sales': np.concatenate((store_a_sales, store_b_sales, store_c_sales))
})

In [55]:
f_statistic, p_value = stats.f_oneway(data[data['store'] == 'A']['sales'],
                                      data[data['store'] == 'B']['sales'],
                                      data[data['store'] == 'C']['sales'])

In [56]:
print(f'F-statistic: {f_statistic}')
print(f'p-value: {p_value}')

F-statistic: 1.579433100797434
p-value: 0.2119435870921493


# Assignment Completed 