## ASSIGNMENT ON STATISTICS ADVANCE

Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

ANOVA (Analysis of Variance) is a statistical test used to compare the means of three or more groups. To ensure the validity of ANOVA results, several assumptions must be met. Here are the key assumptions for conducting ANOVA:

Independence: The observations within each group must be independent of each other. Violations of independence may occur when data points are correlated, such as in repeated measures or clustered designs.

Normality: The residuals (the differences between the observed values and the group means) should follow a normal distribution within each group. Departures from normality may affect the accuracy and reliability of ANOVA results. Violations can occur when the data is heavily skewed or has outliers.

Homogeneity of variances: The variances of the residuals should be approximately equal across all groups. This assumption is known as homoscedasticity. Violations of homogeneity of variances can lead to imprecise or biased estimates of group differences. It commonly occurs when there are unequal variances between groups or when the sample sizes are unequal.

If these assumptions are violated, the validity of ANOVA results may be compromised. Here are examples of violations for each assumption:

Independence:

Violation example: Using repeated measures design where the same subjects are measured under different conditions, and their responses are likely to be correlated.
Normality:

Violation example: In a small sample size, the data may not exhibit a perfect normal distribution even if the assumption is met. However, severe skewness or heavy-tailedness can be problematic.
Homogeneity of variances:

Violation example: Unequal variances may occur when groups have different sample sizes or when different treatments lead to different levels of variability.
When these assumptions are violated, alternative statistical tests or transformations may be necessary. For example, non-parametric tests like the Kruskal-Wallis test can be used instead of ANOVA when the normality assumption is violated. Alternatively, data transformations (e.g., logarithmic or square root transformation) can sometimes address violations of normality or homogeneity of variances.

It is important to assess and address these assumptions to ensure the validity and reliability of ANOVA results.

Q2. What are the three types of ANOVA, and in what situations would each be used?

The three types of ANOVA (Analysis of Variance) are:

One-way ANOVA: One-way ANOVA is used when comparing the means of three or more groups based on a single independent variable (factor). It tests whether there are any statistically significant differences among the group means. This type of ANOVA is appropriate when you have one categorical independent variable and a continuous dependent variable.
Example: A study comparing the average test scores of students from three different schools (School A, School B, and School C).

Two-way ANOVA: Two-way ANOVA is used when you want to analyze the effects of two independent variables (factors) simultaneously on a continuous dependent variable. It examines the main effects of each independent variable and their interaction effect. This type of ANOVA is appropriate when you have two categorical independent variables and a continuous dependent variable.
Example: A study investigating the effects of both age group (young, middle-aged, and elderly) and gender (male, female) on blood pressure.

Repeated Measures ANOVA: Repeated Measures ANOVA (also known as within-subjects ANOVA) is used when measuring the same subjects under different conditions or at multiple time points. It is used to determine if there are any significant differences between the conditions or time points. This type of ANOVA is appropriate when you have a single group of subjects measured on the same continuous dependent variable across multiple conditions or time points.
Example: A study measuring participants' anxiety levels before, during, and after exposure to different stressors.

In summary, one-way ANOVA is used when comparing means across multiple groups with one independent variable, two-way ANOVA is used when analyzing the effects of two independent variables, and repeated measures ANOVA is used when measuring the same subjects under different conditions or at multiple time points. The appropriate choice of ANOVA depends on the research design and the specific hypotheses being tested.

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in ANOVA refers to the decomposition of the total variance in a dataset into different sources or components of variation. It involves dividing the total sum of squares (SS) into distinct parts that represent different factors or sources of variation, allowing for a better understanding of the contributions of these factors to the overall variation in the data.

The partitioning of variance is important in ANOVA for several reasons:

Understanding the sources of variation: By partitioning the total variance, ANOVA helps identify and quantify the different sources of variation in the data. This allows researchers to assess the relative importance of each factor and understand their contributions to the overall variability observed in the dependent variable.

Hypothesis testing: ANOVA uses the partitioned variance to perform hypothesis tests on the effects of different factors. It compares the magnitude of variation between groups (explained variation) with the variation within groups (unexplained variation) to determine if there are significant differences among the groups.

Assessing the significance of factors: ANOVA provides information about the significance of the factors being analyzed. By comparing the variance attributed to each factor (explained variance) with the residual variance (unexplained variance), it helps determine if the observed differences among groups are statistically significant.

Estimating effect sizes: Partitioning the variance allows for the estimation of effect sizes, such as the proportion of variance explained by each factor. This information provides insights into the practical significance or importance of the factors being examined.

Design optimization: Understanding the partitioning of variance helps researchers in experimental design optimization. By identifying the most significant sources of variation, researchers can allocate resources more efficiently, control or reduce variability, and enhance the precision of their experiments.

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import numpy as np
from scipy import stats

# Data for one-way ANOVA
group1 = [10, 12, 15, 11, 14]
group2 = [8, 6, 9, 7, 5]
group3 = [18, 20, 17, 19, 16]

# Combine the data into a single array
data = np.concatenate([group1, group2, group3])

# Calculate the overall mean
overall_mean = np.mean(data)

# Calculate the sum of squares total (SST)
sst = np.sum((data - overall_mean) ** 2)

# Calculate the sum of squares explained (SSE)
group_means = [np.mean(group1), np.mean(group2), np.mean(group3)]
sse = np.sum([len(group) * (mean - overall_mean) ** 2 for group, mean in zip([group1, group2, group3], group_means)])

# Calculate the sum of squares residual (SSR)
ssr = sst - sse

print("SST:", sst)
print("SSE:", sse)
print("SSR:", ssr)


SST: 339.7333333333333
SSE: 302.5333333333333
SSR: 37.19999999999999


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [2]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Data for two-way ANOVA
group1 = [10, 12, 15, 11, 14]
group2 = [8, 6, 9, 7, 5]
group3 = [18, 20, 17, 19, 16]

factor1 = ['A', 'A', 'B', 'B', 'B']
factor2 = ['X', 'Y', 'X', 'Y', 'X']

# Combine the data into a dataframe
data = pd.DataFrame({'Response': group1 + group2 + group3,
                     'Factor1': factor1 * 3,
                     'Factor2': factor2 * 3})

# Fit the two-way ANOVA model
model = ols('Response ~ Factor1 + Factor2 + Factor1:Factor2', data=data).fit()
anova_table = sm.stats.anova_lm(model)

# Extract main effects and interaction effect from the ANOVA table
main_effect_factor1 = anova_table['sum_sq']['Factor1']
main_effect_factor2 = anova_table['sum_sq']['Factor2']
interaction_effect = anova_table['sum_sq']['Factor1:Factor2']

print("Main Effect (Factor 1):", main_effect_factor1)
print("Main Effect (Factor 2):", main_effect_factor2)
print("Interaction Effect:", interaction_effect)


Main Effect (Factor 1): 0.1777777777777766
Main Effect (Factor 2): 0.03174603174603146
Interaction Effect: 0.8571428571428497


Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

In a one-way ANOVA, the F-statistic and the associated p-value are used to assess whether there are significant differences among the group means. In the given scenario, where the F-statistic is 5.23 and the p-value is 0.02, we can make the following conclusions:

Significance of Differences: The obtained F-statistic of 5.23 indicates that there is some evidence of differences among the group means.

Rejecting the Null Hypothesis: The p-value of 0.02 is less than the chosen significance level (usually 0.05). Thus, we reject the null hypothesis, which states that there are no significant differences among the group means. The results suggest that there are indeed significant differences between at least some of the groups.

Interpretation of the p-value: The p-value of 0.02 indicates that the probability of obtaining a test statistic as extreme as the observed F-statistic, assuming the null hypothesis is true, is 0.02. In other words, if there were no true differences among the groups, we would expect to see a test statistic as extreme as the observed F-statistic only 2% of the time.

Practical Significance: While the statistical analysis suggests significant differences, it is also important to consider the practical or substantive significance of the findings. The magnitude of the differences among the groups should be examined to determine their real-world importance.

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Handling missing data in a repeated measures ANOVA is an important consideration to ensure accurate and unbiased results. There are several methods commonly used to handle missing data, each with its own potential consequences. Here are a few commonly used approaches:

Complete Case Analysis (Listwise deletion): This method involves excluding any participants with missing data from the analysis. While it is straightforward to implement, it can lead to reduced sample size and potential loss of statistical power. If the missing data are not missing completely at random (MCAR), this approach can introduce bias.

Mean Imputation: Mean imputation replaces missing values with the mean value of the available data for that variable. It is a simple method but may result in underestimation of variances and correlations since it does not account for the uncertainty associated with the missing values. Additionally, mean imputation assumes that the missing values have the same mean as the observed values, which may not be appropriate in all cases.

Last Observation Carried Forward (LOCF): LOCF imputes missing values with the last observed value for that participant. This method assumes that missing data remain constant over time. It can introduce bias if the missingness pattern is related to the underlying variable being measured.

Multiple Imputation: Multiple imputation involves creating multiple plausible imputed datasets, accounting for the uncertainty of the missing values. The analysis is then performed on each imputed dataset, and the results are combined using appropriate rules. Multiple imputation provides unbiased estimates and properly accounts for the uncertainty associated with missing data. However, it can be computationally intensive and requires careful implementation.

The potential consequences of using different methods to handle missing data include biased estimates, inflated or deflated standard errors, incorrect p-values, and distorted conclusions. The choice of method should be guided by the underlying missing data mechanism, assumptions about the missingness, and consideration of the potential biases introduced by each method.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

After conducting an ANOVA and finding a significant overall effect, post-hoc tests are often performed to examine pairwise comparisons between groups and determine which specific group means differ significantly from each other. Some common post-hoc tests used after ANOVA include:

Tukey's Honestly Significant Difference (HSD): Tukey's HSD test is widely used and controls the family-wise error rate, making it suitable for multiple comparisons. It compares all possible pairs of group means and identifies significant differences while taking into account the overall Type I error rate. Tukey's HSD test is appropriate when you have equal sample sizes and homogeneous variances.

Bonferroni correction: The Bonferroni correction adjusts the significance level for each individual comparison to control the family-wise error rate. It divides the desired significance level (e.g., 0.05) by the number of comparisons to obtain a more stringent alpha level for each test. The Bonferroni correction is conservative but provides strong control over the Type I error rate. It is suitable when there are a small number of planned comparisons.

Scheffé's method: Scheffé's method is a conservative post-hoc test that provides control over the family-wise error rate, making it appropriate for a large number of comparisons. It accounts for unequal sample sizes and variances. However, Scheffé's method tends to be less powerful compared to other post-hoc tests.

Fisher's Least Significant Difference (LSD): Fisher's LSD test is the least conservative post-hoc test and does not require equal sample sizes or homogeneous variances. It compares pairs of group means and identifies significant differences. However, Fisher's LSD test does not control the family-wise error rate, which may increase the chance of Type I errors.

The choice of post-hoc test depends on factors such as sample sizes, variances, and the desired balance between controlling the family-wise error rate and maximizing power. It is important to select a post-hoc test that is appropriate for the specific research question and study design.

Example:
Suppose a researcher conducts a study examining the effects of different exercise interventions on cardiovascular fitness. The researcher measures the fitness level (continuous dependent variable) in four groups: Group A (control), Group B (intervention 1), Group C (intervention 2), and Group D (intervention 3). After conducting an ANOVA and finding a significant overall effect, a post-hoc test would be necessary to compare the specific group means. For example, Tukey's HSD test could be used to determine if there are significant differences between any of the intervention groups or between the intervention groups and the control group. This would provide a more detailed understanding of which interventions lead to significantly different fitness levels compared to others.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [3]:
from scipy import stats
import numpy as np

# Data for one-way ANOVA
diet_A = [2.1, 3.2, 4.5, 1.9, 2.8, 3.1, 2.6, 2.7, 2.2, 2.9,
          3.7, 3.3, 2.8, 3.0, 2.5, 2.6, 3.2, 2.9, 2.4, 3.5,
          3.2, 2.7, 2.8, 2.6, 2.9, 3.1, 2.4, 2.7, 2.5, 2.6,
          2.9, 3.2, 2.8, 2.7, 3.1, 2.8, 2.6, 2.7, 3.0, 2.8,
          2.9, 2.5, 3.1, 2.8, 2.6, 3.3, 3.1, 3.2, 2.6, 2.8]
diet_B = [2.6, 3.8, 4.2, 3.1, 2.5, 3.2, 3.6, 3.9, 2.7, 3.3,
          2.9, 3.1, 3.7, 3.4, 3.0, 3.5, 3.3, 3.1, 2.8, 3.2,
          3.6, 3.4, 2.9, 3.2, 3.1, 3.5, 3.8, 3.2, 3.6, 3.7,
          3.4, 3.1, 3.2, 3.5, 3.4, 3.6, 3.8, 3.2, 3.5, 3.1,
          3.4, 3.2, 3.6, 3.1, 3.4, 3.7, 3.5, 3.3, 3.2, 3.6]
diet_C = [3.9, 4.2, 4.5, 3.5, 3.2, 4.0, 3.8, 4.3, 4.4, 3.7,
          4.0, 3.8, 4.1, 3.9, 4.2, 4.0, 4.4, 3.8, 4.3, 4.1,
          3.7, 4.2, 4.0, 4.3, 4.1, 3.9, 4.0, 4.4, 4.2, 4.3,
          4.0, 3.7, 4.1, 3.8, 4.4, 4.2, 4.0, 4.3, 3.9, 4.1,
          4.2, 4.0, 3.7, 4.1, 4.4, 4.3, 3.9, 4.2, 4.0, 4.3]

# Combine the data into a single array
data = np.concatenate([diet_A, diet_B, diet_C])

# Create a corresponding group variable
groups = ['A'] * len(diet_A) + ['B'] * len(diet_B) + ['C'] * len(diet_C)

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Print the results
print("F-statistic:", f_statistic)
print("p-value:", p_value)

if p_value < 0.05:
    print("There are significant differences between the mean weight loss of the three diets.")
else:
    print("There are no significant differences between the mean weight loss of the three diets.")


F-statistic: 153.1891368604358
p-value: 1.1167663555664372e-36
There are significant differences between the mean weight loss of the three diets.


Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs.experienced). Report the F-statistics and p-values, and interpret the results.

In [12]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a sample dataset with task completion times, software programs, and experience levels
data = {
    'Time': [12, 10, 8, 9, 11, 13, 14, 15, 13, 12, 9, 8, 10, 11, 12, 13, 14, 11, 10, 9, 10, 12, 11, 13, 10, 9, 8, 11, 12, 13],
    'Program': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'A', 'A', 'A'],
    'Experience': ['Novice', 'Experienced'] * 15
}

df = pd.DataFrame(data)

# Convert the Experience column to categorical
df['Experience'] = pd.Categorical(df['Experience'])

# Perform two-way ANOVA
model = ols('Time ~ Program + Experience + Program:Experience', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)


                       sum_sq    df         F    PR(>F)
Program             13.347595   2.0  1.734706  0.197853
Experience           1.036484   1.0  0.269411  0.608481
Program:Experience   0.185738   2.0  0.024139  0.976173
Residual            92.333333  24.0       NaN       NaN


Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [6]:
from scipy import stats
import numpy as np

# Data for two-sample t-test
control_group = [75, 80, 78, 72, 85, 82, 79, 88, 76, 81,
                 77, 83, 79, 86, 84, 79, 82, 80, 75, 78,
                 83, 81, 80, 79, 77, 85, 78, 82, 79, 76,
                 81, 80, 83, 77, 79, 75, 82, 78, 80, 81,
                 76, 83, 78, 80, 82, 75, 77, 79, 81, 83]

experimental_group = [84, 86, 82, 88, 92, 85, 83, 90, 87, 89,
                      82, 85, 86, 88, 87, 90, 83, 85, 84, 86,
                      88, 85, 87, 90, 82, 85, 86, 84, 83, 88,
                      89, 85, 83, 84, 87, 90, 86, 85, 88, 89,
                      83, 82, 85, 87, 88, 84, 86, 89, 83, 85]

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)

# Print the results
print("t-statistic:", t_statistic)
print("p-value:", p_value)

if p_value < 0.05:
    print("There is a significant difference in test scores between the two groups.")
else:
    print("There is no significant difference in test scores between the two groups.")



t-statistic: -10.679040278962674
p-value: 4.089365508069097e-18
There is a significant difference in test scores between the two groups.


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a posthoc test to determine which store(s) differ significantly from each other.

In [9]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import AnovaRM

# Data for repeated measures ANOVA
store_a = [100, 110, 105, 120, 115, 105, 110, 125, 130, 120,
           115, 110, 125, 115, 120, 130, 125, 110, 105, 120,
           110, 105, 125, 120, 115, 110, 125, 130, 120, 115]
store_b = [90, 95, 100, 105, 110, 115, 100, 95, 90, 105,
           110, 115, 100, 105, 90, 95, 100, 105, 110, 115,
           100, 95, 90, 105, 110, 115, 100, 105, 90, 95]
store_c = [80, 85, 90, 95, 100, 85, 80, 90, 95, 100,
           105, 80, 85, 90, 95, 100, 85, 80, 90, 95,
           100, 105, 80, 85, 90, 95, 100, 85, 80, 90]

# Create a pandas DataFrame
data = pd.DataFrame({'Day': list(range(30)) * 3,
                     'Sales': store_a + store_b + store_c,
                     'Store': ['A'] * 30 + ['B'] * 30 + ['C'] * 30})

# Fit the repeated measures ANOVA model
model = AnovaRM(data, 'Sales', 'Day', within=['Store']).fit()

# Print the ANOVA table
print(model.summary())



               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
Store 74.0602 2.0000 58.0000 0.0000

