## Assignment of Statistics Advance-6

#### Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

#### Answer:

Analysis of Variance (ANOVA) is a statistical method used to compare means of three or more groups to determine if there are any statistically significant differences between them. For ANOVA to provide valid results, certain assumptions must be met. These assumptions include:

1. **Normality:** The dependent variable within each group should be normally distributed. This is important because ANOVA relies on the normal distribution assumption when making inferences about population means.

   **Violation Example:** If the data within groups is not normally distributed, it might affect the validity of ANOVA results. This can happen when dealing with skewed or heavily tailed distributions.

2. **Homogeneity of Variances (Homoscedasticity):** The variances of the groups being compared should be approximately equal. Homogeneity of variances is essential for the F-test used in ANOVA to be valid.

   **Violation Example:** If the variances are not equal across groups, the F-test may become unreliable. For instance, if one group has much larger variances than others, it can impact the overall significance of the F-test.

3. **Independence:** Observations within each group must be independent of each other. This means that the value of the dependent variable for one observation should not be influenced by the value of any other observation.

   **Violation Example:** If observations are not independent (e.g., repeated measurements on the same subjects), it may lead to pseudo-replication and affect the accuracy of ANOVA results.

4. **Random Sampling:** The data should be collected through a random sampling process. This helps ensure that the sample is representative of the population.

   **Violation Example:** If the sampling is not random, it might introduce biases, and the sample may not accurately represent the larger population.

5. **Interval or Ratio Data:** ANOVA assumes that the dependent variable is measured on an interval or ratio scale. This allows for meaningful calculations of means and variances.

   **Violation Example:** If the dependent variable is measured on a nominal or ordinal scale, using ANOVA would be inappropriate. In such cases, alternative non-parametric tests may be considered.

It's important to note that while violations of assumptions can impact the validity of ANOVA results, the robustness of ANOVA allows it to still provide meaningful insights in many cases, especially when sample sizes are large. However, researchers should be cautious and interpret results with consideration to the specific circumstances and data characteristics. If assumptions are seriously violated, alternative methods or transformations may be considered.

#### Q2. What are the three types of ANOVA, and in what situations would each be used?

#### Answer:

Analysis of Variance (ANOVA) can be categorized into three main types based on the number of factors (independent variables) and the levels or groups within those factors. These types are:

1. **One-Way ANOVA:**
   - **Description:** This is the most basic form of ANOVA, where there is only one independent variable or factor.
   - **Use Cases:**
     - Comparing means of more than two groups to determine if there are any significant differences.
     - Example: Testing whether the mean scores of students from three different teaching methods are significantly different.

2. **Two-Way ANOVA:**
   - **Description:** Involves two independent variables or factors. It examines how two factors impact a dependent variable simultaneously, as well as the interaction effect between the factors.
   - **Use Cases:**
     - Examining the main effects of two factors on a response variable.
     - Assessing if there is an interaction effect between the two factors.
     - Example: Investigating the effects of both diet and exercise on weight loss.

3. **Repeated Measures ANOVA:**
   - **Description:** Also known as within-subjects ANOVA, it is used when the same subjects are used for each treatment or measurement.
   - **Use Cases:**
     - Examining changes within the same subjects over time or under different conditions.
     - Example: Measuring the blood pressure of individuals before and after receiving different doses of a drug.

Each type of ANOVA is suited to answer different research questions and is applied in various experimental designs. The choice of which ANOVA to use depends on the study design, the number of independent variables, and the nature of the data. Researchers need to carefully consider the experimental setup and the hypotheses they want to test to determine the most appropriate type of ANOVA for their analysis.


#### Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?:

#### Answer: 

The partitioning of variance in Analysis of Variance (ANOVA) refers to the process of decomposing the total variance observed in a dataset into different components associated with different sources of variation. Understanding this concept is crucial for gaining insights into the contributions of various factors to the overall variability in the data. The total variance is partitioned into three main components in the context of one-way ANOVA:

1. **Between-Group Variance (SSB):**
   - **Definition:** This component represents the variability among the group means. It measures how much the group means differ from each other.
   - **Calculation:** It is calculated as the sum of squared differences between each group mean and the overall mean, each weighted by the number of observations in the group.

2. **Within-Group Variance (SSW or SSE):**
   - **Definition:** This component represents the variability within each group. It measures how much individual observations deviate from their group mean.
   - **Calculation:** It is calculated as the sum of squared differences between each individual observation and its group mean, summed across all groups.

3. **Total Variance (SST):**
   - **Definition:** This is the overall variability in the data, considering all observations across all groups.
   - **Calculation:** It is the sum of squared differences between each individual observation and the overall mean.

The partitioning of variance is expressed in the form of the sum of squares (SS), and the degrees of freedom (df) associated with each component are also calculated. The degrees of freedom for each component are crucial in determining the appropriate F-statistic for testing the significance of the factors.

Understanding the partitioning of variance is important for the following reasons:

- **Identification of Sources of Variation:** It helps identify and quantify the sources of variation in the data, distinguishing between variation due to differences among groups and variation within groups.

- **Interpretation of F-Statistic:** The F-statistic in ANOVA is calculated as the ratio of between-group variance to within-group variance. Understanding the partitioning of variance helps in interpreting the F-statistic and assessing the significance of the observed differences among group means.

- **Insight into Experimental Design:** By understanding how much of the total variance is attributed to different sources, researchers can gain insights into the effectiveness of the experimental design and the impact of independent variables on the dependent variable.

In summary, the partitioning of variance is a fundamental concept in ANOVA that provides a structured approach to understanding and analyzing the sources of variability in experimental data. It helps researchers draw meaningful conclusions about the effects of different factors on the observed outcomes.

#### Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual 
sum of squares (SSR) in a one-way ANOVA using Python?

#### Answer:

In [46]:
import numpy as np

# Sample data (replace this with your actual data)
data = {
    'Group1': np.array([12, 14, 16, 18, 20]),
    'Group2': np.array([10, 13, 16, 19, 22]),
    'Group3': np.array([8, 12, 16, 20, 24])
}

# Flatten the data and calculate overall mean
all_data = np.concatenate(list(data.values()))
overall_mean = np.mean(all_data)

# Calculate SST
sst = np.sum((all_data - overall_mean)**2)

# Calculate SSE
sse = np.sum([len(group) * (np.mean(group) - overall_mean)**2 for group in data.values()])

# Calculate SSR
ssr = np.sum([np.sum((group - np.mean(group))**2) for group in data.values()])

# Print results
print(f'Total Sum of Squares (SST): {sst}')
print(f'Explained Sum of Squares (SSE): {sse}')
print(f'Residual Sum of Squares (SSR): {ssr}')


Total Sum of Squares (SST): 290.0
Explained Sum of Squares (SSE): 0.0
Residual Sum of Squares (SSR): 290.0


#### Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

#### Answer:

In [51]:
import numpy as np
import pandas as pd
from scipy.stats import f_oneway

# Sample data (replace this with your actual data)
data = {
    'Factor1': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'Factor2': ['X', 'Y', 'Z', 'X', 'Y', 'Z', 'X', 'Y', 'Z'],
    'Response': [10, 12, 14, 8, 10, 12, 16, 18, 20]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Perform two-way ANOVA
result = df.groupby(['Factor1', 'Factor2'])['Response'].apply(list).apply(np.array).apply(pd.Series).T
anova_result = f_oneway(*[result[i].dropna() for i in result.columns])

# Print the ANOVA table
print("Two-way ANOVA Table:")
print(anova_result)

# Calculate main effects and interaction effect
grand_mean = df['Response'].mean()
main_effect_factor1 = df.groupby('Factor1')['Response'].mean() - grand_mean
main_effect_factor2 = df.groupby('Factor2')['Response'].mean() - grand_mean
interaction_effect = anova_result.statistic / len(result.columns)

# Print main effects and interaction effect
print(f'\nMain Effect of Factor1:\n{main_effect_factor1}')
print(f'\nMain Effect of Factor2:\n{main_effect_factor2}')
print(f'\nInteraction Effect:\n{interaction_effect}')


Two-way ANOVA Table:
F_onewayResult(statistic=nan, pvalue=nan)

Main Effect of Factor1:
Factor1
A   -1.333333
B   -3.333333
C    4.666667
Name: Response, dtype: float64

Main Effect of Factor2:
Factor2
X   -2.0
Y    0.0
Z    2.0
Name: Response, dtype: float64

Interaction Effect:
nan




#### Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

#### Answer:

In a one-way ANOVA, the F-statistic is used to test the null hypothesis that the means of multiple groups are equal. The associated p-value helps determine the statistical significance of the observed differences. Let's interpret the given values:

1. **F-Statistic: 5.23**
   - The F-statistic is a ratio of the variance between groups to the variance within groups. A larger F-statistic suggests a greater difference between group means relative to the variation within each group.

2. **p-value: 0.02**
   - The p-value is the probability of observing an F-statistic as extreme as, or more extreme than, the one calculated from the sample data, assuming the null hypothesis is true.

Interpretation:

- **Null Hypothesis (H0):** The null hypothesis assumes that the means of all groups are equal.
- **Alternative Hypothesis (H1):** The alternative hypothesis suggests that at least one group mean is different.

Conclusions:

1. **Statistical Significance:**
   - The p-value of 0.02 is less than the typical significance level of 0.05. Therefore, we reject the null hypothesis.

2. **Differences Exist:**
   - The rejection of the null hypothesis indicates that there are statistically significant differences among the group means.

3. **Post-hoc Analysis:**
   - If the ANOVA indicates significant differences, it is common to conduct post-hoc tests (e.g., Tukey's HSD) to identify which specific groups differ from each other.

4. **Effect Size:**
   - While statistical significance is established, it's also important to consider the effect size to assess the practical significance of the differences.

In summary, based on the F-statistic and p-value:

- There are statistically significant differences among the group means.
- Further post-hoc analysis may be needed to identify specific groups with differing means.
- The observed differences are unlikely toed when interpreting the results.

#### Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

#### Answer:

Handling missing data in a repeated measures ANOVA is crucial to ensure valid and reliable results. Here are common methods to handle missing data and potential consequences of using different methods:

### Methods to Handle Missing Data:

1. **Complete Case Analysis (CCA):**
   - **Approach:** Exclude cases with missing data from the analysis.
   - **Consequences:** Loss of information and potential bias if missing data is not completely at random (MCAR).

2. **Pairwise Deletion (PD):**
   - **Approach:** Analyze each pair of variables with available data.
   - **Consequences:** May lead to biased results if missing data is not MCAR. Variability in sample sizes across pairs can also affect statistical power.

3. **Imputation Methods:**
   - **Approach:** Estimate missing values based on observed data.
   - **Consequences:** Imputation methods (mean imputation, regression imputation, etc.) introduce uncertainty. Results may be biased if the imputation model is misspecified.

4. **Last Observation Carried Forward (LOCF):**
   - **Approach:** Carry the last observed value forward for missing values.
   - **Consequences:** Assumes that missing values remain constant over time. May not accurately represent the true trajectory of the variable.

5. **Multiple Imputation (MI):**
   - **Approach:** Generate multiple datasets with imputed values and average results.
   - **Consequences:** Reduces bias compared to single imputation methods. Provides more accurate standard errors and confidence intervals.

### Potential Consequences:

1. **Bias:**
   - **Issue:** Using methods like LOCF or mean imputation may introduce bias if missing data is related to unobserved characteristics.

2. **Reduced Power:**
   - **Issue:** Complete case analysis or pairwise deletion may result in reduced statistical power due to a smaller sample size.

3. **Inaccurate Variance Estimation:**
   - **Issue:** Ignoring missing data or using simple imputation methods can lead to underestimated standard errors and confidence intervals.

4. **Misleading Results:**
   - **Issue:** Different methods may yield different results, making it challenging to draw accurate conclusions.

5. **Assumptions Violation:**
   - **Issue:** Some imputation methods assume data is missing completely at random (MCAR), and violating this assumption can impact results.

### Recommendations:

- **Understand the Nature of Missingness:**
  - Investigate whether missing data is completely at random, at random, or not at random. This understanding helps in selecting appropriate methods.

- **Consider Multiple Imputation:**
  - If possible, use multiple imputation to account for uncertainty associated with imputed values and to improve the accuracy of standard errors.

- **Perform Sensitivity Analysis:**
  - Evaluate the robustness of results by comparing outcomes using different missing data handling methods.

- **Document and Justify:**
  - Clearly document the missing data handling approach and justify the chosen method based on the nature of the data and the analysis.

Handling missing data is a complex issue, and the choice of method depends on the assumptions about the missing data mechanism and the goals of the analysis. It's essential to carefully consider the potential consequences of different approaches and their implications for the validity of the study results.

#### Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

#### Answer :

Post-hoc tests are conducted after an Analysis of Variance (ANOVA) to determine which specific group means differ significantly from each other when the ANOVA indicates that there are overall differences among groups. Common post-hoc tests include:

1. **Tukey's Honestly Significant Difference (HSD):**
   - **Purpose:** Identifies which pairs of group means are significantly different from each other.
   - **Use Case:** Suitable when comparing all possible pairs of groups, and the number of comparisons is not too large.

2. **Bonferroni Correction:**
   - **Purpose:** Adjusts significance levels to control the familywise error rate when making multiple comparisons.
   - **Use Case:** Appropriate when conducting numerous pairwise comparisons, but it can be conservative and increase the likelihood of Type II errors.

3. **Scheffe's Test:**
   - **Purpose:** Controls the familywise error rate without assuming equal variances across groups.
   - **Use Case:** Useful when the assumption of homogeneity of variances is violated.

4. **Dunnett's Test:**
   - **Purpose:** Used when comparing multiple treatment groups to a control group.
   - **Use Case:** Useful in situations where there is a designated control group, and the interest is in comparing other groups to this control group.

5. **Games-Howell Test:**
   - **Purpose:** A non-parametric alternative for unequal variances and sample sizes.
   - **Use Case:** Appropriate when the assumption of homogeneity of variances is violated, and sample sizes are unequal.

6. **Holm's Method:**
   - **Purpose:** Similar to Bonferroni correction but can be less conservative.
   - **Use Case:** When you need to control the familywise error rate, and Bonferroni is too stringent.

### Example Situation:

Let's say you conducted a one-way ANOVA to compare the effectiveness of three different teaching methods on student test scores. The ANOVA results indicate a significant difference among the groups. Now, you want to determine which specific pairs of teaching methods lead to significantly different mean test scores.

**Post-hoc Test Application:**

- **Tukey's HSD:** Use Tukey's HSD when you want to compare all possible pairs of teaching methods to identify which pairs have significantly different mean test scores. This method provides simultaneous confidence intervachingMethod'])
print(tukey_results)
```

The output of Tukey's HSD will indicate which pairs of teaching methods have significantly different mean test scores and provide cdata and the specific goals of your analysis.

#### Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results. create sample data

##### Answer:

In [55]:
import numpy as np
import scipy.stats as stats
import pandas as pd

# Set a random seed for reproducibility
np.random.seed(42)

# Generate sample data for weight loss in each diet
data = {
    'Diet': np.repeat(['A', 'B', 'C'], 50),
    'WeightLoss': np.concatenate([
        np.random.normal(loc=70, scale=5, size=50),  # Diet A
        np.random.normal(loc=75, scale=5, size=50),  # Diet B
        np.random.normal(loc=72, scale=5, size=50)   # Diet C
    ])
}

# Create a DataFrame
df = pd.DataFrame(data)

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(
    df[df['Diet'] == 'A']['WeightLoss'],
    df[df['Diet'] == 'B']['WeightLoss'],
    df[df['Diet'] == 'C']['WeightLoss']
)

# Print results
print(f'F-Statistic: {f_statistic}')
print(f'p-value: {p_value}')

# Interpretation
alpha = 0.05
if p_value < alpha:
    print("There are significant differences between the mean weight loss of the three diets.")
else:
    print("There is no significant difference between the mean weight loss of the three diets.")


F-Statistic: 21.754981595687298
p-value: 5.295093828387118e-09
There are significant differences between the mean weight loss of the three diets.


#### Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

#### Answer:

In [44]:
import numpy as np
from scipy.stats import chi2_contingency

# Given data in a contingency table
observed_data = np.array([[200, 225],
                          [150, 175],
                          [150, 100]])

# Perform the chi-square test for independence
chi2_stat, p_value, dof, expected = chi2_contingency(observed_data)

# Print the results
print(f'Chi-square Statistic: {chi2_stat}')
print(f'Degrees of Freedom: {dof}')
print(f'P-value: {p_value}')

# Make a decision based on the significance level
alpha = 0.01  # Significance level
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant association between chocolate preference and country of origin.")
else:
    print("Fail to reject the null hypothesis: There is no significant association between chocolate preference and country of origin.")


Chi-square Statistic: 13.393665158371041
Degrees of Freedom: 2
P-value: 0.0012348168997745915
Reject the null hypothesis: There is a significant association between chocolate preference and country of origin.


ther.

#### Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

#### Answer:

In [56]:
import numpy as np
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Set a random seed for reproducibility
np.random.seed(42)

# Generate sample data
control_group = np.random.normal(loc=70, scale=5, size=50)  # Traditional teaching method
experimental_group = np.random.normal(loc=75, scale=5, size=50)  # New teaching method

# Create a DataFrame
data = pd.DataFrame({
    'Group': np.repeat(['Control', 'Experimental'], 50),
    'TestScores': np.concatenate([control_group, experimental_group])
})

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)

# Print results
print(f'Two-Sample T-Test:')
print(f'T-statistic: {t_statistic}')
print(f'p-value: {p_value}')

# Check for significance
alpha = 0.05
if p_value < alpha:
    print("There is a significant difference in test scores between the two groups.")
    
    # Perform Tukey's HSD post-hoc test
    tukey_results = pairwise_tukeyhsd(data['TestScores'], data['Group'])
    print('\nTukey\'s HSD Post-Hoc Test:')
    print(tukey_results)
else:
    print("There is no significant difference in test scores between the two groups.")


Two-Sample T-Test:
T-statistic: -6.872731683285833
p-value: 5.877565294167974e-10
There is a significant difference in test scores between the two groups.

Tukey's HSD Post-Hoc Test:
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj lower  upper  reject
--------------------------------------------------------
Control Experimental   6.2163   0.0 4.4214 8.0112   True
--------------------------------------------------------


#### Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.

#### Answer:


In [57]:
import numpy as np
import pandas as pd
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Set a random seed for reproducibility
np.random.seed(42)

# Generate sample data
sales_store_A = np.random.normal(loc=500, scale=50, size=30)
sales_store_B = np.random.normal(loc=550, scale=50, size=30)
sales_store_C = np.random.normal(loc=520, scale=50, size=30)

# Create a DataFrame
data = pd.DataFrame({
    'Store': np.repeat(['A', 'B', 'C'], 30),
    'Sales': np.concatenate([sales_store_A, sales_store_B, sales_store_C])
})

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(
    data[data['Store'] == 'A']['Sales'],
    data[data['Store'] == 'B']['Sales'],
    data[data['Store'] == 'C']['Sales']
)

# Print results
print(f'One-Way ANOVA:')
print(f'F-statistic: {f_statistic}')
print(f'p-value: {p_value}')

# Check for significance
alpha = 0.05
if p_value < alpha:
    print("There is a significant difference in sales between the three stores.")
    
    # Perform Tukey's HSD post-hoc test
    tukey_results = pairwise_tukeyhsd(data['Sales'], data['Store'])
    print('\nTukey\'s HSD Post-Hoc Test:')
    print(tukey_results)
else:
    print("There is no significant difference in sales between the three stores.")


One-Way ANOVA:
F-statistic: 9.67762780725552
p-value: 0.00016035145881099311
There is a significant difference in sales between the three stores.

Tukey's HSD Post-Hoc Test:
 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower    upper  reject
-----------------------------------------------------
     A      B  53.3492 0.0001  24.3572 82.3413   True
     A      C  30.0516 0.0404   1.0595 59.0437   True
     B      C -23.2976 0.1402 -52.2897  5.6944  False
-----------------------------------------------------
