QTS.1

Assumptions of ANOVA (Analysis of Variance):

1. **Homogeneity of Variances (Homoscedasticity):**
   - *Assumption:* Variances of the dependent variable are equal across all groups.
   - *Violation Example:* Unequal variances can lead to inaccurate F-statistics and affect the reliability
of ANOVA results.

2. **Independence of Observations:**
   - *Assumption:* Observations within and between groups are independent.
   - *Violation Example:* Correlated observations may lead to inflated Type I error rates.

3. **Normality of Residuals:**
   - *Assumption:* Residuals (the differences between observed and predicted values) are normally distributed.
   - *Violation Example:* Skewed or non-normal residuals can impact the accuracy of p-values and 
confidence intervals.

4. **Random Sampling:**
   - *Assumption:* Data is collected through random sampling.
   - *Violation Example:* Non-random sampling can introduce bias and affect the generalizability of results.

Violations of these assumptions can lead to biased or inefficient estimates and compromise the 
validity of ANOVA results.

QTS.2

There are three main types of ANOVA:

1. **One-Way ANOVA:**
   - **Use:** Used when comparing means across two or more independent groups
    or levels of a single independent variable. It tests whether there are any 
    statistically significant differences among the group means.

2. **Two-Way ANOVA:**
   - **Use:** Appropriate when there are two independent variables (factors) 
    and their interaction, and you want to examine their combined effects on 
    the dependent variable. It assesses whether there are significant main effects and interaction effects.

3. **Repeated Measures ANOVA:**
   - **Use:** Employed when the same subjects are used for each treatment
    (within-subject design). It analyzes the differences between the means 
    of repeated measurements taken under different conditions. Common in 
    experimental designs where the same subjects are exposed to multiple treatments.

In summary:
- **One-Way ANOVA:** Used for comparing means across multiple independent groups.
- **Two-Way ANOVA:** Used when there are two independent variables and their 
interaction to analyze their combined effects.
- **Repeated Measures ANOVA:** Appropriate for within-subject designs, analyzing
repeated measurements under different conditions.

QTS.3

The partitioning of variance in ANOVA involves breaking down the total variability 
in the data into different components attributed to various sources, such as 
treatment effects and random error. This breakdown helps in understanding the 
contributions of different factors to the overall
variability in the dependent variable. It includes:

1. **Between-Group Variance:** Variability between different treatment groups.
2. **Within-Group Variance:** Variability within each treatment group.
3. **Total Variance:** The overall variability in the entire dataset.

Understanding this concept is important because it allows researchers to assess the 
proportion of total variability that can be attributed to the treatment effects and 
determine the statistical significance of these effects. It provides insights into 
whether the observed differences between groups are likely due to the treatments or 
if they could be explained by random variability.

QTS.4

In [2]:
import scipy.stats as stats
import numpy as np

# Sample data for three groups
group1 = np.array([12, 14, 16, 18])
group2 = np.array([20, 22, 24, 26])
group3 = np.array([8, 10, 12, 14])

# Combine the data into a single array
data = np.concatenate([group1, group2, group3])

# Calculate ANOVA
f_statistic, p_value = stats.f_oneway(group1, group2, group3)

# Calculate SST, SSE, SSR
overall_mean = np.mean(data)
sst = np.sum((data - overall_mean)**2)
sse = np.sum([(np.mean(group) - overall_mean)**2 * len(group) for group in [group1, group2, group3]])
ssr = sst - sse

print("Total Sum of Squares (SST):", sst)
print("Explained Sum of Squares (SSE):", sse)
print("Residual Sum of Squares (SSR):", ssr)


Total Sum of Squares (SST): 358.6666666666667
Explained Sum of Squares (SSE): 298.66666666666663
Residual Sum of Squares (SSR): 60.00000000000006


QTS.5

In [4]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a sample dataset
data = {'A': [1, 1, 2, 2, 3, 3],
        'B': ['X', 'Y', 'X', 'Y', 'X', 'Y'],
        'Value': [10, 12, 14, 16, 18, 20]}

df = pd.DataFrame(data)

# Fit the two-way ANOVA model
model = ols('Value ~ A + B + A:B', data=df).fit()

# Get the ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)

# Extract main effects and interaction effect
main_effect_A = anova_table['sum_sq']['A'] / anova_table['df']['A']
main_effect_B = anova_table['sum_sq']['B'] / anova_table['df']['B']
interaction_effect = anova_table['sum_sq']['A:B'] / anova_table['df']['A:B']

print("Main Effect A:", main_effect_A)
print("Main Effect B:", main_effect_B)
print("Interaction Effect:", interaction_effect)

Main Effect A: 63.999999999999915
Main Effect B: 5.999999999999989
Interaction Effect: 2.1742978700154158e-29


QTS.6

In a one-way ANOVA, the F-statistic is used to test whether there are significant
differences among the means of the groups. The associated p-value helps determine the 
statistical significance of the F-statistic. Here's how to interpret the results:

1. **F-Statistic (5.23):**
   - The F-statistic measures the ratio of the variance between groups to the 
    variance within groups. In this case, a larger F-statistic suggests that 
    the means of at least two groups are different.

2. **P-Value (0.02):**
   - The p-value is the probability of observing an F-statistic as extreme as 
    the one calculated, assuming the null hypothesis is true (i.e., assuming 
                there are no true differences between group means).

3. **Conclusion:**
   - With a p-value of 0.02, which is less than the commonly chosen 
    significance level of 0.05, there is evidence to reject the null 
    hypothesis. This suggests that there are significant differences between at least two of the groups.

4. **Interpretation:**
   - The differences between group means are statistically significant at 
    the 0.05 level, indicating that there are likely real differences in the 
    population means. However, the p-value alone doesn't provide information 
    about which specific groups differ.

In summary, you can conclude that there are significant differences between 
groups based on the given F-statistic and p-value.

QTS.7

Handling missing data in repeated measures ANOVA is crucial for obtaining unbiased
and valid results. There are several methods to deal with missing data, each with 
its potential consequences:

1. **Listwise Deletion:**
   - **Method:** Exclude cases with missing data on any variable.
   - **Consequences:** Reduces sample size, potentially leading to loss of statistical
power and biased results if missingness is related to the outcome.

2. **Pairwise Deletion:**
   - **Method:** Analyze all available data for each pairwise comparison.
   - **Consequences:** Preserves more data than listwise deletion but can introduce 
bias if missingness is not completely at random. Results can be based on different 
subsets of data for different comparisons.

3. **Imputation:**
   - **Method:** Replace missing values with estimated values based on observed data.
   - **Consequences:** Preserves sample size, but the choice of imputation method can 
introduce bias if the missing data mechanism is not accurately represented. Common 
imputation methods include mean imputation, regression imputation, or multiple imputation.

4. **Missing Completely at Random (MCAR) vs. Missing at Random (MAR) vs. Missing Not at Random (MNAR):**
   - **Consequences:** The handling method may depend on the missing data mechanism. 
    If data are missing completely at random, complete case analysis or imputation 
    methods can be suitable. If missingness depends on observed variables (MAR), 
    methods like imputation conditioned on observed data may be appropriate. If data 
    are missing non-randomly (MNAR), imputation methods might not fully address bias.

Choosing the appropriate method depends on the nature and pattern of missing data,
as well as the assumptions about the missing data mechanism. It's essential to 
report how missing data are handled and, if possible, perform sensitivity analyses
to assess the impact of different approaches on the results.

QTS.8

Common post-hoc tests used after ANOVA are employed to determine which specific 
group differences are significant when the overall ANOVA indicates that there 
are differences between groups. Here are some common post-hoc tests and when you might use each:

1. **Tukey's Honestly Significant Difference (HSD):**
   - **Use:** Used when you have equal sample sizes and want to control the familywise error rate.
   - **Example:** If you have conducted an ANOVA with three or more groups and find a
significant difference, Tukey's HSD can help identify which specific pairs of groups differ significantly.

2. **Bonferroni Correction:**
   - **Use:** Controls the familywise error rate by adjusting the significance level for individual tests.
   - **Example:** If you conduct multiple pairwise comparisons after ANOVA, 
the Bonferroni correction would be appropriate to avoid inflated Type I error rates.

3. **Sidak Correction:**
   - **Use:** Similar to Bonferroni, but often less conservative.
   - **Example:** When you have multiple pairwise comparisons, and you want to 
adjust the significance level to maintain an overall alpha level, Sidak correction
can be a less stringent alternative to Bonferroni.

4. **Duncan's Multiple Range Test:**
   - **Use:** Useful when you have unequal sample sizes and want to identify homogeneous subsets of means.
   - **Example:** After conducting ANOVA on data with different group sizes, 
Duncan's test can help identify which groups are significantly different from each other.

5. **Scheffé's Method:**
   - **Use:** Suitable for situations with unequal sample sizes and provides 
    a compromise between controlling the Type I error rate and maintaining power.
   - **Example:** If you have unequal sample sizes and want a more powerful 
test at the cost of a less stringent control of Type I errors, Scheffé's method might be appropriate.

**Example Situation:**
Suppose you conducted an ANOVA comparing the mean scores of three teaching methods.
The ANOVA indicates a significant difference. To pinpoint which specific teaching 
methods differ from each other, you would use a post-hoc test. For example, 
Tukey's HSD or Bonferroni correction could be employed to perform pairwise 
comparisons and identify the significant differences between individual teaching methods.

QTS.9

In [13]:
import scipy.stats as stats
import numpy as np

# Generate example data for three diets
 # For reproducibility
diet_A = np.random.normal(loc=5, scale=2, size=50)
diet_B = np.random.normal(loc=7, scale=2, size=50)
diet_C = np.random.normal(loc=4, scale=2, size=50)

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Report results
print("F-Statistic:", f_statistic)
print("P-Value:", p_value)

# Interpretation

if (p_value < 0.05):
    print("The one-way ANOVA indicates a significant difference between the mean weight loss of the three diets.")
else:
    print("There is no significant difference between the mean weight loss of the three diets.")

F-Statistic: 13.865897121724863
P-Value: 3.0442952520825844e-06
The one-way ANOVA indicates a significant difference between the mean weight loss of the three diets.


QTS.10

In [9]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a sample dataset
np.random.seed(42)
data = {
    'Software': np.random.choice(['A', 'B', 'C'], size=90),
    'Experience': np.random.choice(['Novice', 'Experienced'], size=90),
    'Time': np.random.normal(loc=20, scale=5, size=90)  # Adjust loc and scale as needed
}

df = pd.DataFrame(data)

# Fit the two-way ANOVA model
model = ols('Time ~ Software * Experience', data=df).fit()

# Get the ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)

# Extract main effects and interaction effect
main_effect_software = anova_table['sum_sq']['Software'] / anova_table['df']['Software']
main_effect_experience = anova_table['sum_sq']['Experience'] / anova_table['df']['Experience']
interaction_effect = anova_table['sum_sq']['Software:Experience'] / anova_table['df']['Software:Experience']

print("Main Effect Software:", main_effect_software)
print("Main Effect Experience:", main_effect_experience)
print("Interaction Effect:", interaction_effect)

# Interpretation
alpha = 0.05

# Check for main effects
if main_effect_software < alpha:
    print("There is a significant main effect of Software.")
else:
    print("There is no significant main effect of Software.")

if main_effect_experience < alpha:
    print("There is a significant main effect of Experience.")
else:
    print("There is no significant main effect of Experience.")

# Check for interaction effect
if interaction_effect < alpha:
    print("There is a significant interaction effect between Software and Experience.")
else:
    print("There is no significant interaction effect between Software and Experience.")


                          sum_sq    df         F    PR(>F)
Software                8.337633   2.0  0.193670  0.824297
Experience             31.851905   1.0  1.479736  0.227223
Software:Experience    52.479686   2.0  1.219018  0.300694
Residual             1808.132913  84.0       NaN       NaN
Main Effect Software: 4.168816505478778
Main Effect Experience: 31.851905396812533
Interaction Effect: 26.23984303358177
There is no significant main effect of Software.
There is no significant main effect of Experience.
There is no significant interaction effect between Software and Experience.


QTS.11

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

data=np.random.randint(1,101,(50,2))
new_column_names=['traditional teaching method','new teaching method']
df=pd.DataFrame(data,columns=new_column_names)

alpha=0.05

## implementing 2 sample t-test
t_stat,p_value=stats.ttest_ind(df['traditional teaching method'],df['new teaching method'])

print(f"t_statistics:{t_stat}")
print(f"p_value:{p_value}")

# Interpretation
if p_value < alpha:
    print("The two-sample t-test indicates a significant difference between the two teaching methods.")
else:
    print("There is no significant difference between the two teaching methods.")

t_statistics:1.786027695784668
p_value:0.07718756438056777
There is no significant difference between the two teaching methods.


QTS.12

In [10]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Generate example data
np.random.seed(42)  # for reproducibility
data = {
    'Store_A': np.random.normal(50, 10, 30),
    'Store_B': np.random.normal(55, 12, 30),
    'Store_C': np.random.normal(48, 8, 30)
}

df = pd.DataFrame(data)

# Melt the DataFrame to long format for repeated measures ANOVA
df_melted = pd.melt(df, var_name='Store', value_name='Sales')

# Add a column for the day or time point
df_melted['Day'] = np.tile(np.arange(1, 31), len(data.keys()))

# Fit repeated measures ANOVA model
rm_anova_model = AnovaRM(df_melted, 'Sales', 'Day', within=['Store'])
rm_results = rm_anova_model.fit()

# Print ANOVA table
print(rm_results.anova_table)

# Perform post hoc test (Tukey HSD)
posthoc = pairwise_tukeyhsd(df_melted['Sales'], df_melted['Store'], alpha=0.05)
print(posthoc)


        F Value  Num DF  Den DF    Pr > F
Store  3.450612     2.0    58.0  0.038378
 Multiple Comparison of Means - Tukey HSD, FWER=0.05  
 group1  group2 meandiff p-adj   lower   upper  reject
------------------------------------------------------
Store_A Store_B   5.4275 0.0733 -0.4005 11.2555  False
Store_A Store_C  -0.0155    1.0 -5.8434  5.8125  False
Store_B Store_C   -5.443 0.0723 -11.271   0.385  False
------------------------------------------------------
