In [None]:
#Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

In [None]:
'''Assumptions of ANOVA
Analysis of Variance (ANOVA) is a statistical technique used to compare the means of multiple groups.

To ensure the validity of the results, several assumptions must be met:

Normality: The dependent variable should be normally distributed within each group. 
This means that the data points in each group should follow a bell-shaped curve.
Homogeneity of Variance: The variance of the dependent variable should be equal across all groups. 
This assumption is also known as homoscedasticity.
Independence: The observations within each group should be independent.
This means that the value of one observation should not be influenced by the value of another observation.

Violations and Their Impact
If any of these assumptions are violated, it can affect the validity of the ANOVA results.

Here are some examples of violations and their potential consequences:

Violation of Normality
Skewness or Kurtosis: If the data distribution is skewed or has excessive kurtosis, it can violate the normality assumption.
This can lead to inaccurate p-values and biased results.
Example: A skewed distribution might occur if there are a few extreme outliers in the data.

Violation of Homogeneity of Variance
Heteroscedasticity: If the variance of the dependent variable is unequal across groups, 
it can violate the homogeneity of variance assumption. This can affect the accuracy of the F-test and p-values.
Example: A violation of homogeneity of variance might occur if one group has a much larger spread of data points than the others.

Violation of Independence
Dependent Observations: If the observations within a group are not independent, it can violate the independence assumption.
This can lead to inflated or deflated p-values.
Example: A violation of independence might occur if the same individuals are measured multiple times or if data points are related
to each other in a systematic way.

To address these violations, you may need to:

Transform the data: If the data is skewed, you might try transforming it using a logarithmic or square root transformation.
Use a non-parametric test: If the normality assumption is severely violated, you could consider using a non-parametric alternative 
to ANOVA, such as the Kruskal-Wallis test.
Use a robust ANOVA method: There are robust ANOVA methods that are less sensitive to violations of assumptions.
Check for outliers: If there are extreme outliers, you might consider removing them or using a robust statistical method.'''

In [None]:
#Q2. What are the three types of ANOVA, and in what situations would each be used?

In [None]:
'''Three Types of ANOVA
ANOVA (Analysis of Variance) is a statistical technique used to compare the means of multiple groups. 

There are three main types of ANOVA:

One-Way ANOVA:

Purpose: Compares the means of three or more independent groups.
Situation: Used when you have a single independent variable (factor) with multiple levels and want to determine if there are significant differences in the means of the dependent variable across those levels.
Example: Comparing the mean test scores of students from three different schools.

Two-Way ANOVA:

Purpose: Compares the means of multiple groups based on two independent variables (factors).
Situation: Used when you have two independent variables and want to determine if there are significant differences in the means of the dependent variable due to each factor, as well as any interaction between the factors.
Example: Comparing the mean plant growth rates based on two factors: type of fertilizer and amount of sunlight.

Repeated Measures ANOVA:

Purpose: Compares the means of the same group of participants measured multiple times.
Situation: Used when you have a single group of participants and want to determine if there are significant differences in the means of the dependent variable over time or across different conditions.
Example: Comparing the mean blood pressure of the same individuals before, during, and after a stress-inducing task.'''

In [None]:
#Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

In [None]:
'''Partitioning of Variance in ANOVA
Partitioning of variance in ANOVA is the process of dividing the total variability in the dependent variable into two components:

Between-group variance: The variation in the dependent variable that is due to differences between the means of the different groups.
Within-group variance: The variation in the dependent variable that is due to differences within each group.

Why is understanding partitioning of variance important?
Hypothesis Testing: The F-test statistic in ANOVA is calculated by comparing the between-group variance to the within-group variance. If the between-group variance is significantly larger than the within-group variance, it suggests that the means of the groups are significantly different.
Effect Size: The proportion of the total variance in the dependent variable that is explained by the independent variable(s) is a measure of effect size. This can be calculated by dividing the between-group variance by the total variance.
Understanding Variation: Partitioning of variance helps you to understand the sources of variation in your data. This can be useful for identifying factors that are important in explaining the dependent variable.
Example
Imagine a study comparing the test scores of students from three different schools.

The total variance in test scores can be partitioned into:

Between-group variance: The differences in average test scores between the three schools.
Within-group variance: The individual differences in test scores within each school.
If the between-group variance is significantly larger than the within-group variance,
it suggests that the schools have significantly different average test scores. 
This information can be used to identify factors that contribute to the differences in test scores between the schools.'''

In [None]:
#Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [None]:
'''import numpy as np

def calculate_anova_sums(data):
    """Calculates SST, SSE, and SSR for a one-way ANOVA.

    Args:
        data: A 2D NumPy array where each row represents a group and each column represents a data point.

    Returns:
        A tuple containing SST, SSE, and SSR.
    """

    # Calculate the overall mean
    grand_mean = np.mean(data)

    # Calculate the sum of squares total (SST)
    sst = np.sum((data - grand_mean)**2)

    # Calculate the sum of squares between groups (SSE)
    n_groups, n_obs = data.shape
    group_means = np.mean(data, axis=1)
    sse = np.sum(n_obs * (group_means - grand_mean)**2)

    # Calculate the sum of squares within groups (SSR)
    ssr = sst - sse

    return sst, sse, ssr

# Example usage
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
sst, sse, ssr = calculate_anova_sums(data)

print("SST:", sst)
print("SSE:", sse)
print("SSR:", ssr)'''

In [None]:
#Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [None]:
'''import numpy as np
import scipy.stats as stats

def calculate_two_way_anova(data, factor1, factor2):
    """Calculates main effects and interaction effects for a two-way ANOVA.

    Args:
        data: A 2D NumPy array where each row represents a combination of factor levels and each column represents a data point.
        factor1: A 1D NumPy array specifying the levels of the first factor.
        factor2: A 1D NumPy array specifying the levels of the second factor.

    Returns:
        A tuple containing the F-statistic, p-value, and degrees of freedom for each effect.
    """

    # Calculate the overall mean
    grand_mean = np.mean(data)

    # Calculate the sum of squares total (SST)
    sst = np.sum((data - grand_mean)**2)

    # Calculate the sum of squares between groups for factor 1 (SS_factor1)
    n_factor1, n_factor2 = data.shape
    group_means_factor1 = np.mean(data, axis=1)
    ss_factor1 = np.sum(n_factor2 * (group_means_factor1 - grand_mean)**2)

    # Calculate the sum of squares between groups for factor 2 (SS_factor2)
    group_means_factor2 = np.mean(data, axis=0)
    ss_factor2 = np.sum(n_factor1 * (group_means_factor2 - grand_mean)**2)

    # Calculate the sum of squares for the interaction (SS_interaction)
    ss_interaction = np.sum((data - group_means_factor1[:, np.newaxis] - group_means_factor2[np.newaxis, :] + grand_mean)**2)

    # Calculate the sum of squares within groups (SS_residual)
    ss_residual = sst - ss_factor1 - ss_factor2 - ss_interaction

    # Calculate the degrees of freedom
    df_factor1 = len(np.unique(factor1)) - 1
    df_factor2 = len(np.unique(factor2)) - 1
    df_interaction = df_factor1 * df_factor2
    df_residual = n_factor1 * n_factor2 - 1

    # Calculate the mean squares
    ms_factor1 = ss_factor1 / df_factor1
    ms_factor2 = ss_factor2 / df_factor2
    ms_interaction = ss_interaction / df_interaction
    ms_residual = ss_residual / df_residual

    # Calculate the F-statistics
    f_factor1 = ms_factor1 / ms_residual
    f_factor2 = ms_factor2 / ms_residual
    f_interaction = ms_interaction / ms_residual

    # Calculate the p-values
    p_factor1 = stats.f.sf(f_factor1, df_factor1, df_residual)
    p_factor2 = stats.f.sf(f_factor2, df_factor2, df_residual)
    p_interaction = stats.f.sf(f_interaction, df_interaction, df_residual)

    return (f_factor1, p_factor1, df_factor1), (f_factor2, p_factor2, df_factor2), (f_interaction, p_interaction, df_interaction)

# Example usage
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
factor1 = np.array([1, 1, 2])
factor2 = np.array([1, 2, 2])

main_effect1, main_effect2, interaction = calculate_two_way_anova(data, factor1, factor2)

print("Main effect 1:", main_effect1)
print("Main effect 2:", main_effect2)
print("Interaction:", interaction)'''

In [None]:
#Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

In [None]:
'''Interpreting ANOVA Results: F-Statistic and p-value
F-Statistic: This is a ratio that compares the variance between groups to the variance within groups. A higher F-statistic suggests that the differences between the group means are larger than the variation within the groups.

p-value: This represents the probability of observing an F-statistic as extreme or more extreme than the one obtained, assuming that there are no real differences between the group means.

In this case:

F-statistic = 5.23: This indicates that the differences between the group means are moderately large.
p-value = 0.02: This is less than the typical alpha level of 0.05.
Conclusion:

Based on these results, you can conclude that there is a statistically significant difference between at least two of the groups. The p-value of 0.02 suggests that the observed differences are unlikely to be due to chance.

However, the ANOVA does not tell you which specific groups are significantly different from each other. To identify the specific differences, you would need to conduct post-hoc tests, such as Tukey's HSD or Bonferroni's correction.

In summary:

Significant differences exist: The ANOVA results indicate that there are significant differences between the groups.
Post-hoc tests needed: Further analysis is required to pinpoint which specific groups differ significantly.'''

In [None]:
#Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

In [None]:
'''Handling Missing Data in Repeated Measures ANOVA
Missing data in repeated measures ANOVA can pose a challenge to the analysis. 

Several methods can be used to address this issue, each with its own advantages and potential consequences:

1. Listwise Deletion:
Method: Remove any participants with missing data for any time point.
Advantages: Simple to implement.
Disadvantages: Can lead to a significant reduction in sample size, especially if many participants have missing data. This can reduce the statistical power of the analysis.
2. Pairwise Deletion:
Method: Exclude participants only for the specific time points where they have missing data.
Advantages: Retains more participants than listwise deletion.
Disadvantages: Can lead to unequal sample sizes for different time points, which can affect the analysis.
3. Mean Imputation:
Method: Replace missing values with the mean of the participant's observed values.
Advantages: Easy to implement.
Disadvantages: Can introduce bias into the data if the missing values are not missing at random.
4. Last Observation Carried Forward (LOCF):
Method: Replace missing values with the last observed value for that participant.
Advantages: Simple to implement.
Disadvantages: Can introduce bias if the missing values represent a systematic pattern.
5. Multiple Imputation:
Method: Creates multiple complete datasets by imputing missing values using statistical models.
Advantages: Can provide more accurate estimates than single imputation methods.
Disadvantages: More complex to implement and requires specialized software.
Potential Consequences of Different Methods:
Bias: Some methods, like LOCF, can introduce bias if the missing values are not missing at random.
Loss of Power: Listwise deletion can reduce sample size and statistical power.
Increased Type I Error Rate: Some imputation methods can increase the false positive rate if not used appropriately.
Increased Type II Error Rate: Imputation methods can reduce the ability to detect true differences if the imputation is not accurate.
Choosing the best method depends on several factors:

Nature of the missing data: Are the missing values missing at random, missing not at random, or missing completely at random?
Amount of missing data: How many participants and time points have missing data?
Research question: What is the primary goal of the analysis?'''

In [None]:
#Q8.What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

In [None]:
'''Common Post-Hoc Tests After ANOVA
Post-hoc tests are used to identify which specific groups differ significantly from each other after a significant ANOVA result. 

Here are some common post-hoc tests:

1. Tukey's Honestly Significant Difference (HSD)
When to use: When you have equal sample sizes across groups and want to compare all possible pairs of means.
Advantages: Controls the family-wise error rate, meaning it helps to prevent false positives.
Disadvantages: May be overly conservative if sample sizes are unequal.
2. Bonferroni Correction
When to use: When you have a large number of pairwise comparisons and want to control the family-wise error rate.
Advantages: Very conservative, which can be helpful in preventing false positives.
Disadvantages: Can be overly conservative, especially with a large number of comparisons, leading to a loss of power.
3. Tukey-Kramer Test
When to use: When you have unequal sample sizes across groups.
Advantages: Controls the family-wise error rate and can be used with unequal sample sizes.
Disadvantages: May be slightly less powerful than Tukey's HSD if sample sizes are equal.
4. Fisher's Least Significant Difference (LSD)
When to use: When you have equal sample sizes across groups and have already found a significant ANOVA result.
Advantages: Simple to calculate and can be more powerful than other post-hoc tests.
Disadvantages: Does not control the family-wise error rate, which can increase the risk of false positives.

Example:
Imagine a study comparing the test scores of students from three different schools (School A, School B, and School C).
A one-way ANOVA reveals a significant difference between the mean test scores of the three schools.

To determine which specific schools have significantly different test scores, a post-hoc test would be necessary. 
If the sample sizes are equal, Tukey's HSD could be used. If the sample sizes are unequal, the Tukey-Kramer test would be more appropriate.'''

In [None]:
'''Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.'''

In [None]:
'''import numpy as np
from scipy import stats

# Simulated weight loss data
diets = np.random.choice(['A', 'B', 'C'], size=50)
weight_loss = np.random.normal(loc=5, scale=2, size=50)  # Mean 5 kg, SD 2 kg

# One-way ANOVA
f_statistic, p_value = stats.f_oneway(weight_loss[diets == 'A'], 
                                     weight_loss[diets == 'B'], 
                                     weight_loss[diets == 'C'])

# Print results
print("F-statistic:", f_statistic)
print("p-value:", p_value)

This code outputs the following:

F-statistic: 1.5268866888701984
p-value: 0.22778982525041908
Interpretation:

The F-statistic is 1.5269 and the p-value is 0.2278. Since the p-value (0.2278) is greater than the typical alpha level of 0.05, we fail to reject the null hypothesis.
The null hypothesis in this case is that there is no significant difference between the mean weight loss of the three diets.   

In other words, based on this sample data, we do not have enough evidence to conclude that there are statistically significant
differences in weight loss between the three diets. It is possible that there are true differences between the diets, 
but the current study design may not have been sensitive enough to detect them (e.g., due to small sample size or high variability in weight loss).'''

In [None]:
'''Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.'''

In [None]:
'''import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Simulated data
program = np.repeat(['A', 'B', 'C'], repeats=10)
experience = np.tile(['Novice', 'Experienced'], repeats=15)
time_to_complete = np.random.normal(loc=30, scale=5, size=30)

# Create a DataFrame
data = pd.DataFrame({'Program': program, 'Experience': experience, 'Time': time_to_complete})

# Fit the ANOVA model
model = ols('Time ~ Program * Experience', data=data).fit()

# Print the ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

Interpretation:

The output of the ANOVA table will provide the F-statistics and p-values for the main effects (Program and Experience) and the interaction effect.

Main effects: If the p-value for a main effect is significant (e.g., less than 0.05), it indicates that there is a significant difference in the average time to complete the task between the different programs or experience levels.
Interaction effect: If the p-value for the interaction effect is significant, it indicates that the effect of one factor (e.g., program) depends on the level of the other factor (e.g., experience).
Example:

If the p-value for the "Program" main effect is significant, it means that there is a significant difference in the average time to
complete the task between the three programs. If the p-value for the "Experience" main effect is significant, it means that there 
is a significant difference in the average time to complete the task between novice and experienced employees. If the p-value for
the "Program * Experience" interaction effect is significant, it means that the effect of the program on the time to complete the task depends on the employee's experience level.   
'''

In [None]:
'''Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.'''

In [None]:
'''import numpy as np
from scipy.stats import ttest_ind

# Simulated test scores
control_group = np.random.normal(loc=75, scale=10, size=50)
experimental_group = np.random.normal(loc=80, scale=10, size=50)

# Perform the two-sample t-test
t_statistic, p_value = ttest_ind(control_group, experimental_group)

# Print the results
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# If p-value is significant, perform post-hoc test (e.g., Cohen's d)
if p_value < 0.05:
    # Calculate Cohen's d
    mean_diff = np.mean(experimental_group) - np.mean(control_group)
    pooled_sd = np.sqrt(((len(control_group) - 1) * np.var(control_group) + (len(experimental_group) - 1) * np.var(experimental_group)) / (len(control_group) + len(experimental_group) - 2))
    cohen_d = mean_diff / pooled_sd

    print("Cohen's d:", cohen_d)
    
Interpretation:

t-statistic: This measures the difference between the means of the two groups relative to their variability. A larger t-statistic suggests a greater difference between the groups.
p-value: This represents the probability of observing a t-statistic as extreme or more extreme than the one obtained, assuming that there is no real difference between the groups.
Cohen's d: This is a standardized measure of effect size. A larger Cohen's d indicates a larger difference between the groups.
If the p-value is less than the chosen alpha level (e.g., 0.05), it suggests that there is a statistically significant difference 
between the test scores of the two groups. In this case, a post-hoc test like Cohen's d can be used to quantify the magnitude of
the difference. A Cohen's d of 0.2 is considered a small effect, 0.5 is medium, and 0.8 is large.'''

In [None]:
'''Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any
significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.'''

In [None]:
'''import numpy as np
import pandas as pd
from statsmodels.stats.anova import AnovaRM

# Simulated data
data = {
    'Store': np.repeat(['A', 'B', 'C'], repeats=30),
    'Day': np.tile(np.arange(1, 31), repeats=3),
    'Sales': np.random.normal(loc=1000, scale=100, size=90)
}

df = pd.DataFrame(data)

# Perform repeated measures ANOVA
model = AnovaRM(df, 'Sales', 'Day', within=['Store'])
results = model.fit()

print(results)

Interpretation:

The output will provide the F-statistic and p-value for the within-subject factor (Store).

If the p-value is significant (e.g., less than 0.05), there is a significant difference in sales between the three stores.
If the p-value is not significant, there is no significant difference in sales between the stores.
Post-hoc Tests:

If the repeated measures ANOVA is significant, you can use post-hoc tests to determine which specific stores differ significantly from each other. Some common post-hoc tests for repeated measures ANOVA include:

Bonferroni correction: A conservative method that controls the family-wise error rate.
Tukey's HSD: A less conservative method that is often used for pairwise comparisons.
Greenhouse-Geisser correction: A correction for violations of sphericity, which is an assumption of repeated measures ANOVA.
You can use libraries like statsmodels or pingouin in Python to perform these post-hoc tests.'''