### Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

In [None]:
'''
Analysis of Variance (ANOVA) is a statistical technique used to compare the means of three or more groups to determine if there 
are statistically significant differences between them. To apply ANOVA correctly and interpret the results accurately, several 
assumptions need to be met. Violations of these assumptions can impact the validity of the ANOVA results. Here are the key 
assumptions for ANOVA:

Independence of Observations: The observations within each group should be independent of each other. This means that the data 
points in one group should not be influenced by or related to the data points in other groups. Violation of this assumption can 
lead to inflated Type I error rates.

Example of violation: If multiple measurements are taken from the same subject or if there is any form of clustering or 
dependence between observations in different groups.

Normality: The residuals (the differences between observed values and group means) should follow a normal distribution. While 
this assumption is more critical for small sample sizes, ANOVA is known to be robust to mild departures from normality when the 
sample size is sufficiently large (e.g., Central Limit Theorem).

Example of violation: If the residuals are skewed or have heavy tails, indicating a departure from normality.

Homogeneity of Variance (Homoscedasticity): The variance of the residuals should be approximately equal across all groups. This 
assumption is crucial because ANOVA assumes that the populations from which the samples are drawn have the same variance.

Example of violation: If some groups have much larger variances than others, it can lead to inaccurate p-values and decreased 
power.

Equality of Group Sizes: Ideally, the sample sizes for each group should be equal, or at least, not substantially different. 
Unequal sample sizes can affect the power of the ANOVA test.

Example of violation: If one group has a much larger sample size than the others, it may dominate the analysis, leading to 
biased results.

Random Sampling: The data should be obtained through random sampling or an appropriate experimental design. This ensures that 
the sample is representative of the population from which it is drawn.

Example of violation: If the sample is not random and is, for instance, subject to selection bias, the results may not be 
generalizable to the broader population.

Independence of Errors: The errors (residuals) should be independent of each other. This assumption means that the error for one
observation should not be related to the error for any other observation.

Example of violation: If there is a temporal or spatial autocorrelation between errors in time series or spatial data, this 
assumption is violated.

It's important to note that ANOVA is relatively robust to violations of the normality assumption, especially with larger sample 
sizes. However, violations of the other assumptions can seriously impact the validity of ANOVA results. In practice, when 
assumptions are violated, non-parametric alternatives or transformations of the data may be considered, or different 
statistical tests may be more appropriate.

'''

### Q2. What are the three types of ANOVA, and in what situations would each be used?

In [None]:
'''
Analysis of Variance (ANOVA) comes in several forms, each suited to different experimental or research designs and objectives. 
The three primary types of ANOVA are:

One-Way ANOVA:

Use Case: One-Way ANOVA is used when you have one independent variable (a factor) with more than two levels or groups and you 
want to determine if there are significant differences in the means of these groups.
Example: You want to test if there is a significant difference in the test scores of students who received three different 
types of teaching methods (A, B, and C).

Two-Way ANOVA:

Use Case: Two-Way ANOVA is used when you have two independent variables (factors) and you want to determine if there are 
interactions between these factors and if they have a significant effect on the dependent variable.
Example: You want to analyze whether both the type of diet and the type of exercise have an impact on weight loss.

Repeated Measures ANOVA:

Use Case: Repeated Measures ANOVA is used when you have a single group of participants, and each participant is measured under 
all conditions or at multiple time points.
Example: You want to analyze how the performance of a group of athletes changes over time under three different training 
regimes.
Each type of ANOVA is used to answer different research questions and accounts for various experimental designs. Here's a bit
more detail on each:

'''

### Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

In [None]:
'''
The partitioning of variance in Analysis of Variance (ANOVA) refers to the process of decomposing the total variance in a 
dataset into different sources or components of variance. Understanding this concept is crucial in ANOVA because it helps 
researchers assess whether there are significant differences between groups or conditions and provides insights into the 
sources of these differences.

In ANOVA, the total variance in the data is divided into three main components:

Between-Group Variance (SSB): This component represents the variation between different groups or conditions being compared. It 
measures how much the means of these groups differ from each other. A larger between-group variance suggests that there are 
significant differences among the groups.

Within-Group Variance (SSW): Also known as the error variance, this component represents the variation within each group or 
condition. It measures the extent to which individual data points within each group deviate from their group mean. Smaller 
within-group variance indicates that data points within each group are relatively consistent and close to their respective 
group means.

Total Variance (SST): This is the overall variance in the entire dataset, and it is the sum of the between-group variance and 
the within-group variance. It measures the total variability in the data.

The importance of the partitioning of variance is :

If the between-group variance is much larger than the within-group variance, it suggests that there are significant group 
differences, and the null hypothesis (that there are no group differences) is rejected.

If the within-group variance is much larger than the between-group variance, it implies that the observed differences are 
likely due to random variability, and there is no evidence to reject the null hypothesis.

By understanding the partitioning of variance, researchers can interpret ANOVA results, identify which groups or conditions 
differ significantly (if any), and draw meaningful conclusions about their research hypotheses.
'''

### Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [None]:
'''
In a one-way ANOVA, you can calculate the Total Sum of Squares (SST), Explained Sum of Squares (SSE), and Residual Sum of 
Squares (SSR) using Python by following these steps:

Calculate the Total Sum of Squares (SST):

SST measures the total variability in the data and is the sum of squared differences between each data point and the overall
mean.
'''

In [2]:
import numpy as np

# Example data for three groups
group1 = np.array([12, 14, 15, 16, 18])
group2 = np.array([20, 22, 24, 26, 28])
group3 = np.array([10, 11, 12, 13, 15])

# Combine data from all groups
all_data = np.concatenate([group1, group2, group3])

# Calculate overall mean
overall_mean = np.mean(all_data)

# Calculate SST
sst = np.sum((all_data - overall_mean)**2)
print(sst)

454.93333333333334


In [None]:
'''
Calculate the Explained Sum of Squares (SSE):

SSE measures the variability explained by the group means and is the sum of squared differences between each group mean and the 
overall mean, weighted by the number of data points in each group.
'''

In [3]:
# Calculate group means
group1_mean = np.mean(group1)
group2_mean = np.mean(group2)
group3_mean = np.mean(group3)

# Calculate SSE
sse = (len(group1) * (group1_mean - overall_mean)**2) + \
      (len(group2) * (group2_mean - overall_mean)**2) + \
      (len(group3) * (group3_mean - overall_mean)**2)

print(sse)

380.1333333333333


In [None]:
'''
Calculate the Residual Sum of Squares (SSR):

SSR measures the unexplained variability or variability within each group and is the sum of squared differences between each 
data point and its respective group mean.
'''

In [5]:
# Calculate SSR for each group
ssr_group1 = np.sum((group1 - group1_mean)**2)
ssr_group2 = np.sum((group2 - group2_mean)**2)
ssr_group3 = np.sum((group3 - group3_mean)**2)

# Calculate total SSR
ssr = ssr_group1 + ssr_group2 + ssr_group3 
print(ssr)

74.8


### Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [None]:
'''
In a two-way ANOVA, you can calculate the main effects and interaction effects using Python by following these steps:

Prepare Your Data:
Make sure you have a dataset that includes at least three variables: two categorical independent variables (factors) and one 
dependent variable (response variable).

Perform the Two-Way ANOVA:
You can use Python libraries such as scipy.stats or statsmodels to perform the two-way ANOVA analysis. Here's a general 
outline:
'''

In [None]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Assuming 'df' is your DataFrame
formula = 'dependent_variable ~ C(factor1) + C(factor2) + C(factor1):C(factor2)'
model = ols(formula, data=df).fit()
results = sm.stats.anova_lm(model, typ=2)
print(results)

In [None]:
'''
C(factor1) and C(factor2) indicate that factor1 and factor2 are categorical variables.
C(factor1):C(factor2) represents the interaction between factor1 and factor2.
Extract Main Effects and Interaction Effects:
After performing the ANOVA, you can extract the main effects and interaction effects from the results:
'''

In [None]:
main_effect_factor1 = results.loc['C(factor1)', 'F']
main_effect_factor2 = results.loc['C(factor2)', 'F']
interaction_effect = results.loc['C(factor1):C(factor2)', 'F']

In [None]:
'''
main_effect_factor1 and main_effect_factor2 represent the main effects of factor1 and factor2.
interaction_effect represents the interaction effect between factor1 and factor2.
Interpret the Results:
You can now interpret the effects based on their significance levels (p-values) and magnitudes (F-statistics). Significant 
effects indicate that the corresponding factors or interactions have a significant impact on the dependent variable.

Keep in mind that this is a simplified example, and the actual implementation may vary depending on your specific dataset and 
requirements. Additionally, you may want to perform post hoc tests or pairwise comparisons to further analyze significant 
effects and interactions.
'''

### Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

In [None]:
'''
When conducting a one-way ANOVA and obtaining an F-statistic and p-value, you can draw conclusions based on the significance 
level (alpha) typically set at 0.05. Here's how to interpret the results:

F-Statistic: The F-statistic measures the ratio of the variance between groups (explained variance) to the variance within 
groups (unexplained variance). In this case, you obtained an F-statistic of 5.23.

P-Value: The p-value associated with the F-statistic indicates the probability of obtaining the observed F-statistic 
(or a more extreme value) under the null hypothesis that there are no significant differences between the group means.

Now, let's interpret these results:

Null Hypothesis (H0): The null hypothesis in ANOVA is that there are no significant differences between the means of the groups.
In other words, all group means are equal.

Alternative Hypothesis (Ha): The alternative hypothesis is that there is at least one group with a different mean from the 
others.

Based on the results:

If the p-value is less than your chosen significance level (alpha), typically 0.05, you reject the null hypothesis (H0). In 
this case, the p-value is 0.02, which is less than 0.05.

Therefore, you conclude that there is sufficient evidence to suggest that at least one group mean is significantly different 
from the others.

However, the F-statistic alone doesn't tell you which group or groups are different. To identify which groups are different, 
you may need to perform post hoc tests or pairwise comparisons.

In summary, with an F-statistic of 5.23 and a p-value of 0.02, you can conclude that there are significant differences between 
the groups' means. Further analyses are needed to determine which specific groups differ from each other.
'''

### Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

In [None]:
'''
Handling missing data in a repeated measures ANOVA is essential for obtaining accurate and reliable results. There are several 
methods for dealing with missing data, each with its advantages and potential consequences:

Listwise Deletion (Complete Case Analysis):

Method: Remove all cases with any missing data from the analysis.
Pros: Simple to implement, maintains the sample size for analysis.
Cons: Reduces statistical power, may introduce bias if missing data is not completely random, and can result in loss of 
valuable information.

Pairwise Deletion (Available Case Analysis):

Method: Analyze each pairwise combination of variables with available data, ignoring missing data.
Pros: Retains more data than listwise deletion, simple to implement.
Cons: May lead to different sample sizes for different comparisons, can result in biased estimates if data is not missing at 
random.

Imputation Methods:

Method: Replace missing values with estimated values based on observed data or statistical methods (e.g., mean imputation, 
median imputation, regression imputation).
Pros: Retains all cases, preserves statistical power, and can provide unbiased estimates if imputation model is appropriate.
Cons: Choice of imputation method can impact results, may introduce bias if imputation model is misspecified, and standard 
errors should be adjusted to account for imputation uncertainty.

Maximum Likelihood Estimation (MLE):

Method: Use advanced statistical techniques to estimate parameters while accounting for missing data directly in the analysis.
Pros: Provides unbiased parameter estimates and valid statistical tests if the missing data mechanism is correctly modeled.
Cons: Complex to implement, may require specialized software, and assumptions about the missing data mechanism must be met.
The choice of method depends on the nature of the data, the pattern of missingness, and the goals of the analysis. It's essential to carefully consider the potential consequences of each method:

Biased Estimates: Using listwise deletion or inappropriate imputation methods can lead to biased parameter estimates.

Loss of Power: Listwise deletion and pairwise deletion reduce the effective sample size, resulting in reduced statistical power.

Invalid Tests: Ignoring missing data or using inappropriate methods can lead to invalid statistical tests and incorrect 
conclusions.

Assumptions: Imputation methods and MLE may require assumptions about the missing data mechanism, which should be assessed and 
tested.

Impact on Generalizability: The method chosen can affect the generalizability of the results to the population.

In summary, the handling of missing data in repeated measures ANOVA should be done thoughtfully, taking into account the data's 
characteristics and the potential impact on the validity and reliability of the analysis. Consulting with a statistician or 
data analyst experienced in missing data handling is advisable when dealing with complex missing data scenarios.
'''

### Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

In [None]:
'''
Post-hoc tests are used in ANOVA (Analysis of Variance) to make pairwise comparisons between groups when the omnibus ANOVA test 
indicates a significant difference among at least three groups. Common post-hoc tests include:

Tukey's Honestly Significant Difference (Tukey's HSD):

When to Use: Tukey's HSD is used when you have performed a one-way ANOVA and found a significant difference among groups. It's 
a conservative test that controls the familywise error rate.
Example: Suppose you conducted an ANOVA to compare the exam scores of students who received different types of coaching 
(A, B, C), and the ANOVA result showed a significant difference. Tukey's HSD can be used to determine which specific coaching 
types led to significantly different scores.

Bonferroni Correction:

When to Use: Bonferroni correction is used when you want to control the experimentwise error rate by adjusting the significance 
level for multiple comparisons.
Example: If you have multiple pairwise comparisons to make (e.g., comparing the effects of different drug treatments on various 
symptoms), you can apply Bonferroni correction to ensure that the overall Type I error rate is controlled.

Scheffé's Method:

When to Use: Scheffé's method is a conservative post-hoc test that is used when you have unequal sample sizes and want to make 
pairwise comparisons while controlling the experimentwise error rate.
Example: Imagine you conducted an ANOVA to compare the performance of students across different schools, but the schools have 
different numbers of students. Scheffé's method can be used for comparisons that adjust for these inequalities.

Dunnett's Test:

When to Use: Dunnett's test is used when you have one control group and want to compare it to multiple treatment groups. It's 
suitable for situations where you have a control group for reference.
Example: If you're conducting a drug trial with one control group and several experimental groups receiving different 
medications, Dunnett's test can help you determine which experimental groups differ significantly from the control.

Fisher's Least Significant Difference (LSD):

When to Use: Fisher's LSD is a less conservative post-hoc test used when you want to make pairwise comparisons without 
controlling for experimentwise error. It's best used when you have a specific reason not to control for multiple comparisons.
Example: In some exploratory analyses, you may want to quickly identify groups that appear different from one another. Fisher's 
LSD can be used for this purpose.

Games-Howell Test:

When to Use: Games-Howell is used when you have unequal variances and sample sizes among groups and want to perform post-hoc 
pairwise comparisons.
Example: In a study comparing the effects of different fertilizers on plant growth, you observe unequal variances and sample 
sizes among the fertilizer groups. Games-Howell can be applied to make comparisons while accommodating these differences.
The choice of post-hoc test depends on your specific research questions, the nature of your data, and your desired control over 
Type I error rates. It's essential to carefully consider the appropriate post-hoc test for your study design to draw valid 
conclusions from your ANOVA results.
'''

### Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [8]:
import scipy.stats as stats

# Data for weight loss in each diet group
diet_a = [2.3, 1.8, 3.2, 2.5, 2.7]  # Replace with your data for Diet A
diet_b = [1.9, 2.0, 1.5, 2.1, 1.8]  # Replace with your data for Diet B
diet_c = [2.8, 2.7, 3.0, 2.6, 2.9]  # Replace with your data for Diet C

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_a, diet_b, diet_c)

# Print the results
print("F-statistic:", f_statistic)
print("p-value:", p_value)

# Interpret the results
alpha = 0.05  # Set your significance level
if p_value < alpha:
    print("Reject the null hypothesis. There is a significant difference between the mean weight loss of the three diets.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference between the mean weight loss of the three diets.")

F-statistic: 10.081632653061222
p-value: 0.002697284008268878
Reject the null hypothesis. There is a significant difference between the mean weight loss of the three diets.


### Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a DataFrame with your data
data = pd.read_csv('your_data.csv')  # Replace 'your_data.csv' with your data file

# Define a linear model
model = ols('Time ~ Program + Experience + Program:Experience', data=data).fit()

# Perform two-way ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)

# Interpret the results
alpha = 0.05  # Set your significance level

# Check for main effects of Program and Experience
if anova_table['PR(>F)']['Program'] < alpha:
    print("There is a significant main effect of Program.")
else:
    print("There is no significant main effect of Program.")

if anova_table['PR(>F)']['Experience'] < alpha:
    print("There is a significant main effect of Experience.")
else:
    print("There is no significant main effect of Experience.")

# Check for interaction effect
if anova_table['PR(>F)']['Program:Experience'] < alpha:
    print("There is a significant interaction effect between Program and Experience.")
else:
    print("There is no significant interaction effect between Program and Experience.")
Replace 'your_data.csv' with the filename of your data file, and ensure that your data file has columns named 'Time', 'Program', and 'Experience' (or adjust the column names in the code accordingly).

### Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [None]:
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Create a DataFrame with your data
data = pd.read_csv('your_data.csv')  # Replace 'your_data.csv' with your data file

# Split the data into control and experimental groups
control_group = data[data['Group'] == 'Control']['Test_Scores']
experimental_group = data[data['Group'] == 'Experimental']['Test_Scores']

# Perform a two-sample t-test
t_stat, p_value = stats.ttest_ind(control_group, experimental_group)

# Interpret the results
alpha = 0.05  # Set your significance level

if p_value < alpha:
    print("There is a significant difference in test scores between the control and experimental groups.")
else:
    print("There is no significant difference in test scores between the groups.")

# If the results are significant, perform a post-hoc test (Tukey's HSD)
if p_value < alpha:
    data['Group'] = data['Group'].astype('category')
    posthoc = pairwise_tukeyhsd(data['Test_Scores'], data['Group'], alpha=alpha)
    print(posthoc)

    # Interpret the post-hoc results
    posthoc_summary = posthoc.summary()
    print(posthoc_summary)

### Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other. 

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import MultiComparison

# Create a DataFrame with your data
data = pd.read_csv('your_data.csv')  # Replace 'your_data.csv' with your data file

# Create a repeated measures ANOVA model
model = ols('Sales ~ Store', data=data).fit()

# Perform the repeated measures ANOVA
anova = AnovaRM(data, 'Sales', 'Store').fit()

# Interpret the results of the repeated measures ANOVA
print(anova.summary())

# If the results are significant, perform a post-hoc test (e.g., Tukey's HSD)
if anova.pvalues['Store'] < 0.05:
    mc = MultiComparison(data['Sales'], data['Store'])
    posthoc = mc.tukeyhsd()
    print(posthoc.summary())

    # Interpret the post-hoc results
    print(posthoc.plot_simultaneous())