### Que - 1 :  Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validation of the results.

1. The distribution of sample mean is normally distributed.
2. All the outliers need to be removed from the dataset.
3. Population variance in different levels of each independent variables are equal.
4. Samples are independent and random.

### Que - 2 :  What are the three types of ANOVA, and in what situations would each be used?

### There are three types of anova : 
#### 1. One Way Anova
#### 2. Two-Way Anova
#### 3. MANOVA (Multivariate Analysis of Variance)

### 1. One Way Anova :
#### One-Way ANOVA is used when we have one categorical independent variable (also called a factor) and one continuous dependent variable. It is used to determine if there are any significant differences in the means of the dependent variable across the different levels of the independent variable. For example, we may use One-Way ANOVA to determine if there are significant differences in the test scores of students from different schools.

### 2. Two-Way ANOVA : 
#### Two-Way ANOVA is used when we have two categorical independent variables and one continuous dependent variable. It is used to determine if there are any significant main effects of each independent variable and if there is any interaction effect between the two independent variables on the dependent variable. For example, we may use Two-Way ANOVA to analyze the effects of two different factors (e.g., gender and age group) on a health-related outcome.

### 3. MANOVA (Multivariate Analysis of Variance) :
#### MANOVA is used when we have two or more continuous dependent variables and one or more categorical independent variables. It is an extension of ANOVA that allows us to analyze the effects of the independent variables on multiple dependent variables simultaneously. MANOVA is used when we want to determine if there are any significant differences in the mean vectors of the dependent variables across the different levels of the independent variables. For example, we may use MANOVA to analyze the effects of different treatments on multiple physiological measures.

### Que - 3 :  What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in ANOVA refers to the process of decomposing the total variance observed in the data into different components associated with various sources of variation. These sources of variation include the treatment effects, error or residual variation, and potentially other factors such as interactions.

The partitioning of variance is crucial in ANOVA because it helps in understanding the relative contributions of different sources of variation to the total variation in the data. By quantifying the amount of variation attributable to each source, we can determine the significance of the treatment effects and assess the overall model fit.

Understanding the partitioning of variance is important for several reasons:

Hypothesis Testing: ANOVA tests the null hypothesis that there are no significant differences among the group means. The partitioning of variance provides information on the variability between groups (treatment effects) and within groups (random error). By comparing these variances, ANOVA determines if the observed differences between groups are statistically significant.

Assessing Treatment Effects: By examining the variation explained by the treatment effects, we can determine the extent to which the independent variables (factors or treatments) influence the dependent variable. This helps in evaluating the effectiveness of different treatments or experimental conditions.

Model Evaluation: The partitioning of variance allows for evaluating the overall fit of the ANOVA model. If the treatment effects explain a significant portion of the variance compared to the residual error, it suggests that the model adequately captures the relationship between the independent and dependent variables.

Determining Effect Sizes: ANOVA allows for the calculation of effect sizes, which quantify the magnitude of the treatment effects. Effect sizes provide a standardized measure of the practical significance of the observed differences between groups.

Future Research and Interpretation: Understanding the partitioning of variance can guide future research by identifying factors or sources of variation that contribute significantly to the outcomes. It also helps in interpreting the results by providing insights into the relative importance of different factors in explaining the observed variability.

### Que - 4 : How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [2]:
import scipy.stats as stats
import numpy as np

def one_way_anova_sums_of_squares(data):
    # Calculate the overall mean
    overall_mean = np.mean(data)

    # Calculate the total sum of squares (SST)
    sst = np.sum((data - overall_mean) ** 2)

    # Calculate the group means
    group_means = [np.mean(group) for group in data]

    # Calculate the explained sum of squares (SSE)
    sse = np.sum([len(group) * (group_mean - overall_mean) ** 2 for group, group_mean in zip(data, group_means)])

    # Calculate the residual sum of squares (SSR)
    ssr = sst - sse

    return sst, sse, ssr

# Example usage
data = [
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
]

sst, sse, ssr = one_way_anova_sums_of_squares(data)
print(f"Total Sum of Squares (SST): {sst:.2f}")
print(f"Explained Sum of Squares (SSE): {sse:.2f}")
print(f"Residual Sum of Squares (SSR): {ssr:.2f}")


Total Sum of Squares (SST): 60.00
Explained Sum of Squares (SSE): 54.00
Residual Sum of Squares (SSR): 6.00


### Que - 5 :  In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [4]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
import pandas as pd

# Create a DataFrame with the data
data = {
    'A': [1, 2, 3, 4, 5, 6],
    'B': [2, 4, 6, 8, 10, 12],
    'Y': [10, 15, 20, 12, 18, 22]
}
df = pd.DataFrame(data)

# Fit the two-way ANOVA model
model = ols('Y ~ A + B + A:B', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Extract the main effects and interaction effects
main_effect_a = anova_table.loc['A', 'sum_sq']
main_effect_b = anova_table.loc['B', 'sum_sq']
interaction_effect = anova_table.loc['A:B', 'sum_sq']

# Print the main effects and interaction effect
print(f"Main Effect A: {main_effect_a:.2f}")
print(f"Main Effect B: {main_effect_b:.2f}")
print(f"Interaction Effect: {interaction_effect:.2f}")


Main Effect A: 53.16
Main Effect B: 53.16
Interaction Effect: 0.01


### Que - 6 : Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

Based on the results of the one-way ANOVA with an F-statistic of 5.23 and a p-value of 0.02, we can draw the following conclusions and interpretations:

1. Concluding the Differences between Groups: 
   The obtained p-value (0.02) is less than the significance level of 0.05, indicating that there is sufficient evidence to reject the null hypothesis. Therefore, we can conclude that there are statistically significant differences between the groups.

2. Interpreting the Results:
   - F-Statistic: The F-statistic of 5.23 indicates the ratio of the variability between the groups' means to the variability within the groups. A larger F-statistic suggests a larger difference between group means relative to the variation within the groups.
   - P-value: The p-value of 0.02 indicates the probability of observing the obtained F-statistic (or a more extreme value) assuming that the null hypothesis is true. A p-value of 0.02 indicates that there is only a 2% chance of observing such a large F-statistic if there were no real differences between the groups.

   Therefore, with a statistically significant F-statistic and a low p-value, we can conclude that there are significant differences between the groups. However, it is important to note that the one-way ANOVA does not provide specific information about which specific groups differ from one another. To identify which group means differ significantly, post-hoc tests (such as Tukey's HSD, Bonferroni, or LSD tests) or pairwise comparisons can be conducted.

   Additionally, the effect size and the magnitude of the differences between groups should be considered for practical significance. Effect sizes like eta-squared or omega-squared can provide information about the proportion of variance explained by the group differences.

In summary, based on an F-statistic of 5.23 and a p-value of 0.02, we conclude that there are statistically significant differences between the groups. Further analysis is needed to determine which specific group means differ significantly from each other, and additional consideration should be given to effect sizes and practical significance.

### Que - 7 : In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Handling missing data in a repeated measures ANOVA is an important consideration. Here are some common approaches to handle missing data in a repeated measures ANOVA and the potential consequences of using different methods:

Complete Case Analysis (Listwise Deletion):
This approach involves excluding any participant or case with missing data from the analysis. Only complete cases (participants with data for all time points or conditions) are used for the analysis. The main advantage is its simplicity, but it can result in reduced sample size and potential bias if the missing data mechanism is not completely random.

Pairwise Deletion:
Pairwise deletion involves using all available data for each pair of time points or conditions. This means that participants with missing data for specific time points or conditions are still included in the analysis for the remaining time points or conditions. This approach retains more data but can result in different sample sizes for different comparisons, potentially leading to biased estimates of the variances and degrees of freedom.

Mean Substitution:
Mean substitution involves replacing missing values with the mean of the available data for that variable or condition. This method preserves the sample size and can provide unbiased estimates of the means if the missing data mechanism is missing completely at random (MCAR). However, it can underestimate the variability and inflate the Type I error rate if the missing data mechanism is not MCAR.

Multiple Imputation:
Multiple imputation involves estimating the missing values multiple times based on a model that accounts for the observed data. The missing values are imputed multiple times, creating multiple complete datasets. Each complete dataset is then analyzed separately, and the results are combined to obtain an overall estimate. Multiple imputation can provide more accurate estimates and standard errors if the imputation model is appropriate. However, it requires additional computational efforts and assumptions about the missing data mechanism.

### Que - 8 : What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

After conducting an ANOVA and finding a significant main effect or interaction effect, post-hoc tests are often performed to identify specific pairwise differences between groups. Here are some common post-hoc tests used after ANOVA and when to use each one:

1. Tukey's Honestly Significant Difference (HSD) Test:
   Tukey's HSD test is widely used when the number of pairwise comparisons is relatively large. It controls the familywise error rate, allowing for simultaneous comparisons between all pairs of groups. Tukey's HSD is appropriate when you want to determine which specific pairs of group means are significantly different from each other.

2. Bonferroni Correction:
   The Bonferroni correction is a conservative method that adjusts the significance level (alpha) for each comparison to control the familywise error rate. It is useful when you have a small number of specific comparisons to make and want to maintain a more stringent control over Type I error.

3. Fisher's Least Significant Difference (LSD) Test:
   Fisher's LSD test is another post-hoc test that can be used when the number of pairwise comparisons is small. It compares the means of each pair of groups while controlling the familywise error rate. Fisher's LSD is appropriate when you have a specific set of pairwise comparisons in mind and want to determine if those particular pairs differ significantly.

4. Sidak Correction:
   The Sidak correction is similar to the Bonferroni correction, but it provides a less conservative adjustment to the significance level. It is suitable when you have a moderate number of pairwise comparisons and want to control the familywise error rate with less stringency than the Bonferroni correction.

5. Dunnett's Test:
   Dunnett's test is used when you have a control group and want to compare each treatment group to the control group. It adjusts the significance level to account for multiple comparisons while controlling the familywise error rate.

6. Scheffe's Test:
   Scheffe's test is a conservative post-hoc test that allows for comparisons of all possible group means while controlling the familywise error rate. It is appropriate when you have a small sample size, unequal variances, and want to make all pairwise comparisons.

Example:
Suppose you conduct an experiment to compare the effectiveness of three different treatments (Treatment A, Treatment B, and Treatment C) in reducing pain levels. After conducting an ANOVA, you find a significant main effect of treatment. To determine which specific treatments differ significantly from each other, you would perform a post-hoc test.

In this scenario, you could use Tukey's HSD test to compare all possible pairs of treatment means. Tukey's HSD would allow you to determine if there are significant differences between Treatment A and Treatment B, Treatment A and Treatment C, and Treatment B and Treatment C.

Using a post-hoc test in this situation would provide more detailed information about the specific treatment comparisons, helping you understand which treatments show statistically significant differences in pain reduction.

### Que - 9 : A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets.Report the F-statistic and p-value, and interpret the results.

In [11]:
import scipy.stats as stats
import numpy as np

# Weight loss data for each diet
diet_A = [2, 4, 6, 3, 5, 7, 4, 3, 6, 5,
          2, 4, 6, 3, 5, 7, 4, 3, 6, 5,
          2, 4, 6, 3, 5, 7, 4, 3, 6, 5,
          2, 4, 6, 3, 5, 7, 4, 3, 6, 5,
          2, 4, 6, 3, 5, 7, 4, 3, 6, 5]
diet_B = [1, 3, 5, 2, 4, 6, 3, 2, 5, 4,
          1, 3, 5, 2, 4, 6, 3, 2, 5, 4,
          1, 3, 5, 2, 4, 6, 3, 2, 5, 4,
          1, 3, 5, 2, 4, 6, 3, 2, 5, 4,
          1, 3, 5, 2, 4, 6, 3, 2, 5, 4]
diet_C = [3, 6, 9, 4, 8, 12, 6, 4, 9, 8,
          3, 6, 9, 4, 8, 12, 6, 4, 9, 8,
          3, 6, 9, 4, 8, 12, 6, 4, 9, 8,
          3, 6, 9, 4, 8, 12, 6, 4, 9, 8,
          3, 6, 9, 4, 8, 12, 6, 4, 9, 8]

# Combine the data into a single array
data = np.concatenate((diet_A, diet_B, diet_C))

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Print the results
print(f"F-statistic: {f_statistic:.2f}")
print(f"p-value: {p_value:.4f}")

F-statistic: 38.73
p-value: 0.0000


### Que - 10 : A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [12]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Time data for each combination of software program and employee experience level
program_a_novice = [10, 12, 15, 11, 14, 13, 16, 15, 12, 11, 13, 14, 12, 15, 11, 13, 15, 14, 12, 10, 13, 14, 11, 12, 15, 14, 16, 13, 15, 11]
program_a_experienced = [9, 11, 13, 10, 12, 11, 14, 13, 10, 9, 12, 10, 11, 12, 10, 9, 12, 11, 10, 9, 11, 12, 9, 10, 13, 12, 14, 11, 13, 10]
program_b_novice = [11, 14, 13, 12, 15, 13, 16, 14, 13, 12, 14, 13, 12, 15, 11, 12, 14, 15, 13, 11, 14, 15, 12, 13, 16, 14, 17, 13, 15, 12]
program_b_experienced = [10, 12, 13, 11, 14, 12, 15, 13, 11, 10, 13, 11, 12, 13, 10, 11, 13, 14, 12, 10, 13, 14, 11, 12, 15, 13, 16, 12, 14, 11]
program_c_novice = [13, 15, 14, 12, 16, 14, 17, 15, 14, 12, 15, 13, 13, 16, 12, 13, 15, 16, 14, 12, 15, 16, 13, 14, 17, 15, 18, 14, 16, 13]
program_c_experienced = [12, 14, 15, 13, 16, 14, 17, 15, 13, 12, 15, 13, 14, 15, 12, 13, 15, 16, 13, 12, 15, 16, 12, 13, 16, 14, 17, 13, 15, 12]

# Create a DataFrame with the data
import pandas as pd

df = pd.DataFrame({
    'Time': program_a_novice + program_a_experienced + program_b_novice + program_b_experienced + program_c_novice + program_c_experienced,
    'Program': ['A'] * 60 + ['B'] * 60 + ['C'] * 60,
    'Experience': ['Novice'] * 30 + ['Experienced'] * 30 + ['Novice'] * 30 + ['Experienced'] * 30 + ['Novice'] * 30 + ['Experienced'] * 30
})

# Fit the two-way ANOVA model
model = ols('Time ~ Program + Experience + Program:Experience', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Extract the F-statistics and p-values
f_program = anova_table.loc['Program', 'F']
p_program = anova_table.loc['Program', 'PR(>F)']
f_experience = anova_table.loc['Experience', 'F']
p_experience = anova_table.loc['Experience', 'PR(>F)']
f_interaction = anova_table.loc['Program:Experience', 'F']
p_interaction = anova_table.loc['Program:Experience', 'PR(>F)']

# Print the results
print(f"F-statistic Program: {f_program:.2f}, p-value: {p_program:.4f}")
print(f"F-statistic Experience: {f_experience:.2f}, p-value: {p_experience:.4f}")
print(f"F-statistic Interaction: {f_interaction:.2f}, p-value: {p_interaction:.4f}")


F-statistic Program: 27.14, p-value: 0.0000
F-statistic Experience: 22.25, p-value: 0.0000
F-statistic Interaction: 3.86, p-value: 0.0230


In [14]:
anova_table

Unnamed: 0,sum_sq,df,F,PR(>F)
Program,141.011111,2.0,27.137517,5.519536e-11
Experience,57.8,1.0,22.247161,4.899601e-06
Program:Experience,20.033333,2.0,3.855405,0.02299631
Residual,452.066667,174.0,,


### Que - 11 : An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [21]:
import scipy.stats as stats
import numpy as np
import statsmodels as stml

# Test scores for the control group
control_scores = [70, 75, 68, 72, 80, 82, 78, 76, 73, 77,
                  74, 81, 79, 83, 75, 72, 80, 76, 78, 77,
                  79, 74, 75, 73, 71, 72, 77, 80, 76, 78,
                  79, 81, 82, 78, 76, 73, 77, 74, 75, 72,
                  75, 76, 80, 78, 79, 74, 77, 73, 72, 75]

# Test scores for the experimental group
experimental_scores = [85, 82, 89, 88, 93, 85, 84, 86, 90, 87,
                       89, 92, 85, 83, 86, 90, 87, 88, 85, 89,
                       87, 86, 84, 88, 85, 89, 83, 86, 90, 87,
                       89, 92, 85, 83, 86, 90, 87, 88, 85, 89,
                       87, 86, 84, 88, 85, 89, 83, 86, 90, 87]

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_scores, experimental_scores)

# Print the results
print(f"T-statistic: {t_statistic:.2f}")
print(f"p-value: {p_value:.4f}")

# Perform post-hoc test (e.g., Tukey's HSD) if results are significant
if p_value < 0.05:
    # Combine the scores into a single array
    scores = np.concatenate((control_scores, experimental_scores))
    
    # Create a group array to indicate the group for each observation
    groups = np.array(['Control'] * len(control_scores) + ['Experimental'] * len(experimental_scores))
    
    # Perform the post-hoc test
    posthoc_res = stml.stats.multicomp.pairwise_tukeyhsd(scores, groups, alpha=0.05)
    
    # Print the post-hoc test results
    print(posthoc_res)


T-statistic: -18.05
p-value: 0.0000
   Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj lower   upper  reject
---------------------------------------------------------
Control Experimental     10.8   0.0 9.6128 11.9872   True
---------------------------------------------------------


### Que - 12 : A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [22]:
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import AnovaRM
import pandas as pd

# Sales data for each store on each day
store_a_sales = [100, 110, 120, 105, 108, 130, 122, 118, 105, 112,
                 103, 115, 118, 123, 110, 115, 108, 106, 118, 122,
                 110, 120, 116, 105, 128, 130, 113, 121, 119, 125]
store_b_sales = [115, 112, 123, 118, 120, 130, 135, 112, 118, 115,
                 125, 128, 130, 135, 122, 115, 108, 113, 128, 123,
                 120, 118, 110, 115, 128, 125, 130, 135, 118, 122]
store_c_sales = [105, 108, 100, 118, 112, 120, 125, 130, 110, 115,
                 108, 103, 128, 120, 115, 122, 118, 112, 105, 108,
                 120, 125, 130, 115, 110, 118, 123, 128, 120, 118]

# Create a DataFrame with the data
df = pd.DataFrame({
    'Sales': store_a_sales + store_b_sales + store_c_sales,
    'Store': ['A'] * 30 + ['B'] * 30 + ['C'] * 30,
    'Day': list(range(30)) * 3
})

# Convert Day column to categorical variable
df['Day'] = df['Day'].astype('category')

# Fit the repeated measures ANOVA model
model = AnovaRM(df, 'Sales', 'Day', within=['Store']).fit()

# Print the results
print(model.summary())

# Perform post-hoc test (e.g., Tukey's HSD) if results are significant
if model.anova_table['Pr > F']['Store'] < 0.05:
    from statsmodels.stats.multicomp import pairwise_tukeyhsd
    
    # Perform the post-hoc test
    posthoc_res = pairwise_tukeyhsd(df['Sales'], df['Store'], alpha=0.05)
    
    # Print the post-hoc test results
    print(posthoc_res)


               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
Store  7.5002 2.0000 58.0000 0.0013

 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower    upper  reject
-----------------------------------------------------
     A      B   6.3667  0.008   1.4157 11.3176   True
     A      C   1.1333 0.8489  -3.8176  6.0843  False
     B      C  -5.2333 0.0358 -10.1843 -0.2824   True
-----------------------------------------------------
