Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

In [None]:
#Ans1.

# ANOVA (Analysis of Variance) is a statistical method used to compare means among three or more groups to determine
→ if at least one group mean is significantly different from the others. To validly use ANOVA, several key assumptions must be met:

# Assumptions of ANOVA :

# 1.Independence of Observations:

# Each sample or group must be composed of independent observations.
→  Example Violation: If the data points within a group are correlated (e.g., repeated measures on the same subjects), the independence assumption is violated.

# 2.Normality:

# The data within each group should be approximately normally distributed.
→  Example Violation: If the data are heavily skewed or have outliers, this assumption is violated. This can be checked using normality tests like the Shapiro-Wilk test or by visualizing data with Q-Q plots.

# 3.Homogeneity of Variances (Homoscedasticity):

→ The variances among the groups should be approximately equal.
→  Example Violation: If one group has a much larger variance than the others, this assumption is violated. Levene's test or Bartlett's test can be used to assess this.

# Examples of Violations and Their Impact

# 1.Independence Violation:

→ If observations are not independent, for instance, if measurements are taken from the same subjects multiple times without accounting for the repeated measures,
→  it could lead to underestimating the variability within groups.This underestimation can result in a higher Type I error rate (incorrectly rejecting the null hypothesis).

# 2.Normality Violation:

→  If the data are not normally distributed, especially with small sample sizes, the ANOVA results might not be reliable.
→  Non-normal data can lead to incorrect conclusions because the F-statistic may not follow the expected distribution, increasing the chances of both Type I and Type II errors.
→  Example: In a study comparing blood pressure levels across different age groups, if one group's data are highly skewed due to an outlier, the normality assumption is violated.

# 3.Homogeneity of Variances Violation:

→  Unequal variances can affect the F-statistic, making it more difficult to determine if observed differences are significant. This can lead to inaccurate p-values, impacting the validity of the results.
→ Example: In a test comparing the effectiveness of different diets on weight loss, if one diet group has a much higher variance in weight loss due to varied adherence to the diet, the homogeneity assumption is violated.

# Addressing Violations  :
→ Transformations: Applying a transformation to the data (e.g., log transformation) can help meet the normality and homogeneity of variances assumptions.
→ Alternative Tests: Using non-parametric tests like the Kruskal-Wallis test, which do not require the normality assumption, can be a solution when normality is violated.
→ Mixed-Effects Models: When independence is violated due to repeated measures or hierarchical data structures, mixed-effects models or repeated measures ANOVA can be used to appropriately account for the dependency.



Q2. What are the three types of ANOVA, and in what situations would each be used?

In [None]:
#Ans2.

# ANOVA (Analysis of Variance) can be categorized into three main types: One-Way ANOVA, Two-Way ANOVA, and Repeated Measures ANOVA. Each type is used in different situations based on the study design and the number of factors being analyzed.

# 1. One-Way ANOVA

→ Description: One-Way ANOVA is used to compare the means of three or more independent (unrelated) groups based on one independent variable (factor).

→  Use Case: When you have one categorical independent variable with more than two levels (groups) and one continuous dependent variable.

→  Example: A researcher wants to compare the average test scores of students from three different teaching methods (traditional, online, and hybrid).
→  The independent variable is the teaching method (with three levels), and the dependent variable is the test scores.

# 2. Two-Way ANOVA

→  Description: Two-Way ANOVA is used to compare the means among groups that are split on two independent variables (factors).
→  It can also evaluate the interaction effect between the two factors.

→  Use Case: When you have two categorical independent variables and one continuous dependent variable.
→  This type of ANOVA helps in understanding both the main effects of each factor and the interaction effect between them.

→  Example: A researcher wants to study the effects of diet type (vegetarian, non-vegetarian) and exercise frequency (none, moderate, high) on weight loss.
→  The independent variables are diet type and exercise frequency, and the dependent variable is weight loss.

# 3. Repeated Measures ANOVA

→ Description: Repeated Measures ANOVA is used when the same subjects are measured multiple times under different conditions or at different time points.
→  This type accounts for the correlation between the repeated measures on the same subjects.

→  Use Case: When you have one categorical independent variable with more than two levels, but the same subjects are used in each level.
→  This is common in longitudinal studies or crossover designs.

→  Example: A researcher wants to compare the effect of a drug on blood pressure measured at three different time points (baseline, 1 month, 3 months) in the same group of patients.
→  The independent variable is the time point, and the dependent variable is blood pressure.


# Summary :

→ One-Way ANOVA: Used for comparing means across multiple independent groups based on one factor.
→ Example: Comparing test scores of students from three different teaching methods.
→  Two-Way ANOVA: Used for comparing means across groups based on two factors and understanding their interaction.
→  Example: Studying the effect of diet type and exercise frequency on weight loss.
→  Repeated Measures ANOVA: Used for comparing means when the same subjects are measured multiple times under different conditions.
→  Example: Measuring the effect of a drug on blood pressure at different time points in the same patients.



Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

In [None]:
#Ans3.

# Partitioning of variance in ANOVA is a fundamental concept that involves dividing the total variability in the data into components associated with different sources.
#  This helps in understanding how much of the total variability is explained by the factors being studied and how much is due to random error. Understanding this concept is crucial for interpreting the results of an ANOVA test.

# Partitioning of Variance :

# 1.Total Sum of Squares (SST):

→ Represents the total variability in the data.
→  Calculated as the sum of the squared differences between each observation and the overall mean.

Formula :

→  SST = ∑(x - x̄ )^2

# 2.Between-Group Sum of Squares (SSB):

→  Represents the variability between the group means.
→  Calculated as the sum of the squared differences between each group mean and the overall mean, weighted by the number of observations in each group.

Formula :

→  SST = ∑n(X_i - x̄ )^2

Where :
→ n is the number of observation in group i,
→ X_i is the mean of group i,
→ x̄ is the overall mean

# 3.Within-Group Sum of Squares (SSW):

→  Represents the variability within each group.
→  Calculated as the sum of the squared differences between each observation and its respective group mean.

Formula :

→  SSW = ∑∑n(X_ij - x̄ )^2

##  Importance of Partitioning Variance

# 1.Understanding Sources of Variability:

→ Partitioning variance helps identify the contributions of different sources (between groups and within groups) to the total variability.
→  This provides insights into the factors influencing the dependent variable.

# 2.Calculating the F-Statistic:

→  The F-statistic is used to determine if the observed variability between groups is significantly greater than the variability within groups.

# Formula for F-statistic :

→  F = MSB / MSW

→  Where MSB (Mean square between) is SSB/df_B , and MSW (Mean square within) is SSW/df_W.

# Degree of freedom : df_B = k - 1 (k is the number of groups) and df_W = N-k (N is the total number of observations).

# 3. Hypothesis Testing :

→ By partitioning the variance, we can perform hypothesis testing to determine if there are significance differences between group means.
→  Null hypothesis (Ho) : All group means are equal.
→ Alternative hypothesis (Ha) : At least one group mean is different.
→  The F-statistic and the corresponding p-value are used to reject the null hypothesis.

# 4.Effect Size Calcultation :
→ Partitioning variance allows for the calculation of effect size measures, such as Eta-square(n^2) , which indicate the proportion of total variance explained by the independent variable.

→ Formula : for Eta-square :(n^2) = SSB/SST

# Example :
→  Consider a study comparing the test scores of students taught using three different methods (A, B, and C). The steps involved in partitioning variance would be:

→ Calculate the overall mean test score.
→ Calculate the mean test score for each teaching method.
→ Compute the SST, SSB, and SSW using the formulas above.
→ Calculate the F-statistic and compare it to a critical value from the F-distribution to determine if the differences between teaching methods are statistically significant.
→ By partitioning the variance, researchers can understand how much of the variability in test scores is due to differences between teaching methods and how much is due to random variation within each method.


# Summary :

#→ Partitioning of variance in ANOVA is essential for:

→ Identifying sources of variability.
→ Calculating the F-statistic for hypothesis testing.
→ Understanding the contributions of different factors to the total variability.
→ Evaluating the significance and effect size of the factors being studied.
→ This process allows researchers to make informed decisions based on the data and the relationships between variables.


Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [None]:
#Ans4.

# Here is a detailed explanation and corresponding Python code to perform these calculations.

# Calculate SST , SSE and SSR
# 1.Total Sum of Square (SST) :
# SST =  ∑(X_i - x̄ )^2

# Explained Sum of Squares (SSE) :
#SSE =  ∑ n_i(x̄_i  - x̄ )^2
# n is the number of observation in group i,
# x̄_i is the mean of group i.

# Residual Sum of Square(SSR) :
# SSR =  ∑∑(X_ij - x̄_i )^2

# X_ij is each observation in group i.
# x̄_i is the mean of group i.


## Python Implementation Code :

import numpy as np
import pandas as pd
from scipy.stats import f_oneway

# Example data :
data = {'Group' : ['A','A','A','B','B','B','C','C','C'],
        'Value' : [4,5,6,5,6,7,6,7,8]}

# Create Dataframe
df = pd.DataFrame(data)

# Calculation the overall mean
overall_mean = df['Value'].mean()

# Calculate the group means
group_means = df.groupby('Group')['Value'].mean()

# Total Sum of Squares (SST)
df['SST'] = (df['Value'] - overall_mean) ** 2
SST = df['SST'].sum()

# Explained Sum of Squares (SSE)
df['SSE'] = df['Group'].apply(lambda x : len(df[df['Group']==x]))*(group_means -overall_mean)**2
SSE = df.drop_duplicates(subset = 'Group')['SSE'].sum

#Residual Sum of Squares (SSR)
df = df.merge(group_means, on = 'Group', suffixes = ('','_group_mean'))
df['SSR'] = (df['Value']- df['Value_group_mean'])**2
SSR = df['SSR'].sum()

print(f"Total Sum of Squares (SST) : {SST}")
print(f"Explained Sum of Squares (SSE) : {SSE}")
print(f"Residual Sum of Squares (SSR) : {SSR}")


# Explanation of the Code

# 2.Data Preparation:

#Create a DataFrame df with the example data.

#2.Overall Mean:

→ Calculate the overall mean of all observations.

#3.Group Means:

→ Calculate the mean for each group.

# 4.Total Sum of Squares (SST):

→ Calculate the squared differences between each observation and the overall mean.
→ Sum these squared differences to get SST.

# 5.Explained Sum of Squares (SSE):

→ Calculate the squared differences between each group mean and the overall mean, weighted by the number of observations in each group.
→ Sum these squared differences to get SSE.

# 6.Residual Sum of Squares (SSR):

→ Calculate the squared differences between each observation and its respective group mean.
→ Sum these squared differences to get SSR.

# Results :
→ When you run the code, it will output the values of SST, SSE, and SSR, which represent the total variability,
→  the variability explained by the groups, and the residual variability within the groups, respectively.


Total Sum of Squares (SST) : 12.0
Explained Sum of Squares (SSE) : <bound method NDFrame._add_numeric_operations.<locals>.sum of 0   NaN
3   NaN
6   NaN
Name: SSE, dtype: float64>
Residual Sum of Squares (SSR) : 6.0


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [None]:
#Ans5.

# To calculate the main effects and interaction effect in a two-way ANOVA using Python , you can use libraries such as 'statsmodels' and 'pandas'.
# Here is a detailed explanation and corresponding Python code.

# Steps to Calculate Main Effects and Interaction Effets

# 1. Main Effects :
#The main effects are the effects of each independent variable on the dependent variable, ignoring the other independent varible.
#For example if you have two factors, 'A' and 'B' the main effect of 'A' is the effect of 'A' average over all levels of 'B' and vica versa.

# Interaction Effects :

#The interaction effect is the combination effect of two factors, indicating whether the effect of one factor depends on the level of the other factor.

## Python Implementation : How to perform a two-way ANOVA using 'statsmodels' :


import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data
data = {
    'Factor_A': ['Low', 'Low', 'Low', 'High', 'High', 'High', 'Low', 'Low', 'Low', 'High', 'High', 'High'],
    'Factor_B': ['Low', 'Low', 'High', 'Low', 'Low', 'High', 'High', 'High', 'Low', 'High', 'High', 'Low'],
    'Value': [4, 5, 6, 5, 6, 7, 6, 7, 8, 8, 7, 6]
}

# Create DataFrame
df = pd.DataFrame(data)

# Fit the models
model = ols('Value ~ C(Factor_A) + C(Factor_B) + C(Factor_A):C(Factor_B)', data = df).fit()

# Perform  ANOVA :
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)


## Explanation of the Code :-

#1.Data Preparation:

→ Create a DataFrame df with the example data, including two factors (Factor_A and Factor_B) and the dependent variable (Value).

#2.Fitting the Model:

→ Use the ols (Ordinary Least Squares) function from statsmodels.formula.api to define the model. The formula Value ~ C(Factor_A) + C(Factor_B) + C(Factor_A):C(Factor_B) specifies the main effects of Factor_A and Factor_B, and their interaction effect (C(Factor_A):C(Factor_B)).
→ The C() function is used to indicate that Factor_A and Factor_B are categorical variables.

# 3.Performing ANOVA:

→ Use sm.stats.anova_lm to perform the ANOVA on the fitted model. The typ=2 argument specifies the type of sums of squares to use (Type II).

# 4.Results:

→ The anova_table contains the ANOVA results, including the sum of squares, degrees of freedom, F-statistic, and p-values for the main effects and interaction effect.

→ sum_sq: Sum of squares for each source of variation (Factor_A, Factor_B, interaction, and residuals).
→ df: Degrees of freedom associated with each source of variation.

→ F: F-statistic for each source of variation.
→  PR(>F): p-value associated with each F-statistic.
→  This table provides the information needed to assess the significance of the main effects and interaction effect.
→  If the p-value for a factor or interaction is below a chosen significance level (e.g., 0.05), it indicates a statistically significant effect.

# Summary :

→ By following these steps and using the provided Python code, you can calculate and interpret the main effects and interaction effects in a two-way ANOVA,
→  providing valuable insights into the relationships between the factors and the dependent variable.


                            sum_sq   df       F    PR(>F)
C(Factor_A)               0.750000  1.0  0.5625  0.474731
C(Factor_B)               4.083333  1.0  3.0625  0.118233
C(Factor_A):C(Factor_B)   0.750000  1.0  0.5625  0.474731
Residual                 10.666667  8.0     NaN       NaN


Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

In [None]:
#Ans6.

# When conducting a one-way ANOVA, the goal is to detemine if there are any statistically significant differences between the means of three or more independent (unrelated) groups.
#The F-statistic and the p-value obtained from the ANOVA test are used to make this determination.

# Given data :
→  F-statistic = 5.23
→  p-value = 0.02

# Hypothesis :
# Null Hypothesis (Ho) : All group means are equal. There is no significant difference between the groups.
# Alternative Hypothesis (Ha) : At least one group mean is different from the others.

# Significance Level (α) :
→  Commonly used significance levels are 0.05, 0.01, 0.10. In this case, let's assume α = 0.05

#3. Compare p-value to α :
# if the p-value is less than α, We Reject th Null Hypothesis.
# if the p-value is greater than α, We Fail to Reject the Null Hypothesis.''

# Conclusion Based on the Given Results:

# F-statistic: 5.23

# The F-statistic indicates the ratio of the variance between the group means to the variance within the groups.
 A higher F-statistic generally indicates a greater degree of variation between the groups relative to the variation within the groups.

# p-value: 0.02

#The p-value indicates the probability of obtaining an F-statistic at least as extreme as 5.23, assuming the null hypothesis is true.
#  A p-value of 0.02 means there is a 2% chance that the observed differences between the group means are due to random variation alone.

#Since the p-value (0.02) is less than the significance level (Conclusion Based on the Given Results:

→ F-statistic: 5.23

#The F-statistic indicates the ratio of the variance between the group means to the variance within the groups.
# A higher F-statistic generally indicates a greater degree of variation between the groups relative to the variation within the groups.

→ p-value: 0.02

→ The p-value indicates the probability of obtaining an F-statistic at least as extreme as 5.23, assuming the null hypothesis is true.
→  A p-value of 0.02 means there is a 2% chance that the observed differences between the group means are due to random variation alone.

→ Since the p-value (0.02) is less than the significance level (α = 0.05),
→ we reject the null hypothesis. This means there is statistically significant evidence to suggest that not all group means are equal.
→  In other words, there is a significant difference between the groups.


# Interpretation of the Results:

# Statistical Significance:

→ The test results indicate that there are significant differences between the means of the groups.
→  This suggests that the factor being studied has a significant effect on the dependent variable.

# Practical Significance:

→ While statistical significance indicates that a difference exists, it does not provide information about the magnitude or practical significance of the differences.
→  Further analysis, such as post-hoc tests, can help determine which specific groups are significantly different from each other and the size of these differences.

# Post-Hoc Tests:

# To identify which specific groups differ, you can conduct post-hoc tests (e.g., Tukey's HSD, Bonferroni correction).
# These tests control for the Type I error rate and provide pairwise comparisons between group means.


## Example of Post-Hoc Test in Python :
# Perform a post-hoc test using Tukey's HSD in Python

## Python Code :

import pandas as pd
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Example data :
data = {
    'Group' : ['A','A','A','B','B','B','C','C','C'],
    'Value' : [4,5,6,5,6,7,6,7,8]
}

# Create DataFrame
df = pd.DataFrame (data)

# Perform Tukey's HSD test
tukey = pairwise_tukeyhsd (endog=df['Value'], groups= df['Group'],alpha = 0.05)

print(tukey)

# Conclusion:

→ The one-way ANOVA indicates significant differences between the groups, as evidenced by the F-statistic of 5.23 and a p-value of 0.02.
→ We reject the null hypothesis that all group means are equal.
→ Post-hoc tests should be conducted to determine which specific groups have significant differences in their means and to understand the practical significance of these differences.


Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj   lower  upper  reject
---------------------------------------------------
     A      B      1.0 0.4827 -1.5052 3.5052  False
     A      C      2.0 0.1089 -0.5052 4.5052  False
     B      C      1.0 0.4827 -1.5052 3.5052  False
---------------------------------------------------


Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

In [19]:
#Ans7.

# Handling missing data in a repeated measures ANOVA is crucial because missing data can bias the results and reduce the power of the analysis.
# Several methods can be used to address missing data, each with its advantages and potential consequences.

#Methods for Handling Missing Data:

#1.Listwise Deletion (Complete Case Analysis)
#2.Pairwise Deletion
#3.Mean Substitution
#4.Last Observation Carried Forward (LOCF)
#5.Multiple Imputation
#6.Mixed-Effects Models

# 1. Listwise Deletion (Complete Case Analysis)

# Description:

# Exclude any participant who has any missing data.

# Consequences:

# Advantages: Simple to implement and ensures that all analyses are performed on the same data set.
# Disadvantages: Can lead to a significant loss of data, reducing statistical power and potentially introducing bias if the missing data are not completely random.

# 2. Pairwise Deletion

# Description:

#Use all available data for each analysis, excluding only the missing values.
#Consequences:

#Advantages: Retains more data compared to listwise deletion.
#Disadvantages: Can lead to inconsistencies in the sample size across different analyses and may complicate interpretation.

# 3. Mean Substitution
# Description:

# Replace missing values with the mean of the observed values for that variable.
#Consequences:

# Advantages: Simple and retains all cases.
# Disadvantages: Underestimates the variability and can bias the results by reducing the variability in the data.

# 4. Last Observation Carried Forward (LOCF)
# Description:

# Replace missing values with the last observed value for that participant.
# Consequences:

# Advantages: Retains all participants in the analysis.
# Disadvantages: Can introduce bias if the last observation is not representative of the missing values.

# 5. Multiple Imputation
# Description:

# Replace missing values with a set of plausible values based on the observed data, and perform the analysis on each completed data set, combining the results.
# Consequences:

# Advantages: Accounts for the uncertainty about the missing data and provides more accurate estimates and standard errors.
# Disadvantages: Computationally intensive and requires assumptions about the missing data mechanism.

# 6. Mixed-Effects Models
# Description:

# Use all available data by modeling the data with random effects to account for the within-subject correlation.
# Consequences:

# Advantages: Handles unbalanced data and missing values without requiring imputation, providing unbiased estimates if the missing data mechanism is random.
# Disadvantages: More complex to implement and interpret compared to traditional ANOVA.


## Implementing Mixed-Effects , Models in Python :-

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import mixedlm

#Example data
data = {'Subject' : [1,1,1,2,2,2,3,3,3],
        'Time' : ['T1','T2','T3','T1','T2','T3','T1','T2','T3'],
        'Score' : [5,6,None, 7,8,9,4, None,5]

}

#Create DataFrame
df = pd.DataFrame(data)

#Fit the mixed-effects model
model = mixedlm("Score ~ Time", df, groups=df["Subject"])
result = model.fit()

print(result.summary())


# Consequences of Using Different Methods :

# →  Listwise Deletion:
# →  Loss of data leads to reduced power and potential bias.

# Pairwise Deletion:
# → Inconsistency in sample sizes can lead to problems in interpretation.

# Mean Substitution:
# →  Reduces variability and can bias results, leading to inaccurate estimates of the effects.

# LOCF:
# → Can introduce bias if the last observation is not a good estimate of the missing data.

# Multiple Imputation:
# →  Provides more accurate estimates but is computationally intensive and requires assumptions about the missing data mechanism.

# Mixed-Effects Models:
#→ Handles missing data well, unbiased if data are MAR. but more complex to implement and interpret.else

# Conclusion :

# → Choosing the appropriate method to handle missing data in a repeated measures ANOVA is crucial for maintaining the integrity of your analysis.
# →  By carefully assessing the extent and pattern of missing data and selecting the most suitable method


IndexError: index 7 is out of bounds for axis 0 with size 7

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

In [16]:
#Ans8.

# After performing an ANOVA and finding a significant result, indicating that there are differences between group means, it is often necessary to determine which specific groups are different from each other. This is where post-hoc tests come into play. Here are some common post-hoc tests and when to use each one, along with an example scenario.

# Common Post-Hoc Tests :

#1. Tukey's Honestly Significant Difference (HSD) Test
#2. Bonferroni Correction
# 3.Scheffé's Test
# 4.Dunnett's Test
# 5.Fisher's Least Significant Difference (LSD) Test
# 6.Holm-Bonferroni Method

# 1. Tukey's Honestly Significant Difference (HSD) Test
# Use:

#→  Best for pairwise comparisons when you want to control the family-wise error rate.
#→ Appropriate when you have equal sample sizes in each group, but can also be used with unequal sample sizes.

# Example:
#→ Comparing the average test scores of students from four different teaching methods to identify which methods differ from each other.

# 2. Bonferroni Correction
# Use:

#→ Controls the family-wise error rate by adjusting the significance level for each individual test.
#→ Suitable for a small number of comparisons due to its conservative nature.

# 3. Scheffé's Test
# Use:

#→ More flexible than Tukey’s HSD as it can be used for both pairwise and complex comparisons.
#→ Suitable for unplanned comparisons.

#Example:
#→ Comparing multiple treatment effects in a clinical trial where some treatments might be combined into one group.

# 4. Dunnett's Test
# Use:

#→ Compare each treatment group to a single control group.
#→ Suitable when you have a control group and several treatment groups.

# Example:
#→ Testing the effectiveness of new drugs compared to a placebo.

# 5. Fisher's Least Significant Difference (LSD) Test
# Use:

#→ Simple pairwise comparisons without adjusting for multiple comparisons.
#→ Suitable for exploratory analysis, but not recommended due to high Type I error rate.

# Example:
#→ Initial comparison of different fertilizers on plant growth.

# 6. Holm-Bonferroni Method
# Use:

#→ Adjusts p-values to control the family-wise error rate while maintaining more power than the Bonferroni correction.
#→ Suitable for multiple comparisons with a higher likelihood of detecting significant differences.


# Example Scenario
# Situation:

#→ A researcher wants to compare the effectiveness of four different study techniques on students' test scores.

# Step:
#→ Conduct a one-way ANOVA to determine if there are any overall differences between the techniques.
#→ If the ANOVA is significant, use Tukey's HSD test to identify which specific techniques differ from each other.

# Conclusion :

#→ Post-hoc tests are crucial for understanding specific group differences after finding a significant ANOVA result.
#→  Each test has its strengths and weaknesses, and the choice of test depends on the specific context and research questions.


# Compare the average test scores of students from four different teaching methods to identify which methods differ from each other
## Use  Python Code

import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Example data
data = {
    'Method': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'D', 'D', 'D'],
    'Score': [82, 85, 88, 78, 80, 84, 90, 92, 94, 76, 78, 79]
}

df =pd.DataFrame(data)

# Perform Tukey's HSD test
tukey = pairwise_tukeyhsd(endog=df['Score'], groups=df['Method'], alpha=0.05)
print(tukey)


## Scheff's Test : Python Code
# Comparing multiple teatment effects in a clinical trial where some treatments might be combined into one group.



from statsmodels.stats.libqsturng import psturng

# Example F-statistic and degrees of freedom
f_statistic = 4.35
df_between = 3
df_within = 24

# Scheffé's critical value
critical_value = psturng(f_statistic, df_between, df_within)
print(critical_value)


 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower    upper  reject
-----------------------------------------------------
     A      B  -4.3333 0.2206 -10.8264  2.1597  False
     A      C      7.0 0.0352    0.507  13.493   True
     A      D  -7.3333 0.0281 -13.8264 -0.8403   True
     B      C  11.3333 0.0023   4.8403 17.8264   True
     B      D     -3.0 0.4906   -9.493   3.493  False
     C      D -14.3333 0.0005 -20.8264 -7.8403   True
-----------------------------------------------------
[0.01380227]


Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [None]:
#Ans9.

# Sure, let's perform a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of three diets.
# Here are the steps to conduct this analysis:

# 1.Generate or load the data.
# 2.Conduct the one-way ANOVA.
# 3.Report the F-statistic and p-value.
# 4.Interpret the results.


# Step 1: Generate or Load the Data :

# Assume we have the weight loss data for 50 participants assigned to one of the three diets (A, B, or C). Here’s an example dataset:

# Python Code :

import pandas as pd
import numpy as np

# Random seed for reproducibility
np.random.seed(42)

# Simulated weight loss data for 50 participants
data = {
    'Diet': np.random.choice(['A', 'B', 'C'], 50),
    'WeightLoss': np.random.normal(loc=[5, 7, 6], scale=2, size=50)  # mean weight loss around 5, 7, 6 kg respectively
}

df = pd.DataFrame(data)
df.loc[df['Diet'] == 'A', 'WeightLoss'] += np.random.normal(0, 1, df[df['Diet'] == 'A'].shape[0])
df.loc[df['Diet'] == 'B', 'WeightLoss'] += np.random.normal(0, 1, df[df['Diet'] == 'B'].shape[0])
df.loc[df['Diet'] == 'C', 'WeightLoss'] += np.random.normal(0, 1, df[df['Diet'] == 'C'].shape[0])

# Display the first few rows of the data
print(df.head())


# Step 2: Conduct the One-way ANOVA -

# Use the 'statsmodels' library to perform the one-way ANOVA.
## Python Code :

import statsmodels.api as sm
from statsmodels.formula.api import ols

# Perform the ANOVA
model = ols('WeightLoss ~ Diet', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Display the ANOVA table
print(anova_table)


## Report F-statistic and p-value, Use Python Code :

f_statistic = anova_table['F'][0]
p_value = anova_table['PR(>F)'][0]
print(f"F-statistic: {f_statistic:.2f}")
print(f"p-value: {p_value:.4f}")

# Interpretation of results
if p_value < 0.05:
    print("There is a significant difference in mean weight loss between the diets.")
else:
    print("There is no significant difference in mean weight loss between the diets.")


# Step 3: Report the F-Statistic and P-Value :

→ The output of the ANOVA table will include the F-statistic and p-value.

# Step 4: Interpret the Results :
→ Interpret the F-statistic and p-value to determine if there are significant differences between the diets.


# Interpretation :

→ If the p-value is less than 0.05, it suggests that there are significant differences in mean weight loss between at least two of the diets.
→ The F-statistic provides a measure of the ratio of the variance between the group means to the variance within the groups. A higher F-statistic generally indicates that there is more variability between the groups compared to within the groups, supporting the conclusion of significant differences.


Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [None]:
#Ans10.

# To conduct a two-way ANOVA using Python, we need to consider two factors: the software program (Program A, Program B, Program C) and the employee experience level (novice vs. experienced). We will generate a dataset, perform the ANOVA, and interpret the results.

# Step-by-Step Approach :

#1.Generate the data.
#2.Conduct the two-way ANOVA.
#3.Report the F-statistics and p-values.
#4.Interpret the results.

# Step 1: Generate the Data :
# Let's create a synthetic dataset of 30 employees, with each assigned to one of the programs and labeled as either novice or experienced.

# Python Code :

import pandas as pd
import numpy as np

# Random seed for reproducibility
np.random.seed(42)

# Generate data
n = 30
data = {
    'Employee': np.arange(n),
    'Program': np.random.choice(['A', 'B', 'C'], n),
    'Experience': np.random.choice(['Novice', 'Experienced'], n),
    'Time': np.random.normal(loc=0, scale=1, size=n)  # to be adjusted based on groups
}

df = pd.DataFrame(data)

# Adjust mean task completion time based on program and experience
df.loc[(df['Program'] == 'A') & (df['Experience'] == 'Novice'), 'Time'] += np.random.normal(25, 5, df[(df['Program'] == 'A') & (df['Experience'] == 'Novice')].shape[0])
df.loc[(df['Program'] == 'A') & (df['Experience'] == 'Experienced'), 'Time'] += np.random.normal(20, 5, df[(df['Program'] == 'A') & (df['Experience'] == 'Experienced')].shape[0])
df.loc[(df['Program'] == 'B') & (df['Experience'] == 'Novice'), 'Time'] += np.random.normal(30, 5, df[(df['Program'] == 'B') & (df['Experience'] == 'Novice')].shape[0])
df.loc[(df['Program'] == 'B') & (df['Experience'] == 'Experienced'), 'Time'] += np.random.normal(25, 5, df[(df['Program'] == 'B') & (df['Experience'] == 'Experienced')].shape[0])
df.loc[(df['Program'] == 'C') & (df['Experience'] == 'Novice'), 'Time'] += np.random.normal(28, 5, df[(df['Program'] == 'C') & (df['Experience'] == 'Novice')].shape[0])
df.loc[(df['Program'] == 'C') & (df['Experience'] == 'Experienced'), 'Time'] += np.random.normal(22, 5, df[(df['Program'] == 'C') & (df['Experience'] == 'Experienced')].shape[0])

# Display the first few rows of the data
print(df.head())

# Step 2: Conduct the Two-Way ANOVA :-
# We will use the statsmodels library to perform the two-way ANOVA.

import statsmodels.api as sm
from statsmodels.formula.api import ols

# Perform the two-way ANOVA
model = ols('Time ~ Program * Experience', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Display the ANOVA table
print(anova_table)


# Report F-statistics and p-values

f_statistic_program = anova_table['F'][0]
p_value_program = anova_table['PR(>F)'][0]

f_statistic_experience = anova_table['F'][1]
p_value_experience = anova_table['PR(>F)'][1]

f_statistic_interaction = anova_table['F'][2]
p_value_interaction = anova_table['PR(>F)'][2]

print(f"F-statistic for Program: {f_statistic_program:.2f}, p-value: {p_value_program:.4f}")
print(f"F-statistic for Experience: {f_statistic_experience:.2f}, p-value: {p_value_experience:.4f}")
print(f"F-statistic for Interaction: {f_statistic_interaction:.2f}, p-value: {p_value_interaction:.4f}")

# Interpretation of results
if p_value_program < 0.05:
    print("There is a significant main effect of the Program on task completion time.")
else:
    print("There is no significant main effect of the Program on task completion time.")

if p_value_experience < 0.05:
    print("There is a significant main effect of Experience on task completion time.")
else:
    print("There is no significant main effect of Experience on task completion time.")

if p_value_interaction < 0.05:
    print("There is a significant interaction effect between Program and Experience on task completion time.")
else:
    print("There is no significant interaction effect between Program and Experience on task completion time.")



# Step 3: Report the F-Statistics and P-Values :
→ The output of the ANOVA table will include the F-statistics and p-values for the main effects (Program and Experience) and their interaction.

# Step 4: Interpret the Results :
→ We interpret the F-statistics and p-values to determine if there are significant main effects or interaction effects.


# Interpretation :

→ Main Effect of Program: If the p-value for the Program is less than 0.05, it indicates that there is a significant difference in the average task completion time among the different software programs.
→ Main Effect of Experience: If the p-value for Experience is less than 0.05, it suggests that the average task completion time differs significantly between novice and experienced employees.
→ Interaction Effect: If the p-value for the interaction between Program and Experience is less than 0.05, it means that the effect of the software program on task completion time depends on the experience level of the employees.
→ This analysis helps the company understand not only if there are differences between the software programs and experience levels but also if the impact of the software programs varies depending on the experience level of the employees.



   Employee Program   Experience       Time
0         0       C  Experienced  21.858165
1         1       A  Experienced  23.770935
2         2       C  Experienced  11.460449
3         3       C  Experienced  22.753919
4         4       A  Experienced  26.767703
                        sum_sq    df         F    PR(>F)
Program             125.055930   2.0  2.314721  0.120443
Experience           77.086828   1.0  2.853675  0.104117
Program:Experience   71.278863   2.0  1.319335  0.286012
Residual            648.316159  24.0       NaN       NaN
F-statistic for Program: 2.31, p-value: 0.1204
F-statistic for Experience: 2.85, p-value: 0.1041
F-statistic for Interaction: 1.32, p-value: 0.2860
There is no significant main effect of the Program on task completion time.
There is no significant main effect of Experience on task completion time.
There is no significant interaction effect between Program and Experience on task completion time.


Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [None]:
#Ans11.

# To conduct a two-sample t-test in Python to determine if there are any significant differences in test scores between the control group (traditional teaching method) and
# the experimental group (new teaching method), follow these steps:

# 1.Generate or load the data.
# 2.Conduct the two-sample t-test.
# 3.Report the t-statistic and p-value.
# 4.Interpret the results.
# 5.If significant, follow up with a post-hoc test.


# Step 1: Generate or Load the Data
# Let's assume we have test scores for 100 students, with 50 students in each group.

import pandas as pd
import numpy as np

# Random seed for reproducibility
np.random.seed(42)

# Generate data
n = 100
data = {
    'Group' : np.random.choice (['Control', 'Experimental'],n),
    'TestScore' : np.concatenate ([
        np.random.normal(75, 10, n//2), # Control group scores
        np.random.normal(80, 10, n//2)  # Experimental group scores
     ])
}

df = pd.DataFrame(data)

# Display the first rows of the data
print(df.head())


# Step2 : Conduct the Two-Sample T-Test
# We will use the  'scipy.stats' library to perform the t-test

from scipy import stats

# Separate the data into two groups
control_group = df[df['Group']== 'Control']['TestScore']
experimental_group = df[df['Group'] == 'Experimental']['TestScore']

# Perform the Two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)

# Display the results
print(f"T-statistic : {t_statistic:.2f}")
print(f"p-value: {p_value:.4f}")


# Interpretation of results
if p_value < 0.05:
    print("There is a significant difference in test scores between the control and experimental groups.")
else:
    print("There is no significant difference in test scores between the control and experimental groups.")


# Interpretation :

→ T-Statistic: The t-statistic measures the size of the difference relative to the variation in your sample data.
→ P-Value: The p-value tells you the probability that the results from your sample data occurred by chance. A p-value less than 0.05 is typically considered statistically significant.
→ If the p-value is less than 0.05, you can conclude that there is a significant difference in test scores between the control and experimental groups,
→ suggesting that the new teaching method has a different effect compared to the traditional method.



          Group  TestScore
0       Control  82.384666
1  Experimental  76.713683
2       Control  73.843517
3       Control  71.988963
4       Control  60.214780
T-statistic : 0.72
p-value: 0.4739
There is no significant difference in test scores between the control and experimental groups.


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [None]:
#Ans12.

# To conduct a repeated measures ANOVA using Python, we need to consider that the same days are used to record sales for all three stores.
# This implies that the sales data are dependent, making repeated measures ANOVA appropriate.

# Step-by-Step Approach :

#1.Generate or load the data.
#2.Conduct the repeated measures ANOVA.
#3.Report the results.
#4If significant, follow up with a post-hoc test.


# Step 1: Generate or Load the Data:
# Let's create a synthetic dataset of sales for 30 days for three stores (Store A, Store B, Store C).


# Python Code :

import pandas as pd
import numpy as np

# Random seed for reproducibility
np.random.seed(42)

# Generate data
days = 30
sales_data = {
    'Day' : np.arange(1, days + 1),
    'StoreA' : np.random.normal(200,20, days),
    'StoreB' : np.random.normal(220,20, days),
    'StoreC' : np.random.normal(220, 20, days)
}

df = pd.DataFrame(sales_data)

# Display the first few rows of the data
print(df.head())


# Conduct the Repeated Measures ANOVA
# We will use the 'statsmodels' library to perform the repeated measures ANOVA.

import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import AnovaRM

# Melt the dataframe to long format
df_long = pd.melt(df, id_vars= ['Day'], value_vars= ['StoreA', 'StoreB', 'StoreC'],
                  var_name = 'Store', value_name = 'Sales')

# Display the first few rows of the long format data
print (df_long.head())

# Perform the repeated measures ANOVA
aovrm = AnovaRM (df_long, 'Sales', 'Day', within= ['Store'])
res = aovrm.fit()

# Display the results
print(res)


#  Report the Results :
# The output will include the F-statistic and p-value for the main effect of the store.

# Post-Hoc Test :
# If the ANOVA results are significant, we can perform pairwise comparisons using a post-hoc test like Tukey's HSD.
# Since Tukey's HSD is not directly available for repeated measures in statsmodels, we can use the pairwise_tukeyhsd function from statsmodels.stats.multicomp.

# Here the complete code to conduct the repeated meansures ANOVA and post-hoc test if necessary.

# Interpretation of results Used Python Code :
if res.anova_table['Pr > F'][0] < 0.05 :
  print ("There is a significant difference in sales between the stores.")

else :
  print ("There is no significant difference in sales between the stores.")

# Post-hoc test if significant
if res.anova_table['Pr > F'][0] < 0.05 :
  posthoc = pairwise_tukeyhsd(df_long ['Sales'], df_long['Store'])
  print(posthoc)


# Interpretation :

→ ANOVA Results : The F-statistic and p-value will tell us if there are significant differences in sales between the threee stores.
→ Post-Hoc Test : if the ANOVA results are significant, the post-hoc test will indicate which stores differ significantly form each other.

→ AnovaRM : The 'AnovaRM' function is used to perform the repeated measures ANOVA.
# 'pairwise_tukeyhsd' : It is used for the post -hoc analysis. The output will provide a clear understanding of whethe there are significant differences in sales between the stores and pairs of stores differ signifcantly.


   Day      StoreA      StoreB      StoreC
0    1  209.934283  207.965868  210.416515
1    2  197.234714  257.045564  216.286820
2    3  212.953771  219.730056  197.873301
3    4  230.460597  198.845781  196.075868
4    5  195.316933  236.450898  236.250516
   Day   Store       Sales
0    1  StoreA  209.934283
1    2  StoreA  197.234714
2    3  StoreA  212.953771
3    4  StoreA  230.460597
4    5  StoreA  195.316933
               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
Store 15.2367 2.0000 58.0000 0.0000

There is a significant difference in sales between the stores.


NameError: name 'pairwise_tukeyhsd' is not defined