In [None]:
# Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
# the validity of the results.

# Analysis of Variance (ANOVA) is a powerful statistical technique used to compare means between multiple groups. However, ANOVA comes with certain
#  assumptions that need to be met in order to ensure the validity and reliability of the results. Violating these assumptions can lead to incorrect
#   conclusions and interpretations. Here are the key assumptions of ANOVA:

# 1. Independence of Observations:
# The observations within each group should be independent of each other. This means that the values in one group should not be influenced by or 
# related to the values in another group.

# 2. Normality:
# The distribution of the residuals (the differences between individual data points and their respective group means) within each group should be 
# approximately normally distributed. This assumption is important because ANOVA relies on the normality of the residuals for accurate results.

# 3. Homogeneity of Variances (Homoscedasticity):
# The variability (or spread) of the residuals should be roughly equal across all groups. This is known as homogeneity of variances. Violating this 
# assumption can lead to unequal influence of groups on the overall analysis.

# 4. Equal Sample Sizes (in some cases):
# While ANOVA is robust to unequal sample sizes, equal sample sizes are preferred as they can improve the power and sensitivity of the test. If 
# sample sizes are vastly different, it's important to check if the results are consistent with the assumption.

# Now, let's look at examples of violations for each assumption:

# 1. Independence of Observations:
# Violation Example: In a study comparing test scores of students from different schools, if students within the same school are given similar 
# preparation, the scores within each school might be correlated, violating independence.

# 2. Normality:
# Violation Example: In a study comparing reaction times of three different age groups, if the data is heavily skewed or has extreme outliers, the 
# normality assumption might be violated.

# 3. Homogeneity of Variances:
# Violation Example: In a study comparing the yields of different types of crops, if the variance of crop yields in one group is much larger than 
# in another group, the assumption of equal variance might be violated.

# 4. Equal Sample Sizes:
# Violation Example: In a study comparing the effectiveness of three different teaching methods, if one method has a significantly larger number 
# of students compared to the other methods, the assumption of equal sample sizes might be violated.



In [None]:
# Q2. What are the three types of ANOVA, and in what situations would each be used?
# Types of ANOVA:-

# 1.One way ANOVA-One factor with atleast 2 levels,these levels are independent

# 2.Repeated Measures Anova-One factor with atleast 2 levels,these levels are dependent

# 3.Factorial ANOVA-Two or more factors (each with atleast 2 level) levels can be either independent and dependent

In [None]:
# Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

# The partitioning of variance in ANOVA refers to the breakdown of the total variability observed in the data into different components that can be
# attributed to specific sources or factors. 

# In ANOVA, the total variability is divided into two main components:

# 1. Between-Group Variability: This component represents the variation in the dependent variable that is due to differences between the various 
# groups or conditions being compared. In other words, it measures how much the group means differ from each other. This is also known as 
# the "treatment effect" or "factor effect."

# 2. Within-Group Variability (or Residual Variability): This component represents the variation within each group or condition. It measures the 
# variation of individual data points around their respective group means. It represents the variability that cannot be explained by the factors 
# being studied.

# The total variability in the data can then be mathematically decomposed into these two components:

# Total Variability = Between-Group Variability + Within-Group Variability

# Understanding the partitioning of variance is important for several reasons:

# 1. Identifying Sources of Variation:By partitioning the total variance, ANOVA allows you to attribute the observed differences in means to 
# different factors. This helps in understanding which factors are contributing to the observed effects.

# 2. Assessing Significance:ANOVA helps assess whether the between-group variability is statistically significant compared to the within-group 
# variability. If the between-group variability is large relative to the within-group variability, it suggests that the factor being studied has 
# a significant effect.

# 3. Interpreting Results: The partitioning of variance provides a clear framework for interpreting ANOVA results. It helps you quantify the 
# relative impact of different factors on the outcome variable.

# 4. Study Design and Improvement: Understanding how different factors contribute to variability can guide future research and experimental
#  design. It can help identify which factors might be worth investigating further or controlling for in future studies.




In [6]:
# Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
# sum of squares (SSR) in a one-way ANOVA using Python?

import numpy as np
import scipy.stats as stats

# Example data for three groups
group_a = np.array([68, 72, 75, 70, 74])
group_b = np.array([80, 85, 78, 82, 87])
group_c = np.array([60, 62, 65, 59, 64])

# Combine all data into a single array
all_data = np.concatenate((group_a, group_b, group_c))

# Calculate overall mean
overall_mean = np.mean(all_data)

# Calculate Total Sum of Squares (SST)
sst = np.sum((all_data - overall_mean)**2)

# Calculate group means
mean_a = np.mean(group_a)
mean_b = np.mean(group_b)
mean_c = np.mean(group_c)

# Calculate Explained Sum of Squares (SSE)
sse = (len(group_a) * (mean_a - overall_mean)**2 +
       len(group_b) * (mean_b - overall_mean)**2 +
       len(group_c) * (mean_c - overall_mean)**2)

# Calculate Residual Sum of Squares (SSR)
ssr = sst - sse

print("Total Sum of Squares (SST):", sst)
print("Explained Sum of Squares (SSE):", sse)
print("Residual Sum of Squares (SSR):", ssr)


Total Sum of Squares (SST): 1152.9333333333332
Explained Sum of Squares (SSE): 1040.9333333333338
Residual Sum of Squares (SSR): 111.99999999999932


In [7]:
# Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

# In a two-way ANOVA, you can calculate the main effects of each independent variable (factor) and the interaction effect between the two 
# independent variables. The main effects represent the influence of each factor on the dependent variable, while the interaction effect examines 
# whether the combined influence of the factors is different from what would be expected based on their individual effects. 

# Here's how you can calculate main effects and interaction effects using Python:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data for two-way ANOVA
data = {'FactorA': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
        'FactorB': ['X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y'],
        'Values': [10, 12, 9, 11, 20, 22, 18, 21]}

df = pd.DataFrame(data)

# Perform two-way ANOVA
model = ols('Values ~ FactorA * FactorB', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)

                  sum_sq   df           F    PR(>F)
FactorA          190.125  1.0  217.285714  0.000123
FactorB           10.125  1.0   11.571429  0.027235
FactorA:FactorB    0.125  1.0    0.142857  0.724659
Residual           3.500  4.0         NaN       NaN


In [None]:
# # Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
# # What can you conclude about the differences between the groups, and how would you interpret these
# # results?

# In a one-way ANOVA, the F-statistic and the associated p-value are used to determine whether there are statistically significant differences 
# among the means of the groups being compared. Let's interpret the results you provided:

# F-Statistic: 5.23
# The F-statistic is a measure of the ratio of variability between group means to variability within groups. In your case, the F-statistic is 5.23.

# P-Value: 0.02
# The p-value indicates the probability of observing the obtained F-statistic (or a more extreme value) if the null hypothesis is true. In your 
# case, the p-value is 0.02.

# Interpretation:

# Since the p-value (0.02) is less than the typical significance level of 0.05 (or 5%), we can conclude the following:

# Reject the Null Hypothesis: The null hypothesis in this case states that there are no significant differences among the group means. Since the 
# p-value is below 0.05, we have enough evidence to reject the null hypothesis.

# Conclude Significant Differences: With a low p-value, we can conclude that there are statistically significant differences among at least some 
# of the group means. In other words, there is evidence to suggest that the groups are not all the same.

In [None]:
# # Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
# # consequences of using different methods to handle missing data?

# Handling missing data in a repeated measures ANOVA is crucial to ensure the accuracy and reliability of your results. Missing data can arise for
# various reasons, such as participant dropout, technical errors, or incomplete responses. There are several methods to handle missing data, each
# with its own advantages and potential consequences. Here are some common methods and their potential consequences:

# 1. Listwise Deletion (Complete Case Analysis):
#    This involves excluding participants with missing data from the analysis. While it's straightforward, it can lead to loss of statistical 
#    power and potential bias if the missing data are not completely random (missing completely at random, MCAR).

#    Consequence: Reduced sample size, biased results if missing data are not MCAR.

# 2. Mean Imputation:
#    Missing values are replaced with the mean value of the observed data for that variable. It's simple but can underestimate the variability 
#    and distort relationships.

#    Consequence: Underestimation of variability, biased results, reduced statistical power.

# 3. Last Observation Carried Forward (LOCF) or Next Observation Carried Backward (NOCB):
#    Missing data are replaced with the last observed value (LOCF) or the next observed value (NOCB). These methods assume that the missing value 
#    is similar to the last or next observed value, which may not be accurate.

#    Consequence:Distortion of data patterns, may not accurately reflect changes over time.

# 4. Linear Interpolation:
#    Missing values are estimated based on linear interpolation between adjacent observed values. This method assumes a linear relationship between 
#    data points and may not be appropriate for non-linear data.

#    Consequence: May introduce artificial trends or patterns, particularly in non-linear data.

# 5. Multiple Imputation:
#    Multiple imputation generates multiple datasets, each with different imputed values. These datasets are analyzed separately, and the results 
#    are combined to account for uncertainty due to missing data. This method is statistically rigorous but computationally intensive.

#    Consequence: Requires more computational resources, complexity in implementation.

# 6. Model-Based Imputation:
#    Impute missing data using a predictive model based on other variables. This can be more accurate if relationships among variables are 
#    well-understood.

#    Consequence: Relies on the chosen model's accuracy, risk of propagating model errors.

# 7. Missing Data Indicator Variable:
#    Include a binary variable indicating whether data is missing. This allows the ANOVA to treat missing data as a separate group. This can be 
#    informative if missingness is related to a specific factor.

#    Consequence: Adds complexity to analysis, requires additional assumptions about missing data mechanism.

In [None]:
# Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
# an example of a situation where a post-hoc test might be necessary.

# After conducting an Analysis of Variance (ANOVA) and finding a significant difference among group means, post-hoc tests are used to determine 
# which specific groups are significantly different from each other. These tests help avoid the problem of inflated Type I error rates that can 
# occur when conducting multiple pairwise comparisons. Here are some common post-hoc tests and situations where they might be used:

# 1. Tukey's Honestly Significant Difference (HSD):
#    Tukey's HSD test is widely used and compares all possible pairs of group means. It controls the familywise error rate, making it suitable 
#    when you're comparing multiple groups and want to maintain an overall Type I error rate.

#    Example:In a study comparing the effectiveness of three different treatments on pain relief, after conducting an ANOVA and finding a 
#    significant difference, you use Tukey's HSD to determine which specific pairs of treatments have significantly different effects.

# 2. Bonferroni Correction:
#    The Bonferroni correction involves adjusting the significance level (alpha) for each individual comparison to control the overall familywise 
#    error rate. It's conservative but helpful when you want to reduce the risk of false positives.

#    Example: If you're comparing the preferences for multiple flavors of ice cream, and you don't want to risk falsely concluding that two 
#    flavors are different when they might not be, you can use Bonferroni correction for pairwise comparisons.

# 3. Scheffe's Method:
#    Scheffe's method is less sensitive to Type I errors but is suitable for situations where you have a small number of comparisons. It is robust 
#    and can be used when there are unequal sample sizes or unequal variances.

#    Example:In an educational study where you're comparing the performance of students from different schools on multiple subjects, Scheffe's
#     method could be used to compare schools while controlling the familywise error rate.

# 4. Dunn's Test:
#    Dunn's test, also known as the Dunn-Bonferroni test, is a non-parametric post-hoc test suitable for situations with unequal variances and 
#    non-normal distributions. It's less sensitive to distributional assumptions.

#    Example: If you're comparing the performance of different software systems on various metrics and the data is not normally distributed, 
#    Dunn's test can be used to make pairwise comparisons.

# 5. Games-Howell Test:
#    The Games-Howell test is used when the assumption of equal variances is violated. It's more robust in such cases compared to other tests 
#    that assume equal variances.

#    Example: Suppose you're comparing the reaction times of participants under different experimental conditions, and the variances are not 
#    equal. In this case, you can use the Games-Howell test for pairwise comparisons.

# Remember that the choice of post-hoc test depends on factors like sample size, distribution of data, and assumptions being met. It's essential 
# to consider the characteristics of your data and the goals of your analysis to select an appropriate post-hoc test.

In [8]:
# Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
# 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
# to determine if there are any significant differences between the mean weight loss of the three diets.
# Report the F-statistic and p-value, and interpret the results.

import numpy as np
import scipy.stats as stats

# Example data for weight loss of three diets (A, B, C)
diet_a = np.array([3.5, 2.8, 4.2, 2.0, 3.9, 2.5, 2.7, 3.3, 3.8, 2.2,
                   3.0, 3.7, 3.1, 3.6, 3.4, 2.9, 2.3, 3.8, 2.8, 3.2,
                   2.5, 3.4, 2.7, 3.1, 2.9, 3.2, 2.8, 2.6, 3.7, 3.4,
                   2.7, 2.9, 3.5, 2.3, 3.1, 2.8, 3.6, 2.4, 3.3, 2.6,
                   3.0, 2.8, 3.5, 2.7, 2.9, 3.1, 3.3, 2.5, 2.8, 3.2])

diet_b = np.array([2.1, 1.9, 1.8, 2.0, 1.7, 2.3, 1.6, 2.2, 2.0, 1.9,
                   1.8, 1.7, 2.1, 2.4, 2.3, 1.9, 2.2, 1.8, 2.1, 1.7,
                   2.0, 2.3, 1.6, 2.2, 2.1, 1.9, 2.0, 2.4, 1.7, 2.3,
                   1.6, 2.2, 2.1, 1.9, 2.0, 2.3, 1.8, 2.2, 1.6, 2.1,
                   2.4, 2.3, 1.7, 1.9, 2.0, 2.2, 1.8, 2.1, 2.3, 2.4])

diet_c = np.array([0.9, 1.0, 0.8, 0.6, 0.7, 0.8, 1.1, 0.9, 0.7, 0.8,
                   1.0, 0.6, 0.7, 0.9, 1.1, 0.8, 0.6, 1.0, 0.7, 0.9,
                   0.8, 1.1, 0.9, 0.7, 0.8, 1.0, 0.6, 1.1, 0.9, 0.7,
                   0.8, 1.0, 0.8, 0.6, 0.7, 0.9, 1.1, 0.8, 0.6, 1.0,
                   0.9, 0.7, 0.8, 1.1, 0.9, 0.7, 0.8, 1.0, 0.6, 0.7])

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_a, diet_b, diet_c)

print("F-Statistic:", f_statistic)
print("P-Value:", p_value)

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("There are significant differences between the mean weight loss of the three diets.")
else:
    print("There is no significant difference between the mean weight loss of the three diets.")


F-Statistic: 572.5521283395123
P-Value: 4.1443235024109274e-70
There are significant differences between the mean weight loss of the three diets.


In [9]:
# Q10. A company wants to know if there are any significant differences in the average time it takes to
# complete a task using three different software programs: Program A, Program B, and Program C. They
# randomly assign 30 employees to one of the programs and record the time it takes each employee to
# complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
# interaction effects between the software programs and employee experience level (novice vs.
# experienced). Report the F-statistics and p-values, and interpret the results.

import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create example data
np.random.seed(42)
n = 30
programs = np.random.choice(['A', 'B', 'C'], n)
experience = np.random.choice(['novice', 'experienced'], n)
times = np.random.normal(loc=10, scale=2, size=n)

# Create a DataFrame
data = pd.DataFrame({'Program': programs, 'Experience': experience, 'Time': times})

# Perform two-way ANOVA
model = ols('Time ~ Program * Experience', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)


                       sum_sq    df         F    PR(>F)
Program              1.035327   2.0  0.136986  0.872659
Experience           0.521940   1.0  0.138118  0.713420
Program:Experience   2.683910   2.0  0.355113  0.704716
Residual            90.694755  24.0       NaN       NaN


In [10]:
# Q11. An educational researcher is interested in whether a new teaching method improves student test
# scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
# experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
# two-sample t-test using Python to determine if there are any significant differences in test scores
# between the two groups. If the results are significant, follow up with a post-hoc test to determine which
# group(s) differ significantly from each other.

import numpy as np
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Example data for control and experimental groups
control_scores = np.array([78, 82, 85, 75, 92, 88, 80, 84, 79, 87,
                           90, 76, 82, 89, 83, 85, 81, 88, 86, 77,
                           79, 83, 80, 82, 81, 85, 87, 88, 84, 79,
                           81, 86, 90, 78, 79, 81, 82, 88, 85, 83,
                           76, 87, 80, 82, 89, 84, 85, 86, 90, 82])

experimental_scores = np.array([85, 88, 92, 79, 94, 89, 82, 87, 90, 91,
                                95, 81, 87, 93, 89, 91, 86, 90, 88, 83,
                                86, 90, 85, 87, 88, 91, 89, 90, 92, 85,
                                88, 93, 95, 82, 84, 87, 89, 90, 88, 86,
                                80, 92, 87, 89, 93, 88, 89, 90, 93, 87])

# Perform two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_scores, experimental_scores)

print("T-Statistic:", t_statistic)
print("P-Value:", p_value)

# Interpret the t-test results
alpha = 0.05
if p_value < alpha:
    print("There is a significant difference in test scores between the two groups.")
else:
    print("There is no significant difference in test scores between the two groups.")

# Perform post-hoc test (Tukey's HSD)
if p_value < alpha:
    all_scores = np.concatenate((control_scores, experimental_scores))
    group_labels = ['Control'] * len(control_scores) + ['Experimental'] * len(experimental_scores)
    posthoc = pairwise_tukeyhsd(all_scores, group_labels, alpha=0.05)
    print(posthoc)


T-Statistic: -6.156404161485698
P-Value: 1.6369218868074588e-08
There is a significant difference in test scores between the two groups.
 Multiple Comparison of Means - Tukey HSD, FWER=0.05  
 group1    group2    meandiff p-adj lower upper reject
------------------------------------------------------
Control Experimental     4.88   0.0 3.307 6.453   True
------------------------------------------------------


In [11]:
# Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
# retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
# on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

# significant differences in sales between the three stores. If the results are significant, follow up with a post-
# hoc test to determine which store(s) differ significantly from each other.

import numpy as np
import scipy.stats as stats

# Example data for daily sales of three stores (A, B, C) over 30 days
store_a_sales = np.random.normal(loc=1000, scale=100, size=30)
store_b_sales = np.random.normal(loc=1100, scale=120, size=30)
store_c_sales = np.random.normal(loc=900, scale=110, size=30)

# Combine data from all stores
all_sales = np.concatenate((store_a_sales, store_b_sales, store_c_sales))

# Create corresponding labels for store identification
store_labels = ['A'] * 30 + ['B'] * 30 + ['C'] * 30

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(store_a_sales, store_b_sales, store_c_sales)

print("F-Statistic:", f_statistic)
print("P-Value:", p_value)

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("There are significant differences in daily sales between the three stores.")
else:
    print("There is no significant difference in daily sales between the three stores.")


F-Statistic: 25.32276112538273
P-Value: 2.1524678060485753e-09
There are significant differences in daily sales between the three stores.
