In [30]:
import numpy as np
import pandas as pd
from scipy.stats import f
from scipy.stats import f_oneway
from scipy.stats import t
import statsmodels.api as sm
from statsmodels.formula.api import ols
from scipy.stats import ttest_ind
from statsmodels.stats.multicomp import pairwise_tukeyhsd
from statsmodels.stats.anova import AnovaRM

# Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results 


There are the assumptions required to use ANOVA:
1. ANOVA assumes that the data is normally distributed.  
2. The ANOVA also assumes homogeneity of variance, which means that the variance among the groups should be approximately equal. 
3. ANOVA also assumes that the observations are independent of each other.

Violations of these assumptions can impact the validity of the ANOVA results. For example:
1. Violation of independence: If the observations in one group are dependent on the observations in another group, it can lead to biased results.
2. Violation of normality: If the data in one or more groups do not follow a normal distribution, it can lead to inaccurate results. 
3. Violation of homogeneity of variance: If the variance of the data in one or more groups is significantly different from the variance in other groups, it can lead to incorrect conclusions.

# Q2. What are the three types of ANOVA, and in what situations would each be used?



There are three types of ANOVA test:
1. One-Way ANOVA | Complete Randomized Design : This is used when there is only one independent variable (factor) with three or more levels (groups). It is used to test whether there is a significant difference between the means of the groups.

2. Two-Way ANOVA | Factirial ANOVA : This is used when there are two independent variables (factors) and one dependent variable. It is used to test whether there is a significant interaction between the two independent variables and the dependent variable. For example, if we want to test the effect of two different treatments on a group of patients and whether the effect depends on the gender of the patients, we can use a two-way ANOVA.

3. Repeated measures ANOVA test: In this ANOVA test, you take sample means from at least three different sets of test statistics and compare them against one another. This way, you can look for any key and critical values and notate their statistical significance level as well. You do so primarily through utilizing repeated F-tests.

# Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?



The partitioning of variance in ANOVA refers to the division of the total variance in the data into different components based on the sources of variation. There are two main sources of variation in ANOVA: 
1. the variation between groups 
2. the variation within groups

The partitioning of variance helps to determine the relative contribution of each source of variation to the total variance in the data.

The total variance in the data can be represented as the sum of squares total (SST), which is calculated as the sum of the squared differences between each observation and the overall mean. The SST can then be partitioned into two components: the sum of squares between (SSB) and the sum of squares within (SSW). The SSB represents the variation between groups, while the SSW represents the variation within groups.

The partitioning of variance is important because it allows us to determine whether there is a significant difference between the means of the groups. If the SSB is much larger than the SSW, it suggests that there is a significant difference between the means of the groups. On the other hand, if the SSW is much larger than the SSB, it suggests that there is no significant difference between the means of the groups.

Understanding the partitioning of variance is also important because it helps to identify potential sources of error in the study design or data collection process. For example, if there is a large amount of variation within groups, it may indicate that there is a problem with the measurement instrument or that there are confounding variables that need to be controlled for in the analysis.

# Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?



In [3]:

def one_way_anova(data):
    """
    Calculates the total sum of squares (SST), sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA.

    Args:
    data: A list of lists, where each inner list contains the data for one group.

    Returns:
    A tuple of (SST, SSE, SSR).
    """
    mean = np.mean(data)

    # Calculate the total sum of squares
    SST = np.sum((data - mean)**2)

    # Calculate the explained sum of squares
    SSE = 0
    for group in data:
        group_mean = np.mean(group)
        SSE += np.sum((group - group_mean)**2)

    # Calculate the residual sum of squares
    SSR = SST - SSE

    return SST, SSE, SSR


In [4]:
 # Generate some data
data = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

# Calculate the ANOVA
SST, SSE, SSR = one_way_anova(data)

# Print the results
print("SST:", SST)
print("SSE:", SSE)
print("SSR:", SSR)

SST: 60.0
SSE: 6.0
SSR: 54.0


# Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?



In [48]:
# Create a sample dataframe
data = pd.DataFrame({
    'factor_a': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'],
    'factor_b': ['X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y'],
    'y': [10, 12, 8, 9, 11, 13, 7, 6]
})

# Fit the ANOVA model
model = ols('y ~ C(factor_a) + C(factor_b) + C(factor_a):C(factor_b)', data=data).fit()

# Print the ANOVA table
print(sm.stats.anova_lm(model, typ=2))

                         sum_sq   df          F   PR(>F)
C(factor_a)                32.0  1.0  21.333333  0.00989
C(factor_b)                 2.0  1.0   1.333333  0.31250
C(factor_a):C(factor_b)     2.0  1.0   1.333333  0.31250
Residual                    6.0  4.0        NaN      NaN


Main effect of factor_a: The sum of squares for factor_a is 32.0 with 1 degree of freedom, which results in an F-statistic of 21.333 and a p-value of 0.00989. This indicates that there is a significant main effect of factor_a.

Main effect of factor_b: The sum of squares for factor_b is 2.0 with 1 degree of freedom, which results in an F-statistic of 1.333 and a p-value of 0.31250. This indicates that there is no significant main effect of factor_b.

Interaction effect between factor_a and factor_b: The sum of squares for the interaction between factor_a and factor_b is 2.0 with 1 degree of freedom, which results in an F-statistic of 1.333 and a p-value of 0.31250. This indicates that there is no significant interaction effect between factor_a and factor_b.

# Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results? 
 

If the alpha value is 0.05 and the p-value is 0.02, then we can REJECT the null hypothesis at the 0.05 level of significance. This means that there is sufficient evidence to conclude that there is a statistically significant difference between the groups.

The F-statistic of 5.23 suggests that there is a difference between the groups, and the magnitude of the F-value indicates that the variability between groups is 5.23 times the variability within groups. However, post-hoc tests or further analyses are necessary to determine which groups are significantly different from each other.

Post Hoc methods like Tukey's Honestly Significant Difference (HSD), Bonferroni correction, Scheffe's method can be used for further analysis of means

# Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?



In a repeated measures ANOVA, missing data can be handled through different methods, such as pairwise deletion, listwise deletion, or imputation.

Pairwise deletion involves analyzing only the available data for each pair of variables, which can lead to a loss of statistical power and biased estimates. Listwise deletion involves analyzing only the cases with complete data, which can lead to a loss of information and reduced sample size. Imputation involves estimating the missing values based on the available data, which can lead to biased estimates if the imputation model is misspecified.

The potential consequences of using different methods to handle missing data are that the results of the analysis may differ depending on the method used. Therefore, it is important to carefully consider the missing data mechanism and choose an appropriate method to handle missing data.

# Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.



There are several post-hoc tests that can be used after ANOVA, including Tukey's HSD, Bonferroni correction, Scheffe's method, and Dunnett's test.

* Tukey's HSD is commonly used when comparing all possible pairs of means to determine which pairs are significantly different from each other. 
* Bonferroni correction is used to control for Type I error rate when multiple comparisons are made. 
* Scheffe's method is used when there are multiple hypotheses being tested at once. 
* Dunnett's test is used when comparing multiple treatments to a control group.

A situation where a post-hoc test might be necessary is when conducting a study with multiple groups and a significant difference is found between the means. A post-hoc test can be used to determine which specific groups have significantly different means from each other.

# Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.
  


In [17]:
Ho = "Three mean belongs to the same population"
Ha = "At least, one mean differ to the other mean"

A = np.random.normal( 5, 1, 50)
B = np.random.normal( 4.5, 1, 50)
C = np.random.normal( 4.7, 1, 50)

# f test statistic and p value
f_stat, p_value = f_oneway( A, B, C )

# significance level
alpha = 0.05

print("F-statistic:", f_stat)
print("p-value:", p_value)
if p_value < alpha:
    print("Reject Ho.", Ha)
else:
    print("Fail to reject Ho.", Ho)

F-statistic: 3.4089499921915496
p-value: 0.035712362822675564
Reject Ho. At least, one mean differ to the other mean


# Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.
    


In [18]:
# Generating 2 random time samples for novice and expert
time_novice = np.random.normal(loc=15, scale=2, size=45)
time_expert = np.random.normal(loc=10, scale=2, size=45)

# Generate simulated data
data = pd.DataFrame({
    'Software': ['A']*30 + ['B']*30 + ['C']*30,
    'Experience': ['Novice']*45 + ['Experienced']*45,
    'Time': list(time_novice)+list(time_expert)
})

# Fit the ANOVA model
model = ols('Time ~ C(Software) + C(Experience) + C(Software):C(Experience)', data).fit()
anova_table = sm.stats.anova_lm(model, type=2)

print(anova_table)


                             df      sum_sq     mean_sq          F  \
C(Software)                 2.0  232.569522  116.284761  39.015083   
C(Experience)               1.0  245.672030  245.672030  82.426231   
C(Software):C(Experience)   2.0    4.160624    2.080312   0.697972   
Residual                   86.0  256.323678    2.980508        NaN   

                                 PR(>F)  
C(Software)                8.744065e-13  
C(Experience)              3.399439e-14  
C(Software):C(Experience)  5.003897e-01  
Residual                            NaN  


* For program, the F-statistic is 39.015 and the p-value is 8.744065e-13, which is much smaller than the significance level of 0.05, indicating that there is a significant difference in the average time it takes to complete the task using different software programs.
* For Experience, the F-statistic is 39.015 and the p-value is 8.744065e-13, which is much smaller than the significance level of 0.05, indicating that there is a significant difference in the average time it takes to complete the task using different software programs.
* For interaction effect between Program and experience, the F-statistic is 0.698 and the p-value is 0.500, which is much larger than the significance level of 0.05, indicating that there is no significant interaction effect between software program and employee experience level.

# Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other

In [25]:
Ho = "There is no difference in students score"
Ha = "There is a significant difference in students score"

groupA = np.random.normal( loc = 80, scale = 10, size = 100)
groupB = np.random.normal( loc = 75, scale = 10, size = 100)

#significance value
alpha = 0.05

#f stat and p value
t_stat, p_value = ttest_ind( groupA, groupB)

# f critical value 
t_crit1 = t.ppf( alpha/2, 198)
t_crit2 = t.ppf( 1-alpha/2, 198)

print(f"t_stat = {t_stat}")
print(f"t crtical 1 = {t_crit1}")
print(f"t crtical 2 = {t_crit2}")
print(f"p value = {p_value}")

if p_value < alpha:
    print("Reject Ho.", Ha)
else:
    print("Fail to reject Ho.", Ho)



t_stat = 3.335852168325221
t crtical 1 = -1.972017477833896
t crtical 2 = 1.9720174778338955
p value = 0.001015401276676933
Reject Ho. There is a significant difference in students score


In [29]:
 # Combine the two groups of test scores into one array
all_scores = np.concatenate((groupA, groupB))

# Create a list of group labels corresponding to each score in the combined array
group_labels = ['A'] * len(groupA) + ['B'] * len(groupB)

# Conduct a post-hoc Tukey's HSD test
tukey_results = pairwise_tukeyhsd(all_scores, group_labels)

# Print the post-hoc results
print("\nPost-hoc Tukey's HSD test results:")
print(tukey_results)


Post-hoc Tukey's HSD test results:
Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj  lower   upper  reject
---------------------------------------------------
     A      B  -4.6196 0.001 -7.3505 -1.8887   True
---------------------------------------------------


Based on the post-hoc Tukey's HSD test results, there is a statistically significant difference in test scores between group A and group B. The mean test score for group B is significantly lower (mean difference of -4.6196) compared to group A.

With a p-value of 0.001 (p-adj), which is less than the significance level of 0.05, we can reject the null hypothesis that there is no difference in test scores between the two groups. Therefore, we have evidence to suggest that the new teaching method (group A) has a significant impact on improving student test scores compared to the traditional teaching method (group B).

# Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a posthoc test to determine which store(s) differ significantly from each other.

In [31]:
# sales data for Store A, B, and C
sales_a = np.random.normal(loc=500, scale=50, size=(30,))
sales_b = np.random.normal(loc=650, scale=60, size=(30,))
sales_c = np.random.normal(loc=600, scale=55, size=(30,))

# create a DataFrame
sales_df = pd.DataFrame({'Store A': sales_a, 'Store B': sales_b, 'Store C': sales_c})
sales_df

Unnamed: 0,Store A,Store B,Store C
0,555.948914,705.129859,570.691259
1,537.873554,706.762923,699.211108
2,553.648378,599.053916,674.40457
3,501.773959,724.606701,520.446802
4,589.422214,666.055742,564.243875
5,482.359056,641.970916,545.58101
6,441.689131,640.131157,663.249186
7,580.681406,637.584827,551.160223
8,526.154582,630.776774,612.341665
9,529.558091,670.772723,624.623479


In [46]:
# reshape the DataFrame for repeated measures ANOVA
sales_melted = pd.melt(sales_df.reset_index( names = "Day" ), id_vars=['Day'], value_vars=['Store A', 'Store B', 'Store C'], var_name = "Store", value_name = "Sales")
rm_anova = AnovaRM(sales_melted, 'Sales', 'Day', within=['Store'])
rm_results = rm_anova.fit()
print(rm_results)


               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
Store 44.2664 2.0000 58.0000 0.0000



In [47]:
# check if null hypothesis should be rejected based on p-value
if rm_results.anova_table['Pr > F'][0] < 0.05:
    # perform post-hoc Tukey test
    print('Reject the Null Hypothesis : Atleast one of the group has different mean.\n')
    print('Tukey HSD posthoc test:')
    tukey_results = pairwise_tukeyhsd(sales_melted['Sales'], sales_melted['Store'])
    print(tukey_results)
else:
    print('NO significant difference between groups.')

Reject the Null Hypothesis : Atleast one of the group has different mean.

Tukey HSD posthoc test:
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1  group2 meandiff p-adj   lower    upper   reject
--------------------------------------------------------
Store A Store B 145.2211    0.0 111.0682  179.374   True
Store A Store C  93.3176    0.0  59.1647 127.4705   True
Store B Store C -51.9035 0.0014 -86.0564 -17.7506   True
--------------------------------------------------------
