In [1]:
#Ans 1

In [2]:
# ANOVA (Analysis of Variance) is a statistical technique used to compare the means of two or more groups. It makes several assumptions about the data to ensure the validity of the results. Here are the key assumptions required for conducting ANOVA:

# 1. Independence: The observations within each group should be independent of each other. This means that the values in one group should not be influenced by or dependent on the values in another group.

# 2. Normality: The data within each group should follow a normal distribution. This assumption is important because ANOVA relies on the normality assumption to calculate probabilities and make inferences.

# 3. Homogeneity of variances: The variability of scores (variances) in each group should be approximately equal. Homogeneity of variances ensures that the groups have a similar spread of values, allowing for meaningful comparisons.

# Violations of these assumptions can impact the validity of ANOVA results. Here are examples of violations that could impact the validity:

# 1. Violation of independence: If the observations within the groups are not independent, it can lead to biased results. For example, if the same subjects are included in multiple groups or if there is a dependency between the observations due to clustering or repeated measures, the assumption of independence is violated.

# 2. Violation of normality: If the data within the groups deviate significantly from a normal distribution, the ANOVA results may not be reliable. This is particularly important when the sample sizes are small. Violations can occur when there are extreme outliers, skewed distributions, or heavy tails in the data.

# 3. Violation of homogeneity of variances: When the variances across the groups are not equal, the assumptions underlying ANOVA may not hold. This can result in inflated or deflated Type I error rates and can affect the power of the analysis. Violations can occur when there is heteroscedasticity, meaning that the spread of values differs across the groups.

# When these assumptions are violated, alternative statistical tests or adjustments can be used. Non-parametric tests, such as the Kruskal-Wallis test, can be used when the normality assumption is violated. Robust versions of ANOVA, such as Welch's ANOVA, can be used when the assumption of homogeneity of variances is violated.

In [3]:
#Ans 2

In [4]:
# The three types of ANOVA are:

# 1. One-Way ANOVA: One-Way ANOVA is used when there is a single categorical independent variable (also known as a factor) and a continuous dependent variable. It is used to determine if there are any statistically significant differences between the means of three or more groups. For example, a One-Way ANOVA can be used to compare the average test scores of students from different schools (where the schools are the groups) to see if there are any significant differences in performance.

# 2. Two-Way ANOVA: Two-Way ANOVA is used when there are two categorical independent variables (factors) and a continuous dependent variable. It allows for examining the main effects of each factor as well as their interaction effect on the dependent variable. Two-Way ANOVA is suitable when you want to analyze the effects of two independent variables simultaneously. For instance, you might use a Two-Way ANOVA to examine the effects of both gender and age group on the response time of participants in a cognitive task.

# 3. Factorial ANOVA: Factorial ANOVA is an extension of the Two-Way ANOVA and is used when there are two or more categorical independent variables (factors) and a continuous dependent variable. It allows for investigating the main effects of each factor as well as their interactions. Factorial ANOVA can be used when you want to analyze the combined effects of multiple independent variables on the dependent variable. For example, in a study on the effects of both diet and exercise on weight loss, a Factorial ANOVA can be used to assess the impact of each factor separately and their interaction.

# In summary, One-Way ANOVA is used when there is one independent variable, Two-Way ANOVA is used when there are two independent variables, and Factorial ANOVA is used when there are two or more independent variables. The choice of which ANOVA to use depends on the research question, the number of factors being studied, and the desired level of analysis.

In [5]:
#Ans 3

In [6]:
# The partitioning of variance in ANOVA refers to the decomposition of the total variation in the data into different sources or components of variation. It helps to understand how much of the total variation in the data is due to the differences between groups and how much is due to random variation within the groups. This concept is essential in ANOVA because it allows us to quantify and attribute the sources of variability, providing insights into the significance of the group differences being analyzed.

# The partitioning of variance in ANOVA involves three key components:

# Between-Groups Variation: This component represents the variability between the group means. It measures the differences among the group means and assesses whether these differences are statistically significant. If the between-groups variation is large relative to the within-group variation, it suggests that the group means are significantly different from each other.

# Within-Groups Variation: This component represents the variability within each group. It captures the random variation or noise that is inherent within the groups. It reflects the individual differences or measurement errors within the groups. If the within-groups variation is high, it indicates that there is a substantial amount of random variation, making it difficult to distinguish the true group differences from the noise.

# Total Variation: This component represents the overall variability in the data, regardless of group membership. It is the sum of the between-groups and within-groups variation. The total variation reflects the dispersion of scores across all groups. By comparing the between-groups variation to the total variation, ANOVA calculates the proportion of the total variation that can be attributed to the group differences.

# Understanding the partitioning of variance helps in assessing the significance of the group differences being studied. By comparing the magnitude of the between-groups variation to the within-groups variation, ANOVA determines whether the observed differences among the group means are larger than what would be expected by chance alone. It provides a statistical basis for evaluating the importance of the independent variable(s) in explaining the variation in the dependent variable(s).

# Moreover, the partitioning of variance allows for additional analyses, such as calculating effect sizes, estimating power, and conducting post hoc tests. It provides a structured framework for interpreting the results of ANOVA and understanding the relative contributions of different sources of variation, ultimately aiding in drawing valid conclusions from the analysis.

In [7]:
#Ans 4

In [10]:
import pandas as pd
from statsmodels.formula.api import ols
import seaborn as sns
from statsmodels.stats.anova import anova_lm

# Loading Iris dataset from seaborn
df_iris = sns.load_dataset('iris')
print('Top 5 rows of IRIS dataset : ')
print(df_iris.head())
print('\n===================================================================\n')

# Fit the one-way ANOVA model (sepal length vs Species)
model = ols('sepal_length ~ species', data=df_iris).fit()

# Calculate the sum of squares for the model
print('Values for Sepal Length vs Species:')
SSE = model.ess
SSR = model.ssr
SST = SSE + SSR

print('SSE:', round(SSE,4))
print('SSR:', round(SSR,4))
print('SST:', round(SST,4))

print('\n===================================================================\n')
# Print the ANOVA table
print(anova_lm(model))

Top 5 rows of IRIS dataset : 
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa


Values for Sepal Length vs Species:
SSE: 63.2121
SSR: 38.9562
SST: 102.1683


             df     sum_sq    mean_sq           F        PR(>F)
species     2.0  63.212133  31.606067  119.264502  1.669669e-31
Residual  147.0  38.956200   0.265008         NaN           NaN


In [11]:
#Ans 5

In [12]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Generate sample data
np.random.seed(0)
n = 50
factor1 = np.repeat(['A', 'B'], n)
factor2 = np.tile(['X', 'Y'], n)
response = np.random.randn(2 * n)

# Create a DataFrame
df = pd.DataFrame({'Factor1': factor1, 'Factor2': factor2, 'Response': response})

# Fit the two-way ANOVA model
model = ols('Response ~ Factor1 + Factor2 + Factor1:Factor2', data=df).fit()
anova_table = sm.stats.anova_lm(model)

# Extract the main effects and interaction effect from the ANOVA table
main_effect_factor1 = anova_table['sum_sq']['Factor1']
main_effect_factor2 = anova_table['sum_sq']['Factor2']
interaction_effect = anova_table['sum_sq']['Factor1:Factor2']

print('Main effect of Factor1:', main_effect_factor1)
print('Main effect of Factor2:', main_effect_factor2)
print('Interaction effect:', interaction_effect)


Main effect of Factor1: 0.6520765471325478
Main effect of Factor2: 0.18491592889441688
Interaction effect: 0.6178948963395011


In [13]:
# In this example, we generate sample data with two factors (Factor1, Factor2) and the corresponding response variable (Response). We then create a DataFrame df to store the data. Next, we fit the two-way ANOVA model using ols from statsmodels.formula.api and calculate the ANOVA table using sm.stats.anova_lm. From the ANOVA table, we extract the main effects (main_effect_factor1, main_effect_factor2) and the interaction effect (interaction_effect). Finally, we print the results.

# Note: Make sure you have the statsmodels library installed (pip install statsmodels) before running this code. Also, adjust the data and factor names to match your specific dataset.

In [14]:
#Ans 6

In [15]:
# In a one-way ANOVA, the F-statistic is used to test the null hypothesis that the means of all the groups are equal. The p-value associated with the F-statistic helps determine the statistical significance of the observed differences between the groups. In this scenario, an F-statistic of 5.23 and a p-value of 0.02 suggest the following conclusions:

# 1. Differences between the groups: The obtained F-statistic of 5.23 indicates that there are differences between the group means. The F-statistic measures the ratio of the between-groups variability to the within-groups variability. A higher F-value implies larger differences between the group means.

# 2. Statistical significance: The p-value of 0.02 suggests that the observed differences between the groups are statistically significant at the chosen significance level (typically 0.05 or 0.01). Since the p-value is less than the significance level, we reject the null hypothesis.

# Interpretation: Based on these results, we can conclude that there are statistically significant differences between the groups. However, it's important to note that the one-way ANOVA does not provide information about which specific groups differ from each other. To identify the specific group differences, further post hoc tests (e.g., Tukey's HSD, Bonferroni, or pairwise t-tests) can be conducted.

# It is also important to consider the effect size and practical significance in addition to statistical significance. The effect size measures the magnitude of the differences between groups and provides a measure of the practical significance or importance of the observed differences. Additionally, interpreting the results should take into account the context and relevant domain knowledge to draw meaningful conclusions about the differences between the groups.

In [16]:
#Ans 7

In [17]:
# Handling missing data in a repeated measures ANOVA requires careful consideration to maintain the integrity and validity of the analysis. Here are some common approaches to handling missing data in a repeated measures ANOVA and their potential consequences:

# 1. Complete Case Analysis (Listwise deletion): This approach involves excluding any participant with missing data on any variable included in the analysis. The consequence of this method is a reduction in sample size, which can lead to loss of statistical power and potentially biased results if the missingness is related to the variables under study.

# 2. Pairwise Deletion (Available Case Analysis): With this approach, participants with missing data are excluded only from analyses involving the variables with missing data, while including them in analyses for variables with complete data. This method retains more data compared to complete case analysis, but it can introduce bias if the missingness is not completely random and is related to the variables being analyzed.

# 3. Mean Imputation: In mean imputation, missing values are replaced with the mean value of the variable. This approach assumes that the missing values have the same mean as the observed values. However, mean imputation can underestimate the standard errors and lead to inflated Type I error rates, as it does not account for the uncertainty introduced by imputing the missing values.

# 4. Last Observation Carried Forward (LOCF): LOCF involves replacing missing values with the last observed value from the same participant. This method assumes that missing values remain the same as the last observed value, which may not be accurate. LOCF can result in biased estimates and distort the patterns of change over time.

# 5. Multiple Imputation: Multiple imputation is a more sophisticated approach that generates multiple plausible imputations for missing values, taking into account the uncertainty of the missing data. This approach creates a set of complete datasets with imputed values, and the analysis is performed on each dataset, combining the results using specific rules. Multiple imputation provides unbiased estimates, preserves variability, and accounts for the uncertainty introduced by imputing missing values.

# It is crucial to note that no imputation method can guarantee accurate results, and the choice of handling missing data should depend on the nature of the missingness, the assumptions made, and the specific research context. Sensitivity analyses or exploring multiple approaches can provide insights into the robustness of the findings to different missing data strategies. Consulting with a statistician or data analyst is recommended to determine the most appropriate approach for handling missing data in a repeated measures ANOVA.

In [18]:
#Ans 8

In [19]:
# After conducting an analysis of variance (ANOVA) and obtaining a significant result, post-hoc tests are often performed to determine which specific group differences are significant. Some common post-hoc tests used after ANOVA include:

# 1. Tukey's Honestly Significant Difference (HSD): Tukey's HSD is widely used to compare all possible pairwise group differences. It controls the familywise error rate, making it suitable when you have several groups and want to identify which specific pairs of groups differ significantly.

# 2. Bonferroni correction: The Bonferroni correction adjusts the significance level for each comparison to maintain an overall desired alpha level. It is a conservative approach that divides the desired significance level (e.g., 0.05) by the number of pairwise comparisons. Bonferroni correction is commonly used when there are a small number of planned comparisons.

# 3. Scheffe's test: Scheffe's test is a conservative post-hoc test that allows for comparisons between any combination of groups. It is more liberal than Tukey's HSD, making it appropriate when dealing with unequal sample sizes or unequal variances.

# 4. Dunnett's test: Dunnett's test is used when comparing multiple treatment groups to a control group. It controls the overall error rate, making it suitable for situations where you have a control group and want to determine which treatment groups differ significantly from the control.

# Example: Let's consider a scenario where a researcher investigates the effectiveness of three different teaching methods (A, B, and C) on student performance. The researcher conducts a one-way ANOVA and finds a significant overall difference among the groups. To identify which specific pairs of teaching methods differ significantly, a post-hoc test would be necessary.

# For instance, the researcher might use Tukey's HSD post-hoc test to compare the mean performance between all possible pairs of teaching methods (A vs. B, A vs. C, and B vs. C). Tukey's HSD will provide adjusted p-values for each pairwise comparison, indicating which specific pairs of teaching methods have significant differences in student performance.

# Using a post-hoc test in this situation is essential because the ANOVA only tells us that there are significant differences among the groups, but it does not specify which pairs of groups differ significantly. The post-hoc test allows for a more detailed analysis by identifying the specific group differences, aiding in the interpretation of the results and providing more meaningful insights.

In [20]:
#Ans 9

In [21]:
import numpy as np
from scipy import stats

# Generate sample data
np.random.seed(0)
diet_A = np.random.normal(loc=5, scale=2, size=50)
diet_B = np.random.normal(loc=7, scale=2, size=50)
diet_C = np.random.normal(loc=4, scale=2, size=50)

# Concatenate the data
weight_loss_data = np.concatenate([diet_A, diet_B, diet_C])

# Create the group labels
groups = np.repeat(['A', 'B', 'C'], 50)

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Report the results
print('One-way ANOVA Results:')
print('F-statistic:', f_statistic)
print('p-value:', p_value)


One-way ANOVA Results:
F-statistic: 18.630408350295305
p-value: 6.144679136842842e-08


In [22]:
# In this example, we generate sample weight loss data for each diet (A, B, C) using the numpy.random.normal function. We then concatenate the data and create corresponding group labels. Next, we use the stats.f_oneway function from scipy to perform the one-way ANOVA analysis. The function takes the weight loss data for each diet as separate arguments.

# Finally, we report the results, including the F-statistic and the p-value.

# Interpretation: Based on the results of the one-way ANOVA, if the obtained p-value is less than the chosen significance level (e.g., 0.05), we reject the null hypothesis. If the p-value is greater than the significance level, we fail to reject the null hypothesis. In this case, the F-statistic and p-value obtained from the analysis indicate whether there are significant differences between the mean weight loss of the three diets.

# For example, if the reported p-value is less than 0.05, we would conclude that there are significant differences in the mean weight loss between at least two of the three diets (A, B, and C). However, if the p-value is greater than 0.05, we would fail to reject the null hypothesis and conclude that there is no significant difference in the mean weight loss among the three diets.

# Note: Adjust the data generation process to match your specific scenario or use your own weight loss dataset in the code.

In [24]:
#Ans 10

In [26]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Setting random seed for reproducibility
np.random.seed(123)

# Generating 2 random time samples for novice and expert
time_novice = np.random.normal(loc=15, scale=2, size=30)
time_expert = np.random.normal(loc=10, scale=2, size=30)

# Generate simulated data
data = pd.DataFrame({
    'Software': ['A']*20 + ['B']*20 + ['C']*20,
    'Experience': ['Novice']*30 + ['Experienced']*30,
    'Time': list(time_novice)+list(time_expert)
})

# Print the simulated data head 
print('Simulated Data example :')
print(data.head())

print('\n======================================================================================\n')

# Fit the two-way ANOVA model
model = ols('Time ~ C(Software) + C(Experience) + C(Software):C(Experience)', data=data).fit()
table = sm.stats.anova_lm(model, typ=1)

# Set significance level
alpha = 0.05

# Main effects and interaction effect
print(table)
print('\n')
if table['PR(>F)'][0] < alpha:
    print("Conclusion: There is a significant main effect of software.")
else:
    print("Conclusion: There is no significant main effect of software.")

if table['PR(>F)'][1] < alpha:
    print("Conclusion: There is a significant main effect of experience.")
else:
    print("Conclusion: There is no significant main effect of experience.")

if table['PR(>F)'][2] < alpha:
    print("Conclusion: There is a significant interaction effect between software and experience.")
else:
    print("Conclusion: There is no significant interaction effect between software and experience.")


Simulated Data example :
  Software Experience       Time
0        A     Novice  12.828739
1        A     Novice  16.994691
2        A     Novice  15.565957
3        A     Novice  11.987411
4        A     Novice  13.842799


                             df      sum_sq     mean_sq          F  \
C(Software)                 2.0  204.881181  102.440590  18.135666   
C(Experience)               1.0  165.079097  165.079097  29.224933   
C(Software):C(Experience)   2.0   17.481552    8.740776   1.547431   
Residual                   56.0  316.319953    5.648571        NaN   

                                 PR(>F)  
C(Software)                8.460472e-07  
C(Experience)              1.375177e-06  
C(Software):C(Experience)  2.217544e-01  
Residual                            NaN  


Conclusion: There is a significant main effect of software.
Conclusion: There is a significant main effect of experience.
Conclusion: There is no significant interaction effect between software and experience.


In [27]:
# "There is a significant main effect of software": This means that the software programs used by the employees have a significant impact on the outcome variable (e.g., completion time), independent of the experience level of the employees. This suggests that the choice of software program is an important factor that should be considered carefully when completing this task.

# "There is a significant main effect of experience": This means that the experience level of the employees has a significant impact on the outcome variable, independent of the software program used. Specifically, this suggests that experienced employees may complete the task faster than novices, or vice versa. This finding can be helpful for the company to identify the best employees for a given task and to provide appropriate training for new employees.

# "There is NO significant interaction effect between software and experience": This means that the effect of software on the outcome variable does not depend on the experience level of the employees, and vice versa. This suggests that the software programs perform similarly for both novices and experienced employees. This finding can be helpful for the company to decide which software program to use, as they do not need to consider the experience level of the employees when making the choice.

In [28]:
#Ans 11

In [30]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind

# Setting numpy random seed
np.random.seed(45)

# Generating normal test scores with same variance for both control groups
test_score_control = np.random.normal(loc=70, scale=3, size=50)
test_score_experimental = np.random.normal(loc=85, scale=3, size=50)

# Creating the dataframe
df = pd.DataFrame({'test_score':list(test_score_control)+list(test_score_experimental),
                   'group':['control']*50 + ['experimental']*50})

# printing the sample dataframe
print('Simulated data for test_scores:')
print(df.head())
print('\n===============================\n')

null_hypothesis = "There is NO difference in test scores between the control and experimental groups."
alt_hypothesis = "There is SIGNIFICANT difference in test scores between the control and experimental groups."

# Conduct the two-sample t-test
control_scores = df[df['group'] == 'control']['test_score']
experimental_scores = df[df['group'] == 'experimental']['test_score']
t_stat, p_val = ttest_ind(control_scores, experimental_scores, equal_var=True)
print(f"t-statistic: {t_stat:.4f}, p-value: {p_val}")
print('\n')

# Significance value 
alpha = 0.05
if p_val<alpha:
    print('Reject the Null Hypothesis')
    print(f'Conclusion : {alt_hypothesis}')
else:
    print('Failed to reject the Null Hypothesis')
    print(f'Conclusion : {null_hypothesis}')

Simulated data for test_scores:
   test_score    group
0   70.079124  control
1   70.780965  control
2   68.814563  control
3   69.387097  control
4   66.185102  control


t-statistic: -28.5074, p-value: 3.096206271894725e-49


Reject the Null Hypothesis
Conclusion : There is SIGNIFICANT difference in test scores between the control and experimental groups.


In [31]:
#Ans 12