In [None]:
#Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.
ANOVA (Analysis of Variance) is a statistical technique used to compare means across two or more groups. To use ANOVA and interpret its results correctly, several assumptions must be met:
Independence: Observations within each group must be independent of each other. This means that the data points in one group should not be influenced by or dependent on the data points in another group. Violations of this assumption can occur when data points within groups are correlated, such as in repeated measures designs or nested data structures.
Normality: The residuals (the differences between the observed values and the predicted values) should be normally distributed for each group. While ANOVA is robust to violations of normality when sample sizes are large, departures from normality can affect the accuracy of p-values and confidence intervals, especially with small sample sizes.
Homogeneity of variances (Homoscedasticity): The variability of the dependent variable should be approximately equal across all groups. This means that the spread of data points around the group means should be similar for all groups. Violations of homogeneity of variances can lead to inaccurate p-values and confidence intervals, particularly when sample sizes are unequal.

Examples of violations that could impact the validity of ANOVA results:
Non-independence: In a study where individuals are grouped by family or household, the assumption of independence may be violated because observations within the same family or household may be correlated.
Non-normality: If the residuals from the ANOVA model are not normally distributed within each group, this could affect the accuracy of the p-values and confidence intervals. For example, if the residuals are highly skewed or have heavy tails, the assumption of normality may be violated.
Non-homogeneity of variances: In a study comparing the effectiveness of different teaching methods across schools, if the variability of test scores within each school is not consistent across all schools, the assumption of homogeneity of variances may be violated.

In [None]:
#Q2. What are the three types of ANOVA, and in what situations would each be used?
Each type of ANOVA is used in different situations based on the number of independent variables and their levels:

One-way ANOVA: Used when comparing the means of three or more independent groups or levels of a single categorical variable. For example, comparing the mean test scores of students in three different teaching methods (Group 1: Traditional, Group 2: Online, Group 3: Blended).
Two-way ANOVA: Used when examining the effects of two independent categorical variables (factors) on a continuous dependent variable. For example, examining the effects of both gender (Factor 1: Male vs. Female) and age group (Factor 2: Young Adults vs. Middle-aged Adults) on blood pressure.
N-way ANOVA: Used when examining the effects of multiple independent categorical variables (factors) on a continuous dependent variable, with more than two levels for each factor. For example, examining the effects of education level (Factor 1: High School, Bachelor's, Master's, Ph.D.), gender (Factor 2: Male vs. Female), and age group (Factor 3: Young Adults, Middle-aged Adults, Seniors) on income.

In [None]:
#Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?
#The partitioning of variance in ANOVA refers to breaking down total variance into between-group and within-group variances. It's important because it helps understand how much of the variability in the dependent variable is due to differences between groups and how much is due to differences within groups, aiding in the interpretation of results and identification of sources of variation.

In [2]:
#Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?
import numpy as np
from scipy import stats

# Sample data for one-way ANOVA
group1 = [23, 25, 27, 30, 32]
group2 = [20, 22, 25, 28, 30]
group3 = [18, 21, 24, 26, 28]

# Combine all groups into a single array
data = np.concatenate([group1, group2, group3])

# Calculate overall mean (Grand Mean)
grand_mean = np.mean(data)

# Calculate Total Sum of Squares (SST)
sst = np.sum((data - grand_mean) ** 2)

# Calculate group means
group_means = [np.mean(group) for group in [group1, group2, group3]]

# Calculate Explained Sum of Squares (SSE)
sse = np.sum([len(group) * (mean - grand_mean) ** 2 for group, mean in zip([group1, group2, group3], group_means)])

# Calculate Residual Sum of Squares (SSR)
ssr = sst - sse

# Output group means
print("Group Means:")
for i, mean in enumerate(group_means, start=1):
    print(f"Group {i}: {mean}")

print("\nTotal Sum of Squares (SST):", sst)
print("Explained Sum of Squares (SSE):", sse)
print("Residual Sum of Squares (SSR):", ssr)


Group Means:
Group 1: 27.4
Group 2: 25.0
Group 3: 23.4

Total Sum of Squares (SST): 224.93333333333337
Explained Sum of Squares (SSE): 40.53333333333333
Residual Sum of Squares (SSR): 184.40000000000003


In [None]:
#Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# Fit the ANOVA model
model = ols('dependent_variable ~ C(factor1) + C(factor2) + C(factor1):C(factor2)', data=df).fit()

# Analyze main effects
main_effects = anova_lm(model, typ=2)

# Analyze interaction effect
interaction_effect = model.params['C(factor1):C(factor2)[T.level]']


In [None]:
#Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?
In this scenario:

F-statistic = 5.23
p-value = 0.02
Since the p-value (0.02) is less than the commonly chosen significance level of 0.05, we reject the null hypothesis. This means that there is sufficient evidence to conclude that there are statistically significant differences between the groups.

Interpretation:

The differences between the groups are statistically significant, as indicated by the low p-value.
The F-statistic (5.23) further supports the evidence of differences between the groups.
In summary, based on the results of the one-way ANOVA:

We reject the null hypothesis of equal group means.
There are statistically significant differences between the groups.
This suggests that at least one group mean is different from the others, but further post-hoc tests or pairwise comparisons would be needed to determine which specific groups differ from each othe.

In [None]:
#Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?
Listwise deletion: Exclude cases with missing data from the analysis entirely. This approach ensures that only complete cases are included in the analysis.
Pairwise deletion: Analyze only the available data for each pairwise combination of variables. This approach maximizes the use of available data but may lead to biased estimates if data are not missing completely at random.
Mean substitution: Replace missing values with the mean of the observed values for that variable. This approach preserves the sample size but can distort the variance and covariance structure of the data.
Last observation carried forward (LOCF): Replace missing values with the last observed value for that variable. This approach assumes that missing values remain constant over time and may introduce bias if this assumption is violated.
Interpolation: Estimate missing values based on the observed values before and after the missing data point. This approach assumes a linear or nonlinear trend between observed data points and may provide more accurate estimates than mean substitution or LOCF.
Multiple imputation: Generate multiple plausible values for each missing data point based on observed data and impute them using statistical models. This approach accounts for uncertainty associated with missing data and provides more accurate parameter estimates than single imputation methods.

In [None]:
#Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

Common Post-Hoc Tests Used After ANOVA:

Tukey's Honestly Significant Difference (HSD) Test
Bonferroni Correction
Scheffé Test
Dunnett's Test
Fisher's Least Significant Difference (LSD) Test
Games-Howell Test
Sidak Correction
Example Situation:
Suppose you conducted a one-way ANOVA to compare the mean scores of three different teaching methods (A, B, and C) on student performance. The ANOVA results indicate that there is a statistically significant difference among the means of the three groups (p < 0.05). However, ANOVA does not provide information on which specific groups differ from each other. In this case, a post-hoc test is necessary to determine pairwise differences between the teaching methods.

You would use each post-hoc test based on specific considerations such as the assumptions of the test, the number of comparisons being made, and the desired level of significance. For example:

Tukey's HSD test is commonly used when all pairwise comparisons need to be made and assumes equal sample sizes and homogeneity of variances.
Bonferroni correction is a conservative method that adjusts the significance level to control the familywise error rate when conducting multiple comparisons.
Scheffé test is more robust and suitable for situations where sample sizes are unequal or variances are unequal across groups.
Dunnett's test is appropriate when comparing multiple treatment groups to a control group.
Fisher's LSD test is a simple and less conservative method suitable for comparing two treatment groups.
The choice of post-hoc test depends on the specific research question, assumptions, and desired level of control for Type I error.

In [47]:
#Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.
from scipy.stats import f_oneway
import numpy as np
#random value of the diests of 50 random samples
diet_a = np.round(np.random.uniform(1.5,2.9,50),1)
diet_b = np.round(np.random.uniform(1.5,2.9,50),1)
diet_c = np.round(np.random.uniform(1.5,2.9,50),1)
#to calculate the f_statistics and p_value for the given alpha 0.05
alpha=0.05
f_statistics,p_value=f_oneway(diet_a,diet_b,diet_c)
print("F-Statistic:", f_statistics)
print("P-Value:", p_value)
if p_value < alpha:
    print("Reject the null hypothesis. There is a significant difference between the mean weight loss of the three diets.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference between the mean weight loss of the three diets.")


F-Statistic: 0.32571598650704403
P-Value: 0.7225299512887295
Fail to reject the null hypothesis. There is no significant difference between the mean weight loss of the three diets.


In [53]:
#Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to
#complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
# Generate sample data
np.random.seed(0)  # for reproducibility
n = 30
software_programs = np.random.choice(['A', 'B', 'C'], n)
experience_levels = np.random.choice(['Novice', 'Experienced'], n)
task_completion_time = np.random.normal(loc=10, scale=2, size=n)  # sample task completion times
# Create DataFrame
df = pd.DataFrame({'Software': software_programs, 'Experience': experience_levels, 'Time': task_completion_time})
# Fit the ANOVA model
model = ols('Time ~ C(Software) + C(Experience) + C(Software):C(Experience)', data=df).fit()
# Perform the ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)


                              sum_sq    df         F    PR(>F)
C(Software)                11.141545   2.0  2.113814  0.142706
C(Experience)               2.102143   1.0  0.797652  0.380665
C(Software):C(Experience)   6.013261   2.0  1.140857  0.336272
Residual                   63.249921  24.0       NaN       NaN


In [127]:
#Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a 
#two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.
import numpy as np
from scipy.stats import ttest_ind
#np.random.seed(0)  # for reproducibility
control_group_scores = np.random.normal(loc=70, scale=10, size=100)
experimental_group_scores = np.random.normal(loc=75, scale=10, size=100)
t_statistic, p_value = ttest_ind(control_group_scores, experimental_group_scores)
print("T-Statistic:", t_statistic)
print("P-Value:", p_value)
alpha = 0.05  # significance level
if p_value < alpha:
    print("There is a significant difference in test scores between the control group and the experimental group.")
else:
    print("There is no significant difference in test scores between the control group and the experimental group.")

if p_value < alpha:
    # Combine the test scores and group labels into a single DataFrame
    data = np.vstack((control_group_scores, experimental_group_scores)).T
    labels = ['Control'] * 100 + ['Experimental'] * 100

    # Perform Tukey's HSD test
    tukey_results = pairwise_tukeyhsd(data.flatten(), labels, alpha=0.05)
    print(tukey_results)




T-Statistic: -5.3443793223209415
P-Value: 2.4835668985382e-07
There is a significant difference in test scores between the control group and the experimental group.
   Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj   lower  upper reject
---------------------------------------------------------
Control Experimental   0.7194 0.6298 -2.2191 3.658  False
---------------------------------------------------------


In [None]:
 #Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any
#significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.
import numpy as np
import pandas as pd
from scipy.stats import f_oneway
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd
np.random.seed(0)  # for reproducibility# it is used to stop the changing nature of the random varibales.
store_a_sales = np.random.normal(loc=100, scale=20, size=30)
store_b_sales = np.random.normal(loc=110, scale=20, size=30)
store_c_sales = np.random.normal(loc=120, scale=20, size=30)
sales_data = pd.DataFrame({
    'Store A': store_a_sales,
    'Store B': store_b_sales,
    'Store C': store_c_sales
})
# Perform the repeated measures ANOVA with automatic aggregation
anova_rm = AnovaRM(sales_data, sales_data.columns, 'subject', within=['variable'], aggregate_func='mean')
anova_results = anova_rm.fit()
print(anova_results.anova_table['Pr > F'])


if anova_results.pvalues['variable'] < 0.05:
    # Perform Tukey's HSD test
    tukey_results = pairwise_tukeyhsd(sales_data.melt()['value'], sales_data.melt()['variable'], alpha=0.05)
    print(tukey_results)

