In [None]:
#Q1
Analysis of Variance (ANOVA) is a statistical technique used to compare means across multiple groups
or treatments. To use ANOVA effectively, certain assumptions must be met. Violations of these assumptions
can impact the validity of the results. The key assumptions for ANOVA are:

1.Independence: Observations within and between groups should be independent of each other. In experimental
designs, this means that the treatment or condition assigned to one subject should not affect another
subject's response. Violation: If observations are not independent (e.g., repeated measures on the same 
subject without proper modeling), it can lead to inflated Type I error rates.

2.Normality: The residuals (differences between observed values and group means) should follow a normal 
distribution. This assumption primarily applies to the residuals, not necessarily the original data. 
Violation: Departures from normality can result in misleading p-values and confidence intervals. If 
data are strongly non-normal, it may be necessary to transform the data or consider non-parametric tests.

3.Homogeneity of Variance (Homoscedasticity): The variance of residuals should be roughly equal across 
groups. Violation: Heteroscedasticity (unequal variances) can lead to unreliable ANOVA results. If 
variances are significantly different, consider using robust ANOVA methods or transformations.

Examples of violations and their impact:

1.Non-normality: If the residuals are not normally distributed, it can lead to incorrect p-values and 
confidence intervals. For example, if you have count data with many zeros, you might encounter non-normality,
and using ANOVA without appropriate transformations could lead to errors.

2.Heteroscedasticity: If the assumption of equal variances is violated, ANOVA may produce incorrect results.
For instance, in a study comparing the impact of two teaching methods on student scores, if one method 
consistently shows more variation in scores, it can lead to a biased conclusion.

3.Lack of Independence: In longitudinal studies or clustered data, failing to account for dependencies
among observations can result in incorrect p-values. For example, in a study measuring blood pressure 
across time points within the same individuals, ignoring this dependency can lead to underestimated 
standard errors.

Addressing Assumption Violations:

1.Transformations: If data violate normality or homoscedasticity, transforming the data 
(e.g., using logarithms) may help meet these assumptions.

2.Non-parametric Tests: If normality and equal variance assumptions cannot be met, consider non-parametric
tests like the Kruskal-Wallis test, which do not rely on these assumptions.

3.Robust Methods: Robust ANOVA methods can be used when there are concerns about non-normality and 
heteroscedasticity.

4.Mixed-Design ANOVA: For repeated measures or clustered data, using mixed-design ANOVA models that account
for dependencies is appropriate.

It's important to assess these assumptions before interpreting ANOVA results to ensure the validity and 
reliability of your conclusions. Violations should be addressed or alternative methods considered when 
assumptions are not met.

In [None]:
#Q2
Analysis of Variance (ANOVA) is a statistical technique used to analyze the differences among group
means in a dataset. There are three main types of ANOVA, each suited for different situations:

One-Way ANOVA:

Situation: One-Way ANOVA is used when you want to compare the means of three or more independent 
(unrelated) groups or treatments. It answers the question of whether there are statistically 
significant differences among these groups.
Example: You have four different types of fertilizers, and you want to test if they result in 
significantly different crop yields.

Two-Way ANOVA:

Situation: Two-Way ANOVA is used when you want to simultaneously analyze the effects of two independent
categorical variables (factors) on a dependent variable. It helps determine if there are main effects of
each factor and whether there is an interaction effect between them.
Example: You want to study the effects of both diet type (Factor A: low-fat, high-fat) and exercise 
frequency (Factor B: sedentary, moderate, intense) on weight loss.

Repeated Measures ANOVA:

Situation: Repeated Measures ANOVA is used when you have a within-subjects design, meaning you collect
multiple measurements from the same subjects under different conditions or time points. It helps determine 
if there are significant differences across these conditions or time points.
Example: You measure the blood pressure of the same individuals before and after they undergo three 
different stress tests.

Additional Notes:

Factor: In ANOVA, a factor is a categorical independent variable that defines the groups or conditions
you are comparing.
Level: Each category within a factor is called a level. For example, if you have a factor "Treatment" 
with levels "A," "B," and "C," you have three levels.
Interaction Effect: Two-Way ANOVA assesses whether the interaction between two factors (e.g., Diet and Exercise) 
significantly affects the dependent variable. It tells you if the combination of factors has a different
impact than you would expect based on their individual effects.

In summary, One-Way ANOVA is used for comparing means across multiple independent groups, Two-Way ANOVA
assesses the effects of two independent factors, and Repeated Measures ANOVA analyzes differences across
repeated measurements within the same subjects. The choice of ANOVA type depends on the research design 
and the specific hypotheses you want to test.

In [None]:
#Q3
The partitioning of variance in Analysis of Variance (ANOVA) is a fundamental concept in statistics
that helps researchers understand how the total variability in a dataset can be attributed to different
sources or factors. ANOVA is a statistical technique used to analyze and compare means among two or more
groups or treatments to determine if there are significant differences between them. Understanding the 
partitioning of variance is essential in ANOVA because it provides insights into the sources of variability
and allows researchers to draw conclusions about the significance of the factors being studied.

In ANOVA, the total variability in the data is decomposed or partitioned into different components:

Total Variance (Total Sum of Squares, SST): This represents the overall variability in the data, which is
calculated as the sum of the squared differences between each data point and the overall mean of all the data 
points.

Between-Group Variance (Between-Group Sum of Squares, SSB): This represents the variability that can be 
attributed to differences between the group means. It is calculated as the sum of the squared differences
between each group mean and the overall mean.

Within-Group Variance (Within-Group Sum of Squares, SSW): This represents the variability within each group 
or treatment. It is calculated as the sum of the squared differences between each individual data point and 
its group mean.

The partitioning of variance helps researchers answer questions such as:

Are the group means significantly different from each other? (This is determined by comparing the 
Between-Group Variance to the Within-Group Variance.)

What proportion of the total variability can be explained by the group differences? (This is often 
expressed as the "explained variance" or the "effect size.")

How much of the total variability is due to random or unexplained factors? (This is often referred to 
as "unexplained variance" or "error variance.")

Is the observed difference between group means statistically significant? (This is determined by comparing 
the Between-Group Variance to the Within-Group Variance and considering the degrees of freedom.)

Understanding the partitioning of variance allows researchers to assess the significance of the factors 
they are studying and determine if there is evidence to support their hypotheses. It is a powerful tool 
for comparing multiple groups and making informed decisions about whether the differences observed are 
likely to be due to real effects or simply random variation. In essence, ANOVA helps researchers quantify
and attribute sources of variation, which is critical in fields such as experimental research, social sciences,
and many other areas of scientific inquiry.

In [1]:
#Q4
import numpy as np

group1 = np.array([22, 24, 28, 20, 25])
group2 = np.array([30, 32, 28, 34, 29])
group3 = np.array([18, 16, 20, 21, 17])

data = [group1, group2, group3]

grand_mean = np.mean(np.concatenate(data))

sst = np.sum([(x - grand_mean)**2 for group_data in data for x in group_data])

sse = np.sum([len(group_data) * (np.mean(group_data) - grand_mean)**2 for group_data in data])

ssr = np.sum([(x - np.mean(group_data))**2 for group_data in data for x in group_data])

print("Total Sum of Squares (SST):", sst)
print("Explained Sum of Squares (SSE):", sse)
print("Residual Sum of Squares (SSR):", ssr)


Total Sum of Squares (SST): 450.93333333333334
Explained Sum of Squares (SSE): 373.7333333333335
Residual Sum of Squares (SSR): 77.19999999999999


In [5]:
#Q5
import statsmodels.api as sm
from statsmodels.formula.api import ols
import pandas as pd
import numpy as np

# Create a dataframe
dataframe = pd.DataFrame({'Fertilizer': np.repeat(['daily', 'weekly'], 15),
                          'Watering': np.repeat(['daily', 'weekly'], 15),
                          'height': [14, 16, 15, 15, 16, 13, 12, 11,
                                     14, 15, 16, 16, 17, 18, 14, 13,
                                     14, 14, 14, 15, 16, 16, 17, 18,
                                     14, 13, 14, 14, 14, 15]})

# Performing two-way ANOVA
model = ols('height ~ C(Fertilizer) + C(Watering) + C(Fertilizer):C(Watering)',
            data=dataframe).fit()
result = sm.stats.anova_lm(model, type=2)

# Calculate the main effects and interaction effect
main_effect_Fertilizer = result.loc['C(Fertilizer)', 'sum_sq'] / result.loc['C(Fertilizer)', 'df']
main_effect_Watering = result.loc['C(Watering)', 'sum_sq'] / result.loc['C(Watering)', 'df']
interaction_effect = result.loc['C(Fertilizer):C(Watering)', 'sum_sq'] / result.loc['C(Fertilizer):C(Watering)', 'df']

# Print the results
print("Main Effect of Fertilizer:", main_effect_Fertilizer)
print("Main Effect of Watering:", main_effect_Watering)
print("Interaction Effect:", interaction_effect)




Main Effect of Fertilizer: 0.033333333333333735
Main Effect of Watering: 0.000368594289958913
Interaction Effect: 0.040865910536107684


In [None]:
#Q6
In a one-way ANOVA, the F-statistic is used to assess whether there are significant differences
among the means of three or more groups. The p-value associated with the F-statistic helps determine
the statistical significance of these differences.

In your scenario, you obtained an F-statistic of 5.23 and a p-value of 0.02. To interpret these results:

Null Hypothesis (H0): The null hypothesis in a one-way ANOVA is that there are no significant 
differences among the group means. In other words, all the group means are equal.

Alternative Hypothesis (Ha): The alternative hypothesis is that at least one group mean is 
significantly different from the others.

Based on the given F-statistic and p-value:

The F-statistic (5.23) indicates the ratio of the variance between groups to the variance within groups.
A larger F-statistic suggests greater variability between groups relative to within groups.

The p-value (0.02) is the probability of observing such a result (or more extreme) if the null hypothesis
were true. In other words, it indicates the strength of evidence against the null hypothesis.

Interpretation:

Since the p-value (0.02) is less than the typical significance level (e.g., 0.05), we reject the null 
hypothesis (H0) at the 0.05 significance level.

This suggests that there is strong evidence to conclude that at least one group mean is significantly
different from the others.

However, the ANOVA itself doesn't tell you which specific groups are different; it only tells you that 
there are differences somewhere among the groups. You would need to follow up with post hoc tests 
(e.g., Tukey's HSD, Bonferroni, etc.) to determine which specific group(s) differ from each other.

In summary, based on the results of the one-way ANOVA, you can conclude that there are significant
differences among the groups. Further analysis is needed to identify which groups are different from 
each other.

In [None]:
#Q7
Handling missing data in a repeated measures ANOVA is crucial to ensure the validity and reliability 
of your analysis. Repeated measures ANOVA involves measuring the same subjects or entities multiple times,
and missing data can occur for various reasons, such as dropout, non-response, or technical errors. Here 
are some common methods for handling missing data in repeated measures ANOVA and their potential consequences:

1.Complete Case Analysis (Listwise Deletion):

In this approach, any subject with missing data on any time point is excluded from the analysis.
Pros:
Simple and straightforward.
Preserves the sample size for those with complete data.
Cons:
Reduces sample size, potentially leading to a loss of statistical power.
May introduce bias if the missing data are not missing completely at random (MCAR).

2.Mean Imputation:

Replace missing values with the mean of observed values for that variable.
Pros:
Preserves sample size.
Easy to implement.
Cons:
Underestimates the variability in the data, potentially leading to overly optimistic p-values.
Can introduce bias if data are not MCAR.
Does not account for within-subject correlation over time.

3.Linear Interpolation or Last Observation Carried Forward (LOCF):

Linear interpolation fills missing values with estimates based on neighboring time points, or LOCF 
carries the last observed value forward.
Pros:
Preserves sample size.
May provide more accurate estimates if data are missing due to linear trends.
Cons:
Assumes a linear trend between observed points, which may not be valid.
LOCF can overestimate the effect of treatment.

4.Multiple Imputation:

Generates multiple complete datasets with imputed values and combines results.
Pros:
Preserves sample size.
Accounts for uncertainty in imputation.
Applicable for missing data that are not MCAR.
Cons:
More complex and computationally intensive.
Requires specifying a model for imputation, which can be challenging.

5.Mixed-Effects Models (Longitudinal Analysis):

Uses all available data, including subjects with missing data at some time points.
Pros:
Accounts for within-subject correlation.
Can handle data that are missing at random or missing not at random.
Cons:
Requires specialized software and statistical expertise.
Results can be sensitive to the model assumptions.

The choice of method for handling missing data should depend on the nature of the data and the reasons 
for missingness. It's essential to carefully consider the potential consequences of each method, 
including its impact on statistical power, bias, and the validity of conclusions drawn from the analysis.
In practice, sensitivity analyses and exploring different methods can help assess the robustness of 
results to missing data handling techniques.

In [None]:
#Q8
Post-hoc tests are used after conducting an Analysis of Variance (ANOVA) to make pairwise comparisons 
between groups when the ANOVA reveals a significant difference among the group means. These tests help 
identify which specific group(s) differ from each other. Common post-hoc tests include:

1.Tukey's Honestly Significant Difference (HSD):

When to use: Tukey's HSD is a widely used post-hoc test and is appropriate when you have conducted a 
one-way ANOVA. It controls the familywise error rate and is suitable for comparing all possible pairs 
of groups.
Example: In a study comparing the mean scores of three different teaching methods (A, B, and C) on 
student performance, a one-way ANOVA reveals a significant difference. Tukey's HSD can be used to 
determine which teaching methods differ significantly from each other.

2.Bonferroni Correction:

When to use: The Bonferroni correction is a conservative approach used when you have conducted multiple 
pairwise comparisons after an ANOVA (or any other test). It controls the overall Type I error rate by
dividing the desired significance level (e.g., 0.05) by the number of comparisons.
Example: Suppose you conducted 10 pairwise comparisons after a one-way ANOVA. To maintain an overall alpha
level of 0.05, you can use a significance level of 0.05/10 = 0.005 for each individual comparison.

3.Sidak Correction:

When to use: Similar to the Bonferroni correction, the Sidak correction controls the overall Type I 
error rate but tends to be less conservative when conducting multiple comparisons.
Example: If you have multiple comparisons to make after an ANOVA and want to control the familywise 
error rate, you can use the Sidak correction with a specified alpha level.

4.Duncan's Multiple Range Test (MRT):

When to use: Duncan's MRT is used when you have conducted a one-way ANOVA, and it provides more power 
than Tukey's HSD when group sizes are unequal. However, it does not control the familywise error rate.
Example: In a study comparing the yields of different crop varieties, a one-way ANOVA indicates significant
differences. Duncan's MRT can be used to group varieties with similar yields.

5.Games-Howell Test:

When to use: The Games-Howell test is suitable when group variances are unequal, and you have conducted a 
one-way ANOVA. It does not assume equal variances across groups, unlike Tukey's HSD.
Example: In a study comparing the exam scores of students from different schools, a one-way ANOVA shows 
significant differences. The Games-Howell test can be applied when group variances are not equal.

6.Holm-Bonferroni Method:

When to use: The Holm-Bonferroni method is another correction method that controls the familywise error 
rate and can be applied to any situation with multiple comparisons.
Example: After conducting multiple pairwise comparisons in various situations (e.g., clinical trials, 
market research), you can use the Holm-Bonferroni method to adjust significance levels while controlling 
the overall Type I error rate.

When to use a specific post-hoc test depends on the characteristics of your data, the design of your study,
and your desired level of control over Type I errors. The choice of a post-hoc test should align with the 
research question and the assumptions underlying the statistical analysis.

In [6]:
#Q9
import numpy as np
import scipy.stats as stats

# Sample data for the three diets
diet_A = [2.5, 3.2, 2.8, 3.5, 2.9, 3.7, 2.6, 3.0, 3.1, 2.8,
          3.3, 3.4, 3.1, 2.7, 3.0]
diet_B = [2.1, 2.4, 2.2, 2.5, 2.0, 2.6, 2.3, 2.7, 2.8, 2.2,
          2.4, 2.5, 2.3, 2.6, 2.1]
diet_C = [1.8, 1.9, 2.0, 1.7, 1.9, 1.8, 2.1, 2.0, 2.2, 1.6,
          1.9, 1.8, 2.0, 2.1, 2.2]

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Print the results
print("F-statistic:", f_statistic)
print("p-value:", p_value)


F-statistic: 68.92282352941194
p-value: 5.436714823633692e-14


In [None]:
Now, let's interpret the results:

F-statistic: The F-statistic quantifies the ratio of the variability between the group means to 
the variability within the groups. A larger F-statistic suggests a greater likelihood of 
significant differences among the groups.

p-value: The p-value associated with the F-statistic indicates the probability of obtaining the 
observed results if there were no true differences among the group means. A small p-value (typically 
less than 0.05) suggests that there are significant differences among the groups.

Interpretation:

In your analysis, if the p-value is less than your chosen significance level (e.g., 0.05), you can 
reject the null hypothesis. This would suggest that there are significant differences in weight loss
among the three diets (A, B, and C).

Conversely, if the p-value is greater than your chosen significance level, you would fail to reject 
the null hypothesis, indicating that there is insufficient evidence to conclude that the mean weight 
loss differs significantly among the three diets.

In summary, run the code and check the p-value to determine whether there are significant differences
in weight loss among the three diets.

In [7]:
#Q10
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data
data = pd.DataFrame({
    'Software': np.repeat(['A', 'B', 'C'], 30),
    'Experience': np.tile(['Novice', 'Experienced'], 45),
    'Time': np.random.randint(10, 60, 90)  # Random time data (replace with your data)
})

# Perform two-way ANOVA
formula = 'Time ~ C(Software) + C(Experience) + C(Software):C(Experience)'
model = ols(formula, data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the results
print(anova_table)


                                 sum_sq    df         F    PR(>F)
C(Software)                   96.200000   2.0  0.208900  0.811897
C(Experience)                 28.900000   1.0  0.125514  0.724018
C(Software):C(Experience)    262.466667   2.0  0.569950  0.567725
Residual                   19341.333333  84.0       NaN       NaN


In [None]:
Now, let's interpret the results from the ANOVA table:

**Main Effect of Software (C(Software)):

The F-statistic and p-value associated with C(Software) test whether there is a significant difference 
in task completion times among the different software programs (A, B, and C).
A significant p-value indicates that there is a main effect of software, suggesting that at least one
software program has a different average task completion time.

**Main Effect of Experience (C(Experience)):

The F-statistic and p-value associated with C(Experience) test whether there is a significant difference 
in task completion times between novice and experienced employees.
A significant p-value suggests that there is a main effect of employee experience level, indicating that 
novice and experienced employees, on average, have different task completion times.

**Interaction Effect (C(Software):C(Experience)):

The F-statistic and p-value associated with C(Software):C(Experience) test whether there is an interaction 
effect between software programs and employee experience levels.
A significant p-value for the interaction effect indicates that the effect of software on task completion 
times differs depending on the employee's experience level.

Interpretation:

If the p-values associated with any of the main effects (C(Software) or C(Experience)) are less than your 
chosen significance level (e.g., 0.05), you can conclude that there is a significant main effect of that factor.

If the p-value for the interaction effect (C(Software):C(Experience)) is significant, it suggests that the
effect of software on task completion times depends on the employee's experience level, indicating an 
interaction effect.

Carefully interpret the results in the context of your research question to determine the significance of 
each effect and whether there are any interactions between the factors.

In [8]:
#Q11
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.stats.multicomp import MultiComparison

# Sample data (replace with your actual data)
np.random.seed(0)  # For reproducibility
control_group_scores = np.random.normal(75, 10, 50)  # Control group test scores
experimental_group_scores = np.random.normal(80, 10, 50)  # Experimental group test scores

# Perform a two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group_scores, experimental_group_scores)

# Print the results of the t-test
print("Two-Sample T-Test Results:")
print("T-statistic:", t_statistic)
print("P-value:", p_value)

# Check if the results are significant (use a significance level, e.g., 0.05)
if p_value < 0.05:
    print("The results are significant, indicating a difference between the groups.")
    # Perform post-hoc tests (Tukey's HSD) to determine which group(s) differ significantly
    data = pd.DataFrame({'Scores': np.concatenate([control_group_scores, experimental_group_scores]),
                         'Group': np.repeat(['Control', 'Experimental'], 50)})
    mc = MultiComparison(data['Scores'], data['Group'])
    result = mc.tukeyhsd()
    print("\nPost-Hoc Test Results (Tukey's HSD):")
    print(result)
else:
    print("The results are not significant, indicating no difference between the groups.")


Two-Sample T-Test Results:
T-statistic: -1.6677351961320235
P-value: 0.09856078338184605
The results are not significant, indicating no difference between the groups.


In [None]:
Interpretation:

If the p-value of the t-test is less than your chosen significance level (e.g., 0.05), you can 
conclude that there is a significant difference in test scores between the control and experimental 
groups.

If the post-hoc test (Tukey's HSD) results are significant, it will indicate which specific 
group(s) differ significantly from each other.

In [9]:
#Q12
import numpy as np
import pandas as pd
import scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Sample data (replace with your actual data)
np.random.seed(0)  # For reproducibility
store_A_sales = np.random.randint(1000, 5000, 30)  # Sales for Store A
store_B_sales = np.random.randint(900, 4500, 30)   # Sales for Store B
store_C_sales = np.random.randint(800, 4800, 30)   # Sales for Store C

# Create a DataFrame
data = pd.DataFrame({
    'Store': np.repeat(['A', 'B', 'C'], 30),
    'Sales': np.concatenate([store_A_sales, store_B_sales, store_C_sales])
})

# Perform one-way ANOVA
formula = 'Sales ~ C(Store)'
model = ols(formula, data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA results
print(anova_table)


                sum_sq    df         F    PR(>F)
C(Store)  4.014787e+06   2.0  1.721829  0.184773
Residual  1.014289e+08  87.0       NaN       NaN


In [None]:
Now, interpret the results:

If the p-value associated with the 'C(Store)' factor in the ANOVA table is less than your chosen 
significance level (e.g., 0.05), you can conclude that there are significant differences in daily 
sales between at least two of the stores.