<a href="https://colab.research.google.com/github/Riturajkumari/statistics./blob/main/Statistics_Advance_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.**

- ANOVA (Analysis of Variance) is a statistical method used to determine whether there are significant differences between the means of two or more groups. The assumptions required to use ANOVA are:
    - Linear relationships adequately explain the outcomes.
    - Independent variables are not correlated with the error term.
    - Observations of the error term are uncorrelated with each other.
    - Error term has a constant variance.
    - No perfect correlation between independent variables.
    - Error term follows a normal distribution.

- Violations of normality and homoscedasticity (homogeneity of variance) are the main assumptions to check. Having unequal groups can also lead to violations in normality or homogeneity of variance.

- For example, if the data is not normally distributed, it can lead to incorrect conclusions about the significance of differences between groups. Violations of homogeneity of variance can lead to incorrect conclusions about which group has the largest variance.

**Q2. What are the three types of ANOVA, and in what situations would each be used?**

There are three types of ANOVA tests1:

- One-Way Analysis of Variance
- Two-Way Analysis of Variance
- N-Way Analysis of Variance (MANOVA)

  - One-Way ANOVA is used when you want to test two groups to see if there’s a difference between them.
  - Two-Way ANOVA is used when you have one group and you’re double-testing that same group.
  - N-Way ANOVA (MANOVA) is used when you have multiple dependent variables

**Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?**

- In ANOVA, the observed variance in a particular variable is partitioned into components attributable to different sources of variation. ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalizes the t-test beyond two means.
-  ANOVA is based on the law of total variance, where the observed variance in a particular variable is partitioned into components attributable to different sources of variation.
- The total variance can be partitioned into variance between subjects and variance within subjects.
    -  Variance within subjects consists of two components:
    differences between treatments and error or residual variation.
    The 1-way repeated measures ANOVA partitions the total variance into three parts:
    -  a between subjects part
    -  a within subjects part and a between treatments part. The within groups variance is further partitioned, reducing its size .

**Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?**

To calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python, you can use the following steps:

- Calculate the mean of the response variable.
- Calculate the predicted value for each observation.
- Calculate the sum of squares total (SST).
- Calculate the sum of squares regression (SSR).
- Calculate the sum of squares error (SSE).

The following relationship exists between these three measures: SST = SSR + SSE.

In [7]:
import pandas as pd

# load data into a pandas dataframe
df= pd.DataFrame({"hours_studies":[1,2,2,3,4,5],
                  "exam_score":[68,77,81,82,88,90]})
df

Unnamed: 0,hours_studies,exam_score
0,1,68
1,2,77
2,2,81
3,3,82
4,4,88
5,5,90


In [9]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols



# fit one-way ANOVA model
model = ols('exam_score ~ C(hours_studies)', data=df).fit()

# calculate SST, SSR, and SSE
ssr = model.ess
sse = model.ssr
sst = ssr + sse

print('SSR:', ssr)
print('SSE:', sse)
print('SST:', sst)


SSR: 308.0
SSE: 8.0
SST: 316.0


**Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?**

- To calculate the main effects and interaction effects in a two-way ANOVA using Python, you can use the statsmodels library.

In [13]:
import numpy as np
import pandas as pd

#create data
df = pd.DataFrame({'water': np.repeat(['daily', 'weekly'], 15),
                   'sun': np.tile(np.repeat(['low', 'med', 'high'], 5), 2),
                   'height': [6, 6, 6, 5, 6, 5, 5, 6, 4, 5,
                              6, 6, 7, 8, 7, 3, 4, 4, 4, 5,
                              4, 4, 4, 4, 4, 5, 6, 6, 7, 8]})

#view first ten rows of data
df[:10]

Unnamed: 0,water,sun,height
0,daily,low,6
1,daily,low,6
2,daily,low,6
3,daily,low,5
4,daily,low,6
5,daily,med,5
6,daily,med,5
7,daily,med,6
8,daily,med,4
9,daily,med,5


In [14]:
import statsmodels.api as sm
from statsmodels.formula.api import ols

#perform two-way ANOVA
model = ols('height ~ water + sun + water:sun', data=df).fit()
sm.stats.anova_lm(model, typ=2)


Unnamed: 0,sum_sq,df,F,PR(>F)
water,8.533333,1.0,16.0,0.000527
sun,24.866667,2.0,23.3125,2e-06
water:sun,2.466667,2.0,2.3125,0.120667
Residual,12.8,24.0,,


**Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?**

- An F-statistic of 5.23 and a p-value of 0.02 in a one-way ANOVA means that there is a statistically significant difference between the means of the groups being compared.
- The F-statistic represents how much the variability among the means exceeds that expected due to chance. The higher the F-value, the higher the variation between sample means relative to the variation within the samples.
- An F-statistic greater than the critical value is equivalent to a p-value less than alpha, and both mean that you reject the null hypothesis.
-  Each F-value also has a corresponding p-value, and if the p-value is less than a certain threshold (e.g. α =.05), then we conclude that the factor has a statistically significant effect on whatever outcome we’re measuring.

**Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?**

- In repeated measures ANOVA, missing data can be handled by using marginal and mixed models which treat each occasion as a different observation of the same variable. GraphPad Prism software can replace repeated measures ANOVA with mixed-effects model ‘restricted maximum likelihood’ (REML) which can handle the missing values.

- Ignoring missing data (i.e. analyzing only the observed data) asssumes that the observed available data are completely representative of the missing data, which requires that the missingness has no connection whatsoever with the outcomes you are interested in (this is called “missing completely at random”, MCAR).

**Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.**

- Post-hoc tests are used after ANOVA to determine which groups are significantly different from each other. There are many post-hoc tests available, but some of the most common ones include Tukey’s HSD test, Bonferroni correction, Scheffe’s test, and Dunnett’s test.

- Tukey’s HSD test is used when you have a balanced design (equal sample sizes) and you want to compare all possible pairs of means.
-  Bonferroni correction is used when you have a small number of pairwise comparisons and you want to control the family-wise error rate.
- Scheffe’s test is used when you have an unbalanced design (unequal sample sizes) and you want to compare all possible pairs of means.
- Dunnett’s test is used when you have a control group and you want to compare all other groups to the control group

For example, suppose we have three groups: A, B, and C. We conduct an ANOVA and find that there is a statistically significant difference between the means of the three groups. We then conduct a Tukey’s HSD test and find that the mean of group A is significantly different from the mean of group B (p < 0.05), but not significantly different from the mean of group C (p > 0.05).

**Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.**

In [2]:
import pandas as pd
import scipy.stats as stats

# Create a dataframe with the data
data = {'diet': ['A', 'B', 'C'] * 50,
        'weight_loss': [1.2, 1.5, 1.8] * 50}
df = pd.DataFrame(data)

# Conduct the one-way ANOVA
f_statistic, p_value = stats.f_oneway(df[df['diet'] == 'A']['weight_loss'],
                                       df[df['diet'] == 'B']['weight_loss'],
                                       df[df['diet'] == 'C']['weight_loss'])

# Report the F-statistic and p-value
print(f"F-statistic: {f_statistic:.2f}")
print(f"p-value: {p_value:.4f}")


F-statistic: inf
p-value: 0.0000




**Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.**

In [4]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a pandas DataFrame with the data
data = {'program': ['A', 'B', 'C'] * 30,
        'experience': ['novice'] * 45 + ['experienced'] * 45,
        'time': [10.2, 9.8, 10.5, 9.7, 10.1, 10.3, 10.4, 9.9, 10.3, 10.2,
                 9.6, 9.7, 9.8, 9.5, 9.6, 9.7, 10.1, 10.2, 10.3, 10.4,
                 10.5, 10.6, 10.7, 10.8, 11.0, 11.1, 11.2, 11.3, 11.4,
                 11.5] * 3}

df = pd.DataFrame(data)

# Fit the model
model = ols('time ~ C(program) + C(experience) + C(program):C(experience)', data=df).fit()

# Print the summary of the model
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:                   time   R-squared:                       0.073
Model:                            OLS   Adj. R-squared:                  0.018
Method:                 Least Squares   F-statistic:                     1.327
Date:                Mon, 31 Jul 2023   Prob (F-statistic):              0.261
Time:                        16:57:44   Log-Likelihood:                -72.985
No. Observations:                  90   AIC:                             158.0
Df Residuals:                      84   BIC:                             173.0
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                                              coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------------

**Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.**

In [5]:
import numpy as np
from scipy.stats import ttest_ind

# create two sample data
data_group1 = np.array([1, 2, 3, 4, 5])
data_group2 = np.array([6, 7, 8, 9, 10])

# perform two sample t-test
t_statistic, p_value = ttest_ind(data_group1, data_group2)

print("t-statistic:", t_statistic)
print("p-value:", p_value)


t-statistic: -5.0
p-value: 0.001052825793366539


**Q12.A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other**

- the AnovaRM() function from the statsmodels library.

In [11]:
import pandas as pd
from statsmodels.stats.anova import AnovaRM

# create a pandas dataframe with the sales data
df = pd.DataFrame({
    'store': ['A'] * 30 + ['B'] * 30 + ['C'] * 30,
    'sales': [10, 12, 8, 9, 11, 13, 10, 12, 8, 9, 11, 13, 10, 12, 8,
              9, 11, 13, 10, 12, 8, 9, 11, 13, 10, 12, 8, 9, 11,
              13] * 3,
    'day': list(range(1,31)) *3
})

# perform the repeated measures ANOVA using AnovaRM()
results = AnovaRM(data=df,
                  depvar='sales',
                  subject='day',
                  within=['store']).fit()

print(results)


               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
store -7.3238 2.0000 58.0000 1.0000



- post-hoc test to determine which store(s) differ significantly from each other you can use Tukey’s HSD test.

In [12]:
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# perform Tukey's HSD test
tukey_results = pairwise_tukeyhsd(df['sales'], df['store'])

print(tukey_results)


Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj  lower  upper  reject
--------------------------------------------------
     A      B      0.0   1.0 -1.0694 1.0694  False
     A      C      0.0   1.0 -1.0694 1.0694  False
     B      C      0.0   1.0 -1.0694 1.0694  False
--------------------------------------------------
