## Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

## Assumptions:
### 1.The population from which samples are drawn should be normally distributed.
### 2.Independence of cases: the sample cases should be independent of each other.
### 3.Homogeneity of variance: Homogeneity means that the variance among the groups should be approximately equal.

### The one-way ANOVA is considered a robust test against the normality assumption. This means that it tolerates violations to its normality assumption rather well. As regards the normality of group data, the one-way ANOVA can tolerate data that is non-normal (skewed or kurtotic distributions) with only a small effect on the Type I error rate. However, platykurtosis can have a profound effect when your group sizes are small. This leaves you with two options: (1) transform your data using various algorithms so that the shape of your distributions become normally distributed or (2) choose the nonparametric Kruskal-Wallis H Test which does not require the assumption of normality.

## Q2. What are the three types of ANOVA, and in what situations would each be used?

### A one-way ANOVA has just one independent variable. For example, difference in IQ can be assessed by Country, and County can have 2, 20, or more different categories to compare.

### A two-way ANOVA (are also called factorial ANOVA) refers to an ANOVA using two independent variables. Expanding the example above, a 2-way ANOVA can examine differences in IQ scores (the dependent variable) by Country (independent variable 1) and Gender (independent variable 2). Two-way ANOVA can be used to examine the interaction between the two independent variables. Interactions indicate that differences are not uniform across all categories of the independent variables. For example, females may have higher IQ scores overall compared to males, but this difference could be greater (or less) in European countries compared to North American countries.

### A researcher can also use more than two independent variables, and this is an n-way ANOVA (with n being the number of independent variables you have). For example, potential differences in IQ scores can be examined by Country, Gender, Age group, Ethnicity, etc, simultaneously.

## Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

### An ANOVA uses an F-test to evaluate whether the variance among the groups is greater than the variance within a group. Another way to view this problem is that we could partition variance, that is, we could divide the total variance in our data into the various sources of that variation. Here, some of the variance results from variation among replicates within each group, and the rest comes from variation among the groups. Partitioning variance is an incredibly important concept within statistics. For scientists, it is a useful way of looking at the world: variation is everywhere, and you should ask what causes that variation.

## Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [2]:
import statsmodels.api as sm

In [None]:
x = sm.add_constant(x)
  
#fit linear regression model
model = sm.OLS(y, x).fit()
  
# residual sum of squares
print(model.ssr)

## Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

## Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

## Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

### One of the biggest problems with traditional repeated measures ANOVA is missing data on the response variable. The problem is that repeated measures ANOVA treats each measurement as a separate variable. Because it uses listwise deletion, if one measurement is missing, the entire case gets dropped. What to use instead: Marginal and mixed models treat each occasion as a different observation of the same variable. So you may lose the measurement with missing data, but not all other responses from the same subject.

## Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

### Imagine we are testing four materials that we’re considering for making a product part. We want to determine whether the mean differences between the strengths of these four materials are statistically significant. we can reject the null hypothesis and conclude that the four means are not all equal. The Means table at the bottom displays the group means. However, we don’t know which pairs of groups are significantly different. To compare group means, we need to perform post hoc tests, also known as multiple comparisons. 

## Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets.Report the F-statistic and p-value, and interpret the results.

In [1]:
import numpy as np
import scipy.stats as stat

In [25]:
A=np.random.normal(5,scale=5,size=16)
B=np.random.normal(6.6,scale=5,size=16)
C=np.random.normal(4.2,scale=5,size=18)

In [26]:
A

array([ 2.4740655 ,  9.30279443,  2.06703141,  4.90884087, 10.33694422,
       -0.02239828, 10.26001633,  9.20744752,  4.88239026,  6.16705082,
        1.42054154,  4.35061027,  2.28276727,  6.06647   ,  1.94600642,
       -3.66293696])

In [27]:
B

array([16.29467803, -0.68674509, 10.6780615 ,  2.90572568,  6.12971078,
       10.13767272,  8.84252868,  0.9832229 ,  3.89093102,  5.10426296,
        1.47027182, 11.27581586,  1.15628297, 15.00199929, -5.3143011 ,
        4.9888399 ])

In [28]:
C

array([ 5.23906616,  6.31116605,  2.04342699,  1.84236056,  1.35914206,
       -3.90309296,  3.8875906 ,  4.26676556,  9.32600843,  9.45012947,
        6.9498447 , -1.79689642,  6.23235962,  4.68133352,  1.14015308,
       10.27216866,  7.09881087,  4.34644587])

In [30]:
f_statistics,p_value=stat.f_oneway(A,B,C)

In [31]:
f_statistics

0.482927303734956

In [32]:
p_value

0.620002262158216

In [35]:
dfb=3-1
dfw=50-3
significance_value=0.05

In [37]:
critical_value=stat.f.ppf(q=1-significance_value,dfn=dfb,dfd=dfw)

In [38]:
critical_value

3.195056280737215

In [40]:
if f_statistics > critical_value:
    print("we reject the null hypothesis")
else:
    print("we fail to reject the null hypothesis")

we fail to reject the null hypothesis


## Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [45]:
A=np.random.normal(5,scale=1.59,size=10)
B=np.random.normal(6,scale=1.67,size=10)
C=np.random.normal(5,scale=1.34,size=10)

In [46]:
A

array([4.44971989, 6.59922221, 7.47112224, 2.98295775, 5.27791198,
       3.4999388 , 4.29987577, 5.19647777, 5.6113129 , 3.34532801])

In [47]:
B

array([8.25749023, 9.80356043, 5.37122993, 6.24737935, 6.90098152,
       4.74202307, 8.54175311, 4.59855132, 5.85486604, 6.43984543])

In [48]:
C

array([8.73836471, 5.36154602, 5.84913982, 5.5137801 , 4.89632818,
       5.46838294, 4.25925673, 3.37438941, 6.94681681, 2.30976715])

In [49]:
import statsmodels.api as sm

In [51]:
sm.stats.anova_lm(A,B,C,type=2)

AttributeError: 'numpy.ndarray' object has no attribute 'scale'

## Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [53]:
control_g=np.random.randint(35,99,size=50)
experimental_g=np.random.randint(35,99,size=50)

In [54]:
control_g

array([83, 85, 78, 69, 68, 64, 49, 36, 59, 85, 70, 77, 65, 80, 37, 44, 84,
       60, 63, 83, 39, 55, 36, 52, 87, 64, 76, 51, 71, 45, 42, 61, 98, 83,
       38, 76, 53, 47, 48, 39, 66, 51, 36, 69, 66, 76, 69, 63, 95, 37])

In [55]:
experimental_g

array([84, 58, 36, 68, 57, 44, 83, 36, 49, 76, 72, 38, 64, 75, 77, 87, 39,
       66, 92, 96, 83, 97, 35, 90, 41, 69, 80, 35, 37, 59, 53, 52, 98, 61,
       65, 75, 98, 86, 38, 71, 45, 84, 68, 41, 43, 36, 61, 41, 50, 87])

In [56]:
f_statistics,p_value=stat.f_oneway(control_g,experimental_g)

In [57]:
f_statistics

0.06469907274748686

In [58]:
dfb=2-1
dfw=100-2
significance_value=0.05

In [59]:
critical_value=stat.f.ppf(q=1-significance_value,dfn=dfb,dfd=dfw)

In [60]:
critical_value

3.938111078003371

In [61]:
if f_statistics > critical_value:
    print("we reject the null hypothesis")
else:
    print("we fail to reject the null hypothesis")

we fail to reject the null hypothesis


## that is difference is not significant

## Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any  significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.