#### Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

#### Ans:- ANOVA (Analysis of Variance) is a statistical method used to compare the means of three or more groups
#### Normality: The data within each group should follow a normal distribution.

#### Homogeneity of variance: The variance within each group should be approximately equal.

#### Independence: The observations within each group must be independent of each other.

If any of these assumptions are violated, the results of ANOVA may not be valid or reliable. 
#### Non-normality: If the data within each group is not normally distributed, the ANOVA results may not be valid. For example, if the data is heavily skewed or has extreme outliers, the assumption of normality may be violated.

#### Non-homogeneity of variance: If the variance within each group is not approximately equal, the ANOVA results may not be reliable. For example, if the variance in one group is much larger than the others, it could indicate that the groups are not comparable.

#### Lack of independence: If the observations within each group are not independent of each other, the ANOVA results may not be valid. For example, if a group of students takes a test multiple times and the scores are averaged, the observations within the group are not independent.

#### Interaction between factors: If there is an interaction between the factors being tested, the ANOVA results may not be reliable. For example, if a drug has different effects on different genders or age groups, this interaction may not be captured by ANOVA.

#### Q2. What are the three types of ANOVA, and in what situations would each be used?

#### Ans:-ANOVA is a statistical method used to compare the means of three or more groups.
#### There are three types of ANOVA: one-way ANOVA, two-way ANOVA, and repeated measures ANOVA.
#### One-way ANOVA is used for one independent variable with three or more levels, while two-way ANOVA is used for two independent variables and repeated measures ANOVA is used for within-subjects designs.
#### ANOVA has three assumptions that must be met: normality, homogeneity of variance, and independence.
#### Violations of these assumptions can lead to invalid or unreliable results.
#### Before using ANOVA, it is important to check for violations of these assumptions and consider alternative methods if necessary.

#### Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

#### Ans:-The partitioning of variance in ANOVA refers to the breakdown of the total variation in the data into different sources of variation. ANOVA uses this concept to partition the total variance in the dependent variable into different components that are associated with the independent variables and error.

#### The between-group variation (SSB): This represents the variation in the dependent variable that is due to differences between the groups defined by the independent variable.

#### The within-group variation (SSW): This represents the variation in the dependent variable that is due to individual differences within each group.

#### The residual variation (SSE): This represents the unexplained variation in the dependent variable that is not accounted for by the independent variable or error.

The partitioning of variance in ANOVA is important because it helps researchers to identify the sources of variation in their data and to determine the relative importance of each source. This information can help researchers to interpret their results and to make informed decisions about their research questions. Additionally, understanding the partitioning of variance can help researchers to choose appropriate statistical techniques for their data analysis and to avoid misinterpretations of their results.

#### Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [5]:
df = pd.read_csv("https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv")

In [6]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [8]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# load data
data = pd.read_csv('https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv')

# create dependent variable
y = data['Pclass']

# fit one-way ANOVA model
model = ols('y ~ Pclass', data=data).fit()

# calculate SST
sst = sm.stats.anova_lm(model, typ=1)['sum_sq'][0]

# calculate SSE
sse = sm.stats.anova_lm(model, typ=1)['sum_sq'][1]

# calculate SSR
ssr = sst - sse

print('SST:', sst)
print('SSE:', sse)
print('SSR:', ssr)


SST: 622.1234567901223
SSE: 2.784008463660734e-26
SSR: 622.1234567901223


#### Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [19]:
import pandas as pd
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# Load the data into a Pandas DataFrame
data = pd.DataFrame({
    'Group': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
    'Factor_1': ['Low', 'High', 'Low', 'High', 'Low', 'High', 'Low', 'High', 'Low', 'High', 'Low', 'High'],
    'Factor_2': ['Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No'],
    'Score': [70, 80, 85, 95, 60, 75, 90, 100, 55, 70, 80, 90]
})

# Fit the ANOVA model
model = ols('Score ~ C(Group) + C(Factor_1) + C(Factor_2) + C(Factor_1):C(Factor_2)', data).fit()

# Print the ANOVA table
print(anova_lm(model, typ=2))

# Calculate the main effects
main_effects = model.params[:3]
print('Main effects:')
print(main_effects)

# Calculate the interaction effect
interaction_effect = model.params[3]
print('Interaction effect:')
print(interaction_effect)


                              sum_sq   df          F    PR(>F)
C(Group)                  179.166667  2.0   6.142857  0.035328
C(Factor_1)               408.333333  1.0  28.000000  0.001845
C(Factor_2)              1408.333333  1.0  96.571429  0.000064
C(Factor_1):C(Factor_2)     8.333333  1.0   0.571429  0.478309
Residual                   87.500000  6.0        NaN       NaN
Main effects:
Intercept        98.333333
C(Group)[T.B]    -1.250000
C(Group)[T.C]    -8.750000
dtype: float64
Interaction effect:
-9.999999999999986


#### Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

#### Ans:-conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02, it means that there is some evidence of differences between the groups.
A larger F-value indicates that the differences between the groups are more significant. The p-value is the probability of obtaining the observed F-statistic or more extreme results if the null hypothesis is true. In this case, a p-value of 0.02 means that there is only a 2% chance of obtaining the observed F-statistic or more extreme results if there were no differences between the groups.
#### we can conclude that there is evidence to reject the null hypothesis that all the group means are equal, and at least one group mean is different from the others. However, you cannot determine which specific groups are different from one another without further analysis, such as post-hoc tests.

#### Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

#### Ans:- There are several methods to handle missing data in a repeated measures ANOVA. One common approach is to use imputation methods, where missing values are replaced with estimated values based on observed data. There are different types of imputation methods, including mean imputation, regression imputation, and multiple imputation. Another approach is to exclude participants with missing data from the analysis, although this can lead to reduced sample sizes and potentially biased results.

#### Biased estimates: If missing data are not handled correctly, it can lead to biased estimates of group means, variances, and effect sizes.

#### Reduced statistical power: Excluding participants with missing data can lead to a reduced sample size, which can reduce statistical power and increase the risk of type II errors.

#### Increased type I error rate: Imputing missing data based on observed data can introduce noise into the data, which can increase the risk of type I errors, especially when the proportion of missing data is high.

#### Invalid results: If missing data are not handled correctly, it can lead to invalid statistical results, which can have implications for the interpretation of the findings.

#### Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

#### Ans:-Tukey's Honestly Significant Difference (HSD): Tukey's HSD test is a conservative post-hoc test that controls the familywise error rate. It is commonly used when the number of groups is equal and the group sizes are equal.

#### Bonferroni correction: The Bonferroni correction is a conservative post-hoc test that adjusts the significance level for multiple comparisons. It is commonly used when the number of groups is small.

#### Scheffe's test: Scheffe's test is a liberal post-hoc test that controls the familywise error rate. It is commonly used when the number of groups is unequal.

#### Dunnett's test: Dunnett's test is a post-hoc test that compares all groups to a control group. It is commonly used when there is a control group and multiple treatment groups.
An example of a situation where a post-hoc test might be necessary is when a researcher is conducting an experiment to test the effects of three different types of exercise on cardiovascular health. After conducting an ANOVA, the researcher finds that there is a significant difference between the groups. To determine which specific groups are different from one another, the researcher could use a post-hoc test, such as Tukey's HSD test, to compare the mean cardiovascular health scores of each group.


#### Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [20]:
import numpy as np
from scipy.stats import f_oneway

# Generate random weight loss data for the three diets
np.random.seed(123)
A = np.random.normal(loc=5, scale=2, size=50)
B = np.random.normal(loc=7, scale=2, size=50)
C = np.random.normal(loc=9, scale=2, size=50)

# Perform one-way ANOVA
F, p = f_oneway(A, B, C)

# Print results
print("F-statistic:", F)
print("p-value:", p)


F-statistic: 45.78264848608969
p-value: 3.4969532692130786e-16


#### Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

In [None]:
import pandas as pd
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# Create a random dataset with 30 employees and 3 software programs
# where completion time varies based on experience level
import numpy as np
np.random.seed(123)
n = 30
employees = ['Employee {}'.format(i) for i in range(1, n+1)]
programs = ['Program A', 'Program B', 'Program C']
experience = ['Novice', 'Experienced']
data = pd.DataFrame({'Employee': np.random.choice(employees, n),
                     'Program': np.random.choice(programs, n),
                     'Experience': np.random.choice(experience, n),
                     'Time': np.random.normal(loc=10, scale=2, size=n)})

# Convert categorical variables to categorical data type
data['Program'] = data['Program'].astype('category')
data['Experience'] = data['Experience'].astype('category')

# Fit the ANOVA model with interaction effect
model = ols('Time ~ C(Program) * C(Experience)', data=data).fit()

# Print the ANOVA table
print(anova_lm(model, typ=2))

#### Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct atwo-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [23]:
pip install pingouin

Collecting pingouin
  Downloading pingouin-0.5.3-py3-none-any.whl (198 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.6/198.6 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hCollecting tabulate
  Downloading tabulate-0.9.0-py3-none-any.whl (35 kB)
Collecting pandas-flavor>=0.2.0
  Downloading pandas_flavor-0.5.0-py3-none-any.whl (7.1 kB)
Collecting outdated
  Downloading outdated-0.2.2-py2.py3-none-any.whl (7.5 kB)
Collecting lazy-loader>=0.1
  Downloading lazy_loader-0.2-py3-none-any.whl (8.6 kB)
Collecting xarray
  Downloading xarray-2023.3.0-py3-none-any.whl (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.2/981.2 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting littleutils
  Downloading littleutils-0.2.2.tar.gz (6.6 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: littleutils
  Building wheel for littleutils (setup.py) ... [?25ldone
[?25h  C

In [34]:
import numpy as np
from scipy.stats import ttest_ind
from pingouin import pairwise_tukey

# Generate random test score data for the control and experimental groups
np.random.seed(123)
control_scores = np.random.normal(loc=70, scale=10, size=50)
experimental_scores = np.random.normal(loc=75, scale=10, size=50)

# Perform two-sample t-test
t, p = ttest_ind(control_scores, experimental_scores)

# Print results
print("t-value:", t)
print("p-value:", p)

# Perform post-hoc test if significant differences are found
if p < 0.05:
    # Perform Tukey's HSD test
    posthoc = pairwise_tukey(data=np.concatenate([control_scores, experimental_scores]), 
                             groups=np.concatenate([np.zeros(50), np.ones(50)]))
    print("Post-hoc test results:")
    print(posthoc)


t-value: -2.315158728279605
p-value: 0.022690065589586535


TypeError: pairwise_tukey() got an unexpected keyword argument 'groups'

#### Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a sample dataset
data = pd.DataFrame({
    'store': ['A']*30 + ['B']*30 + ['C']*30,
    'sales': [10, 12, 15, 11, 13, 16, 14, 16, 18, 12, 13, 15, 11, 14, 16, 15, 17, 20, 11, 12, 14, 16, 18, 20, 17, 19, 21, 15, 18, 20] * 3,
    'day': list(range(1, 31)) * 3
})

# Convert 'store' and 'day' to categorical variables
data['store'] = data['store'].astype('category')
data['day'] = data['day'].astype('category')

# Fit a repeated measures ANOVA model
rm_anova = ols('sales ~ C(store) + C(day) + C(store):C(day)', data=data).fit()

# Print the ANOVA table
print(sm.stats.anova_lm(rm_anova, typ=2))