## Analysis oF Variance (ANOVA)

## Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

1. Normality of Sampling Distribution of means the distribution of sample mean is normaly distributed.

2. Absence of Outliers
   Outliying score need to be removed from dataset.
    
3. Homogently of variance
   Each one of the population has same variance. Population Variance in different of each independent variable are equal.
   
4. Samples are independent and random.

ANOVA (Analysis of Variance) is a statistical technique used to compare means of three or more groups. The following are the assumptions required to use ANOVA:

1. Independence: The observations within each group should be independent of each other.

2. Normality: The data within each group should follow a normal distribution.

3. Homogeneity of variance: The variance within each group should be equal.

Examples of violations that could impact the validity of the results are:

1. Violation of Independence Assumption: If observations within a group are not independent of each other, then ANOVA may produce inaccurate results. For example, if the same group of people is tested repeatedly over time, then the observations within each group are not independent.

2. Violation of Normality Assumption: If the data within each group does not follow a normal distribution, then ANOVA may produce inaccurate results. For example, if a sample size is too small or there are extreme outliers in the data, then the normality assumption may be violated.

3. Violation of Homogeneity of Variance Assumption: If the variance within each group is not equal, then ANOVA may produce inaccurate results. For example, if one group has a much larger variance than the others, then the homogeneity of variance assumption may be violated.

It is important to check for these assumptions before using ANOVA, and if any are violated, it may be necessary to use alternative statistical techniques.

## Q2. What are the three types of ANOVA, and in what situations would each be used?

The three types of ANOVA are:

ONE WAY ANOVA: One factor with atleast 2 levels, these levels are independent.

1. One-way ANOVA: It is used to test the hypothesis that there are no significant differences between the means of two or more independent groups. This is commonly used when a single    factor (independent variable) is being tested across multiple groups.

Repeated Measures Anova: One factor with atleast 2 levels, level are dependents.

2. Two-way ANOVA: It is used to test the hypothesis that there are no significant interactions between two factors (independent variables) on the dependent variable. This is commonly used when there are two factors being tested and their interaction effects are of interest.


Factorial Anova: Two or More factors(each of which with atleast 2 levels), levels can be either independent and dependent.

3. N-way ANOVA: It is used when there are more than two factors being tested and their interactions on the dependent variable are of interest. The number of factors being tested determines the "N" in N-way ANOVA.

The choice of ANOVA depends on the research question and the design of the study. If the study involves only one factor being tested, then one-way ANOVA can be used. If there are multiple factors being tested, then two-way or N-way ANOVA may be more appropriate. It is important to choose the right type of ANOVA to answer the research question accurately.


## Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Partitioning of variance in ANOVA refers to the process of decomposing the total variation observed in a dataset into different components that are associated with specific sources of variation. 
In ANOVA, the total variance of the dependent variable is partitioned into two types of variance: 
1. between-group variance 

2. within-group variance.

Between-group variance refers to the variation in the dependent variable that is attributable to the differences between the groups being compared in the study. It reflects the extent to which the means of the groups differ from each other. Within-group variance refers to the variation in the dependent variable that is due to the differences within each group, and it reflects the variability of the data within each group.

The partitioning of variance is important because it allows us to determine the relative contributions of the different sources of variation to the total variation in the data. By comparing the between-group variance to the within-group variance, we can determine if there is a statistically significant difference between the means of the groups being compared. This helps us to draw conclusions about the effects of the independent variable on the dependent variable, and to make valid inferences about the population means.


Moreover, understanding the partitioning of variance can help us to identify potential sources of error or bias in our study. For example, if a large proportion of the total variance is due to within-group variance, this may indicate that the groups are not homogeneous and that there may be confounding variables affecting the results. By understanding the partitioning of variance, we can make more accurate and reliable conclusions about our data.


## Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [1]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# load data
data = pd.read_csv('data.csv')

# Fit one-way ANOVA model
model = ols('response ~ group', data=data).fit()

# Calculate total sum of square (SST)
SST = ((data['response'] - data['response'].mean())**2).sum()

# Calculate explained sum of square (SSE)
SSE = ((model.fittedvalues - data['response'].mean()**2)).sum()

# Calulate residual sum of square (SSR)
SSR = ((data['response'] - model.fittedvalues)**2).sum()

print('SST:', SST)
print('SSE:', SSE)
print('SSR:', SSR)

FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

## Q5.In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load the data into a pandas DataFrame
data = pd.read_csv('data.csv')

# Define the ANOVA model with interaction terms
model = ols('response ~ factor1 * factor2', data).fit()

# Get the ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)

# Calculate the main effects and interaction effects
main_effect_1 = anova_table.loc['factor1', 'sum_sq'] / anova_table.loc['Residual', 'sum_sq']
main_effect_2 = anova_table.loc['factor2', 'sum_sq'] / anova_table.loc['Residual', 'sum_sq']
interaction_effect = anova_table.loc['factor1:factor2', 'sum_sq'] / anova_table.loc['Residual', 'sum_sq']

## Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
## What can you conclude about the differences between the groups, and how would you interpret these results?

If a one-way ANOVA produces an F-statistic of 5.23 and a p-value of 0.02, it means that there is significant evidence to reject the null hypothesis that the group means are equal. In other words, at least one group mean is different from the others. However, this result alone does not tell us which group(s) differ from the others.

The F-statistic represents the ratio of the variance between groups to the variance within groups. A larger F-statistic means that the variance between groups is larger relative to the variance within groups. The p-value tells us the probability of obtaining an F-statistic as extreme as the one observed if the null hypothesis (that the group means are equal) is true. In this case, the p-value is less than the significance level of 0.05, indicating that the result is statistically significant.

To interpret the results, you could report that the one-way ANOVA showed a significant difference between the groups (F(2, 57) = 5.23, p = 0.02), indicating that at least one group mean is different from the others. Further post-hoc tests could be conducted to determine which group(s) differ significantly from the others.


## Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

In a repeated measures ANOVA, missing data can be handled using various methods, including listwise deletion, pairwise deletion, mean substitution, and maximum likelihood estimation. Each method has its advantages and disadvantages, and the choice of method depends on the nature of the missing data and the research question.

Listwise deletion involves removing any participant with missing data, resulting in a smaller sample size. This method can potentially bias the results if the missing data are not missing at random and may reduce the statistical power of the analysis.

Pairwise deletion involves using all available data by analyzing each variable separately. However, this method can result in biased estimates of the standard errors and may lead to type I errors.

Mean substitution involves replacing missing values with the mean of the available data for that variable. This method assumes that the missing values are missing at random and may introduce bias if the data are not missing at random.

Maximum likelihood estimation involves using all available data to estimate the parameters of the model. This method can provide unbiased estimates if the missing data are missing at random but may introduce bias if the data are not missing at random.

Overall, it is important to carefully consider the nature of the missing data and choose an appropriate method that is consistent with the assumptions of the analysis.


## Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

After conducting an ANOVA, post-hoc tests are often used to determine which specific groups have significant differences. Here are some common post-hoc tests and situations where they may be used:

1. Tukey's Honestly Significant Difference (HSD) Test: This test is used to compare all pairs of means and is often used when there are equal sample sizes in each group. It is a conservative test, meaning that it is less likely to detect significant differences unless they are truly present. It is often used in situations where there are multiple groups and the researcher wants to determine which groups are significantly different from each other.

2. Bonferroni Correction: This test is used to control for multiple comparisons and is often used when there are unequal sample sizes in each group. It is a more conservative test than Tukey's HSD test and is often used in situations where there are multiple groups and the researcher wants to determine which groups are significantly different from each other while minimizing the chances of false positives.

3. Scheffe's Test: This test is used when the number of comparisons is not known in advance and is often used in situations where there are multiple groups and the researcher wants to determine which groups are significantly different from each other while controlling for the overall Type I error rate.

A situation where a post-hoc test might be necessary is when an ANOVA finds a significant difference between groups, but the researcher wants to determine which specific groups are significantly different from each other. For example, in a study comparing the effectiveness of three different medications for treating a specific condition, an ANOVA might find a significant difference between the three groups. A post-hoc test could be used to determine which specific medications are significantly different from each other.


## Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [3]:
import numpy as np
from scipy.stats import f_oneway

# Generate some sample weight loss data
diet_a = np.random.normal(loc=5, scale=2, size=50)
diet_b = np.random.normal(loc=6, scale=3, size=50)
diet_c = np.random.normal(loc=4, scale=1.5, size=50)

# Conduct one-way ANOVA
f_stat, p_val = f_oneway(diet_a, diet_b, diet_c)

# Print results
print("F-statistic:", f_stat)
print("p-value:", p_val)

F-statistic: 3.9902823798175473
p-value: 0.020532745579574834


## Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs.experienced). Report the F-statistics and p-values, and interpret the results.

In [4]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# create a dataframe with the given data
data = {[[
    'Program': ['A']*10 + ['B']*10 + ['C']*10,
    'Experience': ['Novice']*15 + ['Experienced']*15,
    'Time': [23, 25, 22, 26, 28, 30, 29, 27, 24, 25,
             30, 32, 27, 29, 31, 28, 26, 25, 29, 28,
             32, 34, 30, 31, 29, 27, 25, 28, 30, 26,
             27, 28, 31, 33, 29, 26, 25, 27, 28, 25,
             24, 26, 28, 30, 27, 24, 23, 26, 27, 24]
]]}
df = pd.DataFrame(data)

# create the ANOVA model
model = ols('Time ~ C(Program) + C(Experience) + C(Program):C(Experience)', data=df).fit()

# print the ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

SyntaxError: invalid syntax (1422529035.py, line 7)

## Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [7]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind, f_oneway, posthoc_tukey
import pairwise_tukeyhsd

# Create a dataframe with the test scores for the control and experimental groups
data = pd.DataFrame({
    'group': ['control', 'experimental'],
    'scores': [78, 85, 88, 82, 79, 92, 83, 80, 81, 86, 91, 87, 84, 89, 90, 77, 93, 85, 81, 79,
               84, 88, 83, 86, 82, 80, 85, 89, 91, 87, 90, 92, 94, 81, 83, 79, 84, 88, 80, 86,
               87, 92, 81, 85, 89, 90, 93, 78, 82, 84, 87, 85, 81, 79, 83, 90, 88]
})

# Conduct the two-sample t-test
control_scores = data.loc[data['group'] == 'control', 'scores']
experimental_scores = data.loc[data['group'] == 'experimental', 'scores']
t_statistic, p_value = ttest_ind(control_scores, experimental_scores)
print(f"T-statistic: {t_statistic:.2f}")
print(f"P-value: {p_value:.3f}")

# Conduct the post-hoc test using Tukey's HSD
tukey_results = posthoc_tukey(data, val_col='scores', group_col='group')
print(tukey_results)


ImportError: cannot import name 'posthoc_tukey' from 'scipy.stats' (/opt/conda/lib/python3.10/site-packages/scipy/stats/__init__.py)

## Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.

In [13]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a data frame with sales data for each store on each day
data = pd.DataFrame({['Store': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
                     'Day': [1, 2, 3, 1, 2, 3, 1, 2, 3],
                     'Sales': [100, 120, 110, 80, 90, 70, 150, 140, 160]]})

# Fit a repeated measures ANOVA model
model = ols('Sales ~ C(Store) + C(Day) + C(Store):C(Day)', data=data).fit()

# Print the ANOVA table
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

# Conduct post-hoc tests to determine which stores differ significantly from each other
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Create a new data frame with sales data for each store (collapsed across days)
data_collapsed = pd.DataFrame({'Store': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
                               'Sales': [100, 120, 110, 80, 90, 70, 150, 140, 160]})

# Conduct Tukey's HSD test to compare all pairwise combinations of stores
tukey_results = pairwise_tukeyhsd(data_collapsed['Sales'], data_collapsed['Store'])
print(tukey_results)

SyntaxError: invalid syntax (1819134226.py, line 7)