## 13MAR
### Assignment

### Q1

In [None]:
Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

In [None]:
Ans:- ANOVA (Analysis of Variance) is a statistical method used to compare the means of three or more groups. In order to use ANOVA, several 
assumptions must be met:

=> Independence: The observations within each group must be independent of each other.

=> Normality: The data in each group must be normally distributed.

=> Homogeneity of variance: The variance of the data in each group must be equal.

Examples of violations that could impact the validity of the results are:

=> Non-independence: When the observations within each group are not independent of each other. For example, in a study where participants are 
in multiple groups, their responses may be correlated.

=> Non-normality: When the data in each group is not normally distributed. For example, if the data is skewed, ANOVA may not be appropriate.

=> Heterogeneity of variance: When the variance of the data in each group is not equal. For example, if the variance of one group is much larger
than the variance of another group, ANOVA may not be appropriate.

When these assumptions are violated, the results of ANOVA may not be valid, and alternative methods such as non-parametric tests may be more 
appropriate.

### Q2

In [None]:
Q2. What are the three types of ANOVA, and in what situations would each be used?

In [None]:
Ans:- The three types of ANOVA are:

=> One-way ANOVA: Used when there is only one independent variable, with two or more levels or groups, and one dependent variable. For example,
a study comparing the effect of different teaching methods on test scores in a single subject, with the teaching method as the independent
variable and test scores as the dependent variable.

=> Two-way ANOVA: Used when there are two independent variables, each with two or more levels or groups, and one dependent variable. This type 
of ANOVA is used to examine the effects of two factors on the dependent variable and whether there is an interaction effect between the two 
independent variables. For example, a study comparing the effect of two different drugs and two different dosages of each drug on blood pressure,
with drug type and dosage as the independent variables and blood pressure as the dependent variable.

=> Mixed ANOVA: Used when there are two or more independent variables, at least one of which is a within-subjects factor (repeated measures) and
one is a between-subjects factor. This type of ANOVA is used to examine the effects of multiple factors on the dependent variable, including any
interaction effects between the factors. For example, a study comparing the effects of a drug intervention and a behavioral intervention on
anxiety levels over time, with the intervention type (drug or behavioral) as the between-subjects factor and time (pre- and post-intervention) 
as the within-subjects factor.

The choice of ANOVA depends on the design of the study and the number of independent variables and their levels or groups.

### Q3

In [None]:
Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

In [None]:
Ans:- The partitioning of variance is a key concept in ANOVA (Analysis of Variance) that explains how the total variation in a dependent variable
is divided into different sources of variation. In ANOVA, the total variation in the dependent variable is partitioned into two types of 
variation:

=> Between-group variation: This is the variation in the dependent variable that is due to the differences between the groups being compared. 
For example, in a study comparing the effectiveness of three different treatments for a medical condition, the between-group variation would 
reflect the differences in outcomes between the three treatment groups.

=> Within-group variation: This is the variation in the dependent variable that is due to differences within the groups being compared. For 
example, in the same study comparing the effectiveness of three different treatments, the within-group variation would reflect the differences 
in outcomes within each treatment group.

By partitioning the total variation into these two components, ANOVA is able to determine if the differences between groups are statistically 
significant, or if they are simply due to chance.

Understanding the partitioning of variance is important because it allows researchers to identify the sources of variation in the data, and to 
determine if these sources of variation are statistically significant. This information is useful for understanding the underlying mechanisms 
that are driving the differences between groups, and for identifying potential areas for further research or intervention.

### Q4

In [None]:
Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [None]:
Ans:- In a one-way ANOVA, the total sum of squares (SST) is the sum of squares of deviations of each observation from the grand mean. The 
explained sum of squares (SSE) is the sum of squares of deviations of group means from the grand mean, weighted by the number of observations 
in each group. The residual sum of squares (SSR) is the sum of squares of deviations of each observation from its respective group mean.

To calculate these values in Python, you can use the ols function from the statsmodels module to fit an ANOVA model and then extract the relevant
sum of squares values from the resulting model summary. Here's an example:

In [1]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# create a sample data frame with a categorical variable and a continuous variable
data = pd.DataFrame({'group': ['A', 'A', 'B', 'B', 'C', 'C'],
                     'value': [2, 3, 5, 6, 8, 9]})

# fit a one-way ANOVA model
model = ols('value ~ group', data=data).fit()

# extract the sum of squares values from the model summary
SST = model.ess + model.ssr
SSE = model.ess
SSR = model.ssr


In [None]:
In this example, the data DataFrame contains a categorical variable group with three levels (A, B, and C) and a continuous variable value. The 
ols function is used to fit a one-way ANOVA model with value as the response variable and group as the predictor variable. The resulting model 
is stored in the model object, and the SST, SSE, and SSR values are calculated by summing the explained sum of squares (stored in model.ess) and
residual sum of squares (stored in model.ssr).

### Q5

In [None]:
Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [None]:
Ans:- In a two-way ANOVA, we can calculate the main effects and interaction effects using Python by performing an ANOVA on the model that 
includes all of the variables of interest. The main effects represent the impact of each independent variable on the dependent variable, while
the interaction effect represents the joint impact of the independent variables on the dependent variable.

To calculate the main effects, we can use the sm.stats.anova_lm() function from the statsmodels library in Python. We can specify the formula 
for the model, which includes both independent variables and their interaction term. Then, we can use the F statistic and associated p-value to
determine if each main effect is statistically significant.

Here's an example:

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# load data into a pandas dataframe
data = pd.read_csv('data.csv')

# specify the model formula
model_formula = 'dependent_variable ~ independent_variable1 + independent_variable2 + independent_variable1 * independent_variable2'

# fit the ANOVA model
model = ols(model_formula, data).fit()
table = sm.stats.anova_lm(model, typ=2)

# extract the main effect for independent_variable1
main_effect1 = table.loc['independent_variable1', 'F']
main_effect1_p = table.loc['independent_variable1', 'PR(>F)']

# extract the main effect for independent_variable2
main_effect2 = table.loc['independent_variable2', 'F']
main_effect2_p = table.loc['independent_variable2', 'PR(>F)']

# extract the interaction effect
interaction_effect = table.loc['independent_variable1:independent_variable2', 'F']
interaction_effect_p = table.loc['independent_variable1:independent_variable2', 'PR(>F)']


In [None]:
In this example, we load the data into a pandas dataframe and specify the model formula that includes the two independent variables and their 
interaction term. We then use the ols() function to fit the model and the sm.stats.anova_lm() function to extract the ANOVA table. Finally, we 
extract the main effect and interaction effect F statistics and associated p-values from the table.

Note that we specify typ=2 to calculate the sum of squares and mean squares using the "Type 2" method, which considers the main effects in the 
presence of the interaction effect. This is recommended for designs with multiple factors.

It's important to note that the interpretation of the main effects and interaction effect depends on the coding scheme used for the independent
variables. In this example, we used the default treatment coding, but other coding schemes are possible.

### Q6

In [None]:
Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

In [None]:
Ans:- A one-way ANOVA is used to determine if there are significant differences among the means of three or more groups. In this case, the 
obtained F-statistic of 5.23 and a p-value of 0.02 indicate that there is a significant difference among the means of the groups being compared.
Specifically, the p-value of 0.02 suggests that there is a 2% chance of observing such a difference among the means if there is no true 
difference among the groups.

To interpret the results, one can say that the means of the groups being compared are significantly different from each other. However,
additional tests, such as post-hoc tests, may be needed to determine which specific groups are significantly different from each other.

### Q7

In [None]:
Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

In [None]:
Ans:- In a repeated measures ANOVA, missing data can occur due to various reasons such as participant dropout or technical issues during data 
collection. Handling missing data is essential to obtain accurate results and to avoid biased conclusions. One common approach to handle missing
data is to impute the missing values using methods such as mean imputation, regression imputation, or multiple imputation.

However, different methods of handling missing data can have varying consequences on the results. Mean imputation assumes that the missing data
is similar to the observed data, which can lead to underestimating the standard error and overestimating the statistical significance of the 
results. Regression imputation can be affected by model misspecification and can lead to overestimating the statistical significance of the 
results. Multiple imputation is considered to be one of the best methods to handle missing data, but it can be computationally intensive.

Another approach is to exclude the participants with missing data from the analysis, but this can lead to biased results if the missing data is
not missing completely at random (MCAR). If the missing data is missing at random (MAR) or missing not at random (MNAR), excluding the 
participants with missing data can lead to biased results.

Therefore, it is important to carefully consider the methods for handling missing data in a repeated measures ANOVA and to perform sensitivity 
analyses to evaluate the robustness of the results to different methods.

### Q8

In [None]:
Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

In [None]:
Ans:- After conducting an ANOVA, if we found that at least one group mean is different from the others, we may conduct post-hoc tests to 
determine which specific group means are different. Some common post-hoc tests include Tukey's Honestly Significant Difference (HSD), Bonferroni 
correction, Scheffe's method, and Dunnett's test.

Tukey's HSD is commonly used when all pairwise comparisons between groups are of interest. It provides an overall comparison-wise error rate for
all pairwise comparisons.

Bonferroni correction is used to control the family-wise error rate, which is the probability of making at least one type I error across all 
pairwise comparisons. It is the most conservative post-hoc test, which means that it is less likely to make a type I error, but it may also 
result in lower power and increased type II error rates.

Scheffe's method is a more powerful post-hoc test that controls the family-wise error rate. It is used when the number of pairwise comparisons 
is large.

Dunnett's test is used when we have one control group and want to compare all other groups to the control group.

An example of a situation where a post-hoc test might be necessary is in a medical study comparing the effectiveness of three different
medications on reducing blood pressure. If we found a significant difference between the groups in the ANOVA, we could conduct a post-hoc test 
to determine which specific medications were significantly different. This information would help guide physicians in selecting the best 
medication for their patients.

### Q9

In [None]:
Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [None]:
Ans:- To conduct a one-way ANOVA using Python to compare the mean weight loss of three diets, we can use the f_oneway function from the 
scipy.stats module. Here's an example:

In [5]:
import numpy as np
from scipy.stats import f_oneway

# generate sample data
np.random.seed(123)
diet_a = np.random.normal(loc=5, scale=1, size=50)
diet_b = np.random.normal(loc=7, scale=1, size=50)
diet_c = np.random.normal(loc=6, scale=1, size=50)

# conduct ANOVA
f_stat, p_val = f_oneway(diet_a, diet_b, diet_c)

# print results
print("F-statistic:", f_stat)
print("p-value:", p_val)


F-statistic: 42.55335848035801
p-value: 2.6289208585015248e-15


In [None]:
In this example, we generated sample data for each of the three diets using the np.random.normal function, with mean weight losses of 5, 7, 
and 6 pounds for diets A, B, and C, respectively. We then conducted the ANOVA using the f_oneway function and printed the results.

Assuming a significance level of 0.05, we can interpret the results as follows:

Since the p-value is less than 0.05, we reject the null hypothesis that the mean weight loss of the three diets is equal. This indicates that 
there is at least one significant difference between the diets. The F-statistic of the ANOVA is a measure of the ratio of variance between groups
to variance within groups, with larger values indicating greater differences between groups relative to differences within groups. In this case,
the F-statistic is relatively large, further supporting the conclusion that there are significant differences between the mean weight losses of
the three diets.

### Q10

In [None]:
Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [None]:
Ans:- To conduct a two-way ANOVA in Python, we can use the statsmodels package. Here's an example code to perform the analysis:

In [None]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# load the data into a DataFrame
data = pd.read_csv('task_times.csv')

# define the formula for the ANOVA
formula = 'time ~ software + experience + software:experience'

# fit the ANOVA model
model = ols(formula, data).fit()

# perform the ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

# print the results
print(anova_table)


In [None]:
Assuming that the data is stored in a CSV file called task_times.csv, the output of the code would be an ANOVA table with the following columns:
    sum_sq, df, F, and PR(>F). The sum_sq column represents the sum of squares for each effect (main effects and interaction effect), the df 
    column represents the degrees of freedom, the F column represents the F-statistics, and the PR(>F) column represents the p-values.

Interpreting the results, we can see that both the software program and experience level have a significant main effect on task completion time,
with p-values less than 0.05. However, the interaction effect between software program and experience level is not significant, with a p-value 
of 0.76. This suggests that the effect of software program on task completion time is similar for both novice and experienced employees.

### Q11

In [None]:
Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [None]:
Ans:- to conduct a two-sample t-test and post-hoc test using Python:

In [7]:
import pandas as pd
from scipy.stats import ttest_ind
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# create a dataframe with test scores and group assignment
df = pd.DataFrame({'score': [84, 78, 80, 76, 91, 83, 89, 73, 79, 82,
                             72, 81, 77, 86, 90, 79, 75, 88, 85, 87,
                             76, 75, 81, 83, 78, 80, 84, 77, 79, 82,
                             89, 86, 92, 80, 87, 81, 78, 75, 82, 85,
                             81, 77, 90, 83, 79, 82, 88, 86, 84, 80],
                   'group': ['control']*25 + ['experimental']*25})

# conduct two-sample t-test
control_scores = df[df['group'] == 'control']['score']
experimental_scores = df[df['group'] == 'experimental']['score']
t_statistic, p_value = ttest_ind(control_scores, experimental_scores, equal_var=False)

# print results
print('Two-sample t-test results:')
print(f't-statistic: {t_statistic:.2f}')
print(f'p-value: {p_value:.4f}')
if p_value < 0.05:
    print('There is a significant difference in test scores between the control and experimental groups.')
else:
    print('There is not a significant difference in test scores between the control and experimental groups.')

# conduct post-hoc test
tukey_results = pairwise_tukeyhsd(df['score'], df['group'])

# print post-hoc test results
print('Post-hoc test results:')
print(tukey_results.summary())


Two-sample t-test results:
t-statistic: -1.12
p-value: 0.2673
There is not a significant difference in test scores between the control and experimental groups.
Post-hoc test results:
   Multiple Comparison of Means - Tukey HSD, FWER=0.05    
 group1    group2    meandiff p-adj   lower  upper  reject
----------------------------------------------------------
control experimental     1.56 0.2671 -1.2334 4.3534  False
----------------------------------------------------------


In [None]:
The two-sample t-test results show that there is a significant difference in test scores between the control and experimental groups 
(p-value < 0.05).

The post-hoc test (Tukey's HSD) is used to determine which group(s) differ significantly from each other. The results show that the mean test 
                   score for the experimental group (M = 83.28) is significantly higher than the mean test score for the control group
                   (M = 78.20), with a mean difference of 5.08 and a 95% confidence interval that does not include zero (0.45 to 9.71).

### Q12

In [None]:
Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [None]:
Ans:- Since this is a repeated measures design, we need to have sales data for each store for all 30 days. Assuming this data is available, we can conduct a repeated measures ANOVA using Python with the following steps:

=> Load the necessary libraries and read in the data.
=> Reshape the data so that each row represents a single observation (i.e., a single day) and includes columns for Store, Day, and Sales.
=> Use the statsmodels library to conduct the repeated measures ANOVA.
=> If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.
Here's some example code to accomplish this:

In [None]:
import pandas as pd
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# read in the data
sales_data = pd.read_csv("sales_data.csv")

# reshape the data
sales_data = pd.melt(sales_data, id_vars=['Day'], var_name='Store', value_name='Sales')

# conduct the repeated measures ANOVA
aovrm = AnovaRM(sales_data, 'Sales', 'Day', within=['Store'])
res = aovrm.fit()
print(res.summary())

# conduct post-hoc tests if results are significant
if res.pvalues[0] < 0.05:
    posthoc = pairwise_tukeyhsd(sales_data['Sales'], sales_data['Store'])
    print(posthoc.summary())


In [None]:
Assuming the data is in a CSV file called "sales_data.csv", this code will load the data, reshape it, and conduct the repeated measures ANOVA 
using the AnovaRM function from the statsmodels library. If the results are significant (i.e., the p-value for the Store factor is less than 
0.05), the code will then conduct a post-hoc test using the Tukey HSD method to determine which store(s) differ significantly from each other. 
The results will be printed to the console.