### 1 . Explain the assumptions required to use ANOVA and provide example of violations that could impact the validity of the results

##### Assumptions in ANOVA : 
1. Normality of sampling distribution of mean : the distribution of sample mean is normally distributed.
2. Absence of outliers : Outlying score need to be removed from the dataset 
3. Homogenity of variance : Population variance in different levels of each independent variable or factor it should be same or equal.
4. Samples are independent and are randomly selected. 

##### Example of Violations :
1. If data is heavily skewed or has outliers, it may lead to inaccurate results
2. If there is dependency among observations, it can lead to replication and inflated type 1 error
3. Non-random sampling may lead to biased estimates, affecting the generalizability of the results
4. Using Ordinal or Nominal data with ANOVA may result in inappropriate conclusions.

### 2 . Wha are three types of ANOVA ? in what sitatuion would each be used ?

##### Types of ANOVA : 
1. One-way ANOVA
2. Two-way ANOVA
3. Repeated Measures ANOVA

##### Situations when each type is used :
- One-way ANOVA : Used when there is one independent variable(factor) with two or more levels(groups)

- Two-way ANOVA : Used when there are two independent varibales, each with two or more levels, and we want to examine the main effects of each variables as well as thier interaction

- Repeated Measures ANOVA : Used when the same subjects are used for each treatment (repeated measurements), often in a longitudinal or within-subject design

### 3 . What is partitioning in ANOVA, and why is it important to understand this concept ?

##### Partitioning in ANOVA :
- the partitioning in ANOVA refers to decomposition of the total variance observed in a dataset into different components attributed to various sources. 

- Understanding this concept is important because it helps researchers identify and quantify the sources of variability in the data, which inturn allows them to access the significance of these sources and draw meaningful conclusions about the factors influencing the dependent varibale.

##### Understanding the partioning of variance is important for several reasons:

1. Identifying significnat effects : researchers can determine if there are significant differences among the group means.this helps identify the effects of independent variable.

2. interpreting F-statistics : the F-statistic from ANOVA is a ration of varinaces. it helps in interpreting F-statistic corrrectly and making informed decisions about significance overall model or individual factors.

3. Designing experiments : knowledge of partitioning of varinace is essential for designing experiments. researchers can allocate resources efficiently by focusing on factors that contribute significnatly to the overall variance.


### 4 . How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

##### Total Varinace(Total sum of squares SST): 
- This represents the overall variability in the dependent variable, calculated as the sum of sqaured differences between each individual data point and overall mean.

In [1]:
## To calculate SST 
## importing required modules
import numpy as np
from scipy import stats

## example data
group1 = [4,5,6,3,5]
group2 = [8,7,6,9,10]
group3 = [12,10,11,13,11]

## Combine the data into a single array
data = np.concatenate([group1,group2,group3])

## overall mean
overall_mean = np.mean(data)

## calculating SST
sst = np.sum((data - overall_mean)**2)

## printing the result
print(sst)

136.0


##### Within-group variance(within group sum of square, SSW or SSE):
- This is component represents the variability within each group, calculated as the sum of squared differences between individual data points and thier respective group means. 

In [6]:
## calculating SSW
ssw_g1 = np.sum((group1 - np.mean(group1))**2)
ssw_g2 = np.sum((group2 - np.mean(group2))**2)
ssw_g3 = np.sum((group3 - np.mean(group3))**2)

ssw = ssw_g1+ssw_g2+ssw_g3

## printing value
print(ssw)

20.4


##### Between-Group variance(Between group sum of squares, SSB): 
- this component represents the variability in the dependent varibale that can be attributed to differences between the group mean. it is calculated as the sum of squared differences between each group mean and overall mean, weighted by the number of observations in each group.

In [5]:
## to calculate SSB
## calculating group means
group_means = np.array([np.mean(group1), np.mean(group2), np.mean(group3)])

## calculate SSB
ssb_g1 = np.sum((group_means - overall_mean)**2 * len(group1))
ssb_g2 = np.sum((group_means - overall_mean)**2 * len(group2))
ssb_g3 = np.sum((group_means - overall_mean)**2 * len(group3))

ssb = ssb_g1+ssb_g2+ssb_g3

## printing result
print(ssb)

346.80000000000007


## 5 . in a two-way ANOVA how would you calculate main effects and interaction effects using python ?

In [15]:
## we need statistical model to use two way ANOVA
## imoprting required modules
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

## Example data for two independent factors with two levels each
data = pd.DataFrame({
    'A':np.repeat(['A1','A2'],20),
    'B':np.tile(['B1','B2'],20),
    'Response':np.random.rand(40)
})

## model
## this formula specifies main effects and interactions for A,B,A:B
model = ols('Response ~ A + B + A:B', data=data).fit() 

## perform ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

## Extracting main effects and interaction effects
main_effect_A = anova_table['sum_sq']['A'] / anova_table['df']['A']
main_effect_B = anova_table['sum_sq']['B'] / anova_table['df']['B']
interaction_effect = anova_table['sum_sq']['A:B'] / anova_table['df']['A:B']

## printing the results
print("Main Effect A :", main_effect_A)
print("Main Effect B :", main_effect_B)
print("Interaction Effect : ", interaction_effect)

Main Effect A : 0.10856569305704605
Main Effect B : 0.09282694317026069
Interaction Effect :  0.06010079135128697


### 6 . Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.What can you conclude about the differences between the groups, and how would you interpret these results?

- F-statistic compares the variance between group means to the variance within groups. a higher F-statistic suggests larger differences between group means relative to the variability within each group

- The associated P-value helps determine the statistical significance of observed f-statistic.

##### interpretation :
- With p-value of 0.02, which is less than the commonly used significnace level of 0.05, there is evidence to reject the null hypothesis.
- we can conclude that there are statistically significnat differences between at least two of the groups 

### 7 . in a repeated measures ANOVA, how would you handle missing data and what are the potential consequences of using different methods to handle missing data ?

##### The approach we choose to handle missing data can impact the validity and reliability of the analysis. the common methods for handling missing data is :

###### 1. Complete case Analysis : It means Excluding the missing data from analysis.
- Consequence : Reduces sample size, leading to loss of statistical power

###### 2. Mean manipulation : Replace missing values with the mean of the observed values for the that variable
- Consequence : preserves the sample size but may underestimate variability. 

###### 3. Last observation carried forward(LOCF) or Next observation carried backward(NOCB) : Replace missing values with the last(or next) observed value for that participant 
- Consequence : Assumes that missing values are constant or follow a specific pattern over time.

###### 4. Interpolation or Linear regression imputaion : Use statistical techniques to estimate missing values based on observed data.
- Consequence : Better than mean imputation but assumptions about the underlying data wont be met, sensitive to outliers or non-linear patterns

###### 5. Multiple imputation : Generate multiple plausible imputations for each missing value and analyze each imputed dataset separately then combine.
- Consequence : Requires more computational power and assumes missing data are missing at random or missing at random conditional on observed variables.

### 8 . what are some common post-hoc tests used after ANOVA and when would you use each one? provide an example of a situation where a post-hoc test might be necessary

##### Common post-hoc tests used after ANOVA :
###### 1. Tukey's Honestly significant difference (HSD) : 
- When to use : It is conservative and suitable when you have a balanced design (equal sample sizes) and want to control the family wise error rate. its appropriate three or more errors.

###### 2. Bonferroni Correction :
- When to use : It is a conservative correction that is useful when conducting multiple pairwise comparisions. it controls the familywise error rate by adjusting the significance level.

###### 3. Sidak correction : 
- When to use : Similar to Bonferroni, sidak is another conservative correction method for multiple comparisions. it is less conservative than Bonferroni and is appropriate when you have moderate number of comparisions.

###### 4. Dunnett's test :
- When to use : Dunnet's test is suitable when you have a control group and you want to compare each treatment group to the control. it protects against inflated type 1 error rates associated with multiple comparisions

###### 5. Holm's method :
- When to use : it is a step-down procedure that controls the familywise error rate. it is flexible and less conservative than Bonferroni. it is suitable for a variety of post-hoc situations.

##### Example situation where post-hoc test is necessary :
- A teacher is trying three different teaching methods to se if they affect students exam scores. we find there is a significant effect. and now we want to know exactly which specific pairs of teaching maethods are leading to significantly different scores. This is where post-hoc test comes in handy. it helps compare Method A, Method B, Mathod C to each other to pinpoint exactly where differences lie. the post-hoc is done after using ANOVA.

### 9 . A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

In [6]:
## importing required modules 
import numpy as np
import scipy.stats as stats

## Generating sample weightloss data
np.random.seed(3) 
diet_A = np.random.normal(loc=2, scale=1, size=50)
diet_B = np.random.normal(loc=3, scale=1, size=50)
diet_c = np.random.normal(loc=1.5, scale=1, size=50)

## concatanate the data for ANOVA
weight_loss = np.concatenate([diet_A,diet_B,diet_c])

## perform one-way ANOVA
f_stat,p_value = stats.f_oneway(diet_A,diet_B,diet_c)

## printing results
print("One-Way ANOVA results:")
print("F-statistic :", f_stat)
print("p-value is : ",p_value)

## interpretation
if p_value < 0.05:
    print("The one-way ANOVA indicates a significant difference in mean weight loss")
else:
    print("There is not enough evidence from one-way ANOVA to reject null hypothesis ")

One-Way ANOVA results:
F-statistic : 37.73466963664033
p-value is :  5.935632151744614e-14
The one-way ANOVA indicates a significant difference in mean weight loss


### 10 . A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs : Program A,B,C they randomly assign 30 employees to one of the programs to record the time it takes each employee to complete task. conduct a two way ANOVA using python to determine if there are any main effects or interaction effects between software programs and employees experience levels (novice vs experienced) report f-statistic and p-value and interpret the results.


In [54]:
## importing required modules 
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

## Generate hypothetical data
np.random.seed(4)
n = 30 ## number of employee in each group

## software programme: A,B,C
software = np.repeat(['A','B','C'],n)

## employee experience level novie/experienced
experience = np.tile(['Novice','Experienced'], 45 ) ## used 45 to match total outcomes in df

## randomly generated task completion time 
task_time = np.random.normal(loc = 10, scale = 2, size = n * 3)

## creating dataframe 
df = pd.DataFrame({'Software':software, 'Experience': experience, 'Tasktime' : task_time})

## fit a two-way ANOVA model
model = ols('Tasktime ~ Software * Experience', data=df).fit()

## perform ANOVA 
anova_table = sm.stats.anova_lm(model,typ=2)

## main effects and interaction
main_effect_A = anova_table['sum_sq']['Software'] / anova_table['df']['Software']
main_effect_B = anova_table['sum_sq']['Experience'] / anova_table['df']['Experience']
interaction_effect = anova_table['sum_sq']['Software:Experience'] / anova_table['df']['Software:Experience']

## print ANOVA results
print("Two-way ANOVA results!")
print("Main Effect between software : ",main_effect_A)
print("Main effect between experience : ", main_effect_B)
print("Interaction effect between : ",interaction_effect)
print(anova_table)
print("""Interpretation:\n All the P-values are above significance level of 0.05, 
hence we can say that there is no significnat main effects or interaction effects in the data""")

Two-way ANOVA results!
Main Effect between software :  0.4009398360337057
Main effect between experience :  0.1355715026777578
Interaction effect between :  6.135878517014176
                         sum_sq    df         F    PR(>F)
Software               0.801880   2.0  0.094920  0.909543
Experience             0.135572   1.0  0.032096  0.858249
Software:Experience   12.271757   2.0  1.452632  0.239770
Residual             354.813810  84.0       NaN       NaN
Interpretation:
 All the P-values are above significance level of 0.05, 
hence we can say that there is no significnat main effects or interaction effects in the data


### 11 . an educational researcher is interested in whether a new teaching method improves student test scores. they randomly assign 100 students to either control group or experimental group and administer a test at the end of the semester. conduct two sample t-test using python to determine if there are any significant differences in test scores between two groups. if results are significant follow up with post-hoc test to determine which groups differ significantly from each other.

In [68]:
## importing required modules
import numpy as np
from scipy.stats import ttest_ind

## setting seed 
np.random.seed(5)

## generating data for the test scores
control_group = np.random.normal(loc=70,scale=10,size=50)
experimental_group = np.random.normal(loc=75,scale=10,size=50)

## perform two sample t-test
t_statistic,p_value = ttest_ind(control_group,experimental_group)

## printing results
print("Two sample T-test Results :")
print("t-statistic is : ",t_statistic)
print("P-Value is : ",p_value)
print("""\nInterpretation : \n Since P value is less than significant level 0.05. it is safe to 
conlude that there is significant effect of different study experiment group.\n 
as this is two sample t-test
using only two variables we cannot perform post-hoc test, as it is usually performed after 
ANOVA with two or more groups in the data.""")

Two sample T-test Results :
t-statistic is :  -2.606383481758298
P-Value is :  0.010577861462870278

Interpretation : 
 Since P value is less than significant level 0.05. it is safe to 
conlude that there is significant effect of different study experiment group.
 
as this is two sample t-test
using only two variables we cannot perform post-hoc test, as it is usually performed after 
ANOVA with two or more groups in the data.


### 12 . A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.

In [3]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Set seed for reproducibility
np.random.seed(42)

# Generate hypothetical daily sales data for three stores
store_A = np.random.normal(loc=100, scale=20, size=30)
store_B = np.random.normal(loc=120, scale=20, size=30)
store_C = np.random.normal(loc=110, scale=20, size=30)

# Create a DataFrame with repeated measures structure
df_sales_repeated = pd.DataFrame({
    'Day': np.tile(np.arange(1, 31), 3),
    'Store': np.repeat(['A', 'B', 'C'], 30),
    'Sales': np.concatenate([store_A, store_B, store_C])
})

# Fit repeated measures ANOVA model
model = AnovaRM(df_sales_repeated, 'Sales', 'Day', within=['Store'])
results = model.fit()

# Print ANOVA results
print("Repeated Measures ANOVA Results:")
print(results.summary())
print("Since P-value is ")
## performing post-hoc test 
## tukey's HSD test
tukey_results = pairwise_tukeyhsd(df_sales_repeated['Sales'], df_sales_repeated['Store'])
print("\nTukey's HSD Post-Hoc Test:")
print(tukey_results)

Repeated Measures ANOVA Results:
               Anova
      F Value Num DF  Den DF Pr > F
-----------------------------------
Store 10.3408 2.0000 58.0000 0.0001


Tukey's HSD Post-Hoc Test:
 Multiple Comparison of Means - Tukey HSD, FWER=0.05 
group1 group2 meandiff p-adj   lower    upper  reject
-----------------------------------------------------
     A      B  21.3397 0.0001   9.7429 32.9365   True
     A      C  14.0206 0.0136   2.4238 25.6175   True
     B      C  -7.3191 0.2936 -18.9159  4.2778  False
-----------------------------------------------------
