### Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

### Groups are Independent:

1. **Easy Version**: People or things in one group don't affect those in another group.
**Problem Example**: If your groups are somehow connected, like measuring the same people twice, it might mess up the results.

### Equal Variances:

2. **Easy Version**: The amounts of difference within each group should be about the same.
Problem Example: If one group has a lot more variation than another, it could make your results unreliable.

### Errors are Independent:

3. **Easy Version**: Mistakes in one measurement don’t depend on mistakes in another.
Problem Example: If the mistakes are somehow connected, like if they depend on each other, it could mess up the results.

### Random Sampling:

4. **Easy Version**: You picked your samples in a fair way, not because of something weird.
**Problem Example**: If your samples were chosen in a strange or biased manner, your results might not represent what's really going on.


### Q2. What are the three types of ANOVA, and in what situations would each be used?

1.**One Way Anova** : One factor with atleast two level.And these level are independent.
For eg : we are comparing the heights of people from three different cities. One-Way ANOVA helps you figure out if there's a significant difference in height between these cities.

2.**Two way Anova** : One factor with atleast 2 level. And level are dependent. 
For Eg :  Suppose you're studying how both diet and exercise affect weight loss. Two-Way ANOVA helps you see if diet, exercise, or their combination has a significant impact.

3.**Factorial Anova**: Two or more factors(each of which with atleast two levels). And levels can be either dependent or independent.
For eg: If you measure the blood pressure of the same group before and after a new medication, Repeated Measures ANOVA helps determine if the change is due to the medication or just random variation.

### Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in Analysis of Variance (ANOVA) is a statistical concept that helps us understand the sources of variability in a set of data. In simple terms, it breaks down the total variation observed in a dataset into different components or sources.

**Total Variance (Total Sum of Squares, or SST):**
This represents the overall variability in your data. It's like looking at the total differences between individual data points and the overall mean.


**Between-Group Variance (Between-Group Sum of Squares, or SSB):**

This part of the variance measures how much the means of different groups (or categories) deviate from the overall mean. It tells you if there are significant differences between the groups.


**Within-Group Variance (Within-Group Sum of Squares, or SSW):**

This part of the variance measures the variability within each group. It shows how much individual data points within a group deviate from their group mean.


**why is this concept important?**

ANOVA helps you determine if the differences between group means are statistically significant. If the between-group variance is much larger than the within-group variance, it suggests that there are significant differences between the groups.

By partitioning the variance, ANOVA helps identify whether the variation in your data is mainly due to differences between groups or if it's more about variability within each group

### Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [2]:
import numpy as np
from scipy import stats

In [3]:
group1 = np.array([4, 5, 6, 7, 8])
group2 = np.array([9, 10, 11, 12, 13])
group3 = np.array([14, 15, 16, 17, 18])

# Combine the data into a single array
all_data = np.concatenate([group1, group2, group3])

In [4]:
all_data

array([ 4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18])

In [7]:
# Calculate the overall mean
overall_mean = np.mean(all_data)

# Calculate Total Sum of Squares (SST)
sst = np.sum((all_data - overall_mean) ** 2)

In [20]:
overall_mean

11.0

In [21]:
sst

280.0

In [9]:

mean_group1 = np.mean(group1)
mean_group2 = np.mean(group2)
mean_group3 = np.mean(group3)

# Calculate Explained Sum of Squares (SSE)
sse = len(group1) * (mean_group1 - overall_mean)**2 + \
      len(group2) * (mean_group2 - overall_mean)**2 + \
      len(group3) * (mean_group3 - overall_mean)**2

# Calculate Residual Sum of Squares (SSR)
ssr = np.sum((group1 - mean_group1)**2) + \
      np.sum((group2 - mean_group2)**2) + \
      np.sum((group3 - mean_group3)**2)

In [22]:
sse

250.0

In [23]:
ssr

30.0

In [10]:
#Mean_Squares
df_between = 3 - 1  # Number of groups minus 1
df_within = len(all_data) - 3  # Total number of observations minus the number of groups

In [24]:
df_between

2

In [25]:
df_within

12

In [14]:
#Mean_Squares
ms_between = sse / df_between
ms_within = ssr / df_within


f_statistic = ms_between / ms_within


p_value = 1 - stats.f.cdf(f_statistic, df_between, df_within)

In [26]:
ms_between

125.0

In [27]:
ms_within

2.5

In [15]:
# Print results
print("Total Sum of Squares (SST):", sst)
print("Explained Sum of Squares (SSE):", sse)
print("Residual Sum of Squares (SSR):", ssr)
print("F-statistic:", f_statistic)
print("p-value:", p_value)

Total Sum of Squares (SST): 280.0
Explained Sum of Squares (SSE): 250.0
Residual Sum of Squares (SSR): 30.0
F-statistic: 50.0
p-value: 1.5127924217761546e-06


### Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [29]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [30]:
# Example data
data = {'A': [10, 20, 30, 40, 50, 60],
        'B': [5, 15, 10, 25, 20, 30],
        'Y': [25, 40, 55, 70, 85, 100]}

df = pd.DataFrame(data)

In [34]:
df

Unnamed: 0,A,B,Y
0,10,5,25
1,20,15,40
2,30,10,55
3,40,25,70
4,50,20,85
5,60,30,100


In [31]:
# Fit a two-way ANOVA model
model = ols('Y ~ A * B', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

In [36]:
anova_table

Unnamed: 0,sum_sq,df,F,PR(>F)
A,848.5714,1.0,1.635586e+28,6.114017e-29
B,4.5578690000000005e-27,1.0,0.08785101,0.7948726
A:B,9.515699000000001e-28,1.0,0.01834111,0.9046731
Residual,1.037636e-25,2.0,,


In [32]:
# Extract main effects and interaction effects
main_effect_A = anova_table.loc['A', 'sum_sq'] / anova_table.loc['A', 'df']
main_effect_B = anova_table.loc['B', 'sum_sq'] / anova_table.loc['B', 'df']
interaction_effect = anova_table.loc['A:B', 'sum_sq'] / anova_table.loc['A:B', 'df']

In [33]:
# Print results
print("Main Effect A:", main_effect_A)
print("Main Effect B:", main_effect_B)
print("Interaction Effect:", interaction_effect)

Main Effect A: 848.5714285714279
Main Effect B: 4.557868938585036e-27
Interaction Effect: 9.5156988668933e-28


### Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

**F-Statistic (5.23):**

Think of the F-statistic like a signal that there might be differences between the groups. In this case, the signal is 5.23. The higher the signal, the more likely there are differences.

**P-Value (0.02):**

The p-value is like a measure of how strong the evidence is against the idea that there are no differences between the groups. A p-value of 0.02 means there's a 2% chance of seeing such strong evidence if there were actually no differences.


**Interpretation:**

Since the p-value is less than 0.05 (a common threshold), it suggests that the evidence against the idea of no differences is strong.


**Conclusion:**

In summary, based on the F-statistic and p-value, you would reject the null hypothesis and conclude that there are significant differences between the groups



### Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

#### Potential Consequences:
**Complete Case Analysis (CCA):**

__Consequence__: You lose data, potentially leading to reduced statistical power. Results may be biased if missing data is not random.


**Mean Imputation:**

Consequence: Preserves sample size but may underestimate variability. Assumes missing values have the same mean as observed values.


**Multiple Imputation (MI):**

Consequence: Preserves variability and accounts for uncertainty. Requires a more complex analysis. Assumes missingness is at least missing at random (MAR).


### Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

**Tukey's Honestly Significant Difference (Tukey HSD):**

Use When:
You have three or more groups to compare.
You want to be sure which specific groups are different.

Example:
Imagine testing three types of plant fertilizers to see which one grows taller plants. If ANOVA says there's a difference, Tukey HSD helps pinpoint which fertilizers are significantly different from each other.


**Bonferroni Correction:**

Use When:

You're comparing multiple pairs, and you want to control for making mistakes (like saying there's a difference when there isn't).

Example:

Suppose you're testing different diets to see which one helps people lose weight. If ANOVA indicates a difference, Bonferroni correction helps you compare each pair of diets without increasing the risk of making a wrong conclusion.


**Scheffé's Test:**

Use When:

The number of groups can vary, and you want a reliable test.
You're okay with a bit more caution in declaring differences.

Example:

Let's say you're testing the effectiveness of various teaching methods (A, B, C, D) on student performance. If ANOVA suggests a difference, Scheffé's test helps identify which teaching methods are truly different.


**Why Post-hoc Tests?**

Imagine you have three different ice cream flavors, and you want to know if people like them equally. ANOVA tells you there's a difference in preferences, but it doesn't say which flavors are preferred. Post-hoc tests step in to say, "Ah, people like chocolate more than vanilla, but strawberry is no different from vanilla."

In simpler terms, post-hoc tests are like detectives helping us understand the details after we know there's a difference between groups. They tell us exactly where those differences lie.

### Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets.
### Report the F-statistic and p-value, and interpret the results.

In [39]:
import numpy as np
from scipy.stats import f_oneway

In [40]:
# Generate random weight loss data for three diets
np.random.seed(42)  # Setting seed for reproducibility
weight_loss_A = np.random.normal(loc=5, scale=2, size=50)  # Mean = 5, Standard Deviation = 2
weight_loss_B = np.random.normal(loc=6, scale=2, size=50)  # Mean = 6, Standard Deviation = 2
weight_loss_C = np.random.normal(loc=4, scale=2, size=50)  # Mean = 4, Standard Deviation = 2

In [42]:
# Combine the data
all_weight_loss = np.concatenate([weight_loss_A, weight_loss_B, weight_loss_C])

# Create labels for each group
labels = ['A'] * 50 + ['B'] * 50 + ['C'] * 50

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(weight_loss_A, weight_loss_B, weight_loss_C)

In [46]:
f_statistic

16.574213049400626

In [47]:
p_value

3.2283781469409867e-07

In [44]:
all_weight_loss

array([5.99342831, 4.7234714 , 6.29537708, 8.04605971, 4.53169325,
       4.53172609, 8.15842563, 6.53486946, 4.06105123, 6.08512009,
       4.07316461, 4.06854049, 5.48392454, 1.17343951, 1.55016433,
       3.87542494, 2.97433776, 5.62849467, 3.18395185, 2.1753926 ,
       7.93129754, 4.5484474 , 5.13505641, 2.15050363, 3.91123455,
       5.22184518, 2.69801285, 5.75139604, 3.79872262, 4.4166125 ,
       3.79658678, 8.70455637, 4.97300555, 2.88457814, 6.64508982,
       2.5583127 , 5.41772719, 1.08065975, 2.3436279 , 5.39372247,
       6.47693316, 5.34273656, 4.76870344, 4.39779261, 2.04295602,
       3.56031158, 4.07872246, 7.11424445, 5.68723658, 1.47391969,
       6.64816794, 5.22983544, 4.646156  , 7.22335258, 8.06199904,
       7.86256024, 4.32156495, 5.38157525, 6.66252686, 7.95109025,
       5.04165152, 5.62868205, 3.78733005, 3.60758675, 7.62505164,
       8.71248006, 5.85597976, 8.0070658 , 6.72327205, 4.70976049,
       6.72279121, 9.07607313, 5.92834792, 9.12928731, 0.76050

In [48]:
# Print results
print("One-Way ANOVA Results:")
print("F-Statistic:", f_statistic)
print("p-value:", p_value)

One-Way ANOVA Results:
F-Statistic: 16.574213049400626
p-value: 3.2283781469409867e-07


In [49]:
# Interpret the results
if p_value < 0.05:
    print("There is significant evidence to suggest that there are differences in the mean weight loss between at least two diets.")
else:
    print("There is not enough evidence to conclude that there are differences in the mean weight loss between diets.")

There is significant evidence to suggest that there are differences in the mean weight loss between at least two diets.


### Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs.experienced). Report the F-statistics and p-values, and interpret the results.

In [51]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [52]:
# Setting seed for reproducibility
np.random.seed(42)

# Generate random data
n_per_group = 30
experience_levels = ['Novice', 'Experienced']
software_programs = ['Program A', 'Program B', 'Program C']

# Creating a DataFrame with random data
data = {
    'Time': np.random.normal(loc=10, scale=2, size=n_per_group * len(software_programs) * len(experience_levels)),
    'Program': np.repeat(np.tile(software_programs, n_per_group), len(experience_levels)),
    'Experience': np.tile(np.repeat(experience_levels, len(software_programs)), n_per_group)
}

df = pd.DataFrame(data)

In [55]:
df

Unnamed: 0,Time,Program,Experience
0,10.993428,Program A,Novice
1,9.723471,Program A,Novice
2,11.295377,Program B,Novice
3,13.046060,Program B,Experienced
4,9.531693,Program C,Experienced
...,...,...,...
175,11.654366,Program A,Novice
176,10.026004,Program B,Novice
177,12.907068,Program B,Experienced
178,9.470686,Program C,Experienced


In [53]:
# Perform two-way ANOVA
formula = 'Time ~ Program + Experience + Program:Experience'
model = ols(formula, data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

  F /= J


In [54]:
# Print results
print("Two-Way ANOVA Results:")
print(anova_table)

Two-Way ANOVA Results:
                        sum_sq     df          F        PR(>F)
Program               1.484605    2.0   0.210750  6.467466e-01
Experience                 NaN    1.0        NaN           NaN
Program:Experience  321.988587    2.0  45.708515  1.944404e-10
Residual            619.906287  176.0        NaN           NaN


### Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

In [58]:
import numpy as np
from scipy.stats import ttest_ind
import statsmodels.stats.multicomp as mc

In [59]:
# Setting seed for reproducibility
np.random.seed(42)

# Generate random test scores for the control and experimental groups
control_group = np.random.normal(loc=70, scale=10, size=100)
experimental_group = np.random.normal(loc=75, scale=10, size=100)

In [62]:
control_group

array([74.96714153, 68.61735699, 76.47688538, 85.23029856, 67.65846625,
       67.65863043, 85.79212816, 77.67434729, 65.30525614, 75.42560044,
       65.36582307, 65.34270246, 72.41962272, 50.86719755, 52.75082167,
       64.37712471, 59.8716888 , 73.14247333, 60.91975924, 55.87696299,
       84.65648769, 67.742237  , 70.67528205, 55.75251814, 64.55617275,
       71.1092259 , 58.49006423, 73.75698018, 63.9936131 , 67.0830625 ,
       63.98293388, 88.52278185, 69.86502775, 59.42289071, 78.22544912,
       57.7915635 , 72.08863595, 50.40329876, 56.71813951, 71.96861236,
       77.3846658 , 71.71368281, 68.84351718, 66.98896304, 55.2147801 ,
       62.80155792, 65.39361229, 80.57122226, 73.4361829 , 52.36959845,
       73.24083969, 66.1491772 , 63.23078   , 76.11676289, 80.30999522,
       79.31280119, 61.60782477, 66.90787624, 73.31263431, 79.75545127,
       65.20825762, 68.14341023, 58.93665026, 58.03793376, 78.12525822,
       83.56240029, 69.27989878, 80.03532898, 73.61636025, 63.54

In [63]:
experimental_group

array([ 60.84629258,  70.79354677,  71.57285483,  66.97722731,
        73.38714288,  79.04050857,  93.86185901,  76.74577813,
        77.57550391,  74.25554084,  55.81228785,  74.73486125,
        75.6023021 ,  99.63242112,  73.07639035,  78.01547342,
        74.6528823 ,  63.31321962,  86.42822815,  82.51933033,
        82.91031947,  65.90612545,  89.02794311,  60.98148937,
        80.86857094,  96.90455626,  65.09463675,  69.3370227 ,
        75.99651365,  69.96524346,  59.49336569,  75.68562975,
        64.37696286,  79.73592431,  65.80575766,  90.49934405,
        67.16746708,  71.77938484,  83.13517217,  62.69135684,
        77.27459935,  88.07142754,  58.92516765,  76.84633859,
        77.59882794,  82.81822872,  62.63049289,  61.79543387,
        80.21941566,  77.96984673,  77.5049285 ,  78.46448209,
        68.19975278,  77.32253697,  77.93072473,  67.85648582,
        93.65774511,  79.73832921,  63.08696503,  81.56553609,
        65.2531833 ,  82.87084604,  86.58595579,  66.79

In [60]:
# Perform two-sample t-test
t_statistic, p_value = ttest_ind(control_group, experimental_group)


print("Two-Sample T-Test Results:")
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Check if results are significant
if p_value < 0.05:
    print("The difference in test scores between the control and experimental groups is statistically significant.")
else:
    print("There is not enough evidence to conclude a significant difference in test scores between the groups.")


Two-Sample T-Test Results:
t-statistic: -4.754695943505281
p-value: 3.819135262679478e-06
The difference in test scores between the control and experimental groups is statistically significant.


In [61]:
# If results are significant, follow up with post-hoc test
if p_value < 0.05:
    # Combine data for post-hoc test
    all_scores = np.concatenate([control_group, experimental_group])
    group_labels = ['Control'] * 100 + ['Experimental'] * 100
    
    # Perform post-hoc test (e.g., Tukey's HSD)
    posthoc = mc.MultiComparison(all_scores, group_labels)
    posthoc_results = posthoc.tukeyhsd()
    
    # Print post-hoc test results
    print("\nPost-Hoc Test (Tukey's HSD) Results:")
    print(posthoc_results)


Post-Hoc Test (Tukey's HSD) Results:
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj lower  upper  reject
--------------------------------------------------------
Control Experimental   6.2615   0.0 3.6645 8.8585   True
--------------------------------------------------------


### Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.

In [70]:
import numpy as np
import pandas as pd
import pingouin as pg

In [71]:
# Setting seed for reproducibility
np.random.seed(42)

# Generate random daily sales data for three stores
store_A_sales = np.random.normal(loc=100, scale=20, size=30)
store_B_sales = np.random.normal(loc=110, scale=25, size=30)
store_C_sales = np.random.normal(loc=95, scale=15, size=30)

# Combine data into a DataFrame
df = pd.DataFrame({
    'Sales': np.concatenate([store_A_sales, store_B_sales, store_C_sales]),
    'Store': np.repeat(['A', 'B', 'C'], 30),
    'Day': np.tile(np.arange(1, 31), 3)
})

In [72]:
# Perform repeated measures ANOVA
rm_anova_result = pg.rm_anova(data=df, dv='Sales', within='Store', subject='Day')

# Print repeated measures ANOVA results
print("Repeated Measures ANOVA Results:")
print(rm_anova_result)

Repeated Measures ANOVA Results:
  Source  ddof1  ddof2         F     p-unc       ng2       eps
0  Store      2     58  3.686103  0.031118  0.074828  0.959186


In [73]:
# Check if results are significant
if rm_anova_result['p-unc'][0] < 0.05:
    print("\nThe difference in sales between the three stores is statistically significant.")
    
    # Follow up with post-hoc test (e.g., pairwise t-tests with correction)
    posthoc_result = pg.pairwise_ttests(data=df, dv='Sales', within='Store', subject='Day', padjust='holm')
    
    # Print post-hoc test results
    print("\nPost-Hoc Test Results:")
    print(posthoc_result)
else:
    print("\nThere is not enough evidence to conclude a significant difference in sales between the three stores.")



The difference in sales between the three stores is statistically significant.

Post-Hoc Test Results:
  Contrast  A  B  Paired  Parametric         T   dof alternative     p-unc  \
0    Store  A  B    True        True -2.101014  29.0   two-sided  0.044446   
1    Store  A  C    True        True  0.243517  29.0   two-sided  0.809319   
2    Store  B  C    True        True  2.369288  29.0   two-sided  0.024695   

     p-corr p-adjust   BF10    hedges  
0  0.088892     holm  1.314 -0.509182  
1  0.809319     holm    0.2  0.062386  
2  0.074086     holm  2.117  0.595064  


