## **Q1:-**  
### **Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.**

### **Ans:-**

### **1.Independence:-**
#### The observations within each group are assumed to be independent of each other. In other words, the values in one group should not be dependent on the values in another group.
##### Example of Violation: In a repeated measures design where the same subjects are used in each group, independence is violated. For example, if you measure the performance of the same individuals under different conditions, their performance may be correlated due to carryover effects.


### **2.Homogeneity of Variance (Homoscedasticity):-**
#### The variances within each group are assumed to be equal. This means that the spread of the data (i.e., the variance) is approximately the same in each group.**
##### Example of Violation: In a one-way ANOVA, if one group has a much larger variance compared to the other groups, it can violate this assumption. For instance, if the data in one group is much more spread out than the data in other groups, it may lead to unequal variances.

### **3.Normality:-** 
#### The data within each group should follow a normal distribution. This means that the data should be approximately bell-shaped.
##### Example of Violation:-
##### If the data in one or more groups are significantly skewed or do not follow a normal distribution, it can violate the normality assumption. For instance, if a group's data is highly skewed, ANOVA results may be unreliable.

### **4.Equal Sample Sizes:-**  
#### In some ANOVA models, it is assumed that the sample sizes are equal for all groups.
##### Example of Violation: If the sample sizes are not equal, it may violate this assumption. For example, if one group has a much smaller sample size than the others, it can affect the validity of the results.


## **Q2:-**  
### **What are the three types of ANOVA, and in what situations would each be used?**

### **Ans:-**

### **ANOVA is used to compare treatments, analyze factors impact on a variable, or compare means across multiple groups. Types of ANOVA include one-way (for comparing means of groups) and two-way (for examining effects of two independent variables on a dependent variable)**

## **Q3:-**  
### **What is the partitioning of variance in ANOVA, and why is it important to understand this concept?**

### **Ans:-**

### **The total variation present in a set of data may be partitioned into a number of non-overlapping components as per the nature of the classification. The systematic procedure to achieve this is called Analysis of Variance (ANOVA). With the help of such a partitioning, some testing of hypothesis may be performed.**

## **Q4:-** 
### **How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?**

### **Ans:-**


### 1.Calculate the Grand Mean (GM):

###### Calculate the mean of all the data points across all groups.
### 2.Calculate SST (Total Sum of Squares):

##### Find the squared difference between each data point and the Grand Mean.
##### Sum up these squared differences.
### 3.Calculate SSE (Explained Sum of Squares):

##### Calculate the mean of each group's data points.
##### Find the squared difference between each data point and its group mean.
##### Sum up these squared differences for all groups.
### 4.Calculate SSR (Residual Sum of Squares):

##### Subtract SSE from SST: SSR = SST - SSE.


In [1]:
import numpy as np
from scipy.stats import f_oneway

In [2]:
group_A = np.array([10, 12, 15, 11, 14])
group_B = np.array([20, 22, 25, 21, 24])
group_C = np.array([30, 32, 35, 31, 34])

In [3]:
all_data = np.concatenate([group_A, group_B, group_C])

In [4]:
grand_mean = np.mean(all_data)

In [5]:
sst = np.sum((all_data - grand_mean)**2)

In [6]:
group_means = [np.mean(group_A), np.mean(group_B), np.mean(group_C)]
sse = np.sum([(len(group) * (mean - grand_mean)**2) for group, mean in zip([group_A, group_B, group_C], group_means)])

In [7]:
# Calculate SSR (Residual Sum of Squares)
ssr = sst - sse


In [8]:
# Perform one-way ANOVA
f_statistic, p_value = f_oneway(group_A, group_B, group_C)

In [9]:
print(f"SST (Total Sum of Squares): {sst:.2f}")
print(f"SSE (Explained Sum of Squares): {sse:.2f}")
print(f"SSR (Residual Sum of Squares): {ssr:.2f}")
print(f"F-statistic: {f_statistic:.2f}")
print(f"P-value: {p_value:.4f}")


SST (Total Sum of Squares): 1051.60
SSE (Explained Sum of Squares): 1000.00
SSR (Residual Sum of Squares): 51.60
F-statistic: 116.28
P-value: 0.0000


## **Q5:-** 
### **In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?**


### **Ans:-**

In [10]:
import numpy as np
from scipy.stats import f_oneway

np.random.seed(42)

levels_a = ['A1', 'A2']
levels_b = ['B1', 'B2']
n_samples = 50

data = np.random.randn(2 * n_samples)
factor_a = np.repeat(levels_a, n_samples)
factor_b = np.tile(levels_b, n_samples)

In [11]:
# Perform two-way ANOVA
f_statistic_a, p_value_a = f_oneway(data[factor_a == 'A1'], data[factor_a == 'A2'])
f_statistic_b, p_value_b = f_oneway(data[factor_b == 'B1'], data[factor_b == 'B2'])

In [12]:
# Interaction effect
interaction_data_A1B1 = data[(factor_a == 'A1') & (factor_b == 'B1')]
interaction_data_A1B2 = data[(factor_a == 'A1') & (factor_b == 'B2')]
interaction_data_A2B1 = data[(factor_a == 'A2') & (factor_b == 'B1')]
interaction_data_A2B2 = data[(factor_a == 'A2') & (factor_b == 'B2')]

f_statistic_interaction, p_value_interaction = f_oneway(
    interaction_data_A1B1, interaction_data_A1B2, interaction_data_A2B1, interaction_data_A2B2
)

In [13]:
# Print results
print(f"Main Effect A - F-statistic: {f_statistic_a}, p-value: {p_value_a}")
print(f"Main Effect B - F-statistic: {f_statistic_b}, p-value: {p_value_b}")
print(f"Interaction Effect - F-statistic: {f_statistic_interaction}, p-value: {p_value_interaction}")

Main Effect A - F-statistic: 1.8082615862606561, p-value: 0.18182095878579319
Main Effect B - F-statistic: 0.12174365935810394, p-value: 0.7278995353995543
Interaction Effect - F-statistic: 0.8783854922263483, p-value: 0.4551995251282638


## **Q6:-** 
### **Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.**
### **What can you conclude about the differences between the groups, and how would you interpret these results?**

### **Ans:-**

### **1.Significance of the F-statistic (ANOVA test):-**

#### The F-statistic is a test statistic that compares the variance between groups to the variance within groups. If the F-statistic is significantly greater than 1, it suggests that there are significant differences among at least some of the group means.

### **2.P-value (p):-**
#### The p-value is the probability of observing the F-statistic (or a more extreme one) under the null hypothesis, which assumes that there are no significant differences between the groups.
#### A small p-value (e.g., p < 0.05) indicates that the differences among the group means are statistically significant, and you reject the null hypothesis.


### Based on your results:-
#### The F-statistic of 5.23 suggests that there are significant differences between at least some of the groups.
#### The p-value of 0.02 is less than the typical significance level of 0.05 (assuming a 5% significance level), which indicates that the differences among the group means are statistically significant.

## **Q7:-** 
### **In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?**

### **Ans:-**

### **1.Listwise Deletion (Complete Case Analysis):-**

#### This method involves removing all cases with missing data from the analysis.
#### Consequences:
###### Pros: It's straightforward and does not require imputation.
###### Cons: Can result in a significant loss of data, reduced statistical power, and potentially biased estimates if data are not missing completely at random (MCAR).
### **2.Pairwise Deletion (Available Case Analysis):-**
##### This method uses all available data for each pairwise comparison, even if some data points are missing for specific variables.
#### Consequences:
##### Pros: Maximizes the use of available data and may result in higher statistical power compared to listwise deletion.
##### Cons: Estimates can be inconsistent due to the use of different subsets of the data for different comparisons, which may lead to biased results.


## **Q8:-** 
### **What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.**

### **Ans:-**

### **The post hoc test I'll use is Tukey's method. There are a variety of post hoc tests you can choose from, but Tukey's method is the most common for comparing all possible group pairings. There are two ways to present post hoc test results—adjusted p-values and simultaneous confidence intervals.**

## **Q9:-** 
### **A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets.**
### **Report the F-statistic and p-value, and interpret the results.**

### **Ans:-**

In [14]:
import numpy as np
import scipy.stats as stats

In [15]:
diet_A = [1.5, 2.0, 1.8, 1.6, 1.7, 1.9, 2.1, 1.8, 2.0, 1.6, 1.7, 2.1, 1.9, 2.0, 1.8, 1.6, 1.7, 1.9, 2.1, 1.8, 2.0, 1.6, 1.7, 2.1, 1.9, 2.0, 1.8, 1.6, 1.7, 1.9]
diet_B = [1.0, 1.2, 1.1, 1.3, 1.5, 1.4, 1.2, 1.1, 1.3, 1.5, 1.4, 1.2, 1.1, 1.3, 1.5, 1.4, 1.2, 1.1, 1.3, 1.5, 1.4, 1.2, 1.1, 1.3, 1.5, 1.4, 1.2, 1.1, 1.3, 1.5]
diet_C = [0.5, 0.8, 0.7, 0.9, 0.6, 0.8, 0.7, 0.9, 0.6, 0.8, 0.7, 0.9, 0.6, 0.8, 0.7, 0.9, 0.6, 0.8, 0.7, 0.9, 0.6, 0.8, 0.7, 0.9, 0.6, 0.8, 0.7, 0.9, 0.6, 0.8]

In [16]:
# Perform a one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

In [17]:
print(f"F-statistic: {f_statistic:.2f}")
print(f"P-value: {p_value:.4f}")
alpha = 0.05  # Significance level


F-statistic: 383.08
P-value: 0.0000


In [18]:
if p_value < alpha:
    print("Reject the null hypothesis. There are significant differences between the mean weight loss of the three diets.")
else:
    print("Fail to reject the null hypothesis. There are no significant differences between the mean weight loss of the three diets.")


Reject the null hypothesis. There are significant differences between the mean weight loss of the three diets.


## **Q10:-**  
### **A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.**

### **Ans:-**

In [19]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

In [20]:
data = {
    'Software': ['A', 'B', 'C'] * 30, 
    'Experience': ['Novice', 'Experienced'] * 45, 
    'Time': [25, 30, 28, 22, 27, 26, 24, 29, 31] * 10, 
}
df = pd.DataFrame(data)

In [21]:
# Perform a two-way ANOVA
formula = 'Time ~ Software + Experience + Software:Experience'
model = ols(formula, data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

In [22]:
print("ANOVA Results:")
print(anova_table)


ANOVA Results:
                           sum_sq    df             F        PR(>F)
Software             4.688889e+02   2.0  8.951515e+01  1.511775e-21
Experience           1.767190e-27   1.0  6.747454e-28  1.000000e+00
Software:Experience  5.127596e-29   2.0  9.789047e-30  1.000000e+00
Residual             2.200000e+02  84.0           NaN           NaN


## **Q11:-**  
### **An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.**

### **Ans:-**

In [23]:
import numpy as np
import scipy.stats as stats
control_group_scores = np.array([75, 80, 82, 78, 76, 79, 81, 77, 74, 76, 80, 82, 78, 75, 77, 79, 76, 80, 82, 78, 76, 79, 81, 77, 74, 76, 80, 82, 78, 76])
experimental_group_scores = np.array([85, 88, 90, 87, 84, 89, 86, 88, 87, 86, 90, 91, 88, 85, 87, 89, 86, 90, 91, 88, 86, 85, 88, 87, 84, 89, 86, 88, 87, 86])

In [24]:
t_statistic, p_value = stats.ttest_ind(control_group_scores, experimental_group_scores)

print(f"t-statistic: {t_statistic:.2f}")
print(f"P-value: {p_value:.4f}")

# Interpret the results
alpha = 0.05  # Significance level

t-statistic: -16.09
P-value: 0.0000


In [25]:
if p_value < alpha:
    print("Reject the null hypothesis. There is a significant difference in test scores between the two groups.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in test scores between the two groups.")


Reject the null hypothesis. There is a significant difference in test scores between the two groups.


## **Q12:-** 
### **A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.**

### **Ans:-**

In [26]:
import numpy as np
import scipy.stats as stats

In [27]:
store_A_sales = np.array([100, 110, 120, 130, 115, 105, 125, 110, 115, 120, 130, 135, 140, 115, 125, 130, 105, 120, 115, 110, 135, 125, 110, 120, 125, 135, 130, 120, 110, 125])
store_B_sales = np.array([90, 95, 100, 105, 110, 115, 95, 100, 105, 110, 115, 120, 125, 130, 100, 105, 110, 115, 120, 95, 100, 105, 110, 115, 120, 105, 110, 95, 100, 105])
store_C_sales = np.array([70, 75, 80, 85, 90, 95, 70, 75, 80, 85, 90, 95, 70, 75, 80, 85, 90, 95, 70, 75, 80, 85, 90, 95, 80, 85, 70, 75, 80, 85])

In [28]:

# Perform a one-way ANOVA
f_statistic, p_value = stats.f_oneway(store_A_sales, store_B_sales, store_C_sales)

print(f"F-statistic: {f_statistic:.2f}")
print(f"P-value: {p_value:.4f}")

# Interpret the results
alpha = 0.05  # Significance level

F-statistic: 128.45
P-value: 0.0000


In [29]:
if p_value < alpha:
    print("Reject the null hypothesis. There are significant differences in sales between the three stores.")
else:
    print("Fail to reject the null hypothesis. There are no significant differences in sales between the three stores.")


Reject the null hypothesis. There are significant differences in sales between the three stores.
