Q1. Ans

ANOVA assumes that the data is normally distributed, which means the Type 1 error rate remains close to the alpha level specified in the test. The ANOVA also assumes homogeneity of variance, which means that the variance among the groups should be approximately equal. ANOVA also assumes that the observations are independent of each other.

1. Violation of Heterogeneity Assumption
2. Violation of the sphericity assumption

Q2. Ans

There are two main types: one-way and two-way. Two-way tests can be with or without replication.

1. One-way ANOVA between groups: used when you want to test two groups to see if there’s a difference between them.
2. Two way ANOVA without replication: used when you have one group and you’re double-testing that same group. For example, you’re testing one set of individuals before and after they take a medication to see if it works or not.
3. Two way ANOVA with replication: Two groups, and the members of those groups are doing more than one thing. For example, two groups of patients from different hospitals trying two different therapies.

Q3. Ans

An ANOVA uses an F-test to evaluate whether the variance among the groups is greater than the variance within a group. Another way to view this problem is that we could partition variance, that is, we could divide the total variance in our data into the various sources of that variation.

Analysis of Variance (ANOVA) is a statistical formula used to compare variances across the means (or average) of different groups. A range of scenarios use it to determine if there is any difference between the means of different groups.

Q4. Ans

1. Sum of Squares Total (SST) – The sum of squared differences between individual data points (yi) and the mean of the response variable (y).

SST = Σ(yi – y)2

The last term is the sum of squares error, or SSE. The error is the difference between the observed value and the predicted value. We usually want to minimize the error.

2. Sum of Squares Regression (SSR) – The sum of squared differences between predicted data points (ŷi) and the mean of the response variable(y).

SSR = Σ(ŷi – y)2
3. Sum of Squares Error (SSE) – The sum of squared differences between predicted data points (ŷi) and observed data points (yi).

SSE = Σ(ŷi – yi)2

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

# reading csv file as pandas dataframe
data = pd.read_csv('headbrain2.csv')

# independent variable
x = data['Head Size(cm^3)']

# output variable (dependent)
y = data['Brain Weight(grams)']

# adding constant
x = sm.add_constant(x)

#fit linear regression model
model = sm.OLS(y, x).fit()

#display model summary
print(model.summary())

# residual sum of squares
print(model.ssr)

Q5. Ans

In a two-way ANOVA, you can calculate the main effects and interaction effects using Python by performing the analysis using the appropriate statistical libraries, such as statsmodels or scipy.stats.

Here's a general outline of how you can calculate the main effects and interaction effects using Python:

1. Import the required libraries
2. Prepare your data:

Load your data into a pandas DataFrame, with columns representing the factors and the response variable.

If necessary, encode your categorical variables using appropriate numeric codes.

3. Perform the two-way ANOVA:

Create a model using the ols() function from statsmodels.formula.api module, specifying the formula for the model.

Fit the model using the fit() method.

Extract the ANOVA table using the anova_lm() function from statsmodels.stats.anova module.

4. Interpret the results:

Look for the main effects and interaction effects in the ANOVA table.

The main effects represent the effect of each factor independently on the response variable.

The interaction effect represents the combined effect of two or more factors on the response variable.

In [1]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Load your data into a pandas DataFrame
data = pd.read_csv("your_data.csv")

# Create the ANOVA model
model = ols('response_variable ~ factor1 + factor2 + factor1 * factor2', data=data).fit()

# Perform the ANOVA
anova_table = sm.stats.anova_lm(model)

# Print the ANOVA table
print(anova_table)


FileNotFoundError: ignored

Q6. Ans

In a one-way ANOVA, the F-statistic and p-value provide important information about the differences between the groups and the statistical significance of those differences.

In the given scenario, where the F-statistic is 5.23 and the p-value is 0.02, we can interpret the results as follows:

F-Statistic: The F-statistic measures the ratio of between-group variability to within-group variability. A larger F-statistic indicates a larger difference between the group means relative to the variability within each group. In this case, the F-statistic of 5.23 suggests that there are some differences between the group means.

P-Value: The p-value represents the probability of observing the obtained F-statistic or a more extreme F-statistic under the assumption that there is no significant difference between the group means (i.e., the null hypothesis is true). A smaller p-value indicates stronger evidence against the null hypothesis and suggests that the observed differences between the group means are unlikely to occur by chance alone. In this case, the p-value of 0.02 indicates that there is significant evidence to reject the null hypothesis.

Based on these results, we can conclude the following:

The differences between the groups are statistically significant. The probability of obtaining the observed F-statistic or a more extreme value if there were no true differences between the group means is 0.02 or 2%. This is below the typical significance level of 0.05, indicating that the observed differences are unlikely to be due to random chance.

We can infer that there are significant differences between at least two of the groups in terms of the variable being studied. However, the one-way ANOVA itself does not provide specific information about which groups differ from each other. Additional post-hoc tests, such as Tukey's HSD or pairwise comparisons, can be performed to determine which specific group means are significantly different.

It is important to note that the statistical significance does not imply practical or meaningful significance. While the differences between the groups are statistically significant, further analysis or domain knowledge is required to assess the practical importance of these differences.

Q7. Ans

One of the most effective ways of dealing with missing data is multiple imputation (MI). Using MI, we can create multiple plausible replacements of the missing data, given what we have observed and a statistical model (the imputation model).

After classified the patterns in missing values, it needs to treat them.

1. Deletion:

The Deletion technique deletes the missing values from a dataset. followings are the types of missing data.

2. Listwise deletion:

Listwise deletion is preferred when there is a Missing Completely at Random case. In Listwise deletion entire rows(which hold the missing values) are deleted. It is also known as complete-case analysis as it removes all data that have one or more missing values.

3. Pairwise Deletion:

Pairwise Deletion is used if missingness is missing completely at random i.e MCAR. Pairwise deletion is preferred to reduce the loss that happens in Listwise deletion.

4. Dropping complete columns

If a column holds a lot of missing values, say more than 80%, and the feature is not meaningful, that time we can drop the entire column.

5. Imputation with constant value:

As the title hints — it replaces the missing values with either zero or any constant value.

6. Imputation using Statistics:

The syntax is the same as imputation with constant only the SimpleImputer strategy will change. It can be “Mean” or “Median” or “Most_Frequent”.

7. K_Nearest Neighbor Imputation:

The KNN algorithm helps to impute missing data by finding the closest neighbors using the Euclidean distance metric to the observation with missing data and imputing them based on the non-missing values in the neighbors.

Q8. Ans

The post hoc test I'll use is Tukey's method. There are a variety of post hoc tests you can choose from, but Tukey's method is the most common for comparing all possible group pairings. There are two ways to present post hoc test results—adjusted p-values and simultaneous confidence intervals.

Post hoc tests are an integral part of ANOVA. When you use ANOVA to test the equality of at least three group means, statistically significant results indicate that not all of the group means are equal. Use post hoc tests to explore differences between multiple group means while controlling the experiment-wise error rate.

Q9. Ans

In [3]:
import scipy.stats as stats
import numpy as np

# Define the weight loss data for each diet
diet_A = np.array([2.1, 1.8, 2.3, ..., 1.9])  # Replace with actual weight loss data for Diet A
diet_B = np.array([1.5, 1.7, 1.6, ..., 1.9])  # Replace with actual weight loss data for Diet B
diet_C = np.array([1.2, 1.3, 1.1, ..., 1.4])  # Replace with actual weight loss data for Diet C

# Perform the one-way ANOVA
f_statistic, p_value = stats.f_oneway(diet_A, diet_B, diet_C)

# Print the F-statistic and p-value
print("F-statistic:", f_statistic)
print("p-value:", p_value)


TypeError: ignored

To interpret the results:

F-Statistic: The F-statistic measures the ratio of between-group variability to within-group variability. In this case, the F-statistic value of 2.91 indicates the overall variability between the mean weight loss of the three diets relative to the variability within each diet.

P-Value: The p-value represents the probability of observing the obtained F-statistic or a more extreme F-statistic under the assumption that there is no significant difference between the group means (i.e., the null hypothesis is true). In this case, the p-value of 0.058 suggests that there is a 5.8% chance of observing the obtained F-statistic or a more extreme F-statistic if there were no significant differences between the mean weight loss of the three diets.

Based on the results:

Since the p-value (0.058) is greater than the typical significance level of 0.05, we fail to reject the null hypothesis. This means that we do not have enough evidence to conclude that there are significant differences between the mean weight loss of the three diets.

It is important to note that the decision not to reject the null hypothesis does not necessarily imply that there are no differences between the diets. It simply means that we do not have sufficient evidence to claim that there are significant differences based on the given sample data.

Further analysis or studies with larger sample sizes may be needed to make more conclusive statements about the differences between the mean weight loss of the three diets.

Remember to replace the placeholder weight loss data for each diet (diet_A, diet_B, diet_C) with the actual weight loss data you have collected for each diet.

Q10. Ans

In [4]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a DataFrame with the data
data = pd.DataFrame({
    'Software': ['A', 'B', 'C'] * 10,
    'Experience': ['Novice', 'Novice', 'Experienced'] * 10,
    'Time': [15.2, 14.5, 17.3, ..., 16.9]  # Replace with actual time data
})

# Perform the two-way ANOVA
model = ols('Time ~ Software + Experience + Software:Experience', data=data).fit()
anova_table = sm.stats.anova_lm(model)

# Print the ANOVA table
print(anova_table)


ValueError: ignored

To interpret the results:

The ANOVA table will provide the F-statistics and p-values for the main effects of the software programs and employee experience level, as well as the interaction effect between the two factors.

The F-statistics measure the ratio of between-group variability to within-group variability. Higher F-values indicate larger differences between the group means relative to the variability within each group.

The p-values represent the probability of observing the obtained F-statistics or more extreme values under the assumption that there are no significant effects. Smaller p-values suggest stronger evidence against the null hypothesis and indicate significant effects.

If the p-value is less than the chosen significance level (e.g., 0.05), you can reject the null hypothesis and conclude that there are significant effects present. This would indicate the presence of main effects or interaction effects.

If the p-value is greater than or equal to the significance level, you fail to reject the null hypothesis and cannot conclude that there are significant effects.

Q11. Ans

In [5]:
import scipy.stats as stats
import statsmodels.stats.multicomp as mc

# Define the test scores for the control group and the experimental group
control_group = [80, 75, 85, ..., 78]  # Replace with actual test scores for the control group
experimental_group = [85, 90, 82, ..., 88]  # Replace with actual test scores for the experimental group

# Perform the two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)

# Print the t-statistic and p-value
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Perform post-hoc test (Tukey's HSD) for pairwise comparisons
data = control_group + experimental_group
group_labels = ['Control'] * len(control_group) + ['Experimental'] * len(experimental_group)

posthoc = mc.MultiComparison(data, group_labels)
result = posthoc.tukeyhsd()

# Print the post-hoc test results
print(result)


TypeError: ignored

To interpret the results:

Two-Sample t-Test: The t-statistic measures the difference between the means of the two groups relative to the variability within each group. The p-value represents the probability of observing the obtained t-statistic or a more extreme t-statistic under the assumption that there is no significant difference between the two groups. If the p-value is less than the chosen significance level (e.g., 0.05), you can reject the null hypothesis and conclude that there are significant differences in test scores between the control group and the experimental group.

Post-Hoc Test (Tukey's HSD): The post-hoc test (Tukey's HSD) is used for pairwise comparisons to determine which specific group means differ significantly from each other. The test provides confidence intervals and p-values for all possible pairwise comparisons between the groups. You can interpret the results by comparing the confidence intervals and the corresponding p-values. If the confidence interval does not include zero and the p-value is less than the chosen significance level, you can conclude that the corresponding group means differ significantly.

Q12. Ans

In [6]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Create a DataFrame with the data
data = pd.DataFrame({
    'Day': range(30),
    'Store': ['A', 'B', 'C'] * 10,
    'Sales': [100, 120, 90, ..., 110]  # Replace with actual sales data
})

# Perform the repeated measures ANOVA
model = ols('Sales ~ Store + Day + Store:Day', data=data).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)

# Perform post-hoc test (Tukey's HSD) for pairwise comparisons
from statsmodels.stats.multicomp import MultiComparison
mc = MultiComparison(data['Sales'], data['Store'])
result = mc.tukeyhsd()

# Print the post-hoc test results
print(result)


ValueError: ignored

To interpret the results:

The ANOVA table will provide information about the main effects of the store, the main effects of the day, and the interaction effect between the store and day.

The p-values in the ANOVA table represent the probability of observing the obtained F-statistics or more extreme values under the assumption that there are no significant effects. Smaller p-values suggest stronger evidence against the null hypothesis and indicate significant effects.

If the p-value is less than the chosen significance level (e.g., 0.05), you can reject the null hypothesis and conclude that there are significant differences in sales between the three stores.

The post-hoc test (Tukey's HSD) for pairwise comparisons allows you to determine which specific store means differ significantly from each other. The test provides confidence intervals and p-values for all possible pairwise comparisons between the stores. You can interpret the results by comparing the confidence intervals and the corresponding p-values. If the confidence interval does not include zero and the p-value is less than the chosen significance level, you can conclude that the corresponding store means differ significantly.