<a href="https://colab.research.google.com/github/Sha-98/Data-Science-Masters/blob/main/Statistics_Adv_06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Statistics Advance 06 - Assignment**

## **Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.**


**ANalysis Of VAriance (ANOVA)** is a statistical technique used to compare the means of more than two groups. It is based on the **F-distribution** and assumes the following:

- **1. Independence:** The observations are independent of each other. This means that the outcome of one observation does not affect the outcome of another observation.

- **2. Normality:** The data within each group follows a normal distribution. This assumption can be checked using Q-Q plots or tests like the Shapiro-Wilk test.

- **3. Homogeneity of variances:** The variance of the data is equal across all groups. This assumption can be checked using Levene's test or Bartlett's test.

***Violations of these assumptions can impact the validity of the results and lead to incorrect conclusions. Here are some examples of violations:***

- **1. Dependence:** If the observations are not independent, the F-test used in ANOVA may not be valid. For example, if you are comparing the test scores of students in different classrooms, and some students have siblings in other classrooms, the scores may not be independent.

- **2. Non-normality:** If the data within a group does not follow a normal distribution, the F-test may not be valid. For example, if you are comparing the weights of mice from different treatment groups, and the weights of mice in one group are heavily skewed, the normality assumption is violated.

- **3. Heteroscedasticity:** If the variance of the data is not equal across all groups, the F-test may not be valid. For example, if you are comparing the salaries of employees in different departments, and the variance of salaries is much larger in one department than in others, the homogeneity of variances assumption is violated.

***When these assumptions are violated, it may still be possible to use ANOVA, but alternative methods such as the Welch ANOVA or non-parametric tests like the Kruskal-Wallis test may be more appropriate. It is important to assess the impact of these violations on the results and interpret them with caution.***

## **Q2. What are the three types of ANOVA, and in what situations would each be used?**


**1. One-way ANOVA** is used when there is one independent variable with two or more levels or categories.

*For example, comparing the test scores of students from different schools or the effect of different fertilizers on plant growth.*

**2. Two-way ANOVA** is used when there are two independent variables, each with two or more levels or categories.

*For example, comparing the test scores of students from different schools, broken down by gender, or the effect of different fertilizers and watering schedules on plant growth.*

*Two-way ANOVA can also be used to examine the interaction between the two independent variables.*

**3. N-way ANOVA (with n being the number of independent variables)** is used when there are three or more independent variables, each with two or more levels or categories.

*For example, comparing the test scores of students from different schools, broken down by gender and socioeconomic status, or the effect of different fertilizers, watering schedules, and sunlight exposure on plant growth.*

## **Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?**


Understanding the partitioning of variance in ANOVA is important because **it allows researchers to determine the relative contributions of different sources of variation to the total variation in the data**. This is useful for scientists and researchers who want to understand the relative importance of different factors that may be influencing the outcome of their experiments or studies.

There are several types of ANOVA that can be used in different situations, depending on the number of factors or independent variables being studied and the number of levels or groups within each factor.

- **One-way ANOVA** is used when there is a single factor or independent variable with multiple levels or groups. This is the simplest form of ANOVA and is used to compare the means of the different groups within the factor.
- **Two-way ANOVA** is used when there are two factors or independent variables, each with multiple levels or groups. This allows researchers to evaluate the individual and joint effects of the two factors on the dependent variable.
- **Factorial ANOVA** is used when there are multiple factors or independent variables, each with multiple levels or groups. This allows researchers to examine the combined effects of all the factors on the dependent variable.
- **Welch's ANOVA** is used when the assumption of equal variances is not met, and the variances of the different groups are not equal.
- **Ranked ANOVA** is used when the data is ordinal or when the assumptions of ANOVA are violated. This involves replacing the values with their rank ordering and running a ranked ANOVA on the transformed data.
- **Games-Howell test** is used as a post-hoc test when the assumption of homogeneity of variances has been violated, and the variances of the different groups are not equal.


***In general, ANOVA is used to compare the means of multiple groups and determine if there are any statistical differences between the means. It is important to ensure that the assumptions of ANOVA are met in order for the test results to be valid.***

## **Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?**


In [None]:
import numpy as np
import scipy.stats as stats

# example data
group1 = [3, 5, 6, 7, 9]
group2 = [4, 5, 5, 7, 10]
group3 = [3, 4, 4, 6, 7]

# combimimg all groups into single list
groups = [group1, group2, group3]

# calculating the sum of squares for each group
ssg = [np.sum((xi - np.mean(xi))**2) for xi in groups]

# calculate the sum of squares for all gronps
# sst = np.sum([np.sum((xi - np.mean(xi)) ** 2) for xi in groups])
sst = np.sum(ssg)

#calculating the sum of squares between groups
ssbg = np.sum([len(xi) * (np.mean(xi) - np.mean(groups)) ** 2 for xi in groups])

# calculate the explained sum of squares (SSE)
sse = ssbg

# calculate the residual sum of squares (SSR)
ssr = sst - ssbg

print(f"Sum of Squares for Groups (SSE): {sse:.2f}")
print(f"Residual Sum of Squares (SSR): {ssr:.2f}")

Sum of Squares for Groups (SSE): 5.73
Residual Sum of Squares (SSR): 47.87


## **Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?**


In a two-way ANOVA (Analysis of Variance), you can calculate the main effects and interaction effects using Python with the help of statistical libraries such as scipy.stats and statsmodels. Here's a general outline of the process:

In [None]:
# import necessary libraries
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# load the dataset
# Assuming you have a DataFrame 'df' with columns 'A', 'B', and 'Y'
# A and B are categorical variables, and Y is the dependent variable
data = {'A': ['A1', 'A2', 'A1', 'A2', 'A1', 'A2'],
        'B': ['B1', 'B1', 'B2', 'B2', 'B1', 'B1'],
        'Y': [5, 8, 6, 9, 7, 10]}
df = pd.DataFrame(data)

# fit the model
model = ols('Y ~ A * B', data=df).fit()

# perform ANOVA
anova_table = anova_lm(model, typ=2)

#Interpret the result

# Main Effects:
#       A and B rows in the table will provide information about the main effects of variables A and B.
# Interaction Effect:
#       The interaction term A:B in the table represents the interaction effect.

# Print the ANOVA table
print(anova_table)


                sum_sq   df             F   PR(>F)
A         1.350000e+01  1.0  6.750000e+00  0.12169
B         2.777448e-30  1.0  1.388724e-30  1.00000
A:B       1.643460e-32  1.0  8.217301e-33  1.00000
Residual  4.000000e+00  2.0           NaN      NaN


**1. Main Effects (A and B):**

- Look at the rows corresponding to the main effects of variables A and B.
- Check the p-values associated with these main effects.
- If the p-value is below your chosen significance level (e.g., 0.05), you reject the null hypothesis, suggesting there is a significant main effect.

**2. Interaction Effect (A:B):**

- Look at the row corresponding to the interaction term A:B.
- Check the p-value associated with the interaction effect.
- If the p-value is below your chosen significance level, it suggests there is a significant interaction effect between variables A and B.

**3. Conclusion:**

- If there is a significant main effect for A, it means that the levels of A have a statistically significant impact on the dependent variable.
 -If there is a significant main effect for B, it means that the levels of B have a statistically significant impact on the dependent variable.
- If there is a significant interaction effect, it means that the combined effect of A and B is not simply the sum of their individual effects.

**4. Interpretation Example:**

 -If p-values for A, B, and A:B are all below 0.05, you might conclude that both variables A and B have significant main effects, and there is a significant interaction effect between them.

***Keep in mind that the interpretation might vary based on the specific context of your study and the nature of your variables. The p-values are crucial for determining whether the observed effects are statistically significant.***

## **Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?**


Based on the given **one-way ANOVA** results, with an **F-statistic of 5.23** and a **p-value of 0.02**, we can conclude that ***there is a statistically significant difference between at least two of the groups being compared***.

The **F-statistic** is a ***measure of the ratio of the variation between the groups to the variation within the groups***.

A **larger F-statistic indicates a greater difference between the groups**.

> ***In this case, the F-statistic of 5.23 suggests that there is a significant difference between the groups.***

The **p-value** is a **measure of the probability of observing the given F-statistic (or a more extreme alue) if the null hypothesis is true**. In this case, the null hypothesis is that there is no difference between the groups. A smaller p-value indicates a lower probability of observing the given F-statistic (or a more extreme value) under the null hypothesis.

> ***In this case, the p-value of 0.02 is less than the typical significance level of 0.05, indicating that the observed F-statistic is unlikely to have occurred by chance if the null hypothesis is true.***

Therefore, **we can reject the null hypothesis and conclude that there is a statistically significant difference between at least two of the groups**.

However, the ANOVA test does not identify which specific groups are different. To determine which groups are different, we can perform post-hoc tests, such as Tukey's HSD test or Dunnett's test, to compare the means of each pair of groups.

***In summary, the ANOVA results suggest that there is a statistically significant difference between at least two of the groups being compared. To determine which specific groups are different, you can perform post-hoc tests.***

## **Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?**



**1.** In a repeated measures ANOVA, **missing data is typically handled using listwise deletion**, which means that any observation with missing data is excluded from the analysis. This can lead to a loss of power and potentially biased results if the missing data is not missing at random.

> In the context of previous question, if the one-way ANOVA resulted in an F-statistic of 5.23 and a p-value of 0.02, it suggests that there is a significant effect of the factor being tested.

> However, if there are missing data points, it is important to consider the potential consequences of using listwise deletion. **Specifically, listwise deletion can lead to a loss of power and potentially biased results if the missing data is not missing at random.**

**2. One alternative to listwise deletion is to use multiple imputation**, which involves creating multiple copies of the data set and **replacing missing values with plausible values** based on the observed data. This approach can help to maintain power and reduce bias, but it requires making assumptions about the missing data mechanism and can be **computationally intensive**.

**3.** Another alternative is to use a **mixed effects model**, which can handle missing data by **modeling the correlation between repeated measures within subjects**. This approach can be more robust to missing data than listwise deletion or multiple imputation, but it requires making assumptions about the covariance structure of the repeated measures.

***Ultimately, the choice of method for handling missing data in a repeated measures ANOVA will depend on the specific research context and the assumptions that can be made about the missing data mechanism. It is important to consider the potential consequences of different methods and to report any assumptions made in the analysis.***

## **Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.**


After performing a repeated measures ANOVA, post-hoc tests used to determine which specific pairs or groups of means are significantly different from one another.

These tests are necessary when the ANOVA reveals a significant overall effect, but does not identify which groups are different. Here are some examples of post-hoc tests that can be used in different situations:

**1. Bonferroni test:** This test is used when you have a small number of groups (typically less than five) and want to compare every possible pair of means. It is a conservative test that controls the family-wise error rate, which is the probability of making at least one type I error (false positive) in all the comparisons.

**2. Tukey's HSD test:** This test is used when you have a larger number of groups (more than five) and want to compare every possible pair of means. It is a less conservative test than the Bonferroni test and controls the ***experiment-wise*** error rate, which is the probability of making at least one type I error in all the comparisons.

**3. Scheffe test:** This test is used when you want to compare any subset of means, not just pairs. It is a very flexible test that can handle any number of groups and any number of comparisons, but it is also a conservative test that controls the family-wise error rate.

**4. Student-Newman-Keuls test:** This test is used when you want to compare groups in a stepwise manner, starting with the groups that have the most extreme means. It is a less conservative test than the Bonferroni and Scheffe tests, but it does not control the family-wise error rate.

**5. Dunnett's test:** This test is used when you want to compare a control group to one or more experimental groups. It is a conservative test that controls the family-wise error rate.

***The choice of post-hoc test depends on the specific research question, the number of groups, and the number of comparisons. It is important to choose a test that is appropriate for the data and the research question, and to report the results of the post-hoc tests along with the ANOVA results.***

## **Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.**


In [None]:
import pandas as pd
import numpy as np

# Set a random seed for reproducibility
np.random.seed(42)

# Generate random weight loss data for each diet
weight_loss_A = np.random.normal(loc=3, scale=1, size=50)
weight_loss_B = np.random.normal(loc=5, scale=1, size=50)
weight_loss_C = np.random.normal(loc=4, scale=1, size=50)

# Create a DataFrame
data = pd.DataFrame({
    'Diet': ['A'] * 50 + ['B'] * 50 + ['C'] * 50,
    'WeightLoss': np.concatenate([weight_loss_A, weight_loss_B, weight_loss_C])
})

# Display the DataFrame
data

Unnamed: 0,Diet,WeightLoss
0,A,3.496714
1,A,2.861736
2,A,3.647689
3,A,4.523030
4,A,2.765847
...,...,...
145,C,4.781823
146,C,2.763049
147,C,2.679543
148,C,4.521942


In [None]:
import pandas as pd
from scipy.stats import f_oneway

# Assuming you have created the 'data' DataFrame as provided in the previous example

# Perform one-way ANOVA
anova_result = f_oneway(data[data['Diet'] == 'A']['WeightLoss'],
                        data[data['Diet'] == 'B']['WeightLoss'],
                        data[data['Diet'] == 'C']['WeightLoss'])

# Extract F-statistic and p-value from the result
f_statistic = anova_result.statistic
p_value = anova_result.pvalue

# Print the results
print(f"F-statistic: {f_statistic}")
print(f"P-value: {p_value}")

# Interpret the results
if p_value < 0.05:
    print("The p-value is less than 0.05, indicating a significant difference between at least two diets.")
else:
    print("No significant difference found between the mean weight loss of the three diets.")


F-statistic: 70.8279510701712
P-value: 2.883112717640751e-22
The p-value is less than 0.05, indicating a significant difference between at least two diets.


The result of the one-way ANOVA is as follows:

- **F-statistic: 70.83**
- **P-value: 2.88e-22 (very close to zero)**

**Interpretation:**

- **F-statistic:** This value is a measure of the variation between group means compared to the variation within the groups. ***A higher F-statistic suggests a greater difference between group means.***

- **P-value:** The p-value is the probability of obtaining the observed F-statistic (or more extreme) if the null hypothesis is true. ***In this case, the p-value is extremely small (2.88e-22), much less than the typical significance level of 0.05.***

**Interpretation of the p-value:**

The p-value is less than 0.05, which is the conventional significance level. This indicates strong evidence against the null hypothesis.

**CONCLUSION:**

***Since the p-value is very small, we reject the null hypothesis. There is significant evidence to suggest that there is a difference in mean weight loss among at least two of the three diets (A, B, C).***

***In summary, the result suggests that there are significant differences in weight loss between at least two of the three diets. Further post-hoc tests or pairwise comparisons can be conducted to identify which specific pairs of diets are significantly different from each other. The ANOVA itself indicates that there is some difference but doesn't specify where the differences lie.***

## **Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.**


In [None]:
import pandas as pd
import numpy as np
import random
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Set a random seed for reproducibility
np.random.seed(42)

# Number of employees
n_employees = 30

# Generate random completion times for each program and experience level
completion_times_A_novice = np.random.normal(loc=20, scale=5, size=n_employees // 2)
completion_times_A_experienced = np.random.normal(loc=22, scale=4, size=n_employees // 2)

completion_times_B_novice = np.random.normal(loc=25, scale=7, size=n_employees // 2)
completion_times_B_experienced = np.random.normal(loc=28, scale=6, size=n_employees // 2)

completion_times_C_novice = np.random.normal(loc=30, scale=8, size=n_employees // 2)
completion_times_C_experienced = np.random.normal(loc=32, scale=7, size=n_employees // 2)

# Combine data into a DataFrame
data = {
    'Program': (['A'] * (n_employees // 2) + ['B'] * (n_employees // 2) + ['C'] * (n_employees // 2)) * 2,
    'Experience': (['Novice'] * (n_employees // 2) + ['Experienced'] * (n_employees // 2)) * 3,
    'CompletionTime': np.concatenate([
        completion_times_A_novice, completion_times_A_experienced,
        completion_times_B_novice, completion_times_B_experienced,
        completion_times_C_novice, completion_times_C_experienced
    ])
}

df = pd.DataFrame(data)
df

Unnamed: 0,Program,Experience,CompletionTime
0,A,Novice,22.483571
1,A,Novice,19.308678
2,A,Novice,23.238443
3,A,Novice,27.615149
4,A,Novice,18.829233
...,...,...,...
85,C,Experienced,28.487701
86,C,Experienced,38.407815
87,C,Experienced,34.301258
88,C,Experienced,28.291679


In [None]:
# Perform two-way ANOVA
formula = 'CompletionTime ~ Program + Experience + Program:Experience'
model = ols(formula, data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)

                         sum_sq    df          F        PR(>F)
Program              173.206587   2.0   2.253037  1.113937e-01
Experience           122.914359   1.0   3.197692  7.734806e-02
Program:Experience  1685.592415   2.0  21.925854  2.177989e-08
Residual            3228.831214  84.0        NaN           NaN


The ANOVA table provides information about the sources of variation in your data and whether these sources contribute significantly to the observed differences. Let's interpret the table:

**1. Program:**

- sum_sq (Sum of Squares): 173.21
- df (Degrees of Freedom): 2
- F (F-statistic): 2.25
- PR(>F) (p-value): 0.111

***Interpretation: The p-value (0.111) is greater than the significance level (usually 0.05). Therefore, we do not have enough evidence to reject the null hypothesis for the variable "Program." In other words, there is no significant difference in completion times between the three software programs.***

**2. Experience:**

- sum_sq: 122.91
- df: 1
- F: 3.20
- PR(>F): 0.077

***Interpretation: The p-value (0.077) is close to the significance level, suggesting a marginal effect of "Experience" on completion times. While it is not statistically significant at the conventional 0.05 level, it might be worth further investigation.***

**3. Program:Experience (Interaction effect between Program and Experience):**

- sum_sq: 1685.59
- df: 2
- F: 21.93
- PR(>F): 2.18e-08

***Interpretation: The p-value (2.18e-08) is significantly less than 0.05, indicating a strong interaction effect between "Program" and "Experience." This suggests that the effect of one variable on completion times depends on the level of the other variable.***

**4. Residual (Error):**

- sum_sq: 3228.83
- df: 84

Interpretation: This represents the unexplained variability in the data that is not accounted for by the model.

**CONCLUSION:**

***The choice of software program alone does not significantly affect completion times. There is a suggestion that the experience level might have a marginal effect (p-value close to 0.05).***

***The interaction between program and experience is highly significant, indicating that the effect of the program on completion times depends on the experience level.***

## **Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.**


In [3]:
import numpy as np
import pandas as pd
from scipy import stats
import pingouin as pg

# Create data
np.random.seed(42)  # for reproducibility
control_group = np.random.normal(loc=70, scale=10, size=50)  # control group scores
experimental_group = np.random.normal(loc=75, scale=10, size=50)  # experimental group scores

# Combine data into a DataFrame
data = pd.DataFrame({
    'Group': ['Control'] * 50 + ['Experimental'] * 50,
    'Test_Scores': np.concatenate([control_group, experimental_group])
})

# Conduct two-sample t-test
t_statistic, p_value = stats.ttest_ind(control_group, experimental_group)
print(f'Two-sample t-test results: t = {t_statistic}, p = {p_value}')

# Post-hoc test (Tukey's HSD)
posthoc = pg.pairwise_tukey(data=data, dv='Test_Scores', between='Group')
print('\nPost-hoc test results:')
print(posthoc)

Two-sample t-test results: t = -4.108723928204809, p = 8.261945608702611e-05

Post-hoc test results:
         A             B    mean(A)    mean(B)      diff        se         T  \
0  Control  Experimental  67.745261  75.177809 -7.432548  1.808967 -4.108724   

    p-tukey   hedges  
0  0.000083 -0.81544  


The results of the two-sample t-test and the post-hoc test provide insights into the significant differences between the control group (traditional teaching method) and the experimental group (new teaching method) in terms of student test scores.

**1. Two-sample t-test results:**

- t-value: -4.108723928204809
- p-value: 8.261945608702611e-05

The t-value represents how many standard deviations the sample mean is from the null hypothesis mean. A negative t-value indicates that the control group mean is lower than the experimental group mean.

The extremely low p-value (8.26e-05) suggests strong evidence against the null hypothesis of equal means, indicating a significant difference between the groups.

**2. Post-hoc test results (Tukey's HSD):**

- Group A (Control): mean = 67.75
- Group B (Experimental): mean = 75.18
- Difference in means: -7.43
- p-value for Tukey's test: 8.26e-05

The post-hoc test further explores which specific groups differ significantly. In this case, the p-value for Tukey's test is again extremely low (8.26e-05), indicating significant differences between the control and experimental groups.

**3. Interpretation:**

- The negative t-value and the negative difference in means suggest that, on average, students in the control group (traditional teaching) scored lower than those in the experimental group (new teaching method).
- The extremely low p-values from both the two-sample t-test and the post-hoc test provide strong evidence against the null hypothesis, supporting the conclusion that there are significant differences in test scores between the two teaching methods.

**4. Implications:**

- The findings suggest that the new teaching method is associated with higher student test scores compared to the traditional teaching method.
- Educators may consider implementing the new teaching method to potentially improve student performance.

***Note: The Hedges' g effect size (hedges) is provided, indicating the standardized effect size, but it is not explicitly discussed in this interpretation. Effect size provides information about the practical significance of the observed differences. A negative Hedges' g suggests a small to moderate effect favoring the experimental group.***









## **Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store n those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.**