<a href="https://colab.research.google.com/github/Sha-98/Data-Science-Masters/blob/main/Statistics_Adv_06.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Statistics Advance 06 - Assignment**

## **Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.**


**ANalysis Of VAriance (ANOVA)** is a statistical technique used to compare the means of more than two groups. It is based on the **F-distribution** and assumes the following:

- **1. Independence:** The observations are independent of each other. This means that the outcome of one observation does not affect the outcome of another observation.

- **2. Normality:** The data within each group follows a normal distribution. This assumption can be checked using Q-Q plots or tests like the Shapiro-Wilk test.

- **3. Homogeneity of variances:** The variance of the data is equal across all groups. This assumption can be checked using Levene's test or Bartlett's test.

***Violations of these assumptions can impact the validity of the results and lead to incorrect conclusions. Here are some examples of violations:***

- **1. Dependence:** If the observations are not independent, the F-test used in ANOVA may not be valid. For example, if you are comparing the test scores of students in different classrooms, and some students have siblings in other classrooms, the scores may not be independent.

- **2. Non-normality:** If the data within a group does not follow a normal distribution, the F-test may not be valid. For example, if you are comparing the weights of mice from different treatment groups, and the weights of mice in one group are heavily skewed, the normality assumption is violated.

- **3. Heteroscedasticity:** If the variance of the data is not equal across all groups, the F-test may not be valid. For example, if you are comparing the salaries of employees in different departments, and the variance of salaries is much larger in one department than in others, the homogeneity of variances assumption is violated.

***When these assumptions are violated, it may still be possible to use ANOVA, but alternative methods such as the Welch ANOVA or non-parametric tests like the Kruskal-Wallis test may be more appropriate. It is important to assess the impact of these violations on the results and interpret them with caution.***

## **Q2. What are the three types of ANOVA, and in what situations would each be used?**


**1. One-way ANOVA** is used when there is one independent variable with two or more levels or categories.

*For example, comparing the test scores of students from different schools or the effect of different fertilizers on plant growth.*

**2. Two-way ANOVA** is used when there are two independent variables, each with two or more levels or categories.

*For example, comparing the test scores of students from different schools, broken down by gender, or the effect of different fertilizers and watering schedules on plant growth.*

*Two-way ANOVA can also be used to examine the interaction between the two independent variables.*

**3. N-way ANOVA (with n being the number of independent variables)** is used when there are three or more independent variables, each with two or more levels or categories.

*For example, comparing the test scores of students from different schools, broken down by gender and socioeconomic status, or the effect of different fertilizers, watering schedules, and sunlight exposure on plant growth.*

## **Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?**


Understanding the partitioning of variance in ANOVA is important because **it allows researchers to determine the relative contributions of different sources of variation to the total variation in the data**. This is useful for scientists and researchers who want to understand the relative importance of different factors that may be influencing the outcome of their experiments or studies.

There are several types of ANOVA that can be used in different situations, depending on the number of factors or independent variables being studied and the number of levels or groups within each factor.

- **One-way ANOVA** is used when there is a single factor or independent variable with multiple levels or groups. This is the simplest form of ANOVA and is used to compare the means of the different groups within the factor.
- **Two-way ANOVA** is used when there are two factors or independent variables, each with multiple levels or groups. This allows researchers to evaluate the individual and joint effects of the two factors on the dependent variable.
- **Factorial ANOVA** is used when there are multiple factors or independent variables, each with multiple levels or groups. This allows researchers to examine the combined effects of all the factors on the dependent variable.
- **Welch's ANOVA** is used when the assumption of equal variances is not met, and the variances of the different groups are not equal.
- **Ranked ANOVA** is used when the data is ordinal or when the assumptions of ANOVA are violated. This involves replacing the values with their rank ordering and running a ranked ANOVA on the transformed data.
- **Games-Howell test** is used as a post-hoc test when the assumption of homogeneity of variances has been violated, and the variances of the different groups are not equal.


***In general, ANOVA is used to compare the means of multiple groups and determine if there are any statistical differences between the means. It is important to ensure that the assumptions of ANOVA are met in order for the test results to be valid.***

## **Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?**


In [39]:
import numpy as np
import scipy.stats as stats

# example data
group1 = [3, 5, 6, 7, 9]
group2 = [4, 5, 5, 7, 10]
group3 = [3, 4, 4, 6, 7]

# combimimg all groups into single list
groups = [group1, group2, group3]

# calculating the sum of squares for each group
ssg = [np.sum((xi - np.mean(xi))**2) for xi in groups]

# calculate the sum of squares for all gronps
# sst = np.sum([np.sum((xi - np.mean(xi)) ** 2) for xi in groups])
sst = np.sum(ssg)

#calculating the sum of squares between groups
ssbg = np.sum([len(xi) * (np.mean(xi) - np.mean(groups)) ** 2 for xi in groups])

# calculate the explained sum of squares (SSE)
sse = ssbg

# calculate the residual sum of squares (SSR)
ssr = sst - ssbg

print(f"Sum of Squares for Groups (SSE): {sse:.2f}")
print(f"Residual Sum of Squares (SSR): {ssr:.2f}")

Sum of Squares for Groups (SSE): 5.73
Residual Sum of Squares (SSR): 47.87


## **Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?**


In a two-way ANOVA (Analysis of Variance), you can calculate the main effects and interaction effects using Python with the help of statistical libraries such as scipy.stats and statsmodels. Here's a general outline of the process:

In [44]:
# import necessary libraries
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

# load the dataset
# Assuming you have a DataFrame 'df' with columns 'A', 'B', and 'Y'
# A and B are categorical variables, and Y is the dependent variable
data = {'A': ['A1', 'A2', 'A1', 'A2', 'A1', 'A2'],
        'B': ['B1', 'B1', 'B2', 'B2', 'B1', 'B1'],
        'Y': [5, 8, 6, 9, 7, 10]}
df = pd.DataFrame(data)

# fit the model
model = ols('Y ~ A * B', data=df).fit()

# perform ANOVA
anova_table = anova_lm(model, typ=2)

#Interpret the result

# Main Effects:
#       A and B rows in the table will provide information about the main effects of variables A and B.
# Interaction Effect:
#       The interaction term A:B in the table represents the interaction effect.

# Print the ANOVA table
print(anova_table)


                sum_sq   df             F   PR(>F)
A         1.350000e+01  1.0  6.750000e+00  0.12169
B         2.777448e-30  1.0  1.388724e-30  1.00000
A:B       1.643460e-32  1.0  8.217301e-33  1.00000
Residual  4.000000e+00  2.0           NaN      NaN


**1. Main Effects (A and B):**

- Look at the rows corresponding to the main effects of variables A and B.
- Check the p-values associated with these main effects.
- If the p-value is below your chosen significance level (e.g., 0.05), you reject the null hypothesis, suggesting there is a significant main effect.

**2. Interaction Effect (A:B):**

- Look at the row corresponding to the interaction term A:B.
- Check the p-value associated with the interaction effect.
- If the p-value is below your chosen significance level, it suggests there is a significant interaction effect between variables A and B.

**3. Conclusion:**

- If there is a significant main effect for A, it means that the levels of A have a statistically significant impact on the dependent variable.
 -If there is a significant main effect for B, it means that the levels of B have a statistically significant impact on the dependent variable.
- If there is a significant interaction effect, it means that the combined effect of A and B is not simply the sum of their individual effects.

**4. Interpretation Example:**

 -If p-values for A, B, and A:B are all below 0.05, you might conclude that both variables A and B have significant main effects, and there is a significant interaction effect between them.

***Keep in mind that the interpretation might vary based on the specific context of your study and the nature of your variables. The p-values are crucial for determining whether the observed effects are statistically significant.***

## **Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?**


Based on the given **one-way ANOVA** results, with an **F-statistic of 5.23** and a **p-value of 0.02**, we can conclude that ***there is a statistically significant difference between at least two of the groups being compared***.

The **F-statistic** is a ***measure of the ratio of the variation between the groups to the variation within the groups***.

A **larger F-statistic indicates a greater difference between the groups**.

> ***In this case, the F-statistic of 5.23 suggests that there is a significant difference between the groups.***

The **p-value** is a **measure of the probability of observing the given F-statistic (or a more extreme alue) if the null hypothesis is true**. In this case, the null hypothesis is that there is no difference between the groups. A smaller p-value indicates a lower probability of observing the given F-statistic (or a more extreme value) under the null hypothesis.

> ***In this case, the p-value of 0.02 is less than the typical significance level of 0.05, indicating that the observed F-statistic is unlikely to have occurred by chance if the null hypothesis is true.***

Therefore, **we can reject the null hypothesis and conclude that there is a statistically significant difference between at least two of the groups**.

However, the ANOVA test does not identify which specific groups are different. To determine which groups are different, we can perform post-hoc tests, such as Tukey's HSD test or Dunnett's test, to compare the means of each pair of groups.

***In summary, the ANOVA results suggest that there is a statistically significant difference between at least two of the groups being compared. To determine which specific groups are different, you can perform post-hoc tests.***

## **Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?**



## **Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.**


## **Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.**


## **Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.**


## **Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.**


## **Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store n those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post-hoc test to determine which store(s) differ significantly from each other.**