<a href="https://colab.research.google.com/github/Mohdd-Afaan/data-science-master-2.0/blob/main/Statistics_Advance_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact
the validity of the results.

Ans:
Normality: The data within each group should be approximately normally distributed. However, ANOVA is robust to moderate departures from normality, especially if the group sizes are equal.
Violation Example: If the data within a group significantly deviates from a normal distribution, it may affect the accuracy of the ANOVA results.
Homogeneity of Variances: The variances of the different groups should be approximately equal. This is known as homogeneity of variances.
Violation Example: If the variances of the groups are not equal, it can lead to an increased risk of Type I errors (incorrectly rejecting the null hypothesis) or Type II errors (failing to reject a false null hypothesis).
Independence: Observations within each group should be independent of each other. This means that the values in one group should not be related to or dependent on the values in another group.
Violation Example: If there is a lack of independence, it may lead to biased estimates of variability and significance.
Random Sampling: The data should be collected using a random sampling method to ensure that the samples are representative of the populations from which they are drawn.
Violation Example: If the samples are not randomly selected, there is a risk that the results may not generalize well to the larger population.

Q2. What are the three types of ANOVA, and in what situations would each be used?

Ans:
One-Way ANOVA:
Purpose: Used when comparing the means of three or more independent (unrelated) groups to determine if there are any statistically significant differences between these groups.
Example Scenario: Testing the effectiveness of different teaching methods by comparing the average test scores of students who were taught using Method A, Method B, and Method C.
Two-Way ANOVA:
Purpose: Extends the One-Way ANOVA by examining the influence of two categorical independent variables on a dependent variable.
Example Scenario: Analyzing the effects of both gender and treatment on the average test scores of students. It helps determine if there are interactions between these two factors.
Repeated Measures ANOVA:
Purpose: Used when the same subjects are used for each treatment or under different conditions, meaning there is a repeated measurement taken on the same subjects.
Example Scenario: Analyzing the impact of a drug treatment on patients' blood pressure levels measured at different time points. Repeated Measures ANOVA accounts for the correlation between measurements taken on the same subject.

Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

Ans: The partitioning of variance in Analysis of Variance (ANOVA) refers to the decomposition of the total variability observed in a dataset into different components. Understanding this partitioning is crucial for interpreting the results of an ANOVA and gaining insights into the sources of variation in the data.
Understanding the partitioning of variance is essential for the following reasons:
Hypothesis Testing: ANOVA tests whether the differences between group means are statistically significant. By partitioning the total variance into between-group and within-group components, ANOVA helps determine if the observed differences are likely due to actual group effects rather than random chance.
Effect Size: By examining the proportion of variability explained by between-group differences (SSB/SST), researchers can assess the practical significance or effect size of the group differences.
Model Assessment: It allows researchers to evaluate the adequacy of the model in explaining the observed variability. A significant between-group variability suggests that the model captures meaningful group differences.
Variable Importance: For factorial ANOVA or ANOVA with multiple factors, partitioning helps assess the importance of each factor in explaining the variability in the data.

Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual
sum of squares (SSR) in a one-way ANOVA using Python?

In [None]:
import pandas as pd
import scipy

In [None]:
data = {'Group1': [5, 7, 8, 6, 9],
        'Group2': [10, 12, 15, 11, 14],
        'Group3': [8, 6, 9, 11, 10]}

df = pd.DataFrame(data)

f_stats,p_value= scipy.stats.f_oneway(df['Group1'],df['Group2'],df['Group3'])

total_sample = df.size
total_groups = df.shape[1]
dof_total = total_sample - 1
dof_within = total_sample - total_groups
dof_between = total_groups - 1

grand_mean = df.stack().mean()
sst = ((df - grand_mean)**2).sum().sum()
sse = (df.count() * (df.mean() - grand_mean)**2).sum()
ssr = sst - sse
print("SST:", sst)
print("SSE:", sse)
print("SSR:", ssr)

SST: 117.6
SSE: 75.60000000000001
SSR: 41.999999999999986


Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using python?

In [None]:
import pandas as pd
import scipy

data = {'FactorA': ['A1', 'A1', 'A2', 'A2', 'A3', 'A3', 'A4', 'A4'],
        'FactorB': ['B1', 'B2', 'B1', 'B2', 'B1', 'B2', 'B1', 'B2'],
        'DependentVariable': [10, 12, 14, 16, 8, 10, 18, 20]}

df = pd.DataFrame(data)

result_A = scipy.stats.f_oneway(df[df['FactorA'] == 'A1']['DependentVariable'], df[df['FactorA'] == 'A2']['DependentVariable'])
result_B = scipy.stats.f_oneway(df[df['FactorB'] == 'B1']['DependentVariable'], df[df['FactorB'] == 'B2']['DependentVariable'])
interaction_effect = df.groupby(['FactorA', 'FactorB'])['DependentVariable'].mean()

print("Main Effect of Factor A:", result_A.statistic, result_A.pvalue)
print("Main Effect of Factor B:", result_B.statistic, result_B.pvalue)
print("Interaction Effect:\n", interaction_effect)

Main Effect of Factor A: 8.0 0.10557280900008414
Main Effect of Factor B: 0.4067796610169491 0.5471616124270274
Interaction Effect:
 FactorA  FactorB
A1       B1         10.0
         B2         12.0
A2       B1         14.0
         B2         16.0
A3       B1          8.0
         B2         10.0
A4       B1         18.0
         B2         20.0
Name: DependentVariable, dtype: float64


Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

Ans: Interpretation:
In our case:
F-Statistic: 5.23
The F-statistic is a measure of the ratio of variance between groups to variance within groups. A larger F-statistic suggests that there might be significant differences between at least two group means.
P-Value: 0.02
The p-value is less than the typical significance level of 0.05. Therefore, you have evidence to reject the null hypothesis.
Conclusion:
Given the results:
Reject the Null Hypothesis:
The small p-value (0.02) suggests that we have enough evidence to reject the null hypothesis, indicating that there are statistically significant differences between at least two group means.
Differences Between Groups:
The significant result doesn't tell us which specific groups are different or how many groups are different; it only indicates that, overall, there are differences between at least two groups.

Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

Ans: Handling missing data in a repeated measures ANOVA is crucial for obtaining accurate and reliable results. Repeated measures ANOVA involves analyzing data collected from the same subjects over multiple time points or conditions. Missing data can arise due to various reasons such as dropout, non-response, or technical issues. Here are common methods for handling missing data and their potential consequences:
Methods for Handling Missing Data:
Complete Case Analysis (Listwise Deletion):
Method: Exclude cases with any missing data.
Potential Consequences:
Reduces the sample size.
May introduce bias if missingness is related to the outcome or other variables.
Mean Imputation:
Method: Replace missing values with the mean of the observed values for the variable.
Potential Consequences:
Preserves the sample size.
May underestimate variability and produce biased estimates if missingness is not random.
Last Observation Carried Forward (LOCF):
Method: Replace missing values with the last observed value for that subject.
Potential Consequences:
Assumes that the last observation is a good estimate of the missing value.
Can lead to biased results if patterns of missingness are related to the variable being measured.
Interpolation or Linear Regression Imputation:
Method: Predict missing values based on observed values using interpolation or linear regression.
Potential Consequences:
Assumes a linear relationship between observed values and may introduce bias if the relationship is nonlinear.
Multiple Imputation:
Method: Generate multiple sets of plausible values for missing data, considering uncertainty.
Potential Consequences:
Provides unbiased estimates if the imputation model is correctly specified.
Requires specialized software and may be computationally intensive.

Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

Ans: After conducting an Analysis of Variance (ANOVA) and finding a significant difference among group means, post-hoc tests are often employed to identify which specific groups differ from each other. Common post-hoc tests include Tukey's Honestly Significant Difference (HSD), Bonferroni correction, Scheffé test, and Dunnett's test. The choice of post-hoc test depends on the characteristics of the data and the assumptions made. Here's an overview of some common post-hoc tests:
Tukey's Honestly Significant Difference (HSD):
Use Case:
When we have equal sample sizes and homogeneity of variances.
Example:
Suppose we conducted an ANOVA for testing the effect of different teaching methods on students' test scores and found a significant difference. Tukey's HSD can be used to compare each pair of teaching methods to identify where the significant differences lie.
Bonferroni Correction:
Use Case:
When we want to control the familywise error rate (overall Type I error rate) by adjusting significance levels.
Example:
If we have conducted multiple pairwise comparisons after ANOVA, the probability of making a Type I error increases. Bonferroni correction adjusts the significance level for each individual test to maintain an overall Type I error rate.
Dunnett's Test:
Use Case:
When we have a control group and we want to compare other groups to the control.
Example:
In a drug trial, if we have a control group receiving a placebo and several experimental groups receiving different doses of a drug, Dunnett's test can be used to compare each experimental group to the control group.
Example Scenario:
we conducted a study to evaluate the effectiveness of four different types of exercise programs (A, B, C, D) on weight loss.
After running a one-way ANOVA, we found a significant difference among the means of the exercise programs.
Post-hoc Test:
Since there are multiple groups (exercise programs) to compare, we decide to conduct a post-hoc test to identify which specific pairs of exercise programs differ significantly in terms of weight loss.
Choice of Post-hoc Test:
we choose Tukey's Honestly Significant Difference (HSD) post-hoc test because our sample sizes are equal, and we assume homogeneity of variances.
Results:
The Tukey HSD test reveals that Exercise Programs A and B have significantly different effects on weight loss, while Programs A and C, A and D, B and C, B and D, and C and D do not show significant differences.
In this example, Tukey's HSD helps identify specific pairs of exercise programs that have significantly different effects on weight loss, providing more detailed insights than the initial ANOVA results.

Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [10]:
import pandas as pd
import scipy

data = {'Diet': ['A']*20 + ['B']*20 + ['C']*10,
        'WeightLoss': [3.2, 4.1, 2.8, 3.5, 4.0, 2.9, 3.8, 4.2, 3.0, 3.7,
                       2.5, 3.9, 4.1, 3.3, 3.6, 2.7, 3.9, 4.0, 2.8, 3.5,
                       3.2, 3.8, 2.9, 3.6, 4.2, 3.0, 3.5, 4.1, 2.7, 3.4,
                       4.0, 3.3, 3.8, 2.9, 3.6, 4.1, 2.8, 3.5, 3.0, 3.7,
                       4.1, 3.3, 3.6, 2.7, 3.9, 4.2, 3.0, 3.7, 2.9, 3.8]}

df = pd.DataFrame(data)

result  = scipy.stats.f_oneway(df['WeightLoss'][df['Diet']== 'A'],
                               df['WeightLoss'][df['Diet']== 'B'],
                               df['WeightLoss'][df['Diet']== 'C']
                               )
p_value = result.pvalue
if p_value < 0.05:
    print("The one-way ANOVA result is statistically significant.")
    print("There is evidence of at least one diet having a different mean weight loss.")
else:
    print("The one-way ANOVA result is not statistically significant.")
    print("There is no strong evidence of differences in mean weight loss among the diets.")

The one-way ANOVA result is not statistically significant.
There is no strong evidence of differences in mean weight loss among the diets.


Q10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [26]:
import pandas as pd
import scipy
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

data = {
    'Program': ['A']*10 + ['B']*10 + ['C']*10,
    'ExperienceLevel': ['Novice']*15 + ['Experienced']*15 ,
    'CompletionTime': [
        15, 18, 16, 20, 17, 19, 21, 22, 18, 20, 25, 24, 22, 23, 20,
        30, 28, 29, 32, 31, 30, 27, 29, 28, 31, 22, 25, 24, 26, 23
    ]
}
df = pd.DataFrame(data)

formula = 'CompletionTime ~ Program + ExperienceLevel + Program:ExperienceLevel'
model = ols(formula, df).fit()
anova_results = anova_lm(model)
if any(anova_results['PR(>F)']< 0.05):
  print("The two-way ANOVA result is statistically significant.")
  print("There is evidence of at least one main effect or interaction effect.")
else:
    print("The two-way ANOVA result is not statistically significant.")
    print("There is no strong evidence of differences in completion time.")

The two-way ANOVA result is statistically significant.
There is evidence of at least one main effect or interaction effect.


Q11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [38]:
import pandas as pd
import scipy
from statsmodels.stats.multicomp import pairwise_tukeyhsd
data = {'Group': ['Control']*50 + ['Experimental']*50,
        'TestScores': [75, 80, 85, 78, 82, 79, 83, 81, 77, 80,
                       90, 85, 88, 92, 87, 89, 86, 91, 84, 88,
                       78, 82, 79, 83, 81, 77, 80, 75, 80, 85,
                       88, 92, 87, 89, 86, 91, 84, 88, 90, 85,
                       78, 82, 79, 83, 81, 77, 80, 75, 80, 85,
                       75, 80, 85, 78, 82, 79, 83, 81, 77, 80,
                       90, 85, 88, 92, 87, 89, 86, 91, 84, 88,
                       78, 82, 79, 83, 81, 77, 80, 75, 80, 85,
                       88, 92, 87, 89, 86, 91, 84, 88, 90, 85,
                       78, 82, 79, 83, 81, 77, 80, 75, 80, 85]}

df = pd.DataFrame(data)
control_score = df['TestScores'][df['Group']=='Control']
experimental_score = df['TestScores'][df['Group']=='Experimental']

t_stats,p_value = scipy.stats.ttest_ind(control_score,experimental_score)

if p_value < 0.05:
  posthoc =  pairwise_tukeyhsd(df['TestScores'],df['Group'])
  print(posthoc)
else:
  print("No significant differences observed.")

No significant differences observed.


Q12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any
significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [59]:
import pandas as pd
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import pairwise_tukeyhsd

data = {'Day': list(range(30))*3,
        'Store': ['Store A']*30 + ['Store B']*30 + ['Store C']*30,
        'Sales': [50, 55, 48, 52, 54, 53, 56, 50, 58, 52,
                  60, 55, 52, 57, 59, 56, 54, 58, 51, 53,
                  49, 50, 55, 48, 52, 54, 53, 56, 50, 58,
                  60, 55, 52, 57, 59, 56, 54, 58, 51, 53,
                  49, 50, 55, 48, 52, 54, 53, 56, 50, 58,
                  60, 55, 52, 57, 59, 56, 54, 58, 51, 53,
                  49, 50, 55, 48, 52, 54, 53, 56, 50, 58,
                  60, 55, 52, 57, 59, 56, 54, 58, 51, 53,
                  49, 50, 55, 48, 52, 54, 53, 56, 50, 58]}

df = pd.DataFrame(data)

rmanova = AnovaRM(df,'Sales','Day',['Store'])
result = rmanova.fit()

if result.anova_table["Pr > F"]["Store"] < 0.05:
  posthoc = pairwise_tukeyhsd(df['Sales'], df['Store'])
  print(posthoc)
else:
    print("No significant differences observed.")

No significant differences observed.
