<a href="https://colab.research.google.com/github/DIVYA14797/Machine-Learning/blob/main/Statistic_Advance_6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

ANOVA (Analysis of Variance) is a statistical method used to compare means across two or more groups to determine if there are statistically significant differences between them. However, ANOVA comes with certain assumptions that need to be met for the results to be valid. Here are the key assumptions and examples of violations:

1. Independence of Observations: This assumption requires that the observations within each group are independent of each other. In other words, the value of one observation should not be influenced by the value of another observation within the same group. Violation of this assumption can occur in clustered or correlated data, such as repeated measures designs or nested data.

 * Example of violation: Conducting ANOVA on data collected from twins where the observations within pairs may be correlated, violating the independence assumption.

2. Normality: ANOVA assumes that the residuals (the differences between observed and predicted values) are normally distributed for each group. Violation of this assumption can lead to inaccurate p-values and confidence intervals.

 * Example of violation: Performing ANOVA on a small sample size where the distribution of residuals is highly skewed or has heavy tails, indicating non-normality.

3. Homogeneity of Variance (Homoscedasticity): This assumption states that the variance of the residuals is constant across all levels of the independent variable. In other words, the spread of data points around the mean should be similar across groups. Violation of this assumption can result in unequal variances, affecting the reliability of the F-test.

 * Example of violation: Comparing the exam scores of students from different schools, where one school has much higher variability in scores compared to others, violating the assumption of homogeneity of variance.

4.  Independence of Errors: This assumption states that the residuals of the model are independent of each other. In other words, the error terms in the model are not correlated. Violation of this assumption can lead to biased estimates of the model parameters.

 * Example of violation: Analyzing time series data using ANOVA without accounting for autocorrelation in the residuals, leading to correlated errors.

When these assumptions are violated, alternative approaches or corrections may be needed to ensure the validity of the results. For example, transformations of the data (e.g., log transformation) can sometimes address violations of normality or homogeneity of variance. Additionally, non-parametric tests like the Kruskal-Wallis test can be used as alternatives to ANOVA when assumptions are severely violated.

2. What are the three types of ANOVA, and in what situations would each be used?

The three main types of ANOVA are:

1. One-Way ANOVA: This type of ANOVA is used when comparing the means of three or more independent groups on a single continuous dependent variable. It's appropriate when there is one categorical independent variable with three or more levels (groups).

 * Example: Comparing the effectiveness of three different teaching methods (lectures, online modules, and hands-on activities) on student test scores.

2. Two-Way ANOVA: Also known as factorial ANOVA, this type of ANOVA is used when there are two categorical independent variables (factors) and one continuous dependent variable. It examines the main effects of each independent variable as well as any interaction effect between them.

 * Example: Investigating the effects of both gender and treatment type on patient recovery time after a medical procedure. Here, gender and treatment type are the two independent variables, and patient recovery time is the dependent variable.

3. Repeated Measures ANOVA: This type of ANOVA is used when measurements are taken on the same subjects at multiple points in time or under different conditions. It's appropriate when there are three or more measurements (or conditions) within each level of the independent variable.

 * Example: Assessing the impact of three different exercise programs (low-intensity, moderate-intensity, and high-intensity) on participants' cardiovascular fitness levels measured at baseline, 6 weeks, and 12 weeks. Here, the participants serve as their own controls, and the three measurements represent the repeated measures.

Each type of ANOVA is suited to different experimental designs and research questions. One-Way ANOVA is appropriate when comparing multiple groups on a single variable, Two-Way ANOVA is useful for examining the effects of two independent variables simultaneously, and Repeated Measures ANOVA is ideal for analyzing data with repeated measurements on the same subjects. Choosing the correct type of ANOVA depends on the specific research design and hypotheses being tested.

3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

The partitioning of variance in ANOVA refers to the division of the total variance observed in the data into different components that can be attributed to various sources. Understanding this concept is crucial because it provides insights into the relative contributions of different factors to the variability observed in the dependent variable. This partitioning allows researchers to determine the significance of each factor and assess whether the observed differences between groups are statistically significant or simply due to random variation.

The partitioning of variance typically involves breaking down the total variance into three main components:

1. Between-Group Variance (SS_between): This component of variance represents the variability between the group means. It measures the extent to which the means of the groups differ from each other. In other words, it quantifies the differences among the group means that can be attributed to the effect of the independent variable(s) being studied.

2. Within-Group Variance (SS_within or SS_error): Also known as residual variance, this component accounts for the variability within each group. It reflects the extent of variation among individual observations within the same group, after accounting for the differences between group means. It includes random error and any other unexplained sources of variability.

3. Total Variance (SS_total): This component represents the overall variability observed in the data, regardless of group membership. It is the sum of the between-group variance and the within-group variance. It provides a baseline measure of the total variability in the dependent variable before considering the effects of the independent variable(s).

Understanding the partitioning of variance helps researchers interpret the results of ANOVA by evaluating the relative importance of different factors in explaining the observed differences between groups. It allows for a more nuanced analysis of the data and facilitates the identification of significant effects, which can inform theoretical understanding and practical decision-making. Additionally, partitioning of variance enables researchers to assess the goodness-of-fit of the ANOVA model and the proportion of variability accounted for by the factors under investigation. Overall, comprehending this concept is essential for conducting valid and meaningful analyses using ANOVA.

4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

In [3]:
group1_data = [1, 2, 3, 4, 5]
group2_data = [6, 7, 8, 9, 10]
group3_data = [11, 12, 13, 14, 15]

In [5]:
data = np.array([group1_data, group2_data , group3_data])

In [7]:
import numpy as np
from scipy import stats

def one_way_anova(data):
    # Calculate grand mean
    grand_mean = np.mean(data)

    # Calculate total sum of squares (SST)
    SST = np.sum((data - grand_mean)**2)

    # Calculate group means
    group_means = np.mean(data, axis=0)

    # Calculate explained sum of squares (SSE)
    SSE = np.sum((group_means - grand_mean)**2)

    # Calculate residual sum of squares (SSR)
    SSR = SST - SSE

    return SST, SSE, SSR

# Example data (replace this with your actual data)
data = np.array([group1_data, group2_data, group3_data])

# Calculate sums of squares
SST, SSE, SSR = one_way_anova(data)

print("Total Sum of Squares (SST):", SST)
print("Explained Sum of Squares (SSE):", SSE)
print("Residual Sum of Squares (SSR):", SSR)

Total Sum of Squares (SST): 280.0
Explained Sum of Squares (SSE): 10.0
Residual Sum of Squares (SSR): 270.0


5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

In [15]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data (replace this with your actual data)
# Assuming you have two independent variables (factors) A and B, and a dependent variable Y
data = {
    'A': [1, 1, 1, 2, 2, 2, 3, 3, 3],
    'B': [1, 2, 3, 1, 2, 3, 1, 2, 3],
    'Y': [10, 12, 14, 15, 17, 19, 20, 22, 24]
}

# Create a DataFrame
df = pd.DataFrame(data)



In [9]:
missing_values = df.isnull().sum()
print(missing_values)

A    0
B    0
Y    0
dtype: int64


In [10]:
invalid_values = np.isinf(df).sum() + np.isnan(df).sum()
print(invalid_values)

A    0
B    0
Y    0
dtype: int64


In [16]:
column_means = df.mean(axis=0)

In [17]:
df = df.fillna(column_means)
print(df)

   A  B   Y
0  1  1  10
1  1  2  12
2  1  3  14
3  2  1  15
4  2  2  17
5  2  3  19
6  3  1  20
7  3  2  22
8  3  3  24


In [None]:
# Fit the ANOVA model
model = ols('Y ~ C(A) + C(B) + C(A):C(B)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Get the main effects and interaction effects
main_effects = anova_table['sum_sq'][:-1] / anova_table['sum_sq'].sum()
interaction_effect = anova_table['sum_sq'][-1] / anova_table['sum_sq'].sum()

print("Main Effects:")
print(main_effects)
print("\nInteraction Effect:")
print(interaction_effect)

6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02.
What can you conclude about the differences between the groups, and how would you interpret these
results?

When you conduct a one-way ANOVA and obtain an F-statistic of 5.23 and a p-value of 0.02, you can make several conclusions about the differences between the groups. Here’s how to interpret these results:

1. F-statistic (5.23): The F-statistic is a measure of the ratio of the variance between the groups to the variance within the groups. An F-statistic of 5.23 suggests that the variance between the group means is 5.23 times larger than the variance within the groups. This indicates that there is a noticeable difference between the group means.

2. P-value (0.02): The p-value indicates the probability of observing an F-statistic as extreme as, or more extreme than, the observed value under the null hypothesis (which states that all group means are equal). A p-value of 0.02 means there is a 2% chance that the observed differences between group means occurred by random chance.

3. Significance Level (typically α = 0.05): In most scientific research, a significance level of 0.05 is used as a threshold to determine statistical significance. Since the p-value (0.02) is less than the significance level (0.05), you reject the null hypothesis.

Conclusion
* Reject the Null Hypothesis: Since the p-value is 0.02, which is less than the common significance level of 0.05, you reject the null hypothesis. This suggests that there is sufficient evidence to conclude that at least one group mean is significantly different from the others.

Interpretation
* Statistically Significant Differences: The results indicate that there are statistically significant differences between the means of the groups. This means that the differences observed are unlikely to have occurred by chance.

Additional Considerations
* Post-Hoc Tests: While the one-way ANOVA indicates that there are significant differences among the group means, it does not tell you which specific groups are different from each other. To determine which groups differ, you would need to conduct post-hoc tests, such as Tukey's HSD, Bonferroni correction, or other pairwise comparison methods.

* Effect Size: It’s also helpful to calculate the effect size (e.g., eta-squared, partial eta-squared) to understand the magnitude of the differences between groups. This provides more context on the practical significance of your findings.

Example Interpretation

Suppose you are comparing the effectiveness of three different teaching methods (A, B, and C) on student performance. Your one-way ANOVA results (F-statistic = 5.23, p-value = 0.02) suggest that the teaching method has a statistically significant effect on student performance. Specifically, at least one teaching method leads to a significantly different average performance compared to the others. To identify which specific methods differ, you would perform post-hoc tests. Additionally, calculating an effect size would help you understand the practical importance of these differences in educational settings.









7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential
consequences of using different methods to handle missing data?

In [20]:
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example DataFrame with missing values
data = pd.DataFrame({
    'subject': [1, 2, 3, 4, 5],
    'time1': [5.1, 7.2, np.nan, 6.3, 8.1],
    'time2': [6.2, np.nan, 5.8, 7.1, 7.9],
    'time3': [np.nan, 6.8, 6.1, 6.9, 7.5]
})

# Perform multiple imputation
imputer = IterativeImputer(max_iter=10, random_state=0)
imputed_data = imputer.fit_transform(data.drop(columns=['subject']))

# Create a DataFrame with the imputed data
imputed_df = pd.DataFrame(imputed_data, columns=['time1', 'time2', 'time3'])

# Add the subject column back
imputed_df['subject'] = data['subject']

# Reshape the DataFrame for repeated measures ANOVA
long_df = pd.melt(imputed_df, id_vars=['subject'], value_vars=['time1', 'time2', 'time3'],
                  var_name='time', value_name='score')

# Fit the repeated measures ANOVA model
model = ols('score ~ C(time) + C(subject)', data=long_df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

print(anova_table)

              sum_sq   df          F    PR(>F)
C(time)     0.517599  2.0   1.214264  0.346312
C(subject)  9.330815  4.0  10.944842  0.002497
Residual    1.705062  8.0        NaN       NaN


8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide
an example of a situation where a post-hoc test might be necessary.

Post-hoc tests are used after ANOVA when the null hypothesis is rejected, indicating that at least one group mean is significantly different from the others. These tests help determine which specific groups are different. Here are some common post-hoc tests and the situations in which you would use them:

1. Tukey's Honest Significant Difference (HSD) Test

When to Use:

Tukey's HSD is used when you have conducted a one-way ANOVA with equal or nearly equal sample sizes.
It controls the Type I error rate and is appropriate for pairwise comparisons among all group means.

Example:

You have tested the effect of different diets (Diet A, Diet B, Diet C) on weight loss in a study and found a significant effect using ANOVA. Tukey's HSD can determine which diets significantly differ from each other.
2. Bonferroni Correction

When to Use:

The Bonferroni correction is used when you need to make multiple pairwise comparisons and want to control the familywise error rate.
It is more conservative and suitable when the number of comparisons is large.

Example:

In a clinical trial comparing the efficacy of five different drugs, the Bonferroni correction can be used to adjust the significance level for each pairwise comparison to maintain the overall alpha level.
3. Scheffé's Test

When to Use:

Scheffé's test is useful for making complex comparisons (e.g., comparing combinations of group means).
It is more flexible but less powerful than Tukey's HSD for simple pairwise comparisons.

Example:

You want to compare the average test scores of students from three different teaching methods (traditional, online, hybrid) and also investigate the difference between the average of two methods combined versus the third.
4. Newman-Keuls Test

When to Use:

Newman-Keuls is used for pairwise comparisons and is less conservative than Tukey's HSD.
It does not control the familywise error rate as strictly, which can be a downside.

Example:

After finding a significant ANOVA result in a study comparing four different exercise routines on muscle gain, you might use Newman-Keuls to find out which specific routines differ from each other.
5. Dunnett's Test

When to Use:

Dunnett's test is specifically designed to compare multiple treatment groups to a single control group.
It controls the Type I error rate for comparisons against the control.

Example:

In a pharmaceutical study, you compare three new drug formulations to a placebo. Dunnett's test will help you identify which formulations are significantly different from the placebo.

Example Situation for Post-Hoc Test

Scenario:

You conducted a one-way ANOVA to test the effectiveness of four different study techniques (A, B, C, and D) on students' final exam scores. The ANOVA results showed a significant difference among the techniques.

Post-Hoc Test Needed:

Since the ANOVA indicated significant differences, you need to determine which specific techniques differ from each other.

Choice of Post-Hoc Test:

* Tukey's HSD would be appropriate to compare all pairs of study techniques, as it is designed for pairwise comparisons and controls the Type I error rate effectively.

9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from
50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python
to determine if there are any significant differences between the mean weight loss of the three diets.
Report the F-statistic and p-value, and interpret the results.

In [25]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data: weight loss for 50 participants, randomly assigned to diets A, B, and C
# For simplicity, we create a small example dataset. Replace with actual data.
data = {
    'weight_loss': [2.5, 3.0, 2.8, 2.7, 2.9, 3.1, 3.2, 2.6, 2.8, 3.0,
                    4.1, 4.3, 4.2, 4.5, 4.0, 4.2, 4.1, 4.4, 4.3, 4.0,
                    5.1, 5.3, 5.4, 5.2, 5.0, 5.5, 5.3, 5.2, 5.4, 5.1,
                    3.3, 3.5, 3.2, 3.4, 3.6, 3.3, 3.5, 3.6, 3.2, 3.4,
                    4.6, 4.7, 4.8, 4.5, 4.9, 4.6, 4.7, 4.8, 4.5, 4.9,
                    3.3, 3.5, 3.2, 3.4, 3.6, 3.3, 3.5, 3.6, 3.2, 3.4],
    'diet':  ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A',
             'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
             'C', 'C', 'C', 'C', 'C', 'C' ,'C', 'C', 'C', 'C',
              'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A',
             'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B',
             'C', 'C', 'C', 'C', 'C', 'C' ,'C', 'C', 'C', 'C']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Fit the ANOVA model
model = ols('weight_loss ~ C(diet)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)

          sum_sq    df          F        PR(>F)
C(diet)   21.337   2.0  28.564259  2.550935e-09
Residual  21.289  57.0        NaN           NaN


10. A company wants to know if there are any significant differences in the average time it takes to
complete a task using three different software programs: Program A, Program B, and Program C. They
randomly assign 30 employees to one of the programs and record the time it takes each employee to
complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or
interaction effects between the software programs and employee experience level (novice vs.
experienced). Report the F-statistics and p-values, and interpret the results.

In [26]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data: time to complete a task for 30 employees
# Replace with actual data
data = {
    'time': [
        15, 18, 16, 22, 20, 19,  # Program A, Novice
        14, 17, 16, 20, 18, 19,  # Program B, Novice
        16, 19, 17, 21, 20, 18,  # Program C, Novice
        10, 12, 11, 15, 14, 13,  # Program A, Experienced
        9, 11, 10, 14, 13, 12,   # Program B, Experienced
        10, 13, 12, 15, 14, 13   # Program C, Experienced
    ],
    'program': ['A'] * 6 + ['B'] * 6 + ['C'] * 6 + ['A'] * 6 + ['B'] * 6 + ['C'] * 6,
    'experience': ['Novice'] * 18 + ['Experienced'] * 18
}

# Create a DataFrame
df = pd.DataFrame(data)

# Fit the two-way ANOVA model
model = ols('time ~ C(program) * C(experience)', data=df).fit()
anova_table = sm.stats.anova_lm(model, typ=2)

# Print the ANOVA table
print(anova_table)

                              sum_sq    df          F        PR(>F)
C(program)                 10.500000   2.0   1.270161  2.954548e-01
C(experience)             300.444444   1.0  72.688172  1.634405e-09
C(program):C(experience)    0.055556   2.0   0.006720  9.933036e-01
Residual                  124.000000  30.0        NaN           NaN


11. An educational researcher is interested in whether a new teaching method improves student test
scores. They randomly assign 100 students to either the control group (traditional teaching method) or the
experimental group (new teaching method) and administer a test at the end of the semester. Conduct a
two-sample t-test using Python to determine if there are any significant differences in test scores
between the two groups. If the results are significant, follow up with a post-hoc test to determine which
group(s) differ significantly from each other.

In [28]:
import pandas as pd
from scipy import stats

# Example data: test scores for 100 students
# 50 students in control group, 50 students in experimental group
# For simplicity, we create a small example dataset. Replace with actual data.
data = {
    'test_score': [
        75, 80, 82, 78, 74, 81, 77, 79, 80, 76,
        85, 87, 83, 86, 88, 84, 89, 82, 86, 90,
        68, 72, 70, 65, 69, 67, 71, 73, 75, 64,
        93, 91, 89, 92, 90, 94, 88, 95, 91, 87,
        79, 82, 80, 78, 81, 83, 77, 85, 84, 86,
        73, 71, 69, 72, 75, 68, 70, 74, 66, 72,
        91, 90, 92, 89, 94, 93, 95, 87, 86, 88,
        80, 78, 76, 81, 82, 84, 79, 83, 85, 77
    ],
    'group': ['control'] * 40 + ['experimental'] * 40
}

# Create a DataFrame
df = pd.DataFrame(data)

# Separate the data into two groups
control_group = df[df['group'] == 'control']['test_score']
experimental_group = df[df['group'] == 'experimental']['test_score']

# Perform two-sample t-test
t_stat, p_value = stats.ttest_ind(control_group, experimental_group)

print(f'T-statistic: {t_stat}')
print(f'P-value: {p_value}')

T-statistic: 0.15074732332139876
P-value: 0.8805642122808464


12. A researcher wants to know if there are any significant differences in the average daily sales of three
retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store
on those days. Conduct a repeated measures ANOVA using Python to determine if there are any

significant differences in sales between the three stores. If the results are significant, follow up with a post-
hoc test to determine which store(s) differ significantly from each other.

In [29]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import AnovaRM

# Example data: daily sales for 30 days for each of the three stores
# For simplicity, we create a small example dataset. Replace with actual data.
data = {
    'day': list(range(1, 31)) * 3,
    'sales': [
        150, 160, 145, 155, 165, 140, 150, 170, 160, 150,
        155, 160, 150, 165, 170, 155, 160, 150, 160, 155,
        165, 170, 150, 160, 165, 150, 155, 160, 150, 155,
        140, 150, 135, 145, 155, 130, 140, 160, 150, 140,
        145, 150, 140, 155, 160, 145, 150, 140, 150, 145,
        155, 160, 140, 150, 155, 140, 145, 150, 140, 145,
        180, 190, 175, 185, 195, 170, 180, 200, 190, 180,
        185, 190, 180, 195, 200, 185, 190, 180, 190, 185,
        195, 200, 180, 190, 195, 180, 185, 190, 180, 185
    ],
    'store': ['A'] * 30 + ['B'] * 30 + ['C'] * 30
}

# Create a DataFrame
df = pd.DataFrame(data)

# Perform repeated measures ANOVA
aovrm = AnovaRM(df, 'sales', 'day', within=['store'])
res = aovrm.fit()

print(res.summary())

                             Anova
                    F Value                Num DF  Den DF Pr > F
----------------------------------------------------------------
store 1722152748378555409280192741376.0000 2.0000 58.0000 0.0000

