In [1]:
# Q.1
"""ANOVA (Analysis of Variance) is used to compare the means of three or more groups. For it to provide valid results, several assumptions must be met:

Independence: The observations must be independent of each other. This means the outcome of one observation should not influence another.

Example Violation: If you’re studying the effect of different teaching methods on students’ test scores, but some students share study materials or discuss their answers, their scores might not be independent.

Normality: The data within each group should be approximately normally distributed.

Example Violation: If the test scores are heavily skewed or have outliers, the normality assumption may be violated, affecting the accuracy of ANOVA results.

Homogeneity of Variances (Homoscedasticity): The variances of the populations should be equal across the groups.

Example Violation: If one teaching method leads to much higher variability in test scores compared to others, this assumption is violated, potentially leading to incorrect conclusions.

Potential Impacts of Violations:
Independence: Violations can lead to underestimated standard errors and increased Type I error rates, falsely indicating significant differences when there are none.

Normality: If the data is not normally distributed, especially with small sample sizes, the ANOVA results may not be reliable.

Homogeneity of Variances: Violations can cause the F-test to become overly sensitive to group differences, leading to increased Type I errors."""

'ANOVA (Analysis of Variance) is used to compare the means of three or more groups. For it to provide valid results, several assumptions must be met:\n\nIndependence: The observations must be independent of each other. This means the outcome of one observation should not influence another.\n\nExample Violation: If you’re studying the effect of different teaching methods on students’ test scores, but some students share study materials or discuss their answers, their scores might not be independent.\n\nNormality: The data within each group should be approximately normally distributed.\n\nExample Violation: If the test scores are heavily skewed or have outliers, the normality assumption may be violated, affecting the accuracy of ANOVA results.\n\nHomogeneity of Variances (Homoscedasticity): The variances of the populations should be equal across the groups.\n\nExample Violation: If one teaching method leads to much higher variability in test scores compared to others, this assumption is 

In [2]:
# Q.2
"""There are three main types of ANOVA, each used in different scenarios to compare group means:

One-Way ANOVA:

Situation: Used when you want to compare the means of three or more independent (unrelated) groups.

Example: Comparing the effectiveness of three different diets on weight loss.

Two-Way ANOVA:

Situation: Used when you want to study the effect of two different independent variables on a dependent variable, and to see if there is an interaction between the two variables.

Example: Investigating the effects of both diet and exercise on weight loss, where diet and exercise are the two independent variables.

Repeated Measures ANOVA:

Situation: Used when the same subjects are used for each treatment (like a within-subject design). It’s useful for measuring the changes in subjects over time.

Example: Assessing the impact of a training program on performance, where the same individuals’ performances are measured at different time points (e.g., before, during, and after the training)."""

'There are three main types of ANOVA, each used in different scenarios to compare group means:\n\nOne-Way ANOVA:\n\nSituation: Used when you want to compare the means of three or more independent (unrelated) groups.\n\nExample: Comparing the effectiveness of three different diets on weight loss.\n\nTwo-Way ANOVA:\n\nSituation: Used when you want to study the effect of two different independent variables on a dependent variable, and to see if there is an interaction between the two variables.\n\nExample: Investigating the effects of both diet and exercise on weight loss, where diet and exercise are the two independent variables.\n\nRepeated Measures ANOVA:\n\nSituation: Used when the same subjects are used for each treatment (like a within-subject design). It’s useful for measuring the changes in subjects over time.\n\nExample: Assessing the impact of a training program on performance, where the same individuals’ performances are measured at different time points (e.g., before, during, 

In [3]:
#Q.3 
'''Partitioning of variance in ANOVA refers to breaking down the total variability observed in data into components attributed to different sources. Here’s the breakdown:

Total Sum of Squares (SST): Represents the total variability in the dataset. It's the sum of the squared differences between each observation and the overall mean.

Between-Group Sum of Squares (SSB): Measures the variability due to the differences between group means. It’s the sum of the squared differences between each group mean and the overall mean, weighted by the number of observations in each group.

Within-Group Sum of Squares (SSW): Measures the variability within each group. It’s the sum of the squared differences between each observation and its respective group mean.

The relationship is: $$ SST = SSB + SSW $$

Importance:
Identifies Sources of Variability: Understanding how much of the total variability is due to differences between groups versus within groups helps researchers identify the impact of the independent variable(s).

Informs Significance Testing: By comparing the between-group and within-group variability, ANOVA assesses whether the observed differences in group means are statistically significant or could have occurred by chance.

Enhances Understanding: Partitioning variance offers a clear picture of the structure of the data, enabling better interpretation and informed decisions.'''

"Partitioning of variance in ANOVA refers to breaking down the total variability observed in data into components attributed to different sources. Here’s the breakdown:\n\nTotal Sum of Squares (SST): Represents the total variability in the dataset. It's the sum of the squared differences between each observation and the overall mean.\n\nBetween-Group Sum of Squares (SSB): Measures the variability due to the differences between group means. It’s the sum of the squared differences between each group mean and the overall mean, weighted by the number of observations in each group.\n\nWithin-Group Sum of Squares (SSW): Measures the variability within each group. It’s the sum of the squared differences between each observation and its respective group mean.\n\nThe relationship is: $$ SST = SSB + SSW $$\n\nImportance:\nIdentifies Sources of Variability: Understanding how much of the total variability is due to differences between groups versus within groups helps researchers identify the impa

In [8]:
# Q.4
import numpy as np

data = {
    'A':[20,30,25],
    'B':[27,29,30],
    'C':[25,28,27]
}

groups = [np.array(values) for values in data.values()]
all_data = np.concatenate(groups)

overall_mean = np.mean(all_data)

sst = np.sum((all_data - overall_mean) ** 2)

ssb = np.sum(len(group) * (np.mean(group) - overall_mean) **2 for group in groups)

sse = sum(np.sum((group - np.mean(group)) **2) for group in groups)

print(f"Total Sum of Squares (SST): {sst}")
print(f"Explained Sum of Squares (SSB or SSE): {ssb}")
print(f"Residual Sum of Squares (SSR or SSE): {sse}")

Total Sum of Squares (SST): 79.55555555555556
Explained Sum of Squares (SSB or SSE): 20.222222222222236
Residual Sum of Squares (SSR or SSE): 59.33333333333333


  ssb = np.sum(len(group) * (np.mean(group) - overall_mean) **2 for group in groups)


In [14]:
# Q.5
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols



# Sample data
data = {
    'Teaching_Method': ['A', 'A', 'A', 'B', 'B', 'B', 'A', 'A', 'A', 'B', 'B', 'B'],
    'Study_Time': ['1 hour', '1 hour', '1 hour', '1 hour', '1 hour', '1 hour', '2 hours', '2 hours', '2 hours', '2 hours', '2 hours', '2 hours'],
    'Test_Score': [88, 92, 85, 91, 89, 90, 90, 94, 93, 95, 92, 96]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Fit the two-way ANOVA model
model = ols('Test_Score ~ C(Teaching_Method) + C(Study_Time) + C(Teaching_Method):C(Study_Time)', data=df).fit()

# Perform the ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

# Display the results
print(anova_table)


                                     sum_sq   df         F    PR(>F)
C(Teaching_Method)                10.083333  1.0  1.833333  0.212739
C(Study_Time)                     52.083333  1.0  9.469697  0.015179
C(Teaching_Method):C(Study_Time)   0.083333  1.0  0.015152  0.905071
Residual                          44.000000  8.0       NaN       NaN


In [11]:
pip install statsmodels


Collecting statsmodels
  Downloading statsmodels-0.14.4-cp311-cp311-win_amd64.whl (9.9 MB)
     ---------------------------------------- 9.9/9.9 MB 21.7 MB/s eta 0:00:00
Collecting patsy>=0.5.6
  Downloading patsy-0.5.6-py2.py3-none-any.whl (233 kB)
     ---------------------------------------- 233.9/233.9 kB ? eta 0:00:00
Installing collected packages: patsy, statsmodels
Successfully installed patsy-0.5.6 statsmodels-0.14.4
Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.3.1 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [15]:
# Q.6
"""Reject the Null Hypothesis: Since the p-value (0.02) is less than the significance level (commonly 0.05), you reject the null hypothesis. This means there is enough evidence to suggest that at least one of the group means is significantly different from the others.

Interpretation:
F-Statistic (5.23): This value indicates the ratio of variance between the group means to the variance within the groups. A higher F-statistic suggests a greater degree of difference between group means.

P-Value (0.02): The p-value tells you the probability of obtaining an F-statistic at least as extreme as the one observed, assuming the null hypothesis is true. A p-value of 0.02 means there is a 2% chance that the observed differences occurred by random chance.

So, with a p-value of 0.02, you can confidently say that there are statistically significant differences between the group means. """

'Reject the Null Hypothesis: Since the p-value (0.02) is less than the significance level (commonly 0.05), you reject the null hypothesis. This means there is enough evidence to suggest that at least one of the group means is significantly different from the others.\n\nInterpretation:\nF-Statistic (5.23): This value indicates the ratio of variance between the group means to the variance within the groups. A higher F-statistic suggests a greater degree of difference between group means.\n\nP-Value (0.02): The p-value tells you the probability of obtaining an F-statistic at least as extreme as the one observed, assuming the null hypothesis is true. A p-value of 0.02 means there is a 2% chance that the observed differences occurred by random chance.\n\nSo, with a p-value of 0.02, you can confidently say that there are statistically significant differences between the group means. '

In [16]:
# Q.7
"""
Handling missing data in repeated measures ANOVA is crucial to maintain the integrity and validity of your analysis. Here’s how you can handle it, along with the potential consequences of different methods:

Methods to Handle Missing Data:
Listwise Deletion (Complete Case Analysis):

What it is: Remove any subject with missing data in any of the repeated measures.

Consequences: Reduces sample size, which can decrease statistical power and potentially bias the results if the missing data are not randomly distributed.

Pairwise Deletion:

What it is: Use all available data by only removing cases with missing data for specific analyses.

Consequences: Maintains a larger sample size but can introduce inconsistencies and may not adequately address the pattern of missing data.

Mean Substitution:

What it is: Replace missing values with the mean of the observed values for that variable.

Consequences: Can underestimate the variability and inflate type I errors (false positives).

Multiple Imputation:

What it is: Replace missing data with a set of plausible values based on other observed data, creating multiple complete datasets and averaging the results.

Consequences: More accurate and robust than mean substitution, but computationally intensive and requires assumptions about the data distribution.

Last Observation Carried Forward (LOCF):

What it is: Replace missing values with the last observed value for that subject.

Consequences: Simple to implement but can lead to biases, especially if the missing data occurs at random times.

Mixed-Effects Models:

What it is: Use statistical models that can handle missing data by including random effects to account for subject-level variation.

Consequences: Flexible and robust, but requires more complex modeling and assumptions about the data structure."""

'\nHandling missing data in repeated measures ANOVA is crucial to maintain the integrity and validity of your analysis. Here’s how you can handle it, along with the potential consequences of different methods:\n\nMethods to Handle Missing Data:\nListwise Deletion (Complete Case Analysis):\n\nWhat it is: Remove any subject with missing data in any of the repeated measures.\n\nConsequences: Reduces sample size, which can decrease statistical power and potentially bias the results if the missing data are not randomly distributed.\n\nPairwise Deletion:\n\nWhat it is: Use all available data by only removing cases with missing data for specific analyses.\n\nConsequences: Maintains a larger sample size but can introduce inconsistencies and may not adequately address the pattern of missing data.\n\nMean Substitution:\n\nWhat it is: Replace missing values with the mean of the observed values for that variable.\n\nConsequences: Can underestimate the variability and inflate type I errors (false p

In [17]:
#Q.8 
"""Post-Hoc Tests After ANOVA
Post-hoc tests are statistical procedures used to identify which specific groups differ significantly from one another after a significant ANOVA result. These tests are necessary when the ANOVA indicates an overall difference among groups but doesn't specify which groups are different.

Here are some common post-hoc tests:

1. Tukey's HSD (Honestly Significant Difference)
When to use: When you have equal sample sizes across groups and want to control the family-wise error rate (the probability of making at least one Type I error).
Example: Comparing the mean exam scores of students from three different schools.
2. Bonferroni Correction
When to use: When you want to control the family-wise error rate and have a large number of pairwise comparisons.
Example: Comparing the effects of five different fertilizers on plant growth.
3. Sidak Correction
When to use: Similar to Bonferroni but often slightly more powerful (i.e., less likely to miss true differences).
Example: Comparing the effectiveness of four different marketing campaigns on sales.
4. Fisher's LSD (Least Significant Difference)
When to use: When you have equal sample sizes across groups and don't need to control the family-wise error rate strictly.
Example: Comparing the mean blood pressure of patients in three different treatment groups.
5. Scheffé's Test
When to use: When you want to control the family-wise error rate and have a complex set of comparisons (e.g., comparing all possible combinations of groups).
Example: Comparing the effects of three different factors (e.g., temperature, humidity, and light) on plant growth.
Example of when a post-hoc test might be necessary:

Imagine a study comparing the effectiveness of three different teaching methods (Method A, Method B, and Method C) on student test scores. An ANOVA reveals a significant difference among the three methods. However, the ANOVA alone doesn't tell us which method(s) are significantly different from the others. To pinpoint the specific differences, a post-hoc test (e.g., Tukey's HSD) would be used to compare the mean test scores of each method pair."""

"Post-Hoc Tests After ANOVA\nPost-hoc tests are statistical procedures used to identify which specific groups differ significantly from one another after a significant ANOVA result. These tests are necessary when the ANOVA indicates an overall difference among groups but doesn't specify which groups are different.\n\nHere are some common post-hoc tests:\n\n1. Tukey's HSD (Honestly Significant Difference)\nWhen to use: When you have equal sample sizes across groups and want to control the family-wise error rate (the probability of making at least one Type I error).\nExample: Comparing the mean exam scores of students from three different schools.\n2. Bonferroni Correction\nWhen to use: When you want to control the family-wise error rate and have a large number of pairwise comparisons.\nExample: Comparing the effects of five different fertilizers on plant growth.\n3. Sidak Correction\nWhen to use: Similar to Bonferroni but often slightly more powerful (i.e., less likely to miss true diff

In [19]:
# Q.9
import pandas as pd
import numpy as np
from scipy.stats import f_oneway

# Example data
data = {
    'Diet A': [2, 3, 2.5, 4, 3.5, 2, 3, 2.5, 4, 3.5, 2, 3, 2.5, 4, 3.5, 2, 3, 2.5, 4, 3.5],
    'Diet B': [3, 4, 3.5, 5, 4.5, 3, 4, 3.5, 5, 4.5, 3, 4, 3.5, 5, 4.5, 3, 4, 3.5, 5, 4.5],
    'Diet C': [1, 1.5, 1, 2, 1.5, 1, 1.5, 1, 2, 1.5, 1, 1.5, 1, 2, 1.5, 1, 1.5, 1, 2, 1.5]
}

# Convert to DataFrame
df = pd.DataFrame(data)

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(df['Diet A'], df['Diet B'], df['Diet C'])

print("F-statistic:", f_statistic)
print("P-value:", p_value)
"""Interpretation:
F-statistic: This value indicates the ratio of the variance between group means to the variance within groups. A higher F-statistic suggests a greater degree of difference between group means.

P-value: If the p-value is less than the significance level (e.g., 0.05), it indicates that there are statistically significant differences between the group means.

In this example, you might get results like:


Copy
F-statistic: 40.0
P-value: 1.39e-09
Since the p-value is much less than 0.05, you reject the null hypothesis and conclude that there are significant differences in mean weight loss among the three diets."""

F-statistic: 86.00000000000009
P-value: 6.125226319567677e-18


'Interpretation:\nF-statistic: This value indicates the ratio of the variance between group means to the variance within groups. A higher F-statistic suggests a greater degree of difference between group means.\n\nP-value: If the p-value is less than the significance level (e.g., 0.05), it indicates that there are statistically significant differences between the group means.\n\nIn this example, you might get results like:\n\n\nCopy\nF-statistic: 40.0\nP-value: 1.39e-09\nSince the p-value is much less than 0.05, you reject the null hypothesis and conclude that there are significant differences in mean weight loss among the three diets.'

In [20]:
# Q.10
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data
data = {
    'Software_Program': ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'],
    'Experience_Level': ['Novice', 'Novice', 'Novice', 'Experienced', 'Experienced', 'Experienced', 'Novice', 'Novice', 'Novice', 'Experienced', 'Novice', 'Novice', 'Novice', 'Experienced', 'Experienced', 'Experienced', 'Novice', 'Novice', 'Novice', 'Experienced', 'Novice', 'Novice', 'Novice', 'Experienced', 'Experienced', 'Experienced', 'Novice', 'Novice', 'Novice', 'Experienced'],
    'Completion_Time': [25, 30, 28, 20, 22, 24, 26, 29, 27, 21, 32, 34, 33, 28, 26, 27, 35, 31, 30, 29, 24, 26, 25, 22, 21, 23, 27, 29, 28, 25]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Fit the two-way ANOVA model
model = ols('Completion_Time ~ C(Software_Program) + C(Experience_Level) + C(Software_Program):C(Experience_Level)', data=df).fit()

# Perform the ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

# Display the results
print(anova_table)


                                         sum_sq    df       F        PR(>F)
C(Software_Program)                       194.6   2.0  31.136  2.148308e-07
C(Experience_Level)                       168.2   1.0  53.824  1.410515e-07
C(Software_Program):C(Experience_Level)     4.9   2.0   0.784  4.679224e-01
Residual                                   75.0  24.0     NaN           NaN


In [23]:
# Q.11
import numpy as np
from scipy.stats import ttest_ind

control_group = [70, 68, 75, 73, 71, 69, 72, 74, 76, 70, 71, 73, 69, 72, 74, 75, 70, 73, 71, 68, 71, 69, 75, 73, 70, 74, 72, 71, 70, 69, 75, 72, 74, 71, 73, 70, 69, 74, 72, 71, 70, 69, 73, 72, 70, 71, 74, 73, 71, 70]
experimental_group = [78, 80, 77, 76, 82, 79, 75, 81, 83, 77, 79, 82, 80, 76, 78, 81, 83, 79, 76, 78, 77, 80, 81, 78, 76, 82, 79, 77, 80, 78, 81, 77, 79, 76, 81, 78, 80, 77, 82, 79, 76, 80, 81, 78, 76, 82, 77, 80, 79]

t_statistic , p_value = ttest_ind(control_group , experimental_group)

print("T-statistic:", t_statistic)
print("p-value:", p_value)

T-statistic: -17.213085682401932
p-value: 3.0362969915200846e-31


In [24]:
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Combine data into a DataFrame
df = pd.DataFrame({
    'Group': ['Control'] * len(control_group) + ['Experimental'] * len(experimental_group),
    'Scores': control_group + experimental_group
})

# Perform ANOVA
anova_model = ols('Scores ~ C(Group)', data=df).fit()
anova_table = sm.stats.anova_lm(anova_model, typ=2)
print("ANOVA Table:\n", anova_table)

# Perform Tukey's HSD test
tukey = pairwise_tukeyhsd(endog=df['Scores'], groups=df['Group'], alpha=0.05)
print(tukey)


ANOVA Table:
                sum_sq    df           F        PR(>F)
C(Group)  1310.988349   1.0  296.290319  3.036297e-31
Residual   429.193469  97.0         NaN           NaN
  Multiple Comparison of Means - Tukey HSD, FWER=0.05   
 group1    group2    meandiff p-adj lower  upper  reject
--------------------------------------------------------
Control Experimental   7.2784   0.0 6.4391 8.1176   True
--------------------------------------------------------


In [25]:
# Q.12
import pandas as pd
import numpy as np
from statsmodels.stats.anova import AnovaRM
from statsmodels.stats.multicomp import MultiComparison

# Example data
data = {
    'Day': np.tile(np.arange(1, 31), 3),
    'Store': np.repeat(['A', 'B', 'C'], 30),
    'Sales': [200, 220, 230, 240, 250, 210, 200, 230, 210, 220, 250, 230, 220, 210, 240, 220, 230, 240, 250, 210, 200, 220, 230, 240, 250, 210, 200, 230, 210, 220,
              210, 230, 240, 250, 260, 220, 210, 240, 220, 230, 260, 240, 230, 220, 250, 230, 240, 250, 260, 220, 210, 230, 240, 250, 260, 220, 210, 240, 220, 230,
              190, 210, 220, 230, 240, 200, 190, 220, 200, 210, 240, 220, 210, 200, 230, 210, 220, 230, 240, 200, 190, 210, 220, 230, 240, 200, 190, 220, 200, 210]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Fit the repeated measures ANOVA model
aovrm = AnovaRM(df, 'Sales', 'Day', within=['Store']).fit()

# Perform the repeated measures ANOVA
anova_table = aovrm.summary()
print(anova_table)

# Post-hoc test using Tukey's HSD
df['Group'] = df['Day'].astype(str) + df['Store']
mc = MultiComparison(df['Sales'], df['Group'])
post_hoc_res = mc.tukeyhsd()
print(post_hoc_res)


                             Anova
                    F Value               Num DF  Den DF Pr > F
---------------------------------------------------------------
Store 139328309725521489401368018944.0000 2.0000 58.0000 0.0000



  return _methods._var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
  ret = ret.dtype.type(ret / rcount)


Multiple Comparison of Means - Tukey HSD, FWER=0.05
group1 group2 meandiff p-adj lower upper reject
-----------------------------------------------
   10A    10B     10.0   nan   nan   nan  False
   10A    10C    -10.0   nan   nan   nan  False
   10A    11A     30.0   nan   nan   nan  False
   10A    11B     40.0   nan   nan   nan  False
   10A    11C     20.0   nan   nan   nan  False
   10A    12A     10.0   nan   nan   nan  False
   10A    12B     20.0   nan   nan   nan  False
   10A    12C      0.0   nan   nan   nan  False
   10A    13A      0.0   nan   nan   nan  False
   10A    13B     10.0   nan   nan   nan  False
   10A    13C    -10.0   nan   nan   nan  False
   10A    14A    -10.0   nan   nan   nan  False
   10A    14B      0.0   nan   nan   nan  False
   10A    14C    -20.0   nan   nan   nan  False
   10A    15A     20.0   nan   nan   nan  False
   10A    15B     30.0   nan   nan   nan  False
   10A    15C     10.0   nan   nan   nan  False
   10A    16A      0.0   nan   nan  