In [1]:
##Statistics Advance-6 Assignment

In [None]:
##Q1. Explain the assumptions required to use ANOVA and provide examples of violations that could impact the validity of the results.

##A1. 

##ANOVA (Analysis of Variance) is a statistical method used to compare the means of three or more groups. To use ANOVA reliably, certain assumptions must be met. Here are the key assumptions of ANOVA:

##Independence of observations: The observations within each group are independent of each other. This means that the data points within one group should not be influenced by the data points in other groups. For example, in a study comparing the effectiveness of different teaching methods on student performance, the performance of one student should not affect the performance of another student.

##Normality: The data within each group are normally distributed. This means that the distribution of data points within each group follows a bell-shaped curve. Violation of this assumption could occur when the data are heavily skewed or have outliers. For instance, if a study measures the reaction times of participants to a stimulus and the reaction times are heavily skewed, the assumption of normality may be violated.

##Homogeneity of variances (homoscedasticity): The variance of the data within each group is equal. In other words, the spread of data points within each group should be similar. Violation of this assumption, known as heteroscedasticity, occurs when the variability in one group is significantly different from the variability in another group. For example, if a study compares the heights of individuals across different age groups but one age group has much greater variation in heights than the others, it violates the assumption of homogeneity of variances.

##Interval or ratio scale: The dependent variable (the variable being measured) is measured on an interval or ratio scale. This means that the data are numerical and the intervals between consecutive values are equal. For example, if a study compares the effectiveness of different doses of a drug on blood pressure, blood pressure measurements must be continuous numerical values.

##Violations of these assumptions can impact the validity of ANOVA results:

##Independence: If observations within groups are not independent, it could lead to biased estimates of the variability between groups and within groups.

##Normality: Violations of normality can affect the accuracy of p-values and confidence intervals calculated by ANOVA. Skewed distributions or outliers can inflate or deflate the Type I error rate (false positive rate).

##Homogeneity of variances: If variances are not homogeneous across groups, the F-test in ANOVA may become unreliable. It can lead to an increased Type I error rate and decrease the power of the test.

##Interval or ratio scale: If the dependent variable is not measured on an interval or ratio scale, ANOVA may not be appropriate. Using ANOVA with categorical or ordinal data violates the assumptions and may lead to incorrect conclusions.

##Overall, it's essential to assess these assumptions before interpreting the results of an ANOVA analysis. 


In [None]:
##Q2. What are the three types of ANOVA, and in what situations would each be used?

##A2. ANOVA (Analysis of Variance) comes in several variations, each designed for different experimental designs and research questions. The three main types of ANOVA are:

##One-way ANOVA: One-way ANOVA is used when you have one independent variable (factor) with three or more levels (groups). It's used to determine whether there are any statistically significant differences between the means of the groups. For example:

##A study comparing the effect of three different types of fertilizer on plant growth (independent variable: type of fertilizer, levels: fertilizer A, B, and C).
##Analyzing the effect of different doses of a drug on pain relief (independent variable: dose, levels: low, medium, high).

##Two-way ANOVA: Two-way ANOVA is used when you have two independent variables (factors), and you want to examine the interaction between them and their individual effects on the dependent variable. It allows you to test for main effects of each independent variable as well as their interaction. For example:
##Investigating the effects of both gender and treatment type on patient recovery time (independent variables: gender and treatment type).
##Examining the influence of both temperature and humidity on plant growth (independent variables: temperature and humidity).

##Repeated measures ANOVA: Repeated measures ANOVA is used when the same participants are measured under different conditions or at different time points. It's useful for longitudinal or within-subject designs. This type of ANOVA tests for differences in means across the levels of one or more within-subjects factors. For example:
##Evaluating the effectiveness of a new teaching method by measuring student performance before and after the intervention (within-subject factor: time).
##Assessing changes in anxiety levels in participants before, during, and after exposure to a stressor (within-subject factor: time points).
##Each type of ANOVA has its specific use cases and assumptions. Choosing the appropriate type of ANOVA depends on the research design, the number of independent variables, and the nature of the data being analyzed. It's essential to understand the design of your study and the type of data you have to select the most suitable ANOVA method.

In [None]:
##Q3. What is the partitioning of variance in ANOVA, and why is it important to understand this concept?

##A3. The partitioning of variance in ANOVA refers to the decomposition of the total variance in the data into different components, each representing the variability attributable to specific sources or factors.

##The partitioning of variance in ANOVA typically involves three main components:

##Between-group variance: This component represents the variability between the means of different groups or levels of the independent variable(s). It reflects the extent to which group means differ from each other and is typically associated with the effect of the treatment or experimental conditions.

##Within-group variance: Also known as error variance, this component represents the variability within each group or level of the independent variable(s). It reflects the random variability or noise in the data that is not attributable to the treatment or experimental conditions.

##Interaction variance (if applicable): In designs involving multiple independent variables, the interaction variance represents the variability associated with the combined effects of the independent variables. It reflects whether the effect of one independent variable depends on the level of another independent variable.

In [1]:
##Q4. How would you calculate the total sum of squares (SST), explained sum of squares (SSE), and residual sum of squares (SSR) in a one-way ANOVA using Python?

##A4.

import numpy as np

# Sample data (list of group means for simplicity)
group_means = [10, 15, 20]  # Example group means

# Sample sizes for each group
sample_sizes = [20, 25, 30]  # Example sample sizes

# Overall mean (grand mean)
overall_mean = np.mean(group_means)

# Calculate SST (Total Sum of Squares)
SST = np.sum([(group_mean - overall_mean) ** 2 for group_mean in group_means])

# Calculate SSE (Explained Sum of Squares)
SSE = np.sum([sample_size * (group_mean - overall_mean) ** 2 for group_mean, sample_size in zip(group_means, sample_sizes)])

# Calculate SSR (Residual Sum of Squares)
SSR = SST - SSE

print("Total Sum of Squares (SST):", SST)
print("Explained Sum of Squares (SSE):", SSE)
print("Residual Sum of Squares (SSR):", SSR)

Total Sum of Squares (SST): 50.0
Explained Sum of Squares (SSE): 1250.0
Residual Sum of Squares (SSR): -1200.0


In [2]:
##Q5. In a two-way ANOVA, how would you calculate the main effects and interaction effects using Python?

##A5. In a two-way ANOVA, you can calculate the main effects and interaction effects using Python by fitting a linear model and examining the coefficients associated with each factor and their interactions. 
##The statsmodels library in Python provides tools for fitting linear models and conducting ANOVA.

import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Prepare your data (example data)
data = {
    'A': [1, 1, 2, 2, 3, 3],
    'B': [1, 2, 1, 2, 1, 2],
    'Y': [5, 6, 7, 8, 9, 10]
}

df = pd.DataFrame(data)

# Fit a linear model
model = ols('Y ~ A + B + A:B', data=df).fit()

# Conduct ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

# Extract main effects and interaction effects
main_effects = anova_table['sum_sq'][:-1]
interaction_effect = anova_table['sum_sq'][-1]

print("Main effects:")
print(main_effects)
print("Interaction effect:")
print(interaction_effect)

Main effects:
A      1.600000e+01
B      1.500000e+00
A:B    1.779867e-29
Name: sum_sq, dtype: float64
Interaction effect:
3.2343297114061484e-29


In [None]:
##Q6. Suppose you conducted a one-way ANOVA and obtained an F-statistic of 5.23 and a p-value of 0.02. What can you conclude about the differences between the groups, and how would you interpret these results?

##A6. In a one-way ANOVA, the F-statistic tests whether there are significant differences between the means of the groups. The associated p-value indicates the probability of observing the data if the null hypothesis (no difference between group means) were true.

##Given the F-statistic of 5.23 and a p-value of 0.02:

##Interpretation of the F-statistic: The F-statistic is a measure of the ratio of the variance between groups to the variance within groups. A larger F-statistic suggests a larger difference between the means of the groups relative to the variation within each group. In this case, the F-statistic of 5.23 indicates that there is some evidence of differences between the group means.

##Interpretation of the p-value: The p-value represents the probability of obtaining the observed data, or more extreme data, if the null hypothesis were true. A p-value less than the chosen significance level (usually 0.05) indicates that the observed differences are statistically significant. In this case, the p-value of 0.02 is less than 0.05, suggesting that the observed differences between group means are statistically significant.

##Based on these results, we can conclude that there are significant differences between the groups. Specifically, at least one group mean differs significantly from the others. However, the ANOVA test does not tell us which specific group means are different from each other. To determine that, post-hoc tests (e.g., Tukey's HSD test) can be conducted.

##In summary, with an F-statistic of 5.23 and a p-value of 0.02, we reject the null hypothesis and conclude that there are statistically significant differences between the group means in the study.

In [None]:
##Q7. In a repeated measures ANOVA, how would you handle missing data, and what are the potential consequences of using different methods to handle missing data?

##A7. Handling missing data in repeated measures ANOVA requires careful consideration because missing data can introduce bias and reduce statistical power. Here are some common methods for handling missing data in repeated measures ANOVA and their potential consequences:

##Complete Case Analysis (CCA): In CCA, only cases with complete data for all variables are included in the analysis, and cases with missing data are excluded. While CCA is straightforward, it can lead to biased results if the missing data are not missing completely at random (MCAR). Excluding cases with missing data may also reduce the sample size and statistical power.

##Pairwise Deletion: In pairwise deletion, cases with missing data for some variables are included in the analysis for those variables where data are available. This method maximizes the use of available data but may lead to biased estimates if the missing data are related to the outcome or if the missingness mechanism is not MCAR. Additionally, it can inflate Type I error rates and reduce statistical power, especially when the amount of missing data is large.

##Mean Imputation: Mean imputation involves replacing missing values with the mean of the observed values for that variable. While mean imputation is simple to implement, it can lead to biased estimates and underestimation of standard errors. It also artificially reduces the variability of the data and can distort relationships between variables.

##In summary, the choice of method for handling missing data in repeated measures ANOVA depends on the nature of the missing data, the assumptions underlying the analysis, and the goals of the study. It's essential to carefully consider the potential consequences of different methods and to perform sensitivity analyses to assess the robustness of the results to different assumptions about the missing data mechanism.

In [None]:
##Q8. What are some common post-hoc tests used after ANOVA, and when would you use each one? Provide an example of a situation where a post-hoc test might be necessary.

##A8. After conducting an ANOVA and finding a significant difference between groups, post-hoc tests are often used to determine which specific groups differ from each other. Some common post-hoc tests include:

##Tukey's Honestly Significant Difference (HSD) test: Tukey's HSD test compares all possible pairs of group means and provides simultaneous confidence intervals for each pairwise comparison. It is often used when the number of groups is equal across all levels of the factor.

##Bonferroni correction: The Bonferroni correction adjusts the significance level for each pairwise comparison to control the familywise error rate. It is a conservative approach and is suitable when conducting multiple comparisons.

##An example situation where a post-hoc test might be necessary is in a clinical trial comparing the efficacy of four different treatments for a particular medical condition. 
##After conducting an ANOVA, if the result indicates a significant difference between treatment groups, a post-hoc test such as Tukey's HSD or Bonferroni correction can be used to identify which specific treatment groups differ from each other in terms of efficacy. 
##This helps clinicians make informed decisions about the most effective treatment option for patients.

In [3]:
##Q9. A researcher wants to compare the mean weight loss of three diets: A, B, and C. They collect data from 50 participants who were randomly assigned to one of the diets. Conduct a one-way ANOVA using Python to determine if there are any significant differences between the mean weight loss of the three diets. Report the F-statistic and p-value, and interpret the results.

##A9. 
import numpy as np
from scipy.stats import f_oneway

# Weight loss data for each diet (example data)
diet_A = np.random.normal(loc=5, scale=1, size=50)  # Mean weight loss of 5 kg for diet A
diet_B = np.random.normal(loc=6, scale=1, size=50)  # Mean weight loss of 6 kg for diet B
diet_C = np.random.normal(loc=7, scale=1, size=50)  # Mean weight loss of 7 kg for diet C

# Combine data from all diets
all_data = [diet_A, diet_B, diet_C]

# Perform one-way ANOVA
f_statistic, p_value = f_oneway(*all_data)

# Report the results
print("F-statistic:", f_statistic)
print("p-value:", p_value)

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis. There is a significant difference between the mean weight loss of the three diets.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference between the mean weight loss of the three diets.")
    
##We interpret the results based on the significance level (alpha = 0.05). 
##If the p-value is less than alpha, we reject the null hypothesis and conclude that there is a significant difference between the mean weight loss of the three diets. Otherwise, we fail to reject the null hypothesis.

F-statistic: 48.613520712327926
p-value: 6.237112736215829e-17
Reject the null hypothesis. There is a significant difference between the mean weight loss of the three diets.


In [4]:
##Q10. A company wants to know if there are any significant differences in the average time it takes to complete a task using three different software programs: Program A, Program B, and Program C. They randomly assign 30 employees to one of the programs and record the time it takes each employee to complete the task. Conduct a two-way ANOVA using Python to determine if there are any main effects or interaction effects between the software programs and employee experience level (novice vs. experienced). Report the F-statistics and p-values, and interpret the results.

##A10.
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example data (randomly generated)
data = {
    'Software': ['A', 'B', 'C'] * 20,  # 30 employees randomly assigned to three software programs
    'Experience': ['Novice', 'Experienced'] * 45,  # 30 novice and 30 experienced employees
    'Time': np.random.normal(loc=10, scale=2, size=90)  # Randomly generated time data
}

df = pd.DataFrame(data)

# Fit a linear model
model = ols('Time ~ C(Software) + C(Experience) + C(Software):C(Experience)', data=df).fit()

# Perform ANOVA
anova_table = sm.stats.anova_lm(model, typ=2)

# Report the results
print(anova_table)

ValueError: All arrays must be of the same length

In [5]:
##Q11. An educational researcher is interested in whether a new teaching method improves student test scores. They randomly assign 100 students to either the control group (traditional teaching method) or the experimental group (new teaching method) and administer a test at the end of the semester. Conduct a two-sample t-test using Python to determine if there are any significant differences in test scores between the two groups. If the results are significant, follow up with a post-hoc test to determine which group(s) differ significantly from each other.

##A11. 
from scipy.stats import ttest_ind

# Example data (randomly generated)
control_group_scores = np.random.normal(loc=70, scale=10, size=100)  # Control group scores
experimental_group_scores = np.random.normal(loc=75, scale=10, size=100)  # Experimental group scores

# Perform two-sample t-test
t_statistic, p_value = ttest_ind(control_group_scores, experimental_group_scores)

# Report the results
print("Two-sample t-test results:")
print("t-statistic:", t_statistic)
print("p-value:", p_value)

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis. There is a significant difference in test scores between the control and experimental groups.")
else:
    print("Fail to reject the null hypothesis. There is no significant difference in test scores between the control and experimental groups.")


Two-sample t-test results:
t-statistic: -3.4460457375291544
p-value: 0.0006945405293571882
Reject the null hypothesis. There is a significant difference in test scores between the control and experimental groups.


In [6]:
##Q12. A researcher wants to know if there are any significant differences in the average daily sales of three retail stores: Store A, Store B, and Store C. They randomly select 30 days and record the sales for each store on those days. Conduct a repeated measures ANOVA using Python to determine if there are any significant differences in sales between the three stores. If the results are significant, follow up with a post- hoc test to determine which store(s) differ significantly from each other.

##A12.
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Example data (randomly generated)
data = {
    'Day': list(range(1, 31)) * 3,  # 30 days randomly selected
    'Store': ['A'] * 30 + ['B'] * 30 + ['C'] * 30,  # Three stores
    'Sales': np.random.randint(100, 200, size=90)  # Random sales data
}

df = pd.DataFrame(data)

# Fit a repeated measures ANOVA model
rm_anova_model = ols('Sales ~ C(Store) + C(Day) + C(Store):C(Day)', data=df).fit()

# Perform ANOVA
rm_anova_results = sm.stats.anova_lm(rm_anova_model, typ=3)

# Report the results
print("Repeated measures ANOVA results:")
print(rm_anova_results)

# Follow up with a post-hoc test (Tukey's HSD)
posthoc = pairwise_tukeyhsd(df['Sales'], df['Store'], alpha=0.05)
print("\nPost-hoc test results:")
print(posthoc)

  return np.dot(wresid, wresid) / self.df_resid


ValueError: r_matrix performs f_test for using dimensions that are asymptotically non-normal