# Significance tests and Inference

- Hypothesis testing
- Analysis of variance
- t-test
- Z-test Anova
- One way Anova
- Two way Anova
- Chi square test

## Hypothesis Testing

This is s statistical method used to make inferences or draw conclusion about a population based on a sample of data. It involves formulating a hypothesis about a population parameter and then collecting and analyzing data to determine whether the evidence supports or contradicts the hypothesis.

### Terminologies used
**Null Hypothesis (Ho)** - a statistical theory that suggests that no statistical significance exists between the populations (sample population versus whole population)

**Alternative Hypothesis (Ha)** - it suggests there is a significant difference between the population parameters. Simply put, it is the contrast of the Null Hypothesis.<br>

**Level of significance (α)** - It is a predefined threshold used in hypothesis testing to determine the critical region for making decisions about the null hypothesis.
<br>
The level of significance is chosen by the researcher before conducting the test and typically ranges from 0.01 to 0.10, with 0.05 being the most commonly used value. The significance level determines how strong the evidence must be against the null hypothesis in order to reject it.<br>

Example:<br>
If α =5%, that means we are okay to take a 5% risk and conclude there exists a difference when there is no actual difference.<br>

**Critical value (C)** - A value in the distribution beyond which leads to the rejection of the Null Hypothesis. It is compared to the test statistic.<br>

**Test Statistic (t)** - it is dependent on the test that we run. It is the decidig factor to reject or accept the Null Hypothesis. <br>

*Types of Test Statistics:* <br>

![Alt text](image-28.png)

**P-value (p)** it is the propotion of samples (assuming the Null Hypothesis is true) that would be as extreme as the test statistic.<br>

Example<br>
Assume we are running a two-tailed Z-Test at 95% confidence. Then, the level of significance (α) = 5% = 0.05. Thus, we will have (1-α) = 0.95 proportion of data at the center, and α = 0.05 proportion will be equally shared to the two tails. Each tail will have (α/2) = 0.025 proportion of data. <br>
The critical value i.e., Z95% or Zα/2 = 1.96 is calculated from the Z-scores table.<br>
Take a look of a graphical illustration of the information above:

![Alt text](image-29.png)

### Steps of Hypothesis Testing
1. Specify the Nully and Alternative Hypotheses about a population parameter.
2. Set the Level of Significance(α)
3. Collect Sample Data and calculate the Test Statistics and P-Value by running a Hypothesis test that well suits the data.
4. Make a Conclusion: Reject or Accept the Null Hypothesis 



![Alt text](image-27.png)


### Confusion Matrix in Hypothesis Testing

![Alt text](image-30.png)

**Confidence** - The probability of accepting a True Null Hypothesis. It is denoted as (1-α)<br>

**Power of test** - The probability of rejecting a False Null Hypothesis i.e., the ability of the test to detect a difference. It is denoted as (1-β) and its value lies between 0 and 1.<br>
The factors that affect the power of the test are sample size, population variability and the confidence.<br>

**Type I error**: Occurs when we reject a True Null Hypothesis and is denoted as α.<br>

**Type II error**: Occurs when we accept a False Null Hypothesis and is denoted as β.<br>

**Accuracy** - Number of correct predictions / Total number of cases.


### Hypothesis tests distribution tree

![Alt text](image-31.png)


### Example 1
Assume we are cylindrical pipe makers, we are interested in checking if the diameter of the pipe follows a Normal/Gaussian distribution.


In [14]:
##Step1: Data Collection
import pandas as pd
df = pd.read_csv('Datasets\circle.csv')

df.head()

Unnamed: 0,Diameter (cm)
0,0.015
1,0.02
2,0.025
3,1.025
4,2.025


In [25]:
##Step 2: Define H0 and Ha

H0 = "Data is normal"
Ha = "Data is not normal"

#Set the significance level(α)= 5%
alpha = 0.05

In [57]:
##Step 3: Run a test to check the normality
'''
The shapiro test will be used in this case
'''
from scipy.stats import shapiro
p_value = shapiro(df)[0]
rounded_p_value = round(p_value, 2)


print("Rounded p-value:", rounded_p_value)

Rounded p-value: 0.95


In [58]:
#Step 4: Conclude using the p-value from step 3
if p > alpha:
    print(f" {rounded_p_value} > {alpha}. We accept the Null Hypothesis. {H0}")
else:
    print(f" {rounded_p_value} <= {alpha}. We reject the Null Hypothesis. {H0}")


 0.95 <= 0.05. We reject the Null Hypothesis. Data is Normally Distributed


### Example 2
Assuming the business has two production lines that make that produce the pipes. Check if there is any significant difference in the average diameter of pipes between the two production lines.<br>

*Note:* <br>
Diameter is continuous data and we are comparit data from two units. Therefore, <br>
Y: Continuous<br>
X: Discrete

In [67]:
#Step 1: Generate dataset
import numpy as np

#Set the random seed for reproducibility
np.random.seed(42)

#Generate a sample dataset of 100 values from a normal distribution with a mean 0 and standard deviation of 1
sample_data_line1 = np.abs(np.random.normal(loc= 0, scale=1, size = 100))
sample_data_line2 = np.abs(np.random.normal(loc= 0, scale=1, size = 100))

#Create a DataFrame using the generated sample data
pipes = pd.DataFrame({'Line 1': sample_data_line1,
                      'Line 2': sample_data_line2})

#Display first 5 rows
pipes.head()

Unnamed: 0,Line 1,Line 2
0,0.496714,1.415371
1,0.138264,0.420645
2,0.647689,0.342715
3,1.52303,0.802277
4,0.234153,0.161286


In [68]:
#Step 2:Defining the H0 and Ha
H0 = 'Data is Normally Distributed'
Ha = 'Data is not Normally Distributed'

#Set the significance level(α)= 5%
alpha = 0.05

**Explanation**<br>
We have set the significance level (α) to 0.05, which means we are willing to accept a 5% chance of making a Type I error (rejecting the null hypothesis when it is true).

In [73]:
##Step 3: Run a test to check the normality
'''
Function to check normality using Shapiro-Wilk test
'''

# Function to check normality using Shapiro-Wilk test

from scipy.stats import shapiro
def check_normality(data):
	for columnName, columnData in pipes.items():
		print('\n' + "Shapiro Test Results of '{}' ".format(columnName))
		p_value = round(shapiro(columnData.values)[0], 2)

		if p_value>alpha:
			print(f"{p_value} > {alpha}. We accept Null Hypothesis. '{columnName}' {H0}")
		else:
			print(f"{p_value} <= {alpha}. We reject Null Hypothesis. '{columnName}' {Ha}")

# Function call to check normality
check_normality(pipes)


Shapiro Test Results of 'Line 1' 
0.91 > 0.05. We accept Null Hypothesis. 'Line 1' Data is Normally Distributed

Shapiro Test Results of 'Line 2' 
0.92 > 0.05. We accept Null Hypothesis. 'Line 2' Data is Normally Distributed


In [82]:
#Step 4: Check if variances are equal

#Setting H0 and Ha
H0 = 'Variances are equal or greater'
Ha = 'Variances are not equal'

from scipy.stats import levene
def check_variances(dataset):
	print('\n' + "Variances Test Results' ")
	p_value = round(levene(pipes['Line 1'], pipes['Line 2'])[1],2)

	if p_value>alpha:
		print(f"{p_value} > {alpha}. We accept the Null Hypothesis. {H0}")
	else:
		print(f"{p_value} <= {alpha}. We reject Null Hypothesis. {Ha}")

check_variances(pipes)


Variances Test Results' 
0.73 > 0.05. We accept the Null Hypothesis. Variances are equal or greater


In [84]:
#Step 5: Run the T-test for the two samples with greater variances
from scipy.stats import ttest_ind

# Defining Null and Alternative Hypotheses
H0 = 'There is no significant difference.'
Ha = 'There exist a significant difference.'

#T-test function
def t_test(df):
    print('\n' + "2 Sample T Test Results ")
    test_results = ttest_ind(pipes['Line 1'], pipes['Line 2'], equal_var=True)

    p_value = round(test_results[1],2)

    if p_value>alpha:
        print(f"{p_value} > {alpha}. We accept the Null Hypothesis. {H0}")
    else:
        print(f"{p_value} <= {alpha}. We reject Null Hypothesis. {Ha}")

t_test(pipes)


2 Sample T Test Results 
0.68 > 0.05. We accept the Null Hypothesis. There is no significant difference.


## Analysis of Variance (ANOVA)

This is a statistical test used to compare the means of three or more groups or populations. It determines if there are any significant differences between the group means and helps identify which groups are different from each other.

![Alt text](image-32.png)

### ANOVA Terminologies
**Factor** - This is another term for the *independent variable* in your analysis. In a one-way ANOVA, there is one factor, while in a two-way ANOVA, there are two factors.

**Levels** - These are the different groups or categories within a factor. For example, if the factor is ‘diet’ the levels might be ‘low fat’, ‘medium fat’, and ‘high fat’.

**Response Variable** -  the *dependent variable* or the outcome that you are measuring.

**Within-group Variance** - the variance or spread of scores within each level of your factor.

**Between-group Variance** - the variance or spread of scores between the different levels of your factor.

**Grand Mean** - the overall mean when you consider all the data together, regardless of the factor level.

**Treatment Sums of Squares (SS)** - this represents the *between-group variability*. It is the sum of the squared differences between the group means and the grand mean.

**Error Sums of Squares (SS)** -this represents the *within-group variability*. It’s the sum of the squared differences between each observation and its group mean.

**Total Sums of Squares (SS)** -this is the sum of the Treatment SS and the Error SS. It represents the *total variability* in the data.

**Degrees of Freedom (df)** - the number of values that have the freedom to vary when computing a statistic. <br>
For example, if you have ‘n’ observations in one group, then the degrees of freedom for that group is ‘n-1’.

**Mean Square (MS)**- the average squared deviation and is calculated by dividing the sum of squares by the corresponding degrees of freedom.

**F-Ratio** - This is the test statistic for ANOVAs, and it’s the ratio of the between-group variance to the within-group variance.<br>
If the between-group variance is significantly larger than the within-group variance, the F-ratio will be large and likely significant.

**Null Hypothesis (H0)** - This is the hypothesis that there is no difference between the group means.

**Alternative Hypothesis (H1)** - This is the hypothesis that there is a difference between at least two of the group means.

**p-value** - the probability of obtaining a test statistic as extreme as the one that was actually observed, assuming that the null hypothesis is true.<br>
If the p-value is less than the significance level (usually 0.05), then the null hypothesis is rejected in favor of the alternative hypothesis.

**Post-hoc tests** -these are follow-up tests conducted after an ANOVA when the null hypothesis is rejected, to determine which specific groups’ means (levels) are different from each other. Examples include Tukey’s HSD, Scheffe, Bonferroni, among others.

### Types of Analysis of Variance (ANOVA)

- **One-way (or one-factor) ANOVA**

This is the simplest type of ANOVA, which involves *one independent variable*. For example, comparing the effect of different types of diet (vegetarian, pescatarian, omnivore) on cholesterol level.

- **Two-way (or two-factor) ANOVA**

This involves* two independent variables*. This allows for testing the effect of each independent variable on the dependent variable, as well as testing if there’s an interaction effect between the independent variables on the dependent variable.

- **Repeated Measures ANOVA**
This is used when the same subjects are measured multiple times under different conditions, or at different points in time. This type of ANOVA is often used in longitudinal studies.

- **Mixed Design ANOVA**
This combines features of both between-subjects (independent groups) and within-subjects (repeated measures) designs. In this model, one factor is a between-subjects variable and the other is a within-subjects variable.

- **Multivariate Analysis of Variance (MANOVA)**
This is used when there are two or more dependent variables. It tests whether changes in the independent variable(s) correspond to changes in the dependent variables.

- **Analysis of Covariance (ANCOVA)**
This combines ANOVA and regression. ANCOVA tests whether certain factors have an effect on the outcome variable after removing the variance for which quantitative covariates (interval variables) account. This allows the comparison of one variable outcome between groups, while statistically controlling for the effect of other continuous variables that are not of primary interest.

- **Nested ANOVA**
This model is used when the groups can be clustered into categories. For example, if you were comparing students’ performance from different classrooms and different schools, “classroom” could be nested within “school.”

### ANOVA formulas

- #### Sum of Squares Total (SST)
This represents the total variability in the data. It is the sum of the squared differences between each observation and the overall mean.

![Alt text](image-33.png)

Where:<br>
*yi* - each individual data point<br>
*y_mean* - the grand mean (mean of all observations)

- ##### Sum of Squares Within (SSW)
This represents the variability within each group or factor level. It is the sum of the squared differences between each observation and its group mean.

![Alt text](image-34.png)

Where:<br>
*yij* - each individual data point within a group<br>
*y_mean* - the mean of the ith group

- #### Sum of Squares Between (SSB)
This represents the variability between the groups. It is the sum of the squared differences between the group means and the grand mean, multiplied by the number of observations in each group.

![Alt text](image-35.png)

Where:<br>
*ni* - the number of observations in each group<br>
*y_mean* - the mean of the ith group<br>
*y_mean* - the grand mean (mean of all observations)

- #### Degrees of Freedom
The degrees of freedom are the number of values that have the freedom to vary when calculating a statistic.<br><br>

*For within groups (dfW)*

![Alt text](image-36.png)

*For between groups(dfB)*

![Alt text](image-37.png)


*For total(dfT)*

![Alt text](image-38.png)

<br>
Where:<br>
N represents the total number of observations<br>
k represents the number of groups

- #### Mean Squares
Mean squares are the sum of squares divided by the respective degrees of freedom.


*Mean Squares Between (MSB):* 

![Alt text](image-39.png)


*Mean Squares Within (MSW):*

![Alt text](image-40.png)

- #### F-Statistic
The F-statistic is used to test whether the variability between the groups is significantly greater than the variability within the groups.

![Alt text](image-41.png)

If the F-statistic is significantly higher than what would be expected by chance, we reject the null hypothesis that all group means are equal.

### When to use ANOVA
ANOVA (Analysis of Variance) is used when you have three or more groups and you want to compare their means to see if they are significantly different from each other. It is a statistical method that is used in a variety of research scenarios.

Different scenarios when one might use ANOVA:

- #### Comparing Groups
If you want to compare the performance of more than two groups, for example, testing the effectiveness of different teaching methods on student performance.

- #### Evaluating Interactions
In a two-way or factorial ANOVA, you can test for an interaction effect. This means you are not only interested in the effect of each individual factor, but also whether the effect of one factor depends on the level of another factor.

- #### Repeated Measures
If you have measured the same subjects under different conditions or at different time points, you can use repeated measures ANOVA to compare the means of these repeated measures while accounting for the correlation between measures from the same subject.

- #### Experimental Designs
ANOVA is often used in experimental research designs when subjects are randomly assigned to different conditions and the goal is to compare the means of the conditions.

**Here are the assumptions that must be met to use ANOVA:**
- **Normality**: The data should be approximately normally distributed.
- **Homogeneity of Variances**: The variances of the groups you are comparing should be roughly equal. This assumption can be tested using Levene’s test or Bartlett’s test.
- **Independence**: The observations should be independent of each other. This assumption is met if the data is collected appropriately with no related groups (e.g., twins, matched pairs, repeated measures).

### Advantages of ANOVA

- #### Comparing Multiple Groups
One of the key advantages of ANOVA is the ability to compare the means of three or more groups. This makes it more powerful and flexible than the t-test, which is limited to comparing only two groups.

- #### Control of Type I Error
When comparing multiple groups, the chances of making a Type I error (false positive) increases. One of the strengths of ANOVA is that it controls the Type I error rate across all comparisons. This is in contrast to performing multiple pairwise t-tests which can inflate the Type I error rate.

- #### Testing Interactions
In factorial ANOVA, you can test not only the main effect of each factor, but also the interaction effect between factors. This can provide valuable insights into how different factors or variables interact with each other.

- #### Handling Continuous and Categorical Variables
ANOVA can handle both continuous and categorical variables. The dependent variable is continuous and the independent variables are categorical.

- #### Robustness
ANOVA is considered robust to violations of normality assumption when group sizes are equal. This means that even if your data do not perfectly meet the normality assumption, you might still get valid results.

- #### Provides Detailed Analysis
ANOVA provides a detailed breakdown of variances and interactions between variables which can be useful in understanding the underlying factors affecting the outcome.

- #### Capability to Handle Complex Experimental Designs
Advanced types of ANOVA (like repeated measures ANOVA, MANOVA, etc.) can handle more complex experimental designs, including those where measurements are taken on the same subjects over time, or when you want to analyze multiple dependent variables at once.

### Disadvantages of ANOVA
- #### Assumptions
ANOVA relies on several assumptions including normality (the data follows a normal distribution), independence (the observations are independent of each other), and homogeneity of variances (the variances of the groups are roughly equal). If these assumptions are violated, the results of the ANOVA may not be valid.

- #### Sensitivity to Outliers
ANOVA can be sensitive to outliers. A single extreme value in one group can affect the sum of squares and consequently influence the F-statistic and the overall result of the test.

- #### Dichotomous Variables
ANOVA is not suitable for dichotomous variables (variables that can take only two values, like yes/no or male/female). It is used to compare the means of groups for a continuous dependent variable.

- #### Lack of Specificity
Although ANOVA can tell you that there is a significant difference between groups, it doesn’t tell you which specific groups are significantly different from each other. You need to carry out further post-hoc tests (like Tukey’s HSD or Bonferroni) for these pairwise comparisons.

- #### Complexity with Multiple Factors
When dealing with multiple factors and interactions in factorial ANOVA, interpretation can become complex. The presence of interaction effects can make main effects difficult to interpret.

- #### Requires Larger Sample Sizes
To detect an effect of a certain size, ANOVA generally requires larger sample sizes than a t-test.

- #### Equal Group Sizes
While not always a strict requirement, ANOVA is most powerful and its assumptions are most likely to be met when groups are of equal or similar sizes.



## Example 
Consider a situation where there are three different fertilizer treatments (Treatment A, Treatment B, and Treatment C). We would like to compare the effectiveness of these treatements on the yeild of maize.<br>
We have measured the maize yield for each treatment. Let us conduct an ANOVA analysis

In [88]:
#Importing Libraries
import pandas as pd
import scipy.stats as stats

# Setting the random seed for reproducibility
np.random.seed(42)

#Generating sample dataset of size 100 with normal distribution
sample_A = np.abs(np.random.normal(loc= 30, scale=10, size = 100))
sample_B = np.abs(np.random.normal(loc= 30, scale=10, size = 100))
sample_C = np.abs(np.random.normal(loc= 30, scale=10, size = 100))

# Ensure the values are between 10 and 50
sample_A = np.clip(sample_A, 10, 50)
sample_B = np.clip(sample_B, 10, 50)
sample_C = np.clip(sample_C, 10, 50)

# Convert the dataset to integers
sample_A = sample_A.astype(int)
sample_B = sample_B.astype(int)
sample_C = sample_C.astype(int)

#Create a DataFrame using the generated sample data
fertlizer = pd.DataFrame({'Treatment A': sample_A,
                          'Treatment B': sample_B,
                          'Treatment C': sample_C})

#Print the dataset
fertlizer.head()

Unnamed: 0,Treatment A,Treatment B,Treatment C
0,34,15,33
1,28,25,35
2,36,26,40
3,45,21,40
4,27,28,16


**Explanation** <br>
*np.random.normal* - generates a random dataset of size 100 with a normal distribution <br>
*loc* - this parameter sets the mean of the distribution (30) <br>
*scale* - this parameter sets the standard deviation (10) <br>
*size* - parameter sets the size of the dataset.

In [90]:
#Performing ANOVA
fvalue, pvalue = stats.f_oneway(fertlizer['Treatment A'],
                                fertlizer['Treatment B'],
                                fertlizer['Treatment C'])

# Set the significance level
alpha = 0.05

# Print the ANOVA results
print('ANOVA Results:')
print('F-value:', fvalue)
print('p-value:', pvalue)

# Check if the p-value is less than the significance level
if pvalue < alpha:
    print('There is a significant difference between the means of the treatments.')
else:
    print('There is no significant difference between the means of the treatments.')

ANOVA Results:
F-value: 0.7166036863869475
p-value: 0.48925071938017795
There is no significant difference between the means of the treatments.


**Explanation** <br>
The *f_oneway* function from the *scipy.stats *module is used to perform the ANOVA.