# Significance tests and Inference

- Hypothesis testing
- Analysis of variance
- t-test
- Z-test Anova
- One way Anova
- Two way Anova
- Chi square test

## Hypothesis Testing

This is s statistical method used to make inferences or draw conclusion about a population based on a sample of data. It involves formulating a hypothesis about a population parameter and then collecting and analyzing data to determine whether the evidence supports or contradicts the hypothesis.

### Terminologies used
**Null Hypothesis (Ho)** - a statistical theory that suggests that no statistical significance exists between the populations (sample population versus whole population)

**Alternative Hypothesis (Ha)** - it suggests there is a significant difference between the population parameters. Simply put, it is the contrast of the Null Hypothesis.<br>

**Level of significance (α)** - It is a predefined threshold used in hypothesis testing to determine the critical region for making decisions about the null hypothesis.
<br>
The level of significance is chosen by the researcher before conducting the test and typically ranges from 0.01 to 0.10, with 0.05 being the most commonly used value. The significance level determines how strong the evidence must be against the null hypothesis in order to reject it.<br>

Example:<br>
If α =5%, that means we are okay to take a 5% risk and conclude there exists a difference when there is no actual difference.<br>

**Critical value (C)** - A value in the distribution beyond which leads to the rejection of the Null Hypothesis. It is compared to the test statistic.<br>

**Test Statistic (t)** - it is dependent on the test that we run. It is the decidig factor to reject or accept the Null Hypothesis. <br>

*Types of Test Statistics:* <br>

![Alt text](image-28.png)

**P-value (p)** it is the propotion of samples (assuming the Null Hypothesis is true) that would be as extreme as the test statistic.<br>

Example<br>
Assume we are running a two-tailed Z-Test at 95% confidence. Then, the level of significance (α) = 5% = 0.05. Thus, we will have (1-α) = 0.95 proportion of data at the center, and α = 0.05 proportion will be equally shared to the two tails. Each tail will have (α/2) = 0.025 proportion of data. <br>
The critical value i.e., Z95% or Zα/2 = 1.96 is calculated from the Z-scores table.<br>
Take a look of a graphical illustration of the information above:

![Alt text](image-29.png)

### Steps of Hypothesis Testing
1. Specify the Nully and Alternative Hypotheses about a population parameter.
2. Set the Level of Significance(α)
3. Collect Sample Data and calculate the Test Statistics and P-Value by running a Hypothesis test that well suits the data.
4. Make a Conclusion: Reject or Accept the Null Hypothesis 



![Alt text](image-27.png)


### Confison Matrix in Hypothesis Testing

![Alt text](image-30.png)

**Confidence** - The probability of accepting a True Null Hypothesis. It is denoted as (1-α)<br>

**Power of test** - The probability of rejecting a False Null Hypothesis i.e., the ability of the test to detect a difference. It is denoted as (1-β) and its value lies between 0 and 1.<br>
The factors that affect the power of the test are sample size, population variability and the confidence.<br>

**Type I error**: Occurs when we reject a True Null Hypothesis and is denoted as α.<br>

**Type II error**: Occurs when we accept a False Null Hypothesis and is denoted as β.<br>

**Accuracy** - Number of correct predictions / Total number of cases.


### Hypothesis tests distribution tree

![Alt text](image-31.png)


## Example 1
Assume we are cylindrical pipe makers, we are interested in checking if the diameter of the pipe follows a Normal/Gaussian distribution.


In [14]:
##Step1: Data Collection
import pandas as pd
df = pd.read_csv('Datasets\circle.csv')

df.head()

Unnamed: 0,Diameter (cm)
0,0.015
1,0.02
2,0.025
3,1.025
4,2.025


In [25]:
##Step 2: Define H0 and Ha

H0 = "Data is normal"
Ha = "Data is not normal"

#Set the significance level(α)= 5%
alpha = 0.05

In [57]:
##Step 3: Run a test to check the normality
'''
The shapiro test will be used in this case
'''
from scipy.stats import shapiro
p_value = shapiro(df)[0]
rounded_p_value = round(p_value, 2)


print("Rounded p-value:", rounded_p_value)

Rounded p-value: 0.95


In [58]:
#Step 4: Conclude using the p-value from step 3
if p > alpha:
    print(f" {rounded_p_value} > {alpha}. We accept the Null Hypothesis. {H0}")
else:
    print(f" {rounded_p_value} <= {alpha}. We reject the Null Hypothesis. {H0}")


 0.95 <= 0.05. We reject the Null Hypothesis. Data is Normally Distributed


## Example 2
Assuming the business has two production lines that make that produce the pipes. Check if there is any significant difference in the average diameter of pipes between the two production lines.<br>

*Note:* <br>
Diameter is continuous data and we are comparit data from two units. Therefore, <br>
Y: Continuous<br>
X: Discrete

In [67]:
#Step 1: Generate dataset
import numpy as np

#Set the random seed for reproducibility
np.random.seed(42)

#Generate a sample dataset of 100 values from a normal distribution with a mean 0 and standard deviation of 1
sample_data_line1 = np.abs(np.random.normal(loc= 0, scale=1, size = 100))
sample_data_line2 = np.abs(np.random.normal(loc= 0, scale=1, size = 100))

#Create a DataFrame using the generated sample data
pipes = pd.DataFrame({'Line 1': sample_data_line1,
                      'Line 2': sample_data_line2})

#Display first 5 rows
pipes.head()

Unnamed: 0,Line 1,Line 2
0,0.496714,1.415371
1,0.138264,0.420645
2,0.647689,0.342715
3,1.52303,0.802277
4,0.234153,0.161286


In [68]:
#Step 2:Defining the H0 and Ha
H0 = 'Data is Normally Distributed'
Ha = 'Data is not Normally Distributed'

#Set the significance level(α)= 5%
alpha = 0.05

**Explanation**<br>
We have set the significance level (α) to 0.05, which means we are willing to accept a 5% chance of making a Type I error (rejecting the null hypothesis when it is true).

In [73]:
##Step 3: Run a test to check the normality
'''
Function to check normality using Shapiro-Wilk test
'''

# Function to check normality using Shapiro-Wilk test

from scipy.stats import shapiro
def check_normality(data):
	for columnName, columnData in pipes.items():
		print('\n' + "Shapiro Test Results of '{}' ".format(columnName))
		p_value = round(shapiro(columnData.values)[0], 2)

		if p_value>alpha:
			print(f"{p_value} > {alpha}. We accept Null Hypothesis. '{columnName}' {H0}")
		else:
			print(f"{p_value} <= {alpha}. We reject Null Hypothesis. '{columnName}' {Ha}")

# Function call to check normality
check_normality(pipes)


Shapiro Test Results of 'Line 1' 
0.91 > 0.05. We accept Null Hypothesis. 'Line 1' Data is Normally Distributed

Shapiro Test Results of 'Line 2' 
0.92 > 0.05. We accept Null Hypothesis. 'Line 2' Data is Normally Distributed


In [82]:
#Step 4: Check if variances are equal

#Setting H0 and Ha
H0 = 'Variances are equal or greater'
Ha = 'Variances are not equal'

from scipy.stats import levene
def check_variances(dataset):
	print('\n' + "Variances Test Results' ")
	p_value = round(levene(pipes['Line 1'], pipes['Line 2'])[1],2)

	if p_value>alpha:
		print(f"{p_value} > {alpha}. We accept the Null Hypothesis. {H0}")
	else:
		print(f"{p_value} <= {alpha}. We reject Null Hypothesis. {Ha}")

check_variances(pipes)


Variances Test Results' 
0.73 > 0.05. We accept the Null Hypothesis. Variances are equal or greater


In [84]:
#Step 5: Run the T-test for the two samples with greater variances
from scipy.stats import ttest_ind

# Defining Null and Alternative Hypotheses
H0 = 'There is no significant difference.'
Ha = 'There exist a significant difference.'

#T-test function
def t_test(df):
    print('\n' + "2 Sample T Test Results ")
    test_results = ttest_ind(pipes['Line 1'], pipes['Line 2'], equal_var=True)

    p_value = round(test_results[1],2)

    if p_value>alpha:
        print(f"{p_value} > {alpha}. We accept the Null Hypothesis. {H0}")
    else:
        print(f"{p_value} <= {alpha}. We reject Null Hypothesis. {Ha}")

t_test(pipes)


2 Sample T Test Results 
0.68 > 0.05. We accept the Null Hypothesis. There is no significant difference.


## Analysis of Variance (ANOVA)