## Hypothesis Testing and Mechanism
# Introduction to Hypothesis
- It is a statement that proposes a possible explanation for an observed phenomenon or relationship, which scientists can test through investigation.

## Hypothesis Components
- A hypothesis has two components:

- Example:
    - If you do not clean the fish tank once every three days, the fish will probably not survive for more than three months.

    - Do not clean the fish tank once every three days is the independent variable.
    - Fish will probably not survive for more than three months is the dependent variable.


## Null and Alternative Hypothesis

- There are two types of hypotheses:

    - Null hypothesis [ 𝐻0 ]
    - Alternative hypothesis [ 𝐻𝑎 ]

- Example: 
- Sales Promotion Drive:

- Hypothesis: A sales promotion drive will increase the average monthly sales [µ] by 500 units.

- According to the hypothesis, the sales promotion drive will increase the average monthly sales (µ) by 500 units.

- According to the null hypothesis ( 𝐻0 ), the sales promotion drive has an insignificant impact on the average monthly sales. The historical average ( µ0  ) holds µ =  µ0 .


- According to the alternative hypothesis ( 𝐻𝑎 ), the sales promotion drive has a significant impact on the average monthly sales. Thus, µ > µ0.

- Hourly Output of Two Machines:

    - The null hypothesis states that the average hourly output of machine A (µ1) does not significantly differ from that of machine B (µ2). So, µ1 = µ2.


    - The alternative hypothesis is that the average hourly output of machine A (µ1) is significantly larger than that of machine B (µ2). So, µ1 > µ2.

Generally, a null hypothesis is the negation of the assertion, while an alternative hypothesis in itself is an assertion.

## Hypothesis Testing
Hypothesis testing verifies a hypothesis's plausibility using sample data. A sample data may come from a larger population of data or even from data-generating experimentation.

- A hypothesis test always involves four steps. These are:

    1. State the hypotheses: A statement of the correctness of the two hypotheses (null or alternative)

    2. Set the criteria for a decision: A plan that outlines how to evaluate data

    3. Compute sample statistics: Execution of the plan to physically carry out the analysis

    4. Make a decision: Analysis of the result that rejects the null hypothesis or states that the null hypothesis is plausible

## Hypothesis Testing Outcomes: Type I and Type II Errors
- There are four decisions and outcomes for hypothesis testing:


    - 𝐻0  (null hypothesis) is TRUE and it is rejected: this is a Type I error

    - 𝐻0  (null hypothesis) is TRUE and it is accepted: correct decision

    - 𝐻0  (null hypothesis) is FALSE and it is rejected: correct decision

    - 𝐻0  (null hypothesis) is FALSE and it is accepted: this is a Type II error

    - The probability of the occurrence of Type I errors is denoted by Alpha (α), and the probability of Type II errors is denoted by Beta (β).

- It is indeed not possible to make both α and β zero at the same time when inferences are based on samples. However, reducing one typically increases the other, given a fixed sample size. They're usually not equal to one, and in fact, you often want both to be as small as possible.

- Common choices for α include 0.05 or 0.01. To achieve a low β (and thus high power), researchers typically use large sample sizes, careful experimental design, and sometimes more sophisticated statistical techniques.

- The selected value of Alpha is known as the level of significance. For example, when Alpha is equal to 0.05, the level of significance is 5%.

- This is a simple and effective way to estimate the outcomes of hypothesis testing.

## Steps Involved in Hypothesis Testing
Create a null hypothesis and an alternative hypothesis.

- Decide on a level of significance, that is, alpha = 5% or 1%.

- Choose the type of test you want to perform per the sample data (z-test, t-test, or chi-square).

- Calculate the test statistics (z-score, t-score) using the respective formula of the test chosen.

- Obtain the critical value in the sampling distribution to construct the rejection region of size alpha using the z-table, t-table, or chi table.

- Compare the test statistics with the critical value and locate the position of the calculated test statistics, that is, see if it is in the rejection region or non-rejection region.

    - If the critical value lies in the rejection region, you will reject the hypothesis, that is, sample data provides sufficient evidence against the null hypothesis, and there is a significant difference between hypothesized value and the observed value of the parameter.

    -  If the critical value lies in the non-rejection region, you will not reject the hypothesis, that is, the sample data does not provide sufficient evidence against the null hypothesis, and the difference between hypothesized value and the observed value of the parameter is due to the fluctuation of the sample.

## Confidence Interval
- A confidence interval (CI) generally indicates the amount of uncertainty in any distribution. It is usually expressed as a number or a set or pair of numbers, and it can even be computed for a given distribution statistic.

- CI is the probability that a particular population parameter will fall between a set of values for a certain period.

- CIs can take any number of probability limits, the most common being 95%, and in some cases, even 99%. When the behavior of a population is not known, then it is required to deduce the confidence intervals based on the sample data using statistical methods like a T-Test.

- A confidence interval is essentially a range of values that bind the statistic's mean value, which could in turn contain an unknown population parameter.

- A typical confidence interval in a statistical distribution is shown below:


- An upper limit and a lower limit of CI are marked on either side of the distribution.

## Margin of Error
The margin of error (MoE) indicates by how many percentage points the results will differ from the real population value.

Consider the following statement:

A 95% confidence level with a 3% margin of error implies that the statistical distribution data is within 3% points of the real population value 95% of the time.

The MoE is thus an important part of the confidence interval, without which one can’t accept the inference from statistical analysis. The lower the margin of error, the better the acceptability of the population statistic. MoE is popularly used in poll and election surveys. A poll survey MoE must be scrutinized before accepting the confidence interval.

For Example:

Consider the Gallup poll survey conducted in the 2012 US Presidential elections. The survey indicated 49% voting in favor of Mitt Romney, and 47% in favor of Barack Obama, with 95% CI and +/- 2% MoE. However, Barack Obama polled 51%, while Mitt Romney got 47% in the actual election. The results were even outside the range of the Gallup poll’s MoE of +/-2%. This illustrates the need for statistics while taking CI, CL, and MoE into consideration.

## Confidence Levels
A confidence level is the percentage of probability or certainty that the confidence interval will contain the true population parameter when a random sample is drawn repeatedly.

- In statistics, confidence levels are expressed as a percentage (for example, a 99%, 95%, or 80% confidence level). However, to support or discard the null hypothesis, scientists and engineers usually work with a level of 95% or more. On the other hand, most governmental organizations and departments use 90% as the limit for confidence level.

- A graphical representation of a confidence level of 95% is shown below:

- A confidence level of 95% means that when the experiment is repeated or the poll survey is conducted repeatedly, the survey results will match the results from a population 95% of the time. Confidence levels for different fields are different and are usually adopted in consultation with domain experts. This is done to ensure that the prediction from the statistic is reliable.

## Z-Distribution (Standard Normal Distribution)

- Shape: The Z-distribution has a bell-shaped curve, symmetric around the mean.

Characteristics:

- The mean, median, and mode of the distribution are all zero.
It has a standard deviation of one.
The total area under the curve is equal to 1.
Usage:

- It is used for hypothesis testing and confidence intervals when the sample size is large and the population standard deviation is known.
Any normal distribution can be standardized by converting values into z score.

## T-Distribution

- Shape: The T-distribution is similar to the Z-distribution, but it has heavier tails. This means it is more prone to producing values that fall far from its mean.

- Degree of Freedom: The T-distribution uses only one parameter, which is called the degree of freedom (df). It refers to the number of independent observations, that is, if n = total no. of observations, then df = n-1

Characteristics:

- It's symmetric and bell-shaped, like the Z-distribution, but its exact shape depends on the degrees of freedom.
- The mean of the T-distribution is zero, and its variance is slightly greater than 1 (This varies depending on the degrees of freedom.).
- As the degrees of freedom increase, the T-distribution approaches the Z-distribution.
Usage:

- The T-distribution is used when the sample size is small and the population standard deviation is unknown. It's particularly useful for estimating the mean of a normally distributed population in situations where the sample size is small.

## Plotting T- and Z-Distribution
- The plot below illustrates the differences between the Z-distribution (standard normal distribution) and the T-distributions with different degrees of freedom:


- Z-Distribution: Represented by the solid line, this standard normal distribution is symmetric and bell-shaped, centering around zero with a standard deviation of one.

- T-Distribution (df=5): The dashed line represents the T-distribution with 5 degrees of freedom. It's also symmetric and bell-shaped but has thicker tails compared to the Z-distribution. This indicates a higher likelihood of values far from the mean.

- T-Distribution (df=30): The dotted line represents the T-distribution with 30 degrees of freedom. While it still has slightly thicker tails than the Z-distribution, it's much closer to the 

- Z-distribution's shape than the T-distribution, which has fewer degrees of freedom.

## T-Test
- A T-test is a statistical hypothesis-testing tool that is primarily utilized when the population variance is unknown and the sample size is relatively small (n < 30). There are two main types of T-tests: one-sample and two-sample T-tests.

For a one-sample T-test

- The one-sample T-test is used to compare the mean of a sample with a known population mean. This test employs the standard deviation of the sample instead of the population standard deviation due to the unknown population variance. The formula for a one-sample T-test is as follows:


- Where x̄ is the sample mean, s is the standard deviation of the sample, μ is the mean of the population, and n is the sample size.

- For a two-sample T-test:

- The two-sample T-test, on the other hand, is conducted to compare the means of two different samples to determine whether the differences between these two means are statistically significant. The formula for a two-sample T-test is:


- For a paired sample t-test:

- The paired sample T-test, also known as the dependent sample T-test, is used to compare the means of two related groups to determine if there is a statistically significant difference between these means. It is typically used when the same subjects are used for each treatment (for example, before and after a treatment in a medical study). The assumption is that the paired differences are approximately normally distributed. The formula for a paired sample T-test is:


Where,

- 𝐷¯  is the mean of the differences between the paired observations.

- 𝑠𝐷  is the standard deviation of these differences.

- n is the number of pairs.

- The paired sample T-test is particularly useful for before and after studies.This test accounts for the fact that the two groups are related and not independent of each other.

##  Z-Test
It is a statistical test used to determine whether two population means are different when the variances are known and the sample size is large (n>30). There are two main types of Z-tests: one-sample and two-sample Z-tests.

- For a one-sample Z-test

- The one-sample Z-test is used to compare a population mean with the sample mean. The formula for a one-sample Z-test is as follows: link text
- Where x̄ is the sample mean, σ is the standard deviation of the population, μ is the mean of the population, and n is the sample size.

- For a two-sample Z-test

- The two-sample Z-test is used to compare the means of two different samples. The formula for a two-sample Z-test is as follows:
## Choosing between T-Test and Z-Test

- Sample Size
    - Z-test: It is used when the sample size is large (usually, n ≥ 30). The larger the sample size, the more the sample mean's distribution will resemble a normal distribution due to the central limit theorem.
    - T-test: It is preferred for smaller sample sizes (n < 30). The T-test is more adaptable to small sample sizes since it accounts for the extra uncertainty introduced by estimating the population standard deviation.

-  Population Standard Deviation
    - Z-test: It is required that the population standard deviation is known. This circumstance is less common in real-world scenarios because having access to the entire population data typically implies having the population standard deviation.
    - T-test: It is used when the population standard deviation is unknown and is estimated using the sample standard deviation. The T-test adjusts for the fact that the sample standard deviation varies between samples.

- Distribution of the Data
    - Z-test: It assumes that the data follows a normal distribution. This assumption becomes less of a concern with large sample sizes due to the central limit theorem.
    - T-test: It is more suitable to use non-parametric tests when you are unsure if the data is normally distributed, particularly with smaller sample sizes.

## P-Value
- The p-value is a crucial component in statistical hypothesis testing. It is the smallest level of significance at which you can reject a null hypothesis. Since the p-value offers more information than the critical value, it is generally recommended in many statistical tests.
- The definition and interpretation of the p-value are essentially measures of the probability that the observed data would be at least as extreme as the current results, given that the null hypothesis is true.
- The application and interpretation of the p-value depend on the nature of the hypothesis test being conducted. Here's how it applies to different types of tests:

- Right-Tailed Test:
    - In a right-tailed test, the right end (larger values) of the distribution. The p-value in this case is calculated as:
    - P-Value = P[Test statistics >= observed value of the test statistic]
    - Left-Tailed Test:
    - In a left-tailed test, the left end (smaller values) of the distribution. The p-value is then calculated as:
    - P-Value = P[Test statistics <= observed value of the test statistic]

- Two-Tailed Test:

    - A two-tailed test is used for differences in either direction (larger or smaller). The p-value in a two-tailed test is calculated as:
    - P-Value = 2 * P[Test statistics >= observed value of the test statistic]
    * Remember that a smaller p-value provides stronger evidence against the null hypothesis.

## Decision-Making Using P-Value
- The p-value is compared to the significance level (alpha) for decision-making on the null hypothesis.
- If the p-value is greater than alpha, you do not reject the null hypothesis.
- If the p-value is smaller than the alpha, you reject the null hypothesis.

    - One-Sample T-Test: Example
    - For a particular organization, the average age of the employees was claimed to be 30 years. The authorities collected a random sample of 10 employees' age data to check the claim made by the organization. Construct a hypothesis test to validate the hypothesis at a significance level of 0.05.

In [None]:
# Import necessary libraries and dataset
from scipy.stats import ttest_1samp
import pandas as pd

# Load dataset
ages = pd.read_csv("Ages.csv")

# Assuming 'age' is the column of interest
age_column = ages['ages']  # Replace 'age' with the actual column name
mean_age = age_column.mean()
print(f"Mean age: {mean_age}")

# Perform one-sample t-test
t_statistic, p_value = ttest_1samp(age_column, 30)
print(f"p-value: {p_value}")

# Decision-making
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis.")
else:
    print("Fail to reject the null hypothesis.")

## Paried Sample T-test

- For a particular hospital, it is advertised that a particular chemotherapy session does not affect the patient's health based on blood pressure. It is to be checked if the blood pressure before the treatment is equivalent to the blood pressure after the treatment. Perform a statistical test at the aplha 0.05 level to help validate the claim.

- $H_0$ = mean difference between two samples is 0
- $H_1$ = mean difference between two samples is not 0

In [None]:
import pandas as pd
from scipy import stats
df = pd.read_csv("blood_pressure.csv")
df.head()

df[['bp_before','bp_after']].describe()
ttest,pval = stats.ttest_rel(df['bp_before'], df['bp_after'])
print(pval)
if pval<0.05:
    print("reject null hypothesis")
else:
    print("accept null hypothesis")

## Two-Sample T-Test: Example
- Employee satisfaction is a crucial factor that can influence the productivity and success of a company. The Human Resources department wants to assess whether the satisfaction levels are consistent across different departments. For this analysis, we will focus on two key departments: Sales and Marketing

- Objective: To determine if there is a statistically significant difference in the average employee satisfaction scores between the sales and marketing departments

- Null Hypothesis ( 𝐻0 ): There is no significant difference in the average satisfaction scores between employees in the Sales department and those in the Marketing department. (Mean_Sales = Mean_Marketing)

- Alternative Hypothesis ( 𝐻1 ): There is a significant difference in the average satisfaction scores between employees in the Sales department and those in the Marketing department. (Mean_Sales ≠ Mean_Marketing)

In [None]:
import pandas as pd
from scipy import stats

# Load the dataset
data = pd.read_csv('employee_satisfaction.csv')

# Separate the satisfaction scores for each department
sales_scores = data[data['Department'] == 'Sales']['Satisfaction_Score']
marketing_scores = data[data['Department'] == 'Marketing']['Satisfaction_Score']

# Perform the independent two-sample t-test
t_stat, p_val = stats.ttest_ind(sales_scores, marketing_scores, equal_var=False)
print(f"P-value: {p_val}")

# Interpretation
alpha = 0.05
if p_val < alpha:
    print("Reject the null hypothesis - there is a significant difference in satisfaction scores between departments")
else:
    print("Fail to reject the null hypothesis - no significant difference in satisfaction scores between departments")

## Z-test
- A school principal claims that the students in their school are more intelligent than those of other schools. A random sample of 50 students' IQ scores has a mean score of 110. The mean population IQ is 100, with a standard deviation of 15. State whether the claim of the principal is right or not at a 5% significance level.

- 𝐻0  = average population IQ score is 100

- 𝐻𝑎  = average population IQ score is above 100

In [None]:
# import the libraries
import math
import numpy as np
from numpy.random import randn
from statsmodels.stats.weightstats import ztest

# Generate a random array of 50 numbers having mean 110 and standard deviation of 15
# similar to the IQ scores data
mean_iq = 110
sd_iq = 15/math.sqrt(50)
alpha =0.05
null_mean =100
data = sd_iq*randn(50)+mean_iq
# Print mean and SD
print('mean=%.2f stdv=%.2f' % (np.mean(data), np.std(data)))

# Now you perform the test, and in this function, you passed data in the value parameter
# You passed mean value in the null hypothesis and will check if the mean is larger in the
# alternative hypothesis

ztest_Score, p_value= ztest(data,value = null_mean, alternative='larger')
# The function outputs a p_value and z-score corresponding to that value, you compare the
# p-value with alpha, and if it is greater than alpha, then you do accept the null hypothesis else you reject it.

alpha = 0.05
if(p_value < alpha):
    print("Reject null Hypothesis")
else:
    print("Fail to Reject null Hypothesis")

## Chi-Square Distribution
- A chi-square distribution, pronounced khai square, is a continuous probability distribution widely used in statistical inference. The Greek letter χ is often used, and χ2 is termed chi-square.
- The χ2 distribution and the standard normal distribution are related. If a random variable Z has a standard normal distribution, then Z2 has the χ2 distribution with one degree of freedom.

    - With increasing degrees of freedom, the shape of the χ2 distribution varies. For k = 1, the PDF is infinity, when χ2 = 0. For k = 2, the PDF is 0.5 for χ2 = 0. For higher values of k (3 or more), the χ2 distribution changes to a positively skewed standard normal distribution, and with higher degrees of freedom, the skewness and the kurtosis of the χ2 distribution change, with the distribution becoming increasingly symmetric.

    * In any χ2 distribution, the mean (μ) is k, the number of degrees of freedom, and the variance is 2k.

    * For example, for k = 3 in the diagram, μ = 3, while the variance is 2 x k, or 6.

In [None]:
# Create chi-square distribution of varying degrees of freedom
data1 = np.random.chisquare(df = 1,size = 1000)
data2 = np.random.chisquare(df = 2,size = 1000)
data3 = np.random.chisquare(df = 3,size = 1000)
print(data1[:10])

import matplotlib.pyplot as plt
import seaborn as sns
# Set the style of seaborn
sns.set_style('whitegrid')

# Plot the distributions using kdeplot
sns.kdeplot(data1, label='dof 1')
sns.kdeplot(data2, label='dof 2')
sns.kdeplot(data3, label='dof 3')

# Show the legend
plt.legend()

# Show the plot
plt.show()

from scipy.stats import chi2_contingency
data = [[10,20,30],[6,9,17]]
stat, p_value, dof, chi_array = chi2_contingency(data)
p_value

data = [[10,20,30],[9,1,8]]
stat, p_value, dof, chi_array = chi2_contingency(data)
p_value

## Chi-Square Test and Independence Test
- Statistical methods are often employed to understand the patterns and relationships in data. One such method is the chi-square test, which is used in two different but related scenarios: as a goodness-of-fit test and as a test for independence.

- Chi-Square Test as a Test for Independence: A chi-square test for independence is used to determine whether there's a significant association between two categorical variables in a sample. It's non-parametric, meaning it does not assume any specific distribution for the variables involved.

    - Define the Null and Alternative hypotheses based on the data.  𝐻0  implies that the data met the expected distribution, while  𝐻1  implies that it did not.

    - State the alpha value as mentioned earlier; you usually work with a value of 0.05.

    - Calculate the degrees of freedom, k. It depends on the number of categories or groups and is usually K-1, where K is the number of frequencies.

    - State the decision rule. Calculate the decision value based on the alpha value and degrees of freedom. Based on this value, either reject  𝐻0  or reject  𝐻1 .

In [None]:
#Example: 
from scipy.stats import chi2_contingency
import pandas as pd
# Load the chi-test.csv file
df_chi = pd.read_csv('chi-test.csv')
contingency_table=pd.crosstab(df_chi["Gender"],df_chi["Shopping"])
print('contingency_table :-\n',contingency_table)
# Observed Values
Observed_Values = contingency_table.values
print("Observed Values :-\n",Observed_Values)
b=stats.chi2_contingency(contingency_table)
Expected_Values = b[3]
print("Expected Values :-\n",Expected_Values)
no_of_rows=len(contingency_table.iloc[0:2,0])
no_of_columns=len(contingency_table.iloc[0,0:2])
ddof=(no_of_rows-1)*(no_of_columns-1)
print("Degree of Freedom:-",ddof)
alpha = 0.05
from scipy.stats import chi2
chi_square=sum([(o-e)**2./e for o,e in zip(Observed_Values,Expected_Values)])
chi_square_statistic=chi_square[0]+chi_square[1]
print("chi-square statistic:-",chi_square_statistic)
critical_value=chi2.ppf(q=1-alpha,df=ddof)
print('critical_value:',critical_value)
# P-value
p_value=1-chi2.cdf(x=chi_square_statistic,df=ddof)
print('p-value:',p_value)
print('Significance level: ',alpha)
print('Degree of Freedom: ',ddof)
print('chi-square statistic:',chi_square_statistic)
print('critical_value:',critical_value)
print('p-value:',p_value)
if chi_square_statistic>=critical_value:
    print("Reject $H_0$,There is a relationship between 2 categorical variables")
else:
    print("Retain $H_0$,There is no relationship between 2 categorical variables")

if p_value<=alpha:
    print("Reject $H_0$,There is a relationship between 2 categorical variables")
else:
    print("Retain $H_0$,There is no relationship between 2 categorical variables")