## Hypothesis Testing

Hypothesis testing is a statistical method used in various industries to make decisions or draw conclusions about a population based on sample data. Here are several industry-based examples of hypothesis testing:

**1.Pharmaceuticals: Drug Efficacy**

**Scenario:** A pharmaceutical company develops a new drug to treat a specific medical condition and wants to determine if the drug is more effective than the existing treatment.
Hypothesis:
- `Null Hypothesis (H0):` The new drug is equally effective as the existing treatment.
- `Alternative Hypothesis (H1):` The new drug is more effective than the existing treatment.

**Test:** A clinical trial is conducted, and statistical tests are performed to analyze the data and determine if there is enough evidence to reject the null hypothesis in favor of the alternative.

**2.Finance: Investment Strategy**

**Scenario:** A financial analyst proposes a new investment strategy that claims to outperform the current market average.
Hypothesis:
- **Null Hypothesis (H0):** The new investment strategy does not outperform the market average.
- **Alternative Hypothesis (H1):** The new investment strategy outperforms the market average.

**Test:** Historical data is collected and analyzed using statistical tests to assess whether the returns from the proposed strategy are significantly different from the market average.

**3.Manufacturing: Production Process Improvement**

**Scenario:** A manufacturing plant implements changes to its production process with the goal of reducing defects in the final product.
Hypothesis:
- **Null Hypothesis (H0):** The changes to the production process do not reduce defects.
- **Alternative Hypothesis (H1):** The changes to the production process reduce defects.

**Test:** Data on defect rates before and after the changes are collected, and statistical tests are performed to determine if there is a significant improvement.

In each of these examples, hypothesis testing provides a structured approach to assess claims or changes within different industries, helping decision-makers make informed choices based on statistical evidence.

Hypothesis Testing starts with the formulation of these two hypotheses:

**Null hypothesis (H₀)**:  There is no significant difference between the observed sample and the general population or between two samples.

**Alternate hypothesis (H₁)**: Claims there is some statistical significance between two variables.

Defining the hypothesis is often complicated so we can follow these rules to correctly formulate both the hypotheses.

The null hypothesis always has the following signs:  `=  OR   ≤   OR    ≥`

The alternate hypothesis always has the following signs:  `≠   OR  >   OR    <`

**Example 1**:  Zomato claimed that its total valuation in December 2023 was at least $500 million. Here, the claim contains ≥ sign (i.e. the at least sign), so the null hypothesis is the original claim.

**Example 2**:  Zomato claimed that its total valuation in December 2023 was greater than $500 million. Here, the claim contains > sign (i.e. the ‘more than’ sign), so the null hypothesis is the complement of the original claim.

## Making a Decision

Once you have formulated the null and alternate hypotheses, we have to decide to either reject or fail to reject the null hypothesis.

Let's say Maruti has to buy tires for it's cars from a tire manufacturer. The tire manufacture claims that life of their tire is 36 months. Now we have to find out whether they are correct or not.

- Null hypothesis (H₀) : Life of tire = 36 months
- Alternate hypothesis (H₁) : Life of tire != 36 months

Now Maruti tests 100 tires and the average comes to be 32 months. So do we accept their claim or reject it. If average comes 28 months then what?

For this case we define critical region. Upper and lower critical region. If the average is less than lower critical value and more than upper critical value then we reject the null hypothesis.

The formulation of the null and alternate hypotheses determines the type of the test and the position of the critical regions in the normal distribution.

You can tell the type of the test and the position of the critical region on the basis of the ‘sign’ in the alternate hypothesis.

       ≠ in H₁    →   Two-tailed test        →     Rejection region on both sides of distribution
       < in H₁    →   Lower-tailed test     →     Rejection region on left side of distribution
       > in H₁    →   Upper-tailed test     →     Rejection region on right side of distribution

### Steps to perform Hypothesis Testing:

1. **Formulate the Hypotheses**
The first step is to clearly define the null hypothesis (H₀) and the alternative hypothesis (H₁).
    - `Null Hypothesis (H₀)`: There is no significant difference between the observed sample and the general population or between two samples.
    - `Alternative Hypothesis (H₁)`: There is a significant difference or effect that is not due to chance alone.


2. **Select the Significance Level (α)**
The significance level (α) is the probability of rejecting the null hypothesis when it is actually true (Type I error). Commonly used α levels are 0.05 (5%) or 0.01 (1%).


3. **Collect and Analyze Data**
Data collection can be done through random sampling or experimental design. The data must be relevant to the hypotheses and collected in a way that minimizes bias. Once collected, the data is analyzed to summarize its main characteristics, often through descriptive statistics or visualizations.



4. **Calculate the Test Statistic**
The test statistic is a numerical value calculated from the sample data that, under the null hypothesis, follows a known distribution. The choice of test statistic depends on the nature of the data and the hypothesis being tested. Common tests include t-tests, z-tests, chi-square tests, and ANOVA, among others.


5. **Determine the Critical Region or Calculate the P-value**
    - `Critical Region`: This approach involves comparing the test statistic to critical values that define regions where the null hypothesis would be rejected. The critical values are determined based on the significance level and the distribution of the test statistic.
    - `P-value`: Alternatively, the p-value approach calculates the probability of observing a test statistic as extreme as, or more extreme than, the value calculated from the sample data, assuming the null hypothesis is true. If the p-value is less than or equal to the significance level (α), the null hypothesis is rejected.
    - **We prefer using the p-value method over the critical-region method.**

6. **Make a Decision**
    - If the test statistic falls within the critical region or the p-value is less than or equal to α, reject the null hypothesis in favor of the alternative hypothesis.
    - If the test statistic does not fall within the critical region or the p-value is greater than α, do not reject the null hypothesis.


7. **Draw Conclusions**
Finally, interpret the results in the context of the original research question or business problem. This involves stating whether the findings support the alternative hypothesis and discussing the implications of the results for the problem at hand.

### **Z-Test**

A z-test is a statistical method used in hypothesis testing to determine if there is a significant difference between sample and population means, or between the means of two samples. It is particularly useful when the population standard deviation is known and the sample size is large (typically greater than 30).

`We have mainly two types of z-test:`

1. **One-Sample Z-Test:** Used to determine whether the mean of a single sample is different from a known population mean.
2. **Two-Sample Z-Test:** Used to compare the means of two independent samples to see if they are significantly different from each other.

**Assumptions of Z-Tests**

For a z-test to yield valid/correct results, these assumptions must be met:
1. **Normal Distribution:** The data should be approximately normally distributed. This assumption is satisfied with large sample sizes due to the central limit theorem.
2. **Known Population Standard Deviation:** The standard deviation of the population must be known. If it is unknown, using a `t-test` is more appropriate.
3. **Random Sampling:** The sample data should be randomly drawn from the population, ensuring that it is representative of the actual population data.
4. **Independence:** The samples must be independent of each other, particularly in two-sample z-tests.
5. **Continuous Data:** The z-test is applicable for continuous data, where the variable of interest can take any numeric value.

### Z-Test Example

**Example 1:** Google claims that its internet browser ‘Chrome’ is the best in the industry, as it has an optimum boot time of only 250 ms, with a standard deviation of 9 ms. Sam, a tech geek, wanted to test the claim of Google. So, he randomly collected boot time data of 165 devices of Chrome and got a sample mean of 247 ms.

- Ho: μ = 250, i.e., the mean boot time is 250 ms.
- Ha: μ ≠ 250, i.e., the mean boot time is not 250 ms.

#### **Solution using Critical Region/Value method**

For Z-Test these two conditions should be met:

- Condition 1: n >30, which means that the population sample size should be greater than 30 observations.

- Condition 2: 𝝈 is known, i.e., the population standard deviation is known.

If any of these conditions are not met then `we use t-test`.

The next step is to determine the test statistic. A test statistic, in simple terms, is a value that is to be calculated from some given data, which is then used to compare the results arrived at with the tabular values.

The test statistic for a normal distribution or a Z-test is defined as:

`Z = x−μ / σ/√n`

x is the process mean, μ is the population mean, σ is the standard deviation and n is the sample size.

Z = (247 - 250)/(9/√165)
Z = -4.3

We will now test our hypothesis at a 95% confidence level. For a 95% confidence interval, Z critical value = +1.96 and -1.96; these are the upper and lower critical values, respectively. The test statistic value we calculated is -4.3.

The region between +1.96 and -1.96 is called the acceptance region, and the region outside it is called the critical region.

If the calculated Z-statistic is in the region of acceptance, you fail to reject the null hypothesis. If the calculated Z-statistic lies outside the region of acceptance, i.e., in the critical region, you reject the null hypothesis.

In our case, the test statistic value is -4.3, which lies outside the region of acceptance of ±1.96. So, you reject the null hypothesis.

#### **Solution of Example 1 using p_value method which is very easy and straightforward.**

In [1]:
from statsmodels.stats.weightstats import ztest
import numpy as np

boot_times = list(np.random.randint(200, 270, size = 165))  #generating a list of 165 random integers between 180 and 300

# Perform one-sample Z-test
z_statistic, p_value = ztest(boot_times, value=250)   #here the value parameter is the population mean

# Interpret the results
if p_value < 0.025:
    print(f"Z-statistic: {z_statistic:.4f}, p-value: {p_value:.4f}")
    print("We have sufficient evidence to reject the null hypothesis.")
else:
    print(f"Z-statistic: {z_statistic:.4f}, p-value: {p_value:.4f}")
    print("We do not have sufficient evidence to reject the null hypothesis.")

Z-statistic: -11.1823, p-value: 0.0000
We have sufficient evidence to reject the null hypothesis.


**Note:** The code in above cell in not complete because while performing Z-test we should know the population standard deviation and utilize it in the code of z-test. So let's add that part of using population standard deviation and make it complete.

In [2]:
from statsmodels.stats.weightstats import ztest
import numpy as np

boot_times = list(np.random.randint(180, 300, size = 165))  #generating a list of 165 random integers between 180 and 300

population_mean = 250
pop_std_dev = 9

standardized_boot_times = [(x - population_mean) / (pop_std_dev/np.sqrt(len(boot_times))) for x in boot_times]

# Perform one-sample Z-test
z_statistic, p_value = ztest(standardized_boot_times, value=0)  #here value = 250 but after standardization we always write value = 0

# Interpret the results
if p_value < 0.05:
    print(f"Z-statistic: {z_statistic:.4f}, p-value: {p_value:.4f}")
    print("We have sufficient evidence to reject the null hypothesis.")
else:
    print(f"Z-statistic: {z_statistic:.4f}, p-value: {p_value:.4f}")
    print("We do not have sufficient evidence to reject the null hypothesis.")

Z-statistic: -3.0306, p-value: 0.0024
We have sufficient evidence to reject the null hypothesis.


Suppose a company is evaluating the impact of a new training program on the productivity of its employees. The company has data on the average productivity of its employees before implementing the new training program. The average productivity was 50 units per day with a known pop standard deviation of 5 units. After implementing the training program, the company measures the productivity of a random sample of 30 employees. The sample employees have an average productivity of 53 units per day. The company wants to know if the new training program has significantly improved the productivity of the employees.

In [3]:
import numpy as np
from statsmodels.stats.weightstats import ztest

sample_data = list(np.random.randint(50, 59, size = 30))

pm = 50
pop_std_dev = 5

standardized_sample_data = [(x - pm) / (pop_std_dev / np.sqrt(len(sample_data))) for x in sample_data]

z_statistic, p_value = ztest(standardized_sample_data, value = 0)

In [4]:
z_statistic, p_value

(np.float64(6.984328542534465), np.float64(2.8622155375075756e-12))

In [5]:
sample_data

[np.int64(56),
 np.int64(51),
 np.int64(52),
 np.int64(51),
 np.int64(53),
 np.int64(54),
 np.int64(53),
 np.int64(55),
 np.int64(53),
 np.int64(55),
 np.int64(51),
 np.int64(50),
 np.int64(58),
 np.int64(57),
 np.int64(54),
 np.int64(50),
 np.int64(53),
 np.int64(53),
 np.int64(52),
 np.int64(50),
 np.int64(52),
 np.int64(50),
 np.int64(54),
 np.int64(52),
 np.int64(51),
 np.int64(58),
 np.int64(51),
 np.int64(52),
 np.int64(56),
 np.int64(52)]

### T-test

The t-distribution is kind of a normal distribution; it is also symmetric and single peaked but less concentrated around its peak. In layman’s terms, a t-distribution is shorter and flatter around the centre than a normal distribution. It is used to study the mean of a population that has a distribution fairly close to a normal distribution (but not an exact normal distribution).

**Two simple conditions to determine when to use the t-statistic are as follows:**

- The population standard deviation is unknown.
- The sample size is less than 30.

Even if one of them is applicable in a situation, you can comfortably go for a t-test. The formula to determine the t-statistic is:

                            t = x–μ / s/√n


The company claims that their new algorithm can process a specific dataset in an average of 20 minutes, which is faster than the current average processing time of 22 minutes using the standard algorithm. To validate this claim, a data scientist decides to conduct a t-test.
The data scientist collects a sample of processing times using the new algorithm. The sample consists of 10 processing times (in minutes):

Sample Data : 19,18,21,20,19,22,18,17,21,20

The data scientist wants to test if the new algorithm significantly reduces the processing time compared to the standard average of 22 minutes. The hypothesis for the t-test would be set up as follows:
- `Null Hypothesis (H₀)`: The mean processing time using the new algorithm is equal to 22 minutes. (μ≥22)
- `Alternative Hypothesis (H₁)`: The mean processing time using the new algorithm is less than 22 minutes. (μ<22)

#### **One-Sample t-test**

In [6]:
from scipy.stats import ttest_1samp

global_average_score = 22
sample_scores = [19,18,21,20,19,22,18,17,21,20]

t_stat, p_value = ttest_1samp(sample_scores, global_average_score)

In [7]:
from scipy.stats import ttest_1samp

pm = 22

sample_data = [19,18,21,20,19,22,18,17,21,20]

t_statistic, p_value = ttest_1samp(sample_data, pm)

In [8]:
p_value

np.float64(0.0007389679098032424)

The result of the t-test indicates that there is significant evidence to conclude that the new algorithm reduces the processing time for processing the datasets compared to the standard processing time of 22 minutes. Therefore, we can confidently claim that the new algorithm is more efficient.

Problem Statement : Suppose a manufacturer claims that the average protein content in their new chocolate bars is 50 grams, which we highly doubt and want to check this. So we drew out a sample of 25 chocolate bars and measured their protein content, the sample mean came out to be 49.7 grams and the sample standard deviation was 1.2 grams. Consider the significance level to be 0.05.

#### **Two-Sample t-test**

A two-sample t-test is often used to determine if there is a significant difference between the means of two groups.

A researcher wants to evaluate whether there is a significant difference in the average test scores of students who used two different study methods (Method A and Method B) for an exam. The researcher randomly selects two independent groups of students: one group uses Method A, and the other uses Method B. The test scores for each group are recorded as follows:

- Method A Scores: [78, 84, 92, 88, 75, 80, 85, 90, 87, 79]
- Method B Scores: [82, 88, 75, 90, 78, 85, 88, 77, 92, 80]

In [9]:
import numpy as np
from scipy import stats

# Sample data for Method A and Method B
method_a_scores = np.array([78, 84, 92, 88, 75, 80, 85, 90, 87, 79])
method_b_scores = np.array([82, 88, 75, 90, 78, 85, 88, 77, 92, 80])

# Step 1: Perform Shapiro-Wilk test for normality
shapiro_a = stats.shapiro(method_a_scores)
shapiro_b = stats.shapiro(method_b_scores)

print("Shapiro-Wilk Test for Method A: Statistic =", shapiro_a.statistic, ", p-value =", shapiro_a.pvalue)
print("Shapiro-Wilk Test for Method B: Statistic =", shapiro_b.statistic, ", p-value =", shapiro_b.pvalue)

# Step 2: Perform Levene's test for equal variances
levene_test = stats.levene(method_a_scores, method_b_scores)
print("Levene's Test: Statistic =", levene_test.statistic, ", p-value =", levene_test.pvalue)

# Step 3: Perform independent two-sample t-test
t_statistic, p_value = stats.ttest_ind(method_a_scores, method_b_scores)

# Step 4: Print results
alpha = 0.05
print("T-statistic:", t_statistic)
print("P-value:", p_value)

# Decision based on p-value
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference in average scores.")
else:
    print("Fail to reject the null hypothesis: No significant difference in average scores.")

Shapiro-Wilk Test for Method A: Statistic = 0.9634209538811345 , p-value = 0.8240632802317183
Shapiro-Wilk Test for Method B: Statistic = 0.9429848795756282 , p-value = 0.5866793733496176
Levene's Test: Statistic = 0.16879219804951237 , p-value = 0.6860370886859155
T-statistic: 0.11617981913799512
P-value: 0.9087964141018375
Fail to reject the null hypothesis: No significant difference in average scores.


In [10]:
import numpy as np
from scipy import stats

# Sample data for Method A and Method B
method_a_scores = np.array([78, 84, 92, 88, 75, 80, 85, 90, 87, 79])
method_b_scores = np.array([82, 88, 75, 90, 78, 85, 88, 77, 92, 80])

# Step 1: Perform Shapiro-Wilk test for normality
shapiro_a = stats.shapiro(method_a_scores)
shapiro_b = stats.shapiro(method_b_scores)

print(shapiro_a.pvalue)
print(shapiro_b.pvalue)

0.8240632802317183
0.5866793733496176


In [11]:
levene_test = stats.levene(method_a_scores, method_b_scores)
print(levene_test.pvalue)

0.6860370886859155


In [12]:
t_statistic, p_value = stats.ttest_ind(method_a_scores, method_b_scores)

print(p_value)

0.9087964141018375


Problem 2: Let's consider a scenario where a company wants to test if a new training program improves employee productivity. The productivity scores (measured in units of work completed per day) of employees who underwent the training are compared to those who did not.

**Hypothesis**
- Null Hypothesis (): There is no difference in productivity between trained and untrained employees.
- Alternative Hypothesis (): There is a difference in productivity between trained and untrained employees.

In [13]:
import numpy as np
from scipy.stats import ttest_ind

# Generating synthetic productivity data for trained employees
np.random.seed(0)  # for reproducibility of the random values being generate by the following random.normal functions
trained_productivity = np.random.normal(loc=80, scale=10, size=40)  # Mean 80, std dev 10

# Generating synthetic productivity data for untrained employees
untrained_productivity = np.random.normal(loc=75, scale=10, size=40)  # Mean 75, std dev 10

# Performing a two-sample t-test on the productivity data
t_statistic, p_value = ttest_ind(trained_productivity, untrained_productivity, equal_var = False)

# Interpret the results
alpha = 0.025
if p_value < alpha:
    print(f"T-statistic: {t_statistic:.4f}, p-value: {p_value:.9f}")
    print("We have sufficient evidence to reject the null hypothesis.")
    print("There is a significant difference in productivity between trained and untrained employees.")
else:
    print(f"T-statistic: {t_statistic:.4f}, p-value: {p_value:.4f}")
    print("We do not have sufficient evidence to reject the null hypothesis.")
    print("There is no significant difference in productivity between trained and untrained employees.")

T-statistic: 5.5380, p-value: 0.000000460
We have sufficient evidence to reject the null hypothesis.
There is a significant difference in productivity between trained and untrained employees.


This Python code uses the ttest_ind function from the scipy.stats module to perform a two-sample t-test assuming unequal variances (Welch's t-test). The equal_var=False parameter is set to handle cases where the two groups have different variances.

**Problem Statement 3:**

A retail company wants to evaluate the effectiveness of two different marketing strategies on sales performance. They implement Strategy A in one region and Strategy B in another region. After a month, they collect sales data from both regions. The company wants to determine if there is a statistically significant difference in the average sales between the two regions.

**Hypothesis Statements:**
- Null Hypothesis ($H_0$): There is no difference in the average sales between the two marketing strategies. ($\mu_{\text{A}} = \mu_{\text{B}}$)
- Alternative Hypothesis ($H_1$): There is a difference in the average sales between the two marketing strategies. ($\mu_{\text{A}} \neq \mu_{\text{B}}$)

In [14]:
import numpy as np
from scipy.stats import ttest_ind

# Sample data: sales in thousands of dollars
sales_strategy_A = np.random.normal(loc=100, scale=15, size=50)  # Mean 100, std dev 15
sales_strategy_B = np.random.normal(loc=110, scale=15, size=50)  # Mean 110, std dev 15

# Perform two-sample t-test
t_statistic, p_value = ttest_ind(sales_strategy_A, sales_strategy_B)

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print(f"T-statistic: {t_statistic:.4f}, p-value: {p_value:.4f}")
    print("We have sufficient evidence to reject the null hypothesis.")
    print("There is a significant difference in average sales between the two marketing strategies.")
else:
    print(f"T-statistic: {t_statistic:.4f}, p-value: {p_value:.4f}")
    print("We do not have sufficient evidence to reject the null hypothesis.")
    print("There is no significant difference in average sales between the two marketing strategies.")

T-statistic: -1.0967, p-value: 0.2755
We do not have sufficient evidence to reject the null hypothesis.
There is no significant difference in average sales between the two marketing strategies.


#### Another Example of two sample t-test


A dataset from a recent health survey that includes information on participants' gender (male or female) and their cholesterol levels (a quantitative variable). The data scientist wants to investigate whether there is a significant difference in the mean cholesterol levels between male and female participants.

In [15]:
import numpy as np
from scipy import stats

gender = np.array(["Male", "Female", "Female", "Male", "Female", "Male", "Male",
                   "Female", "Male", "Female"])

cholesterol = np.array([200, 220, 210, 190, 205, 195, 180, 230, 175, 225])

# Since the gender array contains categorical data, we need to separate the cholesterol data by gender
male_cholesterol = cholesterol[gender == "Male"]
female_cholesterol = cholesterol[gender == "Female"]

# Perform the two-sample t-test
t_stat, p_value = stats.ttest_ind(male_cholesterol, female_cholesterol)

print(f"T-statistic: {t_stat}")
print(f"P-value: {p_value}")

T-statistic: -4.57495710997814
P-value: 0.0018139585097282133


The `ttest_ind` function is used to compare the means of two independent samples, which in this case are the cholesterol levels of males and females. The gender array is used to filter the cholesterol array into two groups: male_cholesterol and female_cholesterol.

### Chi-Squared Test

A Chi-Squared test is a statistical method used in hypothesis testing to determine if there is a significant difference between observed and expected frequencies in one or more categories. It is particularly useful for analyzing categorical data and is often employed to test relationships between categorical variables.

**Chi-squared test of independence**: This is used to determine whether or not there is a significant relationship between two nominal (categorical) variables.

**Problem Statement:**

A marketing team at a software company wants to determine if there is a relationship between the type of advertising medium (online, print, television) and software purchases. They collect data from a sample of customers, noting the advertising medium each customer was exposed to and whether they purchased the software.

**Hypotheses:**

- Null Hypothesis ($H_0$): There is no association between the type of advertising medium and software purchases. The variables are independent.
- Alternative Hypothesis ($H_1$): There is an association between the type of advertising medium and software purchases. The variables are not independent.

In [16]:
import numpy as np
from scipy.stats import chi2_contingency

# Observed frequency data in a contingency table
# Rows in the data array represent: Advertising Medium (Online, Print, Television)
# Columns in the data array represent: Purchase (Yes, No)
data = np.array([[30, 10],  # Online
                 [20, 20],  # Print
                 [50, 30]]) # Television

# Perform Chi-Square Test of Independence
chi2_stat, p_value, dof, expected = chi2_contingency(data)

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print(f"Chi-Square Statistic: {chi2_stat:.4f}, p-value: {p_value:.4f}")
    print("We have sufficient evidence to reject the null hypothesis.")
    print("There is a significant association between the advertising medium and software purchases.")
else:
    print(f"Chi-Square Statistic: {chi2_stat:.4f}, p-value: {p_value:.4f}")
    print("We do not have sufficient evidence to reject the null hypothesis.")
    print("There is no significant association between the advertising medium and software purchases.")

Chi-Square Statistic: 5.3333, p-value: 0.0695
We do not have sufficient evidence to reject the null hypothesis.
There is no significant association between the advertising medium and software purchases.


**Problem Statement 2:**
A restaurant chain wants to determine if there is an association between the type of cuisine offered (Italian, Chinese, Mexican) and customer satisfaction levels (satisfied, neutral, dissatisfied). They conduct a survey among customers across several locations to collect this data.

**Hypotheses:**
- Null Hypothesis ($H_0$): There is no association between the type of cuisine and customer satisfaction levels. The variables are independent.
- Alternative Hypothesis ($H_1$): There is an association between the type of cuisine and customer satisfaction levels. The variables are not independent.

In [17]:
import numpy as np
from scipy.stats import chi2_contingency

# Observed frequency data in a contingency table
# Rows: Type of Cuisine (Italian, Chinese, Mexican)
# Columns: Customer Satisfaction (Satisfied, Neutral, Dissatisfied)
data = np.array([[40, 30, 10],  # Italian
                 [35, 25, 20],  # Chinese
                 [25, 30, 15]]) # Mexican

# Perform Chi-Square Test of Independence
chi2_stat, p_value, dof, expected = chi2_contingency(data)

# Interpret the results
alpha = 0.05
if p_value < alpha:
    print(f"Chi-Square Statistic: {chi2_stat:.4f}, p-value: {p_value:.4f}")
    print("We have sufficient evidence to reject the null hypothesis.")
    print("There is a significant association between the type of cuisine and customer satisfaction levels.")
else:
    print(f"Chi-Square Statistic: {chi2_stat:.4f}, p-value: {p_value:.4f}")
    print("We do not have sufficient evidence to reject the null hypothesis.")
    print("There is no significant association between the type of cuisine and customer satisfaction levels.")

Chi-Square Statistic: 6.4983, p-value: 0.1649
We do not have sufficient evidence to reject the null hypothesis.
There is no significant association between the type of cuisine and customer satisfaction levels.


Finally, a Chi-Square test evaluates whether the observed contingency table is significantly different from the table that would be expected if there were no association between the variables.

### Chi-Squared Goodness of fit

A game manufacturer wants to test if a six-sided die is fair. The die is rolled 60 times, and the observed frequencies for each face are: [10, 8, 12, 11, 9, 10]. Assuming a fair die, each face should appear equally often (expected frequency = 10 for each face).

In [18]:
import numpy as np
from scipy.stats import chisquare

observed = np.array([10, 8, 12, 11, 9, 10])
expected = np.array([10, 10, 10, 10, 10, 10])

chi2_stat, p_value = chisquare(f_obs=observed, f_exp=expected)
print("Chi-squared Statistic:", chi2_stat)
print("P-value:", p_value)

Chi-squared Statistic: 1.0
P-value: 0.9625657732472964


A shop owner claims that an equal number of customers visit their shop on each weekday. To test this claim, the owner records the number of customers over a week: [50 (Monday), 60 (Tuesday), 40 (Wednesday), 47 (Thursday), and 53 (Friday)]. The expected frequency for each day is
total customers =50.

In [19]:
# Observed and expected frequencies
observed_customers = np.array([50, 60, 40, 47, 53])
expected_customers = np.array([50, 50, 50, 50, 50])

# Perform Chi-Squared Goodness of Fit test
chi2_stat_customers, p_value_customers = chisquare(f_obs=observed_customers, f_exp=expected_customers)

print("Chi-squared Statistic:", chi2_stat_customers)
print("P-value:", p_value_customers)

Chi-squared Statistic: 4.359999999999999
P-value: 0.3594720674366307


A candy company claims that all four flavors in their product are equally distributed. A sample of candies yields these counts: [22 (Flavor A), 30 (Flavor B), 23 (Flavor C), and 25 (Flavor D)]. The expected frequency for each flavor is 25.

In [20]:
observed_candies = np.array([22, 30, 23, 25])
expected_candies = np.array([25, 25, 25, 25])

chi2_stat_candies, p_value_candies = chisquare(f_obs=observed_candies, f_exp=expected_candies)
print("Chi-squared Statistic:", chi2_stat_candies)
print("P-value:", p_value_candies)

Chi-squared Statistic: 1.5199999999999998
P-value: 0.6776620931894994


### **ANOVA (Analysis of Variance)**

Analysis of Variance (ANOVA) is a statistical method used to compare the means of three or more groups to determine if there are any statistically significant differences between them. It helps to identify whether the observed differences among group means are due to actual differences or random chance.
`ANOVA uses F-tests` to statistically test the equality of means.

Two sample t-tests can validate a hypothesis containing only two groups at a time. For samples involving three or more groups, the t-test becomes tedious, as you have to perform the tests for each combination of the groups. Also, the `possibility of a type-1 error increases` in this process.

**One-Way ANOVA**

A One-Way ANOVA is used to determine whether there are statistically significant differences between the means of three or more independent groups. For the results of a One-Way ANOVA to be valid, certain assumptions must be met:
1. `Independence of Observations:` The observations in each group must be independent of each other. This means that the data collected from one group should not influence the data collected from another group.
2. `Normality:` The data within each group should be approximately normally distributed. This assumption is particularly important when the sample sizes are small, as ANOVA is robust to deviations from normality with larger sample sizes.
3. `Homogeneity of Variances:` The variances among the different groups should be approximately equal. This can be tested using **Levene's test for homogeneity of variances**.
4. `Continuous Dependent Variable:` The dependent variable should be measured on a continuous scale.

**Problem Statement:** A researcher wants to investigate whether different diets lead to different weight loss outcomes. Three different diet plans (Diet A, Diet B, and Diet C) are tested on groups of participants, and the weight loss (in kilograms) is recorded after a month. The researcher wants to determine if there is a significant difference in mean weight loss among the three diet plans.

**Hypotheses:**
- Null Hypothesis ($H_0$): There is no difference in the mean weight loss among the three diet plans. ($\mu_A = \mu_B = \mu_C$)
- Alternative Hypothesis ($H_1$): At least one diet plan has a different mean weight loss compared to the others.

In [21]:
import numpy as np
from scipy.stats import f_oneway, levene

# Sample data: weight loss for each diet plan
diet_A_loss = np.random.normal(loc=5, scale=1.5, size=30)  # Mean 5 kg, std dev 1.5
diet_B_loss = np.random.normal(loc=6, scale=1.5, size=30)  # Mean 6 kg, std dev 1.5
diet_C_loss = np.random.normal(loc=4.5, scale=1.5, size=30)  # Mean 4.5 kg, std dev 1.5

# Perform Levene's Test for equal variances
levene_stat, levene_p_value = levene(diet_A_loss, diet_B_loss, diet_C_loss)

# Interpret Levene's Test results
alpha = 0.05
if levene_p_value < alpha:
    print(f"Levene's Test p-value: {levene_p_value:.4f}")
    print("The variances are significantly different. Consider using a different test for ANOVA.")
else:
    print(f"Levene's Test p-value: {levene_p_value:.4f}")
    print("The variances are not significantly different. Proceeding with ANOVA.")

# Perform One-Way ANOVA
anova_stat, anova_p_value = f_oneway(diet_A_loss, diet_B_loss, diet_C_loss)

# Interpret ANOVA results
if anova_p_value < alpha:
    print(f"ANOVA p-value: {anova_p_value:.4f}")
    print("We have sufficient evidence to reject the null hypothesis.")
    print("There is a significant difference in mean weight loss among the diet plans.")
else:
    print(f"ANOVA p-value: {anova_p_value:.4f}")
    print("We do not have sufficient evidence to reject the null hypothesis.")
    print("There is no significant difference in mean weight loss among the diet plans.")

Levene's Test p-value: 0.4653
The variances are not significantly different. Proceeding with ANOVA.
ANOVA p-value: 0.0000
We have sufficient evidence to reject the null hypothesis.
There is a significant difference in mean weight loss among the diet plans.


In [24]:
import numpy as np
from scipy.stats import f_oneway, levene

# Sample data: weight loss for each diet plan
diet_A_loss = np.random.normal(loc=5, scale=1.5, size=30)  # Mean 5 kg, std dev 1.5
diet_B_loss = np.random.normal(loc=6, scale=1.5, size=30)  # Mean 6 kg, std dev 1.5
diet_C_loss = np.random.normal(loc=4.5, scale=1.5, size=30)  # Mean 4.5 kg, std dev 1.5

#step 1: shapiro wilk test to perform nomality
shapiro_a = stats.shapiro(diet_A_loss)
shapiro_b = stats.shapiro(diet_B_loss)
shapiro_c = stats.shapiro(diet_C_loss)

print(shapiro_a.pvalue)
print(shapiro_b.pvalue)
print(shapiro_c.pvalue)

0.7203177602421449
0.9624176708216704
0.5935965401670127


In [25]:
##perform levene test for equqal variace
levene_stat, levene_p_value = levene(diet_A_loss, diet_B_loss, diet_C_loss)
print(levene_stat)


0.03465251345678155


In [27]:
print(levene_p_value)##important cosideration point, so we have the same variances here because leven p values >0.05


0.9659543346349586


In [29]:
# Perform One-Way ANOVA
anova_stat, anova_p_value = f_oneway(diet_A_loss, diet_B_loss, diet_C_loss)
print(anova_stat)
print(anova_p_value)

7.786183153461138
0.0007749366335931776


The `f_oneway` calculates the `F-value and the P-value`. The `F-value is the test statistic`, and the P-value tells us whether the observed differences in means across the groups are statistically significant.

Remember, ANOVA tells us if `there's at least one significant difference` but doesn't specify where it is. If the test is significant, you would typically `follow up with post-hoc tests` to find out which specific groups differ from each other.

### How to choose the Hypothesis Test

**`Frequently used hypothesis tests`**:

1. **Z-Test**
    - `Use When`: Comparing the mean of a sample to a known population mean when the population variance is known and the sample size is large (n > 30).
    - `Example`: Testing if the average height of a sample of students is different from the known average height of the population.

---

2. **T-Test**
    - `One-Sample T-Test`: Compare the sample mean to a known value.
    - `Two-Sample T-Test`: Compare the means of two independent samples.
    - `Paired T-Test`: Compare means from the same group at different times.
    - `Use When`: The population variance is unknown and the sample size is small (n < 30).
    - `Example`: Testing if the average test scores of two different classes are significantly different.

---

3. **ANOVA (Analysis of Variance)**
    - `One-Way ANOVA`: Compare means of three or more independent groups.
    - `Two-Way ANOVA`: Compare means with two independent variables.
    - `Use When`: Comparing the means of three or more groups to see if at least one group mean is different.
    - `Example`: Testing if different teaching methods result in different student performance.

---

4. **Chi-Square Test**
    - `Chi-Square Goodness of Fit Test`: Determine if a sample matches a population.
    - `Chi-Square Test of Independence`: Determine if two categorical variables are independent.
    - `Use When`: Dealing with categorical data to test relationships between variables.
    - `Example`: Testing if there is an association between gender and voting preference.

---

**`Other Tests`**:

1. **Mann-Whitney U Test**:
    - `Use When`: Comparing differences between two independent groups when the data is not normally distributed.
    - `Example`: Testing if the distribution of scores differs between two different teaching methods.

---

2. **Wilcoxon Signed-Rank Test**:
    - `When to use`: Compare the distributions of two related groups (e.g., before and after a treatment).
    - `Description`: Tests if the distributions of two related groups are significantly different.
    - `Example`: Testing if there is a difference in performance before and after a training program.

---

3. **Kruskal-Wallis H Test**:
    - `When to use`: To test if the distributions of multiple groups are significantly different or Comparing more than two independent groups when the data is not normally distributed.
    - `Example`: Testing if there are differences in customer satisfaction scores across multiple stores.

---

4. **Shapiro-Wilk Test or D'Agostino's K^2 Test or Anderson-Darling Test**:
    - `When to use`: Check if the data follows a normal distribution.
    - `Description`: Tests if the data is normally distributed.

---

5. **Augmented Dickey-Fuller Test or Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Test**:
    - `When to use`: Check if a time series is stationary.
    - `Description`: Tests if a time series is stationary.

### Important Note

Beyond choosing a hypothesis test, it is important to understand whether the data you have meets the assumptions of the test you want to run. Each hypothesis test has a unique set of assumptions, however, there is one assumption that all hypothesis tests share: the data was randomly sampled from the population of interest.

This is important because random sampling ensures that the sample is representative of the population in terms of observed (and unobserved) characteristics. Unfortunately, there may be situations where random sampling is impossible, but it is important to understand how this can bias results of a test.

For example, let’s return to the example with the yogurt company “The Dairy Culture”. Let’s say the company had multiple factories, but the quality assurance team only collected yogurts from one specific factory. The data is thus not randomly sampled from the entire population that we care about (all factories), and could be biased if the quality of yogurt differs at each one.

There can also be ethical issues that arise when a sample is not representative of a population. When developing and testing a vaccine, for example, researchers must make sure to find volunteers from an appropriate proportion of genders, races, age ranges, pre-existing conditions, and so on to test efficacy for the entire population that the vaccine will be used on. If the vaccine manufacturers test on a sample that doesn’t include sufficient data for one race, there is a risk that there could be reduced (if during the initial research phase) or unknown efficacy for that group.

It can often be challenging to find a representative sample or even to recognize when there is biased data, but it is essential to think about when designing an experiment.