In [1]:
import numpy as np
from scipy import stats

# Levene Test

Youtube: https://www.youtube.com/watch?v=x51GDTiPIfI <br/>

The Levene test, named after its developer Howard Levene, is a statistical test used to assess whether the variances of two or more groups or samples are equal or significantly different from each other. It is often used as a preliminary step in statistical analysis to determine whether the assumption of equal variances (homoscedasticity) is met before conducting certain parametric statistical tests, such as the t-test or analysis of variance (ANOVA).

Here's how the Levene test works:

1. Formulate the null hypothesis (H0): The null hypothesis in the Levene test is that the variances of the groups or samples being compared are equal.

2. Formulate the alternative hypothesis (H1): The alternative hypothesis is that the variances of the groups or samples are not equal.

3. Calculate a test statistic: The Levene test calculates a test statistic by comparing the absolute deviations of individual data points from their respective group or sample means.

4. Determine the p-value: The test statistic is then used to calculate a p-value, which represents the probability of obtaining the observed variance differences (or more extreme differences) if the null hypothesis were true.

5. Make a decision: Based on the calculated p-value and a chosen significance level (e.g., α = 0.05), you can decide whether to reject the null hypothesis. If the p-value is less than the chosen significance level, you reject the null hypothesis, indicating that there is significant evidence that the variances are not equal. If the p-value is greater than the significance level, you fail to reject the null hypothesis, suggesting that there is not enough evidence to conclude that the variances are different.

If the Levene test indicates that the variances are significantly different, you may need to use alternative statistical methods that do not assume equal variances, such as Welch's t-test or non-parametric tests like the Mann-Whitney U test or Kruskal-Wallis test.

Overall, the Levene test is a useful tool for assessing the assumption of equal variances, which is important for the validity of various statistical analyses.

### Problem
Suppose you are a researcher studying the performance of three different teaching methods (A, B, and C) on a group of students. You want to determine if there is a significant difference in the variances of test scores achieved by students in these three teaching methods. Your null hypothesis is that the variances are equal, and your alternative hypothesis is that they are not equal.

You have the following data for each teaching method:

Teaching Method A:
[85, 88, 90, 92, 87, 89]

Teaching Method B:
[82, 81, 80, 84, 79, 85]

Teaching Method C:
[88, 85, 82, 81, 87, 84]

Using Python, you want to perform Levene's test to test the equality of variances.

In [4]:
# Solution Using Python
# Data for each teaching method
teaching_method_A = [85, 88, 90, 92, 87, 89]
teaching_method_B = [82, 81, 80, 84, 79, 85]
teaching_method_C = [88, 85, 82, 81, 87, 84]

# Perform Levene's Test
statistic, p_value = stats.levene(teaching_method_A, teaching_method_B, teaching_method_C)

# Set the significance level (alpha)
alpha=0.05

# Print the results
print("Levene's Test:")
print("Test Statistic:", statistic)
print("P-Value:", p_value)

# Decide whether to reject the null hypothesis
if p_value < alpha:
    print("Reject the null hypothesis. There is evidence that the variances are not equal.")
else:
    print("Fail to reject the null hypothesis. There is no significant evidence that the variances are different.")

Levene's Test:
Test Statistic: 0.12820512820512814
P-Value: 0.8806265049681821
Fail to reject the null hypothesis. There is no significant evidence that the variances are different.


# Shapiro Wilk Test

The Shapiro-Wilk test, named after its developers Samuel Shapiro and Martin Wilk, is a statistical test used to assess whether a given dataset follows a normal distribution or not. It is one of the methods used to check the normality assumption, which is important for many statistical techniques, such as parametric tests like t-tests and analysis of variance (ANOVA).

Here's how the Shapiro-Wilk test works:

1. **Null Hypothesis (H0):** The null hypothesis of the Shapiro-Wilk test is that the data follows a normal distribution.

2. **Alternative Hypothesis (H1):** The alternative hypothesis is that the data does not follow a normal distribution.

3. **Test Statistic:** The test calculates a statistic (W) that measures the degree of departure from normality. This statistic is based on the differences between the observed sample data and the values expected if the data were normally distributed.

4. **P-Value:** The test produces a p-value, which indicates the probability of obtaining the observed departure from normality (or more extreme deviations) if the null hypothesis were true. A small p-value (typically less than the chosen significance level, often 0.05) suggests evidence to reject the null hypothesis, indicating that the data is not normally distributed. Conversely, a larger p-value suggests that there is insufficient evidence to conclude that the data departs significantly from a normal distribution.

5. **Decision:** Based on the calculated p-value and a chosen significance level (alpha), you can decide whether to reject the null hypothesis or not. If p-value < alpha, you reject the null hypothesis, concluding that the data is not normally distributed. If p-value ≥ alpha, you fail to reject the null hypothesis, suggesting that the data may reasonably follow a normal distribution.

It's important to note that the Shapiro-Wilk test is most useful for relatively small to moderately sized datasets (typically less than 5,000 observations). For larger datasets, even minor departures from normality can lead to the test detecting non-normality as statistically significant due to its high power.

In practice, the Shapiro-Wilk test is often used in conjunction with visual methods like histograms, Q-Q plots, and normal probability plots to assess the normality assumption of a dataset before applying parametric statistical tests that assume normality. If the data significantly deviates from normality, alternative non-parametric tests or transformations of the data may be considered.

### Problem:

A manufacturer of light bulbs is concerned about the quality of its products. They suspect that the lifespan (in hours) of a certain type of light bulb they produce may not be normally distributed. To investigate this, they randomly select a sample of 30 light bulbs from their production line and record their lifespans. The company wants to determine whether the lifespans of these light bulbs follow a normal distribution.

In [6]:
# Solve this problem
# Lifespan data of 30 randomly selected light bulbs
lifespans = np.array([1100, 1150, 1125, 1080, 1200, 1225, 1175, 1180, 1230, 1195,
                      1220, 1165, 1210, 1240, 1170, 1120, 1135, 1215, 1190, 1245,
                      1160, 1145, 1125, 1205, 1155, 1235, 1190, 1185, 1210, 1140])

# Perform the Shapiro-Wilk test
statistic, p_value = stats.shapiro(lifespans)

# Print the results
print("Shapiro-Wilk Test:")
print("Test Statistic:", statistic)
print("P-Value:", p_value)

if p_value < alpha:
    print("Reject the null hypothesis. The lifespans do not follow a normal distribution.")
else:
    print("Fail to reject the null hypothesis. There is no significant evidence that the lifespans are not normally distributed.")

Shapiro-Wilk Test:
Test Statistic: 0.9725183248519897
P-Value: 0.610126256942749
Fail to reject the null hypothesis. There is no significant evidence that the lifespans are not normally distributed.


# K-S Test

The Kolmogorov-Smirnov (K-S) test is a non-parametric statistical test used to assess whether a given dataset follows a particular probability distribution or if two datasets are drawn from the same underlying distribution. It is named after Andrey Kolmogorov and Nikolai Smirnov, who developed the test.

The K-S test is used to compare an observed dataset to a reference (theoretical) distribution or to compare two datasets to determine if they come from the same underlying population distribution. It is particularly useful when you either don't know the specific distribution of your data or suspect that it may deviate from a known distribution.

There are two main variants of the K-S test:

1. **One-sample K-S test:** This version of the test is used to compare a single dataset to a known theoretical distribution. The null hypothesis (H0) assumes that the dataset follows the specified theoretical distribution. The test calculates a test statistic (D) that represents the maximum absolute difference between the empirical cumulative distribution function (ECDF) of the observed data and the cumulative distribution function (CDF) of the theoretical distribution. The test also produces a p-value, which indicates the probability of observing such a maximum difference under the null hypothesis.

2. **Two-sample K-S test:** This version of the test is used to compare two independent datasets to determine if they come from the same underlying population distribution. The null hypothesis (H0) assumes that the two datasets are drawn from the same distribution. Like the one-sample test, it calculates a test statistic (D) representing the maximum absolute difference between the ECDFs of the two datasets and produces a p-value.

Here's a summary of the steps for both versions of the K-S test:

**One-sample K-S test:**

1. Formulate the null hypothesis (H0): The data follows the specified theoretical distribution.

2. Formulate the alternative hypothesis (H1): The data does not follow the specified theoretical distribution.

3. Calculate the test statistic (D).

4. Determine the p-value, representing the probability of observing D or a larger value under H0.

5. Compare the p-value to a chosen significance level (alpha) to make a decision. If p-value < alpha, reject H0, indicating that the data does not follow the specified distribution.

**Two-sample K-S test:**

1. Formulate the null hypothesis (H0): The two datasets are drawn from the same underlying distribution.

2. Formulate the alternative hypothesis (H1): The two datasets are not drawn from the same underlying distribution.

3. Calculate the test statistic (D) by comparing the ECDFs of the two datasets.

4. Determine the p-value, representing the probability of observing D or a larger value under H0.

5. Compare the p-value to a chosen significance level (alpha) to make a decision. If p-value < alpha, reject H0, indicating that the two datasets come from different distributions.

In Python, you can perform the K-S test using libraries like SciPy (for one-sample and two-sample tests) and numpy (for data manipulation).

### Problem One Sample K-S Test:

A manufacturer of computer chips claims that their chips have a consistent and normally distributed processing time with a mean processing time of 24 milliseconds. To verify this claim, a quality control manager randomly selects 40 computer chips from the production line and records their processing times. The recorded processing times are as follows:

[22.5, 25.2, 23.8, 26.1, 24.9, 23.0, 25.7, 24.4, 23.2, 26.0, 24.3, 25.6, 23.7, 25.1, 24.8, 26.2, 23.9, 25.0, 24.6, 26.4, 23.5, 25.3, 24.7, 26.3, 23.6, 25.5, 24.2, 26.5, 23.1, 25.4, 24.5, 25.8, 24.1, 26.6, 23.3, 25.9, 24.0, 26.7]

Perform a one-sample Kolmogorov-Smirnov (K-S) test to determine whether the processing times of these computer chips follow a normal distribution at a significance level of 0.05.

In [4]:
# Solve this problem
# Recorded processing times
processing_times = np.array([22.5, 25.2, 23.8, 26.1, 24.9, 23.0, 25.7, 24.4, 23.2, 26.0,
                              24.3, 25.6, 23.7, 25.1, 24.8, 26.2, 23.9, 25.0, 24.6, 26.4,
                              23.5, 25.3, 24.7, 26.3, 23.6, 25.5, 24.2, 26.5, 23.1, 25.4,
                              24.5, 25.8, 24.1, 26.6, 23.3, 25.9, 24.0, 26.7])

# Assumed mean and standard deviation for a normal distribution
mu = 24 # Mean processing time according to the claim
sigma = np.std(processing_times) # Sample Standard Deviation

# Perform the one-sample K-S test
statistic, p_value = stats.kstest(processing_times, 'norm', args=(mu, sigma))

# Set the significance level (alpha)
alpha = 0.05

# Print the results
print("One-Sample Kolmogorov-Smirnov (K-S) Test:")
print("Test Statistic:", statistic)
print("P-Value:", p_value)

# Decide whether to reject the null hypothesis
if p_value < alpha:
    print("Reject the null hypothesis. The processing times do not follow a normal distribution.")
else:
    print("Fail to reject the null hypothesis. There is no significant evidence that the processing times deviate from a normal distribution.")

One-Sample Kolmogorov-Smirnov (K-S) Test:
Test Statistic: 0.2857645374741111
P-Value: 0.003064886737276562
Reject the null hypothesis. The processing times do not follow a normal distribution.


### Problem Two Sample K-S Test:

A company manufactures two types of batteries, Type A and Type B, both of which are supposed to have consistent and normally distributed lifetimes. To assess the quality of these batteries, a quality control manager selects random samples of each type and records their lifetimes. The recorded lifetimes (in hours) are as follows:

Type A Batteries:
[420, 425, 430, 435, 440, 445, 450, 455, 460, 465]

Type B Batteries:
[410, 415, 420, 425, 430, 435, 440, 445, 450, 455]

Perform a two-sample Kolmogorov-Smirnov (K-S) test to determine whether the lifetimes of Type A and Type B batteries come from the same underlying distribution at a significance level of 0.05.

In [5]:
# Solve this Problem
# Lifetimes of Type A and Type B batteries
type_a_lifetimes = np.array([420, 425, 430, 435, 440, 445, 450, 455, 460, 465])
type_b_lifetimes = np.array([410, 415, 420, 425, 430, 435, 440, 445, 450, 455])

# Perform the two-sample K-S test
statistic, p_value = stats.ks_2samp(type_a_lifetimes, type_b_lifetimes)

# Set the significance level (alpha)
alpha = 0.05

# Print the results
print("Two-Sample Kolmogorov-Smirnov (K-S) Test:")
print("Test Statistic:", statistic)
print("P-Value:", p_value)

# Decide whether to reject the null hypothesis
if p_value < alpha:
    print("Reject the null hypothesis. The lifetimes of Type A and Type B batteries come from different distributions.")
else:
    print("Fail to reject the null hypothesis. There is no significant evidence that the lifetimes of Type A and Type B batteries come from different distributions.")

Two-Sample Kolmogorov-Smirnov (K-S) Test:
Test Statistic: 0.2
P-Value: 0.9944575548290717
Fail to reject the null hypothesis. There is no significant evidence that the lifetimes of Type A and Type B batteries come from different distributions.


# Fisher's Test

Youtube: https://www.youtube.com/watch?v=udyAvvaMjfM

Fisher's test, also known as Fisher's exact test, is a statistical significance test used to determine if there are nonrandom associations between two categorical variables. It is named after its developer, Sir Ronald A. Fisher, who introduced the test in the early 20th century. Fisher's test is particularly valuable when dealing with small sample sizes or when the assumptions of other statistical tests like the chi-squared test are not met.

This test is often used in 2x2 contingency tables, which are used to examine the relationships between two categorical variables. These variables are typically organized into rows and columns of the table. The goal of Fisher's test is to determine whether there is a significant association between the two variables or if the proportions of observations in different categories of one variable are independent of the categories of the other variable.

Here are the fundamental steps involved in performing Fisher's exact test:

1. Set up a 2x2 contingency table displaying the observed frequencies of the categories for the two variables. The table typically looks like this:

```
                  Variable 1
               | Category A | Category B |
-----------------------------------------
Variable 2 |    a         |    b         |
-----------------------------------------
```

2. Calculate the probability of observing the data under the assumption of independence, which means assuming that there is no association between the two variables. This step involves computing the probabilities of all possible tables that could be generated under the independence assumption, using a hypergeometric probability distribution.

3. Determine if the observed table is significantly different from the expected table under the assumption of independence. This is accomplished by comparing the observed probability to the probabilities of all possible tables. If the observed probability is very low (indicating that the observed table is unlikely under the assumption of independence), it suggests a significant association between the two variables.

4. Calculate a p-value, which represents the probability of obtaining a table as extreme as, or more extreme than, the observed table, assuming the null hypothesis (independence) is true. If the p-value is below a predetermined significance level (e.g., 0.05), you reject the null hypothesis and conclude that there is a significant association between the two variables.

Fisher's exact test is frequently employed in fields such as biology, epidemiology, and the social sciences, where small sample sizes or infrequent events make other statistical tests less suitable. It provides a means to assess the significance of associations in contingency tables while accounting for data limitations.

### Problem:

A medical researcher wants to investigate whether there is an association between two treatments (Treatment A and Treatment B) and the recovery status of patients (Recovered or Not Recovered) after a certain period. The researcher collected data on 50 patients who received Treatment A and 50 patients who received Treatment B. The data is summarized in the following contingency table:

```
                Treatment A    Treatment B
Recovered       20             10
Not Recovered   30             40
```

Perform Fisher's exact test to determine if there is a significant association between the type of treatment and the recovery status of patients at a significance level of 0.05.



In [7]:
# Solve this problem
# Create a 2x2 contingency table
observed_data = [[20, 10], [30, 40]]

# Perform Fisher's exact test
odds_ratio, p_value = stats.fisher_exact(observed_data)

# Set the significance level
alpha = 0.05

# Check if the p-value is less than alpha to determine significance
if p_value < alpha:
    result = "significant"
else:
    result = "not significant"

# Display the results
print(f"Odds Ratio: {odds_ratio:.2f}")
print(f"P-value: {p_value:.4f}")
print(f"The association between treatment and recovery is {result} at alpha = {alpha}.")


Odds Ratio: 2.67
P-value: 0.0486
The association between treatment and recovery is significant at alpha = 0.05.
