# Hypothesis

Hypothesis refers to a statement or assumption about a phenomenon that can be tested using scientific methods. In scientific research, a hypothesis is a proposed explanation for an observation or phenomenon that can be tested through experimentation, observation, and data analysis.

The process of forming a hypothesis usually starts with observation and background research, followed by formulating a question or problem, and then developing a statement or explanation that can be tested through data collection and analysis. The goal of hypothesis testing is to determine whether the hypothesis is supported or rejected based on the evidence.

A well-formed hypothesis should be testable, falsifiable, and specific. It should also be clear and concise, and it should have the potential to be supported or rejected through empirical data.




## Types of hypotheses

1. null hypotheses, 
  
  Null hypotheses state that there is no significant difference or relationship between two or more variables


2. alternative hypotheses  
  
  Alternative hypotheses state that there is a significant difference or relationship.


3. research hypotheses. 
  
  Research hypotheses are more specific and detail the direction and nature of the expected relationship or difference between variables.


## Various Hypotheses test

1. t-test
2. Z-tests
3. ANOVA (Analysis of Variance)
4. Chi-Square Tests
5. Regression Analysis
6. Non-parametric Tests


### t-test

A t-test is a statistical test that is used to determine whether the difference between
the means of two groups is statistically significant. 

There are two types of t-tests: 
    1. Independent samples t-test (unrelated groups)
    2. Paired samples t-test (related groups)
    
    
The difference between independent samples t-test and paired samples t-test lies in the type of data being compared.

An independent samples t-test is used to compare the means of two separate, unrelated groups. Each group is independent of the other, and the individuals within each group are also independent of each other. The groups are not related in any way, and there is no connection between the two groups.

A paired samples t-test, on the other hand, is used to compare the means of two related groups. Each value in one group has a corresponding value in the other group, and the groups are considered "paired" because the individuals are related in some way. For example, you might use a paired samples t-test to compare the pre- and post-test scores of a group of individuals.

In summary, the main difference between independent samples t-test and paired samples t-test is that the former is used to compare independent groups, while the latter is used to compare related or dependent groups.
   

Use Cases:

1. Medical research: To compare the effectiveness of different treatments or interventions
2. Marketing research: To compare the means of different product groups or market segments
3. Educational research: To compare the means of different teaching methods or educational programs
4. Psychological research: To compare the means of different therapeutic techniques or mental health interventions
5. Sports research: To compare the performance of different training methods or athletes

In [11]:
import scipy.stats as stats


group1 = [7, 8, 9, 2, 4]
group2 = [1, 6, 5, 3, 7]

t_statistic, p_value = stats.ttest_ind(group1, group2)

# # Generate two samples of data from different populations
# sample1 = stats.norm.rvs(loc=10, scale=1, size=100)
# sample2 = stats.norm.rvs(loc=11, scale=1, size=100)

# # Conduct an independent samples t-test to compare the means of the two samples
# t_test_result = stats.ttest_ind(sample1, sample2)

# # Extract the t-statistic and p-value from the t-test result
# t_statistic = t_test_result.statistic
# p_value = t_test_result.pvalue

# Print the results
print(f"T-Statistic: {t_statistic}")
print(f"P-Value: {p_value}")

# Interpret the results
if p_value < 0.05:
    print("Reject the null hypothesis.")
    print("The means of the two samples are significantly different.")
else:
    print("Fail to reject the null hypothesis.")
    print("The means of the two samples are not significantly different.")


T-Statistic: 0.9460998335825319
P-Value: 0.37179326080163555
Fail to reject the null hypothesis.
The means of the two samples are not significantly different.


In this code, two samples of data are generated from different populations, 
and an independent samples t-test is conducted to compare the means of the two samples. 

The t-test results are then printed along with the t-statistic and p-value. 
Finally, the results are interpreted based on the p-value. If the p-value is less than 0.05, the null hypothesis is rejected, 
and it is concluded that the means of the two samples are significantly different. 

If the p-value is greater than or equal to 0.05, the null hypothesis is not rejected, 
and it is concluded that the means of the two samples are not significantly different.


#### Example

Suppose a researcher wants to determine if there is a significant difference in the reading scores of students who receive a new reading program (Group A) and students who receive the traditional reading program (Group B). The reading scores of 20 students from Group A and 25 students from Group B are collected and recorded. The mean reading score for Group A is 80 and the mean reading score for Group B is 75. The standard deviations for both groups are 5 and 4, respectively.

To perform a t-test, the following steps can be taken:

1. State the null hypothesis (H0): There is no significant difference in the reading scores of students in Group A and Group B.
2. State the alternate hypothesis (Ha): There is a significant difference in the reading scores of students in Group A and Group B.
3. Determine the t-statistic: t = (mean(A) - mean(B)) / sqrt((s^2 / n1) + (s^2 / n2)), where n1 is the sample size of Group A, n2 is the sample size of Group B, mean(A) and mean(B) are the mean scores of each group, and s is the pooled standard deviation.
4. Calculate the degrees of freedom (df): df = n1 + n2 - 2
5. Determine the critical value using the t-distribution table for a significance level of 0.05 and df = 45 (n1 + n2 - 2)
6. Compare the calculated t-statistic to the critical value. If the t-statistic is greater than the critical value, reject the null hypothesis and accept the alternate hypothesis. If the t-statistic is less than the critical value, fail to reject the null hypothesis.

In this example, if the calculated t-statistic is greater than the critical value, it can be concluded that there is a significant difference in the reading scores of students who receive the new reading program and students who receive the traditional reading program.



In [13]:
import numpy as np
from scipy import stats

group_a = [75, 80, 75, 80, 80, 75, 80, 75, 80, 80, 75, 80, 75, 80, 80, 75, 80, 75, 80, 80]
group_b = [70, 75, 70, 75, 75, 70, 75, 70, 75, 75, 70, 75, 70, 75, 75, 70, 75, 70, 75, 75, 70, 75, 70, 75, 70]

mean_a = np.mean(group_a)
mean_b = np.mean(group_b)
std_a = np.std(group_a)
std_b = np.std(group_b)
n1 = len(group_a)
n2 = len(group_b)

t_statistic = (mean_a - mean_b) / np.sqrt((std_a ** 2 / n1) + (std_b ** 2 / n2))
degrees_of_freedom = n1 + n2 - 2

p_value = stats.t.sf(np.abs(t_statistic), degrees_of_freedom) * 2

print("t-statistic: ", t_statistic)
print("p-value: ", p_value)

# Interpret the results
if p_value < 0.05:
    print("Reject the null hypothesis.")
    print("The means of the two samples are significantly different.")
else:
    print("Fail to reject the null hypothesis.")
    print("The means of the two samples are not significantly different.")


t-statistic:  7.034739149736234
p-value:  1.146104401286446e-08
Reject the null hypothesis.
The means of the two samples are significantly different.


In this code, the t-statistic is calculated by subtracting the means of two groups, and then dividing by the standard error. The standard error is calculated as the square root of the sum of the variances of each group divided by their respective sample size. The degrees of freedom are calculated by adding the sample sizes of the two groups and subtracting 2. The p-value is calculated using the survival function of the t-distribution (stats.t.sf) and is multiplied by 2 to obtain a two-tailed p-value.



### Z-tests

A Z-test is a statistical test used to compare the mean of a sample to a known population mean, assuming 
the population standard deviation is known. The Z-test is used when the sample size is large (n >= 30), 
and the population standard deviation is known.

The Z-statistic measures the difference between the sample mean and the population mean in units of standard deviation, 
while the p-value indicates the probability of observing the Z-statistic assuming the null hypothesis of equal means is true. 

A p-value of less than 0.05 is often considered significant and suggests that the sample mean is significantly different from the population mean.


Use Cases:

Proportion test: 
   
A Z-test can be used to compare the observed proportion of successes in a sample to an expected proportion in a population. For example, a survey is conducted to determine if the proportion of people who support a new policy is different from the national average of 50%. A sample of 100 people is taken and 55% of them support the policy. The Z-test can be used to determine if the difference between the observed proportion (0.55) and the expected proportion (0.5) is statistically significant.

Mean test: 

A Z-test can be used to test the difference between an observed sample mean and a population mean. For example, a manufacturer claims that the average weight of their products is 10 pounds. A sample of 25 products is taken and the average weight is found to be 9.8 pounds. The Z-test can be used to determine if the difference between the observed mean (9.8) and the claimed mean (10) is statistically significant.

Single population mean: 

A Z-test can be used to test if a sample mean is significantly different from a known population mean. For example, the average height of adult men in the US is known to be 69 inches. A sample of 50 men is taken and the average height is found to be 68 inches. The Z-test can be used to determine if the difference between the observed mean (68) and the population mean (69) is statistically significant.

Two population means: 

A Z-test can be used to test if the means of two independent samples are significantly different. For example, a researcher wants to compare the average height of men and women. A sample of 50 men and 50 women is taken and the average heights are found to be 68 inches and 65 inches respectively. The Z-test can be used to determine if the difference between the two means is statistically significant.


In [10]:
import numpy as np
from scipy import stats


sample = [7, 8, 9, 2, 4]
population_mean = 5
population_stddev = 1.5

z_statistic = (np.mean(sample) - population_mean) / (population_stddev / np.sqrt(len(sample)))
p_value = stats.norm.sf(abs(z_statistic)) * 2

print("z-statistic: ", z_statistic)
print("p-value: ", p_value)

# Interpret the results
if p_value < 0.05:
    print("Reject the null hypothesis.")
    print("The means of the two samples are significantly different.")
else:
    print("Fail to reject the null hypothesis.")
    print("The means of the two samples are not significantly different.")


z-statistic:  1.4907119849998598
p-value:  0.13603712811414362
Fail to reject the null hypothesis.
The means of the two samples are not significantly different.
