In [26]:
import numpy as np
from scipy import stats

# Chi Square Test

Youtube Link: https://www.youtube.com/watch?v=HKDqlYSLt68

Chi-square (χ²) is a statistical test and a probability distribution used in hypothesis testing and data analysis, particularly in the field of statistics. It is employed to determine if there is a significant association or relationship between two categorical variables in a dataset. The chi-square test is used to evaluate whether the observed distribution of data significantly differs from the expected distribution under a specific hypothesis.

There are two common types of chi-square tests:

1. **Chi-square goodness-of-fit test**: This test is used when you want to determine if an observed categorical distribution fits an expected theoretical distribution. It answers questions like, "Does the observed data fit the expected proportions?"

2. **Chi-square test of independence**: This test is used to assess if there is a significant association between two categorical variables. It helps answer questions like, "Are these two categorical variables independent of each other or related?"

Here's a brief overview of how each test works:

- **Chi-square goodness-of-fit test**: In this test, you have one categorical variable, and you compare the observed frequencies (counts) of each category to the expected frequencies that you would expect if a particular hypothesis were true. The chi-square statistic quantifies how much the observed frequencies deviate from the expected frequencies.

- **Chi-square test of independence**: In this test, you have two categorical variables, and you want to determine if they are related or independent of each other. You create a contingency table that shows the joint distribution of the two variables, and then you calculate the chi-square statistic to test whether the observed frequencies in the table differ significantly from what would be expected if the variables were independent.



## Chi Square Goodness of Fit

Youtube: https://www.youtube.com/watch?v=ZNXso_riZag&t=22s

The Chi-square goodness of fit test is a statistical test used to determine whether an observed frequency distribution of categorical data fits a specified theoretical distribution. In other words, it assesses whether the observed data is consistent with what we would expect under a certain hypothesis or theoretical model.

Here are the key components and steps involved in a Chi-square goodness of fit test:

Hypotheses:

Null Hypothesis (H0): The observed data follows a specified theoretical distribution. In other words, there is no significant difference between the observed and expected frequencies. <br/>
Alternative Hypothesis (H1): The observed data does not follow the specified theoretical distribution. There is a significant difference between the observed and expected frequencies.

Certainly, here's another problem involving the Chi-square goodness of fit test:

**Problem:**
A university claims that the distribution of its undergraduate students by major follows a certain national average distribution. According to the national average, the distribution of undergraduate majors is as follows:
- 30% Business
- 25% Engineering
- 20% Social Sciences
- 15% Natural Sciences
- 10% Arts and Humanities

To investigate whether the university's student distribution matches the national average, a random sample of 400 undergraduate students is selected, and their majors are recorded. The results are as follows:

- Business: 120 students
- Engineering: 90 students
- Social Sciences: 70 students
- Natural Sciences: 65 students
- Arts and Humanities: 55 students

Perform a Chi-square goodness of fit test to determine if the university's distribution of majors is significantly different from the national average distribution at a 5% level of significance.

**Solution:**

To solve this problem, follow these steps:

1. Set up the null (H0) and alternative (H1) hypotheses:
   - H0 (Null Hypothesis): The distribution of the university's undergraduate students by major matches the national average distribution (30% Business, 25% Engineering, 20% Social Sciences, 15% Natural Sciences, 10% Arts and Humanities).
   - H1 (Alternative Hypothesis): The distribution of the university's undergraduate students by major does not match the national average distribution.

2. Calculate the expected number of students for each major based on the national average distribution and the total number of students (400 in this case).

3. Calculate the Chi-square statistic.

4. Determine the degrees of freedom (number of categories - 1) and find the critical value of Chi-square for a 5% level of significance.

5. Compare the calculated Chi-square statistic with the critical value and make a decision regarding the null hypothesis.

If the calculated Chi-square statistic exceeds the critical value, you would reject the null hypothesis, indicating that the distribution of the university's undergraduate students by major is significantly different from the national average distribution.

You can then perform the calculations and conduct the Chi-square goodness of fit test to determine whether the university's distribution of majors is significantly different from the national average based on the sample data.

In [12]:
# Solve this Problem
# Observed frequencies
observed = np.array([120, 90, 70, 65, 55])

# Expected frequencies based on the national average distribution
expected = np.array([0.30, 0.25, 0.20, 0.15, 0.10]) * 400
print("Expected Table:", expected)

# Perform the Chi-square goodness of fit test
chi2, p = stats.chisquare(observed, f_exp=expected)

# Define the significance level
alpha = 0.05

# Calculate the critical value from the Chi-square distribution
dof = len(observed) - 1
critical_value_chi2 = stats.chi2.ppf(1 - alpha, dof)

# Print the Result
print("Chi Square Statistics is:", chi2)
print("Degress of Freedom:", dof)
print("Critical Value:", critical_value_chi2)
print("P-Value:", p)

# Compare the chi-square statistic with the critical value
if chi2 > critical_value_chi2:
    print("Reject the null hypothesis. The distribution of majors is significantly different from the national average.")
else:
    print("Fail to reject the null hypothesis. The distribution of majors is not significantly different from the national average.")

Expected Table: [120. 100.  80.  60.  40.]
Chi Square Statistics is: 8.291666666666666
Degress of Freedom: 4
Critical Value: 9.487729036781154
P-Value: 0.08145976939917328
Fail to reject the null hypothesis. The distribution of majors is not significantly different from the national average.


## Chi Square Test of Independence

Youtube Link: https://www.youtube.com/watch?v=NTHA9Qa81R8

**Problem:**
A research study is conducted to investigate whether there is a relationship between gender (male or female) and the preference for two different types of sports: basketball and soccer. A random sample of 200 individuals is selected, and their preferences are recorded in the following contingency table:

```
                Basketball   Soccer
Male              50          30
Female            40          80
```

Perform a chi-square test of independence to determine if there is a significant association between gender and sports preference at a 5% level of significance.

**Solution:**

To solve this problem, we will follow these steps:

1. Set up the null (H0) and alternative (H1) hypotheses:
   - H0 (Null Hypothesis): Gender and sports preference are independent (i.e., there is no relationship between them).
   - H1 (Alternative Hypothesis): Gender and sports preference are dependent (i.e., there is a relationship between them).


In [7]:
# Solve this Problem
# Create the observed contingency table
observed = np.array([[50, 30],
                    [40, 80]])

# Perform the chi-square test of independence (stats.chi2_contingency)
chi2, p, dof, expected = stats.chi2_contingency(observed)

# Define the significance level
alpha = 0.05

# Calculate the critical value from the chi-square distribution
chi2_critical_value = stats.chi2.ppf(1 - alpha/2, dof)

# Print the result
print("Chi2 Square vaule is:", chi2)
print("P-Value is:", p)
print("Critical value is:", chi2_critical_value)
print("Degress of Freedom:", dof)

# Conclusion
if chi2 < chi2_critical_value:
    print("Fail to Reject then Null Hypothesis. There is no significant association between gender and sports preference.")
else:
    print("Reject the Null Hypothesis. There is a significant association between gender and sports preference.")

Chi2 Square vaule is: 15.34090909090909
P-Value is: 8.975175678418872e-05
Critical value is: 5.023886187314888
Degress of Freedom: 1
Reject the Null Hypothesis. There is a significant association between gender and sports preference.


# ANOVA

Youtube: https://www.youtube.com/watch?v=0NwA9xxxtHw&t=11s <br/>
ANOVA stands for "Analysis of Variance," and it is a statistical technique used to analyze the differences among group means in a sample. ANOVA is particularly useful when you want to compare the means of three or more groups to determine whether there are statistically significant differences between them. It helps you answer questions like:

Are there any significant differences between the means of multiple groups? <br/>
Which group or groups are significantly different from the others? <br/>
Is there more variation within groups or between groups? <br/>

## One Way ANOVA

Youtube: https://www.youtube.com/watch?v=9cnSWads6oo


One-way ANOVA, or one-way analysis of variance, is a statistical technique used to compare the means of three or more independent (unrelated) groups to determine whether there are statistically significant differences among them. It is a parametric test, which means it makes certain assumptions about the data, including the assumption of normality and homogeneity of variances. Here are the key components and steps involved in conducting a one-way ANOVA:

Groups or Categories: You have one independent variable (factor) that categorizes the data into three or more groups. These groups represent different levels, treatments, or categories of the factor. For example, if you are studying the effect of different teaching methods on student test scores, the groups could be different teaching methods (e.g., Method A, Method B, Method C).

Null Hypothesis (H0): The null hypothesis in a one-way ANOVA states that there are no statistically significant differences among the group means. In other words, all groups have the same population mean.

Alternative Hypothesis (Ha): The alternative hypothesis suggests that there is at least one group with a different population mean compared to the others.


### Problem Statement:

A researcher wants to determine whether three different fertilizers (Fertilizer A, Fertilizer B, and Fertilizer C) have a statistically significant effect on the growth of tomato plants. To test this, the researcher randomly selects 30 tomato plants and divides them into three groups, with each group receiving a different fertilizer treatment. After six weeks, the heights of the tomato plants are measured. The data is as follows (in centimeters):

**Fertilizer A:** 28, 29, 31, 30, 32, 30, 28, 31, 32, 29

**Fertilizer B:** 26, 25, 24, 27, 28, 26, 25, 27, 26, 28

**Fertilizer C:** 35, 34, 36, 34, 35, 36, 33, 35, 34, 36

Conduct a one-way ANOVA to determine if there are any statistically significant differences in tomato plant growth among the three fertilizer treatments. Use a significance level of 0.05.

**Solution:**

To solve this problem, you would follow these steps:

1. **Set up Hypotheses:**
   - Null Hypothesis (H0): There is no significant difference in tomato plant growth among the three fertilizer treatments (μ1 = μ2 = μ3).
   - Alternative Hypothesis (Ha): There is a significant difference in tomato plant growth among the three fertilizer treatments (at least one μi is different).

2. **Perform the ANOVA Test:** Using statistical software or a calculator, compute the one-way ANOVA. The software will provide you with the F-statistic and its associated p-value.

3. **Determine Significance:** Compare the calculated p-value to the chosen significance level (α = 0.05). If the p-value is less than 0.05, you would reject the null hypothesis.


In [4]:
# Solve this Problem
# Data for each fertilizer treatment
fertilizer_a = np.array([28, 29, 31, 30, 32, 30, 28, 31, 32, 29])
fertilizer_b = np.array([26, 25, 24, 27, 28, 26, 25, 27, 26, 28])
fertilizer_c = np.array([35, 34, 36, 34, 35, 36, 33, 35, 34, 36])

# Set the significance level
alpha = 0.05

# Perform One Way ANOVA
f_statistic, p_value = stats.f_oneway(fertilizer_a, fertilizer_b, fertilizer_c)
print("F Statistic:", f_statistic)
print("P-Value:", p_value)

# Interpret Result
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference in tomato plant growth among the fertilizer treatments.")
else:
    print("Fail to Reject the null hypothesis:  There is no significant difference in tomato plant growth among the fertilizer treatments.")

F Statistic: 110.9469026548673
P-Value: 9.489004073242531e-14
Reject the null hypothesis: There is a significant difference in tomato plant growth among the fertilizer treatments.


## Two Way ANOVA

Youtube: https://www.youtube.com/watch?v=SnlOUfT55So <br/>
Youtube 2: https://www.youtube.com/watch?v=0K-bfzLTRiY

ANOVA stands for analysis of variance and tests for differences in the effects of independent variables on a dependent variable. A two-way ANOVA test is a statistical test used to determine the effect of two nominal predictor variables on a continuous outcome variable.

A two-way ANOVA tests the effect of two independent variables on a dependent variable.
1
 A two-way ANOVA test analyzes the effect of the independent variables on the expected outcome along with their relationship to the outcome itself. Random factors would be considered to have no statistical influence on a data set, while systematic factors would be considered to have statistical significance.

### Problem Statement:

A researcher is investigating the effects of two factors, "Type of Diet" and "Exercise Regimen," on weight loss in a weight management study. The study participants are divided into four groups based on the combination of diet type and exercise regimen. The researcher wants to determine whether both factors, individually and interactively, have a significant impact on weight loss.

The four groups are as follows:

Group A: High Protein Diet + High-Intensity Exercise <br/>
Group B: High Protein Diet + Low-Intensity Exercise <br/>
Group C: Low Protein Diet + High-Intensity Exercise <br/>
Group D: Low Protein Diet + Low-Intensity Exercise <br/>

In [24]:
# Solve this Problem
# Define the data for each group
group_a = np.array([10.5, 11.2, 9.8, 11.0, 10.7, 9.5, 10.9, 11.4, 11.8, 10.2, 9.7, 11.5, 10.3, 11.1, 10.8, 10.6, 11.3, 10.4, 11.7, 9.9, 11.6, 10.1, 11.9, 10.0, 10.0, 11.0, 10.5, 11.2, 10.6, 11.4])
group_b = np.array([8.5, 8.2, 8.9, 8.3, 8.6, 8.1, 8.8, 8.7, 8.4, 8.0, 8.5, 8.2, 8.7, 8.6, 8.3, 8.9, 8.4, 8.8, 8.5, 8.1, 8.6, 8.3, 8.2, 8.7, 8.0, 8.5, 8.4, 8.6, 8.8, 8.1])
group_c = np.array([9.0, 9.4, 9.2, 9.7, 9.5, 9.1, 9.6, 9.3, 9.8, 9.9, 9.0, 9.7, 9.2, 9.4, 9.3, 9.6, 9.8, 9.1, 9.5, 9.2, 9.7, 9.0, 9.4, 9.3, 9.6, 9.9, 9.1, 9.5, 9.3, 9.8])
group_d = np.array([7.5, 7.8, 7.3, 7.6, 7.9, 7.2, 7.5, 7.4, 7.7, 7.6, 7.8, 7.3, 7.5, 7.4, 7.9, 7.7, 7.6, 7.2, 7.5, 7.4, 7.8, 7.6, 7.9, 7.3, 7.5, 7.4, 7.7, 7.6, 7.8, 7.3])

# Combine data into a single array
all_data = np.concatenate([group_a, group_b, group_c, group_d])

# Create factors for Type of Diet and Exercise Regimen
diet_factors = np.array(["High Protein", "High Protein", "Low Protein", "Low Protein"])
exercise_factors = np.array(["High Intensity", "Low Intensity", "High Intensity", "Low Intensity"])

# Set the Level of significance
alpha = 0.05

# Perform two-way ANOVA
f_statistic, p_value = stats.f_oneway(group_a, group_b, group_c, group_d)

# Print results and interpret
print("Two-Way ANOVA Results:")
print("F-statistic:", f_statistic)
print("P-value:", p_value)

# Interpret the Result
if p_value < alpha:
    print("Reject the null hypothesis: There is a significant difference in weight loss based on the Type of Diet or Exercise Regimen.")
else:
    print("Fail to reject the null hypothesis: There is no significant difference in weight loss based on the Type of Diet or Exercise Regimen.")

Two-Way ANOVA Results:
F-statistic: 348.26798850063767
P-value: 7.889602812416034e-58
Reject the null hypothesis: There is a significant difference in weight loss based on the Type of Diet or Exercise Regimen.


# F Test

The F-test, also known as the Fisher's F-test, is a statistical hypothesis test used to compare the variances or standard deviations of two or more groups or samples. It assesses whether the variances among the groups are statistically significantly different. The F-test is a fundamental tool in statistics and is often used in analysis of variance (ANOVA) and regression analysis.

There are two main types of F-tests:

Two-Sample F-Test (F-Test for Variances): This type of F-test is used to compare the variances of two independent samples or populations. It helps determine whether the variability in one sample is significantly greater or smaller than the variability in another sample. The null hypothesis (H0) typically states that the variances are equal.<br/>
Example: Comparing the variances of test scores in two different schools to see if one school has significantly more variability in scores than the other.

ANOVA F-Test (Analysis of Variance F-Test): This type of F-test is used to compare the means of two or more groups or samples to determine if there are significant differences among them. It is commonly used to test whether the groups have similar population means. The null hypothesis (H0) typically states that all group means are equal. <br/>
Example: Comparing the mean test scores of students in multiple classes or comparing the effects of different treatments on a group of patients.

In [30]:
# Two Sample F Test
# Generate two sample data arrays (replace these with your actual data)
sample1 = np.array([15, 18, 22, 25, 30])
sample2 = np.array([12, 16, 20, 24, 28])

# Calculate the variances of the two samples
variance1 = np.var(sample1, ddof=1)
variance2 = np.var(sample1, ddof=1)

# Calculate the F-statistic
F_statistic = variance1 / variance2

# Define the degrees of freedom for each sample
df1 = len(sample1) -1
df2 = len(sample2) -2

# Calculated The P Value
p_value = 1 - stats.f.cdf(F_statistic, df1, df2)

# Set the significance level
alpha = 0.05

# Compare the p-value to the significance level
if p_value < alpha:
    print(f'Reject the null hypothesis. The variances are significantly different.')
else:
    print(f'Fail to reject the null hypothesis. The variances are not significantly different.')

# Print the results
print(f'F-statistic: {F_statistic}')
print(f'p-value: {p_value}')

Fail to reject the null hypothesis. The variances are not significantly different.
F-statistic: 1.0
p-value: 0.5210508807675736


Youtube : https://www.youtube.com/watch?v=YrhlQB3mQFI&t=28s