## Example 1: 
The Department of Natural Resources (DNR) received a complaint from recreational fishermen that a community was releasing sewage into the river where they fished.
These types of releases lower the level of dissolved oxygen in the river and hence cause damage to the fish residing in the river. An inspector from the DNR designs a study to investigate the fishermen’s claim.

Fifteen water samples are selected at locations on the river upstream from the community and fifteen samples are selected downstream from the community. The dissolved oxygen readings in parts per million
(ppm) are given in the following data:

$$
\text{Upstream} = [5.2,4.8,5.1,5.0,4.9,4.8,5.0,4.7,4.7,5.0,4.6,5.2,5.0,4.9,4.7]
$$

$$
\text{Downstream} = [3.2,3.4,3.7,3.9,3.6,3.8,3.9,3.6,4.1,3.3,4.5,3.7,3.9,3.8,3.7]
$$

1. In order for the discharge to have an impact on fish health, there needs to be at least an .5 ppm reduction in the dissolved oxygen. Do the data provide sufficient evidence that there is a large enough reduction in the mean dissolved oxygen between the upstream and downstream water in the river to impact the health of the fish? 
2. Do the required conditions to use the test in part (1) appear to be valid?
3. What is the level of significance of the test in part (1)?
4. Estimate the size of the difference in the mean dissolved oxygen readings for the two locations on the river using a 99\% confidence interval.

In [1]:
import numpy as np 
from scipy import stats

upstream= np.array ([5.2, 4.8, 5.1, 5.0, 4.9, 4.8, 5.0, 4.7, 4.7, 5.0, 4.6, 5.2, 5.0, 4.9, 4.7])
downstream = np.array([3.2, 3.4, 3.7, 3.9, 3.6, 3.8, 3.9, 3.6, 4.1, 3.3, 4.5, 3.7, 3.9, 3.8, 3.7])

mean_upstream= np.mean(upstream)
mean_downstream= np.mean(downstream)

std_upstream = np.std(upstream,ddof=1)
std_downstream = np.std(downstream, ddof=1)


t_stats= (mean_upstream - mean_downstream -0.5)/np.sqrt((std_upstream**2 / len(upstream))+(std_downstream**2 / len(downstream)))
df = len(upstream) + len(downstream)-2

t_crict= stats.t.ppf(1 - 0.01, df)

is_significant= t_stats > t_crict

margin_of_error = t_crict * np.sqrt((std_upstream**2 / len (upstream)) + (std_downstream**2 / len(downstream)))
confidence_interval_lower= mean_upstream - mean_downstream - margin_of_error
confidence_interval_upper= mean_upstream - mean_downstream + margin_of_error

print("1. Is there a significant reduction in dissolved oxygen greater than 0.5 ppm?")
print("   - T-statistic:", t_stats)
print("   - Critical t-value at alpha=0.01:", t_crict)
print("   - Is the reduction significant?", is_significant)

print("\n2. Validity of Test Conditions:")
print("   - Normality assumption: Based on sample size, assumed to be met.")
print("   - Independence and homogeneity of variances: Assumed to be met based on the given information.")

print("\n3. Level of Significance:")
print("   - Alpha (Level of Significance): 0.01")

print("\n4. 99% Confidence Interval for the Difference in Means:")
print("   - Confidence Interval (Lower):", confidence_interval_lower)
print("   - Confidence Interval (Upper):", confidence_interval_upper)

1. Is there a significant reduction in dissolved oxygen greater than 0.5 ppm?
   - T-statistic: 6.9625034501699785
   - Critical t-value at alpha=0.01: 2.4671400979674316
   - Is the reduction significant? True

2. Validity of Test Conditions:
   - Normality assumption: Based on sample size, assumed to be met.
   - Independence and homogeneity of variances: Assumed to be met based on the given information.

3. Level of Significance:
   - Alpha (Level of Significance): 0.01

4. 99% Confidence Interval for the Difference in Means:
   - Confidence Interval (Lower): 0.9304355355680579
   - Confidence Interval (Upper): 1.402897797765276


## Example 2:
A random sample of eight pairs of twins was randomly assigned to treatment A or treatment B. The data are given in the following data:

$$
\text{Treatment A} = [48.3,44.6,49.7,40.5,54.3,55.6,45.8,35.4]
$$

$$
\text{Treatment B} = [43.5,43.8,53.7,43.9,54.4,54.7,45.2,34.4]
$$

1. Is there significant evidence that the two treatments are different ?.
2. Place a 95% confidence interval on the mean difference between the responses from the two treatments.

In [None]:
# TODO

## Example 3:
Suppose you have a class of students, and you want to test if there is a significant association between their gender and their favorite subject (Math, Science, or English). You have collected the following data:

```
# Create a pandas DataFrame with the data
data = pd.DataFrame({
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
    'Favorite_Subject': ['Math', 'Science', 'Math', 'English', 'Science', 'Math']
})
```

Perform the chi-test to see if the preferred class is independent from gender.

In [None]:
## TODO

## Example 4:
Suppose you have 3 groups of professionals: Data Scientists, Software Engineers and Data Engineer. For each group we measure the individual salary (in thousands of dollars) per week as following:

$$
\text{Data Scientists} = [7,3,6,6]
$$

$$
\text{Software Engineers} = [6,5,5,8]
$$

$$
\text{Data Engineer} = [4,7,6,7]
$$

Is there evidence that the average salary is different for at least one occupation ?

In [8]:
##TODO 