This code is for EN5423 class at GIST, Republic of Korea, and created by Dr. Hyunglok Kim.  
**Contact information**: hyunglokkim@gist.ac.kr  
**License**: This work is licensed for non-commercial use only.  
**Restrictions**: Do not use this material without permission for teaching or developing other classes.

In [1]:
import numpy as np
from scipy.stats import ttest_ind, mannwhitneyu

In [3]:
# Set the random seed to ensure reproducibility
np.random.seed(100)

# Generate random normal variables
x = np.random.normal(loc=40, scale=5, size=2)
y = np.random.normal(loc=50, scale=5, size=5)

# Perform the one-sided Mann-Whitney U test (equivalent to the Wilcoxon rank-sum test)
# Note: In scipyâ€™s mannwhitneyu function, use of the alternative parameter
# 'less' indicates a one-sided test where the hypothesis is that x has a tendency
# to have smaller values than y.
u_statistic, p_value = mannwhitneyu(x, y, alternative='less')

print(f"U-statistic: {u_statistic}, P-value: {p_value}")

U-statistic: 0.0, P-value: 0.047619047619047616


In [4]:
u_statistic, p_value = mannwhitneyu(x, y, alternative='two-sided')

print(f"U-statistic: {u_statistic}, P-value: {p_value}")

U-statistic: 0.0, P-value: 0.09523809523809523


In [5]:
u_statistic, p_value = mannwhitneyu(x, y, alternative='two-sided')

print(f"U-statistic: {u_statistic}, P-value: {p_value}")

U-statistic: 0.0, P-value: 0.09523809523809523


# HW05 #1 Hypothesis Testing with Environmental Data

#### Background:

Suppose you are an environmental scientist analyzing the impact of two different waste management practices on the concentration of a specific contaminant in groundwater. Practice A is an older, traditional waste management method, whereas Practice B incorporates newer, potentially more environmentally friendly technology.
This exercise is designed to enhance your understanding of how to apply statistical hypothesis testing to environmental data using Python. It encompasses data simulation, visualization, normality testing, and selecting the appropriate hypothesis test based on the characteristics of the data.


# Objective:

Your goal is to determine if there is a statistically significant difference in contaminant concentration levels in groundwater between the two waste management practices.

First, you will need to load the CSV file which contains: contaminant-A and Contaminant_B.

In [10]:
import pandas as pd
# Load the CSV file
file_path = "contaminant_levels.csv"  # Students should update the path accordingly
data = pd.read_csv(file_path)

# Display the first few rows of the dataframe
print(data.head())

   Contaminant_A  Contaminant_B
0      33.006818       0.815941
1      17.491861     110.449275
2      38.385836       2.645885
3      92.114269       0.327777
4      15.892489      14.084710


## Tasks

### Task 1: Data Visualization

Visualize the distribution of contaminant concentrations for both practices using histograms or box plots. Utilize libraries such as Matplotlib or Seaborn for visualization.

```python
# Example Python code snippet for visualization
import matplotlib.pyplot as plt
import seaborn as sns

In [20]:
# Visualization code here


### Task 2: Normality Test

Before applying a parametric hypothesis test, assess if the contaminant concentration levels follow a normal distribution for both practices. Employ the Shapiro-Wilk test for normality.

In [16]:
from scipy.stats import shapiro


# Perform Shapiro-Wilk test
# shapiro_test_statistic, p_value = shapiro(data['Contaminant_A'])
# Interpret the results


Contaminant_A: Statistics=0.7232310175895691, p=3.460963853285648e-06
Contaminant_B: Statistics=0.4158380627632141, p=1.1465015742340157e-10


### Task 3: Hypothesis Testing

Determine if there is a statistically significant difference in the mean contaminant concentration levels between the two waste management practices.

    If both samples are normally distributed, consider using a t-test. Use an independent t-test if the variances are equal; otherwise, use Welch's t-test.
    If the normality assumption is violated, opt for a non-parametric test such as the Mann-Whitney U test.
    Set your significance level at 0.05.

In [19]:
from scipy.stats import ttest_ind, mannwhitneyu



T-test: Statistics=3.020929996361693, p=0.003638331782043606
Mann-Whitney U test: Statistics=885.0, p=1.1188853840957398e-06


### Task 4: Interpretation

Based on the p-value obtained from your hypothesis test, provide an interpretation of whether there is a statistically significant difference in groundwater contaminant concentrations between the two waste management practices. 


### Deliverables

Submit a Jupyter notebook containing:

    The code for data simulation, visualization, normality testing, and hypothesis testing.
    Comments explaining your analysis steps and the interpretation of the test results.