## Description
The dataset (downloaded from UCI machine learning repository88) comes from a wastewater treatment plant that uses activated sludge process to remove organic matter and suspended
solids from municipal wastewater. 

In this process (Figure A7), the suspended solids are first physically settled (primary treatment) and then biologically treated to oxidize the biodegradable organic matter (secondary treatment). 

Data from on-line sensors at different stages of the process for 40 variables over 527 days of operation are provided. Seven out of the 38 variables characterize the effluent water quality. 

<div style="text-align:center; margin-top:2rem;">

![water treatment](water-treatment.png)

</div>


## Sensor Data:

- **Influent:**

    1. DATE        (date)

    2. Q-E         (input flow to plant)

    3. ZN-E        (input Zinc to plant)

    4. PH-E        (input pH to plant)

    5. DBO-E       (input Biological demand of oxygen to plant)

    6. DQO-E       (input chemical demand of oxygen to plant)

    7. SS-E        (input suspended solids to plant)

    8. SSV-E       (input volatile supended solids to plant)

    9. SED-E       (input sediments to plant)

    10. COND-E     (input conductivity to plant)

- **Input to *Primary* Settler**

    11. PH-P       (input pH to primary settler)

    12. DBO-P      (input Biological demand of oxygen to primary settler)

    13. SS-P       (input suspended solids to primary settler)

    14. SSV-P      (input volatile supended solids to primary settler)

    15. SED-P      (input sediments to primary settler)

    16. COND-P     (input conductivity to primary settler)

- **Input to *Secondary* Settler**

    17. PH-D       (input pH to secondary settler)

    18. DBO-D      (input Biological demand of oxygen to secondary settler)

    19. DQO-D      (input chemical demand of oxygen to secondary settler)

    20. SS-D       (input suspended solids to secondary settler)

    21. SSV-D      (input volatile supended solids to secondary settler)

    22. SED-D      (input sediments to secondary settler)

    23. COND-D     (input conductivity to secondary settler)

- **Output from *Secondary* Settler (Effluent)**

    24. PH-S       (output pH)

    25. DBO-S      (output Biological demand of oxygen)

    26. DQO-S      (output chemical demand of oxygen)

    27. SS-S       (output suspended solids)

    28. SSV-S      (output volatile supended solids)

    29. SED-S      (output sediments)

    30. COND-S     (output conductivity)

- **Performance Indicators**

    31. RD-DBO-P   (performance input Biological demand of oxygen in primary settler)

    32. RD-SS-P    (performance input suspended solids to primary settler)

    33. RD-SED-P   (performance input sediments to primary settler)

    34. RD-DBO-S   (performance input Biological demand of oxygen to secondary settler)

    35. RD-DQO-S   (performance input chemical demand of oxygen to secondary settler)

    36. RD-DBO-G   (global performance input Biological demand of oxygen)

    37. RD-DQO-G   (global performance input chemical demand of oxygen)

    38. RD-SS-G    (global performance input suspended solids)

    39. RD-SED-G   (global performance input sediments)


---

## **EDA for Wastewater Treatment Process**

### **Overview**
Exploratory Data Analysis (EDA) is a critical step in understanding the structure and characteristics of the dataset. For the wastewater treatment process, EDA helps identify trends, variability, and relationships between input, intermediate, and output parameters. It also highlights data quality issues such as missing values and outliers, which could affect subsequent analyses.

---

### **Key Components of EDA and Their Purposes**

#### **1. Descriptive Statistics**
   - **Purpose**: Summarize the central tendency, spread, and range of key variables to understand the data's overall characteristics.
   - **Examples**:
     - Calculate the mean and standard deviation of **Input Biological Oxygen Demand (BOD\_E)** to understand typical values and variability.
     - Compare the median and mean for **Intermediate Suspended Solids (SS\_P)** to check for skewness.

---

#### **2. Outlier Detection**
   - **Purpose**: Identify and handle extreme values that could distort analysis or indicate anomalies in the treatment process.
   - **Examples**:
     - Use the **Interquartile Range (IQR) method** to detect outliers in **Input Flow (Q\_E)**.
     - Visualize outliers in **Output pH (PH\_S)** using box plots.

---

#### **3. Correlation Statistics**
   - **Purpose**: Quantify the strength and direction of relationships between input, intermediate, and output variables.
   - **Examples**:
     - Compute the correlation between **Input Suspended Solids (SS\_E)** and **Output Suspended Solids (SS\_S)** to assess how well the process removes solids.
     - Explore the relationship between **Intermediate pH (PH\_P)** and **Output pH (PH\_S)** to evaluate the system’s consistency.

---

### **How EDA Enhances Analysis**
- **Identifying Patterns**:
  - Descriptive statistics and correlations help uncover trends and relationships, such as whether higher input BOD leads to higher effluent levels.
- **Ensuring Data Integrity**:
  - Outlier detection and data quality checks ensure the dataset is clean and ready for downstream modeling.
- **Understanding Process Performance**:
  - By analyzing correlations and variability, we can determine which input parameters have the greatest influence on treatment efficiency.



---

## Exercise

In [None]:
# import packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

In [None]:
# Load the dataset
df = pd.read_csv('data.csv')

# Display the first few rows of the dataset
df.head()

**1. Descriptive Statistics**

Examine the following parameters:
- **Input Biological Oxygen Demand (\( BOD\_E \))**
- **Intermediate Suspended Solids (\( SS\_P \))**
- **Output pH (\( PH\_S \))**

In [None]:
"""
1. Calculate the **mean**, **median**, **minimum**, **maximum**, and **standard deviation** for these variables.
"""
# TODO


**2. Outlier Detection**

Focus on detecting outliers in the **Input Biological Oxygen Demand (\( BOD\_E \))** column using the **Interquartile Range (IQR) method**.


In [None]:
"""
1. Compute the first quartile ( Q1 ) and third quartile ( Q3 ).
2. Calculate the interquartile range ( IQR = Q3 - Q1 ).
3. Define outliers as values outside the range [Q1 - 1.5 \cdot IQR, Q3 + 1.5 \cdot IQR].
4. Count the number of outliers and remove them from the dataset.
5. Confirm their removal using a **boxplot**.
"""
# TODO


**3. Pearson’s Correlation**

Explore the relationships between input/intermediate stage parameters and output parameters:
- \( BOD\_E \) (Input Biological Oxygen Demand) and \( PH\_S \) (Output pH).
- \( SS\_P \) (Intermediate Suspended Solids) and \( PH\_S \) (Output pH).


In [None]:
"""
1. Compute the **Pearson correlation coefficient** for each pair of variables.
2. Visualize the relationships using **scatterplots** and add **regression lines**.
"""
# TODO
