## Description
The dataset (downloaded from UCI machine learning repository88) comes from a wastewater treatment plant that uses activated sludge process to remove organic matter and suspended
solids from municipal wastewater. 

In this process (Figure A7), the suspended solids are first physically settled (primary treatment) and then biologically treated to oxidize the biodegradable organic matter (secondary treatment). 

Data from on-line sensors at different stages of the process for 40 variables over 527 days of operation are provided. Seven out of the 38 variables characterize the effluent water quality. 

<div style="text-align:center; margin-top:2rem;">

![water treatment](water-treatment.png)

</div>


## Sensor Data:

- **Influent:**

    1. DATE        (date)

    2. Q-E         (input flow to plant)

    3. ZN-E        (input Zinc to plant)

    4. PH-E        (input pH to plant)

    5. DBO-E       (input Biological demand of oxygen to plant)

    6. DQO-E       (input chemical demand of oxygen to plant)

    7. SS-E        (input suspended solids to plant)

    8. SSV-E       (input volatile supended solids to plant)

    9. SED-E       (input sediments to plant)

    10. COND-E     (input conductivity to plant)

- **Input to *Primary* Settler**

    11. PH-P       (input pH to primary settler)

    12. DBO-P      (input Biological demand of oxygen to primary settler)

    13. SS-P       (input suspended solids to primary settler)

    14. SSV-P      (input volatile supended solids to primary settler)

    15. SED-P      (input sediments to primary settler)

    16. COND-P     (input conductivity to primary settler)

- **Input to *Secondary* Settler**

    17. PH-D       (input pH to secondary settler)

    18. DBO-D      (input Biological demand of oxygen to secondary settler)

    19. DQO-D      (input chemical demand of oxygen to secondary settler)

    20. SS-D       (input suspended solids to secondary settler)

    21. SSV-D      (input volatile supended solids to secondary settler)

    22. SED-D      (input sediments to secondary settler)

    23. COND-D     (input conductivity to secondary settler)

- **Output from *Secondary* Settler (Effluent)**

    24. PH-S       (output pH)

    25. DBO-S      (output Biological demand of oxygen)

    26. DQO-S      (output chemical demand of oxygen)

    27. SS-S       (output suspended solids)

    28. SSV-S      (output volatile supended solids)

    29. SED-S      (output sediments)

    30. COND-S     (output conductivity)

- **Performance Indicators**

    31. RD-DBO-P   (performance input Biological demand of oxygen in primary settler)

    32. RD-SS-P    (performance input suspended solids to primary settler)

    33. RD-SED-P   (performance input sediments to primary settler)

    34. RD-DBO-S   (performance input Biological demand of oxygen to secondary settler)

    35. RD-DQO-S   (performance input chemical demand of oxygen to secondary settler)

    36. RD-DBO-G   (global performance input Biological demand of oxygen)

    37. RD-DQO-G   (global performance input chemical demand of oxygen)

    38. RD-SS-G    (global performance input suspended solids)

    39. RD-SED-G   (global performance input sediments)

### **Communicating Data Science Insights: Interactive Plots and Dashboards**

---

### **Overview**
Communicating data science insights effectively requires interactive tools and dashboards that make it easier to explore data and share findings with stakeholders. Tools like **Plotly** and **Dash** enable interactive data visualizations, while **Microsoft Power BI** provides a business-ready solution for building visual reports.

For the wastewater treatment process, interactive plots and dashboards can help:
- Explore trends in input and output parameters dynamically.
- Identify relationships and anomalies in real time.
- Provide actionable insights into treatment efficiency.


### **Key Components of Communication**


#### **1. Clarity and Simplicity**
   - **Purpose**: Ensure that insights are easily understood by a diverse audience, including non-technical stakeholders.
   - **Best Practices**:
     - Avoid jargon and overly technical terms when presenting results.
     - Use concise labels, titles, and annotations in visualizations.
   - **Examples**:
     - Clearly label axes in charts (e.g., "Input pH (\( PH_E \))" instead of just "pH").
     - Summarize complex statistical metrics with intuitive explanations.


#### **2. Visual Storytelling**
   - **Purpose**: Use visuals to tell a compelling story that emphasizes the key findings and guides the audience through the analysis.
   - **Best Practices**:
     - Highlight trends, patterns, and key points in visualizations.
     - Use consistent color schemes and chart styles for readability.
   - **Examples**:
     - Use a line plot to show daily trends and highlight spikes with annotations.
     - Create heatmaps to visually communicate correlations between variables.


#### **3. Interactivity**
   - **Purpose**: Allow stakeholders to explore data and insights dynamically, enabling deeper understanding and engagement.
   - **Best Practices**:
     - Include interactive filters, tooltips, and sliders in visualizations and dashboards.
     - Provide multiple views for drilling down into specific data subsets.
   - **Examples**:
     - Use dropdown menus in a dashboard to select variables for analysis.
     - Create tooltips in scatter plots to display additional information about each data point.


#### **4. Relevance and Audience-Focus**
   - **Purpose**: Tailor the communication style and content to the audience’s needs and decision-making requirements.
   - **Best Practices**:
     - Focus on the metrics and visuals most relevant to the problem at hand.
     - Provide actionable recommendations based on insights.
   - **Examples**:
     - For plant operators: Focus on daily trends and compliance metrics.
     - For managers: Emphasize high-level summaries and performance metrics.


#### **5. Accuracy and Transparency**
   - **Purpose**: Ensure the data, methods, and results are accurate and presented with full transparency to build trust.
   - **Best Practices**:
     - Double-check calculations and visualizations for correctness.
     - Clearly explain assumptions, limitations, and potential biases in the analysis.
   - **Examples**:
     - Include a note in dashboards about missing data or outliers.
     - Highlight confidence intervals or uncertainty in model predictions.


#### **6. Call to Action**
   - **Purpose**: Conclude with actionable insights and clear next steps for the audience to act on.
   - **Best Practices**:
     - End presentations with specific recommendations based on findings.
     - Provide tools (e.g., dashboards or reports) that empower stakeholders to act independently.
   - **Examples**:
     - Recommend process adjustments to improve compliance rates.
     - Share a dashboard with filters for exploring scenarios interactively.


### **How Communication Enhances Understanding**
- **Improves Stakeholder Engagement**:
  - Clear and interactive tools enable stakeholders to explore and understand data more effectively.
- **Supports Decision-Making**:
  - Relevant insights presented transparently help in making informed and timely decisions.
- **Drives Action**:
  - Visual storytelling and actionable recommendations inspire proactive improvements and optimizations.




---

## **Exercise: Interactive Plots with Plotly**



In [None]:
# Import necessary packages
import pandas as pd
import plotly.express as px

# Load the dataset
df = pd.read_csv('data.csv')

# Display the first few rows of the dataset
df.head()


Unnamed: 0,DATE,DAY-OF-WEEK,Q-E,ZN-E,PH-E,DBO-E,DQO-E,SS-E,SSV-E,SED-E,...,RD-DQO-G,RD-SS-G,RD-SED-G,PH-S,DBO-S,DQO-S,SS-S,SSV-S,SED-S,COND-S
0,1990-01-01,Monday,41230.0,0.35,7.6,120.0,344.0,136.0,54.4,4.5,...,71.8,87.5,99.4,7.5,16.0,97.0,17.0,51.8,0.03,903.0
1,1990-01-02,Tuesday,37386.0,1.4,7.9,165.0,470.0,170.0,76.5,4.0,...,79.4,89.4,100.0,7.6,22.0,97.0,18.0,80.6,0.0,1481.0
2,1990-01-03,Wednesday,34535.0,1.0,7.8,232.0,518.0,220.0,65.5,5.5,...,71.8,85.9,99.8,7.5,29.0,146.0,31.0,77.4,0.01,1492.0
3,1990-01-04,Thursday,32527.0,3.0,7.8,187.0,460.0,180.0,67.8,5.2,...,77.2,83.3,100.0,7.5,28.0,105.0,30.0,82.0,0.0,1590.0
4,1990-01-07,Sunday,27760.0,1.2,7.6,199.0,466.0,186.0,74.2,4.5,...,73.8,86.6,99.6,7.4,21.0,122.0,25.0,84.0,0.02,1411.0



### **1. Line Plot: Trends in Output pH (\( PH\_S \))**
#### **Tasks**
1. Create a **line plot** of \( PH\_S \) over time using Plotly.
2. Add a title, axis labels, and tooltips showing the date and \( PH\_S \) value.
3. Highlight any significant dips or spikes in the trend.



In [None]:
# TODO: Create a line plot for PH_S over time




### **2. Scatter Plot: Relationship Between \( DBO\_E \) and \( PH\_S \)**
#### **Tasks**
1. Create an **interactive scatter plot** to explore the relationship between \( DBO\_E \) and \( PH\_S \).
2. Use tooltips to display additional information, such as \( Q\_E \) (Input Flow).  *Use `hover_data=[x]` to show additional details in the hover cards*
3. Add a trendline to highlight the relationship.



In [None]:
# TODO: Create an interactive scatter plot with tooltips



### **3. Heatmap: Correlations Between Key Variables**
#### **Tasks**
1. Compute the **correlation matrix** for numerical variables in the dataset.
2. Create an **interactive heatmap** to visualize correlations. *Use `imshow` for plotting the heatmap*
3. Identify the strongest and weakest correlations from the heatmap.



In [None]:
# TODO: Create a heatmap of correlations
correlation_matrix = df.select_dtypes(include='number').corr()



---

## **Assignment**

### **Build the Same Dashboard Using `Dash` and `Microsoft Power BI`**
Create an interactive dashboard to explore the wastewater treatment data. Include:
- A dropdown menu to select a variable for analysis (e.g., \( BOD\_E \), \( Q\_E \)).
- A line chart to visualize trends over time.
