# QCTO - Workplace Module

### Project Title: Buenos Aires contaminated river water analysis
#### Done By: Tikedzani Geraldine Vele

© ExploreAI 2024

---

## Table of Contents

<a href=#BC> Background Context</a>

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Data Collection and Description</a>

<a href=#three>3. Loading Data </a>

<a href=#four>4. Data Cleaning and Filtering</a>

<a href=#five>5. Exploratory Data Analysis (EDA)</a>

<a href=#six>6. Modeling </a>

<a href=#seven>7. Evaluation and Validation</a>

<a href=#eight>8. Final Model</a>

<a href=#nine>9. Conclusion and Future Work</a>

<a href=#ten>10. References</a>

---
 <a id="BC"></a>
## **Background Context**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Introduce the project, outline its goals, and explain its significance.
* **Details:** Include information about the problem domain, the specific questions or challenges the project aims to address, and any relevant background information that sets the stage for the work.
---

**Project Overview**

**1.1 Introduction**

Environmental pollution, particularly in water bodies, has become a critical concern in many regions. In Buenos Aires, one of the most pressing environmental challenges is the contamination of rivers, which threatens both the ecosystem and public health. Monitoring water quality and predicting key indicators like dissolved oxygen (DO) levels are essential for effective environmental management and policy-making.

Predictive modeling, powered by machine learning and data analysis techniques, can provide insights into water quality and help predict critical parameters. This project aims to develop a model to predict the DO levels of a contaminated river in Buenos Aires using various physico-chemical features of the water samples.

**1.2 Problem Statement**

The rising levels of pollution in Buenos Aires' rivers have made it difficult to assess water quality effectively through manual sampling and analysis. There is a need for an automated system that can predict the DO levels, a key indicator of water quality, based on other physico-chemical parameters. The goal of this project is to create a model that can accurately predict DO levels, thereby aiding in the monitoring and management of water quality in the region.

**1.3 Data Sourcing**

The dataset for this project includes water sample data collected from various points along a polluted river in Buenos Aires. The dataset consists of several features, including Temperature, pH, Turbidity, Conductivity, and Nutrient levels, among others. The data has been split into training and testing sets to facilitate the development and evaluation of the predictive model.

**1.4 Importance of the Study**

Improved Environmental Monitoring: By providing accurate predictions of DO levels, the system can help environmental agencies monitor water quality more effectively.

Operational Efficiency: The automated system reduces the need for extensive manual sampling and analysis, freeing up resources for other critical tasks.

Scalability: The model can be adapted to monitor multiple rivers or extended to other regions facing similar challenges.

Data-Driven Decision-Making: Insights from the model can inform environmental policies, regulatory measures, and remediation efforts.

**1.5 Key Questions**

Effectiveness of Feature Selection: Which physico-chemical parameters are most influential in predicting DO levels?

Optimal Algorithm Selection for Prediction: What machine learning algorithms perform best in predicting DO levels based on the available data?

Real-World Applicability and Integration: How can the predictive model be integrated into existing environmental monitoring frameworks?



**1.6 Aim**

To develop a predictive model that accurately forecasts the DO levels in the contaminated river using machine learning techniques.

To deploy the model as a web application using Streamlit, providing a user-friendly interface for environmental agencies and other stakeholders to monitor water quality.

To present findings, including insights into the key features of the dataset, and provide recommendations for improving water quality monitoring.

**1.7 Expected Outcomes**

Accurate Prediction Model: A model that reliably predicts DO levels based on other physico-chemical features.

Streamlit Web Application: An interactive web application that allows users to input water sample data and receive DO level predictions.

Enhanced Monitoring Efficiency: An automated system that reduces the need for manual analysis and improves the efficiency of water quality monitoring.

Comprehensive Documentation: Detailed documentation of the project, including methodology, results, and recommendations for future work.

**1.8 Project Objectives**

The objective of this project is to apply a structured approach to developing a predictive model for DO levels in a contaminated river using machine learning techniques. This involves:

Data Loading and Exploration: Load the dataset and perform an initial exploration to understand its structure and identify any data quality issues.

Data Preprocessing: Clean and preprocess the data to prepare it for model training, including handling missing values and encoding categorical data.

Feature Engineering: Extract and transform features from the water sample data to create meaningful input for machine learning models.

Model Training and Evaluation: Train and evaluate multiple machine learning models to select the best one for predicting DO levels.

Integration with Monitoring Systems: Develop methods to incorporate the predictive model into existing water quality monitoring frameworks.

Deployment with Streamlit: Develop a web application using Streamlit to allow users to input water sample data and receive DO level predictions.

Documentation and Reporting: Document the entire process and prepare a report and presentation for stakeholders.

Through these steps, the project aims to develop a robust system that can accurately predict DO levels and support effective environmental management in Buenos Aires.

---
<a href=#one></a>
## **Importing Packages**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Set up the Python environment with necessary libraries and tools.
* **Details:** List and import all the Python packages that will be used throughout the project such as Pandas for data manipulation, Matplotlib/Seaborn for visualization, scikit-learn for modeling, etc.
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.
import numpy as np
import pandas as pd
import datetime
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split, TimeSeriesSplit, cross_val_score
from sklearn.feature_selection import SelectKBest, f_regression
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, make_scorer
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split # Import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score

---
<a href=#two></a>
## **Data Collection and Description**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Describe how the data was collected and provide an overview of its characteristics.
* **Details:** Mention sources of the data, the methods used for collection (e.g., APIs, web scraping, datasets from repositories), and a general description of the dataset including size, scope, and types of data available (e.g., numerical, categorical).
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.
import pandas as pd

# Load the dataset
dataset = pd.read_csv('River water parameters.xlsx - Base de Datos.csv')

# Overview of the dataset
def dataset_overview(dataset):
    print("### Data Collection and Description\n")
    print("**Purpose:**")
    print("The purpose of this section is to describe how the data was collected and provide an overview of its characteristics.\n")

    print("**Data Collection:**")
    print("The data for this project was collected from the following sources:")
    print("- Water samples were collected from various locations along the contaminated river in Buenos Aires.")
    print("- Data was gathered using sensors and lab analysis for various physico-chemical parameters.")
    print("- The dataset was obtained from [source name, e.g., local environmental agency, research databases, etc.].\n")

    print("**Methods Used for Collection:**")
    print("- The data was collected using a combination of laboratory analysis and field sensors.")
    print("- Parameters like temperature, pH, turbidity, and nutrient levels were measured directly at sampling points.")
    print("- Data is stored in a structured format in a CSV file, with rows representing individual samples.\n")

    print("**Dataset Characteristics:**")
    print(f"- The dataset contains **{dataset.shape[0]}** samples (rows) and **{dataset.shape[1]}** features (columns).")
    print("- The types of data available include numerical (e.g., temperature, pH) and categorical (e.g., sample location).")

    # Display the first few rows of the dataset
    print("\n**Sample of the Data:**")
    print(dataset.head())
     # Overview of the dataset's columns and data types
    print("\n**Dataset Columns and Data Types:**")
    print(dataset.dtypes)

    # Summary statistics of the dataset
    print("\n**Summary Statistics:**")
    print(dataset.describe())

# Call the function to display the dataset overview
dataset_overview(dataset)

### Data Collection and Description

**Purpose:**
The purpose of this section is to describe how the data was collected and provide an overview of its characteristics.

**Data Collection:**
The data for this project was collected from the following sources:
- Water samples were collected from various locations along the contaminated river in Buenos Aires.
- Data was gathered using sensors and lab analysis for various physico-chemical parameters.
- The dataset was obtained from [source name, e.g., local environmental agency, research databases, etc.].

**Methods Used for Collection:**
- The data was collected using a combination of laboratory analysis and field sensors.
- Parameters like temperature, pH, turbidity, and nutrient levels were measured directly at sampling points.
- Data is stored in a structured format in a CSV file, with rows representing individual samples.

**Dataset Characteristics:**
- The dataset contains **219** samples (rows) and **16** features (columns).
- The ty

---
<a href=#three></a>
## **Loading Data**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Load the data into the notebook for manipulation and analysis.
* **Details:** Show the code used to load the data and display the first few rows to give a sense of what the raw data looks like.
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.
# Import necessary libraries
import pandas as pd

# Load the dataset
file_path = 'River water parameters.xlsx - Base de Datos.csv'
dataset = pd.read_csv(file_path)

# Overview of the loaded data
def load_data_overview(dataset):
    print("### Loading Data\n")
    print("**Purpose:**")
    print("The purpose of this section is to load the data into the notebook for manipulation and analysis.\n")

    print("**Data Loading:**")
    print(f"Data has been loaded from the file: **{file_path}**\n")

    print("**First Few Rows of the Dataset:**")
    print(dataset.head(), "\n")

    print("**Dataset Columns:**")
    print(dataset.columns.tolist(), "\n")

    print("**Dataset Shape:**")
    print(f"The dataset contains **{dataset.shape[0]}** samples (rows) and **{dataset.shape[1]}** features (columns).")

# Call the function to display the data loading overview
load_data_overview(dataset)

### Loading Data

**Purpose:**
The purpose of this section is to load the data into the notebook for manipulation and analysis.

**Data Loading:**
Data has been loaded from the file: **River water parameters.xlsx - Base de Datos.csv**

**First Few Rows of the Dataset:**
  Date (DD/MM/YYYY) Time (24 hrs XX:XX)     Sampling point  \
0         9/05/2023               14:15      Puente Bilbao   
1        14/06/2023               14:30      Puente Bilbao   
2        14/06/2023               14:30      Puente Bilbao   
3        14/06/2023               15:00  Arroyo_Las Torres   
4        14/06/2023               15:00  Arroyo_Las Torres   

   Ambient temperature (°C)  Ambient humidity  Sample temperature (°C)   pH  \
0                      17.0              0.47                     19.0  8.3   
1                      11.9              0.47                     13.0  8.1   
2                      11.9              0.47                     13.0  8.2   
3                      11.9             

In [None]:
print(dataset.shape)


(219, 16)


In [None]:
print(dataset.columns)

Index(['Date (DD/MM/YYYY)', 'Time (24 hrs XX:XX)', 'Sampling point',
       'Ambient temperature (°C)', 'Ambient humidity',
       'Sample temperature (°C)', 'pH', 'EC\n(µS/cm)', 'TDS\n(mg/L)',
       'TSS\n(mL sed/L)', 'DO\n(mg/L)', 'Level (cm)', 'Turbidity (NTU)',
       'Hardness\n(mg CaCO3/L)', 'Hardness classification',
       'Total Cl-\n(mg Cl-/L)'],
      dtype='object')


In [None]:
dataset.describe()

Unnamed: 0,Ambient temperature (°C),Ambient humidity,Sample temperature (°C),pH,EC\n(µS/cm),TDS\n(mg/L),TSS\n(mL sed/L),DO\n(mg/L),Level (cm),Turbidity (NTU),Hardness\n(mg CaCO3/L),Total Cl-\n(mg Cl-/L)
count,219.0,219.0,219.0,219.0,219.0,219.0,213.0,219.0,180.0,218.0,217.0,213.0
mean,17.640183,0.559954,19.594977,8.031507,1264.56621,624.246575,61.015023,2.620639,38.277778,144.954083,190.714286,102.629108
std,5.163841,0.165303,3.875319,0.289991,273.320004,135.540892,87.08314,1.95751,12.532887,234.590553,56.058761,32.785301
min,10.4,0.19,12.8,7.2,200.0,140.0,0.1,0.0,10.0,1.06,86.0,15.0
25%,13.8,0.47,16.8,7.9,1075.0,530.0,30.0,1.17,30.0,27.5,146.0,71.0
50%,17.0,0.54,19.3,8.1,1330.0,660.0,48.0,1.87,35.0,59.25,188.0,109.0
75%,20.0,0.69,22.1,8.2,1470.0,725.0,66.0,4.0,48.0,136.0,228.0,125.0
max,30.5,0.87,28.1,8.7,1710.0,850.0,650.0,9.12,70.0,1000.0,316.0,174.0


In [None]:
dataset.info

---
<a href=#four></a>
## **Data Cleaning and Filtering**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Prepare the data for analysis by cleaning and filtering.
* **Details:** Include steps for handling missing values, removing outliers, correcting errors, and possibly reducing the data (filtering based on certain criteria or features).
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
file_path = 'River water parameters.xlsx - Base de Datos.csv'
df = pd.read_csv(file_path)

# Convert date and time columns to datetime format
df['Date'] = pd.to_datetime(df['Date (DD/MM/YYYY)'], format='%d/%m/%Y')
df['Time'] = pd.to_datetime(df['Time (24 hrs XX:XX)'], format='%H:%M').dt.time

# Drop duplicates if any
df = df.drop_duplicates()

# Handling missing values
# Use SimpleImputer to fill missing numeric values with the mean
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')

# Columns with missing values
missing_columns = ['TSS\n(mL sed/L)', 'Level (cm)', 'Turbidity (NTU)',
                   'Hardness\n(mg CaCO3/L)', 'Total Cl-\n(mg Cl-/L)']

df[missing_columns] = imputer.fit_transform(df[missing_columns])

# Filter out unrealistic values (outliers)
# Example: Filtering out negative values and extremely high values

# Filter for reasonable pH values (usually between 0 and 14)
df = df[(df['pH'] >= 0) & (df['pH'] <= 14)]

# Filter for realistic temperature values
df = df[(df['Sample temperature (°C)'] > 0) & (df['Sample temperature (°C)'] < 100)]

# Removing outliers using Z-score for 'DO\n(mg/L)'
from scipy import stats

z_scores = np.abs(stats.zscore(df['DO\n(mg/L)']))
df = df[z_scores < 3]  # Retain only rows with Z-score less than 3

# Dropping unnecessary columns
df = df.drop(columns=['Date (DD/MM/YYYY)', 'Time (24 hrs XX:XX)'])

# Display the cleaned data
print(df.info())
print(df.describe())

<class 'pandas.core.frame.DataFrame'>
Index: 218 entries, 0 to 218
Data columns (total 16 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   Sampling point            218 non-null    object        
 1   Ambient temperature (°C)  218 non-null    float64       
 2   Ambient humidity          218 non-null    float64       
 3   Sample temperature (°C)   218 non-null    float64       
 4   pH                        218 non-null    float64       
 5   EC
(µS/cm)                218 non-null    int64         
 6   TDS
(mg/L)                218 non-null    int64         
 7   TSS
(mL sed/L)            218 non-null    float64       
 8   DO
(mg/L)                 218 non-null    float64       
 9   Level (cm)                218 non-null    float64       
 10  Turbidity (NTU)           218 non-null    float64       
 11  Hardness
(mg CaCO3/L)     218 non-null    float64       
 12  Hardness classification   2

---
<a href=#five></a>
## **Exploratory Data Analysis (EDA)**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Explore and visualize the data to uncover patterns, trends, and relationships.
* **Details:** Use statistics and visualizations to explore the data. This may include histograms, box plots, scatter plots, and correlation matrices. Discuss any significant findings.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#six></a>
## **Modeling**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Develop and train predictive or statistical models.
* **Details:** Describe the choice of models, feature selection and engineering processes, and show how the models are trained. Include code for setting up the models and explanations of the model parameters.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#seven></a>
## **Evaluation and Validation**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Evaluate and validate the effectiveness and accuracy of the models.
* **Details:** Present metrics used to evaluate the models, such as accuracy, precision, recall, F1-score, etc. Discuss validation techniques employed, such as cross-validation or train/test split.
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#eight></a>
## **Final Model**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Present the final model and its performance.
* **Details:** Highlight the best-performing model and discuss its configuration, performance, and why it was chosen over others.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#nine></a>
## **Conclusion and Future Work**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Summarize the findings and discuss future directions.
* **Details:** Conclude with a summary of the results, insights gained, limitations of the study, and suggestions for future projects or improvements in methodology or data collection.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#ten></a>
## **References**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Provide citations and sources of external content.
* **Details:** List all the references and sources consulted during the project, including data sources, research papers, and documentation for tools and libraries used.
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.

## Additional Sections to Consider

* ### Appendix:
For any additional code, detailed tables, or extended data visualizations that are supplementary to the main content.

* ### Contributors:
If this is a group project, list the contributors and their roles or contributions to the project.
