# QCTO - Workplace Module

### Project Title: River Water Pollution in Buenos Aires
#### Done By: Thando Calana

© ExploreAI 2024

---

## Table of Contents

<a href=#BC> Background Context</a>

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Data Collection and Description</a>

<a href=#three>3. Loading Data </a>

<a href=#four>4. Data Cleaning and Filtering</a>

<a href=#five>5. Exploratory Data Analysis (EDA)</a>

<a href=#six>6. Modeling </a>

<a href=#seven>7. Evaluation and Validation</a>

<a href=#eight>8. Final Model</a>

<a href=#nine>9. Conclusion and Future Work</a>

<a href=#ten>10. References</a>

---
 <a id="BC"></a>
## **Background Context**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Introduce the project, outline its goals, and explain its significance.
* **Details:** Include information about the problem domain, the specific questions or challenges the project aims to address, and any relevant background information that sets the stage for the work.
---

---
<a href=#one></a>
## **Importing Packages**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Set up the Python environment with necessary libraries and tools.
* **Details:** List and import all the Python packages that will be used throughout the project such as Pandas for data manipulation, Matplotlib/Seaborn for visualization, scikit-learn for modeling, etc.
---

In [16]:
#Please use code cells to code in and do not forget to comment your code.
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

---
<a href=#two></a>
## **Data Collection and Description**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Describe how the data was collected and provide an overview of its characteristics.
* **Details:** Mention sources of the data, the methods used for collection (e.g., APIs, web scraping, datasets from repositories), and a general description of the dataset including size, scope, and types of data available (e.g., numerical, categorical).
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#three></a>
## **Loading Data**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Load the data into the notebook for manipulation and analysis.
* **Details:** Show the code used to load the data and display the first few rows to give a sense of what the raw data looks like.
---

In [18]:
#Please use code cells to code in and do not forget to comment your code.

df = pd.read_csv("river_water_parameters.csv")

In [124]:
df.head()

Unnamed: 0,Date (DD/MM/YYYY),Time (24 hrs XX:XX),Sampling point,Ambient temperature (°C),Ambient humidity,Sample temperature (°C),pH,EC\n(µS/cm),TDS\n(mg/L),TSS\n(mL sed/L),DO\n(mg/L),Level (cm),Turbidity (NTU),Hardness\n(mg CaCO3/L),Hardness classification,Total Cl-\n(mg Cl-/L)
0,09/05/2023,14:15,Puente Bilbao,17.0,0.47,19.0,8.3,1630,810,1.8,4.3,,,147.0,BLANDA,156.0
1,14/06/2023,14:30,Puente Bilbao,11.9,0.47,13.0,8.1,1000,490,18.0,5.3,,41.2,94.0,BLANDA,78.0
2,14/06/2023,14:30,Puente Bilbao,11.9,0.47,13.0,8.2,1000,490,18.0,4.67,,38.9,86.0,BLANDA,82.0
3,14/06/2023,15:00,Arroyo_Las Torres,11.9,0.47,13.0,8.3,1350,670,0.1,7.01,,30.7,200.0,SEMIDURA,117.0
4,14/06/2023,15:00,Arroyo_Las Torres,11.9,0.47,13.0,8.5,1350,660,0.1,7.23,,25.6,196.0,SEMIDURA,117.0


---
<a href=#four></a>
## **Data Cleaning and Filtering**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Prepare the data for analysis by cleaning and filtering.
* **Details:** Include steps for handling missing values, removing outliers, correcting errors, and possibly reducing the data (filtering based on certain criteria or features).
---

In [20]:
# We will start off by getting general information on the data we are working with
# This will help identify column names, columns with null values and a count of non-nulls
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 219 entries, 0 to 218
Data columns (total 16 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Date (DD/MM/YYYY)         219 non-null    object 
 1   Time (24 hrs XX:XX)       219 non-null    object 
 2   Sampling point            219 non-null    object 
 3   Ambient temperature (°C)  219 non-null    float64
 4   Ambient humidity          219 non-null    float64
 5   Sample temperature (°C)   219 non-null    float64
 6   pH                        219 non-null    float64
 7   EC
(µS/cm)                219 non-null    int64  
 8   TDS
(mg/L)                219 non-null    int64  
 9   TSS
(mL sed/L)            213 non-null    float64
 10  DO
(mg/L)                 219 non-null    float64
 11  Level (cm)                180 non-null    float64
 12  Turbidity (NTU)           218 non-null    float64
 13  Hardness
(mg CaCO3/L)     217 non-null    float64
 14  Hardness c

In [22]:
# We create a copy of the df, which we will perform all the cleaning operations on
cleaned_df = df.copy()

In [24]:
# Before dealing with NULL values, we rename the columns for better readability
# Here, we create a lit that contains the new names that will be assigned to the columns
columns = ["Date", "Time", "Sampling_point", "Ambient_temperature",
           "Ambient_humididty", "Sample_temperature", "ph", "Conductivity",
           "TDS", "TSS", "DO", "Level", "Turbidity", "Hardness", "Hard_class", "CL"]

In [26]:
# Here, we assign the new column names
cleaned_df.columns = columns

All but one of the columns with null values are numeric. This means we have to separate the different columns as we will perform similar operations but cater to their differing data types.

In [29]:
# From the df.info() output, we have identified the following as columns with null values
numeric_columns_with_nulls = ["TSS", "Level", "Turbidity", "Hardness", "CL"]

In [31]:
# For the numeric columns, we will use mean imputation. 
# Mean so as not to affect any statistical calculations that may be used in the columns later.
# Imputation so that we do not drop the columns and lose data.
for col in numeric_columns_with_nulls:
    cleaned_df[col] = cleaned_df[col].fillna(cleaned_df[col].mean())

In [33]:
# For the categorical column, we will use mode imputation.
# Instead of mean, here, we will use the most commonly occurring categorical value
categorical_columns_with_nulls = ["Hard_class"]

In [35]:
for col in categorical_columns_with_nulls:
    cleaned_df[col] = cleaned_df[col].fillna(cleaned_df[col].mode()[0])

In [37]:
cleaned_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 219 entries, 0 to 218
Data columns (total 16 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Date                 219 non-null    object 
 1   Time                 219 non-null    object 
 2   Sampling_point       219 non-null    object 
 3   Ambient_temperature  219 non-null    float64
 4   Ambient_humididty    219 non-null    float64
 5   Sample_temperature   219 non-null    float64
 6   ph                   219 non-null    float64
 7   Conductivity         219 non-null    int64  
 8   TDS                  219 non-null    int64  
 9   TSS                  219 non-null    float64
 10  DO                   219 non-null    float64
 11  Level                219 non-null    float64
 12  Turbidity            219 non-null    float64
 13  Hardness             219 non-null    float64
 14  Hard_class           219 non-null    object 
 15  CL                   219 non-null    flo

In [39]:
# Now, we will check if there are any null values left in our data
cleaned_df.isnull().sum()

Date                   0
Time                   0
Sampling_point         0
Ambient_temperature    0
Ambient_humididty      0
Sample_temperature     0
ph                     0
Conductivity           0
TDS                    0
TSS                    0
DO                     0
Level                  0
Turbidity              0
Hardness               0
Hard_class             0
CL                     0
dtype: int64

In [41]:
# Now that there are no more null values, we can re-assign our cleaned_df to the
# main df
df = cleaned_df

The date, time and sampling point columns don't add much to the statistical analysis methods that will be used therefore, they will be dropped. Before dropping the sampling point column, however, we will perform some visualisation of data to see where the data are drawn from. The assumption here is that these columns do not significantly impact the quality of the water in the rivers. The categorical variable, hardness classification, will be encoded and potentially used later depending on the analysis done.

In [44]:
df = df.drop(columns =['Date','Time'])
df.head()

Unnamed: 0,Sampling_point,Ambient_temperature,Ambient_humididty,Sample_temperature,ph,Conductivity,TDS,TSS,DO,Level,Turbidity,Hardness,Hard_class,CL
0,Puente Bilbao,17.0,0.47,19.0,8.3,1630,810,1.8,4.3,38.277778,144.954083,147.0,BLANDA,156.0
1,Puente Bilbao,11.9,0.47,13.0,8.1,1000,490,18.0,5.3,38.277778,41.2,94.0,BLANDA,78.0
2,Puente Bilbao,11.9,0.47,13.0,8.2,1000,490,18.0,4.67,38.277778,38.9,86.0,BLANDA,82.0
3,Arroyo_Las Torres,11.9,0.47,13.0,8.3,1350,670,0.1,7.01,38.277778,30.7,200.0,SEMIDURA,117.0
4,Arroyo_Las Torres,11.9,0.47,13.0,8.5,1350,660,0.1,7.23,38.277778,25.6,196.0,SEMIDURA,117.0


---
<a href=#five></a>
## **Exploratory Data Analysis (EDA)**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Explore and visualize the data to uncover patterns, trends, and relationships.
* **Details:** Use statistics and visualizations to explore the data. This may include histograms, box plots, scatter plots, and correlation matrices. Discuss any significant findings.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#six></a>
## **Modeling**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Develop and train predictive or statistical models.
* **Details:** Describe the choice of models, feature selection and engineering processes, and show how the models are trained. Include code for setting up the models and explanations of the model parameters.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#seven></a>
## **Evaluation and Validation**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Evaluate and validate the effectiveness and accuracy of the models.
* **Details:** Present metrics used to evaluate the models, such as accuracy, precision, recall, F1-score, etc. Discuss validation techniques employed, such as cross-validation or train/test split.
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#eight></a>
## **Final Model**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Present the final model and its performance.
* **Details:** Highlight the best-performing model and discuss its configuration, performance, and why it was chosen over others.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#nine></a>
## **Conclusion and Future Work**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Summarize the findings and discuss future directions.
* **Details:** Conclude with a summary of the results, insights gained, limitations of the study, and suggestions for future projects or improvements in methodology or data collection.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#ten></a>
## **References**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Provide citations and sources of external content.
* **Details:** List all the references and sources consulted during the project, including data sources, research papers, and documentation for tools and libraries used.
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.

## Additional Sections to Consider

* ### Appendix: 
For any additional code, detailed tables, or extended data visualizations that are supplementary to the main content.
<ul>
    <li>
       <a href=https://github.com/ThandoCalana/QCTO-Workplace>GitHub</a> 
    </li>
    <li>
        <a href=https://trello.com/b/H3W6cShr/qcto-workplace>Trello Board</a>
    </li>
    <li>
        <a href=https://www.canva.com/design/DAGicydbHBg/lyCpbZMosfMPa9YJtTpl3w/edit> Presentation Slides </a>
    </li>
</ul>

* ### Contributors: 
If this is a group project, list the contributors and their roles or contributions to the project.
