# QCTO - Workplace Module

### Project Title: Please Insert your Project Title Here
#### Done By: Name and Surname

© ExploreAI 2024

---

## Table of Contents

<a href=#BC> Background Context</a>

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Data Collection and Description</a>

<a href=#three>3. Loading Data </a>

<a href=#four>4. Data Cleaning and Filtering</a>

<a href=#five>5. Exploratory Data Analysis (EDA)</a>

<a href=#six>6. Modeling </a>

<a href=#seven>7. Evaluation and Validation</a>

<a href=#eight>8. Final Model</a>

<a href=#nine>9. Conclusion and Future Work</a>

<a href=#ten>10. References</a>

---
 <a id="BC"></a>
## **Background Context**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Introduce the project, outline its goals, and explain its significance.
* **Details:** Include information about the problem domain, the specific questions or challenges the project aims to address, and any relevant background information that sets the stage for the work.
---

---
<a href=#one></a>
## **Importing Packages**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Set up the Python environment with necessary libraries and tools.
* **Details:** List and import all the Python packages that will be used throughout the project such as Pandas for data manipulation, Matplotlib/Seaborn for visualization, scikit-learn for modeling, etc.
---

In [1]:
#Please use code cells to code in and do not forget to comment your code.
import pandas as pd
import numpy as np
import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import KNNImputer

In [2]:
!pip install scikit-learn



In [3]:
!pip install seaborn



In [5]:
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split, TimeSeriesSplit, cross_val_score
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import mean_squared_error, mean_absolute_error, make_scorer


---
<a href=#two></a>
## **Data Collection and Description**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Describe how the data was collected and provide an overview of its characteristics.
* **Details:** Mention sources of the data, the methods used for collection (e.g., APIs, web scraping, datasets from repositories), and a general description of the dataset including size, scope, and types of data available (e.g., numerical, categorical).
---

In [10]:
#Please use code cells to code in and do not forget to comment your code.
data = pd.read_csv('River water parameters.csv')
data.head()

Unnamed: 0,Date (DD/MM/YYYY),Time (24 hrs XX:XX),Sampling point,Ambient temperature (°C),Ambient humidity,Sample temperature (°C),pH,EC\n(µS/cm),TDS\n(mg/L),TSS\n(mL sed/L),DO\n(mg/L),Level (cm),Turbidity (NTU),Hardness\n(mg CaCO3/L),Hardness classification,Total Cl-\n(mg Cl-/L)
0,09/05/2023,14:15,Puente Bilbao,17.0,0.47,19.0,8.3,1630,810,1.8,4.3,,,147.0,BLANDA,156.0
1,14/06/2023,14:30,Puente Bilbao,11.9,0.47,13.0,8.1,1000,490,18.0,5.3,,41.2,94.0,BLANDA,78.0
2,14/06/2023,14:30,Puente Bilbao,11.9,0.47,13.0,8.2,1000,490,18.0,4.67,,38.9,86.0,BLANDA,82.0
3,14/06/2023,15:00,Arroyo_Las Torres,11.9,0.47,13.0,8.3,1350,670,0.1,7.01,,30.7,200.0,SEMIDURA,117.0
4,14/06/2023,15:00,Arroyo_Las Torres,11.9,0.47,13.0,8.5,1350,660,0.1,7.23,,25.6,196.0,SEMIDURA,117.0


---
<a href=#three></a>
## **Loading Data**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Load the data into the notebook for manipulation and analysis.
* **Details:** Show the code used to load the data and display the first few rows to give a sense of what the raw data looks like.
---

In [11]:
#Please use code cells to code in and do not forget to comment your code.
data = pd.read_csv('River water parameters.csv')

In [12]:
rd.columns

Index(['Date (DD/MM/YYYY)', 'Time (24 hrs XX:XX)', 'Sampling point',
       'Ambient temperature (°C)', 'Ambient humidity',
       'Sample temperature (°C)', 'pH', 'EC\n(µS/cm)', 'TDS\n(mg/L)',
       'TSS\n(mL sed/L)', 'DO\n(mg/L)', 'Level (cm)', 'Turbidity (NTU)',
       'Hardness\n(mg CaCO3/L)', 'Hardness classification',
       'Total Cl-\n(mg Cl-/L)'],
      dtype='object')

---
<a href=#four></a>
## **Data Cleaning and Filtering**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Prepare the data for analysis by cleaning and filtering.
* **Details:** Include steps for handling missing values, removing outliers, correcting errors, and possibly reducing the data (filtering based on certain criteria or features).
---

In [18]:
#Please use code cells to code in and do not forget to comment your code.
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import KNNImputer

#Please use code cells to code in and do not forget to comment your code.
# Define column names
COLUMNS = ["Date", "Time", "Sampling_point", "Ambient_temperature",
           "Ambient_humididty", "Sample_temperature", "ph", "Conductivity",
           "TDS", "TSS", "DO", "Level", "Turbidity", "Hardness", "Hard_class", "CL"]

# Copy the dataframe and assign new column names
df = rd.copy()
df.columns = COLUMNS

# Transform Date and Time columns to datetime format
df.Date = pd.to_datetime(df.Date, format="%d/%m/%Y")
df.Time = pd.to_datetime(df.Time, format="%H:%M")

# Create a new Datetime column by combining Date and Time
df["Datetime"] = df.Date + pd.to_timedelta(df.Time.dt.strftime("%H:%M:%S"))

# Set the Datetime column as the index
df.index = df.Datetime

# Drop the original Date and Time columns
df.drop(["Date", "Time"], axis=1, inplace=True)

# Extract minute, hour, day, month, and day of the week from Datetime
df["minute"] = df.Datetime.dt.minute
df["hour"] = df.Datetime.dt.hour
df["day"] = df.Datetime.dt.day
df["month"] = df.Datetime.dt.month
df["day_of_week"] = df.Datetime.dt.dayofweek

# Convert categorical variable 'Sampling_point' into dummy/indicator variables
df = pd.get_dummies(df, columns=["Sampling_point"], prefix="pt")

# Encode the 'Hard_class' column using LabelEncoder
labelencoder = LabelEncoder()
df.Hard_class = labelencoder.fit_transform(df.Hard_class)

# Drop the Datetime and Hard_class columns
df.drop(["Datetime"], axis=1, inplace=True)
df.drop(["Hard_class"], axis=1, inplace=True)

# Handle missing values using KNNImputer
knn = KNNImputer(missing_values=np.nan, n_neighbors=5)
df = pd.DataFrame(knn.fit_transform(df), index=df.index, columns=df.columns)

# Display the first 3 rows of the cleaned dataframe
df.head(3)


Unnamed: 0_level_0,Ambient_temperature,Ambient_humididty,Sample_temperature,ph,Conductivity,TDS,TSS,DO,Level,Turbidity,...,minute,hour,day,month,day_of_week,pt_Arroyo Salguero,pt_Arroyo_Las Torres,pt_Puente Bilbao,pt_Puente Falbo,pt_Puente Irigoyen
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-05-09 14:15:00,17.0,0.47,19.0,8.3,1630.0,810.0,1.8,4.3,32.0,47.54,...,15.0,14.0,9.0,5.0,1.0,0.0,0.0,1.0,0.0,0.0
2023-06-14 14:30:00,11.9,0.47,13.0,8.1,1000.0,490.0,18.0,5.3,45.6,41.2,...,30.0,14.0,14.0,6.0,2.0,0.0,0.0,1.0,0.0,0.0
2023-06-14 14:30:00,11.9,0.47,13.0,8.2,1000.0,490.0,18.0,4.67,44.0,38.9,...,30.0,14.0,14.0,6.0,2.0,0.0,0.0,1.0,0.0,0.0


In [19]:
from scipy import stats
# Handling Outliers
# Calculate Z-scores for each column
z_scores = np.abs(stats.zscore(df.select_dtypes(include=[np.number])))

# Set a threshold for Z-score
threshold = 3

# Filter out rows with Z-scores above the threshold
df = df[(z_scores < threshold).all(axis=1)]

# Display the first 3 rows after removing outliers
df.head(3)


Unnamed: 0_level_0,Ambient_temperature,Ambient_humididty,Sample_temperature,ph,Conductivity,TDS,TSS,DO,Level,Turbidity,...,minute,hour,day,month,day_of_week,pt_Arroyo Salguero,pt_Arroyo_Las Torres,pt_Puente Bilbao,pt_Puente Falbo,pt_Puente Irigoyen
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-05-09 14:15:00,17.0,0.47,19.0,8.3,1630.0,810.0,1.8,4.3,32.0,47.54,...,15.0,14.0,9.0,5.0,1.0,0.0,0.0,1.0,0.0,0.0
2023-06-14 14:30:00,11.9,0.47,13.0,8.1,1000.0,490.0,18.0,5.3,45.6,41.2,...,30.0,14.0,14.0,6.0,2.0,0.0,0.0,1.0,0.0,0.0
2023-06-14 14:30:00,11.9,0.47,13.0,8.2,1000.0,490.0,18.0,4.67,44.0,38.9,...,30.0,14.0,14.0,6.0,2.0,0.0,0.0,1.0,0.0,0.0


In [20]:
# Fill missing values with the mean of each column
df.fillna(df.mean(), inplace=True)

# Alternatively, fill missing values with the median of each column
# df.fillna(df.median(), inplace=True)

# Display the first 3 rows after filling missing values
df.head(3)


Unnamed: 0_level_0,Ambient_temperature,Ambient_humididty,Sample_temperature,ph,Conductivity,TDS,TSS,DO,Level,Turbidity,...,minute,hour,day,month,day_of_week,pt_Arroyo Salguero,pt_Arroyo_Las Torres,pt_Puente Bilbao,pt_Puente Falbo,pt_Puente Irigoyen
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-05-09 14:15:00,17.0,0.47,19.0,8.3,1630.0,810.0,1.8,4.3,32.0,47.54,...,15.0,14.0,9.0,5.0,1.0,0.0,0.0,1.0,0.0,0.0
2023-06-14 14:30:00,11.9,0.47,13.0,8.1,1000.0,490.0,18.0,5.3,45.6,41.2,...,30.0,14.0,14.0,6.0,2.0,0.0,0.0,1.0,0.0,0.0
2023-06-14 14:30:00,11.9,0.47,13.0,8.2,1000.0,490.0,18.0,4.67,44.0,38.9,...,30.0,14.0,14.0,6.0,2.0,0.0,0.0,1.0,0.0,0.0


In [21]:
#Removing Duplicate Rows
duplicates = df.duplicated()

# Remove duplicate rows
df = df[~duplicates]

# Display the first 3 rows after removing duplicates
df.head(3)


Unnamed: 0_level_0,Ambient_temperature,Ambient_humididty,Sample_temperature,ph,Conductivity,TDS,TSS,DO,Level,Turbidity,...,minute,hour,day,month,day_of_week,pt_Arroyo Salguero,pt_Arroyo_Las Torres,pt_Puente Bilbao,pt_Puente Falbo,pt_Puente Irigoyen
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-05-09 14:15:00,17.0,0.47,19.0,8.3,1630.0,810.0,1.8,4.3,32.0,47.54,...,15.0,14.0,9.0,5.0,1.0,0.0,0.0,1.0,0.0,0.0
2023-06-14 14:30:00,11.9,0.47,13.0,8.1,1000.0,490.0,18.0,5.3,45.6,41.2,...,30.0,14.0,14.0,6.0,2.0,0.0,0.0,1.0,0.0,0.0
2023-06-14 14:30:00,11.9,0.47,13.0,8.2,1000.0,490.0,18.0,4.67,44.0,38.9,...,30.0,14.0,14.0,6.0,2.0,0.0,0.0,1.0,0.0,0.0


In [22]:
#Normalizing/Scaling Data
from sklearn.preprocessing import StandardScaler

# Initialize the scaler
scaler = StandardScaler()

# Scale the numerical columns
df[df.select_dtypes(include=[np.number]).columns] = scaler.fit_transform(df.select_dtypes(include=[np.number]))

# Display the first 3 rows after scaling
df.head(3)


Unnamed: 0_level_0,Ambient_temperature,Ambient_humididty,Sample_temperature,ph,Conductivity,TDS,TSS,DO,Level,Turbidity,...,minute,hour,day,month,day_of_week,pt_Arroyo Salguero,pt_Arroyo_Las Torres,pt_Puente Bilbao,pt_Puente Falbo,pt_Puente Irigoyen
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-05-09 14:15:00,-0.131037,-0.51417,-0.202964,0.917224,1.407575,1.43722,-1.666714,0.893597,-0.636094,-0.425391,...,-0.760467,-1.187962,-0.770847,-2.170024,-0.739105,-0.503155,-0.511025,2.321012,-0.534522,-0.518875
2023-06-14 14:30:00,-1.141125,-0.51417,-1.749603,0.175957,-1.021266,-1.036626,-1.069852,1.420793,0.505518,-0.480287,...,0.114954,-1.187962,-0.248507,-1.574095,0.056236,-0.503155,-0.511025,2.321012,-0.534522,-0.518875
2023-06-14 14:30:00,-1.141125,-0.51417,-1.749603,0.54659,-1.021266,-1.036626,-1.069852,1.08866,0.37121,-0.500202,...,0.114954,-1.187962,-0.248507,-1.574095,0.056236,-0.503155,-0.511025,2.321012,-0.534522,-0.518875


In [23]:
#Filtering Data Based on Criteria
start_date = '2023-01-01'
end_date = '2023-12-31'
df_filtered = df[(df.index >= start_date) & (df.index <= end_date)]

# Display the first 3 rows of the filtered data
df_filtered.head(3)


Unnamed: 0_level_0,Ambient_temperature,Ambient_humididty,Sample_temperature,ph,Conductivity,TDS,TSS,DO,Level,Turbidity,...,minute,hour,day,month,day_of_week,pt_Arroyo Salguero,pt_Arroyo_Las Torres,pt_Puente Bilbao,pt_Puente Falbo,pt_Puente Irigoyen
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2023-05-09 14:15:00,-0.131037,-0.51417,-0.202964,0.917224,1.407575,1.43722,-1.666714,0.893597,-0.636094,-0.425391,...,-0.760467,-1.187962,-0.770847,-2.170024,-0.739105,-0.503155,-0.511025,2.321012,-0.534522,-0.518875
2023-06-14 14:30:00,-1.141125,-0.51417,-1.749603,0.175957,-1.021266,-1.036626,-1.069852,1.420793,0.505518,-0.480287,...,0.114954,-1.187962,-0.248507,-1.574095,0.056236,-0.503155,-0.511025,2.321012,-0.534522,-0.518875
2023-06-14 14:30:00,-1.141125,-0.51417,-1.749603,0.54659,-1.021266,-1.036626,-1.069852,1.08866,0.37121,-0.500202,...,0.114954,-1.187962,-0.248507,-1.574095,0.056236,-0.503155,-0.511025,2.321012,-0.534522,-0.518875


---
<a href=#six></a>
## **Modeling**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Develop and train predictive or statistical models.
* **Details:** Describe the choice of models, feature selection and engineering processes, and show how the models are trained. Include code for setting up the models and explanations of the model parameters.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#seven></a>
## **Evaluation and Validation**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Evaluate and validate the effectiveness and accuracy of the models.
* **Details:** Present metrics used to evaluate the models, such as accuracy, precision, recall, F1-score, etc. Discuss validation techniques employed, such as cross-validation or train/test split.
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#eight></a>
## **Final Model**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Present the final model and its performance.
* **Details:** Highlight the best-performing model and discuss its configuration, performance, and why it was chosen over others.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#nine></a>
## **Conclusion and Future Work**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Summarize the findings and discuss future directions.
* **Details:** Conclude with a summary of the results, insights gained, limitations of the study, and suggestions for future projects or improvements in methodology or data collection.
---


In [None]:
#Please use code cells to code in and do not forget to comment your code.

---
<a href=#ten></a>
## **References**
<a href=#cont>Back to Table of Contents</a>

* **Purpose:** Provide citations and sources of external content.
* **Details:** List all the references and sources consulted during the project, including data sources, research papers, and documentation for tools and libraries used.
---

In [None]:
#Please use code cells to code in and do not forget to comment your code.

## Additional Sections to Consider

* ### Appendix: 
For any additional code, detailed tables, or extended data visualizations that are supplementary to the main content.

* ### Contributors: 
If this is a group project, list the contributors and their roles or contributions to the project.
