In [1]:

# Notebook Summary:

# V.History: 
# Date Last Modified: 14 May 2025

#--------------------------------------------------------------------------------------------------
'''
    This notebook processes the dataset based on SME feedback, eliminating redundant Determinands 
    and focusing solely on Surface water measurements. It produces two specific datasets: 
    1) Including Reactive monitoring, and 2) Excluding reactive monitoring observations. 
    As a result, 34+ million data points are filtered out from 70 million observations, 
    which will be used to build the model.
    Date: March 25, 2025

'''
#--------------------------------------------------------------------------------------------------


#--------------------------------------------------------------------------------------------------
#Pre-Requisite : 
    #Kernel Python 3 (ipykernel) is required to run this notebook 
    #Required python version - Python 3.10.15 and its compatible Numpy , ScikitLearn libraries

#Old Name: 12_NB_10Transpose_2959_All1.ipynb
#--------------------------------------------------------------------------------------------------


#--------------------------------------------------------------------------------------------------
#Proposed Validations cachements: 
#a. River Dee (Wales)
#b. Catchments in Ireland (potentially available through Teagasc)
#c. Skerne
#d. Browney
#e. Frome
#f. Wye
#--------------------------------------------------------------------------------------------------

'''
    Intro section
'''

'\n    Intro section\n'

In [2]:
#Check python version compatibility 3.10 or above is required
!python -V
python_version=!(python --version 2>&1)
print (python_version)

Python 3.10.15
['Python 3.10.15']


In [3]:
ls ~/.local/share/jupyter/kernels

[0m[01;34mchem2[0m/


In [4]:
#Check python version compatibility 3.10 or above is required

#sudo nano  ~/.local/share/jupyter/kernels/python311/kernel.json
#to display the system version of the pandas
#!pip show pandas
#to display this notebook's kernel version of the pandas
#%pip show pandas

import sys
sys.version

'3.10.15 | packaged by conda-forge | (main, Oct 16 2024, 01:24:24) [GCC 13.3.0]'

In [5]:
####################################################################################################
#Begin CARD Filter data Based as per (multiple rounds) domain SEM feedbacks | WRC, Xylem, ADAS & CTS
####################################################################################################

In [6]:
%run "..//99_Common_Utils/99_NB_CommonUtils.ipynb" #Library Declaration section - Installing or Initiating all required Python Libraries

Intalling required libraries and utilities.....
Uses Python 3 (ipykernel) (Local)
Python 3.10.15
['Python 3.10.15']

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m

2025-05-13 17:19:06.449905: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1747156746.472172  341578 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747156746.479375  341578 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1747156746.498621  341578 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1747156746.498641  341578 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1747156746.498643  341578 computation_placer.cc:177] computation placer alr


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
|| Completed intalling required libraries and utilities ||

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m]

In [7]:
'''
There are 52 individual units of measure in the WQ monitoring data and over 38 for the P-Family. Phosphate or Orthophosphate is 
measured in 'mg/l'. Considerations were made to convert these metrics into a single unit of 'mg/l' (aligned to P unit measure). 
Below are the 'Conversion functions' / code snippets for that. However, the conversion was later discarded in favour of relative weights for 
different metrics in spatio-temporal models. The code is kept here to document this experiment.

Few notes about the considerations for the conversion given :

1. Density Considerations: Some conversions, such as % w/w or mg/kg, assume the substance has a similar density to water (1 g/ml). 
If the density differs, adjust the conversion factor accordingly.

2. Context-Specific Units: Units like bq/l (radioactivity), cfu/0.1l (colony-forming units), us/cm (conductivity), 
and rfu (fluorescence) require additional context or calibration to convert meaningfully to mg/l.

3. Load Units: Units like kg/ann, t/d, and t/qtr are load measurements and depend on the total volume of water per time period for conversion.

The code provides a basic framework, but real-world applications may require more specific formulas and data, especially for 
context-specific units.


There are 38 Unique metric for Phosphate family
mg/l , coded, ug/l , % , mg/kg, ug/kg, m, ppm, ngr, unitless, phi, ng/l, min, deg, no/g, kg/d, 
deccafix, pres/nf, mw.s/sq. , no. , no/100ml, abs/cm, nm, bq/l, text, % v/v, kt/qtr, t/qtr, kg/qtr,
t/d, % w/w, yes/no, l/kg/h, cfu/0.1l, us/cm, g/l, t/wk, ug
'''

'''
Metrics conversion
'''

'\nMetrics conversion\n'

In [8]:
# Conversion functions
def first_convert_to_mg_per_l(value, unit, density=1.0):
    """
    Convert various units to mg/l.
    
    Parameters:
    - value: The numerical value to convert.
    - unit: The unit of the given value as a string.
    - density: Density of the substance in g/ml (default is 1.0, equivalent to water at room temperature).

    Returns:
    - The converted value in mg/l.
    """
    cUnit = 'mg/l'
    if unit == "mg/l":
        return value, cUnit
    elif unit == "%":
        # 1% = 10,000 mg/l
        return value * 10000, cUnit
    elif unit == "ug/l":
        # 1 ug/l = 0.001 mg/l
        return value * 0.001, cUnit
    elif unit == "mg/kg":
        # Assuming water density, 1 mg/kg = 1 mg/l
        return value, cUnit
    elif unit == "% v/v":
        # 1% v/v = 10,000 mg/l (for liquids like ethanol in water)
        return value * 10000, cUnit
    elif unit == "% w/w":
        # 1% w/w = 10,000 mg/l (assuming density is similar to water)
        return value * 10000, cUnit
    elif unit == "bq/l":
        # Conversion depends on the specific isotope (radioactivity)
        #raise ValueError("Conversion for 'bq/l' requires context-specific information.")
        return value, cUnit
    elif unit == "cfu/0.1l":
        # Colony forming units conversion needs microbiological context
        #raise ValueError("Conversion for 'cfu/0.1l' requires context-specific information.")
        return value, unit
    elif unit == "g/l":
        # 1 g/l = 1000 mg/l
        return value * 1000, cUnit
    elif unit == "kg/ann":
        # Annual load conversion requires volume flow data
        #raise ValueError("Conversion for 'kg/ann' requires annual volume flow data.")
        return value, unit
    elif unit == "kg/d":
        # Daily load conversion requires volume flow data
        #raise ValueError("Conversion for 'kg/d' requires daily volume flow data.")
        return value, unit
    elif unit == "kg/qtr":
        # Quarterly load conversion requires volume flow data
        #raise ValueError("Conversion for 'kg/qtr' requires quarterly volume flow data.")
        return value, unit
    elif unit == "kt/qtr":
        # Kilotons per quarter conversion requires volume flow data
        #raise ValueError("Conversion for 'kt/qtr' requires quarterly volume flow data.")
        return value, unit
    elif unit == "ug/kg":
        # Assuming water density, 1 ug/kg = 0.001 mg/l
        return value * 0.001, cUnit
    elif unit == "ng/l":
        # 1 ng/l = 1e-6 mg/l
        return value * 1e-6, cUnit
    elif unit == "no/100ml":
        # Conversion depends on the context of measurement
        #raise ValueError("Conversion for 'no/100ml' requires context-specific information.")
        return value, unit
    elif unit == "us/cm":
        # Conductivity conversion needs additional information about the substance
        #raise ValueError("Conversion for 'us/cm' requires substance-specific details.")
        return value, unit
    elif unit == "no/g":
        # Conversion for counts per gram needs context
        #raise ValueError("Conversion for 'no/g' requires context-specific information.")
        return value, unit
    elif unit == "ppm":
        # 1 ppm = 1 mg/l (assuming water density)
        return value * 1, cUnit
    elif unit == "t/ann":
        # Tons per annum conversion requires flow data
        #raise ValueError("Conversion for 't/ann' requires annual volume flow data.")
        return value, unit
    elif unit == "t/d":
        # Tons per day conversion requires flow data
        #raise ValueError("Conversion for 't/d' requires daily volume flow data.")
        return value, unit
    elif unit == "t/qtr":
        # Tons per quarter conversion requires flow data
        #raise ValueError("Conversion for 't/qtr' requires quarterly volume flow data.")
        return value, unit
    elif unit == "t/wk":
        # Tons per week conversion requires flow data
        #raise ValueError("Conversion for 't/wk' requires weekly volume flow data.")
        return value, unit
    elif unit == "ug":
        # Micrograms conversion needs volume data
        #raise ValueError("Conversion for 'ug' requires volume data.")
        return value, unit
    elif unit == "ng/kg":
        # Assuming water density, 1 ng/kg = 1e-6 mg/l
        return value * 1e-6, cUnit
    elif unit == "pg/l":
        # 1 pg/l = 1e-9 mg/l
        return value * 1e-9, cUnit
    elif unit == "pg/m3":
        # Conversion requires volume information
        #raise ValueError("Conversion for 'pg/m3' requires volume context.")
        return value, unit
    elif unit == "rfu":
        # Relative fluorescence units are context-specific
        #raise ValueError("Conversion for 'rfu' is context-specific and requires calibration data.")
        return value, unit
    else:
        #raise ValueError("Unknown unit: {}".format(unit))
        return value, unit

# Example usage
try:
    result, resultUnit = first_convert_to_mg_per_l(5, "%")
    print("Converted value: {} {}".format(result, resultUnit))
except ValueError as e:
    print("Error:", e)


Converted value: 50000 mg/l


In [9]:
# Conversion functions
def second_convert_to_mg_per_l(value, unit):
    """
    Converts various metrics to mg/L.
    Only applicable units are converted; others raise a ValueError.
    """
    unit = unit.lower()
    conversion_factors = {
        "mg/l": 1,            # Already in mg/L
        "%": 10_000,          # Assumes % w/v (1% = 10,000 mg/L)
        "ug/l": 1e-3,         # Micrograms per liter
        "mg/kg": 1,           # Assuming 1 kg/L density
        "% v/v": 10_000,      # Assumes % v/v in water
        "% w/w": 10_000,      # Assumes % w/w in water
        "g/l": 1_000,         # Grams per liter
        "ug/kg": 1e-3,        # Micrograms per kg (assuming 1 kg/L density)
        "ng/l": 1e-6,         # Nanograms per liter
        "ppm": 1,             # Equivalent to mg/L
        "bq/l": 1,            # Assumes radionuclide mass equivalent in mg/L
        "cfu/0.1l": 10,       # Colony-forming units per 0.1L to mg/L (arbitrary mass equivalence)
        "kg/d": 1 / 0.001,    # Assuming flow of 1 L/d
        "kg/qtr": 1 / (0.001 * 90),  # Quarterly conversion with 90 days assumed
        "kt/qtr": 1e6 / (0.001 * 90),# Kilotons per quarter
        "no/100ml": 10,       # Assuming 1 unit mass equivalent
        "us/cm": None,        # Cannot convert electrical conductivity to mg/L directly
        "no/g": None,         # Cannot convert counts to mg/L directly
        "t/d": 1e6 / 0.001,   # Tons per day assuming 1 L/d
        "t/qtr": 1e6 / (0.001 * 90),
        "t/wk": 1e6 / (0.001 * 7),
        "coded": None,        # Unsupported
        "m": None,            # Unsupported
        "ngr": None,          # Unsupported
        "unitless": None,     # Unsupported
        "phi": None,          # Unsupported
        "min": None,          # Unsupported
        "deg": None,          # Unsupported
        "deccafix": None,     # Unsupported
        "pres/nf": None,      # Unsupported
        "mw.s/sq.": None,     # Unsupported
        "no.": None,          # Unsupported
        "abs/cm": None,       # Unsupported
        "nm": None,           # Unsupported
        "text": None,         # Unsupported
        "yes/no": None,       # Unsupported
        "l/kg/h": None,       # Unsupported
        "ug": 1e-3,           # Micrograms to mg
    }

    if unit in conversion_factors and conversion_factors[unit] is not None:
        return value * conversion_factors[unit]
    else:
        #raise ValueError(f"Unsupported or non-convertible unit: {unit}")
        #raise ValueError(f"Unsupported")
        raise ValueError(np.nan)


# Example usage
try:
    #User input
    #value = float(input("Enter the value: "))
    #unit = input("Enter the unit: ")
    
    #Hardcoded values
    value = 1.1
    unit = 'ug'
    unit = 'text'
    result = second_convert_to_mg_per_l(value, unit)
    print(f"{value} {unit} is equal to {result} mg/L.")
except ValueError as e:
    print(e)

nan


In [10]:
#STEP 1: Use the 24 years WQ data for England gathered for Phosphate family 
# To carry out the Transpose operation
#==========================================================================================================

In [None]:
# Read Full Phosphate dataset
folderpath = cleansed
#filename = '12_CSV_Full_Phosphate_Dataset_w_Clusters.csv'

#filename = '10_WQEA_2019_24_full.csv' March 1 2025
#filename = '03_WQEA_2000_2024_Cleansed_Sorted_New.csv' #Commented on 12 May 2025
filename = '03_WQEA_2000_2024_Cleansed_New.csv'
showtime()
df = loaddata(folderpath, filename, path = 'gcs://rdmai_dev_data/')
showtime()

print (df['determinand_notation'].nunique(), len(df))
df.head(2)

13 May 2025 17:19:21
gcs://rdmai_dev_data/cleansed/03_WQEA_2000_2024_Cleansed_New.csv


In [None]:
''' #Commented since to consider correlation of all 2959 determinands - conventional way suggested by Xylem
#Section_82_Yes 20 Determinands After Xylem discussion
Section_82_Yes = [61,68,76,81,82,119,3169,3656,3976,5282,6872,7064,
                  7786,8153,8958,9261,9803,9901,9921,9924]

#Section_82_Unsure
lstSection_82_Unsure = [152,153,159,162,5245,8467,8471,9087,9088,9089,9821]

#First 18 parameters using Pearson correlation for the Phosphate - Feb 27 2025
First_18_Pearson_Pt = [912,914,8128,384,1361,7245,1284,8648,6971,
                       947,6972,82,6453,8476,7668,81,6854,162,8504]

#353
#7956
#7568
#3305

altitude = [5]

temperature = [22, 24, 76, 1181, 3026, 6530, 8091, 8918]

orthoP = [180]
orthoPLike = [188, 191, 8068, 8755, 9398, 9856]

Phosphate = [192]
PhophateLike = [4127, 7315, 8504]


from itertools import chain
full_list = list(chain(First_18_Pearson_Pt, 
                       Section_82_Yes,
                       altitude, temperature, 
                       orthoP, orthoPLike,
                       Phosphate, PhophateLike))
full_list = list(dict.fromkeys(full_list))

dfPfull = df[df.determinand_notation.isin(full_list)]
''' #Commented since to consider correlation of all 2959 determinands - conventional way suggested by Xylem

dfPfull = df.copy() #added this to retain the dataset of all 2959 determinands

print(dfPfull['purpose_name'].value_counts())
print(dfPfull['sampledMaterialType_name'].value_counts())
dfPfull.head(2)


In [None]:
print(dfPfull['determinand_notation'].nunique())
print ('Whole Dataset Rows: ', len(dfPfull))

In [None]:

#Set by WRC/Xylem Domain SMEs on 31 Mar 2025 
#Only look at samples from River/Running/Surface. Rivers are difficult enough on their own. Lakes, sea, canal, and estuaries are too different.
sMaterialType = ['RIVER / RUNNING SURFACE WATER']

'''
#Set by Nicolai on 26 Mar 2025 along with WRC input on the same
sMaterialType = ['RIVER / RUNNING SURFACE WATER',
                 'ESTUARINE WATER','SEA WATER','CANAL WATER',
                 'POND / LAKE / RESERVOIR WATER']
'''

'''
# Old and discarded - Pasu 27 Mar 2025
sMaterialType =['WHITEBAIT / JUVENILES - GILL',
                'WASTE - BULK MATERIAL',
                'UNCODED',
                'SURFACE DRAINAGE',
                'STORM TANK INFLUENT',
                'STORM SEWER OVERFLOW DISCHARGE',
                'SOLEA SOLEA - COMMON SOLE - WHOLE ANIMAL',
                'SOLEA SOLEA - COMMON SOLE - MUSCLE',
                'SOLEA SOLEA - COMMON SOLE - LIVER OR DIGESTIVE GLAND',
                'SOIL WATER',
                'SOIL - <90UM FRACTION',
                'SOIL - <63UM FRACTION',
                'SOIL - <2000UM FRACTION',
                'SOIL',
                'SILTY CLAY SOIL',
                'SILTY CLAY LOAM SOIL',
                'SILT LOAM SOIL',
                'SEA WATER AT LOW TIDE',
                'SEA WATER AT HIGH TIDE',
                'SEA WATER - INTERTIDAL',
                'SEA WATER',
                'SCROBICULARIA PLANA - PEPPERY FURROW SHELL - WHOLE ANIMAL',
                'SCALLOPS (MOLLUSC) - WHOLE ANIMAL',
                'SANDY SOIL',
                'SANDY SILT LOAM SOIL',
                'SANDY LOAM SOIL',
                'SANDY CLAY SOIL',
                'SALMO TRUTTA - SEA TROUT - TISSUE - SEE COMMENTS',
                'SALMO TRUTTA - SEA TROUT - MUSCLE',
                'SALMO TRUTTA - SEA TROUT - LIVER OR DIGESTIVE GLAND',
                'SALMO TRUTTA - SEA TROUT - GILL',
                'SALMO TRUTTA - BROWN TROUT - WHOLE ANIMAL',
                'SALMO TRUTTA - BROWN TROUT - TISSUE - SEE COMMENTS',
                'SALMO TRUTTA - BROWN TROUT - MUSCLE',
                'SALMO TRUTTA - BROWN TROUT - LIVER OR DIGESTIVE GLAND',	'SALMO TRUTTA - BROWN TROUT - GILL',
                'RYNOCHOSTEGIUM RIPARIOIDES - MOSS - WHOLE PLANT',
                'RUTILUS RUTILUS - ROACH - WHOLE ANIMAL',
                'RUTILUS RUTILUS - ROACH - MUSCLE',
                'RUTILUS RUTILUS - ROACH - LIVER OR DIGESTIVE GLAND',
                'RUTILUS RUTILUS - ROACH - GILL',
                'RUNNING SURFACE WATER SEDIMENT - <63UM FRACTION',
                'RUNNING SURFACE WATER SEDIMENT - <2000UM FRACTION',
                'RUNNING SURFACE WATER SEDIMENT',
                'RIVER / RUNNING SURFACE WATER',
                'RANUNCULUS SPP - BUTTERCUP - WHOLE PLANT',
                'RANUNCULUS SPP - BUTTERCUP - TISSUE - SEE COMMENTS',
                'PRECIPITATION',
                'POTABLE WATER',
                'POND / LAKE / RESERVOIR WATER SEDIMENT - <63UM FRACTION',
                'POND / LAKE / RESERVOIR WATER SEDIMENT',
                'POND / LAKE / RESERVOIR WATER',
                'PLEURONECTES PLATESSA - PLAICE - WHOLE ANIMAL',
                'PLEURONECTES PLATESSA - PLAICE - MUSCLE',
                'PLEURONECTES PLATESSA - PLAICE - LIVER OR DIGESTIVE GLAND',
                'PLATICTHYS FLESUS - FLOUNDER - WHOLE ANIMAL',
                'PLATICTHYS FLESUS - FLOUNDER - MUSCLE',
                'PLATICTHYS FLESUS - FLOUNDER - LIVER OR DIGESTIVE GLAND',
                'OYSTERS (MOLLUSC) - WHOLE ANIMAL',
                'OYSTERS (MOLLUSC) - MUSCLE',
                'OSTEREA EDULIS - NATIVE OYSTER - WHOLE ANIMAL',
                'NEREIS DIVERSICOLOR - RAG WORM - WHOLE ANIMAL',
                'MYTILUS EDULIS - MUSSEL - WHOLE ANIMAL',
                'MYTILUS EDULIS - MUSSEL - TISSUE - SEE COMMENTS',
                'MYTILUS EDULIS - MUSSEL - MUSCLE',
                'MINEWATER (FLOWING/PUMPED)',
                'MINEWATER',
                'MACOMA BALTHICA - BALTIC TELLIN - WHOLE ANIMAL',
                'LOAMY SAND SOIL',
                'LOAM SOIL',
                'LIMANDA LIMANDA - DAB - WHOLE ANIMAL',
                'LIMANDA LIMANDA - DAB - MUSCLE',
                'LIMANDA LIMANDA - DAB - LIVER OR DIGESTIVE GLAND',
                'LANDFILL WASTE',
                'INCINERATOR ASH',
                'FUCUS VESICULOSUS - BLADDER WRACK - WHOLE PLANT',
                'FUCUS VESICULOSUS - BLADDER WRACK - TISSUE - SEE COMMENTS',
                'FUCUS SPIRALIS - SPIRAL WRACK - TISSUE - SEE COMMENTS',
                'FONTINALIS ANTIPYRETICA - MOSS - WHOLE PLANT',
                'ESTUARY SEDIMENT - SUB TIDAL - <90UM FRACTION',
                'ESTUARY SEDIMENT - SUB TIDAL - <63UM FRACTION',
                'ESTUARY SEDIMENT - SUB TIDAL - <2000UM FRACTION',
                'ESTUARY SEDIMENT - SUB TIDAL',
                'ESTUARY SEDIMENT - INTER TIDAL - <63UM FRACTION',
                'ESTUARY SEDIMENT - INTER TIDAL',
                'ESTUARY SEDIMENT - <63UM FRACTION',
                'ESTUARY SEDIMENT',
                'ESTUARINE WATER AT LOW TIDE',
                'ESTUARINE WATER AT HIGH TIDE',
                'ESTUARINE WATER - INTERTIDAL',
                'ESTUARINE WATER',
                'CRASSOSTEREA GIGAS - PACIFIC OYSTER - WHOLE ANIMAL',
                'CRANGON CRANGON - BROWN SHRIMP - WHOLE ANIMAL',
                'CONSTRUCTION WASTE',
                'COASTAL / MARINE SEDIMENT - <63UM FRACTION',
                'COASTAL / MARINE SEDIMENT',
                'CLAY SOIL',
                'CLAY LOAM SOIL',
                'CERASTODERMA EDULE - COCKLE - WHOLE ANIMAL',
                'CERASTODERMA EDULE - COCKLE - TISSUE - SEE COMMENTS',
                'CERASTODERMA EDULE - COCKLE - MUSCLE',
                'CANAL WATER SEDIMENT - <63 UM FRACTION',
                'CANAL WATER SEDIMENT',
                'CANAL WATER - SALINE',
                'CANAL WATER',
                'CALIBRATION WATER',
                'BOREHOLE GAS',
                'ARENICOLA MARINA - LUGWORM - WHOLE ANIMAL',
                'ANY WATER',
                'ANY TIPPED SOLIDS',
                'ANY SOLID/SEDIMENT - UNSPECIFIED',
                'ANY SHELLFISH (MOLLUSC & CRUSTN) - TISSUE - SEE COMMENTS',
                'ANY SHELLFISH (MOLLUSC & CRUSTACEAN) - WHOLE ANIMAL',
                'ANY OIL',
                'ANY NON-AQUEOUS LIQUID',
                'ANY LEACHATE',
                'ANY INVERTEBRATE - WHOLE ANIMAL',
                'ANY HIGHER PLANT - WHOLE PLANT',
                'ANY HIGHER PLANT - TISSUE - SEE COMMENTS',
                'ANY GAS',
                'ANY FLATFISH - WHOLE ANIMAL',
                'ANY FLATFISH - MUSCLE',
                'ANY FLATFISH - LIVER OR DIGESTIVE GLAND',
                'ANY FISH - NOT INCLUDING FLATFISH - WHOLE ANIMAL',
                'ANY FISH - NOT INCLUDING FLATFISH - MUSCLE',
                'ANY FISH - NOT INCLUDING FLATFISH - LIVER OR DIG. GLAND',
                'ANY FISH - NOT INCLUDING FLATFISH - GILL',
                'ANY BRYOPHYTE - MOSSES & LIVERWORTS - WHOLE PLANT',
                'ANY BRYOPHYTE - MOSSES & LIVERWORTS - TISSUE-SEE COMMENTS',
                'ANY BRYOPHYTE - MOSSES & LIVERWORTS - 2 CM TIPS (PLANT)',
                'ANY BIOTA',
                'ANY ASH',
                'ANY AGRICULTURAL',
                'ANY  ALGAE / SEAWEED - WHOLE PLANT',
                'ANY  ALGAE / SEAWEED - TISSUE - SEE COMMENTS',
                'ANY  ALGAE / SEAWEED - 2 CM TIPS (PLANT)',
                'ANGUILLA ANGUILLA - EEL - WHOLE ANIMAL',
                'ANGUILLA ANGUILLA - EEL - MUSCLE',
                'ANGUILLA ANGUILLA - EEL - LIVER OR DIGESTIVE GLAND',
                'ANGUILLA ANGUILLA - EEL - GILL',
                'AIR']

#Whole Dataset (with all sort of meterialtypes) Rows     :  68470660
#Whole Dataset (with five of following meterialtypes Rows:  46761409
#   1. RIVER / RUNNING SURFACE WATER, 2. ESTUARINE WATER, 
#   3. SEA WATER, 4. CANAL WATER, 
#   5. POND / LAKE / RESERVOIR WATER') 
#Whole Dataset (with one of the meterialtypes Rows       :  33770650
#   1. RIVER / RUNNING SURFACE WATER 
'''

dfPfull = dfPfull[dfPfull.sampledMaterialType_name.isin(sMaterialType)]
print ('Whole Dataset Rows: ', len(dfPfull), dfPfull['determinand_notation'].nunique())

#Whole Dataset Rows:  33770650 1897


In [None]:
# SME feedback 1
# Pasupathi: The approach taken above is kind of hybrid - best of both. Once we obtain the model metrics, we can determine 
# if further deep dive is required. However, below determinands are removed

#1. Useless: "type of flow as description (Determinand ID: 3267 is with 193445 records)", "beach signage confirmation", " 
# state of visible pollution scale 0-3 (Determinand ID: 3724 is with 224 records)”. 

useless_Determinands_To_Remove = [3267, 3724]
dfPfull = dfPfull[~dfPfull.determinand_notation.isin(useless_Determinands_To_Remove)]
print ('Whole Dataset Rows: ', len(dfPfull), dfPfull['determinand_notation'].nunique())

#Whole Dataset Rows:  33576981 1895


In [16]:
# SME feedback 2
# Pasupathi: The approach taken above is kind of hybrid - best of both. Once we obtain the model metrics, we can determine if further 
#deep dive is required. However, below determinands are removed

#2. Chemicals might not be as relevant/useful for the model. Worth testing the model with and without in training? <<Below determinands that 
#   are chemicals are removed  90, 91, 92, 8080, 8728 >>
chemicals_To_Remove = [90, 91, 92, 8080, 8728]
dfPfull = dfPfull[~dfPfull.determinand_notation.isin(chemicals_To_Remove)]
print ('Whole Dataset Rows: ', len(dfPfull), dfPfull['determinand_notation'].nunique())

#Whole Dataset Rows:  33424159 1891


Whole Dataset Rows:  33424159 1891


In [17]:
with_Reactive_Monitoring = ['UNPLANNED REACTIVE MONITORING (POLLUTION INCIDENTS)', 'UNPLANNED REACTIVE MONITORING FORMAL (POLLUTION INCIDENTS)']
dfPfull_without_react_mon = dfPfull[~dfPfull.purpose_name.isin(with_Reactive_Monitoring)]
print ('Whole Dataset Rows: ', len(dfPfull_without_react_mon), dfPfull_without_react_mon['determinand_notation'].nunique())

#Whole Dataset Rows:  32437285 1637


Whole Dataset Rows:  32437285 1637


In [18]:
# Save full Family Dataset (Phosphate, Phosphorus, Section82) - 24 Years of data 

#dfPfull.to_csv('../NW_DataPP/12_NB_10Transpose_2959_All1.csv', index=False)
#savedata(dfPfull, "12_NB_10Transpose_2959_All1.csv", 'gcs://rdmai_dev_data/')                                    #Commented on 11 May 2025
#savedata(dfPfull, "12_NB_10Transpose_2959_All1_With_R_Mon.csv", 'gcs://rdmai_dev_data/')                         #Commented on 11 May 2025
#savedata(dfPfull_without_react_mon, "12_NB_10Transpose_2959_All1_Without_R_Mon.csv", 'gcs://rdmai_dev_data/')    #Commented on 11 May 2025

savedata(dfPfull, "07_nb_transpose_2959_all1.csv", 'gcs://rdmai_dev_data/')                                    #Added on 11 May 2025
savedata(dfPfull, "07_nb_transpose_2959_all1_with_r_mon.csv", 'gcs://rdmai_dev_data/')                         #Added on 11 May 2025
savedata(dfPfull_without_react_mon, "07_nb_transpose_2959_all1_without_r_mon.csv", 'gcs://rdmai_dev_data/')    #Added on 11 May 2025

saved, Location:  gcs://rdmai_dev_data/cleansed/07_nb_transpose_2959_all1.csv
saved, Location:  gcs://rdmai_dev_data/cleansed/07_nb_transpose_2959_all1_with_r_mon.csv
saved, Location:  gcs://rdmai_dev_data/cleansed/07_nb_transpose_2959_all1_without_r_mon.csv


('saved, Location: ',
 'gcs://rdmai_dev_data/cleansed/07_nb_transpose_2959_all1_without_r_mon.csv')

In [19]:
dfPfull['determinand_notation'].value_counts()

determinand_notation
111     1515362
76      1490833
116     1456760
61      1433943
180     1421184
9901    1371705
118     1335528
117     1264227
119     1241236
9924    1238488
162      926019
85       922484
135      851016
77       760222
6450     672422
6455     667400
172      666128
1183     408959
664      330008
241      306958
237      304849
158      301450
6841     292488
1181     279969
6528     267395
62       224968
348      209300
6452     206477
6529     206258
6530     205192
6460     189533
3410     187567
52       186633
6531     186385
108      183509
3408     179771
301      177553
182      175275
3409     173098
50       161699
7887     155748
6462     151590
3164     146950
211      146641
106      143076
6396     137435
9686     127610
6051     126989
192      126240
9992     118755
3683      99951
239       96198
114       95709
6458      88214
113       84274
3461      80478
183       79913
105       78773
6423      76929
2348      73331
6045      72505
235

In [20]:
dfPfull['determinand_definition'].value_counts()

determinand_definition
Ammoniacal Nitrogen as N                                                  1515362
Temperature of Water                                                      1490833
Nitrogen, Total Oxidised as N                                             1456760
pH                                                                        1433943
Orthophosphate, reactive as P                                             1421184
Oxygen, Dissolved, % Saturation                                           1371705
Nitrite as N                                                              1335528
Nitrate as N                                                              1264227
Ammonia un-ionised as N                                                   1241236
Oxygen, Dissolved as O2                                                   1238488
Alkalinity to pH 4.5 as CaCO3                                              926019
BOD : 5 Day ATU                                                            

In [21]:
dfPfull['determinand_name'].value_counts()

determinand_name
Ammonia(N)      1515362
Temp Water      1490833
N Oxidised      1456760
pH              1433943
Orthophospht    1421191
O Diss %sat     1371705
Nitrite-N       1335528
Nitrate-N       1264227
NH3 un-ion      1241236
Oxygen Diss     1238488
Alky pH 4.5      926019
BOD ATU          922484
Sld Sus@105C     851016
Cond @ 25C       760222
Cu Filtered      672422
Zinc - as Zn     667400
Chloride Ion     666128
WethPresPrec     408959
Oil & Grs Vs     330008
Calcium - Ca     307006
Magnesium-Mg     304897
Hardness         301450
Phenol Odour     292488
WethPresTemp     279969
Weth-Visibty     267395
Cond @ 20C       224968
Phosphorus-P     209300
Copper - Cu      206477
Weth7Dy-Prec     206258
Weth7Dy-Temp     205192
Fe- Filt         189533
Ni- Filtered     187567
Pb Filtered      186633
Weth7Dy-Vsy      186385
Cadmium - Cd     183509
Zn- Filtered     179771
C - Org Filt     177571
SiO2 Rv          175275
Cr- Filtered     173098
Lead - as Pb     161699
CHLOROPHYLL      155748

In [22]:
dfPfull['purpose_name'].value_counts()

purpose_name
ENVIRONMENTAL MONITORING STATUTORY (EU DIRECTIVES)            17746873
ENVIRONMENTAL MONITORING (GQA & RE ONLY)                       5512281
PLANNED INVESTIGATION (LOCAL MONITORING)                       3379090
MONITORING  (UK GOVT POLICY - NOT GQA OR RE)                   2155835
MONITORING  (NATIONAL AGENCY POLICY)                           1448691
STATUTORY FAILURES (FOLLOW UPS AT NON-DESIGNATED POINTS)        971620
UNPLANNED REACTIVE MONITORING (POLLUTION INCIDENTS)             780258
PLANNED INVESTIGATION (NATIONAL AGENCY POLICY)                  618719
STATUTORY FAILURES (FOLLOW UPS AT DESIGNATED POINTS)            268395
COMPLIANCE AUDIT (PERMIT)                                       253343
UNPLANNED REACTIVE MONITORING FORMAL (POLLUTION INCIDENTS)      206616
WASTE MONITORING (OPERATOR SELF-MONITORING DATA)                 36525
WASTE MONITORING (AGENCY INVESTIGATION)                          16995
WASTE MONITORING (AGENCY AUDIT - PERMIT)                        

In [23]:
dfPfull['isComplianceSample'].value_counts()

isComplianceSample
False    33166122
True       258037
Name: count, dtype: int64

In [24]:
dfPfull['sampledMaterialType_name'].value_counts()

sampledMaterialType_name
RIVER / RUNNING SURFACE WATER    33424159
Name: count, dtype: int64

In [25]:
dfPfull['samplingPoint_notation'].value_counts()

samplingPoint_notation
AN-WEN250      105060
AN-WAV120       75456
NW-88000045     75175
NW-88003872     74592
AN-YAR180       72344
NE-49100488     69373
MD-13598380     64780
MD-50050        59302
NW-88003843     54334
NW-88002634     42381
NW-88000879     40439
NE-49400409     39779
NW-88006220     39124
NW-88003532     36884
NW-88002884     34463
MD-23314180     33678
NE-49500343     33337
MD-46247300     32924
NE-49000137     32146
NW-88002348     31697
NW-88004563     31697
NW-88003147     31315
NE-49700156     31171
NW-88005740     31144
NW-88004397     30954
SO-F0002886     30600
TH-PTHR0107     30497
MD-38473020     30402
NE-42300080     30386
NE-49301624     30324
NW-88022212     30302
SO-G0003885     30281
SO-F0002151     29931
NW-88002001     29764
NW-88004024     29494
NW-88006264     29313
NW-88002065     29169
MD-00025085     28995
SO-E0000362     28292
TH-PRGR0038     28156
NW-88023668     28097
NW-88003521     28038
SO-G0003786     28026
NW-88003561     27937
TH-PLER00

In [26]:
#clean memory
import gc
#del(dfPfull, dfclusters)
#del(dfPfull)
#gc.collect()

In [27]:
#dbutils.notebook.exit("End Workload - Scrip stopped")

In [28]:
#End CARD
#In line comments completed 11-May-2025

In [None]:
'''
    Theme: BB1C - Predicting Phosphate / Orthophosphate in UK Catchments Using AIML Models
    Pasupathi N, Lead Data Scientist, Cognizant - UK.
'''