In [1]:
import pandas as pd
import numpy as np
import json

In [3]:
# Utilizando raw string para la ruta del archivo
archivo_csv = r"C:\Users\JSLV3\Documents\5to Semestre\ETL\Proyecto Water Quality\1ra parte\watersucia.csv"

# Cargar los datos desde el archivo CSV
water = pd.read_csv(archivo_csv, delimiter=';')

# Mostrar las primeras filas del DataFrame
print(water.head())



    Año NombreDepartamento  Div_dpto NombreMunicipio  Divi_muni IrcaMinimo  \
0  2010            Bolívar        13        El Guamo      13248          0   
1  2010            Bolívar        13        El Guamo      13248          0   
2  2010            Bolívar        13        El Guamo      13248          0   
3  2010            Bolívar        13        El Guamo      13248          0   
4  2010            Bolívar        13        El Guamo      13248          0   

  IrcaMaximo IrcaPromedio NombreParametroAnalisis2  MuestrasEvaluadas  \
0        100        37,32        Alcanilidad Total                 67   
1        100        37,32                 Aluminio                 67   
2        100        37,32                 Arsénico                 67   
3        100        37,32                   Cadmio                 67   
4        100        37,32                   Calcio                 67   

   MuestrasTratadas  MuestrasSinTratar  NumeroParametrosMinimo  \
0                67       

## Creating greater value from our data
### Transformations

##### We identified the analysis parameters that have the greatest influence on water pollution:

In [14]:
parametros_influencia = water.groupby('NombreParametroAnalisis2')['IrcaPromedio'].mean().sort_values(ascending=False)
top_15_parametros = parametros_influencia.head(15)
top_15_parametros

NombreParametroAnalisis2
Giardia                          6.589981
Cryptosporidium                  6.589981
Plaguicidas Totales              6.589981
Trihalometanos totales           6.561441
Molibdeno                        6.561441
Nitratos                         6.560811
Hierro total                     6.560811
Magnesio                         6.560811
Manganeso                        6.560811
Mercurio                         6.560811
Mesófilos                        6.560811
Olor                             6.560811
Nitritos                         6.560811
Fosfatos                         6.560811
Organofosforados y carbamatos    6.560811
Name: IrcaPromedio, dtype: float64

All these parameters have an average IRCA of approximately 23.94, suggesting an association with a considerable risk level. This can be useful for prioritizing which water quality parameters need more critical attention in monitoring and treatment programs.

In [None]:
water = water[water['NombreParametroAnalisis2'].isin(top_15_parametros.index)]

water.head(), water.shape

We have filtered the dataset to only leave the rows corresponding to the top 15 analysis parameters related to water pollution, and the rest have been removed.

##### Classification of the Water Quality Risk Index (IRCA)

In [6]:
def clasificar_irca(irca):
    try:
        irca = float(irca.replace(',', '.'))
        if irca == 0:
            return 'Sin información'
        elif 0.001 <= irca <= 5:
            return 'Sin riesgo'
        elif 5.001 <= irca <= 14:
            return 'Riesgo bajo'
        elif 14.001 <= irca <= 35:
            return 'Riesgo medio'
        elif 35.001 <= irca <= 80:
            return 'Riesgo alto'
        elif 80.001 <= irca <= 100:
            return 'Inviable sanitariamente'
        else:
            return 'No clasificado'
    except ValueError:
        return 'No clasificado'

water['rango_irca'] = water['IrcaPromedio'].apply(clasificar_irca)


The classify_irca function transforms the numerical value of the Average IRCA into descriptive categories ranging from 'No risk' to 'Sanitarily unviable', facilitating the interpretation and decision-making in water quality management. This categorization is crucial for public health and environmental analyses, as it simplifies data visualization, allows for quick comparisons between regions, and is essential for modeling and predicting water quality, resulting in more effective interventions and evidence-based policies.

##### Treatment Category

In [7]:
def categorize_treatment(row):
    if row['MuestrasTratadas'] == 0:
        return 'Sin tratamiento'
    elif row['MuestrasTratadas'] == row['MuestrasEvaluadas']:
        return 'Tratamiento completo'
    else:
        return 'Tratamiento parcial'

water['TratamientoCategoría'] = water.apply(categorize_treatment, axis=1)
water.head()


Unnamed: 0,Año,NombreDepartamento,Div_dpto,NombreMunicipio,Divi_muni,IrcaMinimo,IrcaMaximo,IrcaPromedio,NombreParametroAnalisis2,MuestrasEvaluadas,MuestrasTratadas,MuestrasSinTratar,NumeroParametrosMinimo,NumeroParametrosMaximo,NumeroParametrosPromedio,ResultadoMinimo,ResultadoMaximo,ResultadoPromedio,rango_irca,TratamientoCategoría
0,2010,Bolívar,13,El Guamo,13248,0,100,3732,Alcanilidad Total,67,67,0,2,7,2,23.0,23.0,23.0,Riesgo alto,Tratamiento completo
1,2010,Bolívar,13,El Guamo,13248,0,100,3732,Aluminio,67,67,0,2,7,2,,,,Riesgo alto,Tratamiento completo
2,2010,Bolívar,13,El Guamo,13248,0,100,3732,Arsénico,67,67,0,2,7,2,,,,Riesgo alto,Tratamiento completo
3,2010,Bolívar,13,El Guamo,13248,0,100,3732,Cadmio,67,67,0,2,7,2,,,,Riesgo alto,Tratamiento completo
4,2010,Bolívar,13,El Guamo,13248,0,100,3732,Calcio,67,67,0,2,7,2,14.0,14.0,14.0,Riesgo alto,Tratamiento completo



The "Treatment Category" column classifies each set of water samples according to the degree of treatment they have received. This classification helps understand the management and effectiveness of treatment processes implemented in different locations. The categories are:

No treatment: Indicates that none of the evaluated samples were treated.

Partial treatment: Indicates that a portion of the evaluated samples was treated, but not all.

Complete treatment: Indicates that all evaluated samples were treated.

##### Treatment rate

In [8]:
water['TasaTratamiento'] = water['MuestrasTratadas'] / water['MuestrasEvaluadas']
water['TasaTratamiento'].replace([float('inf'), float('-inf'), pd.NA], 0, inplace=True)
water[['NombreDepartamento', 'NombreMunicipio', 'Año', 'MuestrasEvaluadas', 'MuestrasTratadas', 'TasaTratamiento']].head()

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  water['TasaTratamiento'].replace([float('inf'), float('-inf'), pd.NA], 0, inplace=True)


Unnamed: 0,NombreDepartamento,NombreMunicipio,Año,MuestrasEvaluadas,MuestrasTratadas,TasaTratamiento
0,Bolívar,El Guamo,2010,67,67,1.0
1,Bolívar,El Guamo,2010,67,67,1.0
2,Bolívar,El Guamo,2010,67,67,1.0
3,Bolívar,El Guamo,2010,67,67,1.0
4,Bolívar,El Guamo,2010,67,67,1.0


The Treatment Rate represents the proportion of water samples treated relative to the total samples evaluated for each record. This rate is a direct indicator of the coverage and effectiveness of treatment strategies implemented in a specific geographic area during a specified period.
How to Interpret the Treatment Rate:

•	Value of 1.0: Indicates that 100% of the evaluated samples were treated, suggesting complete coverage and likely effective water treatment procedures.

•	Value less than 1.0 but greater than 0: Suggests that a portion of the evaluated samples was treated. The closer the value is to 1.0, the better the treatment coverage.

•	Value of 0: Indicates that none of the evaluated samples were treated, which may be a cause for concern and could require immediate attention to understand the reasons behind the lack of treatment.


##### Data Cleaning and Variable Selection

In [9]:
water = water.drop(['ResultadoMinimo', 'ResultadoMaximo', 'ResultadoPromedio'], axis=1)


To obtain a deeper understanding of the methodology and justification behind the removal of the 'Minimum Result', 'Maximum Result', and 'Average Result' columns, we invite you to consult the EDA_water_quality file. This document contains our exploratory data analysis, which distills key criteria and reveals significant insights that have guided the cleaning of our dataset.

In [None]:
columnas_a_eliminar = ['MuestrasTratadas', 'MuestrasEvaluadas', 'MuestrasSinTratar',
                      'NumeroParametrosMinimo', 'NumeroParametrosMaximo', 'EsAtipico']
water = water.drop(columns=columnas_a_eliminar)


water.head()

In [None]:
new_column_names = [
    "año", 
    "nombre_departamento", 
    "div_dpto", 
    "nombre_municipio", 
    "divi_muni", 
    "irca_minimo", 
    "irca_maximo", 
    "irca_promedio", 
    "rango_irca", 
    "nombre_parametro_analisis", 
    "numero_parametros_promedio", 
    "TasaTratamiento"
    "TratamientoCategoría"   
]
water.columns = new_column_names
print(water.head())

### Dimensional Modeling