# Using pipe() in Python.

The pipe() function lets you apply several steps to your data in a clean and readable way. This is helpful when you want to process data step by step, without using many temporary variables or writing nested functions.

In this notebook, we use toolz.pipe() to clean and label a dataset. Each step runs in order, making the logic easier to follow.



## Why use pipe()?

*   It keeps your code tidy.
*   It helps you separate each step of your process.
*   It avoids writing variables for each step.

## Avoid it when:

*  You’re only doing one or two steps
*  Each step depends on side effects
*  The code is already simple

### **Note**:
 This example is simple, and we could write it without using pipe(). But for practice and illustration, we use pipe() to highlight how it can improve readability and structure when working with more complex data pipelines.

In [69]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from toolz import pipe

# Load the dataset from Google Sheets (tab-separated)
df = pd.read_csv(
    "https://docs.google.com/spreadsheets/d/e/2PACX-1vQjd8--aWHNZ_XIPa0PeruM3Sm91BBt2dI5TabRGrCEeCNgOeMUTyecka8KCNCM3g/pub?gid=426756467&single=true&output=tsv",
    sep="\t"
)

"""
Data Cleaning and Transformation Steps:
1. Delete the column named 'n'
2. Convert all columns with dtype 'object' to 'float'
3. Create new columns with '_concentration' suffix
   - Label values as 'high' if above the column mean, otherwise 'moderate'
"""

def clean_data(df):
    return pipe(
        df,

        # Step 1: Drop column 'n'
        lambda x: x.drop(columns="n"),

        # Step 2: Convert object columns to float (handle comma decimals)
        lambda x: x.assign(
            **{
                col: pd.to_numeric(x[col].str.replace(',', '.'), errors='coerce')
                for col in x.columns if x[col].dtype == 'object'
            }
        ),

        # Step 3: Label high vs. moderate concentration
        lambda x: x.assign(
            **{
                f"{col}_concentration": np.where(
                    x[col] > x[col].mean(), 'high', 'moderate'
                )
                for col in x.columns if x[col].dtype == 'float64'
            }
        )
    )

# Apply the pipeline
clean_df = clean_data(df)

# Preview result
clean_df.head()


Unnamed: 0,Año,Ca2+,Mg2+,Na+,K+,HCO3–,SO42–,Cl–,Ca2+_concentration,Mg2+_concentration,Na+_concentration,K+_concentration,HCO3–_concentration,SO42–_concentration,Cl–_concentration
0,2002,1.7001,1.3001,0.56982,0.11,2.0994,0.32002,1.26,moderate,moderate,moderate,moderate,moderate,moderate,moderate
1,2002,2.0,3.4001,0.39,0.1,3.3007,1.8399,0.75001,moderate,moderate,moderate,moderate,moderate,moderate,moderate
2,2002,1.7999,1.7001,0.77991,0.13001,1.6,1.8299,0.97989,moderate,moderate,moderate,moderate,moderate,moderate,moderate
3,2002,1.4002,1.5997,0.55981,0.1,1.0,1.2501,1.41,moderate,moderate,moderate,moderate,moderate,moderate,moderate
4,2002,1.0,1.5001,0.65986,0.14001,0.79994,0.67001,1.6901,moderate,moderate,moderate,moderate,moderate,moderate,moderate
