# Pandas pipe function

* Organizes multiple preprocessing operations into a single one

## Exercise 1 - Creating a DataFrame

In [1]:
import numpy as np
import pandas as pd

df = pd.DataFrame(
    
    {
       "id": [100, 100, 101, 102, 103, 104, 105, 106],
       "A": [1, 2, 3, 4, 5, 2, np.nan, 5],
       "B": [45, 56, 48, 47, 62, 112, 54, 49],
       "C": [1.2, 1.4, 1.1, 1.8, np.nan, 1.4, 1.6, 1.5]
    }
    
)

df

Unnamed: 0,id,A,B,C
0,100,1.0,45,1.2
1,100,2.0,56,1.4
2,101,3.0,48,1.1
3,102,4.0,47,1.8
4,103,5.0,62,
5,104,2.0,112,1.4
6,105,,54,1.6
7,106,5.0,49,1.5


## Exercise 2 - function to handle missing values

* Replace the missing values in the numerical columns with the mean value of the column.

In [2]:
def fill_missing_values(df):
   for col in df.select_dtypes(include= ["int","float"]).columns:
      val = df[col].mean()
      df[col].fillna(val, inplace=True)
   return df

## Exercise 3 - function to remove duplicate values

In [3]:
def drop_duplicates(df, column_name):
   df = df.drop_duplicates(subset=column_name)
   return df

* It eliminates the duplicate values in the given column or columns. 
* In addition to the DataFrame, this function also takes a column name as an argument. We can pass the additional arguments to the pipe as well.

## Exercise 4 - function to eliminate outliers

In [4]:
def remove_outliers(df, column_list):
   for col in column_list:
      avg = df[col].mean()
      std = df[col].std()
      low = avg - 2 * std
      high = avg + 2 * std
      df = df[(df[col] <= high) | (df[col] >= low)]
   return df

What this function does is as follows:

* It takes a DataFrame and a list of columns as input
* For each column in the list, it calculates the mean and standard deviation
* It calculates a lower and upper bound using the mean and standard deviation
* It removes the values that are outside range defined by the lower and upper bound

## Exercise 5 - creating a pipe

In [5]:
df_processed = (df.
                pipe(fill_missing_values).
                pipe(drop_duplicates, "id").
                pipe(remove_outliers, ["A","B"]))

* This pipe executes the functions in the given order. 
* We can pass the arguments to the pipe along with the function names.

## Exercise 6

* One thing to mention here is that some functions in the pipe modify the original DataFrame. Thus, using the pipe as indicated above will update df as well.
* One option to overcome this issue is to use a copy of the original DataFrame in the pipe. 
* If you do not care about keeping the original DataFrame as is, you can just use it in the pipe.

In [6]:
df

Unnamed: 0,id,A,B,C
0,100,1.0,45,1.2
1,100,2.0,56,1.4
2,101,3.0,48,1.1
3,102,4.0,47,1.8
4,103,5.0,62,1.428571
5,104,2.0,112,1.4
6,105,3.142857,54,1.6
7,106,5.0,49,1.5


In [7]:
df_processed

Unnamed: 0,id,A,B,C
0,100,1.0,45,1.2
2,101,3.0,48,1.1
3,102,4.0,47,1.8
4,103,5.0,62,1.428571
5,104,2.0,112,1.4
6,105,3.142857,54,1.6
7,106,5.0,49,1.5


In [8]:
df = pd.DataFrame(
    
    {
       "id": [100, 100, 101, 102, 103, 104, 105, 106],
       "A": [1, 2, 3, 4, 5, 2, np.nan, 5],
       "B": [45, 56, 48, 47, 62, 112, 54, 49],
       "C": [1.2, 1.4, 1.1, 1.8, np.nan, 1.4, 1.6, 1.5]
    }
    
)

df_copy = df.copy()

df_processed = (df_copy.
                pipe(fill_missing_values).
                pipe(drop_duplicates, "id").
                pipe(remove_outliers, ["A","B"]))

In [9]:
df_processed

Unnamed: 0,id,A,B,C
0,100,1.0,45,1.2
2,101,3.0,48,1.1
3,102,4.0,47,1.8
4,103,5.0,62,1.428571
5,104,2.0,112,1.4
6,105,3.142857,54,1.6
7,106,5.0,49,1.5


In [10]:
df

Unnamed: 0,id,A,B,C
0,100,1.0,45,1.2
1,100,2.0,56,1.4
2,101,3.0,48,1.1
3,102,4.0,47,1.8
4,103,5.0,62,
5,104,2.0,112,1.4
6,105,,54,1.6
7,106,5.0,49,1.5


## Exercise 7 - start the pipeline exclusively

* We can start the pipeline with an exclusive starting step, which just copies the original DataFrame.

In [11]:
def start_pipeline(df):
    return df.copy()

In [12]:
df_processed = (df.
                pipe(start_pipeline).
                pipe(fill_missing_values).
                pipe(drop_duplicates, "id").
                pipe(remove_outliers, ["A","B"]))

In [13]:
df

Unnamed: 0,id,A,B,C
0,100,1.0,45,1.2
1,100,2.0,56,1.4
2,101,3.0,48,1.1
3,102,4.0,47,1.8
4,103,5.0,62,
5,104,2.0,112,1.4
6,105,,54,1.6
7,106,5.0,49,1.5


In [14]:
df_processed

Unnamed: 0,id,A,B,C
0,100,1.0,45,1.2
2,101,3.0,48,1.1
3,102,4.0,47,1.8
4,103,5.0,62,1.428571
5,104,2.0,112,1.4
6,105,3.142857,54,1.6
7,106,5.0,49,1.5


## Exercise 8 - logging

* We have a pipeline that consists of 4 steps. Depending on the raw data and the task at hand, we may need to create pipelines that have several more steps.

* In such workflows, it is important to keep track of what happens at each step so it will be easier to debug in case something goes wrong.

* We can achieve this by logging some information after each step. In our pipeline, the size of the DataFrame tells us if an unexpected thing happened.

* Let’s print the size of the DataFrame after each step is applied in the pipeline. Since the steps are functions, we can use a Python decorator for this task.

* A decorator is a function that takes another function and extends its behavior. The base function is not modified. The decorator wraps it and adds additional functionality.

In [15]:
from functools import wraps

def logging(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        result = func(*args, **kwargs)
        print(f"The size after {func.__name__} is {result.shape}.")
        return result
    return wrapper

## Exercise 9 - using the decorator

* We will “decorate” the functions used in the pipeline as follows:

In [16]:
@logging
def start_pipeline(df):
    return df.copy()

@logging
def fill_missing_values(df):
   for col in df.select_dtypes(include= ["int","float"]).columns:
      val = df[col].mean()
      df[col].fillna(val, inplace=True)
   return df

@logging
def drop_duplicates(df, column_name):
   df = df.drop_duplicates(subset=column_name)
   return df

@logging
def remove_outliers(df, column_list):
   for col in column_list:
      avg = df[col].mean()
      std = df[col].std()
      low = avg - 2 * std
      high = avg + 2 * std
      df = df[(df[col] <= high) | (df[col] >= low)]
   return df

## Exercise 10 - run the pipeline with decorators

In [17]:
df_processed = (df.
                pipe(start_pipeline).
                pipe(fill_missing_values).
                pipe(drop_duplicates, "id").
                pipe(remove_outliers, ["A","B"]))

The size after start_pipeline is (8, 4).
The size after fill_missing_values is (8, 4).
The size after drop_duplicates is (7, 4).
The size after remove_outliers is (7, 4).
