# Using pipe

1. Pipeline on Titanic

- Create a pipeline to:
   - Filter passengers who survived (Survived == 1).
   - Fill missing Age values with the mean.
   - Create a new column, Fare_Per_Age, by dividing Fare by Age.

In [3]:
import pandas as pd

df = pd.read_excel('data/data/titanic.xlsx')

# Define the pipeline steps as functions
def filter_survived(df):
    return df[df['Survived'] == 1]

def fill_missing_age(df):
    mean_age = df['Age'].mean()
    df['Age'].fillna(mean_age, inplace=True)
    return df

def create_fare_per_age(df):
    df['Fare_Per_Age'] = df['Fare'] / df['Age']
    return df


# Chain the transformations using .pipe()
df_transformed = (df
                  .pipe(filter_survived)          # Step 1: Filter passengers who survived
                  .pipe(fill_missing_age)        # Step 2: Fill missing Age values with the mean
                  .pipe(create_fare_per_age))    # Step 3: Create the Fare_Per_Age column

# The resulting DataFrame
display(df_transformed)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(mean_age, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Age'].fillna(mean_age, inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Fare_Per_Age'] = df['Fare'] / df['Age']


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Fare_Per_Age
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,1.875876
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,0.304808
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,1.517143
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,0.412344
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,2.147914
...,...,...,...,...,...,...,...,...,...,...,...,...,...
875,876,1,3,"Najib, Miss. Adele Kiamie ""Jane""",female,15.0,0,0,2667,7.2250,,C,0.481667
879,880,1,1,"Potter, Mrs. Thomas Jr (Lily Alexenia Wilson)",female,56.0,0,1,11767,83.1583,C50,C,1.484970
880,881,1,2,"Shelley, Mrs. William (Imanita Parrish Hall)",female,25.0,0,1,230433,26.0000,,S,1.040000
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S,1.578947


2. Pipeline on Flights

- Create a pipeline to:
   - Filter flights with a departure delay greater than 30 minutes.
   - Add a column Delay_Per_Hour by dividing the delay by the scheduled flight duration.

In [None]:
import pandas as pd

# Fayllarni yuklash
files = 'data/flights'

# Barcha fayllarni birlashtirish
df = pd.concat(map(pd.read_parquet, files), ignore_index=True)

# Pipeline: 
# 1. Filter flights where Departure Delay is greater than 30 minutes
# 2. Add a new column `Delay_Per_Hour` = Delay / Scheduled Flight Duration

def filter_delayed_flights(df):
    return df[df["DepDelay"] > 30]

def add_delay_per_hour(df):
    df["Delay_Per_Hour"] = df["DepDelay"] / df["CRSFlightTime"]
    return df

# Creating Pipeline
df_filtered = (
    df
    .pipe(filter_delayed_flights)  # 1. Filter flights
    .pipe(add_delay_per_hour)     # 2. Add Delay_Per_Hour column
)

# Displaying the result
display(df_filtered)

# Saving the filtered files
df_filtered.to_csv("filtered_flight_stats.csv", index=False)


FileNotFoundError: [Errno 2] No such file or directory: 'd'