Create a collection of functions, including a lambda function and at least one custom
function, to systematically clean and process a dataset. Apply these functions to a
Pandas DataFrame to demonstrate the efficiency of functional programming
concepts in data manipulation.

1. Setup and Sample Data

In [1]:
import pandas as pd
import numpy as np

# Create the "dirty" DataFrame
data = {
    'EmployeeID': ['A101', 'A102', 'A103', 'A104', 'A105', 'A106'],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Salary': ['$75,000', '52000', '$N/A', '61,500', '$48,000.50', np.nan],
    'Department': ['Sales', 'IT', 'sales', 'Marketing', 'IT', 'Finance'],
    'BirthDate': ['1990-05-15', '1985-11-23', '1992-07-30', '2000-01-01', '1995-03-12', '1988-08-25']
}
df = pd.DataFrame(data)

print("--- Original 'Dirty' DataFrame ---")
print(df)
print("\n--- Original Data Types ---")
df.info()

--- Original 'Dirty' DataFrame ---
  EmployeeID     Name      Salary Department   BirthDate
0       A101    Alice     $75,000      Sales  1990-05-15
1       A102      Bob       52000         IT  1985-11-23
2       A103  Charlie        $N/A      sales  1992-07-30
3       A104    David      61,500  Marketing  2000-01-01
4       A105      Eve  $48,000.50         IT  1995-03-12
5       A106    Frank         NaN    Finance  1988-08-25

--- Original Data Types ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   EmployeeID  6 non-null      object
 1   Name        6 non-null      object
 2   Salary      5 non-null      object
 3   Department  6 non-null      object
 4   BirthDate   6 non-null      object
dtypes: object(5)
memory usage: 372.0+ bytes


2. The Function Collection

Custom Function 1: clean_salary

In [2]:
def clean_salary(salary_val):
    """
    Converts a salary string (e.g., '$75,000') to a numeric type (float).
    Handles NaNs and invalid entries.
    """
    # Pass through NaNs
    if pd.isna(salary_val):
        return np.nan
    
    try:
        # Convert to string, remove symbols, and convert to float
        salary_str = str(salary_val).replace('$', '').replace(',', '')
        return float(salary_str)
    except ValueError:
        # Return NaN for entries like '$N/A'
        return np.nan

Custom Function 2: standardize_department

In [3]:
def standardize_department(dept):
    """
    Standardizes department names to a consistent title-case format.
    """
    if pd.isna(dept):
        return 'Unknown'
    
    # .strip() removes whitespace, .title() capitalizes the first letter
    return str(dept).strip().title()

Custom Function 3: calculate_age

In [4]:
def calculate_age(birthdate):
    """
    Calculates the age in years from a birthdate (Timestamp object).
    """
    if pd.isna(birthdate):
        return np.nan
    
    # 'today' is calculated each time, but this is fine for this example
    today = pd.to_datetime('today')
    
    # Calculate age and convert to integer
    age = (today - birthdate).days / 365.25
    return int(age)

Lambda Function: categorize_salary

In [5]:
# A lambda function to be used with .apply()
categorize_salary = lambda s: 'High' if s > 60000 else ('Medium' if s >= 50000 else 'Low')

3. Applying the Functions Systematically

In [6]:
print("\n--- Processing and Cleaning Data ---")

# We build a new, clean DataFrame in a single, readable chain
df_clean = (
    df.assign(
        # 1. Clean Salary (uses custom `def` function)
        Salary_Clean=lambda x: x['Salary'].apply(clean_salary),
        
        # 2. Standardize Department (uses custom `def` function)
        Department_Std=lambda x: x['Department'].apply(standardize_department),
        
        # 3. Convert BirthDate (prerequisite for age calculation)
        BirthDate_DT=lambda x: pd.to_datetime(x['BirthDate'])
    )
    # This second .assign() can see the columns created in the first one
    .assign(
        # 4. Calculate Age (uses custom `def` function)
        Age=lambda x: x['BirthDate_DT'].apply(calculate_age)
    )
    # This third .assign() can see the 'Salary_Clean' column
    .assign(
        # 5. Apply Lambda Function
        Salary_Category=lambda x: x['Salary_Clean'].apply(categorize_salary)
    )
    # Finally, drop the original "dirty" and intermediate columns
    .drop(columns=['Salary', 'Department', 'BirthDate', 'BirthDate_DT'])
)

print("\n--- Final 'Clean' DataFrame ---")
print(df_clean)
print("\n--- Final Data Types ---")
df_clean.info()


--- Processing and Cleaning Data ---

--- Final 'Clean' DataFrame ---
  EmployeeID     Name  Salary_Clean Department_Std  Age Salary_Category
0       A101    Alice       75000.0          Sales   35            High
1       A102      Bob       52000.0             It   39          Medium
2       A103  Charlie           NaN          Sales   33             Low
3       A104    David       61500.0      Marketing   25            High
4       A105      Eve       48000.5             It   30             Low
5       A106    Frank           NaN        Finance   37             Low

--- Final Data Types ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   EmployeeID       6 non-null      object 
 1   Name             6 non-null      object 
 2   Salary_Clean     4 non-null      float64
 3   Department_Std   6 non-null      object 
 4   Age              6