### 2.6 Preparation for Modeling

#### Goal:
Before training any model, it's crucial to prepare the data to ensure it is aligned with the campaign’s strategic objectives. This stage connects the data, variables, and decisions to the goals of the marketing campaign.

#### Data as a Strategic Asset:
The first step is to recognize that **data is not just raw input** but a **representation of business operations and customer behavior**. The quality of the data limits the potential of the analytics. The data must include various dimensions that provide insight into customer behavior:

1. **Customer Characteristics**: Demographic details (age, job, education), financial attributes (loan status, default), and behavioral indicators (e.g., past purchase behavior).
2. **Campaign Interaction Data**: Includes variables such as number of contacts, communication channel used, month and day of contact, and previous campaign outcomes.
3. **Historical Engagement**: Data that shows past interactions, such as whether the customer has been contacted before or if they’ve subscribed in the past.
4. **Contextual and Economic Conditions**: Variables like employment variation, consumer confidence, interest rates, and other factors affecting customer behavior.

#### Feature Engineering and Strategic Design:
**Feature Engineering** is more than just a technical task; it is **strategic**. Each new feature or transformation represents a hypothesis about customer behavior. The features you create should be connected to the campaign’s goal.

For example:
- In the **bank marketing example**, creating the `previous_contact` feature (indicating whether a client had been contacted before) is a **strategic decision** that provides insight into how a customer might respond to future contacts.
- Similarly, **grouping job categories** or **creating age ranges** are strategic decisions that affect the model's performance and interpretability.

In [1]:
import pandas as pd
import numpy as np

def prepare_data_for_modeling(df):
    """
    This function prepares the dataset by transforming and creating features relevant for predictive modeling.
    
    Parameters:
    df (pd.DataFrame): The dataset containing customer and campaign data.
    
    Returns:
    pd.DataFrame: The preprocessed dataset with relevant features.
    """
    
    # Step 1: Create 'previous_contact' feature based on 'pdays' column
    # 'pdays' represents days since last contact. 999 means never contacted before
    if 'pdays' in df.columns:
        df['previous_contact'] = np.where(df['pdays'] == 999, 0, 1)  # 0 = never contacted, 1 = contacted before
        df = df.drop(columns=['pdays'])  # Drop the 'pdays' column as it’s no longer needed

    # Step 2: Create 'age_group' based on age (customer characteristics)
    bins = [18, 30, 40, 50, 60, 100]
    labels = ['18-30', '30-40', '40-50', '50-60', '60+']
    df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels, right=False)
    
    # Step 3: Group job categories (campaign data)
    df['job_group'] = df['job'].apply(lambda x: 'Admin' if 'admin' in x else 'Other')

    # Step 4: Convert 'education' to numeric (0 = basic, 1 = higher education)
    df['education_numeric'] = df['education'].apply(lambda x: 1 if x == 'higher' else 0)
    
    # Step 5: Add an indicator for whether a customer has a loan or not (financial data)
    df['has_loan'] = df['loan'].apply(lambda x: 1 if x == 'yes' else 0)
    
    # Step 6: Handling missing values (example: replace NaNs with the column mean)
    df.fillna(df.mean(), inplace=True)
    
    # Step 7: Drop non-relevant columns (e.g., target variable 'y' should be kept for modeling, not for feature engineering)
    df = df.drop(columns=['y'])
    
    return df

# Example usage:
# Assuming you have a DataFrame `df`:
# df = pd.read_csv("bank_marketing_data.csv")
# df_preprocessed = prepare_data_for_modeling(df)

In [2]:
def feature_engineering(df):
    """
    This function performs feature engineering by creating and transforming variables that will help improve the predictive model.
    
    Parameters:
    df (pd.DataFrame): The preprocessed dataset with customer data.
    
    Returns:
    pd.DataFrame: The dataset with engineered features.
    """
    
    # Step 1: Create a 'contact_frequency' feature based on the number of contacts
    df['contact_frequency'] = df['campaign']  # Assuming 'campaign' contains the number of contacts
    
    # Step 2: Transform 'month' and 'day_of_week' into cyclical features (important for campaign timing)
    df['month_sin'] = np.sin(2 * np.pi * df['month'].astype('category').cat.codes / 12)  # Month as sine
    df['month_cos'] = np.cos(2 * np.pi * df['month'].astype('category').cat.codes / 12)  # Month as cosine
    df['day_of_week_sin'] = np.sin(2 * np.pi * df['day_of_week'].astype('category').cat.codes / 7)  # Day as sine
    df['day_of_week_cos'] = np.cos(2 * np.pi * df['day_of_week'].astype('category').cat.codes / 7)  # Day as cosine
    
    # Step 3: Economic condition features (example: convert categorical variables into numeric features)
    df['consumer_conf_numeric'] = df['cons.conf.idx']  # Assuming it's already numeric, otherwise apply encoding
    
    return df

# Example usage:
# Assuming you have a preprocessed DataFrame `df_preprocessed`:
# df_final = feature_engineering(df_preprocessed)