### Step 1: Data Preprocessing and Feature Engineering

#### Goal:
To transform raw data into useful features that can be used by a predictive model. This involves cleaning the dataset, removing irrelevant features, and engineering new ones.

#### Explanation:
1. **Remove Irrelevant Features**: Some features, like `duration`, are only available after the event has occurred and should not be used in training a predictive model.
2. **Feature Engineering**: Create new features, like `previous_contact`, which might better capture patterns in the data.

In [1]:
import pandas as pd
import numpy as np

def preprocess_data(df):
    """
    This function preprocesses the bank marketing dataset by performing feature selection and engineering.
    
    Parameters:
    df (pd.DataFrame): The raw dataset with customer interaction data.
    
    Returns:
    pd.DataFrame: The processed dataset ready for modeling.
    """
    
    # Remove irrelevant feature (e.g., duration of the call)
    if 'duration' in df.columns:
        df = df.drop(columns=['duration'])
    
    # Feature Engineering: Create a new feature, previous_contact, based on 'pdays' column
    if 'pdays' in df.columns:
        df['previous_contact'] = np.where(df['pdays'] == 999, 0, 1)  # 0 = never contacted, 1 = contacted
        df = df.drop(columns=['pdays'])  # Drop the 'pdays' column as it's now represented by 'previous_contact'
    
    # Other preprocessing steps, such as encoding categorical variables, handling missing values, etc.
    # For simplicity, we'll assume that all categorical variables are already encoded (if necessary).
    
    # Example: Convert the target variable 'y' to binary 0/1
    df['y'] = df['y'].map({'no': 0, 'yes': 1})
    
    return df

# Example usage:
# Assuming you have a DataFrame `df` with the original data:
# df = pd.read_csv("bank_marketing.csv")
# df_processed = preprocess_data(df)

### Step 2: Defining KPIs

#### Goal:
Define measurable outcomes that reflect the performance of the campaign. KPIs provide a connection between business goals and analytical models.

#### KPIs:
1. **Conversion Rate (CR)**: The percentage of contacted clients who subscribe to the term deposit.
2. **Cost per Acquisition (CPA)**: The cost of acquiring each new subscriber.
3. **Response Lift**: The improvement in conversion rate using a predictive model compared to random targeting.

In [2]:
def calculate_kpis(df):
    """
    This function calculates the KPIs for a marketing campaign, such as Conversion Rate, Cost per Acquisition, and Response Lift.
    
    Parameters:
    df (pd.DataFrame): The processed dataset with 'subscribed' and 'cost' columns.
    
    Returns:
    dict: A dictionary with KPI values.
    """
    
    # Calculate Conversion Rate (CR)
    cr = df['subscribed'].mean() * 100  # Percentage of subscribed customers
    
    # Calculate Cost per Acquisition (CPA)
    total_cost = df['cost'].sum()
    total_conversions = df['subscribed'].sum()
    cpa = total_cost / total_conversions if total_conversions > 0 else 0
    
    # Calculate Response Lift (assumes we have predicted probabilities and random group)
    response_rate_targeted = df[df['predicted_probability'] > 0.5]['subscribed'].mean()  # Targeted by model
    response_rate_random = df['subscribed'].mean()  # Random targeting
    lift = response_rate_targeted / response_rate_random if response_rate_random > 0 else 0
    
    # Return KPIs as a dictionary
    return {
        'Conversion Rate (CR)': cr,
        'Cost per Acquisition (CPA)': cpa,
        'Response Lift': lift
    }

# Example usage:
# Assuming you have a DataFrame `df` with the necessary columns:
# df = pd.read_csv("processed_bank_marketing.csv")
# kpis = calculate_kpis(df)
# print(kpis)

### Step 3: Formulating the Analytical Question

#### Goal:
Translate the business problem into a predictive modeling task.

#### Analytical Question:
Given customer data (demographics, campaign history, economic indicators), what is the probability that a customer will subscribe to a term deposit if contacted during the campaign?

#### Model Task:
- **Target Variable**: `y` (subscription = yes/no)
- **Predictors**: Customer attributes (e.g., age, job type), campaign variables (e.g., number of contacts, contact method), and economic indicators (e.g., employment variation rate, Euribor rate).

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

def train_logistic_regression(df):
    """
    This function trains a logistic regression model to predict customer subscriptions.
    
    Parameters:
    df (pd.DataFrame): The processed dataset with features and target variable.
    
    Returns:
    model: The trained logistic regression model.
    """
    
    # Prepare the features (X) and target (y)
    X = df.drop(columns=['subscribed'])
    y = df['subscribed']
    
    # Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Initialize and train the logistic regression model
    model = LogisticRegression(max_iter=1000)
    model.fit(X_train, y_train)
    
    # Predict on the test set
    y_pred = model.predict(X_test)
    y_prob = model.predict_proba(X_test)[:, 1]
    
    # Calculate accuracy and AUC
    accuracy = accuracy_score(y_test, y_pred)
    auc_score = roc_auc_score(y_test, y_prob)
    
    print(f"Model Accuracy: {accuracy * 100:.2f}%")
    print(f"Model AUC: {auc_score:.2f}")
    
    return model

# Example usage:
# Assuming you have a DataFrame `df` with the necessary columns:
# model = train_logistic_regression(df)

### Step 4: Business Application

#### Goal:
Use the model's predictions to inform marketing decisions and prioritize outreach.

#### Strategy:
- **High Probability**: Target aggressively with personalized messaging.
- **Medium Probability**: Use cost-effective methods to reach out.
- **Low Probability**: Consider deprioritizing or leaving out of the campaign.

#### Business Impact:
By focusing on high-probability customers, resources are used efficiently, leading to better conversion rates and cost savings.

In [7]:
def segment_clients(df, model):
    """
    This function segments clients based on their predicted probability of subscribing.
    
    Parameters:
    df (pd.DataFrame): The dataset with client data and predicted probabilities.
    model: The trained model used to predict customer subscription probabilities.
    
    Returns:
    pd.DataFrame: A DataFrame with segmented client groups.
    """
    
    # Predict the probability of subscription for all clients
    df['predicted_probability'] = model.predict_proba(df.drop(columns=['subscribed']))[:, 1]
    
    # Segment clients based on predicted probability
    conditions = [
        (df['predicted_probability'] > 0.7),  # High Probability
        (df['predicted_probability'] <= 0.7) & (df['predicted_probability'] > 0.4),  # Medium Probability
        (df['predicted_probability'] <= 0.4)  # Low Probability
    ]
    choices = ['High Probability', 'Medium Probability', 'Low Probability']
    df['segment'] = np.select(conditions, choices, default='Low Probability')
    
    return df

# Example usage:
# Assuming you have a DataFrame `df` and a trained model `model`:
# df_segmented = segment_clients(df, model)
# print(df_segmented[['predicted_probability', 'segment']].head())