# Customer Personality Prediction Model

This notebook develops a machine learning model to predict customer personality traits from marketing campaign data.

## Problem Statement
In competitive markets, businesses struggle to design marketing strategies that effectively target customers. This project aims to leverage advanced machine learning models to predict customer personality profiles based on demographic, behavioral, and purchase history data.

## Objectives
- Develop a machine learning model to predict customer personality traits
- Improve marketing campaign effectiveness through targeted personalization
- Provide actionable insights for marketing teams


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, classification_report, confusion_matrix
import xgboost as xgb
import lightgbm as lgb
from sklearn.neural_network import MLPClassifier
import warnings
warnings.filterwarnings('ignore')

# Set style for plots
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## 1. Data Collection and Loading


In [None]:
# Load the dataset
df = pd.read_csv('dataset_file.rtfd/marketing_campaign.csv.xls', sep='\t')

print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
df.head()


In [None]:
# Basic information about the dataset
print("Dataset Info:")
print(df.info())
print("\nMissing values:")
print(df.isnull().sum())
print("\nDataset Statistics:")
df.describe()


## 2. Data Preprocessing


In [None]:
# Create a copy for preprocessing
df_processed = df.copy()

# Handle missing values in Income
if df_processed['Income'].isnull().sum() > 0:
    df_processed['Income'].fillna(df_processed['Income'].median(), inplace=True)

# Convert Dt_Customer to datetime and extract features
df_processed['Dt_Customer'] = pd.to_datetime(df_processed['Dt_Customer'], format='%d-%m-%Y')
df_processed['Customer_Age'] = 2024 - df_processed['Year_Birth']
df_processed['Days_Since_Customer'] = (pd.Timestamp('2024-01-01') - df_processed['Dt_Customer']).dt.days

# Drop unnecessary columns
df_processed = df_processed.drop(['ID', 'Year_Birth', 'Dt_Customer', 'Z_CostContact', 'Z_Revenue'], axis=1)

print("After preprocessing:")
print(f"Shape: {df_processed.shape}")
print(f"\nMissing values: {df_processed.isnull().sum().sum()}")
df_processed.head()


## 3. Feature Engineering


In [None]:
# Create derived features
df_processed['Total_Spent'] = (df_processed['MntWines'] + df_processed['MntFruits'] + 
                                df_processed['MntMeatProducts'] + df_processed['MntFishProducts'] + 
                                df_processed['MntSweetProducts'] + df_processed['MntGoldProds'])

df_processed['Total_Purchases'] = (df_processed['NumDealsPurchases'] + df_processed['NumWebPurchases'] + 
                                    df_processed['NumCatalogPurchases'] + df_processed['NumStorePurchases'])

df_processed['Total_Accepted_Campaigns'] = (df_processed['AcceptedCmp1'] + df_processed['AcceptedCmp2'] + 
                                             df_processed['AcceptedCmp3'] + df_processed['AcceptedCmp4'] + 
                                             df_processed['AcceptedCmp5'])

df_processed['Avg_Purchase_Value'] = df_processed['Total_Spent'] / (df_processed['Total_Purchases'] + 1)
df_processed['Children'] = df_processed['Kidhome'] + df_processed['Teenhome']
df_processed['Family_Size'] = df_processed['Children'] + 1  # Assuming single person or couple

# Spending patterns
df_processed['Wine_Ratio'] = df_processed['MntWines'] / (df_processed['Total_Spent'] + 1)
df_processed['Meat_Ratio'] = df_processed['MntMeatProducts'] / (df_processed['Total_Spent'] + 1)
df_processed['Gold_Ratio'] = df_processed['MntGoldProds'] / (df_processed['Total_Spent'] + 1)

# Purchase channel preferences
df_processed['Web_Purchase_Ratio'] = df_processed['NumWebPurchases'] / (df_processed['Total_Purchases'] + 1)
df_processed['Store_Purchase_Ratio'] = df_processed['NumStorePurchases'] / (df_processed['Total_Purchases'] + 1)
df_processed['Catalog_Purchase_Ratio'] = df_processed['NumCatalogPurchases'] / (df_processed['Total_Purchases'] + 1)

print("Feature engineering completed!")
print(f"New shape: {df_processed.shape}")
df_processed.head()


In [None]:
# Encode categorical variables
le_education = LabelEncoder()
le_marital = LabelEncoder()

df_processed['Education_Encoded'] = le_education.fit_transform(df_processed['Education'])
df_processed['Marital_Status_Encoded'] = le_marital.fit_transform(df_processed['Marital_Status'])

# Drop original categorical columns
df_processed = df_processed.drop(['Education', 'Marital_Status'], axis=1)

# Separate features and target
X = df_processed.drop('Response', axis=1)
y = df_processed['Response']

print(f"Features shape: {X.shape}")
print(f"Target distribution:\n{y.value_counts()}")
print(f"\nTarget distribution %:\n{y.value_counts(normalize=True) * 100}")


In [None]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Scale numerical features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")
print(f"\nTraining target distribution:\n{y_train.value_counts()}")
print(f"\nTest target distribution:\n{y_test.value_counts()}")


## 4. Model Development

We'll train and compare multiple models:
1. Random Forest
2. XGBoost
3. LightGBM
4. Neural Network


In [None]:
# Function to evaluate models
def evaluate_model(model, X_train, X_test, y_train, y_test, model_name):
    """Evaluate model performance"""
    # Train
    model.fit(X_train, y_train)
    
    # Predictions
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    y_test_proba = model.predict_proba(X_test)[:, 1]
    
    # Metrics
    train_accuracy = accuracy_score(y_train, y_train_pred)
    test_accuracy = accuracy_score(y_test, y_test_pred)
    precision = precision_score(y_test, y_test_pred)
    recall = recall_score(y_test, y_test_pred)
    f1 = f1_score(y_test, y_test_pred)
    roc_auc = roc_auc_score(y_test, y_test_proba)
    
    # Cross-validation
    cv_scores = cross_val_score(model, X_train, y_train, cv=StratifiedKFold(n_splits=5), scoring='roc_auc')
    
    results = {
        'Model': model_name,
        'Train_Accuracy': train_accuracy,
        'Test_Accuracy': test_accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1_Score': f1,
        'ROC_AUC': roc_auc,
        'CV_ROC_AUC_Mean': cv_scores.mean(),
        'CV_ROC_AUC_Std': cv_scores.std()
    }
    
    return model, results, y_test_pred, y_test_proba


### 4.1 Random Forest Classifier


In [None]:
# Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42, max_depth=10, min_samples_split=5)
rf_trained, rf_results, rf_pred, rf_proba = evaluate_model(
    rf_model, X_train, X_test, y_train, y_test, 'Random Forest'
)

print("Random Forest Results:")
for key, value in rf_results.items():
    print(f"{key}: {value:.4f}")


### 4.2 XGBoost Classifier


In [None]:
# XGBoost
xgb_model = xgb.XGBClassifier(n_estimators=100, random_state=42, max_depth=6, learning_rate=0.1)
xgb_trained, xgb_results, xgb_pred, xgb_proba = evaluate_model(
    xgb_model, X_train, X_test, y_train, y_test, 'XGBoost'
)

print("XGBoost Results:")
for key, value in xgb_results.items():
    print(f"{key}: {value:.4f}")


### 4.3 LightGBM Classifier


In [None]:
# LightGBM
lgb_model = lgb.LGBMClassifier(n_estimators=100, random_state=42, max_depth=6, learning_rate=0.1, verbose=-1)
lgb_trained, lgb_results, lgb_pred, lgb_proba = evaluate_model(
    lgb_model, X_train, X_test, y_train, y_test, 'LightGBM'
)

print("LightGBM Results:")
for key, value in lgb_results.items():
    print(f"{key}: {value:.4f}")


### 4.4 Neural Network (MLP Classifier)


In [None]:
# Neural Network (using scaled data)
nn_model = MLPClassifier(hidden_layer_sizes=(100, 50), max_iter=500, random_state=42, alpha=0.01)
nn_trained, nn_results, nn_pred, nn_proba = evaluate_model(
    nn_model, X_train_scaled, X_test_scaled, y_train, y_test, 'Neural Network'
)

print("Neural Network Results:")
for key, value in nn_results.items():
    print(f"{key}: {value:.4f}")


## 5. Model Comparison


In [None]:
# Compare all models
results_df = pd.DataFrame([rf_results, xgb_results, lgb_results, nn_results])
results_df = results_df.set_index('Model')
print("Model Comparison:")
print(results_df.round(4))


In [None]:
# Visualize model comparison
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Accuracy comparison
results_df[['Test_Accuracy', 'Train_Accuracy']].plot(kind='bar', ax=axes[0, 0], title='Accuracy Comparison')
axes[0, 0].set_ylabel('Accuracy')
axes[0, 0].legend()
axes[0, 0].tick_params(axis='x', rotation=45)

# Precision, Recall, F1
results_df[['Precision', 'Recall', 'F1_Score']].plot(kind='bar', ax=axes[0, 1], title='Precision, Recall, F1 Comparison')
axes[0, 1].set_ylabel('Score')
axes[0, 1].legend()
axes[0, 1].tick_params(axis='x', rotation=45)

# ROC-AUC
results_df[['ROC_AUC', 'CV_ROC_AUC_Mean']].plot(kind='bar', ax=axes[1, 0], title='ROC-AUC Comparison')
axes[1, 0].set_ylabel('ROC-AUC')
axes[1, 0].legend()
axes[1, 0].tick_params(axis='x', rotation=45)

# Cross-validation scores
results_df['CV_ROC_AUC_Mean'].plot(kind='bar', ax=axes[1, 1], title='Cross-Validation ROC-AUC Mean', color='green')
axes[1, 1].set_ylabel('CV ROC-AUC Mean')
axes[1, 1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()


## 6. Select Best Model and Save


In [None]:
# Select best model based on ROC-AUC
best_model_name = results_df['ROC_AUC'].idxmax()
print(f"Best model: {best_model_name} with ROC-AUC: {results_df.loc[best_model_name, 'ROC_AUC']:.4f}")

# Get the best model
if best_model_name == 'Random Forest':
    best_model = rf_trained
    use_scaled = False
elif best_model_name == 'XGBoost':
    best_model = xgb_trained
    use_scaled = False
elif best_model_name == 'LightGBM':
    best_model = lgb_trained
    use_scaled = False
else:
    best_model = nn_trained
    use_scaled = True

print(f"\nBest model details:")
print(results_df.loc[best_model_name])


In [None]:
# Save the best model and preprocessing objects
import joblib
import os

# Create models directory
os.makedirs('models', exist_ok=True)

# Save model
joblib.dump(best_model, 'models/best_model.pkl')
joblib.dump(scaler, 'models/scaler.pkl')
joblib.dump(le_education, 'models/le_education.pkl')
joblib.dump(le_marital, 'models/le_marital.pkl')

# Save feature names
import json
with open('models/feature_names.json', 'w') as f:
    json.dump(list(X.columns), f)

# Save model metadata
model_metadata = {
    'model_name': best_model_name,
    'use_scaled': use_scaled,
    'roc_auc': float(results_df.loc[best_model_name, 'ROC_AUC']),
    'accuracy': float(results_df.loc[best_model_name, 'Test_Accuracy']),
    'precision': float(results_df.loc[best_model_name, 'Precision']),
    'recall': float(results_df.loc[best_model_name, 'Recall']),
    'f1_score': float(results_df.loc[best_model_name, 'F1_Score'])
}

with open('models/model_metadata.json', 'w') as f:
    json.dump(model_metadata, f, indent=2)

print("Model and preprocessing objects saved successfully!")
print(f"\nModel metadata:\n{json.dumps(model_metadata, indent=2)}")


## 7. Feature Importance Analysis


In [None]:
# Feature importance (for tree-based models)
if hasattr(best_model, 'feature_importances_'):
    feature_importance = pd.DataFrame({
        'feature': X.columns,
        'importance': best_model.feature_importances_
    }).sort_values('importance', ascending=False)
    
    plt.figure(figsize=(12, 8))
    sns.barplot(data=feature_importance.head(15), x='importance', y='feature')
    plt.title('Top 15 Feature Importances')
    plt.xlabel('Importance')
    plt.tight_layout()
    plt.show()
    
    print("Top 10 Most Important Features:")
    print(feature_importance.head(10))


## 8. Confusion Matrix and Classification Report


In [None]:
# Get predictions from best model
if use_scaled:
    y_pred_best = best_model.predict(X_test_scaled)
else:
    y_pred_best = best_model.predict(X_test)

# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_best)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No Response', 'Response'], 
            yticklabels=['No Response', 'Response'])
plt.title(f'Confusion Matrix - {best_model_name}')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

# Classification Report
print(f"\nClassification Report - {best_model_name}:")
print(classification_report(y_test, y_pred_best, target_names=['No Response', 'Response']))


## Summary

This notebook has:
1. ✅ Loaded and explored the marketing campaign dataset
2. ✅ Preprocessed the data (handled missing values, encoded categorical variables)
3. ✅ Created derived features (spending patterns, purchase behavior, etc.)
4. ✅ Trained multiple models (Random Forest, XGBoost, LightGBM, Neural Network)
5. ✅ Evaluated models using multiple metrics (Accuracy, Precision, Recall, F1, ROC-AUC)
6. ✅ Selected the best model and saved it for deployment
7. ✅ Analyzed feature importance and model performance

The model is now ready for deployment in the web application!
