# Business Analysis: CRISP-DM Methodology
* Data Source
This dataset was downloaded from Kaggle: https://www.kaggle.com/datasets/rabieelkharoua/predict-conversion-in-digital-marketing-dataset/data

## Overview
> This dataset provides a comprehensive view of customer interactions with digital marketing campaigns. It includes demographic data, marketing-specific metrics, customer engagement indicators, and historical purchase data, making it suitable for predictive modeling and analytics in the digital marketing domain.

## Objective
We aim to utilize this dataset to uncover patterns within the customer base that can inform decision-making for digital marketing budget allocation. Additionally, we will develop predictive models to determine the likelihood of a customer purchasing a product based on their interaction history and demographic information.

## Goals
1. Identify Key Patterns using K-means:

 - Analyze the dataset to identify significant patterns and trends in customer interactions and behaviors to see if we have any natural separation.
 - Understand which demographic and engagement metrics are most indicative of conversion success.
 
2. Predictive Modeling using XGBoost:

 - Develop machine learning models to predict whether a customer is likely to purchase a product.
 - Use the models to segment customers based on their likelihood to convert, enabling targeted marketing strategies.
 
3. Optimize Budget Allocation:

Utilize the insights gained from pattern recognition and predictive modeling to optimize the allocation of the digital marketing budget.
Focus resources on high-potential customer segments to maximize return on investment (ROI).


## Features

|Variable              |Description          | 
|------------------------|:-------------------| 
|CustomerID          | Unique identifier for each customer| 
|Age | Age of the customer | 
|Gender           | Gender of the customer (Male/Female)  | 
|Income  | Annual income of the customer in USD  | 
|CampaignChannel          | The channel through which the marketing campaign is delivered (Email, Social Media, SEO, PPC, Referral) | 
|CampaignType| Type of the marketing campaign (Awareness, Consideration, Conversion, Retention).|
|AdSpend| Amount spent on the marketing campaign in USD.|
|ClickThroughRate| Rate at which customers click on the marketing content.|
|ConversionRate| Rate at which clicks convert to desired actions (e.g., purchases).|
|AdvertisingPlatform| Confidential.|
|AdvertisingTool| Confidential.|
|WebsiteVisits| Number of visits to the website.|
|PagesPerVisit| Average number of pages visited per session.|
|TimeOnSite| Average time spent on the website per visit (in minutes).|
|SocialShares| Number of times the marketing content was shared on social media.|
|EmailOpens| Number of times marketing emails were opened.|
|EmailClicks| Number of times links in marketing emails were clicked.|
|PreviousPurchases| Number of previous purchases made by the customer.|
|LoyaltyPoints| Number of loyalty points accumulated by the customer.|
|--------------------Target Variable----------------|
|Conversion| Binary variable indicating whether the customer converted (1) or not (0).|

## Packages needed

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA
import xgboost as xgb
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics import f1_score

from collections import Counter
import warnings
warnings.filterwarnings("ignore")

## Loading the data

In [None]:
df = pd.read_csv('./digital_marketing_campaign_dataset.csv')

In [None]:
df.head()

## Analising NaN and Duplicates

In [None]:
# Checking NA
df.isna().sum()

In [None]:
# Checking duplicate rows
print(f"Total duplicate rows: {df.duplicated().sum()}")

In [None]:
# number of lines int the DataFrame
num_linhas = df.shape[0]
print(f"Lines: {num_linhas}")

In [None]:
# data types
df.info()

# EDA 

In [None]:
disc_variables = ["Age", "WebsiteVisits", "SocialShares", "EmailOpens", "EmailClicks", "PreviousPurchases"]
cont_variables = ["Income", "AdSpend", "ClickThroughRate", "ConversionRate", "PagesPerVisit", "TimeOnSite", "LoyaltyPoints"]
cat_variables = ["Gender", "CampaignType", "AdvertisingPlatform", "AdvertisingTool", "Conversion"]

In [None]:
# Data
gender_counts = Counter(df["Gender"])
labels = gender_counts.keys()
sizes = gender_counts.values()

# Colors and explode settings
colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99']
explode = [0.1 if i == max(sizes) else 0 for i in sizes]  # explode the largest slice

# Create a pie chart
plt.figure(figsize=(8, 6))
plt.pie(sizes, 
        labels=labels, 
        autopct='%.1f%%', 
        shadow=True, 
        startangle=140, 
        colors=colors, 
        explode=explode,
        wedgeprops={'edgecolor': 'black'})

# Add title
plt.title('Gender Distribution', fontsize=16, fontweight='bold')

# Show plot
plt.show()

In [None]:
# Data
CampaignType_counts = Counter(df["CampaignType"])
labels = CampaignType_counts.keys()
sizes = CampaignType_counts.values()

# Colors and explode settings
colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99']
explode = [0.1 if i == max(sizes) else 0 for i in sizes]  # explode the largest slice

# Create a pie chart
plt.figure(figsize=(8, 6))
plt.pie(sizes, 
        labels=labels, 
        autopct='%.1f%%', 
        shadow=True, 
        startangle=140, 
        colors=colors, 
        explode=explode,
        wedgeprops={'edgecolor': 'black'})

# Add title
plt.title('Campaign Type', fontsize=16, fontweight='bold')

# Show plot
plt.show()

In [None]:
# Data
Conversion_counts = Counter(df["Conversion"])
labels = Conversion_counts.keys()
sizes = Conversion_counts.values()

# Colors and explode settings
colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99']
explode = [0.1 if i == max(sizes) else 0 for i in sizes]  # explode the largest slice

# Create a pie chart
plt.figure(figsize=(8, 6))
plt.pie(sizes, 
        labels=labels, 
        autopct='%.1f%%', 
        shadow=True, 
        startangle=140, 
        colors=colors, 
        explode=explode,
        wedgeprops={'edgecolor': 'black'})

# Add title
plt.title('Conversions', fontsize=16, fontweight='bold')

# Show plot
plt.show()

In [None]:
# Data
AdvertisingTool_counts = Counter(df["AdvertisingTool"])
labels = AdvertisingTool_counts.keys()
sizes = AdvertisingTool_counts.values()

# Colors and explode settings
colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99']
explode = [0.1 if i == max(sizes) else 0 for i in sizes]  # explode the largest slice

# Create a pie chart
plt.figure(figsize=(8, 6))
plt.pie(sizes, 
        labels=labels, 
        autopct='%.1f%%', 
        shadow=True, 
        startangle=140, 
        colors=colors, 
        explode=explode,
        wedgeprops={'edgecolor': 'black'})

# Add title
plt.title('Advertising Tool', fontsize=16, fontweight='bold')

# Show plot
plt.show()

In [None]:
# Data
AdvertisingPlatform_counts = Counter(df["AdvertisingPlatform"])
labels = AdvertisingPlatform_counts.keys()
sizes = AdvertisingPlatform_counts.values()

# Colors and explode settings
colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99']
explode = [0.1 if i == max(sizes) else 0 for i in sizes]  # explode the largest slice

# Create a pie chart
plt.figure(figsize=(8, 6))
plt.pie(sizes, 
        labels=labels, 
        autopct='%.1f%%', 
        shadow=True, 
        startangle=140, 
        colors=colors, 
        explode=explode,
        wedgeprops={'edgecolor': 'black'})

# Add title
plt.title('Advertising Platform', fontsize=16, fontweight='bold')

# Show plot
plt.show()

In [None]:
# Discrete variables analisys 


for col in disc_variables:
    plt.figure(figsize=(12, 6))
    plt.subplot(1, 2, 1)
    sns.histplot(df[col], kde=True)
    plt.title(f'Histogram of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')
    
    plt.subplot(1, 2, 2)
    sns.boxplot(x=df[col])
    plt.title(f'Boxplot of {col}')
    plt.xlabel(col)
    
    plt.show()


In [None]:
# continuos variables analisys 


for col in cont_variables:
    plt.figure(figsize=(12, 6))
    plt.subplot(1, 2, 1)
    sns.histplot(df[col], kde=True)
    plt.title(f'Histogram of {col}')
    plt.xlabel(col)
    plt.ylabel('Frequency')
    
    plt.subplot(1, 2, 2)
    sns.boxplot(x=df[col])
    plt.title(f'Boxplot of {col}')
    plt.xlabel(col)
    
    plt.show()

<br>
<br>
<br>
By the data analysis, we can see that the features are well distributed and we have a slightly unbalanced dataset for the conversion, our target. <br>
<br>
<br>
We can also see that advertising tool and platform will not bring much to our modeling.
<br>
<br>
<br>



In [None]:
df = df.drop(columns = ['AdvertisingTool', 'AdvertisingPlatform','CustomerID'])

In [None]:
plt.figure(figsize=(30, 30))
corr = df.corr()

# Mask the upper triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Adjust the heatmap to display only the lower triangle
sns.heatmap(corr, mask=mask, annot=True, cmap='coolwarm', annot_kws={"size": 16}, fmt='.2f')
plt.show()

# Data preparation

* selection: Do we have all the data selected?
* construction: build new variables
* integration: in the case of, multiple data sources we need to integrate them
* formating: format the data to the correct format.

In [None]:
# Separate features and target
X = df.drop('Conversion', axis=1)
y = df['Conversion']

In [None]:
# Define preprocessing steps for numerical and categorical features
numerical_features = ['Age', 'WebsiteVisits', 'SocialShares', 'EmailOpens', 'EmailClicks', 'PreviousPurchases', 
                      'Income', 'AdSpend', 'ClickThroughRate', 'ConversionRate', 'PagesPerVisit', 'TimeOnSite', 'LoyaltyPoints']
categorical_features = ['Gender', 'CampaignType']

# Numerical features: Impute missing values with median and scale to zero mean and unit variance

numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical features: Impute missing values with the most frequent value and apply one-hot encoding

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing steps into a single preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Fit and transform the data
X_preprocessed = preprocessor.fit_transform(X)

# Extract column names for the transformed data
num_cols = numerical_features
cat_cols = preprocessor.named_transformers_['cat'].named_steps['onehot'].get_feature_names_out(categorical_features)
all_cols = num_cols + list(cat_cols)

# Create a DataFrame with the new column names
X_pad = pd.DataFrame(X_preprocessed, columns=all_cols)

X_pad.head()

### We need to search for the best number of cluster for K-mean, we will use 2 methods for this.

In [None]:
# Determine optimal number of clusters using the elbow method
inertia = []
for n in range(1, 15):
    kmeans = KMeans(n_clusters=n, random_state=42)
    kmeans.fit(X_preprocessed)
    inertia.append(kmeans.inertia_)

# Plot the elbow curve
plt.plot(range(1, 15), inertia)
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()



### We could not determine a perfect cluster amount by this metodology.

In [None]:
# Silhouette Method
silhouette_scores = []
for n in range(2, 15):
    kmeans = KMeans(n_clusters=n, random_state=42)
    cluster_labels = kmeans.fit_predict(X_preprocessed)
    silhouette_avg = silhouette_score(X_preprocessed, cluster_labels)
    silhouette_scores.append(silhouette_avg)

# Plot silhouette scores
plt.figure(figsize=(8, 6))
plt.plot(range(2, 15), silhouette_scores, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Method')
plt.grid(True)
plt.show()

### With this one we can use 2 clusters os 14 or more, 14 clusters are too much for such analysis, so we will try to use 2 then.

# Modeling K mean

In [None]:
# Fit K-Means with the optimal number of clusters
optimal_clusters = 2  # Example: determine from the elbow plot
kmeans = KMeans(n_clusters=optimal_clusters, random_state=42)
clusters = kmeans.fit_predict(X_preprocessed)

# Add cluster labels to the original dataframe
df['Cluster'] = clusters

In [None]:
# Perform PCA
pca = PCA(n_components=2)
X_pad_t = pca.fit_transform(X_pad)

In [None]:
df_pca = pd.DataFrame(data=X_pad_t, columns=['PC1', 'PC2'])
df_pca['Cluster'] = clusters

In [None]:
def biplot(score,coeff, y, labels=None):
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]
    scalex = 2/(xs.max() - xs.min())
    scaley = 2/(ys.max() - ys.min())
    
    fig, ax = plt.subplots(figsize=(10, 10))
#    scatter = ax.scatter(xs * scalex,ys * scaley, c = y)
    sns.kdeplot(x = xs * scalex, y = ys * scaley, hue=y, ax=ax, fill=True, alpha=.6, palette='viridis')
#    ax.legend(*scatter.legend_elements())
    
    for i in range(n):
        ax.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5, 
                 length_includes_head=True, head_width=0.04, head_length=0.04)
        if labels is None:
            ax.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'k', ha = 'center', va = 'center')
        else:
            ax.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'k', ha = 'center', va = 'center')
    ax.set_xlim(-1.2,1.2)
    ax.set_ylim(-1.2,1.2)
    ax.set_xlabel("PC{0}, {1:.1%} explained variace ratio".format(1, pca.explained_variance_ratio_[0]))
    ax.set_ylabel("PC{0}, {1:.1%} explained variace".format(1, pca.explained_variance_ratio_[1]))
    ax.grid()


In [None]:
# Plot the biplot
biplot(X_pad_t, np.transpose(pca.components_),df_pca['Cluster'], labels=X_pad.columns)

### Feature Importance:

 - Email Clicks, Conversion Rate, Income, Pages Per Visit, and TimeOnSite are crucial for PC1. Customers with higher values in these features are likely contributing to higher engagement and conversion metrics.
 
 
- Previous Purchases, Website Visits, and Loyalty Points are important for distinguishing customers in Cluster 0, suggesting these features are key differentiators for this group.

### Cluster-Specific Analysis:

Cluster 0:
- Higher Previous Purchases and Website Visits indicate these customers have engaged more with the product historically.
- High Loyalty Points suggest a potential focus on loyalty programs and rewards.
- The absence of a high Conversion Rate in this cluster might indicate that these customers are not frequent purchasers, potentially due to the product being a high-value item or requiring a longer decision-making process.

Cluster 1:
- Higher values in Email Clicks, Conversion Rate, Income, Pages Per Visit, and TimeOnSite suggest this cluster consists of more actively engaged customers with a higher likelihood of conversion.

### Marketing Strategy:

* Focus on Cluster 0:

Objective: Increase conversion rates by leveraging existing engagement metrics.

Tactics:

- Targeted Campaigns: Design personalized marketing campaigns aimed at converting high website visitors and those with high loyalty points. Emphasize exclusive offers or loyalty point redemptions to drive purchases.
- Engagement Strategies: Use email marketing and reminders for customers with high email open and click rates to encourage conversions.
- Behavioral Analysis: Conduct deeper behavioral analysis to understand why customers in this cluster, despite high engagement metrics, are not converting. This could involve surveys, focus groups, or further segmentation.

* Utilize Insights from Cluster 1:

- Objective: Maintain and enhance engagement.

Tactics:

- Reward High Converters: Implement loyalty programs or rewards for frequent purchasers to maintain high engagement levels.
- Upsell/Cross-sell: Use the high engagement metrics to introduce related products or services, leveraging the high income and time spent on the site.

# XGboost - Modeling if the  campaign will convert or not

XGBoost is an ensemble technique that combines several decision trees to produce better predictive performance than a single decision tree. The principle is that multiple weak learners can collectively yield better results than one strong learner.

As its name suggests, XGBoost is a boosting technique that iteratively minimizes the error of the previous model. By applying a gradient at each step, it reduces errors and helps prevent overfitting.

What makes this technique so popular is its speed, especially when dealing with datasets containing thousands of features and possible splits. XGBoost improves efficiency by analyzing the distribution of features across all data points in a leaf, thus reducing the search space for potential feature splits.

Additionally, XGBoost's ability to handle missing values and its regularization techniques contribute to its robustness and effectiveness in a wide range of applications.

In [None]:
# To avoid data leakage we will Split the data into training and testing sets before re applaying the processor
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply preprocessing
X_train_preprocessed = preprocessor.fit_transform(X_train)
X_test_preprocessed = preprocessor.transform(X_test)

In [None]:

# Train XGBoost classifier
model = xgb.XGBClassifier(random_state=42, n_estimators = 350)
model.fit(X_train_preprocessed, y_train)

# Make predictions
y_pred = model.predict(X_test_preprocessed)

# Evaluate model
print('Accuracy:', accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

# Feature importance
importances = model.feature_importances_
feature_names = preprocessor.get_feature_names_out()
feature_importances = pd.Series(importances, index=feature_names).sort_values(ascending=False)
print(feature_importances)

In [None]:
# Confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
conf_matrix_df = pd.DataFrame(conf_matrix, index=model.classes_, columns=model.classes_)

# Plot confusion matrix
plt.figure(figsize=(10, 7))
sns.heatmap(conf_matrix_df, annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

# Model evaluation

The XGBoost model demonstrates strong performance in predicting whether a customer will respond to a campaign, with an overall accuracy of **91.81%**. Key performance metrics include:

- Precision: 0.78 for non-responders (class 0) and 0.93 for responders (class 1).
- Recall: 0.45 for non-responders and 0.98 for responders.
- F1-Score: 0.57 for non-responders and 0.95 for responders.
This indicates that the model is particularly effective at **identifying responders, with high precision and recall rates.

**Conclusion Based on Model Predictions:<br>
The model can accurately predict if a customer will respond to a campaign. Before launching a new campaign, we can use the model to predict the likely response of target customers, enabling us to anticipate the campaign's performance and make data-driven decisions to optimize marketing efforts.**

## Feature Importance Analysis

The feature importance values from the XGBoost model provide additional insights into which factors most influence the prediction of customer responses. Key observations include:

Top Influential Features:

- Campaign Type - Conversion: The type of campaign focused on conversion has the highest importance score, indicating that the nature of the campaign itself plays a critical role in predicting customer response.
- Previous Purchases: Customers' historical purchase behavior is a strong predictor of future responses.
- Email Clicks and Opens: Email engagement metrics are also highly influential, showing the importance of customer interaction with email marketing efforts.

Other Notable Features:

- Click Through Rate and Pages Per Visit: These engagement metrics indicate that higher interaction with digital content correlates with higher campaign response rates.
- Ad Spend: Investment in advertising impacts customer response, suggesting that higher ad spend can lead to higher engagement.
- Time on Site: The amount of time customers spend on the website is also a significant predictor, reflecting their level of interest and engagement.

Demographic and Other Factors:

- Income and Age: Demographic factors like income and age, although not the top predictors, still play a role in customer response.
- Gender: Gender appears to have minimal importance in this model, suggesting that other factors are more influential in determining campaign response.