# **Conversion Prediction in Digital Marketing Campaigns**

## **I. Introduction**

### **Problem Context**
In digital marketing, identifying customers with the highest likelihood of conversion is crucial to maximizing return on investment (ROI) and optimizing resource allocation. This project addresses this challenge by using Machine Learning techniques to predict conversions in advertising campaigns, helping to prioritize the most effective channels, audiences, and strategies.

### **Objectives**
- Develop a binary classification model to predict the conversion probability of each customer.
- Optimize advertising resources by identifying the most effective channels and campaigns.
- Provide data-driven recommendations to improve commercial decision-making.

### **Scope**
The analysis focuses on simulating general campaigns with an emphasis on conversion and retention, highlighting the most effective channels and campaign types to maximize marketing efficiency.

---

## **II. Dataset**

### **Dataset Description**
- **Source:** [Predict Conversion in Digital Marketing](https://www.kaggle.com/datasets/rabieelkharoua/predict-conversion-in-digital-marketing-dataset/data).
- **Records:** 8,000.
- **Variables:** 20 (no missing values).

#### **Key Variables:**
1. **Demographics:**
   - `Age`: Customer's age.
   - `Gender`: Gender (`Male`, `Female`).
   - `Income`: Annual income ($).

2. **Digital Interactions:**
   - `ClickThroughRate`: Percentage of clicks on ads.
   - `ConversionRate`: Conversion percentage (removed due to being too general).
   - `WebsiteVisits`, `PagesPerVisit`, `TimeOnSite`: Indicators of engagement level.
   - `SocialShares`: Social media shares.
   - `EmailOpens`, `EmailClicks`: Indicators of email campaign effectiveness.

3. **Advertising Campaigns:**
   - `CampaignChannel`: Distribution channel (`Social Media`, `Email`, etc.).
   - `CampaignType`: Campaign type (`Awareness`, `Retention`, etc.).
   - `AdSpend`: Advertising expenditure.

4. **Customer History:**
   - `PreviousPurchases`: Customer's previous purchases.
   - `LoyaltyPoints`: Loyalty points (removed due to bias).

5. **Target Variable:**
   - `Conversion`: Indicates whether (1) or not (0) the desired action was performed.

---

### **EDA and Data Cleaning**

1. **Removal of Irrelevant Columns:**
   - Columns like `AdvertisingPlatform` and `AdvertisingTool` were removed due to containing identical values across all rows (`IsConfid` and `ToolConfid`), making them redundant.

2. **Outliers:**
   - Exploration using boxplots revealed no significant outliers in numerical variables.

3. **Data Distribution:**
   - Most variables exhibit heterogeneous distributions, both numerical and categorical.
   - The `Gender` variable shows a higher representation of women (60.5%) compared to men (39.5%).
   - The target variable `Conversion` is highly imbalanced, with 90% of observations being positive conversions and only 10% negative. This highlights the need for balancing techniques in modeling.

   **Relevant Visualizations:**
   - **Distributions of Numerical and Categorical Variables:**
     ![Variable Distributions](resources/img/dist_var.png)

4. **Correlation Analysis:**
   - The correlation matrix shows weak correlations between most variables and the target variable (`Conversion`).
   - **Collinearity:** No significant collinearity issues were observed.

   **Relevant Visualization:**
   - ![Correlation Matrix](resources/img/mc_num.png)

---

### **Encoding and Transformation**

1. **Categorical Variable Encoding:**
   - **One-Hot Encoding** was applied to categorical variables (`Gender`, `CampaignChannel`, `CampaignType`) to convert them into a numerical format suitable for algorithms.

2. **Boolean Value Conversion:**
   - Variables with `True/False` values were converted to numeric `1/0` values for uniformity.

3. **Encoded Correlation Matrix:**
   - After encoding, the correlation matrix was recalculated, showing similar correlations as before, confirming no strong relationships between predictors and the target variable.

   **Relevant Visualization:**
   - ![Encoded Correlation Matrix](resources/img/mc_cod.png)

---

### **Feature Importance**

Despite weak correlations observed in the correlation matrix, the **Feature Importance** analysis using models like Random Forest and Gradient Boosting reveals significant relationships between certain variables and the target variable (`Conversion`).

**Relevant Visualization:**
- ![General Feature Importance](resources/img/figm.png)

Specific analyses for each campaign type reveal that key features vary depending on the campaign focus.

- ![Feature Importance Awareness](resources/img/fiam.png)
- ![Feature Importance Consideration](resources/img/ficm.png)
- ![Feature Importance Conversion](resources/img/fivm.png)
- ![Feature Importance Retention](resources/img/firm.png)

---

### **Conclusions from Exploratory Analysis and Data Cleaning**

1. **Data Distribution and Balance:**
   - The target variable `Conversion` is imbalanced, requiring techniques like **SMOTE** for modeling.
   - Variables show heterogeneous distributions, with no significant outliers.

2. **Correlation and Feature Relevance:**
   - The correlation matrix indicates weak relationships with the target.
   - Feature Importance analysis identifies key variables for predicting conversions.

3. **Applied Transformations:**
   - Irrelevant columns were removed.
   - Categorical variables were encoded using **One-Hot Encoding**, and boolean values were transformed.

This process ensures the data is ready for modeling, maximizing objectivity and utility in the selected variables.

---

## **III. Data Preprocessing**

### **Quality Verification**
- No missing values were found.
- No outliers were found.
- Categorical variables (`Gender`, `CampaignChannel`, `CampaignType`) were encoded using One-Hot Encoding.

### **Transformations for the First Predictive Model**

1. **Class Balancing:**
   - **SMOTE + Tomek Links** was applied to balance classes in the target variable since negative conversions represented only 10% of the data.

This transformation changed the initial class distribution:

| Conversion | Initial Proportion |
|------------|--------------------|
| 1          | 0.8765            |
| 0          | 0.1235            |

To a perfectly balanced proportion:

| Conversion | Balanced Proportion |
|------------|---------------------|
| 1          | 0.5                |
| 0          | 0.5                |

2. **Feature Scaling:**
   - **StandardScaler** was used to scale numerical variables, ensuring equal weight in the model.

---

### **Transformations for the Second Predictive Model**

1. **Clustering with KMeans and User Behavior Analysis:**
   - **KMeans** was used to cluster users based on objective, relevant, non-redundant, and unbiased variables such as:
     - `Income`, `ClickThroughRate`, `TimeOnSite`, `SocialShares`, `EmailClicks`, and `PreviousPurchases`.

2. **Campaign and Channel Assignments Based on Clusters:**
   - Patterns observed in the clusters were used to assign new communication channels and personalized campaigns to each customer.

3. **Conflicting Variables:**
   - **`LoyaltyPoints`:** This variable is biased due to subjective criteria. Visualizations show low loyalty points assigned to customers with high interaction and multiple purchases, making it unreliable for predictions.
     - ![lp1](resources/img/lp1.png)
     - ![lp2](resources/img/lp2.png)

   - **`ConversionRate`:** A general variable that does not represent specific user behavior. It is derived from the target, introducing redundancy into the model. Graphs show high conversion rates can arise from various actions, such as email clicks, social media shares, or purchases:
     - ![cr1](resources/img/cr1.png)
     - ![cr2](resources/img/cr2.png)

4. **AdSpend Adjustment:**
   - The `AdSpend` column was predicted under the new campaign configuration, excluding the `Conversion` variable to avoid bias.

5. **Conversion Predictions in the New Scenario:**
   - New conversion predictions were made considering cluster changes, campaign assignments, and the updated `AdSpend`.

---

## **IV. Modeling**

### **First Predictive Model Based on Original Campaign**
#### **Model Training**
Several classification algorithms were evaluated, including:
- **Random Forest**
- **Gradient Boosting**
- **XGBoost**

#### **Hyperparameter Optimization**
- **GridSearchCV** was used to find the best parameters, optimizing metrics such as `recall`, `precision`, and `ROC-AUC`.

#### **Evaluation**
- **Selected Model:** Random Forest, which showed the best balance between sensitivity and precision in predictions.

- **Final Model Metrics:**
  - **Recall (Test):** 0.989
  - **Precision (Test):** 0.975
  - **F1-Score:** 0.982
  - **ROC-AUC:** 0.965

---

## **V. Predictions in Simulated Campaign Scenario and Final Results**

### **Key Results**
1. **Important Variables:**
   - `AdSpend`, `TimeOnSite`, `ClickThroughRate`, `SocialShares`, `Income`, `PreviousPurchases`, `EmailClicks`.

2. **Clustering with K-Means:**
   - **31 clusters** were created to personalize strategies.

3. **Optimized Channels and Campaigns:**
   - **Key Channels:** `Email`, `PPC`, `Social Media`.
   - **Effective Campaign Types:** `Conversion` and `Retention`.

---

## **VI. Conclusions and Future Steps**

### **Results Analysis**
- The use of clustering and Feature Importance analysis optimized strategies for specific segments.
- Lack of product, margin, and temporal data limits ROI analysis.

### **Future Steps**
1. **Data Collection:**
   - Incorporate product, margin, and cost data to calculate ROI and cost per conversion.
   - Add temporal data to evaluate trends and dynamic analysis.

2. **Simulations:**
   - Refine campaigns for specific types (`Awareness`, `Conversion`, `Retention`, `Consideration`).

3. **Data Quality:**
   - Design systems to ensure unbiased data collection with clear purposes.

**Final Conclusion:** While current results are useful for general campaign optimization, improving data quality and specificity is essential to maximize prediction fidelity and applicability.
