# Assignment 2

In this Python notebook, the dataset ..., will be evaluated, pre-processed, processed and used to train and evaluate three different predictive models.


## Contextualizing the Problem

Data mining is an essential practice for uncovering patterns and insights from extensive datasets, enabling businesses to make informed, data-driven decisions. This approach is especially valuable in marketing, where understanding customer behavior can lead to more targeted campaigns and optimized resource allocation. This project aims to apply data mining to the marketing domain, specifically focusing on a sample dataset provided by a food retail company that offers various product categories through multiple sales channels. The primary objective is to develop a predictive model to maximize the efficiency and profitability of marketing campaigns.

The dataset at hand is derived from a pilot marketing campaign conducted by the company, involving 2,240 randomly selected customers. This pilot aimed to test the effectiveness of a marketing initiative promoting a new gadget, with results recorded as customer responses (purchases or non-purchases). Despite generating revenue from respondents, the overall campaign ran at a loss, with a negative profit of -3,046 monetary units (MU). This underscores the importance of leveraging data mining to predict and target likely respondents more accurately.

## Problem Statement

**Given the data from a sample marketing campaign, how can we accurately predict which customers are most likely to respond to future campaigns, and what insights can we derive to improve marketing strategies and profitability?**

## Business Objectives and Value Proposition (Business Understanding)

The main goal is to build a robust predictive model that can identify customers most likely to respond to future campaigns. This will enable the marketing team to strategically target high-potential customers, enhancing campaign efficiency and profitability. Achieving this goal involves understanding the characteristics of past respondents and using this insight to refine future efforts. The success of this initiative is expected to justify the value of data-driven marketing strategies and contribute to sustained revenue growth.

Key business objectives include:

- Increasing Campaign Profitability: By accurately predicting respondents, the company can focus its marketing efforts on customers who are more likely to engage, minimizing costs and maximizing revenue.

- Customer Insight and Segmentation: Gaining a deeper understanding of customer behaviors and attributes that correlate with positive responses to marketing campaigns.

- Strategic Resource Allocation: Enabling the marketing team to allocate budget and resources more effectively.

## Data Mining Methodology: CRISP-DM Approach

To ensure a systematic and effective analysis, we will adopt the CRISP-DM (Cross-Industry Standard Process for Data Mining) framework. This method structures the project into six key phases:
<div style="text-align: center;">
    <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/b/b9/CRISP-DM_Process_Diagram.png/639px-CRISP-DM_Process_Diagram.png" alt="CRISP-DM Process" style="width:400px;height:400px;margin-top: 20px;">
</div>

1. Business Understanding: Define the project’s goals and success criteria, focusing on predicting customer responses and maximizing campaign profit. The aim is to ensure targeted marketing that improves the return on investment (ROI).

2.	Data Understanding: Explore the dataset, which includes socio-demographic and firmographic information about customers and their responses to the pilot campaign. This step will involve analyzing data distributions, correlations, and initial insights into what differentiates respondents from non-respondents.

3.	Data Preparation: Prepare the dataset for modeling by handling missing values, outliers, and data transformations. Categorical variables will be encoded, and numerical features will be scaled to enhance model performance. This phase also includes creating new features if necessary to capture additional insights (e.g., interaction terms between relevant variables).

4.	Modeling: Develop multiple predictive models, such as logistic regression, decision tree-based models, and ensemble methods (e.g., Random Forest). Train these models using a split of the data into training and test sets, followed by cross-validation and hyperparameter tuning for optimization.

5.	Evaluation: Assess the models based on key performance metrics, such as accuracy, precision, recall, F1-score, and ROC-AUC. The goal is to select the best-performing model that aligns with the business objectives. Additionally, analyze potential biases or limitations in the model’s performance.

6.	Deployment: While deployment is beyond the scope of this initial analysis, successful implementation could lead to integrating the predictive model into the company’s marketing platform for real-time targeting and campaign management.

# ABOVE TEXT IS GOAT GENERATED WE NEED TO READ THROUGH IT AND MAKE SURE IT IS GOOD !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

### Importing modules

In [4]:
import os
import pandas as pd

### Data Understanding

In [10]:
# First create the folder "data" and add the data source into it
def check_data_folder_dataset_load_df(dataset_csv_name="marketing_campaign.csv"):
    if not os.path.isdir("data"):
        raise Exception("there is no data folder ")
    if not os.path.exists(f"data/{dataset_csv_name}"):
        raise Exception(f"there is no {dataset_csv_name} in the data folder ")
    
    print("data is correctly stored in data folder")
    print("return the pandas dataframe")
    return pd.read_csv(f"data/{dataset_csv_name}", delimiter=";")
    
pd.set_option('display.max_columns', None)        
df_marketing = check_data_folder_dataset_load_df()
df_marketing

data is correctly stored in data folder
return the pandas dataframe


Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,2012-09-04,58,635,88,546,172,88,88,3,8,10,4,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,2014-03-08,38,11,1,6,2,1,6,2,1,1,2,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,2013-08-21,26,426,49,127,111,21,42,1,8,2,10,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,2014-02-10,26,11,4,20,10,3,5,2,2,0,4,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,2014-01-19,94,173,43,118,46,27,15,5,5,3,6,5,0,0,0,0,0,0,3,11,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2235,10870,1967,Graduation,Married,61223.0,0,1,2013-06-13,46,709,43,182,42,118,247,2,9,3,4,5,0,0,0,0,0,0,3,11,0
2236,4001,1946,PhD,Together,64014.0,2,1,2014-06-10,56,406,0,30,0,0,8,7,8,2,5,7,0,0,0,1,0,0,3,11,0
2237,7270,1981,Graduation,Divorced,56981.0,0,0,2014-01-25,91,908,48,217,32,12,24,1,2,3,13,6,0,1,0,0,0,0,3,11,0
2238,8235,1956,Master,Together,69245.0,0,1,2014-01-24,8,428,30,214,80,30,61,2,6,5,10,3,0,0,0,0,0,0,3,11,0


The dataset contains 29 columns and 2240 rows. The column descriptions can be found below:

1.	ID - Customer’s id
2.	Year_Birth - Customer’s year of birth
3.	AcceptedCmp1 - 1 if customer accepted the offer in the 1st campaign, 0 otherwise
4.	AcceptedCmp2 - 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
5.	AcceptedCmp3 - 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
6.	AcceptedCmp4 - 1 if customer accepted the offer in the 4th campaign, 0 otherwise
7.	AcceptedCmp5 - 1 if customer accepted the offer in the 5th campaign, 0 otherwise
8.	Response (target) - 1 if customer accepted the offer in the last campaign, 0 otherwise
9.	Complain - 1 if customer complained in the last 2 years
10.	DtCustomer - date of customer’s enrolment with the company
11.	Education - customer’s level of education
12.	Marital_Status - customer’s marital status
13.	Kidhome - number of small children in customer’s household
14.	Teenhome - number of teenagers in customer’s household
15.	Income - customer’s yearly household income
16.	MntFishProducts - amount spent on fish products in the last 2 years
17.	MntMeatProducts - amount spent on meat products in the last 2 years
18.	MntFruits - amount spent on fruits products in the last 2 years
19.	MntSweetProducts - amount spent on sweet products in the last 2 years
20.	MntWines - amount spent on wine products in the last 2 years
21.	MntGoldProds - amount spent on gold products in the last 2 years
22.	NumDealsPurchases - number of purchases made with discount
23.	NumCatalogPurchases - number of purchases made using catalogue
24.	NumStorePurchases - number of purchases made directly in stores
25.	NumWebPurchases - number of purchases made through company’s web site
26.	NumWebVisitsMonth - number of visits to company’s web site in the last month
27.	Recency - number of days since the last purchase
28.	Z_CostContact - Cost to contact a customer
29.	Z_Revenue - Revenue after client accepting campaign

Below we can see that we have no missing values in the dataset, except for the column Income we see that there are 24 rows that have missing values:

In [9]:
df_marketing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   i

As we can see there is indeed 24 rows in the Income column with Null values:

In [26]:
rows_with_nulls = df_marketing[df_marketing['Income'].isnull()]
print(rows_with_nulls)

         ID  Year_Birth   Education Marital_Status  Income  Kidhome  Teenhome  \
10     1994        1983  Graduation        Married     NaN        1         0   
27     5255        1986  Graduation         Single     NaN        1         0   
43     7281        1959         PhD         Single     NaN        0         0   
48     7244        1951  Graduation         Single     NaN        2         1   
58     8557        1982  Graduation         Single     NaN        1         0   
71    10629        1973    2n Cycle        Married     NaN        1         0   
90     8996        1957         PhD        Married     NaN        2         1   
91     9235        1957  Graduation         Single     NaN        1         1   
92     5798        1973      Master       Together     NaN        0         0   
128    8268        1961         PhD        Married     NaN        0         1   
133    1295        1963  Graduation        Married     NaN        0         1   
312    2437        1989  Gra

Since 1% of the columns have Null values in them, we can safely remove them:

In [27]:
df_marketing = df_marketing.dropna()

In [28]:
df_marketing.describe()

Unnamed: 0,ID,Year_Birth,Income,Kidhome,Teenhome,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
count,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0,2216.0
mean,5588.353339,1968.820397,52247.251354,0.441787,0.505415,49.012635,305.091606,26.356047,166.995939,37.637635,27.028881,43.965253,2.323556,4.085289,2.671029,5.800993,5.319043,0.073556,0.074007,0.073105,0.064079,0.013538,0.009477,3.0,11.0,0.150271
std,3249.376275,11.985554,25173.076661,0.536896,0.544181,28.948352,337.32792,39.793917,224.283273,54.752082,41.072046,51.815414,1.923716,2.740951,2.926734,3.250785,2.425359,0.261106,0.261842,0.260367,0.24495,0.115588,0.096907,0.0,0.0,0.357417
min,0.0,1893.0,1730.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0
25%,2814.75,1959.0,35303.0,0.0,0.0,24.0,24.0,2.0,16.0,3.0,1.0,9.0,1.0,2.0,0.0,3.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0
50%,5458.5,1970.0,51381.5,0.0,0.0,49.0,174.5,8.0,68.0,12.0,8.0,24.5,2.0,4.0,2.0,5.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0
75%,8421.75,1977.0,68522.0,1.0,1.0,74.0,505.0,33.0,232.25,50.0,33.0,56.0,3.0,6.0,4.0,8.0,7.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,11.0,0.0
max,11191.0,1996.0,666666.0,2.0,2.0,99.0,1493.0,199.0,1725.0,259.0,262.0,321.0,15.0,27.0,28.0,13.0,20.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,11.0,1.0
