### Business Problem

The client wants to identify which customers are most likely to respond to the next marketing campaign so the marketing team can allocate budget efficiently. Our goal is to raise campaign ROI by at least 10 % relative to the last campaign’s ROI (once calculated) by sending offers only to customers with a predicted acceptance probability ≥ 25 %, while keeping reach above 60 % (~1 500 customers).”

### Column Expectations
| Column          | Expect dtype   | Null? | Valid range / set  | Planned action |
|-----------------|---------------|-------|--------------------|----------------|
| ID              | int64         | No    | > 0 unique         | Investigate zeros, set as index |
| Year_Birth      | int64         | No    | 1900–2005          | Flag births < 1910 as outliers |
| Education       | category      | No    | {Graduation, PhD, Master, Basic, 2n Cycle} | Standardize spelling, dtype=category |
| Marital_Status  | category      | No    | collapse variants  | Map “Absurd”, “YOLO” to “Single”? |
| Income          | float64       | Yes   | 0–200 000          | Impute nulls (median) & cap at 99th pct |
| Dt_Customer     | datetime64[ns]| No    | 2012‑08‑14 → 2014‑06‑29 | Parse date; derive `Customer_Tenure` |
| Recency         | int64         | No    | 0–120              | Validate non‑neg, dtype int |
| …               | …             | …     | …                  | … |




| Term (abbr.)                          | Plain‑English meaning                                                                                                          | How to calculate it                                                                                                                                                                               | Why it matters in your project                                                                                        |
| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------- |
| **ROI** — *Return on Investment*      | “Did the money we spent bring back more money than it cost?”                                                                   | $\text{ROI}=\frac{\text{Profit}}{\text{Cost}}=\frac{\text{Revenue} - \text{Cost}}{\text{Cost}} $<br>• Example: Spend \$10 000 on a campaign, earn \$15 000 in sales ⇒ Profit \$5 000, ROI = 50 %. | Management cares whether the campaign paid off. When we find segments with higher ROI, we target them more next time. |
| **KPI** — *Key Performance Indicator* | A metric everyone agrees shows success                                                                                         | Could be “conversion rate,” “average order value,” “customer lifetime value.”                                                                                                                     | Keeps the team focused on a number that matters instead of random stats.                                              |
| **Conversion / Conversion Rate**      | A *conversion* is when a customer does the action we want (e.g., buys, signs up). Conversion rate = % of people who converted. | $\text{Conv. Rate}= \frac{\text{Conversions}}{\text{People targeted}}$                                                                                                                            | Our dataset’s `AcceptedCmpOverall` (1 = responded) lets us compute conversion rate for each segment.                  |
| **Segment / Segmentation**            | Grouping customers by shared traits (age, income, etc.).                                                                       | — (concept, not formula)                                                                                                                                                                          | Helps tailor offers; data analysis shows which segments respond best.                                                 |
| **Outlier**                           | A data point way outside the “usual” range.                                                                                    | E.g., income \$666 666 when most are \$20 k–\$80 k.                                                                                                                                               | Outliers can skew averages; we decide whether to cap, remove, or keep them.                                           |
| **Imputation**                        | Filling in missing values                                                                                                      | Median imputation: replace missing incomes with the median income                                                                                                                                 | Keeps dataset usable when nulls exist; must note in documentation.                                                    |
| **Recency**                           | How recently a customer made a purchase (in days)                                                                              | Given directly as `Recency` in dataset                                                                                                                                                            | Lower = bought recently; important for churn prediction or targeting.                                                 |
| **EDA** — *Exploratory Data Analysis* | First‑look, open‑ended exploration of the data                                                                                 | Visuals (histograms, boxplots), stats (`describe()`)                                                                                                                                              | Helps spot patterns, anomalies, and guides cleaning/modeling.                                                         |


In [8]:
import pandas as pd, numpy as np
pd.set_option("display.max_columns", None)
path = "../data/raw/marketing_campaign.csv"
df = pd.read_csv(path, sep=";")

In [9]:
display(df.head(3))
print(df.dtypes.value_counts())

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,MntFruits,MntMeatProducts,MntFishProducts,MntSweetProducts,MntGoldProds,NumDealsPurchases,NumWebPurchases,NumCatalogPurchases,NumStorePurchases,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,2012-09-04,58,635,88,546,172,88,88,3,8,10,4,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,2014-03-08,38,11,1,6,2,1,6,2,1,1,2,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,2013-08-21,26,426,49,127,111,21,42,1,8,2,10,4,0,0,0,0,0,0,3,11,0


int64      25
object      3
float64     1
Name: count, dtype: int64


In [10]:
roles = {
    "identifier": [],
    "date": [],
    "numeric_count": [],
    "currency_amount": [],
    "categorical": []
}

In [11]:
roles["identifier"].append('ID')
roles["date"].append('Dt_Customer')
roles["numeric_count"].extend(['Kidhome', 'Teenhome', 'Recency'])
roles["currency_amount"].extend(['Income', 'MntWines', 'MntFruits', 'MntMeatProducts'])
roles["categorical"].extend(['Education', 'Marital_Status', 'AcceptedCmpOverall'])

In [12]:
print(df['Income'].head())
print(df['Dt_Customer'].head())

0    58138.0
1    46344.0
2    71613.0
3    26646.0
4    58293.0
Name: Income, dtype: float64
0    2012-09-04
1    2014-03-08
2    2013-08-21
3    2014-02-10
4    2014-01-19
Name: Dt_Customer, dtype: object
