### Column Expectations
| Column          | Expect dtype   | Null? | Valid range / set  | Planned action |
|-----------------|---------------|-------|--------------------|----------------|
| ID              | int64         | No    | > 0 unique         | Investigate zeros, set as index |
| Year_Birth      | int64         | No    | 1900–2005          | Flag births < 1910 as outliers |
| Education       | category      | No    | {Graduation, PhD, Master, Basic, 2n Cycle} | Standardize spelling, dtype=category |
| Marital_Status  | category      | No    | collapse variants  | Map “Absurd”, “YOLO” to “Single”? |
| Income          | float64       | Yes   | 0–200 000          | Impute nulls (median) & cap at 99th pct |
| Dt_Customer     | datetime64[ns]| No    | 2012‑08‑14 → 2014‑06‑29 | Parse date; derive `Customer_Tenure` |
| Recency         | int64         | No    | 0–120              | Validate non‑neg, dtype int |
| …               | …             | …     | …                  | … |




| Term (abbr.)                          | Plain‑English meaning                                                                                                          | How to calculate it                                                                                                                                                                               | Why it matters in your project                                                                                        |
| ------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------- |
| **ROI** — *Return on Investment*      | “Did the money we spent bring back more money than it cost?”                                                                   | $\text{ROI}=\frac{\text{Profit}}{\text{Cost}}=\frac{\text{Revenue} - \text{Cost}}{\text{Cost}} $<br>• Example: Spend \$10 000 on a campaign, earn \$15 000 in sales ⇒ Profit \$5 000, ROI = 50 %. | Management cares whether the campaign paid off. When we find segments with higher ROI, we target them more next time. |
| **KPI** — *Key Performance Indicator* | A metric everyone agrees shows success                                                                                         | Could be “conversion rate,” “average order value,” “customer lifetime value.”                                                                                                                     | Keeps the team focused on a number that matters instead of random stats.                                              |
| **Conversion / Conversion Rate**      | A *conversion* is when a customer does the action we want (e.g., buys, signs up). Conversion rate = % of people who converted. | $\text{Conv. Rate}= \frac{\text{Conversions}}{\text{People targeted}}$                                                                                                                            | Our dataset’s `AcceptedCmpOverall` (1 = responded) lets us compute conversion rate for each segment.                  |
| **Segment / Segmentation**            | Grouping customers by shared traits (age, income, etc.).                                                                       | — (concept, not formula)                                                                                                                                                                          | Helps tailor offers; data analysis shows which segments respond best.                                                 |
| **Outlier**                           | A data point way outside the “usual” range.                                                                                    | E.g., income \$666 666 when most are \$20 k–\$80 k.                                                                                                                                               | Outliers can skew averages; we decide whether to cap, remove, or keep them.                                           |
| **Imputation**                        | Filling in missing values                                                                                                      | Median imputation: replace missing incomes with the median income                                                                                                                                 | Keeps dataset usable when nulls exist; must note in documentation.                                                    |
| **Recency**                           | How recently a customer made a purchase (in days)                                                                              | Given directly as `Recency` in dataset                                                                                                                                                            | Lower = bought recently; important for churn prediction or targeting.                                                 |
| **EDA** — *Exploratory Data Analysis* | First‑look, open‑ended exploration of the data                                                                                 | Visuals (histograms, boxplots), stats (`describe()`)                                                                                                                                              | Helps spot patterns, anomalies, and guides cleaning/modeling.                                                         |


In [5]:
import pandas as pd, numpy as np
path = "../data/raw/marketing_campaign.csv"
df = pd.read_csv(path, sep=";")
df.head(3)

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,2012-09-04,58,635,...,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,2014-03-08,38,11,...,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,2013-08-21,26,426,...,4,0,0,0,0,0,0,3,11,0


In [6]:
df.shape
df.info
df.describe(include="all").T.head(15)

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
ID,2240.0,,,,5592.159821,3246.662198,0.0,2828.25,5458.5,8427.75,11191.0
Year_Birth,2240.0,,,,1968.805804,11.984069,1893.0,1959.0,1970.0,1977.0,1996.0
Education,2240.0,5.0,Graduation,1127.0,,,,,,,
Marital_Status,2240.0,8.0,Married,864.0,,,,,,,
Income,2216.0,,,,52247.251354,25173.076661,1730.0,35303.0,51381.5,68522.0,666666.0
Kidhome,2240.0,,,,0.444196,0.538398,0.0,0.0,0.0,1.0,2.0
Teenhome,2240.0,,,,0.50625,0.544538,0.0,0.0,0.0,1.0,2.0
Dt_Customer,2240.0,663.0,2012-08-31,12.0,,,,,,,
Recency,2240.0,,,,49.109375,28.962453,0.0,24.0,49.0,74.0,99.0
MntWines,2240.0,,,,303.935714,336.597393,0.0,23.75,173.5,504.25,1493.0
