## Importing python libraries

In [51]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from toolbox_DS import *

## Functions

## Load dataset

In [52]:
df = pd.read_excel('./data/marketing_campaign.xlsx')
df.head(8)

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,2012-09-04,58,635,...,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,2014-03-08,38,11,...,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,2013-08-21,26,426,...,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,2014-02-10,26,11,...,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,2014-01-19,94,173,...,5,0,0,0,0,0,0,3,11,0
5,7446,1967,Master,Together,62513.0,0,1,2013-09-09,16,520,...,6,0,0,0,0,0,0,3,11,0
6,965,1971,Graduation,Divorced,55635.0,0,1,2012-11-13,34,235,...,6,0,0,0,0,0,0,3,11,0
7,6177,1985,PhD,Married,33454.0,1,0,2013-05-08,32,76,...,8,0,0,0,0,0,0,3,11,0


In [53]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2240 entries, 0 to 2239
Data columns (total 29 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   2240 non-null   int64  
 1   Year_Birth           2240 non-null   int64  
 2   Education            2240 non-null   object 
 3   Marital_Status       2240 non-null   object 
 4   Income               2216 non-null   float64
 5   Kidhome              2240 non-null   int64  
 6   Teenhome             2240 non-null   int64  
 7   Dt_Customer          2240 non-null   object 
 8   Recency              2240 non-null   int64  
 9   MntWines             2240 non-null   int64  
 10  MntFruits            2240 non-null   int64  
 11  MntMeatProducts      2240 non-null   int64  
 12  MntFishProducts      2240 non-null   int64  
 13  MntSweetProducts     2240 non-null   int64  
 14  MntGoldProds         2240 non-null   int64  
 15  NumDealsPurchases    2240 non-null   i

## Variables and meaning

| Column Name             | Description                                |
|-------------------------|--------------------------------------------------------|
| ID                      | Unique identifier for each customer                    |
| Year_Birth              | Year of birth of the customer                          |
| Education               | Education level of the customer                        |
| Marital_Status          | Marital status of the customer                         |
| Income                  | Annual income of the customer                          |
| Kidhome                 | Number of small children in the customer's household   |
| Teenhome                | Number of teenagers in the customer's household        |
| Dt_Customer             | Date when the customer joined                          |
| Recency                 | Number of days since the last purchase                 |
| MntWines                | Amount spent on wine in the last 2 years               |
| MntFruits               | Amount spent on fruits in the last 2 years             |
| MntMeatProducts         | Amount spent on meat in the last 2 years               |
| MntFishProducts         | Amount spent on fish in the last 2 years               |
| MntSweetProducts        | Amount spent on sweets in the last 2 years             |
| MntGoldProds            | Amount spent on gold in the last 2 years               |
| NumDealsPurchases       | Number of purchases made with a discount               |
| NumWebPurchases         | Number of purchases made through the company's website |
| NumCatalogPurchases     | Number of purchases made using a catalogue             |
| NumStorePurchases       | Number of purchases made directly in stores            |
| NumWebVisitsMonth       | Number of visits to company's website in the last month|
| AcceptedCmp3            | 1 if customer accepted the offer in the 3rd campaign, 0 otherwise |
| AcceptedCmp4            | 1 if customer accepted the offer in the 4th campaign, 0 otherwise |
| AcceptedCmp5            | 1 if customer accepted the offer in the 5th campaign, 0 otherwise |
| AcceptedCmp1            | 1 if customer accepted the offer in the 1st campaign, 0 otherwise |
| AcceptedCmp2            | 1 if customer accepted the offer in the 2nd campaign, 0 otherwise |
| Complain                | 1 if customer complained in the last 2 years, 0 otherwise          |
| Z_CostContact           | Contact cost of the customer (constant value)          |
| Z_Revenue               | Revenue from the customer (constant value)             |
| Response                | 1 if customer accepted the offer in the last campaign, 0 otherwise |

## Describe variables

In [54]:
describe_df(df).T

Unnamed: 0,DATE_TYPE,MISSINGS (%),UNIQUE_VALUES,CARDIN (%)
ID,int64,0.0,2240,100.0
Year_Birth,int64,0.0,59,2.63
Education,object,0.0,5,0.22
Marital_Status,object,0.0,8,0.36
Income,float64,1.07,1974,89.08
Kidhome,int64,0.0,3,0.13
Teenhome,int64,0.0,3,0.13
Dt_Customer,object,0.0,663,29.6
Recency,int64,0.0,100,4.46
MntWines,int64,0.0,776,34.64


In [55]:
# Analyzing variables with unique values below 17
temp = describe_df(df).T 
val_card = temp.loc[temp['UNIQUE_VALUES'] < 17].index
print(val_card)
for cada in val_card:
    #print(cada)
    print(df[cada].value_counts(True))
    print()

Index(['Education', 'Marital_Status', 'Kidhome', 'Teenhome',
       'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases',
       'NumStorePurchases', 'NumWebVisitsMonth', 'AcceptedCmp3',
       'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2',
       'Complain', 'Z_CostContact', 'Z_Revenue', 'Response'],
      dtype='object')
Education
Graduation    0.503125
PhD           0.216964
Master        0.165179
2n Cycle      0.090625
Basic         0.024107
Name: proportion, dtype: float64

Marital_Status
Married     0.385714
Together    0.258929
Single      0.214286
Divorced    0.103571
Widow       0.034375
Alone       0.001339
Absurd      0.000893
YOLO        0.000893
Name: proportion, dtype: float64

Kidhome
0    0.577232
1    0.401339
2    0.021429
Name: proportion, dtype: float64

Teenhome
0    0.516964
1    0.459821
2    0.023214
Name: proportion, dtype: float64

NumDealsPurchases
1     0.433036
2     0.221875
3     0.132589
4     0.084375
5     0.041964
6     0.027

In [56]:
# Analyzing values 'YOLO' and 'Absurd' in variable Marital_STatus
df.loc[(df['Marital_Status'] == 'Absurd') | (df['Marital_Status']== 'YOLO' )]

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
2093,7734,1993,Graduation,Absurd,79244.0,0,0,2012-12-19,58,471,...,1,0,0,1,1,0,0,3,11,1
2134,4369,1957,Master,Absurd,65487.0,0,0,2014-01-10,48,240,...,2,0,0,0,0,0,0,3,11,0
2177,492,1973,PhD,YOLO,48432.0,0,1,2012-10-18,3,322,...,8,0,0,0,0,0,0,3,11,0
2202,11133,1973,PhD,YOLO,48432.0,0,1,2012-10-18,3,322,...,8,0,0,0,0,0,0,3,11,1


## Null values

In [57]:
df.isna().sum()

ID                      0
Year_Birth              0
Education               0
Marital_Status          0
Income                 24
Kidhome                 0
Teenhome                0
Dt_Customer             0
Recency                 0
MntWines                0
MntFruits               0
MntMeatProducts         0
MntFishProducts         0
MntSweetProducts        0
MntGoldProds            0
NumDealsPurchases       0
NumWebPurchases         0
NumCatalogPurchases     0
NumStorePurchases       0
NumWebVisitsMonth       0
AcceptedCmp3            0
AcceptedCmp4            0
AcceptedCmp5            0
AcceptedCmp1            0
AcceptedCmp2            0
Complain                0
Z_CostContact           0
Z_Revenue               0
Response                0
dtype: int64

In [58]:
df.Income.value_counts(dropna=False)

Income
NaN        24
7500.0     12
35860.0     4
37760.0     3
83844.0     3
           ..
40760.0     1
41452.0     1
6835.0      1
33622.0     1
52869.0     1
Name: count, Length: 1975, dtype: int64

In [59]:
df.loc[df['Income'].isna()]

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
10,1994,1983,Graduation,Married,,1,0,2013-11-15,11,5,...,7,0,0,0,0,0,0,3,11,0
27,5255,1986,Graduation,Single,,1,0,2013-02-20,19,5,...,1,0,0,0,0,0,0,3,11,0
43,7281,1959,PhD,Single,,0,0,2013-11-05,80,81,...,2,0,0,0,0,0,0,3,11,0
48,7244,1951,Graduation,Single,,2,1,2014-01-01,96,48,...,6,0,0,0,0,0,0,3,11,0
58,8557,1982,Graduation,Single,,1,0,2013-06-17,57,11,...,6,0,0,0,0,0,0,3,11,0
71,10629,1973,2n Cycle,Married,,1,0,2012-09-14,25,25,...,8,0,0,0,0,0,0,3,11,0
90,8996,1957,PhD,Married,,2,1,2012-11-19,4,230,...,9,0,0,0,0,0,0,3,11,0
91,9235,1957,Graduation,Single,,1,1,2014-05-27,45,7,...,7,0,0,0,0,0,0,3,11,0
92,5798,1973,Master,Together,,0,0,2013-11-23,87,445,...,1,0,0,0,0,0,0,3,11,0
128,8268,1961,PhD,Married,,0,1,2013-07-11,23,352,...,6,0,0,0,0,0,0,3,11,0


## Duplicated values

In [60]:
df.duplicated().sum()

0

## First look at numeric values

In [61]:
df.describe().T.round(2)

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
ID,2240.0,5592.16,3246.66,0.0,2828.25,5458.5,8427.75,11191.0
Year_Birth,2240.0,1968.81,11.98,1893.0,1959.0,1970.0,1977.0,1996.0
Income,2216.0,52247.25,25173.08,1730.0,35303.0,51381.5,68522.0,666666.0
Kidhome,2240.0,0.44,0.54,0.0,0.0,0.0,1.0,2.0
Teenhome,2240.0,0.51,0.54,0.0,0.0,0.0,1.0,2.0
Recency,2240.0,49.11,28.96,0.0,24.0,49.0,74.0,99.0
MntWines,2240.0,303.94,336.6,0.0,23.75,173.5,504.25,1493.0
MntFruits,2240.0,26.3,39.77,0.0,1.0,8.0,33.0,199.0
MntMeatProducts,2240.0,166.95,225.72,0.0,16.0,67.0,232.0,1725.0
MntFishProducts,2240.0,37.53,54.63,0.0,3.0,12.0,50.0,259.0


In [62]:
# Unusual values in Year_Birth variable
df.loc[df['Year_Birth'] < 1941]

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
192,7829,1900,2n Cycle,Divorced,36640.0,1,0,2013-09-26,99,15,...,5,0,0,0,0,0,1,3,11,0
239,11004,1893,2n Cycle,Single,60182.0,0,1,2014-05-17,23,8,...,4,0,0,0,0,0,0,3,11,0
339,1150,1899,PhD,Together,83532.0,0,0,2013-09-26,36,755,...,1,0,0,1,0,0,0,3,11,0
1950,6663,1940,PhD,Single,51141.0,0,0,2013-07-08,96,144,...,5,0,0,0,0,0,0,3,11,0


There are 3 people with more than 96 years old and one of them with more than 100 years old. ¿?

In [63]:
# Unusual values in Income variable
df.loc[df['Income'] > 100000]

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
124,7215,1983,Graduation,Single,101970.0,0,0,2013-03-12,69,722,...,2,0,1,1,1,0,0,3,11,1
164,8475,1973,PhD,Married,157243.0,0,1,2014-03-01,98,20,...,0,0,0,0,0,0,0,3,11,0
203,2798,1977,PhD,Together,102160.0,0,0,2012-11-02,54,763,...,4,0,1,1,1,0,0,3,11,1
252,10089,1974,Graduation,Divorced,102692.0,0,0,2013-04-05,5,168,...,2,0,1,1,1,1,0,3,11,1
617,1503,1976,PhD,Together,162397.0,1,1,2013-06-03,31,85,...,1,0,0,0,0,0,0,3,11,0
646,4611,1970,Graduation,Together,105471.0,0,0,2013-01-21,36,1009,...,3,0,0,1,1,0,0,3,11,1
655,5555,1975,Graduation,Divorced,153924.0,0,0,2014-02-07,81,1,...,0,0,0,0,0,0,0,3,11,0
687,1501,1982,PhD,Married,160803.0,0,0,2012-08-04,21,55,...,0,0,0,0,0,0,0,3,11,0
1300,5336,1971,Master,Together,157733.0,1,0,2013-06-04,37,39,...,1,0,0,0,0,0,0,3,11,0
1653,4931,1977,Graduation,Together,157146.0,0,0,2013-04-29,13,1,...,1,0,0,0,0,0,0,3,11,0


In [64]:
df.loc[df['Income'] < 5000]

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
21,5376,1979,Graduation,Married,2447.0,1,0,2013-01-06,42,1,...,1,0,0,0,0,0,0,3,11,0
981,3955,1965,Graduation,Divorced,4861.0,0,0,2014-06-22,20,2,...,14,0,0,0,0,0,0,3,11,0
1245,6862,1971,Graduation,Divorced,1730.0,0,0,2014-05-18,65,1,...,20,0,0,0,0,0,0,3,11,0
1524,11110,1973,Graduation,Single,3502.0,1,0,2013-04-13,56,2,...,14,0,0,0,0,0,0,3,11,0
1846,9931,1963,PhD,Married,4023.0,1,1,2014-06-23,29,5,...,19,0,0,0,0,0,0,3,11,0
1975,10311,1969,Graduation,Married,4428.0,0,1,2013-10-05,0,16,...,1,0,0,0,0,0,0,3,11,0


There is an Income value wiht 666.666$. ¿?

In [65]:
# Unusual values in MntMeatProducts variable
df.loc[df['MntMeatProducts'] > 1000]

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
21,5376,1979,Graduation,Married,2447.0,1,0,2013-01-06,42,1,...,1,0,0,0,0,0,0,3,11,0
164,8475,1973,PhD,Married,157243.0,0,1,2014-03-01,98,20,...,0,0,0,0,0,0,0,3,11,0
687,1501,1982,PhD,Married,160803.0,0,0,2012-08-04,21,55,...,0,0,0,0,0,0,0,3,11,0
1653,4931,1977,Graduation,Together,157146.0,0,0,2013-04-29,13,1,...,1,0,0,0,0,0,0,3,11,0
2228,8720,1978,2n Cycle,Together,,0,0,2012-08-12,53,32,...,0,0,1,0,0,0,0,3,11,0


## First conclusions

### Resume

**Size**:
- 2240 entries
- 29 columns

**Variables**:
- **ID**: 100% cardinality, could be the Index. -> **ACTION: Change To index**
- **Dt_Customer**: has type 'object' and could be 'datetime'. -> **ACTION: change type**.
- **Z_CostContact** and **Z_Revenue**: have only one value each. -> **ACTION: drop columns**
- **'Education', 'Marital_Status', 'Kidhome', 'Teenhome', 'AcceptedCmp3',
    'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2',
    'Complain**': have type 'int' and could be 'category'. -> **ACTION: change type**
- **Marital_Status**: has two values named ‘Absurd’ and ‘YOLO’ that don’t make sense.
    - YOLO: will be removed -> **ACTION: remove**
    - Absurd: wiil be included in a single variable called ‘others’. **ACTION: change to 'others'**
- **Response**: could be the Target variable.

**Null values**:
- **Income**: has 24 null values. -> **ACTION: Remove null values** since there are only 24

**Outliers**: -> **ACTION: Remove**
- Year_Birth (people wiht more than 90 years old, the previous data is 56 years old):
    - ID number: 
        - 7829
        - 11004
        - 1150
- Income (666.666$):
    - ID number: 9432



### Possible new columns:
- Age
- Customer_seniority
- Household_members
- Total_expendidure
- Num_offers_accepted
- Total_number_purchase
- Avg_expenditure (Total_expendidure/Total_number_purchase)

### Possible discretization of variables
- Income